Posts Tagged ownyourdata

Fri May 10

Archiving rooms from a Homeserver (including end-to-end encrypted rooms)

I'm in the middle of a Forever Project, migrating stuff and services off of an old server in my closet at home onto a new (smaller, better, faster!) server in my closet at home.

One such service is a Synapse homeserver that was used as a private Slack-alternative chat for my household, as well as a bridge to some IRC channels. I set it up by hand in haste some years ago and made some not-super-sustainable choices about it, including leaving the database in SQLite (2.2GB and feelin' fine), not documenting my DNS and port-forwarding setup very well, and a few other "oopsies".

I had been keeping the code up to date via "pip install" and the latest "master" tarballs, but when the announcement came about needing valid TLS for federation starting in 0.99.X, I wasn't sure if I was good to upgrade. (I later found out that I was okay, ha!)

I found some docs on the most recent ways to set up Matrix on a new server, and even on how to migrate from SQLite to PostgreSQL. However, I don't know if I'll be able to set aside the time to do it all at once, or if it'll be easier just to set it up fresh, or even if I need a homeserver right now. So, I decided to figure out how to make archives of the rooms I cared about, which included household conversations, recipes, and photos from around the house and on travels.


The process turned out to be pretty involved, which is why it gets a blog post! It boils down to needing these three things:

  • osteele/matrix-archive - Export a Matrix room message archive and photos.
  • matrix-org/pantalaimon - A proxy to handle end-to-end encrypted (E2EE) room content for matrix-archive
  • matrix-org/Olm - C library to handle the actual E2EE processing. Pantalaimon relies on this library and it's Python extensions.

Getting all the tools built required a pretty recent system, which my old server ain't. I ended up building and running them on my personal laptop, running Ubuntu 19.04.

Since both matrix-archive and pantalaimon are Python-based, I created a Python 3.7 virtualenv to keep everything in, rather than installing everything system-wide.


The Olm docs recommend building with CMake, but as someone unfamiliar with CMake I could get it to build and run tests, but could not actually get it installed on my system.

I ended up installing the main lib with:

  make && sudo make install

The Python extensions were a challenge and I am not sure that I remember all the details to properly document them here. I spent a good amount of time trying to follow the Olm instructions to get them installed into my Python virtualenv.

In the end, the pantalaimon install built its own version of the Python Olm extensions, so I'm going to guess this was enough for now.


The pantalaimon README was pretty straightforward, once I installed Olm system-wide. I activated my virtualenv and ran:

  python install

That resulted in a "pantalaimon" script installed in my virtualenv's bin dir, so I could (in theory) run it on the command line, pointing it at my running Synapse server:


That started a service on which matrix-archive would connect over, with pantalaimon handling all the E2EE decryption transparently.


The matrix-archive setup instructions suggest using a dependency manager called "Pipenv" that I was not familiar with. I installed it in my virtualenv, then ran it to setup and install matrix-archive:

  pip install pipenv
  pipenv install

Pipenv "noticed" it was running in a virtualenv, and said so. This didn't seem to be much of a problem, but any command I tried to run with "pipenv run" would fail. I worked around this by looking in the "Pipfile" to see what commands were actually being run, and it turns out it was just calling specific Python scripts in the matrix-archive directory. So, I resolved to run those by hand.


matrix-archive requires MongoDB. I don't use it for anything else, so I had to "sudo apt install mongodb-server".

Running the Import

First, I set the environment variables needed by matrix-archive:

  export MATRIX_USER=<my username>
  export MATRIX_PASSWORD=<my password>
  export MATRIX_HOST=

Then confirmed it was working by getting a list of rooms with IDs:


I set up the list of room IDs in an environment variable:

  export MATRIX_ROOM_IDS=!room@server,!room2@server,...

And slurped in all the messages with:


At the end, it said it had a bunch of messages. Hooray!

Running the Export

This is where things kind of ran off the rails. In trying to export messages I kept seeing Python KeyErrors about a missing 'info' key. It seems like maybe the Matrix protocol was updated to make this an optional key, but the upshot was that matrix-archive seemed to assume that every message with an image attached would have an 'info' with info about a thumbnail for that image.

Additionally, the script to download images had some naive handling for turning attachment URLs like "mxc://" into downloadable URLs. Matrix supports DNS-based delegation, so you can say "the Matrix server for is, and this script didn't handle that.

I did some nasty hacks to only get full-sized images, and from the right host:

  • updated the schema to return the full image URL instead of digging in for a thumbnail
  • added handling to to handle missing 'info', which was used to guess image mimetypes
  • added some hardcoding to map converted "mxc://" URLs to the right host.

Afterwards I was able to do an export of alllllll the images to a "images/" folder:

  python --no-thumbnails

And could then export a particular room's history with:

  python --room-id ROOM-NAME --local-images --filename ROOM-NAME.html

Note that the "--room-id" flag above actually wants the human-readable room name, unless it's actually a room on the main server.

Afterwards, I could open room-name.html in my browser, and see the very important messages and images I worked so hard to archive.

a message exchange. marty asks maktrobot to 'pug me'. maktrobot responds with an image of a pug.
Screenshot from the (very minimal) HTML export, including an image (of a pug, sourced by a chat bot).

What's Next?

For now, I'll be putting these files and images in a safe backup and not worrying about them too much, because I have them. I've already stopped my old Synapse server, and can tackle setting up the new one at my leisure. We've moved our house chats to Signal, and I've moved my IRC usage over to bridged Slack channels.

Running a Matrix Synapse homeserver for the past couple of years has been quite interesting! I really appreciate the hard working community (especially in light of their recent infrastructure troubles), and I recognize that it's a ton of work to build a federating network of real-time, private communication. I enjoyed the freedom of having my own chat service to run bots, share images, and discuss private moments without worrying about who might be reading the messages now or down the road.

That said, there are still some major usability kinks to work out. The end-to-end encryption rollout across homeservers and clients hasn't been the smoothest, and it can be an issue juggling E2EE keys across devices. I look forward to seeing how the community addresses issues like these in the future!

TL;DR - saving an archive of a room's history and files should not be this hard.

Sat May 12
🔖 Bookmarked Pulling My Thangs In Haus — jackyalciné

“Dokku is really nifty in its singular goal - a plugin based platform as a service thingy”

Fri May 4

Leaving Netflix (and taking my data with me)

Netflix has been a staple in my life for years, from the early days of mailing (and neglecting) mostly-unscratched DVDs through the first Netflix original series and films. With Netflix as my catalog, I felt free to rid myself of countless DVDs and series box sets. Through Netflix I caught up on "must-see" films and shows that I missed the first time around, discovered unexpected (and wonderfully strange) things I would never have seen otherwise. Countless conversations have hinged around something that I've seen / am binging / must add to my list.

At times, this has been a problem. It's so easy to start a show on Netflix and simply let it run. At home we frequently spend a whole evening grinding through the show du jour. Sometimes whole days and weekends disappear. This be can true for more and more streaming services but, in my house, it is most true for Netflix. We want to better use our time, and avoid the temptation to put up our stocking'd feet, settle in, and drop out.

It's easy enough to cancel a subscription, and even easier to start up again later if we change our minds. However, Netflix has one, even bigger, hook into my life: my data. Literal years of viewing history, ratings, and the list of films and shows that I (probably) want to watch some day. I wanted to take that data with me, in case we don't come back and they delete it.

Netflix had an API, but after they shut it down in 2014, it's so far dead that even the blog post announcing the API shutdown is now gone from the internet.

Despite no longer having a formal API, Netflix is really into the single-page application style of development for their website. Typically this is a batch of HTML, CSS, and JavaScript that runs in your browser and uses internal APIs to fetch data from their service. Even better, they are fans of infinite scrolling, so you can open up a page like your "My List", and it loads more data as you scroll, until you've got it all loaded in your browser.

Once you've got all that information in your browser, you can script it right out of there!

After some brief Googling, I found a promising result about using the Developer Console in your browser to grab your My List data. That gave me the inspiration I needed to write some JavaScript snippets to grab the three main lists of data that I care about:

Each of these pages displays slightly different data, in HTML that can be extracted with very little javascript, and each loads more data as you scroll the page until it's all loaded. To extract it, I needed some code to walk the entries on the page, extract the info I want, store it in a list, turn the list into a JSON string, and the copy that JSON data to the clipboard. From there I can paste that JSON data into a data file to mess with later, if I want.

Extracting my Netflix Watch List

A screenshot of my watch list

The My List page has a handful of useful pieces of data that we can extract:

  • Name of the show / film
  • URL to that show on Netflix
  • A thumbnail of art for the show!

After eyeballing the HTML, I came up with this snippet of code to pull out the data and copy it to the clipboard:

    .forEach(item => {
        title: item.getAttribute('aria-label'),
        url: item.getAttribute('href'),
        art: item.querySelector('.video-artwork').style.backgroundImage.slice(5,-2)
  copy(JSON.stringify(list, null, 2));

The resulting data in the clipboard is an array of JSON objects like:

    "title": "The Magicians",
    "url": "/watch/80092801?tctx=1%2C1%2C%2C%2C",
    "art": ""

I like this very much! I probably won't end up using the Netflix URLs or the art, since it belongs to Netflix and not me, but a list of show titles will make a nice TODO list, at least for the shows I want to watch that are not on Netflix.

Extracting my Netflix Ratings

A screenshot of my ratings list

More important to me than my to-watch list was my literal years of rating data. This page is very different from the image-heavy watch list page, and is a little harder to find. It's under the account settings section on the website, and is a largely text-based page consisting of:

  • date of rating (as "day/month/year")
  • title
  • URL (on Netflix)
  • rating (as a number of star elements. Lit up stars indicate the rating that I gave, and these can be counted to get a numerical value of the number of stars.)

The code I used to extract this info looks like this:

        date: item.querySelector('.date').innerText,
        title: item.querySelector('.title a').innerText,
        url: item.querySelector('.title a').getAttribute('href'),
        rating: item.querySelectorAll('.rating .star.personal').length
  copy(JSON.stringify(list, null, 2));

The resulting data looks like:

    "date": "9/27/14",
    "title": "Print the Legend",
    "url": "/title/80005444",
    "rating": 5

While the URL probably isn't that useful, I find it super interesting to have the date that I actually rated each show!

One thing to note, although I wasn't affect by it, Netflix has replaced their 0-to-5 star rating system with a thumbs up / down system. You'd have to tweak the script a bit to extract those values.

Extracting my Netflix Watch History

A screenshot of my watch history page

One type of data I was surprised and delighted to find available was my watch history. This was another text-heavy page, with:

  • date watched
  • name of show (including episode for many series)
  • URL (on Netflix)

The code I used to extract it looked like this:

        date: item.querySelector('.date').innerText,
        title: item.querySelector('.title a').innerText,
        url: item.querySelector('.title a').getAttribute('href')
  copy(JSON.stringify(list, null, 2));

The resulting data looks like:

    "date": "3/20/18",
    "title": "Marvel's Jessica Jones: Season 2: \"AKA The Octopus\"",
    "url": "/title/80130318"

Moving forward

One thing I didn't grab was my Netflix Reviews, which are visible to other Netflix users. I never used this feature, so I didn't have anything to extract. If you are leaving and want that data, I hope that it's similarly easy to extract.

With all this data in hand, I felt much safer going through the steps to deactivate my account. Netflix makes this easy enough, though they also make it extremely tempting to reactivate. Not only do they let you keep watching until the end of the current paid period – they tell you clearly that if you reactivate in 10 months, your data won't be deleted.

That particular temptation won't be a problem for me.

Sun Jul 9
🔖 Bookmarked Bookmarks, favs, likes - backfilling years of gaps

“Fast forward a few years: the canonical source is gone. Images, videos, texts, thoughts, life fragments deleted. Domains vanished, re-purposed, sold. There are no redirects, just a lot of 404 Not Found.”

Tue Apr 25

Site Updates: Importing Old Posts, Disqus Comments

Jonathan Prozzi and I have challenged one another to make a post about improving our websites once a week. Here’s mine!

Back in 2008 I started a new blog on Wordpress. It seemed like a good idea! Maybe I would post some useful things and someone would offer me a job! I wanted to allow discussion without the dangers of letting strangers submit data directly to my server, so I set up the JavaScript-based Disqus comments service. I made a few posts per year and it eventually tapered off and I largely forgot about it.

In February 2011 I participated in the Thing-a-Day project on Posterous. It was the first time in a long time that I had published consistently, so when it was announced that Posterous was going away, I worked hard to grab my content and stored it somewhere.

Eventually it was November 2013, Wordpress was “out”, static site generators were “in”, and I wanted to give Octopress a try. I used Octopress’ tools to import all my Wordpress content into Octopress, forgot about adding back the Disqus comments, and posted it all back online. In February 2014, I decided to resurrect my Posterous content, so I created posts for it and got everything looking nice enough.

In 2015 I learned about the IndieWeb, and decided it was time for a new approach to my identity and content online. I set up a new site at based on Jekyll (hey! static sites are still “in”!) and got to work adding IndieWeb features.

Well, today I decided to get some of that old content off my other domain and into my official one. Thankfully, with Octopress being based on Jekyll, it was mostly just a matter of copying over the files in the _posts/ folder. A few tweaks to a few posts to make up for newer parsing in Jekyll, my somewhat odd URL structure, etc., and I was good to go!

“Owning” My Disqus Comments

Though I had long ago considered them lost, I noticed that some of my old posts had a section that the Octopress importer had added to the metadata of my posts from Wordpress:

  _edit_last: ‘1’
  _wp_old_slug: makerbot-cam-1-wiring
  dsq_thread_id: ‘604226727’

All of my Wordpress posts had this dsq_thread_id value, and that got me thinking. Could I export the old Disqus comment data and find a way to display it on my site? (Spoiler alert: yes I could).

Disqus actually has a export feature:

You can request a compressed XML file containing all of your comment data, organized hierarchically into “category” (which I think can be configured per-site), “thread” (individual pages), and “post” (the actual comments), and includes info such as author name and email, the date it was created, the comment message with some whitelisted HTML for formatting and links, whether the comment was identified as spam or has been deleted, etc.

The XML format was making me queasy, and Jekyll data files often come in YAML format for editability, so I did the laziest XML to YAML transform possible, thanks to some Ruby and this StackOverflow post.

require ‘active_support/core_ext/hash/conversions’
require ‘yaml’
file =“disqus_export.xml”, “r”)
hash = Hash.from_xml(
yaml = hash.to_yaml“disqus.yml”, “w”) { |file| file.write(yaml) }

This resulted in a YAML formatted file that looked like:

    dsq:id: ...
    forum: ...

I dropped this into my Jekyll site as _data/disqus.yml, and … that’s it! I could now access the content from my templates in

I wrote a short template snippet that, if the post has a “meta” property with a “dsq_thread_id”, to look in and collect all Disqus comments where “thread.dsq:id” was the same as the “dsq_thread_id” for the post. If there are comments there, they’re displayed in a “Comments” section on the page.

So now some of my oldest posts have some of their discussion back after more than 7 years!

Here’s an example post:

Example of old Disqus comments on display.

I was (pleasantly) surprised to be able to recover and consolidate this older content. Thanks to past me for keeping good backups, and to Disqus for still being around and offering a comprehensive export.

As a bonus, since all of the comments include the commenter’s email address, I could give them avatars with Gravatar, and (though they have no URL to link to) they would almost look right at home alongside the more modern mentions I display on my site.

Update: Yep, added Gravatars.

Old Disqus comments now with avatars by Gravatar

Mon Nov 21

Really excited about end-to-end encryption being available for Matrix! Now you can host your own encrypted communications network!