Archiving rooms from a Matrix.org Homeserver (including end-to-end encrypted rooms)
I'm in the middle of a Forever Project, migrating stuff and services off of an old server in my closet at home onto a new (smaller, better, faster!) server in my closet at home.
One such service is a Matrix.org Synapse homeserver that was used as a private Slack-alternative chat for my household, as well as a bridge to some IRC channels. I set it up by hand in haste some years ago and made some not-super-sustainable choices about it, including leaving the database in SQLite (2.2GB and feelin' fine), not documenting my DNS and port-forwarding setup very well, and a few other "oopsies".
I had been keeping the code up to date via "pip install" and the latest "master" tarballs, but when the announcement came about needing valid TLS for federation starting in 0.99.X, I wasn't sure if I was good to upgrade. (I later found out that I was okay, ha!)
I found some docs on the most recent ways to set up Matrix on a new server, and even on how to migrate from SQLite to PostgreSQL. However, I don't know if I'll be able to set aside the time to do it all at once, or if it'll be easier just to set it up fresh, or even if I need a homeserver right now. So, I decided to figure out how to make archives of the rooms I cared about, which included household conversations, recipes, and photos from around the house and on travels.
Overview
The process turned out to be pretty involved, which is why it gets a blog post! It boils down to needing these three things:
- osteele/matrix-archive - Export a Matrix room message archive and photos.
- matrix-org/pantalaimon - A proxy to handle end-to-end encrypted (E2EE) room content for matrix-archive
- matrix-org/Olm - C library to handle the actual E2EE processing. Pantalaimon relies on this library and it's Python extensions.
Getting all the tools built required a pretty recent system, which my old server ain't. I ended up building and running them on my personal laptop, running Ubuntu 19.04.
Since both matrix-archive and pantalaimon are Python-based, I created a Python 3.7 virtualenv to keep everything in, rather than installing everything system-wide.
Olm
The Olm docs recommend building with CMake, but as someone unfamiliar with CMake I could get it to build and run tests, but could not actually get it installed on my system.
I ended up installing the main lib with:
make && sudo make install
The Python extensions were a challenge and I am not sure that I remember all the details to properly document them here. I spent a good amount of time trying to follow the Olm instructions to get them installed into my Python virtualenv.
In the end, the pantalaimon install built its own version of the Python Olm extensions, so I'm going to guess this was enough for now.
Pantalaimon
The pantalaimon README was pretty straightforward, once I installed Olm system-wide. I activated my virtualenv and ran:
python setup.py install
That resulted in a "pantalaimon" script installed in my virtualenv's bin dir, so I could (in theory) run it on the command line, pointing it at my running Synapse server:
pantalaimon https://matrix.example.com:8448
That started a service on http://127.0.0.1:8009/ which matrix-archive would connect over, with pantalaimon handling all the E2EE decryption transparently.
matrix-archive
The matrix-archive setup instructions suggest using a dependency manager called "Pipenv" that I was not familiar with. I installed it in my virtualenv, then ran it to setup and install matrix-archive:
pip install pipenv pipenv install
Pipenv "noticed" it was running in a virtualenv, and said so. This didn't seem to be much of a problem, but any command I tried to run with "pipenv run" would fail. I worked around this by looking in the "Pipfile" to see what commands were actually being run, and it turns out it was just calling specific Python scripts in the matrix-archive directory. So, I resolved to run those by hand.
MongoDB
matrix-archive requires MongoDB. I don't use it for anything else, so I had to "sudo apt install mongodb-server".
Running the Import
First, I set the environment variables needed by matrix-archive:
export MATRIX_USER=<my username> export MATRIX_PASSWORD=<my password> export MATRIX_HOST=http://127.0.0.1:8009
Then confirmed it was working by getting a list of rooms with IDs:
python list_rooms.py
I set up the list of room IDs in an environment variable:
export MATRIX_ROOM_IDS=!room@server,!room2@server,...
And slurped in all the messages with:
python import_messages.py
At the end, it said it had a bunch of messages. Hooray!
Running the Export
This is where things kind of ran off the rails. In trying to export messages I kept seeing Python KeyErrors about a missing 'info' key. It seems like maybe the Matrix protocol was updated to make this an optional key, but the upshot was that matrix-archive seemed to assume that every message with an image attached would have an 'info' with info about a thumbnail for that image.
Additionally, the script to download images had some naive handling for turning attachment URLs like "mxc://example.com/..." into downloadable URLs. Matrix supports DNS-based delegation, so you can say "the Matrix server for example.com is matrix.example.com:8448, and this script didn't handle that.
I did some nasty hacks to only get full-sized images, and from the right host:
- updated the schema to return the full image URL instead of digging in for a thumbnail
- added handling to export_messages.py to handle missing 'info', which was used to guess image mimetypes
- added some hardcoding to map converted "mxc://" URLs to the right host.
Afterwards I was able to do an export of alllllll the images to a "images/" folder:
python download_images.py --no-thumbnails
And could then export a particular room's history with:
python export_messages.py --room-id ROOM-NAME --local-images --filename ROOM-NAME.html
Note that the "--room-id" flag above actually wants the human-readable room name, unless it's actually a room on the main matrix.org server.
Afterwards, I could open room-name.html in my browser, and see the very important messages and images I worked so hard to archive.
What's Next?
For now, I'll be putting these files and images in a safe backup and not worrying about them too much, because I have them. I've already stopped my old Synapse server, and can tackle setting up the new one at my leisure. We've moved our house chats to Signal, and I've moved my IRC usage over to bridged Slack channels.
Running a Matrix Synapse homeserver for the past couple of years has been quite interesting! I really appreciate the hard working community (especially in light of their recent infrastructure troubles), and I recognize that it's a ton of work to build a federating network of real-time, private communication. I enjoyed the freedom of having my own chat service to run bots, share images, and discuss private moments without worrying about who might be reading the messages now or down the road.
That said, there are still some major usability kinks to work out. The end-to-end encryption rollout across homeservers and clients hasn't been the smoothest, and it can be an issue juggling E2EE keys across devices. I look forward to seeing how the community addresses issues like these in the future!
TL;DR - saving an archive of a room's history and files should not be this hard.
Mentions
Archiving rooms from a Matrix.org Homeserver (including end-to-end encrypted rooms) martymcgui.re/2019/05/10/arc…
@schmarty hey; just found martymcgui.re/2019/05/10/arc… - thanks for writing it up. and sorry it didn’t work out :( between e2e cross-signing and ux improvements, hope you may give us another chance at some point.