Posts tagged 'media-endpoint'

Site updates: simplifying media, complicating mentions

Jonathan Prozzi and I have challenged one another to make a post about improving our websites once a week. I'm late with this one!

Most of the features on my website are experiments in learning new things. Sometimes I learn a better way of doing something that I've already built into the site and it's time to migrate!

Moving Media files from Git LFS to a Media Endpoint

I build my site with Jekyll, and I store my site's configuration and text content via Git. One of the things that most folks avoid with Git is storing text content (which fits into Git's model of efficiently storing differences over time) with large binary files like images, etc. (which Git cannot manage as efficiently).

When I first set up my site, I made use of Git LFS ("Large File Storage") for managing anything that wasn't text. Any images, video, or audio that I added to my site was stored in an _assets/ folder in a way that matched uploaded files to the posts they were a part of. Git LFS would transparently ship those files off to a secondary server rather than include their content in the Git repository itself. I had to go through some hoops to set up my local GitLab server to support Git LFS and to set up Git LFS with the server that handles receiving new posts via Micropub, compiling and deploying the site.

It turns out that there are many reasons that a site would want to handle media files separately from the text content that refers to them. In fact, it is a common enough pattern that the Micropub standard includes a definition for a separate "media endpoint" to handle file uploads. I shared a Micropub media endpoint implementation that I built called Spano a while back, and it has been working well with support from tools like Quill. So the text content of my site is served from https://martymcgui.re/, and my media files from https://media.martymcgui.re/. With a couple of changes in my code and my workflows, this has become the way I handle all media files for my site.

However, I still had a bunch of files in site being handled by Git LFS, and some of my Jekyll code (plugins and templates) for showing embeds expected files to be on the local filesystem. This past week I took some time to write some scripts to find all references to those local files, migrate them to my media server, and update the outgoing links. I also updated my embed handling so it didn't rely on local files. This let me delete a lot of local metadata I was keeping but not using, like all the EXIF tags in uploaded photos. I am now Git LFS free and it feels like one less thing to worry about.

Better Caching for Mentions from Webmention.io

When I finally started displaying webmentions, I had a very simple model for how to cache all the info from webmention.io. Basically: I stored all mentions in a big array and, when my site went to fetch new mentions, it would keep fetching until it saw the "last" mention again. This led to a bit of a bug where someone might send me a mention, update their page, and send the mention again. My site would not be able to recognize the "last" mention, so it would fetch all my mentions again, leading to everything appearing twice.

This past week I rewrote my mention handling to avoid this problem by replacing this array and storing mentions in a hash based on the source and target. The new code also checks to see if the verification date of the mention has changed (giving me a way to detect and notify about changed mentions in the future). I also reorganized my mention cache to include an index by the target URL on my site. This makes it a bit quicker to find mentions for a given page when rendering out the site.

Neither of these changes are really visible to readers of my site, but they have been useful for cleaning things up. The webmention.io handling in particular has brought my plugin a lot closer to being something I could release for other people to use!

Spano - a minimum-viable Micropub Media Endpoint

Micropub is an open API standard to create posts on one's own domain using third-party clients  and currently a W3C Candidate Recommendation. One of the (semi-) recent additions is the idea of a Micropub Media Endpoint. The Media Endpoint provides a way for Micropub clients to upload media files to a Micropub service, receiving a URL that is sent along in place of the file contents when the post is published.

Some of the things I like about Micropub media endpoints include:

  • The spec allows the media endpoint to be on a completely separate domain from the "full" micropub endpoint.
  • The spec doesn't specify anything about how the files are stored or their final URLs or filenames.
  • They make it easy to separate the handling of (large) media files from the (presumably much smaller) content and metadata of a post.
  • They enable Micropub clients to upload multiple files without creating multiple posts. This makes it simpler to create posts that contain multiple images, like a gallery.

Personally, I wanted a Micropub media endpoint server with a few extra properties:

  • It should be able to run completely separately from, and therefore work in conjunction with, any other micropub server implementation.
  • It should not store duplicate files. If the same file is uploaded twice, the same URL should be returned both times.
  • It should not allow overwriting files. If two images of the same name are uploaded, both are kept and receive different URLs.

Enter HashFS

My extra features above essentially describe a content-addressable storage storage system. CAS is a way of storing and accessing data based on some property of the actual content, rather than (potentially arbitrary) files and folders.

HashFS is a Python implementation of a content-addressable file management system. You give it files, it will put them in a directory structure based on a cryptographic hash function of the contents of that file. In other words - HashFS can take any file and give back a unique path to that file which will never change (if you later upload a new version of the file, it gets a different path).

To add the the fun of HashFS, there is a Flask extension called Flask-HashFS which makes it easy to expose a HashFS file store on the web via the Python Flask framework.

Introducing Spano

Spano is a Micropub Media Endpoint server written in Python via the Flask framework which combines Flask-HashFS for file storage with Flask-IndieAuth (introduced earlier) to handle authentication and authorization.

Spano is a server-side web app that basically does one thing: it accepts HTTP POST requests with a valid IndieAuth token and a file named "file", stores that file, and returns a URL to that file. The task of serving uploaded files is left to a dedicated web server like nginx or Apache.

Using Spano

Once Spano has been set up and configured for your domain, uploading is a matter of getting a valid IndieAuth token. IndieAuth-enabled Micropub clients will do this automatically. For testing by hand I like to log in to Quill and copy the access token from the Quill settings page. With token in hand, uploads are as easy as:

curl -D - -F "file=@myfile.jpg" \
  -H"Authorization: Bearer xxxx..." \
  https://media.example.com/micropub/

Which should output a response like:

HTTP/1.1 100 Continue

HTTP/1.0 201 CREATED
Content-Type: text/html; charset=utf-8
Content-Length: 108
Location: https://media.example.com/cc/a5/97/7c/2004..2cb.jpg
Server: Werkzeug/0.11.4 Python/2.7.11
Date: Thu, 26 Jan 2017 02:40:05 GMT

File created: https://media.example.com/cc/a5/97/7c/2004..2cb.jpg

Integrating Spano with your Micropub Endpoint

If you want Micropub clients to use Spano as your Media Endpoint, you need to advertise it. This is handled by your "main" Micropub server using discovery. Essentially, a client will make a configuration request to your server like so:

https://example.com/micropub?q=config

And your server's response should be a JSON-formatted object specifying the "media-endpoint". A bare minimum example:

{
  "media-endpoint": "https://media.example.com/micropub/"
}

In addition to advertising the media-endpoint, your Micropub server must be able to handle lists of URLs in places where it would normally expect a file.

For example, when posting a photo from Quill without a media endpoint, your Micropub server will receive a multipart/form-data encoded file named "photo". When posting from Quill with a media endpoint, your Micropub server will instead receive a list of URLs represented as "photo[]=https://media.example.com/cc/...2cb.jpg". Presumably this pattern would hold for other media types such as video and audio, if you are using Micropub clients that support them.

This particular step has been an interesting challenge for my site, which is a static site generated by Jekyll. My previous Micropub file-handling implementation expected all uploaded assets to live on disk next to the post files, and updating my Jekyll theme and plugins to handle the change is a work in progress. I eventually plan to move all my uploads out of the source for my project in favor of storing them with Spano.

Feedback Welcome!

Spano is probably my second public Python project, so I'd love feedback! If you try it out and run into issues, please drop me a line on GitHub. Or you can find me in the #indieweb chat on freenode IRC.

I'd also like to thank Kyle Mahan for his Woodwind Flask server application, which inspired the structure of Spano.