(Semi) handcrafted RSS

I’ve been using a minimalist blog setup for some time now.

I was having something of a framework fatigue after switching between a few static site generators. Each new generator I decided to try implied usually either learning a new programming language (Python, Ruby, Go) to perform basic setup and a new template engine syntax1. Typically I wasn’t using the vast majority of the features available for each generator. And finally, most of the generators I tried over the years rely on heavy configuration if I want to maintain the site organisation and look.

I’ve discussed with some friends and colleagues why it’s my opinion that a plain HTML blog is still superior to other solutions (such as Markdown coupled with some generator framework). I’ll leave my arguments to a future post.

However, I am still using some form of a generator. The blog writing process at the moment is the following:

  • I write the content of the post to an HTML fragment (no HEAD, for instance). All files are HTML and in the same folder.
  • I have a shell script to walk through the files in the input folder and add a common header, footer and process all code blocks with syntax highlighting.
  • Save the “processed” files to an output folder
  • Upload (currently to Github to be served via Github pages).

The HTML fragments are minimal, for instance:

<h1>A post-modern title</h1>

<p>Yes, this could be an entire blog post.</p>

The point is, where do we draw the line on what a static generator is? For this post, I won’t consider a loose collection of specialised scripts to be a static generator. There is no configuration, no convention, no theming ability 2. You can argue that this is what many generators do, but I think that’s beyond the scope of this short post.

Since my static blog is straightforward, with minimal markup, why not create something equally simple for RSS generation? To do so, I’ve decided to go the way of “handcrafted” HTML.

However, I was accustomed to a static site generator to generate some goodies, such as syndication feeds automatically.

I’ve decided to add an RSS feed to the site, using minimal dependencies (only shell scripting and a couple of universal user-land tools such as grep and cat). This approach has the added benefit that it is applicable to expose other types of data as an RSS feed, such as server and periodic job logs.

We start by adding the feed header to the index.xml:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
        <channel>

                <title>Rui Vieira's blog</title>
                <description>Rui Vieira's personal blog</description>
                <link>https://ruivieira.dev/</link>
                <lastBuildDate>$(date)</lastBuildDate>
EOF

The RSS 2.0 specification3 is quite simple in terms of the minimum requirements for a valid feed. The mandatory “header” fields are:

  • title, the name of the channel.
  • link, the URL to the HTML website corresponding to the channel.
  • description, phrase or sentence describing the channel.

In terms of feed items, according to the specification, at least one of title or description must be present, and all remaining elements are optional.

We use the following in this feed:

  • title, the title of the item.
  • link, the URL of the item.
  • pubDate indicates when the item was published.

pubDate needs to conform with RFC 822.

[!INFO] Just as interesting tidbit, RFC 822 (which defines Internet Text Message formats) is one of the core email RFCs. It predates [https://en.wikipedia.org/wiki/ISO_8601]ISO 8601 by six years (1982) and it’s itself based on 1977’s [https://tools.ietf.org/html/rfc733]RFC 733.

We then loop over all the input files to build the RSS entries.

FILES=input/*.html
for FILE in $FILES
do
    FILENAME="${FILE##*/}"
    FILENAME="${FILENAME%.*}"
    # extract title ...
    # write entry to index.xml
done

Using Bash

First, extract the title. The actual title is not inside the <title> tag, but on the first header <h1>.

cat output/nb-estimation.html       |\
grep -E "<h1.*>(.*?)</h1>"          |\
sed 's/.*<h1.*>\(.*\)<\/h1>.*/\1/'

The first produces:

<div id="main">
  <h1 id="negative-binomial-estimation">Negative Binomial estimation</h1>
</div>

While the second produces:

Negative Binomial estimations

Now, what happens if we have more than one <h1> header? UNIX pipelines to the rescue. We simple retrieve the first line of the matching grep, by inserting a head -1.

To get the modified date of $FILE we can use:

date -r $FILE.html

The final RSS feed build is:

cat >output/index.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
        <channel>

                <title>Rui Vieira's blog</title>
                <description>Rui Vieira's personal blog</description>
                <link>https://ruivieira.dev/</link>
                <lastBuildDate>$(date)</lastBuildDate>
EOF

FILES=input/*.html
for FILE in $FILES
do
    FILENAME="${FILE##*/}"
    FILENAME="${FILENAME%.*}"
    TITLE=$(cat output/$FILENAME.html | grep -E "<h1.*>(.*?)</h1>" | head -1 | sed 's/.*<h1.*>\(.*\)<\/h1>.*/\1/')
    cat >>output/index.xml <<EOF
                <item>
                        <title>$TITLE</title>
                        <link>https://ruivieira.dev/$FILENAME.html</link>
                        <pubDate>$(date -r output/$FILENAME.html)</pubDate>
                </item>
EOF
done

cat >>output/index.xml <<EOF
        </channel>
</rss>
EOF

Using Python

Another possibility is to use a specialised tool to extract an RSS item from an HTML file. To do so, we need to parse the necessary data and replace the extraction part of the loop. This is, after all, along the lines of the UNIX philosophy4: create specialised tools with a focus on modularity and reusability.

To do, we create a simple script called post_title.py. It uses the Beautiful Soup library, which you can install using:

$ pip install beautifulsoup4

The script reads an HTML file, extract the title and return:

from bs4 import BeautifulSoup
import sys

with open(sys.argv[1], 'r') as file:
    data = file.read()

soup = BeautifulSoup(data, features="html.parser")

print(soup.h1.string)

This script can now be used to replace the title extraction:

cat >output/index.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
        <channel>
                <title>Rui Vieira's blog</title>
                <description>Rui Vieira's personal blog</description>
                <link>https://ruivieira.dev/</link>
                <lastBuildDate>$(date)</lastBuildDate>
EOF

FILES=input/*.html
for FILE in $FILES
do
    FILENAME="${FILE##*/}"
    FILENAME="${FILENAME%.*}"
    cat >>output/index.xml <<EOF
                <item>
                        <title>$(post_title.py $FILENAME.html)</title>
                        <link>https://ruivieira.dev/$FILENAME.html</link>
                        <pubDate>$(date -r output/$FILENAME.html)</pubDate>
                </item>
EOF
done

cat >>output/index.xml <<EOF
        </channel>
</rss>
EOF

The reason why the whole RSS feed is not generated in Python is to have the title extraction as a “function” which can map to whichever logic the shell script is using.

Hope this could be useful to you. Happy coding!


  1. As it turns out … I reverted to using a static site generator. More information can be found in the page site details↩︎

  2. Apart from plain CSS theming, that is. ↩︎

  3. https://validator.w3.org/feed/docs/rss2.html ↩︎

  4. See Unix philosophy↩︎