(Semi) Handcrafted RSS

In the old days, before static site generators.

I’ve been using a minimalist blog setup for some time now.

I was having something of a framework fatigue after switching between a few static site generators. Each new generator I decided to try implied usually either learning a new programming language (Python, Ruby, Go) to perform basic setup and a new template engine syntax. Typically I wasn’t using the vast majority of the features available for each generator. And finally, most of the generators I tried over the years rely on heavy configuration if I want to maintain the site organisation and look. I’ve discussed with some friends and colleagues why it’s my opinion that a plain HTML blog is still superior to other solutions (such as Markdown coupled with some generator framework). I’ll leave my arguments to a future post.

However, I am still using some form of a generator. The blog writing process at the moment is the following:

The HTML fragments are minimal, for instance:

<h1>A post-modern title</h1>

<p>Yes, this could be an entire blog post.</p>

The point is, where do we draw the line on what a static generator is? For this post, I won’t consider a loose collection of specialised scripts to be a static generator. There is no configuration, no convention, no theming ability 1. You can argue that this is what many generators do, but I think that’s beyond the scope of this short post.

Since my static blog is straightforward, with minimal markup, why not create something equally simple for RSS generation? To do so, I’ve decided to go the way of “handcrafted” HTML.

However, I was accustomed to a static site generator to generate some goodies, such as syndication feeds automatically.

I’ve decided to add an RSS feed to the site, using minimal dependencies (only shell scripting and a couple of universal user-land tools such as grep and cat). This approach has the added benefit that it is applicable to expose other types of data as an RSS feed, such as server and periodic job logs.

We start by adding the feed header to the index.xml:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
        <channel>

                <title>Rui Vieira's blog</title>
                <description>Rui Vieira's personal blog</description>
                <link>https://ruivieira.dev/</link>
                <lastBuildDate>$(date)</lastBuildDate>
EOF

The RSS 2.0 specification is quite simple in terms of the minimum requirements for a valid feed. The mandatory “header” fields are:

In terms of feed items, according to the specification, at least one of title or description must be present, and all remaining elements are optional.

We use the following in this feed:

pubDate needs to conform with RFC 822.

Just as interesting tidbit, RFC 822 (which defines Internet Text Message formats) is one of the core email RFCs. It predates ISO 8601 by six years (1982) and it’s itself based on 1977’s RFC 733.

We then loop over all the input files to build the RSS entries.

FILES=input/*.html
for FILE in $FILES
do
    FILENAME="${FILE##*/}"
    FILENAME="${FILENAME%.*}"
    # extract title ...
    # write entry to index.xml
done

Using Bash

First, extract the title. The actual title is not inside the <title> tag, but on the first header <h1>.

cat output/nb-estimation.html |  grep -E "<h1.*>(.*?)</h1>" | sed 's/.*<h1.*>\(.*\)<\/h1>.*/\1/'

The first produces:

<div id="main"><h1 id="negative-binomial-estimation">Negative Binomial estimation</h1>

While the second produces:

Negative Binomial estimations

Now, what happens if we have more than one <h1> header? UNIX pipelines to the rescue. We simple retrieve the first line of the matching grep, by inserting a head -1.

To get the modified date of $FILE we can use:

date -r $FILE.html

The final RSS feed build is:

cat >output/index.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
        <channel>

                <title>Rui Vieira's blog</title>
                <description>Rui Vieira's personal blog</description>
                <link>https://ruivieira.dev/</link>
                <lastBuildDate>$(date)</lastBuildDate>
EOF

FILES=input/*.html
for FILE in $FILES
do
    FILENAME="${FILE##*/}"
    FILENAME="${FILENAME%.*}"
    TITLE=$(cat output/$FILENAME.html | grep -E "<h1.*>(.*?)</h1>" | head -1 | sed 's/.*<h1.*>\(.*\)<\/h1>.*/\1/')
    cat >>output/index.xml <<EOF
                <item>
                        <title>$TITLE</title>
                        <link>https://ruivieira.dev/$FILENAME.html</link>
                        <pubDate>$(date -r output/$FILENAME.html)</pubDate>
                </item>
EOF
done

cat >>output/index.xml <<EOF
        </channel>
</rss>
EOF

Using Python

Another possibility is to use a specialised tool to extract an RSS item from an HTML file. To do so, we need to parse the necessary data and replace the extraction part of the loop. This is, after all, along the lines of the Unix philosophy: create specialised tools with a focus on modularity and reusability.

To do, we create a simple script called post_title.py. It uses the Beautiful Soup library, which you can install using:

$ pip install beautifulsoup4

The script reads an HTML file, extract the title and return:

from bs4 import BeautifulSoup
import sys

with open(sys.argv[1], 'r') as file:
    data = file.read()

soup = BeautifulSoup(data, features="html.parser")

print(soup.h1.string)

This script can now be used to replace the title extraction:

cat >output/index.xml <<EOF
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
        <channel>
                <title>Rui Vieira's blog</title>
                <description>Rui Vieira's personal blog</description>
                <link>https://ruivieira.dev/</link>
                <lastBuildDate>$(date)</lastBuildDate>
EOF

FILES=input/*.html
for FILE in $FILES
do
    FILENAME="${FILE##*/}"
    FILENAME="${FILENAME%.*}"
    cat >>output/index.xml <<EOF
                <item>
                        <title>$(post_title.py $FILENAME.html)</title>
                        <link>https://ruivieira.dev/$FILENAME.html</link>
                        <pubDate>$(date -r output/$FILENAME.html)</pubDate>
                </item>
EOF
done

cat >>output/index.xml <<EOF
        </channel>
</rss>
EOF

The reason why the whole RSS feed is not generated in Python is to have the title extraction as a “function” which can map to whichever logic the shell script is using.

Hope this could be useful to you. Happy coding!


  1. Apart from plain CSS theming, that is. ↩︎