Greymeister.net

Getting Rid of Duplicates in the Hacker News Feed

I’m happy to say I’ve survived the Google Reader Apocalypse and switched over to Feedbin to synchronize my feeds. Reeder on my iPhone supports this aggregator and so from that perspective the change has had very little impact on me. I miss is the desktop experience I had with NetNewsWire, but the web interface for Feedbin is much better than Google Reader and removed my need for a dedicated application.

The only new issue that I’ve encountered that I don’t remember from the Reader days are duplicated feed entries, which is only a big nuisance when it comes to a feed as active as Hacker News. The GitHub issue explains why this happens, but to summarize, Hacker News doesn’t assign a unique ID to each post, so Feedbin tries to assign one for it. Unfortunately, the way an ID is generated means that if the post title changes, as is often the case with Hacker News, a new ID is generated and appears as a completely different post in my feed. It got to the point where I became suspicious of visiting any posts on a topic I’d already seen. However, this sort of mental burden is exactly the sort of thing software is supposed to relieve us of!

Instead of having to do any mental pre-filtering, I thought I’d let a bot do that work for me. The solution to this problem seems to be pretty simple: if a post links to a URL that I’ve already seen, skip it. So I needed to be able to do the following:

1. Fetch an RSS feed.
2. Compare the links in each entry to the links in previous entries.
3. If the link is found, skip the entry, otherwise mark the entry as visited.
4. Produce a curated feed from the source only containing new entries.

The first step would be the most technically challenging, but instead of writing this from scratch I assumed there would be a library I could use for this. Python has a pretty good library for this, feedparser, and I’d wanted to try learning the language as a break from what I normally use. It was easy to point it at a URL and get an object that contained the metadata I needed. The next step was some way to compare the links in the source feed to links I had visited previously. An sqlite database made this pretty simple. I just created a table containing a date and a URL. The date isn’t used, but I knew I could go back later and prune the database size down by removing links that were, say, older than a month.

Now I had a system for looking at a RSS feed and finding links that were already published, but I needed a way to make use of it. It was pretty easy to construct a template with Jinja and build the feed, but that wasn’t useful unless I could replace the source feed with this curated one. I thought about copying the file to Dropbox, but decided instead to use Amazon S3. There is a really nice Python library for AWS, boto. I was able to take the template, create a uniquely named bucket, and upload the feed. Making the bucket and the feed object public readable would allow the Feedbin bot access, and now I could just subscribe to the S3 URL directly. I wrote a small shell script and added a crontab to visit the source feed once an hour, which gives Feedbin plenty of time to see it before it updates.

After using this for a week, I’ve only run into one problem: since I’m using a direct string comparison with the entry URL, if there is a trailing slash in one but not the other, it appeared as a unique entry. That was easy enough to fix, and I’m pretty happy with how it has worked. There is plenty of room for improvement, and the code is on GitHub with an MIT license.