Five Years Running a News Site on JAMStack
Part I: History and Architecture
I started working as the Director of Technology at Spotlight PA the Tuesday after Memorial Day, 2019, over five years ago. There have been a lot of changes in technology, journalism, and the world since then, not least of which was the COVID-19 pandemic. I thought this anniversary would be a good opportunity to look back and take stock of what went well, what went poorly, what I think would have been the best choice I could have made at the time, and what choices I would make if I had it to do again now.
I’m planning to write this as a four part series with the first part focused on the broad strokes history of Spotlight PA’s technical architecture and later parts drilling down on specific practices, services, and technologies.
“Five Years Running a News Site on JAMStack” has also been accepted by SRCCON 2024 as a session. I don’t believe the session will be recorded, so if you read this and are interested in learning more, please come see me in Minneapolis this August. SRCCON has always been my favorite conference, and I’m looking forward to being back this year.
History
Spotlight PA is an independent, nonpartisan, and nonprofit newsroom dedicated to high-quality investigative and public-service journalism about the Pennsylvania state government and urgent statewide issues. We produce quality journalism and publish it on our own website as well as sharing it for free with more than 100 community newsrooms across the state. We don’t use programmatic ads on our site. Instead we rely on donations from individual readers, media partners, and philanthropic organizations. While we don’t have a physical product, our stories are frequently reprinted on the front pages of local newspapers, such as the Philadelphia Inquirer, Bucks County Courier Times, Scranton Times-Tribune, and many more.
I joined Spotlight PA as the Director of Technology in 2019. In the time since then, I have either been the lone developer on the site or the lead of two developers. It has been an extremely small technical team, but I think we’ve created a site that compares quite well with its category peers. When I was hired, the newsroom for Spotlight PA was still in the process of being formed, and there was no website at all. In fact, I sort of wondered if our Editor in Chief Chris Baxter had maybe gotten the name of his site wrong, because what I did find was PA Spotlight, which is a different site that was funded by a left-wing dark money group.
Originally, Spotlight PA was a collaboration by the Philadelphia Inquirer, the Pittsburgh Post-Gazette, and LNP Media under the auspices of the Lenfest Institute for Journalism. In 2023, we spun off into our own independent 501(c)3, but for the first four years, my paycheck was signed by the Inky. Because of the Memorial Day holiday, I started work on a Tuesday by taking the train from my home in Baltimore up to Philly for new employee orientation. On the way there, I put a file on Netlify that just said “Coming Soon!” and eventually got someone at the Inquirer to point our DNS at it so we could register with Google and start building up SEO credit.
By Thursday, I had a splash page up with a moody looking public domain picture of the state capitol and our basic mission statement.
This was a simple static site powered by Hugo. The most important thing I learned from working at The Atlantic was to give your projects funny history themed names (they had Ollie and Waldo), so I called the static site Poor Richard and put it up on Github as open source. Eventually, I started using Netlify’s forms feature to put up a page to collect email addresses for our future newsletters. The layout of the brochure site was done with Bulma, a slightly nicer Bootstrap clone.
JAMStack is a marketing term made up by Netlify. Officially it means JavaScript, APIs, and Markup, but in practice it means “a static site plus all the other bits and bobs you need to actually run a website in the real world.” I launched the brochure site on Hugo and JAMStack because I had experience using it for this blog and the Baltimore Sun 2018 election site. The plan however was to work over the summer to replace the brochure site with a “real site”. The future real site was codenamed Hippogriff and would be Nuxt in the front and Django out back. I had used Vue at the Baltimore Sun, and I had liked it, so I wanted to try using an SSR framework to get the benefits of a modern JavaScript frontend with rendered HTML from the server. Django was created for use at the Lawrence Journal-World (who now use WordPress!💔), and is a common choice in newsrooms because of its easy to set up admin interface.
As the summer went on, however, it became clear that I would not be able to get Hippogriff out the door by September, when we needed to launch our first story for grant fulfillment purposes. At this point, I pivoted to trying to get a site together using WordPress and Largo (called Port Deposit as it would bridge us to the future site). However, I had never developed on WordPress professionally before, and I found the combination of local file and database interactions plus legacy CSS hard to work with. Chris didn’t like the look of the in-progress site, so he asked if we could just take the brochure site, make an article page and a homepage, and call that good enough for now.
And so in the grand tradition of all temporary hacks, Poor Richard became the main web presence for Spotlight PA and remains so to this day. At this point, deploying to Netlify takes around 5 minutes, which is basically the same as the amount of time it takes to clear the CDN on most large sites, so for our publishing needs, it’s fine. If we ever find ourselves waiting 10 or 20 minutes for a deploy it might be worth it to finally migrate to a “real” site, but for the foreseeable future that isn’t on the horizon. Using GitHub as our content store has fun side effects, like letting you do an author rename as a pull request and other content management tricks that would be harder to do with a real database.
Early on it was decided that reporters would file stories into Arc, the CMS developed by the Washington Post and used by the Inquirer. Once I knew a story was done, I would go in and copy the content, run it through a filter to turn it into Markdown, and then paste into a new file on the Hugo site. We were only publishing around once a week, so the amount of time involved wasn’t overwhelming, but it did take away from other work I could be doing. It felt like this was the process for a long time, but in retrospect, the actual period we used this system was fairly short.
By October, there was one CLI app to check for new stories and post a link to them in Slack and another CLI to pull down the stories from Arc and format them. But one problem was that there was no scheduler, so I would still have to be available to publish the stories at whatever time they were supposed to go live. Relatively early on, I had a hacky solution to that problem by starting an EC2 instance and putting another CLI on it to push out changes on schedule via Git. We also had DecapCMS (then called Netlify CMS) to let editors do some basic site content changes on their own.
It was, however, painfully clear that we would need a better system for publishing and also a way to share the incoming stories with our partners. Thus the Almanack was born in late November. Like the almanac in Back to the Future II, it predicted future stories. In this case, it did so not by use of a flux capacitor but by using the Arc API to extract stories that were marked as finished drafting and scheduled for publication. Registered partners would be able to log in, see the list of past and upcoming stories, and copy the rich text or HTML for pasting into their own CMSes. At first, it was just a very, very thin Go backend that only existed to do auth and hide our Arc API key and a Vue frontend that took the Arc JSON and presented it as a story for partners. I tried to get it working with Auth0, but that was too complicated, so I used Netlify Auth, which locked us into using Netlify for the hosting, which also continues to this day, unfortunately.
Before long, I put a Heroku Redis instance behind Almanack to cache responses from Arc (which could be a bit slow). With persistence in place, we could email partners when we noticed a new story and eventually even schedule stories to publish on our site, finally freeing me from the responsibility to do it.
In March, I added Heroku Postgres with sqlc as a not-an-ORM to keep track of partner domains. Also in March, the pandemic hit America, we closed our Harrisburg office except for an individual reporter, and we went from publishing around one story a week to multiple stories per day. This was a huge increase in our pace as a site and monthly traffic went up to six figures, where it has stayed ever since, although not at the pandemic level heights.
Eventually I realized that I was using Redis as a database, which was redundant with having Postgres, so I migrated everything to Postgres and eventually removed Redis. Over time more and more things got saved into Almanack, like the most popular article list scraped from Google Analytics or MailChimp newsletters we saved and republished on our own site. Almanack got options to turn on and off house ads and various promotions. Story images started by being stored along with the content on Github, but before too long were moved to S3 with Imgproxy for dynamic sizing, so that the repo wouldn’t bloat out to gigabytes, and Almanack eventually learned how to manage those uploads. Full text search was added that summer using Algolia. Our public facing site was and still is a static site, but increasingly it was supported by dynamic services to augment its capabilities.
We have had a number of special projects and interactives over the years, but one that led to infrastructure still in use today was the mid-2020 COVID Alerts project. My colleague Dan Simmons-Ritchie had the idea to make a newsletter that would send you the COVID numbers for your county in Pennsylvania every week. I made a Go server to manage subscribers to the list on SendGrid. The newsletter was shut down after Dan left Spotlight PA (but right before delta hit 😓), but the server has stuck around as a proxy layer to manage subscribers to our MailChimp newsletters and filter out spammy signups.
On the frontend, at first there was just some vanilla JavaScript, but for the donation page, I used Vue for basic validation. The donation page eventually got replaced with a link out to a vendor’s site, and the remaining bits of Vue were replaced with Alpine.js. I created a feature request for Hugo that led to ESBuild being built into Hugo, which simplified our JavaScript build process considerably. Our 2020 summer intern Kent Wilhelm created a new template for featured stories. I rewrote it in Tailwind after he left, and then Tailwind slowly replaced Bulma as the CSS framework on the rest of the site. We contracted with The New Dynamic in 2022 to redo our About page, and some of how they structured content blocks has had a big influence on more recent page layouts.
There are a lot of other twists and turns to Spotlight PA that I could write about, and I will go into more detail about specific technologies in a subsequent post, but to wrap up this historical retrospective, I want to talk about our move off of Arc as the CMS for editors. All the way back at The Atlantic, I had observed that editors and reporters always just use Google Docs for their collaborations, so I wanted to write something that would let them import a story into the CMS directly from there. (I was not the only one to think of this. The CMS for Tiny News Collective works this way too.) First, I wrote a CLI for my own use that would extract the text from a Google Doc. Then I wrote a CLI to take HTML and turn it into Markdown. For a while, I would just use those tools myself in a pipeline whenever I needed to format a special feature story for the organization. Next those two CLIs got merged into a common project with a web frontend, and then eventually the whole thing got incorporated into Almanack in 2023, just in time for the move off of the Inquirer’s CMS. Along the way I came up with a system where formatted tables in a Google Doc can be used for the various bits of non-text in a story like images, HTML embeds, tables of contents, and so forth.
ArchieML was created by people at the New York Times to serve a similar purpose. It’s a YAML-like format that is supposed to be mixed in with paragraphs of normal text in a document. Many newsrooms use it for collaborations between developers and journalists on interactives and special features. It works in that role, but I think it’s a little too code-like to ask editors to deal with it on a daily basis. By contrast, here is a screenshot of a recent article about the Governor’s private plane usage:
While still a bit code-like, at least you can see the photo and it’s relatively clear that the photo is special and not a paragraph that just happens to have a colon in a certain place.
Everyone hates their CMS, but it is my belief that the best CMS is one where users and developers work closely together so that workflows are adjusted to suit the users, rather than users adapting to the inflexible demands of the system.
So that is how Spotlight PA accidentally ended up running a news site on the JAMStack for five years. I would like to thank everyone who helped make this possible, especially Chris Baxter for trusting me with the role; my former coworkers Dan Simmons-Richie, Sara Simon, Ethan Edwards Coston, and Jeff Rummel; and the many readers who have funded our work.
Spotlight PA currently has an open position for a Newsroom Developer. If the description of the technical architecture here sounded interesting to you, please apply.
For everyone else, stay tuned to this channel for future posts on the practices, services, and technologies that make up Spotlight PA.