Technical Deep Dive: What it’s like to build at #100MillionCitizens Scale
By Torleiv Flatebo, Director of Platform Engineering and Site Reliability at GovDelivery
While GovDelivery is now over 15 years old, we have always been a Software-as-a-Service company priding ourselves in pushing the envelope in terms of cutting edge government software. That said, over the course of the last decade, technology has changed and so have we — we’ve gone through a lot of different backends and frontends to get to where we are now.
Today we’re pleased to announce that GovDelivery technology has over 100 million users spread across the globe and powers over 50 million messages a day. (See more in the announcement from our CEO Scott Burns.)
With this scale and reach, the technology behind the platform is more crucial than ever. In our Platform Engineering group here at GovDelivery, we support and work on configuration management, deploy tooling, continuous integration, data center automation, and backend services to power communications between 1,000+ agencies and 100+ million citizens. Naturally, I sometimes get the question, “What’s it like to build technology at that scale?”
Three guiding principles keep our technology strategy grounded:
- Future-oriented architecture: We’ve thought about the tools we use and have consciously chosen ones that enable us to scale our platform up to 100M users and beyond by handling heavy traffic and being quick to market for new features.
- Open, agile, and collaborative processes: We incorporate monitoring, open source, pull requests, Scrum, and CI to make software we’re proud to put our name on.
- Reliability at Scale: Uptime is really important to us and we have smart people thinking hard about how to make it better, no matter how good it currently is.
The first version of our infrastructure was running on Windows, and we were all Java everywhere. This took us farther than I would have ever expected, but as we grew past 10 million, 20 million, 50 million, and now 100 million subscribers we have evolved our architecture into a much more scalable system.
Our current architecture runs primarily on Ruby on Rails and Java, all of our engineers use Mac OSX or Linux, and our teams strongly believe in test-driven development (TDD), pull requests, and continuous integration (CI). Consistency is essential — as is designing for scale. At the core of the architecture (conceptually and technically) is a set of REST APIs that enable capabilities across product lines and reduces duplication of code. For example, we send a lot of email (more than six billion already this year, and counting), so when a new application needs to send email, we hook it up to a centralized API that manages the complexity, and the new app only needs to make REST calls and gets a lot of functionality for free. As we have expanded the GovDelivery product line — with open data platforms with NuCivic and interactive text messaging from Textizen — this strategy has allowed us to grow and scale functionality fairly seamlessly.
It’d be foolish, though, to think that there weren’t bumps along the way and lessons learned. As we have grown, we have seen a variety of scaling issues, and we try hard to constantly learn and evolve. We use a few rules to help us:
- Bias towards collaboration and consistency: We all work together in an open space, and we heavily use persistent chat rooms to foster communication across teams and visibility for everyone to see what is going on.
- Break down silos: We have regular crossover between teams and appoint roles that are embedded into other teams to provide knowledge transfer as well as a conduit for immediate resolution for anything that needs escalation or assistance from another group.
- Treat 50% capacity as 100%: We need to be able to double at any time.
- Use metrics (thoughtfully): Some issues aren’t really issues, we use data to help understand if it is a bump in the night or if it is an indicator of something bigger.
- Isolate and categorize all production incidents or anomalies: Early detection of issues is key to identify new novel issues and solve them before they affect our customers.
- Transparency: Put all of your issues in one place where everyone can see them, anytime.
- Focus product teams: Focus as much as possible on a few initiatives, and see those through from MVP to Beta to 1.0.
- Prioritize scale issues: Just because a problem may seem like it’s down the road, don’t let these go unnoticed or forgotten; they need to be scoped and fixed diligently.
- Don’t reinvent the wheel: We look to use open source tools first before anything else. Check out our github to see what we are up to.
We have an amazing team of engineers constantly working to predict scaling issues, and handle performance issues as they crop up in production. (SaaS really helps us here; we can solve a problem across all of our customers in the blink of an eye.) And we are constantly looking to innovate. There are a lot of new technologies out there that newer companies are using. We like new stuff, but we also have to balance our scale, efficiency, and security needs against using the newest containers or tools.
Some things we are looking at now:
- How we can further automate the spinup of new applications as we break up our stack into more services? What are ways to eliminate complexity and duplication?
- Can we reduce risk at deploy time and speed up the deployment cycle?
- How do we incorporate the next generation of configuration management, cloud services, and monitoring/metrics?
Come help us out
Even though GovDelivery is now a 200+ person company, each member of the team is empowered and in fact encouraged to help answer those questions, and identify new ones. That’s the only way we can continue to innovate at scale. To be sure it’s hard work, but it’s fun and important too. We are building software for the public good at a massive scale. And we are always looking for good people eager for a good challenge. In you’re interested in joining the GD Engineering team, have a look at GovDelivery.com/Company/Careers, stop by in St. Paul or DC, or don’t hesitate to drop us a line (or a pull request).
Let’s work together on building this system for the next 100 million.
(Editor’s Note: If you want to learn more about GovDelivery’s scale, and the impact our clients are having with the platform, check out the 100M announcement at govdelivery.com/100M)