Nimble Industries

Under The Hood: Inside a Status Page Aggregator

U

StatusGator is a status page aggregator. We monitor the world’s status pages and provide a unified dashboard which tracks the status pages each user cares about. But how do we collect and normalize all this data?

To get started with StatusGator, you choose the services you already use from our list. For each service, you select the specific components from their status pages that you depended on. StatusGator aggregates the status from all these pages, helping you reduce wasted time tracking down problems, and keeping you informed when outages inevitably occur. We do this with an army of electronic Checkers.

Checkers: The Engine of StatusGator

The process starts with a what we call a Checker. A Checker is a bit of software that understands the format of a specific status page. We write Checkers for each status page we add. Sometimes they are for specific services, such as Amazon Web Services. Other times we write generic checkers for status page formats that span multiple services. This includes hosted status page services like Status.io and open source tools like Cachet.

Exactly how a Checker works varies greatly from page to page. Some services, such as the Heroku status page, have published and documented status APIs. Others have undocumented APIs from which the data can be reliably fetched. Most simply present a web page and so we resort to scraping the HTML.

Some status pages have a single overall indicator of the services’s status, usually color coded in green, red, and yellow or orange. But most pages list individual services which we call components. These components are often organized into logical groups by service type or region. Each component typically has a message, icon, or color code that relays its status. We design each checker to respond to return the same data: a list of services and three pieces of data about each:

  • Name
    The name of the component or service.
  • Group Name
    Most often a region like “US West” or a type like “APIs”.
  • Status
    One of four classifications: up, warn, down, or maintenance.

Normalization Challenges

Normalizing of the statuses into one of four buckets is the one part of this process that often requires us to make a judgment call. We try to use the same standard of classification across all status pages. For example, if a page has a “degraded” or “partial outage” level, we always classify that into our warn status. If a page uses terminology like “severe” we would call that down. Occasionally, a service will say “unknown” or “investigating” and we would classify both into our warn.

Our hope is that StatusGator subscribers who want notifications of only the most severe or definite outages can receive them by subscribing to down notices, ignoring the intermittent or lower level outages of the warn level.

Every 5 minutes, we pass each page through its respective Checker. When components of the status page change, these changes are recorded by StatusGator. We then send notifications to each subscriber based on their individual preferences. For example, a user might have elected to receive a notification in Slack each time AWS posts an outage affecting EC2 in the US-West region. When that component of Amazon’s status page changes, a notification is sent and when service is restored, another is posted.

StatusGator currently checks the status of more than 500 status pages containing 25,000 individual service components. In order to make StatusGator as useful as possible, we try to add additional pages every week. Therefore, we rely on your suggestions. If you see a status page that’s not represented on StatusGator, email us! We love to add them and take great pride in being able to add them very quickly.

Nimble Industries