Introducing Winston - Event driven Diagnostic and Remediation Platform ~ Tech Blog

Netflix is a collection of microservices that all come together to enable the product you have come to love. Operating these micro services is also distributed across the owning teams and their engineers. We do not run a central operations team managing these individual services for availability. What we do instead is to invest in tools that help Netflix engineers operate their services for high availability and resiliency. Today, we are going to talk about one such tool recently built for Netflix engineers - Winston

Problem space

Consider a typical mid tier micro service at Netflix. It's a single purpose service hosted on AWS. It uses Jenkins for builds, Spinnaker for deployment and Atlas for monitoring. Alerts are configured on top of metrics using the Atlas stack language. Atlas supports triggering a set of predefined actions when the alert fires, namely instance level remediation (terminate instance, reboot, remove from service discovery etc.), escalations (email, page) or publish to SQS for further integration.

Any action beyond the small set already supported is not a first class citizen within the Atlas framework to reduce complexity and manage resiliency of Atlas itself. Let’s call any of these custom steps for diagnostics and remediation a runbook. Hosting and executing these runbooks usually took the form of

An email or page to a human who has either documented these runbooks on a wiki/doc or written one off tools and scripts to code it up.
A custom micro service that listens to the SQS integration point from Atlas and implements the run book.

Both of these approaches have drawbacks. Escalating to humans to have them do manual repeatable tasks is not the best use of our engineer's time. No one likes to get paged and wake up in the middle of the night to follow some documentation or kick off a script or a tool that a piece of software could have easily done.

Building a custom micro service means that the application team now needs to take on the additional burden of keeping the availability and resiliency of that service high, build integration with Atlas or other monitoring tools in use, manage deployment lifecycle, deprecation cycles for dependencies and worry about safety and security as well. Engineers would rather not deal with these ongoing infrastructure tasks just to host and execute their scripts that encapsulate their business logic.

Winston was created to help engineers achieve their runbook automation without managing the infrastructure and associated core features. And in case you are wondering, it’s named after Winston Wolfe, a character from the movie Pulp Fiction who has a “runbook” to solve problems and creates a controlled and safe environment to execute them.

Winston to the rescue

Winston provides an event driven runbook automation platform for Netflix engineers. It is designed to host and execute runbooks in response to operational events like alerts. Winston’s goal is to act as Tier-1 support for developers where they can outsource their repeatable diagnostic and remediation tasks and have them run automatically in response to events.

Customers provide the following inputs to configure Winston for their use case

Runbook as code (for our first version, we only support python to code runbooks in).
Events (like Atlas alerts) that should trigger that runbook. This can be one or many.

Winston in turn provides the following features that make life easier for Netflix engineers in building, managing and executing their runbooks.

Self serve interface - Winston Studio

From the get go, we aimed to make Winston a self serve tool for our engineers. To help improve usability and make it really easy to experiment and iterate, we created a simple and intuitive interface that our customers use. Winston Studio is our one stop shop for onboarding new runbook automations, configuring existing ones, looking at logs from the runs in production, debugging and managing the runbook lifecycle.

Here is a snapshot of a runbook automated by Real Time Data Infrastructure team at Netflix to troubleshoot and remediate the problem when one of their Kafka brokers is detected to be offline. As you can see in the snapshot, customers can write code to automate their runbooks, configure events that will trigger its execution, configure failure notification settings and also manually run their automation to test changes before deploying them.

Winston_Studio_and_1___mosh__cass_mrgenpoptest_cass-mrgenpop--euwest1c-i-52b94fde_____mosh-client__and_nflx_app_py_-_winstonstudio_-____winstonstudio_-1.png

Users can also look at the previous executions and individual execution details through Winston Studio as shown in the following snapshots.

Runbook lifecycle management

Winston implements a paved path on how runbooks are deployed and managed. It supports multiple versions of a given runbook, one for each environment (dev/test/prod). All runbooks given to Winston are stored in Stash, which is our persistent store for code. It supports versioning and appropriate security models and is a good fit for storing code which is what a runbook is. Each team gets its own isolated repository in Stash and each environment(dev/test/prod) is represented by its own branch in the repository. Winston includes an automated promotion and deployment pipeline. Promotions are triggered manually by engineers through the studio. Deployments gets triggered every time runbooks are promoted or updated via Studio. Runbooks get deployed to all instances of Winston in all three zones and across all four AWS regions within minutes.

HA deployment

Winston deployments are region and stack isolated. Region isolation is to handle region failures (us-east-1 region going down should not affects executions in us-west-2). Stack isolation is to separate our test environment from our critical prod environment and provide an isolated space to test your runbooks before deploying to prod. We also provide dev environment to be able to develop and manually test runbooks before deploying them to test environment.

As you can see in the following diagram, we separate out the compute from persistence. We use a MongoDB replica set for data resiliency and automatic failover in case the db primary dies. Multiple instances in the same region and environment share the same MongoDB cluster. Winston studio is only a deployment time dependency and not a runtime dependency for us so we chose to host the studio in a single region but make it multi instance running behind a load balancer to handle instance failures.

Winston Studio and Winston Deployment

You may think that there is a runbook update propagation line (red arrow) missing between S3 bucket and Winston DEV cluster. The reason why this is not required is because we have a shared filesystem between Winston Studio and Winston DEV compute instances. This helps in faster iterations when you are updating and testing your runbooks multiple times through Winston Studio.

If we look at the zoomed in view of one of the Winston compute instances (shown in the following diagram), we can see that it host an SQS sensor to consume incoming events, rules engine to connect events to runbooks and action runners to execute the runbooks as shown below.

Zoomed in view of a Winston instance

Integrations

Winston has integrations with Atlas as an event source to manage the pipeline of events coming to it. For Atlas, it uses the SQS action as an integration hook. It also provides a bunch of outbound integration API's to talk to Netflix eco system that engineers can use as part of their runbooks if they need. This is an opinionated set of API's built to make automations easy to write.

Supporting technologies

While Winston serves as a great place to host orchestrator runbooks, we also need a way to execute instance level runbooks. We have built a REST based async script runner called BOLT which lives as a daemon on every AWS instance for a given app and provides a platform for hosting and executing instance level runbooks. BOLT provides an automated deployment pipeline as well for iterating over BOLT runbooks.

Usage

We have had Winston out in production since early this year. So far, there are 7 teams on board with 22 unique runbooks hosted on Winston. On an average, we run 15 executions an hour on Winston. Each of these executions would have been manual or skipped as it required manual intervention from our engineers before. Common patterns around usage fall in these buckets:

Filter false positives - Squelch alerts from Atlas using custom diagnostic steps specific to the service. This reduced pager fatigue and on call pain.
Diagnostics - Collect contextual information for the developer on call.
Remediation - Under safe conditions, apply mitigation steps to quickly resolve the issue and bring back the service to a healthy state.
Broker - Pass the alert event to an existing tool that can handle diagnostics and mitigation, managing protocol and data model conversion as part of the runbook.

Build vs. Buy

When kickstarting this project, we looked to see if we wanted to build something custom or re-use what's already there. After some prototyping, talking with peers in the industry and analysis of different solutions in the market, we chose to go a hybrid route of re-using open source solution and building custom software to fill the gaps. We decided to use StackStorm as our underlying engine to host and execute our runbooks. StackStorm was chosen because of following key attributes

Alignment with the problem we aimed to solve (event driven runbook automation)
The fact that it was open source allowed us to review code and architecture quality in detail.
Very pluggable architecture meant we can integrate it within Netflix environment easily.
Great and responsive team backing the product

Choosing StackStorm allowed us to quickly bootstrap without reinventing the wheel. This allowed us to focus on Netflix specific features and integrations and reduced our time to market significantly.

Moving forward

There are lot of improvements we want to make to our product, both for increased operational resiliency and providing more features for our customers to build on top of. Listed below are some key ones in each category.

Resiliency

We are actively looking at providing resource (memory/CPU) and security isolation for individual executions by utilizing container technology.
We want to invest in at-least-one guarantee for events flowing through our platform. Currently there are scenarios in which events are abandoned under some failure scenarios.

Features

Polyglot - We would like to add support for additional languages for runbook authoring (Java is of special interest here).
More self serve features - Support one-to-many and many-to-one relationship and custom parameter mappings between events and runbooks.
Safety - Remediation steps automated and gone haywire can cause considerable damage. We would like to look at providing safety features (e.g. rate limiting, cross event correlation)

Our goal is to continue increasing adoption within Netflix. We aim to learn and grow the product to have bigger and better impact on availability of Netflix as well as keep Netflix engineers happy.

Summary

We talked about the need for a product like Winston at Netflix. We talked about our approach to re-use open source and build when necessary to quickly bootstrap on our need. We went through high level architecture, deployment model, features and current usage of Winston.

Automated diagnostics and remediation of software is still a niche area at Netflix and in the industry. Our goal is to continue to refine a paved path in this space for Netflix and have material impact on MTTR and developer productivity. Winston is a step in that direction and provides the right platform to help engineers automate their way out of repeatable tasks. If this area and approach excites you, please reach out, we would love to talk to you about it.

By: Sayli Karmarkar & Vinay Shah on behalf of Diagnostics and Remediation Engineering(DaRE) team

Tech Blog

الجمعة، 5 أغسطس 2016

Introducing Winston - Event driven Diagnostic and Remediation Platform

Problem space