الأربعاء، 29 فبراير 2012

Fault Tolerance in a High Volume, Distributed System

by Ben Christensen

In an earlier post by Ben Schmaus, we shared the principles behind our circuit-breaker implementation. In that post, Ben discusses how the Netflix API interacts with dozens of systems in our service-oriented architecture, which makes the API inherently more vulnerable to any system failures or latencies underneath it in the stack. The rest of this post provides a more technical deep-dive into how our API and other systems isolate failure, shed load and remain resilient to failures.

Fault Tolerance is a Requirement, Not a Feature

The Netflix API receives more than 1 billion incoming calls per day which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over 100k dependency requests per second.




This all occurs in the cloud across thousands of EC2 instances.

Intermittent failure is guaranteed with this many variables, even if every dependency itself has excellent availability and uptime.

Without taking steps to ensure fault tolerance, 30 dependencies each with 99.99% uptime would result in 2+ hours downtime/month (99.99%30 = 99.7% uptime = 2+ hours in a month).

When a single API dependency fails at high volume with increased latency (causing blocked request threads) it can rapidly (seconds or sub-second) saturate all available Tomcat (or other container such as Jetty) request threads and take down the entire API.


Thus, it is a requirement of high volume, high availability applications to build fault tolerance into their architecture and not expect infrastructure to solve it for them.

Netflix DependencyCommand Implementation

The service-oriented architecture at Netflix allows each team freedom to choose the best transport protocols and formats (XML, JSON, Thrift, Protocol Buffers, etc) for their needs so these approaches may vary across services.

In most cases the team providing a service also distributes a Java client library.

Because of this, applications such as API in effect treat the underlying dependencies as 3rd party client libraries whose implementations are "black boxes". This in turn affects how fault tolerance is achieved.

In light of the above architectural considerations we chose to implement a solution that uses a combination of fault tolerance approaches:
  • network timeouts and retries
  • separate threads on per-dependency thread pools
  • semaphores (via a tryAcquire, not a blocking call)
  • circuit breakers
Each of these approaches to fault-tolerance has pros and cons but when combined together provide a comprehensive protective barrier between user requests and underlying dependencies.


The Netflix DependencyCommand implementation wraps a network-bound dependency call with a preference towards executing in a separate thread and defines fallback logic which gets executed (step 8 in flow chart below) for any failure or rejection (steps 3, 4, 5a, 6b below) regardless of which type of fault tolerance (network or thread timeout, thread pool or semaphore rejection, circuit breaker) triggered it.

Click to enlarge

We decided that the benefits of isolating dependency calls into separate threads outweighs the drawbacks (in most cases). Also, since the API is progressively moving towards increased concurrency it was a win-win to achieve both fault tolerance and performance gains through concurrency with the same solution. In other words, the overhead of separate threads is being turned into a positive in many use cases by leveraging the concurrency to execute calls in parallel and speed up delivery of the Netflix experience to users.

Thus, most dependency calls now route through a separate thread-pool as the following diagram illustrates:


If a dependency becomes latent (the worst-case type of failure for a subsystem) it can saturate all of the threads in its own thread pool, but Tomcat request threads will timeout or be rejected immediately rather than blocking.

Click to enlarge

In addition to the isolation benefits and concurrent execution of dependency calls we have also leveraged the separate threads to enable request collapsing (automatic batching) to increase overall efficiency and reduce user request latencies.

Semaphores are used instead of threads for dependency executions known to not perform network calls (such as those only doing in-memory cache lookups) since the overhead of a separate thread is too high for these types of operations.

We also use semaphores to protect against non-trusted fallbacks. Each DependencyCommand is able to define a fallback function (discussed more below) which is performed on the calling user thread and should not perform network calls. Instead of trusting that all implementations will correctly abide to this contract, it too is protected by a semaphore so that if an implementation is done that involves a network call and becomes latent, the fallback itself won't be able to take down the entire app as it will be limited in how many threads it will be able to block.

Despite the use of separate threads with timeouts, we continue to aggressively set timeouts and retries at the network level (through interaction with client library owners, monitoring, audits etc).

The timeouts at the DependencyCommand threading level are the first line of defense regardless of how the underlying dependency client is configured or behaving but the network timeouts are still important otherwise highly latent network calls could fill the dependency thread-pool indefinitely.

The tripping of circuits kicks in when a DependencyCommand has passed a certain threshold of error (such as 50% error rate in a 10 second period) and will then reject all requests until health checks succeed.

This is used primarily to release the pressure on underlying systems (i.e. shed load) when they are having issues and reduce the user request latency by failing fast (or returning a fallback) when we know it is likely to fail instead of making every user request wait for the timeout to occur.

How do we respond to a user request when failure occurs?

In each of the options described above a timeout, thread-pool or semaphore rejection, or short-circuit will result in a request not retrieving the optimal response for our customers.

An immediate failure ("fail fast") throws an exception which causes the app to shed load until the dependency returns to health. This is preferable to requests "piling up" as it keeps Tomcat request threads available to serve requests from healthy dependencies and enables rapid recovery once failed dependencies recover.

However, there are often several preferable options for providing responses in a "fallback mode" to reduce impact of failure on users. Regardless of what causes a failure and how it is intercepted (timeout, rejection, short-circuited etc) the request will always pass through the fallback logic (step 8 in flow chart above) before returning to the user to give a DependencyCommand the opportunity to do something other than "fail fast".

Some approaches to fallbacks we use are, in order of their impact on the user experience:
  • Cache: Retrieve data from local or remote caches if the realtime dependency is unavailable, even if the data ends up being stale
  • Eventual Consistency: Queue writes (such as in SQS) to be persisted once the dependency is available again
  • Stubbed Data: Revert to default values when personalized options can't be retrieved
  • Empty Response ("Fail Silent"): Return a null or empty list which UIs can then ignore
All of this work is to maintain maximum uptime for our users while maintaining the maximum number of features for them to enjoy the richest Netflix experience possible. As a result, our goal is to have the fallbacks deliver responses as close to what the actual dependency would deliver.

Example Use Case

Following is an example of how threads, network timeouts and retries combine:



The above diagram shows an example configuration where the dependency has no reason to hit the 99.5th percentile and thus cuts it short at the network timeout layer and immediately retries with the expectation to get median latency most of the time, and accomplish this all within the 300ms thread timeout.

If the dependency has legitimate reasons to sometimes hit the 99.5th percentile (i.e. cache miss with lazy generation) then the network timeout will be set higher than it, such as at 325ms with 0 or 1 retries and the thread timeout set higher (350ms+).

The threadpool is sized at 10 to handle a burst of 99th percentile requests, but when everything is healthy this threadpool will typically only have 1 or 2 threads active at any given time to serve mostly 40ms median calls.

When configured correctly a timeout at the DependencyCommand layer should be rare, but the protection is there in case something other than network latency affects the time, or the combination of connect+read+retry+connect+read in a worst case scenario still exceeds the configured overall timeout.

The aggressiveness of configurations and tradeoffs in each direction are different for each dependency.

Configurations can be changed in realtime as needed as performance characteristics change or when problems are found all without risking the taking down of the entire app if problems or misconfigurations occur.

Conclusion

The approaches discussed in this post have had a dramatic effect on our ability to tolerate and be resilient to system, infrastructure and application level failures without impacting (or limiting impact to) user experience.

Despite the success of this new DependencyCommand resiliency system over the past 8 months, there is still a lot for us to do in improving our fault tolerance strategies and performance, especially as we continue to add functionality, devices, customers and international markets.

If these kinds of challenges interest you, the API team is actively hiring:



Share:

الاثنين، 27 فبراير 2012

Aegisthus - A Bulk Data Pipeline out of Cassandra


By Charles Smith and Jeff Magnusson

Our job in Data Science and Engineering is to consume all the data that Netflix produces and provide an offline batch-processing platform for analyzing and enriching it. As has been mentioned in previous posts, Netflix has recently been engaged in making the transition to serving a significant amount of data from Cassandra. As with any new data storage technology that is not easily married to our current analytics and reporting platforms, we needed a way to provide a robust set of tools to process and access the data.

With respect to the requirement of bulk processing, there are a couple very basic problems that we need to avoid when acquiring data. First, we don’t want to impact production systems. Second, Netflix is creating an internal infrastructure of decoupled applications, several of which are backed by their own Cassandra cluster. Data Science and Engineering needs to be able to obtain a consistent view of the data across the various clusters.

With these needs in mind and many of our key data sources rapidly migrating from traditional relational database systems into Cassandra, we set out to design a process to extract data from Cassandra and make it available in a generic form that is easily consumable by our bulk analytics platform. Since our desire is to retrieve the data in bulk, we rejected any attempts to query the production clusters directly. While Cassandra is very efficient at serving point queries and we have a lot of great APIs for accessing data here at Netflix, trying to ask a system for all of its data is generally not good for its long or short-term health.

Instead, we wanted to build an offline batch process for extracting the data. A big advantage to hosting our infrastructure on AWS is that we have access to effectively infinite, shared storage on S3. Our production Cassandra clusters are continuously backed up into a S3 bucket using our backup and recovery tool, Priam. Initially we intended to simply bring up a copy of each production cluster from Priam’s backups and extract the data via a Hadoop map-reduce job running against the restored Cassandra cluster. Working forward from that approach, we soon discovered that while it may be a feasible for one or two clusters, maintaining the number of moving parts required to deploy this solution to all of our production clusters was going to quickly become unmaintainable. It just didn’t scale.

So, is there a better way to do it?

Taking at step back, it became evident to us that the problem of achieving scale in this architecture was two-fold. First, the overhead of spinning up a new cluster in AWS and restoring it from a backup did not scale well with the number of clusters being pushed to production. Second, we were operating under the constraint that backups have to be restored into a cluster equal in size to production. As data sizes grow, there is not necessarily any motivation for production data sources to increase the number of nodes in their clusters (remember, they are not bulk querying the data – their workloads don’t scale linearly with respect to data size).

Thus, we were unable to leverage a key benefit of processing data on Hadoop – the ability to easily scale computing resources horizontally with respect to the size of your data.

We realized that Hadoop was an excellent fit for processing the data. The problem was that the Cassandra data was stored in a format that was not natively readable (sstables). Standing up Cassandra clusters from backups was simply a way to circumvent that problem. Rather than try to avoid the real issue, we decided to attack it head-on.

Aegisthus is Born

The end result was an application consisting of a constantly running Hadoop cluster capable of processing sstables as they are created by any Cassandra data source in our architecture. We call it Aegisthus, named in honor of Aegisthus’s relationship with Cassandra in Greek mythology.

Running on a single Hadoop cluster gives us the advantage of being able to easily and elastically scale a single computing resource. We were able to reduce the number of moving parts in our architecture while vastly increasing the speed at which we could process Cassandra’s data.

How it Works

A single map-reduce job is responsible for the bulk of the processing Aegisthus performs. The inputs to this map reduce job are sstables from Cassandra – either a full snapshot of a cluster (backup), or batches of sstables as they are incrementally archived by Priam from the Cassandra cluster. We process the sstables, reduce them into a single consistent view of the data, and convert it to JSON formatted rows that we stage out to S3 to be picked up by the rest of our analytics pipeline.

A full snapshot of a Cassandra cluster consists of all the sstables required to reconstitute the data into a new cluster that is consistent to the point at which the snapshot was taken. We developed an input format that is able to split the sstables across the entire Hadoop cluster, allowing us to control the amount of compute power we want to throw at the processing (horizontal scale). This was a welcome relief after trying to deal with timeout exceptions when directly using Cassandra as a data source for Hadoop input.

Each row-column value written to Cassandra is replicated and stored in sstables with a corresponding timestamp. The map phase of our job reads the sstables and converts them into JSON format. The reduce phase replicates the internal logic Cassandra uses to return data when it is queried with a consistency level of ALL (i.e. it reduces each row-column value based on the max timestamp).

Performance

Aegisthus is currently running on a Hadoop cluster consisting of 24 m2.4xlarge EC2 instances. In the table below, we show some benchmarks for a subset of the Cassandra clusters from which we are pulling data. The table below shows the number of nodes in the Cassandra cluster that houses the data, the average size of data per node, the total number of rows in the dataset, the time our map/reduce job takes to run, and the number or row/sec we are able to process.


The number of rows/sec is highly variable across data sources. This is due to a number of reasons, notably the average size of the rows and the average number of times a row is replicated across the sstables. Further, as can be seen in the last entry in the table, smaller datasets incur a noticeable penalty due to overhead in the map/reduce framework.

We’re constantly optimizing our process and have tons of other interesting and challenging problems to solve. Like what you see? We’re hiring!

Share:

الثلاثاء، 21 فبراير 2012

Announcing Priam

By Praveen Sadhu, Vijay Parthasarathy & Aditya Jami

We talked in the past about our move to NoSQL and Cassandra has been a big part of that strategy. Cassandra hit a big milestone recently with the announcement of the v1 release. We recently announced Astyanax, Netflix's Java Cassandra client with an improved API and connections management which we open sourced last month.

Today, we're excited to announce another milestone on our open source journey with an addition to make operations and management of Cassandra easier and more automated.

As we embarked on making Cassandra one of our NoSQL databases in the cloud, we needed tools for managing configuration, providing reliable and automated backup/recovery, and automating token assignment within and across regions. Priam was built to meet these needs. The name 'Priam' refers to the king of Troy, in Greek mythology, who was the father of Cassandra.

What is Priam?

Priam is a co-process that runs alongside Cassandra on every node to provide the following functionality:
  • Backup and recovery
  • Bootstrapping and automated token assignment.
  • Centralized configuration management
  • RESTful monitoring and metrics
We are currently using Priam to manage several dozen Cassandra clusters and counting.

Backup and recovery

A dependable backup and recovery process is critical when choosing to run a database in the cloud. With Priam, a daily snapshot and incremental data for all our clusters is backed up to S3. S3 was an obvious choice for backup data due to its simple interface and ability to access any amount of data from anywhere[1].

Snapshot backup

Priam leverages Cassandra's snapshot feature to have an eventually consistent backup[2]. Cassandra flushes data to disk and hard-links all SSTable files (data files) into a snapshot directory. SSTables are immutable files and can be safely copied to an external source. Priam picks up these hard-linked files and uploads them to S3. Snapshots are run on a daily basis for the entire cluster, ideally during non-peak hours. Although snapshot across cluster is not guaranteed to produce a consistent backup of cluster, consistency is recovered upon restore by Cassandra and running repairs. Snapshots can also be triggered on demand via Priam's REST API during upgrades and maintenance operations.

During the backup process, Priam throttles the data read from disk to avoid contention and interference with Cassandra's disk IO as well as network traffic. Schema files are also backed up in the process.

Incremental backup

When incremental backups are enabled in Cassandra, hard-links are created for all new SSTables in the incremental backup directory. Priam scans this directory frequently for incremental SSTable files and uploads them to S3. Incremental data along with the snapshot data are required for a complete backup.

Compression and multipart uploading

Priam uses snappy compression to compress SSTables on the fly. With S3's multi-part upload feature, files are chunked, compressed and uploaded in parallel. Uploads also ensure the file cache is unaffected (set via: fadvise). Priam reliably handles file sizes on the order of several hundred GB in our production environment for several of our clusters.

Restoring data

Priam supports restoring a partial or complete ring. Although the latter is less likely in production, restoring to a full test cluster is a common use case. When restoring data from backup, the Priam process (on each node) locates snapshot files for some or all keyspaces and orchestrates the download of the snapshot, incremental backup files, and starting of the cluster. During this process, Priam strips the ring information from the backup, allowing us to restore to a cluster of half the original size (i.e., by skipping alternate nodes and running repair to regain skipped data). Restoring to a different sized cluster is possible only for the keyspaces with replication factor more than one. Priam can also restore data to clusters with different names allowing us to spin up multiple test clusters with the same data.

Restoring prod data for testing

Using production data in a test environment allows you to test on massive volumes of real data to produce realistic benchmarks. One of the goals of Priam was to automate restoration of data into a test cluster. In fact, at Netflix, we bring up test clusters on-demand by pointing them to a snapshot. This also provides a mechanism for validating production data and offline analysis. SSTables are also used directly by our ETL process.


Figure 1: Listing of backup files in S3

Token Assignment

Priam automates the assignment of tokens to Cassandra nodes as they are added, removed or replaced in the ring. Priam relies on centralized external storage (SimpleDB/Cassandra) for storing token and membership information, which is used to bootstrap nodes into the cluster. It allows us to automate replacing nodes without any manual intervention, since we assume failure of nodes, and create failures using Chaos Monkey. The external Priam storage also provides us valuable information for the backup and recovery process.

To survive failures in the AWS environment and provide high availability, we spread our Cassandra nodes across multiple availability zones within regions. Priam's token assignment feature uses the locality information to allocate tokens across zones in an interlaced manner.

One of the challenges with cloud environments is replacing ephemeral nodes which can get terminated without warning. With token information stored in an external datastore (SimpleDB), Priam automates the replacement of dead or misbehaving (due to hardware issues) nodes without requiring any manual intervention.

Priam also lets us add capacity to existing clusters by doubling them. Clusters are doubled by interleaving new tokens between the existing ones. Priam's ring doubling feature does strategic replica placement to make sure that the original principle of having one replica per zone is still valid (when using a replication factor of at least 3).

We are also working closely with the Cassandra community to automate and enhance the token assignment mechanisms for Cassandra.


Address         DC          Rack        Status State   Load            Owns    Token                   
167778111467962714624302757832749846470
10.XX.XXX.XX us-east 1a Up Normal 628.07 GB 1.39% 1808575600
10.XX.XXX.XX us-east 1d Up Normal 491.85 GB 1.39% 2363071992506517107384545886751410400
10.XX.XXX.XX us-east 1c Up Normal 519.49 GB 1.39% 4726143985013034214769091771694245202
10.XX.XXX.XX us-east 1a Up Normal 507.48 GB 1.39% 7089215977519551322153637656637080002
10.XX.XXX.XX us-east 1d Up Normal 503.12 GB 1.39% 9452287970026068429538183541579914805
10.XX.XXX.XX us-east 1c Up Normal 508.85 GB 1.39% 11815359962532585536922729426522749604
10.XX.XXX.XX us-east 1a Up Normal 497.69 GB 1.39% 14178431955039102644307275311465584408
10.XX.XXX.XX us-east 1d Up Normal 495.2 GB 1.39% 16541503947545619751691821196408419206
10.XX.XXX.XX us-east 1c Up Normal 503.94 GB 1.39% 18904575940052136859076367081351254011
10.XX.XXX.XX us-east 1a Up Normal 624.87 GB 1.39% 21267647932558653966460912966294088808
10.XX.XXX.XX us-east 1d Up Normal 498.78 GB 1.39% 23630719925065171073845458851236923614
10.XX.XXX.XX us-east 1c Up Normal 506.46 GB 1.39% 25993791917571688181230004736179758410
10.XX.XXX.XX us-east 1a Up Normal 501.05 GB 1.39% 28356863910078205288614550621122593217
10.XX.XXX.XX us-east 1d Up Normal 814.26 GB 1.39% 30719935902584722395999096506065428012
10.XX.XXX.XX us-east 1c Up Normal 504.83 GB 1.39% 33083007895091239503383642391008262820
Figure 2: Sample cluster created by Priam


Multi-regional clusters

For multi-regional clusters, Priam allocates tokens by interlacing them between regions. Apart from allocating tokens, Priam provides a seed list across regions for Cassandra and automates security group updates for secure cross-regional communications between nodes. We use Cassandra's inter-DC encryption mechanism to encrypt the data between the regions via public Internet. As we span across multiple AWS regions, we rely heavily on these features to bring up new Cassandra clusters within minutes.

In order to have a balanced multi-regional cluster, we place one replica in each zone, and across all regions. Priam does this by calculating tokens for each region and padding them with a constant value. For example, US-East will start with token 0 where as EU-West will start with token 0 + x. This allows us to have different size clusters in each regions depending on the usage.

When a mutation is performed on a cluster, Cassandra writes to the local nodes and forwards the write asynchronously to the other regions. By placing replicas in a particular order we can actually withstand zone failures or a region failures without most of our services knowing about them. In the below diagram A1, A2 and A3 are AWS availability zones in one region and B1, B2 and B3 are AWS availability zones in another region.

Note: Replication Factor and the number of nodes in a zone are independent settings for each region.



Figure 3: Write to Cassandra spanning region A and region B
Address         DC          Rack        Status State   Load            Owns    Token                   
170141183460469231731687303715884105727
176.XX.XXX.XX eu-west 1a Up Normal 35.04 GB 0.00% 372748112
184.XX.XXX.XX us-east 1a Up Normal 56.33 GB 8.33% 14178431955039102644307275309657008810
46.XX.XXX.XX eu-west 1b Up Normal 36.64 GB 0.00% 14178431955039102644307275310029756921
174.XX.XXX.XX us-east 1d Up Normal 34.63 GB 8.33% 28356863910078205288614550619314017620
46.XX.XXX.XX eu-west 1c Up Normal 51.82 GB 0.00% 28356863910078205288614550619686765731
50.XX.XXX.XX us-east 1c Up Normal 34.26 GB 8.33% 42535295865117307932921825928971026430
46.XX.XXX.XX eu-west 1a Up Normal 34.39 GB 0.00% 42535295865117307932921825929343774541
184.XX.XXX.XX us-east 1a Up Normal 56.02 GB 8.33% 56713727820156410577229101238628035240
46.XX.XXX.XX eu-west 1b Up Normal 44.55 GB 0.00% 56713727820156410577229101239000783351
107.XX.XXX.XX us-east 1d Up Normal 36.34 GB 8.33% 70892159775195513221536376548285044050
46.XX.XXX.XX eu-west 1c Up Normal 50.1 GB 0.00% 70892159775195513221536376548657792161
50.XX.XXX.XX us-east 1c Up Normal 39.44 GB 8.33% 85070591730234615865843651857942052858
46.XX.XXX.XX eu-west 1a Up Normal 40.86 GB 0.00% 85070591730234615865843651858314800971
174.XX.XXX.XX us-east 1a Up Normal 43.75 GB 8.33% 99249023685273718510150927167599061670
79.XX.XXX.XX eu-west 1b Up Normal 42.43 GB 0.00% 99249023685273718510150927167971809781
Figure 4: Multi-regional Cassandra cluster created by Priam

All our clusters are centrally configured via properties stored in SimpleDB, which includes setup of critical JVM settings and Cassandra YAML properties.

Priam's REST API

One of goals of Priam was to support managing multiple Cassandra clusters. To achieve that, Priam's REST APIs provides hooks that support external monitoring and automation scripts. They provide the ability to backup, restore a set of nodes manually and provide insights into Cassandra's ring information. They also expose key Cassandra JMX commands such as repair and refresh.

For comprehensive listing of the APIs, please visit the github wiki here.

Key Cassandra facts at Netflix:
  • 57 Cassandra clusters running on hundreds of instances are currently in production, many of which are multi-regional
  • Priam backs up tens of TBs of data to S3 per day.
  • Several TBs of production data is restored into our test environment every week.
  • Nodes get replaced almost daily without any manual intervention
  • All of our clusters use random partitioner and are well-balanced
  • Priam was used to create the 288 node Cassandra benchmark cluster discussed in our earlier blog post[3].
Related Links:
Share:

الجمعة، 10 فبراير 2012

Announcing Servo

By Brian Harrington & Greg Orzell

In a previous blog post about auto scaling, I mentioned that we would be open sourcing the library that we use to expose application metrics. Servo is that library. It is designed to make it easy for developers to export metrics from their application code, register them with JMX, and publish them to external monitoring systems such as Amazon's CloudWatch. This is especially important at Netflix because we are a data driven company and it is essential that we know what is going on inside our applications in near real time. As we increased our use of auto scaling based on application load, it became important for us to be able to publish custom metrics to CloudWatch so that we could configure auto-scaling policies based on the metrics that most accurately capture the load for a given application. We already had the servo framework in place to publish data to our internal monitoring system, so it was extended to allow for exporting a subset (AWS charges on a per metric basis) of the data into CloudWatch.

Features

  • Simple: It is trivial to expose and publish metrics without having to write lots of code such as MBean interfaces.
  • JMX Registration: JMX is the standard monitoring interface for Java and can be queried by many existing tools. Servo makes it easy to expose metrics to JMX so they can be viewed from a wide variety of Java tools such as VisualVM.
  • Flexible publishing: Once metrics are exposed, it should be easy to regularly poll the metrics and make them available for internal reporting systems, logs, and services like Amazon's CloudWatch. There is also support for filtering to reduce cost for systems that charge per metric, and asynchronous publishing to help isolate the collection from downstream systems that can have unpredictable latency.

The rest of this post provides a quick preview of Servo, for a more detailed overview see the Servo wiki.

Registering Metrics

Registering metrics is designed to be both easy and flexible. Using annotations you can call out the fields or methods that should be monitored for a class and specify both static and dynamic metadata. The example below shows a basic server class with some stats about the number of connections and amount of data that as been seen.





See the annotations wiki page for a more detailed summary of the available annotations and the options that are available. Once you have annotated your class, you will need to register each new object instance with the registry in order for the metrics to get exposed. A default registry is provided that exports metrics to JMX.



Now that the instance is registered metrics should be visible in tools like VisualVM when you run your application.

Publishing Metrics

After getting into JMX, the next step is to collect the data and make it available to other systems. The servo library provides three main interfaces for collecting and publishing data:

  • MetricObserver: an observer is a class that accepts updates to the metric values. Implementations are provided for keeping samples in memory, exporting to files, and exporting to CloudWatch.
  • MetricPoller: a poller provides a way to collect metrics from a given source. Implementations are provided for querying metrics associated with a monitor registry and arbitrary metrics exposed to JMX.
  • MetricFilter: filters are used to restrict the set of metrics that are polled. The filter is passed in to the poll method call so that metrics that can be expensive to collect, will be ignored as soon as possible. Implementations are provided for filtering based on a regular expression and prefixes such as package names.

The example below shows how to configure the collection of metrics each minute to store them on the local file system.



By simply using a different observer, we can instead export the metrics to a monitoring system like CloudWatch.



You have to provide your AWS credentials and namespace at initialization. Servo also provides some helpers for tagging the metrics with common dimensions such as the auto scaling group and instance id. CloudWatch data can be retrieved using the standard Amazon tools and APIs.


Related Links

Servo Project
Servo Documentation
Netflix Open Source
Share:

Labels

Blog Archive