
Today we are happy to announce that Aegisthus, our map/reduce program for reading Cassandra SSTables, is open source. Aegisthus is the tool we use daily to make the data stored in Cassandra available for analysis on our big data platform. We originally wrote about Aegisthus nearly two years ago, so today we want to talk a little about what has changed since then.

Today we are happy to announce that Aegisthus, our map/reduce program for reading Cassandra SSTables, is open source. Aegisthus is the tool we use daily to make the data stored in Cassandra available for analysis on our big data platform. We originally wrote about Aegisthus nearly two years ago, so today we want to talk a little about what has changed since then.
Overview
The initial iteration of Aegisthus was designed in a manner very similar to other typical applications in the Netflix infrastructure. Thus Aegisthus was not just a map/reduce job, but also a Hadoop cluster that could be launched via Asgard, as well as all the tools used to configure and schedule the data conversion. Aegisthus was used as a pipeline for the dimensional data we hold in Cassandra, with the results being data that we would consume in our big data environment. Conceptually, we thought about Aegisthus as similar to Suro, our recently open sourced event data pipeline, when thinking about making data available for batch processing.
Since our initial blog post describing Aegisthus, we have invested heavily in building and refining a data platform that facilitates easy and efficient access to and analysis of our data. Genie and Franklin are two of the core services of this platform. Genie - also part of NetflixOSS - is a building block for many of our higher level applications that need to launch Hadoop jobs in our architecture. Franklin is the internal name for our metadata service, and is highly leveraged by tools we build to provide access to various datasets. It provides a standard set of metadata for data across a variety of sources (e.g. MySQL, S3, Hive, Cassandra, etc) which includes schema, location, and serialization format. However, it also allows adding arbitrary additional information as needed. Using these two building blocks we constructed a platform of tools that generalize the different needs of our users.
Earlier this year when we revisited Aegisthus, we realized that the process had much more in common with the tools that we were writing on top of our core data platform services than with a stand-alone application. First, we refactored Aegisthus to leverage Genie to launch as a job on our EMR based Hadoop clusters rather than run on a dedicated Hadoop cluster. This gives us much better flexibility in preparing for and scaling to additional load. Additionally, it brings Aegisthus in line with our other batch processing jobs, giving a central place to find information about the jobs that have run.
Next, we refactored the configuration of Aegisthus jobs to become additional metadata stored in Franklin. Franklin gives us two benefits. One is that the information about where a dataset originates is stored with the metadata on how to consume the dataset. This gives us a clear data lineage whenever there are questions that need to be directed to the owners of the source systems. Secondly, since Franklin provides the metadata that explains how to consume the datasets that Aegisthus produces, we can utilize the same names both for scheduling the Aegisthus map/reduce job and for the downstream consumers of the datasets in Pig.
The flow of Cassandra data
To get an idea of how this all fits together it is best to consider how information moves from Cassandra into our big data environment. The figure below shows the data flow of our data from Cassandra SSTables to json formatted data available in S3 for downstream batch analytics.
- Priam is responsible for backing up Cassandra SSTables. It does that both incrementally, as SSTables are flushed to disk, and nightly as a full snapshot of all nodes.
- Aegisthus reads the SSTables required to create the dataset. For full backup we will choose the SSTables for a single day. For an incremental backup we will use the most recently flushed files and apply them to our latest json dataset.
- Aegisthus compacts all the records into a single view of the data by removing all duplicate records. To handle eventual consistency, where data differs, Aegisthus keeps the column values with the most recent timestamp. Finally, it serializes the data into a format that can be consumed by our batch processing systems. That format is currently a line of json per record.
- The scheduler queries Franklin for information about the job.
- The scheduler submits the Aegisthus job to Genie.
- Genie chooses a cluster to run the job on and the data gets transformed.
Together these tools make it simple for the person who needs to use the data to convert it into a format more suitable for additional processing.
To get an idea of how this all fits together it is best to consider how information moves from Cassandra into our big data environment. The figure below shows the data flow of our data from Cassandra SSTables to json formatted data available in S3 for downstream batch analytics.
Priam is responsible for backing up Cassandra SSTables. It does that both incrementally, as SSTables are flushed to disk, and nightly as a full snapshot of all nodes.
Aegisthus reads the SSTables required to create the dataset. For full backup we will choose the SSTables for a single day. For an incremental backup we will use the most recently flushed files and apply them to our latest json dataset.
Aegisthus compacts all the records into a single view of the data by removing all duplicate records. To handle eventual consistency, where data differs, Aegisthus keeps the column values with the most recent timestamp. Finally, it serializes the data into a format that can be consumed by our batch processing systems. That format is currently a line of json per record.
The scheduler queries Franklin for information about the job.
The scheduler submits the Aegisthus job to Genie.
Genie chooses a cluster to run the job on and the data gets transformed.
0 التعليقات:
إرسال تعليق