نوفمبر 2013 ~ Tech Blog

الاثنين، 18 نوفمبر 2013

Building the New Netflix Experience for TV

Unknown9:37 صdevice, innovation, JavaScript, performance, TV, UI ليست هناك تعليقات

by Joubert Nel

We just launched a new Netflix experience for TV and game consoles. The new design is based on our premise that each show or movie has a tone and a narrative that should be conveyed by the UI. To tell a richer story we provide relevant evidence and cinematic art that better explain why we think you should watch a show or movie.

The new user interface required us to question our paradigms about what can be delivered on TV – not only is this UI more demanding of game consoles than any of our previous UIs, but we also wanted budget devices to deliver a richer experience than what was previously possible.

For the first time we needed a single UI that could accept navigation using a TV remote or game controller, as well as voice commands and remotes that direct a mouse cursor on screen.

Before we get into how we developed for performance and built for different input methods, let’s take a look at our UI stack.

UI Stack

My team builds Netflix UIs for the devices in your living room: PlayStation 3, PlayStation 4, Xbox 360, Roku 3, and recent Smart TVs and Blu-ray players.

We deploy UI updates with new A/B tests, support for new locales like the Netherlands, and new features like Profiles. While remaining flexible, we also want to take advantage of as much of the underlying hardware as possible in a cross-platform way.

So, a few years ago we broke our device client code into two parts: an SDK that runs on the metal, and a UI written in JavaScript. The SDK provides a rendering engine, JavaScript runtime, networking, security, video playback, and other platform hooks. Depending on the device, SDK updates range from quarterly to annually to never. The UI, in contrast, can be updated at any time and is downloaded (or retrieved from disk cache) when the user fires up Netflix.

Key, Voice, Pointer

The traditional way for users to control our UI on a game console or TV is via an LRUD input (left/right/up/down) such as a TV remote control or game controller. Additionally, Xbox 360 users should be able to navigate with voice commands and folks with an LG Magic Remote Smart TV must be able to navigate by pointing their remote control at elements on screen. Our new UI is our first to incorporate all three input methods in a single design.

We wanted to build our view components in such a way that their interaction behaviors are encapsulated. This code proximity makes code more maintainable and reusable and the class hierarchy more robust. We needed a consistent way to dispatch the three kinds of user input events to the view hierarchy.

We created a new JavaScript event dispatcher that routes key, pointer, and voice input in a uniform way to views. We needed an incremental solution that didn’t require refactoring the whole codebase, so we designed it to coexist with our legacy key handling and provide a migration path.

We must produce JavaScript builds that only contain code for those methods supported by the target device because reduced code size yields faster code parsing, in turn reducing startup time.

To produce lean builds, we use a text preprocessor to strip out input handling code that is irrelevant to a target platform. The advantage of using a text preprocessor instead of, for example, using mixins to layer in additional appearances and interactions, is that we get much higher levels of code proximity and simplicity.

Performance

Devices in the living room use DirectFB or OpenGL for graphics (or something OpenGL-like) and can use hardware acceleration for animating elements of the UI. Leveraging the GPU is key in creating a smooth experience that is responsive to user input – we’ve done it on WebKit using accelerated compositing (see WebKit in Your Living Room and Building the Netflix UI for Wii U).

The typical implementation of hardware accelerated animation of a rectangle requires width x height x bytes per pixel of memory. In our UI we animate entire scenes when transitioning between them; animating one scene at 1080p would require close to 8MB of memory (1920 x 1080 x 4) but at 720p requires 3.5MB (1280 x 720 x 4). We see devices with as little as 20MB memory allocated to a hardware-accelerated rendering cache. Moreover, other system resources such as main memory, disk cache, and CPU may also be severely constrained as compared to a mobile phone, laptop, or game console.

How can we squeeze as much performance as possible out of budget devices and add more cinematic animations on game consoles?

We think JavaScript, HTML and CSS are great technologies to build compelling experiences with, such as our HTML 5 player UI. But we wanted more fine-grained control of the graphics layer and wanted optimizations for apps that do not need reflowable content. Our SDK team built a new rendering engine with which we can deliver animations on very resource constrained devices, making it possible to give customers our best UI. We can also enrich the experience with cinematic animations & effects on game consoles.

The second strategy is by grouping devices into performance classes that give us entry points to turn different knobs such as pool sizes, prefetch ranges, effects, animations, and caching, to take advantage of fewer/more resources while maintaining the integrity of the UI design & interaction.

Delivering great experiences

In the coming weeks we will be diving into more details of our JavaScript code base on this blog.

Building the new Netflix experience for TV was a lot of work, but it gave us a chance to be a PlayStation 4 launch partner, productize our biggest A/B test successes of 2013, and delight tens of millions of Netflix customers.

If this excites you and want to help build the future UIs for discovering and watching shows and movies, join our team!

Preparing the Netflix API for Deployment

Unknown9:09 صapi, continuous delivery ليست هناك تعليقات

At Netflix, we are committed to bringing new features and product enhancements to customers rapidly and frequently. Consequently, dozens of engineering teams are constantly innovating on their services, resulting in a rate of change to the overall service is vast and unending. Because of our appetite for providing such improvements to our members, it is critical for us to maintain a Software Delivery pipeline that allows us to deploy changes to our production environments in an easy, seamless, and quick way, with minimal risk.

The Challenge
The Netflix API is responsible for distributing content that supports our features to the UIs. The API in turn, is dependent on a large number of services that are responsible for generating the data. We often use an hourglass metaphor to describe the Netflix technology stack, with the API being the slender neck.

In the picture above, the changes in content from the services at the bottom of the hourglass need to be delivered in a rapid fashion to the dozens of UIs at the top. The UI teams are on their own rapid iteration cycles, trying to create or modify user experiences for our customers based on that content. The API is the primary conduit for the flow of this content between the services and UIs. This imposes a unique challenge when it comes to our development, testing and deployment flows: we need to move as fast as or faster than the aggregate of the UIs we serve and the services from which we consume. In order to deliver on that challenge and help us keep up with the demands of the business, we are continually investing in our Software Delivery pipeline, with a view towards Continuous Delivery as our end goal.

We have already described the technology and automation that powers our global, multi-region deployments in an earlier blog post. In this post, we will delve into the earlier stages of our pipeline, focusing on our approach to generating an artifact that is ready for deployment. Automation and Insight continue to remain key themes through this part of our pipeline just as they are on the deployment side.

Delivery pipeline
The diagram below illustrates the flow of code through our Delivery pipeline. The rest of this post describes the steps denoted by the gray boxes.

Let’s take a look at how a commit makes its way through the pipeline. A developer starts by making code changes for a feature or bug fix.First, s/he makes a decision as to which branch in the source tree to work off:

Release - on a Weekly deployment schedule
Prod - on a Daily deployment schedule

There are no hard and fast rules regarding which branch to use; just broad guidelines. Typically, we use the Release branch for changes that meet one of the following criteria:

Require additional testing by people outside of the API Team
Require dependency library updates
Need additional "bake-time" before they can be deployed to Production

[1] Developers build and run tests in their local environment as part of their normal workflow. They make the determination as to the appropriate level of testing that will give them confidence in their change. They have the capability to wrap their changes with Feature Flags, which can be used to enable or disable code paths at runtime.

[2] Once a change is committed to the repository, it is part of the Delivery pipeline.

[3][4] Every commit triggers a Build/Test/deploy cycle on our Continuous Integration server. If the build passes a series of gates, it is deployed to one of our Pre-Production environments[5]; depending on the branch in source that the build was generated from. We maintain several environments to satisfy the varied requirements of our user base, which consists of Netflix internal Dev and Test teams as well as external partners. With this process, we ensure that every code change is available for use by our partners within roughly an hour of being committed to the repository.

[6] If the change was submitted to the Prod branch or a manual push trigger was invoked, the build is deployed to the canary cluster as well. It is worth mentioning that our unit of deployment is actually an Amazon Machine Image (ami), which is generated from a successful build. For the purposes of this post though, we'll stick to the term “build”.

Dependency Management

This is a good time to take a deeper look at our approach to dependency management. In the Netflix SOA implementation, services expose their interfaces to consumers through client libraries that define the contract between the two services. Those client libraries get incorporated into the consuming service, introducing build and runtime dependencies for the service. This is highly magnified in the case of the API because of its position in the middle of the hourglass, as depicted at the beginning of the post. With dozens of dependencies and hundreds of client libraries, getting a fully functional version of the API server stood up can in itself be tricky. To add to that, at any point in time, several of these libraries are being updated to support critical new functionality or bug fixes. In order to keep up with the volume and rate of change, we have a separate instance of our pipeline dedicated towards library integration. This integration pipeline builds an API server with the latest releases of all client libraries and runs a suite of functional tests against it. If that succeeds, it generates a configuration file with hardcoded library versions and commits it into SCM. Our developers and the primary build pipeline use this locked set of libraries, which allows us to insulate the team from instabilities introduced by client library changes but still keep up with library updates.

Testing

Next, we explore our test strategy in detail. This is covered in steps 1-4 in the pipeline diagram above. It is helpful to look at our testing in terms of the Software Testing Pyramid to get a sense for our multi-layered approach.

Local Development/Test iterations

Developers have the ability to run unit and functional tests against a local server to validate their changes. Tests can be run at a much more granular level than on the CI server; so as to keep the run times short and support iterative development.

Continuous Integration

Once code has been checked in, the CI servers take over. The goal is to assess the health of the branch in the SCM repository with every commit, including automated code merges between various branches as well as dependency updates. The inability to generate a deployable artifact or failed unit tests cause the build to fail and block the pipeline.

Contract validations

Our interface to our consumers is via a Java API that uses the Functional Reactive model implemented by RxJava. The API is consumed by Groovy scripts that our UI teams develop. These scripts are compiled dynamically at runtime within an API server, which is when an incompatible change to the API would be detected. Our build process includes a validation step based on API signature checks to catch such breakages earlier in the cycle. Any non-additive changes to the API as compared to the version that’s currently in Production fail the build and halt the pipeline.

Integration Tests

A build that has passed unit and contract validation tests gets deployed to the CI server where we test it for readiness to serve traffic. After the server is up and running with the latest build, we run functional tests against it. Given our position in the Netflix stack, any test against the API is by definition an integration test. Running integration tests reliably in a live, cloud-based, distributed environment poses some interesting challenges. For instance, each test request results in multiple network calls on the API server, which implies variability in response times. This makes tests susceptible to timing related failures. Further, the API is designed to degrade gracefully by providing partial or fallback responses in case of a failure or latency in a backend service. So responses are not guaranteed to be static. Several backend services provide personalized data which is dynamic in nature. Factors such as these contribute towards making tests non-deterministic, which can quickly diminish the value of our test suites.

We follow a few guiding principles and techniques to keep the non-determinism under control.

Focus on the structure rather than the content of the responses from the API. We leave content verification to the services that are responsible for generating the data. Tests are designed with this in mind.
Account for cold caches and connections by priming them prior to test execution.
Design the system to provide feedback when it is returning fallback responses or otherwise running in a degraded mode so tests can leverage that information.
Quarantine unreliable tests promptly and automate the process of associating them to tasks in our bug tracking system .

These techniques have allowed us to keep the noise in the test results down to manageable levels. A test failure in our primary suite blocks the build from getting deployed anywhere.

Another key check we perform at this stage is related to server startup time, which we track across builds. If this metric crosses a certain threshold, the pipeline is halted.

User Acceptance Tests

If a build has made it to this point, it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments.

Canary Analysis

This is the final step in the process before the code is deployed globally. A build that has reached this step is a candidate for Production deployment from a functional standpoint. The canary process analyzes the readiness of the build from a systems perspective. It is described in detail in the post on Deploying the API.

Insight - tracking changes

Continuous Delivery is all about feedback loops. As a change makes its way through the pipeline, the automation delivers the results of each step to the concerned person(s) in real time via email and/or chat notifications. Build and Test results are also captured and made available on our Build Status Dashboard for later viewing. Everyone committing a change to the Release or Prod branch is responsible for ensuring that the branch is in a deployable state after their change. The Build Dash always displays the latest status of each branch, including a history of recent commits and test runs. Members of our Delivery Team are on a weekly rotation to ensure smooth operation of the pipelines.

Conclusion

The work we’ve done so far has been helpful in increasing our agility and our ability to scale with the growth of the business. But as Netflix grows and our systems evolve, there are newer challenges to overcome . We are experimenting in different areas like developer productivity, improved insight into code usage and optimizing the pipeline for speed. If any of these sound exciting, take a look at our open roles.

Netflix Open Source Software Cloud Prize Winners

Unknown9:56 صcloud prize, NetflixOSS, winner, winners ليست هناك تعليقات

We launched the Netflix Open Source Software Cloud Prize in March 2013 and it got a lot of attention in the press and blogosphere. Six months later we closed the contest, took a good look at the entrants, picked the best as nominees and the panel of distinguished judges decided the winners in each category, The final winners were announced during Werner Vogel’s keynote at AWS Re:Invent in on November 14th 2013.

Video of the Netflix section of the Keynote

The ten winners all put in a lot of work to earn their prizes. They each won a trip to Las Vegas and a ticket for AWS Re:Invent, a Cloud Monkey trophy, $10,000 prize money from Netflix and $5000 in AWS credits from Amazon. After the keynote was over we all went on stage to get our photo taken with Werner Vogels.

Peter Sankauskas (@pas256) is a software engineer living in Silicon Valley, and the founder of Answers for AWS (@Answers4AWS). He specializes in automation, scaling and Amazon Web Services. Peter contributed Ansible playbooks, Cloud Formation templates and many pre-built AMIs to make it easier for everyone else to get started with NetflixOSS, and put Asgard and Edda in the AWS Marketplace. He recently started his own company AnsWerS to help people who want to move to AWS. Getting started with NetflixOSS is made harder because there are 35 different projects to figure out. Peter has created an extremely useful and simple on-ramp for new NetflixOSS users.

Chris Grzegorczyk (eucaflix, grze, of Goleta California) is Chief Architect and Co-Founder, and Vic Iglesias is Quality and Release Manager at Eucalyptus Systems. Eucalyptus have been using NetflixOSS to provide a proof point for portability of applications from AWS to private clouds based on Eucalyptus. Eucalyptus is open source software for building private and hybrid clouds that are compatible with AWS APIs. Their submission enables NetflixOSS projects to treat Eucalyptus as an additional AWS region and to deploy applications to AWS regions and Eucalyptus datacenters from the same Asgard console.

In June 2013 they shipped a major update to Eucalyptus that included advanced AWS features such as Autoscale Groups that NetflixOSS depends on. Eucalyptus have demonstrated working code at several Netflix meetups and have really helped promote the NetflixOSS ecosystem.

IBM had previously created a demonstration application called Acme Air for their Websphere tools running on IBM Smartcloud. It was a fairly conventional enterprise architecture application, with a Java front end and a database back end. For their winning prize entry, Andrew Spyker (aspyker, of Raleigh North Carolina) figured out how to re-implement Acme Air as a cloud native example application using NetflixOSS libraries and component services, running on AWS. He then ran some benchmark stress tests to demonstrate scalability. This was demonstrated at a Netflix Meetup last summer. The Acme Air example application combines several NetflixOSS projects. The Eureka service registry, Hystrix circuit breaker pattern, Karyon base server framework, Ribbon http client and Asgard provisioning portal. IBM used NetflixOSS to get a deeper understanding of Cloud Native architecture and tools, which it can apply to helping enterprise customers make the transition to cloud.

The Reactive Extensions (Rx) pattern is one of the most advanced and powerful concepts for structuring code to come out in recent years. The original work on Rx at Microsoft by Eric Meijer inspired Netflix to create the RxJava project. We started with a subset of Rx functionality and left a lot of “to do” areas. As the project matured we began to extend RxJava to include other JVM based languages and Joachim Hofer (jmhofer, Möhrendorf, Germany) has made major contribution to type safety and Scala support, with over thirty pull requests.

Joachim works at Imbus AG, Möhrendorf, Germany, he’s lead developer of an agile product team and a Scala enthusiast working on moving their stack from J2EE to Scala/Play/Akka/Spray/RxJava.

Anyone familiar with Hadoop tools and the big data ecosystem knows about the Pig language. It provides a way to specify a high level dataflow for processing but the Pig scripts can get complex and hard to debug. Netflix built and open sourced a visualization and monitoring tool called Lipstick, and it was adopted by Mark Roddy at a vendor called Mortar (mortardata, of New York, NY) who worked with us to generalize some of the interfaces and integrate it with their own Pig based Hadoop platform. We saved Mortar from having to create their own tool to do this, and Netflix now has an enthusiastic partner to help to improve and extend Lipstick so everyone who uses it benefits.

Jakub Narloch (jmnarloch, of Szczecin, Poland) created a test suite for NetflixOSS Karyon based on JBoss Arquillian. The extension integrated with Karyon/Google Guice dependency injection functionality allowing to write tests that directly access the application auto scanned components. The tests are executed in the application container. Arquillian brings wide support for different containers including Tomcat, Jetty and JBoss AS. Karyon is the base server that underpins NetflixOSS services and acts as the starting point for developing new services. Since Genie is based on Karyon, we were able to leverage this integration to use Arquilian to test Genie, and the changes have been merged into the code that Netflix uses internally.

Jakub Narloch is a software engineer working at Samsung Electronics. He received the JBoss Community Recognition Award this year for his open source contributions. In the past year he has been actively helping to develop the JBoss Arquillian project, authoring four completely new extensions and helping to shape many others. His adventure with the open source world began a couple of years earlier and he has also contributed code to projects like Spring Framework, Castor XML and NetflixOSS. Last year he graduated with honors Computer Science from Warsaw University of Technology with an MSc degree. In the past he took part in two editions of Google Summer of Code and in his free time he likes to solve the software development contests held by TopCoder Inc.

In the real world complex web service APIs are hard to manage, and NetflixOSS includes the Zuul API gateway, which is used to authenticate process and route http requests. The next winner is Neil Beveridge (neilbeveridge, of Kent, United Kingdom). He was interested in porting the Zuul container from Tomcat to Netty, which also provides non-blocking output requests, and benchmarking the difference. Neil ran the benchmarks with help from Raamnath Mani, Fanta Gizaw and Will Tomlin at Hotels.com. They ran into an interesting problem with Netty consuming excess CPU and running slower than the original Tomcat version, and then ran into the contest deadline, but have since continued work to debug and tune the Netty code and come up with higher performance for Netty and some comparisons of cloud and bare metal performance for Zuul. Since Netflix is also looking at moving some of our services from Tomcat to Netty, this is a useful and timely contribution. It’s also helpful to other people considering using Zuul to have some published benchmarks to show the throughput on common AWS instance types.

Although the primary storage used by Netflix is based on Cassandra, we also use AWS RDS to create many small MySQL databases for specific purposes. Other AWS customers use RDS much more heavily. Jiaqi Guo (jiaqi, Chicago, Illinois) has built Datamung to automate backup of RDS to S3 and replication of backups across regions for disaster recovery. Datamung is a web-based, Simple Workflow driven application that backs up RDS MySQL databases into S3 objects by launching an EC2 instance and running the mysqldump command. It makes it possible to replicate RDS across regions, VPC, accounts or outside the AWS network.

When we started to build the Denominator library for portable DNS management we contacted Neustar to discuss their UltraDNS product, and made contact with Jeff Damick (jdamick, of South Riding, Virginia). His input as we structured the early versions of Denominator was extremely useful, and provides a great example of the power of developing code in public. We were able to tap into his years of experience with DNS management, and he was able to contribute code, tests and fixes to the Denominator code and fixes to the UltraDNS API itself.

Justin Santa Barbara (justinsb of San Franciso, California) decided to make the Chaos Monkey far more evil, and created fourteen new variants, a “barrel of chaos monkeys”. They interfere with the network, causing routing failure, packet loss, network data corruption and extra network latency. They block access to DNS, S3, DynamoDB and the EC2 control plane. They interfere with storage, by disconnecting EBS volumes, filling up the root disk, and saturating the disks with IO requests. They interfere with the CPU by consuming all the spare cycles, or killing off all the processes written in Python or Java. When run, a random selection is made, and the victims suffer the consequences. This is an excellent but scary workout for our monitoring and repair/replacement automation.

We are pleased to have such a wide variety of winners, from individuals around the world, small and large companies, vendors and end users. Many thanks to all of them for the work they have put into helping grow the NetflixOSS ecosystem, and thanks to everyone else who just uses NetflixOSS or entered the contest but didn’t make the final cut.

The winners, judges and support team got "Cloud Monkey" trophies custom made by Bleep Labs.

Scryer: Netflix’s Predictive Auto Scaling Engine

Unknown8:00 صanalytics, autoscaling, availability, AWS, prediction ليست هناك تعليقات

by Daniel Jacobson, Danny Yuan, Neeraj Joshi

To deliver the best possible experience to Netflix customers around the world, it is critical for us to maintain a robust, scalable, and resilient system. That is why we have built (and open sourced) applications ranging from Hystrix to Chaos Monkey. All of these tools better enable us to prevent or minimize outages, respond effectively to outages, and/or anticipate the kinds of operational gaps that may eventually result in outages. Recently we have built another such tool that has been helping us in this ongoing challenge: Scryer.

Scryer is a new system that allows us to provision the right number of AWS instances needed to handle the traffic of our customers. But Scryer is different from Amazon Auto Scaling (AAS), which reacts to real-time metrics and adjusts instance counts accordingly. Rather, Scryer predicts what the needs will be prior to the time of need and provisions the instances based on those predictions.

This post is the first in a series that will provide greater details on what Scryer is, how it works, how it differs from the Amazon Auto Scaling, and how we employ it at Netflix.

Amazon Auto Scaling and the Netflix Use Case

At the core, AAS is a reactive auto scaling model. That is, AAS dynamically adjusts server counts based on a cluster’s current workload (most often the metric of choice will be something like load average). When spiking or dropping beyond a certain point, AAS policies will trigger the addition or removal of instances. For Netflix, this has proven to be quite effective at improving system availability, optimizing costs, and in some cases reducing latencies. Overall, AAS is a big win and companies with any kind of scale in AWS should be employing this service.

For Netflix, however, there are a range of use cases that are not fully addressed by AAS. The following are some examples:

Rapid spike in demand: Instance startup times range from 10 to 45 minutes. During that time our existing servers are vulnerable, especially if the workload continues to increase.
Outages: A sudden drop in incoming traffic from an outage is sometimes followed by a retry storm (after the underlying issue has been resolved). A reactive system is vulnerable in such conditions because a drop in workload usually triggers a down scale event, leaving the system under provisioned to handle the ensuing retry storm.
Variable traffic patterns: Different times of the day have different workload characteristics and fleet sizes. Some periods show a rapid increase in workload with a relatively small fleet size (20% of maximum), while other periods show a modest increase with a fleet size 80% of the maximum, making it difficult to handle such variations in optimal ways.

Some of these issues can be mitigated by scaling up aggressively, but this is often undesirable as it may lead to scale up - scale down oscillations. Another option is to always run more servers than required which is clearly not optimal from a cost perspective.

Scryer: Our Predictive Auto Scaling Engine

Scryer was inspired in part by these unaddressed use cases, but its genesis was triggered more by our relatively predictable traffic patterns. The following is an example of five days worth of traffic:

In this chart, there are very clearly spikes and troughs that sync up with consistent patterns and times of day. There are definitely going to be spikes and valleys that we cannot predict and the traffic does evolve over longer periods of time. That said, over any given week or month, we have a very good idea of what the traffic will look like as the basic curves are the same. Moreover, these same five days of the week are likely to have the same patterns the week before and the week after (assuming no outages or special events).

Because of these trends, we believed that we would be able to generate a set of algorithms that could predict our capacity needs before our actual needs, rather than simply relying on the reactive model of AAS. The following chart shows the result of that effort, which is that the output from our prediction algorithms aligns very closely to our actual metrics.

Once these predictions were optimized, we attached these predictions to AWS APIs to trigger changes in capacity needs. The following chart shows that our scheduled scaling action plan closely matches our actual traffic with each step minimized to achieve best performance.

We have been running Scryer in production for a few months. The following is a list of the key benefits that we have seen with it:

Improved cluster performance
Better service availability
Reduced EC2 costs

Predictive-Reactive Auto Scaling - A Hybrid Approach

As effective as Scryer has been in predicting and managing our instance counts, the real strength of Scryer is in how it operates in tandem with AAS’s reactive model.

If we are able to predict the workload of a cluster in advance, then we can proactively scale the cluster ahead of time to accurately meet workload needs. But there will certainly be cases where Scryer cannot predict our needs, such as an unexpected surge in workload. In these cases, AAS serves as an excellent safety net for us, adding instances based on those unanticipated, unpredicted needs.

The two auto scaling systems combined provide a much more robust and efficient solution as they complement each other.

Conclusion

Overall, Scryer has been incredibly effective at predicting our metrics and traffic patterns, allowing us to better manage our instance counts and stabilizing our systems. We are still rolling this out to the breadth of services within Netflix and will continue to explore its use cases and optimize the algorithms. So far, though, we are excited about the results and are eager to see how it behaves in different environments and conditions.

In the coming weeks, we plan to publish several more posts discussing Scryer in greater detail, digging deeper into its features, design, technology and algorithms. We are exploring the possibility of open sourcing Scryer in the future as well.

Finally, we work on these kinds of exciting challenges all the time at Netflix. If you would like to join us in tackling such problems, check out our Jobs site.

Tech Blog

الاثنين، 18 نوفمبر 2013

Building the New Netflix Experience for TV

UI Stack

Key, Voice, Pointer

Performance

Delivering great experiences

Preparing the Netflix API for Deployment

Local Development/Test iterations

الخميس، 14 نوفمبر 2013

Netflix Open Source Software Cloud Prize Winners

الثلاثاء، 5 نوفمبر 2013

Scryer: Netflix’s Predictive Auto Scaling Engine

Amazon Auto Scaling and the Netflix Use Case

Scryer: Our Predictive Auto Scaling Engine

Predictive-Reactive Auto Scaling - A Hybrid Approach

Conclusion

Labels

Blog Archive

نموذج الاتصال

About

Tech Blog

الاثنين، 18 نوفمبر 2013

Building the New Netflix Experience for TV

UI Stack

Key, Voice, Pointer

Performance

Delivering great experiences

Preparing the Netflix API for Deployment

Local Development/Test iterations

الخميس، 14 نوفمبر 2013

Netflix Open Source Software Cloud Prize Winners

الثلاثاء، 5 نوفمبر 2013

Scryer: Netflix’s Predictive Auto Scaling Engine

Amazon Auto Scaling and the Netflix Use Case

Scryer: Our Predictive Auto Scaling Engine

Predictive-Reactive Auto Scaling - A Hybrid Approach

Conclusion

Labels

Blog Archive

نموذج الاتصال

اشتراك في

About