Netflix has very diverse data needs. Those needs fall anywhere between rock-solid durable datastores, like Apache Cassandra and lossy in-memory stores, such as the current incarnation of Dynomite. Somewhere in that spectrum is the need to store, index and search documents. This is where Elasticsearch has found a niche in Netflix.
Elasticsearch usage, at Netflix, has proliferated over the past year. It began as one or two isolated deployments managed by the teams using it. That usage has quickly grown to over 15+ clusters (755 nodes), in production, centrally managed by the Cloud Database Engineering (CDE) team.
CDE, as does all of Netflix, believes in automating the operations of our production systems. This is what led us to create tools such as Priam, a sidecar to help manage Apache Cassandra clusters. That same philosophy led us to create Raigad, an Elasticsearch sidecar.
Integration with a centralized monitoring system
Raigad collects and publishes Elasticsearch metrics to a centralized telemetry, monitoring and alerting system. This is achieved by using the Netflix Open Source project Servo. Raigad’s architecture allows you to integrate into your own telemetry system.
Node discovery and tracking
We’ve included a sample implementation using Cassandra, for Raigad to keep track of metadata information of Elasticsearch clusters. Every Elasticsearch instance will read Cassandra to discover other nodes which it needs to connect to during the bootstrap. In this sample implementation, Cassandra eases multi-region Elasticsearch deployments by replicating Elasticsearch meta data across multiple regions wherever Elasticsearch is deployed. This could also be implemented using Eureka.
Auto configuration of the elasticsearch.yml file
Raigad provides a range of configuration parameters to tune Elasticsearch yaml at bootstrap time. eg. ASG based dedicated master-data-search node deployments (default at Netflix), multi-region deployments, tribe node setup etc.
Raigad takes care of cleaning old and creating new indices based on the retention period provided for individual indices using configuration parameters. We currently support daily,monthly and yearly retention periods.
Improvements to run better in AWS
Raigad is used extensively at Netflix in the AWS environment. As mentioned above, for dedicated node deployments we use ASG naming convention. In regards to credentials, it supports Amazon’s IAM key profile management. Using IAM Credentials allows you to provide access to the AWS API without storing an AccessKeyId or SecretAccessKey anywhere on the machine. But if required, you can use your own implementation as well.
Raigad also supports scheduled nightly Snapshot backups to S3 along with Restores at startup or via a REST call. (It uses elasticsearch-aws-plugin underneath)
You can get more info about the features described above or about how to use and install Raigad here.
Distributed systems are complex to operate and to recover from failure. If you add to that, the huge scale at which Netflix operates, you quickly need to make a decision of how to operate such systems. You can either scale a team out to handle the load, or build good automation that can monitor, analyze and alleviate issues, automatically. Netflix’s approach has always been the latter. Raigad helps continue this trend, by providing a tool to help manage our growing Elasticsearch deployment.
CDE is very excited to add Raigad to our ever growing NetflixOSS library. If you run Elasticsearch on AWS, at scale, we believe Raigad may be useful to you too. As with all of our projects, feedback, code or documentation submissions are always welcome.
If you are passionate about Elasticsearch or Open Source Software, in general, we are always looking for great engineers.