Tuesday, February 21, 2012

Announcing Priam

By Praveen Sadhu, Vijay Parthasarathy & Aditya Jami

We talked in the past about our move to NoSQL and Cassandra has been a big part of that strategy. Cassandra hit a big milestone recently with the announcement of the v1 release. We recently announced Astyanax, Netflix's Java Cassandra client with an improved API and connections management which we open sourced last month.

Today, we're excited to announce another milestone on our open source journey with an addition to make operations and management of Cassandra easier and more automated.

As we embarked on making Cassandra one of our NoSQL databases in the cloud, we needed tools for managing configuration, providing reliable and automated backup/recovery, and automating token assignment within and across regions. Priam was built to meet these needs. The name 'Priam' refers to the king of Troy, in Greek mythology, who was the father of Cassandra.

What is Priam?

Priam is a co-process that runs alongside Cassandra on every node to provide the following functionality:
  • Backup and recovery
  • Bootstrapping and automated token assignment.
  • Centralized configuration management
  • RESTful monitoring and metrics
We are currently using Priam to manage several dozen Cassandra clusters and counting.

Backup and recovery

A dependable backup and recovery process is critical when choosing to run a database in the cloud. With Priam, a daily snapshot and incremental data for all our clusters is backed up to S3. S3 was an obvious choice for backup data due to its simple interface and ability to access any amount of data from anywhere[1].

Snapshot backup

Priam leverages Cassandra's snapshot feature to have an eventually consistent backup[2]. Cassandra flushes data to disk and hard-links all SSTable files (data files) into a snapshot directory. SSTables are immutable files and can be safely copied to an external source. Priam picks up these hard-linked files and uploads them to S3. Snapshots are run on a daily basis for the entire cluster, ideally during non-peak hours. Although snapshot across cluster is not guaranteed to produce a consistent backup of cluster, consistency is recovered upon restore by Cassandra and running repairs. Snapshots can also be triggered on demand via Priam's REST API during upgrades and maintenance operations.

During the backup process, Priam throttles the data read from disk to avoid contention and interference with Cassandra's disk IO as well as network traffic. Schema files are also backed up in the process.

Incremental backup

When incremental backups are enabled in Cassandra, hard-links are created for all new SSTables in the incremental backup directory. Priam scans this directory frequently for incremental SSTable files and uploads them to S3. Incremental data along with the snapshot data are required for a complete backup.

Compression and multipart uploading

Priam uses snappy compression to compress SSTables on the fly. With S3's multi-part upload feature, files are chunked, compressed and uploaded in parallel. Uploads also ensure the file cache is unaffected (set via: fadvise). Priam reliably handles file sizes on the order of several hundred GB in our production environment for several of our clusters.

Restoring data

Priam supports restoring a partial or complete ring. Although the latter is less likely in production, restoring to a full test cluster is a common use case. When restoring data from backup, the Priam process (on each node) locates snapshot files for some or all keyspaces and orchestrates the download of the snapshot, incremental backup files, and starting of the cluster. During this process, Priam strips the ring information from the backup, allowing us to restore to a cluster of half the original size (i.e., by skipping alternate nodes and running repair to regain skipped data). Restoring to a different sized cluster is possible only for the keyspaces with replication factor more than one. Priam can also restore data to clusters with different names allowing us to spin up multiple test clusters with the same data.

Restoring prod data for testing

Using production data in a test environment allows you to test on massive volumes of real data to produce realistic benchmarks. One of the goals of Priam was to automate restoration of data into a test cluster. In fact, at Netflix, we bring up test clusters on-demand by pointing them to a snapshot. This also provides a mechanism for validating production data and offline analysis. SSTables are also used directly by our ETL process.

Figure 1: Listing of backup files in S3

Token Assignment

Priam automates the assignment of tokens to Cassandra nodes as they are added, removed or replaced in the ring. Priam relies on centralized external storage (SimpleDB/Cassandra) for storing token and membership information, which is used to bootstrap nodes into the cluster. It allows us to automate replacing nodes without any manual intervention, since we assume failure of nodes, and create failures using Chaos Monkey. The external Priam storage also provides us valuable information for the backup and recovery process.

To survive failures in the AWS environment and provide high availability, we spread our Cassandra nodes across multiple availability zones within regions. Priam's token assignment feature uses the locality information to allocate tokens across zones in an interlaced manner.

One of the challenges with cloud environments is replacing ephemeral nodes which can get terminated without warning. With token information stored in an external datastore (SimpleDB), Priam automates the replacement of dead or misbehaving (due to hardware issues) nodes without requiring any manual intervention.

Priam also lets us add capacity to existing clusters by doubling them. Clusters are doubled by interleaving new tokens between the existing ones. Priam's ring doubling feature does strategic replica placement to make sure that the original principle of having one replica per zone is still valid (when using a replication factor of at least 3).

We are also working closely with the Cassandra community to automate and enhance the token assignment mechanisms for Cassandra.

Address         DC          Rack        Status State   Load            Owns    Token                   
10.XX.XXX.XX us-east 1a Up Normal 628.07 GB 1.39% 1808575600
10.XX.XXX.XX us-east 1d Up Normal 491.85 GB 1.39% 2363071992506517107384545886751410400
10.XX.XXX.XX us-east 1c Up Normal 519.49 GB 1.39% 4726143985013034214769091771694245202
10.XX.XXX.XX us-east 1a Up Normal 507.48 GB 1.39% 7089215977519551322153637656637080002
10.XX.XXX.XX us-east 1d Up Normal 503.12 GB 1.39% 9452287970026068429538183541579914805
10.XX.XXX.XX us-east 1c Up Normal 508.85 GB 1.39% 11815359962532585536922729426522749604
10.XX.XXX.XX us-east 1a Up Normal 497.69 GB 1.39% 14178431955039102644307275311465584408
10.XX.XXX.XX us-east 1d Up Normal 495.2 GB 1.39% 16541503947545619751691821196408419206
10.XX.XXX.XX us-east 1c Up Normal 503.94 GB 1.39% 18904575940052136859076367081351254011
10.XX.XXX.XX us-east 1a Up Normal 624.87 GB 1.39% 21267647932558653966460912966294088808
10.XX.XXX.XX us-east 1d Up Normal 498.78 GB 1.39% 23630719925065171073845458851236923614
10.XX.XXX.XX us-east 1c Up Normal 506.46 GB 1.39% 25993791917571688181230004736179758410
10.XX.XXX.XX us-east 1a Up Normal 501.05 GB 1.39% 28356863910078205288614550621122593217
10.XX.XXX.XX us-east 1d Up Normal 814.26 GB 1.39% 30719935902584722395999096506065428012
10.XX.XXX.XX us-east 1c Up Normal 504.83 GB 1.39% 33083007895091239503383642391008262820
Figure 2: Sample cluster created by Priam

Multi-regional clusters

For multi-regional clusters, Priam allocates tokens by interlacing them between regions. Apart from allocating tokens, Priam provides a seed list across regions for Cassandra and automates security group updates for secure cross-regional communications between nodes. We use Cassandra's inter-DC encryption mechanism to encrypt the data between the regions via public Internet. As we span across multiple AWS regions, we rely heavily on these features to bring up new Cassandra clusters within minutes.

In order to have a balanced multi-regional cluster, we place one replica in each zone, and across all regions. Priam does this by calculating tokens for each region and padding them with a constant value. For example, US-East will start with token 0 where as EU-West will start with token 0 + x. This allows us to have different size clusters in each regions depending on the usage.

When a mutation is performed on a cluster, Cassandra writes to the local nodes and forwards the write asynchronously to the other regions. By placing replicas in a particular order we can actually withstand zone failures or a region failures without most of our services knowing about them. In the below diagram A1, A2 and A3 are AWS availability zones in one region and B1, B2 and B3 are AWS availability zones in another region.

Note: Replication Factor and the number of nodes in a zone are independent settings for each region.

Figure 3: Write to Cassandra spanning region A and region B
Address         DC          Rack        Status State   Load            Owns    Token                   
176.XX.XXX.XX eu-west 1a Up Normal 35.04 GB 0.00% 372748112
184.XX.XXX.XX us-east 1a Up Normal 56.33 GB 8.33% 14178431955039102644307275309657008810
46.XX.XXX.XX eu-west 1b Up Normal 36.64 GB 0.00% 14178431955039102644307275310029756921
174.XX.XXX.XX us-east 1d Up Normal 34.63 GB 8.33% 28356863910078205288614550619314017620
46.XX.XXX.XX eu-west 1c Up Normal 51.82 GB 0.00% 28356863910078205288614550619686765731
50.XX.XXX.XX us-east 1c Up Normal 34.26 GB 8.33% 42535295865117307932921825928971026430
46.XX.XXX.XX eu-west 1a Up Normal 34.39 GB 0.00% 42535295865117307932921825929343774541
184.XX.XXX.XX us-east 1a Up Normal 56.02 GB 8.33% 56713727820156410577229101238628035240
46.XX.XXX.XX eu-west 1b Up Normal 44.55 GB 0.00% 56713727820156410577229101239000783351
107.XX.XXX.XX us-east 1d Up Normal 36.34 GB 8.33% 70892159775195513221536376548285044050
46.XX.XXX.XX eu-west 1c Up Normal 50.1 GB 0.00% 70892159775195513221536376548657792161
50.XX.XXX.XX us-east 1c Up Normal 39.44 GB 8.33% 85070591730234615865843651857942052858
46.XX.XXX.XX eu-west 1a Up Normal 40.86 GB 0.00% 85070591730234615865843651858314800971
174.XX.XXX.XX us-east 1a Up Normal 43.75 GB 8.33% 99249023685273718510150927167599061670
79.XX.XXX.XX eu-west 1b Up Normal 42.43 GB 0.00% 99249023685273718510150927167971809781
Figure 4: Multi-regional Cassandra cluster created by Priam

All our clusters are centrally configured via properties stored in SimpleDB, which includes setup of critical JVM settings and Cassandra YAML properties.

Priam's REST API

One of goals of Priam was to support managing multiple Cassandra clusters. To achieve that, Priam's REST APIs provides hooks that support external monitoring and automation scripts. They provide the ability to backup, restore a set of nodes manually and provide insights into Cassandra's ring information. They also expose key Cassandra JMX commands such as repair and refresh.

For comprehensive listing of the APIs, please visit the github wiki here.

Key Cassandra facts at Netflix:
  • 57 Cassandra clusters running on hundreds of instances are currently in production, many of which are multi-regional
  • Priam backs up tens of TBs of data to S3 per day.
  • Several TBs of production data is restored into our test environment every week.
  • Nodes get replaced almost daily without any manual intervention
  • All of our clusters use random partitioner and are well-balanced
  • Priam was used to create the 288 node Cassandra benchmark cluster discussed in our earlier blog post[3].
Related Links: