Wednesday, October 28, 2015

Evolution of Open Source at Netflix

When we started our Netflix Open Source (aka NetflixOSS) Program several years ago, we didn’t know how it would turn out.  We did not know whether our OSS contributions would be used, improved, or ignored; whether we’d have a community of companies and developers sending us feedback; and whether middle-tier vendors would integrate our solutions into theirs.  The reasons for starting the OSS Programs were shared previously here.


Fast forward to today.  We have over fifty open source projects, ranging from infrastructural platform components to big data tools to deployment automation.  Over time, our OSS site became very busy with more and more components piling on.  Now, even more components are on the path to being open.




While many of our OSS projects are being successfully used across many companies all over the world, we got a very clear signal from the community that it was getting harder to figure out which projects were useful for a particular company or a team; which were fully independent; and which were coupled together.  The external community was also unclear about which components we (Netflix) continued to invest and support, and which were in maintenance or sunset mode.  That feedback was very useful to us, as we’re committed in making our OSS Program a success.


We recently updated our Netflix Open Source site on Github pages.  It does not yet address all of the feedback and requests we received, but we think it’s moving us in the right direction:


  1. Clear separation of categories.  Looking for Build and Delivery tools?  You shouldn’t have to wade through many unrelated projects to find them.
  2. With the new overview section of each category we can now explain in short form how each project should be used in concert with other projects.  With the old “box art” layout, it wasn’t clear how the projects fit together (or if they did) in a way that provided more value when used together.
  3. Categories now match our internal infrastructure engineering organization.  This means that the context within each category will reflect the approach to engineering within the specific technical area.  Also, we have appointed category leaders internally that will help keep each category well maintained across projects within that area.
  4. Clear highlighting of the projects we’re actively investing and supporting.  If you see the project on the site - it’s under active development and maintenance.  If you don’t see it - it may be either in maintenance only or sunset mode.  We’ll be providing more transparency on that shortly.
  5. Support for multi-repo projects.  We have several big projects that are about to be Open Sourced.  Each one will consist of many Github repos.  The old site would list each of the repos, thus making the overall navigation even less usable.  The new site allows us to group the relevant repos together under a single project




Other feedback we’re addressing is that it was hard to get started with many of our OSS projects.  Setup / configuration was often difficult and tricky.  We’re addressing it by packaging most (not yet all) of our projects in the Docker format for easy setup.  Please note, this packaging is not intended for direct use in production, but purely for a quick ramp-up curve for understanding the open source projects.  We have found that it is far easier to help our users’ setup of our projects by running pre-built, runnable Docker containers rather than publish source code, build and setup instructions in prose on a Wiki.


The next steps we’ll be taking in our Open Source Program:


  1. Provide full transparency on which projects are archived - i.e. no longer actively developed or maintained.  We will not be removing any code from Github repos, but will articulate if we’re no longer actively developing or using a particular project.  Netflix needs change over time, and this will affect and reflect our OSS projects.
  2. Provide a better roadmap for what new projects we are planning to open, and which Open projects are still in the state of heavy flux (evolution).  This will allow the community to better decide whether particular projects are interesting / useful.
  3. Expose some of the internal metrics we use to evaluate our OSS projects - number of issues, commits, etc.  This will provide better transparency of the maturity / velocity of each project.
  4. Documentation.  Documentation.  Documentation.


While we’re continuing on our path to make NetflixOSS relevant and useful to many companies and developers, your continuous feedback is very important to us.  Please let us know what you think at netflixoss@netflix.com.  


We’re planning our next NetflixOSS Meetup in early 2016 to coincide with some new and exciting projects that are about to be open.  Stay tuned and follow @netflixoss for announcements and updates.



Tuesday, October 20, 2015

Falcor for Android

Falcor Logo
We’re happy to have open-sourced the Netflix Falcor library earlier this year. On Android, we wanted to make use of Falcor in our client app for its efficient model of data fetching as well as its inherent cache coherence.

Falcor requires us to model data on both the client and the server in the same way (via a path query language). This provides the benefit that clients don’t need any translation to fetch data from the server (see What is Falcor). For example, the application may request path [“video”, 12345, “summary”] from Falcor and if it doesn’t exist locally then Falcor can request this same path from the server.



Another benefit that Falcor provides is that it can easily combine multiple paths into a single http request. Standard REST APIs may be limited in the kind of data they can provide via one specific URL. However Falcor’s path language allows us to retrieve any kind of data the client needs for a given view (see the “Batching” heading in “How Does Falcor Work?”). This also provides a nice mechanism for prefetching larger chunks of data if needed, which our app does on initialization.

The Problem

Being the only Java client at Netflix necessitated writing our own implementation of Falcor. The primary goal was to increase the efficiency of our caching code, or in other words, to decrease the complexity and maintenance costs associated with our previous caching layer. The secondary goal was to make these changes while maintaining or improving performance (speed & memory usage).

The main challenge in doing this was to swap out our existing data caching layer for the new Falcor component with minimal impact on app quality. This warranted an investment in testing to validate the new caching component but how could we do this extensive testing most efficiently?

Some history: prior to our Falcor client we had not made much of an investment in improving the structure or performance of our cache. After a light-weight first implementation, our cache had grown to be incoherent (same item represented in multiple places in memory) and the code was not written efficiently (lots of hand-parsing of individual http responses). None of this was good.

Our Solution

Falcor provides cache coherence by making use of a JSON Graph. This works by using a custom path language to define internal references to other items within the JSON document. This path language is consistent to Falcor, and thus a path or reference used locally on the client will be the same path or reference when sent to the server.

{
   "topVideos": {
       // List, with indices
       0: { $type: "ref", value: ["video", 123] }, // JSON Graph reference
        1: { $type: "ref", value: ["video", 789] }
   },
   "video": {
       // Videos by ID
       123: {
           "name": "Orange Is the New Black",
           "year": 2015,
           ...
       },
       789: {
           "name": "House of Cards",
           "year": 2015,
           ...
       }
   }
}

Our original cache made use of the gson library for parsing model objects and we had not implemented any custom deserializers. This meant we were implicitly using reflection within gson to handle response parsing. We were curious how much of a cost this use of reflection introduced when compared with custom deserialization. Using a subset of model objects, we wrote a benchmark app that showed the deserialization using reflection took about 6x as much time to process when compared with custom parsing.

We used the transition to Falcor as an opportunity to write custom deserializers that took json as input and correctly set fields within each model. There is a slightly higher cost here to write parsing code for the models. However most models are shared across a few different get requests so the cost becomes amortized and seemed worth it considering the improved parsing speed.

// Custom deserialization method for Video.Summary model
public void populate(JsonElement jsonElem) {
   JsonObject json = jsonElem.getAsJsonObject();
   for (Map.Entry<String, JsonElement> entry : json.entrySet()) {
       JsonElement value = entry.getValue();
       switch (entry.getKey()) {
       case "id": id = value.getAsString(); break;
       case "title": title = value.getAsString(); break;
       ...
       }
   }
}

Once the Falcor cache was implemented, we compared cache memory usage over a typical user browsing session. As provided by cache coherence (no duplicate objects), we found that the cache footprint was reduced by about 10-15% for a typical user browse session, or about 500kB.

Performance and Threading

When a new path of data is requested from the cache, the following steps occur:
  1. Determine which paths, if any, already exist locally in the cache
  2. Aggregate paths that don't exist locally and request them from the server
  3. Merge server response back into the local cache
  4. Notify callers that data is ready, and/or pass data back via callback methods
We generalized these operations in a component that also managed threading. By doing this, we were able to take everything off of the main thread except for task instantiation. All other steps above are done in worker threads.

Further, by isolating all of the cache and remote operations into a single component we were easily able to add performance information to all requests. This data could be used for testing purposes (by outputting to a specific logcat channel) or simply as a debugging aid during development.

// Sample logcat output
15:29:10.956: FetchDetailsTask ++ time to build paths: 0ms
15:29:10.956: FetchDetailsTask ++ time to check cache for missing paths: 1ms
15:29:11.476: FetchDetailsTask ~~ http request took: 516ms
15:29:11.486: FetchDetailsTask ++ time to parse json response: 8ms
15:29:11.486: FetchDetailsTask ++ time to fetch results from cache: 0ms
15:29:11.486: FetchDetailsTask == total task time from creation to finish: 531ms

Testing

Although reflection had been costly for the purposes of parsing json, we were able to use reflection on interfaces to our advantage when it came to testing our new cache. In our test harness, we defined tables that mapped test interfaces to each of the model classes. For example, when we made a request to fetch a ShowDetails object, the map defined that the ShowDetails and Playable interfaces should be used to compare the results.

// INTERFACE_MAP sample entries
put(netflix.model.branches.Video.Summary.class,             // Model/class
   new Class<?>[]{netflix.model._interfaces.Video.class}); // Interfaces to test
put(netflix.model.ShowDetails.class,
   new Class<?>[]{netflix.model._interfaces.ShowDetails.class,
                  netflix.model._interfaces.Playable.class});
put(netflix.model.EpisodeDetails.class,
   new Class<?>[]{netflix.model._interfaces.EpisodeDetails.class,
                  netflix.model._interfaces.Playable.class});
// etc.

We then used reflection on the interfaces to get a list of all their methods and then recursively apply each method to each item or item in a list. The return values for the method/object pair were compared to find any differences between the previous cache implementation and the Falcor implementation. This provided a first-pass of detection for errors in the new implementation and caught most problems early on.

private Result validate(Object o1, Object o2) {    //...snipped...    Class<?>[] validationInterfaces = INTERFACE_MAP.get(o1.getClass());
   for (Class<?> testingInterface : validationInterfaces) {        Log.d(TAG, "Getting methods for interface: " + testingInterface);        Method[] methods = testingInterface.getMethods(); // Public methods only
       for (Method method : methods) {            Object rtn1 = method.invoke(o1); // Old cache object            Object rtn2 = method.invoke(o2); // Falcor cache object
           if (rtn1 instanceof FalcorValidator) {                Result rtn = validate(rtn1, rtn2); // Recursively validate objects                if (rtn.isError()) {                    return rtn;                }            }            else if ( ! rtn1.equals(rtn2)) {                return Result.VALUE_MISMATCH.append(rtnMsg);            }        }    }    return Result.OK; }

Bonus for Debugging

Because of the structure of the Falcor cache, writing a dump() method was trivial using recursion. This became a very useful utility for debugging since it can succinctly express the whole state of the cache at any point in time, including all internal references. This output can be redirected to the logcat output or to a file.

void doCacheDumpRecursive(StringBuilder output, BranchNode node, int offset) {
   StringBuilder sb = new StringBuilder();
   for (int i = 0; i < offset; i++) {
       sb.append((i == offset - 1) ? " |-" : " | "); // Indentation chars
   }
   String spacer = sb.toString();
   for (String key : keys) {
       Object value = node.get(key);
       if (value instanceof Ref) {
           output.append(spacer).append(key).append(" -> ")
                 .append(((Ref)value).getRefPath()).append(NEWLINE);
       }
       else {
           output.append(spacer).append(key).append(NEWLINE);
       }
       if (value instanceof BranchNode) {
           doCacheDumpRecursive(output, (BranchNode)value, offset + 1);
       }
   }
}

Sample Cache Dump File

Results

The result of our work was that we created an efficient, coherent cache that reduced its memory footprint when compared with our previous cache component. In addition, the cache was structured in a way that was easier to maintain and extend due to an increase in clarity and a large reduction in redundant code.

We achieved the above objectives while also reducing the time taken to parse json responses and thus speed performance of the cache was improved in most cases. Finally, we minimized our regressions by using a thorough test harness that we wrote efficiently using reflection.

Future Improvements

  • Multiple views may be bound to the same data path so how can we notify all views when the underlying data changes? Observer pattern or RxJava.
  • Cache invalidation: We do this manually in a few specific cases now but we could implement a more holistic approach that includes expiration times for paths that can expire. Then, if that data is later requested, it is considered invalid and a remote request is again required.
  • Disk caching. It would be fairly straightforward to serialize our cache, or portions of the cache, to disk. Caching manager could then check in-memory cache, on-disk cache, and then finally go remote if needed.

Links

Falcor project: https://netflix.github.io/falcor/
Netflix Releases Falcor Developer Preview: http://techblog.netflix.com/2015/08/falcor-developer-preview.html