Published on
Addepar's rallying cry, altogether better, represents not only how we build internally and hand-in-hand with our clients, but also showcases our view that truly consolidated data drives smarter investment decisions and improved outcomes. It drives us to ruthlessly optimize our system for speed, cost and quality to make –- and keep -– Addepar the platform of choice for sophisticated portfolios.
As engineers, much of our work involves incremental improvement. On rare occasions, however, a clever insight can point the way towards major paradigm shifts that allow systems to become altogether better simultaneously. In this post I’ll be discussing a series of novel enhancements to Addepar’s system that also exemplify altogether better: enhancing our data models to improve data access and service architecture. To summarize the headline results, Addepar’s enhanced system is:
Faster: under the new system, uncached data queries are
Up to four times faster than prior cached queries
Up to forty times faster than prior uncached queries.
Cheaper: the new system requires dramatically less dynamic memory for caching and therefore massively increases the system’s efficiency.
Altogether better: by reducing the aforementioned dynamic memory caches, the new system significantly increases the elasticity of our server architecture. It also streamlines data querying patterns by restructuring how data is indexed.
What did we change?
The key elements of the enhanced system are:
Use dual data models for flexibility: introduce binary-based and row-based data models, each optimized for different uses
Tailor data workflows to access patterns: think carefully about how data is actually consumed and optimize APIs for those access patterns.
Optimize data compression for performance: reduce data footprint on disk and achieve a commensurate direct reduction in querying time.
I’ll discuss each of these elements in more detail next.
Using dual data models for flexibility
Most modern applications model data in an object-oriented hierarchical fashion. To persist these models into a durable relational database, these hierarchical objects first need to be marshalled into relational entities, which can be surprisingly costly. In some cases, the marshalling of data from relational to hierarchical object-oriented models can take up as much time as it takes to query it!
Non-relational databases (NoSQL) appear to offer an elegant solution, but fail to pass scrutiny. Moving from a relational (SQL) database to non-relational (NoSQL) database is not a trivial endeavor, with very different semantics and transactionality properties. Moreover, NoSQL databases also incur significant data marshalling costs.
Instead of changing the storage technology, we chose to focus on what exactly was being persisted. Many of our existing relational models are highly intuitive, with each row in the table corresponding to a single entity. But for the purpose of powering our calculations, our application models data in terms of hierarchical associations with the securities, positions and accounts that they pertain to. This modeling paradigm makes little sense in a relational database but a lot of sense as inputs into a calculation engine.
The obvious solution here is to store these records in the same format as the in-memory models in our application. In fact, we can take the exact in-memory bit pattern and serialize that as a blob directly into a relational database field. An obvious objection to this is that such a binary representation is no longer human-readable — this is certainly true. But there are also many benefits:
No data marshalling required — the deserialized data is what you initially serialized
Super fast data transfers — binary serialization protocols like protocol buffers are incredibly efficient at serializing and deserializing bytes to and from the wire
No relational modeling constraints— we can store all the records associated with an account in a single blob, rather than having to model individual records.
Of course, we still need human-readable data for diagnostics, debugging, ad hoc analysis, etc. The binary blob models complement, rather than replace, the existing row-based model. This allows us in some sense to have our cake and eat it too — but with a few additional complexities:
Avoiding duplicate storage costs: we have to represent the data twice
Optimizing data access patterns: deciding when to use which representation
Maintaining parity: making sure the two representations are consistent with each other
Duplicate storage costs are not a major concern, as durable storage is extremely cheap and the binary representations are extremely compact. Data access patterns and maintaining parity require more careful handling, which we will discuss in the next section.
Tailoring data workflows to access patterns
Data access patterns tend to fall into two broad categories: OLTP and OLAP.
OLTP (Online Transactional Processing): OLTP workflows represent “traditional” relational database processes. Most updates to our data are OLTP workflows.
OLAP (Online Analytical Processing): OLAP workflows analyze, transform and aggregate large volumes of data in a non-transactional context. Most of our calculations are OLAP workflows.
In light of these two categories, the two main requirements of (1) optimizing data access patterns and (2) maintaining parity are in fact deeply intertwined. If the binary format and row format data are always modified atomically in a single OLTP transaction, then the parity of these two stores is guaranteed. So these two requirements may be reformulated as:
Design a single OLTP workflow to modify both binary and row format data atomically
Direct all OLAP workflows to the binary format, which it is optimized for.
The main pitfall to avoid here is naively updating binary format data, which can be horrendously inefficient. Suppose a blob represents 100,000 records and we want to update just one of those records. Naively updating the binary format data would require rewriting the entire blob, or 100,000 times as much data as the update itself!
Fortunately, this problem has a well-known solution: delta compaction. Instead of updating a single blob, we append smaller “delta” blobs representing individual updates alongside the “main” blob representing the full history.
Updating blobs only requires writing new data, not the entire history.
OLAP workflows combine the main blobs and delta blobs to get the latest result
The only downside to delta compaction is that delta blobs are not very compact, which becomes a problem as delta blobs organically grow over time. Suppose there is a main blob with 10 years of data, i.e., 2,500 trading days. Each day we add an additional day of data. Initially, this is fine, but after five years of doing this, a third of all records would be stored as uncompacted deltas!
Hence delta compaction, where the delta blobs are periodically compacted into the main blob. This delivers a compact data model that is efficient for reads and writes:
OLTP workflows that now need to write to both row-based and blob-based tables are fast since we only append rows and delta blobs.
Compaction is performed weekly, so delta blobs represent a very small fraction of the overall number of records and blob storage remains highly compact.
So far we’ve (1) why we want to model data in both row and binary formats and (2) how those two formats can be tailored to optimize a variety of workflows. In the next and final section, we’ll dive into the details of this binary format and discuss why it is so compact.
Optimizing data compression for performance
All of our data models start as a time series panel of records. At a high level, compressing such data involves
Sorting the data in some economically meaningful way
Examining a predictable pattern in the data
Exploiting this pattern by encoding it into the compression algorithm
In this section, I’ll discuss two similar and very powerful compression techniques: run-length encoding and delta-run-length encoding.
Run-length encoding
The simplest pattern commonly found in time series panel data is that values in a column tend to repeat. Suppose we have row-based data with just two fields: date and color. From Jan 1, 2000 to Dec. 31, 2019, the color is red. Then on Jan 1, 2020, the color switched to blue and remains at that value through the end of the dataset on Dec 31, 2025.
Date |
Color |
---|---|
Jan 1, 2000 |
Red |
… |
Red |
Dec 31, 2019 |
Red |
Jan 1, 2020 |
Blue |
… |
Blue |
Dec 31, 2025 |
Blue |
Assuming 250 trading days per year, we have 5,000 red records and 1,500 blue records. We don’t actually need to store all these records. As shown in the table above, we only need four records in the table above by noting the start and end dates of each “run” of values.
We use start and end dates since it fits better with our application logic, but this is logically equivalent to run-length encoding, which is widely used in columnar store technologies. The compression efficiency of run length encoding depends on the number of distinct runs and their lengths. In this example, we only require 4 records to represent 7,500 records, so run-length encoding achieves a compression rate of 99.95% — not bad! Even if the value changed 100 times, which would require 200 records to represent, the compression rate would still be 97.33%.
Delta run-length encoding
Run-length encoding works very well for repeated runs. A slight variation works equally well for sequential runs. If a repeated run consists of the same value repeated 100 times, a sequential run consists of the values
0, 1, 2, …, 99
A sequential run can be turned into a repeated run by subtracting one item from the next:
0, 1, 1, …, 1
The first record has a “delta” of 0 to indicate that it is equal to the initial value. This technique works well even when the run is not perfectly sequential. For example, suppose the sequence
0, 1, 2, …99
has a break at 50, where it jumps to 52, rather than continuing with 51. So the sequence is actually
0, 1, 2, …, 50, 52, 53, …, 99
The corresponding deltas would be
0, 1, 1, …, 1, 2, 1, …, 1
So instead of one long run of 1s with length 99, we have two runs of 1s of length 49 each, with one run of 2s of length 1 in between them, which is almost as highly compressed.
Takeaways
I will close out this blog post by reflecting on a few key takeaways.
Challenge conventional wisdom
At the outset of this project, my original hope was to improve query performance by 10-20%. This was a modest but realistic and substantive outcome. Many engineers, including myself, would’ve been skeptical that we could compress our data models by 85% and make queries up to forty times faster. But once the research got going, the results exceeded even my most optimistic expectations.
Optimize statistically
Many of the compression techniques discussed were developed with a statistical mindset — consider what types of patterns a given dataset, or a specific field in a given dataset, tend to exhibit, and optimize for each of those cases. Here are some specific examples of this type of thinking:
Optimize for the worst-case scenario: Often, the most pathological 1% of cases are responsible for 99% of the (in)efficiencies. Focusing on these tail events can often deliver outsized improvements.
Use conditional heuristics: Many times, I found that data tends to exhibit one of two patterns in highly kurtotic ways — that is, it either falls very strongly into type A or type B, with very few records somewhere in the middle, and type A and B each demand very different optimal compression techniques. In this case, developing a simple but reliable heuristic to categorize records as type A or type B, then deploying different compression techniques conditionally, can yield massive compression.