A Deep Dive Into PatternBuilders Analytics Framework (PAF)
Today one of our Server Engineers is going to give you a deep dive on our architecture. As always on our blog, all of the data is simulated and all trademarks are the property of their respective owners.
Hello everyone! I am going to get fairly technical in this post and go over how PatternBuilders Analytics Framework (PAF) does what it does so well. As Terence has said in the past couple of posts, we have a new architecture that’s based around scalability, streaming, and ease of use. That’s not quite the whole story though; the development of this architecture was in fact driven primarily by performance.
In the beginning…
When we started designing this system several years ago, our goal was to leverage rapidly declining storage costs and create an architecture that could answer any potential question the user asked of the system very quickly (under 5 seconds!).
To accomplish that, we started with a design that centered around domain-specific metadata. Things like: the relationships between business objects, whether the objects are transactional or not (retail example: inventory movements are transactional, while products, brands, locations are not), and how to view the data in a way that users can understand.
After that, we developed a comprehensive indexing strategy that could essentially give us the “address” of any object in the system and better yet, give us the “address” of any relationship in the system.
Next, we developed an analytics server that ran through all the applicable transactional data and using the addresses from the indexing server and user metrics (key performance indicators), generated a ton of calculated analytics results for any possible combination of input filters. After all this pre-calculation was done, all the query engine had to do was take the user’s input parameters and retrieve the appropriate results. It looked sort of like this:
This approach was the first version of our analytics framework. It was successful, but it also had some drawbacks:
- When the user would add a new metric to the system, it would have to crunch all existing transactional data before returning results for queries. If there was a lot of historical data, this process could take several days or longer. We could not do ad-hoc calculations.
- Our system supported aggregations beyond simple summation like averaging, percentages, and standard deviation. We also supported complex metrics that involved combining past data with current data (year-over-year, by-day-of-week, etc…) However, due to the nature of these metrics, there was only a limited amount of information we could pre-calculate. Inevitably, user queries on these metrics would result in additional calculations leading to worse performance.
- Dependencies between metrics could break scalability. For example, if you had a metric that depended on another, we had to recalculate the first, and only afterwards be able to deal with the second. We really wanted to be able to give dependent results on-demand.
- A related drawback was that we realized that most domains we analyzed had a very limited number of transactional data types. These limited types had a lot of different metrics depend on them. If the underlying data changed, suddenly our system could be faced with a huge load to re-calculate.
With the next version of the PAF, we set out to fix these problems while sticking with our core ideas. We did some serious analysis and came up with interesting conclusions.
A set of time series results for a given metric for a given query is actually very small. If we have hourly data over the course of 5 years, and the user queries over the whole date range, the most we get is 5 * 365 * 24 = 43,800 results. From the point of view of modern data warehousing, 43 thousand data points is tiny. A modern workstation PC can run a complex statistical calculation over that data in microseconds.
Obviously though, we are not dealing with just a single metric for a single query. We are potentially dealing with billions of records coming in from all over the world. Our indexing server was very good at “addressing” all that data, but the analytics server still had to pre-calculate and store all these results. If we had 50 GB of input data, that could balloon to 1 TB of results. Despite low storage costs and high machine speeds, our bottleneck inevitably ended up being storage.
Understanding how severely storage performance impacted our framework, combined with the insight of how few calculations users were actually interested in, led us to the solution.
Divide and Conquer
The approach was to divide the problem into two steps:
- Storing indexed transactional data in such a way where querying returns the minimum number of data points. What we do is store the value of the raw data for a given index. So, for example this display is straight out of our retail demo:
The first column is really the index (in this case, the first row is for the ‘Online’ retail channel, for the ‘Amazon’ reseller, and the last two rows are for Amazon ‘store’ locations), and the two columns of numbers are the sums of all data values that fit that index for a given date. So when the user queries for data, the system first retrieves the values for the given indexes (regardless of metric!).
- Only run the calculations when the user asks for them! After the query returns our set of essentially raw data, run the calculation for the metric – usually on a very small number of data points. Queries are so fast that complex metrics may actually run more queries as part of their calculations. For example, the user queries for “year over year % growth” for a given store. The system gets the aggregated transactional data for that particular store, and then for each data point, queries for the value from a year ago.
If you have heard of MapReduce, this may sound a bit familiar to you – the first step is basically a “map” step, and the second is basically a “reduce.” So how is our approach better than a general MapReduce system?
BI is not all synecdoche!
There are big optimizations that make our system orders of magnitude higher performing than a general MapReduce algorithm because we recognize that BI is fundamentally different from a general “crunch lots of numbers” problem.
MapReduce is essentially the computer science version of synecdoche (a rarely used Greek word defined as “a figure of speech in which a part is substituted for a whole or a whole for a part”). What I mean is that MapReduce takes a bunch of data, separates it into small problems, then runs a function to solve the small problems, then runs the function on the solutions of those problems, then the solutions of those problems, so on and so forth, until it runs out of data. It essentially runs the same function over the parts and the whole. This is all well and good, but it’s actually very difficult to force a lot of BI problems into that paradigm. BI is not all synecdoche!
- We run different calculations on our “part” and our “whole” problems. In fact, we try very hard to not run calculations at all! Our “map” step does as much “reduce” as it can – by rolling up raw data for a particular index early, we can store this information independent of metrics and avoid processing it during queries.
- Our “reduce” step runs calculations on time series data, a constraint that is completely fine for 99% of the metrics users are interested in. For the remaining 1%, we can do a full calculation but it is a lot slower. We think covering 99% of all cases efficiently is much better than creating a complex and slow general solution.
- Because our “reduce” step happens on-demand, users can define whatever metrics they need and as soon as the indexing server gets through the data, all metrics that operate on that data become available. This is how we can “stream” data – it just happens, without any special work. An indexed value is created, and queries that come through the system can pick it up! Our streaming UI just continuously queries and results are updated automatically.
Scaling and Conclusions
Scalability is a huge buzzword these days. Software architectures and corporations live and die by the hype generated any time someone mentions it. Ultimately, concentrating on scalability without considering the problem being solved can be a bit dangerous. We approached our design by looking at real-world data and doing analysis on what kinds of problems people were trying to solve with BI and ended up with a very agile architecture that provided direct benefits to end-users.