It’s About Time: Series Data, Streaming, & Architecture
In previous posts, we have talked a lot about the PatternBuilders Analytics platform and streaming analytics. This platform is able to scale for huge amounts of data and stream results to the user as they are processed in real time. As mentioned before, we can do this because we have focused on time series analytics, making optimizations to our architecture that beat generalized MapReduce types of solutions by orders of magnitude. I’d like to discuss this focus and how it came about.
Why time series data?
Time series data is ubiquitous. It’s actually more difficult to think of an analytics question a user would be interested in that doesn’t involve time in some capacity. Even a non-numeric query like “Order the list of products by units sold” is almost useless without specifying a time period for which to sort.
To put it into even more basic terms, the whole reason businesses desperately need good analytics is because they would like to find out how well they’re doing and how they could improve their ROI. Well, ultimately, ROI is “I put in X money last year, and got Y money back.” So in order to gauge progress, time has to be involved one way or another.
Instead of trying to ignore reality by assuming that time is just another variable to be used in calculations, we accept the following premises:
- Any domain/business model that requires analytics has the concept of a transaction: an element that has a time, at least one numerical value, and metadata.
- for retail – an inventory movement or sale.
- for social media – a tweet or a status update.
- for healthcare – a patient visit or transfer.
- for financial – a stock transaction (buy, sell, etc…).
- Any metric the user is interested in can be expressed in terms of aggregates of the numeric values in transactions.
- Any non-numeric query the user is interested in (say, “Top 10 products by sales from Category X”) is some kind of filter using the metrics.
These premises have thus far proven completely true. We have not found a single metric that does not fit these criteria. If we ever do, we can always go to a more generalized calculation method for that particular case, sacrificing performance only when we have to.
To be honest, for the longest time, streaming was not really a goal for the platform. We never explicitly sat down and asked ourselves architecturally what would be required to stream data. What happened was that after we optimized for time series aggregations, the results were available in real time automatically; all we had to do was add some nice UI features to update the display. The real insight came when we started thinking about the potential benefits of displaying analytics without any delays:
- How quickly does your business respond to emergencies? What if you could know about a competitor’s actions impacting your sales within hours or minutes instead of weeks?
- What if you came up with a completely new metric that you wanted to try out on 5 years of historical data? What if you could see the numbers as quickly as you could type the formula?
- What if you could watch the social media ROI metrics update within minutes of posting a press release?
These are powerful ideas and I’m sure our users will come up with their own.
There’s a great book called In The Plex about how Google became so successful. This quote aligns very well with our own experiences:
In 2007, Google conducted some user studies that measured the behavior of people whose search results were artificially delayed. One might think that the minuscule amounts of latency involved in the experiment would be negligible—they ranged between 100 and 400 milliseconds. But even those tiny hiccups in delivering search results acted as a deterrent to future searches. The reduction in the number of searches was small but significant, and were measurable even with 100 milliseconds (one-tenth of a second) latency. What’s more, even after the delays were removed, the people exposed to the slower results would take a long time to resume their previous level of searching.
Levy, Steven (2011). In The Plex (p. 186). Simon & Schuster.
The never-ending march of development tools has developed a very strong mashup mentality: open source tools, enterprise frameworks for .NET and Java, and cloud platforms have made it very easy for companies to realize their business ideas without spending a lot of manpower and money on IT. Because of this, it seems like any company talking about their own “new” architecture or platform is treated with a very large amount of skepticism. “Why don’t you use Hadoop?” “What do you do that requires more than the LAMP (Linux, Apache, MySQL, Perl/PHP/Python) stack?” are all legitimate questions, especially in our case, because ultimately, the whole focus of our business is the analytics platform we have developed.
The truth is: we tried! We really did. Our original approach used a relational database and a fairly platform-independent design. Spending money on hardware and tools is a lot more effective than spending time and money on development costs, but we saw very early on that a mashup approach to very fast, user-friendly, “big data” analytics is just not possible. When the scope of the problem reaches into the terabytes and we still expect queries to come back in milliseconds, we have to really think outside of the box. We need NoSQL databases for speed, transactional queues for reliability and scaling, clusters of replicated data loading servers for fault tolerance, a user interface that is powerful and easy to use, and most importantly, an approach to calculation that considers time as something fundamental to any business intelligence problem.