Riding the Data Waterfall
Our new streaming analytics engine.
As promised, I am going to spend the next few posts discussing some of the new features in our analytics framework, otherwise known as PAF. This is our largest and most complex release so far. We are very proud of it—both in how far the framework has come and how closely it matches our vision of what a world class analytics system would look like when we started the company a few years ago.
One of my favorite features, and certainly the biggest change in this release, is that our analytics engine is now completely streaming based. I think that this, along with our improved ad-hoc analysis support, is going to improve our customers’ day-to-day to experience with both calculating and using analytics in their businesses.
I will describe the new features more fully and how they are implemented in more detail in a later post, but first some background on batch versus streaming analytic systems (if you are a geek, most of this will be familiar to you).
For analytic purposes, batch processing is simply the aggregation of data into a single batch, aka a “job,” that will all be analyzed at one time. Typically, batch systems won’t allow you to see partial results so you have to wait until the analysis of the entire job is done before you can view/act on the results. A good example of a typical batch process is a retailer taking all of their orders for the week and then calculating sales performance across the world.
On the other hand, streaming analytic engines like PatternBuilders, analyze data as it arrives and make results immediately available to the user. This means that analytics are available as quickly as the data received can be loaded into the system. Streaming analytics are also called real-time analytics – before the purists start yelling, yes, I know these systems are not real-time by the classic definition but enough marketing folks have used the term in this context to make it stick. Probably the most famous use of streaming analytics is the real-time (there it is again) stock ticker applications used on trading desks. Streaming analytics are very useful when you need information quickly but have a downside since analysis is never “done” which means you may get misleading interim results.
Batch processing has been the most common approach to analytics for a number of reasons:
- It works well for a lot of problems.
- It is a lot easier to code than streaming systems.
- For some domains where the analytics aren’t time series based, it is the only practical approach.
Batch systems also have some serious downsides, especially in a world where digital devices are outputting data 24/7 and organizations have an ever growing need to both analyze that data and react quickly to what they learn from it. The most common complaints:
- No partial answers – you have to wait for the entire batch to finish. For big batches this can take a lot of time.
- Hardware requirements – because they process everything at once, batch systems typically require more hardware (such as memory, disk and CPU) than streaming systems which can process a transaction and then throw it away.
- Limited ad-hoc capabilities (more on this later).
- All or nothing – Any change in the data usually requires the entire batch to be recalculated.
There are some great analytics systems out there built for batch processing with the best being Hadoop, a very powerful, but also very complex, open source solution. (They also have the coolest logo in the industry.) At the core of Hadoop is the Google algorithm known as MapReduce. A very good description of MapReduce can be found in Ayende’s blog.
MapReduce can be used for any batch based calculation, not just analytics. Hadoop has a lot of users at media companies and websites, like Facebook, that use it for image and video processing among other things. Interestingly enough, Google has recently abandoned MapReduce for a more stream-based approach so they can incrementally update their index instead of recalculating it. That and the fact that streaming is a better fit for supporting ad-hoc analytics was the major reason that we switched as well.
One of the major drawbacks of MapReduce, or any batch based approach to analytics, is that you have to know exactly what you want to analyze before you start the job and then wait for the job to finish.
Because of this, batch systems do not lend themselves to ad-hoc analysis. MapReduce in particular has very specific constraints on how both the Map and Reduce functions must be structured. This can make it very difficult to specify your problem in a natural way.
In contrast, our system’s scripting language has no such structural constraints other than the requirement that any analytics function must return a numeric result. This simplicity in specification, combined with the fact that our engine is streaming, allows you to quickly create an analysis even while data is loading and get immediate results. The power this gives you to do ad-hoc and root cause analysis quickly has to be seen to be believed. The other nice thing about streaming systems is that you can easily emulate a batch system if transient results don’t make sense for your particular problem. You simply submit all your data at once and wait for it all to process.
Entry filed under: Data, General Analytics, PatternBuilders Technology, Technology. Tags: analytic systems, batch processing, Hadoop, MapReduce, PatternBuilders Analytic Framework, real-time analysis, streaming analytics, Technology.