Riding the Data Waterfall

January 26, 2011 at 6:01 pm 11 comments

Our new streaming analytics engine.

By Terence Craig

As promised, I am going to spend the next few posts discussing some of the new features in our analytics framework, otherwise known as PAF.  This is our largest and most complex release so far.  We are very proud of it—both in how far the framework has come and how closely it matches our vision of what a world class analytics system would look like when we started the company a few years ago.

One of my favorite features, and certainly the biggest change in this release, is that our analytics engine is now completely streaming based.  I think that this, along with our improved ad-hoc analysis support, is going to improve our customers’ day-to-day to experience with both calculating and using analytics in their businesses.   

I will describe the new features more fully and how they are implemented in more detail in a later post,  but first some background on batch versus streaming analytic systems (if you are a geek, most of this will be familiar to you).

For analytic purposes, batch processing is simply the aggregation of data into a single batch, aka a “job,” that will all be analyzed at one time.  Typically, batch systems won’t allow you to see partial results so you have to wait until the analysis of the entire job is done before you can view/act on the results.  A good example of a typical batch process is a retailer taking all of their orders for the week and then calculating sales performance across the world.

On the other hand, streaming analytic engines like PatternBuilders, analyze data as it arrives and make results immediately available to the user.  This means that analytics are available as quickly as the data received can be loaded into the system.  Streaming analytics are also called real-time analytics – before the purists start yelling, yes, I know these systems are not real-time by the classic definition but enough marketing folks have used the term in this context to make it stick.  Probably the most famous use of streaming analytics is the real-time (there it is again) stock ticker applications used on trading desks.  Streaming analytics are very useful when you need information quickly but have a downside since analysis is never “done” which means you may get misleading interim results.

Batch processing has been the most common approach to analytics for a number of reasons:

  • It works well for a lot of problems.
  • It is a lot easier to code than streaming systems.
  • For some domains where the analytics aren’t time series based, it is the only practical approach.

Batch systems also have some serious downsides, especially in a world where digital devices are outputting data 24/7 and organizations have an ever growing need to both analyze that data and react quickly to what they learn from it.  The most common complaints:

  • No partial answers – you have to wait for the entire batch to finish. For big batches this can take a lot of time.
  • Hardware requirements – because they process everything at once, batch systems typically require more hardware (such as memory, disk and CPU) than streaming systems which can process a transaction and then throw it away.
  • Limited ad-hoc capabilities (more on this later).
  • All or nothing Any change in the data usually requires the entire batch to be recalculated.

There are some great analytics systems out there built for batch processing with the best being Hadoop, a very powerful, but also very complex, open source solution.  (They also have the coolest logo in the industry.)   At the core of Hadoop is the Google algorithm known as MapReduce. A very good description of MapReduce can be found in Ayende’s blog.

MapReduce can be used for any batch based calculation, not just analytics.  Hadoop has a lot of users at media companies and websites, like Facebook, that use it for image and video processing among other things. Interestingly enough, Google has recently abandoned MapReduce for a more stream-based approach so they can incrementally update their index instead of recalculating it.  That and the fact that streaming is a better fit for supporting ad-hoc analytics was the major reason that we switched as well.

One of the major drawbacks of MapReduce, or any batch based approach to analytics, is that you have to know exactly what you want to analyze before you start the job and then wait for the job to finish.

Because of this, batch systems do not lend themselves to ad-hoc analysis.  MapReduce in particular has very specific constraints on how both the Map and Reduce functions must be structured. This can make it very difficult to specify your problem in a natural way.

In contrast, our system’s scripting language has no such structural constraints other than the requirement that any analytics function must return a numeric result.  This simplicity in specification, combined with the fact that our engine is streaming, allows you to quickly create an analysis even while data is loading and get immediate results.  The power this gives you to do ad-hoc and root cause analysis quickly has to be seen to be believed.  The other nice thing about streaming systems is that you can easily emulate a batch system if transient results don’t make sense for your particular problem.  You simply submit all your data at once and wait for it all to process.

Entry filed under: Data, General Analytics, PatternBuilders Technology, Technology. Tags: , , , , , , , .

Marketing 101: The Mea Maxima Culpa Rule Strata Sneak Peek: Why Nobody Does It Better Than Wal-Mart

11 Comments Add your own

  • 1. Lucas  |  January 27, 2011 at 7:50 am

    Hello, Terence.

    Nice post! it’s the first I’ve ever read comparing batch and streaming systems.

    Have you ever heard about Yahoo S4. It’s a streaming system recently released by Yahoo, which main application was to optimize Yahoo’s ad engine.

    Do you know any other material about streaming computing and maybe something about streaming systems architecture?

    Thanks in advance

    Reply
  • 2. Terence  |  January 27, 2011 at 7:31 pm

    Lucas,

    Thanks for reading and the kind words. I have heard of S4 – while we haven’t used it – people I know say its cool tech. It will be interesting to see how big a community they can build around it. Another one to look at is StreamInsight by Microsoft. I will be comparing and contrasting our approach to theirs in a later post.

    Regarding easily accessible papers your best bet is to Google Complex Event Processing which some folks are now calling Cloud Event Processing gotta love marketing :-). That is where I see most of the writing about streaming servers showing up.

    Cheers,

    Terence

    Reply
  • 3. The Perfect Fit for Analytics « Big Data Big Analytics  |  January 31, 2011 at 3:02 pm

    […] my last post, I gave an overview of the difference between batch and streaming analytics approaches.  It was a […]

    Reply
  • […] Gnip CEO, mentioned that he noticed this as well in the Gnip blog).  For the reasons I mentioned here, I think that streaming is one of the critical pieces for big data and big analytics.  It would be […]

    Reply
  • […] at social media data, primarily because the streaming nature of social media makes it ideal for our streaming analytics engine.  This seems to have been overlooked by most of the existing solutions in the space—this […]

    Reply
  • […] is the case with real-time analytics.  As you may recall, in a previous post Terence pointed out that “real-time,” as it is applied to analytics, does not meet the computer […]

    Reply
  • 7. Real-time Web Analytics  |  August 5, 2011 at 1:37 am

    Great information . If it’s okay with you, I’d like to post a link in my blog and hub.

    Reply
    • 8. Terence Craig  |  August 5, 2011 at 1:48 pm

      Thanks for the kind words – Please do link to the article the more exposure the better. I took a look at your website it looks like you guys have a great product. Best Terence

      Reply
  • […] Analytics Framework is designed to handle big data and real-time analytics (see Terence’s post about our real-time/streaming architecture) and while we’ve posted about the benefits of […]

    Reply
  • […] Riding the Waterfall – Our New Streaming Analytics Engine […]

    Reply
  • […] Besides the ability to spread your bandwidth & storage requirements over time, streaming approaches allow you to generate, calculate and use incremental results in near real-time as data arrives and, just as importantly, makes it easy to provide an environment where you don’t have to redo everything when the inevitable hardware, data corruption or network failure occurs. You simply restart the stream at the point where the failure occurred. A more comprehensive view on my thoughts on streaming vs. batch approaches can be found here. […]

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Video: Big Data Made Easy

PatternBuilders Corporate

Follow us on Twitter

Special privacy section!

Enter your email address to subscribe.

Join 57 other followers

Recent Posts

Previous Posts


Follow

Get every new post delivered to your Inbox.

Join 57 other followers

%d bloggers like this: