The Perfect Fit for Analytics
In my last post, I gave an overview of the difference between batch and streaming analytics approaches. It was a very popular post and was mentioned on the excellent MyNOSQL blog whichwas really appreciated. Their able proprietor, Alex Popescu, had the following comment:
“I cannot put my finger on it right now, but I don’t think stream processing can cover exactly the same wide range of computations available in batch processing:
While I haven’t had the chance to play with real big data, I believe it is not a matter of either or. An ideal system would need to support:
- piping incoming data through a combination of filters, preprocessors/transformers, and calculators/extractors
- preserve (all/relevant) data for later computation
- allow processing of stored data in either streams or batches“
I would agree that streaming has limitations, but I think for data intensive analysis of time series data where you can tolerate intermediate results (which covers a lot of real world cases) it is the best fit. At PatternBuilders, our decision to have the platform only support time series based analytics opened up the ability for a huge amount of optimizations. Without those optimizations, the ratio of analytics performance to hardware cost would have been pretty ugly.
I also think discussions on whether an analytics engine is streaming or batch is just a small part of determining whether a modern analytics system is useful and will be widely adopted. One of the founding principles of our business is that the scalability problems that have been the focus in the past have been solved by the combination of:
- Better hardware and declining costs
- Fan out architectures
- Parallel programming techniques (both cross thread & cross machine)
- Scalable Open Source Data Stores (MongoDB, CouchDB, RavenDB, ….)
To get analytics out of the back room and into common use across organizations is less about scalability and more about usability. Usability (to us) comes down to how well a system meets these requirements:
- Does its UI allow non-statisticians to easily query/explore statistics created by others?
- How much hardware do you need for your required performance? Does it do fanout?
- Does the system come with pre-built common metrics for the customers particular industries – for example, GEMROI in retail or the Joint Commission statistics for hospitals
- How easy is it to create new analyses and how long does it take get results from this new analysis? If creation is fast, it allows quick prototyping which in statistics, just as in programming, increases productivity.
- Is it secure?
- Does it have a flexible deployment model (cloud or on premise)?
- Does its user interface provide useable performance on the web?
- Can its scripting language be configured to become an analytics DSL in the user’s problem space?
- Can it easily absorb data from different sources and different formats?
- Is it easy to keep up and running?
- Is it as accessible as Excel (still the number one analytics tool in the world)?
If any vendor/platform can answer all of these questions perfectly, they will be able to fundamentally improve how decisions are made in large organizations. At PatternBuilders, we have made progress on all of the above but neither we, or any other vendor, is completely there yet.
Off to Strata. Hope see you folks there—if you are attending the show, come check out our session. And if the WIFI gods permit, I will try and do some blogging from the show.