Big Data and Cloud not a fit? Comments on Infoworld Article
Since Disqus seems to have completely eaten (bleh) my comment on @davidlinthicum’s very interesting InfoWorld post – Big data and the cloud: A far from perfect fit, I decided to just expand my comments and make a short blog post out of it. IMHO the problems that David is describing are more a reflection of problems with batch oriented technologies like Hadoop (more on my take on Hadoop here) in the cloud than a general problem for cloud based big data solutions.
Computing always has, and probably always will have, a bias towards creating batch focused technologies at the beginning of any large paradigm shift. But as new technologies are absorbed, understood, and move from early adopter to more mainstream use, the batch paradigm will inevitably start to shift to streaming and real-time. We have seen this again and again (from punch cards to touch sensitive tablets, downloaded media to streaming media, DOM to SAX parsers, HTML to Ajax, paper maps to real-time GPS). The reason this evolution almost always occurs is simple: humans live and think in real-time and when our tools do as well we are more productive and happier. So why do we have this bias for batch processing in our first generation computational technologies? Simply put, because batch processing is a lot easier.
The cloud by its very nature makes the major deficiency of batch processing very apparent. You have to have all of your data available at once to do your processing. This of course means that you have to transfer and store ALL of the data you want to process to the cloud before you can even start processing. This is often problematic for big data problems, even for on premise data centers, but as David’s article points out it becomes an even bigger problem in the cloud due to network and storage issues. On the other hand, streaming approaches to big data allow you to process data as it is produced and then throw it away. This dramatically lowers costs and infrastructure requirements as well as makes them much more amenable to adding resources when needed for spikes and then taking them away when they are no longer needed.
Besides the ability to spread your bandwidth & storage requirements over time, streaming approaches allow you to generate, calculate, and use incremental results in near real-time as data arrives and, just as importantly, makes it easy to provide an environment where you don’t have to redo everything when the inevitable hardware, data corruption, or network failure occurs. You simply restart the stream at the point where the failure occurred. A more comprehensive view on my thoughts on streaming versus batch approaches can be found here.
These were the reasons that we decided on a streaming architecture for our big data analytics platform which forms the foundation of our Financial Services analytics product, FinancePBI, and it was also the reported reason Google changed their core infrastructure from MapReduce to their Caffeine architecture in 2010.
There are places where batch architectures can be more efficient (many social graph problems for example). But big data analytics systems must perform their tasks in a world increasingly dominated by IP-enabled location aware devices (aka the Internet of Things). The Internet of Things produces huge amounts of streaming/real-time data from devices as diverse as cell phones, blood pressure monitors, and fly-by-wire jets like the 777. More and more, to be useful a big analytics system needs to be able to comfortably take real-time data from devices and mash that up with other critical streaming data sources, such as stock market tickers or the ever-expanding twitter firehose. Any technology decision that requires batch processing and/or makes taking advantage of the lower costs and convenience available via the cloud more difficult as we enter this new world has to be examined carefully!
Entry filed under: Data, General Analytics, O'Reilly, PatternBuilders Technology, Technology. Tags: analytic systems, analytics, batch processing, big analytics, big data, Hadoop, real-time analysis, streaming analytics.