Posts filed under ‘Data’
I had to miss Strata due to a family emergency. While Mary picked up the slack for me at our privacy session, and by all reports did her usual outstanding job, I also had to cancel a Tuesday night Strata session sponsored by 10Gen on how PatternBuilders has used Mongo and Azure to create a next generation big data analytics system. The good news is that I should have some time to catch up on my writing this week so look for a version of what would have been my 10Gen talk shortly. In the meantime, to get me back in the groove, here is a very short post inspired by a Forbes post written by Dan Everett of SAP on “Hadoopla”
As a CEO of a real-time big data analytics company that occasionally competes with parts of the Hadoop ecosystem, I may have some biases (you think?). But I certainly agree that there is too much Hadoopla (a great term). If our goal as an industry is to move Big Data out of the lab and into mainstream use by anyone other than the companies that thrive on and have the staff to support high maintenance and very high skill technologies, Hadoop is not the answer – it has too many moving parts and is simply too complex.
To quote from a blog post I wrote a year ago:
“Hadoop is a nifty technology that offers one of the best distributed batch processing frameworks available, although there are other very good ones that don’t get nearly as much press, including Condor and Globus. All of these systems fit broadly into the High Performance, Parallel, or Grid computing categories and all have been or are currently used to perform analytics on large data sets (as well as other types of problems that can benefit from bringing the power of multiple computers to bear on a problem). The SETI project is probably the most well know (and IMHO, the coolest) application of these technologies outside of that little company in Mountain View indexing the Internet. But just because a system can be used for analytics doesn’t make it an analytics system…..“
Why is the industry so focused on Hadoop? Given the huge amount of venture capital that has been poured into various members of the Hadoop eco-system and that eco-system’s failure to find a breakout business model that isn’t hampered by Hadoop’s intrinsic complexity, there is ample incentive for a lot of very savvy folks to attempt to market around these limitations. But no amount of marketing can change the fact that Hadoop is a tool for companies with elite programmers and top of the line computing infrastructures. And in that niche, it excels. But it was not designed, and in my opinion will never see, broad adoption outside of that niche despite the seeming endless growth of Hadoopla.
Big Data is Coming of Age in the Capital Markets—Wall Street and Technology’s Deep Dive into “Everything You Need to Know to Unlock Big Data’s Secrets” is a Must Read for All
In the “it’s a small world” category while we were in the midst of launching FinancePBI, the first financial services big data solution built for the cloud and designed to address the needs of the industry, Terence chatted with Melanie Rodier (@mrodier), a Senior Editor at Wall Street and Technology. The topic: big data and the capital markets. That 33-page report is now available and it’s a must read for anyone interested in big data and business.
Why a must read for all? Well, similar to the McKinsey report on big data in 2011, Wall Street and Technology’s big data deep dive covers a lot of ground that applies to any business or organization. In other words, specific industry requirements may be different but big data technology and process challenges are very similar. For example, Wall Street firms—like so many others—find themselves dealing with unstructured data from a variety of sources, including the web, social media, and mobile devices. While there’s value in that data, there are infrastructure issues and a looming talent shortage. Sound familiar? (more…)
Since Disqus seems to have completely eaten (bleh) my comment on @davidlinthicum’s very interesting InfoWorld post – Big data and the cloud: A far from perfect fit, I decided to just expand my comments and make a short blog post out of it. IMHO the problems that David is describing are more a reflection of problems with batch oriented technologies like Hadoop (more on my take on Hadoop here) in the cloud than a general problem for cloud based big data solutions.
Computing always has, and probably always will have, a bias towards creating batch focused technologies at the beginning of any large paradigm shift. But as new technologies are absorbed, understood, and move from early adopter to more mainstream use, the batch paradigm will inevitably start to shift to streaming and real-time. We have seen this again and again (from punch cards to touch sensitive tablets, downloaded media to streaming media, DOM to SAX parsers, HTML to Ajax, paper maps to real-time GPS). The reason this evolution almost always occurs is simple: humans live and think in real-time and when our tools do as well we are more productive and happier. So why do we have this bias for batch processing in our first generation computational technologies? Simply put, because batch processing is a lot easier.
Greetings one and all and happy new year! As promised, part 2 of my post on McKinsey’s drill-down into the tremendous benefits location data offers to new businesses (and business models) as well as to all of us. If you need to refresh your memory (since the author was a wee bit late in meeting her stated publishing date), part 1 is available here. Certainly, the report, “Big data: The next frontier for innovation, competition, and productivity,” is chock full of illuminating ways that big data can be leveraged within specific industries, but personal location data is a somewhat different beast as it cuts across industries. For example, telecom, retail, and media (through location-based advertising) all stand to reap tremendous rewards.
Now, as I said in part 1 and will state again in part 2: I have a bit of angst around the collection and use of personal location data (see my many posts on privacy or our book on “Privacy and Big Data”). But that does not negate what can be gained if it is properly collected and used and with the appropriate regulations and guidance in place (my gosh—I am beginning to sound like one of the privacy policies I hate to read!). Put simply: all company’s data collection and usage policies should be clearly stated and always offered on an opt-in basis. Okay, privacy issues have been dealt with so let’s move on! (more…)
A number of folks have asked me if I was concerned about Microsoft’s recent announcement that they would be partnering with HortonWorks and abandoning their own distributed processing technology for Hadoop. While I thought this was an unfortunate choice on Microsoft’s part (the Dryad project’s implementation of multi-server Linq was pretty compelling), since HPC is a small part of Microsoft’s business, it probably made sense from a business standpoint. In any case, we (as in all of us at PatternBuilders) are not concerned and just to be clear: we don’t believe that this announcement (or any other) means that the many Hadoop ecosystem players own the still forming big data analytics market.
That is not to say that the announcement isn’t proof of the strength of the Hadoop ecosystem. Hadoop is a nifty technology that offers one of the best distributed batch processing frameworks available, although there are other very good ones that don’t get nearly as much press, including Condor and Globus. All of these systems fit broadly into the High Performance, Parallel, or Grid computing categories and all have been or are currently used to perform analytics on large data sets (as well as other types of problems that can benefit from bringing the power of multiple computers to bear on a problem). The SETI project is probably the most well know (and IMHO, the coolest) application of these technologies outside of that little company in Mountain View indexing the Internet. (more…)
All you need is text, Text is all you need (sing to the tune of The Beatles’ All you need is love). If you are one of our regular readers you will remember that several months ago I wrote a manifesto on what the perfect analytics system would look like. One of the last points was:
It must be as accessible as Excel (still the number one analytics tool in the world).
I was wrong – Excel is the number one non-specialized analytics tool in the world but in terms of usage, it is dwarfed in comparison to a very well know specialized analytics toolkit. The creators of this tool are a little company that you may have heard of: it does no evil and analyzes the Internet to bring you back everything on the web based on a simple text query. But behind that simple text box, Google has one of the most sophisticated analytics infrastructures in the world:
- It can deduce your interests.
- Give you the most relevant results.
- And show you appropriate information based on them, as well as bring back highly personalized ads.
Google is not only the largest big data analytics company in the world, but it also has the easiest to use tools—proof that text is all you really need!