Posts tagged ‘big analytics’
Marilyn Craig (Managing Director of Insight Voices, frequent guest blogger, marketing colleague, and analytics guru) and I have been watching the big data “V” pile-on with a bit of bemusement lately. We started with the classic 3 V’s, codified by Doug Laney, a META Group and now Gartner analyst, in early 2001 (yes, that’s correct, 2001). Doug puts it this way:
“In the late 1990s, while a META Group analyst (Note: META is now part of Gartner), it was becoming evident that our clients increasingly were encumbered by their data assets. While many pundits were talking about, many clients were lamenting, and many vendors were seizing the opportunity of these fast-growing data stores, I also realized that something else was going on. Sea changes in the speed at which data was flowing mainly due to electronic commerce, along with the increasing breadth of data sources, structures and formats due to the post Y2K-ERP application boom were as or more challenging to data management teams than was the increasing quantity of data.”
Doug worked with clients on these issues as well as spoke about them at industry conferences. He then wrote a research note (February 2001) entitled “3-D Data Management: Controlling Data Volume, Velocity and Variety” which is available in its entirety here (pdf too). (more…)
In Search of Elusive Big Data Talent: Is Science Big Data’s Biggest Challenge? Or Are We Looking in the Wrong Places? (Part 1 of 3)
When we talk to prospects about their big data initiatives our conversations usually revolve around issues of complexity that goes something like this:
“Big data is so big (no pun intended), there’s such a variety of sources, and it’s coming in so fast. How can we develop and deploy our big data projects when everyone is telling us that we need lots and lots of data scientists and oh, by the way, there aren’t enough?”
Admittedly, many media outlets and pundits are positioning the search for skilled big data resources as what I can only characterize as the battle for the brainiacs. Don’t get me wrong, I am not disputing McKinsey’s report on big data last year that made it clear a talent shortage was looming, estimating that the U.S. would need 140,000 to 190,000 folks with “deep analytical skills” and 1.5 million managers and analysts to “analyze big data and make decisions based on their findings.” But the hype surrounding the data scientist is getting a bit absurd and we seem to be forgetting that those 1.5 million managers and analysts may already be “walking amongst us.” Is a shortage of data scientists really big data’s biggest challenge? (more…)
Since Disqus seems to have completely eaten (bleh) my comment on @davidlinthicum’s very interesting InfoWorld post – Big data and the cloud: A far from perfect fit, I decided to just expand my comments and make a short blog post out of it. IMHO the problems that David is describing are more a reflection of problems with batch oriented technologies like Hadoop (more on my take on Hadoop here) in the cloud than a general problem for cloud based big data solutions.
Computing always has, and probably always will have, a bias towards creating batch focused technologies at the beginning of any large paradigm shift. But as new technologies are absorbed, understood, and move from early adopter to more mainstream use, the batch paradigm will inevitably start to shift to streaming and real-time. We have seen this again and again (from punch cards to touch sensitive tablets, downloaded media to streaming media, DOM to SAX parsers, HTML to Ajax, paper maps to real-time GPS). The reason this evolution almost always occurs is simple: humans live and think in real-time and when our tools do as well we are more productive and happier. So why do we have this bias for batch processing in our first generation computational technologies? Simply put, because batch processing is a lot easier.
A number of folks have asked me if I was concerned about Microsoft’s recent announcement that they would be partnering with HortonWorks and abandoning their own distributed processing technology for Hadoop. While I thought this was an unfortunate choice on Microsoft’s part (the Dryad project’s implementation of multi-server Linq was pretty compelling), since HPC is a small part of Microsoft’s business, it probably made sense from a business standpoint. In any case, we (as in all of us at PatternBuilders) are not concerned and just to be clear: we don’t believe that this announcement (or any other) means that the many Hadoop ecosystem players own the still forming big data analytics market.
That is not to say that the announcement isn’t proof of the strength of the Hadoop ecosystem. Hadoop is a nifty technology that offers one of the best distributed batch processing frameworks available, although there are other very good ones that don’t get nearly as much press, including Condor and Globus. All of these systems fit broadly into the High Performance, Parallel, or Grid computing categories and all have been or are currently used to perform analytics on large data sets (as well as other types of problems that can benefit from bringing the power of multiple computers to bear on a problem). The SETI project is probably the most well know (and IMHO, the coolest) application of these technologies outside of that little company in Mountain View indexing the Internet. (more…)
As you all know, Tim and I spoke at MongoSF recently. Our session was focused on how to build a streaming analytics system with Mongo. For those of you who might have missed this post thread, here are the highlights (with the appropriate links):
- We wanted to make our beta version of PatternBuilders Social Media Analytics demo publicly available on the web.
- We looked at cloud-based deployments as a way to make this economically viable.
- As part of our move to the cloud, we made significant changes to PatternBuilders Platform architecture—which included MongoDB (a choice that the PatternBuilders development team is very happy with).
Our session was videotaped and I am happy to announce that it is now available on the 10gen site. You’ll notice that we got a lot of great questions. If, after viewing the video, you have some thoughts or questions please send them my way through comments or email—it may take me some time (we are, as Mary said in her last post, crazy busy right now), but I will follow up!
Data ownership, privacy, and security: we are all in this together.
There’s been a lot of marketing “noise” going on about the exponential growth of digital data (and yes, we are partially responsible for some of it) and there’s even a sound bite for it: big data (we did not coin the term but we have used it over and over again). Now, in my defense, I thought that this term made complete sense and was the “perfect” definition for the problem we are all facing. Of course, I forgot an important marketing axiom: test the term with folks outside of the industry to ensure that the meaning is not lost. You know, it’s always fun to spend time with friends and family, especially when they ask “what is it exactly that you do?” In the course of our conversation, I discovered that “big” means, well, big, which does not quite “do justice” to the challenges of the “big data” world that we all live in.
So, what exactly is big data and why should you care? Well, big data is really big—which is how the term big data came to be. For example, IDC’s research on the size of our “digital universe” revealed the following:
- In 2009, the digital universe grew 62% or almost 800,000 petabytes (for those of you “size-challenged folks,” each petabyte is a million gigabytes which translates into a stack of DVDs reaching from the earth to the moon and back).
- In 2010, it was projected to grow to 1.2 million (final counts are not in as of yet) petabytes.
- By 2020, it is projected to be 44 times as big as it was in 2009 (those DVDs would be stacked up halfway to Mars).
Note the use of the term “digital universe.” This refers to data that is stored in digital form. For example, all data that is stored in a computer is digital. So when we talk about the digital universe, we are in essence talking about all the data in the world that is stored on some sort of computer (big or small), most likely in some sort of database. (more…)
I have been involved with databases and analytics for the last 20 years or so. In fact, I remember when relational databases first started to displace IBM’s, Digital’s, and HPs ‘proprietary databases’ like IMAGE and RMS. I also remember the heated arguments about whether QUEL or SQL would win the query language wars (IMHO the worst language won).
It was an exciting time. Data, and how to manage it, was the focal point of the entire technology industry. But as relational databases became ubiquitous, the focus rightly turned: now that we have stored all that data, what the #@$@$@ do we do with it all? (more…)