Data is Officially Big Business and Exploratory Analysis is in the Driver’s Seat
In case you haven’t heard, data is “a $100 billion marketplace.” There are data markets, like Gnip, Infochimps, and Microsoft’s Azure Marketplace that offer some datasets for free and others for a price. And those prices may be dropping considerably due to data reseller, Mediasift, which is providing very low cost access to social data. This means that companies outside of the Fortune 500 can now afford to purchase small and deep slices of social data for “pennies on the dollar” when compared to other data markets. There are also public datasets offered by data.gov and others. And all of these datasets are constantly growing, fed by the data generated by social networks, e-commerce, mobile location, and yes, advertising technology. Think about this for a moment:
“… 600 billion electronic transactions are created in the U.S. every day, and many of those transactions come from geo-locational data generated by cell phones, which through cellular towers, triangulate a person’s exact location at any time. Wireless providers have that data in real time.”
Wow. Or put another way:
“Big data — an industry term that refers to large data warehouses — includes machine- and human-generated data such as computer system log files, financial services electronic transactions, Web search streams, e-mail metadata, search engine queries and social networking activity. In 2010 alone, 1.5 zettabytes of that kind of data was created, most of it machine-generated.”
How about big wow? (Yes, pun intended and yes, I think it’s funny!)
But here’s the thing. For any of this to be useful, analytics has to play a major role, and not just in the traditional sense. We are all familiar with business performance KPIs that help us to streamline operations and maximize revenue. In fact, in a recent post I talked about how analytics separated the winners from the losers. But we now have the capacity to collect (buy), store, aggregate, and retrieve enormous amounts of data cheaply and easily. We are dealing with multiple, very large datasets and mashing them all together. Do we even know what we are looking for?
Let me put this another way. If traditional analytics is like finding needles in haystacks, shouldn’t we also, as Semil Shaw pointed out, be building “better magnets to draw out all those needles?” Recent research from Aster Data certainly supports this:
“Nearly 30% of respondents thought that exploratory analysis of big data to find ‘the next big business insight’ was a huge business opportunity. This supports the notion that massive data exploration, and the role of data scientists is key for today’s data-driven organizations.”
What exactly is exploratory data analysis? Well, John Tukey felt that too much emphasis was placed on statistical hypothesis testing (otherwise known as confirmatory data analysis) rather than on using the data itself to suggest hypotheses to test (exploratory analysis). In other words, it’s not what you think you know; it’s what you don’t know that you know.
Sound convoluted? Well, yes and no. Look, if we are mashing together different kinds of datasets it’s quite possible that there are insights—when the data is viewed in aggregate—that we simply never would have thought of. If we view the aggregate dataset from a statistical point of view, we may never “discover” those insights. Exploratory analysis allows us to explore the data and let the data itself, “speak” to us.
So how can you get started with exploratory analysis? Well, if you are working with smaller datasets you can certainly use Excel (if you are a business user), SAS or R (if you are an analyst). However, if you are working with a big dataset like the ones I’ve just mentioned, you may want to check out PatternBuilders Analytics Framework. Yes, this is a shameless plug and it’s shameless because, to be quite honest, we have not yet seen anyone else come out with an exploratory data analysis (EDA) tool that can support big data. Designed to be used by the average (business or otherwise) user or statistician, our EDA tool is:
- Fast—because if it takes hours to perform analysis people won’t use it.
- Automatic—our system tells you when things you are interested in happen; it does not make you look for them.
- Integrated—if you love Excel, then you can still use it after we do the initial analysis.
- Mashup friendly—data and datasets are infinite; you need to be able to blend them as your business requires.
- Extensible—create your own metrics as easily as in Excel.
- Point and clickable—it’s that easy but if you feel the need to ask your VP of Marketing to write an R Script or create a MapReduce job go ahead, but you may not like their answer.
- Web friendly—you need your data wherever you need it, not where your analysis tool happens to run.
To see our design philosophy in action, take a look at our correlation engine video.
Here’s our take on the “biggest” big data challenge: exploratory analysis tools will determine whether big data changes the world or just makes a lot of storage vendors very happy. In other words, no EDA, no life-altering discoveries.
Entry filed under: Data, General Analytics, Technology. Tags: analytics, Azure, big data, confirmatory data analysis, data markets, exploratory analysis, exploratory data analysis, Gnip, Infochimps, MediaSift.