Tales of Beers and Diapers
There is an apocryphal story that is often told to illustrate data mining concepts. The story is about beer and diaper sales and usually goes along the lines of:
|INSERT MAJOR RETAILER NAME| found on |INSERT DAY OF THE WEEK| that beer and diaper sales were strongly correlated. Once noticed on |INSERT BI TOOL OF CHOICE|, it was found |PICK ONE|:
- That diapers are too heavy for recently pregnant women so they ask their husbands to pick them up coming home from work and since hubby is off the clock and ready to get his drink on, he also picks up beer.
- That a diaper emergency occurs fairly late in the evening and the husband is sent out while the new mother cares for the baby. Being annoyed, he also picks up a 12 pack to relax.
- That |INSERT SOME EQUALLY GROSS STEREOTYPICAL ASSUMPTION ABOUT THE U.S. WORKING CLASS PARENT|.
The brilliant analyst at |SAME MAJOR RETAILER AS ABOVE| intuits that a simple relocation of beer next to diapers will lead to more purchases of beer and beer sales improve by |INSERT HIGHER %|.
It’s a great story even though it is almost certainly an urban legend, if for no other reason that Wal-Mart is usually the big retailer chain mentioned. Having direct experience with Wal-Mart and how secretive they are, if the chain was Wal-Mart, trust me, you would not have heard about it.
And while stories like this have been used to sell lots of BI and Data Mining licenses, the fact of the matter is that performing correlation with large data sets in a performant manner and without a highly trained statistician is no fun at all with most tools and is incredibly underused because of that fact.
This is a shame because the power of correlation to confirm or, ideally, discover relationships that can positively affect a business is hard to overstate. We have used our correlation engine on such diverse things as determining optimum restocking times for multi-national produce retailers to finding the relations between social media campaigns and actual sales. For this power to be broadly used, we need to make the correlation tools available both easier and faster.
When we were redesigning our correlation module in PAF, we had these goals:
- It had to be fast over large data sets.
- It had to be easy enough to use that an AVERAGE user could easily do exploratory data analysis (EDA) without knowing what it is. Is the new Twitter campaign driving sales in the Southern region of pink cell phones?
- It had to support time shifted correlations that allowed the discovery of correlations where there is a significant time gap in the relationship. Does an ad buy only impact sales 2 months after a campaign is started?
- The results of the correlation should be easy to understand.
- That it would lay the foundation for the holy grail – automatic EDA. The system tells users automatically what relationships are worth paying attention to.
For complex hypotheses testing and for avoiding the correlation is not causation trap, you will still need folks with a strong statistics foundation. But having sophisticated EDA capabilities in the reach of anybody who wants to easily test their gut observations will empower workers throughout an organization and will enable a lot of the art of a business (aka known as hunches) to become empirically-based science.
Entry filed under: Data, General Analytics, General Business, Technology. Tags: analytics, big data, correlation, data mining, PatternBuilders Analytic Framework, retail analytics, statistics, Wal-Mart.