In Search of Elusive Big Data Talent: Is Science Big Data’s Biggest Challenge? Or Are We Looking in the Wrong Places? (Part 1 of 3)
When we talk to prospects about their big data initiatives our conversations usually revolve around issues of complexity that goes something like this:
“Big data is so big (no pun intended), there’s such a variety of sources, and it’s coming in so fast. How can we develop and deploy our big data projects when everyone is telling us that we need lots and lots of data scientists and oh, by the way, there aren’t enough?”
Admittedly, many media outlets and pundits are positioning the search for skilled big data resources as what I can only characterize as the battle for the brainiacs. Don’t get me wrong, I am not disputing McKinsey’s report on big data last year that made it clear a talent shortage was looming, estimating that the U.S. would need 140,000 to 190,000 folks with “deep analytical skills” and 1.5 million managers and analysts to “analyze big data and make decisions based on their findings.” But the hype surrounding the data scientist is getting a bit absurd and we seem to be forgetting that those 1.5 million managers and analysts may already be “walking amongst us.” Is a shortage of data scientists really big data’s biggest challenge?
Let’s consider first those folks with deep analytical skills that were dubbed, for want of a better term, data scientists (believed to be coined by D.J. Patil of LinkedIn fame and Jeff Hammerbacher of Facebook way back in 2008). They are, according to a recent ComputerWorld article:
“The top dogs in big data… Many of these people come out of math or traditional statistics. Some have backgrounds or degrees in artificial intelligence, natural language processing or data management.”
Harvard Business Review (HBR) goes so far as to call it the “sexiest job of the 21st century.” According to HBR:
“More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them. At ease in the digital realm, they are able to bring structure to large quantities of formless data and make analysis possible. They identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. In a competitive landscape where challenges keep changing and data never stop flowing, data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data. Data scientists realize that they face technical limitations, but they don’t allow that to bog down their search for novel solutions. As they make discoveries, they communicate what they’ve learned and suggest its implications for new business directions. Often they are creative in displaying information visually and making the patterns they find clear and compelling. They advise executives and product managers on the implications of the data for products, processes, and decisions.”
Maybe it’s just me but this “reads” like a romance novel! I mean how many data scientists do we know embody HBR’s fulsome definition? Not many and honestly, I think we do the data scientist role a disservice by implying, as HBR did in a follow on post that data scientists are:
“… the magicians who transform an inchoate mass of bits into a fit subject for analysis. God may have been the first to produce order out of chaos, but data scientists do it too, admittedly on a smaller scale.”
Yes, data scientists are important and yes, we need more of them but let’s not confuse a specific role with the discipline itself. What is data science? I’m glad you asked as my favorite, non-hyperbolic definition comes from Columbia University’s curriculum description of a course entitled “Introduction to Data Science”:
“[Data science] lies at the intersection of statistics, computer science, data visualization and the social sciences.”
And within the data science discipline there are specific roles and functions. Sandeep Sacheti, vice president of customer insights and operational excellence, at Wolters Kluwer Corporate Legal Services, looks at it this way:
“He thinks of big data jobs in terms of four ‘buckets of skillsets’: Data scientist, data architect, data visualizer and data change agent.”
Now whether you would agree with the actual titles, these four buckets capture the various skills needed on any big data science project. We’ve already established that the data scientist is the top dog but that doesn’t mean that he or she oversees a data (science) team. I think of this person as the one you go to when the data challenge is becoming exponentially complicated and suddenly, after working with them, it’s not! The architects are the guy and gals that get their hands dirty—cleaning, organizing, and analyzing the data. They usually come from a programming, business intelligence, or statistics background. The visualizers are the ones who “translate analytics into information a business can use.” And since they are in the insights business, they need to be able to work and communicate with the ivory tower (where the top executives live) and all the way down to the feet on the street (my favorite folks). The change agents will determine whether you are successful (or not) as they are responsible for instituting changes in operations and processes. I worked in the BI industry for many, many years with a product manager who used to say that insights are useless unless acted upon. I cannot tell you how many times this proved true! You have got to be able to institute change—in my opinion, the most difficult and overlooked big data task because it often flies in the face of corporate culture and mores.
When “we” (the PatternBuilders team) talk to prospects and customers about big data talent we counsel them to set aside titles and education (how many “data scientists” do you really need?) and focus first on who they already have:
- Who are your A-players? Every company has them—these are the people who consistently demonstrate a willingness to cross chasms, to investigate new ways of doing things, and who are not afraid of technology innovations. They have an intense curiosity and are always poking around the quantitative parts of your business. And although they may not have all the technical skills (although there are some who do) of a data scientist, many could fill the other roles and possibly be your Chief Data Officer (more on this a bit later).
- Who are your power users? These are the people you go to when you want to understand something specific about your business that may cross functions, involve math or statistics, require sophisticated analysis, etc. What’s interesting about these folks is that they may not always have a great deal of depth in some areas (like programming) but they always know what needs to be done to figure out the answer, have a deep understanding of your business, and are never afraid to seek help.
- Who are your geeks? Yes, the power users and geeks often go hand in hand and usually have a great deal of respect for one another’s skillsets. The geek is the one that does the heavy lifting when it comes to technology—whether it’s programming, infrastructure, database design, etc.—and is the first person that the power user calls when he/she needs help.
More often than not, there are a number of people who fit the various bills that I outlined and that’s a very good thing because big data projects (large or small) are always more successful when a majority of the team comes from within. After all, nobody knows your company dynamics, strengths, and weaknesses better than your employees. While technical skills can be taught, knowledge of your business, a drive to find out what is happening, etc., are much harder to teach than programming and statistical techniques.
Now, I am going to digress just a bit here because I am sure that some of our regular readers in the big data space may think that I am just plain crazy (or worse, uninformed) to think that there are a number of people within a company capable of managing and participating in big data initiatives. And two to two-and-a-half years ago, I might (I still would argue the point though) agree with you. Back when the Hadoop ecosystem was the only game in town, big data was not an industry, it was a technology. For those of us who saw the rise of business intelligence or CORBA and java technologies (been there), it “begins hard” and often requires intensive development efforts that armies of programmers and service teams support and sustain. But as the market matures, tools and applications replace technology which empowers all kinds of users to participate and encourages widespread adoption.
This is how I see it: Big data 1.0 was all about the technology and how some fortune 100 companies were using it to derive significant business value. Big data 2.0 is all about tools and applications that companies of all sizes can adopt and use to derive significant business value. By the way, this is why we call our analytics applications data-science-in-a-box. Our goal is, and always has been, to ensure that all types of big data users—from the data scientist to the quants, analysts, and general business users—have the tools they need to perform their respective jobs. And like our other 2.0 brethren, we also ensure that all the big data bells and whistles—like streaming/batch analytics, high performance mashups, complex analytic computation, and multiple deployment options—are there too. Big data 2.0 is the reason why you can (and should) look within to fill specific roles.
Don’t get me wrong—there are probably some roles that you will need to fill from the outside but start seeding the team internally and then with that team, identify what weaknesses you’ll need to address and fill externally. That way you have a group ideally positioned to take on any big data initiative that comes their way!
Now, you may have noticed that this is part 1 of a 3-part post. Coming up:
- Part 2: My colleague Marilyn’s take on analytics, data science, and your business: When is an insight really an insight?
- Part 3: My take on the big data science team (data jujitsu anyone?), the Chief Data Officer, and the roles of IT and the privacy function. I suspect that however I lay this one out there will be vehement agreement and disagreement and I look forward to reading all about it in the comments!