Big Data Project: Objectives First, Plan Second (Part 3)
A top-level view of our data project over a series of posts.
By Mary Ludloff
Welcome to the third post in our series on a big data project. Our goal is to walk you all the way through a big data project from its inception through its completion (or depending on the project, through deployment and maintenance). Those of you familiar with our series know that we include our Big Data Playbook rules as we address specific topics—we may repeat some as we go along but if you need to refresh your memory on where we are, go to Part 1 and Part 2.
You now know that we are working with the University of Sydney on a project that looks at the impact social media comments have on a company’s stock and whether this mediates the influence of primary news. Specifically: Is a company’s stock price influenced by both and can we isolate and study the impact of those distinct sources on that stock price?
We then went into discovery mode (Rule #4: Ask questions about the question) until we had a thorough understanding of the question (see Table 1). And now it’s time to make a plan—well actually, it’s time to make a list of objectives which will serve as the launching pad for our plan. This, of course, leads us to our next rule:
Rule #5: Objectives first, plan second.
Why objectives first? Well, there’s a natural tendency to drop into the planning phase before you’ve thought out what you’re trying to do in what I like to call “big animal pictures.” I call this descending into the weeds before you’ve got a general idea of what you need to do. Projects, big data or otherwise, generally begin with determining your objectives and then breaking down the resources and tasks needed to complete the project.
Yes, this is a simplistic view of a project as each area can be pushed down and out into all layers of complexity. And that’s my point. At the start of the planning phase of a big data project you don’t want to push down—that will come later and will become an iterative process as you gain information. Rather, you want to sketch out your objectives based on your understanding of the question and then figure out the resources and tasks needed (our original table that provided context to the question is included here for reference). Once you have that, you can then address data, technology, and partner requirements as well as identify gaps in all those areas that will need to filled.
Since we did a deep-dive into exactly what the original question meant, it’s now time to figure out our objectives. After talking through what social media channels we wanted to focus on with Dr. Briley, we decided to analyze the impact of tweets on a company’s stock price. There’s been a great deal of research on how the Twitter mood can impact the stock market in general, but few projects have taken on the task of looking at a specific company—this seemed like a good area for us to focus on.
We also had to analyze the impact of primary media sources on price and isolate that impact from follow-on tweets. First, we had to select a news source. In this case, Reuters seemed like the natural fit as they are the market leader in this area. Additionally, Reuters could also provide sentiment analysis along with other data that would help us to measure sentiment and influence as well as “study the impact” (see Table 1).
Now we moved on to the “heart” of the question: how can we determine and isolate the propagation mode of “company news” from the reporting of financial news in Reuters to tweets about that information? Naturally, we also wanted to explore the different aspects of a tweet that might make it more or less influential. There are a number of tools available that measure some aspect of social authority but for this project we focused on the following:
- The volume, velocity, and acceleration of tweets generated after a news article reports finanical information.
- The social authority (or influence) of the twitterer as indicated by his/hers Klout score and number of followers.
Finally, once we had all this data how would we determine (algorithymically) the impact both (and singularly) sources had on a company stock price?
Based on what we just covered, here are our four objectives:
Okay, now that we have our objectives, our next post will do a deep dive into the data we need to crunch. By definition, every big data project involves data (big or small and keep in mind that size is just one of the V’s to consider). It goes without saying (but we will) that a majority of the resources we’ll need will be data—of course platform technology (lots of issues to suss out here), partners (how we might leverage our channel ecosystem for institutional knowledge, technology, etc.), and people (it’s time to figure out the skillsets we’re going to need) will also play major roles. But first, we’ll be talking about the data—what we need and where we are getting it. Oh and you’ll also get introduced to some of our partners who are providing all that lovely data!