Greetings one and all! 2012 was a breakout year for PatternBuilders and we are very grateful to all of you for helping to make that happen. But we would also like to take a minute to extend our condolences and share the grief of parents across the world that lost young children to violence. Newtown was singularly horrific but similar events play out all too often across the globe. We live in an age of technical wonders—surely we can find ways to protect the world’s children.
This is our last post of 2012 and in the spirit of the season, we decided to do something a little different this year. Recently, the Wall Street Journal asked 20 of its “friends” to tell them what books they enjoyed in 2012 and the responses were equally eclectic and interesting. Not to be outdone, Adam Thierer published his list of cyberlaw and info-tech policy books for 2012. Many of the recommendations culled from both sources ended up on our reading lists for 2013 (folks, 2012 is almost over and between launching AnalyticsPBI for Azure and working on our update for Privacy and Big Data, not a lot of “other” reading is going to happen during the holiday season!) and spurred an interesting discussion about our favorite reads of the year. One caveat: Our lists may include books we read but were not necessarily published this year. So without further ado, I give you our favorite reads of 2012! (more…)
For the second post on AnalyticsPBI for Azure (first one here), I thought I would give you some insight on what is required for a modern real-time analytics application and talk about the architecture and process that is used to bring data into AnalyticsPBI and create analytics from them. Then we will do a series of posts on retrieving data. This is a fairly technical post so if your eyes start to glaze over, you have been warned.
In a world that is quickly moving towards the Internet of Things, the need for real-time analysis of high velocity and high volume data has never been more pronounced. Real-time analytics (aka streaming analytics) is all about performing analytic calculations on signals extracted from a data stream as they arrive—for example, a stock tick, RFID read, location ping, blood pressure measurement, clickstream data from a game, etc. The one guaranteed component of any signal is time (the time it was measured and/or the time it was delivered). So any real-time analytics package must make time and time aggregations first class citizens in their architecture. This time-centric approach provides a huge number of opportunities for performance optimizations. It amazes me that people still try to build real-time analytics products without taking advantage of them.
Until AnalyticsPBI, real-time analytics were only available if you built a huge infrastructure yourself (for example, Wal-Mart) or purchased a very expensive solution from a hardware-centric vendor (whose primary focus was serving the needs of the financial services industry). The reason that the current poster children for big data (in terms of marketing spend at least), the Hadoop vendors, are “just” starting their first forays into adding support for streaming data (see CloudEra’s Impala, for example) is that calculating analytics in real-time is very difficult to do. Period.
It has been a while since I’ve done posts that focus on our technology (and big data tech in general). We are now about 2 months out from the launch of the Azure version
But before I start exercising my inner geek, it probably makes sense to take a look at the development philosophy and history that forms the basis of our upcoming release. Historically, we delivered our products in one of two ways:
- As a framework which morphed (as of release 2.0) into AnalyticsPBI, our general analytics application designed for business users, quants, and analysts across industries.
- As vertical applications (customized on top of AnalyticsPBI) for specific industries (like FinancePBI and our original Retail Analytics application) which we sold directly to companies in those industries.
Today, I got the sad news that a dear friend and an early contributor to PatternBuilders passed away.
Andrew (Andrei) Leman was a gruff, kind and generous man who will be deeply missed. Andrei was also a very talented mathematician and software engineer who created some of the fundamental theories around the mathematics of graphs. His papers on that subject are still heavily cited.
More importantly Andrei was a loving husband to his wife Elena and a great friend and mentor to many, many folks.
He will be missed but his work and the respect and affection he engendered will endure.
пухом my friend.
A week ago, I was in New York City for Strata’s Big Data Conference. The weather was sunny and mild and as I walked around the City I was reminded of just how vibrant it is and told my husband later that evening that we have to visit it more often. After the conference, I headed home and then watched with disbelief as this wonderful city, surrounding areas, and many more states were engulfed by Hurricane Sandy. I was saddened by the destruction and loss of life, but today am reminded of the resilience of its inhabitants as the clean up and rebuilding begins. For those of you interested in helping, I point you to ABC News’ story and the Wall Street Journal’s article on ways to help the storm victims. Or you can go to the Red Cross home page for information on how to make a financial donation or give blood. To all of you on the East Cost impacted by Hurricane Sandy: Our hearts go out to you and you are in our prayers.
I had to miss Strata due to a family emergency. While Mary picked up the slack for me at our privacy session, and by all reports did her usual outstanding job, I also had to cancel a Tuesday night Strata session sponsored by 10Gen on how PatternBuilders has used Mongo and Azure to create a next generation big data analytics system. The good news is that I should have some time to catch up on my writing this week so look for a version of what would have been my 10Gen talk shortly. In the meantime, to get me back in the groove, here is a very short post inspired by a Forbes post written by Dan Everett of SAP on “Hadoopla”
As a CEO of a real-time big data analytics company that occasionally competes with parts of the Hadoop ecosystem, I may have some biases (you think?). But I certainly agree that there is too much Hadoopla (a great term). If our goal as an industry is to move Big Data out of the lab and into mainstream use by anyone other than the companies that thrive on and have the staff to support high maintenance and very high skill technologies, Hadoop is not the answer – it has too many moving parts and is simply too complex.
To quote from a blog post I wrote a year ago:
“Hadoop is a nifty technology that offers one of the best distributed batch processing frameworks available, although there are other very good ones that don’t get nearly as much press, including Condor and Globus. All of these systems fit broadly into the High Performance, Parallel, or Grid computing categories and all have been or are currently used to perform analytics on large data sets (as well as other types of problems that can benefit from bringing the power of multiple computers to bear on a problem). The SETI project is probably the most well know (and IMHO, the coolest) application of these technologies outside of that little company in Mountain View indexing the Internet. But just because a system can be used for analytics doesn’t make it an analytics system…..“
Why is the industry so focused on Hadoop? Given the huge amount of venture capital that has been poured into various members of the Hadoop eco-system and that eco-system’s failure to find a breakout business model that isn’t hampered by Hadoop’s intrinsic complexity, there is ample incentive for a lot of very savvy folks to attempt to market around these limitations. But no amount of marketing can change the fact that Hadoop is a tool for companies with elite programmers and top of the line computing infrastructures. And in that niche, it excels. But it was not designed, and in my opinion will never see, broad adoption outside of that niche despite the seeming endless growth of Hadoopla.
Let me tell you a little secret: I always know when I am talking (and working) with a company that has successfully launched big data initiatives. There are three characteristics that these companies share:
- A C-level executive runs the “[big] data operations.”
- The Chief Data Officer (even if they are the CIO) has a heavy business/operations background.
- The data team is focused on the “business,” not the data.
Did you notice that technology and data science are not reflected in any of the characteristics? Some of you may consider this sacrilege—after all, we are operating in a world where technology (and I happily work for one of those companies) has changed the data collection, usage, and analysis game. Colleges and universities are now offering master degrees in analytics. The role of the data scientist has been pretty much deified (I refer you to Part 1 of this series). And we all need to be very worried about the “talent shortage” and our ability to recruit the “right analytical team” (I refer you to Part 2 of this series).
Yes—technology has had a tremendous impact on how much data we can collect and the ways in which we can analyze it but not everyone needs to be a senior computer programmer. Yes—we all should strive to be more mathematically inclined but not all of us need Master’s or PhD’s in statistics or analytics. Yes—some companies, based on their business models, may have a staff of data scientists but others may get along just fine without one (with the occasional analytics consultant lending a hand). (more…)