Introducing AnalyticsPBI for Azure—A Cloud-Centric, Components-Based, Streaming Analytics Product
It has been a while since I’ve done posts that focus on our technology (and big data tech in general). We are now about 2 months out from the launch of the Azure version
But before I start exercising my inner geek, it probably makes sense to take a look at the development philosophy and history that forms the basis of our upcoming release. Historically, we delivered our products in one of two ways:
- As a framework which morphed (as of release 2.0) into AnalyticsPBI, our general analytics application designed for business users, quants, and analysts across industries.
- As vertical applications (customized on top of AnalyticsPBI) for specific industries (like FinancePBI and our original Retail Analytics application) which we sold directly to companies in those industries.
Why did we build our own framework to support the development of analytics applications instead of using something like Hadoop? A couple of reasons:
- First, we were building streaming applications and there weren’t any scalable, useable software-only solutions available. Almost every system that we could find was batch-oriented and clearly not built with the cloud in mind. Additionally, even the unusable ones where geared to specific domains like finance or petroleum distribution. (And no, I don’t think that attempts, like Cloudera’s Impala, to graft streaming on top of the intrinsically batch-focused Hadoop changes our original assessment.)
- Second, as the CEO of one of the leading Hadoop vendors illustrated by offering to pay ISV’s to develop on Hadoop, current big data platforms may work fine for one-off projects but are a poor choice for enterprise application development.
Don’t misunderstand me as I have nothing but respect for the Hadoop committers. Many of them are undoubtedly some of the best system programmers on the planet. However, folks of that caliber spend most of their time trying to solve problems like helping Yahoo create a better search engine (the genesis of Hadoop). They are focused on supporting the Fortune 10’s where resources—in terms of time, money, sophisticated skillsets, and administrative personnel—are not an issue. This means that developers working on software like Hadoop don’t have a lot of experience or place a lot of importance on building tools that will easily fit the skillsets and budgets of most IT organizations. And these IT organizations don’t have the resources or patience for “solutions” that require significant development, deployment, and maintenance efforts. What IT organizations do need are out-of-the-box big data applications.
This is a story that occurs again and again in technology. For example, a lot of people have forgotten that IBM invented the relational database (RDB). For those of us that were building applications back then—yes, I am that old—the reason why folks think that Oracle “equals” relational database is pretty simple. IBM was focused on developing a database that correctly implemented the relational algebra which formed the theoretical underpinnings of the relational database to which the market said, “Who cares?” In contrast, Oracle focused on creating relational database products and applications that their customers could easily build stuff with (for example, if the customer wants record pointers, the customer gets record pointers, purity be damned). This practical attitude allowed them to dominate the multi-billion dollar RDB market as well as provided Larry Ellison with the means to buy an entire island in the Hawaiian chain.
The sole purpose of big data platforms and big data analytics is to enable applications to use data in ways and at a scale that was impossible a few years ago. But this can only happen if we build big data tools that are accessible to the analyst or person in IT that just needs to get stuff done and could care less about the indexing strategy or the efficiency of a bloom filter implementation. What they care about is simple: How easy it is to get the answers they need to improve the business and how nimbly can they react to new data that could help? And they need to do it with a hardware budget that doesn’t rival NASA’s. This (and a bit of island envy) is why PatternBuilders was founded. Sure, we can hold our own as system developers but our focus is, and always will be, on developing streaming analytics applications that are not only easy to use but easy to customize into great vertical applications (like FinancePBI).
How many big data platforms can do this? Until now, none. This is why we aren’t seeing mainstream adoption of big data outside of the largest and most sophisticated governments and companies. This lack of adoption is a significant issue for all of us in the big data space. We want to see the technology have the dramatic impact on businesses, governments, and societies that drew us to the big data industry in the first place. But developing and deploying a batch-focused big data analytics applications has been ridiculously hard, requiring too much hardware and too many highly paid technology consultants. On top of that, we are quickly moving into the world of the Industrial Internet and the consumer Internet of Things, all of which require streaming analytics. Since current big data solutions were not designed for streaming data analytics, developing a real-time analytics application becomes nearly impossible (as opposed to just ridiculously hard!).
This release is the culmination of what we learned from building, using, and administering streaming applications across a variety of industries. AnalyticsPBI for Azure is a powerful, full featured, easy to use streaming analytics application that offers:
- High performance real-time visualizations delivered via a browser.
- A cloud friendly, multi-threaded and multi-machine model calculation engine that is scriptable in any .NET programming language and supports the plethora of statistical libraries available on Windows-based operating systems.
- A query engine that 1) allows you to query your models and receive real-time updates when new data arrives and 2) supports full text search and geo-queries.
- Secure hybrid deployment on public and private clouds.
The goal of this release (and every product that we build) is to give analytics professionals an alternative to the size limitations and proprietary languages of solutions like SAS, R, and MatLab as well as address (and significantly reduce) the high administrative overhead and batch-centric natures of the various Hadoop flavors and other big data 1.0 technologies. AnalyticsPBI for Azure puts the focus squarely on using big data to deliver answers and insights–which we should all remember is the reason for all the big data hype in the first place. Our design goals for this release (which will form the basis of upcoming technology posts) were to:
- Leverage our partnership with Microsoft Azure, so that our customers could reap the benefits of Azure’s huge compute infrastructure when they needed it. At the same time, ensure that we continue to support on-premise and hybrid cloud/on-premise installations.
- Continue to push our fan out capabilities and ability to scale transparently by moving from our home grown sharding infrastructure to our database vendor’s, 10gen, much improved tag-aware sharding and locking strategy in MongoDB (version 2.2). As an aside, Mongo’s domination in the NoSQL database market is a great example of a company that is focused on application developers’ needs, not theoretical orthodoxy.
- Evolve into using a components-based architecture and providing RESTful APIs to make it easy for developers (ours and 3rd parties) to integrate with, customize, and even replace parts of the system to make application development easier.
- Make it trivial to add data streams to the system and perform mashups of large data streams.
- Use a push delivery model wherever practical (in this case, SignalR was a godsend).
In follow ups to this post, I am going to provide a detailed description of how we built AnalyticsPBI for Azure, highlighting some of the choices and trade-offs we made in our own code as well as some of the third party components we used. I am also going to show some code examples to demonstrate how easy it is to work with the system, including our version of the canonical Hadoop word count example—ours requires less (much less) code and delivers real-time results “automagically” without programmer intervention. I’ll also demonstrate how we “ate our own dog food” and are now using our analytics system to capture performance analytics for a PatternBuilders instance in real-time. Eventually, we want to turn this facility into a cloud service for our customers to monitor their systems similar to how MongoDB creator 10gen does with its MMS service.
Looking forward to our continuing conversation—your comments and feedback are always welcome!