AnalyticsPBI for Azure: Turning Real-Time Signals into Real-Time Analytics

December 12, 2012 at 5:22 pm 8 comments

By Terence Craig

PBI 3 0 archslide 3For the second post on AnalyticsPBI for Azure (first one here), I thought I would give you some insight on what is required for a modern real-time analytics application and talk about the architecture and process that is used to bring data into AnalyticsPBI and create analytics from them. Then we will do a series of posts on retrieving data. This is a fairly technical post so if your eyes start to glaze over, you have been warned.

In a world that is quickly moving towards the Internet of Things, the need for real-time analysis of high velocity and high volume data has never been more pronounced. Real-time analytics (aka streaming analytics) is all about performing analytic calculations on signals extracted from a data stream as they arrive—for example, a stock tick, RFID read, location ping, blood pressure measurement, clickstream data from a game, etc. The one guaranteed component of any signal is time (the time it was measured and/or the time it was delivered).  So any real-time analytics package must make time and time aggregations first class citizens in their architecture. This time-centric approach provides a huge number of opportunities for performance optimizations. It amazes me that people still try to build real-time analytics products without taking advantage of them.

Until AnalyticsPBI, real-time analytics were only available if you built a huge infrastructure yourself (for example, Wal-Mart) or purchased a very expensive solution from a hardware-centric vendor (whose primary focus was serving the needs of the financial services industry). The reason that the current poster children for big data (in terms of marketing spend at least), the Hadoop vendors, are “just” starting their first forays into adding support for streaming data (see CloudEra’s Impala, for example) is that calculating analytics in real-time is very difficult to do. Period.

IMHO, Hadoop vendors entry into the streaming data market is unfortunate. Anyone who has spent time working with streaming analytics will tell you that it is easy to make a streaming architecture do the batch jobs that Hadoop was designed for (and where it excels). However, the inverse, using batch-centric architectures to support high velocity streams, is an amazingly painful, inefficient, and hacky process that may make sense from a sales perspective but can’t be justified on technical merits. Why? Well, streaming analytics present a problem set that is NOT particularly well suited to the batch-oriented programming and database environments of the last 50 years. From databases to programming techniques, streaming data problems require completely different approaches: Not bolt-ons to existing technology. In fact, the high velocity and constant changes that are the hallmark of streaming make it the poster child for the eventual consistency approaches pioneered by the NoSQL movement and on demand cloud computing platforms like our platform of choice, Microsoft Azure.

In the big data age, a streaming analytics system needs to provide the speed of the fastest OLTP engine, the analytics power of a SAS or a MatLab, the multi-machine processing capabilities of HPC clusters, and the rich query capabilities of a full text search engine. And lest we forget, given the amount of data that companies typically want to process these days, it must also support idempotent fan-out scalability across all major components, including its:

  • Analytics Engine
  • Message Passing Infrastructure
  • Compute Infrastructure
  • Database
  • Search Engine

Besides the need for fan-out scalability, the system must support both of the common enterprise deployment models, on-premise and the cloud, as well as the new hybrid architectural options that are being pushed hard (and rightfully so IMHO) by Microsoft with Azure and Windows Server 2012 and other vendors.

First, let’s define what a signal is. A signal is an external event that analysis will be performed on: A stock tick, RFID read, PBI performance event, tweet, etc.  AnalyticsPBI can support analysis on any type of signal as long as it has one or more named numeric fields and a date. For example, take a stock tick signal with the following information: 2/1/12 4:00:00:00 AAPL Bid $600, Volume 10,000. It would become the following DataEvent:

{
   "EventTime":"3:13 PM - 1 Dec 2012",
   "EventType":"TickEvent",
   "Tags":[
      {
         "Symbol":"APPL"
      },
{
"CompanyName":"Apple, Inc."
},
      {
         "CompanyLocation":{
            "Lat":37.331789,
            "Lon":-122.029620
         }
      },
      {
         "Feed":"ActiveFinancial"
      },
      {
         "index":"Nasdaq"
      },
      {
         "industry":"technology"
      }
   ],
   "Values":[
      {
         "Bid":600
      },
      {
         "Volume":10000
      }
   ]
}

In essence, a SignalReader performs the following steps:

  1. Reads the signal–in this case, a stock tick.
  2. Stores pertinent information from the signal itself or from other sources (for TickEvents an additional source might be EDGAR for officer and exchange information) and the numeric data contained within the signal.

For ticker events the above transformation is done by the Ticker SignalReader. A SignalReader is a server written by PatternBuilders, customers, or third parties. A SignalReader is responsible for the following:

  • Captures a signal – stock tick, RFID read, PBI performance event, tweet, etc.
  • If that signal does not have a numeric field, it creates appropriate numeric data and attaches it to the signal (see the sentiment example below).
  • Uses the signal itself and any appropriate external data to attach metadata to the signal. This metadata is stored in the form of strings that we refer to as Tags. Tags are used to create indexes for analytic calculations and will be discussed in the next post.
  • Converts that signal to the standard PatternBuilders data structure known as a DataEvent.
  • Attaches the Tags to the DataEvent.
  • Submits the new DataEvent via a REST call to the event processing server.

For example, let’s look at the following tweet, courtesy of the SignalReader that we are using in our joint research project with the University of Sydney (more information coming soon or you can look at a description of the project in Wall Street and Technology):

stocktweet

Our University of Sydney SignalReader would turn this tweet into a data event resembling the following:

{
   "EventTime":"2:13 PM - 9 Dec 12",
   "EventType":"SymbolTweet",
   "Tags":[
      {
         "Symbol":"CEO"
      },
      {
         "Symbol":"NXY.CA"
      },
      {
         "Symbol":"NXY"
      },
      {
         "Industry":"Energy"
      },
      {
         "Hyperlink":"http:\/\/stks.co\/aG0S"
      },
      {
         "TweetLocation":{
            "Lat":40.65,
            "Lon":73.78
         }
      },
      {
         "author":"StockTwits"
      },
      {
         "tweetid":277898756815466497
      }
      {
         "text":"Canada approves Chinas biggest ever foreign takeover http:\/\/stks.co\/aG0S via BW $CEO $NXY.CA $NXY"
      }
   ],
   "Values":[
      {
         "Sentiment":5
      },
      {
         "AuthorKlout":90
      }
   ]
}

In essence, the SignalReader performs the following steps:

  1. Reads the tweet (this could be from the Twitter firehose, a curated tweet, or a Twitter archive).
  2. Stores the author, tweet id, and the location of the tweet.
  3. Looks up the Klout score of the author.
  4. Parses out the company symbols so they can be applied as Tags.
  5. Looks up metadata about the companies associated with the symbol to determine industry and apply that as a tag.
  6. Parses out the hyperlink and sends it to a sentiment engine that will scan it and return a sentiment value.
  7. Bundles all of this data up into a DataEvent for processing.

The DataEvents created by the SignalReader are used to create various simple analytics, such as the average number of negative tweets about $NXY on a particular day, average sentiment of tweets by stocktwits, etc. In turn, these basic analytics (aka Measures) are used to build more complex analytics (aka Models) which can then be correlated with the real-time ticker prices that are captured by the StockTicker SignalReader. While SignalReaders are simple to code, they offer AnalyticsPBI users an amazing amount of power. They provide an easy way to introduce almost any data into the PatternBuilders system without worrying about any of the complexities involved with high velocity, multi-machine processing of real time data. SignalReaders are the primary route to verticalize and customize our baseline application (similar to what we did with FinancePBI). A vertical flavor of AnalyticsPBI is mostly defined by the SignalReaders it supports and the Models that it ships with.

Our base AnalyticsPBI application includes a CVS SignalReader (every row of the CSV file is treated as a separate signal—which is what I meant last post when I said that batch processing can be supported by streaming architectures), a relational database reader, and a Twitter reader (without sentiment).  FinancePBI comes with all of these, plus a ticker reader and the ability to add a sentiment engine to the Twitter reader. We are planning to release our bundled SignalReaders as open source so that others can use them to write additional SignalReaders.

My next post will continue focusing on the input and calculation side of AnalyticsPBI showing how a DataEvent is aggregated with other DataEvent to create analytics (both models and measures) with a little aside on the difference between a combination and a permutation.

Entry filed under: General Analytics, PatternBuilders Technology. Tags: , , , , , , , .

Introducing AnalyticsPBI for Azure—A Cloud-Centric, Components-Based, Streaming Analytics Product Our Favorite Reads of 2012

8 Comments Add your own

Leave a comment

Trackback this post  |  Subscribe to the comments via RSS Feed


Video: Big Data Made Easy

PatternBuilders Corporate

Special privacy section!

Previous Posts