Posts tagged ‘big data’

Events to Measures – Scalable Analytics Calculations using PatternBuilders in the Cloud

By Terence Craig

SMEventToMeasureTopDiagramOne part of the secret sauce that enables PatternBuilders to provide more accessible and performant user experiences for both creators and consumers of streaming analytics models is its infrastructure. Our infrastructure makes it easy to combine rich search capabilities for a diverse set of standard analytics that can be used to create more complex streaming analytics models. This post will describe how we create those standard analytics that we call Measures.

In my last post about our architecture, we delved into how we used custom SignalReaders as the point of entry for data into Analytics PBI.  We’ve tightened up our nomenclature a bit since our last post, so it’s worth reviewing some of our definitions:

Nomenclature Description
Feed An external source of data to be analyzed.  These can include truly real-time feeds such as stock-tickers, the Twitter firehose, or batch feeds, such as CSV files converted to data streams.
Event An external event within a Feed that analysis will be performed on. For example, a stock tick, RFID read, PBI performance event, tweet, etc.  AnalyticsPBI can support analysis on any type of event as long as it has one or more named numeric fields and a date.  An Event can have multiple Signals.
Signal A single numeric data element within an Event, tagged with the metadata that accompanied the Event, plus any additional metadata (to use NSA parlance) applied by the FeedReader. For example, a stock tick would have Signals of Price and Volume among others.
Tag A string representing a piece of metadata about an Event.  Tags are combined to form Indexes for both Events and Measures.
FeedReader (formerly SignalReader) A service written by PatternBuilders, customers, or third parties to read particular Feed(s), convert the metadata to Tags, and potentially add metadata from other sources to create Events.  Simple examples include a CSV reader and a stock tick reader. An example of a more complex reader is the reader we have created for the University of Sydney project that filters the Twitter firehose for mentions of specific stock symbols and hyperlinks to major media articles and then creates an Event that includes a Signal derived from the sentiment scores of those linked articles.  That reader was discussed here.A FeedReader’s primary responsibility is to create and index an object that converts “raw data” received from one or more Feeds to an Event. To accomplish this it does the following:

  1. Captures an Event from a feed – stock ticker, RFID channel, the Twitter firehose, etc.
  2. Uses the Event itself and any appropriate external data to attach or enrich metadata and numeric data to the Event.
  3. Creates a MasterIndex from all of the metadata attached to the Event. This MasterIndex and the Date associated with this Event is used to create Measures and Models later on in the process.  It can also attach geo data if appropriate.
  4. Extracts the numeric Signals for that Event.
  5. Pushes the Event object onto a named queue – the “EventToBeCalculatedQueue”–for processing. This queue, like all PatternBuilders queues, has a pluggable implementation. It can be in memory (cheaper, and faster) or persistent (more costly and slightly slower). One of the great advantages of the various cloud services, including our reference platform Azure, is the availability of scalable, fast, reliable, persistent queues.
Measure A basic calculation that is generated automatically by the PatternBuilders calculation service and persisted. Measures are useful in and of themselves but they are also used to dynamically  generate results for more complex streaming Analytic Models.

As the topic of this post is Events to Measures, let’s create a simple Measure and follow it thru the process. For this purpose, we’ll be working with a simplified StockFeedReader that will create a tick Event from a tick feed that includes two Signals – Volume and Price – for stock symbols on a minute-by-minute basis. The reader will enrich the Feed’s raw tick data with metadata about the company’s industries and locations. After enrichment, the JSON version of the event would look like this:

{
     "Feed": "SampleStockTicker",
     "FeedGranularity": "Minute",
     "EventDate": "Fri, 23 Aug 2013 09:13:32 GMT",
     "MasterIndex": "AcmeSoftware:FTSE:Services:Technology",
     "Locations":  [
          {
              "Americas Sales Office": {
                  "Lat": "40.65",
                  "Long": "73.94"
               }
          }
          {
               "Europe Sales Office": {
                  "Lat": "51.51",
                  "Long": "0.12"
               }
          }
      ],
      "Tags":  [
          {
              "Tag1": "AcmeSoftware",
              "Tag2": "Technology",
              "Tag3": "FTSE"
          }
       ],
       "Signals":  [
          {
               "Price": "20.00",
               "Volume": "10000"
          }
       ]
}

Note that there is a MasterIndex field that is a concatenation of all the Tags about the tick. When the MasterIndex is persisted, it is actually stored in a more space efficient format but we will use the canonical form of the index as shown above throughout this post for clarity.

A MasterIndex has two purposes in life:

  1. To allow the user to easily find a Signal by searching for particular Tags.
  2. To act as the seed for creating indexes for Measures and Models. These indexes, along with a date range, are all that is required to find any analytic calculations in the system.

Once an Event has been created by a FeedReader, the FeedReader uses an API call to place the Event on the EventToBeCalculatedQueue. Based on beta feedback, we’ve adopted a pluggable queuing strategy. So before we go any further, let’s take a quick detour and talk briefly about what that means.  Currently, PatternBuilders supports three types of queues for Events:

  • A pure in-memory queue. This is ideal for customers that want the highest performance and the lowest cost and who are willing to redo calculations in the unlikely event of machine failure. To keep failure risk as low as possible, we actually replicate the queues on different machines and optionally, place those machines in different datacenters.
  • Cloud-based queues. Currently, we use Azure ServiceBus Queues but there is no reason that we couldn’t also support other PaSS vendor’s queues as well. The nice thing about ServiceBus queues is that the latest update from Microsoft for Windows 2012 allows them to be used on-premise against Windows Server with the same code as for the cloud—giving our customers maximum deployment flexibility.
  • AMPQ protocol. This allows our customers to host FeedReaders and Event queues completely on-premise while using our calculation engine.  When combined with encrypted Tags, this allows our customers to keep their secrets “secret” and still enjoy the benefits of a real-time cloud analytics infrastructure.

Once the Event is placed on the IndexRequestQueue, it will be picked up by the first available Indexing server which monitors that queue for new Events (all queues and Indexing servers can be scaled up or down dynamically). The indexing service is responsible for creating measure indexes from the Tags associated with the Event.  This is the most performance critical part of loading data so forgive our skimpiness on implementation details but we are going to let our competition design this one for themselves :-).  Let’s just say that conceptually the index service creates a text search searchable index for all non-alias tags and any associated geo data. Some tags are simply aliases for other Tags and do not need measures created for them. For example, the symbol AAPL is simply and alternative for Apple Computer, so creating an average volume metric for both APPL and Apple is pointless since they will always be the same. Being able to find that value by searching on APPL or Apple on the other hand is amazingly useful and is fully supported by the system.

More formally:

<Geek warning on>

The indexes produced by an Event will be:

image001

where n equals the number of non-alias tags and the upper limit for k is equal to n.

</Geek warning off>

From our simple example above, we have the following Tags: AcmeSoftware, FTSE, Services, and Technology.  This trivial example will produce the following Indexes:

AcmeSoftware
FTSE
Services
Technology
AcmeSoftware:FTSE
AcmeSoftware:Services
AcmeSoftware:Technology
FTSE:Services
FTSE:Technology
Services:Technology
AcmeSoftware:FTSE:Services
AcmeSoftware:FTSE:Technology
AcmeSoftware:Services:Technology
FTSE:Services:Technology
AcmeSoftware:FTSE:Services:Technology

The indexing service can perform parallel index creation across multiples cores and/or machines if needed. As Indexes are created, they, and each Signal in the Event, are combined into a calculation request object and placed in the MeasureCalculationRequestQueue queue that is monitored by the Measure Calculation Service.

The analytics service will take each index and use it to create/update all of the standard measures (Sum, Count, Avg, Standard Deviation, Last, etc.) for each unique combination of index and the Measure’s native granularity for each Signal (Granularity management is complex and will be discussed in my next post).

Specifically, the Calculation Service will remove a calculation request object from the queue and perform the following steps for all Measures appropriate to the Signal:

  1. Attempt to retrieve the Measure from either cache or persistent storage.
  2. If not found, create the Measure for the appropriate Date and Signal.
  3. Perform the associated calculation and update the Measure.

Graphically the whole process looks something like this:

SManalyticsservice

The advantages of this approach are manifold.  First, it allows for very sophisticated search capabilities across Measures and Models.  Second, it allows deep parallelization for Measure calculation. This parallelization allows us to scale the system by creating more Indexing Services and Calculation Services with no risk of contention and it is this scalability which allows us to provide near real-time, streaming updates for all Measures and most Models.  Each Index, time, and measure combination is unique and can be calculated by separate threads or even separate machines. A measure can be aggregated up from its native granularity using a pyramid scheme if the user requests it (say by querying for an annual number from a measure whose Signal has a native granularity of a minute). A proprietary algorithm prevents double counting for the edge cases where Measures with different Indexes are calculated from the same Events.

So now you’ve seen how we get from a raw stream to a Measure.  And how, along the way, we’re able to enrich meta and numeric data to enable both richer search capabilities and easier computation of more complex analytics models.  Later on, we explore how searches are performed and models are developed—you will see how this enrichment process makes exploring and creating complex analytics models much easier than the first generation of  big data, business intelligence, or desktop analytics systems.

However, before we get there we need to talk about how PatternBuilders handles dates and Granularity in more detail.  At our core, we are optimized for time-series analytics and how we deal with time is a critical part of our infrastructure. This is why in my next post we will be doing a deep (ok medium deep) dive into how we handle pyramidal aggregation and the always slippery concepts of time and streaming data. Thanks for reading and as always comments are free and welcomed!

August 29, 2013 at 8:18 am 2 comments

Privacy v Security, Transparency v Secrecy: The NSA, PRISM, and the Release of Classified Documents

By Mary Ludloff

Privacy, Anonymity, and Judicial Oversight are on the Endangered List

PRISM 3An age old debate has once again reared its very ugly head due to whistleblower Edward Snowden’s revelations about NSA surveillance, PRISM, and the astounding lack of any rigorous oversight on the NSA’s vast data collection apparatus.  While PatternBuilders has been incredibly busy, in our non-copious amounts of spare time Terence and I have also been working on our update to Privacy and Big Data (which is undergoing another rewrite due to new government surveillance revelations that for a while happened hourly, then daily, then weekly but certainly are far from over). It’s important to note that pre-revelations  our  task was already herculean due to mainstream media’s pick up on “all stories related to privacy” (a good thing) that often missed the mark on the technical side of the house (we often find ourselves explaining to non-techies just what meta data is which usually happens after someone on CNN, Fox, NBC, ABC, etc., butchers the definition) or got tripped up by the various Acts, Amendments, state laws, EU Directives, etc., that apply to aspects of privacy.

Over the last few weeks as details about PRISM emerged, it’s become clear to me that main street America may still not understand the seismic shift that big data and analytics brings to the privacy debate. Certainly the power of big data and analytics has been lauded or vilified in the press—followers of our twitter feed are used to seeing the pros and cons of big data projects debated pretty much every day. We’ve (Terence and I) talked and tweeted about privacy issues as it applies to individuals, companies, and governments. Heck, we even wrote a book about privacy and big data. (more…)

July 19, 2013 at 12:14 pm 2 comments

Big Data Project: Start with a Question that You Want to Answer

A top-level view of our data project over a series of posts.

By Marilyn Craig

Start with a questionWelcome to the second post of a series on a big data project that will (Mary and I hope) provide clarity and insights on how to successfully complete a big data initiative. Now, just in case you’ve forgotten the first two rules in our Big Data Playbook, I am going to repeat them here because they play into our topic of the day which is all about “starting” your big data project:

Rule #1: Big Data IS NOT rocket science.

Yes, far too often those lucky internal folks tasked with managing a big data project fall into the trap of data science paralysis which is similar in thought to analysis paralysis. By this I mean that there are so many moving pieces to capture, so many technology decisions to make, so many skill positions that need to be filled, so many fill-in-the-blanks that need to get done that you never actually get started which leads me to our second rule:

Rule #2: Garbage in, garbage out.

(more…)

April 3, 2013 at 5:39 pm 3 comments

A Big Data Showdown: How many V’s do we really need? Three!

By Mary Ludloff

3 vs of big dataMarilyn Craig (Managing Director of Insight Voices, frequent guest blogger, marketing colleague, and analytics guru) and I have been watching the big data “V” pile-on with a bit of bemusement lately. We started with the classic 3 V’s, codified by Doug Laney, a META Group and now Gartner analyst, in early 2001 (yes, that’s correct, 2001). Doug puts it this way:

“In the late 1990s, while a META Group analyst (Note: META is now part of Gartner), it was becoming evident that our clients increasingly were encumbered by their data assets.  While many pundits were talking about, many clients were lamenting, and many vendors were seizing the opportunity of these fast-growing data stores, I also realized that something else was going on. Sea changes in the speed at which data was flowing mainly due to electronic commerce, along with the increasing breadth of data sources, structures and formats due to the post Y2K-ERP application boom were as or more challenging to data management teams than was the increasing quantity of data.”

Doug worked with clients on these issues as well as spoke about them at industry conferences. He then wrote a research note (February 2001) entitled “3-D Data Management: Controlling Data Volume, Velocity and Variety” which is available in its entirety here (pdf too). (more…)

January 17, 2013 at 7:06 pm 3 comments

Our Favorite Reads of 2012

By Mary Ludloff & Terence Craig

Fave ReadsGreetings one and all! 2012 was a breakout year for PatternBuilders and we are very grateful to all of you for helping to make that happen. But we would also like to take a minute to extend our condolences and share the grief of parents across the world that lost young children to violence. Newtown was singularly horrific but similar events play out all too often across the globe. We live in an age of technical wonders—surely we can find ways to protect the world’s children.

This is our last post of 2012 and in the spirit of the season, we decided to do something a little different this year. Recently, the Wall Street Journal asked 20 of its “friends” to tell them what books they enjoyed in 2012 and the responses were equally eclectic and interesting. Not to be outdone, Adam Thierer published his list of cyberlaw and info-tech policy books for 2012. Many of the recommendations culled from both sources ended up on our reading lists for 2013 (folks, 2012 is almost over and between launching AnalyticsPBI for Azure and working on our update for Privacy and Big Data, not a lot of “other” reading is going to happen during the holiday season!) and spurred an interesting discussion about our favorite reads of the year. One caveat: Our lists may include books we read but were not necessarily published this year. So without further ado, I give you our favorite reads of 2012! (more…)

December 21, 2012 at 7:07 pm Leave a comment

“Hadoopla”

© Marqin Cook

By Terence Craig

I had to miss Strata due to a family emergency. While Mary picked up the slack for me at our privacy session, and by all reports did her usual outstanding job, I also had to cancel a Tuesday night Strata session sponsored by 10Gen on how PatternBuilders has used Mongo and Azure to create a next generation big data analytics system.   The good news is that I should have some time to catch up on my writing this week so look for a version of what would have been my 10Gen talk shortly. In the meantime, to get me back in the groove, here is a very short post inspired by a Forbes post written by Dan Everett of SAP on “Hadoopla”

As a CEO of a real-time big data analytics company that occasionally competes with parts of the Hadoop ecosystem, I may have some biases (you think?).  But I certainly agree that there is too much Hadoopla (a great term).  If our goal as an industry is to move Big Data out of the lab and into mainstream use by anyone other than the companies that thrive on and have the staff to support high maintenance and very high skill technologies, Hadoop is not the answer – it has too many moving parts and is simply too complex.

To quote from a blog post I wrote a year ago:

“Hadoop is a nifty technology that offers one of the best distributed batch processing frameworks available, although there are other very good ones that don’t get nearly as much press, including Condor and Globus.  All of these systems fit broadly into the High Performance, Parallel, or Grid computing categories and all have been or are currently used to perform analytics on large data sets (as well as other types of problems that can benefit from bringing the power of multiple computers to bear on a problem). The SETI project is probably the most well know (and IMHO, the coolest) application of these technologies outside of that little company in Mountain View indexing the Internet. But just because a system can be used for analytics doesn’t make it an analytics system…..

Why is the industry so focused on Hadoop? Given the huge amount of venture capital that has been poured into various members of the Hadoop eco-system and that eco-system’s failure to find a breakout business model that isn’t hampered by Hadoop’s intrinsic complexity, there is ample incentive for a lot of very savvy folks to attempt to market around these limitations.  But no amount of marketing can change the fact that Hadoop is a tool for companies with elite programmers and top of the line computing infrastructures. And in that niche, it excels.  But it was not designed, and in my opinion will never see, broad adoption outside of that niche despite the seeming endless growth of Hadoopla.

October 24, 2012 at 1:39 pm 1 comment

Big Data and Science: Focus on the Business and Team, Not the Data (Part 3 of 3)

By Mary Ludloff

Let me tell you a little secret: I always know when I am talking (and working) with a company that has successfully launched big data initiatives. There are three characteristics that these companies share:

  1. A C-level executive runs the “[big] data operations.”
  2. The Chief Data Officer (even if they are the CIO) has a heavy business/operations background.
  3. The data team is focused on the “business,” not the data.

Did you notice that technology and data science are not reflected in any of the characteristics? Some of you may consider this sacrilege—after all, we are operating in a world where technology (and I happily work for one of those companies) has changed the data collection, usage, and analysis game. Colleges and universities are now offering master degrees in analytics. The role of the data scientist has been pretty much deified (I refer you to Part 1 of this series). And we all need to be very worried about the “talent shortage” and our ability to recruit the “right analytical team” (I refer you to Part 2 of this series).

Yes—technology has had a tremendous impact on how much data we can collect and the ways in which we can analyze it but not everyone needs to be a senior computer programmer. Yes—we all should strive to be more mathematically inclined but not all of us need Master’s or PhD’s in statistics or analytics. Yes—some companies, based on their business models, may have a staff of data scientists but others may get along just fine without one (with the occasional analytics consultant lending a hand). (more…)

October 20, 2012 at 4:50 am 4 comments

Data Science: What the World Needs is Answers, Not Just Insights Part 2 (of 3)

By Marilyn Craig, Managing Director, Insight Voices

As you may or may not know, we are in the midst of a 3-part series on data science, covering roles, skills, etc.—generally what you should think about as well as what’s not as important (no matter what the latest articles say!). For Part 2, we have a guest poster—Marilyn Craig of Insight Voices. Marilyn is what I like to call a “classic quant.” She has been at the forefront of big data and data science before most people knew these terms (and spaces) existed and has been my go-to person whenever I had an analytics question (see title) that I needed an answer to. In this post, Marilyn looks at insights and makes the case for why we should all care far more about answers. Take it away Marilyn!

Here’s an interesting question for this new world order of Big Data Analytics: what’s an Insight and what’s an Answer? Sometimes they are the same, sometimes not. An insight is a piece of information or understanding. It may or may not be useful. It may or may not help your business improve, solve world hunger, or even make sense. An answer is always useful. It is the result of asking a question. And the best kinds of answers are those that solve the questions that you really care about. (more…)

October 8, 2012 at 10:57 am 5 comments

Speaking on Inman Connect Panel on Real Estate and Big Data

By Terence Craig

I apologize for falling behind on blogging, but between several new hires,  major partnerships, and the industry finally starting to understand the need for product-driven (instead of project-driven) big data, things have been very hectic. Good, but hectic.

I did want to pull my head off my keyboard for a minute to tell you about participating in the big data & real estate panel this Thursday at Connect San Francisco.  Our panel will be moderated by industry luminary Brad Inman @bradInman.

Real estate has always been a data-driven business and is relying more and more on the insights and operational nimbleness provided by big data.  For those of you who are scratching your heads and going, “Huh, Real Estate and big data?” – think about it for a minute.  The real estate industry is “using” big data to do all kinds of things and drive all kinds of business models, such as:

  • Commercial landlords using smart thermostats and smart windows adjusted in real-time to save energy.
  • Capturing real-time parking meter data to make real-time decisions about how long to leave a retail location open.
  • Using real-time video analysis to stop vandalism before it happens.
  • Offering sophisticated analytics – see consumer facing sites like Truila and Zillow.
  • Risk Modeling – check out RMS. Like most of the PatternBuilders team, they were “doing” Big Data before the term was invented.

If you are attending the show, stop by and say hi. If you are interested in Big Data & Real Estate, look for our post-Connect blog next week. In it, we will talk about some great insights about the New York real estate market derived from a ton of data we grabbed from the NYC public data market which was then spun up in the PatternBuilders framework on our brand spanking new Microsoft Azure cloud beta release.

August 1, 2012 at 9:37 pm Leave a comment

Big Data Makes Its Broadway Debut (Sort of) and Other News

By Mary Ludloff

What a week for big data in the news! Less than two to three years ago, it seemed like big data was the sole purview of “pioneer” companies and industries (like retail and financial services). Today, everyone is writing about it (follow us on @bigdatapbi for the news we come across) and we’ve gone from industry- or research-related “stories” to mainstream press reporting from outlets like the Wall Street Journal, New York Times, and Forbes. But you really know you’ve made it when you’ve taken your metaphorical bow on Broadway.

Full disclosure: I am a theater maven. Love Broadway, love the theater, love going to New York City to see shows. So imagine my early morning pre-coffee surprise when I saw this headline in the Wall Street Journal: Big Data Hits Broadway. Somebody wrote and produced a play about big data? How could I not know about this? Who’s in the cast? Well… not quite! However, the slogan “Conquer Big Data” is on a sign gracing Times Square. As editor Michael Hickins points out in the article:

“As you can see from these roadside signs – one above Broadway in New York’s Times Square district, the other on Highway 101 between San Jose and Redwood City – Big Data has gone from bleeding edge to the edge of the highway.” (more…)

June 22, 2012 at 4:35 pm Leave a comment

Older Posts


Video: Big Data Made Easy

PatternBuilders Corporate

Follow us on Twitter

Special privacy section!

Enter your email address to subscribe.

Join 54 other followers

Recent Posts

Previous Posts


Follow

Get every new post delivered to your Inbox.

Join 54 other followers