Posts tagged ‘analytic systems’

All Things Must Come to an End—PatternBuilders is Shutting Down

By Terence Craig and Mary Ludloff

All things mus t come to an end SMThere’s a sad, but true, statistic that every entrepreneur knows by heart: 9 out of 10 startups fail. Unfortunately, PatternBuilders is adding its number to this pile. We have been procrastinating writing this post because shutting down a company is hard. When you put your heart and soul into something, you need time to process, reflect, and eventually get to the point where you can move on.

But moving on does not mean that we are disappearing; after all, shutting down the company does not end our passion for big data, privacy, and all things tech-related (especially IoT). To that end, we will be maintaining this blog, as our main place to write and comment about those issues.  We are also consulting around all areas involving big data and/or privacy (via our existing consulting organization, Ludloff-Craig Associates) and are working on some other things that we are keeping under wraps for now. But if you follow our blog, @terencecraig, or @mludloff, you will be the first to know.  And if you have interesting opportunities, consulting projects, or for the right company – a full-time job ­– please get in touch.

There are a number of reasons why we are shutting our doors, but suffice to say, we made some decisions we knew might have an adverse effect on the company. And we stand by those decisions. (more…)

January 26, 2016 at 9:05 pm 2 comments

A Sneak Peek at Our New HTML 5 UI and Geek Love for Some of the Libraries Used in Building AnalyticsPBI4Azure

By Terence Craig

AnalyticsPBI Coming SoonDrumroll please! After nearly a year of development work, we are about to offer early access to the first real-time/streaming analytics solution software appliance for the cloud – AnalyticsPBI for Azure.  There will be more forthcoming on the product launch but the new UI is so cool I had to show it off a bit.

We will be following up with a formal launch and Early Access Program (EAP) signups in the next couple of weeks so watch this space and patternbuilders.com for details – the big data analytics market is about to change in a big way! Here’s a sneak peek on what we’ve been working on.

For the geek part of my blog I am going to give a shout out to three libraries that we are using – all have made a huge difference in the product’s performance, scalability, and usability. The first two libraries come from Microsoft – Reactive Extensions and TPL Dataflow.  The third library is the open source math and statistics library, Math.Net.

(more…)

November 8, 2013 at 2:07 pm 2 comments

Events to Measures – Scalable Analytics Calculations using PatternBuilders in the Cloud

By Terence Craig

SMEventToMeasureTopDiagramOne part of the secret sauce that enables PatternBuilders to provide more accessible and performant user experiences for both creators and consumers of streaming analytics models is its infrastructure. Our infrastructure makes it easy to combine rich search capabilities for a diverse set of standard analytics that can be used to create more complex streaming analytics models. This post will describe how we create those standard analytics that we call Measures.

In my last post about our architecture, we delved into how we used custom SignalReaders as the point of entry for data into Analytics PBI.  We’ve tightened up our nomenclature a bit since our last post, so it’s worth reviewing some of our definitions:

Nomenclature Description
Feed An external source of data to be analyzed.  These can include truly real-time feeds such as stock-tickers, the Twitter firehose, or batch feeds, such as CSV files converted to data streams.
Event An external event within a Feed that analysis will be performed on. For example, a stock tick, RFID read, PBI performance event, tweet, etc.  AnalyticsPBI can support analysis on any type of event as long as it has one or more named numeric fields and a date.  An Event can have multiple Signals.
Signal A single numeric data element within an Event, tagged with the metadata that accompanied the Event, plus any additional metadata (to use NSA parlance) applied by the FeedReader. For example, a stock tick would have Signals of Price and Volume among others.
Tag A string representing a piece of metadata about an Event.  Tags are combined to form Indexes for both Events and Measures.
FeedReader (formerly SignalReader) A service written by PatternBuilders, customers, or third parties to read particular Feed(s), convert the metadata to Tags, and potentially add metadata from other sources to create Events.  Simple examples include a CSV reader and a stock tick reader. An example of a more complex reader is the reader we have created for the University of Sydney project that filters the Twitter firehose for mentions of specific stock symbols and hyperlinks to major media articles and then creates an Event that includes a Signal derived from the sentiment scores of those linked articles.  That reader was discussed here.A FeedReader’s primary responsibility is to create and index an object that converts “raw data” received from one or more Feeds to an Event. To accomplish this it does the following:

  1. Captures an Event from a feed – stock ticker, RFID channel, the Twitter firehose, etc.
  2. Uses the Event itself and any appropriate external data to attach or enrich metadata and numeric data to the Event.
  3. Creates a MasterIndex from all of the metadata attached to the Event. This MasterIndex and the Date associated with this Event is used to create Measures and Models later on in the process.  It can also attach geo data if appropriate.
  4. Extracts the numeric Signals for that Event.
  5. Pushes the Event object onto a named queue – the “EventToBeCalculatedQueue”–for processing. This queue, like all PatternBuilders queues, has a pluggable implementation. It can be in memory (cheaper, and faster) or persistent (more costly and slightly slower). One of the great advantages of the various cloud services, including our reference platform Azure, is the availability of scalable, fast, reliable, persistent queues.
Measure A basic calculation that is generated automatically by the PatternBuilders calculation service and persisted. Measures are useful in and of themselves but they are also used to dynamically  generate results for more complex streaming Analytic Models.

As the topic of this post is Events to Measures, let’s create a simple Measure and follow it thru the process. For this purpose, we’ll be working with a simplified StockFeedReader that will create a tick Event from a tick feed that includes two Signals – Volume and Price – for stock symbols on a minute-by-minute basis. The reader will enrich the Feed’s raw tick data with metadata about the company’s industries and locations. After enrichment, the JSON version of the event would look like this:

{
     "Feed": "SampleStockTicker",
     "FeedGranularity": "Minute",
     "EventDate": "Fri, 23 Aug 2013 09:13:32 GMT",
     "MasterIndex": "AcmeSoftware:FTSE:Services:Technology",
     "Locations":  [
          {
              "Americas Sales Office": {
                  "Lat": "40.65",
                  "Long": "73.94"
               }
          }
          {
               "Europe Sales Office": {
                  "Lat": "51.51",
                  "Long": "0.12"
               }
          }
      ],
      "Tags":  [
          {
              "Tag1": "AcmeSoftware",
              "Tag2": "Technology",
              "Tag3": "FTSE"
          }
       ],
       "Signals":  [
          {
               "Price": "20.00",
               "Volume": "10000"
          }
       ]
}

Note that there is a MasterIndex field that is a concatenation of all the Tags about the tick. When the MasterIndex is persisted, it is actually stored in a more space efficient format but we will use the canonical form of the index as shown above throughout this post for clarity.

A MasterIndex has two purposes in life:

  1. To allow the user to easily find a Signal by searching for particular Tags.
  2. To act as the seed for creating indexes for Measures and Models. These indexes, along with a date range, are all that is required to find any analytic calculations in the system.

Once an Event has been created by a FeedReader, the FeedReader uses an API call to place the Event on the EventToBeCalculatedQueue. Based on beta feedback, we’ve adopted a pluggable queuing strategy. So before we go any further, let’s take a quick detour and talk briefly about what that means.  Currently, PatternBuilders supports three types of queues for Events:

  • A pure in-memory queue. This is ideal for customers that want the highest performance and the lowest cost and who are willing to redo calculations in the unlikely event of machine failure. To keep failure risk as low as possible, we actually replicate the queues on different machines and optionally, place those machines in different datacenters.
  • Cloud-based queues. Currently, we use Azure ServiceBus Queues but there is no reason that we couldn’t also support other PaSS vendor’s queues as well. The nice thing about ServiceBus queues is that the latest update from Microsoft for Windows 2012 allows them to be used on-premise against Windows Server with the same code as for the cloud—giving our customers maximum deployment flexibility.
  • AMPQ protocol. This allows our customers to host FeedReaders and Event queues completely on-premise while using our calculation engine.  When combined with encrypted Tags, this allows our customers to keep their secrets “secret” and still enjoy the benefits of a real-time cloud analytics infrastructure.

Once the Event is placed on the IndexRequestQueue, it will be picked up by the first available Indexing server which monitors that queue for new Events (all queues and Indexing servers can be scaled up or down dynamically). The indexing service is responsible for creating measure indexes from the Tags associated with the Event.  This is the most performance critical part of loading data so forgive our skimpiness on implementation details but we are going to let our competition design this one for themselves :-).  Let’s just say that conceptually the index service creates a text search searchable index for all non-alias tags and any associated geo data. Some tags are simply aliases for other Tags and do not need measures created for them. For example, the symbol AAPL is simply and alternative for Apple Computer, so creating an average volume metric for both APPL and Apple is pointless since they will always be the same. Being able to find that value by searching on APPL or Apple on the other hand is amazingly useful and is fully supported by the system.

More formally:

<Geek warning on>

The indexes produced by an Event will be:

image001

where n equals the number of non-alias tags and the upper limit for k is equal to n.

</Geek warning off>

From our simple example above, we have the following Tags: AcmeSoftware, FTSE, Services, and Technology.  This trivial example will produce the following Indexes:

AcmeSoftware
FTSE
Services
Technology
AcmeSoftware:FTSE
AcmeSoftware:Services
AcmeSoftware:Technology
FTSE:Services
FTSE:Technology
Services:Technology
AcmeSoftware:FTSE:Services
AcmeSoftware:FTSE:Technology
AcmeSoftware:Services:Technology
FTSE:Services:Technology
AcmeSoftware:FTSE:Services:Technology

The indexing service can perform parallel index creation across multiples cores and/or machines if needed. As Indexes are created, they, and each Signal in the Event, are combined into a calculation request object and placed in the MeasureCalculationRequestQueue queue that is monitored by the Measure Calculation Service.

The analytics service will take each index and use it to create/update all of the standard measures (Sum, Count, Avg, Standard Deviation, Last, etc.) for each unique combination of index and the Measure’s native granularity for each Signal (Granularity management is complex and will be discussed in my next post).

Specifically, the Calculation Service will remove a calculation request object from the queue and perform the following steps for all Measures appropriate to the Signal:

  1. Attempt to retrieve the Measure from either cache or persistent storage.
  2. If not found, create the Measure for the appropriate Date and Signal.
  3. Perform the associated calculation and update the Measure.

Graphically the whole process looks something like this:

SManalyticsservice

The advantages of this approach are manifold.  First, it allows for very sophisticated search capabilities across Measures and Models.  Second, it allows deep parallelization for Measure calculation. This parallelization allows us to scale the system by creating more Indexing Services and Calculation Services with no risk of contention and it is this scalability which allows us to provide near real-time, streaming updates for all Measures and most Models.  Each Index, time, and measure combination is unique and can be calculated by separate threads or even separate machines. A measure can be aggregated up from its native granularity using a pyramid scheme if the user requests it (say by querying for an annual number from a measure whose Signal has a native granularity of a minute). A proprietary algorithm prevents double counting for the edge cases where Measures with different Indexes are calculated from the same Events.

So now you’ve seen how we get from a raw stream to a Measure.  And how, along the way, we’re able to enrich meta and numeric data to enable both richer search capabilities and easier computation of more complex analytics models.  Later on, we explore how searches are performed and models are developed—you will see how this enrichment process makes exploring and creating complex analytics models much easier than the first generation of  big data, business intelligence, or desktop analytics systems.

However, before we get there we need to talk about how PatternBuilders handles dates and Granularity in more detail.  At our core, we are optimized for time-series analytics and how we deal with time is a critical part of our infrastructure. This is why in my next post we will be doing a deep (ok medium deep) dive into how we handle pyramidal aggregation and the always slippery concepts of time and streaming data. Thanks for reading and as always comments are free and welcomed!

August 29, 2013 at 8:18 am 2 comments

Enterprise Software in the Cloud: Why We Chose Azure as our First PaaS Platform

By Terence Craig

SW in the cloudI’ve been absent from the blog too long, but if you’ve been following my colleagues (Mary and Marilyn) postings, you’ll see it’s been a very busy and fruitful time at PatternBuilders.  While I’m still overdue for the next segment of the architecture blog series, I thought I would take a break and talk a bit about some of the things we learned as we moved our product and business model to Microsoft Azure.

As someone who has worked with Microsoft technology and partnered with them off and on over the last two decades (even flirting with going to work for them a couple of times), the most surprising discovery was how serious Microsoft has become about the cloud, open source, and being an active and supportive partner for startups.  As many of you who have been around as long as I have will no doubt remember, this is a very different, some would say revolutionary, move for the world’s most powerful proprietary software company.  We had some concerns when we became members of Microsoft’s Azure Startup program BizSpark Plus and subsequently the more exclusive BizSpark One, but it has turned out to be a great experience for us on both the business and technical level. (more…)

May 4, 2013 at 6:10 pm 4 comments

In Search of Elusive Big Data Talent: Is Science Big Data’s Biggest Challenge? Or Are We Looking in the Wrong Places? (Part 1 of 3)

By Mary Ludloff

When we talk to prospects about their big data initiatives our conversations usually revolve around issues of complexity that goes something like this:

“Big data is so big (no pun intended), there’s such a variety of sources, and it’s coming in so fast. How can we develop and deploy our big data projects when everyone is telling us that we need lots and lots of data scientists and oh, by the way, there aren’t enough?”

Admittedly, many media outlets and pundits are positioning the search for skilled big data resources as what I can only characterize as the battle for the brainiacs. Don’t get me wrong, I am not disputing McKinsey’s report on big data last year that made it clear a talent shortage was looming, estimating that the U.S. would need 140,000 to 190,000 folks with “deep analytical skills” and 1.5 million managers and analysts to “analyze big data and make decisions based on their findings.” But the hype surrounding the data scientist is getting a bit absurd and we seem to be forgetting that those 1.5 million managers and analysts may already be “walking amongst us.” Is a shortage of data scientists really big data’s biggest challenge? (more…)

September 30, 2012 at 2:04 pm 7 comments

Big Data is Coming of Age in the Capital Markets—Wall Street and Technology’s Deep Dive into “Everything You Need to Know to Unlock Big Data’s Secrets” is a Must Read for All

By Mary Ludloff

In the “it’s a small world” category while we were in the midst of launching FinancePBI, the first financial services big data solution built for the cloud and designed to address the needs of the industry, Terence chatted with Melanie Rodier (@mrodier), a Senior Editor at Wall Street and Technology. The topic: big data and the capital markets. That 33-page report is now available and it’s a must read for anyone interested in big data and business.

Why a must read for all? Well, similar to the McKinsey report on big data in 2011, Wall Street and Technology’s big data deep dive covers a lot of ground that applies to any business or organization. In other words, specific industry requirements may be different but big data technology and process challenges are very similar. For example, Wall Street firms—like so many others—find themselves dealing with unstructured data from a variety of sources, including the web, social media, and mobile devices. While there’s value in that data, there are infrastructure issues and a looming talent shortage. Sound familiar? (more…)

May 3, 2012 at 6:41 pm 1 comment

Big Data and Cloud not a fit? Comments on Infoworld Article

By Terence Craig

Since Disqus seems to have completely eaten (bleh) my comment on @davidlinthicum’s very interesting InfoWorld post – Big data and the cloud: A far from perfect fit, I decided to just expand my comments and make a short blog post out of it. IMHO the problems that David is describing are more a reflection of problems with batch oriented technologies like Hadoop (more on my take on Hadoop here) in the cloud than a general problem for cloud based big data solutions.

Computing always has, and probably always will have, a bias towards creating batch focused technologies at the beginning of any large paradigm shift.   But as new technologies are absorbed, understood, and move from early adopter to more mainstream use, the batch paradigm will inevitably start to shift to streaming and real-time. We have seen this again and again (from punch cards to touch sensitive tablets, downloaded media to streaming media, DOM to SAX parsers, HTML to Ajax, paper maps to real-time GPS). The reason this evolution almost always occurs is simple: humans live and think in real-time and when our tools do as well we are more productive and happier.  So why do we have this bias for batch processing in our first generation computational technologies? Simply put, because batch processing is a lot easier.

(more…)

February 23, 2012 at 3:01 pm Leave a comment

Older Posts


Video: Big Data Made Easy

PatternBuilders Corporate

Follow us on Twitter

Special privacy section!

Enter your email address to subscribe.

Join 61 other followers

Recent Posts

Previous Posts


%d bloggers like this: