No-SQL – Going All The Way
We have recently made a big architectural change concerning our storage back-end and I wanted to talk about it.
Storage is key to any Big Data problem. As we’ve mentioned in prior posts, most of our performance bottlenecks and optimizations have to do with storage performance and architecture, as opposed to computation. Our architecture for the last few years has consisted of a hybrid approach with “no-SQL” analytics storage using MongoDB and “non-transactional” data stored in a traditional RDBMS, primarily SQL Server. There were a couple of reasons for this architecture. First, we started off entirely in RDBMS-land, because our initial design was done before no-SQL systems were really at a production-level of maturity. Second, most of our customers and prospects had traditional schemas and data organization – making integration easier if we could just use the same object model.
We have talked at length about the PAF Analytics Server and the design issues we faced there. We use MongoDB exclusively to store our hierarchical indexes and aggregations – interacting with MongoDB at a very low level for performance reasons. In fact, our use of storage for analytics is so simple, the entire “adapter” for MongoDB takes up probably just 500 lines of code.
For the non-analytics data, however, our traditional RDBMS approach was far more complex. We used an object-relational mapper called NHibernate as the “glue” between our schemas and the database, so we could avoid writing platform-specific database code if possible. The trouble with NHibernate is that it can be a configuration and deployment nightmare, especially because we used it on several different kinds of domains at the same time. We had to write a lot of custom code to tell NHibernate how to store the data efficiently for all cases, and we had to design a “loader” architecture to deal with data coming in from different disparate data sources and to properly maintain relationships. Our two storage stacks suddenly ended up looking like this:
- Analytics: Analytics Server -> MongoDB Adapter -> DB
- Non-Analytics: Data Source Readers -> Loader -> Persistence Layer -> NHibernate -> SQL Server Adapter -> DB
As you can see, the analytics side of things is a lot cleaner. On the non-analytics side, maintaining, extending, and debugging these layers was becoming more and more difficult. We had started planning a migration to full no-SQL, but finally, the fragility of the non-analytics storage stack was starting to impact development time, so we had to address it.
In the end, the re-implementation of our back-end took basically 3 days. The schema-less design of no-SQL databases makes object-relational mapping significantly easier, essentially putting the emphasis on serializing and de-serializing objects instead of database-specific queries and commands. We also decided to stop abstracting queries (as was done with NHibernate) because all storage systems have their own performance and feature idiosyncrasies and we had to write custom code anyway. The final stack ended up like this:
- Non-Analytics: Data Source Readers -> Loader -> MongoDB Adapter -> DB
One of the surprising consequences of this new approach was an across-the-board increase in performance. We had made our changes without really focusing on optimization (in fact we could do a lot more with MongoDB), but still, our loading & calculation speed went up by an order of magnitude in some cases – very unexpected! The more we can do with one machine, the less our customers have to pay for hardware, whether in the cloud or on premise. This is a good thing indeed.
Why such big gains? NHibernate was really built for different kinds of needs – a stable schema, lots of different kinds of operations, very high level access, complex queries. The overhead associated with keeping track of the link between the code and the database ends up being pretty heavy when dealing with lots of objects – requiring lots of special configuration to perform efficiently. Our system works with dynamic schemas with simple operations, low level access, and incredibly high throughput. Same type of problem when it comes to the database itself – we were not really using the RDBMS the way it was supposed to be used, because the custom performance optimizations to do so would have made the PAF even more fragile.
Configuration and deployment simplicity improvements were also immediately apparent. Because both of our storage needs were now handled by MongoDB and we had no need to configure NHibernate, deploying the platform (once again, whether on-premise or in-the-cloud) became a one-step process. No more creating SQL Server databases, setting up permissions, setting up indexes, etc…, just hit “Recreate DB” in our Management interface, and within 5 seconds you have a fully set-up instance, ready to accept whatever Big Data problem you need solved!