One size fits all or one size fits none? The relational database model was invented by Ted Codd in the early 1970’s. There were a few early prototypes built, but the model became mainstream when IBM built and marketed DB2 in 1984. By the early 1990’s, relational database systems began to be considered the standard and the only way to solve all database related problems. It didn’t matter what kind of workload one needed to support – relational databases were considered the panacea. In other words, a “one size fits all” concept evolved and stayed until mid to late 2000’s.
It was at this time that researchers and technology leaders started to question the concept. They realized that for every kind of problem, there is a better ways of solving it than the traditional relational database system. Said differently, they realized that “one size fits none”!
Modern Databases for Online Analytical Processing (OLAP) workloads:
Columnar Databases: Datawarehouse models, in a majority of cases, are represented as star schemas (or a variation thereof) with Fact tables at the Center connected to the Dimension tables via primary-key, foreign-key relationships. Analytical queries are written against the Fact tables and typically query and aggregate using only a small fraction of columns available in the tables. On the other hand, traditional relational databases are row-oriented, causing a dataset much larger than required, to get loaded into the memory from the disk. This is because the entire record needs to be loaded into the memory even when only a few columns of the record are referred to in the client query.
The modern take on improving the design is to shift the paradigm by 90 degrees and store data column-by-column rather than row-by-row on the disk. The approach offers multiple advantages over the traditional approach. First of all, it improves performance by allowing the system to load only the data actually required to serve the client query from the disk into the memory. Secondly, the data in a column can be encoded and compressed to a much higher degree than that in a row. This is because column data is of the same kind, whereas row data is not. Data compression leads to reduction in storage required, and reduces the amount of data that needs to be moved from disk to memory and vice versa. Thirdly, the query executor performs much better in a column-oriented database when the decision to select a value is dependent on only a few columns. Hence, in a majority of cases a columnar database suits an OLAP workload much better. HP’s Vertica, SAP’s HANA and Amazon’s Redshift are a few examples of such commercially available databases.
Modern Database Systems for Online Transaction Processing (OLTP) workloads:
- Main Memory Databases: Moore’s law, over the years, has allowed the amount of main memory available in database systems to increase exponentially. On the other hand, data needs of OLTP systems (leaving aside Facebook like systems) typically do not grow exponentially. The entire catalog of Amazon, for example, is of the order of 1 Terabyte and does not grow exponentially as the growth is usually linked with the growth of the company’s business. Surprisingly enough, 1 Terabyte of RAM costs $30,000 or less and is definitely not a big deal for a company like Amazon. It also is a fact that, on an average, around 30% of time spent on executing queries is wasted on managing the Buffer Cache of the database system. Therefore, many modern OLTP database systems attempt to load the entire database in the main memory and get rid of the concept of Buffer Cache and complexities associated with it such as encoding/decoding of data, and sending data back and forth between main memory and the disk.
- Single Threaded Databases: Row level locking is a necessary evil in traditional database systems. It is necessary because these systems are multi-threaded and allow multiple threads to attempt to update the same data at the same time. It is evil because managing locks and associated latches is time consuming. Locking and latching, on an average, take about 30% of time available to a database. This time should ideally be spent on running database queries. A primary reason that traditional databases are multi-threaded is that they don’t want to keep a transaction waiting while the one ahead of it is spending I/O cycles loading data into memory from the disk. Modern main memory-oriented databases, on the other hand, can afford to be single threaded, as they don’t need to send data back and forth between memory and the disk. These modern single threaded databases are therefore able to spend more time on running user queries. Volt DB is one such memory oriented, single threaded commercially available system.
- NoSQL Databases: Another category of modern databases is NoSQL databases. NoSQL databases can prove to be a good fit for certain types of OLTP workloads. They provide the ability to scale horizontally, are fault tolerant, provide a flexible schema, and can accommodate data other than relational such as JSON documents. They also support a variety of data models ranging from the most simple key-value pairs to the more sophisticated models such as document-oriented, column family and graph databases. However, one needs to tread carefully before going for a NoSQL database as they sacrifice certain essential database characteristics such as the ability to “join” and the ability to wrap multiple DML operations in a transaction. Cassandra, MongoDB and CouchDB are some of the popular NoSQL databases out there.
About Ashish Kashyap
Ashish Kashyap is a Senior IT Leader with over 15 years of software consulting experience in a variety of industries such as pharmaceutical, retail, airline, automobile, marketing, and utility. He is an experienced solutions architect and has designed & developed B2B and B2C kind of applications on a variety of platforms including J2EE, COTS, and SAAS.
More on Ashish Kashyap.