Microsoft’s new SQL Server 2012 with its columnar capabilities along with already existing column oriented databases brings new opportunities to flash base storage in the data center. Along with the growing demand for big data in the cloud and massive analytic requirements, flash based storage and cache can assist to further increase performance of analytical processing.
What is Column Oriented Database?
A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
In practice, row-oriented architectures are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column stores are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).
Up to now, column oriented DBMS were a niche segment in the OLAP world. Although used by system vendors in the OLAP benchmarks, such as TPC-H to demonstrate system capabilities for record breaking, mainstream DMBS such as Microsoft SQL Server 2008 and Oracle 11 still used row oriented architecture.
Commercial columnar DBMS such as HP-Vertica, VectorWise, Sybase IQ and more were used mainly for high analysis environments, usually aside with the main organization’s DBMS. However, the growth of data and the demand for advanced analytics, data warehousing and business intelligence along with the rise of Big Data in cloud environments, have moved more and more organizations to adopt column oriented architecture.
These market demands have caused Microsoft to integrated “Project Apollo” to the new SQL Server 2012 (currently available as RC0 version). Although not a column oriented DBMS (as the above examples), the new SQL server 2012 offers “Column Store Indexes“.
A columnstore index stores each column in a separate set of disk pages, rather than storing multiple rows per page as data traditionally has been stored. In case of the row store indexes multiple pages will contain multiple rows of the columns spanning across multiple pages. In case of column store indexes multiple pages will contain multiple single columns. This will lead only the columns needed to solve a query will be fetched from disk. Additionally there is good chance that there will be redundant data in a single column which will further help to compress the data.
Flash Based Storage Opportunities
So, what are the opportunities for flash base storage with column oriented architecture? For this, let’s review the main features and see how they suit the flash based architecture:
Read Only –OLAP processing is a read only application by its nature. This is ideal for flash storage, both as a primary storage (operates better in reads, no endurance problems) and as cache tier (the cache applications such as Fusion-IO’s IO Turbine, EMC Lightning and OCZ’s VXL operate in a write through mode).
Compressed Data – the columnar nature of the records storage enable high compression of the information. Thus, the database can fit in the more expensive flash storage, both as primary storage or as a cache tier.
Random Workload – as compressed and non-sequenced, the column storage of the information dictates more random and smaller type of read workload. As indexes are spread over the DB and requires less read (does not require massive pages read), the result workload is more random, less sequential and with smaller commands – ideal for flash storage.
Higher Entropy – for optimization, data is further stored with the indexes, causing query process to derive its data from indexes’ search. As a result, the DB information is segmented with high access pick areas and massive low access areas. This histogram is ideal for flash cache application, caching the hot zones of the database (the areas that stores the column indexes).
To demonstrate the capabilities of column oriented databases, we take the TPC-H benchmark, an OLAP benchmark with 22 queries simulating a suite of business oriented ad-hoc queries and concurrent data modifications. The benchmark specifies several database sizes, from 100G to 30T. Our examples relate to the standard 100G benchmark.
SQL Server 2012 (RC0)
As mentioned, the new 2012 SQL Server is not a true column oriented database but supports limited column indexes (one per table). Furthermore, after inserting the column indexes, data cannot be inserted to the database. However, even this limited addition has a huge impact on the performance.
As can be seen, moving from SQL Server 2008 row indexing to SQL Server 2012 with columnstore indexes reduces dramatically the query process time. The total query time reduces from 8047 to 2138, almost 4 times faster. Using flash storage (OCZ Z Driver R4) reduces the overall query time to 618 seconds – 4x improvement.
To see how it is done, we should look at the workload analysis of the server. As can be seen in the histogram (processing Query 1):
The database is divided into two major areas – the data area that stores the records and the index area that stores the columnstore indexes (and some attached data). The size of the data area is about 100G (as regular database) and the index area is about 25G. Obviously, the temperature (i.e., access frequency of the space) of the index area is much higher (10x) than the data area. Hence, flash caching software should mainly warm (pre-fetch) the index area to achieve more than 90% hit ratio. Using OCZ’s VXL cacheing software in VMware environment, the total query time reduces from 2138 to 828 – 4x improvement.
In this example, a true columnar database is tested. HP-Vertica is high performance and high compressed analytical database. Here, no indexes need to be defined and data is inserted in a columnar way. As can be seen from the histogram:
- Database is highly compressed – about 25G comparing to the 160G of SQL server (with indexes).
- Indexes area is 10x hotter than data area. Also, index area is much more narrow than 2012.
Again, flash cache applications can take advantage of this structure and gain high hit ratio. Furthermore, with the high compression (only 25G for 100G database), the entire database can reside in flash and achieve high performance.
Clearly, column oriented databases will enter more and more to data centers and cloud environments, both via mainstream DBMS (i.e., Microsoft) and smaller vendors (e.g., HP-Vertica). The growing demand for analytical services and big data will also push flash based storage either as primary tier or cache to provide the high performance required.