Massive Technical Interviews Tips: Columnar Databases

Friday, March 9, 2018

Columnar Databases

https://www.slideshare.net/straycat90/demystifying-columnar-databases
column-family store - not columnar store
data locality at the partition level

1219 -198.99-424.47
4976 + 500 - 4110 - 1,062.38 -10 -325

Each column is stored in a separate file
Each column for a given row is at the same offset (auto-indexing)
Only read the files you need
Also get improved compression because all data in one file is the same data type.
Columnar databases produce automatic vertical partitioning

InfiniDB also automatically creates horizontal partitions of 8 million rows (default)
Knowing what values are in each partition allows for partition elimination at query time

Bonus: Easy to Add a New Column
Row-oriented: Usually requires rebuilding table
Addition of column shifts every row

Column-oriented: Just create another file

Because of the nature of columnar storage, singlerowoperations can underperform.
Do not attempt OLTP-style transactions on a columnar database

Columnar: new value must be added to each file
Do batch inserts and use cpimport, the bulk loader, instead.
CPIMPORT is your friend.

http://guyharrison.squarespace.com/blog/2009/9/2/columnar-compression-in-11gr2.html

In these scenarios storing the data for a specific column in the same block is far more effective, since data warehousing queries usually aggregate within columns and across rows.

One of the big advantages of column-oriented databases is that very high compression rates can be achieved. Compression works best when repeated patterns are present. The repetition within columns is almost always far greater than the repetition within rows. For instance a SURNAME column can be expected to have many repeated entries, whereas there will be little or no repetition between a SURNAME and a DATE_OF_BIRTH column. Column orientation therefore allows for very high compression ratios to be achieved with only modest CPU requirements.

The big problem with column-oriented database is that DML is very expensive. A single row insert must update all of the blocks containing the columns for that row. In a columnar database, that multiples the write overhead dramatically.

https://www.dunnsolutions.com/content/bi-blog/-/blogs/column-based-or-row-based-that-is-the-question-scene-2

Columnar Benefits

Columnar databases typically provide better query performance when the number of fields required for the query are a subset of the transaction’s fields
Columnar databases can provide more efficient use of space (data compression)

Columnar Drawbacks

Writing transactional data to a columnar database takes more time than to a row-based database (warning, don’t use a columnar store for your transactional systems)
If your query contains all the columns in a transaction, your query will typically perform worse compared to a row-based database

https://www.dunnsolutions.com/content/bi-blog/-/blogs/column-based-or-row-based-that-is-the-question-scene-1

There are a variety of columnar databases: Sybase IQ, InfoBright, and Redshift, to name a few (there are many others).

a columnar database is, simply a database that stores data in a columns rather than rows, and from the outside looks like your typical row-based DBMS

But what if we wanted the total sale for every transaction for the product SKU5678? This is a different type of question. We are not interested in the details of a given transaction, but the sum of the Total Sale of transactions that involved the specific Product SKU5678.

http://www.data-design.org/blog/columnar-database

One of the main benefits of a columnar database is that data can be highly compressed. The compression permits columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly.

http://db.csail.mit.edu/pubs/abadi-column-stores.pdf

https://en.wikipedia.org/wiki/Column-oriented_DBMS

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:

10:001,12:002,11:003,22:004;
Smith:001,Jones:002,Johnson:003,Jones:004;
Joe:001,Mary:002,Cathy:003,Bob:004;
40000:001,50000:002,44000:003,55000:004;

In this layout, any one of the columns more closely matches the structure of an index in a row-based system. This may cause confusion that can lead to the mistaken belief a column-oriented store "is really just" a row-store with an index on every column. However, it is the mapping of the data that differs dramatically. In a row-oriented indexed system, the primary key is the rowid that is mapped from indexed data. In the column-oriented system, the primary key is the data, which is mapped from rowids.^[3] This may seem subtle, but the difference can be seen in this common modification to the same store:

…;Smith:001;Jones:002,004;Johnson:003;…

Whether or not a column-oriented system will be more efficient in operation depends heavily on the workload being automated. Operations that retrieve all the data for a given object (the entire row) are slower. A row-based system can retrieve the row in a single disk read, whereas numerous disk operations to collect data from multiple columns are required from a columnar database. However, these whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, collecting the first and last names from many rows to build a list of contacts is far more common than reading all data for any single address. This is even more true for writing data into the database, especially if the data tends to be "sparse" with many optional columns. For this reason, column stores have demonstrated excellent real-world performance in spite of many theoretical disadvantages.^[4]

https://mariadb.com/products/technology/columnstore

To be clear, a columnar (or column-oriented) database is a relational database system that stores data grouped by each column rather than each row, making massive analytical queries very fast while also offering great compression. A column-family database like Cassandra is completely different and is better described as an advanced key/value store.

Some options for columnar databases available today:

MariaDB (which was forked from mysql) recently released their ColumnStore storage engine.
PostgreSQL has the cstore_fdw extension from the great team at Citus. You can combine this with their main Citus sharding extension to get scale-out column-oriented storage with postgres, but will take much more effort to maintain.
Greenplum is based on the older Postgres 8.x branch and turned into a distributed database that can combine column and row oriented tables along with other unstructured data processing.

http://www.timestored.com/time-series-data/column-oriented-databases
https://loonytek.com/2017/05/04/why-analytic-workloads-are-faster-on-columnar-databases/

Friday, March 9, 2018

Columnar Databases

Labels

Popular Posts