Massive Technical Interviews Tips: Cassandra Storage Format

https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html

Data (Data.db): The SSTable data
Primary Index (Index.db): Index of the row keys with pointers to their positions in the data file
Bloom filter (Filter.db): A structure stored in memory that checks if row data exists in the memtable before accessing SSTables on disk
Compression Information (CompressionInfo.db): A file holding information about uncompressed data length, chunk offsets and other compression information
Statistics (Statistics.db): Statistical metadata about the content of the SSTable
Digest (Digest.crc32, Digest.adler32, Digest.sha1): A file holding adler32 checksum of the data file
CRC (CRC.db): A file holding the CRC32 for chunks in an uncompressed file.
SSTable Index Summary (SUMMARY.db): A sample of the partition index stored in memory
SSTable Table of Contents (TOC.txt): A file that stores the list of all components for the SSTable TOC
Secondary Index (SI_.*.db): Built-in secondary index. Multiple SIs may exist per SSTable

http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html
Cassandra creates a new SSTable when the data of a column family in Memtable is flushed to disk. SSTable stands for Sorted Strings Table a concept borrowed from Google BigTable which stores a set of immutable row fragments in sorted order based on row keys. SSTable files of a column family are stored in its respective column family directory.

The data in a SSTable is organized in six types of component files. The format of a SSTable component file is
<keyspace>-<column family>-[tmp marker]-<version>-<generation>-<component>.db

<keyspace> and <column family> fields represent the Keyspace and column family of the SSTable, <version> is an alphabetic string which represents SSTable storage format version, <generation> is an index number which is incremented every time a new SSTable is created for a column family and <component> represents the type of information stored in the file. The optional "tmp" marker in the file name indicates that the file is still being created. The six SSTable components are Data, Index, Filter, Summary, CompressionInfo and Statistics.

create keyspace usertable with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use usertable;
create column family data with comparator=UTF8Type;

The SSTables under cassandra/data/usertable/data directory:

usertable-data-ic-1-CompressionInfo.db
usertable-data-ic-1-Data.db
usertable-data-ic-1-Filter.db
usertable-data-ic-1-Index.db
usertable-data-ic-1-Statistics.db
usertable-data-ic-1-Summary.db
usertable-data-ic-1-TOC.txt

In the above SSTable listing, the SSTable storage format version is ic, generation number is 1. The usertable-data-ic-1-TOC.txt contains the list of components for the SSTable.

Data file stores the base data of SSTable which contains the set of rows and their columns. For each row, it stores the row key, data size, column names bloom filter, columns index, row level tombstone information, column count, and the list of columns. The columns are stored in sorted order by their names. Filter file stores the row keys bloom filter.

Index file contains the SSTable Index which maps row keys to their respective offsets in the Data file. Row keys are stored in sorted order based on their tokens. Each row key is associated with an index entry which includes the position in the Data file where its data is stored. New versions of SSTable (version "ia" and above), promoted additional row level information from Data file to the index entry to improve performance for wide rows. A row's columns index, and its tombstone information are also included in its index entry. SSTable version "ic" also stores column names bloom filter in the index entry.

Summary file contains the index summary and index boundaries of the SSTable index. The index summary is calculated from SSTable index. It samples row indexes that are index_interval (Default index_interval is 128) apart with their respective positions in the index file. Index boundaries include the start and end row keys in the SSTable index.

CompressionInfo file stores compression metadata information that includes uncompressed data length, chuck size, and a list of the chunk offsets. Statistics file contains metadata for a SSTable. The metadata includes histograms for estimated row size and estimated column count. It also includes the partitioner used for distributing the key, the ratio of compressed data to uncompressed data and the list of SSTable generation numbers from which this SSTable is compacted. If a SSTable is created from Memtable flush then the list of ancestor generation numbers will be empty.

Version "ja"

Column names bloom filter, Column count fields are removed from SSTable data component file. The fields, localDeletionTime and markedForDeleteAt added to represent deletion timestamps of a row. For a row tombstone, localDeletionTime represents the timestamp in seconds at which a top level tombstone is created and markedForDeleteAt represents a timestamp in microseconds after which the data in a row should be considered as deleted. The default value for localDeletionTime is 0x7fffffff and the default value for markedForDeleteAt is 0x8000000000000000. If a row's deletion timestamps contain the default values then it contains live data.

Row size field is also removed from data component file. A special 0 length column name is added to represent the end of row. Row size is calculated from Index file. Index file contains starting offsets of each row. Successive offsets are used to calculate each row size except for the last row in SSTable. For the last row, end of data file is used to calculate its size.

Version "ka"

In version "ka", contents of Statistics component file is split in to 3 types of metadata called validation, compaction and stats. Validation metadata is used to validate SSTable which includes partitioner and bloom filter fp chance fields. Compaction metadata includes ancestors information which is also available in older formats and a new field called cardinality estimator. Cardinality estimator is used to efficiently pre-allocate bloom filter space in a merged compaction file by estimating how much the input SStables overlap. Stats metadata contains rest of the information available in old formats and two additional fields. First one is a flag to track the presence of local/remote counter shards and the other one is for storing the repair time.

sstable2json.py

sstable2json.py reads the rows and columns in a given SSTable and converts those to JSON format similar to sstable2json tool available in Cassandra distribution. It doesn't require access to Cassandra column families in system keyspace to decode SSTable data like sstable2json tool. It is tested with version "jb".

http://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html

The first step in understanding the 3.x storage engine is to accept the new mental model. Those that knew the Thrift API prior to Cassandra 3.0 basically knew how the storage engine worked; each internal row was an ordered list of columns that optionally had a value. This was kind of fun to start with, then we started having more complicated use cases which lead to some common abstractions. These abstractions were formalised in CQL 3, which gave birth to terms such as Partition and Clusteringthat while they have become the lingua franca for Cassandra were not natively supported by the previous storage engine.

Starting with the 3.x storage engine Partitions, Rows, and Clustering are natively supported. A Partition is a collection of Rows that share the same Partition Key(s)that are ordered, within the Partition, by their Clustering Key(s). Rows are then by globally identified by their Primary Key: the combination of Partition Key and Clustering Key. The important change is that the 3.x storage engine now knows about these ideas, it may seem strange but previously it did not know about the Rows in a Partition. The new storage engine was created specifically to handle these concepts in a way that reduces storage requirements and improves performance.

Serialising to the -Data.db component of the SSTable starts with the Partition. When we decide to flush a Memtable to disk a call to Memtable.FlushRunnable.writeSortedContents() is made, this iterates the Partitions in the Memtable making a call for each that ends up at BigTableWriter.append(). As we are focusing on Partitions we will be looking at what happens whenColumnIndex.writeAndBuildIndex() is called.

At a broad level the layout of a Partition in the -Data.db file has three components: A header, following by zero or one static Rows, followed by zero or more Clusterableobjects.

Clusterable objects are simply Objects that can be ordered by a ClusteringPrefix. Generally that means something that extends Unfiltered, which has a handy Kindenum with two values: ROW or RANGE_TOMBSTONE_MARKER. Which means on disk a Partition consists of a header, followed by 1 or more Unfiltered objects.

Partition Header

The partition header has a simple layout, containing the Partition Key and deletion information.

The Partition Key is the concatenated Columns of the Partition Key defined in your Table.
- The length of the Partition Key is encoded using a short, giving it a max length of 65,535 bytes.
- The concatenated bytes of the Partition Key columns are then written out.
The DeletionTime for the partition contains deletion information for partition tombstones.
- DeletionTime.localDeletionTime is the server time in seconds when the deletion occurred, which is compared to gc_grace_seconds to decide when it can be purged. When used with a TTL the localDeletionTime is the time the data expires.
- DeletionTime.markedForDeleteAt is the timestamp of the deletion, data with a timestamp less than this value is considered deleted.

https://github.com/scylladb/scylla/wiki/SSTables-3.0-Data-File-Format

File Type	Typical Name	Description
Compression information	mc-1-big-CompressionInfo.db	Contains information about the compression algorithm, if used
Data file	mc-1-big-Data.db	Stores the actual data
Checksum file	mc-1-big-Digest.crc32	Contains a checksum of the data file
Bloom filter	mc-1-big-Filter.db	Contains a Bloom filter for quickly (but in a probabilistic way) checking whether particular data may be stored in the data file
Index file	mc-1-big-Index.db	Contains the primary index of data for easier search
Statistics file	mc-1-big-Statistics.db	Stores aggregated statistics about the data
Summary file	mc-1-big-Summary.db	Provides sampling of the index file (can be seen as "coarse-grained" index)
Table of contents	mc-1-big-TOC.txt	Lists all the files for the current SSTable

The storage format has been significantly revamped in Cassandra 3.0 compared to the 2.x series. The primary driver for that was the fact that the previous storage format has been devised long before CQL and did not reflect its concepts and abstractions. This, in turn, hindered various fixes and led to suboptimal disk space usage.

To understand what that means, you can refer to the SSTables 2.x data file format description. In brief, in 2.x every data file is a sequence of partitions (called "rows" in pre-CQL terminology) where each partition is more or less a sequence of cells (or, more precisely, atoms with a majority of them being cells). There is no notion of columns or rows (in CQL terminology) at this level. Each cell has its full name comprised of the clustering prefix (values of all the clustering columns) followed by the non-PK column name. This scheme causes a lot of duplication since for every row in a given partition determined by a set of clustering columns values those values are stored in names of all the cells that belong to this row. For composite primary keys with long clustering columns values that would mean a lot of disk space wasted for no good reason. Another consequence is that the storage engine has to recognize and group cells from the row itself without knowing in advance their count or overall size.

SSTables 3.0 format addresses this by re-arranging the way data is stored. Now every partition is comprised of rows. Each row is defined by its clustering columns values and consists of a number of cells that all share those clustering columns values as their name prefix. So now one can reason about the data file as a sequence of partitions where partition consists of rows that consist of cells, whereas before it was a sequence of partitions with each partition consisting of cells. This is oversimplified but helpful for understanding the high-level picture of changes.

SSTables 3.0 data file format relies heavily on the table schema. It contains information about PK and non-PK columns, corresponding data types and width (fixed-width vs. variable-width), clustering columns ordering and so on.

https://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra

But the primary key can also be COMPOSITE (aka COMPOUND), generated from more columns.

 create table stackoverflow_composite (
      key_part_one text,
      key_part_two int,
      data text,
      PRIMARY KEY(key_part_one, key_part_two)      
  );

In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is the CLUSTERING KEY (in this example key_part_two)

Please note that the both partition and clustering key can be made by more columns, here's how:

 create table stackoverflow_multiple (
      k_part_one text,
      k_part_two int,
      k_clust_one text,
      k_clust_two int,
      k_clust_three uuid,
      data text,
      PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)      
  );

Behind these names ...

The Partition Key is responsible for data distribution across your nodes.
The Clustering Key is responsible for data sorting within the partition.
The Primary Key is equivalent to the Partition Key in a single-field-key table (i.e. Simple).
The Composite/Compound Key is just any multiple-column key

PRIMARY KEY (a): The partition key is a.
PRIMARY KEY (a, b): The partition key is a, the clustering key is b.
PRIMARY KEY ((a, b)): The composite partition key is (a, b).
PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).
PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.
PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).

https://docs.datastax.com/en/cql/3.3/cql/cql_using/refStaticCol.html

A table that does not define any clustering columns cannot have a static column. The table having no clustering columns has a one-row partition in which every column is inherently static.
A table defined with the COMPACT STORAGE directive cannot have a static column.
A column designated to be the partition key cannot be static.

Thursday, March 28, 2019

Cassandra Storage Format

Partition Header

Labels

Popular Posts