Massive Technical Interviews Tips: Big table

Monday, October 19, 2015

Big table | miafish

What is big table?

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Map: Big Table is a collection of (key, value)pairs where the key identifies a row and the value is the set of columns.

Persistent: the data is stored persistently on disk.

Distributed: Big table's data is distributed among many independent machines. The table is broken up among rows, with groups of adjacent rows managed by a server. A row itself is never distributed.

Sparse: the table is sparse, meaning that different rows in a table may use different columns, with many of the columns empty for a particular row.

multidimensional: A table is indexed by rows. Each row contains one or more named column families. Column families are defined when the table is first created. Within a column family, one may have one or more named columns. All data within a column family is usually of the same type. The implementation of BigTable usually compresses all the columns within a column family together. Columns within a column family can be created on the fly. Rows, column families and columns provide a three-level naming hierarchy in identifying data. For example:

edu.rutgers.cs" : { // row "users" : { // column family "watrous": "Donald", // column "hedrick": "Charles", // column "pxk" : "Paul" // column } "sysinfo" : { // another column family "" : "SunOS 5.8" // column (null name) } }

To get data from BigTable, you need to provide a fully-qualified name in the form column-family:column. For example, users:pxk or sysinfo:. The latter shows an null column name.

time-based: time is another dimension in Big Table data. Every column family may keep multiple versions of column family data. if an application does not specify a time-stamp. it will retrieve the latest version of the column family. Alternatively, it can specify a time-stamp and get the latest version that is earlier than or equal to that time-stamp.

Data Model

The map is indexed by a row key, column key and a time stamp.

row key:

column keys:

time stamp:

Implementation:

Locating rows within a big table is managed in a three-level hierarchy.

file stored in Chubby that contains the location of the root tablet(first tablet in the metadata table).
the root tablet contains the location of all tablets in a special metadata table.
each metadata tablet contains the location of a set of user tablets.

Advantages

can grow to immense size with storage distributed across a large number of servers.
Join operations are less costly because of the denormalization

Read full article from Big table | miafish