Friday, October 23, 2015

Google Bigtable Architecture



分布式系统设计自学整理
BigTable: a NoSQL database (Column based) based on GFS
补充:大数据中常见的两种NoSQL:1)key/ value based like Redis or MemcacheDB 和 2)column based, BigTable
SSTable的数据结构:
1. Rows (key) : 最大64KB 的字符串,在SSTable中,是按照Row key 排序好的,顾名思义,sorted string table
2. Columnkeys: 一个set,可以支持range query
3. value:包括两部分,第一部分是value, 第二部分是时间timestamp,是一个64bit的整

SSTable的文件格式:原则是不能修改,只能append,根据timestamp来决定哪个是最新的value
1. 分成datablock, 每个block 64
2. 在SSTable最后有一个 block index
3.查询SSTable先 load index然后二分查询 (不能放进内存用hashtable查询!)
Bigtable 系统:
1.Master server: 负责测试tablet server是不是在工作, 处理GFS的sharding等,在Bigtable中,Client是不会和master交流的,要和Lock交流
2. Lock server: 用Chubby或者zookepper实现,完成多进程的锁,用metadata查找key所在的server
3. Tablet server: 处理读写操作
写操作:
1.Client ask lock
2.Lock return a tablet server and lock it
3.Go to tablet server, 1) write to 1) commit log (write ahead log) in disk and 2)memTable in memory. If memory down, recovery from log
4. Minorcompaction: when memory hit the threshold, frozen this one and write to GFS asa SSTable
5.Major compaction: Merge SSTable, with same row key, use the most recent record.
6.After all, client ask lock to unlock the tablet server and update the metadata
读操作:
1. Clientask lock
2.Check metadata, return the tablet server and lock it
3.Go to tablet server, first check memTable
4.If not in memTable , check tablets one-by-one
5.For each tablet, check the index firstly. So the worst time is O(mlogk), m isthe number of tablets, k is the length of each tablets.
6.Retern value and unlock the server
其他重要的scale化:
1. readcache: tabletserver上,1)scan cache 保存已read的key/value,化重复read; 2)block cache,存GFS上SSTBlock,优化读取附近的key(不是很懂)
2.Bloom filter对每SSTable加入一个bloom filter (多个hash函数来确定的一个bit filter),如果key没有通SSTable的bloom filter,就不用read SSTable的index了,省去了二分的时间
https://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.html
Bigtable is designed with semi-structured data storage in mind. It is a large map that is indexed by a row key, column key, and a timestamp. Each value within the map is an array of bytes that is interpreted by the application. Every read or write of data to a row is atomic, regardless of how many diferent columns are read or written within that row.


map
map is an associative array; a data structure that allows one to look up a value to a corresponding key quickly. Bigtable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.
distributed
Bigtable's data is distributed among many independent machines. At Google, Bigtable is built on top of GFS (Google File System). The Apache open source version of Bigtable, HBase, is built on top of HDFS (Hadoop Distributed File System) or Amazon S3. The table is broken up among rows, with groups of adjacent rows managed by a server. A row itself is never distributed.
sparse
The table is sparse, meaning that different rows in a table may use different columns, with many of the columns empty for a particular row.
sorted
Most associative arrays are not sorted. A key is hashed to a position in a table. Bigtable sorts its data by keys. This helps keep related data close together, usually on the same machine — assuming that one structures keys in such a way that sorting brings the data together. For example, if domain names are used as keys in a Bigtable, it makes sense to store them in reverse order to ensure that related domains are close together.
multidimensional
A table is indexed by rows. Each row contains one or more named column families. Column families are defined when the table is first created. Within a column family, one may have one or more named columns. All data within a column family is usually of the same type. The implementation of Bigtable usually compresses all the columns within a column family together. Columns within a column family can be created on the fly. Rows, column families and columns provide a three-level naming hierarchy in identifying data. For example:
edu.rutgers.cs" : { // row "users" : { // column family "watrous": "Donald", // column "hedrick": "Charles", // column "pxk" : "Paul" // column } "sysinfo" : { // another column family "" : "SunOS 5.8" // column (null name) } }
To get data from Bigtable, you need to provide a fully-qualified name in the form column-family:column. For example, users:pxk or sysinfo:. The latter shows an null column name.

time-based
Time is another dimension in Bigtable data. Every column family may keep multiple versions of column family data. If an application does not specify a timestamp, it will retrieve the latest version of the column family. Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp.
In Bigtable, however, there is no type associated with the column. It is just a bunch of bytes. The data in a column family may also be large, as in the contents column family. The anchor column family illustrates the extra hierarchy created by having columns within a column family. It also illustrates the fact that columns can be created dynamically (one for each external anchor), unlike column families. Finally, it illustrates the sparse aspect of Bigtable. In this example, the list of columns within the anchor column family will likely vary tremendously for each URL. In all, we may have a huge number (e.g., hundreds of thousands or millions) of columns but the column family for each row will have only a tiny fraction of them populated. While the number of column families will typically be small in a table (at most hundreds), the number of columns is unlimited.
A table is logically split among rows into multiple subtables called tablets. A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing within Bigtable. Because the table is always sorted by row, reads of short ranges of rows are efficient: one typically communicates with a small number of machines. Hence, a key to ensuring a high degree of locality is to select row keys properly (as in the earlier example of using domain names in reverse order).

Timestamps
A table is configured with per-column-family settings for garbage collection of old versions. A column family can be defined to keep only the latest n versions or to keep only the versions written since some time t.
Implementation
Bigtable comprises a client library (linked with the user's code), a master server that coordinates activity, and many tablet servers. Tablet servers can be added or removed dynamically.

The master assigns tablets to tablet servers and balances tablet server load. It is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).

Each tablet server manages a set of tablets (typically 10-1,000 tablets per server). It handles read/write requests to the tablets it manages and splits tablets when a tablet gets too large. Client data does not move through the master; clients communicate directly with tablet servers for reads/writes. The internal file format for storing data is Google's SSTable, which is a persistent, ordered, immutable map from keys to values.

Bigtable uses the Google File System (GFS) for storing both data files and logs. A cluster management system contains software for scheduling jobs, monitoring health, and dealing with failures.
Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs with five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. Paxos is used to keep the replicas consistent. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.
In Bigtable, Chubby is used to:
  • ensure there is only one active master
  • store the bootstrap location of Bigtable data
  • discover tablet servers
  • store Bigtable schema information
  • store access control lists

table starts off with just one tablet. As the table grows, it is split into multiple tablets. By default, a table is split at around 100 to 200 MB.
Locating rows within a Bigtable is managed in a three-level hierarchy. The root (top-level) tablet stores the location of all Metadata tablets in a special Metadata tablet. Each Metadata table contains the location of user data tablets. This table is keyed by node IDs and each row identifies a tablet's table ID and end row. For efficiency, the client library caches tablet locations.
A tablet is assigned to one tablet server at a time. Chubby keeps track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a Chubby serversdirectory. The master monitors this directory to discover new tablet servers. When the master starts, it:
  • grabs a unique master lock in Chubby (to prevent multiple masters from starting)
  • scans the servers directory in Chubby to find live tablet servers
  • communicates with each tablet server to discover what tablets are assigned to each server
  • scans the Metadata table to learn the full set of tablets
  • builds a set of unassigned tablet servers, which are eligible for tablet assignment
https://dzone.com/articles/understanding-hbase-and-bigtab
A Bigtable can be configured for replicaiton to multiple Bigtable clusters in different data centers to ensure availability. Data propagation is asynchronous and results an eventually consistent model.

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.
The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.
At its core, Hbase/BigTable is a map.

sorted
Unlike most map implementations, in Hbase/BigTable the key/value pairs are kept in strict alphabetical order. That is to say that the row for the key "aaaaa" should be right next to the row with key "aaaab" and very far from the row with key "zzzzz".

Because these systems tend to be so huge and distributed, this sorting feature is actually very important. The spacial propinquity of rows with like keys ensures that when you must scan the table, the items of greatest interest to you are near each other.

This is important when choosing a row key convention. For example, consider a table whose keys are domain names. It makes the most sense to list them in reverse notation (so "com.jimbojw.www" rather than "www.jimbojw.com") so that rows about a subdomain will be near the parent domain row.
multidimensional
A table's column families are specified when the table is created, and are difficult or impossible to modify later. It can also be expensive to add new column families, so it's a good idea to specify all the ones you'll need up front.
Fortunately, a column family may have any number of columns, denoted by a column "qualifier" or "label".
the column families are static, the columns themselves are not. 
In this case, the "zzzzz" row has exactly one column, "A:catch_phrase". Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows. To get that information, you'd have to do a full table scan. You can however query for a list of all column families since these are immutable (more-or-less).
But How Do They Scale To Petabytes?
First of all, the rows are maintained in alphabetic order (which the paper calls lexicographic order) by row name. The row ranges are broken up into partitions called tablets. Tablets are distributed across servers for load balancing.
So Bigtable Stores Everything Forever . . .
Bigtable allows applications to tell columns to keep either the last n versions of a cell or only versions written in the last n days. The old data eventually gets garbage-collected out of Bigtable. Data can be kept indefinitely, too.

You can read, write, update, delete and control access. You can limit the rows, columns and timestamps produced by a scan of the data. There is also a scripting language, with the great name Sawzall that enables a variety of data transformations, filtering and summarization.

  • Cluster management. Google has a cluster management system which so far seems publicly undocumented (maybe they’re embarrassed) that schedules, monitors and manages the Bigtable’s cluster.
  • SSTable. This is the underlying file format used to store Bigtable data. SSTables are designed so that a data access requires, at most, a single disk access. An SSTable, once created, is never changed. If new data is added, a new SSTable is created. Once an old SSTable is no longer needed, it is set out for garbage collection. SSTable immutability is at the core of Bigtable’s data checkpointing and recovery routines.
  • Chubby. Cute name, huh? Chubby is the distributed lock server that allows a multi-thousand node Bigtable cluster to stay coordinated. Chubby itself is a cluster app that maintains five active replicas, one of which is the master. Like GFS’s master node, Chubby is architected to keep lock management traffic very light. Chubby also rules over tablet server life and death, stores access control lists, data schemas and the bootstrap location of Bigtable data.
Bigtable’s three-level addressing scheme accomplishes that while supporting a huge address space.
  • Chubby stores the location of the root tablet
  • The root tablet contains the address all metadata tablets
  • The metadata tablets contain the locations of a group of user tablets
This simple scheme allows a single Bigtable to address over 16 billion tablets.
http://storagemojo.com/2006/09/08/google%E2%80%99s-bigtable-distributed-storage-system-pt-ii/
Tablets, which are collections of rows, are stored in SSTables,once an SSTable is written, is it never changed. If the data in the tablet is changed, a new SSTable is generated.


Tablet readers and writers need to have permission, which is checked against a list maintained in Chubby. Writes are committed to the log where they are grouped for efficient writing. Reads are done against a merged view of the data. Since the rows are sorted alphabetically it is easy for the system to form the merged view.
Bigtable uses a number of strategies to control and reduce memory and storage consumption. For example, the contents of critical memory buffers get committed to SSTables when they grow too large.
Client apps can optionally specify compression for SSTables. And clients can even specify what compression to use. The results can be pretty surprising. For example, due to the repetition across webpages in a website and the types of compression used, compression rates of 10-1 are common. All of a sudden 10 billion webpages take up no more storage an uncompressed 1 billion webpages. 
Everything breaks.
Understand the use of new features. Add only what users really need, not a generalized solution to the entire problem space
System-level monitoring. You can’t fix it if don’t know it’s broke. Not only Bigtable, but Chubby, GFS and client processes.
Simple designs are good. For example, only rely upon widely-used features in your code and in the systems you interact with.
http://www.slideshare.net/vanjakom/google-bigtable-paper-presentation-5078506
HBase - Bigtable synonyms
- Bigtable ~ HBase
- tablet ~ region
- Master server ~ HBase master
- Tablet server ~ HBase Region server
- Chubby ~ Zookeeper ???
- GFS ~ HDFS
- MapReduce ~ MapReduce :)
- SSTable file ~ MapFile file

- MapFiles index is stored in separate file instead at end of file
as in SSTable
- Unlike Bigtable which identifies a row range by the table name
and end-key, HBase identifies a row range by the table name
and start-key
- when the HBase master dies, the cluster will shut down

Others:
http://hypertable.com/documentation/architecture/

PDF:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
http://systemdesigns.blogspot.com/2016/01/bigtable.html
Paper: http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Bigtable is a distributed storage system for managing structured data that is designed to scale to perabyte-size data across many commodity servers.

It is not a relational database; it is just a table. However, it is designed to work on a huge scale.
It is a large map that is indexed by a row key, column key, and a timestamp:

(row:string, column:string, time:int64) → string



Bigtables的特性

· map

Bigtable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.

 persistant

data stored peristantly on disk.

· distributed

distributed on many servers. The table is broken up among rows, with groups of adjacent rows managed by a server. A row itself is never distributed. The row range for a table is dynamically partitioned. Each row range is called a tablet, 这是它用来distribution和load balancing的最小unit.

· sparse

different rows in a table may use different columns, with many of the columns empty for a particular row.

· sorted

Bigtable sorts its data by keys. This helps keep related data close together, usually on the same machine. For example, 如果用domain name来做key, 那可以reverse order, 这样相关domain可以相近,如下:
  • edu.rutgers.cs
  • edu.rutgers.nb
  • edu.rutgers.www

· multidimensional

A table is indexed by rows. Each row contains one or more named column families. Column families are defined when the table is first created.

Within a column family, one may have one or more named columns. All data within a column family is usually of the same type. 每个column family里的column一般会被压缩。结构举例如下:
取数据需要fully-qualified name,格式为column-family:column, 比如users:pxk

· time-based

 time是另外一维。每个column family 可以存储multiple versions of column family data. 如果application不指定timestamp, the latest version of the column family会被读取. 或者指定timestamp然后获得latest version that is earlier than or equal to that timestamp.

Columns and Column Families


In BigTable, there is no type associated with the column. It is just a bunch of bytes.



Rows and partitioning

A table is logically split among rows into multiple subtables called tablets. A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing within Bigtable. Because the table is always sorted by row, reads of short ranges of rows are efficient: one typically communicates with a small number of machines. Hence, a key to ensuring a high degree of locality is to select row keys properly (as in the earlier example of using domain names in reverse order).

每一次对row的读写都是atomic(either all occur, or nothing occurs)

Implementation

Bigtable 包含:

1. a client library (linked with the user's code),

2. a master server that coordinates activity, and many tablet servers. 

3. Tablet servers can be added or removed dynamically.

The master assigns tablets to tablet servers and balances tablet server load. It is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).

Each tablet server manages a set of tablets (typically 10-1,000 tablets per server). It handles read/write requests to the tablets it manages and splits tablets when a tablet gets too large. Client data does not move through the master; clients communicate directly with tablet servers for reads/writes. The internal file format for storing data is Google's SSTable, which is a persistent, ordered, immutable map from keys to values.

Bigtable uses the Google File System (GFS) for storing both data files and logs. A cluster management system contains software for scheduling jobs, monitoring health, and dealing with failures.

Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs with five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. Paxos is used to keep the replicas consistent. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.

In Bigtable, Chubby is used to:

    · ensure there is only one active master

    · store the bootstrap location of Bigtable data

    · discover tablet servers

    · store Bigtable schema information

    · store access control lists
A table starts off with just one tablet. As the table grows, it is split into multiple tablets. By default, a table is split at around 100 to 200 MB.


Locating rows within a Bigtable is managed in a three-level hierarchy:

    · The root (top-level) tablet stores the location of all Metadata tablets in a special Metadata tablet. 

    · Each Metadata table contains the location of user data tablets. This table is keyed by node IDs and each row identifies a tablet's table ID and end row. For efficiency, the client library caches tablet locations.

    · User data tablets.


A tablet is assigned to one tablet server at a time. Chubby keeps track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a Chubby servers directory. The master monitors this directory to discover new tablet servers. When the master starts, it:

1. grabs a unique master lock in Chubby (to prevent multiple masters from starting)

2. scans the servers directory in Chubby to find live tablet servers

3. communicates with each tablet server to discover what tablets are assigned to each server

4. scans the Metadata table to learn the full set of tablets

5. builds a set of unassigned tablet servers, which are eligible for tablet assignment
一个Bigtable可以配置为replicated到多个Bigtable clusters 在不同的data centers来实现high availability. 

集群包括主服务器(master server)和片服务器(tablet server),主服务器负责将片分配给片服务器,而具体的数据服务由片服务器负责。片服务器获取了片的所有SSTable文件名,片服务器通过一些索引机制可以知道所需要的数据在哪个SSTable文件,然后从GFS中读取SSTable文件的数据。

片的数据最终是写到GFS里的。

minor compaction: 当片服务器收到一个写请求,片服务器首先检查请求是否合法。如果合法,先将写请求提交到日志,然后将数据写入内存中的memtable。memtable相当于SSTable的缓存,当memtable成长到一定规模会被冻结,Bigtable随之创建一个新的memtable,并且将冻结的memtable转换为SSTable格式写入GFS。

当片服务器收到一个读请求,同样要检查请求是否合法。如果合法,这个读操作会查看所有SSTable文件和memtable的合并视图,因为SSTable和memtable本身都是已排序的,所以合并相当快。

每一次minor compaction都会产生一个新的SSTable文件,SSTable文件太多读操作的效率就降低了,所以Bigtable定期执行merging compaction操作,将几个SSTable和memtable合并为一个新的SSTable。



2. METADATA table - 元数据表的结构


元数据表(METADATA table)是一张特殊的表,它被用于数据的定位以及一些元数据服务。

i. The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identifier and its end row 

ii. Each METADATA row stores approximately 1KB of data in memory(因为访问量比较大,元数据表是放在内存里的).This feature is useful for small pieces of data that are accessed frequently: we use it internally for the location column family in the METADATA table.

iii. We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet(such as when a server begins serving it).



Figure: 一个简化的Bigtable结构图



结构图以Webtable表为例,表中存储了网易、百度和豆瓣的几个网页。当我们想查找百度贴吧昨天的网页内容,可以向Bigtable发出查询Webtable表的(com.baidu.tieba, contents: yesterday)。

假设客户端没有该缓存,那么Bigtable访问root tablet的片服务器,希望得到该网页所属的片的位置信息在哪个元数据片中。使用METADATA.Webtable.com.baidu.tieba为行键在root tablet中查找,定位到最后一个比它大的是METADATA.Webtable.com.baidu.www,于是确定需要的就是元数据表的片A。访问片A的片服务器,继续查找Webtable.com.baidu.tieba,定位到Webtable.com.baidu.www是比它大的,确定需要的是Webtable表的片B。访问片B的片服务器,获得数据。

每个片实际都由若干SSTable文件和memtable组成,而且这些SSTable和memtable都是已排序的。这就导致查找片B时,可能需要将所有SSTable和memtable都查找一遍;另外客户端应该不会直接从元数据表获得SSTable的文件名,而只是获得片属于片服务器的信息,通过片服务器为代理访问SSTable。

3. 为什么使用Bigtable 而不用the standard Relational Database Management System(RDBMS)

i. Size:The size of the preprocessing table is the first reason why we cannot use RDBMS to store the data. The 70 terabytes data (the data from Google Earth)cannot be stored as one table hosts in a single machine. 

ii. Scalability:Relational Database also scales but only in single node. When the hardware capacity of the single node is reached, the load needs to be distributed to other nodes. RDBMS is more suitable for applications which are hosted on one single node but Bigtable is designed for data distribution of a large scale into hundreds or thousands of machines over network. 

iii. Flexibility: Bigtable allows managing the data in a flexible way to add or remove a node from the distribution cluster.

Bigtable is a sparse, distributed, persistent multidimensional sorted map.The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

- HBase

HBase is an open source; non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File system), providing BigTable-like capabilities for Hadoop. It provides a faulttolerant way of storing large quantities of sparse data.


HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.

以下link的表格中总结了Bigtable和HBase的相同与不同:

http://www.larsgeorge.com/2009/11/hbase-vs-bigtable-comparison.html

Feature
 Google BigTable 
 Apache HBase 
Notes
Atomic Read/Write/Modify
Yes, per row
Yes, per row
Since BigTable does not strive to be a relational database it does not have transactions. The closest to such a mechanism is the atomic access to each row in the table. HBase also implements a row lock API which allows the user to lock more than one row at a time.
Lexicographic Row Order
Yes
Yes
All rows are sorted lexicographically in one order and that one order only. Again, this is no SQL database where you can have different sorting orders.
Block Support
Yes
Yes
Within each storage file data is written as smaller blocks of data. This enables faster loading of data from large storage files. The size is configurable in either system. The typical size is 64K.
Block Compression
Yes, per column family
Yes, per column family
Google uses BMDiff and Zippy in a two step process. BMDiff works really well because neighboring key-value pairs in the store files are often very similar. This can be achieved by using versioning so that all modifications to a value are stored next to each other but still have a lot in common. Or by designing the row keys in such a way that for example web pages from the same site are all bundled. Zippy then is a modified LZW algorithm. HBase on the other hand uses the standard Java supplied GZip or with a little effort the GPL licensed LZO format. There are indications though that Hadoop also may want to have BMDiff (HADOOP-5793) and possibly Zippy as well.
Number of Column Families
Hundreds at Most
Less than 100
While the number of rows and columns is theoretically unbound the number of column families is not. This is a design trade-off but does not impose too much restrictions if the tables and key are designed accordingly.
Column Family Name Format
Printable
Printable
The main reason for HBase here is that column family names are used as directories in the file system.
Qualifier Format
Arbitrary
Arbitrary
Any arbitrary byte[] array can be used.
Key/Value Format
Arbitrary
Arbitrary
Like above, any arbitrary byte[] array can be used.
Access Control
Yes
No
BigTable enforces access control on a column family level. HBase does not have yet have that feature (see HBASE-1697).
Cell Versions
Yes
Yes
Versioning is done using timestamps. See next feature below too. The number of versions that should be kept are freely configurable on a column family level.
Custom Timestamps
Yes (micro)
Yes (milli)
With both systems you can either set the timestamp of a value that is stored yourself or leave the default "now". There are "known" restrictions in HBase that the outcome is indeterminate when adding older timestamps after already having stored newer ones beforehand.
Data Time-To-Live
Yes
Yes
Besides having versions of data cells the user can also set a time-to-live on the stored data that allows to discard data after a specific amount of time.
Batch Writes
Yes
Yes
Both systems allow to batch table operations.
Value based Counters
Yes
Yes
BigTable and HBase can use a specific column as atomic counters. HBase does this by acquiring a row lock before the value is incremented.
Row Filters
Yes
Yes
Again both system allow to apply filters when scanning rows.
Client Script Execution
Yes
No
BigTable uses Sawzall to enable users to process the stored data.
MapReduce Support
Yes
Yes
Both systems have convenience classes that allow scanning a table in MapReduce jobs.
Storage Systems
GFS
HDFS, S3, S3N, EBS
While BigTable works on Google's GFS, HBase has the option to use any file system as long as there is a proxy or driver class for it.
File Format
SSTable
HFile
Block Index
At end of file
At end of file
Both storage file formats have a similar block oriented structure with the block index stored at the end of the file.
Memory Mapping
Yes
No
BigTable can memory map storage files directly into memory.
Lock Service
Chubby
ZooKeeper
There is a difference in where ZooKeeper is used to coordinate tasks in HBase as opposed to provide locking services. Overall though ZooKeeper does for HBase pretty much what Chubby does for BigTable with slightly different semantics.
Single Master
Yes
No
HBase recently added support for multiple masters. These are on "hot" standby and monitor the master's ZooKeeper node.
Tablet/Region Count
10-1000
10-1000
Both systems recommend about the same amount of regions per region server. Of course this depends on many things but given a similar setup as far as "commodity" machines are concerned it seems to result in the same amount of load on each server.
Tablet/Region Size
100-200MB
256MB
The maximum region size can be configured for HBase and BigTable. HBase used 256MB as the default value.
Root Location
1st META / Chubby
-ROOT- / ZooKeeper
HBase handles the Root table slightly different from BigTable, where it is the first region in the Meta table. HBase uses its own table with a single region to store the Root table. Once either system starts the address of the server hosting the Root region is stored in ZooKeeper or Chubby so that the clients can resolve its location without hitting the master.
Client Region Cache
Yes
Yes
The clients in either system caches the location of regions and has appropriate mechanisms to detect stale information and update the local cache respectively
Meta Prefetch
Yes
No (?)
A design feature of BigTable is to fetch more than one Meta region information. This proactively fills the client cache for future lookups.
Historian
Yes
Yes
The history of region related events (such as splits, assignment, reassignment) is recorded in the Meta table.
Locality Groups
Yes
No
It is not entirely clear but it seems everything in BigTable is defined by Locality Groups. The group multiple column families into one so that they get stored together and also share the same configuration parameters. A single column family is probably a Locality Group with one member. HBase does not have this option and handles each column family separately.
In-Memory Column Families
Yes
Yes
These are for relatively small tables that need very fast access times.
KeyValue (Cell) Cache
Yes
No
This is a cache that servers hot cells.
Block Cache
Yes
Yes
Blocks read from the storage files are cached internally in configurable caches.
Bloom Filters
Yes
Yes
These filters allow - at a cost of using memory on the region server - to quickly check if a specific cell exists or maybe not.
Write-Ahead Log (WAL)
Yes
Yes
Each region server in either system stores one modification log for all regions it hosts.
Secondary Log
Yes
No
In addition to the Write-Ahead log mentioned above BigTable has a second log that it can use when the first is going slow. This is a performance optimization.
Skip Write-Ahead Log
?
Yes
For bulk imports the client in HBase can opt to skip writing into the WAL.
Fast Table/Region Split
Yes
Yes
Splitting a region or tablet is fast as the daughter regions first read the original storage file until a compaction finally rewrites the data into the region's local store.

The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.



In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split. The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

The master node in HBase uses the .META. solely to detect when a region was split but the message was lost. For that reason it scans the .META. on a regular basis to see when a region appears that is not yet assigned. It will then assign that region as per its default strategy.

Compactions

The following are more terminology differences than anything else.

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

HBase has a similar operation but it is referred to as a "flush". Opposed to that "minor compactions" in HBase rewrite the last N used store files, i.e. those with the most recent mutations as they are probably much smaller than previously created files that have more data in them.

... we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished. 

This again refers to what is called "minor compaction" in HBase.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.

Here we have an exact match though, a "major compaction" in HBase also rewrites all files into one.
https://www.scribd.com/document/21244790/Google-Designs-Lessons-and-Advice-from-Building-Large-Distributed-Systems
MVCC: Multiversion concurrency control

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts