Wednesday, March 23, 2016

HBASE



https://mp.weixin.qq.com/s/JlSfY4Glglxw94u5bSCS4A
IndexTable的RowKey由四部分组成,按顺序依次是:DataTable Region StartKey、IndexName、IndexValue和DataTable RowKey,如图所示。
DataTable Region StartKey。将DataTable Region的StartKey作为IndexTable Region的RowKey的第一部分,主要基于两个方面的考虑。
一是使得IndexTable Region和对应的DataTable Region拥有相同的StartKey,这样便可将StartKey作为两个Region的关联依据;
二是当DataTable Region分裂时,可使用相同的SplitKey对IndexTable Region进行相应的分裂操作,并将新产生的DataTable Region和IndexTable Region建立关联关系。
IndexName。
在一张DataTable的基础上可以定义多个索引,如果为每个索引创建一个IndexTable,则在实际应用过程中,势必会产生大量的IndexTable,当DataTable Region分裂时,还需要对与之关联的所有IndexTable Region分别执行分裂操作,这将消耗大量的系统资源,并且不易维护。因此,我们考虑将一张DataTable的所有索引数据,存放到同一张IndexTable中,不同索引的数据以IndexName进行区分。
IndexValue。如果索引是单列索引,IndexValue就是DataTable Row的某个Column Value,如果索引是组合索引的话,则IndexValue就是DataTable Row的多个Column Value组合而成的。
DataTable RowKey。被用来定位DataTable Row,以获取最终的数据。


https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master
HBase provides low-latency random reads and writes on top of HDFS and it’s able to handle petabytes of data. One of the interesting capabilities in HBase is Auto-Sharding, which simply means that tables are dynamically distributed by the system when they become too large.

Regions and Region Servers

The basic unit of scalability, that provides the horizontal scalability, in HBase is called Region. Regions are a subset of the table’s data, and they are essentially a contiguous, sorted range of rows that are stored together.
Initially, there is only one region for a table.  When regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.
Looking back at the HBase architecture the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.

The HBase Architecture has two main services: HMaster that is responsible for coordinating Regions in the cluster and execute administrative operations; HRegionServer responsible to handle a subset of the table’s data.

HMaster, Region Assignment and Balancing

The HBase Master coordinates the HBase Cluster and is responsible for administrative operations.

A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance operation. The Master also handles Region Server failures by assigning the region to another Region Server.

The mapping of Regions to Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all, and clients can go directly to the Region Server responsible to serve the requested data.

Locating a Row-Key: Which Region Server is responsible?

To put or get a row clients don’t have to contact the master, clients can contact directly  the Region Server that handles the specified row. In the case of a scan clients can contact directly the set of Region Servers responsible for handling the set of keys.
To identify the Region Server, the client does a query on the META table.META is a system table, used to keep track of regions. It contains the server name and a region identifier comprised of a table name and the start row-key. By looking at the start-key and the next region start-key clients are able to identify the range of rows contained in a a particular region.
The client keeps a cache for the region locations. This avoids the need for clients to hit the META table every time an operation on the same region is issued. In case of a region split or move to another Region Server (due to balancing, or assignment policies) the client will receive an exception as response and the cache will be refreshed by fetching the updated information from the META table.

Since META is a table like the others, the client has to identify on which server META is located. The META locations are stored in a ZooKeeper node on assignment by the Master, and the client reads directly the node to get the address of the Region Server that contains META.

The original design was based on BigTable with another table called -ROOT- containing the META locations and ZooKeeper pointing to it. HBase 0.96 removed that in favor of ZooKeeper only since META cannot be split and therefore consists of a single region.

http://hbase.apache.org/0.94/book/quickstart.html
HBase expects the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions, for example, will default to 127.0.1.1 and this will cause problems for you.
/etc/hosts should look something like this:
            127.0.0.1 localhost
            127.0.0.1 ubuntu.ubuntu-domain ubuntu
./bin/hbase shell
./bin/stop-hbase.sh

Architecting HBase Applications: A Guidebook for Successful Development and Design
column families, sparsely populated
Define column families, but not need define column names
HBase orders the keys based on the byte values.

tables are split into regions where each region will store a specific range of data.
start key-end key
regions can be spliited or merged.

COLUMN FAMILY
For the same region, different column families will store the data into different files and can be configured differently. Data with the same access pattern and the same format should be grouped into the same column family.

memstores -> hfiles, compacted over times
HFiles are stored in HDFS and so benefit from the Hadoop persistence and replication.

HFile blocks - default 64kb
Data blocks, Index blocks,
Bloom filter block - used to skip parsing the file when looking for a specific key
Data blocks are stored first, then the index blocks, then the bloom filter blocks and the trailer blocks are stored at the end.

HBase is a column oriented database. That means that each column will be stored individually instead of storing an entire row on its own.

Cell:
Key Length, value length, key, value,
scalability: regroup data into bigger files and spread a table across many servers

Compactions, splits, and balancing
minor-major compaction
disable automatic major compactions and use cron to trigger it

Splits (Auto-Sharding)
if bigger than maximum size 10G, split one region into 2.

HBase Master will run a load balancer to ensure that all the region servers are managing and serving a similar number of regions.

https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and
The answer is that HBase guarantees ACID semantics per-row.  ACID is an acronym for:
• Atomicity: All parts of transaction complete or none complete
• Consistency: Only valid data written to database
• Isolation: Parallel transactions do not impact each other’s execution
• Durability: Once transaction committed, it remains

If you have experience with traditional relational databases, these terms may be familiar to you.  Traditional relational databases typically provide ACID semantics across all the data in the database; for performance reasons, HBase only provides ACID semantics on a per-row basis.

Consider two concurrent writes to HBase that represent {company, role} combinations I’ve held:



Image 1.  Two writes to the same row

From the previously cited Write Path Blog Post, we know that HBase will perform the following steps for each write:

(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each data cell [the (row, column) pair] to the memstore

We clearly need some concurrency control.  The simplest solution is to provide exclusive locks per row in order to provide isolation for writes that update the same row.  So, our new list of steps for writes is as follows (new steps are in blue).

(0) Obtain Row Lock
(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each cell to the memstore
(3) Release Row Lock



Therefore, we need some concurrency control to deal with read-write synchronization.  The simplest solution would be to have the reads obtain and release the row locks in the same manner as the writes.  This would resolve the ACID violation, but the downside is that our reads and writes would both contend for the row locks, slowing each other down.


Notice the new steps introduced for Multiversion Concurrency Control.  Each write is assigned a write number (step w1), each data cell is written to the memstore with its write number (step w2, e.g. “Cloudera [wn=1]”) and each write completes by finishing its write number (step w3).

A consistent response without requiring locking the row for the reads!

Let’s put this all together by listing the steps for a write with Multiversion Concurrency Control: (new steps required for read-write synchronization are in red):
(0) Obtain Row Lock
(0a) Acquire New Write Number
(1) Write to Write-Ahead-Log (WAL)
(2) Update MemStore: write each cell to the memstore
(2a) Finish Write Number
(3) Release Row Lock
http://hbasefly.com/2016/04/03/hbase_hfile_index/
HFile中索引结构根据索引层级的不同分为两种:single-level和mutil-level,前者表示单层索引,后者表示多级索引,一般为两级或三级。HFile V1版本中只有single-level一种索引结构,V2版本中引入多级索引。之所以引入多级索引,是因为随着HFile文件越来越大,Data Block越来越多,索引数据也越来越大,已经无法全部加载到内存中(V1版本中一个Region Server的索引数据加载到内存会占用几乎6G空间),多级索引可以只加载部分索引,降低内存使用空间。

V2版本Index Block有两类:Root Index Block和NonRoot Index Block,其中NonRoot Index Block又分为Intermediate Index Block和Leaf Index Block两种。HFile中索引结构类似于一棵树,Root Index Block表示索引数根节点,Intermediate Index Block表示中间节点,Leaf Index block表示叶子节点,叶子节点直接指向实际数据块。
HFile中除了Data Block需要索引之外,上一篇文章提到过Bloom Block也需要索引,索引结构实际上就是采用了single-level结构,文中Bloom Index Block就是一种Root Index Block。


对于Data Block,由于HFile刚开始数据量较小,索引采用single-level结构,只有Root Index一层索引,直接指向数据块。当数据量慢慢变大,Root Index Block满了之后,索引就会变为mutil-level结构,由一层索引变为两层,根节点指向叶子节点,叶子节点指向实际数据块。如果数据量再变大,索引层级就会变为三层。
http://hbasefly.com/2016/03/25/hbase-hfile/
HFile V2的逻辑结构如下图所示:

文件主要分为四个部分:Scanned block section,Non-scanned block section,Opening-time data section和Trailer。
Scanned block section:顾名思义,表示顺序扫描HFile时所有的数据块将会被读取,包括Leaf Index Block和Bloom Block。
Non-scanned block section:表示在HFile顺序扫描的时候数据不会被读取,主要包括Meta Block和Intermediate Level Data Index Blocks两部分。
Load-on-open-section:这部分数据在HBase的region server启动时,需要加载到内存中。包括FileInfo、Bloom filter block、data block index和meta block index。
Trailer:这部分主要记录了HFile的基本信息、各个部分的偏移值和寻址信息。
HFile物理结构


http://hbasefly.com/2016/03/23/hbase-memstore-flush/
HBase中,Region是集群节点上最小的数据服务单元,用户数据表由一个或多个Region组成。在Region中每个ColumnFamily的数据组成一个Store。每个Store由一个Memstore和多个HFile组成,如下图所示:
HBase是基于LSM-Tree模型的,所有的数据更新插入操作都首先写入Memstore中(同时会顺序写到日志HLog中),达到指定大小之后再将这些修改操作批量写入磁盘,生成一个新的HFile文件,这种设计可以极大地提升HBase的写入性能;另外,HBase为了方便按照RowKey进行检索,要求HFile中数据都按照RowKey进行排序,Memstore数据在flush为HFile之前会进行一次排序,将数据有序化;还有,根据局部性原理,新写入的数据会更大概率被读取,因此HBase在读取数据的时候首先检查请求的数据是否在Memstore,写缓存未命中的话再到读缓存中查找,读缓存还未命中才会到HFile文件中查找,最终返回merged的一个结果给用户。

Memstore Flush流程
为了减少flush过程对读写的影响,HBase采用了类似于两阶段提交的方式,将整个flush过程分为三个阶段:
  1. prepare阶段:遍历当前Region中的所有Memstore,将Memstore中当前数据集kvset做一个快照snapshot,然后再新建一个新的kvset。后期的所有写入操作都会写入新的kvset中,而整个flush阶段读操作会首先分别遍历kvset和snapshot,如果查找不到再会到HFile中查找。prepare阶段需要加一把updateLock对写请求阻塞,结束之后会释放该锁。因为此阶段没有任何费时操作,因此持锁时间很短。
  2. flush阶段:遍历所有Memstore,将prepare阶段生成的snapshot持久化为临时文件,临时文件会统一放到目录.tmp下。这个过程因为涉及到磁盘IO操作,因此相对比较耗时。
  3. commit阶段:遍历所有的Memstore,将flush阶段生成的临时文件移到指定的ColumnFamily目录下,针对HFile生成对应的storefile和Reader,把storefile添加到HStore的storefiles列表中,最后再清空prepare阶段生成的snapshot。
然而一旦触发Region Server级别限制导致flush,就会对用户请求产生较大的影响。会阻塞所有落在该Region Server上的更新操作,阻塞时间很长,甚至可以达到分钟级别。


假设每个Memstore大小为默认128M,在上述配置下如果每个Region有两个Memstore,整个Region Server上运行了100个region,根据计算可得总消耗内存 = 128M * 100 * 2 = 25.6G > 24.9G,很显然,这种情况下就会触发Region Server级别限制,对用户影响相当大。

根据上面的分析,导致触发Region Server级别限制的因素主要有一个Region Server上运行的Region总数,一个是Region上的Store数(即表的ColumnFamily数)。对于前者,根据读写请求量一般建议线上一个Region Server上运行的Region保持在50~80个左右,太小的话会浪费资源,太大的话有可能触发其他异常;对于后者,建议ColumnFamily越少越好,如果从逻辑上确实需要多个ColumnFamily,最好控制在3个以内。

create 'NewsClickFeedback',{NAME=>'Toutiao',VERSIONS=>1,BLOCKCACHE=>true,BLOOMFILTER=>'ROW',COMPRESSION=>'SNAPPY',TTL => ' 259200 '},{SPLITS => ['1','2','3','4','5','6','7','8','9','a','b','c','d','e','f']}


上述建表语句表示创建一个表名为“NewsClickFeedback”的表,该表只包含一个列簇“Toutiao”。

数据版本数,HBase数据模型允许一个cell的数据为带有不同时间戳的多版本数据集,VERSIONS参数指定了最多保存几个版本数据,默认为1。假如某个用户想保存两个历史版本数据,可以将VERSIONS参数设置为2,再使用如下Scan命令就可以获取到所有历史数据:


scan 'NewsClickFeedback',{VERSIONS => 2}

BLOOMFILTER
布隆过滤器,优化HBase的随即读取性能,可选值NONE|ROW|ROWCOL,默认为NONE,该参数可以单独对某个列簇启用。启用过滤器,对于get操作以及部分scan操作可以剔除掉不会用到的存储文件,减少实际IO次数,提高随机读性能。Row类型适用于只根据Row进行查找,而RowCol类型适用于根据Row+Col联合查找,如下:
Row类型适用于:get ‘NewsClickFeedback’,’row1′
RowCol类型适用于:get ‘NewsClickFeedback’,’row1′,{COLUMN => ‘Toutiao’}
对于有随机读的业务,建议开启Row类型的过滤器,使用空间换时间,提高随机读性能。

数据压缩方式,HBase支持多种形式的数据压缩,一方面减少数据存储空间,一方面降低数据网络传输量进而提升读取效率。目前HBase支持的压缩算法主要包括三种:GZip | LZO | Snappy,下面表格分别从压缩率,编解码速率三个方面对其进行对比:

Snappy的压缩率最低,但是编解码速率最高,对CPU的消耗也最小,目前一般建议使用Snappy

TTL

数据过期时间,单位为秒,默认为永久保存。对于很多业务来说,有时候并不需要永久保存某些数据,永久保存会导致数据量越来越大,消耗存储空间是其一,另一方面还会导致查询效率降低。如果设置了过期时间,HBase在Compact时会通过一定机制检查数据是否过期,过期数据会被删除。用户可以根据具体业务场景设置为一个月或者三个月。示例中TTL => ‘ 259200’设置数据过期时间为三天

IN_MEMORY

数据是否常驻内存,默认为false。HBase为频繁访问的数据提供了一个缓存区域,缓存区域一般存储数据量小、访问频繁的数据,常见场景为元数据存储。默认情况,该缓存区域大小等于Jvm Heapsize * 0.2 * 0.25 ,假如Jvm Heapsize = 70G,存储区域的大小约等于3.2G。需要注意的是HBase Meta元数据信息存储在这块区域,如果业务数据设置为true而且太大会导致Meta数据被置换出去,导致整个集群性能降低,所以在设置该参数时需要格外小心。

BLOCKCACHE

是否开启block cache缓存,默认开启。
SPLITS
region预分配策略。通过region预分配,数据会被均衡到多台机器上,这样可以一定程度上解决热点应用数据量剧增导致系统自动split引起的性能问题。HBase数据是按照rowkey按升序排列,为避免热点数据产生,一般采用hash + partition的方式预分配region,比如示例中rowkey首先使用md5 hash,然后再按照首字母partition为16份,就可以预分配16个region
http://stackoverflow.com/questions/4356952/how-is-resolved-concurrency-and-locking-in-cassandra-memtable
 there is no MVCC because there is no such thing as isolation levels in Cassandra. The atom of work is a single column
It should also be pointed out that there is generally a way to model your data (and possibly client behavior) so that isolation isn't necessary

http://apache.claz.org/hbase/



Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts