Massive Technical Interviews Tips: Google File System

Wednesday, October 21, 2015

Google File System

1. GFS，先从最底层数据库结构开始了解，很多内容没有完全消化，这是我一次读paper的总结，希望以后慢慢了解更多，错误的地方希望大家指正！

https://pdos.csail.mit.edu/6.824/papers/gfs.pdf

GFS(Google file system): a distributed file system, 用来存储数据，作为BigTable（SortedString Table的底层），也是HDFS的真身。把文件放在很多个disks上，能够满足大量读写需求。

GFS的目的：

1. Fault-tolerance: 假设有1000个机器，每个机器每年会坏一次，那么每天就有大概3个机器会挂。如何确保数据安全？

2. Performance：大量的同时读写，如何处理I/O

3. Consistency：如果有多个人同时读写，如何确保最后读到的是最新的数据

4. Efficiency：如何最优使用网络bandwidth来传输数据

在GFS之前的文件系统：

1. 每个文件分成blocks （4096byte），每个文件有自己的Metadata，记录文件名/时间/size/和blocks的diskoffset

2. Master-slave 系统：数据都存放在master上，slave是备份，master挂了，找一个slave顶上去

3. 缺点：随机访问速度慢，多进程读写需要锁进程，数据存放在一个master严重依赖badnwidth

GFS的结构：

1. client，比如SSTable，read/writefile to GFS

2. Master，这里的master跟一般文件系统的不一样，不存放数据，memory中存放

3. Chunk servers，存放数据的metadata和一个hashtable[filename][chunk_index] = chunk_server

4. chunks，和一般的4096 block不同， GFS的数据每一块比较大，default用的是64M，这样metadata就比较小了。建立在小数据碎片很小的基础上。

5. Megadata, 64Byte 每个chunk

6. Replica：每个数据存3次，在不同的chunkserver上，以防止chunk server 挂了

Read：读数据：

1. Client 发送 filename 到master

2. 在master 找到这个file的metadata，然后找到相应的chunk list，最后在hashmap里找相应的chunk serverlist，返回给client，这些数据会在client的cache里

3. Client读chunk server list，对于每个chunk，对应最多3个server，根据IP地址选择最近的server来读取数据

*4. 实际上会有一个 version的信息，在每个数据里和chunk server list，如果在server数据里的version和chunk server 不同就重新连接master

*5. Client 会吧server list 放进cache里，这样就不需要每次都问master了

Write: 写数据：（这里的写指的是modify）

1. Client发 filename到master

2. master恢复chunk server list，这里跟读一样，但是chunkservers分成两种，一个叫primaryreplica，一种叫secondary replica

3. 把数据写到三个replica的cache里（！！不是硬盘，是LRU cache）

4. Client 发写请求给primary replica，primary把最新的文件写到硬盘

5. Primary 把写请求 forward给两个secondary， secondary 写data

6. 写完之后secondary告诉primary写好了

7. primary 回复client，如果出错就重复3-7

*和传统方法不一样的是，GFS写数据提供recordappend方法，用来保证多client同时写

系统优化：

1. 一般不需要第二个master，如果挂了去重启就行，但是master会存一个log到硬盘，重启之后从log恢复state

2. 如何判断chunk server是不是挂了：heartbeat message

3. 如何判断一个chunk server上某个block挂了： Achunk is broken up into 64 KB blocks. Each has a corresponding 32 bit checksum

2. 现实生活中data不总是占满64M，会有大量几k几十k的文件碎片，1PB的metadata至少在几十G。大一点公司的storage QPS都是用million计的。
3. 面Databricks你应该研究Spark和Flink，他们又不做Storage

另外，作为面试官我肯定会考的问题：
1. 怎么保证Master和Shadow master一致性？写一半挂了怎么半？

http://kaushiki-gfs.blogspot.com/

GFS design assumptions:

1) Components in system fail a lot and GFS should be able to recover from it.

2) Files stored are usually large.

3) Reads of two kinds: large streaming reads and small random reads.

4) Once files are written they are mostly read. Most of the write operations are of append type.

5) Support concurrent appends by multiple clients to the same file.

6) High sustained bandwidth and throughput are more important than low latency.

Key Ideas:

Design and Architecture: GFS cluster comprises of single master and multiple chunk servers accessed by multiple clients. Since files to be stored in GFS are huge, processing and transferring such huge files can consume a lot of bandwidth. To efficiently utilize bandwidth files are divided into large 64 MB size chunks which are identified by unique 64-bit chunk handle assigned by master.

No caching: File data is not cached by the client or chunkserver. Large streaming reads offer little caching benefits since most of the cache data will always be overwritten.

Single Master: Simplifies design and allows a simple centralized management. Master stores metadata and co-ordinates access. All metadata is stored in master’s memory which makes operations fast. It maintains 64 bytes/chunk. Hence master memory is not a serious bottle neck. In order to minimize master involvement lease mechanism is used. Lease is used to maintain a consistent mutation (append or write) order across replicas.

Garbage collection: The system has a unique approach for this. Once a file is deleted its resources are not reclaimed immediately instead they are renamed with hidden namespace. Such files are removed if they exist for 3 days during the regular scan. The advantages offered by it are: 1) simple 2) deleting of files can take place during master’s idle periods and 3) safety against accidental deletion.

Thus the biggest vulnerability in this system is the Master as it can be the single point of failure. This is overcome by replicating the master state. There are shadow masters which can provide read-only access in absence of primary master.

Relaxed consistency model

1) File namespace mutations are always atomic.

2) File region is consistent if all clients read same values from replicas.

3) File region is defined if clients see mutation writes in entirety.

4) Operation log ensures metadata stored by master is always consistent and defined.

Pros:

1) Very high availability and fault tolerance through replication: a) Chunk and master replication and b) Chunk and master recovery.

2) Simple and efficient centralized design with a single master. Delivers good performance for what it was designed for i.e. large sequential reads.

3) Concurrent writes to the same file region are not serializable. Thus replicas might have duplicates but there is no interleaving of records. To ensure data integrity each chunkserver verifies integrity of its own copy using checksums.

4) Read operations span at least a few 64KB blocks therefore the check summing costs reduces.

5) Batch operations like writing to operation log, garbage collection help increase the bandwidth.

6) Atomic append operations ensures no synchronization is needed at client end.

7) No caching eliminates cache coherence issues.

8) Decoupling of flow of data from flow of control allows to use network efficiently.

9) Orphaned chunks are automatically collected using garbage collection.

10) GFS master constantly monitors each chunkserver through heartbeat messages.

Cons:

1) Special purpose design is a limitation when applying to general purpose design.

2) Many of their design decisions will be inefficient in case of smaller files:

i) Small files will have small number of chunks even one. This can lead to chunk servers storing these files to become hot spots in case of many client requests.

ii) Also if there are many such small files the master involvement will increase and can lead to a potential bottleneck. Having a single master node can become an issue.

3) Lazy garbage collection can become a problem where the files are not static. If there many deletions then not reclaiming physical storage might become a problem for file creations.

4) Since a relaxed consistency model is used clients have to perform consistency checks on their own.

5) Performance might degrade if the numbers of writers and random writes are more.

6) Master memory is a limitation.

7) The whole system is tailored according to workloads present in Google. GFS as well as applications are adjusted and tuned as necessary since both are controlled by Google.

8) No reasoning is provided for the choice of standard chunk size (64MB).

Some of the problems Google faced:

1) Size of storage increased in the range of petabytes. The amount of metadata maintained by master increased and scanning through such large amounts became an issue. The single master started becoming a bottleneck when thousand client requests came simultaneously.

2) 64 MB standard chunk size design choice created problems when application mix evolved. The system had to deal with applications generating large number of small files e.g.: Gmail.

3) Original design choice sacrificed latency. However, building a lot of latency sensitive and user centered applications like Gmail and YouTube on top a file system intended for batch-oriented applications was a major challenge.

http://programming-project.blogspot.com/2014/04/general-architecture-of-google-file.html

Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk servers don't send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS copies every chunk multiple times and stores it on different chunk servers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired.

Management done to overloading single master in Google File System

Having a single master enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, the involvement of master in reads and writes must be minimized so that it does not become abottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunk servers it should contact. It caches this information for a limited time and interacts with the chunk servers directly for many subsequent operations.

General scenario of client request handling by GFS

File requests follow a standard work flow. A read request is simple; the client sends a request to the master server to find out where the client can find a particular file on the system. The server responds with the location for the primary replica of the respective chunk. The primary replica holds a lease from the master server for the chunk in question.

If no replica currently holds a lease, the master server designates a chunk as the primary. It does this by comparing the IP address of the client to the addresses of the chunk servers containing the replicas. The master server chooses the chunk server closest to the client. That chunk server's chunk becomes the primary. The client then contacts the appropriate chunk server directly, which sends the replica to the client.

Write requests are a little more complicated. The client still sends a request to the master server, which replies with the location of the primary and secondary replicas. The client stores this information in a memory cache. That way, if the client needs to refer to the same replica later on, it can bypass the master server. If the primary replica becomes unavailable or the replica changes then the client will have to consult the master server again before contacting a chunk server.

The client then sends the write data to all the replicas, starting with the closest replica and ending with the furthest one. It doesn't matter if the closest replica is a primary or secondary. Google compares this data delivery method to a pipeline.

Once the replicas receive the data, the primary replica begins to assign consecutive serial numbers to each change to the file. Changes are called mutations. The serial numbers instruct the replicas on how to order each mutation. The primary then applies the mutations in sequential order to its own data. Then it sends a write request to the secondary replicas, which follow the same application process. If everything works as it should, all the replicas across the cluster incorporate the new data. The secondary replicas report back to the primary once the application process is over.

At that time, the primary replica reports back to the client. If the process was successful, it ends here. If not, the primary replica tells the client what happened. For example, if one secondary replica failed to update with a particular mutation, the primary replica notifies the client and retries the mutation application several more times. If the secondary replica doesn't update correctly, the primary replica tells the secondary replica to start over from the beginning of the write process. If that doesn't work, the master server will identify the affected replica as garbage.

Advantages and disadvantages of large sized chunks in Google File System

Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much larger than typical file system blocks sizes. Each chunk replica is stored as a plain Linux file on a chunk server and is extended only as needed.

Advantages

1. It reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information.

2. Since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunk server over an extended period of time.

3. It reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages.

Disadvantages

1. Lazy space allocation avoids wasting space due to internal fragmentation.

2. Even with lazy space allocation, a small file consists of a small number of chunks, perhaps just one. The chunk servers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because the applications mostly read large multi-chunk files sequentially. To mitigate it, replication and allowance to read from other clients can be done.

http://google-file-system.wikispaces.asu.edu/

Master ServerThe master maintains all ﬁle system metadata and controls some system-wide activities.Metadata includes the following :

File and chunk namespace
Access control information
Mapping from ﬁles to chunks
Current locations of chunks

System-wide activities include (explained in detail ahead):

Chunk lease management
Garbage collection of orphaned chunks
Chunk migration between chunkservers.
Checking the state of live chunkservers by sending heartbeat message.

Apparently Google has made changes to GFS by putting multiple GFS masters on top of pool of chunkservers(no documentation yet). They ended up doing a multi-cell approach with a master for each cell and more than one cell per data center. Cells across network function as distinct file systems. Applications are responsible for partitioning data across those different cells. Assumption is that each application has its own master(or a set of masters).
As a disadvantage of large chunk size, small files are stored as a single chunk. So, when multiple clients perform operations on this file, the single chunk becomes a hot-spot.

Metadata
Master stores the metadata such as file and chunk namespaces, file to chunks mapping, location of each chunk's replicas on the chunkservers. All this data is stored in memory to increase the response time.
Persistent data: Namespaces and file to chunk mapping are also stores locally on disk by the master using operation logs and also replicated across other machines (genrally shadow masters ). These operation logs are used in case a master crashes to recover to the last running state.
Non-Persistent data: Chunk location information is not stored persistently as master can easily collect this data on start-up from each chunkserver and then periodically via heartbeat messages. This eliminates the problem of keeping the master and the chunkservers in sync as chunkservers keep on joining, leaving, renaming,etc.

In-Memory Data Structures

Keeping metadata in memory makes master operations really fast. Master can do all the operations of replica management and garbage collection in the background as a result of this. The only problem with this approach is that master's memory may become a limiting factor though in reality it doesn't. For each chunk of 64MB we need to store just 64 bytes.
e.g. If the system stores 1 million chunks (64TB data), we need only 64MB of memory to store that.

Operation Log

The operation log contains a historical record of critical metadata changes. So, in case of failure, we can easily recover by replaying these logs. Moreover, this log gives us the exact logical sequence of the concurrent operations that were carried out by the master. Each file, chunk and their version numbers are identified by logical times they were created at.
The operation log is replicated on other shadow servers to increase the reliability in case the master itself fails. So, as soon as a metadata change request is made, the operation log is updated on main server as well as the shadow servers and then the actual change is made to the metadata and reported to the client.

Here two cases arise in case of master failure:

Log updated but operation failed : In this case we can always replay the log and get to the latest version.
Log update itself failed : In this case, we are sure that operation failed and so, we are at a consistent older version and will have to redo the recent operations.

Operation logs are used with checkpoints to attain better results.

Checkpoint

After a period of time the operation logs start growing very large and replaying them takes a considerable amount of time. To overcome this problem, checkpoints were introduced. Whenever, the logs start growing beyond a limit, the system checkpoints it state and dumps it to the local disk. In case of recovery, this checkpoint is used is loaded into memory and the operation logs after the checkpoint are replayed. This reduces the recovery time to a great extent.

Checkpoint creation can take some time. To avoid delaying incoming mutations a special approach was designed. When creating a checkpoint, the master switches to a new log file and creates the checkpoint in a separate thread. In the mean while, new mutations are logged in the new log file. Using the new checkpoint and the new log system can recover quickly.
A number of checkpoints are maintained to roll-back to a previous consistent state.

http://www.nosqlnotes.net/archives/237

Google文件系统（Google File System，GFS）是构建在廉价的服务器之上的大型分布式系统。它将服务器故障视为正常现象，通过软件的方式自动容错，在保证系统可靠性和可用性的同时，大大减少了系统的成本。

GFS是Google云存储的基石，其它存储系统，如Google Bigtable，Google Megastore，Google Percolator均直接或者间接地构建在GFS之上。另外，Google大规模批处理系统MapReduce也需要利用GFS作为海量数据的输入输出。

系统架构

GFS将整个系统的节点分为三种角色：GFS Master（总控服务器），GFS Chunkserver（数据块服务器，简称CS）以及GFS Client（客户端）。

GFS文件被划分为固定大小的数据块（Chunk），由Master在创建时分配一个64位全局唯一的Chunk句柄。CS以普通的Linux文件的形式将Chunk存储在磁盘中。为了保证可靠性，Chunk在不同的机器中复制多份，默认为三份。

Master中维护了系统的元数据，包括文件及Chunk名字空间，GFS文件到Chunk之间的映射，Chunk位置信息。它也负责整个系统的全局控制，如Chunk租约管理，垃圾回收无用Chunk，Chunk复制，等等。Master会定期与CS通过心跳的方式交换信息。

Client是GFS提供给应用程序的访问接口，它是一组专用接口，不遵守POSIX规范，以库文件的形式提供。Client访问GFS时，首先访问Master节点，获取与之进行交互的CS信息，然后直接访问这些CS，完成数据存取工作。

需要注意的是，GFS中的客户端不缓存文件数据，只缓存Master中获取的元数据，这是由GFS的应用特点决定的。GFS最主要的应用有两个：MapReduce与Bigtable。对于MapReduce，GFS客户端使用方式为顺序读写，没有缓存文件数据的必要；而Bigtable作为云表格系统，内部实现了一套缓存机制。另外，如何维护客户端缓存与实际数据之间的一致性是一个极其复杂的问题。

http://www.theregister.co.uk/2009/08/12/google_file_system_part_deux/
"High sustained bandwidth is more important than low latency," read the original GPS research paper. "Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response-time requirements for an individual read and write."

But even when the system is running well, there can be delays. "There are places in the design where we've tried to optimize for throughput by dumping thousands of operations into a queue and then just processing through them," he continues. "That leads to fine throughput, but it's not great for latency. You can easily get into situations where you might be stuck for seconds at a time in a queue just waiting to get to the head of the queue."

Quinlin and crew are moving to a system that uses not only distributed slaves but distributed masters. And the slaves will store much smaller files. The chunks will go from 64MB down to 1MB.

This takes care of that single point of failure. But it also handles the file-count issue - up to a point. With more masters you can not only provide redundancy, you can also store more metadata. "The distributed master certainly allows you to grow file counts, in line with the number of machines you're willing to throw at it," Quinlan says. "That certainly helps."

Quinlin and crew are moving to a system that uses not only distributed slaves but distributed masters. And the slaves will store much smaller files. The chunks will go from 64MB down to 1MB.

http://prismoskills.appspot.com/lessons/System_Design_and_Big_Data/Chapter_04_-_Google_File_System.jsp
Components can fail anytime. Some can fail so badly that they cannot be recovered at all.
Update operations are not recommended. Most operations are either read or append.
GFS should constantly monitor itself and replace lost components quickly.
Atomic appends should be guaranteed but with minimal synchronization overhead.
Latency is not of paramount importance as goal is to support batch systems.
GFS supports all regular file operations like create, delete, open, close, read, write and file hierarchies.
It also supports snapshot and atomic appends.

Architecture

One master and multiple chunkservers.
Files are divided into fixed-size chunks.
Each chunk is identified by immutable and globally unique 64-bit chunk-ID assigned by master at the time of chunk creation.
This allows for 2⁶⁴ = 18x10¹⁸ unique chunks. If each chunk is one MB also, total storage space would become 10⁹ terabytes.
Usually chunksize is 64 MB. Such large chunksize reduces number of entries in the master node.
Also, clients can comfortably performa several operations in the vicinity of a hot or current file-index without opening too many network connections.
Each chunk is replicated (usually 3 times) for reliability. Replication factor is configurable.
Master manages all meta-data like:
1. Namespace
2. Mapping from files to chunks
3. Garbage collection of orphaned chunks
4. Chunk migration between servers
5. Heartbeat management to all chunkservers for health checking.
There is no caching of file chunks - mostly due to enormous size of chunks and very little re-use of same portions of a file.
Clients do cache metadata however.
Master node does not participate in any read/write of actual file contents.
It only hands-over replica information to clients who interact directly with replicas.
For very frequently accessed files, throughput of queries per second can be increased by adding more replicas.
The master stores all the metadata in memory. Large filenames are compressed using prefix compression.
This helps the master to be very fast which in turn reduces the load on the master when serving queries.

Master does not persist chunkserver mapping to files but gathers it by periodically refreshing it from chunkservers.
This is more efficient and reliable when 100s of nodes are involved.
Operation Log (also called T-log (T=Transaction) in some systems) contains a historical record of critical metadata changes.
Write calls do not succeed unless an entry has been made in the Operation Log.
Since its so critical, this log is replicated across several machines.
Master can recover from a crash/shutdown by replaying its state from the T-log.
To avoid full replay, a checkpoint is also stored such that master does not need to replay state before that checkpoint.
A very simple strategy for efficient checkpointing is to roll over the logfile for T-log when a particular number of records have been processed.
Master guarantees consistency by ensuring the order of mutations on all replicas and using chunk version numbers.
If a replica has incorrect version, it is garbage collected at the earliest.
Since clients cache chunk locations, they could read stale data.
So clients are expected to refresh their cache frequently.
Checksums are used to check chunks' validity. If a chunk is corrupted, it is restored from other replicas as soon as possible.
GFS guarantees atleast once writes for writers.
This means that records could be written more than once as well (although rarely).
It is the responsibility of the readers to deal with these duplicate chunks.
This is achieved by having checksums and serial numbers in the chunks which help readers to filter and discard duplicate data.
Application forwards write chunks to all replicas including the primary node.
Once all replicas have received the chunks, a write request is sent to the primary node.
This is done so that primary can assign a serial number to the chunks and send the same to all other replicas.
This helps in ordering of multiple chunks for same file with multiple writers.
If any replica fails, above step is repeated again resulting in the possibility of duplicate writes.
Hence the guarantee of writing atleast once.
Snapshots are created almost instantaneously by following the copy-on-write method.
This means that when a snapshot request comes, only the metadata is copied.
Chunk references in the metadata point to older chunks only.
When a write request comes to older or the newer chunk, only then the copy is actually made.
Also, the copy is always made locally on the same node to save network traffic.

Master-Node Operations

GFS does not have i-node like tree structure for directories and files.
Rather it has a hash-map that maps a filename to its metadata and read-write locks are applied on each node of the hash table for synchronization.
In any operation involving /d1/d2/d3/.../dn/leaf, read locks are applied on /d1, /d1/d2/, /d1/d2/d3/, /d1/d2/.../dn/ and read or write lock on the leaf.
Due to this, concurrent writes on the same leaf are prevented right away.
Advantage of this system is that it allows concurrent modifications in the same directory.
Replicas should be as distributed as possible - for example not being in the same rack for sure.
This gives better reliability and faster read responses at the cost of slower writes.
Stale replicas are detected by version number strategy.
During any append/write operation, master increments the version number of all nodes participating in the write operation.
If a node is left out in this process, it will report a lower version number when it comes online.
At this time, master detects that node as having a stale copy and schedules it for deletion.

http://research.google.com/archive/gfs.html
https://docs.google.com/viewer?url=http%3A%2F%2Fresearch.google.com%2Farchive%2Fgfs-sosp2003.pdf
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
http://queue.acm.org/detail.cfm?id=1594206

Wednesday, October 21, 2015

Google File System

In-Memory Data Structures

Operation Log

Checkpoint

Labels

Popular Posts