Thursday, October 22, 2015

f4: Facebook's Warm BLOB Storage System



http://www.umbrant.com/blog/2014/f4_facebook_warm_blob_storage.html
Haystack is very good at what it was designed to do: fast random access to write-once blobs. In short, it writes out all these objects log-structured to large 100GB files on the local filesystem, and maintains an in-memory index of blob locations so it can serve up a blob with at most a single disk seek.

The downside of Haystack is that it's not very space efficient. Files are replicated both at the node-level because of RAID-6 and also geographically three times, leading to a total replication factor of 3.6x. f4 improves upon this by using erasure coding, which drops the replication factor to 2.1x. Considering that Facebook has 65PB of warm blobs, we're looking at tens of PBs in savings (meaning millions of dollars).

However, the downside of erasure coding is worsened request rate and failure recovery. With erasure coding, there's only a single data replica that can serve read requests. Failure recovery is more expensive since it requires reading the other data and parity blocks in the stripe. In the meanwhile, clients reads require doing online erasure coding, unless they failover to another datacenter.

Haystack is great for hot data, and f4 is great for warm data, the key is determining where a given blob belongs.

Determining hotness
Facebook's blobs tend to be accessed frequently when they're first uploaded, after which access rates drops off exponentially. There are a couple different types of blobs, e.g. photos, videos, attachments, and each had different access rates and drop offs. They chose to look at a nifty metric over time: 99th percentile IOPS/TB. Based on synthetic benchmarks, they knew f4's 4TB drives could handle handle 80 IOPS with acceptable latency. This meant a blob wamigration made sense when the IOPS/TB for a type of blob fell below 20.

Profile photos, it turned out, do not exhibit a strong drop off, and are never moved to f4. Photos ended up being hot for about 3 months, and everything else was only hot for one month.

http://blog.acolyer.org/2014/12/16/f4-facebooks-warm-blob-storage-system/
It’s the story of how Facebook implemented a tiered storage solution for BLOBs and introduced per data class (temperature) replication factor, latency, and time-to-recovery tuning.

There is a strong correlation between the age of a BLOB and its temperature. Newly created BLOBs are requested at a far higher rate than older BLOBs. For instance, the request rate for week-old BLOBs is an order of magnitude lower than for less-than-a-day old content for eight of nine examined types. In addition, there is a strong correlation between age and the deletion rate.We use these findings to inform our design…

These data access patterns support a two-tier BLOB storage solution with Haystack used for ‘hot’ blobs, and content then migrating to f4 after three months for photos, and after one month for other blob types. By reducing the replication factor for ‘warm’ blobs the average request latency goes up slightly (from 14ms to 17ms), and the time to recovery after failure is increased, but the cost of storage is significantly reduced. The effective replication factor (ratio of physical data size to logical data size) for Haystack is 3.6, for f4 this has been brought down to 2.1.

Using Reed-Solomon coding achieves reliability at lower storage overheads than full replication, at the cost of longer rebuild and recovery times under failure. A (10,4) encoding has a 1.4x expansion factor. Keeping two copies of this would give a 2.8x effective replication factor, but the XOR technique further reduces the replication factor to 2.1.

Mechanical Sympathy
Choosing a 1 GB block size for encoding reduces the number of blobs that span multiple blocks, thus requiring multiple I/O operations to read
Separating out storage-intensive tasks from computing intensive tasks in the architecture (for example, the introduction of a separate transformation tier) enables the use of appropriate hardware for each.
Candidate hard drives were benchmarked to determine maximum IOPS that could be consistently achieved at the desired latency
An important consideration in the design of f4 was keeping the hardware and software well matched. Hardware that provides capacity or IOPS that are not used by the software is wasteful; software designed with unrealistic expectations of the hardware will not work. The hardware and software components of f4 were co-designed to ensure they were well-matched by using software measurements to inform hardware choices and vice-versa.

The lower CPU requirements of the storage nodes may even enable the use of lower-powered CPUs in the future, introducing a second form of cost saving.

Self-healing
f4 is designed to handle disk failures, host failures, rack failures, and data center failures. The failure rates are interesting: drives have an annual failure rate of about 1% (so > 1 drive failure per cell per week), hosts fail ‘periodically’, and racks fail ‘multiple times per year’.

When there are failures in a cell, some data blocks will become unavailable, and serving reads for the BLOBs it holds will require online reconstruction of them from companion data blocks and parity blocks. Backoff nodes are storage-less, CPU-heavy nodes that handle the online reconstruction of request BLOBs.

This online reconstruction happens in the request path and rebuilds only the requested blob, not the full block (the blob is typically much smaller than the blob). This keeps request latency down. Full block rebuilding is then handled offline by dedicated rebuilder nodes.

Thus different components at different layers in the system architecture are continually watching, repairing, and rebalancing the system

Facebook lessons learned
Among these the importance of simplicity for operational stability, the importance of measuring underlying software for your use case’s efficiency, and the need for heterogeneity in hardware to reduce the likelihood of correlated failures stand out.

Hardware heterogeneity gives immunity against shared weaknesses in homogeneous hardware.

It all begins with a detailed understanding of the use case and usage patterns, backed up by data analysis.

The software and hardware are considered together, such that the system architecture enables the right hardware to be used for the right job (in previous version of the system compute-heavy and storage-heavy tasks were mixed on the same nodes). Finally, the multiple layers of repair and recovery create a self-sustaining, self-healing system that can survive multiple and continual failures.
http://www.theregister.co.uk/2014/10/13/facebook_codes_warm_erasure_blobs_storage/
  • Erasure coding — the adding of calculated parity values (Reed-Solomon codes) to a string of bytes, such that the string can be recovered if an error deletes or distorts some of the complete string. Typically more efficient than RAID at protecting data as it uses less space.
Facebook’s special problem is that it has three main types of user data, with associated metadata, and these three types need huge amounts of storage. Its main and most-accessed datasets are the recent, less than one-week-old postings on a user’s timeline. These get accessed a lot by the user’s "Friends".

Volumes are stored in one data centre and in cells, where a cell is 14 racks of 15 hosts with 30 x 4TB drives per host. Each volume/stripe/block is paired with a buddy volume/stripe/block in a different geographic region. Facebook stores an XOR of the buddies in a third region. This scheme protects against failure of one of the three regions.

http://blog.dshr.org/2014/10/facebooks-warm-storage.html
A BLOB is a Binary Large OBject. Each type of BLOB contains a single type of immutable binary content, such as photos, videos, documents, etc. Section 3 of the paper is a detailed discussion of the behavior of BLOBs of different kinds in Facebook's storage system.

Figure 3 shows that the rate of I/O requests to BLOBs drops rapidly through time. The rates for different types of BLOB drop differently, but all 9 types have dropped by 2 orders of magnitude within 8 months, and all but 1 (profile photos) have dropped by an order of magnitude within the first week.


  • That significant kinds of data should be moved from expensive, high-performance hot storage to cheaper warm and then cold storage as rapidly as feasible.
  • That the I/O rate that warm storage should be designed to sustain is so different from that of hot storage, at least 2 and often many more orders of magnitude, that attempting to re-use hot storage technology for warm and even worse for cold storage is futile.
http://www.datacenterknowledge.com/archives/2015/05/08/cold-storage-the-facebook-data-centers-that-back-up-the-backup/
The storage server can power up without sending power to any of the drives. Custom software controls which drive powers on when it is needed.
This way, a cold storage facility only needs to supply enough power for six percent of all the drives it houses. Overall, the system needs one-quarter of the power traditional storage servers need.
This allowed the team to further strip down the design. Instead of three power shelves in the Open Rack, a cold storage rack has only one, and there are five power supplies per shelf instead of seven. The number of Open Rack bus bars was reduced from three to one.
Today, data from a hot data center on the West Coast is backed up in a cold storage site on the East Coast and vice versa. The next step would be to apply the Reed-Solomon technique across multiple geographically remote cold storage sites, Patiejunas said.


PDF:
http://www-bcf.usc.edu/~wyattllo/papers/f4-osdi14.pdf

Video:
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts