Saturday, April 2, 2016

Big Data Misc
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

在实践中,我们经常需要判断一个元素是否在一个集合中,例如垃圾邮件过滤,爬虫的网址去重,等等。这题也是一道很经典的题目,称为成员查询(Membership Query)。
答案: Bloom Filter
给定一个无限的整数数据流,如何查询值在某个范围内的元素出现的总次数?这个经典的大数据场景下的问题,称为范围查询(Range Query)。

Spark与Hadoop基于相同的HDFS文件存储系统, Spark将中间结果保存在内存中而不是将其写入磁盘,当需要多次处理同一数据集时,这一点特别实用。Spark的设计初衷就是既可以在内存中又可以在磁盘上工作的执行引擎。当内存中的数据不适用时,Spark操作符就会执行外部操作。Spark可以用于处理大于集群内存容量总和的数据集。

case: 将Spark、Kafka和Apache Cassandra结合在一起,其中Kafka负责输入的流式数据,Spark完成计算,最后Cassandra NoSQL数据库用于保存计算结果数据。

Hbase and Lucene flush and compaction (buffer-flush-merge strategy)

First written to an in-memory store (memstore) (heap), once this memstore reaches a certain size, it is flushed to disk into a store file (everything is also written immediately to a log file for durability). The store files created on disk are immutable. Sometimes the store files are merged together, this is done by a process called compaction.
In contrast to Lucene though, were one usually has only one IndexWriter, HBase typically has hundreds of regions open.

 It is a good thing that the number of store files stays small, because when you read a row from HBase it needs to check all the store files and the memstore (this can be optimized through bloomfilters or by specifyingtimestamp ranges).
There are two kinds of compactions:
  • the minor compactions: these are triggered each time a memstore is flushed, and will merge some of the store files, determined by an algorithm described below.
  • the major compactions: these run about every 24 hours (after the currently oldest store file was written), and merge together all store files into one. The 24 hours is adjusted with a random margin of up to 20% to avoid many major compactions happening at the same time. Major compactions can also be triggered manually, via the API or the shell.
There is another difference between minor and major compactions: major compactions process delete markers, max versions, etc, while minor compactions don’t. This is because delete markers might also affect data in the non-merged files, so it is only possible to do this when merging all files.
If, after a compaction, a newly written store file is greater than a certain size (default 256 MB, see property hbase.hregion.max.filesize), the region will be split into two new regions.

When you perform a delete in HBase, nothing gets deleted immediately, rather a delete marker (a.k.a. thombstone) is written. This is because HBase does not modify files once they are written.

(key value pair storage)
hbased - byte[]
Amazon DynamoDB stores structured data, indexed by primary key, and allows low latency read and write access to items ranging from 1 byte up to 64KB. Amazon S3 stores unstructured blobs and suited for storing large objects up to 5 TB. In order to optimize your costs across AWS services, large objects or infrequently accessed data sets should be stored in Amazon S3, while smaller data elements or file pointers (possibly to Amazon S3 objects) are best saved in Amazon DynamoDB.

Cassandra: all nodes serve the same role – that of a primary replica for a portion of the data.  
Systems using master-slave replication and/or mixed-master fail-over schemes sometimes fall into the “distributed” category, but these techniques are fragile and impose unnecessary limits on scalability (usually in the form of hardware interconnection complexity and geographic proximity).

Though the node is the “primary” for a portion of the data in the cluster, the number of copies of the data kept on other nodes in the cluster is configurable. When a node goes down, the other nodes containing copies, referred to as “replicas”, continue to service read requests and will even accept writes for the down node. When the node returns, these queued up writes are sent from the replicas to bring the node back up to date (you can find more detail on this process, known as hinted handoff, and Cassandra’s implementation of such here:
Another benefit of this design is the ease of which new nodes can be added. When a new node is brought in to the cluster, it can take over a portion of the data from existing nodes, relieving them of the responsibility for that range of data. Because all nodes are the same, this communication can happen seamlessly in a running cluster with the nodes exchanging messages to one another and the rest of the cluster as needed.
Having all nodes share the same role also streamlines operations and systems administrations tasks as well. Because Cassandra has a single node type, it has only a single set of requirements for hardware, for monitoring, and deployment.
Gossip Protocol:
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster.
(broadcast routing)


Amazon DynamoDB stores structured data, indexed by primary key, and allows low latency read and write access to items ranging from 1 byte up to 64KB. Amazon S3 stores unstructured blobs and suited for storing large objects up to 5 TB. In order to optimize your costs across AWS services, large objects or infrequently accessed data sets should be stored in Amazon S3, while smaller data elements or file pointers (possibly to Amazon S3 objects) are best saved in Amazon DynamoDB.

Hbase stored  

    No comments:

    Post a Comment


    Review (561) System Design (304) System Design - Review (196) Java (179) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Linux (40) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (29) Product Architecture (28) Big Data (27) Soft Skills (27) Concurrency (26) MultiThread (26) Miscs (25) Cracking Code Interview (24) Distributed (24) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) OOD Design (20) System Design - Practice (19) How to Ace Interview (16) Security (16) Algorithm (15) Brain Teaser (14) Google (14) Redis (14) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Solr (11) Testing (11) Resource (10) Search (10) Amazon (9) Cache (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Trouble Shooting (8) Cassandra (7) Company - Facebook (7) Design (7) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Machine Learning (7) NoSQL (7) C++ (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) API Design (4) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Data structures (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Puzzles (4) Python (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Chrome (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) Elasticsearch (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Html (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) RateLimiter (3) Resource-System Desgin (3) Scala (3) UML (3) ZooKeeper (3) geeksquiz (3) AI (2) Advanced data structures (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Garbage Collection (2) Go (2) Hadoop (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) reddit (2) Ads (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Stories (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Java Review (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) LintCode (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) Reactive (1) Reading (1) Reading Code (1) Refactoring (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

    Popular Posts