Massive Technical Interviews Tips: Big Data Misc

https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

https://searchaws.techtarget.com/definition/data-lake
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
https://soulmachine.gitbooks.io/system-design/content/cn/bigdata/membership-query.html
给定一个无限的数据流和一个有限集合，如何判断数据流中的元素是否在这个集合中？

在实践中，我们经常需要判断一个元素是否在一个集合中，例如垃圾邮件过滤，爬虫的网址去重，等等。这题也是一道很经典的题目，称为成员查询(Membership Query)。
答案: Bloom Filter
给定一个无限的整数数据流，如何查询值在某个范围内的元素出现的总次数？这个经典的大数据场景下的问题，称为范围查询(Range Query)。

http://itsumomono.blogspot.com/2015/07/cassandra-hdfs-mr-zk-kafka-storm-hbase.html
MapReduce数据处理流程中的每一步都需要一个Map阶段和一个Reduce阶段，而且如果要利用这一解决方案，需要将所有用例都转换成MapReduce模式。

在下一步开始之前，上一步的作业输出数据必须要存储到分布式文件系统中。因此，复制和磁盘存储会导致这种方式速度变慢。另外Hadoop解决方案中通常会包含难以安装和管理的集群。而且为了处理不同的大数据用例，还需要集成多种不同的工具（如用于机器学习的Mahout和流数据处理的Storm）。

如果想要完成比较复杂的工作，就必须将一系列的MapReduce作业串联起来然后顺序执行这些作业。每一个作业都是高时延的，而且只有在前一个作业完成之后下一个作业才能开始启动。

而Spark则允许程序开发者使用有向无环图（DAG）开发复杂的多步数据管道。而且还支持跨有向无环图的内存数据共享，以便不同的作业可以共同处理同一个数据。

Spark与Hadoop基于相同的HDFS文件存储系统, Spark将中间结果保存在内存中而不是将其写入磁盘，当需要多次处理同一数据集时，这一点特别实用。Spark的设计初衷就是既可以在内存中又可以在磁盘上工作的执行引擎。当内存中的数据不适用时，Spark操作符就会执行外部操作。Spark可以用于处理大于集群内存容量总和的数据集。

Spark会尝试在内存中存储尽可能多的数据然后将其写入磁盘。它可以将某个数据集的一部分存入内存而剩余部分存入磁盘。开发者需要根据数据和用例评估对内存的需求。Spark的性能优势得益于这种内存中的数据存储。

case: 将Spark、Kafka和Apache Cassandra结合在一起，其中Kafka负责输入的流式数据，Spark完成计算，最后Cassandra NoSQL数据库用于保存计算结果数据。

HBase数据存储在内存中而非磁盘上，随机地重复访问和修改，低延迟查找和范围scan
Hbase and Lucene flush and compaction (buffer-flush-merge strategy)
http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/

First written to an in-memory store (memstore) (heap), once this memstore reaches a certain size, it is flushed to disk into a store file (everything is also written immediately to a log file for durability). The store files created on disk are immutable. Sometimes the store files are merged together, this is done by a process called compaction.
In contrast to Lucene though, were one usually has only one IndexWriter, HBase typically has hundreds of regions open.

It is a good thing that the number of store files stays small, because when you read a row from HBase it needs to check all the store files and the memstore (this can be optimized through bloomfilters or by specifyingtimestamp ranges).

There are two kinds of compactions:

the minor compactions: these are triggered each time a memstore is flushed, and will merge some of the store files, determined by an algorithm described below.
the major compactions: these run about every 24 hours (after the currently oldest store file was written), and merge together all store files into one. The 24 hours is adjusted with a random margin of up to 20% to avoid many major compactions happening at the same time. Major compactions can also be triggered manually, via the API or the shell.

There is another difference between minor and major compactions: major compactions process delete markers, max versions, etc, while minor compactions don’t. This is because delete markers might also affect data in the non-merged files, so it is only possible to do this when merging all files.

If, after a compaction, a newly written store file is greater than a certain size (default 256 MB, see property hbase.hregion.max.filesize), the region will be split into two new regions.

delete

When you perform a delete in HBase, nothing gets deleted immediately, rather a delete marker (a.k.a. thombstone) is written. This is because HBase does not modify files once they are written.

(key value pair storage)
hbased - byte[]
Amazon DynamoDB stores structured data, indexed by primary key, and allows low latency read and write access to items ranging from 1 byte up to 64KB. Amazon S3 stores unstructured blobs and suited for storing large objects up to 5 TB. In order to optimize your costs across AWS services, large objects or infrequently accessed data sets should be stored in Amazon S3, while smaller data elements or file pointers (possibly to Amazon S3 objects) are best saved in Amazon DynamoDB.

Cassandra: all nodes serve the same role – that of a primary replica for a portion of the data.
Systems using master-slave replication and/or mixed-master fail-over schemes sometimes fall into the “distributed” category, but these techniques are fragile and impose unnecessary limits on scalability (usually in the form of hardware interconnection complexity and geographic proximity).

Though the node is the “primary” for a portion of the data in the cluster, the number of copies of the data kept on other nodes in the cluster is configurable. When a node goes down, the other nodes containing copies, referred to as “replicas”, continue to service read requests and will even accept writes for the down node. When the node returns, these queued up writes are sent from the replicas to bring the node back up to date (you can find more detail on this process, known as hinted handoff, and Cassandra’s implementation of such here: http://wiki.apache.org/cassandra/HintedHandoff).

Another benefit of this design is the ease of which new nodes can be added. When a new node is brought in to the cluster, it can take over a portion of the data from existing nodes, relieving them of the responsibility for that range of data. Because all nodes are the same, this communication can happen seamlessly in a running cluster with the nodes exchanging messages to one another and the rest of the cluster as needed.

Having all nodes share the same role also streamlines operations and systems administrations tasks as well. Because Cassandra has a single node type, it has only a single set of requirements for hardware, for monitoring, and deployment.

https://en.wikipedia.org/wiki/Gossip_protocol

Gossip Protocol:
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster.
(broadcast routing)

ZK

Amazon DynamoDB stores structured data, indexed by primary key, and allows low latency read and write access to items ranging from 1 byte up to 64KB. Amazon S3 stores unstructured blobs and suited for storing large objects up to 5 TB. In order to optimize your costs across AWS services, large objects or infrequently accessed data sets should be stored in Amazon S3, while smaller data elements or file pointers (possibly to Amazon S3 objects) are best saved in Amazon DynamoDB.

Hbase stored

Saturday, April 2, 2016

Big Data Misc

Labels

Popular Posts