Wednesday, December 2, 2015

Introducing Dynomite - Making Non-Distributed Databases, Distributed

https://medium.com/netflix-techblog/introducing-dynomite-making-non-distributed-databases-distributed-c7bce3d89404
Every time a customer loads up a homepage or starts to stream a movie, there are a number of microservices involved to complete that request. Most of these microservices use some kind of stateful system to store and serve data. A few milliseconds here and there can add up quickly and result in a multi-second response time

Dynomite is a sharding and replication layer. Dynomite can make existing non distributed datastores, such as Redis or Memcached, into a fully distributed & multi-datacenter replicating datastore

In the open source world, there are various single-server datastore solutions, e.g. Memcached, Redis, BerkeleyDb, LevelDb, Mysql (datastore). The availability story for these single-server datastores usually ends up being a master-slave setup. Once traffic demands overrun this setup, the next logical progression is to introduce sharding

Dynomite’s design goal is to turn those single-server datastore solutions into peer-to-peer, linearly scalable, clustered systems while still preserving the native client/server protocols of the datastores, e.g., Redis protocol.
A Dynomite cluster consists of multiple data centers (dc). A datacenter is a group of racks, and a rack is a group of nodes. Each rack consists of the entire dataset, which is partitioned across multiple nodes in that rack. Hence, multiple racks enable higher availability for data. Each node in a rack has a unique token, which helps to identify the dataset it owns.

Each Dynomite node (e.g., a1 or b1 or c1) has a Dynomite process co-located with the datastore server, which acts as a proxy, traffic router, coordinator and gossiper. In the context of the Dynamo paper, Dynomite is the Dynamo layer with additional support for pluggable datastore proxy, with an effort to preserve the native datastore protocol as much as possible.

A datastore can be either a volatile datastore such as Memcached or Redis, or persistent datastore such as Mysql, BerkeleyDb or LevelDb. Our current open sourced Dynomite offering supports Redis and Memcached.

Replication

A client can connect to any node on a Dynomite cluster when sending write traffic. If the Dynomite node happens to own the data based on its token, then the data is written to the local datastore server process and asynchronously replicated to other racks in the cluster across all data centers. If the node does not own the data, it acts as a coordinator and sends the write to the node owning the data in the same rack. It also replicates the writes to the corresponding nodes in other racks and DCs.

In the current implementation, a coordinator returns an Ok back to client if a node in the local rack successfully stores the write and all other remote replications will happen asynchronously.

The pic below shows an example for the latter case where client sends a write request to non-owning node. It belongs on nodes a2,b2,c2 and d2 as per the partitioning scheme. The request is sent to a1 which acts as the coordinator and sends the request to the appropriate nodes.

Multiple racks and multiple data centers provide high availability. A client can connect to any node to read the data. Similar to writes, a node serves the read request if it owns the data, otherwise it forwards the read request to the data owning node in the same rack. Dynomite clients can fail over to replicas in remote racks and/or data centers in case of node, rack, or data center failures.

Cold cache warm-up

Currently, this feature is available for Dynomite with the Redis datastore. Dynomite can help to reduce the performance impact by filling up an empty node or nodes with data from its peers.

Dynomite with built-in gossip helps to maintain cluster membership as well as failure detection and recovery. This simplifies the maintenance operations on Dynomite clusters.

In AWS environment, a datacenter is equivalent an AWS’ region and a rack is the same as an AWS’ availability zone.

At Netflix, we see the benefit in encapsulating client side complexity and best practices in one place instead of having every application repeat the same engineering effort, e.g., topology-aware routing, effective failover, load shedding with exponential backoff, etc.

Dynomite ships with a Netflix homegrown client called Dyno. Dyno implements patterns inspired by Astyanax (the Cassandra client at Netflix), on top of popular clients like Jedis, Redisson and SpyMemcached, to ease the migration to Dyno and Dynomite.

Dyno Client Features

Connection pooling of persistent connections — this helps reduce connection churn on the Dynomite server with client connection reuse.
Topology aware load balancing (Token Aware) for avoiding any intermediate hops to a Dynomite coordinator node that is not the owner of the specified data.
Application specific local rack affinity based request routing to Dynomite nodes.
Application resilience by intelligently failing over to remote racks when local Dynomite rack nodes fail.
Application resilience against network glitches by constantly monitoring connection health and recycling unhealthy connections.
Capability of surgically routing traffic away from any nodes that need to be taken offline for maintenance.
Flexible retry policies such as exponential backoff etc
Insight into connection pool metrics
Highly configurable and pluggable connection pool components for implementing your advanced features.

Micro batching — submitting a batch or requests to a distributed db gets tricky since different keys map to different servers as per the sharding/hashing strategy. Dyno has the capability to take a user submitted batch, split it into shard aware micro-batches under the covers, execute them individually and then stitch the results back together before getting back to the user. Obviously one has to deal with partial failure here, and Dyno has the intelligence to retry just the failed micro-batch against the remote rack replica responsible for that hash partition.
Load shedding — Dyno’s interceptor model for every request will give it the ability to do quota management and rate limiting in order to protect the backend Dynomite servers.

https://medium.com/netflix-techblog/dynomite-with-redis-on-aws-benchmarks-5c942fc7ca38
Dynomite is a proxy layer that provides sharding and replication and can turn existing non-distributed datastores into a fully distributed system with multi-region replication.

Dynomite, with Redis is now utilized as a production system within Netflix.

Dynomite extends eventual consistency to tunable consistency in the local region. The consistency level specifies how many replicas must respond to a write or read request before returning data to the client application. Read and write consistency can be configured to manage availability versus data accuracy. Consistency can be configured for read or write operations separately (cluster-wide).

https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc

Dynomite is a generic dynamo implementation that can be used with many different key-value pair storage engines. Currently, it provides support for the Redis Serialization Protocol (RESP) and Memcached write protocol. We chose Dynomite for its performance, multi-datacenter replication and high availability. Moreover, Dynomite provides sharding, and pluggable data storage engines, allowing us to scale vertically or horizontally as our data needs increase.

http://techblog.netflix.com/2014/11/introducing-dynomite.html
http://www.zenlife.tk/dynomite.md

动机

现在有很多单服务器的数据存储方案，像memcached,redis,berkeleydb,leveldb,mysql。最终可用性的解决方式都是做主从。当请求量大到一定程度，都是做分片。这些操作都很麻烦，由应用开发者都维护分片更是蛋疼。

Dynomite的目标是将这些单服务数据存储做成p2p的，线性可扩展的集群系统，仍然保留原生的协议，比如redis。

拓扑

一个集群有多个数据中心。一个数据中心有一组机架(rack)，一个机架内有一组节点。每个机架都有整个的数据集，分布到架内的不同结点中。机架内的每个节点都有一个唯一的token，用于确定它负责哪些数据集。

每个Dynomite节点由一个Dynomite进程和一个数据存储服务器绑定，Dynomite进程作用是代理，路由请求，协调和gossiper。

数据存储服务器可以是memcached或redis这类不持久化的，或者像mysql,berkeleydb或leveldb这类持久化的。当前Dynomite支持redis和memcached。

副本

客户端可以连到Dynomite集群中的任何节点发送写请求。如果正好该节点负责这份数据，那么数据会写到本地存储并异步同步到所有数据中心的其它机架。如果节点不负责这份数据，那么它作为协调者将数据转发到同机架的节点。它还会将写请求发给对应机架和数据中心的其它结点。

当前是写到本地机架就像客户端返回ok，写其它副本是异步的。

高可用读

多机架和多数据中心提供了高可用。客户端可以连到任意节点上去读数据。跟写类似，如果节点上没有数据，它会转发读请求到同一机架中拥有数据的节点。容错性方面，客户端可以请求其它机架或数据中心的节点。

可插拨的数据存储

当前支持redis和memcached。大多数的api子集都支持。

支持标准的memcached/redis协议

任何memcached或redis的客户端都可以直接连dynomite，不需要改。但还是缺一些东西，像自动容错，请求流量限制，连接池之类的。这些在Dyno客户端中有。

p2p

集群内的Dynomite节点角色都是相同的，因此没有单点问题。加节点可以直接加。

恢复

空节点可以从其它节点从拿数据，加快恢复时间

异构的数据中心副本

写会被复制到多个数据中心。不同数据中心可以配置成不同数量的机架和节点数。如果不同数据中心请求量不均衡，这个feature特别好。

节点内部通信

Dynomite内置的gossip维护集群成员关系并处理选主的恢复问题。

客户端架构

普通的redis和memcached客户端都是可用的(前面提过服务器节点转发)。Dynomite也有提供自己的客户端Dyno。有些feature。

支持连接池
拓扑aware以及负载均衡
应用可指定本地机架
智能容错(本地节点或机架挂了)
监控连接健康状态，回收有问题的连接
当结点下线或维护时可以处理路由
and so on...

https://github.com/Netflix/dynomite/issues/532

Redis Cluster is an CP system, Dynomite is an AP system (definitions based on the CAP theorem). Search for the word Dynomite in the Redis documentation: https://redis.io/topics/sentinel for more information