Wednesday, April 6, 2016

Cross Data Center
One way that many developers have started building fault tolerance into their systems is by distributing their database across multiple data centers. “Eventually consistent” systems, which aim to eventually make data consistent across nodes, can be implemented across data centers more easily than databases with “strong consistency” guarantees because they provide limited guarantees about the state of the data they store. However, that decision to distribute the system across multiple, physical locations is not something that should be taken lightly, as it introduces additional complexity into the entire system. If your distributed database is the main database of record for your application, having inconsistent state or lost data probably isn’t a good idea.

1. Your Application Tier Might Not Be Scalable And Fault Tolerant

An application stack that fully utilizes the capabilities of an MDC database has fault tolerance built into each level of the system, from the DNS information down to the hardware running the application. At the DNS level, you would want to have multiple “A” records that point to multiple load balancers to ensure that the URL for your application can be resolved to an active load balancer IP even if one of those load balancers failed or can’t be reached. (Obviously fault tolerance at the DNS provider level is helpful too, but as the user, you can’t do too much to ensure that your DNS provider is resilient to failures.)
At the next level down, you would want to ensure that your load balancers are able to scale and handle failures as well. There are a number of ways to approach adding fault tolerance at this level, such as using an “elastic” load balancer that gets scaled up as part of what Amazon calls an “auto scaling” group, for example, which also allows it to handle failures of individual server “instances” to eliminate the load balancer tier as a single point of failure.
2. Latency Matters A Lot
An eventually consistent database working in MDCs is a design decision that introduces risks and failure modes as a tradeoff. 

An MDC database deployment can provide valuable benefits to organizations and developers to deal with data center outages, machine failures, and disaster recovery. However, configuring and operating an entire application stack consistently across multiple locations is a non-trivial (and potentially expensive) task that potentially introduces more points of failure than a single-data center deployment.
Cloud-hosted applications and services are often deployed to multiple datacenters. This approach can reduce network latency for globally located users, as well as providing a complete failover capability should one deployment or one datacenter become unavailable for any reason. For best performance, the data that an application uses should be located close to where the application is deployed, so it may be necessary to replicate this data in each datacenter. If the data changes, these modifications must be applied to every copy of the data. This process is called synchronization.
Alternatively, you might choose to build a hybrid application or service solution that stores and retrieves data from an on-premises data store hosted by your own organization. For example, an organization may hold the main data repository on-premises and then replicate only the necessary data to a data store in the cloud. This can help to protect sensitive data that is not required in all applications. It is also a useful approach if updates to the data occur mainly on-premises, such as when maintaining the catalog of an e-commerce retailer or the account details of suppliers and customers.
Master-Master Replication, in which the data in each replica is dynamic and can be updated. This topology requires a two-way synchronization mechanism to keep the replicas up to date and to resolve any conflicts that might occur. In a cloud application, to ensure that response times are kept to a minimum and to reduce the impact of network latency, synchronization typically happens periodically. The changes made to a replica are batched up and synchronized with other replicas according to a defined schedule. While this approach reduces the overheads associated with synchronization, it can introduce some inconsistency between replicas before they are synchronized.

Master-Subordinate Replication, in which the data in only one of the replicas is dynamic (the master), and the remaining replicas are read-only. The synchronization requirements for this topology are simpler than that of the Master-Master Replication topology because conflicts are unlikely to occur. However, the same issues of data consistency apply.
  • To improve performance and scalability:
    • Use Master-Subordinate replication with read-only replicas to improve performance of queries. Locate the replicas close to the applications that access them and use simple one-way synchronization to push updates to them from a master database.
    • Use Master-Master replication to improve the scalability of write operations. Applications can write more quickly to a local copy of the data, but there is additional complexity because two-way synchronization (and possible conflict resolution) with other data stores is required.
    • Include in each replica any reference data that is relatively static, and is required for queries executed against that replica to avoid the requirement to cross the network to another datacenter. For example, you could include postal code lookup tables (for customer addresses) or product catalog information (for an ecommerce application) in each replica.
  • To improve reliability:
    • Deploy replicas close to the applications and inside the network boundaries of the applications that use them to avoid delays caused by accessing data across the Internet. Typically, the latency of the Internet and the correspondingly higher chance of connection failure are the major factors in poor reliability. If replicas are read-only to an application, they can be updated by pushing changes from the master database when connectivity is restored. If the local data is updateable, a more complex two-way synchronization will be required to update all data stores that hold this data.
  • To improve security:
    • In a hybrid application, deploy only non-sensitive data to the cloud and keep the rest on-premises. This approach may also be a regulatory requirement, specified in a service level agreement (SLA), or as a business requirement. Replication and synchronization can take place over the non-sensitive data only.
  • To improve availability:
    • In a global reach scenario, use Master-Master replication in datacenters in each country or region where the application runs. Each deployment of the application can use data located in the same datacenter as that deployment in order to maximize performance and minimize any data transfer costs. Partitioning the data may make it possible to minimize synchronization requirements.
    • Use replication from the master database to replicas in order to provide failover and backup capabilities. By keeping additional copies of the data up to date, perhaps according to a schedule or on demand when any changes are made to the data, it may be possible to switch the application to use the backup data in case of a failure of the original data store.
Decide which type of synchronization you need:
  • Master-Master replication involves a two-way synchronization process that is complex because the same data might be updated in more than one location. This can cause conflicts, and the synchronization must be able to resolve or handle this situation. It may be appropriate for one data store to have precedence and overwrite a conflicting change in other data stores. Other approaches are to implement a mechanism that can automatically resolve the conflict based on timings, or just record the changes and notify an administrator to resolve the conflict.
  • Master-Subordinate replication is simpler because changes are made in the master database and are copied to all subordinates.
  • Decide the frequency of synchronization. Most synchronization frameworks and services perform the synchronization operation on a fixed schedule. If the period between synchronizations is too long, you increase the risk of update conflicts and data in each replica may become stale. If the period is too short you may incur heavy network load, increased data transfer costs, and risk a new synchronization starting before the previous one has finished when there are a lot of updates. It may be possible to propagate changes across replicas as they occur by using background tasks that synchronize the data.
  • Consider how you will deal with failures during replication. This may require rerouting requests for data to another replica if the first cannot be accessed, or even rerouting requests to another deployment of the application.
  • Make sure applications that use replicas of the data can handle situations that may arise when a replica is not fully consistent with the master copy of the data. For example, if a website accepts an order for goods marked as available but a subsequent update shows that no stock is available, the application must manage this—perhaps by sending an email to the customer and/or by placing the item on back order.

No comments:

Post a Comment


Review (561) System Design (304) System Design - Review (196) Java (179) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Linux (40) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (29) Product Architecture (28) Big Data (27) Soft Skills (27) Concurrency (26) MultiThread (26) Miscs (25) Cracking Code Interview (24) Distributed (24) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) OOD Design (20) System Design - Practice (19) How to Ace Interview (16) Security (16) Algorithm (15) Brain Teaser (14) Google (14) Redis (14) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Solr (11) Testing (11) Resource (10) Search (10) Amazon (9) Cache (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Trouble Shooting (8) Cassandra (7) Company - Facebook (7) Design (7) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Machine Learning (7) NoSQL (7) C++ (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) API Design (4) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Data structures (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Puzzles (4) Python (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Chrome (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) Elasticsearch (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Html (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) RateLimiter (3) Resource-System Desgin (3) Scala (3) UML (3) ZooKeeper (3) geeksquiz (3) AI (2) Advanced data structures (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Garbage Collection (2) Go (2) Hadoop (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) reddit (2) Ads (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Stories (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Java Review (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) LintCode (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) Reactive (1) Reading (1) Reading Code (1) Refactoring (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

Popular Posts