Monday, April 17, 2017

Stories about Scalability



http://blog.wix.engineering/2015/02/25/scaling-wix-100m-users-beginning/
Our first backend engineer built that server based on Tomcat, Hibernate, Ehcache, and MySQL. 

In 2008 I was told by one of the founders, “You told us that we would have to replace our server side about now, you just failed to mention how hard that’s gonna be.” 


The actual story of that transformation is beyond the scope of this post, but I will get to it in a later post. But the questions remain: Did we make the right decision to build on a server that was not well architected? Was it worth it to move fast at the cost of accruing a huge technical debt?
At Wix, the first lesson we learned from the early years is that when we begin a project or a startup, if we do not know what variation of our product will work (as most startups do not know), we should move fast. We should be opportunistic, utilizing any tools we are familiar with, regardless of scale or ordered methodologies (such as TDD). And yes, we will gain technical debt.
The second lesson, the one we failed at, is that we should build fast, but also build for gradual rewrite. From the initial stage we should prepare to replace parts of our system if and when we need—and with minimal effort. Had we done so, we could have replaced our initial server within a year or two, and not spent 4 years on that effort. However, we built a classical spaghetti ball server and paid the price for it.
To summarize, in the early stages, build fast and build for gradual rewrite.

http://blog.wix.engineering/2015/03/18/scaling-to-100m-to-cache-or-not-to-cache/
Ehcache, unlike Memcached, runs in process (in the JVM), and its distributed feature fully replicates cache state between all the nodes in a cluster.

Unfortunately, even after we fixed all the places where a site definition was stored, customers continued to complain about corrupted sites. This was because we just fixed the bad state stored in the database, forgetting that the cache also stored copies of our data, including the corrupted documents.
How come we forgot the cache? Well, we tend to forget what we cannot see. Ehcache is just a “black box” cache—a Java library without a SQL interface to query or a management application to view the cache content. Because we did not have an easy way to “look” into the cache, we could not diagnose and introspect it when we had corrupted data incidents. 
At this point you should ask, “What about cache invalidation?” Because we were using Ehcache, which has a management API that supports invalidation, we could have written specific code instructing our app servers to invalidate the cache (an invalidation switch). If we did not prepare an invalidation switch for a certain type of data, we would again need to restart both servers at once to clear the bad state.

The first thing to do was to check the MySQL statistics. It turns out that if using MySQL correctly, we could get submillisecond reads, even for large tables. Today we have tables with over 100 million rows that we read from at submillisecond performance. We do so by providing the MySQL process sufficient memory to handle disk cache, and by reading single rows by primary key or an index without joins.
We claim that a cache is not part of an architecture; rather, it is one of a set of solutions for a performance problem, and not the best at that.
Our guidelines for using a cache are as follows:
  1. You do not need a cache.
  2. Really, you don’t.
  3. If you still have a performance issue, can you solve it at the source? What is slow? Why is it slow? Can you architect it differently to not be slow? Can you prepare data to be read-optimized?
If you do need to introduce a cache, think of the following:
  • How do you invalidate the cache?
  • How do you see the values in the cache (black box vs. white box cache)?
  • What happens in a cold start of your system? Can the system cope with traffic if the cache is empty?
  • What is the performance penalty of using the cache?
http://blog.wix.engineering/2015/04/18/scaling-to-100m-service-level-driven-architecture/
Deploying a new version of our software, in some cases, required a MySQL schema change. As Hibernate is not forgiving of mismatches between the schema it expects and the actual database (DB) schema,

This two-hour planned downtime often turned out to be more complex because problems would occur during deployment. In some cases, performing the MySQL schema change took considerably longer than planned (altering large tables, rebuilding indexes, disabling constraints for data migrations, etc.). Or sometimes, after performing the schema change and trying to restart the server, it did not start because of some unforeseen deployment, configuration, or schema issues that prevented it from operating. And in other cases, the new version of our software was faulty, so to restore service, we would change the MySQL schema again (to match the previous version) and redeploy the old version of our software.

With this realization, we set out to split our system into two different segments: the editor segment, responsible for building websites, and the public segment, responsible for serving websites. This solution enabled us to provide different service levels to meet each of our business functions.

We already understood that the release cycle introduced risk, but we realized that it impacted our two major business functions—building websites and serving websites—differently. So, we learned that we needed to have different service levels for each function, and that we had to architect our system around them.
What are those different service levels? The aspects we considered were availability, performance, risk of change, and time to recover from failure. The public segment, which affects all Wix users and websites, needed to have the highest service level with regard to all of those aspects. But for the editor segment, failure only affects users in the process of building websites, so the business impact is lower. This allowed us to trade off a high service level for better agility, which saved development effort.
Today when we add new functions to our system, we first ask what is the required service level, and we then determine where to position that new function in our architecture.

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts