http://highscalability.com/blog/2015/12/1/deep-lessons-from-google-and-ebay-on-building-ecosystems-of.html
http://www.infoq.com/presentations/service-arch-scale-google-ebay
http://www.infoq.com/presentations/service-arch-scale-google-ebay
Ecosystem Of Services
- What does it look like to have a large scale ecosystem of polyglot microservices?
- At eBay and Google hundreds to thousands of independent services all work together.
- Modern large scale systems compose services in a graph of relationships, not a hierarchy or set of tiers.
- Services depend on many other services while being depended on by many services.
- Older large scale systems were typically organized in strict tiers.
How Is An Ecosystem Of Services Created?
- These best performing systems are more the product of evolution than intelligent design. At Google, for example, there never has been a centralized top down design of the system. It has evolved and grown over time in a very organic way.
- Variation and natural selection. When a problem needs to be solved a new service is created, or more often extracted from an existing service or product. Services survive as long as they are used, as long as they provide value, otherwise they are deprecated.
- These large scale systems develop from the bottom up. Clean design can be an emergent property rather than a product of top down design.
- As an example consider some of the service layering for Google App Engine.
- The Cloud Datastore (a NoSQL service) is built on the Megastore (a geo-scale structured database) which is built on Bigtable (a cluster-level structured service) which is built on Colossus (a next-generation clustered file system) which is built on Borg (the cluster management infrastructure).
- The layering is clean. Each layer adds something that didn’t belong in the layer below. It was not the product of top down design.
- It was built from the bottom up. Colossus, the Google file system was built first. Several years later Bigtable was built. Several years later Megastore was built. And several years later Cloud Database migrated onto Megastore.
- You can have this wonderful separation of concerns without a top down architecture.
- A better way for eBay to have handled the situation was to encode the knowledge of the smart experienced people in the review board and put it into something that’s reusable by individual teams. Encode that experience into a library or a service or even a set of guidelines that people can use on their own rather only coming into the process at the last moment.
How Do Standards Evolve Without Architects?
- The parts of communication that are usually standardized:
- Network protocols. Google uses a proprietary protocol called Stubby. eBay uses REST.
- Data formats. Google uses Protocol Buffers. eBay tends to use JSON.
- Interface schema standard. Google uses Protocol Buffers. For JSON there’s JSON schema.
- The pieces of common infrastructure that are usually standardized:
- Source code control.
- Configuration management.
- Cluster manager.
- Monitoring systems.
- Alerting systems.
- Diagnostic tools.
- All these components can evolve out of conventions.
Building A Service
- When you are service owner what does it look like when you are building a service in a large scale system of polyglot microservices?
- A well performing service in a large scale architecture is:
- Single-purpose. It will have a simple well defined interface.
- Modular and independent. What we might call a microservice.
- Does not share a persistence layer. More on this later.
What Are The Goals Of The Service Owner?
- Meet the needs of your clients. Provide the necessary functionality, at the proper quality level, while meeting the negotiated performance levels, while maintaining stability and reliability, while constantly improving the service over time.
- Meet the needs at a minimum cost and effort.
- This goal aligns incentives in a way that encourages the use of common infrastructure.
- Each team has a limited set of resources so it’s in their interest to leverage common battle tested tools, processes, components, and services.
- It also incents good operational behavior. Automate the building and deploying of your service.
- It also incents optimizing for the efficient use of resources.
What Are The Responsibilities Of The Service Owner?
- You build it you run it.
- The team, typically a small teams, owns the service from design, through development, and deployment, all the way through to retirement.
- There’s no separate maintenance or sustaining engineering team.
- Teams have the freedom to make their own technology choices, methodologies, and working environment.
- Teams are accountable for the choices they make.
- Service as a bounded context.
- The cognitive load on a team is bounded.
- There’s no need to understand all the other services in the ecosystem.
- A team needs to understand their service in depth and the services they depend on.
- This means teams can be extremely small and nimble. A typical team is 3-5 people. (As an aside a US Marine Corps fireteam has four people.)
- The small team size means communication within the team is at a really high bandwidth and quality.
- Conway’s Law used to your advantage. By organizing in small teams you’ll end up with small individual components.
What Is The Relationship Between Services?
- Think about relationships between services as vendor-customer relationships, even though you are at the same company.
- Be very friendly and cooperative, but be very structured in the relationship.
- Be very clear about ownership.
- Be very clear how about who is responsible for what. In large part this is about defining a clear interface and maintaining it.
- Incentives are aligned because customer can choose to use a service or not. This encourages services to do right by their customers. This is one of the ways new services end up being built.
- Define SLAs. As service provider promise a certain level of service to their customers so customers can rely on the service.
- Customer teams pay for services.
- Charging for a service aligns economic incentives. It motivates both sides to be extremely efficient about the use of resources.
- When things are free we tend not to value them and tend not to optimize them.
- For example, an internal customer was using Google App Engine for free and they were using a lot of resources. Begging them to be more efficient about their use of resources turned out not to be a good strategy. A week after chargebacks kicked-in they were able to reduce their consumption of GAE resources by 90% with one or two simple changes.
- It’s not that the team using GAE was evil, they just had other priorities, so there was no incentive for them to optimize their use of GAE. It turns out they actually got better response times with the more efficient architecture.
- Charging also incents a service provider to keep quality high, otherwise an internal customer may go elsewhere. This directly incentivizes good development and management practices. Code reviews are one example. Google’s very large scale build and test system is another. Google runs millions of automated tests every day. Acceptance tests for all dependent code is run every time code is accepted into the repository, which helps all the small teams maintain the quality of their services.
- A charge back model encourages small incremental changes. Small changes are easier to understand. Also, the impact of code changes are non-linear. A thousand line change is not 10 times riskier than a 100 line change, it’s more like 100 times riskier.
- Maintain full backward / forward compatibility of interfaces.
- Never break client code.
- This means maintaining multiple interface versions. In some nasty situations it means maintaining multiple deployments, one for the new version and others for older versions.
- Usually because of the small incremental change model interfaces are not changed.
- Have an explicit deprecation policy. Then the service provider is strongly incented to moved all clients off version N and over to version N+1.
Operating Services At Scale
- As a service provider what does it feel like to operate a service in a large scale system of polyglot microservices?
- Predictable performance is a requirement.
- Services at scale are very exposed to variability in performance.
- Predictability in performance is much more important than average performance.
- Low latency with inconsistent performance is actually not low latency at all.
- It’s far easier for clients to program against a service when it provides consistent performance.
- Tail latencies dominate performance as services use many other services to carry out their work.
- Imagine a service that has 1ms latency at the median and at the 99.999%ile (1 in 10,000) the latency is one second.
- Making one call means you are slow .01% of the time.
- If you are using 5,000 machines, as many large scale services at Google do, then you are slow 50% of the time.
- For example, a one in a million problem with memcached was tracked down to a low level data structure reallocation event. This rare problem surfaced as at a higher level as latency spikes. Low level details like this turn out to be extremely important in a large scale system.
- Resilience in depth.
- Service disruptions are far more likely to occur from a mistake by a person rather than a hardware or software failure.
- Be resilient to machine, cluster and datacenter failures.
- Load balance and provide flow control when invoking other services.
- Be able to rapidly roll back changes.
- Incremental deployment.
- Use a canary system. Don’t deploy to all machines at once. Choose a system, put the new version of that software on that system, and see how it behaves in the new world.
- If it works begin a staged rollout. Start of with 10% of the machines, move to 20%, and so on through the rest of the fleet.
- If a problem happens at the 50% point in the deploy then you should be able to rollback.
- eBay made use of feature flags to decouple code deployment from feature deployment. Typically code is deployed with a feature turned off, then it can be turned on or off. This makes sure the code can be properly deployed before a new feature is turned on. It also means if the new feature has a bug, a performance issue, or a business failure, then the feature can be turned off without having to deploy new code.
- You can have too much alerting you can never have too much monitoring.
Service Anti-Patterns
- The Mega-Service
- A service that does too much. What you want is an ecosystem of very small clean services.
- A service that does too much is a just another monolith. It’s hard to reason about, it’s hard to scale, it’s hard to change, and it also creates more upstream and downstream dependencies than you want.
- Shared Persistence
- In the tiered model services are put in the application tier and the persistence layer is provided as a common service to the applications.
- They did this at eBay and it didn’t work. It breaks the encapsulation of the service. Applications can back-door into your service by updating the database. It ends up reintroducing coupling of services. Shared databases don't allow for loosely coupled services.
- Microservices prevent this problem by being small, isolated, and independent, which is how you keep your ecosystem healthy and growing.