Architecting for Scale
Build systems keeping availability in mind.
Dependencies
What do you do when a component you depend on fails? How do you retry? What do you do if the failure is an unrecoverable (hard) failure, rather than a recoverable (soft) failure?
Customers
What do you do when a component that is a customer of your system behaves poorly? Can you handle excessive load on your system? Can you throttle excessive traffic? Can you handle garbage data passed in? What about accidental or intentional Denial of Service (DoS) problems?
Limit Scope
When there is a failure, it is often possible for part or even most of your system to continue to operate, even if part of your system is failing. What do you do to make sure a failure has the most limited scope possible? If a dependency fails, can you provide some value even if you cannot provide full services?
Always Think About Scaling.
Using CDN, cache
how state is maintained
Mitigate risk.
server crashes, database corruptes, a returned answer incorrect
network connection fail,
what to do when a problem occurs in order to reduce the impact of the problem as much as possible. Mitigation is about making sure your application works as best and as completely as possible, even when services and resources fail. Risk mitigation requires thinking about the things that can go wrong, and putting a plan together, now, to be able to handle the situation when it does happen.
Determining where the risk is
Determining which risk items must be removed and which are acceptable.
Mitigating your remaining risk to reduce its likelihood and severity.
severity + likelihood
Monitor availability.
external perspective, also internal monitoring
Respond to availability issues in a predictable and defined way.
Establish internal private SLAs for service-to-service communications
other teams who are closely connected to the troubled service and depend on it, may want to also be alerted of problems when they occur. They typically want to make sure that the other teams problems don’t cause their service problems.
Dividing Into Services
Specific Business requirements
Distinct and Separable Team Ownership
Naturally Separable Data
Shared Capabilities/Data
A service must be owned and maintained by a single development team within your organization
Single Team Owned Service Architecture (STOSA)
Service SLAs
Managing the production environment, along with all dev, staging, and pre-production deployment environments for the service.
Monitoring
2 Nines 99% 432 minutes
3 Nines 99.9% 43 minutes
4 Nines 99.99% 4 minutes
5 Nines 99.999% 26 seconds
6 Nines 99.9999% 2.6 seconds
Dealing with Service Failures
Avoid Cascading Service Failures
Responding to a Service Failure
Predictable Response
Having a predictable response is an important aspect for services to be able to depend on other services. You must provide a predictable response given a specific set of circumstances and requests
As much as possible, even if your dependencies fail or act unpredictably, it is important that you do not propagate that unpredictability upward to those who depend on you.
Graceful Degradation - reduced functionality.
Graceful degradation is when a service reduces the amount of work it can accomplish as little as possible when it lacks needed results from a failed service.
It is important for a service (or application) to provide as much value as it can, even if not all the data it normally would need is available to it due to a dependency failure.
Graceful Backoff
Changing what you need to do in a way that provides some value to the consumer, even if you cannot really complete the request, is an example of graceful backoff.
Fail as Early as Possible
Resource conservation
Responsiveness
The sooner you determine a request will fail, the sooner you can give that result to the requester. This lets the requester move on and make other decisions more quickly.
Error complexity
Customer-Caused Problems
if the error message returned indicated that the error was permanent and caused by an invalid argument, the calling service could have seen the “permanent error” indicator and not attempted retries that it knew would fail.
PROVIDE SERVICE LIMITS
Building Systems with Reduced Risk
Idempotent
Simplicity
Self-Repair
A load balancer that reroutes traffic to a new server when a previous server fails.
A “hot standby” database that is kept up to date with the main production database. If the main production database fails or goes offline for any reason, the hot standby automatically picks up the “master” role and begins processing requests.
A service that retries a request if it gets an error, anticipating that perhaps the original request suffered a transient problem and that the new request will succeed.
A queuing system that keeps track of pending work so that if a request fails, it can be rescheduled to a new worker later, increasing the likelihood of its completion and avoiding the likelihood of losing track of the work.
A background process (for example, something like Netflix’s Chaos Monkey) goes around and introduces faults into the system, and the system is checked to make sure it recovers correctly on its own.
A service that requests multiple, independently developed and managed services to perform the same calculation. If the results from all services are the same, the result is used. If one (or more) independent service returns a different result than the majority, that result is thrown away and the faulty service is shut down for repairs.
Build systems keeping availability in mind.
Dependencies
What do you do when a component you depend on fails? How do you retry? What do you do if the failure is an unrecoverable (hard) failure, rather than a recoverable (soft) failure?
Customers
What do you do when a component that is a customer of your system behaves poorly? Can you handle excessive load on your system? Can you throttle excessive traffic? Can you handle garbage data passed in? What about accidental or intentional Denial of Service (DoS) problems?
Limit Scope
When there is a failure, it is often possible for part or even most of your system to continue to operate, even if part of your system is failing. What do you do to make sure a failure has the most limited scope possible? If a dependency fails, can you provide some value even if you cannot provide full services?
Always Think About Scaling.
Using CDN, cache
how state is maintained
Mitigate risk.
server crashes, database corruptes, a returned answer incorrect
network connection fail,
what to do when a problem occurs in order to reduce the impact of the problem as much as possible. Mitigation is about making sure your application works as best and as completely as possible, even when services and resources fail. Risk mitigation requires thinking about the things that can go wrong, and putting a plan together, now, to be able to handle the situation when it does happen.
Determining where the risk is
Determining which risk items must be removed and which are acceptable.
Mitigating your remaining risk to reduce its likelihood and severity.
severity + likelihood
Monitor availability.
external perspective, also internal monitoring
Respond to availability issues in a predictable and defined way.
Establish internal private SLAs for service-to-service communications
other teams who are closely connected to the troubled service and depend on it, may want to also be alerted of problems when they occur. They typically want to make sure that the other teams problems don’t cause their service problems.
Dividing Into Services
Specific Business requirements
Distinct and Separable Team Ownership
Naturally Separable Data
Shared Capabilities/Data
A service must be owned and maintained by a single development team within your organization
Single Team Owned Service Architecture (STOSA)
Service SLAs
Managing the production environment, along with all dev, staging, and pre-production deployment environments for the service.
Monitoring
2 Nines 99% 432 minutes
3 Nines 99.9% 43 minutes
4 Nines 99.99% 4 minutes
5 Nines 99.999% 26 seconds
6 Nines 99.9999% 2.6 seconds
Dealing with Service Failures
Avoid Cascading Service Failures
Responding to a Service Failure
Predictable Response
Having a predictable response is an important aspect for services to be able to depend on other services. You must provide a predictable response given a specific set of circumstances and requests
As much as possible, even if your dependencies fail or act unpredictably, it is important that you do not propagate that unpredictability upward to those who depend on you.
Graceful Degradation - reduced functionality.
Graceful degradation is when a service reduces the amount of work it can accomplish as little as possible when it lacks needed results from a failed service.
It is important for a service (or application) to provide as much value as it can, even if not all the data it normally would need is available to it due to a dependency failure.
Graceful Backoff
Changing what you need to do in a way that provides some value to the consumer, even if you cannot really complete the request, is an example of graceful backoff.
Fail as Early as Possible
Resource conservation
Responsiveness
The sooner you determine a request will fail, the sooner you can give that result to the requester. This lets the requester move on and make other decisions more quickly.
Error complexity
Customer-Caused Problems
if the error message returned indicated that the error was permanent and caused by an invalid argument, the calling service could have seen the “permanent error” indicator and not attempted retries that it knew would fail.
PROVIDE SERVICE LIMITS
Building Systems with Reduced Risk
Idempotent
Simplicity
Self-Repair
A load balancer that reroutes traffic to a new server when a previous server fails.
A “hot standby” database that is kept up to date with the main production database. If the main production database fails or goes offline for any reason, the hot standby automatically picks up the “master” role and begins processing requests.
A service that retries a request if it gets an error, anticipating that perhaps the original request suffered a transient problem and that the new request will succeed.
A queuing system that keeps track of pending work so that if a request fails, it can be rescheduled to a new worker later, increasing the likelihood of its completion and avoiding the likelihood of losing track of the work.
A background process (for example, something like Netflix’s Chaos Monkey) goes around and introduces faults into the system, and the system is checked to make sure it recovers correctly on its own.
A service that requests multiple, independently developed and managed services to perform the same calculation. If the results from all services are the same, the result is used. If one (or more) independent service returns a different result than the majority, that result is thrown away and the faulty service is shut down for repairs.