Monday, October 26, 2015

Lessons Learned From Reading Postmortems



https://github.com/danluu/post-mortems

http://danluu.com/postmortem-lessons/
Proper error handling code is hard. Bugs in error handling code are a major cause of bad problems. This means that the probability of having sequential bugs, where an error causes buggy error handling code to run, isn’t just the independent probabilities of the individual errors multiplied. It’s common to have cascading failures cause a serious outage.

Configuration
Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages.
seemingly benign config changes can also cause a company-wide service outage.

config to read file property of db first.
- then try in one server, change it to read file property first, then change the configuration in file property, monitor it.
- if everything works, then change the config in db and propagate to all servers.

Monitoring / Alerting
https://aws.amazon.com/cn/message/41926/
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally.  The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.
From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD.  We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
https://en.wikipedia.org/wiki/5_Whys
  1. It is necessary to engage the management in the 5Whys process in the company. For the analysis itself, consider what is the right working group. Also consider bringing in a facilitator for more difficult topics.
  2. Use paper or whiteboard instead of computers.
  3. Write down the problem and make sure that all people understand it.
  4. Distinguish causes from symptoms.
  5. Pay attention to the logic of cause-and-effect relationship.
  6. Make sure that root causes certainly lead to the mistake by reversing the sentences created as a result of the analysis with the use of the expression “and therefore”.
  7. Try to make our answers more precise.
  8. Look for the cause step by step. Don’t jump to conclusions.
  9. Base our statements on facts and knowledge.
  10. Assess the process, not people.
  11. Never leave “human error”, “worker’s inattention”, etc., as the root cause.
  12. Foster an atmosphere of trust and sincerity.
  13. Ask the question “Why” until the root cause is determined, i.e. the cause the elimination of which will prevent the error from occurring again.[8]
  14. When you form the answer for question "Why" - it should happen from the customer's point of view.
http://coolshell.cn/articles/17737.html
简单来说,这天,有一个 AWS 工程师在调查 Northern Virginia (US-EAST-1) Region 上 S3 的一个和账务系统相关的问题,这个问题是S3的账务系统变慢了(我估计这个故障在Amazon里可能是Sev2级,Sev2级的故障在Amazon算是比较大的故障,需要很快解决),Oncall的开发工程师(注:Amazon的运维都是由开发工程师来干的,所以Amazon内部嬉称SDE-Software Developer Engineer 为 Someone Do Everything)想移除一个账务系统里的一个子系统下的一些少量的服务器(估计这些服务器上有问题,所以想移掉后重新部署),结果呢,有一条命令搞错了,导致了移除了大量的S3的控制系统。包括两个很重要的子系统:
1)一个是S3的对象索引服务(Index),其中存储了S3对象的metadata和位置信息。这个服务也提供了所有的 GET,LIST,PUT 和DELETE请求。
2)一个是S3的位置服务系统(Placement),这个服务提供对象的存储位置和索引服务的系统。这个系统主要是用于处理PUT新对象请求。
这就是为什么S3不可访问的原因。
在后面,AWS也说明了一下故障恢复的过程,其中重点提到了这点——
虽然整个S3的是做过充分的故障设计的(注:AWS的七大Design Principle 之一 Design for Failure)—— 就算是最核心的组件或服务出问题了,系统也能恢复。但是,可能是在过去的日子里 S3 太稳定了,所以,AWS 在很长很长一段时间内都没有重启过 S3 的核心服务,而过去这几年,S3 的数据对象存储级数级的成长(S3存了什么样数量级的对象,因为在Amazon工作过,所以多大概知道是个什么数量级,这里不能说,不过,老实说,很惊人的),所以,这两个核心服务在启动时要重建并校验对象索引元数据的完整性,这个过程没想到花了这么长的时候。而Placement服务系统依赖于Index 服务,所以花了更长的时间。
了解过系统底层的技术人员应该都知道这两个服务有多重要,简而言之,这两个系统就像是Unix/Linux文件系统中的inode,或是像HDFS里的node name,如果这些元数据丢失,那么,用户的所有数据基本上来说就等于全丢了。
而要恢复索引系统,就像你的操作系统从异常关机后启动,文件系统要做系统自检那样,硬盘越大,文件越多,这个过程就越慢。
另外,这次,AWS没有使用像以前那样 Outage 的故障名称,用的是 “Increased Error Rate” 这样的东西。我估计是没有把所有这两个服务删除完,估计有些用户是可以用的,有的用户是则不行了。
0)太喜欢像Gitlab和AWS这样的故障公开了,那怕是一个自己人为的低级错误。不掩盖,不文过饰非,透明且诚恳。Cool!
1)这次事件,还好没有丢失这么重要的数据,不然的话,将是灾难性的。
2)另外,面对在 US-EASE-1 这个老牌 Region 上的海量的对象,而且能在几个小时内恢复,很不容易了。
3)这个事件,再次映证了我在《关于高可用的系统》中提到的观点:一个系统的高可用的因素很多,不仅仅只是系统架构,更重要的是——高可用运维
4)对于高可用的运维,平时的故障演习是很重要的。AWS 平时应该没有相应的故障演习,所以导致要么长期不出故障,一出就出个大的让你措手不及。这点,Facebook就好一些,他们每个季度扔个骰子,随机关掉一个IDC一天。Netflix 也有相关的 Chaos Monkey,我以前在的路透每年也会做一次大规模的故障演练——灾难演习。
5)AWS对于后续的改进可以看出他的技术范儿。可以看到其改进方案是用技术让自己的系统更为的高可用。然后,对比国内的公司对于这样的故障,基本上会是下面这样的画风:
a)加上更多更为严格的变更和审批流程,
b)使用限制更多的权限系统和审批系统
c)使用更多的人来干活(一个人干事,另一个人在旁边看)
d)使用更为厚重的测试和发布过程
e)惩罚故障人,用价值观教育工程师。
这还是我老生长谈的那句话——如果你是一个技术公司,你就会更多的相信技术而不是管理。相信技术会用技术来解决问题,相信管理,那就只会有制度、流程和价值观来解决问题。(注意:这里我并没有隔离技术和管理,只是更为倾向于用技术解决问题)
最后,你是要建一个 “高可用的技术系统” ,还是一个 “高用的管理系统”? ;-)

GitLab.com currently uses a single primary and a single secondary in hot-standby mode. The standby is only used for failover purposes. In this setup a single database has to handle all the load, which is not ideal. The primary's hostname is db1.cluster.gitlab.com, while the secondary's hostname is db2.cluster.gitlab.com.
In the past we've had various other issues with this particular setup due to db1.cluster.gitlab.com being a single point of failure. 
We would later find out that part of the load was caused by a background job trying to remove a GitLab employee and their associated data. This was the result of their account being flagged for abuse and accidentally scheduled for removal. More information regarding this particular problem can be found in the issue "Removal of users by spam should not hard delete".
It would later be revealed by another engineer (who wasn't around at the time) that this is normal behaviour: pg_basebackup will wait for the primary to start sending over replication data and it will sit and wait silently until that time. Unfortunately this was not clearly documented in our engineering runbooks nor in the official pg_basebackup document.

Database Backups Using pg_dump

When we went to look for the pg_dump backups we found out they were not there. The S3 bucket was empty, and there was no recent backup to be found anywhere. Upon closer inspection we found out that the backup procedure was using pg_dump 9.2, while our database is running PostgreSQL 9.6 (for Postgres, 9.x releases are considered major). A difference in major versions results in pg_dump producing an error, terminating the backup procedure.
The difference is the result of how our Omnibus package works. We currently support both PostgreSQL 9.2 and 9.6, allowing users to upgrade (either manually or using commands provided by the package). To determine the correct version to use the Omnibus package looks at the PostgreSQL version of the database cluster (as determined by $PGDIR/PG_VERSION, with $PGDIR being the path to the data directory). When PostgreSQL 9.6 is detected Omnibus ensures all binaries use PostgreSQL 9.6, otherwise it defaults to PostgreSQL 9.2.
The pg_dump procedure was executed on a regular application server, not the database server. As a result there is no PostgreSQL data directory present on these servers, thus Omnibus defaults to PostgreSQL 9.2. This in turn resulted in pg_dump terminating with an error.
While notifications are enabled for any cronjobs that error, these notifications are sent by email. For GitLab.com we use DMARC. Unfortunately DMARC was not enabled for the cronjob emails, resulting in them being rejected by the receiver. This means we were never aware of the backups failing, until it was too late.
http://coolshell.cn/articles/17680.html
首先,一个叫YP的同学在给gitlab的线上数据库做一些负载均衡的工作,在做这个工作时的时候突发了一个情况,Gitlab被DDoS攻击,数据库的使用飙高,在block完攻击者的IP后,发现有个staging的数据库(db2.staging)已经落后生产库4GB的数据,于是YP同学在Fix这个staging库的同步问题的时候,发现db2.staging有各种问题都和主库无法同步,在这个时候,YP同学已经工作的很晚了,在尝试过多个方法后,发现db2.staging都hang在那里,无法同步,于是他想把db2.staging的数据库删除了,这样全新启动一个新的复制,结果呢,删除数据库的命令错误的敲在了生产环境上(db1.cluster),结果导致整个生产数据库被误删除。(陈皓注:这个失败基本上就是 “工作时间过长” + “在多数终端窗口中切换中迷失掉了”
在恢复的过程中,他们发现只有db1.staging的数据库可以用于恢复,而其它的5种备份机制都不可用,第一个是数据库的同步,没有同步webhook,第二个是对硬盘的快照,没有对数据库做,第三个是用pg_dump的备份,发现版本不对(用9.2的版本去dump 9.6的数据)导致没有dump出数据,第四个S3的备份,完全没有备份上,第五个是相关的备份流程是问题百出的,只有几个粗糙的人肉的脚本和糟糕的文档,也就是说,不但是是人肉的,而且还是完全不可执行的。(陈皓注:就算是这些备份机制都work,其实也有问题,因为这些备份大多数基本上都是24小时干一次,所以,要从这些备份恢复也一定是是要丢数据的了,只有第一个数据库同步才会实时一些
最终,gitlab从db1.staging上把6个小时前的数据copy回来,结果发现速度非常的慢,备份结点只有60Mbits/S,拷了很长时间(陈皓注:为什么不把db1.staging给直接变成生产机?因为那台机器的性能很差)。数据现在的恢复了,不过,因为恢复的数据是6小时前的,所以,有如下的数据丢失掉了:
  • 粗略估计,有4613 的项目, 74 forks,  和 350 imports 丢失了;但是,因为Git仓库还在,所以,可以从Git仓库反向推导数据库中的数据,但是,项目中的issues等就完全丢失了。
  • 大约有±4979 提交记录丢失了(陈皓注:估计也可以用git仓库中反向恢复)。
  • 可能有 707  用户丢失了,这个数据来自Kibana的日志。
  • 在1月31日17:20 后的Webhooks 丢失了。
因为Gitlab把整个事件的细节公开了出来,所以,也得到了很多外部的帮助,2nd Quadrant的CTO – Simon Riggs 在他的blog上也发布文章 Dataloss at Gitlab 给了一些非常不错的建议:
  • 关于PostgreSQL 9.6的数据同步hang住的问题,可能有一些Bug,正在fix中。
  • PostgreSQL有4GB的同步滞后是正常的,这不是什么问题。
  • 正常的停止从结点,会让主结点自动释放WALSender的链接数,所以,不应该重新配置主结点的 max_wal_senders 参数。但是,停止从结点时,主结点的复数连接数不会很快的被释放,而新启动的从结点又会消耗更多的链接数。他认为,Gitlab配置的32个链接数太高了,通常来说,2到4个就足够了。
  • 另外,之前gitlab配置的max_connections=8000太高了,现在降到2000个是合理的。
  • pg_basebackup 会先在主结点上建一个checkpoint,然后再开始同步,这个过程大约需要4分钟。
  • 手动的删除数据库目录是非常危险的操作,这个事应该交给程序来做。推荐使用刚release 的 repmgr
  • 恢复备份也是非常重要的,所以,也应该用相应的程序来做。推荐使用 barman (其支持S3)
  • 测试备份和恢复是一个很重要的过程。
看这个样子,估计也有一定的原因是——Gitlab的同学对PostgreSQL不是很熟悉。
首先,一个叫YP的同学在给gitlab的线上数据库做一些负载均衡的工作,在做这个工作时的时候突发了一个情况,Gitlab被DDoS攻击,数据库的使用飙高,在block完攻击者的IP后,发现有个staging的数据库(db2.staging)已经落后生产库4GB的数据,于是YP同学在Fix这个staging库的同步问题的时候,发现db2.staging有各种问题都和主库无法同步,在这个时候,YP同学已经工作的很晚了,在尝试过多个方法后,发现db2.staging都hang在那里,无法同步,于是他想把db2.staging的数据库删除了,这样全新启动一个新的复制,结果呢,删除数据库的命令错误的敲在了生产环境上(db1.cluster),结果导致整个生产数据库被误删除。(陈皓注:这个失败基本上就是 “工作时间过长” + “在多数终端窗口中切换中迷失掉了”
在恢复的过程中,他们发现只有db1.staging的数据库可以用于恢复,而其它的5种备份机制都不可用,第一个是数据库的同步,没有同步webhook,第二个是对硬盘的快照,没有对数据库做,第三个是用pg_dump的备份,发现版本不对(用9.2的版本去dump 9.6的数据)导致没有dump出数据,第四个S3的备份,完全没有备份上,第五个是相关的备份流程是问题百出的,只有几个粗糙的人肉的脚本和糟糕的文档,也就是说,不但是是人肉的,而且还是完全不可执行的。(陈皓注:就算是这些备份机制都work,其实也有问题,因为这些备份大多数基本上都是24小时干一次,所以,要从这些备份恢复也一定是是要丢数据的了,只有第一个数据库同步才会实时一些
最终,gitlab从db1.staging上把6个小时前的数据copy回来,结果发现速度非常的慢,备份结点只有60Mbits/S,拷了很长时间(陈皓注:为什么不把db1.staging给直接变成生产机?因为那台机器的性能很差)。数据现在的恢复了,不过,因为恢复的数据是6小时前的,所以,有如下的数据丢失掉了:
  • 粗略估计,有4613 的项目, 74 forks,  和 350 imports 丢失了;但是,因为Git仓库还在,所以,可以从Git仓库反向推导数据库中的数据,但是,项目中的issues等就完全丢失了。
  • 大约有±4979 提交记录丢失了(陈皓注:估计也可以用git仓库中反向恢复)。
  • 可能有 707  用户丢失了,这个数据来自Kibana的日志。
  • 在1月31日17:20 后的Webhooks 丢失了。
因为Gitlab把整个事件的细节公开了出来,所以,也得到了很多外部的帮助,2nd Quadrant的CTO – Simon Riggs 在他的blog上也发布文章 Dataloss at Gitlab 给了一些非常不错的建议:
  • 关于PostgreSQL 9.6的数据同步hang住的问题,可能有一些Bug,正在fix中。
  • PostgreSQL有4GB的同步滞后是正常的,这不是什么问题。
  • 正常的停止从结点,会让主结点自动释放WALSender的链接数,所以,不应该重新配置主结点的 max_wal_senders 参数。但是,停止从结点时,主结点的复数连接数不会很快的被释放,而新启动的从结点又会消耗更多的链接数。他认为,Gitlab配置的32个链接数太高了,通常来说,2到4个就足够了。
  • 另外,之前gitlab配置的max_connections=8000太高了,现在降到2000个是合理的。
  • pg_basebackup 会先在主结点上建一个checkpoint,然后再开始同步,这个过程大约需要4分钟。
  • 手动的删除数据库目录是非常危险的操作,这个事应该交给程序来做。推荐使用刚release 的 repmgr
  • 恢复备份也是非常重要的,所以,也应该用相应的程序来做。推荐使用 barman (其支持S3)
  • 测试备份和恢复是一个很重要的过程。
看这个样子,估计也有一定的原因是——Gitlab的同学对PostgreSQL不是很熟悉。
随后,Gitlab在其网站上也开了一系列的issues,其issues列表在这里 Write post-mortem (这个列表可能还会在不断更新中)
从上面的这个列表中,我们可以看到一些改进措施了。挺好的,不过我觉得还不是很够。

相关的思考

因为类似这样的事,我以前也干过(误删除过数据库,在多个终端窗口中迷失掉了自己所操作的机器……),而且我在amazon里也见过一次,在阿里内至少见过四次以上(在阿里人肉运维的误操作的事故是我见过最多的),但是我无法在这里公开分享,私下可以分享。在这里,我只想从非技术和技术两个方面分享一下我的经验和认识。
技术方面
人肉运维
一直以来,我都觉得直接到生产线上敲命令是一种非常不好的习惯。我认为,一个公司的运维能力的强弱和你上线上环境敲命令是有关的,你越是喜欢上线敲命令你的运维能力就越弱,越是通过自动化来处理问题,你的运维能力就越强。理由如下:
其一,如果说对代码的改动都是一次发布的话,那么,对生产环境的任何改动(包括硬件、操作系统、网络、软件配置……),也都算是一次发布。那么这样的发布就应该走发布系统和发布流程,要被很好的测试、上线和回滚计划。关键是,走发布过程是可以被记录、追踪和回溯的,而在线上敲命令是完全无法追踪的。没人知道你敲了什么命令。
其二,真正良性的运维能力是——人管代码,代码管机器,而不是人管机器。你敲了什么命令没人知道,但是你写个工具做变更线上系统,这个工具干了什么事,看看工具的源码就知道了。
正如2nd Quadrant的CTO建议的那样,你需要的是一个自动化的备份和恢复的工具,而不是一个权限系统。
其三、像使用mv而不rm,搞一个checklist和一个更重的流程,更糟糕。这里的逻辑很简单,因为,1)这些规则需要人去学习和记忆,本质上来说,你本来就不相信人,所以你搞出了一些规则和流程,而这些规则和流程的执行,又依赖于人,换汤不换药,2)另外,写在纸面上的东西都是不可执行的,可以执行的就是只有程序,所以,为什么不把checklist和流程写成代码呢?(你可能会说程序也会犯错,是的,程序的错误是consistent,而人的错误是inconsistent)
最关键的是,数据丢失有各种各样的情况,不单单只是人员的误操作,比如,掉电、磁盘损坏、中病毒等等,在这些情况下,你设计的那些想流程、规则、人肉检查、权限系统、checklist等等统统都不管用了,这个时候,你觉得应该怎么做呢?是的,你会发现,你不得不用更好的技术去设计出一个高可用的系统!别无它法。
一个系统是需要做数据备份的,但是,你会发现,Gitlab这个事中,就算所有的备份都可用,也不可避免地会有数据的丢失,或是也会有很多问题。理由如下:
1)备份通常来说都是周期性的,所以,如果你的数据丢失了,从你最近的备份恢复数据里,从备份时间到故障时间的数据都丢失了。
2)备份的数据会有版本不兼容的问题。比如,在你上次备份数据到故障期间,你对数据的scheme做了一次改动,或是你对数据做了一些调整,那么,你备份的数据就会和你线上的程序出现不兼容的情况。
3)有一些公司或是银行有灾备的数据中心,但是灾备的数据中心没有一天live过。等真正灾难来临需要live的时候,你就会发现,各种问题让你live不起来。你可以读一读几年前的这篇报道好好感受一下《以史为鉴 宁夏银行7月系统瘫痪最新解析
所以,在灾难来临的时候,你会发现你所设计精良的“备份系统”或是“灾备系统”就算是平时可以工作,但也会导致数据丢失,而且可能长期不用的备份系统很难恢复(比如应用、工具、数据的版本不兼容等问题)。
我之前写过一篇《分布式系统的事务处理》,你还记得下面这张图吗?看看 Data Loss 那一行的,在Backups, Master/Slave 和 Master/Master的架构下,都是会丢的。
AWS 的 S3 的的高可用是4个加11个9的持久性(所谓11个9的持久性durability,AWS是这样定义的,如果你存了1万个对象,那么丢一个的时间是1000万年),这意味着,不仅仅只是硬盘坏,机器掉电,整个机房挂了,其保证可以承受有两个设施的数据丢失,数据还是可用的。试想,如果你把数据的可用性通过技术做到了这个份上,那么,你还怕被人误删一个结点上的数据吗?
一般说来,故障都需要反思,在Amazon,S2以上的故障都需要写COE(Correction of Errors),其中一节就是需要Ask 5 Whys,我发现在Gitlab的故障回顾的blog中第一段中也有说要在今天写个Ask 5 Whys。关于Ask 5 Whys,其实并不是亚马逊的玩法,这还是算一个业内常用的玩法,也就是说不断的为自己为为什么,直到找到问题的概本原因,这会逼着所有的当事人去学习和深究很多东西。在Wikipedia上有相关的词条 5 Whys,其中罗列了14条规则:
  1. 你需要找到正确的团队来完成这个故障反思。
  2. 使用纸或白板而不是电脑。
  3. 写下整个问题的过程,确保每个人都能看懂。
  4. 区别原因和症状。
  5. 特别注意因果关系。
  6. 说明Root Cause以及相关的证据。
  7. 5个为什么的答案需要是精确的。
  8. 寻找问题根源的步骤,而不是直接跳到结论。
  9. 要基础客观的事实、数据和知识。
  10. 评估过程而不是人。
  11. 千万不要把“人为失误”或是“工作不注意”当成问题的根源。
  12. 培养信任和真诚的气氛和文化。
  13. 不断的问“为什么”直到问题的根源被找到。这样可以保证同一个坑不会掉进去两次。
  14. 当你给出“为什么”的答案时,你应该从用户的角度来回答。
很多公司基本上都是这样的套路,首先是极力掩盖,如果掩盖不了了就开始撒谎,撒不了谎了,就“文过饰非”、“避重就轻”、“转移视线”。然而,面对危机的最佳方法就是——“多一些真诚,少一些套路”,所谓的“多一些真诚”的最佳实践就是——“透明公开所有的信息”,Gitlab此次的这个事给大家树立了非常好的榜样。AWS也会把自己所有的故障和细节都批露出来。
事情本来就做错了,而公开所有的细节,会让大众少很多猜测的空间,有利于抵制流言和黑公关,同时,还会赢得大众的理解和支持。看看Gitlab这次还去YouTube上直播整个修复过程,是件很了不起的事,大家可以到他们的blog上看看,对于这样的透明和公开,一片好评。

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts