Sunday, May 22, 2016

how to troubleshooting



http://blog.thefirehoseproject.com/posts/learn-to-code-and-be-self-reliant/
Debugging error messages is incredibly important. The fact of the matter is, error messages are just a part of programming: they are seen by inexperienced and very experienced developers alike. The only difference is, the more experience you have dealing with error messages, the less time you’ll need to spend trying to fix them. Here’s why:
  • Over time, you will learn how to read error messages and extract the relevant details of the problem quickly. The first time you see an error message, it will take you a while to decode what it actually means. But after you’ve seen hundreds of error messages (and you will see hundreds!), you will be able to pinpoint the problem’s location and the relevant details you need in order to fix it.
  • You should learn from each error message that you resolve. Don’t just fix the error and be done with it; understand what is wrong with the code you’re fixing. By learning from each of your errors, the next time you make the same mistake, you’ll be able to fix it much faster.
  • Initially, you will probably ask for help on each error message you see. Over time, you’ll learn to ask for help less frequently by double-checking your code and conducting smart Google searches.
https://ericlippert.com/2014/03/05/how-to-debug-small-programs/
While you are running the code in the debugger I encourage you to listen to small doubts. Most programmers have a natural bias to believe their program works as expected, but you are debugging it because that assumption is wrong! Very often I’ve been debugging a problem and seen out of the corner of my eye the little highlight show up in Visual Studio that means “a memory location was just modified”, and I know that memory location has nothing to do with my problem. So then why was it modified? Don’t ignore those nagging doubts; study the odd behaviour until you understand why it is either correct or incorrect.
And the next time you write an assignment, write the specification, test cases, preconditions, postconditions and assertions for a method before you write the body of the method! You are much less likely to have a bug, and if you do have a bug, you are much more likely to be able to find it quickly.
https://technet.microsoft.com/en-us/magazine/ff955771.aspx

If You Don’t Know Why It Works, It Isn’t Fixed

Ask, Narrow and Verify

Open Minds, Simple Solutions

Good troubleshooting skills are a constant and a necessity that’s entirely separate of technical knowledge. Good troubleshooting means applying logic so you can take concrete steps to effectively narrow the possibilities. It also means keeping an open mind and calling on any related knowledge (including how to find help and research the problem) to help you reach the simple solution.
  • Take notes about error messages: If your computer gives you an error message, be sure to write down as much information as possible. You may be able to use this information later to find out if other people are having the same error.
Debug It!
Reproduce -> Diagnose -> Fix -> Reflect
Reproduce:
Find a way to reliably and conveniently reproduce the problem on demand.

Diagnose:
Construct hypotheses, and test them by performing experiments until you are confident that you have identified the underlying cause of the bug.

Fix:
Design and implement changes that fix the problem, avoid introducing regressions, and maintain or improve the overall quality of the software.

Reflect:
Learn the lessons of the bug. Where did things go wrong? Are there any other examples of the same problem that will also need fixing? What can you do to ensure that the same problem doesn’t happen again?


1. High-Level Strategies
Item 1: Handle All Problems through an Issue-Tracking System
Item 2: Use Focused Queries to Search the Web for Insights into Your Problem
Item 3: Confirm That Preconditions and Postconditions Are Satisfied
Item 4: Drill Up from the Problem to the Bug or Down from the Program’s Start to the Bug
Item 5: Find the Difference between a Known Good System and a Failing One
Compare the behavior of a known good system with that of a failing one to find the failure’s cause.
Consider all of the elements that can influence a system’s behavior: code, input, invocation arguments, environment variables, services, and dynamically linked libraries.

Item 6: Use the Software’s Debugging Facilities
Item 7: Diversify Your Build and Execution Environment
Item 8: Focus Your Work on the Most Important Problems
2. General-Purpose Methods and Practices
Item 9: Set Yourself Up for Debugging Success
Prepare a robust minimal test case
Automate the bug’s reproduction.
Script a log file’s analysis.
Learn how an API or language feature really works.

• Sleep on a difficult problem.
• Don’t give up.
• Invest in your environment, tools, and knowledge.

Item 10: Enable the Efficient Reproduction of the Problem
Reproducible runs simplify your debugging process.
Create a short self-contained example that reproduces the problem.
Have mechanisms to create a replicable execution environment.

Item 11: Minimize the Turnaround Time from Your Changes to Their Result

Item 12: Automate Complex Testing Scenarios
Item 13: Enable a Comprehensive Overview of Your Debugging Data
Item 14: Consider Updating Your Software
Item 15: Consult Third-Party Source Code for Insights on Its Use
Get the source code for third-party code you depend on.
Explore problems with third-party APIs and cryptic error messages by looking at the source code.
Link with the library’s debug build.
Correct third-party code only when there’s no other reasonable alternative.
Item 16: Use Specialized Monitoring and Test Equipment
Item 17: Increase the Prominence of a Failure’s Effects
• Force the execution of suspect paths.
• Increase the magnitude of some effects to make them stand out for study.
• Apply stress to your software to force it out of its comfort zone.
• Perform all your changes under a temporary revision control branch.

Item 18: Enable the Debugging of Unwieldy Systems from Your Desk
Item 19: Automate Debugging Tasks
Automate the exhaustive searching for failures; computer time is cheap, yours is expensive.
Item 20: Houseclean Before and After Debugging
Item 21: Fix All Instances of a Problem Class
An error in one place is likely to also occur in othersAfter fixing one fault, find and fix similar ones and take steps to ensure they will not occur in the future.
3. General-Purpose Tools and Techniques
Item 22: Analyze Debug Data with Unix Command-Line Tools
Item 23: Utilize Command-Line Tool Options and Idioms
Item 24: Explore Debug Data with Your Editor
Item 25: Optimize Your Work Environment
Item 26: Hunt the Causes and History of Bugs with the Revision Control System
Item 27: Use Monitoring Tools on Systems Composed of Independent Processes
4. Debugger Techniques
Item 28: Use Code Compiled for Symbolic Debugging
Item 29: Step through the Code
Item 30: Use Code and Data Breakpoints
Item 31: Familiarize Yourself with Reverse Debugging
Item 32: Navigate along the Calls between Routines
Item 33: Look for Errors by Examining the Values of Variables and Expressions
Item 34: Know How to Attach a Debugger to a Running Process
Item 35: Know How to Work with Core Dumps
Item 36: Tune Your Debugging Tools
Item 37: Know How to View Assembly Code and Raw Memory
5. Programming Techniques
Item 38: Review and Manually Execute Suspect Code
Item 39: Have Colleagues Go Over Your Code and Reasoning
Item 40: Add Debugging Functionality
Item 41: Add Logging Statements
Item 42: Use Unit Tests
Item 43: Use Assertions
Item 44: Verify Your Reasoning by Perturbing the Debugged Program
Item 45: Minimize the Differences between a Working Example and the Failing Code
Item 46: Simplify the Suspect Code
Item 47: Consider Rewriting the Suspect Code in Another Language
Item 48: Improve the Suspect Code’s Readability and Structure
Item 49: Fix the Bug’s Cause, Rather Than Its Symptom
6. Compile-Time Techniques
Item 50: Examine Generated Code
Item 51: Use Static Program Analysis
Item 52: Configure Deterministic Builds and Executions
Item 53: Configure the Use of Debugging Libraries and Checks
7. Runtime Techniques
Item 54: Find the Fault by Constructing a Test Case
Item 55: Fail Fast
Item 56: Examine Application Log Files
Item 57: Profile the Operation of Systems and Processes
Item 58: Trace the Code’s Execution
Item 59: Use Dynamic Program Analysis Tools
8. Debugging Multithreaded Code
Item 60: Analyze Deadlocks with Post-Mortem Debugging
Item 61: Capture and Replicate
Item 62: Uncover Deadlocks and Race Conditions with Specialized Tools
Item 63: Isolate and Remove Nondeterminism
Item 64: Investigate Scalability Issues by Looking at Contention
Item 65: Locate False Sharing by Using Performance Counters
Item 66: Consider Rewriting the Code Using Higher-Level Abstractions

http://blog.51cto.com/13527416/2073644

墨菲定律

  • 任何事情都没有表面看起来那么简单
  • 所有事情的发展都会比你预计的时间长
  • 会出错的事情总会出错
  • 如果担心某个事情发生,那么它更有可能发生
墨菲定律暗示我们,如果担心某种情况会发生,那么它更有可能发生,久而久之就一定会发生。这警示我们,在互联网公司,对生成环境发生的任何怪异现象和问题都不要轻视,对其背后的原因一定要调查清楚。同样,海恩法则也强调任何严重的事故背后都是很多次小问题的积累,当到一定量级后会导致质变,严重的问题就会浮出水面。
那么,我们需要对线上服务产生任何现象,哪怕是小问题,都要刨根问底,对任何现象都要遵循下面问题
  • 为什么会发生 ?
  • 发生了该怎么应对 ?
  • 怎么恢复 ?
  • 怎么避免 ?

应急目标

在生成环境发生故障时快速恢复服务,避免或减少故障带来的损失,避免或减少故障对客户的影响

应急原则

  • 应第一时间恢复系统,而不是彻底解决呢问题,快速止损
  • 明显资金损失时,要第时间升级,快速止损
  • 指标要围绕目标,快速启动应急过程与止损方案
  • 当前负责人不能短时间内解决问题,则必须进行升级处理
  • 处理过程在不影响用户体验的前提下,保留现场

应急方法与流程

线上应急一般分为 6 个阶段
  1. 发现问题
  2. 定位问题
  3. 解决问题
  4. 回顾问题
  5. 改进措施
过程中要记住,应急只有一个总体目标:尽快恢复,消除影响。不管处于哪个阶段,首先想到的必须是恢复问题,恢复问题不一定能定位问题,也不一定有完美的解决方案,可能通过经验或者开关等。但这可以达到快速恢复的目的,然后保留现场,以及定位问题,解决问题和复盘

发现问题

通常我们通过系统层面、应用层面和中间件层面监控来发现问题
  • 系统层面监控包括
    1. 系统的 CPU 使用率
    2. Load average
    3. Memory
    4. I/O (网络与磁盘)
    5. SWAP 使用情况
    6. 线程数
    7. File Description 文件描述符等
  • 应用层面监控包括
    1. 接口的响应时间
    2. QPS
    3. 调用频次
    4. 接口成功率
    5. 接口波动率等
  • 中间件层面监控包括数据库、缓存、消息队列。
    1. 对数据库的负载、慢查询、连接数等监控
    2. 对缓存的连接数、占用内存、吞吐量、响应时间等监控
    3. 消息队列的响应时间、吞吐量、负载、堆积情况等监控

定位问题

分析定位过程中先考虑系统最近发生的变化,需要考虑如下几方面
  • 故障系统最近是否上过线?
  • 依赖的基础平台与资源是否升级过?
  • 依赖的系统是否上过线?
  • 运营是否在系统内做过运营变更?
  • 网络是否有波动?
  • 最近的业务量是否涨了?
  • 运营方是否有促销活动?

解决问题

解决问题要以定位问题为基础,必须清晰定位问题产生的根本原因,在提出解决问题的有效方案,没有明确原因之前,不用使用各种方法来尝试修复问题,可能还没有解决这个问题又引入了下个问题,想想刚刚提到的墨菲定律

回顾问题

解决问题后,需应急团队与相关方回顾事故产生的原因、应急过程的合理性、提出整改措施,主要聚焦在以下几个问题:
  • 类似的问题还有哪些没有发生?
  • 做了哪些事情,事故就不会再发生?
  • 做了哪些事情,及时发生故障,也不会产生影响?

改进措施

根据回顾问题提出的改进措施,以正式的项目管理方式进行统一管理,采用 SMART 原则来跟进
Murphy's law is a popular adage that states that "things will go wrong in any given situation, if you give them a chance," or more commonly, "whatever can go wrong, will go wrong." 
Murphy's law is an adage or epigram that is typically stated as: "Anything that can go wrong will go wrong".


海恩法则:任何不安全事故都是可以预防的。
海恩法则是由美国工业界的安全先驱赫伯特·威廉·海因里希提出的关于工业安全的法则。海恩法则指出: 每一起严重事故的背后,必然有29次轻微事故和300起未遂先兆以及1000起事故隐患


按照海恩法则分析,当一件重大事故发生后,我们在处理事故本身的同时,还要及时对同类问题的“事故征兆”和“事故苗头”进行排查处理,以此防止类似问题的重复发生,及时解决再次发生重大事故的隐患,把问题解决在萌芽状态。


海恩法则强调两点:一是事故的发生是量的积累的结果;二是再好的技术,再完美的规章,在实际操作层面,也无法取代人自身的素质和责任心。

In his 1931 book "Industrial Accident Prevention, A Scientific Approach", Herbert W Heinrich put forward the following concept that became known as Heinrich's Law:
in a workplace, for every accident that causes a major injury, there are 29 accidents that cause minor injuries and 300 accidents that cause no injuries.
This is commonly depicted as a pyramid (in this case with the number of minor incidents shown as 30 for simplicity):
Heinrich's law is based on probability and assumes that the number of accidents is inversely proportional to the severity of those accidents. It leads to the conclusion that minimising the number of minor incidents will lead to a reduction in major accidents, which is not necessarily the case.

http://dbaplus.cn/news-21-625-1.html
4月份的时候看到一道面试题,据说是腾讯校招面试官提的:在多线程和高并发环境下,如果有一个平均运行一百万次才出现一次的bug,你如何调试这个bug?(知乎原贴地址如下:https://www.zhihu.com/question/43416744)

https://www.zhihu.com/question/43416744
对于第一个问题而言,解决Bug,第一步就是重现,第二步定位以及Reduce,第三步再来解。所以,不管百万次还是十万次,首先要重现出来,然后找出重现出来的计算机状态。计算机不会欺骗人,每一个问题出来肯定是有原因的,唯一要做的就是如何把这个计算机状态信息还原出来,你可以使用log跟踪等,怎么纪录还原都是工程师的选择。而若能把相关的状态信息拿到,剩下的就是定位是哪里的问题,而这时候最好的就是模拟和Reduce,把问题缩小,排除其它信息干扰。模拟与Reduce成功以后,再想办法解决,然后再来估计解决问题的难度与成本问题等,有些BUG我们是知道,但是解决太麻烦了,影响也不大,就放着。
1. 先问清楚 bug 是属于哪一类。崩溃?数据不一致?然后可能要在问更多的资料,才能考虑用什么方法。题主说的尽量保留状态是对的。然后可能要针对相关的软件部分,看看能否有测试能重现问题,逐步收窄范围。另一方面,从管理上应该要考虑 bug 的严重性与成本/时间的问题。如果最终能找出问题,需要研究怎样防范相似的 bug。

https://www.jiuzhang.com/qa/3705/
https://www.jiuzhang.com/qa/3815/

为什么Debug一定要靠自己?

原因有四:
1. 如果是别人给你指出你的程序哪儿错了,你自己不会有任何收获,你下一次依旧会犯同样的错误。
2. 经过长时间努力Debug 获得的错误,印象更深刻。
3. ebug 能力是面试的考察范围。
4. 锻炼Debug 能力能够提高自己的Bug Free的能力。

Debug的基本步骤

  1. 重新读一遍程序。按照自己当初想的思路,走一遍程序,看看程序是不是按照自己的思路在走。(因为很多时候,你写着写着就忘了很多事儿)这种方式是最有效最快速的 Debug 方式。
  2. 找到一个非常小非常小的可以让你的程序出错的数据。比如空数组,空串,1-5个数的数组,一个字符的字符串。
  3. 在程序的若干位置输出一些中间结果。比如排序之后输出一下,看看是不是真的按照你所想的顺序排序的。这样可以定位到程序出错的部分。
  4. 定位了出错的部分之后,查看自己的程序该部分的逻辑是否有错。
    在第4步中,如果无法通过肉眼看出错误的部分,就一步步“模拟执行”程序,找出错误。

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts