Massive Technical Interviews Tips: Troubleshooting Practices

TODO:
https://mp.weixin.qq.com/s/rLdHuwEb7gKjgpbsh1ioSA

https://crossoverjie.top/2018/11/08/java-senior/JVM-concurrent-HashSet-problem

简单来说就是从 MQ 中取出数据然后丢到后面的业务线程池中做具体的业务处理。

而报警的队列正好就是这个线程池的队列。

跟踪代码发现构建线程池的方式如下：


ThreadPoolExecutor executor = new ThreadPoolExecutor(coreSize, maxSize,

0L, TimeUnit.MILLISECONDS,

new LinkedBlockingQueue<Runnable>());;

put(poolName,executor);

采用的是默认的 LinkedBlockingQueue 并没有指定大小（这也是个坑），于是这个队列的默认大小为 Integer.MAX_VALUE。

由于应用已经重启，只能从仅存的线程快照和内存快照进行分析。

其中有两个比较大的对象，一个就是之前线程池存放任务的 LinkedBlockingQueue，还有一个则是 HashSet。

当然其中队列占用了大量的内存，所以优先查看，HashSet 一会儿再看。

由于队列的大小给的够大，所以结合目前的情况来看应当是线程池里的任务处理较慢，导致队列的任务越堆越多，至少这是目前可以得出的结论。

因为从表现来看线程池里的任务迟迟没有执行完毕，所以主要看看它们在干嘛。

正好他们都处于 RUNNABLE 状态，同时堆栈如下：

发现正好就是在处理上文提到的 HashSet，看这个堆栈是在查询 key 是否存在。通过查看 312 行的业务代码确实也是如此。

这里的线程名字也是个坑，让我找了好久。

所有有一个很重要的事实：这几个业务线程在查询 HashSet 的时候运行了 6 7 个小时都没有返回。

同时发现这个应用生产上运行的是 JDK1.7 ，所以我初步认为应该是在查询 key 的时候进入了 HashMap 的环形链表导致 CPU 高负载同时也进入了死循环。

源源不断的从 MQ 中获取数据。
将数据丢到业务线程池中。
判断数据是否已经写入了 Set。
没有则查询数据库。
之后写入到 Set 中。

这里有一个很明显的问题，那就是作为共享资源的 Set 并没有做任何的同步处理。

这里会有多个线程并发的操作，由于 HashSet 其实本质上就是 HashMap，所以它肯定是线程不安全的，所以会出现两个问题：

Set 中的数据在并发写入时被覆盖导致数据不准确。
会在扩容的时候形成环形链表。

第一个问题相对于第二个还能接受。

通过上文的内存分析我们已经知道这个 set 中的数据已经不少了。同时由于初始化时并没有指定大小，仅仅只是默认值，所以在大量的并发写入时候会导致频繁的扩容，而在 1.7 的条件下又可能会形成环形链表。

发现是运行在 HashMap 的 465 行，来看看 1.7 中那里具体在做什么：

已经很明显了。这里在遍历链表，同时由于形成了环形链表导致这个 e.next 永远不为空，所以这个循环也不会退出了。

到这里其实已经找到问题了，但还有一个疑问是为什么线程池里的任务队列会越堆越多。我第一直觉是任务执行太慢导致的。

仔细查看了代码发现只有一个地方可能会慢：也就是有一个数据库的查询。

把这个 SQL 拿到生产环境执行发现确实不快，查看索引发现都有命中。

但我一看表中的数据发现已经快有 7000W 的数据了。同时经过运维得知 MySQL 那台服务器的 IO 压力也比较大。

所以这个原因也比较明显了：

由于每消费一条数据都要去查询一次数据库，MySQL 本身压力就比较大，加上数据量也很高所以导致这个 IO 响应较慢，导致整个任务处理的就比较慢了。

但还有一个原因也不能忽视；由于所有的业务线程在某个时间点都进入了死循环，根本没有执行完任务的机会，而后面的数据还在源源不断的进入，所以这个队列只会越堆越多！

这其实是一个老应用了，可能会有人问为什么之前没出现问题。

这是因为之前数据量都比较少，即使是并发写入也没有出现并发扩容形成环形链表的情况。这段时间业务量的暴增正好把这个隐藏的雷给揪出来了。所以还是得信墨菲他老人家的话。

HashSet 不是线程安全的，换为 ConcurrentHashMap同时把 value 写死一样可以达到 set 的效果。
根据我们后面的监控，初始化 ConcurrentHashMap 的大小尽量大一些，避免频繁的扩容。
MySQL 中很多数据都已经不用了，进行冷热处理。尽量降低单表数据量。同时后期考虑分表。
查数据那里调整为查缓存，提高查询效率。
线程池的名称一定得取的有意义，不然是自己给自己增加难度。
根据监控将线程池的队列大小调整为一个具体值，并且要有拒绝策略。
升级到 JDK1.8。
再一个是报警邮件酌情考虑为电话通知😂。

https://insights.thoughtworks.cn/can-not-log-in

https://tech.meituan.com/netty_direct_memory_screening.html

我们发现打印到控制台的这个 appender 忘记注释掉了，所以初步猜测：因为这个项目打印的日志过多，而 Log4j2 打印到控制台是同步阻塞打印的，所以就导致了这个问题。那么接下来，我们把线上所有机器的这行注释掉，本以为会“大功告成”，但没想到仅仅过了几天，5xx告警又来“敲门”。看来，这个问题并没我们最初想象的那么简单。

CAT 上关于堆外内存的监控没有任何异常（应该是没有统计准确，一直维持在 1M），而这边我们又确认堆外内存已快超过上限，并且已经知道 Netty 底层是使用的哪个字段来统计。那么接下来要做的第一件事情，就是反射拿到这个字段，然后我们自己统计 Netty 使用堆外内存的情况。

堆外内存统计字段是 DIRECT_MEMORY_COUNTER，我们可以通过反射拿到这个字段，然后定期 Check 这个值，就可以监控 Netty 堆外内存的增长情况。

于是我们通过反射拿到这个字段，然后每隔一秒打印，为什么要这样做？

因为，通过我们前面的分析，在爆发大量 OOM 现象之前，没有任何可疑的现象。那么只有两种情况，一种是突然某个瞬间分配了大量的堆外内存导致OOM；一种是堆外内存缓慢增长，到达某个点之后，最后一根稻草将机器压垮。在这段代码加上去之后，我们打包上线。

https://github.com/mrniko/netty-socketio/commit/5ad451ca896937773426507338f53ac0d981f4de#diff-9ae711561f273181c98102849d0660de

这段代码是把 packet 里面一个字段的值转换为一个 char。然而，当我们使用 idea 预执行的时候，却抛出类一个愤怒的 NPE！！也就是说，框架申请到一段内存之后，在 encoder 的时候，自己 GG 了，还给自己挖了个NPE的深坑，最后导致内存无法释放（最外层有堆外内存释放逻辑，现在无法执行到了）。而且越攒越多，直到被“最后一根稻草”压垮，堆外内存就这样爆了

既然 Bug 已经找到，接下来就要解决问题了。这里只需要解决这个NPE异常，就可以 Fix 掉。我们的目标就是，让这个 subType 字段不为空。于是我们先通过 idea 的线程调用栈，定位到这个 packet 是在哪个地方定义的：

我们给这个字段赋值即可，由于这里是连接关闭事件，所以我们给他指定了一个名为 DISCONNECT 的字段（可以改天深入去研究 Socket.IO 的协议），反正这个 Bug 是在连接关闭的时候触发的，就粗暴一点了！

解决这个 Bug 的过程是：将这个框架的源码下载到本地，然后加上这一行，最后重新 Build一下，pom 里改了一下名字，推送到我们公司的仓库。这样，项目就可以直接进行使用了。

https://allegro.tech/2018/05/a-comedy-of-errors-debugging-java-memory-leaks.html
http://www.importnew.com/29591.html
What is the main source of magic in Java applications? Garbage collection of course.

full garbage collections happening once every 15 minutes and causing 20-second long [!] stop-the-world pauses. No wonder the connection to ZooKeeper was timing out despite no issues with either ZooKeeper or the network!

These pauses also explained why the whole application kept dying rather than just timing out and logging an error. Our apps run inside Marathon, which regularly polls a healthcheck endpoint of each instance and if the endpoint isn’t responding within reasonable time, Marathon restarts that instance.

Having previously worked with some applications which allocate a lot of short-lived objects, I increased the size of young generation by increasing both -XX:G1NewSizePercent and -XX:G1MaxNewSizePercent from their default values so that more data could be handled by the young GC rather than being moved to old generation, as Censum showed a lot of premature tenuring. This was also consistent with the full GC collections taking place after some time. Unfortunately, these settings didn’t help one bit.

The next thing I thought was that perhaps the producer generated data too fast for the consumer to consume, thus causing records to be allocated faster than they could be freed. I tried to reduce the speed at which data was produced by the repository by decreasing the size of a thread pool responsible for generating the denormalized records while keeping the size of the consumer data pool which sent them off to ES unchanged. This was a primitive attempt at applying backpressure, but it didn’t help either.

At this point, a colleague who had kept a cooler head, suggested we do what we should have done in the first place, which is to look at what data we actually had in the heap. We set up a development instance with an amount of data comparable to the one in production and a proportionally scaled heap. By connecting to it with jvisualvm and running the memory sampler, we could see the approximate counts and sizes of objects in the heap. A quick look revealed that the number of our domain Ad objects was way larger than it should be and kept growing all the time during indexing, up to a number which bore a striking resemblance to the number of ads we were processing. But… this couldn’t be. After all, we were streaming the records using RX exactly for this reason: in order to avoid loading all of the data into memory.

Lo and behold, we were actually loading all data into memory. It was, of course not intended. Not knowing RxJava well enough at that time, we wanted to parallelize the code in a particular way and resolved to using CompletableFuture along with a separate executor in order to offload some work from the main RX flow. But then, we had to wait for all the CompletableFutures to complete… by storing references to them and calling join(). This caused the references to all futures, and thus also to all the data they referenced, to be kept alive until the end of indexing, and prevented the Garbage Collector from freeing them up earlier.

Software engineering is all about trade-offs, and deciding on the priorities of different tasks is no exception.

Having more RX experience under our belt, we were able to quite easily get rid of the CompletableFutures, rewrite the code to use only RX, migrate to RX2 in the process, and to actually stream the data instead of collecting it in memory. The change passed code review and went to testing in dev environment. To our surprise, the app was still not able to perform indexing with a smaller heap. Memory sampling revealed that the number of ads kept in memory was smaller than previously and it not only grew but sometimes also decreased, so it was not all collected in memory. Still, it seemed as if the data was not being streamed, either.

The relevant keyword was already used in this post: backpressure. When data is streamed, it is common for the speeds of the producer and the consumer to differ. If the producer is faster than the consumer and nothing forces it to slow down, it will keep producing more and more data which can not be consumed just as fast. There will appear a growing buffer of outstanding records waiting for consumption and this is exactly what happened in our application. Backpressure is the mechanism which allows a slow consumer to tell the fast producer to slow down.

Our indexing stream had no notion of backpressure which was not a problem as long as we were storing the whole index in memory anyway. Once we fixed one problem and started to actually stream the data, another problem — the lack of backpressure — became apparent.

This is a pattern I have seen multiple times when dealing with performance issues: fixing one problem reveals another which you were not even aware of because the other issue hid it from view.

In RxJava 2, the original Observable class was split into Observable which does not support backpressure and Flowable which does. Fortunately, there are some neat ways of creating Flowables which give them backpressure support out-of-the-box. This includes creating Flowables from non-reactive sources such as Iterables. Combining such Flowables results in Flowables which also support backpressure, so fixing just one spot quickly gave the whole stream backpressure support.

Looking at GC logs, we still noticed lots of premature tenuring — on the order of 70%. Even though performance was already acceptable, we tried to get rid of this effect, hoping to perhaps also prevent the full garbage collection at the same time.

Premature tenuring (also known as premature promotion) happens when an object is short-lived but gets promoted to the old (tenured) generation anyway. Such objects may affect GC performance since they stuff up the old generation which is usually much larger and uses different GC algorithms than new generation. Therefore, premature promotion is something we want to avoid.

https://crossoverjie.top/2018/07/08/java-senior/JVM-Troubleshoot/

现象:原本只需要执行几分钟的任务执行了几个小时都没退出。翻遍了所有的日志都没找到异常。

所以通常建议大家线程名字给的有意义，在排查问题时很有必要。

其实其他几个线程都和这里的堆栈类似，很明显的看出都是在做 Redis 连接。

于是我登录 Redis 查看了当前的连接数，发现已经非常高了。

这样 Redis 的响应自然也就变慢了。

所以问题的主要原因是：大量的应用并发查询 Redis，导致 Redis 的性能降低。

既然找到了问题，那如何解决呢？

减少同时查询 Redis 的应用，分开时段降低 Redis 的压力。
将 Redis 复制几个集群，各个应用分开查询。但是这样会涉及到数据的同步等运维操作，或者由程序了进行同步也会增加复杂度。

这时就需要相应的工具进行分析了，最常用的自然就是 MAT 了。

我试了一个在线工具也不错（文件大了就不适合了）：

http://heaphero.io/index.jsp

以下是一些日常建议：

尽量不要在线程中做大量耗时的网络操作，如查询数据库（可以的话在一开始就将数据从从 DB 中查出准备好）。
尽可能的减少多线程竞争锁。可以将数据分段，各个线程分别读取。
多利用 CAS+自旋 的方式更新数据，减少锁的使用。
应用中加上 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp 参数，在内存溢出时至少可以拿到内存日志。
线程池监控。如线程池大小、队列大小、最大线程数等数据，可提前做好预估。
JVM 监控，可以看到堆内存的涨幅趋势，GC 曲线等数据，也可以提前做好准备。

怎样了解你的线程在干嘛?

我们前面提到，为了了解线程当前的执行状态，需要通过 jstack 工具来获取线程的 stack 信息，再从中找到我们关心的线程具体的执行情况。

这里首先需要关注的一点是，对于我们自己的线程，一定要「给他一个名字」。否则，在茫茫的线程海里，你只能欲哭无泪。

对于通过 Thread 创建线程时，可以直接通过构造方法指定名称。通过「线程池」来获取线程，需要定义自己的ThreadFactory，在其中指定生成的线程名字。

当然，在Linux环境中，也可以搭配置「top」命令来分析。具体可以按这个步骤来操作：

先用top 命令在机器上分析出当前占用cpu的进程，
通过top -H -p <pid> 查看进程内各个线程的资源占用
第一列显示的是这些线程的pid，此时再搭配 jstack生成的线程 stack信息查找对应的内容。需要注意一点，这里得先把pid转换成十六进制，然后再搜索即可，对应我们上面内容中的nid。

http://afghl.github.io/2017/09/06/production-bug-tracing.html

这个接口是批量操作接口。接受一个ids数组，然后遍历数组，启动事务，向几个表查询，然后向几个表插入记录。

接口没有限制数组大小，也就是如果传一个1000个id的数组，会for循环1000次插入。

可以还原事故现场了：在当时这个接口被调用，传了一个很大的数组，导致DB堵了，所有到DB的连接都堵了，最终导致我们的服务集群无可用线程，报NoAvailableThreadException。

简单验证

确定一下确实是发生了这样的情况：第一个是捞ELK日志：事故前1分钟左右，这个接口的确被调用被传入大数组，而且相同的参数调用了8次。

第二，看thread dump：发现几乎所有线程都wait在C3P0的pool：

这个接口是批量操作接口。接受一个ids数组，然后遍历数组，启动事务，向几个表查询，然后向几个表插入记录。

接口没有限制数组大小，也就是如果传一个1000个id的数组，会for循环1000次插入。

简单验证

确定一下确实是发生了这样的情况：第一个是捞ELK日志：事故前1分钟左右，这个接口的确被调用被传入大数组，而且相同的参数调用了8次。

第二，看thread dump：发现几乎所有线程都wait在C3P0的pool：

基本可以盖棺定论了吧？

可是不对啊

可是不对啊：第一：这时DBA和DAL（数据库中间件团队）告诉我们，DB和DAL根本没读，一直在等客户端发命令下来呢。

第二：insert语气的TPS虽然飙升了，但也就不到100的样子，怎么可能会堵住DB呢？

第三：如果DB堵了，那应该是整个集群都堵了，都会报NoAvailableThreadException，此时我再次查ELK日志确认：报这个异常的节点都被调用了这个接口的，也就是没被调用这个接口的节点，都没事。

而如果是堵在DB的话，那整个集群的所有节点都会堵了。

那就是说是在客户端的某个地方hang住了。而不是在DB层。此时干扰项又出现了：db出现的执行了120s的sql是怎么出现的？

google一下和请教老司机

就这样，想了很久都没有头绪，尝试在测试环境重现，也没法重现（由于线上的这个库是跨机房的，测试环境很难重现这样的网络延迟）。

这时请教DAL组的老大瞅一眼：

应该是触发了C3P0的一个preparedstatment已知bug：preparedstatment的死锁。

再google一下，应该就是这个问题：高并发的时候，触发C3P0的preparedstatment的cache pool死锁，资源不释放，如果线程请求这个cache pool，就会block住。

至于有sql执行很久的问题，再分析一下，是这样的：

这个接口的几个sql语句都在不同的事务中，sql语句发到db之后，db需要等一句commit，这句sql才算是真正执行。如果commit一直没下，就会触发数据库的120s超时。

另一点是应用每一次发送sql语句都需要访问下preparedstatment cache pool，然后再拿一个连接connection。此时如果堵了，那么这条语句就发不到db。

所以DB的120s超时是等一句commit等了120s都等不到，而不是这条sql语句跑了120s。

http://afghl.github.io/2017/09/16/production-cpu-100-tracing.html
记一次线上CPU使用率飙升排查过程

相同的问题一共发生过两次。第一次发生在周二，当时看到的一个现象是CPU使用率飙升，另一个现象是rabbitmq的连接异常。当时第二个现象是一个很大的干扰项，我们怀疑的是rabbitmq中间件出问题了，后来中间件团队的人告诉我们，这个消息的q很久没有改动了，追查就断了。当时是用重启解决问题的。

第二次出现是周五，这时没有报任何异常，只是偶然的发现所有节点的CPU使用率都到了20%：

第二点是cms的GC时间突然上升：

遇到事故的第一个反应当然是：最近有没有发布？没有。最近一次发布是一个礼拜前了，而且那次改动绝不可能导致这个问题。

想到的第一件要做的事是：找出正在使用CPU的线程。SOP：

top 或者 jps找到jvm实例的pid。
top -Hp pid 找到线程的pid
将线程的pid转成16进制，用jstack就可以找到线程，然后可看到线程在跑什么代码。

这时因为我没有在生产环境跑jstack的权限，最后一步竟然扰攘了很久。最后找运维拉了jstack下来，定位到使用CPU的线程都是GC线程：

和第二个现象吻合：说明jvm正在进行大量的GC回收。

GC log

这时候登录到其中一个服务器看GC log，发现CMS的gc log不停地刷出来：说明jvm在不间断的执行full GC！

jstat没权限，于是这时又分析了一轮GC log，因为之前不熟悉这部分内容，所以很多信息都是边看边查。没查出个什么所以然。

推断出是内存泄漏

这时思路有点停滞了。回到监控页面再看集群的监控曲线，看看有没有别的线索。

这时看到堆内存的曲线不对：

图中表示的时间点也就是事故发生的时间点：正常情况下，一次full GC应该会回收大量内存，所以 正常的堆内存曲线应该是呈锯齿形。但在这一次full之后，堆内存再也没有下降。

于是推断： 堆中有大量不能回收的对象且在不停膨胀，使堆的使用占比超过full GC的触发阈值，但又回收不掉，导致full GC一直执行。

推断出是内存泄漏问题了，于是jmap走一波吧。

堆内存使用了3.8G，有一个3.1G的大对象占了82%的容量！

在查看这个对象究竟是什么：是hibernate的queryPlanCache：

找到关键词了：queryPlanCache。google一下，很快定位到原因：http://stackoverflow.com/questions/31557076/spring-hibernate-query-plan-cache-memory-usage

定位root cause

这事还没完，queryPlanCache里缓存了什么？到底是哪些sql语句被缓存了导致泄漏的？

MAT里支持OCL这种查询语法，找到了这个对象大量缓存的是什么：

根据sql语句，搜索一下代码，很容找到root cause了。

问题是之前为什么没有这个问题呢？

因为最近都没有发布，之前为什么没有这个问题呢？看了一下代码，原来是某个代码的错误逻辑：sql语句的生成依赖es的搜索结果，前几天es和db不同步，导致生成了这种in 几千个id的sql语句。

这个系统的qps很低，所以出问题了不是性能出现瓶颈，而是这个诡异的问题。

这问题最快的定位方法是什么？

写起来思路很清晰，但实际上整个过程花费了大半天的时间，其实排查的过程中，有很多关键点没有抓到，有很多现象，是可以凭经验条件反射的推断出原因的：

看到cpu使用率上涨，用jstack看使用cpu的线程，以及该线程在跑什么代码。
找到是gc线程，然后看gc曲线是否正常。
看堆内存曲线，正常的曲线是锯齿形的，如果不是，一次full GC之后内存没有明显下降，那基本可以推断发生内存泄漏了。
怀疑是内存泄漏的问题，可以跑jmap，然后拉到MAT分析。
第四步比较耗时的话，可以同时跑这个命令：jmap -histo pid。看看有没有线索。

拿到进程状态，节点状态的SOP：找运维。

然后是在代码层面提高系统可用性的经验：

主流程应对外部依赖尽量少，比如es，能直接用db完成主流程的场景，不要依赖其他存储。

http://blog.longjiazuo.com/archives/5178

最近集团要做FY19的机器采购预算，我们原来使用的A8和F43机型都不在采购范围内了，新的采购机型推荐F53，需要采购多少台F53？需要有压测数据作为依据来做预算。
优化前，我们的placecard服务在F53上的压测性能结果如下：

备注：

这里的线上是A8机器，32核cpu
F53机器上压测时，压测客户端上出现connection refused错误，服务端上观察到的qps上不去。
可以看到在F53(96核)和线上A8（32核）上的qps差不多，不符合预期。

从问题入手

压测客户端上报了connection refused，怀疑是和tomcat配置有关，当服务端tomcat所有的线程都被占满之后，最大的排队请求数是由acceptCount设置的，当队列中的请求数达到了acceptCount上限后，随后的请求会被拒绝。

http://tomcat.apache.org/tomcat-7.0-doc/config/http.html#Connector_Comparison

The HTTP Connector element represents a Connector component that supports the HTTP/1.1 protocol. It enables Catalina to function as a stand-alone web server, in addition to its ability to execute servlets and JSP pages. A particular instance of this component listens for connections on a specific TCP port number on the server. One or more such Connectors can be configured as part of a single Service, each forwarding to the associated Engine to perform request processing and create the response.

If you wish to configure the Connector that is used for connections to web servers using the AJP protocol (such as the mod_jk 1.2.x connector for Apache 1.3), please refer to the AJP Connector documentation.

Each incoming request requires a thread for the duration of that request. If more simultaneous requests are received than can be handled by the currently available request processing threads, additional threads will be created up to the configured maximum (the value of the maxThreads attribute). If still more simultaneous requests are received, they are stacked up inside the server socket created by the Connector, up to the configured maximum (the value of the acceptCount attribute). Any further simultaneous requests will receive "connection refused" errors, until resources are available to process them.

如果是由这个问题引起的，那么如果把acceptCount配置为一个特别大的值，就应该不会再出现这个问题了，我们修改后的Connector参数配置如下：

再压测果然没有connection refused的情况了（针对压测客户端直连服务器上的其中一个tomcat的场景，我们服务器上部署了3个tomcat），而是出现了新的read timeout的情况，说明大量请求都在队列中，客户端等待响应超时，出现read timeout。顺便说一句，我们压测工具使用的是jmeter，当出现read timeout时，压测客户端由于在等待响应，实际发送的请求qps数减少了，服务器上观察到的qps也没有上去。

增大acceptCount后，压测客户端上看到的情况不一样了，connection refused消失了，变为了readtimeout的错误。说明大量请求都在服务端的队列中了，那么为什么请求会堆积在队列中而不是迅速的被服务端的多线程给处理掉，从上文的贴图中可以看到，服务端的最大线程数配置为了2048（这也是配置了一个很大的值），那么实际有多少个线程，这些线程都在干嘛，是真的这些线程处理不过来吗？拿当时的服务端的jstack快照来看看。
搜一下业务处理的线程数有多少：

发现确实达到了配置的最大线程数2048，这些线程都在干活吗？搜一下runnable状态的有多少

2048个线程中，runnable状态的线程数有75个，waiting状态的有1973个，见下图：

waiting的这些线程都处于BLOCKED状态，是因为写的业务代码没有考虑并发性能问题吗？我们来从线程栈里找找线索，贴上一段jstack信息如下：

"http-bio-8080-exec-2048" #2637 daemon prio=5 os_prio=0 tid=0x00007ff46412b000 nid=0x1ed48 waiting for monitor entry [0x00007feb63ebd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.util.Collections$SynchronizedMap.get(Collections.java:2584)
    - waiting to lock <0x00000007409b1450> (a java.util.Collections$SynchronizedMap)
    at org.dom4j.tree.QNameCache.get(QNameCache.java:117)
    at org.dom4j.DocumentFactory.createQName(DocumentFactory.java:199)
    at org.dom4j.tree.NamespaceStack.createQName(NamespaceStack.java:392)
    at org.dom4j.tree.NamespaceStack.pushQName(NamespaceStack.java:374)
    at org.dom4j.tree.NamespaceStack.getQName(NamespaceStack.java:213)
    at org.dom4j.io.SAXContentHandler.startElement(SAXContentHandler.java:234)

查看锁0x00000007409b1450有多少：1967

锁都发生在这行代码上 'org.dom4j.tree.QNameCache.get(QNameCache.java:117)‘，这是dom4j包里的代码，看来不是因为我们开发同学的代码引起的。
我们这个业务的和下游服务（地图搜索引擎）之间的通讯序列化方式是xml，上述锁发生在response的xml串解析时。按理说不同线程不同请求之间的xml解析相互之间不应该有锁。分析dom4j的代码发现QNameCache的get方法里的确是有synchronizedMap的同步锁，

    /** Cache of {@link QName}instances with no namespace */
    protected Map noNamespaceCache = Collections
            .synchronizedMap(new WeakHashMap());

    /**
     * Cache of {@link Map}instances indexed by namespace which contain caches
     * of {@link QName}for each name
     */
    protected Map namespaceCache = Collections
            .synchronizedMap(new WeakHashMap());

解决dom4j QNameCache的锁竞争问题

解决上述问题的办法是消除这个锁竞争，去掉SynchronizedMap，改为一个ThreadLocal的WeakHashMap，因为是ThreadLocal的，线程独有，所以不存在多个线程之间的锁竞争。当然这样做也有其弊端，多个线程各自去占用nameSpace空间，JVM内存占用的空间比以前大一点，但是在有限的nameSpace情况下，这个多出来的空间基本可以忽略不计。我们的业务场景也不会有无限多的nameSpace类型。最后这个map是一个WeakHashMap，在jvm内存空间不够的时候，WeakHashMap也是可以被JVM强制回收的。
修改之后的jar包已上传集团maven仓库，版本号为：1.6.1-snowman。可能有同学会问，为什么不用最新的dom4j，因为最新版本的dom4j这个问题也依然存在，因而没法采用升级版本的方式来解决这个问题，鉴于我们原来是使用的1.6.1版本，为了避免大的改动带来的风险，我在原来的1.6.1基础之上修复了这个问题，命名版本号：1.6.1-snowman，如果大家遇到类似dom4j的xml解析性能问题，可以尝试下载我们的这个版本来解决这个问题。

继续压测

解决了服务端的上述并发锁问题，我们继续压测，单个tomcat的情况下，qps由以前的1600左右上升到了2200，以为这次能够顺利的提升qps，没想到继续加压后，压测客户端仍然报connection refused问题，服务端上监测到的qps呈如下波动状：

比较奇怪的是这个qps呈波浪形，振荡周期是2分钟，而且很规律，怀疑是压测客户端发送的请求数就是这样的，怀疑是不是压测客户端有问题，这次让压测同学把客户端上报connection refused的完整堆栈信息发出来：

org.apache.http.conn.HttpHostConnectException: Connection to http://10.103.215.139:8080 refused
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at com.autonavi.snowman.util.HttpURL.returnChunkMessage(HttpURL.java:144)
at com.autonavi.snowman.util.HttpURL.executeSnowmanPbRequest(HttpURL.java:100)
at com.autonavi.snowman.placecard.PlaceCardSearch.runTest(PlaceCardSearch.java:55)
at org.apache.jmeter.protocol.java.sampler.JavaSampler.sample(JavaSampler.java:191)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:434)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:261)
at java.lang.Thread.run(Thread.java:849)&#xa;Caused by: java.net.ConnectException: Cannot assign requested address
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:643)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)

发现客户端报错Cannot assign requested address，这其实是客户端自己的端口不够的原因，之前在另外一篇性能调优文章里提到过，在短连接高并发场景下，服务端需要调整tcp端口超时回收时间。没注意到在短连接高并发场景下的压测客户端上同样有类似问题。在短连接高并发场景下，client端频繁建立连接，而端口释放较慢，socket端口正常关闭后，端口不是立即释放的，而是处于TIME_WAIT状态，默认等待60s之后才释放，这就解释了上图中qps的振荡周期是2分钟的原因。

解决压测客户端端口耗尽的问题

在压测客户端上配置下述几个参数可解决这个问题：
1.调低端口释放后的等待时间，默认60s，修改为15s

sysctl -w net.ipv4.tcp_fin_timeout=15

2.修改tcp/ip协议配置，开启对于TCP时间戳的支持,若该项设置为0，则端口快速回收tcp_tw_recycle配置不起作用

sysctl -w net.ipv4.tcp_timestamps=1

3.修改tcp/ip协议配置，快速回收socket资源，默认为0，修改为1

sysctl -w net.ipv4.tcp_tw_recycle=1</blockquote>

做了上述参数修改之后，重新压测，在F53机器上，qps可以到9000了，cpu也上到百分之40多了。

cpu只到百分之40多，但是qps到了9000就上不去了，还得继续找原因，先看看服务端的线程都在干什么，分析过程比较类似，这里就不再赘述了，直接贴一段找到问题的jstack堆栈信息，然后说结论：

"http-bio-8080-exec-6124" #6763 daemon prio=5 os_prio=0 tid=0x00007f7c0023a000 nid=0xe924 waiting for monitor entry [0x00007f72f9a5c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.catalina.loader.WebappClassLoader.findResources(WebappClassLoader.java:1367)
    - waiting to lock <0x000000074045d298> (a [Ljava.util.jar.JarFile;)
    at java.lang.ClassLoader.getResources(ClassLoader.java:1170)
    at java.util.ServiceLoader$LazyIterator.hasNextService(ServiceLoader.java:348)
    at java.util.ServiceLoader$LazyIterator.hasNext(ServiceLoader.java:393)
    at java.util.ServiceLoader$1.hasNext(ServiceLoader.java:474)
    at javax.xml.parsers.FactoryFinder$1.run(FactoryFinder.java:293)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.xml.parsers.FactoryFinder.findServiceProvider(FactoryFinder.java:289)
    at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:267)
    at javax.xml.parsers.SAXParserFactory.newInstance(SAXParserFactory.java:127)
    at org.dom4j.io.JAXPHelper.createXMLReader(JAXPHelper.java:46)
    at org.dom4j.io.SAXHelper.createXMLReaderViaJAXP(SAXHelper.java:125)
    at org.dom4j.io.SAXHelper.createXMLReader(SAXHelper.java:78)
    at org.dom4j.io.SAXReader.createXMLReader(SAXReader.java:894)
    at org.dom4j.io.SAXReader.getXMLReader(SAXReader.java:715)
    at org.dom4j.io.SAXReader.read(SAXReader.java:435)
    at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
    at com.autonavi.snowman.lse.datamodel.LocalLbsResultParser.parse(LocalLbsResultParser.java:55)
    at com.autonavi.snowman.placecard.service.server.SearchServer.getQueryResult(SearchServer.java:861)
    at com.autonavi.snowman.placecard.service.server.SearchServer.getQueryIdsResult(SearchServer.java:896)

根源在于parseText里面调用的一些方法，SAXParserFactory.newInstance方法里会先根据system property里配置的javax.xml.parsers.SAXParserFactory属性，去找SAXParserFactory的实现类，找不到，则如果找到了lib/jaxp.propertie，使用lib/jaxp.propertie中定义的实现类，如果还找不到，则从current thead’s contect class loader 和 system class loader中去找SAXParserFactory的实现类，如果再找不到，则最后使用默认值“com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl”来实例化一个parserFactory在从classpath中找资源的时候，会对资源加锁，就是这个锁影响了系统的并发量。

解决xml解析时找SAXParserFactory实现类时出现的并发锁问题

从上述代码分析中，可以看到系统首先会去找system property里面是否配置了实现类，那么我们的解法一就是在启动脚本里指定以下两个system properties:

-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl

这个解决办法在没上docker化之前，需要PE来修改基线），另外一个对于开发同学比较可控的解法是在代码层面来解决这个问题：代码里面直接new SAXParserFactory的实现类，直接指定其实现类。将原来调用dom4j包里DoucumentHelper.parseText方法的地方都替换为调用下述DocumentUtils.parseText方法：

注，如果使用上述示例代码中的org.apache.xerces.jaxp.SAXParserFactoryImpl实现类，需要额外引入xerces包，也可不引入新的三方库，而是直接使用jdk自带的com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl实现类，我们做过对比测试，性能方面使用xerces包（2.11.0版本）性能会稍微好一点点。

优化结果

优化后的Placecard在F53机器上的qps达到11k，并且性能瓶颈到了依赖的下游引擎服务端，API端还能承受更高的qps。

压测客户端的注意事项

1.短连接高并发场景下，除了要注意服务端的tcp端口回收配置以外，还需要关注压测客户端的端口回收配置，否则压测客户端上可能会出现上述的端口耗尽的情况。
2.如果是压测java应用，需要先对服务端有个预热，让JIT生效后

使用xml序列化方式时的性能注意事项

如果应用使用xml的序列化方式，需要做如下两个优化工作以提升应用性能。
1. 如果使用官方的dom4j版本，在并发量大的情况下，会出现大量的QNameCache的并发锁，使得大量线程处于BLOCKED状态，服务的qps上不去。推荐使用1.6.1.2-snowman版本来解决这个问题，这个版本我们已经上传到集团的mvn库，可以直接引用了。
2. 默认的DocumentHelper.parseText方法，在高并发场景下找SAXParserFactory的实现时，会出现ClassLoader.getResources时的并发锁，导致大量线程BLOCKED，应用qps上不去。推荐使用上述我们重写的parseText方法来解决这个问题。

其实上文一直讲的都只是“术”，并没有讲“道”。而每次需要调优的时候遇到的问题可能都不一样，用同样的术是解决不了变化的问题的，我们需要的是“道”。“道”是相对不变化的。
光有“术”、例如懂JVM，不一定能解决上述问题，光有“道”，不懂术，也解决不了上述问题，道术兼修，才能在遇到问题时庖丁解牛。

浅聊性能调优过程中的“道”

爱因斯坦说过：“如果给我1个小时解答一道决定我生死的问题，我会花55分钟弄清楚这道题到底在问什么。一旦清楚它到底在问什么，剩下的5分钟足够回答这个问题。“，性能上不去，很多时候都是不知道问题出在哪里、从哪里下手，调优的首要过程就是搞清楚问题在哪里，然后再去研究解决方案。

像大多数调优文章一样，看起来很顺理成章，一步一步的往上调，其实作者可能是有很多插（岔）曲（路）没有写。在时间允许的情况下，可以大胆做各种假设，然后去验证，假设是这些地方的问题，然后去验证这些点是否真的存在问题，有些假设是正确的，的确是这些点有问题，有些假设点被证实为没有问题，这些假设就是岔路，但是不要紧，岔路是为得到正确路径，岔路也是调优的一部分。比如我们这次调优，当qps上不去的时候，发现服务器上in和out的网络流量加起来一直是100多兆，怀疑是因为服务器在千兆机房的网络环境，网络方面达到了上限，这个时候我们使用了request和response都更小的case去跑压测，得到的qps结果还是一样，基本就可以排除是网络瓶颈问题了。

可能在极少数情况下，调优时没有明确的线索，这个时候可以像探案一样去调优，链路上的所有的点都是可以怀疑的对象，可以采用排除法来逐一排除可疑的环节。例如本次压测时，压测客户端上报的connection refused错误，tomcat、nginx、网络层面都可能报这个错，我们将压测客户端绕过nginx，直连tomcat同样也有connection refused的错误，基本上就可以排除nginx问题了。

在没有解决问题之前，你要调优的对象有点类似于一个黑盒子，你可能解决了一个个问题，但是在没有完成之前，是不知道还有多少问题的，你解决了一个问题，往往觉得就ok了，没想到继续压测的时候还是不行，你已经做了很多努力了，其实可能不是你这个问题没解决，可能是这个问题解了，但是新的问题又冒出来了，借用淘宝DBA团队的一句名言：往往在你即将绝望放弃的时候，其实离最终的终点已经非常非常的接近。调优之道还需要一点坚持的精神。

add conditional breakpoint at
java.util.zip.ZipFile.ZipFile(File, int, Charset)

jzfile = open(name, mode, file.lastModified(), usemmap);

System.out.println(arg0);

return false

/.metadata/.plugins/org.eclipse.wst.server.core/tmp0/wtpwebapps/vue-metadata-services-admin/WEB-INF/lib/spring-boot-1.5.0.BUILD-SNAPSHOT.jar

Caused by: java.lang.IllegalArgumentException: java.util.zip.ZipException: invalid CEN header (bad signature)

at org.apache.catalina.webresources.JarResourceSet.initInternal(JarResourceSet.java:96)

at org.apache.catalina.util.LifecycleBase.init(LifecycleBase.java:102)

... 12 more

Caused by: java.util.zip.ZipException: invalid CEN header (bad signature)

at java.util.zip.ZipFile.open(Native Method)

at java.util.zip.ZipFile.<init>(ZipFile.java:220)

at java.util.zip.ZipFile.<init>(ZipFile.java:150)

at java.util.jar.JarFile.<init>(JarFile.java:166)

at java.util.jar.JarFile.<init>(JarFile.java:103)

at org.apache.catalina.webresources.JarResourceSet.initInternal(JarResourceSet.java:88)

... 13 more

http://stackoverflow.com/questions/7959499/force-re-download-of-release-dependency-using-maven

mvn clean install -U

-U means force update of dependencies.

You cannot make Maven re-download dependencies, but what you can do instead is to purge dependencies that were incorrectly downloaded using mvn dependency:purge-local-repository.

See: http://maven.apache.org/plugins/maven-dependency-plugin/purge-local-repository-mojo.html

http://stackoverflow.com/questions/22536279/java-util-zip-ziperror-invalid-cen-header-bad-signature

There are two possible explanations:

You have a corrupted ZIP file - Compile error in maven2: "invalid CEN header (bad signature)"
It is something to do with ZIP-64:
- java.util.zip.ZipException: invalid CEN header (bad signature)
- But also note that ZIP-64 support was apparently only added in 1.7.0_b55 ... https://blogs.oracle.com/xuemingshen/entry/zip64_support_for_4g_zipfile

Caused by: java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:
java.lang.ClassNotFoundException: org.quartz.core.QuartzScheduler_Stub (no security manager: RMI class loader disabled)
http://stackoverflow.com/questions/8987883/rmi-classnotfoundexception
RMI does not require a security manager to run at all. It requires a security manager if and only if you want to use the codebase feature

A codebase can be defined as a source, or a place, from which to load classes into a Java virtual machine
The codebase feature refers to the system property java.rmi.server.codebase. If you make the remote interface class and its closure available to the client in its CLASSPATH, and to the Registry, you don't have to use the codebase feature

http://docs.oracle.com/javase/7/docs/technotes/guides/rmi/javarmiproperties.html

http://docs.oracle.com/javase/7/docs/technotes/guides/rmi/enhancements-7.html
Running a system with the java.rmi.server.useCodebaseOnly property set to false is not recommended, as it may allow the loading and execution of untrusted code.
the RMI property java.rmi.server.useCodebaseOnly is set to true by default. In earlier releases, the default value was false.

http://stackoverflow.com/questions/3629568/rmi-server-exception

6.1 If you encounter a problem running your RMI server

The first problem you might encounter is the receipt of a ClassNotFoundException when attempting to bind or rebind a remote object to a name in the registry. This exception is usually due to a malformed codebase property, resulting in the registry not being able to locate the remote object's stubs or other classes needed by the stub.

It is important to note that the remote object's stub implements all the same interfaces as the remote object itself, so those interfaces, as well as any other custom classes declared as method parameters or return values, must also be available for download from the specified codebase.

Most frequently, this exception is thrown as a result of omitting the trailing slash from the URL value of the property. Other reasons would include: the value of the property is not a URL; the path to the classes specified in the URL is incorrect or misspelled; the stub class or any other necessary classes are not all available from the specified URL.

http://www.quartz-scheduler.org/documentation/quartz-2.x/configuration/ConfigRMI.html

Some users experience problems with class availability (namely Job classes) between the client and server. To work through these problems you'll need an understanding of RMI's "codebase" and RMI security managers. You may find these resources to be useful:

An excellent description of RMI and codebase: http://www.kedwards.com/jini/codebase.html . One of the important points is to realize that “codebase” is used by the client!

http://docs.oracle.com/javase/7/docs/technotes/guides/rmi/hello/hello-world.html

start java -classpath classDir -Djava.rmi.server.codebase=file:classDir/ example.hello.Server

http://archive.oreilly.com/pub/a/java/archive/quartz.html?page=4

-Djava.rmi.server.codebase
   file:/home/dejanb/quartz/lib/quartz.jar

https://discussions.apple.com/thread/1524974?tstart=0
http://tumblr.gudge.com/post/18186353550/ssh-could-not-resolve-hostname-342200223i

ssh: Could not resolve hostname \342\200\223i name or service unknown

A friend recently got this error on Ubuntu when trying to ssh to a server via ip. The command was like

ssh -i mykey.pem ubuntu@1.2.3.4

Worked fine for me on Windows, OSX etc but not Ubuntu. Turned out to be a copy/paste issue. The -i was converted to – (which is not the minus char) and was difficult to spot.

Monday, December 19, 2016

Troubleshooting Practices

简单验证

简单验证

可是不对啊

google一下和请教老司机

GC log

推断出是内存泄漏

定位root cause

问题是之前为什么没有这个问题呢？

这问题最快的定位方法是什么？

从问题入手

解决dom4j QNameCache的锁竞争问题

继续压测

解决压测客户端端口耗尽的问题

解决xml解析时找SAXParserFactory实现类时出现的并发锁问题

优化结果

压测客户端的注意事项

使用xml序列化方式时的性能注意事项

浅聊性能调优过程中的“道”

Labels

Popular Posts