Massive Technical Interviews Tips: Zookeeper Code

http://www.cnblogs.com/francisYoung/p/5225703.html
http://blog.xiaohansong.com/2016/08/22/zookeeper-watch-async/
初学 Zookeeper 会发现客户端有两种回调方式： Watcher 和 AsyncCallback

zooKeeper.getData(root, new Watcher() {
            public void process(WatchedEvent event) {

            }
        }, new AsyncCallback.DataCallback() {
            public void processResult(int rc, String path, Object ctx, byte[] data, Stat stat) {

            }
        }, null);

可以看到，getData方法可以同时设置两个回调：Watcher 和 AsyncCallback，同样是回调，它们的区别是什么呢？要解决这个问题，我们就得从这两个接口的功能入手。

Watcher：Watcher是用于监听节点，session 状态的，比如getData对数据节点a设置了watcher，那么当a的数据内容发生改变时，客户端会收到NodeDataChanged通知，然后进行watcher的回调。
AsyncCallback:AsyncCallback是在以异步方式使用 ZooKeeper API 时，用于处理返回结果的。例如：getData同步调用的版本是：byte[] getData(String path, boolean watch,Stat stat)，异步调用的版本是：void getData(String path,Watcher watcher,AsyncCallback.DataCallback cb,Object ctx)，可以看到，前者是直接返回获取的结果，后者是通过AsyncCallback回调处理结果的。

ClientWatchManager中有四种Watcher

defaultWatcher：创建Zookeeper连接时传入的Watcher，用于监听 session 状态
dataWatches：存放getData传入的Watcher
existWatches：存放exists传入的Watcher，如果节点已存在，则Watcher会被添加到dataWatches
childWatches：存放getChildren传入的Watcher

从代码上可以发现，监听器是存在HashMap中的，key是节点名称path，value是Set<Watcher>

private final Map<String, Set<Watcher>> dataWatches =
        new HashMap<String, Set<Watcher>>();

ClientWatchManager只有一个方法，那就是materialize，它根据事件类型type和path返回监听该节点的特定类型的Watcher。

每次返回都会从HashMap中移除节点对应的Watcher，例如：addTo(dataWatches.remove(clientPath), result);，这就是为什么Watcher是一次性的原因（defaultWatcher除外）。值得注意的是，由于使用的是HashSet存储Watcher，重复添加同一个实例的Watcher也只会被触发一次。

SendThread：负责 IO 操作，包括发送，接受响应，发送 ping 等。
EventThread：负责处理事件，执行回调函数。

Zookeeper 客户端中Watcher和AsyncCallback都是异步回调的方式，但它们回调的时机是不一样的，前者是由服务器发送事件触发客户端回调，后者是在执行了请求后得到响应后客户端主动触发的。它们的共同点在于都需要在获取了服务器响应之后，由SendThread写入EventThread的waitingEvents中，然后由EventThread逐个从事件队列中获取并处理。

https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html

What ZooKeeper Guarantees about Watches

Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.

A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

Things to Remember about Watches

Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.

Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)

A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.

When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.

http://sel-fish.net/2016/11/29/zk-len-err/

1. 根据报错查找对应的源码


ack 'Len error'



src/java/main/org/apache/zookeeper/server/NettyServerCnxn.java

392:                            throw new IOException("Len error " + len);



src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java

540:            throw new IOException("Len error " + len);

想要的信息在NIOServerCnxn.java中：


vim src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java +540



539         if (len < 0 || len > BinaryInputArchive.maxBuffer) {

540             throw new IOException("Len error " + len);

541         }

继续dig maxBuffer：


ack maxBuffer



src/java/main/org/apache/jute/BinaryInputArchive.java

87:    static public final int maxBuffer = Integer.getInteger("jute.maxbuffer", 0xfffff);

122:    // Since this is a rough sanity check, add some padding to maxBuffer to

126:        if (len < 0 || len > maxBuffer + 1024) {



src/java/main/org/apache/zookeeper/ClientCnxn.java

107:     * jute.maxBuffer value. To avoid this we instead split the watch

110:     * with respect to the server's 1MB default for jute.maxBuffer.

可以看到maxBuffer和jute.maxbuffer这个环境变量相关，默认值是1048575，那么现在可以确认是客户端发出来的包太大，那么为什么会这样大呢？如果你足够幸运的话，会发现ack maxBuffer的第二个匹配，内容为：


104     /* ZOOKEEPER-706: If a session has a large number of watches set then

105      * attempting to re-establish those watches after a connection loss may

106      * fail due to the SetWatches request exceeding the server's configured

107      * jute.maxBuffer value. To avoid this we instead split the watch

108      * re-establishement across multiple SetWatches calls. This constant

109      * controls the size of each call. It is set to 128kB to be conservative

110      * with respect to the server's 1MB default for jute.maxBuffer.

111      */

112     private static final int SET_WATCHES_MAX_LENGTH = 128 * 1024;

这时候，如果足够幸运去Jira ZOOKEEPER-706看一看描述，会发现报错完全一致，剩下的就是重现、确认修复版本了。

其实经历过的很多问题排查从事后看都走了或多或少的弯路，重要的是需要尽量多地使用工具尽快地把自己从弯路上掰过来。

1. 怀疑业务方代码问题

服务端是一个公共集群，服务于多个业务，重启之后只有一个业务报错，其它业务都没有问题，所以怀疑业务方代码有问题。从事后来看，这个是因为出错的业务方watch的path太多导致的，和数据有关系，而不是代码的原因。当然，看了许久，业务代码是没问题的。

2. 怀疑业务方使用的版本问题

还是之前的思路，代码没问题就怀疑是版本的问题，找了一个很牵强的点，Curator 2.8.0里面指定依赖的是ZooKeeper 3.4.5，而业务方使用的是3.4.6。于是试图沿着这个方向去重现。结果显而易见，无法重现问题，卡壳了。

3. tcpdump抓包分析问题

由于后续又重启了Server导致问题在线上复现，有几台预发的业务机器未重启，提供了错误现场。使用tcpdump抓包分析后，可以看到

从内容看，第41行发的请求包，type为101（按照zk的协议，除了ConnectRequest，每个请求包的包头都会有type来标志请求包的类型），根据src/java/main/org/apache/zookeeper/ZooDefs.java中的代码显示，是setWatches请求。

这时候大概有些眉目了，应该是setWatches请求包太大导致的错误，于是转入了上述“正确的姿势”。

Q：为什么重启的时候不会有这个问题，而重连的时候会有这个问题？
A：重新启动的逻辑和重连的逻辑不一样。重启的时候会首先调用getChildren2去获取path的子节点，然后逐一去setWatches。重连时会把之前的watches批量set一遍。

尽量先翻代码。先Google固然也是好习惯，但是容易陷入寻找类似问题的泥沼中，特别是在关键词挑选和组织得不是很好的情况下。
维护的开源产品还是需要多看看upstream的更新。

Thursday, January 11, 2018

Zookeeper Code