Sunday, October 25, 2015

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure + Zipkin



分布式跟踪系统调研
链路监控的核心是:用一个全局的 ID 将分布式请求串接起来,在JVM内部通过ThreadLocal传递,在跨JVM调用的时候,通过中间件(http header, framework)将全局ID传递出去,这样就可以将分布式调用串联起来

一个下单请求经历了什么?调用栈
订单查询的平均QPS,最高QPS,波动情况,监控QPS
为什么这个请求很慢,哪个环节出了问题,监控潜在因素
数据库的请求量突然上涨,如何排除来源,链路分析
这个操作需要依赖哪些东西,是数据库还是消息队列,如果某个redis挂了,哪些业务会受到影响,依赖分析

处理过程包括应用内部埋点,日志数据收集,在线和离线的数据分析,结果的存储和展示
低侵入性——作为非业务组件,应当尽可能少侵入或者无侵入其他业务系统,对于使用方透明,减少开发人员的负担
灵活的应用策略——可以(最好随时)决定所收集数据的范围和粒度
时效性——从数据的收集和产生,到数据计算和处理,再到最终展现,都要求尽可能快
决策支持——这些数据是否能在决策支持层面发挥作用,特别是从 DevOps 的角度
可视化才是王道
如果没有对 RPC 调用框架做统一封装,就可能侵入到每一个业务工程里去写埋点日志

中间件埋点字节码增强技术埋点,类似Btrace,最终的目标其实都是一样的:对应用程序透明即可
论文中提到过采样的问题,考虑到数据量很大的时候,全量跟踪会导致很大的压力,采样便应用而生。淘宝用的是hash采样,
京东用的是固定时间段内固定跟踪数量的采样,Dapper使用的是自适应采样,业务量不大可以先全量
还有 ostrich 用于定义监控指标(count, gauge, histogram, timer)。用户的在线数量,系统的线程数量
不能造成性能负担:一个价值未被验证,却会影响性能的东西,是很难推广的
因为要写log,业务QPS越高,性能影响越重。通过采样和异步log解决

type Span struct {
    TraceID    int64
    Name       string
    ID         int64
    ParentID   int64
    Annotation []Annotation
    Debug      bool
}

Span代表某个特定的方法调用,有一个名称和一个id。由一系列的标注组成

1
2
3
4
5
6
type Annotation struct {
    Timestamp int64
    Value     string
    Host      Endpoint
    Duration  int32
}

Trace是关联到同一个请求的一系列的Span
每个机器上有一个deamon做日志收集。业务进程把自己的Trace发到daemon。
daemon把收集Trace往上一级发送。
多级的collector,类似pub/sub架构。可以负载均衡。
对聚合的数据进行实时分析和离线存储。
离线分析需要将同一条调用链的日志汇总在一起

分析

调用链跟踪:把同一TraceID的Span收集起来,按时间排序就是timeline。把ParentID串起来就是调用栈。
抛异常或者超时,在日志里打印TraceID。利用TraceID查询调用链情况,定位问题。
依赖度量:
强依赖:调用失败会直接中断主流程
高度依赖:一次链路中调用某个依赖的几率高
频繁依赖:一次链路调用同一个依赖的次数多
离线分析按TraceID汇总,通过Span的ID和ParentID还原调用关系,分析链路形态。
实时分析对单条日志直接分析,不做汇总,重组。得到当前QPS,延迟

数据来源

请求调用链
系统、业务 metrics(CPU, IO, memory,disk, http, servlet, db, cache, jvm…)
异常堆栈
GC log

存储

淘宝用disruptor实现了一个类似这样的日志组件
eBay和点评的都是用Java并发库中的LBQ来做存储的,队列做存储的好处很明显,就是读写很方便,性能也高
基于disruptor做一个内存队列。当然,这仅仅是当前的方案,如果将来队列无法支持,那么log4j2 将是我们的备选方案
flume作为分布式日志的收集框架
Storm作为分布式流式处理框架
SEDA架构,多阶段事件驱动架构,用disruptor做线程之间的数据交换。将整个处理流程抽象为:验证、分析、告警、存储
存储:将数据全量存储到Hbase
Twitter使用scirbe来把所有的跟踪数据传输到zipkin的后端和hadoop文件系统
数据保留两个星期

可视化

必须能读才有价值
zipkin是整套的解决方案

BraveServletFilter
brave/brave-web-servlet-filter

http://zipkin.io/pages/instrumenting.html

http://zipkin.io/pages/quickstart
wget -O zipkin.jar 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec'
java -jar zipkin.jar 
http://zipkin.io/pages/existing_instrumentations.html
Tracing information is collected on each host using the instrumented libraries and sent to Zipkin. When the host makes a request to another application, it passes a few tracing identifiers along with the request to Zipkin so we can later tie the data together into spans.

https://github.com/openzipkin/zipkin/blob/master/zipkin-server/README.md
wget -O zipkin.jar 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec'
java -jar zipkin.jar
Once you've started, browse to http://localhost:9411 to find traces!
  • / - UI
  • /config.json - Configuration for the UI
  • /api/v1 - Api
  • /health - Returns 200 status if OK
  • /info - Provides the version of the running instance
  • /metrics - Includes collector metrics broken down by transport type
There are more built-in endpoints provided by Spring Boot, such as /metrics. To comprehensively list endpoints, GET /mappings.
https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-endpoints.html
Zipkin supports 64 and 128-bit trace identifiers, typically serialized as 16 or 32 character hex strings. By default, spans reported to zipkin with the same trace ID will be considered in the same trace.
For example, 463ac35c9f6413ad48485a3953bb6124 is a 128-bit trace ID, while 48485a3953bb6124 is a 64-bit one.
Note: Span (or parent) IDs within a trace are 64-bit regardless of the length or value of their trace ID.

https://github.com/openzipkin/brave/blob/master/brave/README.md
https://zhuanlan.zhihu.com/p/20941369
Brave.Builder builder = new Brave.Builder("serviceName"); builder.spanCollector(HttpSpanCollector.create( "http://localhost:9411", new EmptySpanCollectorMetricsHandler())); Brave brave = builder.build();


<bean id="braveFilter"
  class="com.github.kristofa.brave.servlet.BraveServletFilter">
  <constructor-arg
    value="#{brave.serverRequestInterceptor()}" />
  <constructor-arg
    value="#{brave.serverResponseInterceptor()}" />
  <constructor-arg>
    <bean
      class="com.github.kristofa.brave.http.DefaultSpanNameProvider" />
  </constructor-arg>
</bean>
最后一个类 com.github.kristofa.brave.http.DefaultSpanNameProvider 存在于 brave-http 模块中。当使用 Maven 或 Grauva

http://ryanjbaxter.com/cloud/spring%20cloud/spring/2016/07/07/spring-cloud-sleuth.html
https://spring.io/blog/2016/02/15/distributed-tracing-with-spring-cloud-sleuth-and-spring-cloud-zipkin
Spring Cloud Sleuth sets up useful log formatting for you that prints the trace ID and the span ID. Assuming you’re running Spring Cloud Sleuth-enabled code in a microservice whose spring.application.name is my-service-id
http://cloud.spring.io/spring-cloud-sleuth/
You will see traceId and spanId populated in the logs. If this app calls out to another one (e.g. with RestTemplate) it will send the trace data in headers and if the receiver is another Sleuth app you will see the trace continue there.
http://www.cnblogs.com/java-zhao/p/5838819.html
23     @Bean
24     public SpanCollector spanCollector() {
25         HttpSpanCollector.Config spanConfig = HttpSpanCollector.Config.builder()
26                                               .compressionEnabled(false)//默认false,span在transport之前是否会被gzipped。
27                                               .connectTimeout(5000)//5s,默认10s
28                                               .flushInterval(1)//1s
29                                               .readTimeout(6000)//5s,默认60s
30                                               .build();
31         return HttpSpanCollector.create("http://ip:9411", 
32                                         spanConfig,
33                                         new EmptySpanCollectorMetricsHandler());
34     }
35 
36     @Bean
37     public Brave brave(SpanCollector spanCollector) {
38         Brave.Builder builder = new Brave.Builder("service1");//指定serviceName
39         builder.spanCollector(spanCollector);
40         builder.traceSampler(Sampler.create(1));//采集率
41         return builder.build();
42     }

http://blog.csdn.net/liaokailin/article/details/52077620
44     @Bean
45     public BraveServletFilter braveServletFilter(Brave brave) {
46         /**
47          * 设置sr、ss拦截器
48          */
49         return new BraveServletFilter(brave.serverRequestInterceptor(), 
50                                       brave.serverResponseInterceptor(),
51                                       new DefaultSpanNameProvider());
52     }

PWLSF#1 Ryan Kennedy and Anjali Shenoy on Dapper
Zipkin at Twitter

http://liudanking.com/arch/micro-service-troubleshooting-tool-distributed-tracing

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts