Friday, October 16, 2015

Metrics & Monitoring



7 Java Performance Metrics to Watch After a Major Release
1. Response times and throughput
2. Load Average
Load Average is a metric that’s traditionally divided to 3, showing its result for the last 1, 5 and 15 minutes (left to right).
3. Error Rates (and how to solve them)
4. GC rate and pause duration
5. Business Metrics
6. Uptime and service health
7. Log size
Fast and furious debugging
The primary consumer for a log file nowadays for the most part is a machine who will essentially take that logging stream, that stream of logging messages that are being put into the file and begin to glean meaning from that; visualize trend information, anomalies.”

“There’s been a lot of movement within a lot of companies moving towards metric-driven development. So now we can automatically pull metrics instead of putting a lot of information to the log file with the hope that somebody can process that later and then visualize it for us.

https://www.digitalocean.com/community/tutorials/an-introduction-to-tracking-statistics-with-graphite-statsd-and-collectd

http://www.zenlife.tk/instrument.md

在线服务

在线服务是指那些需要立即响应的系统,比如,大部分的数据库和HTTP请求都属于这一类。
监控这类系统关键的指标是,执行请求数错误数,以及延迟。记录正在处理中的请求数量也很有意义。
对请求进行计数时,开始和结束都要记录。结束的时候,可以记录下error和延迟情况。

在用户不做额外配置情况下,库都需要提供一些基本的信息。
如果库是用于访问外部进程(比如网络,磁盘,或者进程通信),至少要记录下总的请求数错误数,和延迟
取决于是轻量级库,还是比较重的库,可能还要记录下库自身的内部错误和延迟情况。
因为一个库可能使用到不同部分的资源,记录时注意加上合适的标签区分。比如,数据库连接池应该跟连续到的数据库区分开来。

日志

作为一个通用规则,对每类日志都应该有一个计数器,每记一行都对计数器加加。这样就能知道哪类日志出现的频率是多少,发生了多少次。
记录应用发生的info/error/warning的日志的数量也很有用。可以检查它们并对比最近发布版本之间是否有大的变化。

失败

失败跟日志的处理类似。每次有执行失败,都应该将计数器增加。
报告失败时,应该使用一个指标记录总的尝试次数。这样可以方便计算失败率。

线程池

对于任何类型的线程池,最关键的指标是排队请求数运行中的线程数总的线程数已处理的作业数,以及花了多久。记录排队等待时间也很有用。

缓存

缓存要监控的关键指标包括,总的请求数命中数延迟,其次是具体到某一类的在线服务的请求数,错误数,延迟。

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts