Monday, June 29, 2015

Design Hit Counter - how to count number of requests in last second, minute and hour - Stack Overflow



http://massivealgorithms.blogspot.com/2016/08/leetcode-362-design-hit-counter.html
统计网站最近5分钟的访问次数
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=320340
设计一个class,要求支持
void hit() <--- 网站访问次数加1
int getHits()  <--- 返回最近5分钟的访问次数

如果网站访问量很大,多个线程可能同时访问hit和getHits,不要求getHits绝对正确,如何提高性能。
搜尋最近 N 分鐘 / 小時 / 天 資料
這類題目的套路目前我的思路是準備很多不同 granularity 的時間桶組,每個資料來時,放到若干對應桶內
例如:
前一小時組 (60桶):
min0, min1, ... min59
前一天組 (24桶):
hour0,hour2, ... hour23
前一個禮拜組 (7桶):
day0, day1, ... day6

當下來了一個時間,就放到 min0, hour0, day0
query: 
最近一個禮拜: 就把 day0 ~ day6 拿來算
最近 3 小時: 就把 hour0, ~ hour2 拿來算

然後每個時間桶組,對應時間到,會自動建新桶,然後淘汰最舊桶

算法如上
但對於這題我不知道實現上
1. 怎麼去實現桶的? 用哪種 DB? 存的方式為何?
2. 怎麼實作自動建新桶,淘汰最舊桶?

請問有大老能解說解說這部分嗎?

另外一個好奇的是
實務上如 Splunk 這類公司,真的是用時間桶在實現嗎? 還是有其他更高深的方法


如果是算法题,那么没有什么比(doubly) linkedlist更适合的了。
每分钟来的,加到dummyTAIL ->队尾->原来的队尾(曾经的min0, 现在升级为min1)->....min59 -> dummyROOT
小时和天也是类似的道理。

但是楼主的问题一是存储方式,那么时间序列数据库应该是一个不错的选择。每个不同的桶就由其开始的时间为key。
考虑到楼主所提的Splunk,那么第二问如果是算法题,在每次更新分钟的时候,原来的min59就会被自动淘汰掉,原来的min58会升级为min59,放在dummyROOT的左边

循环数组比链表更合适,在于快速查询某个bucket的数据,同时由于数组比链表这种分散存储有更好的locality。底层的缓存效果也会更好。

splunk这种专门做检索的不知道,这种统计一般自己写个业务普通关系数据库就够用,因为其内建的存储优化几近极致,同时提供丰富的聚合运算。
存的方式主要考虑重写还是重读,一般读多写少的业务,多建个索引不亏。还有就是incremental id 即有时间顺序,同时主键一般先天就是clustered index,
在做时段统计的时候读取效率更高。但是要注意id做时间顺序脆弱性,比如不经意的插入操作破坏顺序

algorithm - how to count number of requests in last second, minute and hour - Stack Overflow
There is a hypothetical web server which supports only one very simple API - count of requests received in the last hour, minute and second. This server is very popular in the world and received thousands of requests per second.

Aim it to find how to return accurately these 3 counts to every request?
Requests are coming all the time, so the window of one hour, one minute and one second is different per request. How to manage a different window per request so that the counts are correct per request?

You can create an array of size 60x60 for each second in the hour and use it as circular buffer. Each entry contains number of requests for a given second. When you move to next second, clear it and start counting. When you are at then end of array, you start from 0 again, so effectively clearing all counts prior to 1 hour.
  1. For Hour: return sum of all elements
  2. For Minute: return sum of last 60 entries (from currentIndex)
  3. For Second: return count on the currentIndex
So all three have O(1) space and time complexity. Only drawback is, it ignores milliseconds, but you can apply same notion to include milliseconds as well.
http://www.mitbbs.com/article_t/JobHunting/32458451.html
两部分:
1 建立counter for every second
可以用循环数组
2 建立 window (5分钟,1 小时,等等)
这个问题就是一个固定window 滑动的问题了。
可以在 O(1) 返回结果

Second this, if you need sub-second accuracy. You can record each request 
with the timestamp, retrieve the head and tail second records for adjustment.

You'll need a linked list, plus a 3600 second indexing rotational array for data structure, each array entry would point to the first request after a round second. Build another index for minute or whatever period.
http://prismoskills.appspot.com/lessons/System_Design_and_Big_Data/Chapter_03_-_Count_requests_in_last_second_and_hour.jsp
Problem: Given an extremely busy server which receives thousands of requests per second.
It is desired to find the number of requests received in the last second, minute and hour.
What algorithm and data-structures can be used to do this as accurately as possible?

Solution : Some of the first solutions that come to mind are:
Maintain 3 counters, one each for last hour, second and minute.
This is quickly rejected because there is no way to remove the count of requests which fall out of the window of last hour, minute and second.

Maintain 3 lists, one each for last hour, second and minute
This does solve the problem with the above counters, but since the number of requests is huge, a massive synchronization will be required for each thread adding its request to three lists.
Given thousands of requests per second, synchronization alone will slow down the entire lists' updation process.

Best approach: A good solution is one which allows concurrent updates and does not eat up too much memory.
With this idea in mind, the following can be a good solution:

1) Create an array AtomicInteger frequencies[1000]; to store the number of requests received per second.

2) This array will store frequencies of requests for the current second i.e. from HH:MM:SS to HH:MM:SS+1 time

3) We will store current second in some variable, say currentSecond
This can be retrieved in Java as:
Calendar calendar = Calendar.getInstance();
int currentSecond = calendar.get(Calendar.SECOND);

4) frequencies array counts are incremented as:
int newSecond = calendar.get(Calendar.SECOND);
if (newSecond != currentSecond)
{
    synchronized (...)
    {
        int requestsPerSecond = sumFrequenciesInTheSecond (frequencies);
        frequencies = new AtomicInteger[1000];
    }
}
// frequencies points to current second at this point
int requestMillisOfSecond = calendar.get(Calendar.MILLISECOND);
frequencies[requestMillisOfSecond]++;

5) So we are able to get the frequencies per second by just using 1000 AtomicIntegers.

6) To get requests per minute, all we need to do is add the seconds in a minute.
This can be done by storing just 60 seconds of data, i.e. just 60 integers.

7) Similarly, once we have minutes' data, we can get hours' data by adding 60 minutes.
Again there is no need to store more than 60 minutes of data.
Whenever 60 minutes complete for the current hour, we sum up those minutes, store the hour's requests per second and reset the minutes array.

8) Thus, all we need is the following:
AtomicInteger frequencies[1000];
int secondsFrequencies[60]; // Every second gets its value by summing frequencies array
int minutesFrequencies[60]; // Every minute gets its value by summing secondsFrequencies array
int hoursFrequencies[60]; // Every hour gets its value by summing minutesFrequencies array

9) This way, we get our solution in just O(1) space

https://github.com/WeizhengZhou/leetcode3/blob/master/src/LastCounter.java
public class LastCounter {
private static int SIZE  = 6;//should be 60, use 6 for test
private int[] A = null;
private int index = 0;
private long startTime = 0;
private int lastMinuteCount = 0;
public LastCounter(){
this.A = new int[SIZE];
this.index = 0;
//Counter start time, index start from 0
this.startTime = (long) (System.currentTimeMillis()/1E3);
this.lastMinuteCount = 0;
}
public void hit(){
int lastIndex = index;//record last index
this.index = getIndex();//get current index
if(lastIndex == index){//still in this slot
A[index] += 1;//increase current second count by 1
this.lastMinuteCount +=1;//increase last minute count
}

else{
for(int i=(lastIndex+1)%SIZE;i<=this.index;i++){//[lastindex, index-1] has no hits
this.lastMinuteCount -= A[i];//for index, i is 60 seconds ago
A[i] = 0;
}
A[this.index] = 1;//first hit in this second
this.lastMinuteCount +=1;//update lastMinute
}
int sum = 0;
for(int i=0;i<A.length;i++)
sum += A[i];
System.out.println("lastIndex = " + lastIndex +
", index = " + index  +
", count = " + lastMinuteCount +
", A = " + Arrays.toString(A) +
", sum(A) = " + sum);
}
public int lastMinuteCount(){
return lastMinuteCount;
}
// Calculate the index in circular array based on the current time
public int getIndex(){
long curTime = (long) (System.currentTimeMillis()/1E3);
int timeElipse = (int) (curTime - startTime);
System.out.println("timeElipse = "+ timeElipse + ", ");
return (timeElipse % SIZE);
}
}
https://discuss.leetcode.com/topic/181/design-hit-counter
Write a hit counter for hits received in the past 5 minutes.
The HitCounter has two methods:
void hit()   // record a hit.

long getHits()   // Return the number of hits in the last 5 minutes
Here is a suggestion. I assume that we use granularity second.ArrayDeque (thanks to @diptesh for the remind) is used, in which each element is time slot which corresponds to different seconds, when hit arrive. When hit arrives or information about it is looked for , list is resized according to the time.
public class HitCounter {
    private final int WINDOW_SIZE = 300;
    class TimeSlot {
        public TimeSlot(long t) {
            time = t;
            hits = 1;
            
        }
        private long time;
        private long hits;      
    }
    
    private long hits;
    private long time;

    
    private Deque<TimeSlot> slots = new ArrayDeque<>(WINDOW_SIZE * 60);
    
    HitCounter() {
        time = System.currentTimeMillis()/1000;
    }
   private   void resize(long current) {    
        long before = current - WINDOW_SIZE;
        while (slots.size() > 0 && slots.getFirst().time < before) {
            hits -= slots.removeFirst().hits;             
        }
        if (slots.size() == 0) {            
            slots.add(new TimeSlot(current));
            
        }       
        time = slots.getFirst().time;
        
    }
    public void hit() {
        long current = System.currentTimeMillis()/1000;  
        if (time < current - WINDOW_SIZE)  {
            resize(current);           
        }
        else {
            if (slots.size() > 0 && slots.getLast().time == time) {
                slots.getLast().hits++;               
            } else {
                slots.add(new TimeSlot(current));
            }           
        }       
        hits++;
    }

    public long getHits() {
        long current = System.currentTimeMillis()/1000;       
        if (time < current - WINDOW_SIZE) 
            resize(current);
        return hits;
    }
}
Map<Long, Long> map = new HashMap<Long, Long>();
Long hits = 0L, earliestTime = 0L;
int SLOTS = 300 * 60;
public void logHits(long time) { // time in seconds
    if(map.containsKey(time)) {
        map.put(time, map.get(time)+1);
    }else {
        map.put(time, 1L);
    }
    cleanUp(time);
    hits++;
}

public Long getHits(long time) { // time in seconds
    cleanUp(time);
    return hits;
}

public void cleanUp(long time) {
    while(time - earliestTime >= SLOTS) {
        if(map.containsKey(earliestTime)) {
            hits -= map.get(earliestTime);
            map.remove(earliestTime);
        }
        earliestTime = earliestTime + 1;
    }
}
http://stackoverflow.com/questions/11701008/efficient-way-to-compute-number-of-hits-to-a-server-within-the-last-minute-in-r
http://www.careercup.com/question?id=14974850

https://github.com/mission-peace/interview/blob/master/src/com/interview/multithreaded/RealTimeCounter.java
public class RealTimeCounter {

    private final static int GRANULARITY = 300;
    private AtomicLongArray counter = new AtomicLongArray(GRANULARITY);
    private volatile int pos = 0;
 
    private RealTimeCounter(){
        PositionUpdater positionUpdater = new PositionUpdater(this);
        positionUpdater.start();
    }
 
    private static volatile RealTimeCounter INSTANCE;
 
    public static RealTimeCounter getInstance(){
        if(INSTANCE == null){
            synchronized (RealTimeCounter.class) {
                if(INSTANCE == null){
                    INSTANCE = new RealTimeCounter();
                }
            }
        }
        return INSTANCE;
    }
 
    public long getTotalEvents(){
        int total = 0;
        for(int i=0; i < GRANULARITY; i++){
            total += counter.get(i);
        }
        return total;
    }
 
    public void addEvent(){
        counter.getAndIncrement(pos);
    }
 
    void incrementPosition(){
        //first reset the value to 0 at next counter location.
        counter.set((pos + 1)%GRANULARITY, 0);
        pos = (pos + 1)%GRANULARITY;
    }
}
class PositionUpdater extends TimerTask{

    private final RealTimeCounter realTimeCounter;
    private final Timer timer = new Timer(true);
    private static final int DELAY = 1000;
    PositionUpdater(RealTimeCounter realTimeCounter) {
        this.realTimeCounter = realTimeCounter;
    }
 
    public void start(){
        timer.schedule(this, DELAY);
    }
    @Override
    public void run() {
        realTimeCounter.incrementPosition();
    }
}  
http://blog.gainlo.co/index.php/2016/09/12/dropbox-interview-design-hit-counter/
I have a strong preference for questions that are not hard to solve in the simplest case but the discussion can go deeper and deeper by removing/adding specific conditions

Forget about all the hard problems like concurrency and scalability issue, let’s say we only have a single machine with no concurrent requests, how would you get the number of visitors for the past 1 minute?
Apparently, the simplest solution is to store all the visitors with the timestamps in the database. When someone asks for visitor number of the past minute, we just go over the database and do the filtering and counting. A little bit optimization is to order users by timestamp so that we won’t scan the whole table.
The solution is not efficient as the time complexity is O(N) where N is the number of visitors. If the website has a large volume, the function won’t be able to return the number immediately.
Let’s optimize
A couple of ways to think about this problem. Since the above approach returns not only visitor numbers, but also visitors for the past minute, which is not needed in the question. And this is something we can optimize. From a different angle, we only need numbers for the past minute instead of any time range, which is another area that we can improve potentially. In a nutshell, by removing unnecessary features, we can optimize our solution.
A straightforward idea is to only keep users from the past minute and as time passes by, we keep updating the list and its length. This allows us to get the number instantly. In essence, we reduce the cost of fetching the numbers, but have to keep updating the list.
We can use a queue or linked list to store only users from the past minute. We keep all the element in order and when the last user (the earliest user) has the time more than a minute, just remove it from the list and update the length.

Space optimization

There’s little room to improve the speed as we can return the visitor number in O(1) time. However, storing all the users from the past minute can be costly in terms of space. A simple optimization is to only keep the user timestamp in the list rather than the user object, which can save a lot of space especially when the user object is large.
If we want to further reduce the space usage, what approach would you take?
A good way to think about this is that to improve space complexity, what should we sacrifice? Since we still want to keep the time complexity O(1), one thing we can compromise is accuracy. If we can’t guarantee to return the most accurate number, can we use less space?
Instead of tracking users from the past minute, we can only track users from the past second. By doing this, we know exactly how many visitors are from the last second. To get visitor numbers for the past minute, we keep a queue/linked list of 60 spots representing the past 60 seconds. Each spot stores the visitor number of that second. So every second, we remove the last (the earliest) spot from the list and add a new one with the visitor number of past second. Visitor number of the past minute is the sum of the 60 spots.
The minute count can be off by the request of the past second. And you can control the trade-off between accuracy and space by adjusting the unit, e.g. you can store users from past 2 seconds and have 30 spots in the list.

How about concurrent requests?

In production systems, concurrency is the most common problems people face. If there can be multiple users visiting the site simultaneously, does the previous approach still work?
Part of. Apparently, the basic idea still holds. However, when two requests update the list simultaneously, there can be race conditions. It’s possible that the request that updated the list first may not be included eventually.
The most common solution is to use a lock to protect the list. Whenever someone wants to update the list (by either adding new elements or removing the tail), a lock will be placed on the container. After the operation finishes, the list will be unlocked.
This works pretty well when you don’t have a large volume of requests or performance is not a concern. Placing a lock can be costly at some times and when there are too many concurrent requests, the lock may potentially block the system and becomes the performance bottleneck.

Distribute the counter

When a single machine gets too many traffic and performance becomes an issue, it’s the perfect time to think of distributed solution. Distributed system significantly reduces the burden of a single machine by scaling the system to multiple nodes, but at the same time adding complexity.
Let’s say we distribute visit requests to multiple machines equally. I’d like to emphasize the importance of equal distribution first. If particular machines get much more traffic than the rest machines, the system doesn’t get to its full usage and it’s very important to take this into consideration when designing the system. In our case, we can get a hash of users email and distribute by the hash (it’s not a good idea to use email directly as some letter may appear much more frequent than the others).
To count the number, each machine works independently to count its own users from the past minute. When we request the global number, we just need to add all counters together.

Read full article from algorithm - how to count number of requests in last second, minute and hour - Stack Overflow

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts