Saturday, June 6, 2015

Pavel's Blog: Finding The Median In Large Sets Of Numbers Split Across 1000 Servers



Pavel's Blog: Finding The Median In Large Sets Of Numbers Split Across 1000 Servers
How would you find the median across a thousand servers with a billion of numbers each?

This is a question that involves lots of discussion because it may be quite vague and may require taking some assumptions. Obviously we can't do any kind of in-memory sorting because we don't have enough memory. We may possibly fit one set at a time, which would be under 1 GB of memory.

The first obvious solution is an external merge sort and then a look up of the n/2 element (or the average of n/2 and n/2 + 1 on even n's). This solution would be n log n, but you may look for an O (n) solution, which someone may find impossible, since you need a sorted set to determine the median that involves element comparison which leads to the n log n complexity. But, who says we need sort by comparison.. ?

Another solution comes from our observation that we need to find the n/2 element in a sorted array, which can be generalized to the order statistics algorithm of finding the kth smallest number in an unsorted array. 

This algorithm is also called quick select or the selection algorithm. I talked about an implementation of this algorithm in this article. The problem is that this is an in-memory algorithm and we would not have enough memory to maintain all the sets. 

So we will need a distributed version of this algorithm which will basically need a routine to swap numbers across servers. We have to assume that this is an optimized and efficient routine, because there's just too many factors that can affect the efficiency of this routine (network latency, connection problems). This is a O(n) solution and the distributed version of kth smallest may look like this:
public static long partition(
        DistributedArray array, long first, long last) {
            
    int pivot = array.get(first);
    long pivotPosition = first;
    first++;
    while (first <= last) {
        // scan right
        while ((first <= last ) && (array.get(first) > pivot)) {
            first++;
        }
    
        // scan left
        while ((last >= first ) && (array.get(last) <= pivot)) {
            last--;
        }
    
        if (first > last) {
            array.swap(pivotPosition, last);
        }
        else {
            array.swap(first, last);
        }
    }
    return last;
}

public static int orderStatistic(
        DistributedArray array, long k, long first, long last) {
            
    long pivotPosition = partition(array, first, last);
    
    if (pivotPosition == (k - 1)) {
        return array.get(k - 1);
    }
    if ((k - 1) < pivotPosition) {
        return orderStatistic(array, k, first, pivotPosition - 1);
    }
    else {
        return orderStatistic(array, k, pivotPosition + 1, last);
    }
}

public static int kthSmallest(DistributedArray array, int k) {
    return orderStatistic(
        array, k, 0, array.size() - 1);
}

public static int kthLargest(DistributedArray array, int k) {
    return orderStatistic(
        array, array.size() - k + 1, 0, array.size() - 1);
}

/**
 * A class that encapsulates the logic of accessing indexes 
 * across multiple servers.
 */
public class DistributedArray {

    public void swap(long from, long to) {
        // swap across servers
    }
    
    public int get(long index) {
        // get the corresponding index.
        // decide what server it comes from
    }
    
    public long size() {
        // return the number of all elements 
        // counted across all servers
    }
}

Another solution comes from the second observation that we shouldn't use a comparison sort. We could use a histogram of the counts of the numbers across the servers, like in the counting sort. But here we would have to assume a few more things. First we have to know something about the range of the numbers. If we have ranges of order of billions we could store an array of a few billions cells or at least one billion since the problem statement allows us to process a billion numbers in memory, which is the second assumption. Again this is a O(n) solution because we compute the histogram by going once through all the numbers and then find the kth number (the median) by going through the elements of the histogram. 

public static int median(MedianServer[] srvs) {
    int[] histogram = new int[(int) Math.pow(2, 30)];
   
    // build the histogram
    for (MedianServer srv:srvs) {
        for (int i = 0; i < srv.size(); i++) {
            histogram[srv.get(i)]++;
        }
    }
       
    int k = 0;
    int median = 0;
    boolean even = histogram.length % 2 == 0;
   
    // find kth number
    for (int i = 0; i < histogram.length; i++) {
        for (int j = 0; j < histogram[i]; j++) {
            k =+ histogram[i];
            if (k == histogram.length / 2 - 1) {
                median = i;
                if (!even) {
                return median;
                }
            } else if (k == histogram.length / 2) {
                return (median + i) / 2;
            }
        }
    }
   
    return 0;
}

/**
 * A class that encapsulates the logic of accessing medians
 * on a server
 */
private static class MedianServer {

    public int get(int index) {
        // return the median for the specified index
    }
   
    public int size() {
        // return the number of medians
    }
}

We could think of more solutions if we could assume even more facts. Let's say we know that the numbers across all servers are uniformly distributed. Having this large amount of numbers we could say the median is also the average of all numbers. So if we also know the range of distribution, say 0.. 1 billion, then computing the median is a O(1) operation and all we need to do is to compute 0 + (1 billion - 0) / 2 which is 500 million. If we don't know the range we can compute the median of medians in O(n) by using order statistics on each server.

If the distribution is not uniform but we know some information about it we could still calculate the median with some probability. Of course we can find other different solutions if we take various assumptions, but the above solutions can probably satisfy any interviewer or whoever asked this question.
Read full article from Pavel's Blog: Finding The Median In Large Sets Of Numbers Split Across 1000 Servers

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts