Saturday, October 31, 2015

Playlist Shuffle Algorithm
Since the Spotify service launched, we used Fisher-Yates shuffle to generate aperfectly random shuffling of a playlist. However, perfectly random means that the following two orders are equally likely to occur (different colors represent different artists):
Gambler’s fallacy
some people don’t want the same artist playing two or three times within a short time period.
people often think that if they haven’t won anything in a scratchcard lottery a couple of times in a row, they should have bigger chance of winning now. This phenomenon is called Gambler’s fallacy.

The problem with conventional shuffle algorithms is that they are too random. They lack fairness and uniform distribution.

The problem
Imagine that your music collection consists of tracks of three different genres, with a roughly equal number of tracks per genre. For example, you have 10 tracks of genre A, 11 of genre B and 11 of genre C. A truly random shuffle algorithm like it’s employed in almost all player programs and device firmwares would generate shuffle patterns like this one:


This example, short as it is, already exhibits the two main problems of random shuffle algorithms. The first one is the burst of four adjacent C’s in the middle of the sequence, 

the second one is the lack of B’s in that area (there’s no B for 8 slots, which is quarter of the whole sequence!). Actually, these are two manifestations of the same problem: lack of uniform distribution. A truly random shuffle algorithm of course can, and will, take the liberty of generating such non-uniform spots.

Long story short, if you want shuffled music, you actually don’t want a random playback order. Instead, you want a good, uniformly distributed mix of all the tracks on your disk or flash. How about this:


This pattern doesn’t have any unusual bursts. Instead it’s nicely alternating between the genres, but without being completely regular. In fact, there’s a simple rule that should be followed to construct good shuffle patterns:

Try to spread the tracks of each logical group as far away from each other as possible.

The concept
The basic principle of the algorithm is to split the music collection into multiple »logical groups«. The tracks inside each group are then shuffled to obtain a random playback order. If the group is still divisible into sub-groups (say, you grouped into genres, now you sub-group into artists), the Balanced Shuffle algorithm is used for this step. This way, the algorithm is applied recursively until there’s no further classification possible. In this case, the remaining tracks of this (sub-)group are considered equal in style and can thus be shuffled using a conventional method. When the track lists for all groups are collected, the main part of the algorithm (described below) begins: The per-group playlists are merged into a single one, taking care of the »maximum spread« principle.
    public static void shuffle(Object[] a) {
        int N = a.length;
        for (int i = 0; i < N; i++) {
            // choose index uniformly in [i, N-1]
            int r = i + (int) (Math.random() * (N - i));
            Object swap = a[r];
            a[r] = a[i];
            a[i] = swap;

Java's serialization algorithm

In general the serialization algorithm does the following:

It writes out the metadata of the class associated with an instance.
It recursively writes out the description of the superclass until it finds java.lang.object.
Once it finishes writing the metadata information, it then starts with the actual data associated with the instance. But this time, it starts from the topmost superclass.
It recursively writes the data associated with the instance, starting from the least superclass to the most-derived class.

An outline of the serialization algorithm

AC ED 00 05 73 72 00 0A 53 65 72 69 61 6C 54 65
73 74 05 52 81 5A AC 66 02 F6 02 00 02 49 00 07
76 65 72 73 69 6F 6E 4C 00 03 63 6F 6E 74 00 09
4C 63 6F 6E 74 61 69 6E 3B 78 72 00 06 70 61 72
65 6E 74 0E DB D2 BD 85 EE 63 7A 02 00 01 49 00
0D 70 61 72 65 6E 74 56 65 72 73 69 6F 6E 78 70
00 00 00 0A 00 00 00 42 73 72 00 07 63 6F 6E 74
61 69 6E FC BB E6 0E FB CB 60 C7 02 00 01 49 00
0E 63 6F 6E 74 61 69 6E 56 65 72 73 69 6F 6E 78
70 00 00 00 0B

AC ED: STREAM_MAGIC. Specifies that this is a serialization protocol.
00 05: STREAM_VERSION. The serialization version.
0x73: TC_OBJECT. Specifies that this is a new Object.
The first step of the serialization algorithm is to write the description of the class associated with an instance. The example serializes an object of type SerialTest, so the algorithm starts by writing the description of the SerialTest class.

0x72: TC_CLASSDESC. Specifies that this is a new class.
00 0A: Length of the class name.
53 65 72 69 61 6c 54 65 73 74: SerialTest, the name of the class.
05 52 81 5A AC 66 02 F6: SerialVersionUID, the serial version identifier of this class.
0x02: Various flags. This particular flag says that the object supports serialization.
00 02: Number of fields in this class.
Next, the algorithm writes the field int version = 66;.

0x49: Field type code. 49 represents "I", which stands for Int.
00 07: Length of the field name.
76 65 72 73 69 6F 6E: version, the name of the field.

And then the algorithm writes the next field, contain con = new contain();. This is an object, so it will write the canonical JVM signature of this field.

0x74: TC_STRING. Represents a new string.
00 09: Length of the string.
4C 63 6F 6E 74 61 69 6E 3B: Lcontain;, the canonical JVM signature.
0x78: TC_ENDBLOCKDATA, the end of the optional block data for an object.
The next step of the algorithm is to write the description of the parent class, which is the immediate superclass of SerialTest.

0x72: TC_CLASSDESC. Specifies that this is a new class.
00 06: Length of the class name.
70 61 72 65 6E 74: SerialTest, the name of the class
0E DB D2 BD 85 EE 63 7A: SerialVersionUID, the serial version identifier of this class.
0x02: Various flags. This flag notes that the object supports serialization.
00 01: Number of fields in this class.
Now the algorithm will write the field description for the parent class. parent has one field, int parentVersion = 100;.

0x49: Field type code. 49 represents "I", which stands for Int.
00 0D: Length of the field name.
70 61 72 65 6E 74 56 65 72 73 69 6F 6E: parentVersion, the name of the field.
0x78: TC_ENDBLOCKDATA, the end of block data for this object.
0x70: TC_NULL, which represents the fact that there are no more superclasses because we have reached the top of the class hierarchy.

10 Best Practices for Log Management

Coding with JRebel: Java Forever Changed
Search Resources
So far, the serialization algorithm has written the description of the class associated with the instance and all its superclasses. Next, it will write the actual data associated with the instance. It writes the parent class members first:

00 00 00 0A: 10, the value of parentVersion.
Then it moves on to SerialTest.

00 00 00 42: 66, the value of version.
The next few bytes are interesting. The algorithm needs to write the information about the contain object, shown in Listing 8.

Listing 8. The contain object

contain con = new contain();

Remember, the serialization algorithm hasn't written the class description for the contain class yet. This is the opportunity to write this description.

0x73: TC_OBJECT, designating a new object.
00 07: Length of the class name.
63 6F 6E 74 61 69 6E: contain, the name of the class.
FC BB E6 0E FB CB 60 C7: SerialVersionUID, the serial version identifier of this class.
0x02: Various flags. This flag indicates that this class supports serialization.
00 01: Number of fields in this class.

Next, the algorithm must write the description for contain's only field, int containVersion = 11;.

0x49: Field type code. 49 represents "I", which stands for Int.
00 0E: Length of the field name.
63 6F 6E 74 61 69 6E 56 65 72 73 69 6F 6E: containVersion, the name of the field.
Next, the serialization algorithm checks to see if contain has any parent classes. If it did, the algorithm would start writing that class; but in this case there is no superclass for contain, so the algorithm writes TC_NULL.

0x70: TC_NULL.
Finally, the algorithm writes the actual data associated with contain.

00 00 00 0B: 11, the value of containVersion.

Interview Misc



  • “架构师设计的,我只是使用”
  • “公司要求统一使用redis作为缓存存储”
  • “我比较熟悉redis

  • “因为redis支持集群高可用,redis集群支持固化,所以选择了redis




假如存储的是订单信息,keyoidvalueOrder实体,只作为缓存使用,允许cache miss读取数据库,此时未必需要redis的固化功能。




如何获得这些僵尸进程的 PID?如何把它们终止?我用 ps grep awk xargs 配合管道把这几个问题用一行命令行解决了,他很满意。
ps aux | awk '"[Zz]" ~ $8 { printf("%s, PID = %d\n", $8, $2); }'

    Kill the zombies

    zombies are living dead, so the aren't always easy to kill.
    • Try executing: kill -9 PID
      • example: kill -9 5067
    • See is the zombie is still alive, undead
      • use the techniques above
    • If its still undead
      • get a cross or garlic, well reliable sources tell me the don't work. We must try something else
    • Kill the zombie's parent (process)
    • execute: ps ef
      • this will display a process (family) tree
      • find the command who is the PID matches the zombie then look at the parents and try killing them
    • example:
    5102 pts/3    Ss     0:00 bash MANPATH=/usr/local/share/man:/usr/share/man:/usr
    5229 pts/3    R+     0:00  \_ ps ef MANPATH=/usr/local/share/man:/usr/share/man
    4929 tty1     S+     0:00 -bash TERM=linux HOME=/home/zymos SHELL=/bin/bash USE
    4954 tty1     S+     0:00  \_ /bin/sh /usr/bin/startx MANPATH=/usr/local/share/
    4970 tty1     S+     0:00      \_ xinit /home/zymos/.xinitrc -- -nolisten tcp -
    4975 tty1     S      0:01          \_ icewm MANPATH=/usr/local/share/man:/usr/s
    4977 tty1     S      0:01              \_ xterm MANPATH=/usr/local/share/man:/u
    4987 pts/0    Ss     0:00              |   \_ bash MANPATH=/usr/local/share/man
    5090 pts/0    Sl+    1:31              |       \_ ./hydranode MANPATH=/usr/loca
    5092 pts/0    Sl+    0:29              |           \_ ./hydranode-core --disabl
    4978 tty1     S      0:00              \_ xterm -e /home/zymos/.icewm/startup M
    4986 pts/1    Ss+    0:00              |   \_ /bin/sh /home/zymos/.icewm/startu
    4993 pts/1    S+     0:00              |       \_ gaim    MANPATH=/usr/local/sh
    4994 pts/1    S+     0:00              |       \_ /bin/bash /usr/libexec/mozill
    5030 pts/1    Sl+    1:29              |       |   \_ /usr/lib/mozilla-firefox/
    5067 pts/1    Z+     0:00              |       |       \_ [netstat] <defunct>
    4995 pts/1    S+     0:00              |       \_ xterm MANPATH=/usr/local/shar
    5032 pts/2    Ss+    0:00              |       |   \_ bash MANPATH=/usr/local/s
    5085 pts/1    S+     0:00              |       \_ boinc_client MANPATH=/usr/loc
    4981 tty1     S      0:00              \_ icewmtray MANPATH=/usr/local/share/ma
    5060 pts/1    S+     0:00 /usr/libexec/gconfd-2 13 MANPATH=/usr/local/share/man
    • netstat matches the PID above, 5067
      • so in this example I would try killing
        • 5030 pts/1 Sl+ 1:29 /usr/lib/mozilla-firefox/
        • 4978 tty1 S 0:00 xterm -e /home/zymos/.icewm/startup M
        • 4986 pts/1 Ss+ 0:00 /bin/sh /home/zymos/.icewm/startu
    • Hopefully this will work
    • If you need a  single line command that does the work you can try
    ps axu | awk '"[Zz] ~ $8 { system(sprintf("kill -HUP %d", $2)); }'
    • This will work in most cases
      • Check if the zombie process still exists; if so, run the same command but change -HUP to -TERM.
      • If the zombie is still there, use -9 instead of -HUP or -TERM.
        • -9 will terminate it for sure but it's considered very bad practice
        • FYI - The above is way too much work. Use killall <process-name> e.g. killall firefox
    You can clean up a zombie process by killing its parent process with the following command:
    kill -HUP $(ps -A -ostat,ppid | grep -e '[zZ]'| awk '{ print $2 }')
    A zombie is already dead, so you cannot kill it. To clean up a zombie, it must be waited on by its parent, so killing the parent should work to eliminate the zombie. (After the parent dies, the zombie will be inherited by init, which will wait on it and clear its entry in the process table.) If your daemon is spawning children that become zombies, you have a bug. Your daemon should notice when its children die and wait on them to determine their exit status.
    Example command:
    kill $(ps -A -ostat,ppid | awk '/[zZ]/{print $2}')
    第二问是统计一个 httpd 的访问日志中,访问量最大的前五个 IP。- shell
    我用 Ruby 写了一个,六行代码就能搞定。

    接着面试官了解 DNS 吗,如果浏览器没法上网一般怎么诊断?我的回答是先 dig 看
    DNS 解析是否正确,然后用 ping 判断 IP 是否可以访问,再用 curl 看是不是浏览器
    设置的问题。最后还可以用类似 的服务检查。

    第四个问题是怎么把一个文件复制到多个机器上?我说可以先用 ssh-copy-id 把公钥
    拷过去,然后再用 scp 拷文件。面试官说这样机器多就麻烦了。我解释到可以用
    expect 写脚本自动输入密码,还加了句当然也可以用 Twitter 的开源工具 murder 分

    第五问是文件系统中 soft link 和 hard link 的区别。我的解释是 soft link 是一
    种特殊的文件,它的内容是被指向的文件的路径,而 hard link 是直接指向 inode 的
    。所以 soft link 可以用于目录,但是 hard link 不可以。

    文件系统中 inode 和 path 的区别。我回答是 inode 是文件系统的一个数据结构,指
    向某个磁盘上的文件;而 path 是由多个 struct dentry 组成的,每个 dentry 描述
    了 inode 的父子关系。

    最后一问是如何修改 DNS 服务器?我说可以修改 /etc/resolve.conf。

    phone screen interview

    First round:
    1. check if a string is anagram  (geeks很喜欢问这个问题,遇到说话磕磕绊绊但
    2. Given a matrix which contains black and white grids, use a method to find
    out if the white grids are connected or not, if yes, return true.

    Second round:
    1.  Design an API which is used for a cache big data
    2.  Given a string, find longest substring which contains just two unique characters.
    double a = 0; double b = 99; while(true) { double c = (a+b)/2; if(c == a || c == b) break; a = b; b = c; } cout<<a;
    浮点数判等,会无限循环么?由于浮点数精度有限,所以这个程序迭代个五六十次就终止了,又由于 cout 有默认的输出精度,所以输出结果应该是66。
    1. 面向对象编程的各类问题,多态、虚函数等;
    2. C++ 和 Java 的区别有哪些;
    3. 你用过这种语言的哪些库;
    4. 对 STL 是否熟悉;
    5. Java 虚拟机了解么,解释垃圾回收机制的原理;
    6. 线程/进程间通信如何实现,写过多线程的程序么;
    7. 大数据处理框架用过哪些(其实就算是编程语言的杀手应用吧);
    8. 机器学习、数据可视化的库用过哪些;
    9. java 中 final 的用法,C/C++ 中的 const 的用法;
    10. MySQL 的如何建立索引进行查询的,存储方式是怎样的,为什么要分块;
    11. 堆区和栈区的区别在哪里;
    12. malloc 执行的时候发生了那些事情;
    13. 正则表达式的写法;
    14. vim 中执行某项操作的快捷键;
    • 思维益智题 考察你分析问题的能力,大部分可以归结到二分、动态规划、递归上,重要的是思路,其次是尽量低的复杂度,再次是代码的正确性
    1. 单向网络,起点和终点唯一且连通,寻找那些一旦被删除将导致起点终点无法连通的点;
    2. 有序数组寻找和为某数的一对数字;
    1. 大数据量的 url log,怎么去重且统计每个 url 的出现次数,复杂度分析;
    2. 设计 cache 系统
      • 设计 cache 的接口;
      • 可以用什么数据结构实现;
      • 如何实现可伸缩的容量;
      • cache 的空间管理策略,比如 cache 哪些条目,何时清理;
      • cache 系统启动时分配多大的空间,之后按照怎样的策略增大;
    3. 设计爬虫;
    4. 流媒体播放客户端从多个完全相同的发送方接收视频包,同一发送方的包会按序到达,不同发送方的包则不一定,有可能会丢包,但还是要保证播放流畅度,设计播放客户端的算法。
    • 训练自己算法复杂度的分析能力,有的时候对复杂度的分析会反过来会帮助你找到更好的算法
    • 一定量的题目积累很重要,就好像准备高考数学压轴题一样,见识的越多,思路来得越快,当然,前提是你能够不断总结反思


    Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

    Popular Posts