Saturday, October 31, 2015

Playlist Shuffle Algorithm

https://labs.spotify.com/2014/02/28/how-to-shuffle-songs/
Since the Spotify service launched, we used Fisher-Yates shuffle to generate aperfectly random shuffling of a playlist. However, perfectly random means that the following two orders are equally likely to occur (different colors represent different artists):

Gambler’s fallacy
some people don’t want the same artist playing two or three times within a short time period.
people often think that if they haven’t won anything in a scratchcard lottery a couple of times in a row, they should have bigger chance of winning now. This phenomenon is called Gambler’s fallacy.

http://keyj.emphy.de/balanced-shuffle/

The problem with conventional shuffle algorithms is that they are too random. They lack fairness and uniform distribution.

The problem

Imagine that your music collection consists of tracks of three different genres, with a roughly equal number of tracks per genre. For example, you have 10 tracks of genre A, 11 of genre B and 11 of genre C. A truly random shuffle algorithm like it’s employed in almost all player programs and device firmwares would generate shuffle patterns like this one:

AACBBCBACABBCCACCCCABBACBACABABB

This example, short as it is, already exhibits the two main problems of random shuffle algorithms. The first one is the burst of four adjacent C’s in the middle of the sequence,

the second one is the lack of B’s in that area (there’s no B for 8 slots, which is quarter of the whole sequence!). Actually, these are two manifestations of the same problem: lack of uniform distribution. A truly random shuffle algorithm of course can, and will, take the liberty of generating such non-uniform spots.

Long story short, if you want shuffled music, you actually don’t want a random playback order. Instead, you want a good, uniformly distributed mix of all the tracks on your disk or flash. How about this:

ABCBCABACBACBCABCACBABCACBACBCAB

This pattern doesn’t have any unusual bursts. Instead it’s nicely alternating between the genres, but without being completely regular. In fact, there’s a simple rule that should be followed to construct good shuffle patterns:

Try to spread the tracks of each logical group as far away from each other as possible.

The concept

The basic principle of the algorithm is to split the music collection into multiple »logical groups«. The tracks inside each group are then shuffled to obtain a random playback order. If the group is still divisible into sub-groups (say, you grouped into genres, now you sub-group into artists), the Balanced Shuffle algorithm is used for this step. This way, the algorithm is applied recursively until there’s no further classification possible. In this case, the remaining tracks of this (sub-)group are considered equal in style and can thus be shuffled using a conventional method. When the track lists for all groups are collected, the main part of the algorithm (described below) begins: The per-group playlists are merged into a single one, taking care of the »maximum spread« principle.

http://algs4.cs.princeton.edu/11model/Knuth.java.html

    public static void shuffle(Object[] a) {
        int N = a.length;
        for (int i = 0; i < N; i++) {
            // choose index uniformly in [i, N-1]
            int r = i + (int) (Math.random() * (N - i));
            Object swap = a[r];
            a[r] = a[i];
            a[i] = swap;
        }
    }

Java's serialization algorithm

http://www.javaworld.com/article/2072752/the-java-serialization-algorithm-revealed.html

In general the serialization algorithm does the following:

It writes out the metadata of the class associated with an instance.
It recursively writes out the description of the superclass until it finds java.lang.object.
Once it finishes writing the metadata information, it then starts with the actual data associated with the instance. But this time, it starts from the topmost superclass.
It recursively writes the data associated with the instance, starting from the least superclass to the most-derived class.

An outline of the serialization algorithm

AC ED 00 05 73 72 00 0A 53 65 72 69 61 6C 54 65

73 74 05 52 81 5A AC 66 02 F6 02 00 02 49 00 07

76 65 72 73 69 6F 6E 4C 00 03 63 6F 6E 74 00 09

4C 63 6F 6E 74 61 69 6E 3B 78 72 00 06 70 61 72

65 6E 74 0E DB D2 BD 85 EE 63 7A 02 00 01 49 00

0D 70 61 72 65 6E 74 56 65 72 73 69 6F 6E 78 70

00 00 00 0A 00 00 00 42 73 72 00 07 63 6F 6E 74

61 69 6E FC BB E6 0E FB CB 60 C7 02 00 01 49 00

0E 63 6F 6E 74 61 69 6E 56 65 72 73 69 6F 6E 78

70 00 00 00 0B

AC ED: STREAM_MAGIC. Specifies that this is a serialization protocol.

00 05: STREAM_VERSION. The serialization version.

0x73: TC_OBJECT. Specifies that this is a new Object.

The first step of the serialization algorithm is to write the description of the class associated with an instance. The example serializes an object of type SerialTest, so the algorithm starts by writing the description of the SerialTest class.

0x72: TC_CLASSDESC. Specifies that this is a new class.

00 0A: Length of the class name.

53 65 72 69 61 6c 54 65 73 74: SerialTest, the name of the class.

05 52 81 5A AC 66 02 F6: SerialVersionUID, the serial version identifier of this class.

0x02: Various flags. This particular flag says that the object supports serialization.

00 02: Number of fields in this class.

Next, the algorithm writes the field int version = 66;.

0x49: Field type code. 49 represents "I", which stands for Int.

00 07: Length of the field name.

76 65 72 73 69 6F 6E: version, the name of the field.

And then the algorithm writes the next field, contain con = new contain();. This is an object, so it will write the canonical JVM signature of this field.

0x74: TC_STRING. Represents a new string.

00 09: Length of the string.

4C 63 6F 6E 74 61 69 6E 3B: Lcontain;, the canonical JVM signature.

0x78: TC_ENDBLOCKDATA, the end of the optional block data for an object.

The next step of the algorithm is to write the description of the parent class, which is the immediate superclass of SerialTest.

0x72: TC_CLASSDESC. Specifies that this is a new class.

00 06: Length of the class name.

70 61 72 65 6E 74: SerialTest, the name of the class

0E DB D2 BD 85 EE 63 7A: SerialVersionUID, the serial version identifier of this class.

0x02: Various flags. This flag notes that the object supports serialization.

00 01: Number of fields in this class.

Now the algorithm will write the field description for the parent class. parent has one field, int parentVersion = 100;.

0x49: Field type code. 49 represents "I", which stands for Int.

00 0D: Length of the field name.

70 61 72 65 6E 74 56 65 72 73 69 6F 6E: parentVersion, the name of the field.

0x78: TC_ENDBLOCKDATA, the end of block data for this object.

0x70: TC_NULL, which represents the fact that there are no more superclasses because we have reached the top of the class hierarchy.

POPULAR RESOURCES

WHITE PAPER

10 Best Practices for Log Management

WHITE PAPER

Coding with JRebel: Java Forever Changed

SEE ALL

Search Resources

So far, the serialization algorithm has written the description of the class associated with the instance and all its superclasses. Next, it will write the actual data associated with the instance. It writes the parent class members first:

00 00 00 0A: 10, the value of parentVersion.

Then it moves on to SerialTest.

00 00 00 42: 66, the value of version.

The next few bytes are interesting. The algorithm needs to write the information about the contain object, shown in Listing 8.

Listing 8. The contain object

contain con = new contain();

Remember, the serialization algorithm hasn't written the class description for the contain class yet. This is the opportunity to write this description.

0x73: TC_OBJECT, designating a new object.

0x72: TC_CLASSDESC.

00 07: Length of the class name.

63 6F 6E 74 61 69 6E: contain, the name of the class.

FC BB E6 0E FB CB 60 C7: SerialVersionUID, the serial version identifier of this class.

0x02: Various flags. This flag indicates that this class supports serialization.

00 01: Number of fields in this class.

Next, the algorithm must write the description for contain's only field, int containVersion = 11;.

0x49: Field type code. 49 represents "I", which stands for Int.

00 0E: Length of the field name.

63 6F 6E 74 61 69 6E 56 65 72 73 69 6F 6E: containVersion, the name of the field.

0x78: TC_ENDBLOCKDATA.

Next, the serialization algorithm checks to see if contain has any parent classes. If it did, the algorithm would start writing that class; but in this case there is no superclass for contain, so the algorithm writes TC_NULL.

0x70: TC_NULL.

Finally, the algorithm writes the actual data associated with contain.

00 00 00 0B: 11, the value of containVersion.

Interview Misc

https://mp.weixin.qq.com/s/MsS3HwuA2fPqPNs0CHWS-Q

评论的回复中，我提到，自己会在面试中问候选人这个问题。不少同学质疑，问这个问题有什么意义，究竟想考察的是什么。这个问题很严肃，且很重要，它脱离了mc和redis本身，涉及到技术人的自我要求。

我在面试中，会如何向候选人提问？

一般，我不会预设任何问题，更不会问我擅长的领域。

我会先看简历上写了什么，比如“在项目中使用redis作为缓存存储”，我可能就会问，使用redis存储了什么，key是什么，value是什么，为什么选择redis而不是其他缓存。

所以，是不是问redis并不是重点，重点是简历写了什么。如果在简历上写，“在项目中使用hash作为内存存储结构”，我可能就会问，使用hash存储了什么，key是什么，values是什么，为什么选择hash而不是其他数据结构。

有一些候选人，会说：

“架构师设计的，我只是使用”

或者

“公司要求统一使用redis作为缓存存储”

又或者

“我比较熟悉redis”

这类回答，是比较减分的，作为技术人，自己千万不能把自己当做“码农”，而要把自己当做“设计师”，日常工作中不能只是为了“完成交代下来的任务”。

还有一些候选人，他会进一步解释，例如：

“因为redis支持集群高可用，redis集群支持固化，所以选择了redis”

这类回答，说明候选人对redis进行过专门的学习，应该会非常好学。但是，ta们未必经得起后续的系列问题：

“既然缓存的是用户信息，需要高可用么？”

回复：貌似不需要。

“既然缓存的是订单信息，需要固化么？”

回复：貌似不需要。

“那为什么还要选型redis呢？

任何脱离业务的架构设计，方案设计，技术选型都是耍流氓。

这不是一句空话。

做技术方案，技术选型的时候，一定是针对业务需求来折衷的。

假如存储的是用户信息，key是uid，value是User实体，当缓存挂了的时候，如果不会因为流量压到数据库而导致雪崩，此时缓存未必需要redis的集群功能。

假如存储的是订单信息，key是oid，value是Order实体，只作为缓存使用，允许cache miss读取数据库，此时未必需要redis的固化功能。

选择，因为在某个场景下，ta适合。

其实，我并没有对自己的提问，预设任何答案，只要候选人的思路是清晰的，逻辑是自洽的，即使给出的未必是最优的方案，也是能让人眼前一亮。

我们，因为项目的压力，历史的包袱，做出妥协性设计方案的次数还少么？技术人，清楚用什么，清楚怎么用还不够，更重要的是明白为什么。

技术人，需要一些情怀，多问自己一句为什么，对自己有好处。

http://yuanhsh.iteye.com/blog/2194627

http://jiangweiyu.tumblr.com/page/5

T家：

什么是僵尸进程？如何列出所有的僵尸进程？

如何获得这些僵尸进程的 PID？如何把它们终止？我用 ps grep awk xargs 配合管道把这几个问题用一行命令行解决了，他很满意。
http://how-to.wikia.com/wiki/How_to_display_and_kill_zombie_processes

ps aux | awk '"[Zz]" ~ $8 { printf("%s, PID = %d\n", $8, $2); }'

Kill the zombies

zombies are living dead, so the aren't always easy to kill.

Try executing: kill -9 PID
- example: kill -9 5067
See is the zombie is still alive, undead
- use the techniques above

If its still undead
- get a cross or garlic, well reliable sources tell me the don't work. We must try something else
Kill the zombie's parent (process)
execute: ps ef
- this will display a process (family) tree
- find the command who is the PID matches the zombie then look at the parents and try killing them
example:

PID TTY      STAT   TIME COMMAND
5102 pts/3    Ss     0:00 bash MANPATH=/usr/local/share/man:/usr/share/man:/usr
5229 pts/3    R+     0:00  \_ ps ef MANPATH=/usr/local/share/man:/usr/share/man
4929 tty1     S+     0:00 -bash TERM=linux HOME=/home/zymos SHELL=/bin/bash USE
4954 tty1     S+     0:00  \_ /bin/sh /usr/bin/startx MANPATH=/usr/local/share/
4970 tty1     S+     0:00      \_ xinit /home/zymos/.xinitrc -- -nolisten tcp -
4975 tty1     S      0:01          \_ icewm MANPATH=/usr/local/share/man:/usr/s
4977 tty1     S      0:01              \_ xterm MANPATH=/usr/local/share/man:/u
4987 pts/0    Ss     0:00              |   \_ bash MANPATH=/usr/local/share/man
5090 pts/0    Sl+    1:31              |       \_ ./hydranode MANPATH=/usr/loca
5092 pts/0    Sl+    0:29              |           \_ ./hydranode-core --disabl
4978 tty1     S      0:00              \_ xterm -e /home/zymos/.icewm/startup M
4986 pts/1    Ss+    0:00              |   \_ /bin/sh /home/zymos/.icewm/startu
4993 pts/1    S+     0:00              |       \_ gaim    MANPATH=/usr/local/sh
4994 pts/1    S+     0:00              |       \_ /bin/bash /usr/libexec/mozill
5030 pts/1    Sl+    1:29              |       |   \_ /usr/lib/mozilla-firefox/
5067 pts/1    Z+     0:00              |       |       \_ [netstat] <defunct>
4995 pts/1    S+     0:00              |       \_ xterm MANPATH=/usr/local/shar
5032 pts/2    Ss+    0:00              |       |   \_ bash MANPATH=/usr/local/s
5085 pts/1    S+     0:00              |       \_ boinc_client MANPATH=/usr/loc
4981 tty1     S      0:00              \_ icewmtray MANPATH=/usr/local/share/ma
5060 pts/1    S+     0:00 /usr/libexec/gconfd-2 13 MANPATH=/usr/local/share/man

netstat matches the PID above, 5067
- so in this example I would try killing
  - 5030 pts/1 Sl+ 1:29 /usr/lib/mozilla-firefox/
  - 4978 tty1 S 0:00 xterm -e /home/zymos/.icewm/startup M
    
    Or
  - 4986 pts/1 Ss+ 0:00 /bin/sh /home/zymos/.icewm/startu
Hopefully this will work

If you need a single line command that does the work you can try

ps axu | awk '"[Zz] ~ $8 { system(sprintf("kill -HUP %d", $2)); }'

This will work in most cases
- Check if the zombie process still exists; if so, run the same command but change -HUP to -TERM.
- If the zombie is still there, use -9 instead of -HUP or -TERM.
  - -9 will terminate it for sure but it's considered very bad practice
  - FYI - The above is way too much work. Use killall <process-name> e.g. killall firefox

http://stackoverflow.com/questions/16944886/how-to-kill-zombie-process
You can clean up a zombie process by killing its parent process with the following command:

kill -HUP $(ps -A -ostat,ppid | grep -e '[zZ]'| awk '{ print $2 }')

A zombie is already dead, so you cannot kill it. To clean up a zombie, it must be waited on by its parent, so killing the parent should work to eliminate the zombie. (After the parent dies, the zombie will be inherited by init, which will wait on it and clear its entry in the process table.) If your daemon is spawning children that become zombies, you have a bug. Your daemon should notice when its children die and wait on them to determine their exit status.

Example command:

kill $(ps -A -ostat,ppid | awk '/[zZ]/{print $2}')

第二问是统计一个 httpd 的访问日志中，访问量最大的前五个 IP。- shell

我用 Ruby 写了一个，六行代码就能搞定。

接着面试官了解 DNS 吗，如果浏览器没法上网一般怎么诊断？我的回答是先 dig 看
DNS 解析是否正确，然后用 ping 判断 IP 是否可以访问，再用 curl 看是不是浏览器
设置的问题。最后还可以用类似 www.websitedown.info 的服务检查。

第四个问题是怎么把一个文件复制到多个机器上？我说可以先用 ssh-copy-id 把公钥
拷过去，然后再用 scp 拷文件。面试官说这样机器多就麻烦了。我解释到可以用
expect 写脚本自动输入密码，还加了句当然也可以用 Twitter 的开源工具 murder 分
布式部署。

第五问是文件系统中 soft link 和 hard link 的区别。我的解释是 soft link 是一
种特殊的文件，它的内容是被指向的文件的路径，而 hard link 是直接指向 inode 的
。所以 soft link 可以用于目录，但是 hard link 不可以。

文件系统中 inode 和 path 的区别。我回答是 inode 是文件系统的一个数据结构，指
向某个磁盘上的文件；而 path 是由多个 struct dentry 组成的，每个 dentry 描述
了 inode 的父子关系。

最后一问是如何修改 DNS 服务器？我说可以修改 /etc/resolve.conf。

G家：
phone screen interview

First round：
1. check if a string is anagram （geeks很喜欢问这个问题，遇到说话磕磕绊绊但
发音标准的哥哥，应该就是geek了）
2. Given a matrix which contains black and white grids, use a method to find
out if the white grids are connected or not, if yes, return true.
第一次想的是recursion，后来写的时候用的是DP，但是发现不对，想了一下，用了图
论的解法，满打满算的写完了。之后不到半个小时就安排了第二次面试

Second round:
1. Design an API which is used for a cache big data
2. Given a string, find longest substring which contains just two unique characters.

http://frank19900731.github.io/blog/2014/11/25/mian-shi-jing-yan-fen-xiang-zhi-bian-cheng-yu-yan-ti/
double a = 0; double b = 99; while(true) { double c = (a+b)/2; if(c == a || c == b) break; a = b; b = c; } cout<<a;

浮点数判等，会无限循环么？由于浮点数精度有限，所以这个程序迭代个五六十次就终止了，又由于 cout 有默认的输出精度，所以输出结果应该是66。

面向对象编程的各类问题，多态、虚函数等；
C++ 和 Java 的区别有哪些；
你用过这种语言的哪些库；
对 STL 是否熟悉；
Java 虚拟机了解么，解释垃圾回收机制的原理；
线程/进程间通信如何实现，写过多线程的程序么；
大数据处理框架用过哪些（其实就算是编程语言的杀手应用吧）；
机器学习、数据可视化的库用过哪些；
java 中 final 的用法，C/C++ 中的 const 的用法；
MySQL 的如何建立索引进行查询的，存储方式是怎样的，为什么要分块；
堆区和栈区的区别在哪里；
malloc 执行的时候发生了那些事情；
正则表达式的写法；
vim 中执行某项操作的快捷键；

高频题目提供了可比较性，有筛选价值，而即便你背了网上的解答也可能是不全面的（我一般把网上看到的和自己的使用经验一起说，显得比较真实…）

http://frank19900731.github.io/blog/2014/11/18/mian-shi-jing-yan-fen-xiang-zhi-shu-ju-jie-gou-suan-fa-fen-xiang/

思维益智题考察你分析问题的能力，大部分可以归结到二分、动态规划、递归上，重要的是思路，其次是尽量低的复杂度，再次是代码的正确性。

单向网络，起点和终点唯一且连通，寻找那些一旦被删除将导致起点终点无法连通的点；
有序数组寻找和为某数的一对数字；

这类开放题目让你自主选择数据结构，主要是考察求职者对于数据结构的特性与使用场景的综合理解，在面对具体应用场景时能否运用已有的数据结构和算法知识提出合理的解决方案。一般来说在这类问题里哈希表的出场率会比较高……例题如下

大数据量的 url log，怎么去重且统计每个 url 的出现次数，复杂度分析；
设计 cache 系统
- 设计 cache 的接口；
- 可以用什么数据结构实现；
- 如何实现可伸缩的容量；
- cache 的空间管理策略，比如 cache 哪些条目，何时清理；
- cache 系统启动时分配多大的空间，之后按照怎样的策略增大；
设计爬虫；
流媒体播放客户端从多个完全相同的发送方接收视频包，同一发送方的包会按序到达，不同发送方的包则不一定，有可能会丢包，但还是要保证播放流畅度，设计播放客户端的算法。

训练自己算法复杂度的分析能力，有的时候对复杂度的分析会反过来会帮助你找到更好的算法；
一定量的题目积累很重要，就好像准备高考数学压轴题一样，见识的越多，思路来得越快，当然，前提是你能够不断总结反思。

Saturday, October 31, 2015

Playlist Shuffle Algorithm

Java's serialization algorithm

Interview Misc

Kill the zombies

Labels

Popular Posts