Massive Technical Interviews Tips: Hashing Misc

Wednesday, September 30, 2015

Hashing Misc

String.hashCode()

String.hashCode() is a very poor hash, especially for the lower bits. It is standard and cannot be changed or break backward compatibility. However, this doesn't have to be a problem as HashMap uses an agitate function which brings down some of the higher bits to randomise the lower ones.

HashMap

int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

http://www.javamex.com/tutorials/collections/hash_function_technical.shtml
http://www.javamex.com/tutorials/collections/hash_function_technical_2.shtml
as is actually quite typical with strings, characters are biased towards certain bits being set.

The trick of the string hash function is to try and iron out this bias by mixing bits with a biased probability with other bits, in particular those where the probability is already fairly unbiased.

http://stackoverflow.com/questions/9406775/why-does-string-hashcode-in-java-have-many-conflicts

This looks quite confusing to me because it has so many conflicts; although it's not required to be unique (we can still rely on the equals()), but less conflicts means better performance without visiting the entries in a linked list.

Suppose we've two characters, then so long as we can find two strings matching below equation, then we will have the same hashcode()

a * 31 +b = c * 31 +d

It will be easy to conclude that (a-c) * 31 = d-b take an easy example is make a-c = 1 and d-b = 31; so i wrote below codes for simple test

public void testHash() {
    System.out.println("A:" + (int)'A');
    System.out.println("B:" + (int)'B');
    System.out.println("a:" + (int)'a');

    System.out.println("Aa".hashCode() + "," + "BB".hashCode());
    System.out.println("Ba".hashCode() + "," + "CB".hashCode());
    System.out.println("Ca".hashCode() + "," + "DB".hashCode());
    System.out.println("Da".hashCode() + "," + "EB".hashCode());        
}

it will print below results which means all the strings have the same hashcode(), and it's easy to do it in a loop.

even worse, suppose we've 4 characters in the string, according to the algorithm, suppose the first 2 characters produce a2, the 2nd 2 characters produce b2; the hashcode will still be a2 * 31^2 + b2 thus, with a2 and b2 equal between 2 strings, we will get more strings with hashcode() conflict. such examples are "AaAa", "BBBB" and so on; then we will have 6 characters, 8 characters......

suppose most of the time we use characters in ascii table in a string which will be used in a hashmap or hashtable, then the choosen prime number 31 here is definitely too small;

Nobody in their right mind would expect String.hashCode to be highly collision free. It is simply not designed with that in mind. (If you want highly collision free hashing, then use a crypto hash algorithm ... and pay the cost.) String.hashCode() is designed to be reasonably good across the domain of all Strings ... and fast.

https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

http://www.ocert.org/advisories/ocert-2011-003.html

Any server in the world running the same version of Ruby will get the same hash values. This means that I can send a specially crafted web request to your server, in which the request parameters contain lots of those strings with the same hash code. Your web framework will probably parse the parameters into a hash table, and they will all end up in the same hash bucket, no matter how big you make the hash table. Whenever you want to access the parameters, you now have to iterate over a long list of hash collisions, and your swift O(1) hash table lookup is suddenly a crawling slow O(n).

I just need to make a small number of these evil requests to your server and I’ve brought it to its knees.

The solution is for the hash code to be consistent within one process, but to be different for different processes.

I understand the use of hashCode() — it’s very tempting. On strings, numbers and collection classes, hashCode() always returns a consistent value, apparently even across different JVM vendors. It’s like that despite the documentation for hashCode()explicitly not guaranteeing consistency across different processes:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

An introduction to optimising a hashing strategy

To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.

static final int hash(Object key) {

int h;

return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);

}

This mixes the high bits of the hash code with the low bits to improve the randomness of the lower bits.

String

public int hashCode() {

int h = hash;

if (h == 0 && value.length > 0) {

char val[] = value;

for (int i = 0; i < value.length; i++) {

h = 31 * h + val[i];

}

hash = h;

}

return h;

}

Unique hashCodes is not enough to avoid collisions.

I Just Need Unique Hash Codes, Don't I?

A common misconception is that to avoid collisions all you need is to have a unique hash code. While having unique hash codes is highly desirable, that is not enough.

The number of buckets needs to be smaller and the hash codes need to be modulo-ed to the number of buckets. If the number of buckets is a power of 2, you can use a mask of the lowest bits.

The problem is that the size of a HashMap is not unlimited (or at least 2^32 in size) This means the hashCode() number has to be reduced to a smaller number of bits.

A simple solution is to have a bucket turn into a tree instead of a linked list. In Java 8, it will do this for String keys, but this could be done for all Comparable types AFAIK.

Another approach is to allow custom hashing strategies to allow the developer to avoid such problems, or to randomize the mutation on a per collection basis, amortizing the cost to the application.

http://stackoverflow.com/questions/7032965/how-do-i-figure-out-and-change-which-version-of-java-maven-is-using-to-execute

mvn -version will output which java it's using. If JAVA_HOME is set to a valid JDK directory and Maven is using something else, then most likely someone has tampered with the way that Maven starts up.

http://sighingnow.github.io/%E5%88%86%E5%B8%83%E5%BC%8F%E7%B3%BB%E7%BB%9F/consistent_hashing_algorithm.html

http://calvin1978.blogcn.com/articles/murmur.html
有些人看到murmur就想到了陌陌就想到了别的，其实是 multiply and rotate的意思，因为算法的核心就是不断的"x *= m; x = rotate_left(x,r);"

Java的HashCode

那Java自己的String的hashCode()呢？用的是Horner法则， s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

实现上是循环原哈希值×31再加下一个Char的值，算法及其复杂度简单一目了然，连我都能看懂。

for (int i = 0; i < str.length(); i++) { hash = 31*hash + str.charAt[i]; }

注意，int会溢出成负数，再溢成正数，比如"abcde"是 92599395， abcdef"是 -1424385949， "abcdefgh" 是 1259673732

Eclipse的自动代码生成的hashCode()函数也是类似的，循环原哈希值×31，再加下一个属性的hashCode()值。

31有个很好的特性，就是用移位和减法来代替乘法，可以得到更好的性能：31*i==(i<<5)-i。现在的VM可以自动完成这种优化。 Java自己的HashMap，会在调用Key的hashCode()得到一个数值后，用以下算法再hash一次，免得Key自己实现的hashCode()质量太差。

static int hash(int h) {
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}

那为什么Cassandra们不选简单直白的Horner法则呢？

我猜原因之一是它的性能，有个评测用js来跑的，还是murmur好些。

再猜原因之二它的变化不够激烈，比如"abc"是96354， "abd"就比它多1。而用 murmur"abc"是1118836419，"abd"是413429783。像在Jedis里，它的几个Shard节点的名字就叫做shard-1,shard-2，shard-3而已，hash这几个字符串，murmur的结果还能均匀的落在一致性哈希环上，用Honer法则就不行了。

Wednesday, September 30, 2015

Hashing Misc

String.hashCode()

Java的HashCode

Labels

Popular Posts