Monday, November 16, 2015

Java: HashCode Implementation Details

http://my.oschina.net/chihz/blog/56256

在编写类的时候，如果覆盖了Object的equals方法，那么必须要覆盖hashCode方法，并且如果两个对象用equals方法比较返回true，那么这两个对象hashCode返回的值也必须是相等的，并且对于同一个对象，equals方法需要比较的属性值没有被修改，那么每次调用hashCode返回的值应该是一致的。

hashCode主要是用于散列集合，通过对象hashCode返回值来与散列中的对象进行匹配，通过hashCode来查找散列中对象的效率为O(1)，如果多个对象具有相同的hashCode，那么散列数据结构在同一个hashCode位置处的元素为一个链表，需要通过遍历链表中的对象，并调用equals来查找元素。这也是为什么要求如果对象通过equals比较返回true,那么其hashCode也必定一致的原因。

Google首席Java架构师Joshua Bloch在他的著作《Effective Java》中提出了一种简单通用的hashCode算法

1. 初始化一个整形变量，为此变量赋予一个非零的常数值，比如int result = 17;

2. 选取equals方法中用于比较的所有域，然后针对每个域的属性进行计算：

(1) 如果是boolean值，则计算f ? 1:0

(2) 如果是byte\char\short\int,则计算(int)f

(3) 如果是long值，则计算(int)(f ^ (f >>> 32))

(4) 如果是float值，则计算Float.floatToIntBits(f)

(5) 如果是double值，则计算Double.doubleToLongBits(f)，然后返回的结果是long,再用规则(3)去处理long,得到int

(6) 如果是对象应用，如果equals方法中采取递归调用的比较方式，那么hashCode中同样采取递归调用hashCode的方式。否则需要为这个域计算一个范式，比如当这个域的值为null的时候，那么hashCode 值为0

(7) 如果是数组，那么需要为每个元素当做单独的域来处理。如果你使用的是1.5及以上版本的JDK，那么没必要自己去重新遍历一遍数组，java.util.Arrays.hashCode方法包含了8种基本类型数组和引用数组的hashCode计算，算法同上，

java.util.Arrays.hashCode(long[])的具体实现:

public static int hashCode(long a[]) {

        if (a == null)

            return 0;

        int result = 1;

        for (long element : a) {

            int elementHash = (int)(element ^ (element >>> 32));

            result = 31 * result + elementHash;

        }

        return result;

}

Arrays.hashCode(...)只会计算一维数组元素的hashCOde,如果是多维数组，那么需要递归进行hashCode的计算，那么就需要使用Arrays.deepHashCode(Object[])方法。

3. 最后，要如同上面的代码，把每个域的散列码合并到result当中：result = 31 * result + elementHash;

4. 测试，hashCode方法是否符合文章开头说的基本原则，这些基本原则虽然不能保证性能，但是可以保证不出错。

这个算法存在这么几个问题需要探讨:

1. 为什么初始值要使用非0的整数?这个的目的主要是为了减少hash冲突，考虑这么个场景，如果初始值为0,并且计算hash值的前几个域hash值计算都为0，那么这几个域就会被忽略掉，但是初始值不为0,这些域就不会被忽略掉，示例代码:

import java.io.Serializable;

public class Test implements Serializable {

    private static final long serialVersionUID = 1L;

    private final int[] array;

    public Test(int... a) {

        array = a;

    }

    @Override

    public int hashCode() {

        int result = 0; //注意，此处初始值为0

        for (int element : array) {

            result = 31 * result + element;

        }

        return result;

    }

    public static void main(String[] args) {

        Test t = new Test(0, 0, 0, 0);

        Test t2 = new Test(0, 0, 0);

        System.out.println(t.hashCode());

        System.out.println(t2.hashCode());

    }

}

如果hashCode中result的初始值为0，那么对象t和对象t2的hashCode值都会为0，尽管这两个对象不同。但如果result的值为17,那么计算hashCode的时候就不会忽略这些为0的值，最后的结果t1是15699857，t2是506447

2. 为什么每次需要使用乘法去操作result? 主要是为了使散列值依赖于域的顺序，还是上面的那个例子，Test t = new Test(1, 0)跟Test t2 = new Test(0, 1), t和t2的最终hashCode返回值是不一样的。

3. 为什么是31? 31是个神奇的数字，因为任何数n * 31就可以被JVM优化为 (n << 5) -n,移位和减法的操作效率要比乘法的操作效率高的多。

另外如果对象是不可变的，那么还推荐使用缓存的方式，在对象中使用一个单独的属性来存储hashCode的值，这样对于这个对象来说hashCode只需要计算一次就可以了。

private volatile int hashCode = 0;

@Override

public int hashCode() {

    int result = hashCode;

    if(result == 0) {

        ...//计算过程

    }

    return result;

}

注意，缓存属性必须是volatile的，这样可以在并发访问环境中保持内存可见性。否则会产生线程安全问题。

org.apache.commons.lang.builder.HashCodeBuilder的实现

org.apache.commons.lang.builder.HashCodeBuilder能够通过反射机制来自动计算对象的hashCode值。其核心方法是静态方法：reflectionHashCode，这个方法有多个重载的版本，这些重载版本最终都是调用的 int reflectionHashCode( intinitialNonZeroOddNumber, int multiplierNonZeroOddNumber, Object object,

boolean testTransients, Class reflectUpToClass, String[] excludeFields)

其余的版本只是提供不同的默认参数，从而简化了构建的过程。比如:

1

2

3

public static int reflectionHashCode(Object object) {

        return reflectionHashCode(17, 37, object, false, null, null);

}

然后构建的过程是这样的，除了指定过滤的，比如transient属性、excludeFields指定的属性之外，会遍历其它的类属性，然后通过反射的方式获取属性值，如果属性是数组，则会遍历数组，否则会调用属性的hashCode, 如果是多维数组，会去递归取hashCode值，对单个属性计算hash值的代码如下：

public HashCodeBuilder append(Object object) {

        if (object == null) {

            iTotal = iTotal * iConstant;

        } else {

            if(object.getClass().isArray()) {

                // 'Switch' on type of array, to dispatch to the correct handler

                // This handles multi dimensional arrays

                if (object instanceof long[]) {

                    append((long[]) object);

                } else if (object instanceof int[]) {

                    append((int[]) object);

                } else if (object instanceof short[]) {

                    append((short[]) object);

                } else if (object instanceof char[]) {

                    append((char[]) object);

                } else if (object instanceof byte[]) {

                    append((byte[]) object);

                } else if (object instanceof double[]) {

                    append((double[]) object);

                } else if (object instanceof float[]) {

                    append((float[]) object);

                } else if (object instanceof boolean[]) {

                    append((boolean[]) object);

                } else {

                    // Not an array of primitives

                    append((Object[]) object);

                }

            } else {

                iTotal = iTotal * iConstant + object.hashCode();

            }

        }

        return this;

    }

这里要小心，因为它是直接调用属性对象的hashCode,如果是基本类型，那么就会调用包装器中提供的hashCode方法，如果是引用类型，那么需要仔细检查引用类型的hashCode方法，以免产生违反hashCode基本原则的情况。

然后剩下的计算过程，同Effective Java中描述的基本类似，不过这里的hash初值和乘数因子可以自己来设置，默认的情况是使用了17 和 37两个互质数。

HashCodeBuilder最大好处是使用方便，一行代码就能搞定hashCode的重写问题，并且让代码很清晰，但是它有这么几个值得注意的地方:

1. 使用反射会对程序的性能造成影响，不过Java反射机制为了把性能影响降到最低，对类似getFields()之类的操作都采用了Cache策略，对一般的程序而言，这些性能开销往往可以忽略不计。另外如果使用的是不可变对象，那么强烈建议把hashCode Cache住，这样可以极大的提高hashCode计算的性能。

2. 因为默认会处理所有的field(除了transient修饰的field),所以一定要测试是否违反hashCode的基本原则(为了保障基本原则的正确，建议跟org.apache.commons.lang.EqualsBuilder搭配使用)，尤其是当类的域中包含引用类型的时候，一定要递归检查引用类型的hashCode.

Double Hashcode
http://stackoverflow.com/questions/9650798/hash-a-double-in-java

Double.hashCode() complex? It basically converts double into a long (no magic here, after all they both are simply 64-bit values in memory) and computing long hash is quite simple. The double -> long conversion is done via public static doubleToLongBits(). What is complex about this?

Examples:

Double.valueOf(42.5).hashCode();        //better answer to everything

Long.valueOf(Double.doubleToLongBits(42.5)).hashCode();

public static long doubleToLongBits(double value) {

long result = doubleToRawLongBits(value);

// Check for NaN based on values of bit fields, maximum

// exponent and nonzero significand.

if ( ((result & DoubleConsts.EXP_BIT_MASK) ==

DoubleConsts.EXP_BIT_MASK) &&

(result & DoubleConsts.SIGNIF_BIT_MASK) != 0L)

result = 0x7ff8000000000000L;

return result;

}

public static native long doubleToRawLongBits(double value);

http://javatutorialhq.com/java/lang/double-class-tutorial/doubletorawlongbits-method-example/

Some takeaways on the doubleToRawLongBits(double value) method:

Returns a representation of the specified floating-point value according to the IEEE 754 floating-point “double format” bit layout, preserving Not-a-Number (NaN) values.
If the argument is positive infinity, the result is 0x7ff0000000000000L.
If the argument is negative infinity, the result is 0xfff0000000000000L.
If the argument is NaN, the result is the long integer representing the actual NaN value. Unlike the doubleToLongBits method, doubleToRawLongBits does not collapse all the bit patterns encoding a NaN to a single “canonical” NaN value.

http://stackoverflow.com/questions/1074781/double-in-hashmap

Short answer: It probably won't work.

Honest answer: It all depends.

Longer answer: The hash code isn't the issue, it's the nature of equal comparisons on floating point. As Nalandial and the commenters on his post point out, ultimately any match against a hash table still ends up using equals to pick the right value.

So the question is, are your doubles generated in such a way that you know that equals really means equals? If you read or compute a value, store it in the hash table, and then later read or compute the value using exactly the same computation, then Double.equals will work. But otherwise it's unreliable: 1.2 + 2.3 does not necessarily equal 3.5, it might equal 3.4999995 or whatever. (Not a real example, I just made that up, but that's the sort of thing that happens.) You can compare floats and doubles reasonably reliably for less or greater, but not for equals.

The problem is not the hash code but the precision of the doubles. This will cause some strange results. Example:

    double x = 371.4;
    double y = 61.9;
    double key = x + y;    // expected 433.3

    Map<Double, String> map = new HashMap<Double, String>();
    map.put(key, "Sum of " + x + " and " + y);

    System.out.println(map.get(433.3));  // prints null

The calculated value (key) is "433.29999999999995" which is not EQUALS to 433.3 and so you don't find the entry in the Map (the hash code probably is also different, but that is not the main problem).

If you use

map.get(key)

it should find the entry... []]

http://buttercola.blogspot.com/2015/11/linkedin-hashcode-of-string-in-java.html

    public static int hashCode(String s) {

        if (s == null || s.length() == 0) {

            return 0;

}

        int hash = 0;

        for (int i = 0; i < s.length(); i++) {

            hash = s.charAt(i) + 31 * hash;

}

        return hash;

}

    public static void main(String[] args) {

        String s = "call";

        int result = hashCode(s);

        System.out.println(result);

}

http://java-bytes.blogspot.com/2009/10/hashcode-of-string-in-java.html

public int hashCode() {

int h = hash;

if (h == 0) {

int off = offset;

char val[] = value;

int len = count;

for (int i = 0; i < len; i++) {

h = 31*h + val[off++];

}

hash = h;

}

return h;

}

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

This wasn't sufficient for me, to understand why 31 is used. I did a bit of research and found some good links, providing some info about why 31 is used. Here are some links with very good info: Stack Overflow - Why does java hashcode use 31 as a multiplier Apart from the above link, there are a couple of other links with pretty good info about hashing, hash collisions and performance of hashing algorithms: Consistency Of Hashcode On A Java String Best Hashing Algos, In Terms Of Collisions and Performance Java Array Hashcode Implementation

http://www.allenlipeng47.com/blog/index.php/2016/01/23/contain-function-in-hashcode/

public class test {

    static class Point {
        int a;
        public Point(int a) {
            this.a = a;
        }
    }

    public static void test1() {
        Point element = new Point(1);
        HashSet<Point> hs = new HashSet<Point>();
        hs.add(element);
        element.a = 2;
        System.out.println(hs.contains(element));
    }

    public static void test2() {
        HashSet<Integer> element = new HashSet<Integer>();
        HashSet<HashSet<Integer>> hs = new HashSet<HashSet<Integer>>();
        hs.add(element);
        element.add(2);
        System.out.println(hs.contains(element));
    }

    public static void main(String[] args) {
        test1();
        test2();
    }

}

final Entry<K,V> getEntry(Object key) {
    if (size == 0) {
        return null;
    }

    int hash = (key == null) ? 0 : hash(key);
    for (Entry<K,V> e = table[indexFor(hash, table.length)];
         e != null;
         e = e.next) {
        Object k;
        if (e.hash == hash &&
            ((k = e.key) == key || (key != null && key.equals(k))))
            return e;
    }
    return null;
}

We can see contains() function both checks hashCode() and equals() function. Only when 2 functions return true, it will return true.

Accordingly, if I change the Point class to below, test1() will return false too.

static class Point {
    int a;
    public Point(int a) {
        this.a = a;
    }
    public int hashCode() {
        return a;
    }
}

Ideally, the element in HashSet() should be immutable. Or if we want to change, we should do remove, update, put operations.

Or another way is that we should guarantee that hashCode for new element should be the same as before, and new element should equal to before. But this is actually a worse way. Remember ideally, element in HashSet should be immutable.

public boolean containsKey(Object key) {
return getNode(hash(key), key) != null;
}
final Node<K,V> getNode(int hash, Object key) {
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) {
if (first.hash == hash && // always check first node
((k = first.key) == key || (key != null && key.equals(k))))
return first;
if ((e = first.next) != null) {
if (first instanceof TreeNode)
return ((TreeNode<K,V>)first).getTreeNode(hash, key);
do {
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
return e;
} while ((e = e.next) != null);
}
}
return null;
}
http://www.allenlipeng47.com/blog/index.php/2016/01/26/hashcode-for-int/

int[] a1 = {1, 2};
int[] a2 = {1, 2};
System.out.println(a1.hashCode());
System.out.println(a2.hashCode());
System.out.println(Arrays.hashCode(a1));
System.out.println(Arrays.hashCode(a2));

The result is:

The reason is because hashCode() int[] is not overwritten. It is treated as an object. Read below explanation from javaworld, we know hashCode() for Object is a mapping to memory.

Basically the default implementation of hashCode() provided by Object is derived by mapping the memory address to an integer value. If look into the source of Object class , you will find the following code for the hashCode.

Monday, November 16, 2015

Java: HashCode Implementation Details

org.apache.commons.lang.builder.HashCodeBuilder的实现

Examples:

Labels

Popular Posts