http://www.mapdb.org/blog/better_primitive_collections_proposal/
https://dzone.com/articles/whats-wrong-with-java-boxed-numbers
http://java-performance.info/bit-sets/
So there is need for new primitive collection. It should store data in-memory, but also be swappable to disk if free memory runs low. On-heap
long[]
is not really swappable, Garbage Collector might randomly touch an array and load it from swap. Also data should be persisted between JVM restarts to save import time (without crash protection). That sounds like a job for memory mapped files.
Bit of theory: Collections such as HashSet<Long> store data in an array:
Long[]
. Primitive collections remove boxing: long[]
. But one could also use raw binary storage such as byte[]
, DirectByteByffer
, RandomAccessFile
or MappedByteBuffer
. Access to longs
will get slightly complicated, but most storage options provide methods such as setLong(index,value)
or getLong(index)
and are very similar to arrays.
Primitive collections relay on code generators to combine various types together. There are many classes such as
IntObjectHashMap<V>
or IntLongHashMap<V>
. So it should be possible to modify code generator and add one extra parameters for storage. Than there will be classes such as MmapIntObjectHashMap<V>
or HeapIntLongHashMap<V>
.
https://github.com/palantir/trove-3.0.3/blob/master/generated-src/gnu/trove/list/array/TLongArrayList.java
https://github.com/palantir/trove-3.0.3/blob/master/generated-src/gnu/trove/list/linked/TIntLinkedList.java
https://github.com/palantir/trove-3.0.3/blob/master/generated-src/gnu/trove/list/linked/TIntLinkedList.java
Trove primitive sets use one array to store set values (using open indexed approach) and one more array of bytes to save each cell status: open, used or deleted. Besides, there are
THashSet<E>
and TLinkedHashSet<E>
, which can replace java.util.HashSet
and java.util.LinkedHashSet
from JDK. These object sets don’t use a separate array of bytes to keep the slot state (unlike primitive sets).
Both JDK sets are based on the
HashMap<E,Object>
, where Object
values are just cell state flags. HashMaps
, in turn, are based on array of Map.Entry
. As a result, JDK HashSet
uses three Objects
per each stored values. This is especially expensive in case when you need to store primitive values in a set.
JDK
LinkedHashSet
is built on top of HashSet
by adding double linked references to each map entry to keep order in which records have been inserted
Note that for a set of integer values (
short/int/long
) it may be worth to replace such set with a bit set. Read "Bit sets" article for more details.http://java-performance.info/bit-sets/
Do not forget about bit sets when you need to map a large number of integer keys to boolean flags.
Sets of integer values should be replaced with bit sets in a lot of cases in order to save a lot of memory.
LongBitSet
to deal with BitSet
problems
Let’s try to deal with these problems. We will implement a
LongBitSet
– a class similar to java.util.BitSet
. The only difference is that its keys will be long
instead of int
.
A
LongBitSet
will be backed by Map<Long, BitSet>
. The idea is to use high bits of a key as a key to this map and low bits as a key to a value BitSet
. Thus we will support both partitioning and long
keys. Problem is solved.
What about size of each
BitSet
? It depends on your data, but keeping up to a million keys in each BitSet
is usually a good choice. Each bit set will require no more than 128K, keeping total size of a LongBitSet
rather small