http://www.mapdb.org/blog/better_primitive_collections_proposal/
https://dzone.com/articles/whats-wrong-with-java-boxed-numbers
http://java-performance.info/bit-sets/
So there is need for new primitive collection. It should store data in-memory, but also be swappable to disk if free memory runs low. On-heap
long[] is not really swappable, Garbage Collector might randomly touch an array and load it from swap. Also data should be persisted between JVM restarts to save import time (without crash protection). That sounds like a job for memory mapped files.
Bit of theory: Collections such as HashSet<Long> store data in an array:
Long[]. Primitive collections remove boxing: long[]. But one could also use raw binary storage such as byte[], DirectByteByffer, RandomAccessFile or MappedByteBuffer. Access to longs will get slightly complicated, but most storage options provide methods such as setLong(index,value) or getLong(index) and are very similar to arrays.
Primitive collections relay on code generators to combine various types together. There are many classes such as
IntObjectHashMap<V> or IntLongHashMap<V>. So it should be possible to modify code generator and add one extra parameters for storage. Than there will be classes such as MmapIntObjectHashMap<V> or HeapIntLongHashMap<V>.
https://github.com/palantir/trove-3.0.3/blob/master/generated-src/gnu/trove/list/array/TLongArrayList.java
https://github.com/palantir/trove-3.0.3/blob/master/generated-src/gnu/trove/list/linked/TIntLinkedList.java
https://github.com/palantir/trove-3.0.3/blob/master/generated-src/gnu/trove/list/linked/TIntLinkedList.java
Trove primitive sets use one array to store set values (using open indexed approach) and one more array of bytes to save each cell status: open, used or deleted. Besides, there are
THashSet<E> and TLinkedHashSet<E>, which can replace java.util.HashSet and java.util.LinkedHashSet from JDK. These object sets don’t use a separate array of bytes to keep the slot state (unlike primitive sets).
Both JDK sets are based on the
HashMap<E,Object>, where Object values are just cell state flags. HashMaps, in turn, are based on array of Map.Entry. As a result, JDK HashSet uses three Objectsper each stored values. This is especially expensive in case when you need to store primitive values in a set.
JDK
LinkedHashSet is built on top of HashSet by adding double linked references to each map entry to keep order in which records have been inserted
Note that for a set of integer values (
short/int/long) it may be worth to replace such set with a bit set. Read "Bit sets" article for more details.http://java-performance.info/bit-sets/
Do not forget about bit sets when you need to map a large number of integer keys to boolean flags.
Sets of integer values should be replaced with bit sets in a lot of cases in order to save a lot of memory.
LongBitSet to deal with BitSet problems
Let’s try to deal with these problems. We will implement a
LongBitSet – a class similar to java.util.BitSet. The only difference is that its keys will be long instead of int.
A
LongBitSet will be backed by Map<Long, BitSet>. The idea is to use high bits of a key as a key to this map and low bits as a key to a value BitSet. Thus we will support both partitioning and long keys. Problem is solved.
What about size of each
BitSet? It depends on your data, but keeping up to a million keys in each BitSet is usually a good choice. Each bit set will require no more than 128K, keeping total size of a LongBitSet rather small