Massive Technical Interviews Tips: False Sharing

Saturday, December 19, 2015

False Sharing

http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
https://dzone.com/articles/false-sharing
Memory is stored within the cache system in units know as cache lines. Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size. The most common cache line size is 64 bytes. False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line. Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system. I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code.

To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line. Two threads writing to the same variable can be tracked down at a code level. To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us. Intel VTune is such a profiling tool. In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing.

Figure 1. above illustrates the issue of false sharing. A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance. The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect.

http://stackoverflow.com/questions/7750140/whats-the-difference-between-a-word-and-byte

Word: The natural size with which a processor is handling data (the register size). The most common word sizes encountered today are 8, 16, 32 and 64 bits, but other sizes are possible. For examples, there were a few 36 bit machines, or even 12 bit machines.

The byte is the smallest addressable unit for a CPU. If you want to set/clear single bits, you first need to fetch the corresponding byte from memory, mess with the bits and then write the byte back to memory.

The word by contrast is biggest chunk of bits with which a processor can do processing (like addition and subtraction) at a time. That definition is a bit fuzzy, as some processor might have different word sizes for different tasks (integer vs. floating point processing for example). The word size is what the majority of operations work with.

Java Memory Layout
For the Hotspot JVM, all objects have a 2-word header. First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects. The second is a reference to the class of the object. Arrays have an additional word for the size of the array. Every object is aligned to an 8-byte granularity boundary for performance. Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes:

doubles (8) and longs (8)
ints (4) and floats (4)
shorts (2) and chars (2)
booleans (1) and bytes (1)
references (4/8)

With this knowledge we can pad a cache line between any fields with 7 longs. Within the Disruptor we pad cache lines around the RingBuffer cursor andBatchEventProcessor sequences.

For the Hotspot JVM, all objects have a 2-word header. First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects. The second is a reference to the class of the object. Arrays have an additional word for the size of the array. Every object is aligned to an 8-byte granularity boundary for performance. Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes:

doubles (8) and longs (8)
ints (4) and floats (4)
shorts (2) and chars (2)
booleans (1) and bytes (1)
references (4/8)
<repeat for sub-class fields>

With this knowledge we can pad a cache line between any fields with 7 longs. Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences.

To show the performance impact let’s take a few threads each updating their own independent counters. These counters will be volatile longs so the world can see their progress.

    public final static class VolatileLong

{

        public volatile long value = 0L;

        public long p1, p2, p3, p4, p5, p6; // comment out

}

http://robsjava.blogspot.com/2014/03/what-is-false-sharing.html
CPU's don't read memory in single bytes, rather they read 'chunks of memory' usually in blocks of 64 bytes, these chunks are referred to as cache lines.

If you had two threads ( let's call them Thread 1 and Thread 2 ) both modifying a volatile variable, which we shall call ‘x' :

volatile long x;

If Thread 1 was to change the value of ‘x’, and then Thread 2 was to read it :

Thread 1: x=3;

Thread 2: System.out.print(x);

For the value of x to be passed between the two threads ( Thread 1 to Thread 2 ) a whole 64 bytes will be exchanged, as cores only exchange data in cache lines. It is possible that Thread 1 and Thread 2, may actually be processed on the same core, but for this over simplified example lets assume that each thread is processed on its own core.

Given that long values are stored in 8 bytes, and in our example our cache line is 64 bytes, then the cache line could store 8 longs, we already have one long value of 'x' stored in the cache line, lets assume that the rest of the cache line was full of 7 other longs, for example v1 to v7

x, v1, v2, v3, v4, v5 ,v6 ,v7

False Sharing

This cache line could be used by a number of different threads. If another thread was to modify v2, this would then force Thread 1 and Thread 2 to reload the cache line. You maybe wondering why should Thread 1 and Thread 2 reload this cache line, as the update to v2 should not affect them. Well, even though these updates are logically independent of each other, coherence is maintained on a cache-line basis, and not on individual elements. This apparent unnecessary sharing of data is referred to as false sharing.

Padding

A core can execute hundreds of instructions in the time taken to fetch a single cache line.

If a core has to wait for a cache line to be reloaded, the core will run out of things to do, this is called a stall. Stalls can be avoided by reducing false sharing, one technique to reduce false sharing is to pad out data structure so that threads working on independent variables fall in separate cache lines.

An example of a padded class, attempting to place 'x' and 'v1' on separate cache lines :

public class FalseSharingWithPadding { 
 
    public volatile long x; 
    public volatile long p2;   // padding 
    public volatile long p3;   // padding 
    public volatile long p4;   // padding 
    public volatile long p5;   // padding 
    public volatile long p6;   // padding 
    public volatile long p7;   // padding 
    public volatile long p8;   // padding 
    public volatile long v1; 
}

Before you go ahead an pad all your data structures its worth bearing in mind that the JVM can eliminate or re-order unused fields, thus re-introducing false sharing. Also there is no guarantee where objects will be placed on the heap.

To reduce the chance of your unused padding fields from being eliminated, it usually helps if you set them volatile. I suggest you only apply padding to highly contended concurrent classes and then, only if profiling on your target architecture, actually shows a difference. Usually best to do this after at least 10,000 iterations, to eliminate the effects of JVM realtime optimisations.

Java 8 and @Contended

Rather than introduce padding fields a cleaner approach would be to annotate the fields that were likely to fall victim to false sharing, this could act as a hint to the JVM, which could split your fields into separate cache lines. This is the aim of JEP 142.

This JEP introduces the @Contended annotation. This annotation serves as a hint that such objects and fields should reside in locations isolated from those of other objects or fields


public class Point { 
    int x;
    @Contended
    int y; 
}

The code above will place both x and y on separate cache lines. The @Contended «shifts» the 'y' field away from the object header.

Saturday, December 19, 2015

False Sharing

False Sharing

Padding

Java 8 and @Contended

Labels

Popular Posts