Hardware Aware Algorithms (part I) | Balls & Bins, A Mix of Random Thoughts
The reason behind the observation is pretty simple. Here I'll explain two major factors.
The prefetchers nowadays are way more sophisticated than just prefetching contiguous memory locations. If the memory access pattern is regular, the prefetcher might still be able to do a pretty good job even if the locations are not near each other.
https://ballsandbins.wordpress.com/2015/03/16/hardware-aware-algorithms-part-ii/
The naive approach parallelizes the outer loop, and thus only enjoys good cache behavior for the inner loop. On the other hand, the divide-and-conquer approach recursively splits the computation into local ranges, which makes the prefetching and TLB load much more effective — at least for computation within each of the two halves. Admittedly, this approach requires more thinking and the code is less clean. For a lot of applications in big data and financial industry, performance is everything. However, there are also some less critical scenarios where code simplicity and readability is valued.
Read full article from Hardware Aware Algorithms (part I) | Balls & Bins, A Mix of Random Thoughts
func (p horizontal) compute(in *[m][m]float64) float64 {
var sum float64
for
i := 0; i < m; i++ {
for
j := 0; j < m; j++ {
sum += in[i][j]
}
}
return
sum
}
// vertical order.
type vertical
struct
{}
func (p vertical) compute(in *[m][m]float64) float64 {
var sum float64
for
i := 0; i < m; i++ {
for
j := 0; j < m; j++ {
sum += in[j][i]
}
}
return
sum
}
- Data Cache: Modern processors typically have 2 or 3 layers of data cache, L1D, L2, and L3. The outer layer has larger size but also higher latency. To simply the concept, one can regard data caches as hardware hash tables with a fixed number of slots for each hash value. Items in a cache is evicted in a LRU manner. Prefetchers are responsible for prefetching items into caches. If an item is available when it is needed, it is called a cache hit and the item can be retrieved from the cache directly; otherwise, it is a cache miss and the request is sent to the outer layer and may eventually reach the main memory, which is several orders of slower.
- Translation Lookaside Buffers (TLBs): When we program, we work in the virtual memory space. There is a translation between virtual addresses and physical addresses. The virtual memory system usually achieves this by mapping pages with the help of a page table. Modern processors have a small hardware cache called the TLB cache to store part of the mapping. A miss on the TLB cache can be very expensive because the translation then proceeds by looking up the page table in a process called a page walk. The page walk requires a lot of time when compared to the processor speed, as it involves reading the contents of multiple memory locations and using them to compute the physical address.
The prefetchers nowadays are way more sophisticated than just prefetching contiguous memory locations. If the memory access pattern is regular, the prefetcher might still be able to do a pretty good job even if the locations are not near each other.
https://ballsandbins.wordpress.com/2015/03/16/hardware-aware-algorithms-part-ii/
The naive approach parallelizes the outer loop, and thus only enjoys good cache behavior for the inner loop. On the other hand, the divide-and-conquer approach recursively splits the computation into local ranges, which makes the prefetching and TLB load much more effective — at least for computation within each of the two halves. Admittedly, this approach requires more thinking and the code is less clean. For a lot of applications in big data and financial industry, performance is everything. However, there are also some less critical scenarios where code simplicity and readability is valued.
Read full article from Hardware Aware Algorithms (part I) | Balls & Bins, A Mix of Random Thoughts