Massive Technical Interviews Tips: S2 Geometry Library

Thursday, October 8, 2015

S2 Geometry Library

https://www.quora.com/What-algorithms-is-Google-using-in-the-geocoding-and-searching
The geocode it uses in a nutshell:

start with a cube map
create a quadtree with 30 levels on each face
label both leaves and internal nodes with 64 bits
use 3 bits for the face index
use n*2 bits to identify a node at depth n
add a single 1 bit
pad with 0 bits as necessary

This code has several nice properties, e.g. the depth of a node can be computed from the index of the least set bit in its code. It's also pretty amazing that when S2 is used to geocode the surface of Earth, the leaf nodes are smaller than 1 cm^2.

Make sure to check the code out though, it has many clever details, such as using the Hilbert curve ordering for labels so that neighboring nodes get close codes.

Areas are represented as sets of nodes, so most operations on areas are simple set operations implemented in a way that eliminates redundancy (ensures that parent nodes absorb their children).

See Geometry on the Sphere: Google's S2 Library for an in-depth explanation.

http://highscalability.com/blog/2015/9/14/how-uber-scales-their-real-time-market-platform.html
Geospatial Index

The earth is a sphere. It’s hard to do summarization and approximation based purely on longitude and latitude. So Uber divides the earth into tiny cells using the Google S2 library. Each cell has a unique cell ID.
Using an int64 every square centimeter on earth can be represented. Uber uses a level 12 cell, which are 3.31 km(2) to 6.38 km(2), depending on where on earth you are. The boxes change shape and size depending on where on the sphere they are.
S2 can give the coverage for a shape. If you want to draw a circle with a 1km radius centered on London, S2 can tell what cells are needed to completely cover the shape.
Since each cell has an ID the ID is used as a sharding key. When a location comes in from supply the cell ID for the location is determined. Using the cell ID as a shard key the location of the supply is updated. It is then sent out to a few replicas.
When DISCO needs to find the supply near a location, a circle’s worth of coverage is calculated centered on where the rider is located. Using the cell IDs from the circle area all the relevant shards are contacted to return supply data.
It’s all scalable. Even though it’s not as efficient as you might like, and since fanout is relatively cheap, the write load can always be scaled by adding more nodes. The read load is scaled through the use of replicas. If more read capacity is needed the replica factor can be increased.
A limitation is the cell size is fixed at the level 12 size. Dynamic cell size might be supported in the future. There’s a tradeoff as the smaller the cell size the greater the fanout for queries.

https://en.wikipedia.org/wiki/Quadtree
A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes.

They decompose space into adaptable cells
Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
The tree directory follows the spatial decomposition of the quadtree.

Thursday, October 8, 2015

S2 Geometry Library

Labels

Popular Posts