Monday, June 29, 2015

POI-GeoHash



http://www.learn4master.com/interview-questions/system-design/poi-geohash
地图上分割成不同区域这个设计题的核心是什么来着?
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=301057
记得有个什么算法为核心,就是分割成一块一块,然后有不同情况下的优化,
但是想不起来具体是什么来了。还看过一篇中文的文章讲得不错,可惜记不清细节了
GeoHash算法,地理编码
https://dzone.com/articles/designing-spacial-index
Geographic Information Systems (GIS) is an interesting area of exploration because it poses two significant challenges: latency at scale and modeling spatial locality.

Let's say you're visiting NYC and need an Internet connection. Where's the nearest Wi-Fi hotspot?How might an HBase application help you answer that question? What kind of design decisions go into solving that problem, and how can it be solved in a scalable way?
You know you want fast access to a relevant subset of the data. To achieve that, let's start with two simple and related goals:
  1. You want to store location points on disk such that points close to each other in space are close to each other on disk.
  2. You want to retrieve as few points as possible when responding to a query.
Starting with a compound rowkey
Concatenating X- and Y-axis values as the rowkey won't cut it for an efficient schema.
This sorting results in lots of hopping between the northern and southern clusters because of sorting first on longitude, then latitude.

Using the geohash as a spatially aware rowkey
The geohash makes a great choice for the rowkey because it's inexpensive to calculate and the prefix gets you a long way toward finding nearest neighbors. 

Compared to the target query area, the six-character prefix match areas are huge. Worse still, the query spans two of those larger prefixes. Seen in this context, those five characters of common prefix include far more data than you need. 

Relying on prefix match results in scanning a huge amount of extra area. Of course, there's a trade-off. If your data isn't dense at this precision level, executing fewer, longer scans isn't such a big deal. The scans don't return too much superfluous data, and you can minimize the remote procedure call (RPC) overhead. 

If your data is dense, running more, shorter scans will reduce the number of excess points transported over the wire. Plus, if there's one thing that computers are getting good at these days, it's parallelism. Execute each of those shorter scans on its own CPU core, and the query is still as fast as the slowest scan.

https://www.elastic.co/guide/en/elasticsearch/guide/current/geohashes.html
Geohashes are a way of encoding lat/lon points as strings.

Geohashes divide the world into a grid of 32 cells—4 rows and 8 columns—each represented by a letter or number. The g cell covers half of Greenland, all of Iceland, and most of Great Britian. Each cell can be further divided into another 32 cells, which can be divided into another 32 cells, and so on. The gc cell covers Ireland and England, gcp covers most of London and part of Southern England, and gcpuuz94k is the entrance to Buckingham Palace, accurate to about 5 meters.

In other words, the longer the geohash string, the more accurate it is. If two geohashes share a prefix— and gcpuuz—then it implies that they are near each other. The longer the shared prefix, the closer they are.

That said, two locations that are right next to each other may have completely different geohashes. For instance, the Millenium Dome in London has geohash u10hbp, because it falls into the u cell, the next top-level cell to the east of the g cell.

http://www.bigfastblog.com/geohash-intro
We can keep sub-dividing the space until we get to street-level or beyond. At that point our Geohash will be a bit longer and will look something like this…
0010110101011100011000110001101111000111
This binary can be represented as alphanumeric characters (32 bit encoded). Each 5 bits is converted to one character.
00101  10101  01110  00110  00110  00110  11110  00111
Which comes out as
  5      p      f      6      6      6      y      7
So our Geohash for the party on Saturday night is “5pf666y7″.
Why Use A GeoHash?
Single Simple String Representation
Grouping Of Points
Zooming And Aggregation

Caching At Scale

The Gotcha Solution
There is an algorithm for calculating the geohash values of the 8 surrounding grid-squares of a given grid square, which widens our search net in all directions, allowing us to see nearby points that reside on the other side of a grid’s boundary.
Simple way to encode lat/long into a string
Compact string encoding of geographic coordinate with
arbitrary precision
Bangalore - TDR1, Domlur - TDR1WX

Why do we need it when we have LL?
Big address of this venue
Complicated Lat/Long {12.963787,77.637789}

Subdivides space into "buckets" of grid shape
● Doesn't really represent a point, rather a bounding
area in which the point is present
● Hierarchical spatial structure with gradual degradation
● Longer the geohash
 -> Smaller the area

 Grouping and Zooming
Easy grouping
○ tdr1wxyp5dn7v -> {12.9637069, 77.6377931}
● 4th Main Rd, Domlur II Stage, Domlur, Bengaluru, Karnataka
○ tdr1wxyp5dn7v -> {12.9637, 77.6378}
● Domlur, Bengaluru, Karnataka
○ tdr1wxyp5dn7v -> {12.96, 77.64}
● Bengaluru, Karnataka
○ tdr1wxyp5dn7v -> {13, 78}
● Malur-Bangarapettu Rd

Nearby (Proximity) Search
● Nearby locations usually share similar prefixes.
● Long common prefixes indicate two places are near,
however two nearby places do not always have common prefixes


Finding Nearby Places
Prefix Match:
SELECT * FROM table WHERE place_geohash LIKE hashcode%
● Expand/Neighbors Match
SELECT * FROM table WHERE place_geohash IN (geohash.expand('tzxy1'))

Limitations
● Locality Anomalies
○ Complicates Proximity Searches
○ Need to use bounding box extensively making
search an expensive operation
● Projection based model
○ A geohash of given length will denote different
region size in poles than in equator region
http://blog.factual.com/how-geohashes-work
The geohash format
A geohash is a series of bits that repeatedly bisects a search space of latitudes and longitudes. The first bit bisects the longitude, the next one latitude, the next longitude, etc. As a result, some geohashes specify square-ish (in spherical coordinates) regions, while others specify 2:1 rectangles. Because geohashes are fundamentally binary data, they’re often written in base-32. The alphabet is a little unusual in that it omits some letters that might be confused with numbers.
Typical encoding strategy
long encodeGeohash(double lat, double lng, int bits) {
  double minLat = -90,  maxLat = 90;
  double minLng = -180, maxLng = 180;
  long result = 0;
  for (int i = 0; i < bits; ++i)
    if (i % 2 == 0) {                 // even bit: bisect longitude
      double midpoint = (minLng + maxLng) / 2;
      if (lng < midpoint) {
        result <<= 1;                 // push a zero bit
        maxLng = midpoint;            // shrink range downwards
      } else {
        result = result << 1 | 1;     // push a one bit
        minLng = midpoint;            // shrink range upwards
      }
    } else {                          // odd bit: bisect latitude
      double midpoint = (minLat + maxLat) / 2;
      if (lat < midpoint) {
        result <<= 1;                 // push a zero bit
        maxLat = midpoint;            // shrink range downwards
      } else {
        result = result << 1 | 1;     // push a one bit
        minLat = midpoint;            // shrink range upwards
      }
    }
  return result;
}


First, data indexed by geohash will have all points for a given rectangular area in contiguous slices (the number of slices depends on the precision required and the presence of geohash "fault lines"). This is especially useful in database systems where queries on a single index are much easier or faster than multiple-index queries. Second, this index structure can be used for a quick-and-dirty proximity search - the closest points are often among the closest geohashes. 

1)GeoHash将二维的经纬度转换成字符串
2)字符串越长,表示的范围越精确。
3)字符串相似的表示距离相近
GeoHash就是一种将经纬度转换成字符串的方法,并且使得在大部分情况下,字符串前缀匹配越多的距离越近
只需要将所在位置经纬度转换成GeoHash字符串,并与各个餐馆的GeoHash字符串进行前缀匹配,匹配越多的距离越近。


当geohash base32编码长度为8时,精度在19米左右,而当编码长度为9时,精度在2米左右,编码长度需要根据数据情况进行选择。
这种类型的空间填充曲线的优点是将二维空间转换成一维曲线(事实上是分形维),对大部分而言,编码相似的距离也相近, 但Peano空间填充曲线最大的缺点就是突变性,有些编码相邻但距离却相差很远,比如0111与1000,编码是相邻的,但距离相差很大。
1)由于GeoHash是将区域划分为一个个规则矩形,并对每个矩形进行编码,这样在查询附近POI信息时会导致以下问题,比如红色的点是我们的位置,绿色的两个点分别是附近的两个餐馆,但是在查询的时候会发现距离较远餐馆的GeoHash编码与我们一样(因为在同一个GeoHash区域块上),而较近餐馆的GeoHash编码与我们不一致。这个问题往往产生在边界处。
解决的思路很简单,我们查询时,除了使用定位点的GeoHash编码进行匹配外,还使用周围8个区域的GeoHash编码,这样可以避免这个问题。 
2)我们已经知道现有的GeoHash算法使用的是Peano空间填充曲线,这种曲线会产生突变,造成了编码虽然相似但距离可能相差很大的问题,因此在查询附近餐馆时候,首先筛选GeoHash编码相似的POI点,然后进行实际距离计算。
      geohash只是空间索引的一种方式,特别适合点数据,而对线、面数据采用R树索引更有优势(可参考:深入浅出空间索引:为什么需要空间索引)。

Where’s the nearest wifi hotspot?
You want to store location points on disk such that points close to each other in space are close to each other on disk.
You want to retrieve as few points as possible when responding to a query.

design of the rowkey is the single most important thing you can do in your HBase schema

We claimed earlier that concatenating X- and Y-axis values as the rowkey won’t cut it for an efficient schema.
longitude and latitude are equally important in defining the location of a point. In order to use them as the basis for the spatial index, you need an algorithm to combine them.

You’ll use 12 geohash characters for the precision because that’s the highest precision you can fit in an 8-byte Long and still represent a meaningful character string. By truncating characters from the end of the hash, you get a less precise geohash and a correspondingly less precise selection of the map. Where full precision represents a point, partial precision gives you an area on the map, effectively a bounding box around an area in space.

The geohashes are all represented as character strings in the Base32 encoding alphabet.
Like all such algorithms, it exhibits locality anomalies where two points that are physically close together may have very different geohashes or two points that are very far apart may have nearly identical geohashes, complicating proximity searches. There are adequate workarounds for this (some patented) at the cost of making searches more expensive. 

Using the geohash as a spatially aware rowkey

the optimal approach for the data is to scan the center tile and its eight neighbors. This approach will guarantee correct results while minimizing the amount of unnecessary network IO. Luckily, calculating those neighbors is a simple bit-manipulation away.
http://www.mitbbs.ca/article_t/JobHunting/32476139.html
里面讲的很清楚怎么样apply geohash to query.如果我没有理解错的话,用里面说的
方法可以省去前面大牛讲的design中的index server,就是存放每个grid中的PO壹的信息的
给一个point and range,应该可以计算出minimum geohash prefix that contains thearea

然后用这个a list of geohash prefix去storage layer 拿所有符合要求的POI就可以了。
https://github.com/yinqiwen/ardb/blob/master/doc/spatial-index.md
http://gis.stackexchange.com/questions/18330/would-it-be-possible-to-use-geohash-for-proximity-searches
Store Spatial Data
ZADD key <GeoHash50Bits>  <value>

Encode longitude/latitude coordinate with estimate bits by geohash-int.
Find surrounding 8 neighbors' geohash integer value by geohash-int.
For each geohash integer value, we generate a pair (GeoHashIneger, GeoHashIneger + 1). Then we got 9 pairs.
For each pair, we convert it to a score range. Any integer value in the pair should be left shift to 52 bits. The shifted value is the smallest 52 bits geohash value in the geohash box represent by unshifted GeoHashInteger.
For example, if we need search points in radius 3000m, then we should encode the longitude/latitude coordinate to a integer with 26 bits, then left shift it 26 bits, then we got a 52 bits integer.
For each score range, use 'ZRANGEBYSCORE key min max WITHSCORES' to retrieve all point's value and it's score.
For each point value and it's score, we can decode the score to a GeoHash area by geohash-int and compute the distance with given longitude/latitude , then compare the distance with given radius value to exclude the point not in radius.
Solr Spatial Search
For indexing geodetic points (latitude and longitude), supply the pair of numbers as a string with a comma separating them in latitude then longitude order. For non-geodetic points, the order is x,y for PointType, and for RPT you must use a space instead of a comma, or use WKT.
  • LatLonType and its non-geodetic twin PointType
  • SpatialRecursivePrefixTreeFieldType (RPT for short), including RptWithGeometrySpatialField, a derivative
  • BBoxField
RPT offers more features than LatLonType and fast filter performance, although LatLonType is more appropriate when efficient distance sorting/boosting is desired. They can both be used simultaneously for what each does best – LatLonType for sorting/boosting, RPT for filtering.  If you need to index shapes other than points (e.g. a circle or polygon) then use RPT.
sfield a spatial indexed field
fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5. This filter returns all results within a circle of the given radius around the initial point:
fq={!bbox sfield=store}&pt=45.15,-93.85&d=5. The rectangular shape is faster to compute and so it's sometimes used as an alternative to geofilt when it's acceptable to return points outside of the radius.

Coding

Products
GeoFire
https://www.firebase.com/blog/2013-09-25-location-queries-geofire.html

point of interest, or POI, is a specific point location that someone may find useful or interesting. 
Q1. Given the current location, how to find the most closest k points.
Q2. Given the current location, how to find all the points within k miles. 

A1. Geohash 
A2. K-D tree 
A3. Space-filling Curve 

intersection of intervals -- search points in a range.
intersection of rectangles (using bst)   search overlapped intervals
https://www.youtube.com/watch?v=Igr6yONkpIQ

TODO:

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts