Making geo joins faster with H3 indexes

Making geo joins faster with H3 indexes(floedb.ai)

173 points by matheusalmeida 104 days ago | 69 comments

cullenking 103 days ago |

We do something similar for some limited geospatial search using elastic search. We make a set of h3 indexes for each of the hundreds of millions of gps recordings on our service, and store them in elastic search. Geospatial queries become full text search queries, where a point is on the line if the set of h3 indexes contains the point. You can do queries on how many cells overlap, which lets you match geospatial tracks on the same paths, and with ES coverage queries, you can tune how much overlap you want.

Instead of using integers IDs for the hexes, we created an encoded version of the ID that has the property that removing a character gets you the containing parent of the cell. This means we can do basic containment queries by querying with a low resolution hex (short string) as a prefix query. If a gps track goes through this larger parent cell, the track will have hexes with the same prefix. You don’t get perfect control of distances because hexes have varying diameters (or rather the approximation, since they aren’t circles they are hexes), but in practice and at scale for a product that doesn’t require high precision, it’s very effective.

I think at the end of this year we’ll have about 6tb of these hex sets in a four node 8 process ES cluster. Performance is pretty good. Also acts as our full text search. Half the time we want a geo search we also want keyword / filtering / etc on the metadata of these trips.

Pretty fun system to build, and the concept works with a wide variety of data stores. Felt like a total hack job but it has stood the test of time.

Thanks uber, h3 is a great library!

jillesvangurp 103 days ago | |

Elastisearch and Opensearch have a built in geo_shape type that is a bit more optimal for queries like this.

Before that existed (pre 1.0 actually), I did something similar with geohashes, which are similar to h3 but based on simple string encoded quad trees. I indexed all the street segments in openstreetmap with that (~800 million at the time) and implemented a simple reverse geocoder. Worked shockingly well.

The geo_shape type uses a bkd tree in binary format. It's heavily optimized for this type of intersects/overlaps queries at scale. Basically does the same thing but using a lot less disk space and memory. It's similar to what you would find in proper GIS databases. Elasticsearch/opensearch also support h3 and geohash grid aggregations on top of geo_shape or geo_point types.

I'm guessing the author is using something like postgresql which of course has similar geospatial indexing support via post gis.

cullenking 102 days ago | | |

Doesn’t meet all our product requirements unfortunately. We used returned hexes in certain queries, and we also hacked in directionality of line using least significant 12 bits of the hex (didn’t need that level of hex precision), and we are doing direction oriented matching and counting. For simpler use cases it’s definitely a better option. thanks for reminding me and other people reading my comment!

anacoluthe 102 days ago | |

Beware that the parent hexagon does not contain its children...

penteract 102 days ago | | |

No idea if they are doing this, but you can use Gosper islands (https://en.wikipedia.org/wiki/Gosper_curve) which are close to hexagons, but can be exactly decomposed into 7 smaller copies.

chrisweekly 102 days ago | |

Awesome comment, thanks for sharing the details. I love this kind of pragmatic optimization. Also, one dev's "total hack* job" [e.g. yourself, in the past] is another's stroke of genius.

* I'd frame it as "kludge", reserving "hack" for the positive HN sense. :)

ajfriend 103 days ago | |

Very cool! And the prefix queries you mention are what I was trying to get at in another comment, but you explained it better :)

freakynit 103 days ago | |

Does this effect writes negatively?

cullenking 102 days ago | | |

Not any differently than another indexed text field

jandrewrogers 103 days ago |

There is a lot of literature on join operations using discrete global grid systems (DGGS). H3 is a widely used DGGS optimized for visualization.

If joins are a critical performance-sensitive operation, the most important property of a DGGS is congruency. H3 is not congruent it was optimized for visualization, where congruency doesn’t matter, rather than analytical computation. For example, the article talks about deduplication, which is not even necessary with a congruent DGGS. You can do joins with H3 but it is not recommended as a general rule unless the data is small such that you can afford to brute-force it to some extent.

H3 is great for doing point geometry aggregates. It shines at that. Not so much geospatial joins though. DGGS optimized for analytic computation (and joins by implication) exist, they just aren’t optimal for trivial visualization.

dgsan 103 days ago |

I don't like what scrolling this site does to my browser history.

nmstoker 103 days ago | |

Yes, noticed that too. Blocked me getting back to HN. Bad behaviour from the site.

adrriv 100 days ago | |

Just pushed a fix so it doesn't clutter your history anymore. Thanks for the heads-up

stevemk14ebr 102 days ago | |

Truly trash blog design

markstos 102 days ago |

There are some competing grid systems with similar features and benefits as well.

Notably A5, which has the property that each cell covers exactly the same area, even when stretched towards the north and south pole. Useful for certain spatial analysis where you need every cell to have the same size.

https://a5geo.org/

jandrewrogers 102 days ago | |

That is a nicely designed DGGS, a lot of attention paid to the details. I hadn't seen it before.

markstos 102 days ago | | |

The author of A5 was recently featured on the Mapscaping podcast: https://mapscaping.com/podcast/a5-pentagons-are-the-new-best...

febed 103 days ago |

Wouldn’t having a spatial index give you most of the performance gains talked about here without needing H3?

feverzsj 103 days ago | |

Yes. And it should be faster. They may forget to create spatial index.

twelvechairs 102 days ago | | |

Agree with this. They are re-solving a problem that has been solved better by others before (with R-trees).

They may well be using some data storage where spatial indexing is not possible or standard. Geoparquet is a common one now - a great format in many ways but spatial indexing isnt there.

Postgres may be out of fashion but still an old fashioned postgis server is the simplest solution sometimes.

hiddew 102 days ago |

Also see PostGIS for Postgres systems: https://postgis.net/docs/.

geophile 102 days ago |

Z-order based indexes avoid the resolution problem. Basically:

- Generate z-values for spatial objects. Points -> a single z-value at the highest resolution of the space. Non-points -> multiple z-values. Each z-value is represented by a single integer, (I use 64 bit z-values, which provide for space resolution of 56 bits.) Each integer represents a 1-d range. E.g. 0x123 would represent 0x123000 through 0x123fff

- Spatial join is basically a merge of these z-values. If you are joining one spatial object with a collection of N spatial objects, the time is logN. If you are joining two collections, then it's more of a linear-time merge.

For more information: PROBE Spatial Data Modeling and Query Processing in an Image Database Application. IEEE Trans. Software Eng. 14(5): 611-629 (1988)

An open source java implementation: https://github.com/geophile/geophile. (The documentation includes a number of corrections to the published algorithm.)

galkk 102 days ago |

Ohh, every geo join/spatial thing with picture that consists of those small cells over map is such pet peeve of mine. Facebook marketplace, craigslist, tinder, any app with “proximity search”.

No, this city isn’t 4 miles from my city. There is a literal lake between us. It’s 10+ miles.

Please, invent something, do precompute, but just avoid naive-ish searches.

boxed 102 days ago | |

Is this related to the article?

cyanydeez 102 days ago | | |

Hes just angry hes not a crow.

galkk 102 days ago | | |

Yes. The pictures with those small grids that ignore highways, rivers and mountains is what bothers me

analytically 102 days ago |

  At 500 stations:
  - H3: 218µs, 4.7KB, 109 allocs
  - Fallback: 166µs, 1KB, 37 allocs
  - Fallback is 31% faster

  At 1000 stations:
  - H3: 352µs, 4.7KB, 109 allocs
  - Fallback: 312µs, 1KB, 37 allocs
  - Fallback is 13% faster

  At 2000 stations:
  - H3: 664µs, 4.7KB, 109 allocs
  - Fallback: 613µs, 1KB, 37 allocs
  - Fallback is 8% faster

  At 4500 stations (real-world scale):
  - H3: 1.40ms, 4.7KB, 109 allocs
  - Fallback: 1.34ms, 1KB, 37 allocs
  - Fallback is 4% faster

  Conclusion: The gap narrows as station count increases. At 4500 stations they're nearly equivalent. H3 has fixed overhead (~4.7KB/109 allocs for k=2 ring), while fallback scales linearly. The crossover point where H3 wins is likely
  around 10-20K entries.

mattforrest 102 days ago |

I wrote a post about using H3 or any DGGS for that matter. Yes it speeds things up but you loose accuracy. If search is the primary concern it can help but if any level of accuracy matters I would just use a better engine with GeoParquet to handle it. https://sedona.apache.org/latest/blog/2025/09/05/should-you-...

gct 101 days ago |

You can do this with [S2](https://s2geometry.io/) as well which has the very nice property that parent cells do indeed always contain their children, and sorting the cell ids puts them into in-order order.

mgaunard 102 days ago |

Why doesn't it use k-d trees or r-trees?

cpa 102 days ago | |

The big reason is that H3 is data independant. You put your data in predefined bins and then join on them, whereas kd/r trees depend on the data and building the trees may become prohibitive or very hard (especially in distributed systems).

mgaunard 102 days ago | | |

Indices are meant to depend on the data yes, not exactly rocket science.

Updating an R-tree is log(n) just like any other index.

mcherm 102 days ago |

Why does H3 use a nonoverlapping set of hexagons? A square grid would make it even simpler and faster to calculate. I am perfectly happy to believe that a hex grid works better for some reason but what is that reason?

RaczeQ 102 days ago | |

The main design goal was to make the distance between neighbours constant. With squares, you have 4 side neighbours and 4 corner neighbours. With hexagons, it's easier to interpolate paths and analyse distances.

kbaker 102 days ago | |

Maybe this comparison with S2 will explain:

https://h3geo.org/docs/comparisons/s2/

avereveard 102 days ago |

nice writeup, terrible website garbling the page history

I wonder how it compare with geohashing, I know it is not as efficient in term of partitioning and query end up weird since you need to manage decoding neighbor cells but finding all element of a cell is a "starts with" query which allow to put data effectively on most nosql databases with some sort of text sorting

tiagod 102 days ago |

I've used the built-in H3 primitives in ClickHouse and it's a treat.