Making geo joins faster with H3 indexes(floedb.ai) |
Making geo joins faster with H3 indexes(floedb.ai) |
Instead of using integers IDs for the hexes, we created an encoded version of the ID that has the property that removing a character gets you the containing parent of the cell. This means we can do basic containment queries by querying with a low resolution hex (short string) as a prefix query. If a gps track goes through this larger parent cell, the track will have hexes with the same prefix. You don’t get perfect control of distances because hexes have varying diameters (or rather the approximation, since they aren’t circles they are hexes), but in practice and at scale for a product that doesn’t require high precision, it’s very effective.
I think at the end of this year we’ll have about 6tb of these hex sets in a four node 8 process ES cluster. Performance is pretty good. Also acts as our full text search. Half the time we want a geo search we also want keyword / filtering / etc on the metadata of these trips.
Pretty fun system to build, and the concept works with a wide variety of data stores. Felt like a total hack job but it has stood the test of time.
Thanks uber, h3 is a great library!
Before that existed (pre 1.0 actually), I did something similar with geohashes, which are similar to h3 but based on simple string encoded quad trees. I indexed all the street segments in openstreetmap with that (~800 million at the time) and implemented a simple reverse geocoder. Worked shockingly well.
The geo_shape type uses a bkd tree in binary format. It's heavily optimized for this type of intersects/overlaps queries at scale. Basically does the same thing but using a lot less disk space and memory. It's similar to what you would find in proper GIS databases. Elasticsearch/opensearch also support h3 and geohash grid aggregations on top of geo_shape or geo_point types.
I'm guessing the author is using something like postgresql which of course has similar geospatial indexing support via post gis.
* I'd frame it as "kludge", reserving "hack" for the positive HN sense. :)
If joins are a critical performance-sensitive operation, the most important property of a DGGS is congruency. H3 is not congruent it was optimized for visualization, where congruency doesn’t matter, rather than analytical computation. For example, the article talks about deduplication, which is not even necessary with a congruent DGGS. You can do joins with H3 but it is not recommended as a general rule unless the data is small such that you can afford to brute-force it to some extent.
H3 is great for doing point geometry aggregates. It shines at that. Not so much geospatial joins though. DGGS optimized for analytic computation (and joins by implication) exist, they just aren’t optimal for trivial visualization.
Notably A5, which has the property that each cell covers exactly the same area, even when stretched towards the north and south pole. Useful for certain spatial analysis where you need every cell to have the same size.
They may well be using some data storage where spatial indexing is not possible or standard. Geoparquet is a common one now - a great format in many ways but spatial indexing isnt there.
Postgres may be out of fashion but still an old fashioned postgis server is the simplest solution sometimes.
- Generate z-values for spatial objects. Points -> a single z-value at the highest resolution of the space. Non-points -> multiple z-values. Each z-value is represented by a single integer, (I use 64 bit z-values, which provide for space resolution of 56 bits.) Each integer represents a 1-d range. E.g. 0x123 would represent 0x123000 through 0x123fff
- Spatial join is basically a merge of these z-values. If you are joining one spatial object with a collection of N spatial objects, the time is logN. If you are joining two collections, then it's more of a linear-time merge.
For more information: PROBE Spatial Data Modeling and Query Processing in an Image Database Application. IEEE Trans. Software Eng. 14(5): 611-629 (1988)
An open source java implementation: https://github.com/geophile/geophile. (The documentation includes a number of corrections to the published algorithm.)
No, this city isn’t 4 miles from my city. There is a literal lake between us. It’s 10+ miles.
Please, invent something, do precompute, but just avoid naive-ish searches.
At 500 stations:
- H3: 218µs, 4.7KB, 109 allocs
- Fallback: 166µs, 1KB, 37 allocs
- Fallback is 31% faster
At 1000 stations:
- H3: 352µs, 4.7KB, 109 allocs
- Fallback: 312µs, 1KB, 37 allocs
- Fallback is 13% faster
At 2000 stations:
- H3: 664µs, 4.7KB, 109 allocs
- Fallback: 613µs, 1KB, 37 allocs
- Fallback is 8% faster
At 4500 stations (real-world scale):
- H3: 1.40ms, 4.7KB, 109 allocs
- Fallback: 1.34ms, 1KB, 37 allocs
- Fallback is 4% faster
Conclusion: The gap narrows as station count increases. At 4500 stations they're nearly equivalent. H3 has fixed overhead (~4.7KB/109 allocs for k=2 ring), while fallback scales linearly. The crossover point where H3 wins is likely
around 10-20K entries.Updating an R-tree is log(n) just like any other index.
I wonder how it compare with geohashing, I know it is not as efficient in term of partitioning and query end up weird since you need to manage decoding neighbor cells but finding all element of a cell is a "starts with" query which allow to put data effectively on most nosql databases with some sort of text sorting
That said, this feels like an issue with rendering geometry rather than with the index itself. I’m curious to hear more about why you think the lack of congruency affects H3’s performance for spatial joins. Under the hood, it’s still a parent–child hierarchy very similar to S2’s — H3 children are topological rather than geometric children (even though they still mostly overlap).
Congruency allows for much more efficient join schedules and maximizes selectivity. This minimizes data motion, which is particularly important as data becomes large. Congruent shards also tend to be more computationally efficient generally, which does add up.
The other important aspect not raised here, is that congruent DGGS have much more scalable performance when using them to build online indexes during ingestion. This follows from them being much more concurrency friendly.
Not familiar with geo stuff / DGGS. Is H3 not congruent because hexagons, unlike squares or triangles, do not tile the plane perfectly?
I mean: could a system using hexagons ever be congruent?
You could get a Gosper-island like tiling starting from H3 by saying that each "Hex" is defined recursively to be the union of its 6/7 parts (stopping at some small enough hexagons/pentagons if you really want). Away from the pentagons, these tiles would be very close to Gosper islands.
I was wrong about this (e.g. https://en.wikipedia.org/wiki/Rhombic_triacontahedron). It still seems possible to me that there's a limit to the smallest tile that can tile a unit sphere on its own. (Smallest by diameter as a set of points in R^3).
This is all speculation, but intuitively your criticism makes sense.
Also, mapping 147k cities to countries should not take 16 workers and 1TB of memory, I think the example in the article is not a realistic workload.
Not rocket science but different tradeoffs, that’s what engineering is all about.
The whole advantage over a static partition is that it will allow you to properly deal with data that is irregularly distributed.
Those data structures can definitely be merged if that's what you're asking.
About your binary tree comment: yes this is absolutely valid, but consider then that binary trees also are a bad fit for distributed computing, where data is often partitioned at the top level (making it no longer a binary tree but a set of binary trees) and cross-node joins are expensive.
To me, the big selling point of H3 is that once you’re "in the H3 system", many operations don’t need to worry about geometry at all. Everything is discrete. H3 cells are nodes in a tree with prefixes that can be exploited, and geometry or congruency never really enter the picture at this layer.
Where geometry and congruency do come in is when you translate continuous data (points, polygons, and so on) into H3. In that scenario, I can totally see congruency being a useful property for speed, and that H3 is probably slower than systems that are optimized for that conversion step.
However, in most applications I’ve seen, the continuous-to-H3 conversion happens upstream, or at least isn’t the bottleneck. The primary task is usually operating on already "hexagonified" data, such as joins or other set operations on discrete cell IDs.
Am I understanding the bottleneck correctly?
H3 is optimized for equal-area point aggregates. Congruency does not matter for these aggregates because there is only a single resolution. To your point, in H3 these are implemented as simple scalar counting aggregates -- little computational geometry required. Optimized implementations can generate these aggregates more or less at the speed of memory bandwidth. Ideal for building heat maps!
H3 works reasonably for sharding spatial joins if all of the cell resolutions have the same size and are therefore disjoint. The number of records per cell can be highly variable so this is still suboptimal; adjusting the cell size to get better distribution just moves the suboptimality around. There is also the complexity if polygon data is involved.
The singular importance of congruence as a property is that it enables efficient and uniform sharding of spatial data for distributed indexes, regardless of data distribution or geometry size. The practical benefits follow from efficient and scalable computation over data stored in cells of different size, especially for non-point geometry.
Some DGGS optimized for equal-area point aggregates are congruent, such as HEALPix[0]. However, that congruency comes at high computational cost and unreasonably difficult technical implementation. Not recommended for geospatial use cases.
Congruence has an important challenge that most overlook: geometric relationships on a 2-spheroid can only be approximated on a discrete computer. If you are not careful, quantization to the discrete during computation can effectively create tiny gaps between cells or tiny slivers of overlap. I've seen bugs in the wild from when the rare point lands in one of these non-congruent slivers. Mitigating this can be costly.
This is how we end up with DGGS that embed the 2-spheroid in a synthetic Euclidean 3-space. Quantization issues on the 2-spheroid become trivial in 3-space. People tend to hate two things about these DGGS designs though, neither of which is a technical critique. First, these are not equal area designs like H3; cell size does not indicate anything about the area on the 2-sphere. Since they are efficiently congruent, the resolution can be locally scaled as needed so there are no technical ramifications. It just isn't intuitive like tiling a map or globe. Second, if you do project the cell boundaries onto the 2-sphere and then project that geometry into something like Web Mercator for visualization, it looks like some kind of insane psychedelic hallucination. These cells are designed for analytic processing, not visualization; the data itself is usually WGS84 and can be displayed in exactly the same way you would if you were using PostGIS, the DGGS just doesn't act as a trivial built-in visualization framework.
Taking data stored in a 3-space embedding and converting it to H3-ified data or aggregates on demand is simple, efficient, and highly scalable. I often do things this way even when the data will only ever be visualized in H3 because it scales better.
If the objective is to overfit for high-performance scalable analytics, including congruency, the most capable DGGS designs are constructed by embedding a 2-spheroid in a synthetic Euclidean 3-space. The metric for the synthetic 3-space is usually defined to be both binary and as a whole multiple of meters. The main objection is that it is not an “equal area” DGGS, so not good for a pretty graphic, but it is trivially projected into it as needed so it doesn’t matter that much. The main knobs you might care about is the spatial resolution and how far the 3-space extends e.g. it is common to include low-earth orbit in the addressable space.
I was working with a few countries on standardizing one such design but we never got it over the line. There is quite a bit of literature on this, but few people read it and most of it is focused on visualization rather than analytic applications.
For the wider tech world - I would say postgres suffers from being "old tech" and somewhat "monolithic". There have been a lot of trends against it (e.g. nosql, fleeing the monolith, data lakes). But also more practically for a lot of businesses geospatial is not their primary focus - they bring other tech stacks so something like postgis can seem like duplication if they already use another database, data storage format or data processing pipeline. Also some of the proliferation of other software and file formats have made some uses cases easier without postgis.
Really Id say the most common path ive seen for people who dont have an explicit geospatial background who are starting to implement it is to avoid postgis until it becomes absolutely clear that they need it.
You can of course also use h3 in postgis directly as well as r trees. Its helps significantly for heatmap creation and sometimes for neighbourhood searches.
DGGS that use 3-space embeddings are topologically 3-dimensional i.e. purely volumetric. They do not interpret the Earth as a 2-surface. In addition to polar coordinates, you must provide a volumetric model of the Earth to compute the DGGS cell. The shard distributions look very different between a 2-surface and a 3-surface. The latter has significantly better properties for large analytical data models but requires more sophisticated storage architectures.
The synthetic 3-space is optimized for two things. You want maximally efficient mapping function from the typical WGS84 geometry into it. Tidy math, basically. Since it is purely internal, the user will never see it, and it doesn't map to anything real, you have latitude to design it to satisfy software engineering objectives as long as it works. Second, the 3-space references are naturally less compact than 2-surface references at the same resolution even though you'll end up with roughly the same number of shards. A lot of effort goes to schemes to compress out the sparseness so that the storage requirements are similar to 2-surface DGGS e.g. how often do you need to represent geometry 1000 km below the Earth's surface?
These DGGS also have the low-key advantage that they natively represent and understand 3-space, not just surface geometry, if you move beyond making flat maps.