Hexagons and Hilbert curves – The headaches of distributed spatial indices

Hexagons and Hilbert curves – The headaches of distributed spatial indices(hivekit.io)

89 points by max_sendfeld 2 years ago | 27 comments

spenczar5 2 years ago |

Oh boy, this gives me a chance to talk about one of the gems of astronomy software which deserves to be better known: HEALPixel tesselation!

HEALPixels stand for 'Hierarchical Equal-Area Iso-latitudinal Pixels'. It is a scheme that was developed to analyze signals that cover the entire sky, but with variable density.

Like HTM or Hilbert curves, this can be used to organize spatial data.

The tesselation looks kind of funny but has many good features - it doesn't have discontinuities at poles, and is always equal area. And with the "nested" healpixel formulation, pixels are identified by integers. Pixel IDs are hierarchical based on leading bits - so, for example, pixel 106 (=0110 1010) contains pixel 1709 (=0110 1010 1101). This lets you do some marvelous optimizations in queries if you structure your data appropriately. Nearest neighbor searches can be extremely quick if things are HEALPix-indexed - and so can radius searches, and arbitrary polygon searches.

HEALPixels are used today for more than just their original intent. LSST will use them for storing all-sky data and point source catalogs, for example.

More here:

- Original NASA/JPL site: https://healpix.jpl.nasa.gov/

- Popular Python implementation: https://healpy.readthedocs.io/en/latest/

- Good PDF primer: https://healpix.jpl.nasa.gov/pdf/intro.pdf

And an experimental database being built on healpix for extremely large data volumes (certainly many TB, maybe single-digit PB): https://github.com/astronomy-commons/hipscat

aimor 2 years ago | |

Thanks for sharing, I've needed this for years and never stumbled across it.

eh_why_not 2 years ago | |

Gem indeed! Here's its wikipedia page:

https://en.wikipedia.org/wiki/HEALPix

michelpp 2 years ago |

I experimented with geospatial Hilbert Curves as a Postgres extension [0] for PostGIS using the S2 [1] spherical geometry library. S2 uses a scale free cell coverage pattern that is numbered using six interlocking space filling Hilbert Curves [2]. This approach is similar to what is used in this article. S2 doesn't use hexagons but a different cell structure but the idea is generally the same.

By having both high level (cell) and low level (cell id) geometries it was a very powerful library which allowed projection from the hilbert space into a Postgres spatial index (spgist) including various trees, like noted in this article. It appears to be still quite active in development.

[0] https://github.com/michelp/pgs2

[1] https://s2geometry.io/

[2] https://s2geometry.io/devguide/s2cell_hierarchy

max_sendfeld 2 years ago | |

Ah - I've used S2 in the past. Great work. It scales really well for larger datasets.

Interesting that you mention the use of multiple hilbert curves as well. We also experimented with two Hilbert Curves, rotated by 90 degrees. This helps to get around what we've dubbed the "Hilbert Equator" problem where two objects are quite far on the curve because they are placed close to one of the major fault lines in the fractal (for lack of a better word)

jandrewrogers 2 years ago | | |

There are always caveats with these types of indexing methods, especially if you require dynamic high-performance indexing and support for rectangles/polygons. You can only move the edge cases around, there is no way to eliminate them. This allows you to tailor an indexing scheme for specific workload assumptions but this obviously breaks down if you need an algorithm that generalizes to many workloads and data models. There isn't just one type of edge case with this type of indexing, there are several which may or may not be relevant depending on what you are trying to do.

Some research from the 1980s showed it is only possible to mitigate bounded categories of edge case, thereby improving generality, by indexing on complex higher-dimensionality embeddings. Mitigating more categories requires more dimensions and more complexity. However, no one could figure out how to construct these embeddings for even basic cases or deal with more practical curse of dimensionality issues, so that is largely forgotten (the researchers themselves made comments to the effect that they didn't think a tractable solution was possible). I've never even been able to find that literature in electronic form, unfortunately.

Like AI, it is an interesting open-ended problem space. You can prove that an elegant optimal solution is not tractable so it ends up being a search for asymptotically optimal algorithms that become exponentially more complex the closer you get to optimal.

michelpp 2 years ago | | |

Tackling these boundary problems are the literal "edge cases" in geospatial indexing and they exist everywhere, so this i a good reason for using an existing library as the authors have already solved them.

Hexagons are cool, but they are not necessarily the bestagon for a spherical geometry since you cannot break a hexagon into smaller hexagons, whereas an S2 cell is a "cube" with spherical sides or HEALPix uses a rhombic dodecahedron [0] both of which can be split into smaller divisions of themselves.

Not to discourage you from your experimentation, it's all trade offs and you might find a good one. Good luck!

[0] https://en.wikipedia.org/wiki/HEALPix

plagiarist 2 years ago | | |

Does that end up with specific points that have that same difficulty at intersections of equators? I suppose even if that is the case, a point is something that's easier to position over a dead area than a line would be.

durkie 2 years ago | |

any idea why postgis hasn't formally incorporated S2 cell functions? It seems like it'd be right up their alley.

tromp 2 years ago |

> a vehicle with a Hilbert Curve position of 0.34 is really close to one with 0.35 and really far from one with 0.89

But points with a large difference in their single curve coordinate can be either far apart or close together. E.g. on this 16 point Hilbert curve

     __.  .__ 
     __|  |__ 
    |   __   |
    |__|  |__|

the '.' marked points at 1/16th and 15/16th along the curve are adjacent.

wolframhempel 2 years ago | |

It's an important point. We've ended up using two Hilbert curves, rotated by 90 degrees to tackle this problem. If it is close on either, it is close in 2D space

tromp 2 years ago | | |

Interesting. So one might think that being close together in 2D space corresponds to being close on

BOTH cartesian coordinates / EITHER Hilbert coordinate

And that being far apart in 2D space corresponds to being far apart on

EITHER cartesian coordinate / BOTH Hilbert coordinates

But if we consider the two commas below which are close together in 2D space, we see they are far apart in any rotation of this Hilbert curve?!

        __ __    __ __
    |__|   __|  |__   |__|
     __   |__    __|   __
    |  |__ __|  ;__ __|  |
    |__    __,__ __    __|
     __|  |__    __|  |__
    |   __   |  |   __   |
    |__|  |__|  |__|  |__|

dwallin 2 years ago | | |

As mentioned in other replies, this doesn't fully solve a lot of the cases. I think z order curves might actually be better here, they are slightly less accurate individually, but in a way that would seemingly minimize the worst case results when taking the minimum of the two. But at that point you are probably better off using a single z-order curve and using the coordinates to check the manhattan or Chebyshev distance using bitwise operations.

sp332 2 years ago | | |

So you converted your 2D space into another 2D space?

joe_the_user 2 years ago |

The problem of finding the shortest path on a map (a la google maps etc) is often solved with approaches like "hub networks" and "contraction hierarchies" rather than simple spatial networks. Things like dijkstra's algorithm are only efficient if you're input is the raw graph and point and you only want one path once.

See just for a start: https://arxiv.org/abs/1504.05140

Lichtso 2 years ago |

Here is one of the recent advancements in performing similarity (including nearest neighbor) search on sparse high dimensional data using learned indices: https://arxiv.org/abs/2204.10028

The paper basically deals with the most adverse case but the approach should work in lower dimensions and in data with a non-uniform density distribution as well.

zX41ZdbW 2 years ago |

I've recently implemented a map of the Internet using a similar technique: https://reversedns.space/

In ClickHouse, we have almost every mentioned technique: H3 and S2, geohashes, and indexing by space-filling curves.

esafak 2 years ago | |

For reference:

https://clickhouse.com/docs/en/sql-reference/functions/geo/h...

https://clickhouse.com/docs/en/sql-reference/functions/geo/s...

feverzsj 2 years ago |

The purpose of using space-filling curves is to convert coordinates into 1 dimensional data, thus they can be indexed by ordinary database index, no matter distrusted or not. The problem is it's much slower than a "real" spatial index like R-tree, and the supported queries are rather limited.

On the other hand, spatial data can be easily partitioned, rendering the aforementioned method less useful. Also, I've never seen anyone needed more than a single instance of postgres to index all the spatial data, unless you are dealing with spatial temporal data, which can grow infinitely large. Then again, spatial temporal data can be easily partitioned.

patches11 2 years ago |

It would be interesting to know what other solutions exist in this space and why they weren’t chosen.

I know of at least one, GeoMesa, which seems like it could at least provide the building blocks to achieve what they are trying to do.

joe_the_user 2 years ago | |

There's a long of study on the subject (that I've only surveyed). See my comment https://news.ycombinator.com/item?id=39792884

scentoni 2 years ago |

Also see "Evenly distributing points on a sphere" https://news.ycombinator.com/item?id=17842057

klysm 2 years ago |

I got to use space filling curves in my first ever professional programming experience to build image pyramids in a cache efficient way

xrd 2 years ago |

Makes me want to play more hex.

https://www.drking.org.uk/hexagons/hex/templates.html

otoburb 2 years ago |

I knew about hexagons but not about R-trees and Hilbert curves, and then combining the two. Even though that was an inbound content article, it was at least informative.

__ __ __ __ |__| __| |__ |__| __ |__ __| __ | |__ __| ;__ __| | |__ __,__ __ __| __| |__ __| |__ | __ | | __ | |__| |__| |__| |__|