Ask HN: Storing and Processing Large Amounts of Temporal-Spatial Data As part of our research group, we're collecting large amounts of location data. Our data essentially looks like (user id, lat/long co-ordinates, timestamp). There's other metadata involved too, but that's not relevant here. We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time. I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to: (1) For a given location, who was near that location (within a specified distance) over a specified period of time? (2) Which locations are near each other? That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations? We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful. |