Ask HN: Storing and processing less than 1TB of unstructured data?

2 points by johnnycarcin 6 years ago | 4 comments

Many HN folks deal with data problems, so I thought I'd ask: if you had to store and index less than 1TB of unstructured (plain text) data, what would you use?

I have a bunch of text files and HTML pages that I'd like to dump into something and then be able to search over it, maybe even be able to find relationships (common terms, phrases, etc) between the various docs. I've heard of things like hadoop, but that seems to be overkill for the amount of data I have. I'd also like to keep things as low-cost as possible as this is just for personal use. I've looked at a few of the cloud providers but am honestly not sure what I'm looking for, so I find myself walking away more confused than when I started.

This seems like an easy problem, but for whatever reason I'm getting wrapped around the axle on it.

dekhn 6 years ago |

I recommend the book "Managing Gigabytes", which while dated is still relevant. The title doesn't indicate this, but it's heavily focused on data structures for indexing text documents.

But Elasticsearch running on a cloud VM with an attached EBS volume would be a fast way to get work done.

1e10 6 years ago |

1tb is nothing these days. If you insist on cloud the hetzner could be best bang for buck. Otherwise a similar desktop system can be acquired for less than 1000 usd.

I’d start with solr or elasticsearch and a simple indexing script (home rolled python script).

Then you can use solr admin or something like Jupyter for iterative querying.

I’m not an expert on index tuning, but you might even be able to dump it all into postgres with json types.

Best of luck!

johnnycarcin 6 years ago | |

Yeah, the amount of data is pretty small in the grand scheme of things, maybe that is why i'm getting so hung up haha. Elasticsearch was actually the first thing I thought of so maybe I'll just go with that and see what happens...

johnnycarcin 6 years ago |

coming back, i stumbled over this while looking at options: https://docs.alephdata.org/. It is a bit more heavyweight than plain elasticsearch, but it has some nice additions that might make it worth it depending on your situation.