Ask HN: Fast, In-Memory, Distributed data analysis and machine learning?

5 points by henrythe9th 12 years ago | 13 comments

We're looking to implement a new data pipeline architecture at work. The primary goal is speed (data size is small enough to fit entirely in memory, sharded across multiple machines if needed). The primary bottleneck is feature extraction, transformation and iteration, which is both CPU and read/write intensive. Model building is not too slow, so no need to distribute training/testing as of yet.

I've heard good things about Spark/Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don't even need a super sophisticated system and a Riak/Redis K-V store cluster would do?

Thanks in advance

karterk 12 years ago |

Hard to offer suggestions without knowing rough size of data - depending on how much money you're willing to cough up, even 1 TB is in the range of "can fit in the memory" territory.

Having said that, Spark is really great for running iterative algorithms and will definitely fit with what you have described. I suggest staying away from building it on your own using riak/redis (atleast until you have ruled out spark), as you will run into lots of operational issues like handling failures, resource allocation, retries etc.

henrythe9th 12 years ago | |

Thanks for your input. We're roughly talking around 5GB of data. Data growth should be linear in the next 6months. Money is not a big concern. Speed of iteration is key.

We frequently run different processing algorithms over the entire stored dataset (stored data doesn't change) and update the calculated features each time. Not sure if this helps narrows things down. Thanks

karterk 12 years ago | | |

A little bit of context: I have done a lot of hadoop, and also well aware of spark and storm. Storm is mostly well suited for handling a stream of real-time data. Spark is specifically for running iterative algorithms - it can read from HDFS, and with the expressiveness of Scala, it's great for building machine-learning related stuff.

However, 5GB of data is literally nothing, and that statement holds till your data size is atleast 50-60 GB. Given that 64 GAM RAM machines are now commodity, I would just load the entire thing in RAM and write a multi-threaded program. Sounds old school, but regardless of how well documented hadoop, spark and storm are, there is nevertheless a learning curve and a maintenance cost. Both of which are well worth only if you see your data rapidly growing to the X TB range. Otherwise, it might be just easier to stick it in a single machine and get stuff done.

You can stick to Scala/Java, and so long you develop good abstractions around your core algorithms, you can always move to spark/hadoop when you need it. Feel free to send me an email if you want to talk more (email in profile).

agibsonccc 12 years ago |

I can vouch for storm. If only for the fact it's pretty easy to setup (especially compared to hadoop) Being able to leverage zookeeper for coordination allows you some extra capabilities for coordination as well. With that being said, just watch how you build your bolts/spouts. There's lots of ways you can send data in to the system, but in general , storm's documentation has been superb to work with.

I built a mini library for myself to auto construct the topologies based on a set of named dependencies to handle bolt/spout wiring. Aside from that, the builder interface for it is really nice if your data pipeline doesn't change.

There's good support for testing with a local cluster as well.

henrythe9th 12 years ago | |

Thanks for your suggestion. Do you have any specific readings for me to look into for building bolts/spouts for sending data into the system?

Thanks

agibsonccc 12 years ago | | |

Here's the root wiki: https://github.com/nathanmarz/storm/wiki

Here's the system architecture: https://github.com/nathanmarz/storm/wiki/Concepts

Here's non JVM languages (specifically python) for building spouts/bolts https://github.com/nathanmarz/storm/wiki/Using-non-JVM-langu...

Here's an example project: https://github.com/nathanmarz/storm-starter

x0x0 12 years ago |

you should check out http://0xdata.com/ ; it's built from the ground up on a custom dkv to do in-memory ML. Reasons to check it out:

1 - it's open source https://github.com/0xdata/h2o

2 - ingest data from hdfs, s3, csv

3 - I've built systems like what you're discussing twice; the ML algorithms are often easier to write than expected while data management (moving data, sending updates, etc) which initially seems easier is much harder. 0xdata handles this for you.

4 - under active development

5 - it cleanly runs on your dev box with 1 or many nodes for development; deploying is a simple as uploading a jar to a cluster and putting a single file on each naming peers in the cluster

5a - see scripts to walk you through doing this

disclosure: I work on it as of very recently =P

nihar 12 years ago |

Have you looked at Oracle Coherence? It's pretty light weight and has clustering features as well.

henrythe9th 12 years ago | |

Thanks for the suggestion. Looks very interesting, but couldn't find much information about it besides on Oracle.

How's the community and use cases for Coherence?

Thanks

nihar 12 years ago | | |

Not much in terms of the open source community, but Oracle forums have some good support for this. Plus, the documentation that comes with the product is pretty decent, and a lot of really large firms use the solution.