Ask HN: Fast, In-Memory, Distributed data analysis and machine learning? We're looking to implement a new data pipeline architecture at work. The primary goal is speed (data size is small enough to fit entirely in memory, sharded across multiple machines if needed). The primary bottleneck is feature extraction, transformation and iteration, which is both CPU and read/write intensive. Model building is not too slow, so no need to distribute training/testing as of yet. I've heard good things about Spark/Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don't even need a super sophisticated system and a Riak/Redis K-V store cluster would do? Thanks in advance |