Fishing for graphs in a Hadoop data lake

Fishing for graphs in a Hadoop data lake(oreilly.com)

44 points by bjerun 8 years ago | 5 comments

The article is a nice summary. I think the author missed a key argument for his multi-model case, AWS costs. Short queries via Spark/Hadoop will cost more on AWS than a focused graph model on dedicated graph DB / multi-model DB on AWS.

janemanos 8 years ago |

Thanks for the article. Seems like a good appeoach to combine the strenghts of both graph & Hadoop. Wonder which other, in addtion to the described ones, use cases could be suitable here.

Anyone an idea?

lmeyerov 8 years ago | |

We get good visibility into what folks do in practice based on their use of Graphistry: we're a DB-agnostic scalable visual graph analytics environment, so we've been seeing (& assisting) what analysts do standalone / what developers build / what data scientists do from notebooks.

1. Most graphs are small (< 100M nodes & edges, probably even < 1M). So analysts just load a CSV directly into us or dump into pandas and work from there. Most Graphistry users do this. It became so common that we baked in a transform to our library that shortcuts the data wrangling problem of SQL/CSV records -> node table + edge table via our "hypergraph" transform.

2. Sometimes the data is too big or they want to use a query language they're more comfortable with vs. Pandas. We'll see a bunch of SQL (incl. Spark), Splunk, Elastic, etc. when approach #1 isn't enough. No need for Neo4j/Titan/GraphX for that problem. If they end up doing this a lot, Neo4j ends up being a sensible choice because of the ergonomics of the Cypher query language.

3. Sometimes, graph queries or analytics _are_ technically critical. We'll see mostly analytics via use of NetworkX or maybe iGraph, such as for slightly better community detection, or something smarter than degrees for node sizes. Sometimes we'll see query langs, probably Neo4j because (I'm guessing) the database is packaged accessibly. For ergonomic reasons, I've been expecting the efforts around OpenCypher for Spark will eventually supplant GraphX for the exploratory case, and we'll start seeing more Janus as it gains more steam.

4. Even more occasionally, people are building true graph algorithms that cannot be sufficiently approximated with their existing tools. E.g., we're seeing a bunch more in the knowledge graph space (ex: finance), and in security/fraud, we're seeing the bigger enterprises needing the same for correlation work. This gets into powering latency-sensitive ML / detection algorithms, fast analyst experiences, etc. However, stuff like regular SQL & Splunk & Spark still gets _most_ teams mostly there with great scaleout etc., so there's a bit of a problem/time/budget/expertise thing going on.

We've been happy to support all these kinds of projects at Graphistry -- and are often part of the entry into them -- so always happy to chat about it. Likewise, I'm not listing work by good teams like those at Datastax Graph, Blazegraph, and Amazon Neptune -- we see them, just they're used more in specific enterprise/federal scenarios.

neunhoef 8 years ago | | |

Author of posted article here: thanks for the additional pointers. It seems that graphistry excels at visualization. Essentially, your offering confirms the main story of the article: make more out of your (graph) data by extracting it from Hadoop to a different tool.

And obviously, one should use the right tool for the purpose . I think graphistry is a good choice for graph visualization, graph databases like ArangoDB or Neo4j will be good at ad hoc traversals. And multi-model databases like ArangoDB or OrientDB will be good at a wide range of ad hoc queries. Anyway, thanks again for the pointers.