Fishing for graphs in a Hadoop data lake(oreilly.com) |
Fishing for graphs in a Hadoop data lake(oreilly.com) |
Anyone an idea?
1. Most graphs are small (< 100M nodes & edges, probably even < 1M). So analysts just load a CSV directly into us or dump into pandas and work from there. Most Graphistry users do this. It became so common that we baked in a transform to our library that shortcuts the data wrangling problem of SQL/CSV records -> node table + edge table via our "hypergraph" transform.
2. Sometimes the data is too big or they want to use a query language they're more comfortable with vs. Pandas. We'll see a bunch of SQL (incl. Spark), Splunk, Elastic, etc. when approach #1 isn't enough. No need for Neo4j/Titan/GraphX for that problem. If they end up doing this a lot, Neo4j ends up being a sensible choice because of the ergonomics of the Cypher query language.
3. Sometimes, graph queries or analytics _are_ technically critical. We'll see mostly analytics via use of NetworkX or maybe iGraph, such as for slightly better community detection, or something smarter than degrees for node sizes. Sometimes we'll see query langs, probably Neo4j because (I'm guessing) the database is packaged accessibly. For ergonomic reasons, I've been expecting the efforts around OpenCypher for Spark will eventually supplant GraphX for the exploratory case, and we'll start seeing more Janus as it gains more steam.
4. Even more occasionally, people are building true graph algorithms that cannot be sufficiently approximated with their existing tools. E.g., we're seeing a bunch more in the knowledge graph space (ex: finance), and in security/fraud, we're seeing the bigger enterprises needing the same for correlation work. This gets into powering latency-sensitive ML / detection algorithms, fast analyst experiences, etc. However, stuff like regular SQL & Splunk & Spark still gets _most_ teams mostly there with great scaleout etc., so there's a bit of a problem/time/budget/expertise thing going on.
We've been happy to support all these kinds of projects at Graphistry -- and are often part of the entry into them -- so always happy to chat about it. Likewise, I'm not listing work by good teams like those at Datastax Graph, Blazegraph, and Amazon Neptune -- we see them, just they're used more in specific enterprise/federal scenarios.
And obviously, one should use the right tool for the purpose . I think graphistry is a good choice for graph visualization, graph databases like ArangoDB or Neo4j will be good at ad hoc traversals. And multi-model databases like ArangoDB or OrientDB will be good at a wide range of ad hoc queries. Anyway, thanks again for the pointers.
The nuance being... with stuff like data science notebooks and pandas, the people skilled enough to do extraction are also skilled enough that it's easier to just use pandas. The exception is repeat work or when it is for regular analysts. Friendly query languages like Neo4j's Cypher helps there. Not sure what Arango supports... Gremlin? Proprietary?
Graphistry's environment is agnostic, and _not_ a database, so it'd be wrong of me to advocate teams drop their system of record and use just us ;-) We ended up building a visual "playbook" investigation environment to help teams streamline these scenarios. They run visual playbooks against their legacy db (splunk, elastic, sql, ...) for faux-graph queries, or their new graph db for deeper ones (e.g., path queries). So we're more of the system of record + superpowers for your investigations, kind of like a smarter version of what Tableau/Looker do for SQL.