Graph Mining Library

306 points by zuzatm 2 years ago | 101 comments

esafak 2 years ago |

Graph mining was "so hot right now" ten years ago. Remember GraphX (https://spark.apache.org/graphx/) and GraphLab (https://en.wikipedia.org/wiki/GraphLab) ? Or graph databases?

I guess it coincided with the social network phenomenon. Much more recently geometric learning (ML on graphs and other structures) shone, until LLMs stole their thunder. I still think geometric learning has a lot of life left in it, and I would like to see it gain popularity.

PaulHoule 2 years ago | |

There are "graph databases" which see graphs as a universal approach to data, see RDF and SPARQL and numerous pretenders. For that matter, think of a C program where the master data structure is a graph of pointers. In a graph like that there is usually a huge number of different edge types such as "is married to", "has yearly average temperature", ...

Then there are "graph algorithms" such as PageRank, graph centrality, and such. In a lot of those cases there is one edge type or a small number of edge cases.

There are some generic algorithms you can apply to graphs with many typerd edges edges such as the magic SPARQL pattern

  ?s1 ?p ?o .
  ?s2 ?p ?o .

which finds ?s1 and ?s2 that share a relationship ?p with some ?o and is the basis for a similarity metric between ?s1 and ?s2. Then there are the cases that you pick out nodes with some specific ?p and apply some graph algorithm to those.

The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective. Specific graphs usually do have some structure with some locality. There was a time I was using that magic SPARQL pattern and wrote a program that would have taken 100 years to run and then repacked the data structures and discovered an approximation that let me run the calculation in 20 minutes.

Thus practitioners tend to be skeptical about general purpose graph processing libraries as you may very have a problem that I could code up a special-purpose answer to in less time than you'll spend fighting with the build system for that thing that runs 1000x faster.

----

If you really want to be fashionable though, arXiv today is just crammed with papers about "graph neural networks" that never seem to get hyped elsewhere. YOShInOn has made me a long queue of GNN papers to look at but I've only skimmed a few. A lot of articles say they can be applied to the text analysis problems I do but they don’t seem to really perform better than the system YOShInOn and I use so I haven’t been in a hurry to get into them.

Someone 2 years ago | | |

> a universal approach to data, see RDF and SPARQL and numerous pretenders. For that matter, think of a C program where the master data structure is a graph of pointers.

A graph of typed pointers. As you likely know, the basic element of RDF is not “foo has a relationship with bar”, but “foo has a relationship with bar of type baz”.

Also, the types themselves can be part of relationships as in “baz has a relationship with quux of type foobar”

> The thing about graphs is, in general, they are amorphous and could have any structure (or lack of structure) at all which can be a disaster from a memory latency perspective

But that’s an implementation detail ;-)

In theory, the engine you use to store the graph could automatically optimize memory layout for both the data and the types of query that are run on it.

Practice is different.

> Thus practitioners tend to be skeptical about general purpose graph processing libraries

I am, too. I think the thing they’re mostly good for is producing PhD’s, both on the theory of querying them, ignoring performance, and on improving performance of implementations.

esafak 2 years ago | | |

1. Graph algorithms like the ones you mentioned are processed not by graph databases like Neo4j, but graph processing libraries like the titular Google library.

2. Geometric learning is the broader category that subsumes graph neural networks.

https://geometricdeeplearning.com/

reaperman 2 years ago | |

I still use NetworkX a lot when a problem is best solved with graph analysis, I really enjoy the DevEx of that package.

emmanueloga_ 2 years ago |

For those wanting to play with graphs and ML I was browsing the arangodb docs recently and I saw that it includes integrations to various graph libraries and machine learning frameworks [1]. I also saw a few jupyter notebooks dealing with machine learning from graphs [2].

Integrations include:

* NetworkX -- https://networkx.org/

* DeepGraphLibrary -- https://www.dgl.ai/

* cuGraph (Rapids.ai Graph) -- https://docs.rapids.ai/api/cugraph/stable/

* PyG (PyTorch Geometric) -- https://pytorch-geometric.readthedocs.io/en/latest/

1: https://docs.arangodb.com/3.11/data-science/adapters/

2: https://github.com/arangodb/interactive_tutorials#machine-le...

afandian 2 years ago |

Can someone with familiarity with Bazel give any clues how to build? `bazel build` does something, but I end up with `bazel-build` and `bazel-build` with no obvious build artefacts.

elteto 2 years ago | |

In bazel //... is the equivalent of the 'all' target in make:

    bazel build //...
    bazel test //...
    bazel query //...

The last one should list all targets (from what I remember).

afandian 2 years ago | | |

Thanks! That last one lists 84 results. None looks obviously like 'main'. Trying a random one:

    bazel run //in_memory/clustering:graph
    ERROR: Cannot run target //in_memory/clustering:graph

I'm going to wait until someone updates the readme I think!

itissid 2 years ago |

Noob Q: Would this library be a (good?) candidate to be integrated with a wrappers/extension libraries to have all the graph based clustering algorithms in one place(assuming they are not already)?

Or do(better?) frameworks for the same function as this code already exist(maybe networkx?)?

sbrother 2 years ago |

I might be (very) far behind the times, but does this have any relationship with Pregel?

cmckn 2 years ago | |

Pregel is a distributed graph processing system, this (AFAICT) is a library for working with graphs in-memory on a single computer.

zekenie 2 years ago |

some examples would be super helpful!

specproc 2 years ago | |

Documentation of any sort would be super helpful.

zuzatm 2 years ago | |

It's coming! Check again in 12 hours, I believe it should be up then!

pharmakom 2 years ago |

Can someone explain what this library might be useful for?

oddthink 2 years ago | |

Clustering. I used the correlation clusterer from here for a problem that I could represent as a graph of nodes with similarity measures (this data looks like this other data) and strong repelling features (this data is known to be different from this other, so never merge them).

whitten 2 years ago |

Github says it is C, C++, and Starland.

What is Starland ?

Laremere 2 years ago | |

It's Starlark, the language for configuring the build system Bazel. Bazel is the open source port of Google's internal build system, Blaze. Starlark is a subset of Python.

Terr_ 2 years ago | | |

This list of corporate project name associations makes me wonder where Galactus comes in. :P

https://www.youtube.com/watch?v=y8OnoxKotPQ

ashout33 2 years ago | |

if I had to guess, that is a typo and should be starlark, which is the language used for bazel build files. bazel is the build system they use

jefftk 2 years ago | | |

Github says "Starlark 6.2%", so it looks like whitten's typo, not GitHub's.

nolok 2 years ago | | |

On which keyboard layout is rk into nd a typo ...

nologic01 2 years ago |

Graph algorithms cry out for some standardization. Think blas and lapack.

bigbillheck 2 years ago | |

Consider: https://graphblas.org

nologic01 2 years ago | | |

I wonder how much overlap this new project with graphblas and older graph libraries like boost::graph https://www.boost.org/doc/libs/1_83_0/libs/graph/doc/

xxpor 2 years ago |

I was hoping this would mine literal stats graphs for anomaly detection

lanstin 2 years ago | |

https://en.wikipedia.org/wiki/Graph_theory

It's interesting and deceptively simple at first.

blitzar 2 years ago | |

I think they use the word "graph" to mean a different thing to what I use the word for.

supriyo-biswas 2 years ago | |

That’s relatively easy, see https://en.m.wikipedia.org/wiki/Interquartile_range

charcircuit 2 years ago |

Most of these files have a double license header.

ldhulipala 2 years ago | |

Thanks for pointing this out (fixed now).

specproc 2 years ago | | |

If you're working on this repo, can we plz haz docs?

tomrod 2 years ago |

This is a big deal, I think. I'm guessing it's not widely used internally anymore if they are open sourcing it. What is used instead?

dllthomas 2 years ago |

Does it accept graphviz?

ponyous 2 years ago |

No idea where is the hype coming from, who is actually upvoting this? 0 Docs, 0 examples, 0 explanation of how is it useful.

Is "Graph Mining" so ubiquitous that people know what this is all about?

ldhulipala 2 years ago | |

We are updating the README to be more descriptive; in the meantime, please see https://gm-neurips-2020.github.io/ or https://research.google/teams/graph-mining/

ldhulipala 2 years ago | | |

There are now more documents linked to in the README.md and an example you can try to run: https://github.com/google/graph-mining

bafe 2 years ago | |

It was hyped some years ago. There are plenty of legitimate applications of graphs, perhaps the library offers well optimized implementation of important algorithms. But the past hype around all things "graph" was not rational. As always, you can't solve all problems with a graph as you can't with a neural network or with any other structure/algorithm

MarkMarine 2 years ago |

Whew. Lots of complaints from people who probably will never need to use this code.

If you need docs just read the .h files, they have extensive comments. I’m sure they’ll add them or maybe, just maybe, you could write some to contribute.

This would have made some of my previous work much easier, it’s really nice to see google open source this.

riku_iki 2 years ago | |

> If you need docs just read the .h files

curious if this is typical dev experience inside google..

dekhn 2 years ago | | |

I think in most cases, back when I worked there, I would have instead searched the monorepo for targets that depended on this library (an easy lookup), and look at how they used it.

Some code libraries had excellent docs (recordio, sstable, mapreduce). But yes, reading the header file was often the best place to start.

MarkMarine 2 years ago | | |

I’m not at google so I’ve got no idea.

Reading the code, especially the header files, seems to be pretty standard as far as what I see in non-open source code. So, it’s been my typical dev experience, I’d say if you’re somewhere that has gleaming, easy to understand docs that are actually up to date with the code you all have too much time on your hands, but I serially work at startups that are running to market.

ls612 2 years ago | |

I think it’s that it’s not at all obvious how to even build the damn thing so at least a little bit of readme would have been nice. I agree with the sentiment this looks like a super cool tool.

PaulHoule 2 years ago | | |

It says you're supposed to leave a ticket if you have questions or comments... A README file isn't much to ask for.

helsinki 2 years ago | |

The .proto files are the documentation everyone is looking for.

corentin88 2 years ago |

Interesting fact: the first commit is 2 years old and is entitled "Boilerplate for new Google open source project".

Either they rewrite git history or it took about 2 years to get approval on making this repo public.

j2kun 2 years ago | |

The code has an internal analogue, and the tooling lets you choose whether to export the entire git history or squash it. They may have chosen the former, in which case it could just be 2 years to migrate and rework the code to be ready for open sourcing. In that time I imagine there were four reorgs and countless priority shifts :)

spankalee 2 years ago | |

If you know you want to open source a project eventually, it's easier if you start it in the open source part of the internal repo with all the licensing and headers in place. Open sourcing existing code is harder because you need to review that it hasn't used something that can't be opened.

So probably they just started the project two years ago, had aspiration to open source, and finally just did now. Some teams might publish earlier, some like to wait until it's had enough internal usage to prove it out.

progval 2 years ago | |

FWIW they had already pushed that commit four months ago: https://archive.softwareheritage.org/browse/snapshot/bd01717...

hiddencost 2 years ago | |

That could be the template it was cloned from

thfuran 2 years ago | |

Now that's bureaucracy.

numpad0 2 years ago | | |

I'd agree if last commit was 2 years ago.

0x6461188A 2 years ago |

How is this usable. I see no documentation. There is a docs folder but all it contains is a code of conduct.

mathisfun123 2 years ago | |

I'm not trying to be snarky but have you considered reading the code? Like I'll be honest I can't remember the last time I looked at docs at all instead of reading the code itself.

n3150n 2 years ago | | |

Are you for real? I'm also not trying to be snarky but...

8360 unique lines scattered across more than 100 files. Good luck deciphering that in a single day!

By the way, the first issue in the repo is a "Request for a more verbose README", which I agree with.

otteromkram 2 years ago | |

This header file has lots of commentary.

https://github.com/google/graph-mining/blob/main/in_memory/c...

This, too:

https://github.com/google/graph-mining/blob/main/in_memory/s...

Same with most of the other files.

How is it usable? It's usable if you want to find date within lots and lots of data efficiently. That's kinda Google's thing. :-D

PaulHoule 2 years ago |

Towards the end of my relationship with a business partner, he was really impressed with a graph processing library released by Intel (because it was Intel), while my thoughts were "ho hum, this looks like it was done by a student" (like a student who got a C-, not a A student) and thought about how much I liked my really crude graph processing scripts in Pig that were crazy fast because they used compressed data structures and well-chosen algorithms.