Towards Natural Language Semantic Code Search at GitHub

Towards Natural Language Semantic Code Search at GitHub(githubengineering.com)

159 points by Chris911 7 years ago | 52 comments

dgreensp 7 years ago |

It is a real cultural problem how engineers get more excited about machine learning than basic usability.

GitHub search can't even search for a literal string, let alone a regex. It can't search a subdirectory. Ranking is indistinguishable from random. It's been this way for years. How about building an actual, usable, basic code search and then getting all fancy with your machine learning?

I almost built my own "online git grep for GitHub" last year.

Sir_Cmpwn 7 years ago | |

I agree with this sentiment 100%. I can use traditional search engines for "how to ping a rest thing in python", but I can't grep Github for even basic snippets of code. I don't think their global code search has ever been useful. Glad they have their priorities straight /s

kornish 7 years ago | |

Agreed. Luckily, we as a community have tools like Sourcegraph which are based on battle-tested pragmatic systems from places like Google.

Disclaimer: no affiliation, just love the team and product.

welder 7 years ago | | |

At first I thought this would replace Sourcegraph, but looks like it's just an experiment with NLP... Thank goodness we have Sourcegraph for searching GH but especially for searching GHEnterprise in an SOA environment where it's impossible to have every repo cloned locally for ripgrep.

P.S. I'm not affiliated, we just use Sourcegraph at a company I work for.

dgreensp 7 years ago | | |

Thanks for the recommendation

bryanrasmussen 7 years ago | | |

Damn, Sourcegraph is very close to something I've been thinking of building.

samlambert 7 years ago | |

All I can say is that we know this. We know it should be better. There's definitely more to come.

neongreen 7 years ago | |

I'm building one! https://codesearch.aelve.com

Currently it runs on a fairly slow machine, so regex-heavy requests will take some time on big package repositories like Rubygems, but I plan to get a nicer machine soon.

If you know Scala, you can even contribute (wink wink), just ping me. A lot of tasks we have at this stage are pretty basic.

nh2 7 years ago | | |

Nice! Next index all public Github repos?

psychometry 7 years ago | |

You also can't search a forked repository, which is pathetic.

Deimorz 7 years ago | | |

This limitation is especially frustrating when a fork becomes the "primary" repo for a project for some reason. It's probably not a common occurrence overall, but I've run into it at least a couple of times.

A good example is that GitHub's own repo for their CommonMark implementation isn't searchable, because it's a fork of cmark: https://github.com/github/cmark/

bradleyjg 7 years ago | |

Exactly this. I don’t understand why the-thing-I-searched-for.java is so rarely on the first page of results. Doesn’t that seem like an obvious thing I might be interested in?!?

brian-armstrong 7 years ago | |

Yes! When I read the post title I was really excited. The I clicked in and felt my heart sink a little. Engineers and PMs seem to be too easily swayed by shiny things.

boyter 7 years ago | |

To be fair its a hard problem to solve, especially with traditional search engine tools.

Take for example,

    for(int i=0;i<100;i++)

And then a search for i++ Due to the way almost every search tool works that would be split into tokens "for int i 0 100" which are not very useful. Even if you include the characters = ; < + ( ) in the search you break the ability to do things such as boolean queries or fuzzy search term~1

Its totally possible to solve these issues using tweaks of the input into your index, which is what I did with searchcode.com or with a different approach which is what Google Code Search did. However neither have a requirement to be 100% in sync with the repository which I suspect is something that the github team value.

All the code search tools suffer from this in some way. At small scale its possible to just brute force the search. At scale you can do it by tweaking your algorithm and sacrificing accuracy. My feeling is that the github team chose accuracy.

amelius 7 years ago | | |

But people use grep on their code all the time ...

karmakaze 7 years ago | |

In full agreement. Every now and then I'll expect a search to work. My solution has been to run etsy/hound [0] for my active reps.

  [0] https://github.com/etsy/hound

petters 7 years ago | |

Could not agree more. Everyone who works for or have worked for Google in the last years knows that an excellent code search does not have to be fancy.

VirenM 7 years ago | |

I've honestly moved to Google, adding

`{search query} -site:github.com/{repo}/{file i want to target}`

Its much clearer and concise.

mullikine 7 years ago | |

I made a regex search for GitHub and emacs plugin. In theory I could put this on GitHub. It uses the bigquery ghtorrent table. There's only so much time in a day though. If you want it upvote me

bitL 7 years ago | |

So what? Here their only mistake is not to license/buy some search engine instead of wasting years on developing another "meh" one. What they are doing here with semantic search is the future and their chance to make all existing code search engines obsolete. Use your favorite Internet search engine to find GitHub's snippets of code instead. Those won't give you semantic code search though.

rococode 7 years ago |

This might just be me, but does anyone else feel that GitHub's code search has other points that could be improved first?

My biggest gripe is that the other results show in seems to be totally random. For example, if I have a Java class called A and I search "class A" in code search, the actual A.java doesn't tend to show up anywhere near the front. I just tried this in a repo and the actual A.java file was on the last page of results when I searched "class A". The vast majority of the results before it didn't even have the words "class" and "A" next to each other, which A.java does...

Maybe I'm doing something wrong (I'd welcome any input on how to use code search correctly!), but it just feels like they're jumping the gun on trying to make their code search more advanced when the basic functionality doesn't work that well.

brian-armstrong 7 years ago | |

Yes, GH search leaves so much to be desired. And this post doesn't actually seem to address the weaknesses.

The search appears to be configured for natural language documents, not code. The stopwords are not right and search appears to strip all sigils. They could get pretty far just by parsing documents and changing their lucene/elasticsearch configuration.

samlambert 7 years ago | |

We are very aware of the problem. I think you are going to really love what we are working on.

snaky 7 years ago | | |

Is it some search algorithm that is so new, unusual and groundbreaking you can't talk about until it patented, or what?

hiccuphippo 7 years ago | | |

I really hope you are right and have your priorities straight when it comes to search. I'd love for a way to search for usages of a class::method, or for strings that contain the text "hello" or for variables named foo. And if you integrate that into the code itself, Ctrl+click a class method to find all usages, maybe even usages in other repositories, so I can see how other people use a certain library.

And of course good old regex search.

finnh 7 years ago |

I would settle for the ability to use logical OR when searching issues/pull requests, or to combine multiple negated searches.

"is:pr is:open ( author:bob OR author:jim )"

The lack of this pretty basic functionality makes issue & PR search much less useful than it could be.

matmo 7 years ago | |

Agreed. It'd also be nice to see a list of issues you're subscribed to. Here's a fun issue to follow for that - https://github.com/isaacs/github/issues/283

sam0x17 7 years ago |

It is awesome that they are working on this, but can I just say there are a lot of basic search features they need to add before "doing the hard thing". Here are some things that I should be able to do easily but can't (or can't very easily or well) using GitHub's search mechanism:

1. exact or close string searches for code that involves ![]{}_-*() etc characters

2. searches across past commits (e.g. find a line that used to be in the code)

4. search across pull request + comments (not just issues and commit messages)

5. advanced search operators -- there should be a full filtering UI with ands and ors etc

Because of this I often find my self grepping locally, or (more often) totally out of luck.

aaaaaaaaaab 7 years ago |

Now that’s what I call a misfeature!

GitHub is used by programmers. Surprisingly, they tend to be very good at telling computers precisely what they want, in the computers’ own language.

Natural language search is the exact opposite of this, invented for mom & pops who start their search phrase with “Dear Google, I’d like to search for ...”.

KenanSulayman 7 years ago |

GitHub is building some amazing stuff recently, I guess now that Microsoft is going to acquire them, there's far less pressure on making Github Enterprise profitable..

DannyBee 7 years ago |

I saw this created in another thread and it seems to accurately sum up the comments here: https://imgflip.com/i/2i90x2

sqs 7 years ago | |

What thread did you originally see that in?

paintstripper 7 years ago |

They should add regex search support first before this stuff.

nraynaud 7 years ago |

wait, they can't search through forks or collate identical results and they are going into natural language processing?

manigandham 7 years ago |

Devs don't search code repositories using natural language queries, and any scenarios of searching for code examples that way are already extremely well handled by StackOverflow and Google.

This is an incredible waste of time and resources that could be spent making the existing search far better with very minor tweaks. A perfect example of big company project management where nobody seems to know what their users actually want.

tyingq 7 years ago |

I'd settle for github search that's case sensitive and recognizes things like dollar signs, semi-colons, commas, braces, and such.

HereBeBeasties 7 years ago |

Dear GitHub,

Please build search that lets me actually find a given file by name.

You are busy building a space rocket when all we want is a bicycle. Impressive, but useless for just popping down to the shops.

Love,

The rest of the world's developers

mullikine 7 years ago |

I want to work at github. They're making cool things.

guessmyname 7 years ago | |

As of today, GitHub has 89 open positions:

• 2 openings - Business Systems

• 2 openings - Communications

• 38 openings - Engineering

• 3 openings - Finance

• 1 opening - Internal Communications

• 4 openings - Legal

• 8 openings - Marketing

• 2 openings - People Operations

• 1 opening - Policy

• 7 openings - Product

• 8 openings - Sales

• 9 openings - Security

• 1 opening - Services

• 3 openings - Support

— https://github.com/about/careers

brian-armstrong 7 years ago | |

Do they? The main product appears to have 0 product velocity

nkantar 7 years ago | | |

I used to feel this way, but then I discovered https://blog.github.com/ and no longer do.

Sure, they may not be addressing your/my specific concerns, but the product is changing.

person_of_color 7 years ago | |

I want a MS.