GitHub's language detection is broken(github.com) |
GitHub's language detection is broken(github.com) |
It's probably also a bug to even have the notion of "a language" for a repo given the burgeoning polyglot programming trend. So many repos these days contain multiple languages, especially when you consider javascript, that I question if it even makes sense to say 'This project is in language X' at all.
Like you say, the best option really would be to let the repo owners / maintainers just specify this stuff. They are, after all, the ones who know.
Note: I'm not saying they shouldn't have the auto-detection, because it definitely helps if the maintainer doesn't do it, but for those that want to help classify things - let them!
"Sorry, I couldn't determine if you had C code in your repo or is that Limbo code?"
For example, I've got a javascript modules in repositories. For each module, I make a demo version to show what the module does, and that demo includes a bunch of css. Apparently, there is more css than their is Javascript, so GitHub labels the module as css, but the important part isn't css, the important part is the javascript. In order to resolve this, I've had to move the css into a different repository, and ignore it in the javascript repository. Seems like a long way around, when all I want to do is correct them and say that the module is actually a javascript module.
.rb=RealBasic .m=Mercury .pl=Prolog .js=SomeCrapOrOther …
Perl 83.5% Shell 16.5%
There is not a single .pl or .pm file, nor a single mention of 'perl' anywhere in the repository, and all scripts begin with #!/bin/sh.A number of my other repositories have similar problems, but this one is by far the worst.
> if you'd like Mercury language detection on GitHub then with the current implementation of Linguist you need to pick a different (unique as Objective-C already defines this) primary_extension and add .m to the extensions array which will force Linguist into using the other detection methods mentioned above.
EDIT: or as I like to yell at Github for Windows when it can't revert out of a merge conflict "WHAT IS EVEN THE POINT OF YOU?!"
Classification is never 100% accurate.
EDIT: Exact method that it is used is reported here: https://github.com/github/linguist/pull/748#issuecomment-374...
I expect that Javascript's github popularity ranking is (a little bit) inflated due to such issues.
https://github.com/github/linguist/blob/master/lib/linguist/...
I suppose I could Google it and act like I know… naw
Are people even reading the context of the rest of the PR?
And even if it WAS irrelevant and only important to a very small number of people, that doesn't mean it can be ignored.
I don't follow. That sounds like the exact criteria for something to be ignored.
Stand back, gents! This one is a champion!
GitHub isn't discriminating against certain programmers. Stay calm and keep coding!
It is discriminating, and harmful to all programmers. We need to be able to easily search for these lesser known languages – they are important cultural works. The commenter points out: "Limbo ... seems to have heavily inspired Go (which is currently extremely fashionable)". We are worse off for not having our history readily accessible.
- learning heuristics based on user suggestion.
- extension filtering to differentiate similar languages.
- the algo would use prominence and placement of white space and non-word characters to create the DNA of a language. If the language scores below a threshold against the DNA, it doesn't presume, it asks the user. If a language scores high against this DNA, it still allows used override. Whenever a user would submit their indicator, its file source would be used to train the heuristic.
Yep, seems about right.
The alternative is to fix the design issue. But that's going to be a lot harder and require more than a few days.
The work to fix the design issue was already done by @nox, who submitted a pull request which is still open: https://github.com/github/linguist/pull/985
But I honestly can't tell if that's what he meant, or if it was more of a "not my problem" type of response.
Personally I'd like to have a fixed language that I can set and that the search will use. Next to that, it would be fine for me to statically show what the repository contains, but please use a better language detection, just going by extensions is quite naive.
The disambiguation test for C++ headers is ridiculous:
matches << Language["C++"] if data.include?("#include <cstdint>")I use Github for the visual flair and cool features. If I wanted to run my own fundamental architecture, I'd be doing that.
To me this comes off as assuming the worst intentions on behalf of the github developers.
No, the primary_extension is only used in a gists_helper.rb file outside the Linguist repos. Note that the feature is deprecated anyway.
https://github.com/github/linguist/blob/master/lib/linguist/...
> Basically, Github needs to be accepting of programmers of all stripes, or they are destined to be irrelevant (or at least doing lots of scrambling) once the trendy kids move on from the trendy things they're doing and the currently-popular languages start falling out of style with a reversion to a previous status quo. Github needs to accept that there is a vast wealth of code out there which predates it and which will easily postdate it.
Okay there, buddy. I don't think lack of Lingo support is going to be GitHub's eventual downfall.
[1] https://reference.wolfram.com/mathematica/tutorial/Mathemati...
Of course that PR isn't being accepted either.
The software crisis spelled out in a single sentence.
Luckily I use languages popular enough to be classified correctly.
In other words, it seems like the overall design wouldn't be hurt too much by just extricating primary_extension completely. Best-case scenario, primary_extension is equivalent to always having at least one item in extensions. It does nothing else.
Also, this looks like a bug: "if possible_languages.length > 1 ... else possible_languages.first" What if length is 0? That's not greater than 1. I'm not familiar with Ruby, does first return null on an empty array, or does it error? LINQ in .NET has separate First<T> and FirstOrDefault<T> methods: one errors, the other returns default(T) (which is null in the case of reference types). Or is there a default match in the index that occurs when no other language is found? https://github.com/github/linguist/blob/master/lib/linguist/...
Not instilling a lot of confidence that someone really thought through this bit of code. I'm not saying I very strictly think through everything I write, but I also don't write software for thousands of users, and I acknowledge that I've grown rather complacent in terms of time spent per unit code.
Code that commonly resides in .asp files is completely different from code that commonly resides in .aspx files. They are not synonyms for each other. Also, I would wager that C# aspx files are a tad more common than VB.NET aspx files.
It's even worse than lumping together .c and .cpp. You at least have some chance of getting .c files to compile in a C++ compiler. There is no chance of running ASP code through the ASP.NET engine.
This is why "ASPX" as a term exists, to differentiate from ASP.
In my years of experience online, trolling was specifically riling someone up by saying things the troll doesn't really believe.
Trolling isn't disagreeing that a workaround is sufficient to ignore an actual issue. But that's just my opinion.
At worst, it is confusing to implementers and requires chicanery to work around... which is exactly the case we're in. We're in the worst case scenario for this bit of code, and there is no upside to its best-case scenario. Just delete the code.
Long-term this is probably the right solution, but why go through all this trouble right now if there is a simple workaround? It seems like the only problem right now is a few people's pride.