There are two kinds of scientific study: those where you can run another (ideally orthogonally approaching to the same question) experiment along with rigorous controls, and those where you can't.
The first type is much less likely to have results vary based on analytical technique (effectively the second experiment is a new analytical technique). Of course it does happen sometimes and sometimes the studies are wrong, still more controls and more experiments are always more better.
However, studies were you're limited by ethical or practical constraints (i.e. most experiments involving humans) don't have that luxury and therefore are far more contingent on decisions made at the analysis stage. What's awesome with this paper is it kind of gets around this limitation by trying different analytical methods, effectively each being a new "experiment" and seeing if they all reach the same consensus.
Interestingly, very few features in the analysis were shared among a large fraction of the teams, (only 2 features were used by more than 50% of teams) which suggests that no matter the method, the result holds true. A similar approach to open data and distributed analysis would be a really great way to eliminate some of the recent trouble with reproducibility in the broader scientific literature.
TL:DR summary: Scientific results are highly contingent on subjective decisions at the analysis stage. Different (well-founded) data analysis techniques on a fairly simple and well-defined problem can give radically different results.
It's very interesting research -- a great real-life example supporting the models Scott Page et. al. use for the value of cognitive diversity. The thrust of the blog post is about where crowdsourcing analysis can be helpful (as well as reasonable caveats about where it might not apply), which is certainly an interesting question. Obvioulsy, there are a lot of other implications to this as well.
http://fivethirtyeight.com/features/science-isnt-broken/
On the bright side, if you look at the 95CI for the 29 studies, almost all of them overlap.
This seems like a topic where one indeed typically winds-up with a multitude of competing conclusions.
Among other factors for we have:
* Pre-existing beliefs on the part of researchers.
* Lack of sufficient data.
* Difficulty in defining hypothese (is there a skin tone cut-off or should one look for degrees of skin tone and degrees of prejudice, should one look all referees or some referees).
Given this, I'd say it's a mistake to expect just numeric data at the level of complex social interactions to be anything like clear or unambiguous. If studies on topics such as this have value, they have to involve careful arguments concerning data collection, data normalization/massaging, and only then data analysis and conclusions.
But a lot of the context comes from prevalence shoddy studies that expect you can throw data in a bucket and draw conclusions, further facilitated having those conclusions echoed by mainstream media or by the media of one's chosen ideology.
So, for starters: 29 students get the same question on the math/physics/chemistry exam and give 29 different answers. Breaking news? Obviously not. Either the question was outrageously bad worded (not such a rare thing, sadly), or students didn't do very well and we've got at most 1 correct answer.
Basically, we've got the very same situation here. Except our "students" were doing statistics, which is not really math and not really natural science. Which is why it is somehow "acceptable" to end up with the results like that.
If we are doing math, whatever result we get must be backed up with formally correct proof. Which doesn't mean of course, that 2 good students cannot get contradicting results, but at least one of their proofs is faulty, which can be shown. And this is how we decide what's "correct".
If we are doing science (e.g. physics) our question must be formulated in a such way that it is verifiable by setting up an experiment. If experiment didn't get us what we expected — our theory is wrong. If it did — it might be correct.
Here, our original question was "if players with dark skin tone are more likely than light skin toned players to receive red cards from referees", which is shit, and not a scientific hypothesis. We can define "more likely" as we want. What we really want to know: if during next N matches happening in what we can consider "the same environment" black athletes are going to get more red cards than white athletes. Which is quite obviously a bad idea for a study, because the number of trials we need is too big for so loosely defined setting: not even 1 game will actually happen in isolated environment, players will be different, referees will be different, each game will change the "state" of our world. Somebody might even say that the whole culture has changed since we started the experiment, so obviously whatever the first dataset was — it's no longer relevant.
Statistics is only a tool, not a "science", as some people might (incorrectly) assume. It is not the fault of methods we apply that we get something like that, but rather the discipline that we apply them to. And "results" like that is why physics is accepted as a science, and sociology never really was.
For a real world example of this see deworming schoolchildren.
People looking at the educational effects of deworming children reach different conclusions because some of them use a medical model and some of them use an economics model.
http://www.cochrane.org/news/educational-benefits-deworming-...
http://www.cochrane.org/CD000371/INFECTN_deworming-school-ch...
Talked about in this More or Less episode:
http://www.bbc.co.uk/programmes/b0659q1f
http://www.theguardian.com/society/2015/jul/23/research-glob...
https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statist...
Statistics can be manipulated surprisingly easily.
> "This is interesting, here's some thoughts and ideas that further contribute to this subject"
> "This is interesting, here's a link to some further writing on this subject"
> "The entire concept and discipline of statistics is bullshit."
Par for the course here at Hacker News.
I also think that the quote fits in quite nicely here, it's not a wholesale rejection of statistics.
Also par for the course here :)
That clearly wasn't the case here, so I was wondering why the author choose to use it. I realize that meaning has changed over time, but I was wondering what meaning he (and others) intend when it is used.
Example:
poster1: only retards use vi, notepad rules
poster2: huge list of reasons why vi is better than notepad
poster1: lol tldr
We already have terms like "summary", "digest", and even "précis"; why create a new term imbued with snark?
But not always. You have to consider how it was used to tell whether it was meant with ill will, the term tl;dr on its own doesn't necessarily tell you enough.
By now, it often is used to be friendly. There's a subconscious acknowledgement that long words take people's time. The speaker can even be talking about his own work and tell everyone "tldr version: " at the top.
Google uses it a lot in their own docs: https://developers.google.com/s/results/?q=tldr
You can also choose to pronounce it "teal deer" and use images of a green deer animal to signify the same.
...
In some situations, if you want, you can say it to be mean.