Twenty-nine teams use same dataset, find contradicting results [pdf]

Twenty-nine teams use same dataset, find contradicting results [pdf](osf.io)

172 points by alexleavitt 10 years ago | 32 comments

entee 10 years ago |

This paper is awesome because it transparently folds the analytical approach into the experiment being conducted.

There are two kinds of scientific study: those where you can run another (ideally orthogonally approaching to the same question) experiment along with rigorous controls, and those where you can't.

The first type is much less likely to have results vary based on analytical technique (effectively the second experiment is a new analytical technique). Of course it does happen sometimes and sometimes the studies are wrong, still more controls and more experiments are always more better.

However, studies were you're limited by ethical or practical constraints (i.e. most experiments involving humans) don't have that luxury and therefore are far more contingent on decisions made at the analysis stage. What's awesome with this paper is it kind of gets around this limitation by trying different analytical methods, effectively each being a new "experiment" and seeing if they all reach the same consensus.

Interestingly, very few features in the analysis were shared among a large fraction of the teams, (only 2 features were used by more than 50% of teams) which suggests that no matter the method, the result holds true. A similar approach to open data and distributed analysis would be a really great way to eliminate some of the recent trouble with reproducibility in the broader scientific literature.

dang 10 years ago |

A blog post giving background is at http://www.nature.com/news/crowdsourced-research-many-hands-....

jdp23 10 years ago | |

The blog post is a great overview as well as useful context, thanks for sharing it.

TL:DR summary: Scientific results are highly contingent on subjective decisions at the analysis stage. Different (well-founded) data analysis techniques on a fairly simple and well-defined problem can give radically different results.

It's very interesting research -- a great real-life example supporting the models Scott Page et. al. use for the value of cognitive diversity. The thrust of the blog post is about where crowdsourcing analysis can be helpful (as well as reasonable caveats about where it might not apply), which is certainly an interesting question. Obvioulsy, there are a lot of other implications to this as well.

nkurz 10 years ago | | |

Off-topic, but you seem well positioned to answer: Why do you say "TL:DR" here when summarizing a short blog post that you enjoyed? Clearly the meaning has diverged from the original abbreviated insult of "Too long; didn't read", but I don't understand what people mean when they use it today. Why did you phrase it this way? Are you a native English speaker? If not intended to be derogatory, does the dissonance bother you?

SilasX 10 years ago |

Reminds me of the idea (Robin Hanson's, I think?) to add an extra layer of blindness to studies: during peer review, take the original data, and write a separate paper with the opposite conclusion. Randomize which reviewers get which version. Your original paper is then only accepted if they reject the inverted version.

gwern 10 years ago | |

I think you misremembered it: http://www.overcomingbias.com/2007/01/conclusionblind.html http://www.overcomingbias.com/2010/11/results-blind-peer-rev... Nothing about accepted only if they rejected the reversed version; just that the pro & con versions be supplied (first post), or a paper sans conclusions/results (second post).

sndean 10 years ago |

FiveThirtyEight did a write up of this paper (part 2):

http://fivethirtyeight.com/features/science-isnt-broken/

On the bright side, if you look at the 95CI for the 29 studies, almost all of them overlap.

joe_the_user 10 years ago |

"The primary research question tested in the crowdsourced project was whether soccer players with dark skin tone are more likely than light skin toned players to receive red cards from referees."

This seems like a topic where one indeed typically winds-up with a multitude of competing conclusions.

Among other factors for we have:

* Pre-existing beliefs on the part of researchers.

* Lack of sufficient data.

* Difficulty in defining hypothese (is there a skin tone cut-off or should one look for degrees of skin tone and degrees of prejudice, should one look all referees or some referees).

Given this, I'd say it's a mistake to expect just numeric data at the level of complex social interactions to be anything like clear or unambiguous. If studies on topics such as this have value, they have to involve careful arguments concerning data collection, data normalization/massaging, and only then data analysis and conclusions.

But a lot of the context comes from prevalence shoddy studies that expect you can throw data in a bucket and draw conclusions, further facilitated having those conclusions echoed by mainstream media or by the media of one's chosen ideology.

krick 10 years ago |

I understand how tempting it is in our age of big data and all that stuff to perceive this as some curious new phenomena, but it really is not. This is precisely the reason why we've come up with some criteria for "science" quite a while ago. And in fact, all this experiment is pretty meaningless.

So, for starters: 29 students get the same question on the math/physics/chemistry exam and give 29 different answers. Breaking news? Obviously not. Either the question was outrageously bad worded (not such a rare thing, sadly), or students didn't do very well and we've got at most 1 correct answer.

Basically, we've got the very same situation here. Except our "students" were doing statistics, which is not really math and not really natural science. Which is why it is somehow "acceptable" to end up with the results like that.

If we are doing math, whatever result we get must be backed up with formally correct proof. Which doesn't mean of course, that 2 good students cannot get contradicting results, but at least one of their proofs is faulty, which can be shown. And this is how we decide what's "correct".

If we are doing science (e.g. physics) our question must be formulated in a such way that it is verifiable by setting up an experiment. If experiment didn't get us what we expected — our theory is wrong. If it did — it might be correct.

Here, our original question was "if players with dark skin tone are more likely than light skin toned players to receive red cards from referees", which is shit, and not a scientific hypothesis. We can define "more likely" as we want. What we really want to know: if during next N matches happening in what we can consider "the same environment" black athletes are going to get more red cards than white athletes. Which is quite obviously a bad idea for a study, because the number of trials we need is too big for so loosely defined setting: not even 1 game will actually happen in isolated environment, players will be different, referees will be different, each game will change the "state" of our world. Somebody might even say that the whole culture has changed since we started the experiment, so obviously whatever the first dataset was — it's no longer relevant.

Statistics is only a tool, not a "science", as some people might (incorrectly) assume. It is not the fault of methods we apply that we get something like that, but rather the discipline that we apply them to. And "results" like that is why physics is accepted as a science, and sociology never really was.

DiabloD3 10 years ago |

So, does this mean every team used improper methodology? Or can we meta-review the results and figure out what's really going on?

DanBC 10 years ago | |

It makes it really hard to work out what's happening, especially if you want the result to match existing standards.

For a real world example of this see deworming schoolchildren.

People looking at the educational effects of deworming children reach different conclusions because some of them use a medical model and some of them use an economics model.

http://www.cochrane.org/news/educational-benefits-deworming-...

http://www.cochrane.org/CD000371/INFECTN_deworming-school-ch...

Talked about in this More or Less episode:

http://www.bbc.co.uk/programmes/b0659q1f

http://www.theguardian.com/society/2015/jul/23/research-glob...

LunaSea 10 years ago |

Of course it's social "sciences".

DanBC 10 years ago | |

Also medical treatment: https://news.ycombinator.com/item?id=10387375

hmate9 10 years ago |

Lies, damned lies, and statistics

https://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statist...

Statistics can be manipulated surprisingly easily.

justinlardinois 10 years ago | |

There are three kinds of lies. There are also three kinds of comments I see in this thread:

> "This is interesting, here's some thoughts and ideas that further contribute to this subject"

> "This is interesting, here's a link to some further writing on this subject"

> "The entire concept and discipline of statistics is bullshit."

Par for the course here at Hacker News.

duaneb 10 years ago | | |

Hey, the null hypothesis is powerful and valuable. I, for one, and happy that all three types are well-represented; all three are healthy in moderation.

I also think that the quote fits in quite nicely here, it's not a wholesale rejection of statistics.

jdp23 10 years ago | | |

4. > "About TL:DR ...:"

Also par for the course here :)