Seaborn bug responsible for finding of declining disruptiveness in science

Seaborn bug responsible for finding of declining disruptiveness in science(arxiv.org)

82 points by ossicones 2 years ago | 72 comments

I saw this headline and my first thought was that someone was claiming that a mind impacting virus that evolved in the ocean was causing scientists to do research with less ambition. Which is of course ridiculous lol. But a bug in a visualization library impacting science is also ridiculous.

thayne 2 years ago | |

If I understand the abstract correctly (which I very well might not be), I don't think it is saying a bug caused problems across all of science, but that it resulted in an incorrect conclusion in one meta study of disruptiveness in science.

ossicones 2 years ago | |

Oh man, I’ve been tricked so many times by the names of packages and frameworks… and here I did it to others. Sorry!

xeckr 2 years ago | |

Seeing this headline together with the one about the link between Toxoplasma gondii and entrepreneurship made me wonder if I was still dreaming.

RobotToaster 2 years ago | |

Given that the majority of scientists seem to be cat owners, and toxoplasmosis has been linked to mental illness, it's not entirely implausible that a (human) bug is slowing scientific advancement.

toxik 2 years ago | | |

Got a source on that cat claim, doc?

tempodox 2 years ago | |

Have you ever seen the “BrainDead” series? This idea is not unheard of.

https://www.imdb.com/title/tt4877736

a_gnostic 2 years ago | | |

Thanks for reminding me of this oldie https://m.youtube.com/watch?v=GVvL2ca65DA

raziel2701 2 years ago |

Science communication must be at an all-time low. I initially thought the paper was about a sea-borne pathogen being responsible for a decline in disruptiveness in science, which is a crazy statement.

Then I thought that it was a paper claiming that a bug in the seaborn plotting library in python was responsible for the decline in disruptiveness in science, which is absurd!

Finally I understood, that this is a paper that is debunking another meta paper that claimed that disruptiveness in science had declined. And this new, arxiv paper is showing that a bug in the seaborn plotting library is responsible for the mistake in the analysis that led to that widely publicized conclusion about declining disruptiveness in science. oh boy so many levels...

matthewdgreen 2 years ago | |

Neither the paper title nor the abstract leads with “Seaborn.” The decision to start the submission with “Seaborn bug…” is purely an HN artifact, and nothing to do with science communication.

ETA: For those who don’t click through, the paper title is “Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science.” The first few sentences of the abstract are:

“Park et al. [1] reported a decline in the disruptiveness of scientific and technological knowledge over time. Their main finding is based on the computation of CD indices, a measure of disruption in citation networks [2], across almost 45 million papers and 3.9 million patents. Due to a factual plotting mistake, database entries with zero references were omitted in the CD index distributions, hiding a large number of outliers with a maximum CD index of one, while keeping them in the analysis [1].”

mglz 2 years ago | |

> Science communication must be at an all-time low.

It's arxiv, not a press release. :)

bumbledraven 2 years ago |

The seaborn issue linked in the paper, “Treat binwidth as approximate to avoid dropping outermost datapoints” (https://github.com/mwaskom/seaborn/pull/3489), summarizes the problem as follows:

> floating point errors could cause the largest datapoint(s) to be silently dropped

However, the paper does not contain the string “float”, instead saying only:

> A bug in the seaborn 0.11.2 plotting software [3], used by Park et al. [1], silently drops the largest data points in the histograms.

So at the very least, the paper is silent on a key aspect of the bug.

daveguy 2 years ago |

Seaborn is a visualization library. No statistical tests should have been done with seaborn as an intermediate processing step. I guess they used some of the convenience functions as part of the data analysis. Seaborn is a final step tool, not a data analysis tool. That's an embarrassing lesson to learn post-publication.

sitkack 2 years ago | |

Take a look at the linked chart in my other comment. Visualization is absolutely a driver during research, it isn’t just an embarrassing revelation. Charts killed the Challenger crew.

sillysaurusx 2 years ago | | |

> Charts killed the Challenger crew.

https://www.tiktok.com/t/ZT8oG7ym6/

This is one of my favorite TikToks of all time, and you’ll see why. It goes into detail about how charts killed the Challenger crew. But the storytelling is second to none.

ayhanfuat 2 years ago | | |

It can be a driver. But then you do a deep dive. You create frequency tables, you create crosstabs, you calculate summary statistics, you do inferential statistics. There is no excuse for not catching this pre-publication.

Aloisius 2 years ago | |

The bulk of the problem was caused by erroneous metadata.

The bug in Seaborn simply meant that the histograms that could have alerted them that something was wrong with their analysis, didn't.

light_hue_1 2 years ago |

I hope that all the publications that celebrated the original work, like the Economist https://www.economist.com/science-and-technology/2023/01/04/..., Nature's news service https://www.nature.com/articles/d41586-022-04577-5, the FT https://www.ft.com/content/c8bfd3da-bf9d-4f9b-ab98-e9677f109..., and others spend as much time on correcting the record as they did on promoting the idea that science is broken.

And I hope the original authors tell Nature to retract their paper. It's already highly influential unfortunately.

sitkack 2 years ago |

This image is the best illustration of the flaw https://arxiv.org/html/2402.14583v1/x1.png

On mobile and can’t read the rest of the paper, the impact could be massive.

moh_maya 2 years ago |

The submission was flagged, and I am not sure I understand why since the only (negatively) critical discussion I see is on the ambiguity over the title in the HN submission; flagging a submission appears to take it off the HN homepage, and I feel a title ambiguity in the face of the significance of the submission itself isn’t a strong reason for removing the submission from HN? :)

There are (at the time of posting this comment) no comments raising any substantive issue with the arxiv submission itself (which ofc has to go through the peer review process of publication, and hopefully the original authors will respond / rebut this new article) - so curious why its been flagged? It’s not dead, so cannot vouch for it.

If folks in the HN community who have flagged it have done so because there are serious issues with what the paper is asserting, please comment / critique instead of just flagging it. If it’s because of the ambiguity in the title, I hope @dang and the moderators editorialize - there are some valuable comments in this thread that helped me understand what the issue is and what the bug is!

math_dandy 2 years ago |

Damn hipsters should just use matplotlib like the rest of us.

StableAlkyne 2 years ago | |

Gonna preface by saying I like what matplotlib is trying to do, and that it has done a lot of good for a lot of people.

Seaborn is a wrapper around matplotlib. It's popular because it removes a lot of the boilerplate from matplotlib and is pandas-aware

For example, you call the pairplot function with a dataframe, and you just get a matrix of correlation plots and histograms. Versus matplotlib where half the documentation/search results use imperative w/ global state and the other half is OOP, and all the extra subplots shenanigans you have to decipher to get something that looks good.

It's convenience, really. The people who use seaborn don't want to dive into matplotlib because the interface is kinda a mess with multiple incompatible ways to do things. It also documents what arguments mean instead of hiding most of them in **kwargs soup. You get plots in 1 minute of seaborn that would otherwise take 10 minutes in matplotlib to write.

keenmaster 2 years ago |

Bizarre. How do people make such big, splashy findings that can mess with people’s sense of optimism about science and innovation, without doing the simplest types of checks on their data and methodology.

dkasper 2 years ago |

Sounds like a plot point from three body problem.

bmitc 2 years ago |

I wonder how much bad science has occurred due to the acceptance of Python as the lingua franca.

KRAKRISMOTT 2 years ago |

The graphing library caused this?

morkalork 2 years ago | |

I must be waking up still because on first reading I interpreted it as a sea-born bug, something infectious or parasitic.

brookst 2 years ago | | |

Totally expected norovirus .

ayhanfuat 2 years ago | |

No, bad analysis caused it. Graphs are at best secondary tools to interpret findings. You don't use graphs to draw conclusions.

SubiculumCode 2 years ago | | |

But we do use plots to help identify problems in datasets, like all the time. Statistics 101

sergers 2 years ago |

Reading the comments here hilarious.

Like others, expecting a wildy different article...

stavros 2 years ago | |

What were you expecting? I read that as "a bug in the Seaborn graphing library caused wrong conclusions" and don't understand what other interpretations there are.

CoastalCoder 2 years ago | | |

I'd never heard of the Seaborn library. And since Seaborn is the first word of the title, I assumed it was capitalized for that reason only.

So I thought the article would be about some ocean-faring insect or microbe that somehow affected scientists' mental acuity.

mathgradthrow 2 years ago | | |

Ailment transmitted at sea has somehow made science less impactful.

SideburnsOfDoom 2 years ago | | |

I have never heard of the Seaborn graphing library; I was curious as to how a marine virus or bacterium could cause a "finding of declining disruptiveness". maybe a similar mechanism to Toxoplasma gondii?

rhelz 2 years ago |

Of course, it has nothing to do with rampant fraud, unreproducible results, incentive structures which reward the number of papers over the quality of papers, having researchers spend their prime scientific years writing grant proposals instead of actual research...

...nor does it have anything to do with tech companies hoarding cash by the trillions of dollars oversees instead of spending it on R&D, and even what R&D they internally produce they have no incentive to publish or productize, because virtually no new business will be more profitable than the monopoly business they already have...

asplake 2 years ago |

Seaborn??? Typo surely

Edit: Not mentioned in the abstract but it is in the main paper. Editorialised title.

CoastalCoder 2 years ago | |

This threw me also. I was expecting a really different article.

JW_00000 2 years ago | |

It's referring to the seaborn library (https://seaborn.pydata.org/), a Python library for data visualization (built on top of matplotlib).

asplake 2 years ago | | |

The first word of an editorialised title, frowned upon here, and not mentioned in the abstract (which I did read)

shawnz 2 years ago | |

It's a statistics software package:

> A bug in the seaborn 0.11.2 plotting software [3], used by Park et al. [1], silently drops the largest data points in the histograms.