Scientists should use AI as a tool, not an oracle

Scientists should use AI as a tool, not an oracle(aisnakeoil.com)

124 points by randomwalker 2 years ago | 106 comments

> Unfortunately, most scientific fields have succumbed to AI hype, leading to a suspension of common sense. For example, a line of research in political science claimed to predict the onset of civil war with an accuracy2 of well over 90%, a number that should sound facially impossible. (It turned out to be leakage, which is what got us interested in this whole line of research.)

This coupled with people acting on its predictions is a kind of self fulfilling prophecy.

which is to ask, are AI safety folks building models of this pattern? :)

godelski 2 years ago | |

This is true for a lot of things, not just AI. But in AI, a guy who didn't get a High School degree and wrote Harry Potter Fan Fiction is one of the leading voices in doomerism.

The problem is you can't just "use logic and reason" because simple models are not good enough. The nuance dominates, but that's why we have experts.

What's funny to me is that people will confidently argue with experts and others value their opinion over the expert's knowledge. But on the other hand, people tend to just take machines at face value. Maybe these aren't overlapping groups, but it does appear that way. There's a great irony in trusting a machine but not the person/s that built said machine.

elicksaur 2 years ago | | |

I don’t trust the machines or the people who make them, and I didn’t have to read the Harry Potter fanfic to know ad hominems are poor arguments. What group does that make me?

FinchNova12 2 years ago | | |

Not sure if you intended this, but it feels like the first sentence of your argument is more broadly a critique of the credentials of AI Safety proponents. Maybe you are distinguishing between doomers vs broader AI Safety proponents, but if not, I feel like the counterargument is that most people on the CAIS letter (https://www.safe.ai/work/statement-on-ai-risk) interface quite frequently with these AI models and are also (purportedly) seriously concerned about AI safety

SrslyJosh 2 years ago | |

> are AI safety folks building models of this pattern?

First you need to ask if AI "safety folks" actually understand the technology, and if they are thinking about it objectively. If they believe that we're a few years away from accidentally creating Skynet, they need to put down the crack pipe and go work in another field.

immibis 2 years ago | | |

We have already created Skynet. Its name is Capitalism. Or the Internet. One of those things.

throwanem 2 years ago | |

How would that work, do you think?

wegfawefgawefg 2 years ago | | |

If you knew everyone would ask gpt before doing anything, you would make gpt say what woudl generally be considered the better option. Not going to war, not committing suicide, etc. In this way even if war was the optimal decision according to some other utility function, the behavior of people is directed in a positive way. (Presumably)

gerdesj 2 years ago | |

"accuracy2" sigh - the 2 is a superscript to a footnote and not a domain specific term.

"facially impossible" ... does that really riff on "on the face of it", or is it farcically misspelt?

Garbage in, garbage out 8)

tikhonj 2 years ago | | |

"Facially" in the sense of "on the face of it", roughly as a synonym for "obviously", seems like a pretty standard usage to me—this is certainly not the first place I've seen the word used in this sense.

riverdweller 2 years ago | | |

Human recall failure. Probably wanted "seemingly", "apparently", or even "ostensibly", but who's got time for all that when the publish button's right there.

fragmede 2 years ago | | |

Also from the article:

> Also, ML code tends to vastly more complex and less standardized than traditional statistical modeling.

I mean, hey, it's proof that the text isn't AI generated, since ChatGPT is better at English than that, but it makes it hard to read and I'm not going to buy their book if it's going to be full of errors like that.

throwanem 2 years ago |

If this is already such a problem even in the professional discipline and vocation whose sine qua non is the accurate analysis of physical reality, I'm really nervous about the next few years. And I was nervous already...

captainkrtek 2 years ago |

In my professional work, I treat chatgpt as a search engine that I feel I can ask questions of in a natural manner. I often find small flaws in technical solutions it offers, but it can still provide useful starting points to investigate. I rarely trust code it generates (at least for the language I mainly work in) as i’ve seen it make some serious mistakes (eg: using keywords in the language that don’t exist)

userbinator 2 years ago |

People treating tools like they're infallible has been a problem since computers were invented, but IMHO the biggest difference with AI is how confident and convincing it can be in its output. Much like others here, I already have had to convince, very carefully, many otherwise-decently-intelligent people who believed ChatGPT was correct.

Thus I think the biggest success of AI will be the arts, where imprecision is not fatal, and hallucinations turn into entertainment instead of "truths".

antonvs 2 years ago | |

I think this misses something important. If it makes economic sense, corporations will figure out ways to integrate AI into their processes, even if it's imperfect. After all, companies are already built out of humans who are also often confidently wrong - but successful companies have ways to detect and mitigate that. In fact, that's one of the primary requirements for a company to survive, that it's able to build a functioning system out of imperfect components, particularly humans.

You can see an example of this in the use of LLMs to generate code. In that case, there's a whole SDLC pipeline designed to detect errors: type systems, language compilers and runtimes, tests of various kinds, QA, user feedback, etc. We don't just trust confident software developers to produce correct code.

Even a life-critical function like medical imaging - where imprecision can be fatal - can potentially benefit from this, where AI is used in conjunction with human review. It mainly requires development of some standards of practice - unlike with an average user blindly trusting the output of a model, radiologists would need training on how to use the models in question.

quantum_state 2 years ago |

AI is a tool … a fool with a tool is still a fool … For natural sciences, there is no need to worry since nature would provide the ultimate check … for social “sciences”, it is entirely a different story.

TheRoque 2 years ago |

The worst is having random people questioning your expertise because of what ChatGPT told them.

godelski 2 years ago | |

To be fair, people did this before ChatGPT. It's just the thing they point to as evidence now, and they'll always find something. The underlying problem is much bigger:

1) people confidently arguing with domain experts about topics that they have little to no experience in.

2) people valuing the opinions of arguers from 1 over experts.

alvah 2 years ago | | |

To be extra fair, "domain experts" in some areas have had a bad few years; there are a couple of fields I can think of off the top of my head where the "experts" wheeled out to advise/scare the public are clearly more influenced by politics (or saving their own skin) than science. Replacing trust in experts with trust in LLMs is obviously dumb, but who is Joe Sixpack supposed to turn to?

fragmede 2 years ago | |

Doctors had this moment when Google first came out

ThunderSizzle 2 years ago | | |

To be fair, I came across doctors who are no better than a static webpage from the CDC. I fire those doctors pretty quickly.

benhoyt 2 years ago |

> People should use AI as a tool, not an oracle

There, fixed the title.

az09mugen 2 years ago | |

People must not use AI as an oracle, but rather as a tool.

I think this is even better

bbor 2 years ago |

Wow I came into this article angry, idk if their book title accurately conveys the sober, expert analysis it contains! In case anyone else is curious why they’re talking about “leakage” in the first place instead of the existing term “model bias”, here’s the paper they cite in the “compelling evidence” paper that started these two’s saga with the snake oil salesmen: https://www.cs.umb.edu/~ding/history/470_670_fall_2011/paper...

Crux passage:

> Our focus here is on leakage, which is a specific form of illegitimacy that is an intrinsic property of the observational inputs of a model. This form of illegitimacy remains partly abstract, but could be further defined as follows: Let u be some random variable. We say a second random variable v is u-legitimate if v is observable to the client for the purpose of inferring u. In this case we write v € legit{u}.

> A fully concrete meaning of legitimacy is built-in to any specific inference problem. The trivial legitimacy rule, going back to the first example of leakage given in Section 1, is that the target itself must never be used for inference:

> (1) y !€ legit{y}

So ultimately this all about bad experimental discipline re: training and test data, in an abstract way? I’ve been staring at this paper for way too long trying to figure out what exactly each “target” is and how it leaks, but I hope that engineering-translation is close

dluan 2 years ago |

Scientists have been obsessed with over-optimzing for FOMO for the past decade - what papers should I read that I don't have time for, what grants should I apply for that I don't know about, what projects should I work on that will give me the best ROI, who in my field is poised to disrupt or make a big leap, etc.

Some even think that the end goal is actually an autonomous research agent that can make decisions about what questions to ask and why, and that's one of the true marks of AGI. That to me is insane and misses the entire point of science altogether, even once we reach that technical feasibility. We ask questions about the universe to expand our human relationship with the universe, not to just amass more research capital for the sake of it. And the fact that the AI snake oil has infected big chunks of science reveals which parts of it are just gold rush speculation and which aren't.

There's a more fundamental challenge of training scientists to understand why we ask the questions we ask. You can't just offload that to some background task and trust that it makes sense.

Onawa 2 years ago | |

I understand the point that you're making about overoptimizing for FOMO in science. I wanted to give you another perspective from a scientist working within the US government that doesn't care about playing that game.

Our governmental research agency, and NIH as a whole has TONS of research data that we don't have the manpower to screen and provess. There are also gaps in data that AI/ML could help us simulate. AI research assistants could potentially help us process and evaluate "what questions to ask" by, for example, looking for trends in QSAR (quantitative structure-activity relationship) models for novel chemicals and help us direct our attention to compounds of toxicological interest.

We've also been trying to use the AI research assistants to speed up the process of evaluating the scientific literature for toxicologists who have to make regulatory decisions. Our agency has a backlog of chemicals that we would love to evaluate, but lacks the manpower to do so.

No profit motive or much "clout" interest, at least that I've seen. Just a lot of public servant scientists who need some extra help protecting the public.

m3kw9 2 years ago |

To know when to be skeptical to LLMs you have to know how it is trained and inferenced, and you have to use it often to see how it can screw up

cdme 2 years ago |

It's marketed and sold as an oracle. The AGI crowd feels like a cult.

devjab 2 years ago |

I would have thought scientists weren’t going to use these tools to do research considering they as a group are far more exposed to things like peer reviews and critical thinking than general society.

What worries me the most about these AI solutions, however, is their usage in the public sector. They can certainly be useful helpers, like, they can scan images for cancer and if added to existing processes involving humans, often lead to enhanced results. They can’t replace any existing methods, however, as we learned here in Denmark a few years ago. Unfortunately that lesson hasn’t been learned across the public sector. I think medicine and healthcare learned it, but right now, we’re replacing actual human controls, audits and sometimes decision making with AI or an unwarranted trust in AI results. Which is going to lead to some really terrible results considering how bad things like LLMs often are at being lucky in even “common knowledge” situations. It’s further enhanced by how some of the work it’s tasked to do isn’t as black-and-white as writing code is. We use AI tools in our daily work, and they are ok, but as anyone who’s used them for programming probably knows by now, they aren’t exactly great at being lucky. Sometimes they’ll hallucinate solutions that simply do not exist.

This is how they work, and as I said earlier, AIs can be great enhancers. They aren’t replacements though, and if we start treating them like they are, which is very tempting from a change-management and benefit-realisation perspective, we’re just going to get in trouble. This is unfortunately exactly what we’re doing, and why wouldn’t we? Most western public sectors have functioned on at least some form of new public management for two decades by now, sometimes longer. As a result the entire systemic culture is geared toward efficiency and cost reduction, even when it doesn’t really result in either efficiency and cost reduction on a broader perspective.

Now, if scientists are on board. Then what hope does a public bureaucracy have?

bitwize 2 years ago |

LLMs are basically Dissociated Press, but with deeper layers of statistics for a better function approximation than a simple Markov chain. It's really doing the same thing though: pick the next sequence of characters that best follows the foregoing characters.

Not something I'd trust as a "source of truth". Maybe a neat idea generator. And some of the deep learning algorithms can identify patterns that humans might miss -- patterns that could reveal useful insight. But they're not doing the knowledge work.

shmatt 2 years ago |

I feel like 90% of AI discussions online these days can be shut down with “a probabilistic syllable generator is not intelligence”

rsynnott 2 years ago | |

Even people who _know_ that often seem to have difficulty intuitively believing it, is the trouble; it's very good at _appearing_ to be intelligent, good enough that even people who should know better sometimes think that the correctness problems are just a case of "need more GPUs", rather than insoluble.

BriggyDwiggs42 2 years ago | |

How do you define intelligence?

WalterSear 2 years ago | |

That hasn't worked for me.

Zambyte 2 years ago | |

Humans are not fact machines, we are often wrong. Do humans not have intelligence?

What do you even mean by "intelligence" when you say a probabilistic syllable generator "is not intelligence"?

dotnet00 2 years ago | | |

Like clockwork, out come the "but humans" deflections. An LLM is not a human-like intelligence. This is patently obvious, such comparisons are nonsensical and just further the problem of people anthropomorphizing a tool and treating it like an oracle.

skrap 2 years ago |

...but why wouldn't they use AI as an oracle? From an outsider's perspective, it seems that there's already plenty of incentive to test the margins of acceptable academic practice in order to produce more papers or publish more quickly. Sadly I feel like it'll become the norm to have a chatbot interpret your results and write your paper rather than using those expensive grad students.

I don't have answers; just the lingering question "why are we building this?"

Kalium 2 years ago | |

We're building this because the ability to make narrow, specific predictions can be narrowly and specifically useful. This works if you have a good understanding of both the tools and the domain you're looking to make predictions in.

Unfortunately, from an outsider perspective, this looks like being widely and generically useful. If you don't understand your tools, you're going to misuse them, and this hype cycle is the result.

hulitu 2 years ago |

> Scientists should use AI as a tool, not an oracle

T in AI stands for tool.

chomskyole 2 years ago |

Maybe they should also call it "curve fitting" instead of "AI" so they don't need to call a "poor fit" a "hallucination"

hollerith 2 years ago | |

It's all very simple, eh?

chomskyole 2 years ago | | |

Look, it's a bit late here so I don't really have time to fully refute your sophisticated argument. But let me ask you this: if AI is not curve fitting, what is it then?

logrot 2 years ago |

But surely if it's artificial intelligence then it'd know its limits and would respond appropriately? Oracle use no problem?

It is it because it's actually shit but it's the best thing we've seen yet and everyone is just in denial?

mewpmewp2 2 years ago | |

People constantly misevaluate their own limits though. Why should AI not be allowed to do that?

Jensson 2 years ago | | |

Professionals don't constantly misevaluate their limits, if the AI is to replace a professional it has to know its limits.

threeseed 2 years ago | |

It depends on who you mean.

Most normal people look at AI like ChatGPT as an amazing tool and have used it effectively as a replacement for Google, Grammarly etc. And for them it's fine because any mistakes are localised to them.

The problem are those building products on LLMs e.g. Legal, Customer Service who are knowingly misrepresenting the capabilities of what it can do to companies who don't know any better. And I would argue this is fraudulent and where we will see most of the problems.

10000truths 2 years ago |

Is "leakage" just another term for overfitting?

dotnet00 2 years ago | |

I think a popular example of leakage would be that of a tank recognition AI that perfectly handles training/testing data but fails in real use, because all the tanks of one country happen to have a tree in the background, while those of the other do not, effectively leaking the image label and making the model look for a tree instead of the tank. Even if you trained less or used fewer parameters, it'd still go for the easiest route of trying to detect features of a tree. You'd have to change the training data.

XenophileJKO 2 years ago | |

No usually it means the data that you intend to test the model on was accidentally used to train the model. There are more complex scenarios where you get leakage without actually showing the model the test examples. Where you have features that have future information in them that you won't have at actual inference time.

So usually it ends up in overfitting, but is more about having information at training time that it shouldn't.

russfink 2 years ago | |

These are two different definitions. Can someone please disambiguate?

teknopaul 2 years ago |

No shit sherlock

LouisSayers 2 years ago |

Not just scientists, but everyone!

My partner recently went a bit nuts writing an article with the help of GPT4. She was very proud of how productive she'd been until I asked if she'd actually searched for the papers GPT4 had referred to.

Of course, many of the referred to papers didn't exist...

duxup 2 years ago |

We use search that way, don’t see why AI trained on similar content wouldn’t be just variable in terms of reliability.

skywhopper 2 years ago | |

This is incredibly simplistic. Search engine results give a lot of context clues about the reliability of their asserted facts and provide a potential spectrum of answers. LLM-generated answers strip all that away, and give a single authoritatively phrased answer. Even if you’re inclined to disbelieve it, the LLM answer gives you no ability to dig in, refine, or compare. It just is. If you ask a chatbot if it’s sure, it might double down, or apologize and then repeat itself, or say it was right and give a contradictory followup.

Traditional pre-spam-overload Google results could often give a high quality answer, or if not, you’d at least get the sense of the low quality. Not so with LLMs.

duxup 2 years ago | | |

I think you overestimate people’s ability to sniff out bad data on the internet.

Also are you suggesting people fact check an AI by asking it if it is correct? That seems absurd.