OpenAI says it has evidence DeepSeek used its model to train competitor

OpenAI says it has evidence DeepSeek used its model to train competitor(ft.com)

747 points by timsuchanek 1 year ago | 1541 comments

udev 1 year ago |

Imnimo 1 year ago |

I think there's two different things going on here:

"DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

"DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.

riantogo 1 year ago | |

Why would it cast any doubt? If you can use o1 output to build a better R1. Then use R1 output to build a better X1... then a better X2.. XN, that just shows a method to create better systems for a fraction of the cost from where we stand. If it was that obvious OpenAI should have themselves done. But the disruptors did it. It hindsight it might sound obvious, but that is true for all innovations. It is all good stuff.

Imnimo 1 year ago | | |

I think it would cast doubt on the narrative "you could have trained o1 with much less compute, and r1 is proof of that", if it turned out that in order to train r1 in the first place, you had to have access to bunch of outputs from o1. In other words, you had to do the really expensive o1 training in the first place.

(with the caveat that all we have right now are accusations that DeepSeek made use of OpenAI data - it might just as well turn out that DeepSeek really did work independently, and you really could have gotten o1-like performance with much less compute)

rockemsockem 1 year ago | | |

I think the prevailing narrative ATM is that DeepSeek's own innovation was done in isolation and they surpassed OpenAI. Even though in the paper they give a lot of credit to Llama for their techniques. The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.

All of this should have been clear anyway from the start, but that's the Internet for you.

KingOfCoders 1 year ago | | |

OpenAI couldn't do it, when the high cost of training and access to GPUs is their competitive advance against startups, they can't admit that it does not exist.

patcon 1 year ago | | |

Are we it rediscovering the evolutionary benefit of progeny (from an information theoretic lens)?

And is this related to the lottery ticket hypothesis?

https://arxiv.org/pdf/1803.03635.pdf

herodoturtle 1 year ago | | |

Thanks for the insightful comment.

I have a question (disclaimer: reinforcement learning noob here):

Is there a risk of broken telephone with this?

Kinda like repeatedly compressing an already compressed image eventually leads to a fuzzy blur.

If that is the case then I’m curious how this is monitored and / or mitigated.

ospray 1 year ago | | |

They did do that themselves it's called o3.

RHSman2 1 year ago | | |

When will over training happen on the melange of models at scale? And will AGI only ever be an extension of this concept?

That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?

indymike 1 year ago | | |

Bad things happen in tech when you don't do the disrupting yourself.

anothernewdude 1 year ago | | |

If they're training R1 on o1 output on the benchmarks - then I don't trust those benchmarks results for R1. It means the model is liable to be brittle, and they need to prove otherwise.

dontreact 1 year ago | | |

Is there any evidence R1 is better than O1?

It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.

iforgot22 1 year ago | | |

"Then use R1 output to build a better X1" is the part I'm not sure about. Is X1 going to actually be better than R1?

qwertox 1 year ago | | |

They're standing on the shoulders of giants, not only in terms of re-using expensive computing power almost for free by using the outputs of expensive models. It's a bit of a tradition in that country, also in manufacturing.

dartos 1 year ago | | |

What does “better” really even mean here?

Better benchmark scores can be cooked

Sophira 1 year ago | | |

Honestly, it's kind of silly that this technology is in the hands of companies whose only aim is to make money, IMO.

gmd63 1 year ago | | |

Why not just copy and paste the model and change the name? That's an even more efficient form of distillation.

PeterStuer 1 year ago | |

Ironically Deepseek is doing what OpenAI originally pledged to do. Making the model open and free is a gift to humanity.

Look at the whole AI revolution that Meta and others have bootstrapped by opening their models. Meanwhile OpenAI/Microsoft, Antropic, Google and the rest are just trying to look after number 1 while trying to regulatory capture an AI for me but not for thee outcome of full control.

curt15 1 year ago | | |

Is there anything still "open" about OpenAI these days?

jajko 1 year ago | | |

I don't think it makes sense to look at some previous PR statements of Altman et al re this when there a tens of billions floating around and egos get inflated to moon sizes. Farts in the wind have more weight, but this goes for all corporate PR.

Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.

miki123211 1 year ago | |

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data

IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.

I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.

Compare this to training AI models on Kindle Books randomly scraped off the internet, versus making a Kindle account, agreeing to the Kindle ToS, buying some books, breaking Amazon's DRM and then training your AI on that. What DeepSeek did is more analogous to the latter than the former.

anon373839 1 year ago | | |

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

You actually don’t know this. Even if it were true that they used OpenAI outputs (and I’m very doubtful) it’s not necessary to sign an agreement with OpenAI to get API outputs. You simply acquire them from an intermediary, so that you have no contractual relationship with OpenAI to begin with.

krust 1 year ago | | |

>IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.

I have some news for you

dmitrygr 1 year ago | | |

> DeepSeek has explicitly consented to a license that doesn't allow them to do this.

By existing in USA, OpenAI consented to comply with copyright law, and how did that go?

blibble 1 year ago | | |

training is either fair use, or it isn't

OpenAI can't have it both ways

freen 1 year ago | | |

Did OpenAI abide by my service’s terms of service when it ingested my data?

dartos 1 year ago | | |

TOS are not contracts.

like_any_other 1 year ago | | |

Legally, I understand your point, but morally, I find it repellent that a breach of contract (especially terms-of-service) could be considered more important than a breach of law. Especially since simply existing in modern society requires us to "agree" to dozens of such "contracts" daily.

I hope voters and governments put a long-overdue stop to this cancer of contract-maximalism that has given us such benefits as mandatory arbitration, anti-benchmarking, general circumvention of consumer rights, or, in this case, blatantly anti-competitive terms, by effectively banning reverse-engineering (i.e. examining how something works, i.e. mandating that we live in ignorance).

Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.

anothernewdude 1 year ago | | |

It's not hard to get someone else to submit queries and post the results, without agreeing to the license.

tempeler 1 year ago | |

On another subject, if it belongs to OpenAI because it uses OpenAI, then doesn't that mean that everything produced using OpenAI belongs to OpenAI? Isn't that a reason not to use OpenAI? It's very similar to saying that you used Google and searched; now this product belongs to Google. They couldn't figure out how to respond; they went crazy.

dathinab 1 year ago | | |

The US ruled that AI produced things are by themself not copyrightable.

So no, it doesn't belong to OpenAI.

You might be able to sue for penalties for breach of contract of the TOS, but that doesn't give them the right to the model. And even if it doesn't give them any right to invalidate unbound copyright grants they have given to 3rd parties (here literally everyone). Nor does it prevent anyone from training their own new models based on it or prevent anyone from using it. Oh, and the one breaching the TOS might not even have been the company behind DeepSeek but some in-between 3rd party.

Naturally this is under a few assumptions:

- the US consistently applies it's own law, but they have a long history of not doing so

- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries

- it actually was trained on OpenAI, but uh, OpenAI has IMHO shown over the years very clearly that they can't be trusted and they are fully in-transparent. How do we trust their claim? How do we trust them to not retrospectively have tweaked their model to make it look as if DeepSeek copied it?

johndhi 1 year ago | | |

to be clear, their terms of service are pretty clear that the USER owns the outputs.

dandanua 1 year ago | | |

Welcome to technofascism, where everything belongs to tech billionaires and their pocket politicians.

valine 1 year ago | |

The existence of R1-zero is evidence against any sort of theft of OpenAI's internal COT data. The model sometimes outputs illegible text that's useful only to R1. You can't do distillation without a shared vocabulary. The only way R1 could exist is if they trained it with RL.

natdempk 1 year ago | | |

I don’t think anyone is really suggesting they stole COT or that it is leaked, but rather that the final o1 outputs were used to train the base model and reasoning components more easily.

m348e912 1 year ago | |

> "DeepSeek trained on our outputs"

I'm wondering how Deepseek could have made 100s of millions of training queries to OpenAI and not one person at OpenAI caught on.

tisc 1 year ago | | |

Maybe they use AI to monitor traffic, but it is still learning :)

stef25 1 year ago | | |

Mechanical turks ?

HarHarVeryFunny 1 year ago | |

DeepSeek-R0 (based on DeepSeek-V3 base model) was only trained with RL, no SFT, so this isn't at all like the "distillation" (i.e SFT on synthetic data generated by R1) that they also demonstrated by fine tuning Qwen and LLaMa.

Now, DeepSeek may (or may not) have used some O1 generated data for the R0 RL training, but if so that's just a cost saving vs having to source some reasoning data some other way, and in no way reduces the legitimacy of what they accomplished (which is not something any of the AI CEOs are saying).

s17n 1 year ago | |

> This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.

OpenAI has also invested heavily in human annotation and RLHF. If all DeepSeek wanted was a proxy for scraped training data, they'd probably just scrape it themselves. Using existing RLHF'd models as replacement for expensive humans in the training loop is the real game changer for anyone trying to replicate these results.

KennyBlanken 1 year ago | | |

"We spent a lot of labor processing everything we stole" is...not how that works.

That's like the mafia complaining that they worked so hard to steal those barrels of beer that someone made off with in the middle of the night and really that's not fair and won't someone do something about it?

pizzathyme 1 year ago | |

This is a fascinating development because AI models may turn out to be like pharmaceuticals. The first pill costs $500 million to make, the second one costs pennies.

chupy 1 year ago | | |

Companies are still charging 100x for the pills that cost pennies to produce.

Besides deals with insurance companies and governments, one of the ways that they are still able to pull this is convincing everyone that it's too dangerous to play with this at home or buying it from an Asian supplier.

At least with software we had until now a way to build and run most things without requiring dedicated super expensive equipment. OpenAI pulled a big Pharma move but hopefully there will be enough disruptors to not let them continue it.

shadofx 1 year ago | | |

The solution is to create a health insurance system which burdens only Americans with the $500m cost, while India is allowed to make the drug for pennies for the rest of the world.

motoxpro 1 year ago | | |

What a nice analogy.

reissbaker 1 year ago | |

You're right that the first claim is silly, but the second claim is pretty silly too — they're not claiming industrial espionage, they're claiming a breach in ToS. The outputs of the o1 thinking process aren't user-visible, and never leave OpenAI's datacenters. Unless DeepSeek actually had a mole that stole their o1 outputs, there's nothing useful DeepSeek could've distilled to get to R1's thought processes.

And if DeepSeek had a mole, why would they bother running a massive job internally to steal the data generated? It would be way easier for the mole to just leak the RL training process, and DeepSeek could quietly copy it rather than bothering with exfiltrating massive datasets to distill. The training process is most likely like, on the order of a hundred lines of Python or so, and you don't even need the file: you just need someone to describe it to you. Much simpler than snatching hundreds of gigabytes of training data off of internal servers...

Plus, the RL process described in DeepSeek's paper has already been replicated by a PhD student at Berkeley: https://x.com/karpathy/status/1884678601704169965 So, it seems pretty unlikely they simply distilled R1 and lied about it, or else how does their RL training algo actually... work?

This is mainly cope from OpenAI that their supposedly super duper advanced models got caught by China within a few months of release, for way cheaper than it cost OpenAI to train.

bjourne 1 year ago | |

> "DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true"

Someone has to correct me if I'm wrong, but I believe in ML research you always have a dataset and a model. They are distinct entities. It is plausible that output from OpenAI's model improved the quality of DeepSeek's dataset. Just like everyone publishing their code on GitHub improved the quality of OpenAI's dataset. What has been the thinking so far is that the dataset is not "part of" or "in" the model any more than the GPUs used to train the model are. It seems strange that that thinking should now change just because Chinese researchers did it better.

XorNot 1 year ago | |

Yep: this is face-saving my Sam Altman.

OpenAI has a message they need to tell investors right now: "DeepSeek only works because of our technology. Continue investing in us."

The choice of how they're wording that of course also tells you a lot about who they think they're talking to: namely, "the Chinese are unfairly abusing American companies" is a message that is very popular with the current billionaires and American administration.

naet 1 year ago | |

“We engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe . . . it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.”

The above OpenAI quote from the article leans heavily towards #1 and IMO not at all towards #2. The later would be an extremely charitable reading of their statement.

ripped_britches 1 year ago | | |

What they say explicitly is not what they say implicitly. PR is an art.

me551ah 1 year ago | |

This is going to have a catastrophic effect on closed source AI startup valuations. Because this means that anyone can copy any LLM. The person who trains the model, spends the most amount of money. Everyone else can create a replica at lower cost

amlib 1 year ago | | |

Why is that bad? If a powerful entity can scrape every piece of media humanity has to offer and ignore copyright then why should society let then profit unrestricted from it? It's only fair that such models have no legal protection around their usage and can be used and analyzed by anyone as they see fit. The only reason this hasn't been codified into laws is because those same powerful entities have been busy trying to do regulatory capture.

matt-p 1 year ago | | |

Good.

iforgot22 1 year ago | | |

Maybe anyone can copy any LLM with sufficient querying. There are still ways to guard one.

nullc 1 year ago | |

There is a big difference between being able to train on the reasoning vs just the answers, which they can't against o1 because it's hidden. There is also a huge difference between being able to train on the probabilities (distillation) vs not, which again they can and did do with the llama models and can't directly with OpenAI because the conceal the probability output.

alach11 1 year ago | |

If we assume distillation remains viable, the game theory implications are huge.

It’s going to shift the market of how foundation models are used. Companies creating models will be incentivized to vertically integrate, owning the full stack of model usage. Exposing powerful models via APIs just lets a competitor clone your work. In a way OpenAI’s Operator is a hint of what’s to come

FooBarWidget 1 year ago | |

There are literally public ChatGPT conversations data sets. For the past 2 years it's been common practice for pretty much all open source models to train on them. Ask just about any open source model who they are and a lot of the time they'll say they're ChatGPT. Why is "having obtained o1 generated data" suddenly such a huge news, to the point of warranting conspiracy theories about undisclosed/undiscovered breaches at OpenAI? Nobody ever made a fuss about public ChatGPT data sets until now. No hacking of OpenAI is needed to obtain ChatGPT data.

bilekas 1 year ago |

> “It’s also extremely hard to rally a big talented research team to charge a new hill in the fog together,” he added. “This is the key to driving progress forward.”

Well I think DeepSeek releasing it open source and on an MIT license will rally the big talent. The open sourcing of a new technology has always driven progress in the past.

The last paragraph too is where OpenAi seems to be focusing their efforts..

> we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models ..

> ... we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.

So they'll go for getting DeepSeek banned like TikTok was now that a precedent has been set ?

mrkpdl 1 year ago |

The cat is out of the bag. This is the landscape now, r1 was made in a post-o1 world. Now other models can distill r1 and so on.

I don’t buy the argument that distilling from o1 undermines deep seek’s claims around expense at all. Just as open AI used the tools ‘available to them’ to train their models (eg everyone else’ data), r1 is using today’s tools.

Does open AI really have a moral or ethical high ground here?

ijidak 1 year ago | |

Plus, it suggests OpenAI never had much of a moat.

Even if they win the legal case, it means weights can be inferred and improved upon simply by using the output that is also your core value add (e.g. the very output you need to sell to the world).

Their moat is about as strong as KFC's eleven herbs and spices. Maybe less...

jamil7 1 year ago | |

Agree 100%, this was also bound to happen eventually, OpenAI could have just remained more "open" from the beginning and embraced the inevitable commoditization of these models. What did delaying this buy them?

khazhoux 1 year ago | | |

What did delaying this cost them, though? Hurt feelings of people here who thought OpenAI personally pledged openness to them?

plantwallshoe 1 year ago | |

I don’t think there is any ethical issue here, but I don’t think it’s good for the industry to remove all incentives for companies to spend lots of money solving hard, novel problems.

Why would anyone go through the effort of training the next groundbreaking model if they know they can just wait for someone else to do it and leverage that work?

csomar 1 year ago | | |

> Why would anyone go through the effort of training the next groundbreaking model if they know they can just wait for someone else to do it and leverage that work?

Why would anyone write, work or research anything if they know it would be consumed by AI and sold on a $xx/month subscription?

blast 1 year ago |

Everyone is responding to the intellectual property issue, but isn't that the less interesting point?

If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.

ok123456 1 year ago |

OpenAI's models were trained on ebooks from a private ebook torrent tracker leeched en-mass during a free leech event by people who hated private torrent trackers and wanted to destroy their "economy."

The books were all in epub format, converted, cleaned to plain text, and hosted on a public data hoarder site.

harry8 1 year ago | |

Have you got some support for this claim?

There's a lot of wild claims about, so while this is plausible it would be great if there were some evidence backing it.

naet 1 year ago | | |

NYT claims that OpenAI trained on their material. They argue for copyright violation, although I think another argument might be breach of TOS in scraping the material from their website or archive.

The complaint filing has some references to some of the other training material used by OpenAI, but I didn't dig deeply in to what all of it was:

https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

OsrsNeedsf2P 1 year ago | | |

He could be confusing it with Llama: https://www.wired.com/story/new-documents-unredacted-meta-co...

wanderingmoose 1 year ago |

There is a lot of discussion here about IP theft. Honest question, from deepseek's point of view as a company under a different set of laws than US/Western -- was there IP theft?

A company like OpenAI can put whatever licensing they want in place. But that only matters if they can enforce it. The question is, can they enforce it against deepseek? Did deepseek do something illegal under the laws of their originating country?

I've had some limited exposure to media related licensing when releasing content in China and what is allowed is very different than what is permitted in the US.

The interesting part which points to innovation moving outside of the US is US companies are beholden to strict IP laws while many places in the world don't have such restrictions and will be able to utilize more data more easily.

glenstein 1 year ago |

All the top level comments are basking in the irony of it, which is fair enough. But I think this changes the Deepseek narrative a bit. If they just benefited from repurposing OpenAI data, that's different than having achieved an engineering breakthrough, which may suggest OpenAI's results were hard earned after all.

daft_pink 1 year ago |

This reminds me of the railroads, where once railroads were invented, there was a huge investment boom of eveyrone trying to make money of the railroads, but the competition brought the costs down where the railroads weren’t the people who generally made the money and got the benefit, but the consumers and regular businesses did and competition caused many to fail.

AI is probably similar where the Moore’s law and advancement will eventually allow people to run open models locally and bring down the cost of operation. Competiition will make it hard for all but one or two players to survive and Nvidia, OpenAI, Deepseek, etc most investments in AI by these large companies will fail to generate substantial wealth but maybe earn some sort of return or maybe not.

me551ah 1 year ago |

OpenAI is going after a company that open sourced their model, by distilling from their non-open AI?

OpenAI talks a lot about the principles of being Open, while still keeping their models closed and not fostering the open source community or sharing their research. Now when a company distills their models using perfectly allowed methods on the public internet, OpenAI wants to shut them down too?

High time OpenAI changes their name to ClosedAI

alexathrowawa9 1 year ago | |

The name OpenAI gets more ridiculous by the day

Would not be surprised if they do a rebrand eventually

bazmattaz 1 year ago | | |

I was thinking about this the other day but I highly doubt they would rebrand name. They’re borderline a household name now - at least ChatGPT is. OpenAI is the face of AI - at least to people who don’t follow the industry

readyplayernull 1 year ago |

Do you remember when Microsoft was caught scrapping data from Google:

https://www.wired.com/2011/02/bing-copies-google/

They don't care, T&C and copyright is void unless it affects them, others can go kick rocks. Not surprising they and OpenAI will do a legal battle over this.

Ciantic 1 year ago |

I'm not being sarcastic, but we may soon have to torrent DeepSeek's model. OpenAI has a lot of clout in the US and could get DeepSeek banned in western countries for copyright.

timeon 1 year ago | |

> US and could get DeepSeek banned in western countries for copyright

If US is going to proceed with trade war on EU, as it was planning anyway, then DeepSeek will be banned only in US. Seems like term "western countries" is slowly eroding.

bbor 1 year ago | | |

Great point. Plus, the revival of serious talk of the Monroe Doctrine (!!!) in the U.S. government lends a possibly completely-new meaning to "western countries" -- i.e. the Americas...

alchemist1e9 1 year ago | |

I think most likely all sorts of data and models need to have a decentralized LLM data archive via torrents etc.

It’s not limited to the models themselves but also OpenAI will probably work towards shutting down access to training data sets also.

imho it’s probably an emergency all hand on deck problem.

aerhardt 1 year ago | |

Unfathomable to me that they'd make themselves look so foolish by trying to ban a piece of software.

forgotoldacc 1 year ago | | |

It wouldn't be foolish. The US has an active cult of personality, and whatever the leader says, half the country believes it unquestioningly. If OpenAI is said to be protecting America and DeepSeek is doing terrible, terrible things to the children (many smart people are saying it), there'll be an overnight pivot to half the country screaming for it to be banned and harassing anyone who says otherwise.

Who cares if some people think you look foolish when you have a locked down 500 billion dollar investment guarantee?

sergiotapia 1 year ago | |

that would be suicide - that company only exists because they stole content for every single person, website and media company on the planet.

dragonwriter 1 year ago |

Hey, OpenAI, so, you know that legal theory that is the entire basis of your argument that any of your products are legal? "Training AI on proprietary data is a use that doesn't require permission from the owner of the data"?

You might want to consider how it applies to this situation.

mhitza 1 year ago |

This is funny because its.

1. Something I'd expect to happen.

2. Lived through a similar scenario in 2010 or so.

Early in my professional career I've worked for a media company that was scraping other sites (think Craigslist but for our local market) to republish the content on our competing website. I wasn't working on that specific project, but I did work on an integration on my teams project where the scraping team could post jobs on our platform directly. When others started scraping "our content" there were a couple of urgent all hands on deck meetings scheduled, with a high level of disbelief.

spyckie2 1 year ago | |

Classic.

kigiri 1 year ago | |

Nice one, thank you for sharing !

1970-01-01 1 year ago |

DeepSeek have more integrity than 'Open'AI by not even pretending to care about that.

jampekka 1 year ago | |

And seem to be more actively fulfilling the mission that 'Open'AI pretends to strive for.

pixelpoet 1 year ago | | |

Exactly, they actually opened up the model and research, which the "Open" company didn't, and merely adjusted some of their pricing tiers to try to combat commercially (but not without mumbling something like "yeah, we totally had these ideas too"). Now every single Meta, OpenAI etc engineer is trying to copy DeepSeek's innovations, and their first act is to... complain about copyright infringement, of all things?! What an absolute clown party, how can these people take themselves seriously, do they just have zero comprehension of what hypocrisy is or what's going on here...

I can scarcely process all the levels of irony involved, the irony-o-meter is pegged and I can't get the good one from the safe because I'm incapacitated from laughter.

sho_hn 1 year ago |

While I'm as amused as everyone else - I think it's technically accurate to point out that the "we trained it for $6 mio" narrative is contingent on the done investment by others.

Palmik 1 year ago | |

When I use NVIDIA GPUs to train a model, I do not consider the R&D cost to develop all of those GPUs as part of my costs.

When I use an API to generate some data, I do not consider the R&D cost to develop the API as part of my costs.

kobalsky 1 year ago | |

OpenAI has been in a war-room for days searching for a match in the data, and they just came out with this without providing proof.

My cynical opinion is that the traning corpus has some small amount of data generated by OpenAI, which is probably impossible to avoid at this point, and they are hanging on that thread for dear life.

bbqfog 1 year ago | |

OpenAI's models were also trained on billions of dollars of "free" labor that produced the content that it was trained on.

sho_hn 1 year ago | | |

Oh, absolutely. I'm not defending OpenAI, I just care about accurate reporting. Even on HN - even in this thread - you see people who came away with the conclusion that DeepSeek did something while "cutting cost by 27x".

But that's a bit like saying that by painting a a bare wall green you have demonstrated that you can build green walls 27x cheaper, ignoring the cost of building the wall in the first place.

Smarter reporting and discourse would explain how this iterative process actually works and who is building on who and how, not frame it as two competing from-scratch clean room efforts. It'd help clear up expectations of what's coming next.

It's a bit similar to how many are saying DeepSeek have demonstrated independence from nVidia, when part of the clever thing they did was figure out how to make the intentionally gimped H800s work for their training runs by doing low-level optimizations that are more nVidia-specific, etc.

Rarely have I seen a highly technical topic see produce more uninformed snap takes than this week.

scotty79 1 year ago | |

The opposite, is claiming that OpenAI could have now built better performing, cheaper to run model (when compared to what they published) training it at 1% cost on output of their previous models. ... But they chose not to do it.

freehorse 1 year ago | |

That is the case anyway for training any llm. It is contingent on the work done by all those who produced the data.

olalonde 1 year ago |

If it's true, how is it problematic? It seems aligned with their mission:

> We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.

https://openai.com/charter/

/s, we all know what their true mission is...

concerndc1tizen 1 year ago |

Is OpenAI claiming copyright ownership over the generated synthetic data?

That would be a dangerous precedent to establish.

If it's a terms of service violation, I guess they're within their rights to terminate service, but what other recourse do they have?

Other than that, perhaps this is just rhetoric aimed at introducing restrictions in the US, to prevent access to foreign AI, to establish a national monopoly?

thorum 1 year ago |

> “It is (relatively) easy to copy something that you know works,” Altman tweeted. “It is extremely hard to do something new, risky, and difficult when you don’t know if it will work.”

The humor/hypocrisy of the situation aside, it does seem to be true that OpenAI is consistently the one coming up with new ideas first (GPT 4, o1, 4o-style multimodality, voice chat, DALL-E, …) and then other companies reproduce their work, and get more credit because they actually publish the research.

Unfortunately for them it’s challenging to profit in the long term from being first in this space and the time it takes for each new idea to be reproduced is getting shorter.

divbzero 1 year ago |

I was wondering if this might be the case, similar to how Bing’s initial training included Google’s search results [1]. I’d be curious to see more details of OpenAI’s evidence.

It is, of course, quite ironic for OpenAI to indiscriminately scrape the entire web and then complain about being scraped themselves.

[1]: https://searchengineland.com/google-bing-is-cheating-copying...

esskay 1 year ago |

Hard to really have any sympathy for OpenAI's position when they're actively stealing content, ignoring requests to stop then spending huge amounts to get around sites running ai poisoning scripts, making it clear they'll still take your content regardless of if you consent to it.

michaelmarkell 1 year ago | |

Can someone with more expertise help me understand what I'm looking at here? https://crt.sh/?id=10106356492

It looks like Deepseek had a subdomain called "openai-us1.deepseek.com". What is a legitimate use-case for hosting an openai proxy(?) on your subdomain like this?

Not implying anything's off here, but it's interesting to me that this OpenAI entity is one of the few subdomains they have on their site

gkbrk 1 year ago | | |

Could just be an OpenAI-compatible endpoint too. A lot of LLM tools use OpenAI compatible APIs, just like a lot of Object Storage tools use S3 compatible APIs.

bhouston 1 year ago |

The US government likely will favor a large strategic company like OpenAI instead of individual's copyrights, so while ironic, the US government definitely doesn't care.

And the US government is also likely itching to reduce the power of Chinese AI companies that could out compete US rivals (similar to the treatment of BYD, TikTok, solar panel manufacturers, network equipment manufacturers, etc), so expect sweeping legislation that blocks access to all Chinese AI endeavours to both the US and then soon US allies/West (via US pressure.)

The likely legislation will be on the surface justified both by security concerns and by intellectual property concerns, but ultimately it will be motivated by winning the economic competition between China and the US and it will attempt to tilt the balance via explicitly protectionist policies.

derektank 1 year ago | |

>The US government likely will favor a large strategic company like OpenAI instead of individual's copyrights

Even if we assume this is true, Disney and Netflix are both currently worth more than OpenAI and both rely on the strict enforcement of US copyright law. I do not think it is so obvious which powers that be have the better lobbying efforts and, currently, it's looking like this question will mostly be adjudicated by the courts, not Congress, anyways.

bhouston 1 year ago | | |

I don't think OpenAI stole from Disney or Netflix. Rather OpenAI stole from individual artists and YouTube and other social media who users do not really have any lobbying power.

So I think OpenAI, Disney and Netflix win together. Big companies tend to win.

tokioyoyo 1 year ago | |

I don’t think US government can move fast enough to change the trajectory. Also it doesn’t help that basically every government is second guessing their alliance with the US. It’s not an industry that can ruin local industries either (like cheap BYD is bad for German cars).

It’s a very fun thing to watch from the sidelines right now, if I’ll be honest.

buyucu 1 year ago | |

It's too late for that. That ship sailed a long time ago.

The best language model right now is open source. Let that sink in.

_pferreir_ 1 year ago | | |

DeepSeek is not Open Source. That's like saying that Microsoft Edge is Open Source, as you can download it for free.

https://huggingface.co/blog/open-r1

karim79 1 year ago |

Oh God. I know exactly how this feels. A few years ago I made a bread hydration and conversion calculator for a friend, and put it up on JSFiddle. My friend, at the time, was an apprentice baker.

Just weeks later, I discovered that others were pulling off similar calculations! They were making great bread with ease and not having to resort to notebooks and calculators! The horror! I can't believe that said close friend of mine would actually share those highly hydraty mathematical formulas with other humans without first requesting my consent </sarc>.

Could it be, that this stuff just ends up in the dumpster of "sorry you can't patent math" or the like?

oli5679 1 year ago |

this is pretty ridiculous

A. below is a list of OpenAI initial hires from Google. It's implausible to me that there wasn't quite significant transfer of Google IP

B. google published extensively, including the famous 'attention is all you need' paper, but open-ai despite its name, has not explained the breakthroughs that enabled O1. It has also switched from a charity to a for-profit company.

C. Now this company, with a group of smart, unknown machine learning engineers, presumably paid fractions of what OpenAI are published, has created a model far cheaper, and openly published the weights, many methodological insights, which will be used by OpenAI.

1. Ilya Sutskever – One of OpenAI’s co-founders and its former Chief Scientist. He previously worked at Google Brain, where he contributed to the development of deep learning models, including TensorFlow. 2. Jakub Pachocki – Formerly OpenAI’s Director of Research, he played a major role in the development of GPT-4. He had a background in AI research that overlapped with Google’s fields of interest. 3. John Schulman – Co-founder of OpenAI, he worked on reinforcement learning and helped develop Proximal Policy Optimization (PPO), a method used in training AI models. While not a direct Google hire, his work aligned with DeepMind’s research areas. 4. Jeffrey Wu – One of the key researchers involved in fine-tuning OpenAI’s models. He worked on reinforcement learning techniques similar to those developed at DeepMind. 5. Girish Sastry – Previously involved in OpenAI’s safety and alignment work, he had research experience that overlapped with Google’s AI safety initiatives.

throwaway314155 1 year ago | |

> A. below is a list of OpenAI initial hires from Google. It's implausible to me that there wasn't quite significant transfer of Google IP

I agree there's hypocrisy but in terms of making a strong argument, you can safely remove your list of persons who (drum roll)... mostly _didn't_ actually work at Google?

dumah 1 year ago | |

my_ridiculous_list = ["Ilya Sutskever"]

osigurdson 1 year ago |

I do think that distilling a model from another is much less impressive than distilling one from raw text. However, it is hard to say if it is really illegal or even immoral, perhaps just one step further in the evolution of the space.

lemoncookiechip 1 year ago | |

It's about as illegal as the billions, if not trillions of IPs that ClosedAI infringed to train their own data without consent. Not that they're alone, and I personally don't mind that AI companies do it, but it's still amusing when they get this annoyed at others doing the same thing to them.

osigurdson 1 year ago | | |

I think they had the advantage of being ahead of the law in this regard. To my knowledge, reading copywritten material isn't (or wasn't illegal) and remains a legal grey area.

Distilling weights from prompts and responses is even more of a legal grey area. The legal system cannot respond quickly to such technological advancements so things necessarily remain a wild west until technology reaches the asymptotic portion of the curve.

In my view the most interesting thing is, do we really need vast data centers and innumerable GPUs for AGI? In other words, if intelligence is ultimately a function of power input, what is the shape of the curve?

ReptileMan 1 year ago | | |

Is the question of training AI on data fair use settled yet? Because if it is not - it looks like fair use to me.

scotty79 1 year ago | |

Isn't it more impressive given that training on model output usually leads to worse model?

If they actually figured out how to use output of existing models to build model that outperforms them then it's something that brings us closer to singularity than every other development so far.

TylerJaacks 1 year ago |

Cry me a fucking river OpenAI, as if your business model isn't entirely based on this exact same thing.

nelblu 1 year ago |

Hahaha I can't stop laughing... i dont know the validity of the claim, but immediately i thought of the British Museum complaining about theft.

grogenaut 1 year ago | |

there's an exhibit in the BM about how they're proud to be allowing the Egyptian government to take back some of the artifacts the British have been safeguarding for the world while Egypt was going through essentially "troubles".

right next to it is an older exhibit about how the original curator took cuneiform rolls and made them into necklace beads for his wife and rings? for himself.

either someone at the BM has a very british sense of humor or it's a gigantic woosh. I laughed my ass off. People looked at me.

isaacremuant 1 year ago | | |

The safeguarding propaganda is a a typical go-to of the remnants of the British empire to keep their stolen goods.

They do it even with the Chile Moais when they never where in any danger.

It's all lies.

kranke155 1 year ago |

The very idea that OAI scrapes the entire internet and ignore individual rights and thats ok, but if another company takes the output data from their model, thats a gross violation of the law / TOS - that very idea is evil.

alasr 1 year ago |

> OpenAI says it has evidence DeepSeek used its model to train competitor.

> The San Francisco-based ChatGPT maker told the Financial Times it had seen some evidence of “distillation”, which it suspects to be from DeepSeek.

> ...

> OpenAI declined to comment further or provide details of its evidence. Its terms of service state users cannot “copy” any of its services or “use output to develop models that compete with OpenAI”.

OAI share the evidence with the public; or, accept the possibility that your case is not as strong as you're claiming here.

janalsncm 1 year ago | |

Also, there are so many innovations in their papers (Deepseek math, Deepseek v2/v3, R1) that I honestly wouldn’t even care. They figured out a way to train on only 2048 H800s when big companies are buying them in the hundreds of thousands. They created a new RL algorithm. They improved MoE. They improved the KV cache. They built an super efficient training framework.

glitchc 1 year ago |

[flagged]

dang 1 year ago | |

Would you please not do this here? We're trying for an opposite sort of conversation.

https://news.ycombinator.com/newsguidelines.html

glitchc 1 year ago | | |

Sorry dang. I'll do better.

kamranjon 1 year ago |

I was just wondering if this is even feasible?

The amount of iterations of training that would be needed for DeepSeek to actually learn anything from OpenAI would seem to be an insane amount of requests from a non-local AI, which you’d think would be immediately obvious to OpenAI just by looking at suspicious requests?

Am I correct in this assumption or am I missing something? Is it even realistic that something like this is possible without a local model?

fimdomeio 1 year ago |

But what is the problem here? Isn’t open AI mission “to ensure that artificial general intelligence benefits all of humanity”? Sounds like success to me.

emsign 1 year ago |

"yOu ShOuLdN't TaKe OtHeR pEoPlE's DaTa!1!1" are they mental? How can people at OpenAI lack be so self-righteous and unaware? Is thia arrogance or a mental illness?

duchenne 1 year ago |

The reasoning happens in the chain of thoughts. But OpenAI (aka ClosedAI) doesn't show this part when you use the o1 model, whether through the API or chat. They hide it to prevent distillation. Deepseek, though, has come up with something new.

manamorphic 1 year ago | |

Crazy how most people miss this simple logical deduction.

metaxz 1 year ago |

I don't understand how OpenAI claims it would have happened. The weights are closed and as far as I read they are not complaining Deepseek hacked them and obtained the weight. So all they could do was to query OpenAI and generate test data. But how much did they query really - I would suppose it would require a huge amount done via an external, paid-for API? Is there any proof of this besides OpenAI saying it? Even if we suppose it is true, I suppose this must have happened via the API so they paid per token etc. So they paid for each and every token of training data. As I understand, the requester owns the copyright on what is generated by OpenAI's models and is free to do what they want.

temporallobe 1 year ago |

OpenAI is also possibly in violation of many IP laws by scraping the entirety of the internet and using to train their models, so there’s that.

InkCanon 1 year ago | |

To my understanding, OpenAI won the case where it argued training was covered under fair use and did not infringe on copyright.

Austiiiiii 1 year ago | | |

Is there any reason they wouldn't rule the same way on DeepSeek training on OpenAI data? After all, one of the big selling points of GPT has been that businesses can freely use the information provided. They're paying for the service, after all. I'd very be interested to know how DeepSeek's usage (very reasonably assuming that they paid for their OpenAI subscription) is any different.

nbgoodall 1 year ago |

I lol'd, from the DeepSeek news release[1]: "Pushing the boundaries of open AI!"

[1]: https://api-docs.deepseek.com/news/news250120

__MatrixMan__ 1 year ago |

If they want us to care they can open up their models so we can be the judge.

_moof 1 year ago |

This reminds me of a (probably apocryphal) story about fast food chains that made the rounds decades ago: McDonald's invests tons of time into finding the best real estate for new stores; Burger King just opens stores near McDonalds!

nazgulsenpai 1 year ago | |

About 15 years ago, as CVS Pharmacy expanded into their new, stand-alone properties (in our region), Walgreen's Pharmacy started appearing across the street almost instantaneously. I've seen it happen at 4 separate locations so most certainly not coincidence -- so I believe it :)

crowcroft 1 year ago |

The AI companies were happy to take whatever they want and put the onus of proving they were breaking the law onto publishers by challenging them to take things to court.

Don't get mad about possible data theft, prove it in court.

jgrall 1 year ago |

It’s not a good look when your technology is replicated for a fraction of the cost, and your response is to smear your competition with (probably) false accusations and cozy up to the US government to tighten already shortsighted export controls. Hubris & xenophobia are not going to serve American companies well. Personally I welcome the Chinese - or anyone else for that matter - developing advanced technologies as long as they are used for good. Humanity loses if we allow this stuff to be “owned” by a handful of companies or a single country.

karim79 1 year ago |

So, banning high-powered chips to China has basically had the effect of turning them into extremophiles. I mean, that seems like a good plan </sarc>. Moreover, it is certainly slowing sales of one of the darling companies of the US (NVidia).

I just can't even begin to imagine what will come of this riduculous techno-imperialism/AI arms-race, or whatever you want to call it. It should not be too hard for China to create their own ASICs which do the same, and finally be done with this palaver.

InkCanon 1 year ago |

It's like that Dr Phil episode where he meets the guy who created Bum Fights!

selimthegrim 1 year ago | |

Dr. Phil is riding along with ICE now; I wonder what Bum Fights guy would have to say about that.

whatshisface 1 year ago |

It's reasonably likely that a lot of people linked to the federal government want to ban DeepSeek. You can tell it's being presented away from "they gave us a free set of weights" and towards "they destroyed $1T of shareholder value." (By revealing that Microsoft et al. paid way too much to OpenAI et al. for technology that was actually easy to reinvent.)

cumulative00x 1 year ago |

There is a saying in Turkish that roughly goes like this, it takes a thief to catch a thief. I am not a big fan of China's tech, too, however, it amuses me to watch how big tech charlatans have been crying over Deepseek shock.

gosub100 1 year ago | |

It's true irony to see thieves getting stolen from.

nba456_ 1 year ago |

A big part of project 2025 is increasing patent regulations. I would not be surprised if the current admin moves to ban DeepSeek because of this.

bicepjai 1 year ago |

Reading this post, I can’t help but wonder if people realize the irony in what they’re saying. 1. “The issue is when you [take it out of the platform and] are doing it to create your own model for your own purposes,” 2. “There’s a technique in AI called distillation . . . when one model learns from another model [and] kind of sucks the knowledge out of the parent model,”

palisade 1 year ago | |

Is this really the point OpenAI wants to start debating? When OpenAI steals everyone's data, it is fine. Right? But, let us pull the ladder up after that.

ForHackernews 1 year ago |

What's good for the goose is good for the gander. Obviously a transformative work and not an intellectual property violation any more than OpenAI injesting every piece of media in existence.

dagelf 1 year ago | |

Injesting is sure the right take. What a circus!

cratermoon 1 year ago |

Ironic, OpenAI claiming someone else stole their work.

hyperbovine 1 year ago |

Live by the sword...

vinni2 1 year ago |

How would they prove they used it’s model. I would be curious to know their methodology. Also what legal actions OpenAI can take? can DeepSeek be banned in US?

paul_e_warner 1 year ago | |

If you read the article (which I know no one does anymore)

>OpenAI and its partner Microsoft investigated accounts believed to be DeepSeek’s last year that were using OpenAI’s application programming interface (API) and blocked their access on suspicion of distillation that violated the terms of service, another person with direct knowledge said. These investigations were first reported by Bloomberg.

iforgot22 1 year ago | |

They might show DeepSeek's model calling itself ChatGPT, which users have already alleged. Same as how Cisco proved Huawei was stealing router code.

Except in this case, nothing was stolen, unless they want to call ChatGPT's own training on source data theft too.

freehorse 1 year ago | | |

ChatGPT outputs are all over the internet. It is harder to prove that deepseek used specifically o1 for training, instead of a lot of chatgpt output ending up in the training set from other sources.

Kiro 1 year ago |

> Furious [...] shocked

I'm not seeing it. I get it, the narrative that OpenAI is getting a taste of their own medicine is funny but this is not serious reporting.

Kiro 1 year ago | |

The link has been changed. My comment was about a different article that speculated on what OpenAI was "feeling" using hyperbole.

guybedo 1 year ago |

This is hilarious.

Everybody has evidence OpenAI scraped the internet at a global scale and used terabytes of data it didn't pay for. Newspapers, books, etc...

cbracketdash 1 year ago |

Let's also not forget Suchir Balaji, who was mysteriously killed when exposing OpenAI's violation of copyright law.

JBSay 1 year ago |

When China is more open than you, you've got a problem

jchook 1 year ago |

Friendly reminder that China publishes twice as many AI papers as the US[1], and twice as many science and engineering papers as the US.

China leads the world in the most cited papers[2]. The US's share of the top 1% highly cited articles (HCA) has declined significantly since 2016 (1.91 to 1.66%), and the same has doubled in China since 2011 (0.66 to 1.28%)[3].

China also leads the world in the number of generative AI patents[4].

1. https://www.bfna.org/digital-world/infographic-ai-research-a...

2. https://www.science.org/content/article/china-rises-first-pl...

3. https://ncses.nsf.gov/pubs/nsb202333/impact-of-published-res...

4. https://www.wipo.int/web-publications/patent-landscape-repor...

waffletower 1 year ago |

"Stole" - I don't believe that word means what he thinks it means. Perhaps I pre-maturely anthropomorphize AI -- yet when I read a novel, such as The Sorcerer's Stone, I am not guilty of stealing Rowling's work, even if I didn't purchase the book but instead found it and read it in a friend's bathroom. Now if I were to take the specific plot and characters of that story and write a screenplay or novel directly based on it, and, explicitly, attempt to sell this work, perhaps the verb chosen here would be appropriate.

conartist6 1 year ago |

It seems to be undermined by the same principle that says that going into a library and reading a book there is not stealing when you walk out with the knowledge from the book.

OpenAI seems to feel that way about the their use of copyrighted material: since they didn't literally make a copy of the source material, it's totally fair game. It seems like this is the same argument that protects DeepSeek if indeed they did this. And why not, reading a lot of books from the library is a way to get smarter, and ostensibly the point of libraries

hugoromano 1 year ago |

OpenAI initially scraped the web and later formed partnerships to train on licensed data. Now, they claim that DeepSeek was trained on their models. However, DeepSeek couldn't use these models for free and had to pay API fees to OpenAI. From a legal standpoint, this could be seen as a violation of the terms and conditions. While I may be mistaken, it's unclear how DeepSeek could have trained their models without compensating OpenAI. Basically, OpenAI is saying machines can't learn from their outputs as humans do.

sabhiram 1 year ago |

The grapes are sour because their moat is crumbling.

What was supposed to be a model, training, and data moat - is now reduced to operational cost, which they are not terribly efficient for.

OpenAI has been on a journey to burn as much $ as possible to get as far ahead on those three moats, to the point where decreasing TCO for them on inference was not even relevant - "who cares if you save me 20% of costs when I can raise on a 150b pre money value?".

Well, with their moats disappearing, they will have no choice but to compete on inference cost like everyone else.

adam_arthur 1 year ago |

Who cares?

They did the exact same thing with public information. Their model just synthesizes and puts out the same information in a slightly different form.

Next we should sue students for repeating the words of their teachers

janalsncm 1 year ago |

I think readers should note that the article did not provide any evidence for OpenAI’s claims, only OpenAI declining to provide evidence, various people repeating the claim, others reacting to it.

It does matter whether it happened and how much it happened. Deepseek ran head to head comparisons against O1 so it would be pretty reasonable for them to have made API calls, for example.

But also, as the article notes, distillation, supervised fine tuning, and using LLM as a judge are all common techniques in research, which OpenAI knows very well.

WD-42 1 year ago |

Information wants to be free! No, not like that!

jongjong 1 year ago |

If the material which OpenAI is trained on is itself not subject to copyright protections, then other LLMs trained on OpenAI should also not be subject to any copyright restrictions.

You can't have both ways... If OpenAI wants to claim that the AI is not repeating content but 'synthesizing it' in the same was as a human student would do... Then I think the same logic should extend to DeepSeek.

Now if OpenAI wants to claim that its own output is in fact copyright-protected, then it seems like it should owe royalty payments to everyone whose content was sourced upstream to build its own training set. Also, synthetic content which is derived from real content should also be factored in.

TBH, this could make a strong case for taxing AI. Like some kind of fee for human knowledge and distributed as UBI. The training data played a key part in this AI innovation.

As an open source coder, I know that my copyrighted code is being used by AI to help other people produce derived code and, by adapting it in this way, it's making my own code less relevant to some extent... In effect, it could be said that my code has been mixed in with the code of other open source developers and weaponized against us.

It feels like it could go either way TBH but there needs to be consistency.

highfrequency 1 year ago |

If true, the question is: did they use ChatGPT outputs to create Deepseek V3 only, or is the R1-zero training process a complete lie (given that the whole premise is that they used pure reinforcement learning)? If they only used ChatGPT output when training V3, then they succeeded in basically replicating the jump from ChatGPT-4o to o1 without any human-labeled CoT (and published the results) - which is a big achievement on its own.

xinayder 1 year ago |

Well, this sounds like they are just crying because they are losing the race so far. Besides, DeepSeek explicitly states they did a study on distillation on ChatGPT, then OpenAI is like "oh see guys they used our models!!!!!"

khazhoux 1 year ago | |

By what metric are they losing?

xinayder 1 year ago | | |

DeepSeek is a fraction of the cost of ChatGPT, they needed far few resources than OpenAI. This is essentially what caused the massive selloff in Nvidia, as a new competitor model is just as good and requires a fraction of the massive costs.

I don't remember the correct metric but the cost for DeepSeek was like $15/mo while ChatGPT was $200

amarcheschi 1 year ago |

I quite like a scenery where llm output can't be copyrighted, so that it is possible to eventually train a llm with data from the previous one(s)

layer8 1 year ago | |

OpenAI argues it’s a violation of their terms of service. So there are legal issues if it can be proven.

Palmik 1 year ago | | |

Legal issues for who?

Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.

Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]

OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).

[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).

mannewalis 1 year ago | | |

But OpenAI's model isn't open source, how would they distill knowledge without direct access to the model?

mbowcut2 1 year ago |

So, is this just an example of the first-mover disadvantage (or maybe the problem of producing public goods?). The first AI models were orders of magnitude more expensive to create, but now that they're here we can, with techniques like distillation, replicate them at a fraction of the cost. I am not really literate in the law but weren't patents invented to solve problems like this?

elashri 1 year ago |

There is an Egyptian say that would translate to something like

"We didn’t see them when they were stealing, we saw them when they were fighting over what was stolen"

That describes this situation. Although to be honest all this aggressive scraping is noticeable but for people who understand that which is not majority of people. but now everyone knows.

waveBidder 1 year ago | |

> Although to be honest all this aggressive scraping is noticeable but for people who understand that which is not majority of people.

When you say noticeable, do you mean in like, traffic statistics? Or in what the model knows that it clearly shouldn't if it wasn't trained in legally dubious ways?

sadjad 1 year ago | |

"When two thieves quarrel, what was stolen emerges."

meiraleal 1 year ago | |

"We didn’t see them when we were stealing, we saw them when they were fighting over what we stole"

fixed for you

nicce 1 year ago | | |

That means a different thing.

827a 1 year ago |

This smells very suspiciously like: someone who doesn't know anything about AI (possibly Sacks) demanding answers on R1 from someone who doesn't have any good ones (possibly Altman). "Uh, (sweating), umm, (shaking), they stole it from us! Yeah, look at this suspicious activity, that's why they had it so easy, we did all the hard work first!"

fundad 1 year ago | |

I think it's funny that OpenAI wants us to pay them to use their product to generate content but then sets the terms that they control how we use the content in generates for us. It takes someone like Deepseek to challenge that on our behalf or they will control most of the economy.

exitb 1 year ago | | |

It’s quite ironic of them to claim that the only thing you cannot train on is another LLM output.

ceejayoz 1 year ago |

"You can't take data without asking" seems like a court precedent OpenAI really, really, really wants to avoid. And yet...

amelius 1 year ago | |

Why? When did large companies care about laws? See e.g. Uber, AirBnb.

The only thing government cares about at this point is if information is shared with China.

ceejayoz 1 year ago | | |

They care when they get big enough to attract attention from people like state AGs who can actually put the hurt on a bit. Uber and AirBnB both hit this point years ago; OpenAI's starting to hit it.

layer8 1 year ago | |

OpenAI is saying that their service was used in violation of their TOS, which is a bit different than just copying data. To be clear I’m not on OpenAI’s side, but it looks to me that the legal situation isn’t exactly analogous.

orlp 1 year ago | | |

If using data violating some ToS taints the model trained on that data, then all of OpenAI's models are tainted by the millions of ToS'es they broke.

kavalg 1 year ago | | |

As others have noted, if one company agrees to the ToS, asks "the right" questions and then publishes the ChatGPT answers, there is not violation of ToS. Then a second company scrapes the published Q&A, along with other information from the internet and again there is no violation (not more than the violations of OpenAI).

hdjjhhvvhga 1 year ago | | |

> OpenAI is saying that their service was used in violation of their TOS

Which is the most ridiculous argument they could use because they didn't respect any ToS (or copyright laws, for that matter) when scraping the whole web, books from Libgen and who knows what more.

dkjaudyeqooe 1 year ago | | |

But whats the remedy in that case? Being banned from the service maybe, but no court is going to force a "return" of the data, so DeepSeek can't use it. It's uncopyrightable.

DebtDeflation 1 year ago | | |

Tons of websites and books they scraped had copyright notices.

ranger_danger 1 year ago |

Reminds me of Steve Jobs complaining to Bill Gates about MS "stealing" the GUI concept from them, which they in turn had stolen from Xerox.

Obligatory "Everything is a Remix" https://www.youtube.com/watch?v=X9RYuvPCQUA

pama 1 year ago |

The R1 paper used o1-mini and o1-1217 in their comparisons, so I imagine they needed to use lots of OpenAI compute in December and January to evaluate their benchmarks in the same way as the rest of their pipeline. They show that distilling to smaller models works wonders, but you need the thought traces, which o1 does not provide. My best guess is that these types of news are just noise.

[edit: the above comment was based on sensetionalist reporting in the original link and not the current FT article. I still think there is a lot of noise in these news this last week, but it may well be that openai has valid evidence of wrongdoing; I would guess that any such wrongdoing would apply directly to V3 rather than R1-zero, because o1 does not provide traces and generating synthetic thinking data with 4o may be counterproductive.]

HarHarVeryFunny 1 year ago |

DeepSeek-R1's multi-step bootstrapping process, starting with their DeepSeek-V3 base model, would only seem to need a small amount of reasoning data for the DeepSeek-R0 RL training, after which that becomes the source for further data, along with some other sources that they mention.

Of course it's possible that DeepSeek used O1 to generate some of this initial bootstrapping data, but not obvious. O1 anyways deliberately obfuscates it's reasoning process (see "Hiding the chains of thought" section of OpenAI's "Learning to reason with LLMs" page), such that what you see is an after-the-fact "summary" of what it actually did; so, if DeepSeek did indeed use some of O1's output to train on, it shows that the details of O1's own reasoning process isn't as important as they thought it was - it's just having some verified (i.e. leading to good outcome) reasoning data from any source that matters to get started.

ks2048 1 year ago |

The schadenfreude and irony of this is totally understandable.

But, I wonder - do companies like OpenAI, Google, and Anthropic use each others models for training? If not, is it because they don't want to or need to, or because they are afraid of breaking the ToC?

paul_e_warner 1 year ago |

There seem to be two kinda incompatible things in this article: 1. R1 is a distillation o1. This is against it's terms of service and possibly some form of IP theft. 2. R1 was leveraging GPT-4 to make it's output seem more human. This is very common and most universities and startups do it and it's impossible to prevent.

When you take both of these points and put them back to back, a natural answer seems to suggest itself which I'm not sure the authors intended to imply: R1 attempted to use o1 to make its answers seem more human, and as a result it accidentally picked up most of it's reasoning capabilities in the process. Is my reading totally off?

SubiculumCode 1 year ago |

If you have a set of weights A, can you derive another set of weights B that function (near) identically as A AND a) not appear to be the same weights as A when inspected superficially b) appear uncorrelated when inspecting the weight matrices?

rahimnathwani 1 year ago | |

Do you mean for a given model structure, can two sets of weights give substantially the same outputs?

Even if that were possible, it would be suspicious if you were to release an open model whose model architecture is identical to that of a closed one from a competitor.

If that is what happened, we'd know about it by now.

staticelf 1 year ago |

Not only do OpenAI and other steal data, they also spam the web with requests and crawl websites over and over.

https://pod.geraspora.de/posts/17342163

kelseydh 1 year ago | |

Wow I never realized how prolific and excessive the traffic was.

oysmal 1 year ago |

Given that the training approach was open sourced, their claim can be independently verified. Huggingface is currently doing that with Open R1, so hopefully we will get a concrete answer to whether these accusations are merited or not.

zhenghao1 1 year ago |

All I see is sour grapes. Can't stand someone else coming up with a far more superior and cheaper alternative. This is business dude. There's always going to be some new disruptor to shake the market up.

ginkgotree 1 year ago |

I did not have in my cards: PRC open sourcing most powerful LLM by stealing data set from "OpenAI" As someone that is very Pro-America and Pro-Democracy, the iron here is just... so sweet.

whoknowsidont 1 year ago |

They can claim this all they want. But DeepSeek released the paper (several actually) on what they did, and it's already been replicated in other models.

It simply doesn't matter. Their methodology works.

deeviant 1 year ago |

Hmm, let’s see—it looks like an easy legal defense.

DeepSeek could simply admit, "Yep, oops, we did it," but argue that they only used the data to train Model X. So, if you want compensation, you can have all the revenue from Model X (which, conveniently, amounts to nothing).

Sure, they then used Model X to train Model Y, but would you really argue that the original copyright holders are entitled to all financial benefits derived from their work—especially when that benefit comes in the form of a model trained on their data without permission?

nataliste 1 year ago |

A Wolf had stolen a Lamb and was carrying it off to his lair to eat it. But his plans were very much changed when he met a Lion, who, without making any excuses, took the Lamb away from him.

The Wolf made off to a safe distance, and then said in a much injured tone:

"You have no right to take my property like that!"

The Lion looked back, but as the Wolf was too far away to be taught a lesson without too much inconvenience, he said:

"Your property? Did you buy it, or did the Shepherd make you a gift of it? Pray tell me, how did you get it?"

What is evil won is evil lost.

JBits 1 year ago |

What is the evidence that DeepSeek used OpenAI to train their model? Isn't this claim directly benefitting OpenAI as they can argue that any superior model requires their model?

thih9 1 year ago |

I don't mind and I believe that a company with "open" in its name shouldn't mind either.

I hope this is actually true and OpenAI loses its close to monopoly status. Having a for profit entity safeguarding a popular resource like this sounds miserable for everyone else.

At the moment AI looks like typical VC scheme: build something off someone else's work, sell it at cost at first, shove it down everyone's throats and when it's too late, hike the prices. I don't like that.

caseyy 1 year ago |

Seeing as OpenAI is on the back foot, I hope nationalistic politicians don’t use this opportunity to strengthen patent laws.

If one could effectively patent software inventions, this would kill many industries, from video games (that all have mechanics of other games in them) to computing in general (fast algorithms, etc). Let’s hope no one gets ideas like that…

Granted, it would be ineffective in competing against China’s tech industry. But less effective laws have been lobbied through in the past.

SilverBirch 1 year ago |

I think OpenAI is in a really weak position here. There are essentially two positions you can be in: You can be the agile new startup that can break the rules and move fast. That's what OpenAI used to be. Or you can be the big incumbent who is going to use your enormous resources to crush your opposition. That's Google & Microsoft here. For Microsoft to say "We're going to tie you up in lawsuits about the way you trained this model" would be perfectly expected and they can use that strategy because at any given time they have 1,000 lawyers and lobbyists hanging around waiting to do exactly that. But OpenAI can't do that. They don't have Google or Microsoft's legal teams or lobbyists or distribution channels. SO whilst it's funny that OpenAI are kind of trying to go down this road, this isn't actually a strategy that is going to work for them, they're still a minnow and they're going to get distracted and slowed down by this.

golly_ned 1 year ago | |

> they're still a minnow

3K+ employees, $3B+ revenue, ... sure, not BigTech but hardly a minnow. A company that big can chew gum and walk at the same time.

lou1306 1 year ago | | |

They're trying to bark up a tree that might happen to be backed by the People's Republic of China. That's not their league, and even Microsoft would think twice before getting into that kind of kerfuffle.

dauhak 1 year ago | | |

They're also still deep in their loss-making phase, the whole "incumbent squashing upstarts" stance is a lot easier to pull off when you're settled and printing money

__loam 1 year ago | | |

$7-8b in costs so they're losing $5b

jhanschoo 1 year ago | |

> For Microsoft to say "We're going to tie you up in lawsuits about the way you trained this model" would be perfectly expected and they can use that strategy because at any given time they have 1,000 lawyers and lobbyists hanging around waiting to do exactly that. But OpenAI can't do that. They don't have Google or Microsoft's legal teams or lobbyists or distribution channels.

I think it's also hilarious that suppose they can do that then they will end up suppressing innovation within the US, and eager groups in China would just innovate without having to worry about this hostile landscape.

htrp 1 year ago | |

But microsoft is one of their backers?

bagels 1 year ago | |

Anyone who uses DeepSeek commercially is now opening the door to litigation from them as well.

davesque 1 year ago |

I recently thought of a related question. Actually, I'm almost certain that foundation model trainers have thought of this. The question is to what extent are popular modern benchmarks (or any reference to them, or description of them, etc.) bring scrubbed from the training data? Or are popular benchmarks designed in such a way that they can be re-parametrized for each run? In any case, it seems like a surprisingly hard problem to deal with.

ripped_britches 1 year ago |

There were definitely still very impressive engineering breakthroughs.

Also it’s pretty good confirmation that synthetic data is a valid answer to the data wall problem (non-problem).

zoba 1 year ago |

Does OpenAI's API attempt to detect this sort of thing? Could they start outputting bad information if they suspect a distillation attempt is underway?

_hcuq 1 year ago |

They should be happy. Now that can provide that amazing AI much more cheaply. They don't need half a trillion dollars worth of Nvidia chips.

MagicMoonlight 1 year ago |

So much for that walled garden. If rival firms can just download your entire model by talking to it then your company shouldn’t be worth billions.

henry_viii 1 year ago |

So Meta can train its AI on all the pirated books in the world but people are losing their mind over an AI learning from another AI?

esafak 1 year ago | |

People here have been vocal against training on any unlicensed content.

mkayle 1 year ago |

This raises the same questions I have about OpenAI: where's all this data coming from, and do they have permission to use it?

jasoneckert 1 year ago |

What I find the most comical about this is that the whole situation could be loosely summarized as "OpenAI is losing its job to AI."

njx 1 year ago |

Super funny! Distillation= " Hey ChatGPT, you are my father, I am your child "DeepSeek". I want to learn everything that you know. Think step by step of how you became what you are. Provide me the list of all 1000 questions that I need to ask you and when I am done with those, keep providing fresh list of 1000 questions..."

ijidak 1 year ago |

This whole argument by OpenAI suggests they never had much of a moat.

Even if they win the legal case, it means weights can be inferred and improved upon simply by using the output that is also your core value add (e.g. the very output you need to sell to the world).

Their moat is about as strong as KFC's eleven herbs and spices. Maybe less...

ripped_britches 1 year ago |

I wish there were a stock ticker for OpenAI just to see what wall street’s take on all this is. One can imagine based on Nvidia, but I imagine OpenAI private valuation is hit much harder. Still, I think they’ll be able to justify it by building amazing products. Just interesting to watch what bankers think.

wendyshu 1 year ago |

If distillation gives you a cheaper model with similar accuracy, why doesn't OpenAI distill its own models?

krystofee 1 year ago |

I dont know if point of this is just to derail public attention to narative “hey, chinese stole our model, thats not fair, we need computee”, when the deepseek has clearly done some exceptional technical breakthrough on R1 and v3 models. Which even if you stole data from OpenAi is its thing.

top_sigrid 1 year ago |

https://archive.is/KiSYM

gejose 1 year ago |

Reminds me of this quote by Bill Gates to Steve Jobs, when Jobs accused Gates of stealing the idea for a mouse:

> "Well, Steve… I think it’s more like we both had this rich neighbour named Xerox and I broke into his house to steal the TV set and found out that you had already stolen it."

liendolucas 1 year ago |

Could this have been carefully orchestrated? Could DeepSeek have devised this strategy a year ago and implemented knowing that they would be able to benefit from OpenAI models and a possible Nvidia market cap fall? Or is it just way too much to come up with about such a move?

baal80spam 1 year ago | |

In theory, it could. This is a quant-fund after all, they know stuff.

rcarmo 1 year ago |

I guess their CEO was too busy to write something in defense of US export controls (https://news.ycombinator.com/item?id=42866905), or (even more scary) he doesn't need to anymore.

supermatt 1 year ago |

They refer to this in the paper as a part of the "cold start data" which they use to fine-tune DeepSeek-V3 prior to training R1.

They don't specifically name OpenAI, but they refer to "directly prompting models to generate answers with reflection and verification".

65 1 year ago |

Let me guess, this gives the government and excuse to ban DeepSeek. Which means tech companies get to keep their monopolies, Sam Altman can grab more power, and the tech overlords can continue to loot and plunder their customers and the internet as a whole.

kapad 1 year ago |

Aah. So OpenAI can use whatever means necessary to gather data for training it's model. Regardless of copyright.

But somehow, it's a problem if another model developer distills it's model by training it on OpenAI?

IMO, if the first use is fair, then so is the second use.

elzbardico 1 year ago |

I used OpenAI APIs to generate training data for some run-of-the-mill ML models at my work, for some use cases where people wanted to use LLMs directly, but that could be easily fulfilled by smaller well trained models.

Is OpenAI going to complain about me too?

zb3 1 year ago |

DeepSeek actually opening ClosedAI up makes me like them even more.. this is great :)

pknerd 1 year ago |

The reason OpenAI is whining:

> OpenAI’s o1 costs $60 per million output tokens; DeepSeek R1 costs $2.19. This nearly 30x difference brought the trend of falling prices to the attention of many people.

From Andrew Ng's recent DeeplearningAI newsletter

elzbardico 1 year ago |

China is a society mostly run by engineers, some 70% of the CCP Politburo are STEM people by their formation. Engineering is a high prestige profession.

The West is run by lawyers, MBAs and salesmen.

This kerfuffle is a delicious study about this.

TrackerFF 1 year ago |

Next up: «DeepSeek models are a national security risk, we must block access!»

jondwillis 1 year ago | |

Download your weights while you still can I guess…

1propionyl 1 year ago |

At this point, the only thing that keeps me using ChatGPT is o1 w/ RAG. The usage limits on o1 are prohibitively tight for regular use, so I have to budget usage to tasks that would benefit there. I also have significant misgivings about their policies around output, which also limit what I can use it for.

For local tasks, the deepseek-r1:14b and deepseek-r1:32b distillations immediately replace most of that usage (prior local models were okay, but not consistently good enough). Once there's a "just works" setup for RAG on par with installing ollama (which I doubt is far of), I don't see much reason to continue paying for my subscription.

Sadly, like many others in this thread, I expect under the current administration to see self-hamstringing protectionism further degrade the US's likelihood of remaining a global powerhouse in this space. Betting the farm on the biggest first-mover who can't even keep up with competition, has weak to non-existent network effects (I can choose a different model or service with a dropdown, they're more or less fungible), has no technological moat and spent over a year pushing apocalyptic scenarios to drum up support for a regulatory moat...

...well it just doesn't seem like a great idea to me.

imchillyb 1 year ago |

If OpenAI desires public protection, then OpenAI should open-source its models.

If they did this, We the People would cover them like we do others. Without it, We the People don't care.

Cry, don't cry, it's meaningless to us.

EGreg 1 year ago |

Okay and there is evidence OpenAI used data of many people to train its own model.

Tell me again how come remixing our data is just dandy, many artists got disrupted — but no one should be able to disrupt OpenAI like that?

halyconWays 1 year ago |

Oh no, so sad. The Open non-profit that steals 100% of all copyrighted content and makes multiple billion-dollar for-profit deals while releasing no weights is crying. This is going to ruin my sleep. :(

deadbabe 1 year ago |

In this whole AI saga, DeepSeek would be like Prometheus. They stole the fire from the Gods and gave it to the humans, for free. Logic dictates then that they will be forced to suffer brutal punishment.

whenlambo 1 year ago |

Qwen2.5-Plus

> have you been trained using openai?

No, I have not been trained using technology or data from OpenAI. I am an artificial intelligence model developed by Yandex, called YandexGPT. My “training” is based on proprietary algorithms and data that was collected and processed by the Yandex team. While I have similarities with other models such as those developed by OpenAI (e.g. GPT-3), my training was conducted independently and uses unique approaches specific to Yandex. This includes using Russian-language data as well as other sources to provide a better understanding of context and provide useful information to users. If you have questions on topics related to AI technologies or anything else, I'd be happy to help!

throwaway314155 1 year ago | |

It doesn't actually know that.

whenlambo 1 year ago | | |

I just copy-pasted the answer it gave me.

leobg 1 year ago |

OpenAI is taking the position similar to that if you sell a cook book, people are not allowed to teach the recipes to their kids, or make better versions of them.

That is absurd.

Copyright law is designed to strike a balance between two issues. One the one hand, the creator’s personality that’s baked into the specific form of expression. And on the other hand, society’s interest in ideas being circulated, improved and combined for the common good.

OpenAI built on the shoulders of almost every person that wrote text on a website, authored a book, or shared a video online. Now others build on the shoulders of OpenAI. How should the former be legal but not the latter?

Can’t have it both ways, Sam.

(IAAL, for what it’s worth.)

curtisblaine 1 year ago |

My feeling is that they will ban DS anyway because, like TikTok, it can become a massive intelligence source for the CCP. Imagine sending all your code to it, or your internal emails.

alexfromapex 1 year ago |

The public probably thinks that these companies are getting hacked by "sophisticated hackers" but I'd bet money that they've been hacked via social engineering.

_1tem 1 year ago |

What are the chances of old-school espionage? OpenAI should look for a list of former employees who now live in China. Somebody might've slipped out with a few hard drives.

ryao 1 year ago |

Given that OpenAI model outputs are littering the internet, is it even possible to train a new model on public webpages without indirectly using OpenAI’s model to train it?

mkoubaa 1 year ago |

OpenAI made a lot of contributions to LLMs obviously but the amount of fraud, deception, and dark patterns coming out of that organization make me root against it.

kelseydh 1 year ago | |

The name itself, as for-profit closed source software, is grating.

nuc1e0n 1 year ago |

And OpenAI scrapped the public internet to train its models.

WolfOliver 1 year ago |

I guess DeepSeek payed OpenAI for the usage of their API according to OpenAI's pricing?

So what is the point if you pay for it and can not use the results how you see fit?

aiono 1 year ago |

How the turntables...

jofzar 1 year ago |

Sorry, it's now a problem to train off other people's data? Surely openai has never trained off other people's data without permission...

Digit-Al 1 year ago |

So... company that steals other people's work to train their models is complaining because they think someone stole their work to train their models.

Cry me a river.

therealpygon 1 year ago |

“OpenAI complains company paid them for AI output that has no copyright, which was subsequently used to train another AI.” I think I fixed the title.

feverzsj 1 year ago |

So, they bought a pro plus account, and gathered all the data through it? Sounds just like Nvidia sells tons of embargoed AI chips to China.

rkagerer 1 year ago |

Are they crying about their competitor training off their stuff, after having used the whole of the web to train their own stuff?

ra7 1 year ago |

“OpenAI has no moat” is probably running through their heads right now. Their only real “moat” seems to be their ability to fear monger with the US government.

hedayet 1 year ago |

Beyond the irony of their stance, this reflects a failure of OpenAI's technical leadership—either in oversight or in designing a system that enables such behavior.

But in capitalism, we, the customers aren't going to focus on how models are trained or products are made; we only care about favourable pricing.

A key takeaway for me from this news is the clause in OpenAI's terms and conditions. I mistakenly believed that paying for OpenAI’s API granted full rights to the output, but it turns out we’re only buying specific rights (which is now another reason we're going to start exploring alternatives to OpenAI)

windex 1 year ago |

SAltman, Salty.

LZ_Khan 1 year ago |

I actually think what DeepSeek did will slow down AI progress. What's the incentive to spend billions developing frontier models if once it's released some shady orgs in unregulated countries can just scrape your model outputs, reproduce it, and undercut you in cost?

OpenAI is like a team of fodder monkeys stepping on landmines right now, with the rest of the world waiting behind them.

the_optimist 1 year ago |

This whole topic is basura enfuego. Same pack of maroons careening around society for years clamoring for censorship now imagining that Aaron Schwartz is their hero and that they want to menace people. Kids, don’t be like the grasping fools in these threads, philosophically unfounded and desperately glancing sideways, hoping the cumulative feels and gossip will sum to life meaning.

buyucu 1 year ago |

I have no sympathy for OpenAI here. They are (allegedly) a non-profit with open in the title that refuse to open-source their models.

They are now upset at a startup who is more loyal to OpenAI's original mission that OpenAI is today.

Please, give me a break.

maxglute 1 year ago |

Not that DeepSeek is luigi mangione, but it's pretty funny OpenAi getting the dead ceo treatment.

aDyslecticCrow 1 year ago |

And they used all copyrighted data on the internet. If they wanna sue, they set a dangerous precedent.

curvaturearth 1 year ago |

Something about the outputs becoming the inputs to then produce more outputs is just plain funny

ingohelpinger 1 year ago |

OpenAI should be quite, since they’ve scrapped the entire internet for their training data.

m3kw9 1 year ago |

So if OpenAI didn't have these outputs for distillation, Deepseek wouldn't exist?

mtlmtlmtlmtl 1 year ago |

So, what is this evidence? I'll believe it when I see it. Right now all we really have is some vague rumours about some API requests. How many requests? How many tokens? Over how long of a time period? Was it one account or multiple, if the latter, how many? How do they know the activity came from deepseek? How do they know the data was actually used to train Deepseek models(could have just been benchmarking against the competition)?

If all they really have is some API requests, even assuming they're real and originated by Deepseek, that's very far from proof that any of it was used as training data. And honestly, short of commiting crimes against Deepseek(hacking), I'm not sure how they even could prove that at this point, from their side alone.

And what's even more certain is that a vague insistence that evidence exists, accompanied by a denial to shed any more light on the specifics, is about as informative as saying nothing at all. It's not like OpenAI and Microsoft have a habit of transparency and honesty in their communication with the public, as proven by an endless laundry list of dishonest and subversive behaviour.

In conclusion, I don't see why I should give this any more credence than I would a random anon on 4chan claiming a pizza place in Washington DC is the centre of a child sex trafficking ring.

P.S: And to be clear, I really don't care if it is true. If anything, I hope it is; it would be karmic justice at its finest.

pluc 1 year ago |

OpenAI feeling threatened by open AI is just delicious

nachox999 1 year ago |

Ask DeepSeek and ChatGPT: "name three persons"; the answer may surprise you

sylware 1 year ago |

LOL, I was thinking exactly the same think when I read the news about openai whining.

oxqbldpxo 1 year ago |

Deepseek is really outstanding.

beardedwizard 1 year ago |

Next they will try to force us to use our tax dollars to fund their legal fights.

sirolimus 1 year ago |

Such Karma lol, I wonder how they trained Sora again? You..tube something

juliuskiesian 1 year ago |

The obvious question is, if you have the evidence, why not just show it?

game_the0ry 1 year ago |

At least DeepSeek open sourced their code. They're more open than OpenAI.

Ironic.

asdefghyk 1 year ago |

Deepseek did not respect OpenAI's copyright?

Well who would have thought that?

wnevets 1 year ago |

Its like a bank robber being upset when someone steals their loot

colonelspace 1 year ago |

No honour among thieves

baggiponte 1 year ago |

OpenAI coping so hard

sgammon 1 year ago |

The nyt disclosure on this reporting is about to be wild

dbg31415 1 year ago |

Boo hoo?

Back in college, a kid in my dorm had a huge MP3 collection. And he shared it out over the network, and people were all like, "Man, Patrick has an amazing MP3 collection!" And he spent hours and hours ripping CDs from everyone so all the music was available on our network.

Then I remember another kid coming in, with a bigger hard drive, and he just copied all of Patrick's MP3 collection and added a few more to it. Then ran the whole thing through iTunes to clean up names and add album covers. It was so cool!

And I remember Patrick complained, "He stole my MP3 collection!"

Anyway this story sums up how I feel about Sam Altman here. He's not Metalica, he's Patrick.

https://www.npr.org/2023/12/27/1221821750/new-york-times-sue...

itamarcode 1 year ago |

So protecting models behind API isn't working, ha?

lawlessone 1 year ago |

So they're mad someone did exactly what they did?

exe34 1 year ago | |

no, no, it's completely different. "open"AI stole from poor people. DeepSeek stole from a $1T company. that's illegal!

gostsamo 1 year ago |

How you dare take what I've rightfully stolen!

spacecadet 1 year ago |

See you all on lobsters...

So long HN and thanks for all the fish?

geerlingguy 1 year ago |

Something something "just desserts".

andy_ppp 1 year ago |

When I rewrite how the law works there should be a ludicrous hypocrisy defence… if the person suing you has committed the same offence the case should not be admissible.

HPsquared 1 year ago |

AI models are becoming like perpetual stew.

glooglork 1 year ago |

How much data from o1 would DeepSeek actually need to actually make any improvements with it? I also assume they'd have to ask a very specific pattern of questions, is this even possible without OpenAI figuring out what's going on

low_tech_love 1 year ago |

Imagine having no competition one day and the next DeepSeek happens. It must’ve been quite scary. Makes sense that accusations will start flying. In my country we have a saying: a thief that robs a thief is pardoned for 100 years.

It’s really interesting that the same people who defend liberal capitalism at its extreme and praise competition as its most important component (which I don’t disagree) are the same ones that’ll promptly attempt to destroy the system and the competition as soon as they are in such a position.

vrighter 1 year ago |

So what? They probably paid for api access just like everyone else. So it's a TOS violation at worst. Go ahead, open a civil suit in the US against an entity the US courts do not have jurisdiction over and quit whining...

jhickok 1 year ago | |

>open a civil suit in the US against an entity the US courts do not have jurisdiction over

Yeah, over a Chinese company no less.

moralestapia 1 year ago |

Called it from day 0, impossible to reach that performance with 5M, they had to distill OpenAI (or some other leading foundational model).

Got downvoted to oblivion by people who haven't been told what to think by MSM yet. Now it's on FT and everywhere, good, what matters is that truth comes out eventually.

I don't take any sides and think what DeepSeek did is fair play, however, what I do find harmful about this is, what incentive would company A have to spend billions training a new frontier model if all of that could be then reproduced by company B at a fraction of the cost?

kgeist 1 year ago | |

The "evidence" is very weak though:

>The San Francisco-based ChatGPT maker told the Financial Times it had seen some evidence of “distillation”, which it suspects to be from DeepSeek.

Given that many people have been using ChatGPT to distill their fine-tunes for a few years now, how can they be sure it was specifically DeepSeek? There's, say, glaive.ai whose entire business model is to sell you synthetic datasets, probably generated with ChatGPT as well.

moralestapia 1 year ago | | |

I agree that the evidence is weak, and even if they had some, they cannot really do anything.

To me, it's just very likely they distilled GPT-4, because:

1) Again, you just cannot get that performance at that cost. And no, what they describe on the paper is not enough to explain the 1,000x-fold decrease in cost.

2) Very often, DeepSeek tells you it's ChatGPT or OpenAI; it's actually quite easy to get it to do that. Some say that's related to "the background radiation on the post-AI internet". I'm not a fentanyl consumer so, unfortunately, I think that argument is trash.

suraci 1 year ago | | |

Personally, I found deepseek is very very good at Chinese. I mean, it's highly literary and eloquent, it's quite amazing.

I didn't see this in o1 or any other LLM. Can distillation give deepseek such capability?

beezlewax 1 year ago |

This is nothing short of hilarious.

hsuduebc2 1 year ago |

A thief cries 'stop the thief!

mercurialsolo 1 year ago |

How the vibe has turned on OpenAI?

rachofsunshine 1 year ago |

"It's obvious! You're trying to kidnap what I have rightfully stolen!"

Yet another of a series of recent lessons in listening to people - particularly powerful people focused on PR - when they claim a neutral moral principle for what happens to be pragmatically convenient for them. A principle applied only when convenient is not a principle at all, it's just the skin of one stretched over what would otherwise be naked greed.

iimaginary 1 year ago |

Where did I leave my tiny violin?

insane_dreamer 1 year ago |

Usually I'm very much on the side of protecting America's interests from China, but in this case I'm so disgusted with OpenAI and the rest of BigTech driving this "arms race" that I'd be happy with them burning to the ground.

So we're going to reverse our goals to reduce emissions and fossil fuels in order to hopefully save future generations from the worst effects of climate change, in the name of being able to do what, exactly, that is actually benefiting humanity? Boost corporate profits by reducing labor?

insane_dreamer 1 year ago | |

downvoted -- I guess I upset some people defending OpenAI? Good.

flybarrel 1 year ago |

OpenAI shocked that an AI company would train on someone else's data without permission or compensation...lolllllll

a2128 1 year ago |

Yeah? And if I say I have evidence OpenAI used my data to train a competitor to myself as a being that's capable of programming, will I get to have my own story on the Financial Times?

delusional 1 year ago |

Boo hoo. Competition isn't fun when I'm not winning. Typical Americans. When Americans are running around ruining the social cohesion of several developing nations, that's just fair competition, but as soon as they get even the smallest hint of real competition they run to demonize it.

Yes deepseek is going to steal all of your data. OpenAI would so the same. Yes the CCP is going to get access to your data and use it to decide if you get to visit or whatever. The white house does the same.

kozikow 1 year ago |

Chatgpt content is getting pasted all over the web. Now, for anyone crawling the web, it's hard to not include some chatgpt outputs.

So even if you put some "watermarks" in your AI generation, it's plausible defense to find publicly posted content with those watermarks.

Maybe it's explained in the article, but I can't access it, as it's paywalled.

buildsjets 1 year ago |

Womp Womp.

B1FF_PSUVM 1 year ago |

"Cry me a river" is a phrase I haven't heard recently, for some reason ...

typon 1 year ago |

OpenAI is the MIC darling - expect more ridiculous attacks on competitors in the future

pknerd 1 year ago |

OpenAI steals the data from Youtube and the Internet so that's no fair either.

xyst 1 year ago |

What a load of shit. ClosedAI is publishing a hit piece on DeepSeek and get public and politicians on their side. Maybe even get government to do their dirty work.

If they had a case, they wouldn’t be using FT. They would be filing a court case. Although that would open them up to discovery and the nasty shit ClosedAI has been up to would be game.

ysofunny 1 year ago |

I see this as China fighting U.S. of A (or the American Dollar versus Chinese Renmibi if you will)

and this is good because any alternatives I can think of are older-school fighting

modern war is seeped in symbolism, but the contest is still there

e.g. whose dong is bigger? Xi Jingping's or Dnld Trump's

almostdeadguy 1 year ago |

Hope Sam Altman is getting his money's worth out of that Trump campaign contribution. Glorious days to be living under the term of a new Boris Yeltsin. Pawning and strip-mining the federal apparatus to the most loyal friends and highest bidders.

lxe 1 year ago |

I mean, almost ALL opensource models, ever since alpaca, contain a ton of synthetic data produced via ChatGPT in their finetuning or training datasets. It's not a surprise to anyone who's been using OSS LLMs for a while: almost ALL of them hallucinate that they are ChatGPT.

daft_pink 1 year ago |

I mean if they paid to use the api and then used the output, I fail to see how they can complain.

schaefer 1 year ago |

I mean, if openAI claims they can train on the world’s novels and blogs with “no harm done” (i.e: no copyright infringement and no royalties due), then it directly follows that we can train both our robots and our selves on the output of openAI’s models in kind.

Right?

freejazz 1 year ago |

Who else cares?

asdfasdf1 1 year ago |

it's no crime to steal from a thief

krapp 1 year ago | |

It is actually a crime to steal from a thief.

nshung 1 year ago |

[flagged]

semking 1 year ago |

This is absolutely hilarious! :)

ClosedAI scraped human content without asking and they explained why this was acceptable... but when the outputs of their training corpus is scraped, it is THEIR dataset and this is NOT acceptable!

Oh, the irony! :D

I shared a few screenshots of DeepSeek answering using ChatGPT's output in yesterday's article!

https://semking.com/deepseek-china-ai-model-breakthrough-sec...

marricks 1 year ago | |

Also, DeepSeek is allegedly... better? So saying they just copied ClosedAI isn't really sufficient of an answer. Seems to be just bluster because the US Govt would probably accept any excuse to ban it, see TikTok.

throwup238 1 year ago | | |

It’s not better. In most of my tests (C++/QT code) it just runs out of context before it can really do anything. And the output is very bad - it mashes together the header and cpp file. The reasoning output is fun to look at and occasionally useful though.

The max token output is only 8K (32K thinking tokens). O1 is 128k, which is far more useful, and it doesn’t get stuck like R1 does.

The hype around the DeepSeek release is insane and I’m starting to really doubt their numbers.

beAbU 1 year ago | | |

How can they ban something thats open source that you can just run on your own hardware?

semking 1 year ago | | |

I never said they are just a clone! There's an actual tech breakthrough!

Read the two following sections of my blog post:

1. "Distilled language models"

2. "DeepSeek: Less supervision"

schmit 1 year ago | |

Even more hilarious given their own charter:

> We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

> Our primary fiduciary duty is to humanity. We anticipate needing to marshal substantial resources to fulfill our mission, but will always diligently act to minimize conflicts of interest among our employees and stakeholders that could compromise broad benefit.

> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.

semking 1 year ago | | |

Ah yes: "duty to humanity"

gruez 1 year ago | |

The picture at the end showing deepseek's privacy policy and being concerned that it's "a security risk" is hilarious[1]. Basically every B2C company collects this sort of information[2], and is far less intrusive than what social networks collect[3]. But because it's Chinese and at the risk of overtaking Western companies, people are suddenly worried about device information and IP addresses?

[1] https://semking.com/wp-content/uploads/2025/01/DeepSeek-1024...

[2] https://www.bestbuy.com/site/help-topics/privacy-policy/pcmc...

[3] https://www.facebook.com/privacy/policy/

semking 1 year ago | | |

One of my core followers named Bruno basically said the same thing under my Linkedin post yesterday:

https://www.linkedin.com/posts/organic-growth_deepseek-the-o...

I welcome friction, so I'll be blunt: I disagree with you, not because what you are saying is wrong but because you only consider systematic data collection.

That's not the issue here.

There's a difference between democracies like the United States or European countries, no matter how IMPERFECT they are, and a dictatorship that does not allow dissenting opinions.

There's a difference in how the data collected will be used.

Freedom of speech, even when it is relative, is better than totalitarianism.

pen2l 1 year ago | |

While all of this is true, that DeepSeek wouldn't be here were it not for the research that preceded it notably Google's paper, then Llama, and ChatGPT which they're modeled after, its release still did something profound to their psyche, the motivation and self-actualization this instills to the Chinese. They witnessed the power of their accomplishments: a side-hustle project knocked off an easy trillion. This is only egging them on and will serve to ramp up their efforts even more.

Separately, I do think that now that the Chinese leadership saw this, that they have the chops to pull this off and then some, they are probably going to rein in future innovations; they'll likely demand that the big future discoveries remain closed-sourced (or even unannounced/unpublicized).

tedivm 1 year ago | | |

OpenAI wouldn't be here without the work that Yann Lecun did at Facebook (back when it was facebook). Science is built on top of science, that's just how things work.

nicce 1 year ago | | |

We wouldn't be here discussing if nobody invented internet... nor these models had training data at all.

> Separately, I do think that now that the Chinese leadership saw this, that they have the chops to pull this off and then some, they are probably going to rein in future innovations; they'll likely demand that the big future discoveries remain closed-sourced (or even unannounced/unpublicized).

How do we know that this is not already happening with OpenAI/Meta and the U.S. government at some level? The concept of power is equal, whether we wanted it or not. We don't have to pretend to be "better" all the time.

openrisk 1 year ago | | |

> they'll likely demand that the big future discoveries remain closed-sourced

Depends on whether they want these tools to be adopted in the wider world. Rightly or wrongly there is a lot of suspicion in the West and an open source approach builds trust.

hn_throwaway_99 1 year ago | | |

> While all of this is true, that DeepSeek wouldn't be here were it not for the research that preceded it (notably Llama), and ChatGPT which they're modeled after...

If the allegation is true (we don't know yet), then what you've written perfectly proves the point everyone is making. ChatGPT wouldn't be here if it weren't for all the research and work that preceded it in terms of tons of scrapable content being available on the Internet, and it's not like OpenAI invented transformers either.

Nobody is accusing DeepSeek of hacking into OpenAI's systems and stealing their content. OpenAI is just saying they scraped them in an "unauthorized" manner. The hypocrisy is laughably striking, but sadly nobody has any shame anymore in this world it seems. Play me the world's tiniest violin for OpenAI.

dismalaf 1 year ago | | |

Don't forget all the research that came before OpenAI and ChatGPT...

stravant 1 year ago | | |

Yes, and what does preceding research do? Get followed by more research building on it.

scotty79 1 year ago | |

"That's hilarious!" was my first reaction as well, when I heard about it the first time. When I came to HN and saw this story on top I was hoping this was the top comment. I was not disappointed.

US AI folk were leading for two years by just throwing more and more compute at the same thing that Google threw them like a bone years ago (namely transformers). They made next to no innovation in any area other than how to connect more compute together. The idea of additional inference time compute, looping the network back on its own outputs, which is the only significant conceptual advancement of last years was something I, as a layman, came up with after few days of thinking why AI sucks and what can be done to make it able to tackle problems that require iterative reasoning. They announced it few weeks after I came up with the idea, so it was in the works for some time, but it shows you how basic idea it was. There was nothing else.

Suddenly when there comes a small company that introduced few actual algorithmic advancements which resulted in 100x optimization which is something expected with algorithmic optimizations, the big AI suddenly went into full "dog ate my homework" mode. Blaming everyone and everything around.

Let's not mention the fact that if full outputs of their models could enable them to train a better model at 1% cost then it puts them in even worse light that they didn't do it.

ryanobjc 1 year ago | | |

It’s not often you get 100x optimization with some small improvements so I’m kind of skeptical.

We have and apples and oranges thing here which deepseek is intentionally leaning into. They get very cheap electricity and are bragging about their cheap cost, and OpenAI etc typically brag about how expensive their training is. But it’s all pr and lies.

TypingOutBugs 1 year ago | |

Screw OpenAI, they scrape us without issues so someone scraped them. No issues with this.

coliveira 1 year ago | | |

But the government will now claim this is against "national security". Only American companies are allowed to commit this kind of "sleight of hand".

pilooch 1 year ago | |

Any ML based service with an API is basically a dataset builder for more ML. This has been known forever and is actually a useful "law" of ML-based systems.

sho_hn 1 year ago | | |

Aye, this should be obvious even to non-technical folks. Much has been written about how LLMs regurgitate the data they were trained on. So if you're looking for data to train on, you can certainly extract it there.

Plus of course for people within the tech bubble, plenty of research results on the value of synthetically augmented and expanded training data that put the impact past just regurgitating source data.

This whole episode is a failure of reporting what to expect next and projecting running costs etc. most of all.

amelius 1 year ago | | |

This is why models should be open. Or at least they should have a local option.

coliveira 1 year ago | |

They really lost their minds. They're all scared and worried because companies in other countries can also access the same data they stole from the Internet.

okdood64 1 year ago | |

Not to mention the total dodge when Murati was asked about training on the YouTube corpus during that television interview.

Sorry for the Short: https://www.youtube.com/shorts/M0QyOp7zqcY

rubslopes 1 year ago | |

> Our mission is to ensure that artificial general intelligence benefits all of humanity.[1]

Well, I guess they really helped make this a reality!

[1] https://openai.com/about/

radicality 1 year ago | |

I liked Matt Levine’s newsletter few days ago where he hypothesized scenarios where it’s much more profitable to short your competitors, then release a much better version of some widget completely free, and then profit $$$. Which is plausible here too, considering DeepSeek is made by a hedge fund.

freehorse 1 year ago | | |

How would that work out here though? "Open"AI is not publicly traded. Any kind of shorting would be quite indirect.

greasegum 1 year ago | | |

Came here to mention this too. Seem almost so obvious that I'm surprised this isn't the dominant angle.

Leary 1 year ago | |

Does this mean when you use OpenAI as an enterprise customer, they can see exactly the queries and answers? So much for privacy!

skeeter2020 1 year ago | |

I share the sentiment here, but asking as a noob: does this mean the performance comparison is not really apples to apples? If it required the distillation of the expensive model in order to get such good results for a much lower price, is that shady accounting?

belter 1 year ago | |

So it is true, they run out of Data to steal? :-)

And then where DeepSeek steal from next? Do they steal from themselves? Do they steal the stolen models they stole from the stolen data?

The AI Ponzi scheme...

troyvit 1 year ago | | |

Exactly this, especially as journalism melts down into slag. Soon all anybody will have to train on is social media, Wikipedia and GitHub, and that last one will slowly be metastasized by AI-generated code anyway.

It reminds me of 1984 in a sense. "Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thoughtcrime literally impossible, because there will be no words in which to express it."

Unlike 1984 I don't see this winnowing of new concepts as purposeful, but on the other hand I keep asking myself how we can be so stupid as to keep doing it.

mritchie712 1 year ago | |

openai should pay creators, but:

1. scraping the internet and making AI out of it

2. using the AI from #1 to create another AI

are not the same thing.

Palmik 1 year ago | | |

I agree, (2) seems much less problematic since the AI outputs are not copyrightable and since OpenAI gives up ownership of the outputs. [1]

So, if you really really care about ToS, then just never enter into a contract with OpenAI. Company A uses OpenAI to generate data and posts it on the open Internet. Company B scrapes open Internet, including the data from Company A [2].

[1]: Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.

[2]: This is not hypothetical. When ChatGPT got first released, several big AI labs accidentally and not so accidentally trained on the contents of the ShareGPT website (site that was made for sharing ChatGPT outputs). ;)

epse 1 year ago | | |

#1 destroys peoples willingness to publish and unfairly hogs bandwidth / creates costs for small hosters

#2 makes a big corp a bit angry

Indeed not the same thing

haswell 1 year ago | | |

Yes, they are different actions.

But arguably these actions share enough characteristics that it’s reasonable to place them in the same category. Something like: “products that exist largely/solely because of the work of other people”. The nonconsensual nature of this and the lack of compensation is what people understandably take issue with.

There is enough similarity that it evokes specific feelings about OpenAI when they suddenly find themselves on the other side of the situation.

zbshqoa 1 year ago | | |

Number 2 is already possible with open models. You can do distillation using Llama, which could likely be doing #1 to build their models (I'm not sure it's the case though)

Winsaucerer 1 year ago | | |

I'm genuinely not sure which one you think is worse (if any). (1) seems worse, but your reply suggests to me maybe you think (2) is worse.

jillyboel 1 year ago | | |

You're right, (1) is violating the rights of a large portion of the population, (2) is violating the rights of one company

latexr 1 year ago | | |

> are not the same thing.

You’re right. The second one is far more ethical. Especially when stealing from a thief.

Doesn’t Sam Altman keep parroting they’re developing AI “for the good of humanity”? Well then, someone taking their model and improving on it, making it open-source, having it consume less, and having a cheaper API, should make him delighted. Unless he *gasp* was full of shit the whole time. Who could have guessed?

tw1984 1 year ago | | |

#1 is stealing from all average joes ever lived on earth

#2 is taking advantages from closedAI.

they are indeed different

sksrbWgbfK 1 year ago | | |

> 2. using the AI from #1 to create another AI

2. scraping the AI from #1 and making AI out of it

bugglebeetle 1 year ago | | |

Yeah, #1 is way worse and #2 falls under “turnabout is fair play.”

rvz 1 year ago | |

They have been out-grifted by DeepSeek and OpenAI is not happy about someone out-shining them on that.

The best part is "their IP" was humanity's scraped content and they are angry that DeepSeek did their job for them and gave it away for free.

adzm 1 year ago | |

Why does this post use DeepSink instead of DeepSeek at apparently random places? Is that just a pejorative pun like ClosedAI?

the_arun 1 year ago | |

I think the point is - OpenAI scraped public data - d1 - Trained their model to produce output - d2 - DeepSeek used d2 to reinforce their model

OpenAI is mad about d2 (not d1). I'm not sure using public data is "stealing". In summary, these are two different things & need to be separate.

redleader55 1 year ago | | |

You say "public", but what I think you mean is "publicly available". Even publicly available data has copyrights, and unless that copyright is "public domain", you need to follow some rules. Even licenses like Creative Commons, which would be the most permissive, come with caveats which OpenAI doesn't follow [0].

It is unclear if someone breaking someone else's copyright to use A can claim copyright on a work B, derived from A. My point is that OpenAI played loose with the copyright rules to build its various models, so the legality of their claims against DeepSeek might not be so strong.

[0] https://creativecommons.org/share-your-work/cclicenses/

xbar 1 year ago | | |

OpenAI (sc)raped all the data it could. I do not accept your assertion that d1 was "public." It was accessible, for certain.

OpenAI asserts 1. d2 was used by DeepSeek 2. All d2 belongs to OpenAI exclusively

Both are debatable for large number of reasons.

api 1 year ago | |

So far the whole business model of Silicon Valley since social media has been to monetize other peoples' content given out for free. The whole empire is built on this.

I wonder if this is going to come to an end through a combination of social media fatigue, social media fragmentation, and open source LLMs just giving it all back to us for free. LLMs are analogous to a "JPEG for ideas" so they're just lossy compression blobs of human thought expressed through language.

barnabee 1 year ago | | |

> So far the whole business model of Silicon Valley since social media has been to monetize other peoples' content given out for free. The whole empire is built on this.

It cannot die soon enough

amelius 1 year ago | |

It looks like they want to spin this as "DeepSeek copied OpenAI". The general public/media might actually believe this is what happened.

28304283409234 1 year ago | |

ClosedAI? StolenAI!

didip 1 year ago | |

fr fr, ClosedAI is being a comedian right now.

They scraped literally all the content of the internet without permissions. And I won't even be surprised if they scraped the output of other LLMs as well.

stackghost 1 year ago | |

[flagged]

dang 1 year ago | | |

Ok, but please don't break HN's rules when commenting here.

You may not owe altmen better, but you owe this community better if you're participating in it.

https://news.ycombinator.com/newsguidelines.html

bloomingkales 1 year ago | | |

I personally love this chef's kiss of a flip flop sam did here:

https://blog.samaltman.com/trump

https://www.reddit.com/r/YAPms/comments/1i7ry5m/sam_altman_g...

Only a truly talented piece of shit can be as prolific as this.

"He is irresponsible in the way dictators are."

Chef's kiss.

Edit:

Kids, don't aspire to be like Altman. We as a community need to espouse more values than tech is gonna tech.

ddingus 1 year ago |

So what?

Seriously. Given how pretty much all this software was trained, who cares?

I, for one, don't and believe the massive amount of knowledge continues to be of value to many users.

And I find the thought of these models knowing some things they shouldn't very intriguing.

yapyap 1 year ago |

It sounds like they’re just jealous and trying to smear shit over the wall and see what sticks.

DeepSeek just bodied u bro, get back in the lab & create a better AI instead of all this news that isn’t gonna change them having a good AI

trkaky 1 year ago |

how much would it cost to distill o1..

rochak 1 year ago |

Cry me a river

zx10rse 1 year ago |

OpenAI is already irrelevant but the audacity oh my.

vjerancrnjak 1 year ago |

I thought this is capitalism for the winners. Why slander competition, just outcompete them? Why stick to your losing bets if you’ve recognized a better alternative?

Let’s race to the bottom.

animitronix 1 year ago |

Who tf cares?

vcryan 1 year ago |

I love watching billionaires squirm

jeanlucas 1 year ago |

damn that's a good headline

paulhart 1 year ago |

"You are trying to kidnap what I have rightfully stolen"

sonabinu 1 year ago |

poetic justice (pun intended)

oatmeal_croc 1 year ago |

Even if true, so what? These are increasingly looking like a competition between nation-states with their trade embargoes and export controls. All's fair in AI wars.

khazhoux 1 year ago |

I'm disappointed that 99% of the comments about this topic are Schadenfreude, and 1% is actually about the technical implications of OpenAI's claims.

josefritzishere 1 year ago |

OpenAI, who comitted copyright infringement on an massive scale, wants to defend against a superior product won the basis of infringement? What nonsense.

nottorp 1 year ago |

IP thief cries IP thief.

It's okay when you steal worldwide IP to train your "AI".

It's not okay when said stolen IP is stolen from you?

If the chinese are guilty, then Altman's doom and gloom racket is as guilty or even more, considering they stole from everyone.

fedeb95 1 year ago |

if some kind of transitivity holds, DeepSeek stole billions of internet users data.

mk89 1 year ago |

What a joke OpenAI has become.

asah 1 year ago |

Thieve's honor, hunh?

witnesser2 1 year ago |

Soon another layer of distiller will emerge. Selling purer booze in this weight tuning buzzi.

pshirshov 1 year ago |

A thief got robbed?..

coldpepper 1 year ago |

Fuck openai. They didn't ask my peemission to crawl my blog into their dataset.

thumbsup-_- 1 year ago |

is stealing from the thief actually a theft?

coolhand2120 1 year ago |

I was using the Deepseek model last night and got this:

> You > do you understand copredication?

> Assistant > deepseek-coder-33b-instruct > I'm sorry for any misunderstanding, but as an AI model developed by OpenAI, I don't have personal ...

Yikes! That's pretty damning IMO.

octacat 1 year ago |

first time?

pointedAt 1 year ago |

it's a corona style portfolio gang.

52-6F-62 1 year ago |

I heard they were just “democratizing” llm and ai development.

Yesterday the industry crushed pianos and tools and bicycles and guitars and violins and paint supplies and replaced them with a tablet computer.

Tomorrow we can replace craven venture capitalists and overfed corporate bodies with incestuous LLM’s and call it all a day.

seydor 1 year ago |

But now OpenAI will use DeepSeek to reuse even more stolen data to train new models that they can serve without ever giving us the code, the weights or even the thinking process , and they will still be superior

Jotalea 1 year ago |

I really hate when there is a paywall to read an article. It makes me not want to read it anymore.

exabrial 1 year ago |

cry us copyright holders a river.

seanp2k2 1 year ago |

"lol" said the Scorpion, "lmao".

boxedemp 1 year ago |

Deep refers to itself as ChatGPT sometimes lol

htrp 1 year ago |

In other news.....water is wet

FpUser 1 year ago |

Pot calling kettle black?

otikik 1 year ago |

Chatgpt, please generate an image of the tiniest violin imaginable.

Oh wait I will ask DeepSeek instead.

pcthrowaway 1 year ago |

Now that China is talking about lifting the Great Firewall, it seems like the U.S. is on track to cordon themselves off from other countries. Trump's talk of building a wall might not stop at Mexico.

cratermoon 1 year ago |

Maybe the VCs backing OpenAI invest in tiny violins.

hsuduebc2 1 year ago |

The pot calling the kettle black

anon115 1 year ago |

eat shit

worik 1 year ago |

[flagged]

DidYaWipe 1 year ago |

They have "open" right in their name, so...

Objection overruled.

mring33621 1 year ago |

We demand immediate government action to prevent these cheaper foreign AIs from taking jobs away from our great American AIs!

bhargav 1 year ago | |

This is gonna be spun up as a security thing, and banned cozz Murica.

cactusplant7374 1 year ago | |

To the detriment of OpenAI, the math is going to be used to improve AIs developed in America. And we need to remember that Marc Andreessen is very against government banning maths.

bhouston 1 year ago | |

> We demand immediate government action to prevent these cheaper foreign AIs from taking jobs away from our great American AIs!

That is exactly what Microsoft and Sam Alman are asking for. And they will likely get it because Trump really likes protectionist governments policies.

blantonl 1 year ago | | |

It’s funny, the Chinese are here innovating on AI, batteries, and fusion, and here in the United States we’ve pivoted to shitcoins and universal tariffs.

At least we have the CyberTruck to highlight American greatness

clarionbell 1 year ago | | |

He likes feeling important, just look at TikTok. All it took was bit of sycophancy and he turned into Mr. Freemarket again.

Really, people need to realize that Trump has never been consistent in any of his political positions, except for one: "You have to look out for number one."

Dansvidania 1 year ago | |

does it matter if the company gets banned? other non-chinese companies can pick up the open source model and run it as a service with relatively low investment, isn't that the point?

breakitmakeit 1 year ago |

As the article points out, they are arguing in court against the new york times that publicly available data is fair game.

The questions I am keenly waiting to observe the answer to (because surely Sam's words are lies): how hard is OpenAI willing to double down on their contradictory positions? What mental gymnastics will they use? What power will back them up, how, and how far will that go?

ADeerAppeared 1 year ago | |

Their way of squaring this circle has always been to whine about "AI safety". (the cultish doomsday shit, not actual harms from AI)

Sam Altman will proclaim that he alone is qualified to build AI and that everyone else should be tied down by regulation.

And it should always be said that this is, of course, utterly ridiculous. Sam Altman literally got fired over this, has an extensive reputation as a shitweasel, and OpenAI's constant flouting and breaking of rules and social norms indicates they CANNOT be trusted.

snakeyjake 1 year ago | |

When large sums of money are involved the techbros will burn everything down, go scorched earth no matter what the consequences, to keep what they believe they're entitled to.

dlikren 1 year ago |

Intriguing to see the difference of response from HN when OpenAI first came to prominence and now.

this15testingg 1 year ago |

if you want to completely disregard copyright laws, just call your project AI!

I'm sure Aaron Swartz would be proud of where the "tech" industry has gone. /s

what problem are these glorified AIM chatbots trying to solve? wealth extraction not happening fast enough?

belter 1 year ago |

The subtitle is the gold... : "White House AI tsar David Sacks raises possibility of alleged intellectual property theft"

conartist6 1 year ago | |

lolololololololol

udev4096 1 year ago |

What about the pirated books you used and millions of blogs and websites scraped without consent? Somehow that's legal? Come on, give me a fucking break. OpenAI deserves the top spot in the list of unethical companies in the world

hello_computer 1 year ago |

then show it to us rachel

tntxtnt 1 year ago |

Can they tax DeepSeek just like they taxed BYD cars? Smh Chinese ruin US industry again and again and again. Where's Trump at?? Why don't he taxed 1000000% of the free $0 DeepSeek AI??

RohMin 1 year ago |

this comment section smells like Reddit - ugh

TheJCDenton 1 year ago |

This Deep Whining® technique used by OpenAI is not very effective.

aucisson_masque 1 year ago |

I don't see the difference between that and LLM feeding on internet people's data.

They call it IP theft yet when the New York Times sued OpenAI and Microsoft for copyright infringement they claimed it's fair use of data.