So what is the point if you pay for it and can not use the results how you see fit?
Cry me a river.
But in capitalism, we, the customers aren't going to focus on how models are trained or products are made; we only care about favourable pricing.
A key takeaway for me from this news is the clause in OpenAI's terms and conditions. I mistakenly believed that paying for OpenAI’s API granted full rights to the output, but it turns out we’re only buying specific rights (which is now another reason we're going to start exploring alternatives to OpenAI)
OpenAI is like a team of fodder monkeys stepping on landmines right now, with the rest of the world waiting behind them.
They are now upset at a startup who is more loyal to OpenAI's original mission that OpenAI is today.
Please, give me a break.
If all they really have is some API requests, even assuming they're real and originated by Deepseek, that's very far from proof that any of it was used as training data. And honestly, short of commiting crimes against Deepseek(hacking), I'm not sure how they even could prove that at this point, from their side alone.
And what's even more certain is that a vague insistence that evidence exists, accompanied by a denial to shed any more light on the specifics, is about as informative as saying nothing at all. It's not like OpenAI and Microsoft have a habit of transparency and honesty in their communication with the public, as proven by an endless laundry list of dishonest and subversive behaviour.
In conclusion, I don't see why I should give this any more credence than I would a random anon on 4chan claiming a pizza place in Washington DC is the centre of a child sex trafficking ring.
P.S: And to be clear, I really don't care if it is true. If anything, I hope it is; it would be karmic justice at its finest.
Ironic.
Well who would have thought that?
Back in college, a kid in my dorm had a huge MP3 collection. And he shared it out over the network, and people were all like, "Man, Patrick has an amazing MP3 collection!" And he spent hours and hours ripping CDs from everyone so all the music was available on our network.
Then I remember another kid coming in, with a bigger hard drive, and he just copied all of Patrick's MP3 collection and added a few more to it. Then ran the whole thing through iTunes to clean up names and add album covers. It was so cool!
And I remember Patrick complained, "He stole my MP3 collection!"
Anyway this story sums up how I feel about Sam Altman here. He's not Metalica, he's Patrick.
https://www.npr.org/2023/12/27/1221821750/new-york-times-sue...
So long HN and thanks for all the fish?
It’s really interesting that the same people who defend liberal capitalism at its extreme and praise competition as its most important component (which I don’t disagree) are the same ones that’ll promptly attempt to destroy the system and the competition as soon as they are in such a position.
Yeah, over a Chinese company no less.
Got downvoted to oblivion by people who haven't been told what to think by MSM yet. Now it's on FT and everywhere, good, what matters is that truth comes out eventually.
I don't take any sides and think what DeepSeek did is fair play, however, what I do find harmful about this is, what incentive would company A have to spend billions training a new frontier model if all of that could be then reproduced by company B at a fraction of the cost?
>The San Francisco-based ChatGPT maker told the Financial Times it had seen some evidence of “distillation”, which it suspects to be from DeepSeek.
Given that many people have been using ChatGPT to distill their fine-tunes for a few years now, how can they be sure it was specifically DeepSeek? There's, say, glaive.ai whose entire business model is to sell you synthetic datasets, probably generated with ChatGPT as well.
To me, it's just very likely they distilled GPT-4, because:
1) Again, you just cannot get that performance at that cost. And no, what they describe on the paper is not enough to explain the 1,000x-fold decrease in cost.
2) Very often, DeepSeek tells you it's ChatGPT or OpenAI; it's actually quite easy to get it to do that. Some say that's related to "the background radiation on the post-AI internet". I'm not a fentanyl consumer so, unfortunately, I think that argument is trash.
I didn't see this in o1 or any other LLM. Can distillation give deepseek such capability?
Yet another of a series of recent lessons in listening to people - particularly powerful people focused on PR - when they claim a neutral moral principle for what happens to be pragmatically convenient for them. A principle applied only when convenient is not a principle at all, it's just the skin of one stretched over what would otherwise be naked greed.
So we're going to reverse our goals to reduce emissions and fossil fuels in order to hopefully save future generations from the worst effects of climate change, in the name of being able to do what, exactly, that is actually benefiting humanity? Boost corporate profits by reducing labor?
Yes deepseek is going to steal all of your data. OpenAI would so the same. Yes the CCP is going to get access to your data and use it to decide if you get to visit or whatever. The white house does the same.
So even if you put some "watermarks" in your AI generation, it's plausible defense to find publicly posted content with those watermarks.
Maybe it's explained in the article, but I can't access it, as it's paywalled.
If they had a case, they wouldn’t be using FT. They would be filing a court case. Although that would open them up to discovery and the nasty shit ClosedAI has been up to would be game.
and this is good because any alternatives I can think of are older-school fighting
modern war is seeped in symbolism, but the contest is still there
e.g. whose dong is bigger? Xi Jingping's or Dnld Trump's
Right?
> You > do you understand copredication?
> Assistant > deepseek-coder-33b-instruct > I'm sorry for any misunderstanding, but as an AI model developed by OpenAI, I don't have personal ...
Yikes! That's pretty damning IMO.
Yesterday the industry crushed pianos and tools and bicycles and guitars and violins and paint supplies and replaced them with a tablet computer.
Tomorrow we can replace craven venture capitalists and overfed corporate bodies with incestuous LLM’s and call it all a day.
Oh wait I will ask DeepSeek instead.
Objection overruled.
Not to mention it'd defeat the whole purpose of a "free market" economy. (Not that that means much of anything anymore)
I don't think all of that is a bad thing (regulation tends to make it harder to do the first two things), but "free markets" are the economic equivalent to the "point mass" in physics: perhaps useful sometimes to create simple models and explanations of things, but will never exist in the real world.
Every LLM provider in the US will be using it to lower OpEx.
One of them is likely to pass those savings along to consumers to gain market share.
Facebook is in the business of providing weights for free.
The idea that we are all doomed unless we immediately migrate to DeepSeek is fantasy.
The US government has the wherewithal to drag Europe along with it, like they did with Huawei's 5G equipment.
Besides, it's kinda too late for this. The model is freely accessible, so any attempt at banning it would be _completely_ moot. If DeepSeek keeps releasing their future models for free, I don't see how a ban could ever be effective at all. Worse case scenario, big tech can't use those models... but then individuals (and startups willing to go fast and break laws) will be able to use them and instantly get a leg up on the competition.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
That is exactly what Microsoft and Sam Alman are asking for. And they will likely get it because Trump really likes protectionist governments policies.
At least we have the CyberTruck to highlight American greatness
Really, people need to realize that Trump has never been consistent in any of his political positions, except for one: "You have to look out for number one."
The questions I am keenly waiting to observe the answer to (because surely Sam's words are lies): how hard is OpenAI willing to double down on their contradictory positions? What mental gymnastics will they use? What power will back them up, how, and how far will that go?
Sam Altman will proclaim that he alone is qualified to build AI and that everyone else should be tied down by regulation.
And it should always be said that this is, of course, utterly ridiculous. Sam Altman literally got fired over this, has an extensive reputation as a shitweasel, and OpenAI's constant flouting and breaking of rules and social norms indicates they CANNOT be trusted.
I'm sure Aaron Swartz would be proud of where the "tech" industry has gone. /s
what problem are these glorified AIM chatbots trying to solve? wealth extraction not happening fast enough?
They call it IP theft yet when the New York Times sued OpenAI and Microsoft for copyright infringement they claimed it's fair use of data.
But restricting the trade in micro chips only because the USA is afraid it will loose a technical and commercial edge is a long long way from a free market.
It is too late, too. China has broken out and they are ahead in many fields. Not trading chips with them will make them build their own foundries. In two decades they will be as far ahead there as they are in many other fields.
If the USA would trade then the technological capacities of China and the USA would stay matched, as they help each other. China ahead in some areas, the USA ahead in others.
That would still (probably) not be a pure Free Market but it would be a freer market, and better for everybody except a few elites (on both sides)
Madness is taking root
What a bunch of idiots. The propaganda keeps telling us that they don't invent, they can only copy etc., but clearly that's not true.
> He likes feeling important, just look at TikTok. All it took was bit of sycophancy and he turned into Mr. Freemarket again.
Not really. He said that TikTok has to have shift towards US ownership if it wants to continue, he just gave them a 90 day extension to allow that change in ownership.
>Very often, DeepSeek tells you it's ChatGPT or OpenAI; it's actually quite easy to get it to do that. Some say that's related to "the background radiation on the post-AI internet". I'm not a fentanyl consumer so, unfortunately, I think that argument is trash.
The exact same thing happened with Llama. Sometimes it also claimed to be Google Assistant or Amazon Alexa.
That's tricky, you can optimize a model to do real well on synthetic benchmarks.
That said, DeepSeek performs a bit worse than GPT-4 in general and substantially wrong on benchmarks like ARC which is designed with this in mind.
Prompt: Find an English word that contains 4 'S' letters and 3 'T' letters.
Deepseek-R1: stethoscopists (correct, thought for 207 seconds)
ChatGPT-o1: substantialists (correct, thought for 188 seconds)
ChatGPT-4o: statistics (wrong) (even with "let's think step by step")
In almost every example I provide, it's on par with o1 and better than 4o.>substantially wrong on benchmarks like ARC which is designed with this in mind.
Wasn't it revealed OpenAI trained their model on that benchmark specifically? And had access to the entire dataset?
It is predicting next token, are we really taking its words and think the model knows what it is saying?
What's funny is that people have been saying this since OpenCL was announced but today we're actually in a worse spot than we were 10 years ago. China too - their inability to replicate EULV advancements has left their lithography in a terrible place.
There's a reason China is literally dependent on Nvidia for competitive hardware. It's their window into export-controlled TSMC wafers and complex streaming multiprocessor designs. Losing access to Nvidia hardware isn't an option for them (or the United States for that matter) which is why the US pushes so hard for export controls. There is no alternative.
Really? I am intrigued. Why do you think that?
I have no idea what the build quality of BYD is, but doing better than Tesla isn't exactly a challenge.
Chinese phones, yes. But I'd argue we're past peak China. Huawei phones briefly were the #1 selling in the world, have since pulled back.
Also, compare V3 to 4o and R1 to o1, that's the right way.
Lol Taiwan is officially the "Republic of China", as per their own claims.
I see what you mean. But there is an alternative point of view.
The indigenous people of Taiwan are very different. DNA wise they are the prototypical polynesians
When you take both of these points and put them back to back, a natural answer seems to suggest itself which I'm not sure the authors intended to imply: R1 attempted to use o1 to make its answers seem more human, and as a result it accidentally picked up most of it's reasoning capabilities in the process. Is my reading totally off?
Even if that were possible, it would be suspicious if you were to release an open model whose model architecture is identical to that of a closed one from a competitor.
If that is what happened, we'd know about it by now.
It simply doesn't matter. Their methodology works.
DeepSeek could simply admit, "Yep, oops, we did it," but argue that they only used the data to train Model X. So, if you want compensation, you can have all the revenue from Model X (which, conveniently, amounts to nothing).
Sure, they then used Model X to train Model Y, but would you really argue that the original copyright holders are entitled to all financial benefits derived from their work—especially when that benefit comes in the form of a model trained on their data without permission?
The Wolf made off to a safe distance, and then said in a much injured tone:
"You have no right to take my property like that!"
The Lion looked back, but as the Wolf was too far away to be taught a lesson without too much inconvenience, he said:
"Your property? Did you buy it, or did the Shepherd make you a gift of it? Pray tell me, how did you get it?"
What is evil won is evil lost.
I hope this is actually true and OpenAI loses its close to monopoly status. Having a for profit entity safeguarding a popular resource like this sounds miserable for everyone else.
At the moment AI looks like typical VC scheme: build something off someone else's work, sell it at cost at first, shove it down everyone's throats and when it's too late, hike the prices. I don't like that.
If one could effectively patent software inventions, this would kill many industries, from video games (that all have mechanics of other games in them) to computing in general (fast algorithms, etc). Let’s hope no one gets ideas like that…
Granted, it would be ineffective in competing against China’s tech industry. But less effective laws have been lobbied through in the past.
3K+ employees, $3B+ revenue, ... sure, not BigTech but hardly a minnow. A company that big can chew gum and walk at the same time.
I think it's also hilarious that suppose they can do that then they will end up suppressing innovation within the US, and eager groups in China would just innovate without having to worry about this hostile landscape.
Also it’s pretty good confirmation that synthetic data is a valid answer to the data wall problem (non-problem).
Even if they win the legal case, it means weights can be inferred and improved upon simply by using the output that is also your core value add (e.g. the very output you need to sell to the world).
Their moat is about as strong as KFC's eleven herbs and spices. Maybe less...
If there's a better AI, old AI will lose the job first.
As somebody who got to work adjacent to some of these things for a long time, I've been wondering about this. Are LLMs and transformers actually better than these "old" models or is it more of an 80/20 thing where for a lot less work (on developers' behalf) LLMs can get 80% of the efficacy of these old models?
I ask because I worked for a company that had a related content engine back in 2008. It was a simple vector database with some bells and whistles. It didn't need a ton of compute, and GPUs certainly weren't what they are today, but it was pretty fast and worked pretty well too. Now it seems like you can get the same thing with a simple query but it takes a lot more coal to make it go. Is it better?
"OpenAI is losing its job to open AI."
OpenAIs $200 closed-ai uppended by hedge-funds free side-project
Quant geeks outcompete overpaid silicon valley devs etc.
Basically, hubris gets its comeuppance which is a david vs goliath biblical archetype which is why this drama grips all of us.
That said, I feel like "quant geeks" aren't quite underdogs compared to silicon valley devs. wdyt?
We arent talking about those indigenous people regarding this topic. We are talking about the Chinese people there. This would be clear and obvious if Confederates were in any of the examples I gave. All 3 examples have indigenous people now who aren’t cared about now.
Americans were still actively cleansing Native Americans under 200 years ago. The only country that would do anything serious about an attempt at Chinese reunification would be America [and of course NATO and Europe but if America wasn’t doing anything, Europe wouldn’t either].
If it wasn’t for America, reunification would have already happened.
So the pov of Americans or the west caring about indigenous people is faulty from the pov of most of the rest of the world. The west should care about indigenous people in their own direct spheres of influence first.
Nonetheless the fact that you can just change a bit the prompt to instruct the model to do what you want makes everything much faster.
Yes the trade-off is that you need GPUs to make it run, but that's why we have cloud
Considering fresh signs of rupture in transatlantic relations, maybe Orban will turn out to have had keen foresight. There seems to be some sort of realignment afoot under the Trump administration.
https://en.wikipedia.org/wiki/2024_visits_by_Viktor_Orb%C3%A...
https://en.wikipedia.org/wiki/China%E2%80%93Hungary_relation...
If running ChatGPT costs $1B/y and they make $3B/y on it, tacking on the cost of R&D on top doesn't seem very fair.
Is it? Time will tell, but it wasn't "different" even during the crypto craze when CUDA was literally printing money. We were promised world-changing ASICs just like with Cereberas and Groq, and ended up with nothing in the end.
When AI's popularity blows over (and it will, like crypto), will the market have responded fast enough? From where I'm standing, it looks like every major manufacturer (and even the Chinese market) is trying the ASIC route again. And having watched ASICs die a very painful and unsatisfying death in the mining pools, I'd like to avoid manufacturing purpose-made ICs that are obsolete within months. I'm simply not seeing the sort of large-scale strategy that threatens Nvidia's actual demand.
Crypto mining was just about hash rates, so I don't think it really mattered whether you used CUDA or not. Nvidia cards were just faster usually. People did use AMD too, but that didn't really involve building up the OpenCL ecosystem, just making it run one particular algo. They do also use ASICs for BTC in particular, I don't think that died.
> "Well, Steve… I think it’s more like we both had this rich neighbour named Xerox and I broke into his house to steal the TV set and found out that you had already stolen it."
They don't specifically name OpenAI, but they refer to "directly prompting models to generate answers with reflection and verification".
But somehow, it's a problem if another model developer distills it's model by training it on OpenAI?
IMO, if the first use is fair, then so is the second use.
Is OpenAI going to complain about me too?
> OpenAI’s o1 costs $60 per million output tokens; DeepSeek R1 costs $2.19. This nearly 30x difference brought the trend of falling prices to the attention of many people.
From Andrew Ng's recent DeeplearningAI newsletter
The West is run by lawyers, MBAs and salesmen.
This kerfuffle is a delicious study about this.
For local tasks, the deepseek-r1:14b and deepseek-r1:32b distillations immediately replace most of that usage (prior local models were okay, but not consistently good enough). Once there's a "just works" setup for RAG on par with installing ollama (which I doubt is far of), I don't see much reason to continue paying for my subscription.
Sadly, like many others in this thread, I expect under the current administration to see self-hamstringing protectionism further degrade the US's likelihood of remaining a global powerhouse in this space. Betting the farm on the biggest first-mover who can't even keep up with competition, has weak to non-existent network effects (I can choose a different model or service with a dropdown, they're more or less fungible), has no technological moat and spent over a year pushing apocalyptic scenarios to drum up support for a regulatory moat...
...well it just doesn't seem like a great idea to me.
If they did this, We the People would cover them like we do others. Without it, We the People don't care.
Cry, don't cry, it's meaningless to us.
Tell me again how come remixing our data is just dandy, many artists got disrupted — but no one should be able to disrupt OpenAI like that?
> have you been trained using openai?
No, I have not been trained using technology or data from OpenAI. I am an artificial intelligence model developed by Yandex, called YandexGPT. My “training” is based on proprietary algorithms and data that was collected and processed by the Yandex team. While I have similarities with other models such as those developed by OpenAI (e.g. GPT-3), my training was conducted independently and uses unique approaches specific to Yandex. This includes using Russian-language data as well as other sources to provide a better understanding of context and provide useful information to users. If you have questions on topics related to AI technologies or anything else, I'd be happy to help!
That is absurd.
Copyright law is designed to strike a balance between two issues. One the one hand, the creator’s personality that’s baked into the specific form of expression. And on the other hand, society’s interest in ideas being circulated, improved and combined for the common good.
OpenAI built on the shoulders of almost every person that wrote text on a website, authored a book, or shared a video online. Now others build on the shoulders of OpenAI. How should the former be legal but not the latter?
Can’t have it both ways, Sam.
(IAAL, for what it’s worth.)
https://youtu.be/NUhrF0xkhhc?si=1WHWYZrhRmfOYO_y&t=1150 (it's about 2 minutes)
OpenAI's future investments -- billions -- were just threatened to be undercut by several orders of magnitude by a competitor. It's in their best interests to cast doubt on that competitor's achievements. If they can do so by implying that OpenAI are in fact the source of most of the DeepSeek's performance then all the better.
It doesn't matter whether there's a compelling legal argument around copyright, or even if it's true that they actually copied. It just needs to be plausible enough that OpenAI can make a reasonable case for continuing investment at the levels it's historically attained.
And plausibility is something they've handily achieved with this announcement -- the sentiment on HN at least is that it is indeed plausible that DeepSeek trained on OpenAI. Which means there's now doubt that a DeepSeek-level model could be trained without making use of OpenAI's substantial levels of investment. Which is the only thing that OpenAI should be caring about.
it is, but the 2nd order logic says that if they are trying to cast doubt, it means they've got nothing better to offer and casting doubt is the only step they have.
if i was an investor in openAI, this should be very scary as it simply means I've overvalued it.
Sure, Open AI invested billions banking on the livelihood of every day people being replaced, or as Sam says, “A renegotiation of the social contract”
so as an engineer that is being targeted by meta and sales force under the “not hiring engineers plan” all o have to say to Open AI is “welcome to the social contract renegotiation table”
Indeed, when the alleged infringer is outside US jurisdiction and not violating any local laws in the country where it's domiciled.
The fact that Microsoft cannot even get this app removed from "app stores" tells us all we need to know.
It will be OpenAI and others who will be copying DeepSeek.
Some of us would _love_ to see Microsoft try to assert copyright over a LLM. The question might not be decided in their favour, putting a spectre over all their investment. It is not a risk worth taking.
Anyone remember this one: https://en.wikipedia.org/wiki/Microsoft_Corp._v._Zamos
>there's now doubt that a DeepSeek-level model could be trained without making use of OpenAI's substantial levels of investment.
But, this still seems to be a problem for OpenAI. Who wants to invest "substantially" in a company whose output can be used by competitors to build an equal or better offering for orders of magnitude less?
Seems they'd need to make that copyright stick. But, that's a very tall and ironic order, given how OpenAI obtained its data in the first place.
There's a scenario where this development is catastrophic for OpenAI's business model.
Basically, in a round about way, OpenAi is going back to their roots and more - they're something between a charity and Robin Hood, stealing the money of rich investors and giving it to poor and aspirational AI competitors.
"Karma's a bitch, ain't it."
Sam should focus on the product instead of trying to out-jerk Elon and his buddies.
OpenAI has a legally submitted point of view that they believe the benefits of AI to humanity are so great that anyone creating AI should be allowed to trample all over copyright laws, Terms of Use, EULAs, etc.
But OpenAI’s version of benefit to humanity is that they should be allowed to trample over those laws so they can benefit humanity by closely guarding the output of trampling those laws and charging humanity an access fee.
Even if we accept all of OpenAI’s criticisms of DeepSeek, they’re arguing that DeepSeek doing the exact same thing, but releasing the output for free for anyone to use is somehow less beneficial to humanity.
However, OpenAI and Google are far more politically influential than the lobbyists in the 90s, so it is likely to succeed.
My understanding is that legal positions and arguments (within Common Law) need not be consistent across "cases" - they are considered in isolation with regards the body of law extant at the time.
I think that Sam can quite happily argue two differing points of view to two courts. Until a judgement is made, those arguments are simply arguments and not "binding" or even "influential elsewhere" or whatever the correct terms are.
I think he can legitimately argue both ways but may not have it both ways.
It would be very sensible that if a trial comes up, all these arguments that Sam Altman made for the other side score against him and OpenAI.
What people “suck out” of their API are the general ideas. And they do it specifically so they can reassemble them in their own way.
It’s like reading all the Jack Reacher novels and then creating your own hero living through similar situations, but with a different name.
You’ll read it and you’ll say, dang, that situation/metaphor/expression/character reminds me of that Reacher novel. But there’s nothing Lee Child can do about it.
And that’s perfectly fine. Because he himself took many of his ideas from others, like Le Carré.
It’s the Eternal Tao.
Is there a scenario where it isn’t?
Either (1) a competitor is able to do it better without their work or (2) a competitor is able to use their output and develop a better product.
Either way, given the costs, how do you justify investing in OpenAI if the competitor is going to eat their lunch and you’ll never get a return on your investment?
More specifically, on the legal side I don't see how they can protect their output without stepping on their own argument for ingesting everyone else's. And, if that were to indeed prove impossible, then that would be the catastrophic scenario.
On your point (1), I don't think that's necessarily catastrophic. That's just good old-fashioned competition, and OpenAI would have to simply best them on R&D.
I just don't see how OpenAI makes a legitimate copyright claim without stepping on its entire business model.
>OpenAI and its partner Microsoft investigated accounts believed to be DeepSeek’s last year that were using OpenAI’s application programming interface (API) and blocked their access on suspicion of distillation that violated the terms of service, another person with direct knowledge said. These investigations were first reported by Bloomberg.
Except in this case, nothing was stolen, unless they want to call ChatGPT's own training on source data theft too.
I'm not seeing it. I get it, the narrative that OpenAI is getting a taste of their own medicine is funny but this is not serious reporting.
Everybody has evidence OpenAI scraped the internet at a global scale and used terabytes of data it didn't pay for. Newspapers, books, etc...
China leads the world in the most cited papers[2]. The US's share of the top 1% highly cited articles (HCA) has declined significantly since 2016 (1.91 to 1.66%), and the same has doubled in China since 2011 (0.66 to 1.28%)[3].
China also leads the world in the number of generative AI patents[4].
1. https://www.bfna.org/digital-world/infographic-ai-research-a...
2. https://www.science.org/content/article/china-rises-first-pl...
3. https://ncses.nsf.gov/pubs/nsb202333/impact-of-published-res...
4. https://www.wipo.int/web-publications/patent-landscape-repor...
OpenAI seems to feel that way about the their use of copyrighted material: since they didn't literally make a copy of the source material, it's totally fair game. It seems like this is the same argument that protects DeepSeek if indeed they did this. And why not, reading a lot of books from the library is a way to get smarter, and ostensibly the point of libraries
What was supposed to be a model, training, and data moat - is now reduced to operational cost, which they are not terribly efficient for.
OpenAI has been on a journey to burn as much $ as possible to get as far ahead on those three moats, to the point where decreasing TCO for them on inference was not even relevant - "who cares if you save me 20% of costs when I can raise on a 150b pre money value?".
Well, with their moats disappearing, they will have no choice but to compete on inference cost like everyone else.
They did the exact same thing with public information. Their model just synthesizes and puts out the same information in a slightly different form.
Next we should sue students for repeating the words of their teachers
It does matter whether it happened and how much it happened. Deepseek ran head to head comparisons against O1 so it would be pretty reasonable for them to have made API calls, for example.
But also, as the article notes, distillation, supervised fine tuning, and using LLM as a judge are all common techniques in research, which OpenAI knows very well.
You can't have both ways... If OpenAI wants to claim that the AI is not repeating content but 'synthesizing it' in the same was as a human student would do... Then I think the same logic should extend to DeepSeek.
Now if OpenAI wants to claim that its own output is in fact copyright-protected, then it seems like it should owe royalty payments to everyone whose content was sourced upstream to build its own training set. Also, synthetic content which is derived from real content should also be factored in.
TBH, this could make a strong case for taxing AI. Like some kind of fee for human knowledge and distributed as UBI. The training data played a key part in this AI innovation.
As an open source coder, I know that my copyrighted code is being used by AI to help other people produce derived code and, by adapting it in this way, it's making my own code less relevant to some extent... In effect, it could be said that my code has been mixed in with the code of other open source developers and weaponized against us.
It feels like it could go either way TBH but there needs to be consistency.
Well, this sounds like they are just crying because they are losing the race so far. Besides, DeepSeek explicitly states they did a study on distillation on ChatGPT, then OpenAI is like "oh see guys they used our models!!!!!"
I don't remember the correct metric but the cost for DeepSeek was like $15/mo while ChatGPT was $200
Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.
Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]
OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).
[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).
"We didn’t see them when they were stealing, we saw them when they were fighting over what was stolen"
That describes this situation. Although to be honest all this aggressive scraping is noticeable but for people who understand that which is not majority of people. but now everyone knows.
When you say noticeable, do you mean in like, traffic statistics? Or in what the model knows that it clearly shouldn't if it wasn't trained in legally dubious ways?
The only thing government cares about at this point is if information is shared with China.
Which is the most ridiculous argument they could use because they didn't respect any ToS (or copyright laws, for that matter) when scraping the whole web, books from Libgen and who knows what more.
Obligatory "Everything is a Remix" https://www.youtube.com/watch?v=X9RYuvPCQUA
[edit: the above comment was based on sensetionalist reporting in the original link and not the current FT article. I still think there is a lot of noise in these news this last week, but it may well be that openai has valid evidence of wrongdoing; I would guess that any such wrongdoing would apply directly to V3 rather than R1-zero, because o1 does not provide traces and generating synthetic thinking data with 4o may be counterproductive.]
Of course it's possible that DeepSeek used O1 to generate some of this initial bootstrapping data, but not obvious. O1 anyways deliberately obfuscates it's reasoning process (see "Hiding the chains of thought" section of OpenAI's "Learning to reason with LLMs" page), such that what you see is an after-the-fact "summary" of what it actually did; so, if DeepSeek did indeed use some of O1's output to train on, it shows that the details of O1's own reasoning process isn't as important as they thought it was - it's just having some verified (i.e. leading to good outcome) reasoning data from any source that matters to get started.
But, I wonder - do companies like OpenAI, Google, and Anthropic use each others models for training? If not, is it because they don't want to or need to, or because they are afraid of breaking the ToC?
this implies that when casting doubt the doubt is always false, if the doubt here is true, then it is a good offer.
Commonly the phrase sowing doubt is used to say an argument someone has made is false, but that was evidently not what the parent poster meant, although it was what the comment I replied to probably interpreted it as.
on edit: I believe what the parent poster meant is that whether or not OpenAI/Altman believes the doubts expressed, they are pretty much constrained to cast some doubt as they do whatever else they are planning to deal with the situation. From outside we can't know if they believe it or not.
I don't think that this is a working argument, because all their steps I can imagine are not mutually exclusive.
OpenAI is one of the companies like Company B that is using data from the open Internet.
[1] Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
Correct me if my understanding is wrong, but if OpenAI's accusation is correct and DS is a derivative work, then isn't it inaccurate to say DS reached ChatGPT performance "at a fraction of the cost"? If true, seems like it's more accurate to say that they were able to copy an expensive model, at low expense.
DeepSeek did some studies on distillation, which might be what OpenAI is complaining about. But their bigger model is not a distilled version of OpenAI's.
Curious, though, can a corporation be pardoned?
State-level crimes (like his NY felonies) and civil torts (like his case where he owes $500M currently) are separate.
ClosedAI scraped human content without asking and they explained why this was acceptable... but when the outputs of their training corpus is scraped, it is THEIR dataset and this is NOT acceptable!
Oh, the irony! :D
I shared a few screenshots of DeepSeek answering using ChatGPT's output in yesterday's article!
https://semking.com/deepseek-china-ai-model-breakthrough-sec...
The max token output is only 8K (32K thinking tokens). O1 is 128k, which is far more useful, and it doesn’t get stuck like R1 does.
The hype around the DeepSeek release is insane and I’m starting to really doubt their numbers.
Read the two following sections of my blog post:
1. "Distilled language models"
2. "DeepSeek: Less supervision"
> We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.
> Our primary fiduciary duty is to humanity. We anticipate needing to marshal substantial resources to fulfill our mission, but will always diligently act to minimize conflicts of interest among our employees and stakeholders that could compromise broad benefit.
> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.
[1] https://semking.com/wp-content/uploads/2025/01/DeepSeek-1024...
[2] https://www.bestbuy.com/site/help-topics/privacy-policy/pcmc...
https://www.linkedin.com/posts/organic-growth_deepseek-the-o...
I welcome friction, so I'll be blunt: I disagree with you, not because what you are saying is wrong but because you only consider systematic data collection.
That's not the issue here.
There's a difference between democracies like the United States or European countries, no matter how IMPERFECT they are, and a dictatorship that does not allow dissenting opinions.
There's a difference in how the data collected will be used.
Freedom of speech, even when it is relative, is better than totalitarianism.
Separately, I do think that now that the Chinese leadership saw this, that they have the chops to pull this off and then some, they are probably going to rein in future innovations; they'll likely demand that the big future discoveries remain closed-sourced (or even unannounced/unpublicized).
> Separately, I do think that now that the Chinese leadership saw this, that they have the chops to pull this off and then some, they are probably going to rein in future innovations; they'll likely demand that the big future discoveries remain closed-sourced (or even unannounced/unpublicized).
How do we know that this is not already happening with OpenAI/Meta and the U.S. government at some level? The concept of power is equal, whether we wanted it or not. We don't have to pretend to be "better" all the time.
Depends on whether they want these tools to be adopted in the wider world. Rightly or wrongly there is a lot of suspicion in the West and an open source approach builds trust.
If the allegation is true (we don't know yet), then what you've written perfectly proves the point everyone is making. ChatGPT wouldn't be here if it weren't for all the research and work that preceded it in terms of tons of scrapable content being available on the Internet, and it's not like OpenAI invented transformers either.
Nobody is accusing DeepSeek of hacking into OpenAI's systems and stealing their content. OpenAI is just saying they scraped them in an "unauthorized" manner. The hypocrisy is laughably striking, but sadly nobody has any shame anymore in this world it seems. Play me the world's tiniest violin for OpenAI.
US AI folk were leading for two years by just throwing more and more compute at the same thing that Google threw them like a bone years ago (namely transformers). They made next to no innovation in any area other than how to connect more compute together. The idea of additional inference time compute, looping the network back on its own outputs, which is the only significant conceptual advancement of last years was something I, as a layman, came up with after few days of thinking why AI sucks and what can be done to make it able to tackle problems that require iterative reasoning. They announced it few weeks after I came up with the idea, so it was in the works for some time, but it shows you how basic idea it was. There was nothing else.
Suddenly when there comes a small company that introduced few actual algorithmic advancements which resulted in 100x optimization which is something expected with algorithmic optimizations, the big AI suddenly went into full "dog ate my homework" mode. Blaming everyone and everything around.
Let's not mention the fact that if full outputs of their models could enable them to train a better model at 1% cost then it puts them in even worse light that they didn't do it.
We have and apples and oranges thing here which deepseek is intentionally leaning into. They get very cheap electricity and are bragging about their cheap cost, and OpenAI etc typically brag about how expensive their training is. But it’s all pr and lies.
Plus of course for people within the tech bubble, plenty of research results on the value of synthetically augmented and expanded training data that put the impact past just regurgitating source data.
This whole episode is a failure of reporting what to expect next and projecting running costs etc. most of all.
Sorry for the Short: https://www.youtube.com/shorts/M0QyOp7zqcY
Well, I guess they really helped make this a reality!
And then where DeepSeek steal from next? Do they steal from themselves? Do they steal the stolen models they stole from the stolen data?
The AI Ponzi scheme...
It reminds me of 1984 in a sense. "Don’t you see that the whole aim of Newspeak is to narrow the range of thought? In the end we shall make thoughtcrime literally impossible, because there will be no words in which to express it."
Unlike 1984 I don't see this winnowing of new concepts as purposeful, but on the other hand I keep asking myself how we can be so stupid as to keep doing it.
1. scraping the internet and making AI out of it
2. using the AI from #1 to create another AI
are not the same thing.
So, if you really really care about ToS, then just never enter into a contract with OpenAI. Company A uses OpenAI to generate data and posts it on the open Internet. Company B scrapes open Internet, including the data from Company A [2].
[1]: Ownership of content. As between you and OpenAI, and to the extent permitted by applicable law, you (a) retain your ownership rights in Input and (b) own the Output. We hereby assign to you all our right, title, and interest, if any, in and to Output.
[2]: This is not hypothetical. When ChatGPT got first released, several big AI labs accidentally and not so accidentally trained on the contents of the ShareGPT website (site that was made for sharing ChatGPT outputs). ;)
#2 makes a big corp a bit angry
Indeed not the same thing
But arguably these actions share enough characteristics that it’s reasonable to place them in the same category. Something like: “products that exist largely/solely because of the work of other people”. The nonconsensual nature of this and the lack of compensation is what people understandably take issue with.
There is enough similarity that it evokes specific feelings about OpenAI when they suddenly find themselves on the other side of the situation.
You’re right. The second one is far more ethical. Especially when stealing from a thief.
Doesn’t Sam Altman keep parroting they’re developing AI “for the good of humanity”? Well then, someone taking their model and improving on it, making it open-source, having it consume less, and having a cheaper API, should make him delighted. Unless he *gasp* was full of shit the whole time. Who could have guessed?
#2 is taking advantages from closedAI.
they are indeed different
2. scraping the AI from #1 and making AI out of it
The best part is "their IP" was humanity's scraped content and they are angry that DeepSeek did their job for them and gave it away for free.
OpenAI is mad about d2 (not d1). I'm not sure using public data is "stealing". In summary, these are two different things & need to be separate.
It is unclear if someone breaking someone else's copyright to use A can claim copyright on a work B, derived from A. My point is that OpenAI played loose with the copyright rules to build its various models, so the legality of their claims against DeepSeek might not be so strong.
OpenAI asserts 1. d2 was used by DeepSeek 2. All d2 belongs to OpenAI exclusively
Both are debatable for large number of reasons.
I wonder if this is going to come to an end through a combination of social media fatigue, social media fragmentation, and open source LLMs just giving it all back to us for free. LLMs are analogous to a "JPEG for ideas" so they're just lossy compression blobs of human thought expressed through language.
It cannot die soon enough
They scraped literally all the content of the internet without permissions. And I won't even be surprised if they scraped the output of other LLMs as well.
You may not owe altmen better, but you owe this community better if you're participating in it.
https://blog.samaltman.com/trump
https://www.reddit.com/r/YAPms/comments/1i7ry5m/sam_altman_g...
Only a truly talented piece of shit can be as prolific as this.
"He is irresponsible in the way dictators are."
Chef's kiss.
Edit:
Kids, don't aspire to be like Altman. We as a community need to espouse more values than tech is gonna tech.
Seriously. Given how pretty much all this software was trained, who cares?
I, for one, don't and believe the massive amount of knowledge continues to be of value to many users.
And I find the thought of these models knowing some things they shouldn't very intriguing.
DeepSeek just bodied u bro, get back in the lab & create a better AI instead of all this news that isn’t gonna change them having a good AI
Let’s race to the bottom.
It's okay when you steal worldwide IP to train your "AI".
It's not okay when said stolen IP is stolen from you?
If the chinese are guilty, then Altman's doom and gloom racket is as guilty or even more, considering they stole from everyone.
Their failure is important at a minimum to the future of the United States if not the world.
Society will always have crazy sociopaths destroying things for their own gain, and now is Altman's turn.
I think DeepSeek’s strategy to announce a misleading low cost (just the final training run that optimizes a base model that in turn is possibly based on OpenAI) is also purposeful. After all, High Flyer, the parent company of DeepSeek, is a hedge fund - and I bet they took out big short positions on Nvidia before their recent announcements. The Chinese government, of course, benefits from a misleading number being announced broadly, causing doubt among investors who would otherwise continue to prop up American technology startups. Not to mention the big fall in American markets as a result.
I do think there’s also a big difference between scraping the Internet for training data, which might just be fair use, and training off other LLMs or obtaining their assets in some other way. The latter feels like the kind of copying and industrial espionage that used to get China ridiculed in the 2000s and 2010s. Note that DeepSeek has never detailed their training data, even at a high level. This is true even in their previous papers, where they were very vague about the pre training process, which feels suspicious.
I had literally come to this post to say the same. You beat me to it.
USA is going crazy over deepseek and to me , it just shows that the world is a black swan , an AI bubble.
I am not saying AI has no use. I regularly use it to create something , but its just not recommended. I am going to stop using AI , to grow my mind.
And its definitely way overpriced. People are investing so much money without seeing the returns? , and I think people are also using AI because of a sense of FOMO , I don't know , to me its funny .
I really really want to create a index fund with strictly no AI companies. Since this doesn't feel diversified enough. Like sure nvidia gave a quarter of return the last year , but I mean , at this point , it almost feels the same as that of bitcoin. The reason I don't / won't invest in bitcoin is I don't want "that" risk.
This has been a boggling year.
I have realized that the world is crazy. Truly. Trump winning from going to the point of getting shot to deepseek causing nvidia / american stock market to go down , heck even bitcoin! , its so crazy , trump launching his meme coin. If the world is crazy. Just be the sane person around. You will stick around , that's my philosophy. I won't jump on AI wandwagon . But its still absolutely wild & horror seeing how a "sideproject" (deepseek) absolutely put american stock market in shambles.
I want more diversifaction. I am not satisfied with the current system. This feels like a bubble and I want no part in it.
If you give me $500B to train the best model in the world, and then a couple people at a hedge fund in China can use my API to train a model that’s almost equal for a tiny fraction of what I paid, then it appears to be outrageously foolish to build new frontier models.
The only financial move that makes sense is to wait for someone else to burn hundreds of billions building a better model, and then clone it. OpenAI primarily exists to do one of the most foolish things you can possibly do with money. Seems like a really bad deal for investors.
In fact, I’d argue this is even worse, because no matter how much OpenAI improves their product, and Altman is prancing around claiming to need $7Trillion to improve their product, someone else can replicate it for a few million.
First-mover advantage doesn't always have to pay off in the marketplace. FedEx probably has to schedule extra flights between SF and DC just to haul all of OpenAI's patent applications.
I suspect that it's going to end up like the early days of radio, when everybody had to license dozens of key patents from RCA ( https://reason.com/2020/08/05/how-the-government-created-rca... ). For the same reason, Microsoft is reputed to make more money from Android licenses than Google does.
OpenAi spent investor money to enable random Chinese Ai startups to offer a better version of their own product at a fraction of the cost. In some ways, this was inevitable to be the conclusion, but I do find the way we arrive at this conclusion to be particularly enjoyable to watch playout.
Is it our job as a thinking public to decry it? Also sure. In fact, wildly yes.
I think it's more accurate to say most people can't (and don't) care about big monetary figures.
As far as Joe Average is concerned, ChatGPT cost $OoomphaDuuumpha and Deepseek cost $RuuunphaBuuunpha. The only thing Joe Average will care is the bill he gets after using it himself.
I now view any moralistic statement by any of these big tech companies as complete and total bullshit, which is probably for the best, because that is what it is. These companies now exist solely to amass power and wealth. They will still use moralistic language to try to motivate their employees, but I hope folks still see it for the complete nonsense that it is.
The cost of $5.5 million was quoted at $2/GPU-hour which is a reasonable price for on-demand H100s that anyone in the US could access, and likely on the high side given bulk pricing and that they are using nerfed versions. OpenAI might be all pr and lies but everything I've seen so far says that deepseek's claims about cost are legit.
“I don't want to live in a world where someone else makes the world a better place better than we do”
- Gavin Belson
It's funny if OpenAI were to complain about this, but at least on Twitter I don't see that much whining about it from OpenAI employees. Sam publicly praised DeepSeek.
I do see some of them spreading the "they're hiding GPUs they got through sanction evasion" theory, which is disappointing, though.
right next to it is an older exhibit about how the original curator took cuneiform rolls and made them into necklace beads for his wife and rings? for himself.
either someone at the BM has a very british sense of humor or it's a gigantic woosh. I laughed my ass off. People looked at me.
They do it even with the Chile Moais when they never where in any danger.
It's all lies.
> The San Francisco-based ChatGPT maker told the Financial Times it had seen some evidence of “distillation”, which it suspects to be from DeepSeek.
> ...
> OpenAI declined to comment further or provide details of its evidence. Its terms of service state users cannot “copy” any of its services or “use output to develop models that compete with OpenAI”.
OAI share the evidence with the public; or, accept the possibility that your case is not as strong as you're claiming here.
The amount of iterations of training that would be needed for DeepSeek to actually learn anything from OpenAI would seem to be an insane amount of requests from a non-local AI, which you’d think would be immediately obvious to OpenAI just by looking at suspicious requests?
Am I correct in this assumption or am I missing something? Is it even realistic that something like this is possible without a local model?
Don't get mad about possible data theft, prove it in court.
I just can't even begin to imagine what will come of this riduculous techno-imperialism/AI arms-race, or whatever you want to call it. It should not be too hard for China to create their own ASICs which do the same, and finally be done with this palaver.
The value was highly speculative, an illusion created by PR and sentiment momentum. "Hype value" not real value (unless you're able to realize it and dump those bags on someone else before fundamentals set in). Same thing happening with power companies downstream of the discovery that AI is not going to be a savior of sagging electricity demand. Overdriving the fundamentals is not value destruction, it is "I gambled and lost."
https://www.bloomberg.com/news/articles/2025-01-28/deepseek-... | https://archive.today/mCemf
"In the short run, the market is a voting machine but in the long run, it is a weighing machine."
All of the BigTech companies have enough cash flow from profitable lines of business to make speculative bets.
The US can ban all they want, but if the rest of the world starts preferring Chinese social media, Chinese AI, and Chinese websites in general, the US is going to lose one of its crown jewels.
The way the US behaves is a problem and makes a lot of people prefer alternatives just for the sake of avoiding the US, which is why it's important that the US get along with other nations, but--well, about that...
I don't think that's at all likely in the current economic system
They can stop DeepSeek doing various things commercially I guess, but stopping Americans using their ideas is simply impossible and stopping use of their source or weights would be (likely successfully) challenged under the first amendment.
There is no law against simply destroying trillions of dollars of shareholder value.
DeepSeek proved the emperor had no clothes and wiped out a lot of their valuation when investors saw reaching parity to Chtgpt is not really that difficult.
It took them years and years to move forward with the ban ok Tiktok and it still hasn't been banned yet. There is no way they are going to ban some MIT-licensed weights.
>"they destroyed $1T of shareholder value."
The market has largely recovered.
How much taxpayer money has gone to OpenAI and Anthropic? They are the two big sinners in closed AI.
Trump wanted to ban Tiktok before... and then simply chose not to / forgot about it.
Next round congress acted, and Trump delayed it and has said that he is interested in his friends buying it.
Is there really a competitive plan here or is it just fishing for payouts / grifting for allies?
The context is always about competition, but I'm not even sure that's their plan.
That's why it's called a bubble. Pretty sure my great great grandad also overpaid for some tulips.
Like the fact that US revolution was basically kickstarted by blatantly breaking the patent law (like there was this one mill specifically) , I think its a historic event. And now here we are ! The scam of national security.
To be honest. People seem to be really kind on the fall of USA. I am not that interested since the rise of China terrifies me. But the hypocrisy of USA / losing such soft power (like here I am , from random country critiquing USA based on facts , it really downplays it being a superpower) that would be the downfall of USA.
To me , the future terrifies me. In fact the present terrifies me. I think the world is running crazy or maybe its just me.
I find your behavior repulsive and fervently wish you would quit.
It's not a borderline call—I'd post exactly the same thing regardless of who or what such a comment was about.
Not even close.
This guy is actively ruining society while enriching himself in the process, but we somehow can't call a spade a spade?
Pathetic.
Not that we could ever see what the NSA, CISA, ASIS, GCHQ, and other 3/4-letter agencies are actually doing with the collected data.
But they pinky promised to use it properly (or something), so, yay.
China considers industry to be completely subservient to government. Checks and balances are secondary to ideas like harmony and collective well being.
>There's a difference in how the data collected will be used.
>Freedom of speech, even when it is relative, is better than totalitarianism.
I don't disagree with "democracy is better than totalitarianism", but what does that have to do with collecting device information and IP addresses? Is that excuse a cudgel you can use against any behavior that would otherwise be innocuous? It's fine to be against deepseek because you're concerned about them getting sensitive data via queries, or even that their models be a backdoor to project chinese soft power, but hand wringing about device information and IP addresses is absurd. It makes as much sense as being concerned that the CCP/deepseek does meetings, because even though every other companies does meetings, CCP/deepseek meetings could be used for totalitarianism.
I admit I am concerned when I see blatant algorithmic manipulation of social platforms to favor any narrative that aligns with geopolitical objectives.
I also wrote about the TikTok algo a few days ago. You'll see what I think of user privacy violations (closed ecosystem + basically a keylogger in this case):
https://semking.com/likes-lies-untold-story-tiktok-algorithm...
I cannot stand when dissenting voices or opinions are shadow-banned.
And I have the same opinion regarding U.S. or EU companies.
Our privacy should be respected.
In the meantime: strong encryption at every corner, please!
>Our privacy should be respected.
Characterizing device information and IP addresses as "privacy violations" is a stretch. If you showed a history railing against this sort of stuff, agnostic of geopolitical alignment, then you get a pass, but I think it's fair to assume the converse until proven otherwise.
>In the meantime: strong encryption at every corner, please!
Irrelevant. The data collection is done by first parties. Encryption doesn't do anything.
>I admit I am concerned when I see blatant algorithmic manipulation of social platforms to favor any narrative that aligns with geopolitical objectives.
>I cannot stand when dissenting voices or opinions are shadow-banned.
What does this have to do with privacy? Again, it's fine to be against "blatant algorithmic manipulation of social platforms" or whatever, but dragging seemingly unrelated topics in an attempt to amass as big pile of greviances as possible is disingenuous.
>I also wrote about the TikTok algo a few days ago. You'll see what I think of user privacy violations (closed ecosystem + basically a keylogger in this case):
>https://semking.com/likes-lies-untold-story-tiktok-algorithm...
Where's the keylogging? I skimmed the article and the only thing I could find was a passing mention about an article that you "was advised not to publish it and I didn’t". How much keylogging could possibly going on in a short video app? Is the "keylogging" just a way to make "we measure how engaged someone is with a video" as sinister as possible?
I'm curious how robust this principle is for you, because China and Russia are not the first countries that come to mind when talking about the (actual, existing, documented) manipulation of US speech and media by a foreign government.
Yet it seems we can only have this discussion, ironically, when the subject is a US government-approved one like China. Anything else would be problematic and unsafe.
This kind of person has a lot of cognitive dissonance going on.
Good for them! I hope this teaches Wall Street to not freak out about an unverified announcement.
Wall Street lost billions, and I hope they learned their lesson and next time will not crash the market when unverified news comes out.
Being a citizen of a western nation, I'm inclined to agree with the general sentiment here, but how can you definitively say this? You, or I, don't know with any certainty what interference the US government has played with domestic LLMs, or what lies they have fabricated and cultivated, that are now part of those LLMs' collective knowledge. We can see the perceived censorship with deepseek more clearly, but that isn't evidence that we're in any safer territory.
There are loads of examples on the internet of LLMs pushing (foreign) government narratives e.g. on Israel-Palestine.
Just because you might agree with the propaganda doesn't make it any less problematic.
There isn’t even a single example of that. If an LLM is taking a certain position because it has learned from articles on that topic, that’s different from it being manipulated on purpose to answer differently on that topic. You’re confusing an LLM simply reflecting the complexity out there in the world on some topics (showing up in training data), with government forced censorship and propaganda in DeepSeek.
The two aren’t the same, not even remotely close.
That's far more dystopian than a post-hoc "guardrailed" model (that you can run locally without guardrails).
These arguments always remind me of the arguments against Huawei because they _might_ be spying on western countries. On the other hand we had the US government working hand in hand with US corporations in proven spying operations against western allies for political and economic gain. So why should we choose an American supplier over a Chinese one?
> I think DeepSeek’s strategy to announce a misleading low cost (just the final training run that optimizes a base model that in turn is possibly based on OpenAI) is also purposeful. After all, High Flyer, the parent company of DeepSeek, is a hedge fund - and I bet they took out big short positions on Nvidia before their recent announcements. The Chinese government, of course, benefits from a misleading number being announced broadly, causing doubt among investors who would otherwise continue to prop up American technology startups. Not to mention the big fall in American markets as a result.
Why should I care about the stock value of US corporations?
> I do think there’s also a big difference between scraping the Internet for training data, which might just be fair use, and training off other LLMs or obtaining their assets in some other way.
So if training of copyrighted work scrapped of the Internet is fair use, how would the training of the LLMs not be fair use as well? You can't have it both ways.
Is corporate misinformation so much better? Recall about Tienanmen Square might be more honest but if LLMs had been available over the past 50 years, I would expect many popular models would have cheerfully told us company towns are a great place to live, cigarettes are healthy, industrial pollution has no impact on your health, and anthropogenic climate change isn't real.
Especially after the recent behaviour of Meta, Twitter, and Amazon in open support of Trump and Republican interests, I'll be shocked if we don't start seeing that reflected in their LLMs over the next few years.
Hollywood also started by using non-regulation / non-licensed movie equipment when nobody was looking.
So, USA has all this "move fast, break things, and monopolize the new thing so hard that no one can get near" mentality since forever, and this moves in cycles.
It's now AIs turn, but it turned out that they democratized the world so hard, so everybody can act fast now.
In nature, nobody can stay at the top forever. People should understand this.
HN's stated mandate is intellectual curiosity (https://news.ycombinator.com/newsguidelines.html, https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...).
Do you feel that your comment https://news.ycombinator.com/item?id=42866108 was curious (in that sense)? or was it rather that you feel something else is more important?
As for it getting banned, TikTok was banned partly because of credible accounts of it having been used by China to track political enemies. Are we thinking they'll expand the argument on national security to say that any application that transfers data to China is a national security threat? Because that could be a very slippery slope.
And in any case, such a measure seems like it would only bar access to the DeepSeek app. Surely no one could argue that the underlying open source model, if run locally on American soil, could constitute a security threat, right?
Due diligence never goes out of style.
Do you mean cash dividends [1]?
Also, the premise is false. Dividend yields have roughly tracked interest rates [2]. (The difference is a dirty component of the equity risk premium [3].)
[1] https://www.investopedia.com/ask/answers/05/stockcashdividen...
[2] https://www.multpl.com/s-p-500-dividend-yield/table/by-year
[3] https://www.investopedia.com/investing/calculating-equity-ri...
I agree: this is a characterization I never made. FYI, I also collect this type of data about you when you visit my website. That said, telemetry + totalitarianism = bad combo.
>Irrelevant. The data collection is done by first parties. Encryption doesn't do anything.
Even if data is collected by first parties, encryption is still highly relevant because it ensures that the data remains secure in transit and at rest. It does a lot.
>What does this have to do with privacy? Again, it's fine to be against "blatant algorithmic manipulation of social platforms" or whatever, but dragging seemingly unrelated topics in an attempt to amass as big pile of greviances as possible is disingenuous.
You are aggressive for no reason whatsoever. There's nothing disingenuous: when users are shadow-banned by platforms under dictatorships, they end up flagged, and their private data is often analyzed for nefarious reasons. There's a link with privacy but I'll stop at this stage if we cannot have a civilized discussion.
>Where's the keylogging? I skimmed the article and the only thing I could find was a passing mention about an article that you "was advised not to publish it and I didn’t". How much keylogging could possibly going on in a short video app? Is the "keylogging" just a way to make "we measure how engaged someone is with a video" as sinister as possible?
“TikTok iOS subscribes to every keystroke (text inputs) happening on third party websites rendered inside the TikTok app. This can include passwords, credit card information and other sensitive user data. (keypress and keydown). We can’t know what TikTok uses the subscription for, but from a technical perspective, this is the equivalent of installing a keylogger on third party websites.”
https://krausefx.com/blog/announcing-inappbrowsercom-see-wha...
Please note that this article is outdated (August 2022). Importantly, the article does not claim that any data logging or transmission is actively occurring. Instead, it highlights the potential technical capabilities of in-app browsers to inject JavaScript code, which could theoretically be used to monitor user interactions.
"We Don't See Things As They Are, We See Them As We Are"
— Samuel b. Nahmani
Do the unit economics make this sustainable?
Meanwhile, to please the US they would probably have to move the entire company to the US. And even that may not be enough
If DeepSeek becomes popular in America I predict it will be blocked, national firewall style. Will EU do the same?
Either way what you describe is perfectly achievable for the workers, but at some point management needs to own up to their failures and getting rewarded because the board is also made up of executives at other big tech companies is a perverse incentive to never actually improve.
Edit: I am not defending OpenAI and we are all enjoying the irony here. But it puts into perspective some of the wilder claims circulating that DeekSeek was able to somehow complete with OpenAI for only $5M, as if on a level playing field.
Edit: Responding to your edit, Deepseek only claimed that the final training run was $5m, not that the whole process caught that (they even call this out). I think it's important to acknowledge that, even if they did get some training data from OpenAI, this is a remarkable achievement.
We don’t know that of course. OpenAI claim to have some evidence and I guess we’ll just have to wait and see how this plays out.
There’s also a substantial difference between training of the entire internet and one that very specifically targets your competitor's products (or any specific work directly).
https://scholar.google.com/citations?view_op=view_citation&h...
Basically all LLM can trace their origin back to that paper.
This was just a single example though. The whole point is that people build on the work from the past, and that this is normal.
I don't think your statement holds with current behavior.
Being hostile does not bring you friends. Sure, various countries can have reasons to suck it up anyway (e.g. because of sanctions, or because China makes an offer too good to pass, although even that comes with strings attached). But in the long run you just create clients or satellites who will escape at the first occasion.
The American foreign policy around the middle of the 20th century relied very effectively on soft power, which is something you can leverage to get much more out of your investments than their pure monetary value. It is not required in order to gain influence, but it is a force multiplier.
Is hostility a bad idea only for America? Sure hope not.
I think protectionism is long-term bad for every country, but it's especially and uniquely bad for the biggest economy in the world who has net benefitted the most from free trade and competition. There's no denying that China is influential – the argument is that they could've been (and still can be) so much more influential by embracing western tech instead of walling themselves off.
I've also compared o1 and (online-hosted) r1 on Qt/C++ code, being a KDE Plasma dev, and my impression so far was that the output is roughly on par. I've given both models some tricky tasks about dark corners of the meta-object system in crafting classes etc. and they came up with generally the same sort of suggestions and implementations.
I do appreciate that "asking about gotchas with few definitive solutions, even if they require some perspective" and "rote day-to-day coding ops" are very different benchmarks due to how things are represented in the training data corpus, though.
My standard test is to ask the model to write a QSyntaxHighlighter subclass that uses TreeSitter to implement syntax highlighting. O1 can do it after a few iterations, but R1’s output has been a mess. That said, its thought process revealed a few issues that I then fixed in my canonical implementation.
If an org consistently finds one model performs worse on their corpus than another, they aren't going to keep using it because it ranks higher in some set of benchmarks.
For instance Fireworks offers R1 with 164K/164K. They are far more expensive than DeepSeek though
I mean, couldn't that be because they're just overwhelmed by users at the moment?
> And the output is very bad - it mashes together the header and cpp file
That sounds way worse, and like, not something caused by being hugged to death though.
Aider recently stated DeepSeek is placed a the top of their benchmark though[1] so I'm inclined to believe it isn't all hype.
It’s just not as impressive as people make it out to be. It might be better than o1 on Python or Javascript thats all over the training data, but o1 is overwhelmingly better at anything outside the happy path.
How is that not consumer harm?
Hopefully in the future the FTC will break up Microsoft, forcing them to split Azure, Office, and Windows. They clearly can't be trusted with all 3.
In practice I don’t think anyone can economically host the whole model plus the kv cache for the entire context size of 128k (and I’m skeptical of Deepseek’s claims now anyway).
Edit: a Kagi team member just said on Discord that they’ll be increasing max tokens next release
I haven’t used their official chat interface or API for privacy reasons.
I invite anyone to post a chat transcript showing a successful run of R1 against this prompt (and please tell me which API/service it came from so I can go use it too!)
I'm talking about individuals and organizations making a decision on whether or not to use a model based on their own testing. That's what ultimately matters here.
It’s the business of commodities. The magic is in tiny incremental improvements and distribution. DeepSeek forces us to question if AI—possibly intelligence—is a commodity.
OMG, it seems tech has been invaded by baaing crypto bros
Their moat is >1B people are already using ChatGPT monthly.
They aren't going to switch unless something is substantially better.
Unlike a social network, network effects won't help them - their users don't care how many other users they have, only about the AI output quality.
> They aren't going to switch unless something is substantially better.
Or approximately as good but cheaper.
You're fooling yourself if you think OpenAI is going to pass up implementing the same strategies to get a ~27x cheaper model.
> Unlike a social network, network effects won't help them - their users don't care how many other users they have, only about the AI output quality.
Google Search doesn't have a network effect. Everyone on HN has been saying Google Search is complete garbage for a decade. It still has the same market share (roughly) as it did a decade ago.
« The thing I noticed right away when Claude came out is how little lock-in ChatGPT had established. This was very different to my experience when I first ran a search on Google, sometime in the year 2000. After the first time I used Google, I literally never used another search engine again; it was just light years ahead of its competitors in terms of the quality of its results, and the clarity of its presentation. This week I added a third chatbot to the mix: DeepSeek »
Follow up: https://x.com/TheStalwart/status/1884606421225848889
This is why OpenAI is so deep in the product development phase right now. They have to become the OS to be successful but I don't see that happening
Tell that to Friendster/MySpace and Facebook.
Try again.
There are different moats [1]. You’re describing incumbency, an intangible moat. It’s nice, but it’s fickle. Particularly with something with low switching costs.
OpenAI could argue, before, that it had a natural monopoly. More people use OpenAI so it gets more revenue and more data which lets it raise more capital to train these expensive models. That may not be true, which means it only has that first, shallow moat. It’s Nike. Not Google.
Google has a low switching cost, and hardly anyone switches.
ChatGPT is quite similar to Google in this way.
1B MAUs doesn't look great if half of them come from one source that can easily change to a competitor.
Except one product is 100% free and the other is mostly locked behind paid subscriptions
It’s already unable to keep up with demand, it will never be the default on mobile devices and businesses in the US will never trust it.
moat noun a deep, wide ditch surrounding a castle, fort, or town, typically filled with water and intended as a defense against attack.
Second place just needs a catapult and a diseased cow.
Google has massive network effects on its ad business and a natural monopoly on its search index. Crawling the web is expensive. It’s why Kagi has to pay Google (versus being able to pay them once and then stop).
iOS has 70% market share in the US
I wonder what the direct user counts are.
https://en.wikipedia.org/wiki/Illegal_number
> An AACS encryption key (09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0) that came to prominence in May 2007 is an example of a number claimed to be a secret, and whose publication or inappropriate possession is claimed to be illegal in the United States.
This is a silly take for anyone in tech. Any binary sequence is a number. Any information can be, for practical purposes, rendered in binary [1].
Getting worked up about restrictions on numbers works as a meme, for the masses, because it sounds silly, but is tantamount to technically arguing against privacy, confidentiality, the concept of national secrets, IP as a whole, et cetera.
[1] https://en.m.wikipedia.org/wiki/Shannon%27s_source_coding_th...
> Any piece of digital information is representable as a number; consequently, if communicating a specific set of information is illegal in some way, then the number may be illegal as well.
> It depends on where you live. In many places, collecting rainwater is completely legal and even encouraged, but some regions have regulations or restrictions.
United States: Most states allow rainwater collection, but some have restrictions on how much you can collect or how it can be used. For example, Colorado has limits on the amount of rainwater homeowners can store. Australia: Generally legal and encouraged, with many homes using rainwater tanks. UK & Canada: Legal with few restrictions. India & Many Other Countries: Often encouraged due to water scarcity.
That's not the same thing as a number being illegal at all. Here, watch this:
> I claim breathing is illegal in the United States
There, now breathing is claimed to be illegal in the United States.
https://www.federalregister.gov/documents/2023/11/01/2023-24...
>(k) The term “dual-use foundation model” means an AI model that is trained on broad data; generally uses self-supervision; contains at least tens of billions of parameters; is applicable across a wide range of contexts; and that exhibits, or could be easily modified to exhibit, high levels of performance at tasks that pose a serious risk to security, national economic security, national public health or safety, or any combination of those matters, such as by: ...
It orders the Secretary of Commerce to "solicit input from the private sector, academia, civil society, and other stakeholders through a public consultation process on potential risks, benefits, other implications, and appropriate policy and regulatory approaches related to dual-use foundation models for which the model weights are widely available".
Congress has never ceded power to anyone. They wield legislative authority and power of the purse, and wield it as they see fit. The special interests campaigning about this are extreme reactionaries whose stated purpose is to make government ineffective.
https://en.wikipedia.org/wiki/Export_of_cryptography_from_th...
Of course Joe Sixpack can throw their code up anywhere, but Joe Corporation gets wrecked if they try to sell it.
https://developer.apple.com/documentation/security/complying...
For example, this is enforced by Apple Store.
Open source means two things in spirit:
(a) You have everything you need to be able to re-create something, and at any step of the process change it.
(b) You have broad permissions how to put the result to use.
The "open source" models from both Meta so far fail either both or one of these checks (Meta's fails both). We should resist the dilution of the term open source to the point where it means nothing useful.
That's why terms like "libre" were born to describe certain kinds of software. And that's what you're describing.
This is a debate that started, like, twenty years ago or something when we started getting big code projects that were open source but encumbered by patents so that they couldn't be redistributed, but could still be read and modified for internal use.
But I think my argument still stands though? Users can run Deepseek locally, so unless the US Gov't wants to reach for book burning levels or idiocy, there is not really a feasible way to ban the American public of running DeepSeek, no?
No, but it's good enough to replace some office jobs. Which forces us to ask, to what degree is intelligence--unique intelligence--required for useful production? (We can ask the same about physical strength.)
My partner is in law, along with several friends and the amount of completely _useless_ work and ceremony they’re forced to do is insane. It’s a literal waste of their talent and time. We could probably net most of the claimed AI gains by taking a serious look at pointless workloads and come out ahead due to not needing the energy and capital expenditure.
MySpace and Friendster both spent significant time as the #1 social sites. Facebook unseated them rapidly. The same is possible for OpenAI.
Nvidia will definitely stay profitable for now though, as long as Deepseek’s breakthroughs are not further improved upon. But if others find additional compression gains, Nvidia won’t recapture its old premium. Its stock hinged on 80% margins and 75% annual growth, Deepseek broke that premise.
Growth might take a short-term dip, but may well be picked up by induced demand. Being able to train your own models "cheaply" will cause a lot more companies and departments want to train their own models on their own data, and cause them to retrain more frequently.
The time of being able to sell H100 clusters for inference might be coming to an end though.
It’s good for Nvidia. It’s not as good as it was before. (Assuming DeepSeek’s claims are replicable.)
The important question is "will this and similar optimizations to come permit local LLM use, cutting OpenAI out of the equation entirely?"
This will make the cloud providers - especially AWS, GCP and to a lesser extent the also ran clouds more valuable. The other models hosted by AWS on Bedrock are already “good enough” for most business use cases.
And then consumers are definitely not going to be running LLMs locally on their computers to replicate ChatGPT (the product) anymore than they are going to get an FTP account, mount it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem and then from Windows or Mac, accessed the FTP account through built-in software instead of using cloud storage like Dropbox. [1]
Whether someone comes up with a better product than ChatGPT and overcome the brand awareness is yet to be seen.
[1] Also the iPod had no wireless, less space than the Nomad and was lame.
Not personally. They'll let Apple handle it for them.
(This is already a thing. https://machinelearning.apple.com/research/introducing-apple...)
But that would mean a 27x lower valuation.
Not directly. The 27x is about costs. What it means is some order of magnitude of more competition. That reduces natural market share, price leverage and thus future profits.
No.
Valuations are based on future profits. Not future revenues.
You can theoretically lower your costs by 27x and end up with 2x more future profits - if you're actually 45x cheaper (which DeepSeek's method claims to be).
Your relative margin may have doubled, but your absolute profit-per-item hasn't. Say you had a 10% margin before, at a $100 price and $90 cost, for a $10 profit-per-item. Reduce price 27x and cost 45x, so $3.7 price, $2 cost, and $1.7 profit-per-item. 6x less profit - not as bad as 27x, but not good if you're OpenAI.
Which are estimated, in significant part, by the chance of a competitor arising.
If the barriers of entry are much lower than originally thought, the potential profit margin plummets.
It absolutely does. People use Google for search -> Websites optimise for Google -> People get “better” results when searching with Google.
The fact that it’s market share is sticky and not responding quickly to change in quality is sort of indicative of the network effect.
It would be so much better if all models were trained with LibGen.
Totally agree. But prompting debate or even further thought isn’t the point of the meme.
There is thought-stopping satire and thought-provoking satire. Much of it depends on the context. I’m not getting the latter from a “USA land of the ‘free’” comment.
The local LLM on iPhones are literally 1% as powerful as the server based models like 4o.
That’s not even considering battery considerations
Currently, yes. That's why this is a compelling advance - it makes local LLMs much more feasible, especially if this is just the first of many breakthroughs.
A lot of the hype around OpenAI has been due to the fact that buying enough capacity to run these things wasn't all that feasible for competitors. Now, it is, potentially even at the local level.
ChatGPT doesn't have any profits right now.
We have no idea what investors are expecting future profits to be.
> Say you had a 10% margin before, at a $100 price and $90 cost, for a $10 profit-per-item. Reduce price 27x and cost 45x, so $3.7 price, $2 cost, and $1.7 profit-per-item. 6x less profit - not as bad as 27x, but not good if you're OpenAI.
Now do the same thing but assume you have 10x more subscribers because the prices are ~27x lower.
You end up with almost 2x more total profit.
Just take ChatGPT's ~$200 subscription. Hardly anyone is going to pay ~$200 a month. Reduce that by 27x - and you're at $7.5 per month. Maybe 10% of people on the planet will pay that.
You're in various spots of this thread pushing the idea that their 1B MAUs make them unassailable. How are they gonna get to 10B in a world with less than that total people?
> Just take ChatGPT's ~$200 subscription. Hardly anyone is going to pay ~$200 a month. Reduce that by 27x - and you're at $7.5 per month. Maybe 10% of people on the planet will pay that.
They can't even make money at the $200 price point, though. https://x.com/sama/status/1876104315296968813
That's https://en.wikipedia.org/wiki/Source-available_software , not 'open source'. The latter was specifically coined [1] as a way to talk about "free software" (with its freedom connotations) without the price connotations:
The argument was as follows: those new to the term "free software" assume it is referring to the price. Oldtimers must then launch into an explanation, usually given as follows: "We mean free as in freedom, not free as in beer." At this point, a discussion on software has turned into one about the price of an alcoholic beverage. The problem was not that explaining the meaning is impossible—the problem was that the name for an important idea should not be so confusing to newcomers. A clearer term was needed. No political issues were raised regarding the free software term; the issue was its lack of clarity to those new to the concept.
[1] https://opensource.com/article/18/2/coining-term-open-source...
And French fries are anything that was fried in France?
No, they also fail even that test. Neither Meta nor DeepSeek have released the source code of their training pipeline or anything like that. There's very little literal "source code" in any of these releases at all.
What you can get from them is the model weights, which for the purpose of this discussion, is very similar to compiler binary executable output you cannot easily reverse, which is what open source seeks to address. In the case of Meta, this comes with additional usage limitations on how you may put them to use.
As a sibling comment said, this is basically "freeware" (with asterisks) but has nothing to do with open source, either according to RMS or OSI.
> This is a debate that started, like, twenty years ago
For the record, I do appreciate the distinction. This isn't meant as an argument from authority at all, but I've been an active open source (and free software) developer for close to those 20 years, am on the board of one of the larger FOSS orgs, and most households have a few copies of FOSS code I've written running. It's also why I care! :-)
This debate is over and makes the open source community look silly. Open model and weights is, practically speaking, open source for LLMs.
I have tremendous respect for FOSS and those who build and maintain it. But arguing for open training data means only toy models can practically exist. As a result, the practical definition will prevail. And if the only people putting forward a practical definition are Meta et al, this is what you get: source available.
Completely, fully breaking the meaning of the term "open source" is causing collateral damage outside the AI topic, that's where it really hurts. The open source principle is still useful and necessary, and we need words to communicate about it and raise correct expectations and apply correct standards. As a dev you very likely don't want to live in a tech environment where we regress on this.
It's not "source available" either. There's no source. It's freeware.
"I can download it and run it" isn't open source.
I'm actually not too worried that people won't eventually re-discover the same needs that open source originally discovered, but it's pretty lame if we lose a whole bunch of time and effort to re-learn some lessons yet again.
Also the training data is of a massive amount.
Additionally, what about human in the loop training, do you deliver humans as part of the source?
And don’t aspire to be like those who saw what he is but made peace with it in exchange for silver.
I could understand that from someone with an empty stomach. But so many people doing it when their pockets are already overflowing is exactly the kind of rot that degrades an entire society.
We're all just seeing the results so much better now that they can't even be bothered to pretend they ever more than this.
Later edit: The way this submission fell ~400th spots after just two hours despite having 1250 points and 550 comments, had its comments flagged and shuffled around to different submissions as soon as they touched too close to YC&Co is a good mirror of how today's society works.
Aspire to be like Aaron Schwartz.
Kind of a cliche, but aspire to be the best version yourself every day. Learn from the successes and failures of others, but don't aspire to be anyone else because eventually you'll be very disappointed.
He was lovely. And a genius. Maybe he changed, but he was a truly nice person.
But yes.
But survive. This too will pass.
edit: It appears I'm wrong. Will someone correct me on what he did?
He didn't do it without authorization.
https://en.wikipedia.org/wiki/Aaron_Swartz
> Visitors to MIT's "open campus" were authorized to access JSTOR through its network.
The first link is from mid-2016. The second link is from January 2025.
It is entirely reasonable for someone to genuinely change his or her views of a person over the course of 8.5 years. That is a substantial length of time in a person’s life.
To me a “flip-flop” is when one changes views on something in a very short amount of time.
While most of the things affected are highly political situations, i.e. Trump's ideas or Biden's fitness. We also seem to have thrown out things that we used to consider cornerstones of liberal democracy i.e. our ideas regarding free speech and censorship, where we claim that it's not happening because it is a private company.
In 2016: Sam alluded to Trump's rise as not dissimilar to Hitler's. He said that Trump's ideas on how to fix things are so far off the mark that they are dangerous. He even quoted the famous: "The only thing necessary for the triumph of evil is for good men to do nothing."
In 2025: "I'm not going to agree with him on everything, but I think he will be incredible for the country"
This is quite obviously someone who is pandering for their own benefit.
That's the thing though right, that we all created this mess together. Like yeah, why don't you (and the rest of us) blame him?. We're all pretty warped and it's going to take collective rehab.
Super pretentious to quote MLK, but the man had stuff to say so here it is (on Inaction):
"He who passively accepts evil is as much involved in it as he who helps to perpetrate it"
"The ultimate tragedy is not the oppression and cruelty by the bad people but the silence over that by the good people"
- Sam Altman - 2016
"If you elect a reality TV star as President, you can't be surprised when you get a reality TV show" - Sam Altman - 2017
"When the future of the republic is at risk, the duty to the country and our values transcends the duty to your particular company and your stock price." - Sam Altman - 2017
"I think I started that a little bit earlier than other people, but at this point I am in really good company" - Sam Altman - 2017 ( On his criticism of Trump )
"Very few people realize just how much @reidhoffman did and spent to stop Trump from getting re-elected -- it seems reasonably likely to me that Trump would still be in office without his efforts. Thank you, Reid!" - Sam Altman - 2020A community only espouses good values when it punishes bad behavior. How do we do this when those misbehaving are very rich, and attempting to punish the misbehavior has negative consequences on you? There just aren't many available tools that don't require significant sacrifices.
This is the "beauty" of the free market ideology (see e.g. https://a16z.com/the-techno-optimist-manifesto/ ). If all the transactions are voluntary, there is no way to punish anyone.
This is obviously untrue at face value. See: Cancel Culture, Bud Light, and Freedom Fries for examples.
Did you mean something more than what you stated here?
The reason the flip flops are so laughable to me is because they attempt to couch them in some noble, moralistic viewpoint, instead of the obvious reason "We own big companies, the government has extreme power to make or break these companies, and everyone knows kissing up to Trump is what is required to be on his good side."
Profiles in Cowardice, every last one of them.
> After years of pretending to be Democrats, Big Tech leaders are now pretending to be Republicans, in hopes of currying favor with the new administration. Beware of the scummy monopoly campaign to vilify competition law as they rip off consumers and crush competitors.
This is exactly what OpenAI is trying to do with these allegations.
One of my most contrarian positions is I still like and support Altman, despite most of the internet now hating him almost as much as they (justifiably) hate Elon. Was a fan of Sam pre-YC presidency and still am now.
(I also am a big fan of DeepSeek and its CEO.)
Tesla accelerated us forward into the electric car age. SpaceX revolutionized launches.
OpenAI added some real startup oomph to the AI arms race which was dominated by megacorps with entrenched products that they would have disrupted only slowly.
So these guys are doing useful things, however you feel about their other conduct. Personally I find the gross political flip-flops hard to stomach.
I don't support the mafia, but I sympathize with the local store that pays the mafia its racket dues to avoid being ransacked.
There is a fine line between cowardice and common sense.
For a contrast to the Bezos, Zuckerberg and Altman types, look at Tim Cook. Sure, Apple paid the 1 million inauguration "donation", and Cook was at the inauguration, and I'm not arguing he's winning any "Profiles in Courage" awards, but he didn't come out with lots of tweets claiming how massuh Trump is so wise and awesome, Apple didn't do a 180 on their previous policies, etc.
We need to relearn because we need a different definition for LLMs. One that works in practice, not just at the peripheries.
Maybe we can have FOSS LLMs vs open-source ones, like we do with software licenses. The former refers to the hardcore definition. The latter the practical (and widely used) one.
> Maybe we can have FOSS LLMs vs open-source ones, like we do with software licenses.
Why not just call them freeware LLMs, which would be much more accurate?
There's nothing "hardcore" or "zealot" about not calling these open source LLMs because there's just ... absolutely nothing there that you call open source in any way. We don't call any other freeware "open source" for being a free download with a limited use license.
This is just "we chose a word to communicate we are different from the other guys". In games, they chose to call it "free to play (f2p)" when addressing a similar issue (but it's also not a great fit since f2p games usually have a server dependency).
Most of the public is unfamiliar with the term. And with some of the FOSS community arguing for open training data, it was easy to overrule them and take the term.
The set of free/libre licenses (as defined by the FSF) is almost identical to the set of open sources licenses (as defined by the OSI).
The debate within FOSS communities has been between copyleft licenses like the GPL, and permissive licenses like the MIT licence. Both copyleft and permissive licenses are considered free/libre by the FSF, and both of them are considered open source by the OSI.
I get your overall take is "this is just how things go in language", but you can escalate that non-caring perspective all the way to entropy and the heat death of the universe, and I guess I prefer being an element that creates some structure in things, however fleeting.
I’d argue otherwise. (Familiar with, not know.) Particularly in policy circles.
> picking one that invites far more questions and needs for explanation
There wasn't ever a debate. And now, not even the OSI demands training data. (It couldn’t. It, too, would be ignored.)
The richest man in the world has a government office down the street from the white house, which the taxpayers are funding. He's rumored to sleep there.
What do you think?
For which MIT can certainly have a) locked the door and b) trespassed him, but that's a very different issue than having authorization to access JSTOR.
Again, I'm not offering an opinion on it.
So, the argument goes that despite its intention, OpenAI has been one of the largest drivers of innovation in an emerging technology.
Striving for good isn’t gonna be fun all the time, but when choosing role models I like to factor in how happy they seem. I’d like to spend some time happy.
Yes, the first mistake was a business leader in tech taking a public political position. It was popular and accepted (if not expected) in the valley in 2016.
Doing that then (and banking the social and reputational proceeds) created the problem of dissonance now. If he'd just stayed neutral in public in 2016, he could do what he's doing now and we could assume he's just being a pragmatic business person lobbying the government to further his company's interests.
And then later they go to the school to teach our future, working with high stress and low salary.
Same with medical school in many countries where healthcare is not privatized. Insane hours, huge responsibilities and poor pay for doctors and nurses in many countries.
Nowadays everyone wants to be an influencer or software developer.
And nurses - they work at minimal salary in Poland. Even in USA if you count hourly rates it will be quite poor salary for nurses.
(Since we are dreaming, can I add sane hours for medical doctors (like <= 8 per day)?)
I have a lot of respect for CEOs who just focus on being a good CEO. It's a hard enough job as is. I don't care about or want to know some CEO's personal position on politics, religion or sports teams. It's all a distraction from the job at hand. Same goes for actors, athletes and singers. They aren't qualified to have an opinion any more relevant than anyone else's, except on acting, athletics, singing - or CEO-ing.
Sadly, my perspective is in the minority. Which is why I think so many public figures keep making this mistake. The media, pundits and social sphere need them to keep making this mistake.
Also it helps to have sincere political views. GitHub's CEO at the time of #DropICE was too cynical and his image suffered because of it.
There are no neutral positions in today's political landscape. I'm not stating my opinion here, this is according to most political positions on the spectrum. You suggested "Progressive" (but without hot button issues) as a way of signaling a neutral position. That may be true in parts of the valley tech sphere but it certainly doesn't hold in the rest of the U.S. "Progressive" is usually defined being to the left of "Liberal", so it's hardly neutral. Over half of U.S. voters cast their ballot for the Republican candidate. Almost all those people interpret anyone identifying themselves as "Liberal" as definitely partisan (and negative, of course). Most of them see "Progressive" as even worse, slipping dangerously toward "Socialist". And the same holds true for the term "Conservative" on the other side of the spectrum, of course.
No, identifying as "Progressive" wouldn't distance you from political connotations and culture warring, it's leaping into the maelstrom yelling "Yipee-Ki-Yay!" You may want to update your priors regarding how the broad populace perceives political labels. With voters divided almost exactly in half regarding politics and cultural war issues and a large percentage on both sides having "Strong" or "Very Strong" feelings, stating any position will be seen as strongly negative by tens of millions of people. If you're a CEO (or actor, athlete, singer, etc) who relies on appealing to a broad audience, when it comes to publicly discussing politics (or religion), the downsides can be large and long-lasting but the upsides are small and fleeting. As was said in the movie "WarGames", the only winning move is not playing.
"DeepSeek trained on our outputs and that's not fair because those outputs are ours, and you shouldn't take other peoples' data!" This is obviously extremely silly, because that's exactly how OpenAI got all of its training data in the first place - by scraping other peoples' data off the internet.
"DeepSeek trained on our outputs, and so their claims of replicating o1-level performance from scratch are not really true" This is at least plausibly a valid claim. The DeepSeek R1 paper shows that distillation is really powerful (e.g. they show Llama models get a huge boost by finetuning on R1 outputs), and if it were the case that DeepSeek were using a bunch of o1 outputs to train their model, that would legitimately cast doubt on the narrative of training efficiency. But that's a separate question from whether it's somehow unethical to use OpenAI's data the same way OpenAI uses everyone else's data.
(with the caveat that all we have right now are accusations that DeepSeek made use of OpenAI data - it might just as well turn out that DeepSeek really did work independently, and you really could have gotten o1-like performance with much less compute)
All of this should have been clear anyway from the start, but that's the Internet for you.
And is this related to the lottery ticket hypothesis?
I have a question (disclaimer: reinforcement learning noob here):
Is there a risk of broken telephone with this?
Kinda like repeatedly compressing an already compressed image eventually leads to a fuzzy blur.
If that is the case then I’m curious how this is monitored and / or mitigated.
That is where artificial intelligence is going. Copy things from other things. Will there be a AI Eureka moment where it deviates and knows where and why the reason it is wrong?
It seems like if they in fact distilled then what we have found is that you can create a worse copy of the model for ~5m dollars in compute by training on its outputs.
Better benchmark scores can be cooked
Look at the whole AI revolution that Meta and others have bootstrapped by opening their models. Meanwhile OpenAI/Microsoft, Antropic, Google and the rest are just trying to look after number 1 while trying to regulatory capture an AI for me but not for thee outcome of full control.
Thieves yelling 'stop those thieves' scenario to me, they just were first and would not like losing that position. But its all about money and consequently power, business as usual.
IANAL, but It is worth noting here that DeepSeek has explicitly consented to a license that doesn't allow them to do this. That is a condition of using the Chat GPT and the OpenAI API.
Even if the courts affirm that there's a fair use defence for AI training, DeepSeek may still be in the wrong here, not because of copyright infringement, but because of a breach of contract.
I don't think OpenAI would have much of a problem if you train your model on data scraped from the internet, some of which incidentally ends up being generated by Chat GPT.
Compare this to training AI models on Kindle Books randomly scraped off the internet, versus making a Kindle account, agreeing to the Kindle ToS, buying some books, breaking Amazon's DRM and then training your AI on that. What DeepSeek did is more analogous to the latter than the former.
You actually don’t know this. Even if it were true that they used OpenAI outputs (and I’m very doubtful) it’s not necessary to sign an agreement with OpenAI to get API outputs. You simply acquire them from an intermediary, so that you have no contractual relationship with OpenAI to begin with.
I have some news for you
By existing in USA, OpenAI consented to comply with copyright law, and how did that go?
OpenAI can't have it both ways
I hope voters and governments put a long-overdue stop to this cancer of contract-maximalism that has given us such benefits as mandatory arbitration, anti-benchmarking, general circumvention of consumer rights, or, in this case, blatantly anti-competitive terms, by effectively banning reverse-engineering (i.e. examining how something works, i.e. mandating that we live in ignorance).
Because if they don't, laws will slowly become irrelevant, and our lives governed by one-sided contracts.
So no, it doesn't belong to OpenAI.
You might be able to sue for penalties for breach of contract of the TOS, but that doesn't give them the right to the model. And even if it doesn't give them any right to invalidate unbound copyright grants they have given to 3rd parties (here literally everyone). Nor does it prevent anyone from training their own new models based on it or prevent anyone from using it. Oh, and the one breaching the TOS might not even have been the company behind DeepSeek but some in-between 3rd party.
Naturally this is under a few assumptions:
- the US consistently applies it's own law, but they have a long history of not doing so
- the US doesn't abuse their power to force their economical opinions (ban DeepSeek) on other countries
- it actually was trained on OpenAI, but uh, OpenAI has IMHO shown over the years very clearly that they can't be trusted and they are fully in-transparent. How do we trust their claim? How do we trust them to not retrospectively have tweaked their model to make it look as if DeepSeek copied it?
I'm wondering how Deepseek could have made 100s of millions of training queries to OpenAI and not one person at OpenAI caught on.
Now, DeepSeek may (or may not) have used some O1 generated data for the R0 RL training, but if so that's just a cost saving vs having to source some reasoning data some other way, and in no way reduces the legitimacy of what they accomplished (which is not something any of the AI CEOs are saying).
OpenAI has also invested heavily in human annotation and RLHF. If all DeepSeek wanted was a proxy for scraped training data, they'd probably just scrape it themselves. Using existing RLHF'd models as replacement for expensive humans in the training loop is the real game changer for anyone trying to replicate these results.
That's like the mafia complaining that they worked so hard to steal those barrels of beer that someone made off with in the middle of the night and really that's not fair and won't someone do something about it?
Besides deals with insurance companies and governments, one of the ways that they are still able to pull this is convincing everyone that it's too dangerous to play with this at home or buying it from an Asian supplier.
At least with software we had until now a way to build and run most things without requiring dedicated super expensive equipment. OpenAI pulled a big Pharma move but hopefully there will be enough disruptors to not let them continue it.
And if DeepSeek had a mole, why would they bother running a massive job internally to steal the data generated? It would be way easier for the mole to just leak the RL training process, and DeepSeek could quietly copy it rather than bothering with exfiltrating massive datasets to distill. The training process is most likely like, on the order of a hundred lines of Python or so, and you don't even need the file: you just need someone to describe it to you. Much simpler than snatching hundreds of gigabytes of training data off of internal servers...
Plus, the RL process described in DeepSeek's paper has already been replicated by a PhD student at Berkeley: https://x.com/karpathy/status/1884678601704169965 So, it seems pretty unlikely they simply distilled R1 and lied about it, or else how does their RL training algo actually... work?
This is mainly cope from OpenAI that their supposedly super duper advanced models got caught by China within a few months of release, for way cheaper than it cost OpenAI to train.
Someone has to correct me if I'm wrong, but I believe in ML research you always have a dataset and a model. They are distinct entities. It is plausible that output from OpenAI's model improved the quality of DeepSeek's dataset. Just like everyone publishing their code on GitHub improved the quality of OpenAI's dataset. What has been the thinking so far is that the dataset is not "part of" or "in" the model any more than the GPUs used to train the model are. It seems strange that that thinking should now change just because Chinese researchers did it better.
OpenAI has a message they need to tell investors right now: "DeepSeek only works because of our technology. Continue investing in us."
The choice of how they're wording that of course also tells you a lot about who they think they're talking to: namely, "the Chinese are unfairly abusing American companies" is a message that is very popular with the current billionaires and American administration.
The above OpenAI quote from the article leans heavily towards #1 and IMO not at all towards #2. The later would be an extremely charitable reading of their statement.
It’s going to shift the market of how foundation models are used. Companies creating models will be incentivized to vertically integrate, owning the full stack of model usage. Exposing powerful models via APIs just lets a competitor clone your work. In a way OpenAI’s Operator is a hint of what’s to come
Well I think DeepSeek releasing it open source and on an MIT license will rally the big talent. The open sourcing of a new technology has always driven progress in the past.
The last paragraph too is where OpenAi seems to be focusing their efforts..
> we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models ..
> ... we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.
So they'll go for getting DeepSeek banned like TikTok was now that a precedent has been set ?
I don’t buy the argument that distilling from o1 undermines deep seek’s claims around expense at all. Just as open AI used the tools ‘available to them’ to train their models (eg everyone else’ data), r1 is using today’s tools.
Does open AI really have a moral or ethical high ground here?
Even if they win the legal case, it means weights can be inferred and improved upon simply by using the output that is also your core value add (e.g. the very output you need to sell to the world).
Their moat is about as strong as KFC's eleven herbs and spices. Maybe less...
Why would anyone go through the effort of training the next groundbreaking model if they know they can just wait for someone else to do it and leverage that work?
Why would anyone write, work or research anything if they know it would be consumed by AI and sold on a $xx/month subscription?
If Deepseek trained off OpenAI, then it wasn't trained from scratch for "pennies on the dollar" and isn't the Sputnik-like technical breakthrough that we've been hearing so much about. That's the news here. Or rather, the potential news, since we don't know if it's true yet.
The books were all in epub format, converted, cleaned to plain text, and hosted on a public data hoarder site.
There's a lot of wild claims about, so while this is plausible it would be great if there were some evidence backing it.
The complaint filing has some references to some of the other training material used by OpenAI, but I didn't dig deeply in to what all of it was:
https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
A company like OpenAI can put whatever licensing they want in place. But that only matters if they can enforce it. The question is, can they enforce it against deepseek? Did deepseek do something illegal under the laws of their originating country?
I've had some limited exposure to media related licensing when releasing content in China and what is allowed is very different than what is permitted in the US.
The interesting part which points to innovation moving outside of the US is US companies are beholden to strict IP laws while many places in the world don't have such restrictions and will be able to utilize more data more easily.
AI is probably similar where the Moore’s law and advancement will eventually allow people to run open models locally and bring down the cost of operation. Competiition will make it hard for all but one or two players to survive and Nvidia, OpenAI, Deepseek, etc most investments in AI by these large companies will fail to generate substantial wealth but maybe earn some sort of return or maybe not.
OpenAI talks a lot about the principles of being Open, while still keeping their models closed and not fostering the open source community or sharing their research. Now when a company distills their models using perfectly allowed methods on the public internet, OpenAI wants to shut them down too?
High time OpenAI changes their name to ClosedAI
Would not be surprised if they do a rebrand eventually
https://www.wired.com/2011/02/bing-copies-google/
They don't care, T&C and copyright is void unless it affects them, others can go kick rocks. Not surprising they and OpenAI will do a legal battle over this.
If US is going to proceed with trade war on EU, as it was planning anyway, then DeepSeek will be banned only in US. Seems like term "western countries" is slowly eroding.
It’s not limited to the models themselves but also OpenAI will probably work towards shutting down access to training data sets also.
imho it’s probably an emergency all hand on deck problem.
Who cares if some people think you look foolish when you have a locked down 500 billion dollar investment guarantee?
You might want to consider how it applies to this situation.
1. Something I'd expect to happen.
2. Lived through a similar scenario in 2010 or so.
Early in my professional career I've worked for a media company that was scraping other sites (think Craigslist but for our local market) to republish the content on our competing website. I wasn't working on that specific project, but I did work on an integration on my teams project where the scraping team could post jobs on our platform directly. When others started scraping "our content" there were a couple of urgent all hands on deck meetings scheduled, with a high level of disbelief.
I can scarcely process all the levels of irony involved, the irony-o-meter is pegged and I can't get the good one from the safe because I'm incapacitated from laughter.
When I use an API to generate some data, I do not consider the R&D cost to develop the API as part of my costs.
My cynical opinion is that the traning corpus has some small amount of data generated by OpenAI, which is probably impossible to avoid at this point, and they are hanging on that thread for dear life.
But that's a bit like saying that by painting a a bare wall green you have demonstrated that you can build green walls 27x cheaper, ignoring the cost of building the wall in the first place.
Smarter reporting and discourse would explain how this iterative process actually works and who is building on who and how, not frame it as two competing from-scratch clean room efforts. It'd help clear up expectations of what's coming next.
It's a bit similar to how many are saying DeepSeek have demonstrated independence from nVidia, when part of the clever thing they did was figure out how to make the intentionally gimped H800s work for their training runs by doing low-level optimizations that are more nVidia-specific, etc.
Rarely have I seen a highly technical topic see produce more uninformed snap takes than this week.
> We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.
> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.
/s, we all know what their true mission is...
That would be a dangerous precedent to establish.
If it's a terms of service violation, I guess they're within their rights to terminate service, but what other recourse do they have?
Other than that, perhaps this is just rhetoric aimed at introducing restrictions in the US, to prevent access to foreign AI, to establish a national monopoly?
The humor/hypocrisy of the situation aside, it does seem to be true that OpenAI is consistently the one coming up with new ideas first (GPT 4, o1, 4o-style multimodality, voice chat, DALL-E, …) and then other companies reproduce their work, and get more credit because they actually publish the research.
Unfortunately for them it’s challenging to profit in the long term from being first in this space and the time it takes for each new idea to be reproduced is getting shorter.
It is, of course, quite ironic for OpenAI to indiscriminately scrape the entire web and then complain about being scraped themselves.
[1]: https://searchengineland.com/google-bing-is-cheating-copying...
It looks like Deepseek had a subdomain called "openai-us1.deepseek.com". What is a legitimate use-case for hosting an openai proxy(?) on your subdomain like this?
Not implying anything's off here, but it's interesting to me that this OpenAI entity is one of the few subdomains they have on their site
And the US government is also likely itching to reduce the power of Chinese AI companies that could out compete US rivals (similar to the treatment of BYD, TikTok, solar panel manufacturers, network equipment manufacturers, etc), so expect sweeping legislation that blocks access to all Chinese AI endeavours to both the US and then soon US allies/West (via US pressure.)
The likely legislation will be on the surface justified both by security concerns and by intellectual property concerns, but ultimately it will be motivated by winning the economic competition between China and the US and it will attempt to tilt the balance via explicitly protectionist policies.
Even if we assume this is true, Disney and Netflix are both currently worth more than OpenAI and both rely on the strict enforcement of US copyright law. I do not think it is so obvious which powers that be have the better lobbying efforts and, currently, it's looking like this question will mostly be adjudicated by the courts, not Congress, anyways.
So I think OpenAI, Disney and Netflix win together. Big companies tend to win.
It’s a very fun thing to watch from the sidelines right now, if I’ll be honest.
The best language model right now is open source. Let that sink in.
Just weeks later, I discovered that others were pulling off similar calculations! They were making great bread with ease and not having to resort to notebooks and calculators! The horror! I can't believe that said close friend of mine would actually share those highly hydraty mathematical formulas with other humans without first requesting my consent </sarc>.
Could it be, that this stuff just ends up in the dumpster of "sorry you can't patent math" or the like?
A. below is a list of OpenAI initial hires from Google. It's implausible to me that there wasn't quite significant transfer of Google IP
B. google published extensively, including the famous 'attention is all you need' paper, but open-ai despite its name, has not explained the breakthroughs that enabled O1. It has also switched from a charity to a for-profit company.
C. Now this company, with a group of smart, unknown machine learning engineers, presumably paid fractions of what OpenAI are published, has created a model far cheaper, and openly published the weights, many methodological insights, which will be used by OpenAI.
1. Ilya Sutskever – One of OpenAI’s co-founders and its former Chief Scientist. He previously worked at Google Brain, where he contributed to the development of deep learning models, including TensorFlow. 2. Jakub Pachocki – Formerly OpenAI’s Director of Research, he played a major role in the development of GPT-4. He had a background in AI research that overlapped with Google’s fields of interest. 3. John Schulman – Co-founder of OpenAI, he worked on reinforcement learning and helped develop Proximal Policy Optimization (PPO), a method used in training AI models. While not a direct Google hire, his work aligned with DeepMind’s research areas. 4. Jeffrey Wu – One of the key researchers involved in fine-tuning OpenAI’s models. He worked on reinforcement learning techniques similar to those developed at DeepMind. 5. Girish Sastry – Previously involved in OpenAI’s safety and alignment work, he had research experience that overlapped with Google’s AI safety initiatives.
I agree there's hypocrisy but in terms of making a strong argument, you can safely remove your list of persons who (drum roll)... mostly _didn't_ actually work at Google?
Distilling weights from prompts and responses is even more of a legal grey area. The legal system cannot respond quickly to such technological advancements so things necessarily remain a wild west until technology reaches the asymptotic portion of the curve.
In my view the most interesting thing is, do we really need vast data centers and innumerable GPUs for AGI? In other words, if intelligence is ultimately a function of power input, what is the shape of the curve?
If they actually figured out how to use output of existing models to build model that outperforms them then it's something that brings us closer to singularity than every other development so far.
Some may view this as partially true, given that o-1 does not output its CoT process.
Whatever that means. The legal system right now in shambles and flat footed.
Knowing our current government leadership, I think we’re going to see some brute force action backed up by the United States military.
Even if they didn’t directly, intentionally use o1 output (and they didn’t claim they didn’t, so far as I know), AI slop is everywhere. We passed peak original content years ago. Everything is tainted and everything should be understand in that context.
In relative terms, that's obviously and most definitely true.
In absolute terms, that's obviously and most definitely false.
That's honestly such a academic point, who really cares?
They've been outcompeted and the argument is 'well if we didn't let people access our models, they would of taken longer to get here' so what??
The only thing this gets them is an explanation as to why training o1 cost them more than 5 million or whatever, but that is in the past the datacentre has consumed the energy.. the money has gone up in fairly literal steam.
That being said, breaching OAI's systems, re-training a better model on top of their closed source model, then open sourcing it: That's more Robinhood than Villain I'd say.
Meanwhile, they have access to Meta models and Qwen. And Meta models are very easy to run and there's plenty of published work on them. Occam's Razor.
IMHO the whole world is becoming crazy for a lot of reasons, and pissing off billionaires makes me laugh.
Cheapening a series of fact checkable innovations because of the country of origin when so far all that they have showed are signs of good faith is paranoid at best and propaganda to support the billionaire tech lords saving face for their own arrogance at worst.
The word "our" does a lot of heavy lifting in politics[0]. America is not a commune, it's a country club, one which we used to own but have been bought out of, and whose new owners view us as moochers but can't actually kick us out (yet). It is in competition with another, worse country club that purports to be a commune. We owe neither country club our loyalty, so when one bloodies the other's nose, I smile.
[0] Some languages have a notion of an "exclusive we". If English had such a concept, this would be an exclusive our.
Also still an unresolved issue is how they will ever comply with a deletion request, should any model output personal data of someone. They are heavily in a gray area, with regards to what should be allowed. If anything, they should really shut up now.
But DeepSeek didn't use that presumably (since it's secret). They definitely can't argue that using copyrighted material for training is fine, but using output from other commercial models isn't. That's too inconsistent.
https://www.linkedin.com/posts/kevinkeller_deepseek-privacy-...
TikTok is a consumption tool, DS is a productive one. They aren't the same.
Can't really ban what can be downloaded for free and hosted by anyone. There are many providers hosting the ~700B parameter version that aren't CCP aligned.
People made shirts with printouts of the code to RSA under the heading "this shirt is a munition." Apparently such shirts are still for sale, even though they are not classified as munitions anymore.
[1] - https://en.wikipedia.org/wiki/Export_of_cryptography_from_th...
Like music? They banned napster
The UAE (where I live, happily, and by choice), which desperately wants to be the center of the world in AI and is spending vast time and treasure to make it happen (they've even got their own excellent, government-funded foundation model), would _love_ this. Any attempt to ban DeepSeek in the US would be the most gigantic self-own. Combine that with no income tax, a fantastic standard of living, and a willingness to very easily give out visas to smart people from anywhere in the world, and I have to imagine it is one of several countries desperate for the US to do something so utterly stupid.
everything is already too late.
The US should only ban DeepSeek (and other Chinese companies) from accessing US frontier models.
The US should only ban DeepSeek (and other Chinese companies) from accessing US frontier models designed and trained by Chinese Americans.
fixed for you.
Google Deepmind's recent Gemini 2.0 Flash Thinking is also priced at the new Deepseek level. It's pretty good (unlike previous Gemini models).
I didn't know that. Is this always the case?
Using data from another model won't save you any training time.
It's...not, and its repeatedly been proven in practice that this is an invalid generalization because it is missing necessary qualifications, and its funny that this myth keeps persisting.
It's probably a bad idea to use uncurated output from another AI to train a model if you are trying to make a better model rather than a distillation of the first model, and its definitely (and, ISTR, the actual research result from which the false generalization has developed) a bad idea to iteratively fine-tune a model on its own unfiltered output, but there has been lots of success using AI models to generate data which is curated and used to train other models, which can be much more efficient that trying to create new material without AI once you've gotten to the point where you've already hoovered up all the readily-accessible low hanging fruit of premade content relevant to your training goal.
Re: "generally a bad idea", I'd just highlight "generally" ;) Clearly it worked in this case!
Ah. So if I understand this... once the internet becomes completely overrun with AI-generated articles of no particular substance or importance, we should not bulk-scrape that internet again to train the subsequent generation of models.
I look forward to that day.
It proofs we _can_ optimize our training data.
Just like humans have been genetically stable for a long time, the quality & structure of information available to a child today vs that of 2000 years ago makes them more skilled at certain tasks. Math being a good example.
That is not true at all.
We have known how to solve this for at least 2 years now.
All the latest state of the art models depend heavily on training on synthetic data.
It's not apparently obvious to me that that is the case.
Ie. do you need a SOTA model to produce a new SOTA model?
And just because a model trains on some ChatGPT data, doesn't mean that that data is the majority. It's just another dataset.
If OpenAI trained on the intellectual property of others, maybe it wasn't the creativity breakthrough people claim?
Oppositely
If you say ChatGPT was trained on "whatever data was available", and you say Deepseek was trained "whatever data was available", then they sound pretty equivalent.
All the rough consensus language output of humanity is now roughly on the Internet. The various LLMs have roughly distilled that and the results are naturally going to be tighter and tighter. It's not surprising that companies are going to get better and better at solving the same problem. The situation of DeepSeek isn't so much that promises future achievements but that it shows that OpenAI's string of announcements are incremental progress that aren't going to be reaching the AGI that Altman now often harps on.
You need to visit mainland China and see how AI applications are everywhere, from transport to goods shipping.
I'm not surprised at all. I hope this in the end makes the US kill its strict IP laws, which is the problem.
If the US doesn't, China will always have a huge edge on it, no matter how much NVidia hardware the US has.
And you know what, Huawei is already making inference hardware... it won't take them long to finally copy the TSMC tech and flip the situation upside down.
When China can make the equivalent of H100s, it will be hilarious because they will sell for $10 in Aliexpress :-)
Lets be fair though. Replicating TSMC isn't something that could happen quickly. Then again, who knows how far along they already are...
BTW, who in China is doing the best AI on goods shipping since you mention it?
I understand ToS violations can lead to a ban. OpenAI is free to ban DeepSeek from using their APIs.
If your own API can leak your secret sauce without any malicious penetration, well, that's on you.
DDOSing web sites and grabbing content without anyone's consent is not hard earned at all. They did spent billions on their thing, but nothing was earned as they could never do that legally.
But let's keep the eye on the ball for a second. None of that changes the fact that what was built was a capability to reflect that knowledge in dynamic and deep ways in conversation, as well as image and audio recognition.
And did Deepseek also build that? From scratch? Because they might not have.
One way or another, they were able to create something that has WAY cheaper inference costs than o1 at the same level of intelligence. I was paying Anthropic $15/1M tokens to make myself 10x faster at writing software, which was coming out to $10/day. O1 is $60/1M tokens, which for my level of usage would mean that it costs as much as a whole junior software engineer. DeepSeek is able to do it for $2.50/1M tokens.
Either OpenAI was taking a profit margin that would make the US Healthcare industry weep, or DeepSeek made an engineering breakthrough that increases inference efficiency by orders of magnitude.
It's been known for a while that competitors used OpenAI to improve their models, that's why they changed the TOS to forbid it.
That doesn't mean the deep seek technical achievements are less valid.
Well, that's literally exactly what it would mean. If DeepSeek relied on OpenAI’s API, their main achievement is in efficiency and cost reduction as opposed to fundamental AI breakthroughs.
Eventually all future AIs will be produced with synthetic input, the amount of (quality) data we humans can produce is quite limited.
The fact that the input of one AI has been used in the training of another one seems irrelevant.
The deeper question is whether Deepseek has achieved real autonomy or if it’s just a derivative work. If the latter, then OpenAI still holds the keys to future advances. If Deepseek truly found a way to be independent while achieving similar performance, then OpenAI has a problem.
The details of how they trained matter more than the inevitability of synthetic data down the line.
Also, if you read their papers it’s quite clear there are several important engineering achievements which enabled this. For example multi head latent attention.
It’s the same problem with pharmaceuticals and generics. It’s great when the price of drugs is low, but without perverse financial incentives no company is going to burn billions of dollars in a risky search for new medicines.
They had to be cheating.
https://news.ycombinator.com/newsguidelines.html
p.s. yes, that goes both ways - that is, if people are slamming a different country from an opposite direction, we say the same thing (provided we see the post in the first place)
Then all the farmers in the midwest went broke not because they couldn't get their goods to market, but because JP Morgan's consolidated syndicates ate all their margin hauling their goods to market.
Consolidation and monopoly over your competition is always the end goal.
Surely that's only possible when you have a large barrier to entry?
What's going to be that barrier in this case - cos it turns out not to be neither training costs/hardware or secret expertise.
The problem for AI is the hardware is commodified and offers no natural monopoly, so there isn't really anything obvious to vertically integrate-towards-monopoly.
> Similarly, business growth, per se, tells us little about value. It's true that growth often has a positive impact on value, sometimes one of spectacular proportions. But such an effect is far from certain. For example, investors have regularly poured money into the domestic airline business to finance profitless (or worse) growth. For these investors, it would have been far better if Orville had failed to get off the ground at Kitty Hawk: The more the industry has grown, the worse the disaster for owners.
Probably won't be Moore's law (which is kind of slowing down) so much as architectural improvements (both on the compute side and the model side - you could say that R1 represents an architectural improvement of efficiency on the model side).
Well, Steve... I think it’s more like we both had this rich neighbor named Xerox and I broke into his house to steal the TV set and found out that you had already stolen it.
Xerox could be seen as Google, whose researchers produced the landmark Attention Is All You Need paper, and the general public, who provided all of the training data to make these models possible.
As far as I can tell o1 was based on Q-star, which could likely be Quiet-STaR, a CoT RL technique developed at Stanford that OpenAI may have learned about before it got published. Presumably that's why they never used the Q-Star name even though it had garnered mystique and would have been good for building hype. This is just speculation, but since OpenAI haven't published their technique then we can't know if it really was their innovation.
I don't understand, you mean OpenAI isn't releasing open models and openly publishing their research?
No, they aren't (other than whisper).
Their "papers" are closer to marketing materials. Very intentionally leaving out tons of technical information.
I claim one just can't put the humor/hypocrisy aside that easily.
What OpenAI did with the release of ChatGPT is productize research that was open and ongoing with Deepmind and other leading at least as much. And everything after that was an extension of the basic approach - improved, expanded but ultimately the same sort of beast. One might even say the situation of OpenAI to DeepMind was like Apple to Xerox. Productizing is nothing to sneeze at - it requires creativity and work to productize basic research. But naturally get end-users who consider the productizers the "fountain heads", who overestimate the productizers because products are all they see.
They simply were the first to aggressively pursue scaling the transformer to the extent that is normal for the industry today. Although this has proven to produce interesting results, “simply adding scale” is, in my view, the least interesting development in modern ML. Giving credit where it’s due, they MAY have popularized the RLHF methodology, but I don’t recall them inventing that either?
(feel free to point out any of the above that I falsely attributed to NOT OpenAI.)
Additionally I seem to remember in an interview with Altman circa late ‘21 where he explains that the spirit of “OpenAI” and how their only goal is pursuing AGI, and “should someone else come up with a more promising path to get there, we would stop what we’re doing and help them”. I couldn’t find a reference to this interview, but anyone else, please feel free to share (I think it was a youtube link). - fast forward to 2025 and now “OpenAI” is the least open large contributor and indiscernible from your run-of-the-mill AI/ML valley startup insofar as they’re referring to others as “competitors” as opposed to collaborators.. interesting times…
But if you leave someone in the tech industry of SV/SF long enough, they'll start to get high on their own supply and think they're entitled to insane amounts of value, so...
1 Activist tried to get the US copyright office to acknowledge his LLM as the author, who would then provide him a license to the work.
There was no issue with himself being the original author and copyright holder of the AI works. But that's not what was being challenged.
It potentially cost the whole field in terms of innovation. For OpenAI specifically, they now need to scramble to come up with a differentiated business model that makes sense in the new landscape and can justify their valuation. OpenAI’s valuation is based on being the dominant AI company.
I think you misread my comment if you think my feelings are somehow hurt here.
I don't see how, and you're not explaining it. If the models had been public this whole time, then... they would be protected against people publishing derivative models?
> I think you misread my comment if you think my feelings are somehow hurt here.
Not you, but most HNers got emotionally attached to their promise of openness, like they were owed some personal stake in the matter.
Are you suggesting that if OpenAI published their models, they would still want to prevent derivative models? You take the "I wish OpenAI was actually open" and add your own restriction?
Or do you mean that them publishing their models and research openly would not have increased innovation? Because that's quite a claim, and you're the one who has to explain your thinking.
I am not in the field, but my understanding is that ever since the PaLM paper, research has mostly been kept from the public. OpenAI's money making has been a catalyst for that right? Would love some more insight.
The comments were moved here by dang from an flagged article with an editorialized /clickbait title. That flagged post has 1300 points at the time of writing.
https://news.ycombinator.com/item?id=42865527
1.
It should be incumbent on the moderator to at least consider that the motivation for the points and comments may have been because many thought the "hypocrisy" of OpenAI's position was a more important issue than OpenAI's actual claim of DeepSeek violating its ToS. Moving the comments to an article that buries the potential hypocrisy issue that may have driven the original points and comments is not ideal.
2.
This article is from FT, which has a content license deal with OpenAI. To move the comments to an article from a company that has a conflict of interest due to its commercial relations with the YC company in question is problematic here especially since dang often states they try to more hands-off on moderation when the article is about a YC company.
3.
There is a link by dang to this thread from the original thread, but there should also be a link by dang to the original thread from here as well. Why is this not the case?
4.
Ideally, dang should have asked for a more substantial submission that prioritized the hypocrisy point to better match the spirit of the original post instead of moving the comments to this article.
Just don't set the expectations bar too high to start with is all I am saying. Folks that get so high up money and power wise aren't nice people, period. Even if nice normal guy without any sociopathic traits would suddenly shoot so high, the environment and pressures would deform them pretty quickly.
Also, I would consider only some leaked private conversations with close people as representative truth, not some PR statements carefully crafted by team of experts.
Happy to be proven wrong, still waiting for an example #1 to give me some hope.
The most famous example would be him contacting ScarJo's agent to hire her to provide her voice for their text-to-speech bot, them being told to go pound sand, and doing it anyway, and then lying about (which they got away with until her agent released a statement saying they'd approached her and she told them to fuck off.)
To my understanding, this is not true. The "Sky" voice was based on a real voice actor they had hired months before contacting Johansson, with the casting call not mentioning anything about sounding like Johansson. [0]
I think it's plausible that they noticed some similarity and that's what prompted them to later reach out to see if they could get Johansson herself, but it's not Johansson's voice and does not appear to be someone hired to sound like her.
Everyone is standing on the shoulders of giants.
But IANAL, so if you have a citation that says otherwise I'd be happy to see it!
You just need to read OpenAI’s arguments about why TOS and copyright laws don’t apply to them when they’re training on other people’s copyrighted and TOS protected data and running roughshod over every legal protection.
For actual, legally binding consent, you'll need to make some real effort to make sure the consumer understands what they are agreeing to.
And this analogy isn't particularly good. Napster was the server, not the product. Whether you got XYZ from Napster or wherever else doesn't matter, because its the product that you are after, not the way to get the product.
Like you've put, it's still a somewhat gray area, and I personally have nothing against them (or anyone else) using copyrighted content to train models.
I do find it annoying that they're so closed-off about their tech when it's built on the shoulders of openness and other people's hard work. And then they turn around and throw Issy fits when someone copies their homework, allegedly.
Actually unless the law changes this is pretty settled territory in US law. All output of AIs are not copyrightable, and are therefore in the public domain. The only legal avenue of attack OpenAi has is Terms of Service violation, which is a much weaker breach then copyright if it is even true.
According to a quick google search, the human body consumes ~145W of power over 24h (eating 3000kcals/day). The brain needs ~20% of that so 29W/day. Much less than our current designs of software & (especially) hardware for AI.
In a way this is something most companies have been doing with their smaller models, DeepSeek just supposedly* did it better.
So say DS had simply published a paper outlining the RL technique they used, and one of Meta, Google or even OpenAI themselves had used it to train a new model, don't you think they'd have shouted off the rooftops about a new breakthrough? The fact that the provenance of the data is from a rival's model does not negate the value of the research IMHO.
IANAL but it seems to me that OpenAI wouldn’t be able to claim their outputs are IP since they are AI-generated. It may be against their TOS, meaning OpenAI could refuse to provide service to DeepSeek in the future, but they can’t really sue them.
[1]: https://www.reuters.com/legal/ai-generated-art-cannot-receiv...
I don't think OpenAI should be able to make any claims of IP for the AI generated outputs, since they based that on other work, partially copyrighted work, which they hide. They simply throw algorithms at data that is not their data to begin with.
If I steal something, keep the exact thing I stole hidden, and sell a product, that I could only have made, based on the stolen thing, how can I expect that to be even legal, let alone untouchable IP?
I think way too many people have seen too many dollar signs in front of their eyes. The whole thing is outrageous. If they were transparently proving, that they are using open data sets, adhering to licenses, then they would get to claim IP.
Well, they can argue that, if they're fine with being hypocrites.
They're hypocrites.
The US ruled that the AI cannot be the author, that doesn't lead like so many clickbait articles suggest, that no AI products can be copyrighted.
1 Activist tried to get the US copyright office to acknowledge his LLM as the author, who would then provide him a license to the work.
There was no issue with himself being the original author and copyright holder of the AI works. But thats not what was being challenged.
> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models
No one is training on indiscriminate synthetic data. It's very much discriminated, but still synthetic.
Like I’ve said time and time again, nobody in this space gives a fuck about anyone that isn’t directly contributing money to their bottom line at that particular instant. The fundamental idea is selfish, damages the fundamental machinery that makes the internet useful by penalizing people that actually make things, and will never, ever do anything for the greater good if it even stands a chance of reducing their standing in this ridiculously overhyped market. Giving people free access to what is for all intents and purposes a black box is not “open” anything, is no more free (as in speech) than Slack is, and all of this is obviously them selling a product at a huge loss to put competing media out of business and grab market share.
IMO, it would look bad for OpenAI to push strongly with this story, it would look like they're losing the technological edge and are now looking for other ways to make sure they remain on top.
Since they have no intellectual property rights in the output, it's not clear to me they have a cause of action to sue over how the output is used.
I wonder if any lawyers have written about this topic.
But in all reality I'm happy to see this day. The fact that OpenAI ripped off everyone and everything they could and, to this day pretend like they didn't, is fantastic.
Sam Altman is a con and it's not surprising that given all the positive press DeepSeek got that it was a full court assault on them within 48 hours.
Part of it is the form of the comment, btw - that one was entirely a sequence of indignation tropes.
What is definitely true is that there are already other providers offering DeepSeek R1 (e.g. on OpenRouter[1]) for $7/m-in and $7/m-out. Meanwhile OpenAI is charging $15/m-in and $60/m-out. So already you're seeing at least 5x cheaper inference with R1 vs O1 with a bunch of confounding factors. But it is hard to say anything truly concrete about efficiency OpenAI does not disclose the actual compute required to run inference for O1.
[1] the cheapest I've found, which also happens to run in the EU, is https://studio.nebius.ai/ at $0.8/million input.
Edit: I just saw that openrouter also now has nebius
Also possibly b/c of their sweetheart deal with Azure they've never needed to negotiate enterprise pricing so they're probably calculating margins based on GPU list prices or something insane like that.
Source:
> If not sold within a year, the law would make it illegal for web-hosting services to support TikTok, and it would force Google and Apple to remove TikTok from app stores — rendering the app unusable with time.
https://www.npr.org/2024/04/24/1246663779/biden-ban-tiktok-u...
So no, it doesn't use your exact location, it just uses the censorship mechanisms that Apple and Google gracefully provide.
I'm not going to take a side on whether there is or not.
But, it does sound reminiscent of the reasons used to ban Tik-tok.
Lot of hoops when there's simply other models to utilize publicly
Prompts, they said, were unlikely enough to satisfy the requirement of a human controlling the expressive elements thus most AI output today is probably not copyrightable.
https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
Prompts alone.
But there are almost no cases of "Prompts Alone" products seeking copyright.
Even what 3-4 years ago?, AI tools moved into a collaborative footing. Novel AI forces a collaborative process (and gives you output that can demonstrate your input which is nice). ChatGPT effectively forces it due to limited memory.
There was a case, posted here to ycombinator, where a chinese judge upheld "significant" human interaction was involved when a user made 20-odd adjustments to their prompt iterating over produced images and then added a watermark to the result. I would be very surprised if most sensible jurisdictions didn't follow suit.
Midjourney and ChatGPT already include tools to mask and identify parts of the image to be regenerated. And multiple image generators allow dumb stuff like stick figures and so forth to stand in as part of an uploaded image prompt.
And then theres AI voice which is another whole bag of tricks.
>thus most AI output today is probably not copyrightable.
Unless it was worked on even slightly as above. In fact it would be hard to imagine much AI work that isn't copyrightable. Maybe those facebook pages that just prompt "Cyberpunk Girl" and spit out endless variations. But I doubt copyright is at the forefront of their mind.
And don't forget the billions in partnerships...
I can see why people refuse to pay taxes.
It seems like the difference between someone doing a better writeup of (say) Wiles's proof vs. proving Fermat's Last Theorem independently.
This is already evident with CNSA/NASA, Huawei/Android, TikTok/Western social media. The Western tech gets mothballed because we stick our heads in the sand and pretend we are undisputed leaders of the world in tech, whereas it is slowly becoming disputable.
> Western tech gets mothballed because we stick our heads in the sand and pretend we are undisputed leaders of the world in tech, whereas it is slowly becoming disputable.
I am hearing Chinese tech is now the best and they achieved it with banning things left and right.
So you can think of training as CI+TEST_ENV and inference as the cost of running your PROD deployments.
Generally in traditional IT infra PROD >> CI+TEST_ENV (10-100 to 1)
The ratio might be quite different for LLM, but still any SUCCESSFUL model will have inference > training at some point in time.
I think you're making assumptions here that don't necessarily have to be universally true for all successful models. Even without getting into particularly pathological cases, some models can be successful and profitable while only having a few customers. If you build a model that is very valuable to investment banks, to professional basketball teams, or some other much more limited group than consumers writ large, you might get paid handsomely for a limited amount of inference but still spend a lot on training.
In the U.S. there is 350 million people and EU has 520 million people (excluding Russia and Turkey).
China alone has 1.4 billion people.
Since there is a language barrier and China isolates themselves pretty well from the internet, we forget that there is a huge society with high focus on science. And most of our tech products are coming from there.
There's some clues that their population count isn't accurate and would be closer to 1.2 billion in reality, not that it changes the conclusion.
So about as many as US has adults.
Many countries in Latin America have better relations and more robust trade partnerships with China.
As for the EU, I think it will be great for it to shed its reliance on the US, and act more independently from it.
Trump has already managed to completely destroy the US reputation within basically the entire continent¹. And he seems intent on creating a commercial war against all the countries here too.
1 - Do not capture and torture random people on the street if you want to maintain some goodwill. Even if you have reasons to capture them.
Scary times.
Not to mention post-training. Their novel GRPO technique used for preference optimization / alignment is also much more efficient than PPO.
That's a funny analogy, but in reality DeepSeek did reinforcement learning to generate chain of thought, which was used in the end to finetune LLMs. The RL model was called DeepSeek-R1-Zero, while the SFT model is DeepSeek-R1.
They might have boostrapped the Zero model with some demonstrations.
> DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.
> Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1Zero outputs in a readable format, and refining the results through post-processing by human annotators.
What I'm saying is that in the media it's being portrayed as if DeepSeek did the same thing OpenAI did 27x cheaper, and the outsized market reaction is in large parts a response to that narrative. While the reality is more that being a fast-follower is cheaper (and the concrete reason is e.g. being able to source training data from prior LLMs synthetically, among other things), which shouldn't have surprised anyone and is just how technology in general trends.
The achievement of DeepSeek is putting together a competent team that excels at end-to-end implementation, which is no small feat and is promising wrt/ their future efforts.
This question is malformed, imo. Every lab is doing derivative work. OpenAI didn’t invent transformers, Google did. Google didn’t invent neural networks or back propagation.
If you mean whether OAI could have prevented DS from succeeding by cutting off their API access, probably not. Maybe they used OAI for supervised fine tuning in certain domains, like creative writing, which are difficult to formally verify (although they claim to have used one of their own models). Or perhaps during human preference tuning at the end. But either way, there are many roads to Rome, and OAI wasn’t the only game in town.
Point is, those future advances are worthless. Eventually anybody will be able to feed each other's data for the training.
There's no moat here. LLMs are commodities.
Contracts are enforceable to the degree to which you can pay lawyers to enforce them.
I will run out of money trying to enforce my terms of service against openAI, while they have a massive war chest to enforce theirs.
Ain’t libertarianism great?
For example, my digital garden is under GFDL, and my blog is CC BY-NC-SA. IOW, They can't remix my digital garden with any other license than GFDL, and they have to credit me if they remix my blog, and can't use it for any commercial endeavor, which OpenAI certainly does now.
So, by scraping my webpages, they agree to my licensing of my data. So they're de-facto breaching my licenses, but they cry "fair-use".
If I tell that they're breaching the license terms, they'd laugh at me, and maybe give me 2 cents of API access to mock me further. When somebody allegedly uses their API with their unenforcable ToS, they scream like an agitated cuckatoo (which is an insult to the cuckatoo, BTW. They're devilishly intelligent birds).
Drinking their own poison was mildly painful, I guess...
BTW, I don't believe that Deepseek has copied/used OpenAI models' outputs or training data to train theirs, even if they did, "the cat is out of the bag", "they did something amazing so they needed no permissions", "they moved fast and broke things", and "all is fair-use because it's just research" regardless of how they did it.
Heh.
If the fair use defense holds up, they didn't need a license to scrape your webpage. A contract should still apply if you only showed your content to people who've agreed to it.
> and "all is fair-use because it's just research"
Fair use is a defense to copyright infringement, not breach of contract. You can use contracts, like NDAs, to protect even non-copyright-eligible information.
Morally I'd prefer what DeepSeek allegedly did to be legal, but to my understanding there is a good chance that OpenAI is found legally in the right on both sides.
The ban was on exporting the code, not having the code in possession.
Furthermore it was only the US who had this ban.
I am old enough to remember this and the scoffing that European PGP users had towards their American counterparts
This is immediately obvious if you look at it through a statistical learning lens and not the mysticism crystal ball that many view NN’s through.
"Play and reflection" is something else, which isn't distillation.
Given this, there’s no reason why it could not be trivial to produce a child model from (filtered) parent output that exceeds the child model on a different, more meaningful objective like being a useful chatbot. There's no reason why this would have to be limited to domains with verifiable answers either.
It is not distillation. It's like how you can arrive at new knowledge by reflecting on existing knowledge.
Unfiltered? Sure. With human curation of the generated data it certainly can. (Even automated curation can do this, though its more obvious that human curation can.)
I mean, I can randomly developed fact claims about addition, and if I curate which ones go into a training set, train a model that reflects addition of integers much more accurately than the random process which generated the pre-curation input data.
Without curation, as I already said, the best you get is a distillation of the source model, which is highly improbable to be more accurate.
That is the existential, $1T question.
Also, can I have some money to build more data centres pls?
I said generally because there are things like adversarial training that use a ruleset to help generate correct datasets that work well. Outside of techniques like that it's not just a rule of thumb, it's always true that training on the output of another model will result in a worse model.
https://www.scientificamerican.com/article/ai-generated-data...
Not convincing.
You can imagine model doing some primitive thinking and coming to conclusion. Then you can train another model on summaries. If everything goes well it will be coming to conclusions quicker. That's at least. Or it may be able solve more complex problems with the same amount of 'thinking'. It will be self-propelled evolution.
Another option is to use one model to produce 'thinking' part from known outputs. Then train another to reproduce thinking to get the right output, unknown to it initially. Using humans to create such dataset would be slow and very expensive.
PS: if it was impossible humans would be still living on the trees.
These models don't evolve like they, there is not a random process of architectural evolution. Nor is there a fitness function anything like "get better at math."
A system like AlphaZero works because it has a rules to use as an oracle: the game rules. The game rules provide the new training information needed drive the process. Each game played produces new correct training data.
These LLMs have no such oracle. Their fitness function is and remains: predict the next word, followed by: produce text that makes a human happy. Note that it's not "produce text that makes ChatGPT happy."
Yes, there is the question how much ChatGPT data DeepSeek has ingested. Certainly not zero! But if DeepSeek has achieved iterative self-improvement, that'd be huge too!
Even if o1 specifically was used (which is in itself doubtful), it does not mean that this was the main reason that r1 succeeded/it could not have happened without it. The o1 outputs hides the CoT part, which is the most important here. Also we are in 2025, scratch does not exist anymore. Creating better technology building upon previous (widely available) technology has never been a controversial issue.
who cares. even if the claim is true, does that make the open source model less attractive?
in fact, it implies that there is no moat in this game. openai can no longer maintain its stupid valuation, as other companies can just scrape its output and build better models at much lower costs.
everything points to the exact same end result - DeepSeek democratized AI, OpenAI's old business model is dead.
Yes! Because whether they reproduced those capabilities independently or copying them from relying on downstream data has everything to do with whether they're actually state of the art.
The US copyright's determination was based on the simple analogy of someone hiring someone else to create a work for them. The person hiring, even if they offer suggestions and veto results, is not contributing enough to the expression and therefore has no right to claim copyright themselves.
If you stand behind a painter and tell them what to do, you don't have any claim to copyright as the painter is still the author of the expression, not you. You must have a hand in the physical expression by painting yourself.
But then
>You must have a hand in the physical expression by painting yourself.
You contradict yourself. Novel AI will literally highlight your contributions separately to the AI so you can prove you also painted. Image generators literally let you paint over the top to select AI boundaries.
Though I’m not suggesting an import ban on DeepSeek would be effective either. Just that the US does have precedence pulling these kinds of stunts.
You can also look at the 90s subculture for passing DeCSS code (a tool for breaking DVD encryption) to see another example of how people wilfully skirted these kinds of stupid legal limitations.
https://en.m.wikipedia.org/wiki/DeCSS
So if you were to ask me if a ban on DeepSeek would work, the answer is clearly “no”. But that doesn’t mean it’s not going to happen. And if it does, the only people hurt are legitimate US businesses who might get a benefit from DeepSeek but have to follow the law. Those of us outside of America will be completely unaffected. Just like we were when US tried to limit the distribution of GPG.
As for humans, assuming genetically they have the same intellectual abilities, you can see the difference in development of different groups. It's mostly defined by training the better next generation. Schools are exactly for this.
Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.
As far as I know, DeepSeek adds only a little to the transformers model while o1/o3 added a special "reasoning component" - if DeepSeek is as good as o1/o3, even taking data from it, then it seems the reasoning component isn't needed.
Distillation is a term of art in AI and it is fundamentally incorrect to talk about distilling human-created data. Only an AI model can be distilled.
https://en.m.wikipedia.org/wiki/Knowledge_distillation#Metho...
It seems clear that the term can be used informally to denote the boiling down of human knowledge, indeed it was used that way before AI appeared in the popular imagination.
- v2/v3 (not r1) seem to be cloned from o1/4o output, and perform worse (this cost the oft-repeated 5ish mm USD)
- r1 is specifically a reasoning step (using RL) _on top of_ v2/v3 and performs similarly to o1 (the cost of this is _not reported anywhere_)
- In the o1 blog post, they specifically say they use RL to add reasoning to LLMs: https://openai.com/index/learning-to-reason-with-llms/
I did not think this, nor did I think this was what others assumed. The narrative, I thought, was that there is little point in paying OpenAI for LLM usage when a much cheaper, similar / better version can be made and used for a fraction of the cost (whether it's on the back of existing LLM research doesn't factor in)
If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly, even if the initial cost is huge. It also means OpenAI probably needs a better moat to protect its interests.
I'm not sure where the reality is exactly, but market reactions so far have basically followed that initial narrative and now the rebuttal.
The latter could be a one time thing, and/or OpenAi Could still use their financial might to leverage those innovations and get even better with them.
However, the former destroys their business model and no amount of intelligence and innovation from OpenAI protects them from being copied at a fraction of the cost.
How do you know this?
> If the narrative is actually that DeepSeek can only reach whatever heights OpenAI has already gotten to with some new tricks, then markets will probably refocus on OpenAI's innovations and price things accordingly
Why? If every innovation OpenAI is trying to keep as secret sauce becomes commoditized quickly and cheaply, then why would markets care about any innovations they have? They will be unable to monetize them.
That's what I thought and assumed. This is the narrative that's been running through all the major news outlets.
It didn't even occur to me that DeepSeek could have been training their models using the output of other models until reading this article.
But HOW they are necessary is the change. They went from building blocks to stepping stones. From a business standpoint that's very damaging to OAI and other players.
The first words of Disney's Aladdin (1992) are spoken by the *Peddler*, the mysterious merchant at the beginning of the film. He says:
"Ah, Salaam and good evening to you, worthy friend. Please, please, come closer..."
He then continues with: "Too close! A little too close. There. Welcome to Agrabah. City of mystery, of enchantment, and the finest merchandise this side of the River Jordan, on sale today! Come on down!"
This opening sets the stage for the story, introducing the magical and bustling world of Agrabah.
"stole"?
They consumed publicly available material on the Internet
I am no fan of these billionaire capitalists and their henchpersons but condem them for their multitude of sins.
Consuming publicly available Internet resources is not one of them. IMO
I agree that there are some important distinctions and word-choices to be made here, and that there are problems with equating training to "stealing", and that copyright infringement is not theft, etc.
That said, if you zoom out to the overall conduct, it's fair to argue that the companies are doing something unethical, the same as if they paid an army of humans to memorize other people's work and then regurgitate slightly-reworded copies.
I would use the analogy of those humans learning from the material. Like reading books in the library
"regurgitate slightly-reworded copies" in my experience using LLMs (not insubstantial) that is an unfairly pejorative take on what they do
Being on the internet doesnt make it yours, or acceptable to take. In the case of OpenAI (and Anthropic) they should be following the long held principle of the robots.txt file on sites, which can be specifically set to tell just them that they may not take your content - they openly ignore that request.
OpenAI absolutely is stealing from everyone, hence why most will have little sympathy when they complain someone stole from them.
However I think the reality is that there's only so much coal to be mined, as far as LLM training goes. When we're at "very dimishing returns" SoC/Apple/TSMC-CPU innovations will deliver cheap inference. We only really need a M4 Ultra with 1TB RAM to hollow-out the hardware-inference-supplier market.
Very easy to imagine a future where Apple releases a "Apple Intelligence Mac Studio" with the specs for many businesses to run arbitrary models.
The marginal cost of software is zero. You need some kind of perceived advantage to get people to pay for it. This isn't hard, as most people will pay a bit for big-name vs "free". That could change as more open source apps become popular by being awesome.
Speculations aside, from what I understood, something like this shouldn't hold a drop of water under fair-use doctrine, because there's a disproportional damage, plus a huge monopolistic monetary gain because of what they did and how they did.
On the other hand, I don't believe that Deepseek used OpenAI (in any capacity or way or method) to develop their models, but again, it doesn't matter how they did it in this current conjecture.
What they successfully did was to upset a bunch of high level people, regardless of the technical things they achieved.
IMHO, AI war has similar dynamics to MAD. The best way is not to play, but we are past the Rubicon now. Future looks dirty.
"Something like this" as in what DeepSeek allegedly did, or the web-scraping done by both of them?
For what DeepSeek allegedly did, OpenAI wouldn't have a copyright infringement case against them because the US copyright office determined that AI-generated content is not protected by copyright - and so there's no need here for DeepSeek to invoke fair use. It'll instead be down to whether they agreed to and breached OpenAI's contract.
For the web-scraping it's more complicated. Fair use is determined by the weighing of multiple factors - commercial use and market impact are considered, but do not alone preclude a fair use defense. Machine learning models do seem, at least to me, highly transformative - and "the more transformative the new work, the less will be the significance of other factors".
Additionally, since the market impact factor is the effect of the use of the copyrighted work on the market for that work, I'd say there's a reasonable chance it does not actually include what you may expect it to. For instance if you're a translator suing Google Translate for being trained on your translated book, the impact may not be "how much the existence of Google Translate reduced my future job prospects" nor even "how many fewer people paid for my translated book because of the existence of Google Translate" but rather "how many fewer people paid for my translated book than would have had that book been included in the training data" - which is likely very minor.
The Chinese Communist party very much sees itself in a global rivalry over "new productive forces". That's official policy. And US leadership basically agrees.
The US is playing dirty by essentially embargoing China over big AI - why wouldn't it occur to them to retaliate by playing dirtier?
I mean we probably won't know for sure, but it's much less far fetched than a lot of other speculation in this area.
E.g., R1's cold start training could probably have benefited quite a bit from having access to OpenAI's chain of thought data for training. The paper is a bit light on detail on how it was made.
interestingly, that actually makes the CCP the largest political party pursuing state capitalism.
there won't be any competition between China and the US if the CCP is indeed a communist party as we all know full well that communism doesn't work at all.
Of course this is China so the government can and does intervene at will, but alleging that this required CIA level state espionage to pull off is alien crash levels of implausible. They open sourced the entire thing and published incredibly detailed papers on how they did it!
This is not America. Your ideas do not apply the same way.
I'm European, are you talking about Microsoft, Google, or OpenAI?
Are you trolling me?
The evidence supporting offensive hacking is abundant in recent history; the number of things which have been learned from alien crash data is surely smaller by comparison to the number of things which have been learned from offensive hacking.
e.g China
Then, you’ve asserted that was the original claim.
Sneaky! But that’s how “arguments” on HN are “won”.
>> The idea that they used o1's outputs for their distillation further shows that models like o1 are necessary.
> Hmm, I think the narrative of the rise of LLMs is that once the output of humans has been distilled by the model, the human isn't necessary.
If deepseek was produced through the distillation (term of art) of o1, then the cost of producing deepseek is strictly higher than the cost of producing o1, and can't be avoided.
Continuing this argument, if the premise is true then deepseek can't be significantly improved without first producing a very expensive hypothetical o1-next model from which to distill better knowledge.
That is the argument that is being made. Please avoid shallow dismissals.
Edit: just to be clear, I doubt that deepseek was produced via distillation (term of art) of o1, since that would require access to o1's weights. It may have used some of o1's outputs to fine tune the model, which still would mean that the cost of training deepseek is strictly higher than training o1.
Yeah, your technical point is kind of ridiculous here that in all my uses of distillation (and in the comment I quoted), distillation is used in informal sense and there's no allegation that DeepSeek could have been in possession of OpenAI's model weights, which is what's needed for your "Distillation (term of Art)".
Seems reasonable they would be investing heavily in plaing state assets within OpenAI so they can copy the models.
'Can't have your data going to China'
'Can't allow companies that do censorship aligned with foreign nations'
'This company violated our laws and used an American company's tech for their training unfairly'
And the government choosing winners.
'The government in announcing 500 billion going to these chosen winners, anyone else take the hint, give up, you won't get government contracts but will get pressure'.
Good thing nobody is making these sorts of arguments today.
Sure US economic power has a long reach right now because of the importance of the dollar etc - but the more it uses that to bully, the more countries are making sure they are independent.
I suspect the "it ain't training costs/hardware" bit is a bit exagerated since it ignores all the prior work that DeepSeek was built on top of.
But, if all else fails, there's always the tried-and-true approaches: regulatory capture, industry entrenchment, use your VC bucks to be the last one who can wait out the costs the incumbents do face before they fold, etc.
How does it ignore it? The success of Deepseek proves that training costs/hardware are definitely NOT a barrier to entry that protects OpenAI from competition. If anyone can train their model with ChatGPT for a fraction of the cost it took to train ChatGPT and get similar results, then how is that a barrier?
That’s not to say they lie about everything, obviously the thing works amazingly well. The cost is understated by 10x or more, which is still not bad at all I guess? But not mind blowing.
As you grow bigger, you create barriers to entry where none existed before, whether intentionally or unintentionally.
Also barriers to entry aren't the only way to get a consolidated market anyway.
No I’m not trolling you.
However, there is no evidence this is how the data was obtained. Zero, zilch.
So its a useless statement which only plays on peoples bias against their hated nation state de jour.
Also, scraping most of the websites they scrape is also not allowed, they do it anyways
Looking back on the article, it is indeed using "distillation" as a special/"term of art" but not using it correctly. IE, it's not actually speculating that DeepSeek obtained OpenAI's weights and distilled them down but rather that it used OpenAI's answers/output as a starting point (which there is a different method/"term of art").
But now we have a bully in the whitehouse who seems to want to literally steal neighboring land, or is throwing shit everywhere to distract from the looting and oligarchy being formed. So I suddenly have more empathy for that position.
Even OpenAI itself hasn’t resorted to these wild conspiracy theories.
Unless you’re an insider in these companies, you’re just like the rest of us, you know nothing.
China steals a lot of documentation from the US but in a tech forum you of all people should be very familiar with how little actual progress a bunch of documentation is towards a finished unit.
The Comac C19 still uses American engines despite all the industrial espionage in the world because most actual engineering is still a brute force affair into finding how things fail and fixing that. That's one of the main advantages SpaceX has proven out with their "eh fuck it, just launch and we will see what breaks" methodology.
Even fraud filled Chinese research makes genuine advancements.
Believing that China, a wealthy nation of over a billion people, with immense unity, nationality, and a regime able to explicitly write blank checks could only possibly beat the US at something by cheating is like, infinite hubris. It's hilarious actually.
I don't know if DeepSeek is actually just a clone of something or a shenanigan, that's possible and China certainly has done those kinds of things before, but to think it's the MOST LIKELY outcome, or to over rely on it in any way is a death sentence. OpenAI claims to have evidence, why do they not show it?
In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data
Is this cold start data what OpenAI is claiming their output ? If so what's the big deal ?
> To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.
Maybe they needed OpenAI for their process. But now that their model is open source, anyone can use that as their cold start and spend the same amount.
"From scratch" is a moving target. No one who makes their model with massive data from the net is really doing anything from scratch.
They are using the current SOTA tools and models to build new models for cheaper.
It is no better for OpenAI in this scenario either, any competitor can easily copy their expensive training without spending the same, i.e. there is a second mover advantage and no economic incentive to be the first one.
To put it another way, the $500 Billion Stargate investment will be worth just $5Billion once the models become available for consumption, because it only will take that much to replicate the same outcomes with new techniques even if the cold start needed o1 output for RL.
My understanding is this effectively builds on OpenAI's very expensive initial work, provides a "nearly as good as" model for orders of magnitude cheaper to train and run, that also provides a basis to continue building on and improving without openAI, and without human bottlenecks.
That cuts OAI off at the knees in terms of market viability after billions have been spent. If DS can iterate and match the capabilities of the current in-development OAI models in the next year, it may come down to regulatory capture and government intervention to ensure its viability as a company.
Let's just assume that the cost of training can be externalized to other people for free.
If other players can access that data with relatively less effort, then it's futile trying to train your models and improve upon them, as clearly you don't have an architectural moat, just a training moat.
Kind of like an office scene where an introverted hardworker does all the tedious work, while his extroverted colleague promotes it as his and gains credit.
The big question really is, are we doing it wrong, could we have created o1 for a fraction of the price. Will o4 cost less to train than o1 did?
The second question is naturally. If we create a smarter LLM, can we use it to create another LLM that is even smarter?
It would have been fantastic if DeepSeek could have come out with an o3 competitor before o3 even became publicly available. That way we would have known for sure that we’re doing it wrong. Cause then either we could have used o1 to train a better AI or we could have just trained in a smarter and cheaper way.
Whether or not you could have, you can now.
The model already embodies the "total sum of a massive amount of compute" used to create it; if it's possible to reuse that embodied compute to create a better model, that's good for the world. Forcing everyone to redo all that compute for themselves is, conversely, bad for the world.
We don't make people figure out how to domesticate a cow every time they want a hamburger. Or test hundreds of thousands of filaments before they can have a lightbulb. Inventions, once invented, exist as giants to stand upon. The inventor can either choose to disclose the invention and earn a patent for exclusive rights, or they can try to keep it a secret and hope nobody reverse engineers it.
So this is the first time I’ve heard the Chinese regime being described in such flowery terms on HN - lol. But ok - haha
Human reasoning, as it exists today, is the result of tens of thousands of years of intuition slowly distilled down to efficient abstract concepts like "numbers", "zero", "angles", "cause", "effect", "energy", "true", "false", ...
I don't know what reasoning from scratch would look like without training on examples from other reasoning beings. As human children do.
First you must invent the universe.
To quote DeepSeek directly:
> DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.
This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.
Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains.
This doesn't mean a well-trained model can't reason about input it hasn't encountered before, just that the input needs to be in some way causally connected to the same laws which governed the input the manifold was trained on.
I'm sure we could create a fully generalized reasoning manifold which could handle anything, but I don't see how we possibly get that without first considering and encountering all possible inputs. But these inputs still have to have some form of constraint governed by laws that must be learned through sampling, otherwise you'd just be training on effectively random data.
The other commenter who suggested simply generating all possible sentences and training on internal consistency should probably consider Gödel's incompleteness theorems, and that internal consistency isn't enough to accurately model and interpret the universe. One could construct a thought experiment about an isolated brain in a jar with effectively unlimited neuronal connections, but no sensory connection to the outside world. It's possible, with enough connections, that the likelihood of the brain conceiving of true events it hasn't actually encountered does increase meaningfully. But the brain still has nothing to validate against, and can't simply assume that because something is internally logically consistent, that it must exist or have existed.
That was the easy part though, figuring out how to handle all the unintended side effects it generated is still an ongoing process. Please sit and relax while we are solving the few incidentals events occurring here and there, rest assured we are putting our best effort to their resolution.
They probably would need some other source of truth than an existing model, but it isn't clear how much additional data is needed.
I think you have a cartoonish view of China. A huge amount goes on that the government has no idea about. Now that DeepSeek has made a huge media splash, the Chinese government will certainly pay attention to them, but then again, so will the US government.
I’m suggesting it will be happening now and any past efforts will be retroactively analyzed by the appropriate CCP apparatus since everyone is aware of the scale of success as of Monday. It has become a political success, thus it is imperative the CCP partakes in it.
> DeepSeek, illegally, got their hands on an OpenAI model via a breach of OpenAI's systems. [...] given the lengths other Chinese entities have gone to when it comes to replicating Western technology; we should not discount this.
Above, teractiveodular said that "DeepSeek is basically a startup, not a 'foreign nation-state backed organization'". You called teractiveodular naive for saying that. So forgive me if I take the obvious implication that you think DeepSeek is actually a state-backed actor enabled by government hacking of OpenAI.
the attempt in cryptography with PGP and export controls made that clear.
Even if DS specifically is banned (and even effectively), a dozen other clean room replications following their published methods will become available.
It is possible this government will ban all “unapproved” LLMs not running at authorized provider[1], saying it is weapon and AGI or skynet or whatever makes powers that sound important, thus establishing the need for control [2], the rest of the world will keep innovating.
—-
[1] Bans just need to work only economically, not at information level i.e organization with liability considerations will not use “unapproved” ones and they are ones who will bulk of the money and that what they need to protect.
[2] if they were smart they could do this positively without the backlash bans would have. By giving protections to compliant models like legal indemnity for for model companies and users without necessarily blocking others
On the other hand, my take on it, the ability to do reasoning in a long context is a general capability. And my guess is that it can be bootstrapped from scratch, without having to do training on all of the internet or having to distill models trained on the internet.
But we already know that is the case: the Deepseek v3 paper says it was posttrained partly with an internal version of R1:
> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.
And deepseekmath did a repeated cycle of this kind of thing mixing in 10% of old previously seen data with new generated data from last gen in a continuous bootstrap.
>We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.
>For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting.
The non-reasoning portion of the DeepSeek-V3 dataset is described as:
>For non-reasoning data, such as creative writing, role-play, and simple question answering, we utilize DeepSeek-V2.5 to generate responses and enlist human annotators to verify the accuracy and correctness of the data.
I think if we were to take them at their word on all this, it would imply there is no specific OpenAI data in their pipeline (other than perhaps their pretraining corpus containing some incidental ChatGPT outputs that are posted on the web). I guess it's unclear where they got the "reasoning prompts" and corresponding answers, so you could sneak in some OpenAI data there?
The reasoning prompts and answers for SFT from V3 you mean ? No idea. For that matter you have no idea where OpenAI got this data from either. If they open this can of worms, their can of worms will be opened as well.
I assume in their API logs.
That just means that the edge you’re able to retain if you invest $1B is nonexistent. It also means there’s a huge disincentive to invest $1B if your reward instantly evaporates. That would normally be fine if the competitor is otherwise able to get to that new level without the $1B. But if it relies on your $1B to then be able to put in $100M in the first place to replicate your investment, it essentially means the market for improvements disappears OR there’s legislation written to ensure competitors aren’t allowed to do that.
This is a tragedy of the commons and we already have historical example for how humans tried to deal with it and all the problems that come with it. The cost of producing a book requires substantial capital but the cost of copying it requires a lot less. Copyright law, however flawed and imperfect, tries to protect the incentive to create in the face of that.
That’s why it’s so hard to understand the true cost of training Deepseek whereas it’s a little bit easier for cutting edge models (& even then still difficult).
Tbh a lot of people in the world would love this outcome. They will use AI because not using it puts them at a comparative disadvantage - but would rather AI doesn't develop further or didn't develop at all (i.e. they don't value the absolute advantage/value). There's both good and bad reasons for this.
“Hey OpenAI, if you had to make a clone of yourself again how would you do it and for a lot cheaper?”
Nice move.
Jeez. Must be really tough to have some comparatively small group of people financially destroy your industry with your own mechanically-harvested professional output while dubiously claiming to be be better than you when in reality it’s just a lot cheaper. Must be tough.
Maybe they should take some time to self-reflect and make some art and writing about it using the products they make that mechanically harvest the work of millions of people, and have already screwed up the commercial art and writing marketplaces pretty throughly. Maybe tell DeepSeek it’s their therapist and get some emotional support and guidance.
This is possibly true, though with billions already invested I'm not sure that OpenAI would just...stop absent legislation. And, there may be technical or other solutions beyond legislation. [0]
But, really, your comment here considers what might come next. OTOH, I was replying to your prior comment that seemed to imply that DeepSeek's achievement was of little consequence if they weren't improving on OpenAI's work. My reply was that simply approximating OpenAI's performance at much lower cost could still be extraordinarily consequential, if for no other reason than the challenges you subsequently outlined in this comment's parent.
[0] On that note, I'm not sure (and admittedly haven't yet researched) how DeepSeek just wholesale ingested ChatGPT's "output" to be used for its own model's training, so not sure what technical measures might be available to prevent this going forward.
Emergent tool use from multi-agent interaction is a good example - https://openai.com/index/emergent-tool-use/
In your particular case the prompt would look something like: <pubmed dump> what are the plants that aren't poisonous to most people?
A general reasoner would recover language and relevant world model from pubmed dump. And then would proceed to reason about it, to perform the task.
It doesn't look like a particularly efficient process.
In an axiomatic system, those solutions are checkable, but how discoverable are they when your search space starts from infinity? How much do you lose by disregarding the gritty reality and foam of human experience? It provides inspirational texture that helps mathematicians in the search at least.
Reality is a massive corpus of cause and effect that can be modeled mathematically. I think you're throwing the baby out with the bathwater if you even want to be able to math in a vacuum. Maybe there is a self optimization spider that can crawl up the axioms and solve all of math. I think you'll find that you can generate new math infinitely, and reality grounds it and provides the gravity to direct efforts towards things that are useful, meaningful and interesting to us.
At the end of the day, all theory must be empirically verified, and contextually useful reasoning simply cannot develop in a vacuum.
Unless the brain is using physics that we don’t understand or can’t replicate, it seems that, at least theoretically, there should be a way to model what it’s doing with silicon and code.
States like inspiration and creativity seem to correlate in an interesting way with ‘temperature’, ‘top p’, and other LLM inputs. By turning up the randomness and accepting a wider range of output, you get more nonsense, but you also potentially get more novel insights and connections. Human creativity seems to work in a somewhat similar way.
To your point, experience is the training. Without language/data to represent human experience and knowledge to train a model, how would you give it 'experience'?
They were pre-designed to learn what they always learn. Their minds structured to readily make the same connections as puppies, that dogs have always needed to survive.
Not for real reasoning, which by its nature, does not have a limit.
Its easy to train the same things to a degree, but its amazing to watch different dogs individually learn and reason through things completely differently, even within a breed or even a litter.
Reasoning ability is always limited by the capacity of the thinker to frame the concepts and interactions. Its always limited by definition, we only push that limit farther than other species, and AGI may eventually push it past our abilities.
- humans experience reality at a slower pace than AI could theoretically experience a simulated reality
- humans have to transfer knowledge to the next generation every 80 years (in a manner that's very lossy), and around half of each human lifespan is spent learning things that the previous generation already knew
Reasoning could very well have originally been an emergent property of a group of beings.
The animal kingdom is full of examples of groups being more intelligent than individuals, including in human animals as of today.
It’s entirely possible that reasoning emerged as a property of a group before it emerged in any individual first.
What I wonder instead is whether reasoning is a property that is either there or not there, with a sharp boundary of existence.
Do this continually through generations until you arrive at modern society.
On the contrary, when reasoning about the real world, one must reason starting from assumptions that are uncertain (at best) or even "clearly wrong but still probably useful for this particular question" (at worst). Any long and logic-heavy proof would make the results highly dubious.