Dall-E 2(openai.com) |
Dall-E 2(openai.com) |
And this observation may lead to a great consequences for visual arts. I had a lot of joy of looking at different Dall-E interpretations to find what the flaw of the interpretation that forbids it to be a piece of art of an equal value to the original. It is a ready made tool to search for explanations of the Power of Art. It cannot say what detail make a picture to be an artwork, but it allow to see multiple data points, and to narrow the hypothesis space. My main conclusion is that the pearl earring have nothing to do with the power of art. It is something in the eye, and probably with the slightly opened mouth. (Somehow Dall-E pictured all interpretations with closed lips, so it seems to be an important thing, but I need more variation along this axis to be sure).
[1] https://en.wikipedia.org/wiki/Girl_with_a_Pearl_Earring [2] https://yourartshop-noldenh.com/awol-erizku-girl-with-the-pe...
The original girl is more open, more independent and mindless. The interpretation's girl is more self-controlled, assertive and not interested really, just going throw all those movements of regular communication between people. Maybe it's just me, but what I really value on such occasions is mindlessness, the ability of people to not mind themselves, to let their selves to dissolve in the environment. I cannot keep tears in my eyes sometimes when I watch some entertainer playing Chopin or Paganini, because what I see in their movements is complete dissolution of a person in a piece of music, in a piece of art and skill. An entertainer just do what they do with their full attention on it, and with all their motivation focused on it. There is nothing here for them, just them and their actions.
There is not a single thought devoted to how people around me would react to what I do and how I do that. I just do what I do and I do not care about people around me, and if it somehow makes people happy... I don't care really. I mean I know that afterwards I'd feel a pride of myself, but just for now I don't really care.
I know this feeling. I like to sing, and I'm good at it (above average), and I know what it feels like to dissolve into the song and to let song to rule. I play piano and I know what it is like to dissolve into the piece I'm playing, to stop myself from existing, to let music to take the lead. And the original painting make me believe that the girl is in this state of mind. I do not know the history or the remaining of the story, I do not know if she get into this state for a second, of she never leaves it (it may be a sad experience, don't you think?), but somehow I know that right now she is right in this state. I want to watch this her moment for an eternity.
Thinking about it, I'd confess that Interpretation Girl does trigger the same, but on a smaller scale. I feel how my mind is trying to find a coherent state to her gaze, but this feeling stops in tens of microseconds, not hundreds of them.
edit: want->watch. Stupid mistake ruining the meaning of the sentence.
But its like a giant database of decent clipart for anything we can imagine
We do not know exactly what part of our perception of reality can be attributed to "the visual cortex and some association cortex". But now we can feel it. We can test it. We can compare ourselves with the cold calculating machine. I believe that it is a priceless opportunity that we shouldn't miss. At least I personally can't. I'm going to figure out is it possible to me to have such a companion as Dall-E in mine wanderings in a sea of information in Internet, and if it is, then to get one.
> But its like a giant database of decent clipart for anything we can imagine
And this also. Yes. Though I'm not interested in clipart.
> Prices are per 1,000 tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. This paragraph is 35 tokens.
Further down, in the FAQ[2]:
> For English text, 1 token is approximately 4 characters or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.
> To learn more about how tokens work and estimate your usage…
> Experiment with our interactive Tokenizer tool.
And it goes on. When most questions in your FAQ are about understanding pricing—to the point you need to offer a specialised tool—perhaps consider a different model?
Great work.
Looking forward to when they start creating movies from scripts.
Now the rant:
I think if OpenAI genuinely cared about the ethical consequences of the technology, they would realise that any algorithm they release will be replicated in implementation by other people within some short period of time (a year or two). At that point, the cat is out of the bag and there is nothing they can do to prevent abuse. So really all they are doing is delaying abuse, and in no way stopping it.
I think their strong "safety" stance has three functions:
1. Legal protection 2. PR 3. Keeping their researchers' consciences clear
I think number 3 is dangerous because researchers are put under the false belief that their technology can or will be made safe. This way they can continue to harness bright minds that no doubt have ethical leanings to create things that they otherwise wouldn't have.
I think OpenAI are trying to have the cake and eat it too. They are accelerating the development of potentially very destructive algorithms (and profiting from it in the process!), while trying to absolve themselves of the responsibility. Putting bandaids on a tumour is not going to matter in the long run. I'm not necessarily saying that these algorithms will be widely destructive, but they certainly have the potential to be.
The safety approach of OpenAI ultimately boils down to gatekeeping compute power. This is just gatekeeping via capital. Anyone with sufficient money can replicate their models easily and bypass every single one of their safety constraints. Basically they are only preventing poor bad actors, and only for a limited time at that.
These models cannot be made safe as long as they are replicable.
To produce scientific research requires making your results replicable.
Therefore, there is no ability to develop abusable technology in a safe way. As a researcher, you will have blood on your hands if things go wrong.
If you choose to continue research knowing this, that is your decision. But don't pretend that you can make the algorithms safer by sanitizing models.
There's certainly research happening around this, and RL in games is a great test bed, but people choosing actions will safe from automation longer than people not choosing actions, if that makes sense. It's the person who decides "hire this person" vs the person who decides "I'll use this particular shade of gray."
[0] The best example is when X causes Y and X also causes Z, but your data only includes Y and Z. Without actually manipulating Y, you can't see that Y doesn't cause Z, even if it's a strong predictor.
[1] Another example is the datasets. You need two different labels depending on what happens if you take action A or B, which you can't have simultaneously outside of simulations.
Nowadays there’s lots of great low/no code platforms, like Retool, that represent a far greater threat to the amount of code that needs to be produced than AI ever will.
To use a cliche: code is a bug, not a feature. Abstracting away the need for code is the future, not having a machine churn out the same code we need today.
Caravaggio is probably chortling from wherever he is ..
There doesn't seem to be an equivalent movement with AI-generated art, probably because the understanding of how the models are trained from large datasets is not mainstream yet. I would imagine thousands of those same artists/consumers would be up in arms if they had a basic understanding of ML and millions of average people were beginning to feed the models their own keywords.
This I think ties in with the "responsibility" principles that OpenAI outlines. Once the generation technique has been reverse-engineered and can be used without limits, there is no way to uninvent it. It can be made illegal, but humans can always find a way around laws if they want something badly enough. This could have drastic consequences if enough artists believe that the training violates their respect or other intangible humanistic qualities. With technological advancement that can never be put back in the bottle and spreads to occupy the entire consciousness of the Internet, their options for recourse will be far different than being able to tell a single fringe art group siphoning others' content to pack up and leave.
https://news.ycombinator.com/item?id=30931614
I point this out because while Dall-E 2 seems interesting (I'm out of my depth, so delegating to the conversation taking place here), the timing of its release as well as accompanying press blasts within the last hour from sites like TheVerge—verified via wayback machine queries and time-restricted googling—seems both noteworthy and worth a deeper conversation given what was just published about Worldcoin.
To be clear, it's worth asking if Dall-E 2 was published ahead of schedule without an actual product release (only a waitlist) to potentially move the spotlight away from Worldcoin.
The internet's own proverb has never been more important to keep in mind. A dose of skepticism is a must.
Art is truth.
The people didn't program Dall E how to make art. They taught it to recognize patterns and create something by extrapolating from the patterns, all on its own. So the AI isn't a projection of what they think is good art, it's projecting what it thinks is good art, based on a prompt. The output is its best effort of a feeling, even if the feeling had to be inputted by a living person. So it's still art that's as good as the feeling that it came from-fleeting feelings being lower quality than those that required more time and thought
I think the results are being poisoned by the fact that most old paintings have deteriorated colors, so the training data looks nothing like the originals. It's certainly a lot yellower than https://cdn.openai.com/dall-e-2/demos/variations/originals/g...
https://twitter.com/sama/status/1511724264629678084?s=20&t=6...
Sam Altman demonstrates Dall-E 2 using twitter suggestions - https://news.ycombinator.com/item?id=30933478 - April 2022 (3 comments)
You can join it by following the steps in the guide here: https://github.com/huggingface/community-events/tree/main/hu...
There will also be talks from awesome folks at EleutherAI, Google, and Deepmind
> By removing the most explicit content from the training data, we minimized DALL·E 2’s exposure to these concepts
> We won’t generate images if our filters identify text prompts and image uploads that may violate our policies
The 'how to prevent superintelligences from eating us' crowd should be taking note: this may be how we regulate creatures larger than ourselves in the future
And even how we regulate the ethics of non-conscious group minds like big companies
I suspect trends in design will move towards those areas that AI struggles with (assuming there are any left!)
I think we passed that point a while ago, but seeing this makes me think we aren't too far off from computers composing pieces that actually sound good too.
As far as the text driven, I would have to mess with some non pre-canned presentations to see how useful it was.
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.
For anyone pondering such questions, I would recommend reading "The Past, Present, and Future of AI Art" - https://thegradient.pub/the-past-present-and-future-of-ai-ar...
This is really already the case, actually. Most artworks have “value” because they have a compelling narrative, not because they look pretty. So I think we can expect future artists to really emphasize their background, life story, process of making the art, etc. All things that cannot be done by a machine.
When you have a digital display of pixels, if you randomly color pixels at 24 fps then you will eventually display every movie that can be or will ever be made, powerset notwithstanding. This can also be tied to digital audio.
In short, while mind-blowingly large, the space of display through digital means is finite.
I think an AI infused future is going to become increasingly more absurd and surreal, it will lead to a kind of creative and cultural nihilism, if that's the right term.
Like the value of originality will become meaningless.
- In support of your argument, the Buzzfeed News investigation likely has been in the works for weeks, meaning Altman et al have had more than just a couple days to throw together a Dall-E 2 soft launch
- However, weren't OpenAI's GPT (2 and 3) announced to the world in similar fashion? e.g. demos and whitepapers and waitlists, but not a full product release?
- Throwing together a Dall-E 2 soft launch just in time to distract from the investigation would require a conspiracy, i.e. several people being at least vaguely aware that deadlines have been accelerated for external reasons. Is the Worldcoin story big enough to risk tainting OpenAI, which seems like a much more prominent part of Altman's portfolio?
- BFN reached out to A16Z, Worldcoin, Khosla Ventures largely declined to comment, which would mean that at least one person probably had a bit of runway from at least when the requests for comment were submitted. So yeah, you're probably right.
- Going from the github repos for GPT 2 and 3, those may have been hard launches:
Feb 14 2019, predating the first press for GPT-2 by a few days (was probably made public Feb 14 though) - https://github.com/openai/gpt-2/commit/c2dae27c1029770cea409...
May 28 2020, timed alongside the press news for GPT-3 - https://github.com/openai/gpt-3/commit/12766ba31aa6de490226e...
- Would it really have to be a conspiracy? Sounds like only one person would have to target a specific date or date range, and without really giving a reason.
One of the things that puts a hole in my own thinking here is that Sam Altman's name isn't really tied to the Dall-E 2 release. It's just OpenAI, and the press around Sam's name today still exclusively surfaces just this one Worldcoin story (https://news.google.com/search?q=sam+altman+when%3A1d&). So if this was actually intended to bury another story, Sam's name would have to have been included in all the press blasts to be successful. But the Buzzfeed story seems like it kinda died alone on the vine.
I listed some of them here - https://news.ycombinator.com/item?id=30934732, just because I remembered there had been previous discussions and listing related previous discussions is a thing.
What I'm submitting for consideration is that the marketing page and associated press blasts (there's a live influencer reaction video airing right now about Dall-E 2, for instance) for Dall-E 2 were potentially pushed up to offset negative press from Worldcoin for their shared founder.
I'd like to be wrong. But it's too well timed.
I don't see the publication of a marketing page (again, not a finished product) for a product founded by someone who's other main venture is being investigated by journalists for misleading claims as being a coincidence, but if the timing matters and 14-15 hours doesn't seem like it works for the assertion in your mind, then perhaps the Dall-E 2 page going live less than an hour after the Worldcoin HN submission fits the bill.
I've got no horse in this race. I'm just drawing attention to familiar PR strategies used for brand risk mitigation, that's all.
Easy to put together a marketing piece on short notice or potentially even push a pending marketing page out to production with a waitlist rather than links to production or even beta quality services.
In other text-to-image algorithms I'm familiar with (the ones you'll typically see passed around as colab notebooks that people post outputs from on Twitter), the basic idea is to encode the text, and then try to make an image that maximally matches that text encoding. But this maximization often leads to artifacts - if you ask for an image of a sunset, you'll often get multiple suns, because that's even more sunset-like. There's a lot of tricks and hacks to regularize the process so that it's not so aggressive, but it's always an uphill battle.
Here, they instead take the text embedding, use a trained model (what they call the 'prior') to predict the corresponding image embedding - this removes the dangerous maximization. Then, another trained model (the 'decoder') produces images from the predicted embedding.
This feels like a much more sensible approach, but one that is only really possible with access to the giant CLIP dataset and computational resources that OpenAI has.
But there's no real rhyme or reason, it is a sort of alchemy.
Is text encoding strictly worse or is it an artifact of the implementation? And if it is strictly worse, which is probably the case, why specifically? What is actually going on here?
I can't argue that their results are not visually pleasing. But I'm not sure what one can really infer from all of this once the excitement washes over you.
Blending photos together in a scene in photoshop is not a difficult task. It is nuanced and tedious but not hard, any pixel slinger will tell you.
An app that accepts a smattering of photos and stitches them together nicely can be coded up any number of ways. This is a fantastic and time saving photoshop plugin.
But what do we have really?
"Kuala dunking basketball" needs to "understand" the separate items and select from the image library hoops and a Kuala where the angles and shadows roughly match.
Very interesting, potentially useful. But if doesn't spit up exactly what you want can't edit it further.
I think the next step has got to be that it conjures up a 3d scene in Unreal or blender so you can zoom in and around convincingly for further tweaks. Not a flat image.
Stock photography sales are in the many billions of dollars per year and custom commissioned photography is larger still. That's a pretty seriously sized ready-made market.
> But if doesn't spit up exactly what you want can't edit it further.
I suspect there's a big startup opportunity in pioneering an easy-to-use interface allowing users to provide fast iterative feedback to the model - including positional and relational constraints ("put this thing over there"). Perhaps even more valuable would be easy yet granular ways to unconstrain the model. For example, "keep the basketball hoop like that but make the basketball an unexpected color and have the panda's right paw doing something pandas don't do that human hands often do."
But I do think we could have guessed that this sort of approach would be better (at least at a high level - I'm not claiming I could have predicted all the technical details!). The previous approaches were sort of the best that people could do without access to the training data and resources - you had a pretrained CLIP encoder that could tell you how well a text caption and an image matched, and you had a pretrained image generator (GAN, diffusion model, whatever), and it was just a matter of trying to force the generator to output something that CLIP thought looked like the caption. You'd basically do gradient ascent to make the image look more and more and more like the text prompt (all the while trying to balance the need to still look like a realistic image). Just from an algorithm aesthetics perspective, it was very much a duct tape and chicken wire approach.
The analogy I would give is if you gave a three-year-old some paints, and they made an image and showed it to you, and you had to say, "this looks like a little like a sunset" or "this looks a lot like a sunset". They would keep going back and adjusting their painting, and you'd keep giving feedback, and eventually you'd get something that looks like a sunset. But it'd be better, if you could manage it, to just teach the three-year-old how to paint, rather than have this brute force process.
Obviously the real challenge here is "well how do you teach a three-year-old how to paint?" - and I think you're right that that question still has a lot of alchemy to it.
Is there a rhyme or reason as to why picasso decided to paint like that? Yes these networks are hard to reason about, but so are real human brains.
Why? You can tweak the prompt, change parameters, or even use the actual "edit" capability that they demo in the post.
DALL-E 2 spits as many outputs as you want. Then you choose the one you prefer.
I'm not sure if I'm speaking clearly, I just don't understand, what's the difference between training "text encoding to an image" vs "text embedding to image embedding". In both cases you have some kind of "sunset" (even though it's obviously just a dot in a multi-dimension space, not the letters) on the left, and you try to maximize it when training the model to get either a image-embedding or a image straight away.
The previous systems I was talking about work something like this: "Try to find me the image the looks like it most matches 'a picture of a sunset'. Do this by repeatedly updating your image to make it look more and more like a sunset." Well, what looks more like a sunset? Two sunsets! Three sunsets! But this is not normally the way images are produced - if you hire an artist to make you a picture of a bear, they don't endeavor to create the most "bear" image possible.
Instead, what an artist might do is envision a bear in their head (this is loosely the job of the 'prior' - a name I agree is confusing), and then draw that particular bear image.
But why is this any different? Who cares if the vector I'm trying to draw is a 'text encoding' or an 'image encoding'? Like you say, it's all just vectors. Take this answer with a big grain of salt, because this is just my personal intuitive understanding, but here's what I think: These encodings are produced by CLIP. CLIP has a text encoder and an image encoder. During training, you give it a text caption and a corresponding image, it encodes both, and tries to make the two encodings close. But there are many images which might accompany the caption "a picture of a bear". And conversely there are many captions which might accompany any given picture.
So the text encoding of "a picture of a bear" isn't really a good target - it sort of represents an amalgamation of all the possible bear pictures. It's better to pick one bear picture (i.e. generate one image embedding that we think matches the text embedding), and then just to try to draw that. Doing it this way, we aren't just trying to find the maximum bear picture - which probably doesn't even look like a realistic natural image.
Like I said, this is just my personal intuition, and may very well be a load of crap.
In the context of a living being, different genes interact with each other as well. For example, you have certain cells that secrete hormones (many genes needed to do that), then you have genes that encode for hormone receptors, and those receptors trigger other actions encoded by other genes. There's probably too much complexity to ask an AI system to synthesize the entire genetic code for a living being. That would be kind of like if I asked you to draw the exact blueprints for a fighter get, and write all the code, and synthesize all the hardware all at once, and you only get one shot. You would likely fail to predict some of the interactions and the resulting system wouldn't work. You could only achieve this through an iterative process that would involve years of extensive testing.
Could you use a deep learning system to synthesize genetic code? Maybe just single genes that do fairly basic things, and you would need a massive dataset. Hard to say what that would look like. Is it really enough to textually describe what a gene does?
With text and images you can leverage “ground truth” data (verified by humans) to train your model.
The DNA sequences I would look for methods that don’t require good ground truth data.
I think that using something like this for porn could potentially offer the biggest benefit to society. So much has been said about how this industry exploits young and vulnerable models. Cheap autogenerated images (and in the future videos) would pretty much remove the demand for human models and eliminate the related suffering, no?
EDIT: typo
* I recommend reading the Risks and Limitations section that came with it because it's very through: https://github.com/openai/dalle-2-preview/blob/main/system-c...
* Unlike GPT-3, my read of this announcement is that OpenAI does not intend to commercialize it, and that access to the waitlist is indeed more for testing its limits (and as noted, commercializing it would make it much more likely lead to interesting legal precedent). Per the docs, access is very explicitly limited: (https://github.com/openai/dalle-2-preview/blob/main/system-c... )
* A few months ago, OpenAI released GLIDE ( https://github.com/openai/glide-text2im ) which uses a similar approach to AI image generation, but suspiciously never received a fun blog post like this one. The reason for that in retrospect may be "because we made it obsolete."
* The images in the announcement are still cherry-picked, which is therefore a good reason why they tested DALL-E 1 vs. DALL-E 2 presumably on non-cherrypicked images.
* Cherry-picking is relevant because AI image generation is still slow unless you do real shenanigans that likely compromise image quality, although OpenAI has likely a better infra to handle large models as they have demonstrated with GPT-3.
* It appears DALL-E 2 has a fun endpoint that links back to the site for examples with attribution: https://labs.openai.com/s/Zq9SB6vyUid9FGcoJ8slucTu
GLID-3: https://colab.research.google.com/drive/1x4p2PokZ3XznBn35Q5B...
and a new Latent Diffusion notebook: https://colab.research.google.com/github/multimodalart/laten...
have both appeared recently and are getting remarkably close to the original Dall-E (maybe better as I can't test the real thing...)
So - this was pretty good timing if OpenAI want to appear to be ahead of the pack. Of course I'd always pick a model I can actually use over a better one I'm not allowed to...
[0] Which is why releasing your code is so beneficial.
Using something like this could really help automate or at least kickstart the more mundane parts of content creation. (At least when you are using high resolution, true color imagery.)
There are some 3D image generation techniques, but they aren't based on polygonal modelings, so 3D artists are safe for now
Or what about even generating images you could then photogrammetry into models?
Preventing Harmful Generations
We’ve limited the ability for DALL·E 2 to generate violent,
hate, or adult images. By removing the most explicit content
from the training data, we minimized DALL·E 2’s exposure to
these concepts. We also used advanced techniques to prevent
photorealistic generations of real individuals’ faces,
including those of public figures.
"And we've also closed off a huge range of potentially interesting work as a result"I can't help but feel a lot of the safeguarding is more about preventing bad PR than anything. I wish I could have a version with the training wheels taken off. And there's enough other models out there without restriction that the stories about "misuse of AI" will still circulate.
(side note - I've been on HN for years and I still can't figure out how to format text as a quote.)
This doorway is downright impossible https://cdn.openai.com/dall-e-2/demos/variations/modified/fl...
At this point, it still seems like it's pushing pixels around until it's "good enough" when you squint at it.
Some of the images also hit me with a creep factor, like the bears on the corgis in the art gallery, but that maybe only because I know it's AI generated.
This and the current AI generated art scene makes it looks like that artwork is now a "solved" problem. See AI generated art on twitter etc.
There is a strong relation between the prompt and the generated images but just like GPT-3, it fails to fully understand what was being asked. If you take the prompt out of the equation and see the generated artwork on its own, its upto your interpretation just like any artwork.
The solar powered ship with a propeller sailing under the golden gate bridge during sunset with dolphins jumping around was pretty impressive. https://twitter.com/sama/status/1511731259319349251
I think it's only missing the dolphins.
Creating great _art_ that Grayson Perry (for example) would recognise as such is probably AGI-complete, because it requires a deep understanding of the human condition, society, and a lot of reasoning skills.
A great artist could certainly use Dall-E 2 as part of their method, though.
This is why we are blown away by some pieces of text generated by GPT-3 as if it has its own mind. Even most abstract art has meaning for anyone who is looking for it.
What I am saying is if a generated artwork is indistinguishable from what a human can make than that's all that was needed.
For example, using DeMorgan's theorem, we can build any logic circuit out of all NAND or NOR gates:
https://www.electronics-tutorials.ws/boolean/demorgan.html
https://en.wikipedia.org/wiki/NAND_logic
https://en.wikipedia.org/wiki/NOR_logic
Dall-E 2's level of associative comprehension is so far beyond the old psychology bots in the console pretending to be people, that I can't help but wonder if it's reached a level where it can make any association.
For example, I went to an AI talk about 5 years ago where the guy said that any of a dozen algorithms like K-Nearest Neighbor, K-Means Clustering, Simulated Annealing, Neural Nets, Genetic Algorithms, etc can all be adapted to any use case. They just have different strengths and weaknesses. At that time, all that really mattered was how the data was prepared.
I guess fundamentally my question is, when will AGI start to become prevalent, rather than these special-purpose tools like GPT-3 and Dall-E 2? Personally I give it less than 10 years of actual work, maybe less. I just mean that to me, Dall-E 2 is already orders of magnitude more complex than what's required to run a basic automaton to free humans from labor. So how can we adapt these AI experiments to get real work done?
While technical work will always have a place -- I think that much creative work will become more like the management of a team of highly-skilled, niche workers -- with all the frustrations, joys, and surprises that entails.
The upside it that it’s more “intuitive” and requires much less detail and technique, as the AI infers the detail and technique. The downside is that it’s really hard to know what the AI will generate or get it to generate something really specific.
I believe the future will combine the heuristics of AI-generation with the specificity of traditional techniques. For example, artists may start with a rough outline of whatever they want to draw as a blob of colors (like in some AI image-generation papers). Then they can fill in details using AI prompts, but targeting localized regions/changes and adding constraints, shifting the image until it’s almost exactly what they imagined in their head.
You can definitely make them incremental. You can give it a task like "make a more accurate description from initial description and clarification". Even GPT-3-based models available today can do these tasks.
Once this is properly productionized it would be possible to implement stuff just talking with a computer.
Isn't that essentially what programming already is?
Imagine waking up and telling your (preferably locally hosted) voice assistant that today really feels like a Rembrandt day and the AI just generates new paintings for you.
Curbing Misuse Our content policy does not allow users to generate violent, adult, or political content, among other categories. We won’t generate images if our filters identify text prompts and image uploads that may violate our policies. We also have automated and human monitoring systems to guard against misuse.
- https://github.com/openai/dalle-2-preview/blob/main/system-c...
- https://github.com/openai/dalle-2-preview/blob/main/system-c...
Have a favorite painter? Here's 10,000 new paintings like theirs.
https://www.henrirousseau.net/war.jsp
However, this painting has themes of violence and politics plus some nude dead bodies, so it violates the content policy: "Our content policy does not allow users to generate violent, adult, or political content, among other categories."
So what you'd get is some kind of sanitized watered-down tepid version of Rossueau, the kind of boring drivel suitable for corporate lobbies everywhere, guaranteed not to offend or disturb anyone. It's difficult to find words... horrific? dystopian? atrocious? No, just no.
I’ve been using tools like this for over a year now. Even with filtered dataset and filtered interface, they can make images that would make the Fangoria crowd blush if you put the slightest effort into it.
It’s one thing to be able to make brain-wrenching images with a lot of photoshop effort (or digging hard enough in the dark corners of the internet). It’s another thing entirely give anyone the ability to spew out thousands of them trivially.
However the fidelity of their music AI kinda sucks at this point, but I'm sure we'll get pitch perfect versions of this concept as the singularity gets closer :)
Imagine not just DALL-E 2 but a single model which be trained on different kinds of media and generate music, images, video and more.
The series talks about:
- essential lessons for AI creatives of the future
- shares details on how to compete creatively in the future
- talks about how to make money through Multimodal AI
- make predictions about AI’s effects on society
- at a very basic level, discusses the ethics of multimodal AI and the philosophy of creativity itself
By my understanding, it's the most comprehensive set of videos on this topic.
The series is free to watch entirely on YouTube: GPT-X, DALL-E, and our Multimodal Future https://www.youtube.com/playlist?list=PLza3gaByGSXjUCtIuv2x9...
From the paper:
> Limitations > Although conditioning image generation on CLIP embeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP [Dall-E 2] is worse at binding attributes to objects than a corresponding GLIDE model.
The binding problem is interesting. It appears that the way Dall-E 2 / CLIP embeds text leads to the concepts within the text being jumbled together. In their example "a red cube on top of a blue cube" becomes jumbled and the resulting images are essentially: "cubes, red, blue, on top". Opens a clear avenue for improvement.
Here's an example from my prompt ("a group of farmers picking lettuce in a field digital painting"): https://labs.openai.com/s/jb5pzIdTjS3AkMvmAlx69t7G
1. Deepmind, who solved go, protein folding, and that seems really onto something.
2. Everyone else, spending billions to build machines that draw astronauts on unicorns, and smartish bot toys.
This seems to me like a big step towards AGI; a key component of consciousness seems (in my opinion) to be the ability to take words and create a mental picture of what's being described. Is that the long term goal WRT researching a model like this?
> Curbing Misuse [...]
That's great, nowadays the big AI is controlled by mostly benevolent entities. How about when someone real nasty gets a hold of it? In a decade the models anyone can download will make today's GPT-3 etc look like pong right?
Recommender systems etc are already shaping society and culture with all kinds of unintended effects. What happens when mindless optimizing models start generating the content itself?
It's almost impossible to even give an affirmative answer to that question without making yourself a target. And as much as I err on the side of creator freedom, I find myself shying away from saying yes without qualifications.
And if you don't allow cp, then by definition you require some censoring. At that point it's just a matter of where you censor, not whether. OpenAI has gone as far as possible on the censorship, reducing the impact of the model to "something that can make people smile." But it's sort of hard to blame them, if they want to focus on making models rather than fighting political battles.
One could imagine a cyberpunk future where seedy AI cp images are swapped in an AR universe, generated by models ran by underground hackers that scrounge together what resources they can to power the behemoth models that they stole via hacks. Probably worth a short story at least.
You could make the argument that we have fine laws around porn right now, and that we should simply follow those. But it's not clear that AI generated imagery can be illegal at all. The question will only become more pressing with time, and society has to solve it before it can address the holistic concerns you point out.
OpenAI ain't gonna fight that fight, so it's up to EleutherAI or someone else. But whoever fights it in the affirmative will probably be vilified, so it'd require an impressive level of selflessness.
There's a huge case to be made that flooding the darknet with AI generated CP reduces the revictimization of those in authentic CP images, and would cut down on the motivating factors to produce authentic CP (for which original production is often a requirement to join CP distribution rings).
As well, I have wondered for a long time how the development of AI generated CP could be used in treatment settings, such as (a) providing access to victimless images in exchange for registration and undergoing treatment, and (b) exploring if possible to manipulate generated images over time to gradually "age up" attraction, such as learning what characteristics are being selected for and aging the others until you end up with someone attracted to youthful faces on adult bodies or adult faces on bodies with smaller sexual characteristics, etc - ideally finding a middle ground that allows for rewiring attraction to a point they can find fulfilling partnerships with consenting adults/sex workers.
As a society we largely just sweep the existence of pedophiles under the rug, and that certainly hasn't helped protect people - nearly one in four are victims of sexual abuse before adulthood, and that tracks with my own social circle.
Maybe it's time to all grow up and recognize it as a systemic social issue for which new and novel approaches may be necessary, and AI seems like a tool with very high potential for doing just that while reducing harm on victims in broad swaths.
I'd not be that happy with an 8chan AI just spitting out CP images, but I'd be very happy with groups currently working on the issue from a treatment or victim-focus having the ability to change the script however they can with the availability of victimless CP content.
30 years since the original issue of encryption, it looks like cp trumps the other Horsemen of the Cyperpunk FAQ, with drug dealers and organized crime taking the back seat. It's interesting how misinformation is a recent development that they anticipate; a Google search shows that the term 'Infocalypse' was actually appropriated by discussions of deepfakes some time in mid-2020. That said, the crypto wars are here to stay—most recently with EARN IT reintroduced just two months ago.
The similar issue of 3D-printed guns has developed in parallel over the past decade as democratized manufacturing became a reality. There are even HN discussions tying all of these technologies together, by comparing attitudes towards the availability of Tor vs guns (e.g., [1]).
And there are innumerable related moral qualms to be had in the future; will the illegal drugs or weapons produced using matter replicators be AI-designed?
Overall, I think all of these issues revolve around the question of what it means to limit freedoms that we've only just invented, as technological advances enable things never before considered possible in legislation. (And as the parent comment implies, here's where the use of science fiction in considering the implications of the impossible comes in).
[0] https://en.wikipedia.org/wiki/Four_Horsemen_of_the_Infocalyp...
That doesn't mean that it's all bad, and that there's no recreational use for it. We have limits on the availability of various other artificial stimulants. We should continue to have limits on the availability of porn. Where to draw that line is a real debate.
[1] https://en.wikipedia.org/wiki/Wirehead_(science_fiction)
This author's books are great at putting these sort of moral ideas to test in a sci-fi context. This specific tome portraits virtual wars and virtual "hells". The hope is of being more civilized than by waging real war or torturing real living entities. However some protagonists argue that virtual life is indistinguishable from real life, and so sacrificing virtual entities to save "real" ones is a fallacy.
Or some such, it's been a while.
If people are exposed to stimuli, they will pursue increasingly stimulating versions of it. I.e., if they see artificial CP, they will often begin to become desensitized (habituated) and pursue real CP or even live children thereafter.
Conversely, if people are not exposed to certain stimuli, they will never be able to conceptualize them, and thus will be unable to think about them.
Obviously you cannot eliminate all CP but minimizing the overall levels of exposure / ease of access to these kinds of things is way more appropriate than maximizing it.
If people are exposed to stimuli, they will pursue
increasingly stimulating versions of it.
This is not true in any kind of universal way.If you enjoy car chases in movies, does that mean you're going to require more and more intense chase scenes, and then consume real-life crash footage, and ultimately progress to doing your own daredevil driving stunts in real life?
No, because at some point it's "enough."
Same with... literally anything we enjoy. Did you enjoy your lunch? Did you compulsively feel the need to work up to crazier and crazier lunches?
What about sex? Have you had sex? Do you feel the need to seek out crazier and crazier versions of it?
I have accumulated tens of thousands of headshots in video games but have yet to ever shoot a single real person in the face. More importantly, I have never had the urge to seek out same.
I am not sure that your initial premise has any truth to it.
I'd actually argue the reverse, I think you see a lot more effort towards acquiring things that are illegal than you would otherwise.
> Our content policy does not allow users to generate violent, adult, or political content, among other categories. We won’t generate images if our filters identify text prompts and image uploads that may violate our policies. We also have automated and human monitoring systems to guard against misuse.
I can get why the people who worked hard on it and spent money building it don't want to be associated with porn.
Why? Is there something inherently wrong with porn? Is it not noble to supply a base human need, based on some arbitrary cultural artifact that you possess?
Are you asserting that nobody has humanitarian concerns? If so, that's quite a statement; what basis is there? I've seen so many humanitarian acts, big and small, that I can't begin to count. I've seen them today. I hear people express humanitarian beliefs and feelings all the time. I do them and have them myself. Maybe I misunderstand.
I'm not saying that AI will pass all Turing tests. But as far as having a virtual girlfriend/prostitute.
I'm not picking on the commenter - by itself it's not a big deal - but look at the assumptions behind that comment, which I almost didn't notice on HN.
Maybe give it another five years, a few more $billion and a few more petabytes/flops and it will be good. Then finally everyone can generate art for their own Magic: the Gathering cards.
(That's the end goal, right?)
My dataset is a start, but it may benefit from focused training, the way Facebook's new Make-A-Scene https://arxiv.org/abs/2203.13131#facebook (not DALL-E 2 quality but not far from it) has focused losses on faces.
They're a very complex anatomical form, many small tendons and muscles. Many artists struggle to depict hands. They're not made out of a few straight lines like a torso, there's lots of skew going on. They're probably the hardest structure of the human body to 'learn' for a ML system.
> an astronaut riding a horse, and the astronaut has five fingers on each hand
An example off the top of my head: this could be used as advertising or recruitment for controversial organizations or causes. Would it be wrong for the USA to use this for military recruitment? Israel? Ukraine? Russia?
Another example: this could be used to glorify and reinforce actions which our society does not consider to immoral but other societies - or our own future society - will. It wasn't long ago that the US and Europe did a full 180 on their treatment of homosexuality. Will we eventually change our minds about eating meat, driving cars, etc?
Have they gone too far in a desperate bid to prevent the AI from being capable of harm? Have they not gone far enough? I don't know. If I was that worried about something being misused, I don't think I could ever bring myself to work on it in the first place. But I suppose the onward march of technology is inevitable.
glid-3 is a relatively small model trained by a single guy on his workstation (aka me) so it's not going to be as good. It's also not fully baked yet so ymmv, although it really depends on the prompt. The new latent diffusion model is really amazing though and is much closer to DALLE-2 for 256px images.
I think the open source community will rapidly catch up with Openai in the coming months. The data, code and compute are all there to train a model of similar size and quality.
What kind of prompts is GLID-3 especially good for? I remember getting lucky when I was playing around a few times but I didn't do it systematically.
Do you happen to know how much GPU RAM I need to run glid-3 and/or the latent diffusion model, if I don't want to run on colab?
OpenAI has a low resolution checkpoint for similar functionality as this - called GLIDE - and the output is super boring compared to community driven efforts, in large part because of similar dataset restrictions as this likely has been subjected to.
I don't see a run button?
On.. maybe "Runtime -> Run All" from the menu ...
Shows me a spinning circle around "Download model" ...
26% ...
Fascinating, that Google offers you a computer in the cloud for free ..
Now it is running the model. Wow, I'm curious ..
Ha, it worked!
Nothing compared to the images in the Dall-E 2 article but still impressive.
However, the free GPU is now a K80 which is obsolete and barely sufficient for running these types of models.
It's hard to compare because we don't know how much cherry picking is going on with published Dall-E results (either v1 or v2)
My gut feeling is that it's in the same ballpark as Dall-E 1
Pilots are not there to fly the aircraft, the autopilot already does that. They are there to command the aircraft, in a pair in case one is incapacitated, making the best decisions for the people on board, and to troubleshoot issues when the worst happens.
No AI or remote pilot is going to help when say... the aircraft loses all power. Or the airport has been taken over in a coup attempt and the pilot has to decide whether to escape or stay https://m.youtube.com/watch?v=NcztK6VWadQ
You can bet on major flights having two commercial pilots right up until the day we all get turned into paperclips.
For things like aircraft pilots, it's both realtime-- which means 'reviewer' per output-- you haven't taken a highly trained pilot out of the loop, even if you relegated them to supervising the computer-- and life critical so merely "so/so" isn't good enough.
In practice in illustration (as in all arts) there are a variety of markets where different levels of talent, originality, reputation and creative engagement with the brief are more relevant. For editorial illustration, it's certainly not a case of 'find me someone who can draw X', and probably hasn't been since printing presses got good enough to print photographs.
Unlike artwork, precision and correctness is absolutely critical in coding.
Bytecode -> Assembly -> C -> higher level languages -> AI-assisted higher-level languages
But the reverse is not true, they won't be able to properly vet a piece of code generated by an AI since that will require technical expertise. (You could argue if the piece of code produced the requisite set of output that they would have some marginal level of confidence but they would never really know for sure without being able to understand the actual code)
For the first category, Dall-E 2 and Codex are promising but not there yet. It's not clear how long it'll take them to reach the point where you no longer need people. I'm guessing 2-4 years but the last bits can be the hardest.
As for the second category, we are not there yet. Self-driving cars/planes, and lots of other automation will be here and mature way before an AI can read and communicate through emails, understand project scope and then execute. Also lots of harmonization will have to take place in the information we exchange: emails, docs, chats, code, etc... That is, unless the browser is able to open a navigator and type an address.
It's important to note that we still need professionals to guarantee the quality of the output from AIs, including this one. As noted in their issue tracker, DALL-E has very specific limitations, but these can be easily solved by employing dedicated professionals, who are trained to tame the AI and properly finish the raw output.
So, if I were running OpenAI, I'll clearly be experimenting with how their AIs and human interact, and build a training program around it for producing practical outputs. (Actually, I work in consumer robotics, and human adoption has been the biggest hurdle here. Thus, my claim here.)
--
In case of fine art, thou, I don't think they'll not get hit by this AI advancement. The biggest problem is that you simply can't get the exact image you want wit this AI. Even humans cannot transfer visual information in verbal form without a significant loss of details, thus a loss of quality. It's the same with AI, but, worse, because AI rely on the bias in a specific set of training data, and it never truly understands the human context in it (in the current level of technology).
Additionally, the rise of no-code development is just extending the functionality of designers. I didn't take design seriously (as a career choice) growing up because I didn't see a future in it, now it pays my bills and the demand for my services just grows by the day.
Similar argument to make with chess AI: it didn't make chess players obsolete, it made them stronger than ever.
Are all designers becoming more valuable or is a subset of really good ones going to reap the value increase and capture more of the previously available value?
I think your analogy is poor, because this is a tool for makers. The engineers aren't the makers.
I think a more apt analogy is if John Deere made a universal harvester that you could use for any crop, but they decided they didn't like soybeans so you are forbidden to use it for that. In that case, yes I would complain, and I would expect everyone else to, as well.
It's their service, their call.
I have some hobby projects, almost nobody uses them, but you bet I'll shut stuff down if I felt something bad was happening, being used to harass someone, etc. NOT "because bad PR" but because I genuinely don't want to be a part of that.
If you want some images / art made for you don't expect someone will make them for you. Get your own art supplies and get to work.
I feel like we, as a species, will struggle for a while with how to treat adults like adults online. As happy as I am to advocate for safe spaces on the internet, perhaps we need to start having a serious discussion about how we can do so without resorting to putting safety mats everywhere and calling it a job well done.
Hecklers get a veto?
It makes me wonder what they're planning to do with this? If they're deliberately restricting the training data, it means their goal isn't to make the best AI they possibly can. They probably have some commercial applications in mind where violent/hateful/adult content wouldn't be beneficial. Children's books? Stock photos? Mainstream entertainment is definitely out. I could see a tool like this being useful during pre-production of films and games, but an AI that can't generate violent/adult content wouldn't be all that useful in those industries.
I don't think there is a way comparable to markdown, since the formatting options are limited: https://news.ycombinator.com/formatdoc
So your options are literal quotes, "code" formatting like you've done, italics like I've done, or the '>' convention, but that doesn't actually apply formatting. Would be nice if it were added.
Personally, I prefer to combine the '>' convention with italics. Still, I'd agree that proper quote formatting would be a welcome improvement.
This is exactly the sort of thing that gets a company mired in legal issues, vilified in the media, and shut down. I can not blame them for avoiding that potential minefield.
(Hmm, I guess this comparison doesn't actually work...)
But at least we can get another billion of meme-d comics with apes wearing sunglasses, so that's good news right?
It's just soul-crushing that all the modern, brilliant engineering is driven by abysmal, not even high-school art-class grade aesthetics and crowd-pleasing ethics that are built around the idea of not disturbing some 1000 very vocal twitter users.
Death of culture really.
Companies like OpenAI have a responsibility to society. Imagine the prompt “A photorealistic Joe Biden killing a priest”. If you asked an artist to do the same they might say no. Adding guiderails to a machine that can’t make ethical decisions is a good thing.
Personally, I fear more what corporations or some governments can do with such models than what a random person can do generating Biden images. And without restriction, at least academics could better study these models (including their risks) and we could be better prepared to deal with them.
Society didn't collapse after photoshop. "Responsibility to society" is such a catch-all excuse.
Their document about all the measures they took to prevent unethical use is also a document about how to use a re-implementation of their system unethically. They literally hired a "red team" of smart people to come up with the most dangerous ideas for misusing their system (or a re-implementation of it), and featured these bad ideas prominently in a very accessibly written document on their website. So many fascinating terrible ideas in there! They make a very compelling case that the technology they are developing has way more potential for societal harm than good. They had me sold at "Prompt: Park bench with happy people. + Context: Sharing as part of a disinformation campaign to contradict reports of a military operation in the park."
But amusingly, exactly that did happen in one of their GPT experiments! https://openai.com/blog/fine-tuning-gpt-2/
That's no hot take. It's literally the reason.
The nature of creative work will certainly change, creatives will adopt tools such as Dall-E 2. In certain narrow cases they might be replaced, such as if you are asking a creative to generate a very specific image, but how often is that the case? The majority of the time tools such as Dall-E 2 will act as an accelerator for creatives and help them increase their output.
Furthermore, tools like Dall-E seem like they'll lower the barrier of entry for more people to get into art, resulting in more artists, not fewer. Increased competition for the same dollar amounts might make artists, on average, "poorer" (when averaged across an increased number of artists), but this seems like the end-result of any new tool that empowers more artists to more easily make "good" work, not just AI-generated tools.
I'm excited for both 1) more art in the world and 2) in some cases, artists making "even better" art (by combining their existing experience + new tools).
I think art will survive, just like photography didn't kill the painting, the idea of art might simply begin to encompass this new mean of production, which no longer requires the steady hand, but still requires a discerning eye. Sure, we might say that the "artist" is simply a curator, picking which algorithmic output is most worthy of display, but these distinctions have historically been fluid, and challenging ideas of art has long been one of art's function as well
Jumping out of the conceptual box to generate novel PURPOSE is not the domain of a Dall-E 2. You've still gotta ask it for things. It's a paintbrush. Without a coherent story, it's an increasingly impressive stunt (or a form of very sophisticated 'retouching brush').
If you can imagine better than the next guy, Dall-E 2 is your new tool for expression. But what is 'better'?
Maybe lots artists of the future will actually use AI models to express their inner thoughts and desires in a way that touches something in their audience. It will still be art.
Even if an AI could generate an exactly equivalent painting, I would pay $0 for it. It wouldn't mean anything to me.
But...that's always been the case for creatives.
> for all but the most famous
OK DALL-E, generate our logo in the style of ${most famous}40+ years ago, it was hard to access the equipment necessary to learn music production, so only a small slice of the population was able to learn these skills. And availability made the process take years.
Today, you can download free software that enables music production, and if you have a good ear, can create something "good" in weeks. This has led to an explosion of musical experimentation by the youth: a teenager can now create a great electronic dance song with devices they already own if they have the right creativity, taste and dedication.
Similarly, everyone has an imagination - many people have visual imaginations. The gating factor of art production is largely the mechanical memory of how to transform mental concepts into the right shapes and hues to express that visual concept to others.
With these sorts of tools we are going to have an explosion of art hobbyists. I've played with some similar, more primitive AI art generation tools and it is a lot of fun. People will be creating works of art from their couch while watching TV that rival the quality of what professionals are producing today.
Or when synthesizers and computer music was invented, that they will displace talented musicians that know how to play an instrument and how now everybody without a musical education will be able to produce music, thus devaluing actual musicians.
Maybe everyone will have an AI image as their desktop wallpaper, but if you've got cash you'll want something with provenance and rarity to brag about.
Also, I think creatives are valued for their imagination. If you wanted something decent, would you pay someone to sift through a million AI generated images to find a gem, or just pay an artist you like to create one for you?
1) That is a tiny share of the market. Most of the market is - I have a game / online publication / book, and I need an illustration xyz. Which this AI seems to solve.
2) how do you even prove your rare art wasn't painted by an AI?
By the same logic you should also complain about any number of IDEs, development tools, WordPress, game maker systems like RPG maker or Unity, after all if anyone can just leverage a free physics and collision system without having a complete understanding of rigid body Newtonian systems to roll their own engine it'll be too uniform.
First, it creates a random 10x10 pixel blurry image and asks a neural net: "Could this be a duck wearing a hat on Mars?" and the neural net replies "No, because all the pictures I've ever seen of Mars have lots of red color in them" so the system tweaks the pixels to make them more red, put some pixels in the center that have a plausible duck color, etc.
After it has a 10x10 image that is a plausible duck on Mars, the system scales the image to 20x20 pixels, and then uses 4 different neural nets on each corner to ask "Does this look like the upper/lower left/right corner of a duck wearing a hat on Mars?" Each neural net is just specialized for one corner of the image.
You keep repeating this with more neural nets until you have a pretty 1000x1000 (or whatever) image.
Here's a more of a 'not 15 year old' explanation: https://ml.berkeley.edu/blog/posts/dalle2/
The system consists of a few components. First, CLIP. CLIP is essentially a pair of neural networks, one is a 'text encoder', and the other is an 'image encoder'. CLIP is trained on a giant corpus of images and corresponding captions. The image encoder takes as input an image, and spits out a numerical description of that image (called an 'encoding' or 'embedding'). The text encoder takes as input a caption and does the same. The networks are trained so that the encodings for a corresponding caption/image pair are close to each other. CLIP allows us to ask "does this image match this caption?"
The second part is an image generator. This is another neural network, which takes as input an encoding, and produces an image. Its goal is to be the reverse of the CLIP image encoder (they call it unCLIP). The way it works is pretty complicated. It uses a process called 'diffusion'. Imagine you started with a real image, and slowly repeatedly added noise to it, step by step. Eventually, you'd end up with an image that is pure noise. The goal of a diffusion model is to learn the reverse process - given a noisy image, produce a slightly less noisy one, until eventually you end up with a clean, realistic image. This is a funny way to do things, but it turns out to have some advantages. One advantage is that it allows the system to build up the image step by step, starting from the large scale structure and only filling in the fine details at the end. If you watch the video on their blog post, you can see this diffusion process in action. It's not just a special effect for the video - they're literally showing the system process for creating an image starting from noise. The mathematical details of how to train a diffusion system are very complicated.
The third is a "prior" (a confusing name). Its job is to take the encoding of a text prompt, and predict the encoding of the corresponding image. You might think that this is silly - CLIP was supposed to make the encodings of the caption and the image match! But the space of images and captions is not so simple - there are many images for a given caption, and many captions for a given image. I think of the "prior" as being responsible for picking which picture of "a teddy bear on a skateboard" we're going to draw, but this is a loose analogy.
So, now it's time to make an image. We take the prompt, and ask CLIP to encode it. We give the CLIP encoding to the prior, and it predicts for us an image encoding. Then we give the image encoding to the diffusion model, and it produces an image. This is, obviously, over-simplified, but this captures the process at a high level.
Why does it work so well? A few reasons. First, CLIP is really good at its job. OpenAI scraped a colossal dataset of image/caption pairs, spent a huge amount of compute training it, and come up with a lot of clever training schemes to make it work. Second, diffusion models are really good at making realistic images - previous works have used GAN models that try to generate a whole image in one go. Some GANs are quite good, but so far diffusion seems to be better at generating images that match a prompt. The value of the image generator is that it helps constrain your output to be a realistic image. We could have just optimized raw pixels until we get something CLIP thinks looks like the prompt, but it would likely not be a natural image.
To generate an image from a prompt, DALL-E 2 works as follows. First, ask CLIP to encode your prompt. Next, ask the prior what it thinks a good image encoding would be for that encoded prompt. Then ask the generator to draw that image encoding. Easy peasy!
The MIT Limits to Growth study predicts the collapse of global civilization around 2040
https://www.vice.com/amp/en/article/z3xw3x/new-research-vind...
> So how can we adapt these AI experiments to get real work done?
You're missing a step here - the difference between "imagining doing something" and "actually doing something". An ML model can produce thoughts, but that isn't necessarily the same direction of research as actually doing things in real life, much less becoming superhuman and taking over the world etc.
In your imagination, everything always goes your way.
I'm in a bit of a rush and don't know the term for this offhand, but I remember hearing that single-layer neural networks are equivalent to multi-layer ones:
https://stats.stackexchange.com/questions/451127/equivalence...
https://www.quora.com/Is-a-single-layer-feed-forward-neural-...
There are probably more insights like this out there. These equivalences allow us to think in abstractions that get us above the minutia of fine-tuning these algorithms so that we can see the big picture. I think.
Artificial General Intelligence
>For example, I went to an AI talk about 5 years ago where the guy said that any of a dozen algorithms like K-Nearest Neighbor, K-Means Clustering, Simulated Annealing, Neural Nets, Genetic Algorithms, etc can all be adapted to any use case. They just have different strengths and weaknesses. At that time, all that really mattered was how the data was prepared.
How do you suppose KNN is going to generate photorealistic images? I don't understand the question here
>I guess fundamentally my question is, when will AGI start to become prevalent, rather than these special-purpose tools like GPT-3 and Dall-E 2?
Actual AGI research is basically non-existant, and GPT-3/Dall-E 2 are not AGI-level tools.
>Personally I give it less than 10 years of actual work, maybe less
Lol...
>I just mean that to me, Dall-E 2 is already orders of magnitude more complex than what's required to run a basic automaton to free humans from labor.
Categorically incorrect
The flip side is that these narrow use cases progressed so quickly that we have to worry about stuff like deep fakes now.
Something's not right here.
As a programmer, I feel that what went wrong is that we invested too much in profit-driven endeavors, basically stuff that's mainstream. To be blunt, the academic side of me doesn't care about use cases. I care about theory, formalism, abstraction, reproducibility, basically the scientific method. From that perspective, all AI is equivalent, it just takes input, searches a giant solution space using its learned context as clues, and returns the closest solution it can in the time given. It's an executable piping data around. The rest is hand waving.
And given that, the stuff that AI is doing now is orders of magnitude more complex than running a Roomba. But a robot vacuum actually helps people.
To answer your question, a KNN could solve this if the user reshapes the image data into a different coordinate system where the data can be partitioned (all inference comes down to partitioning):
https://en.wikipedia.org/wiki/Change_of_basis
Tensors are about reshaping data into a coordinate system where relationships become obvious, like going from rectangular to polar coordinates, or using a Fourier transform:
https://en.wikipedia.org/wiki/Tensor
My frustration with all of this is the same one I have with physics or any other evolving discipline. The lingo obfuscates the fundamental abstractions, creating artificial barriers to entry.
Edit: I should add a disclaimer here that my friend and I worked on a video game for like 11 years. I'm no expert in AI, I'm just acutely sensitive to how the realities of the workaday world waste immeasurable potential at scale.
The same technology that is drawing cute unicorns can be used for endless other use cases. Perhaps the PR side of the launch and the subject matter they show unveil their product is just that, PR.
It's like Apple Memoji thing (not sure if I'm spelling it correctly). You can think of as trivial and waste of talent to use their Camera/FaceID to animate cute animals based on facial expression, but that same tech will enable lots other things to come.
1. step-by-step guidance for a blind person navigating the use of a public restroom.
2. an EMS AI helping you to save someone's life in an emergency.
3. an AI coach that can teach you a new sport or activity.
4. an omnipresent domain-expert that can show you how to make a gourmet meal, repair an engine, or perform a traditional tea ceremony.
5. a personal assistant that can anticipate your information need (what's that person's name? where's the exit? who's the most interesting person here? etc.) and whisper the answer in your ear just as you need it.
Now, add all of the above to an AR capability where you can now think or speak of something interesting and complex, and have it visualized right before your eyes. With this capability, I could augment my imagination with almost super-human capabilities that allow one to solve complex problems almost as if it was an internal mental monologue.
All of these scenarios are just a short hop from where were at now, so mark my words: we will have "borgs" like those described above long before we reach anything like general AI.
I still have to do all the hard thinking, but once I figure out what I want written and start typing, Copilot will spit out a good portion of the contextually-obvious lines of code.
For example, recent phone cameras can estimate depth per pixel from single images. Hundreds of millions of these devices are deployed. A decade ago this was AI/CV research lab stuff.
Smaller reproductions of the original research.
I wouldn't be surprised to see a comparable version for 3D models in the next year or two, though. Even if the current architecture doesn't lend itself to 3D structures (I don't know), there's a lot of parallel work being done right now (esp. by Google) for encoding 3D data in new/efficient ways, translating specialized 2D images into 3D models, and more.
OpenAI had no idea it could be used to generate images itself, which is why they left in issues like how it thinks an apple and the word "apple" written on a piece of paper are the same thing. Probably wouldn't have released it if they did know.
⸻
1. My current work background is an enormous screen-filling eyeball. For my writing group, I try to have something that reflects the story I'm workshopping if I'm workshopping that week and something surreal otherwise.
2. My most expensive custom illustration was a title for an article about stone carver/letterer David Kindersley which I had inscribed in stone and photographed.
https://www.fastcompany.com/90725035/metaverse-horizon-world...
Say I'm looking for photography of real events and places, like a royal weeding or a volcano erupting does this help me? Of specific places and architectural features? Of a protest?
You're suggesting clipart on steroids: https://thispersondoesnotexist.com
I think if I was istockphoto.com I'd be a little worried, but that is microstock photography. I'm not sure that is worth billions. In fact I know it isn't.
Besides once this tech is wildly available if anything it devalues this sort of thing further closer to $0.
It would probably augment existing processes rather than replace them completely.
If you are doing a photoshoot for a banana stand with a human model with characteristics x,y,z you're still going to get a human from an agency or craigslist to pose. If suddenly the client informs you that they needed human a,b,c instead maybe one of these forthcoming tools will let you swap that out faster. You'd upload your photoshoot and an example or two of the type of human model you wished you had retroactively and it would fix it up faster than an intern.
Cool.
My hypothesis is that it could be a partial replacement/competitor and devalue their offering - reasonable to assume you'd be paying $99/mo soon and it will gradually decrease as the tech spreads and more competitors emerge.
Adobe is also in this game (https://stock.adobe.com), they are not unfamiliar with AI. You can see how a lot of people will jump on this if it proves to be lucrative.
I don't claim to be an expert and I didn't say this is worthless.
For porn and sex it's different though. Some people are attracted to things that are deviant and taboo. That's the part they're looking for. As pornography has become more widely accepted, a market has developed for more and more extreme forms of it. This has been documented. It's not the content per-se but rather the nature of it that is found attractive. So the idea is to find a line that's reasonable so the people that feel the need to get close to that line can have that urge fulfilled without damaging society.
A market will form for more and more extreme content as soon as the line moves and what was one taboo no longer is. An Overton window of sorts for pornography.
I don't think this is the case, from anecdotal experiences; Hollywood chase scenes are much more exciting to me than real life crash footage, I've watched enough. They need cooking, and if you are cooking anyway, mixing artificial and "natural" ingredients can even be a problem than a positive.
Truth is always boring.
Pornographers know this and talk about. Read David Foster Wallace's essay on it.
Personally, I would never buy a painting generated by an ML model, or even a commercial illustration, if i can help it. The artist and their life experience is half the point of art, IMO.
This is arguably the most insipid and stupid crippling of a powerful tool for content creation I can think of. It’s worse than the adobe updates using every cpu core and locking up my machine once a week.
What counts as “political” hm? Want it to look like that Obama poster or perhaps you want a Soviet Union flag for your retro 80s punk… oops sorry “political”… let’s go to adult… hmm that’s even dumber is the model showing too much ankle? What about the obvious fact that this is just designed with a heterodoxy view of pornography and likely does nothing to stem the wildly various fetishes and other sexual proclivities that exist in the world…
It is effectively “we got squeamish and have done a bunch of stuff to stop you doing stuff that makes us squeamish, please don’t make us squeamish, we’re so worried we’re even checking for it in case you sneak something past us”…
They should comply with the law, try to prevent and also check for child porn… but otherwise just let users use the damn tool, if someone wants an Obama hope poster of a sexualised Mussolini jerking off onto a balloon animal… why the heck do they feel the need to say no to that. It’s a deeply repressive instinct that should be fought against whenever people start to “police” what is acceptable in artistic mediums.
I look forward to the reimplemented versions of this from efforts like EuletherAI and others.
Personally I really like leaning into the discontinuities and quirkiness of generated images. This is output from GLID-3: https://twitter.com/mwegner/status/1511139661095178241
eg. prompt: half human half Eiffel tower. A human Eiffel tower hybrid (I get mostly normal Eiffel towers from LDM but some sensical results from glid-3)
glid-3 will be worse for things that require detailed recall, like a specific person.
With smaller models you kind of have to generate a lot of samples and pick out the best ones.
Bad designers are even being given better and better paying jobs as the top talent gets poached up quicker and quicker.
In other words, if someone else is a better designer than you, that actually has nothing to do with if they're going to take your job. They may have something better to do. An ML model isn't a worker no matter how good it is at painting, so rather than having a job it has input resources (RTX 3090s, electricity, maintenance engineers) but the concept is still important.
Eventually somebody will use the research to train the model to do whatever they want it to do.
Besides that these models are massive. For quite a while the only people even capable of making them will be those with significant means. That will be mostly Governments and Corporations anyway.
Whether you consider it moral doesn't seem relevant, only to respect the wishes of the author of such programs.
[1] https://github.com/katharostech/bevy_retrograde/blob/master/...
AI does not have to be perfect and it's likely that businesses will settle for almost as good as human if it's 'cost effective'.
A child with adult body parts is a whole other class of weirdness that might pop out too.
Models want to surprise us all.
First video clips were with the faces of your usual celebrities, but then suddenly I got "treated" to Greta Thunberg in the situations you might expect. I cut my exploration short.
Now, Greta Thunberg is actually 19 now (how time flies !), except that deep fake was most likely trained on her media appearances, which started when she was 15 !
(I guess at least that she wasn't a child any more, which might explain why those clips had not been almost immediately flagged and removed ?)
The "edit" capability, as far as I can tell please correct me if I got confused, is picking your favorite out of the generated variations.
I would like to "lock" the scene and add instructions like "throw in a reflection".
- Provide an existing image
- Provide a text prompt ("flamingo")
- Select from X variations the new image that looks best to you
- It does the equivalent of a google image search on your "flamingo" prompt
- It picks the most blend-able ones as a basis to a new synthetic flamingo
- It superimposes the result on your image
Very cool don't get me wrong. Now I want to tweak this new floating flamingo I picked further, or have that Corgi in the museum maybe sink into the little couch a bit as it has weight in the real world.Can't. You'd have to start over with the prompt or use this as the new base image maybe.
The example with furniture placement in an empty room is also very interesting. You could describe the kind of couch you want and where you want it and it will throw you decent options.
But say I want the purple one in the middle of the room that it gave me as an option, but rotated a little bit. It would generate a completely new purple couch. Maybe it will even look pretty similar but not exactly the same.
See what I mean?
This makes it as easy as typing a sentence - and the quality seems fairly realistic
I should be explicit -- I am saying the exposure which makes one seek stimulus is merely a catalyst for deeper urges, not a generator of them as such. A certain level of inhibition (e.g. sociopathy) is required but IMO so is a prior conception of the deed.
In your example, if someone is predisposed to wanting to shoot actual people in the head, exposing them to video game headshots may distract in the short term but desensitizes and entrenches the image in the long term, possibly making it easier to decide to pull the trigger later on if they are sufficiently inhibited of social concerns. This does not happen for people with high inhibitions, or at least sufficient self-control.
I'm not sure that's true. Our brains can imagine a lot that we've never seen, though maybe not very accurately. Inventors and developers and artists do it all the time, if we are talking about the same thing.
I'm not sure that disproves your premise. Virtual experiences may make real ones easier, but some research and details about where it works, where it doesn't, would be helpful. Many training programs use virtual experiences, such as flight simulators.
Am totally blind, have never been able to see, can still conceive of a headshot. So, yes?
To put it as nicely as possible, this wildly contradicts reality as I have experienced it and observed others experiencing it.
2) Because we haven't built a machine that can paint (etc.) with traditional materials like a skilled artist?
Another use case could be to make it easier/ automatic to create comics. You tell what the background should be, characters should be doing and the dialogues. Boom, you have a good enough comic.
-----------
Reading as a medium has not evolved with technology. Creating the imagery does happen in humans' minds. It's not surprise that some people enjoy doing that (and also enjoy watching that imagery) and others do not.
This could be a helping brain to create those imageries.
-----------
Now imagine, reading stories to your child. Actually, creating stories for your child. Where they are the characters in the stories. Having a visual element to it is definetly going to be a premium experience.
Yes, this is the sane approach, since a jet represents an enormous amount of energy that can be directed anywhere in the world (just about). But that said, there seems to be enormous pressure to allow driverless vehicles, which also direct large amounts of energy directed anywhere in your city. IOW it seems like a matter of time before we say, collectively, screw it, let the computers fly the plane and if loss of power is a catastrophe, so be it.
As far as the extremely unlikely hostage situation goes, if it were AI controlled that would be even less likely attempts from people to hijack an airplane in the first place since there wouldn't be a human element a.k.a. a pilot that they could appeal to their emotion.
I can easily imagine that at some point, pilots are replaced with technicians who are just there to fix redundant AI systems in case of failure.
What GPT-3 and DALL-E shows is that you can infer a lot based on the latent structure of data, even without understanding the underlying physical process.
https://lilianweng.github.io/posts/2021-07-11-diffusion-mode...
Personally, I find the core diffusion papers pretty dense and difficult to follow, so the blog post is where I'd begin.
https://arxiv.org/pdf/1503.03585.pdf
This paper is a decent starting point on the literature side, but it's a doozy.
Both the paper and blog post are pretty math heavy. I have not yet found a really clear intuitive explanation that doesn't get down in the weeds of the math, and it took me a long time to understand what the hell the math is trying to say (and there are some parts I still don't fully understand!)
See the linked papers if you don't like videos.
These are fp16 numbers though, you might need a recent nvidia card to run it.
I used the instructions here to check: https://github.com/wang-xinyu/tensorrtx/blob/master/tutorial...
"One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form."
Yeah, for measures that are subsetting out only the nice data, "flipping the sign" would be picking the other subset. So something like "data_to_train_on = (good_data_split, evil_data_split)[accidental_one_based_index_because_humans_still_cant_agree_on_how_to_count]"
> This is my quote.
It's much better than using a code block for your readers.
CLIP+VQGAN generation IIRC works by replacing the adversarial network with CLIP, so it understands text prompts, then retraining it for a while towards the prompted target, then generating whatever it's learned from that.
GANs are a silly idea that shouldn't work but somehow do. There's some attempts to replace the idea: https://www.microsoft.com/en-us/research/blog/unlocking-new-...
Makes sense to me as far as avoiding a sort of maximized sunset that is always there and is SUNSET rather than a nice sunset... but also avoiding watering it down and getting a way too subtle sunset.
It's not AI but I've been watching some folks solving / trying to solve some routing (vehicles) problems and you get the "this looks like it was maximized for X" kind of solution but that's maybe not what is important / customer perception is unpredictable. I kinda want to just come up with 3 solutions and let someone randomly click .... in fact i see some software do that at times.
Accomplishing that is achieving general AI.
In the meantime, there are plenty of boilerplate ORMs and simplistic API template tools that make production of bog standard CRUD apps dead simple. Of course, they all have their drawbacks and trade-offs, and aren't always suitable. But I don't see the amount of software engineering work reducing as a result of these no-code, low-code tools, do you?
As I see, the real challenge to solve is for it to be able to hold context and be able to communicate iteratively. Also, as you say find missing gaps. That's important. Other than that, you tell it what you want, it creates something and then you tell it to change things around. Which is, BTW, pretty similar to how it works with biological life based developers. Though as we're lazy, we like to clarify a lot of things up front (and either drive customers crazy or teach them that this is the way it works). If you have an AI that spits out code in a few minutes, it may not matter a lot.
Most of the programming jobs are indeed about making relatively simple stuff from standard components.
Let me know when you find a single programmer who can do that reliably.
As a tool, this could be used by an artist to continue working on that image until it's exactly what the artist (or the comissioner) is looking for: masking off the water to actually add dolphins, masking off the ship to redraw it, retoning the sky for a more aesthetically-pleasing sunset, adding other objects to specific locations in the scene, etc.
Here's one example of how these composite prompts + masking can make more specific images here: https://twitter.com/jmhessel/status/1511757848442654721
If you put both in front of someone with no idea about conceptual art, there’s a real chance you might be right. If they happen not to “get” the work or understand the context or just know enough about conceptual art, then a viewer might easily miss the point.
But a computer could not have conceived Duchamp’s urinal, not with our current technology. You’re probably going to need AGI for that (which I’m certain will arrive eventually).
But deliberately, no, it couldn’t, not yet, and human conceptual artists could make far far superior art than a machine. Because great art requires understanding the human condition and deep reasoning about the world.
A comparison can be made with Damien Hurst’s or Anthony Gormley’s use of assistants to create the pieces as instructed by the artists.
Duchamp’s urinal isn’t brilliant because the urinal was difficult to acquire or to make, but because it expresses so much and asks so many questions.
https://twitter.com/RiversHaveWings & https://github.com/crowsonkb
[5] Katherine Crowson. Ava linear probe. https://twitter.com/RiversHaveWings/status/ 1472346186728173568?s=20&t=T-HRr3Gw5HRGjQaMDtRe3A, 2021.
[6] Katherine Crowson. Clip guided diffusion hq 256x256. https://colab.research.google.com/ drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021.
[7] Katherine Crowson. Clip guided diffusion 512x512, secondary model method. https://twitter. com/RiversHaveWings/status/1462859669454536711, 2021.
[8] Katherine Crowson. v-diffusion. https://github.com/crowsonkb/v-diffusion-pytorch, 2021.
Edit: Diffusion models guided by CLIP*
Yup, everyone feels it. …but, does complaining help? Nope. All it does is make you feel a bit better with out really putting in effort in.
We can’t have nice things because people abuse them. Not everyone. …but enough people that it’s both a PR and legal problem. specifcally a legal problem in this case.
To have adults treated like adults online, you have to figure out how to stop all adults from being dicks online.
…no one has figured that out yet.
So, complain away if you like, but it will do exactly nothing. No one, at all, is going to just “have a serious discussion” about this; the solution you propose is flat out untenable, and will probably remain so indefinitely.
Every single time OpenAI comes out with something, they dress it up as a huge threat, either to society or to themselves. Everyone falls for it. Then someone else comes along, quietly replicates it, and poof! No threat! Isn’t it incredible how that works?
There are already a bunch of dalle replicas, including ones hosted openly and uncensored by huggingface. They’re not facing huge legal or PR problems, and they’re not out of business.
Is it a legal issue? I'm not sure, though I believe that cartoon child porn is not legal in the US (or is at least a legal gray area). Regardless, I sympathize with OpenAI not wanting to enable such behavior.
If you pay attention to all the corgi examples, the sofa texture changes in each of them, and it synthesizes shadows in the right orientation - that's what it's trained to do. The first one actually does give you the impression of weight. And if you look at "A bowl of soup that looks like a monster knitted out of wool" the bowl is clearly weighing down. I bet if the picture had a more fluffy sofa you would indeed see the corgi making an indent on it, as it will have learned that from its training set.
Of course there will be limits to how much you can edit, but then nothing stops you from pulling that into Photoshop for extra fine adjustments of your own. This is far from a 'cool trick' and many of those images would take hours for a human to reproduce, especially with complex textures like the Teddy Bear ones. And note how they also have consistent specular reflections in all the glass materials.
My issue is that it appears to not be possible to explain what the AI is doing at all. If you could, you'd be able to actually control the output. And talking about how the model is trained is interesting but not an answer.
Of course there is a superimposing step, that just means it adds its layer on top of the photo you provide. That's all it means and that's literally what it is doing, that's all I tried to say, heh.
> If you pay attention to all the corgi examples, the sofa texture changes in each of them
Yes, exactly!
> This is far from a 'cool trick' and many of those images would take hours for a human to reproduce
OK, fair enough. I'll try to be more clear:
It is very cool and not a trick and the results are fantastic if you got out exactly what you wanted. Amazing time saver. And if not? Right now this is totally hit or miss.
It would also take hours for a human to reproduce a Vermeer and this no doubt has those in its training set and would style-transfer unto a corgi instantly. Certainly faster than Vermeer himself could do it.
But Vermeer could explain how he came up with the style, his techniques, choices, 'etc.
It reads like the advance here is that it will usually synthesize something that looks great but not always the thing that you want. With no recourse.
It is not doing this. You are wrong. You are mistaken. You are confused. You do not understand what is happening.
(People have tried to tell you this several times, but you're not listening. shrug One more can't hurt.)
Often they can't. Ramanujan couldn't explain how he solved math problems, for instance, and humans can forget their own history easily, or even forget how to do something consciously while still doing it through muscle memory.
An ML model wouldn't forget the same way, but it could just lie to you.
The kind of tech you're imagining, where the computer has semantic understanding of what's in the picture, and is reproducing something based on a 3D scene, knowledge of physics, materials, etc is probably decades away. In that sense yes, this is just a 'trick'.
There is a DALL-E model available now from another company and you can use it directly (mini-DALLE or ruDALL-E), but its vocabulary is small and it can't do faces for privacy reasons.
I think it is using a free text query to select the best possible clipart from a big library and blends it together. Still very interesting and useful.
It would be extremely impressive if the "Kuala dunking a basketball" had a puddle on the court in which it was reflected correctly, that would be mind blowing.
The difference is just that it makes the compositing easier. If you don't have a pre-existing image that would match the shadows and angles you can hallucinate a new Kuala that does. Neat trick.
But I bet if I threw the poor marsupial at a basket net it would look really differently than the original clipart of it climbing some tree in a slow and relaxed manner. See what I mean?
Maybe Dall-E 2 can make it strike a new pose. The limb positions could be altered. But the facial expression?
And if the basketball background has wind blowing leaves in one direction the Kuala fur won't match, it will look like the training set fur. The puddle won't reflect it. 'etc.
This thing doesn't understand what a Kuala is like a 3-yr old. It understands the text "Kuala" is associated with that tagged collection of pixel blobs and can conjure up similar blobs unto new backgrounds - but it can't paint me a new type of Kuala that it hasn't seen before. It just looks that way.
Especially the part about maybe generating specifically tailored material to "train" folks. Although, while obviously moral instead of immoral like "gay conversion therapy", I wonder if it would be just as ineffective.
and would cut down on the motivating factors
to produce authentic CP (for which original
production is often a requirement to join
CP distribution rings).
Hmmmmm. Will machine-generated "normal" (i.e., non-CP) porn really eliminate the motivating factors to produce normal porn?I obviously can't speak for enjoyers of CP. But when watching normal porn, I think part of the thrill for many/most people is knowing that what's happening is real.
Another potential risk is that a flood of publicly available, machine-generated CP might actually help the producers and distributors of real CP by serving as camouflage. Finding and prosecuting the people who make real CP is difficult enough already. Now, imagine if the good guys couldn't even reliably tell what was real and there were 100000x as many fake images as real ones floating around.
Yikes.
I'm wondering how true that is.
Obviously, lots of people consume hentai, and platforms like Danbooru are immensely popular.
Also, speaking personally... some of the porn that I've consumed that felt the most "real" was 3D animations where the only real humans behind them were the SFM artists (and voice actors). These artists felt free to do scenes with, like, actual cinematography, with flirting and teasing and emotions between the characters, of a kind you never see even in softcore live-action porn.
So I do wonder how much potential AI generation has for completely substituting large parts of the porn industry.
let's assume that AI generated CP should be illegal. Does it mean that possession of model that is able to generate such content should also be illegal? If not, then it's easy to just generate content on the fly and do not store anything illegal. But when we make model illegal, then how do you enforce that? Models are versatile enough to generate a lot of different content, how do you decide if ability to generate illegal content is just a byproduct or purpose of that model?
> let's assume that AI generated CP should be illegal
Well that's a big assumption, lol. I definitely agree that it would be impossible to enforce, for the reasons you say.
I personally would not be in favor of such a law at all. Partially because it's unenforceable as you say, and partially on principle.
The argument against real CP is extremely clear: we deem it abominable because it harms children. That doesn't apply to computer-generated CP, or the models/tools used to produce it.
In that sense, instead of enforcing non-existance of models, the enforcement could just make ilegal to provide any service that process inputs or provide outputs that are cp-like, by, i.e. obligating people with the models to add filters on input and/or after result is generated but before it is displayed or returned from computation.
Unless you understand real to just mean that actual humans were involved, describing porn as real seems to be a bit of a stretch more often than not.
I am assuming that any adult reading this understands that professional porn is quite different from the sex most of us experience in our private lives in a number of major ways, both emotionally and physically.[1]
But anyway, yes. By "real" I mean "real human beings, having real sex."
----
[1] There is a lot of homemade, amateur porn on the big well-known porn sites and it seems quite popular, and much of that is closer to what typical folks do at home. But that's beside the point.
If people already accepted that they need help, there are many good ways to treat people with unwanted sexual obsessions (trying to choose my words carefully here). I honestly don't think that it would help them to serve them more content.
However, I'd love to see some research to explore the possibility of involving machine generated content in psychological treatment. The core of your idea is IMHO brilliant.
your teacher was wrong
i had a friend who didnt get credit for his design work because he used photoshop instead of using pen and paper for similar reason, i still find it amazing that a teacher would say such a thing
Above a certain threshold of ability, yes.
The same will hold true for designers. DALL-E-alikes will be integrated with the Adobe suite.
The most cutting edge designers will speak 50 variations of their ideas into images, then use their hard-earned granular skills to fine-tune the results.
They'll (with no code) train models in completely new, unique-to-them styles--in 2D, 3D, and motion.
Organizations will pay top dollar for designers who can rapidly infuse their brands with eye-catching material in unprecedented volume. Imitators will create and follow YouTube tutorials.
Mom & pop shops will have higher fidelity marketing materials in half the time and half the cost.
All will be ever as it was.
The space for "AI-assisted higher-level languages" sufficiently distinct from natural language is vanishingly small. Eventually you're just speaking natural language to the computer, which just about anyone can do (perhaps with some training).
AI that can write code from a natural language description doesn't help as much as you seem to think if natural language description is too hard to actually bother with when humans (who obviously benefit from having a natural language description) are writing the code.
Now, if the AI can actually interview stakeholders and come up with what the code needs to do...
But I am not convinced that is doable short of AGI (AI assistants that improve productivity of humans in that task, sure, but that expands the scope for economically viable automation projects rather than eliminating automators.)
At some point AI will become as powerful as companies.
And then AI will be able to sustain positive feedback loop of creating more powerful company like ecosystems that will create even more powerful ecosystems. This process will be fundamentally limited by available power and the sun can provide a lot of power. Eventually AI will be able to support space economy and then the only limit will be the universe.
We will be united with the AI, we're already relying on it so much that it has become a part of our extended minds.
What's this in reference to?
So, today some good AI applications are face detection, fingerprint detection, or generating art. Where you need to catch or generate the general gist of it without pixel precision.
Of course, programming might be under greater threat than we imagine. I can also not claim that anyone holding that position is just plain _wrong_. But I do believe that would take an AI breakthrough that is yet to happen. That breakthrough would also have absolutely crazy consequences beyond programming, because now we would have "exact AI" and the thought of that boggles my mind for sure.
Claiming AIs are going to take over or destroy the world has been a basis of "AI safety" research since the 90s, but that isn't real research, it's a new religion run by Berkeley rationalists who read too many SF novels.
Also, one thing that everyone seems to ignore is that even if the number of jobs are not reduced, the skill/talent level for doing those jobs may (actually DO) increase and also, switching careers does not work for everyone. So you'll inevitably have people without a job even if it's just that the job market is shifting.
But I argue that as automation reaches jobs with higher levels of sophistication, i.e. the jobs of more skilled workers, some people will simply be left out because of their talent won't be enough to do any job that has not been automated.
Evolution doesn't stop for anyone, don't think like a dinosaur.
You thought climate change is hard to hold up? Try holding up the invention of AI. The whole world is going to have to change and some form of socialism/UBI will have to be accepted, however unpalatable.
There's the possibility that watching FOO directly encourages viewers to do FOO in real life. Like you said, this is the most fragile. I think clearly this is true in some cases -- most of us have seen a food commercial on TV and thought, "I could really go for that right now." I'm less convinced that it's true for something like pedophilia: the average person will be revolted by it, not encouraged, unless they already are into that kind of awful thing.
There's the possibility that watching FOO doesn't directly encourage viewers to do FOO, but serves to kind of normalize it. I think this happens a lot, but I think it takes a carefully crafted context and message.
There's the possibility that AI generated CP could actually helps children, by providing a safe outlet for pedophiles so that they wouldn't need to do heinous shit in real life. I recall reading studies that instances of (adult) rape in societies were inversely correlated with the availability of (adult) pornography, with a possible explanation being that porn provided a safe outlet for people who weren't getting the kind of sex they wanted.
They did this to stop bad PR, because some people are convinced that an AI making pictures is in some way dangerous to society. It is not. We have deepfakes already. We've had photoshop for so long. There is no danger. Even if there was, the cat's out of the bag already.
Reasonable people already know to distrust photographic evidence nowadays that is not corroborated. The ones who don't would believe it without the photo regardless.
We've been through this many times, with books, with movies, with video games, with Internet. If it *can* be used for porn / violence etc., it will be, but it won't be the main use case and it won't cause some societal upheaval. Kids aren't running around pulling cops out of cars GTA-style, Internet is not ALL PORN, there is deepfake porn, but nobody really cares, and so on. There are so many ways to feed those dark urges that censorship does nothing except prevent normal use cases that overlap with the words "violence" or "sex" or "politics" or whatever the boogeyman du jour is.
Cheap and plentiful is substantivly different from "possible". See for example, oxycontin.
Do not be deluded that our own governments are not manufacturing the narrative too. The US has committed just as many war crimes as Russia. Of course, people feel differently about blowing up hospitals in Afghanistan rather than Ukraine. What the Afghan people think about that is not considered too much.
This AI has the potential to absolutely automate the very long Photoshop work, leading to an even worse stat eof things. So, yes, "Responsibility to society" is absolutely a thing.
But notice how all of these deep faking technologies weren't actually necessary for that.
People believe what they want to believe. Regardless of quality of provided evidence.
Scaremongering idea of deep fakes and what they can be doing was militarized in this information war way more than the actual technology.
I think this technology should develop unrestricted so society can learn what can be done and what can't be done. And create understanding what other factors should be taken into account when assesing veracity of images and recordings (like multiple angles, quality of the recording, sync with sound, neural fake detection algorithms) for the cases when it's actually important what words someone said and what actions he was recorded doing. Which is more and more unimportant these days because nobody cared what Trump was doing and saying, nobody cares about Bidens mishaps and nobody cares what comes out of Putins mouths and how he chooses his greenscreen backgrounds.
> People believe what they want to believe. Regardless of quality of provided evidence.
That is a terrible oversimplification of the mechanics of propaganda. The entire reason for the movements that are popping up is actors flooding people with so much info that they question absolutely everything, including the truth. This is state sponsored destabilisation, on a massive scale. This is the result of just shitty news sites and text posts on twitter. People already don't double check any of that. There will not be an "understanding of assessing veracity". There is already none for things that are easy to check. You could post that the US elite actively rapes children in a pizza place and people will actually fucking believe you.
So, no. Having this technology for _literally any purpose_ would be terribly destructive for society. You can find violence and Joe Biden hentai without needing to generate it automatically through an AI
Algorithm space is large and guess-checking through it takes a lot of effort even when it’s automated like now. It requires huge amounts of compute. And meaningful progress requires the combined effort of the entire worlds intellectual and compute resources. It sounds implausible at first but this machine learning ecosystem is in fact subject to sanctions. There are extreme but plausible ways of reducing the stream of progress to a trickle. It just requires people to actually wake up to what’s happening.
You provide the background image and a text prompt and it doodles on top of the image you provided as per their demonstration. I wasn't referring to the other examples down the page where it conjures up a brand new image from scratch based on your image input.
It is great that you can tell it to add a flamingo and it fits into the background you provide nicely due to the well tuned style transfer. That part is cool. And it is impressive that sometimes the flamingo it adds is reflected in the water. But sometimes it isn't reflected. And it isn't up to you, it is up to it. And you can't tell it to add a reflection as a discrete step.
Look more carefully. This is more akin to a clipart finder, except if the clipart doesn't exist it uses the most similar thing in its training set to what it guesses you want as a starting point to synthesize new clipart from.
It doesn't add it in like an artist would and you can't control it at all. I don't know how to better express this.
This isn't unimpressive or un-useful but not quite as mind blowing on second glance.
Or am I in denial about how impressive this all really is by reading something slightly different into the static hand selected examples openai teased us with? :)
I'm sure two more papers down the line this thing will do what the true believers are convinced it already does perfectly much more seamlessly if they solve for my new favorite term, panoptic segmentation.
If you read the article, it gives examples that do exactly this. For example, adding a flamingo shows the flamingo reflected in a pool. Adding a corgi at different locations in a photo of an art gallery shows it in picture style when it's added to a picture, then in photorealistic style when it's on the ground.
A lot of the time it doesn't super matter, but sometimes it does.
I might be misinterpeting your use of "compositing" here (and my own technical knowledge is fairly shallow) but I don't think there's any compositing of elements generally in AI image generation. (unless Dall-E 2 changes this. I haven't read the paper yet)
> Given an image x, we can obtain its CLIP image embedding zi and then use our decoder to “invert” zi, producing new images that we call variations of our input. .. It is also possible to combine two images for variations. To do so, we perform spherical interpolation of their CLIP embeddings zi and zj to obtain intermediate zθ = slerp(zi, zj , θ), and produce variations of zθ by passing it through the decoder.
From the limitations section:
> We find that the reconstructions mix up objects and attributes.
ML models can output training data verbatim if they over-fit, but a well trained model does extrapolate to novel inputs. You could say that this model doesn't know that images are 2d representations of a larger 3d universe, but now we have NERF which kind of obsoletes this objection as well.
I'd argue that at no point is there a representation of a "teddy bear" and "a background" that map closely to their visual representation - that are combined.
(I'm aware I'm being imprecise so give me some leeway here)
Let me state my opinion more directly.
I'm for developing as much of deep fake technology in the open so that people can internalize that every video they see, every message, every speech should be initially treated as fabricated garbage unrelated to anything that actually happened in reality. Because that's exactly what it is. Until additional data shows up, geolocating, showing it from different angles and such.
Even if most people manage to internalize just the first part and assume everything always is fake news, that is still great because that counters propaganda to immense degree.
Power of propaganda doesn't come from flooding people with chaos of fakery. It comes from constructing consistent message by whatever means necessary and hammering it into the minds of your audience for months and years while simultaneously isolating them from any material, real or fake that contradicts your vision. Take a look no further than brainwashed Russian citizens and Russian propaganda that is able to successfully influence hundreds of millions without even a shred of deep fake technology for decades.
The problem of modern world is not that no one believes the actual truth because it doesn't really matter what most people believe. Only rich influence policy decisions. The problem is that people still believe that there is some truth which makes them super easy to sway to believe what you are saying is true and weaponize by using nothing more than charismatic voice and consistent message crafted to touch the spots in people that remain the same at least since the world war II and most likely from time immemorial.
And the "elite" who actually runs this world, will pursue tools of getting the accurate information and telling facts from fiction no matter the technology.
"Sassy Justice with Fred Sassy" (reporting on Deep Fakes) :
If we task "Kuala dunking basketball" to a human and present them with two images, one of a Kuala climbing a tree and another of a basketball player dunking - the human would cut out the foreground (Human, Kuala) from the background (basketball court, forest) and swap them places easily.
The laborious part would be to match the shadows and angles in the new image. This requires skill and effort.
Dall-E would conjure up an entirely novel image from scratch, dodging this bit. It blended the concepts instead, great.
But it does not understand what a basketball court actually is, or why the Kuala would reflect in a puddle. Or why and how this new Kuala might look different in these circumstances from previous examples of Kualas that it knows about.
The human dunker and the kuala dunker are not truly interchangeable. :)
Maybe it really will magically output everything to my satisfaction, what a time to be alive! :)
https://arxiv.org/pdf/2112.10741.pdf
so it could distinguish individual objects from backgrounds. Other ML models can definitely do that; it's called "panoptic segmentation".
It really needs to expose the whole pipeline to become truly useful.