Stable Diffusion with Core ML on Apple Silicon(machinelearning.apple.com) |
Stable Diffusion with Core ML on Apple Silicon(machinelearning.apple.com) |
The repo is aimed at developers and has two parts. The first adapts the ML model to run on Apple Silicon (CPU, GPU, Neural Engine), and the second allows you to easily add Stable Diffusion functionality to your own app.
If you just want an end user app, those already exist, but now it will be easier to make ones that take advantage of Apple's dedicated ML hardware as well as the CPU and GPU.
>This repository comprises:
python_coreml_stable_diffusion, a Python package for converting PyTorch models to Core ML format and performing image generation with Hugging Face diffusers in Python
StableDiffusion, a Swift package that developers can add to their Xcode projects as a dependency to deploy image generation capabilities in their apps. The Swift package relies on the Core ML model files generated by python_coreml_stable_diffusion
https://github.com/apple/ml-stable-diffusionI imagine that here apple wants to highlight a more research/interactive use, for example to allow fine tuning SD on a few samples from a particular domain (a popular customization).
[1] https://onnxruntime.ai/docs/execution-providers/CoreML-Execu...
People who can't get the models to work by themselves given the source code aren't the target audience. There are other projects, though, that do distribute quick and easy scripts and tools to run these models.
Apple stepping in to get Stable Diffusion working on their platform is probably an attempt to get people to take their ML hardware more seriously. I read this more like "look, ma, no CUDA!" than "Mac users can easily use SD now". This module seemed to be designed so that the upstream SD code can easily be ported back to macOS without special tricks.
https://github.com/LaurentMazare/tch-rs
I used this in the past to make a transformer-based syntax annotator. Fully in Rust, no Python required:
> For distilled StableDiffusion 2 which requires 1 to 4 iterations instead of 50, the same M2 device should generate an image in <<1 second
The author has a detailed blogpost outlining how he modified the model to use Metal on iOS devices. https://liuliu.me/eyes/stretch-iphone-to-its-limit-a-2gib-mo...
This site is purely a marketing effort.
Pretty much like Stable Diffusion and the grifters using it in general and they will never credit the artists and images that they stole to generate these images.
Of course you can see the original images (https://rom1504.github.io/clip-retrieval/), it was legal to collect them (they used robots.txt for consent just like Google Image Search) and it was legal to do this with them (but not using US legal principles since it's made in Germany).
"Crediting the artist" isn't a legal principle - it's more like some kind of social media standard which is enforced by random amateur artists yelling at you if you don't do it. It's both impossible (there are no original artists for a given output) and wouldn't do anything to help the main social issue (future artists having their jobs taken by AIs).
two wrongs don't make a right.
I'm not seeing any installation instructions on either link - what am I missing?
Great support for M1, basically since the beginning. The install is painless.
Release video for InvokeAI 2.2: https://www.youtube.com/watch?v=hIYBfDtKaus
This gets you text descriptions to images.
I have seen models that given a picture, then generate similar pictures. I want this because while I have many pictures of my grandmothers, I only have a couple of pictures of my grandfathers and it would be nice to generate a few more.
Core ML is so well done. A year ago I wrote a book on Swift AI and used Core ML in several examples.
Edit: still alive! https://grandperspectiv.sourceforge.net/
Found a >100GB accidental “livestream” recording on one computer. Would have taken forever to find what was taking up all the room otherwise.
GUI apps for this task like GP and the like are more visually complex than they need to be.
/dev/disk3s5 926Gi 857Gi 52Gi 95% 8067489 540828800 1% /System/Volumes/Data
It normally hovers around 30-35Gi free.One on the GPU and another on the ML core?
Centralized services small and large are guilty of this and I'm sick of it.
If you mean monetary usecases: Roughly something like Photoshop/Blender/UnrealEngine with ML plugins that are low latency, private, and $0 server hosting costs.
3.56 seconds?
They have some benchmarks on the github repo: https://github.com/apple/ml-stable-diffusion
For reference, previously I was getting about <3 minutes for 50 iterations on my Macbook Air M1. I haven't yet tried Apple's implementation but it looks like a huge improvement. It might take it from "possible" to "usable".
Mac Studio with M1 Ultra gets 3.3 iters/sec for me.
MacBook Pro M1 Max gets 2.8 iters/sec for me.
And the posted benchmarks for the M2 Macbook Air make me consider 'upgrading' to an Air.
Dall-e et. al will still be able to bandwagon off of all the free ecosystem being built around the $10M SD1.4 model that is showing what is possible.
E.g. Dall-e could go straight to Hollywood if their model training works better than SD’s. The toolsets will work
Maybe a dumb question but can the old model still be run?
https://mezha.media/en/2022/10/06/google-is-working-on-image...
Give it some time and SD will be able to do the same.
See deforum[1] and andreasjansson‘s stable-diffusion-animation[2]
[1]: https://deforum.github.io/
[2]: https://replicate.com/andreasjansson/stable-diffusion-animat...
What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, for instance, it may in fact be faster to a the model to "enhance" a low-resolution frame rather than trying to render it fully on the machine.
ex. AMD's FSR vs NVIDIA DLSS
- AMD FSR (Fidelity FX Super Resolution): https://www.amd.com/en/technologies/fidelityfx-super-resolut...
- NVIDIA DLSS (Deep Learning Super Sampling): jhttps://www.nvidia.com/en-us/geforce/technologies/dlss/
AMD's approach renders the game at a crummy, low-detail resolution then each frame uses "upscales"
Both FSR and DLSS aim to improve frames-per-second in games by rendering them below your monitor’s native resolution, then upscaling them to make up the difference in sharpness. Currently, FSR uses spatial upscaling, meaning it only applies its upscaling algorithm to one frame at a time. Temporal upscalers, like DLSS, can compare multiple frames at once, to reconstruct a more finely-detailed image that both more closely resembles native res and can better handle motion. DLSS specifically uses the machine learning capabilities of GeForce RTX graphics cards to process all that data in (more or less) real time.
Video is really a series of frames, the framerate for film/human could get away with 24 frames/second-- ~40ms/image for real-time.
What's cool about the era in which we live is if you look at high-performance graphics for games or simulations, it may in fact be faster to run the model on each frame to "enhance" a low-resolution frame rather than trying to render it fully on the machine.
ex. AMD's FSR vs NVIDIA DLSS
- AMD FSR (Fidelity FX Super Resolution): https://www.amd.com/en/technologies/fidelityfx-super-resolut...
- NVIDIA DLSS (Deep Learning Super Sampling): https://www.nvidia.com/en-us/geforce/technologies/dlss/
AMD's approach renders the game at a crummy, low-detail resolution then use "spatial upscaling" to enhance the images one frame at a time.
NVIDIA DLSS uses "temporal upscaling" to pass over multiple frames and uses other capabilities exclusive to Nvidia's cards to stitch together the frames.
This is a different challenge than generating the content from scratch
I don't think this is possible in real-time yet, but someone put a filter trained on the German country side to produce photorealistic Grand Theft Auto driving gameplay:
https://www.youtube.com/watch?v=P1IcaBn3ej0
Notice the mountains in the background go from Southern California brown to lush green
https://www.rockpapershotgun.com/amd-fsr-20-is-a-more-demand....
I'm not sure exactly what that costs me in terms of power, but it is assuredly less than any of these services charge for a single image generation.
but seriously, I wonder when you'll be able to paste in a script, and get out a storyboard or a movie
E.g., it's about 20x as fast as InvokeAI, which doesn't have an FP16 option that works on a Mac.
One gotcha for me is ncdu2 going Zig and Zig dropping support for OS versions as Apple does.
But also yes, it's gotta be expensive to host these models and I'm not sure where all these subsidies are coming from. I expect that we'll eventually see these things transition to more paid services.
https://explosion.ai/blog/metal-performance-shaders
However the SoC only uses 31W when posting that performance.
So sounds plausible that the m1 can reach the same level in some use cases with the right optimizations.
0: https://github.com/CompVis/stable-diffusion/blob/main/LICENS...
Macbook Airs (way back when) felt sluggish. The MBA M1 changed that, it was "fine". These M2s are unexpectedly responsive on an ongoing basis.
The MacBook Pro M1 Max is great (would be fantastic except they lost a Thunderbolt port in favor of legacy HDMI and memory card jacks), but you expect that machine to be responsive, so it's less surprising.
The Studio Ultra, though, never slows down for anything.
Still, if the Air could drive two external screens instead of one, I'd "downgrade" from the Max.
I've since moved to the M2 air, and it is noticeably faster than M1, but it isn't the huge leap from last gen intel that the M1 was. But the hardware itself feels way better.
* Air built-in display
* 2K display connected via USB-C -> DisplayPort adapter
* Two more 2K displays of same model via DisplayLink connected via USB hub
For all practical means it's almost impossible to see any DisplayLink compression artifacts even in most of games.
PS: Each adapter cost me $40:
Been nervous to dip into it, given the architecture change and last year's challenges with display link docks.
// UPDATE: Oops, looking at the product, I see I should have specified: 4K screens or higher. About half our desks are 2 x 4K, about half 2 x 5K, except the Air M1 folks who are 1 x 5K.
SD2 wasn’t “neutered”, the piece of it from OpenAI that knew a lot of artist names but wasn’t reproduceable was replaced with a new one from Stability that doesn’t. You can fine-tune anything you want back in.
The "in the style of Greg Rutkowski" prompts from SD1 though, IIRC, were thought to be proof it was reproducing the training set. But it actually only saw ~27 images of his, and the rest was residual biases from CLIP.
- create a virtual environment (Python 3.8.15 worked best)
- upgrade pip
- pip install wheel
- pip install -r requirements.txt
- and then, python setup.py install
- Had to update my XCode to use the generated mlpackage files :/
- Expand drawer with instructions and follow them to download model and convert it to Core ML format
- Run their CLI command as mentioned
I keep running into this, message is
RuntimeError: Error compiling model: "Error reading protobuf spec. validator error: The model supplied is of version 7, intended for a newer version of Xcode. This version of Xcode supports model version 6 or earlier.".
I upgraded XCode, tried re-installing the command line tools with various invocations of `sudo rm -rf /Library/Developer/CommandLineTools ; xcode-select --install` etc but still get the above message(thanks in advance, in case you see this and reply)
edit: I see from https://github.com/apple/ml-stable-diffusion/issues/7 that somebody upgraded to macos 13.0.1 and that fixed the issue for them. I've put off upgrading to Ventura so far and don't want to upgrade just to mess around with stable diffusion on m1, if it can be avoided.
I assume the environment part is what the "conda" commands on the GitHub repo readme are doing, but finding "conda" to install seems to be its own process. It's not on MacPorts, pip seems to only install a Python package instead of an executable, and getting a package from some other site feels sketchy.
What is it with ML and Python, anyway? Why is this amazing new technology being shrouded in an ecosystem and language which… well, I guess if I can't say anything nice…
I think whether you need a virtualenv depends on your system python version and compatibility of any of the dependencies, but it's also pretty nice to be able to spin up or blow away envs without bloating your main python directory or worrying that you're overwriting dependencies for a different project.
brew install miniconda
brew comes from: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Don't take my word for it, visit https://brew.sh.For higher resolution some other solution is required.
it's correct enough that if you know your way around a CLI, git, and package management you can figure it out
I can build an Electron app in under a day with a pretty UI. It would take me several months to get anything sensible that is OS native. And I'm not going to sit down and learn the alternative.
So please just say "thank you" to the developers that are sharing free things with you.
Worst of all is the shamelessness, though. Don't Electron developers feel ashamed when they ship their products? Or have their brains been so muddled by this "JavaScript everywhere" mentality that they don't realize it's bad? Will future generations even know what a native application is anymore?
This program suggests quitting other applications while it runs. Maybe that wouldn't be so necessary if it wasn't using a framework which needs like 2GB of memory before it can draw a window.
I note that my OP hasn't been downvoted into oblivion as most of my critical HN posts are. I think there's at least a significant silent minority who agree with me on this one.
I assume the developer went for electron due to familiarity, but it would be a pretty good exercise for someone to port it to SwiftUI and native Swift for the front end.
I would do it myself but sadly am bound by other clauses.
It seems that he works coincidentally because CLIP associates his name with concept art.
that's simply not going to happen. as in every technological development so far, this is just another tool.
1) artists create the styles out of thin air
2) artists create the images out of thin air
3) computers are just collectors of this data and do not actually originate anything new. they are just very clever copycats.
you're looking at an artist tool more than anything. sure, it's an unconventional one and a threatening one, but that's been true of literally every technological development since the Industrial Revolution.
Also, I'm not so sure that language models like SD, Imagen, GPT-3, PaLM are purely copycats. And I'm not so sure that most human artists are not mostly copycats either.
My suspicion is that there's much more overlap between how these models work and what artists do (and how humans think in general), but that we elevate creative work so much that it's difficult to admit the size of the overlap. The reason why I lean this way is because of the supposed role of language in the evolution of human cognition (https://en.m.wikipedia.org/wiki/Origin_of_language)
And the reason I'm not certain that the NN-based models are purely copycats is they have internal state; they can and do perform computations, invent algorithms, and can almost perform "reasoning". I'm very much a layperson but I found this "chains of thought" approach (https://ai.googleblog.com/2022/05/language-models-perform-re...) very interesting, where the reasoning task given to the model is much more explicit. My guess is that some iterative construction like this will be the way the reasoning ability of language/image models will improve.
But at a high level, the only thing we humans have going for us is the anthropic principle. Hopefully there's some magic going on in our brains that's so complicated and unlikely that no one will ever figure out how it works.
BTW, I am a layperson. I am just curious when we will all be killed off by our robot overlords.
all of these assumptions miss something so huge that it surprises me that so many miss it: WHO is doing the art purchasing? WHO is evaluating the "value" of... well... anything, really? It is us. Humans. Machines can't value anything properly (example: Find an algorithm that can spot, or create, the next music hit, BEFORE any humans hear it). Only humans can, because "things" (such as artistic works, which are barely even "things", much more like "arbitrary forms" when considered objectively/from the universe's perspective) only have meaning and value to US.
> when we will all be killed off by our robot overlords
We won't. Not unless those robots are directed or programmed by humans who have passionate, malicious intent. Because machines don't have will, don't have need, and don't have passion. Put bluntly and somewhat sentimentally, machines don't have love (or hate), except that which is simulated or given by a human. So it's always ultimately the human's "fault".
At some point biological humans will either merge with their technology or stop being the forefront of intelligence in our little corner of the universe. Either of those is perfectly acceptable as far as I am concerned and hopefully one or both of them come to pass. The only way they don't IMO is if we manage to exterminate ourselves first.
e.g. SD2 prompted with "not a cat" produces a cat, and "1 + 1" doesn't produce "2".
I don't think it will either, but artists think it will, so it's strange that their proposed solution "credit the original artists behind AI models" won't solve the problem they have with it.
It will indeed happen, though not to all artists.
> as in every technological development so far, this is just another tool.
Just like every other tool, it changes things, and not everyone wants to change. Those who embrace the new tech are more likely to thrive. Those who don't, less likely.
> 1) artists create the styles out of thin air > 2) artists create the images out of thin air
I understand what you're saying, but as an artist, I can't agree. No artist lives in total isolation. No artist creates images out of thin air. Those who claim to are lying, or just don't realize how they're influenced.
How artists are influenced varies, obviously, but for me I think that however I've been influenced, that influence impacts my output similarly to how the latest generation of AI driven image generation works.
I'm influenced by the collective creative output of every artist who's stuff I've seen. An AI tool is influenced by its model. I don't see a lot differences there, conceptually speaking. There are obvious differences about human experience, model training, bias, etc, but that's a much larger conversation. Those differences do matter, but I don't think they matter enough to change my stance conceptually they work the same in terms of leveraging "influence" to create something unique.
> 3) computers are just collectors of this data and do not actually originate anything new. they are just very clever copycats.
Stable Diffusion does a pretty damn good job of mixing artistic styles to the point where I have no problem disagreeing with you here. It comes as close to originating something new as humans do. You could argue about how it does it disqualifies its output as "origination", but those same arguments would be just as effective at disqualifying humans for the same reasons.
That all said, I agree with you that the tech is a disruptive tool. It's a threat the same way that cameras were a threat to portrait artists, or Autocad for architects, or CNC machines for machinists might be a threat. The idea that new tech doesn't take jobs is naive - it always does. But it doesn't always completely eliminate those jobs. Those who adapt and leverage and take advantage of the new tools can still survive and thrive. Those who reject the new tech might not. Some might find a niche in using "old" techniques (which in away still leverages the new tech - as a marketing/differentiation strategy).
For me, I've been using Stable Diffusion a lot lately as a tool for creating my own art. It's an incredibly useful tool for sketching out ideas, playing with color, lighting, and composition.
In fairness this is an obscure Github page that <0.001% of people will be aware of. If creators of all these AI generating tools sat down and thought of consequences the author's names could have been watermarked by default and the license required to keep it unless allowed by the author for for example.
there clearly was no thought around mitigating any of these problems and we are having what we are having now with the storm around "robots taking artist's jobs" which they may (at least for some 90% of "artists" who are just rehashing existing styles) or may not, only time will tell.
But given that the entire world of technical documentation assumes all technically-inclined people using Macs are using Homebrew, I’ll probably have to give up and switch over at some point. But not yet.
In fact, if you used it before they cleaned all that up, or used it before moving from Intel to ARM and did a restore to the new arch instead of fresh install, it's worth doing a brew dump to a Brewfile, uninstalling ALL packages and brew, and reinstalling fresh on this side of the permissions and path cleanups.
- Migrate Homebrew from Intel Macs to Apple Silicon Macs:
also wonder if anyone did a blogpost yet
Just because the app is written using a chromium framework does not necessarily mean that it's written poorly, VS code is a great example of a fast performance application written in electron.
I don't know where you're getting 2 GB of required memory but if you spin up an electron app it's rare that it requires more than 100 if it's not doing anything.
If you knew anything about these types of stable diffusion interfaces you know that they basically have to load the entire model into memory so that's likely where the multiple gigabytes is coming from.
A lot of us got into development work because we want to create new things, you sound more like the person who spends 99% of their time endlessly optimizing the game engine without actually remembering to build a compelling game experience.
You're getting downvoted because your arrogant tone makes you sound like an insufferable bore.
A vast number of species are no longer around, and we are relatively unusual in being a species that can even contemplate its own demise, so it's entirely reasonable that we would think about and be potentially concerned about our own technological creations supplanting us, possibly maliciously.
Mostly money launderers, I've heard.
>we won't be killed off by AGI because humans don't have malicious intent
I wouldn't say malice is necessary. It's just economics. Humans are lazy, inefficient GI that farts. The only reason the global economy feeds 8 billion of us is that we are the best, cheapest (and only) GI.
It's astonishing how ungrateful people are. Even writing documentation for the software is quite a time-consuming action - writing the software itself is much more time-consuming.
So you are looking at some free software, that gives you the ability to play with StableDiffusion in 2 clicks, has a wide range of features and settings, surely required a ton of time to implement, and you arrogantly saying “pff, an Electron app...”
I wasn’t saying that the author of DiffusionBee should make a SwiftUI application. In fact I said the opposite in that I agree that the person who expected a native app is entitled.
I was however refuting the person I was responding to who said making a native app is a huge undertaking, because learning SwiftUI is fairly quick. That’s not to say that the maintainer should learn it but just that it’s fairly quick to learn should someone else want to.
I was also saying that someone (maybe someone other than the maintainer of DiffusionBee) could contribute a SwiftUI front end.
Finally I was saying I would gladly contribute it myself if I could (but unfortunately have other reasons why I can’t)
anyway hopefully that clears things up, and that hostility from your post is unwarranted.
Don't be toxic to don’t get that hostility.
If anything you’re the one being toxic because you’re unable to have a reasonable conversation about a misunderstanding, and are instead trying to put words in my virtual mouth to conform to your outrage.