This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)
Bandwidth has always been crazy cheap.
In fact locally I can get a 10 gbps home internet unmetered connection for $300/mo.
I'm not sure how they'd react if I transferred 1 PB/mo though :)
1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.
2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.
Or better yet, how about asking me where I want to store my models?
Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:
- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)
- Are already underpaid for their work as-is
- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.
I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.
I am constantly amazed at how the new AI tech can be used.
Of course this would be illegal under most countries copyright laws.
While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.
Enjoy: https://youtu.be/gmNSFqyg_Z8
https://arstechnica.com/information-technology/2022/09/james...
Watch Light My Fire on YouTube Music https://music.youtube.com/watch?v=lN3v3EfA6_A&si=_hcG3Wjakxd...
I can't figure out if this is an example of Godwin's Law or not.
No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.
These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.
I have an accent. If not for that, I'd be a great presenter.
If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.
This isn't autotune for the spoken word, though. It's not fixing pacing or vocabulary, and in the audio above it isn't even fixing intonation. Yes, a thick German accent will give you away as being of German extraction. But you're also using the word 'since' when Brits and Americans would use 'for', and it's not going to fix that. Any more than it'll fix my french when I make the exact same mistake going the other direction (for=duration vs for=purpose vs for=interval). If I hear 'since one month' you're likely German or Indian. If you ask how long I've been in Marseille you'll know I'm American in about half that time.
> No current artificial intelligence is powerful enough to hide the weirdness of Weird Al.
The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.
Its going to get crazy.
Voice cloning is a special case, these models are equally good at making new voices.
Pick your book, pick your reader and away it goes. The Diary of Anne Frank read by Gilbert Gottfried.
I'd like to read it, in any case.
I'm not sure how to feel about that. I'm against the idea that some people "deserve" being paid for being lucky born with an interesting voice.
On the other hand, the world always worked like that. And, say, hard-working farmer or doctor were also lucky being born with necessary traits to make for their living, while others weren't.
Singers didn't want software clones, but voices actors are fair game.
AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.
The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.
The real product would have a real voice over actor paid for with VC money.
It's only been a year. Give it some time and I'm sure AI will have much better results. Right now, you can get some of that unique work by finetuning the AI off of a person's existing portfolio.
Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.
Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.
* Create dynamic new voice lines at runtime, for example game characters reacting to new situations.
* Operate at a scale that's infeasible for humans, for example turning every ebook into an audiobook.
The work product produced by their voice for fulfilling the contract is owned. No corp owns someone else's voice.
It's one reason why VAs rarely take fan requests for a character they voice.
I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.
The one that always comes to mind for me is this video of an Eminem interview done from scratch as a Talking Heads song: https://www.youtube.com/watch?v=Kfl3N9nesRg
This is potentially something that generative AI could be good at doing (at least recreating vocals), but this parody of the Talking Heads required a lot of very clever insight into what made a good Talking Heads song and returned a convincing and novel melody. And I think we are still a ways off.
its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.
Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.
In the streets, sure. Meeting up at out of town conference centers a few times a year, probably not. Most real communities have always been "dark matter" to those outside them; Discord working the same way feels more authentic than most of the internet.
There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.
There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.
These big teleconference apps are usually hit or miss but discord seems to be the winner currently for actual "social networking", also add in its trend in the gaming community
That being said, Discord does have some advantages over older forum-type communities - it's usually way better for cultivating smaller communities, and its no-effort-required chat systems means that you can always hop on and discuss things that are on the cutting edge. This is quite important in a field like AI, where it feels like something revolutionary happens every other week.
(Also, I don't know if that implication was intentional, but gen Z and "underaged" haven't meant the same thing for many years now)
Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)
I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].
Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).
IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.
PRE-EDIT, ERRONEOUS ANSWER
Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.
[0] https://github.com/snakers4/silero-models#text-to-speech
[1] https://silero.ai
[2] https://github.com/snakers4/silero-models#standalone-use
[3] https://github.com/Grumbel/ttsprech#usageI'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/
Looks promising, I'm going to check it out too! MIT license, even! If it's fast enough for real time, it could be the new best option. The paper claims faster inference than VITS...
How many audio books is 40 hours?
Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.
We're probably several years out from it being something people use personally for audio books.
All of these AI as a Service (AaaS?) API companies are going to race each other to razor thin margins. Immediately after ElevenLabs raised, five other TTS services raised nearly the same amount of money.
Are you reading War & Peace or Cat In The Hat?
I doubt they're better than Google's TTS though.
https://github.com/suno-ai/bark Demo at https://huggingface.co/spaces/suno/bark
In the couple samples I tried it was substantially better at picking up meaning compared to VALL-E-X
I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:
* https://github.com/rhasspy/piper
I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.
The official samples are here: https://rhasspy.github.io/piper-samples/
Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/
While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.
(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)
[0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...
----
Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
That’s what this sounds like. Five syllables of Michael Jackson while he’s trying to be Action Hero or Big Villain, or Funny Sidekick (a problem Eddie Murphy has never had, all evidence from Coming to America notwithstanding).
I'm guessing contracts will need to be updated to say that a character's voice made from AI can't be used so a completely different production cannot say they have the actor attached for publicity purposes.
Glad to hear you and your peers are still posting on the open web!
https://learn.microsoft.com/en-us/windows/win32/fileio/hard-...
https://learn.microsoft.com/en-us/windows-server/administrat...
https://youtu.be/XkLqAlIETkA (Extremely NSFW without headphones)
For me it came from the voice; I hadn't heard of Gilbert Gottfried as a specific person until I read this discussion. The reaction faces of the women listeners were also amusing.
Head over to Audible reviews, some books are widely considered to be great books as written but the audiobook is reviewed as one to be avoided because it was recorded poorly, the narrator paced it wrong, they had an annoying voice, they couldn’t do a voice of the opposite gender, whatever.
Plus it seems like a great accessibility feature. Many books are recorded for the vision impaired community by volunteers and that’s admirable, but some of the AI today does a much better job.
Any AI voice could save that one. Any of them! Heck the original voice on the 1984 Machintosh could do better.
The danger behind AI and other manipulative technology is that it erodes trust. We already have serious issues with trust in media, and not just the obvious cases of Russian/Chinese propaganda, but also stuff like kids getting anorexia from extremely photoshopped advertising.
Add AI on top and no one can be certain about anything anymore. Say someone distributes a fake "recording" of the US President calling for glassing Moscow, or the Serbian President declaring war on Kosovo? That has the potential to actually cost lives on a massive scale.
Remember when Stable Diffusion was released a year ago and one of the big artist copes was "sure, it can generate random images, but it'll never be able to generate the same character repeatedly!" They were already wrong because Textual Inversion and DreamBooth were already published, and soon enough, ported to SD and now people could dump out thousands of images of the same character in the same consistent style etc (and did).
As a joke, I can see it being funny, but it was a jarring way to experience it.
California, for example:
"Any person who knowingly uses another’s name, voice, signature, photograph, or likeness, in any manner, on or in products, merchandise, or goods, or for purposes of advertising or selling, or soliciting purchases of, products, merchandise, goods or services, without such person’s prior consent, or, in the case of a minor, the prior consent of his parent or legal guardian, shall be liable for any damages sustained by the person or persons injured as a result thereof."
https://leginfo.legislature.ca.gov/faces/codes_displaySectio....
C'mon, this is hacker news, what happened to "information should be free"?
I would vote for it only if it somehow encouraged voice actors to experiment and create new interesting styles. Kinda like patents were designed to do -- encourage inventors (although recently it became controversial in IT world).
Of course it's not totally devoid of skill, you need to be able to emote, inflect, and convey emotion, but the bar is far far lower.
Majority of success is attained like this though. Athletes paid for being born strong tall and fast, models paid for being pretty, rich families being paid for being born rich, smart people being paid for being born smart, or hardworking, etc. It's the most dominant factor everywhere.
For work I end up transferring 50-150 gigs often, sometimes daily. Never heard a word from them that this has been a problem.
An unmetered 10G port at a US data center is ~$1500/mo. Not particularly expensive
> Belyaev is a 29-year-old synthetic-speech artist at the Ukrainian start-up Respeecher, which uses archival recordings and a proprietary A.I. algorithm to create new dialogue with the voices of performers from long ago. The company worked with Lucasfilm to generate the voice of a young Luke Skywalker for Disney+’s The Book of Boba Fett, and the recent Obi-Wan Kenobi series tasked them with making Darth Vader sound like James Earl Jones’s dark side villain from 45 years ago, now that Jones’s voice has altered with age and he has stepped back from the role.
Artists have sued, and won, to have artwork moved, shown differently, or force-sold back to the artist.
Now, everything you say is copyright... you. At least in my legal jurisdiction! Even my image is, in Quebec! Yes, that includes if you take my picture outside.
So what of one's voice? And if you don't have a real agreement, to use that voice in any way desired. And then you use that voice to.. I don't know, advocate for terrorists or something weird.
What then?
I don't think it's completely clearcut, and I think there will be changes, decisions on this going down the road.
If a company uses an actor's previously recorded dialog to be edited in a way that makes them sound in favor of terrorism on the attempt to have people think the actor said the words, we have issues on so many levels. If the dialog is chopped/re-edited to use as dialog for the same character in later works, then I really don't have issues with it.
NAVA also has guidelines for protection against AI abuse: https://navavoices.org/synth-ai/
That seems insane to me. Do you have specific examples?
"Independent of the author's economic rights, and even after the transfer of the said rights, the author shall have the right to claim authorship of the work and to object to any distortion, modification of, or other derogatory action in relation to the said work, which would be prejudicial to the author's honor or reputation."
https://en.wikipedia.org/wiki/Authors%27_rights
"The authors of dramatic works (plays, etc.) also have the right to authorize the public performance of their works (Article 11, Berne Convention)."
"The protection of the moral rights of an author is based on the view that a creative work is in some way an expression of the author's personality: the moral rights are therefore personal to the author and cannot be transferred to another person except by testament when the author dies."
"“Author” is used in a very wide sense, and includes composers, artists, sculptors and even architects"
Architects can deny changes in interior design: Lighting, artwork, etc., long after the building is finished. Just a few days ago I talked with a theater director: The author of the original work has the right to deny a production, for whatever reason, e.g. if they don't like the nose of an actor.
I bet my voice is mine under most jurisdictions (and I mean most; the Berne convention has been signed by 181 countries), even if I signed a contract that gives you wide permission to use it. And if I didn't, you can't use it outside of the very narrow scope of the work I produced for you. Even if you simply want to reuse an existing recording in another context.
The amount of stuff humans can accomplish is strongly limited by the supply of workers. Automating one job frees them up to do other things.
Unless you're one of the people out of work. And even if you don't care anything about them, if there's enough of them then the resulting unrest will be your problem anyway.
I've thought a lot about it, and I don't think it's true.
It’s a really different world now I’ve got massive models running on my laptop thanks to Apple Silicon and the unified memory architecture, and the c++ ports of various diffusion image models and several families of large language text models work well on my AMD gpu too… it’s so much easier to participate in the current generation of applied ML work without having to go out of my way to have specific ML supported hardware.
100 books/year. That's an impressive feat regardless the number of pages. Are these downloaded ebooks or physical printed copies of books?
I've seen a few for download, and they're always like hundreds of meg, if not over a gig. And that's in mp3, where it should be compressed heavily.
That's just not good value. Was sort of my point.
That it often results in them getting an equivalent mindshare (or more) of the Representatives views is also not surprising, and only natural.
It doesn't inspire warm fuzzies in those too busy working to survive though.
My law classes did cover common law, yes, but not favourably(can you guess I come from a civil law country?). Sounds like a system that made sense in 15th century Britain, but is quite the complex beast with many issues nowadays when it doesn't need to be.
However that still doesn't answer my original question, why is there no new legislation to cover the newly existing scenarios talked about? It seems to me that even the UK does that at least for some things, and they're the original common law country.
Eleventh.
We've had an infestation of "pay me or I won't share" types.
The community is not gatekeeping knowledge, anyone can join. It merely tries to keep certain corporations out...
Pytorch nightly (I use for cuda-12) doesn't work w Python 3.12, but if you stick w 3.11 or 3.10 you should be ok. Rest was just w/o version numbers if you're on a clean venv should be fine, however there's a bug in the Utils lib that requires a 1-line fix if you're trying to inference (also linked). nltk was the only dependency not listed so not bad compared to most code drops!
Thanks for writing up your experience! Good to know it works! And it's fast!
PHONEMIZER_ESPEAK_LIBRARY = c:\Program Files\eSpeak NG\libespeak-ng.dll
PHONEMIZER_ESPEAK_PATH = c:\Program Files\eSpeak NGEdit: Got it working, sounds really great and is super fast as advertised. Amazing! Just tried modifying the code to make it speak more quickly and it worked first try and still sounds good too! This is way better than using Coqui TTS. Just need a few more pretrained models and the voice cloning that was in the paper and this will become super popular very quickly.