Deep Learning for Guitar Effect Emulation(teddykoker.com) |
Deep Learning for Guitar Effect Emulation(teddykoker.com) |
Also, the modern TS9 isn't exactly right. I'd love to see this work applied to vintage vs current TS vs modded units.
The author says it works in real-time, but to non music/audio folks this could mean '100 ms latency is real-time enough, right?'
Generally, I think the audio VST business is a really fun space to be in for a lifestyle business, as it is way too small to be attractive for VCs. It seems like a space that provides many niches for lots of small players to thrive in.
As an aside, it's really quite interesting that a lot of cutting edge tech is now used to emulate the hardware-based tech of yesteryear. Think film filters for photoshop, and about 90% of all audio plugins that emulate high end hardware, compressors, pedals, etc etc.
It's basically a terrible place to be a developer in it for the money. Really fun work otherwise. The cool gigs are the ones where you build custom plugins for someone's crazy idea.
In consumer applications, plugins are used all the time for prototyping before you go to hardware. MATLAB is way too slow for anything useful.
I’m curious if anyone has any direct knowledge about that.
There are so many professional activities similar to that where no one makes any money and people really just do it for the love, and then there are seemingly similar things like that where people make surprisingly large amounts of money.
There are also several solo/small shop developers that do make a living from selling plug-ins. Here are a few that I can think of off the top of my head.
Auburn Sounds: https://www.auburnsounds.com/ Valhalla DSP: https://valhalladsp.com/ Kilohearts: https://kilohearts.com/
It's hard to tell how much Duda is an outlier, though, and how many other people could succesfully follow his path.
One of the things I've also heard from labels is that not only there's money in the VST world (it's also very crowded, piracy is rampant as noted, etc.), a lot of plugins are ported over to iOS and are sold as "virtual pedals". The number of sales and revenue there was noted as being very interesting.
But you have to be willing to put in the time and make phenomenal products, because no one wants average instruments and effects, we can get those for free.
($1200) https://www.native-instruments.com/en/products/komplete/bund...
($300) https://www.soundtoys.com/product/soundtoys-5/
($500) https://www.arturia.com/products/analog-classics/v-collectio...
However, piracy is also pretty big when it comes to plugins.
It doesn't have their breadth, but the tones it does have are nearly as good as it gets without serious air movement.
One meaning is just that you can guarantee specific deadlines. So if your programme can react within an hour guaranteed, that would be real-time. (Though usually we are talking about tighter deadlines, like what's needed to make ABS brakes work.)
For 'real time' music usage you wouldn't need strict guarantees, but something that's usually fast enough.
“Usually fast enough” are three words that guarantee failure in a live show/MIDI environment, which is a large use case of VST and its peers beyond production. By extension, “usually fast enough” further guarantees nobody will ever use your software. That’s noticeable right away.
The question isn’t about compsci real-time theorycrafting, it’s “here’s a buffer of samples, if you don’t give it back in a dozen milliseconds the entire show collapses.” That’s pretty clearly meant by “real time“ contextually.
Also, I don't think this approach won't work well with time-varying effects such as chorus, although I'm happy to be proven wrong.
The non-linearity of the ear is frequency dependent[0], but in practice I suspect it would be sufficient to pre-process the linear PCM data with x=sqrt(x) and undo before playback with x=x^2.
A distortion pedal is essentially just a waveshaper [1]. Think of audio in digital terms as just a series of numbers. A waveshaper is just a simple mathematical function. To apply it, you literally just apply the function to each value in the input stream and there's your output stream. There's no memory or interesting algorithms going on. It's the audio equivalent to calling map() on your list of samples with some lambda to produce a new list of samples.
Of course distortion pedals do that in the analogue domain using circuitry, which has some additional complexity because transistors and diodes and friends don't behave exactly like mathematical functions. There's "sag" and some other physical effects that cause the output to also somewhat depend on previous input.
Even so, that can generally be modelled using a simple convolution. Each output sample is calculated by taking some finite number of previous input samples, multiplying each of them by a weight factor, and then summing the results.
Does that sound like a neural net? It is. That's what we call them convolutional neural networks. Convolution is bread and butter in DSP. You can easily generate one that produces the same effect as some piece of hardware or acoustic environment by running an impulse (a single 1.0 sample surrounded by silence) through the system and then recording the result. That "impulse response" essentially is your set of convolution weights.
So using a deep neural network and then training sounds a lot to me like overkill to me. You could accomplish much the same by using a "depth-1 network" and running an impulse through it.
Caveat, though: I am just a novice here, so there could very well be a lot of subtlety I'm missing out on.
Maybe for the average person or buried in the mix, but the audio samples were easy to distinguish for me as a guitarist. The NN samples unnatural decay were a dead give away.
Very impressive given it's from a NN, but I specifically moved to analog for that reason.
It was close though, so maybe for say a beginner on a shoe-string budget it would be perfectly acceptable.
While it may not be able to emulate a real pedal to create one’s own sound, it would be interesting / fun for amateurs when applied as a post-filter with an interface that says “make this sound like X famous incredible track” coming out of a stock guitar signal.
it seems easy for me to differentiate them and I’m a beginner with guitars (~1 month, so I’m your average Joe). it’s pretty good though, I’m sure it can be improved greatly.
That being said the Tube Screamer is a somewhat simple effect: it's just a distortion with the clipping diodes moved to the feedback loop.
How possible would it be to get the famous A/B class amplifier voltage sag and associated changes in parameters of the whole amplifier, or in other words "will it chug"?
Many of the techniques discussed were variations on image processing - transforming the input to the frequency domain then converting this to an image, and applying standard techniques to transform the image, then back to the time domain. There are many compromises with this approach (loosing phase information for example) but with a suitable overlap/add the results were better than I expected, and certainly there's room for further investigation to see if there's useful stuff in there.
Another time domain approach that was applicable to your amplifier model question was an attempt to determine hidden variables in a circuit. Basically, the circuit under test is examined, and rather that build a spice model (which can be laborious) the technique was to expose the interval voltages following components with memory (so capacitors for example). These outputs were included in the NN training model, and so in effect the normally hidden internal state was exposed and allowed for a very good approximation.
Here's the paper:
Do you know if there will be a DAFx2020? That would make it the first conference in years that I would really want to attend.
Truly effective modelling of analog pedals, tube amps and guitar cabs has been around for years and is way more cost effective from the bedroom to touring bands.
The "purists" are hipsters who value the rarity of some pedals, massive pedalboards and their tube amps. I'm not knocking them - I understand why there is a nostalgia factor and tweaking dials is cool. As a computer guy though, I much prefer the ability to make things like this in my bedroom: https://i.imgur.com/OqMoBxz.png And when I want to tweak a dial, I program an expression foot controller to tweak any parameter (or multiple).
All that said, great to be looking at modelling techniques...
Also, what is the inference latency on your model? A nice thing about analog guitar effects is that they are blazingly fast.
Awesome, I'd love to hear Josh from JHS Pedal's opinion on this.
It was also published as a realtime JUCE project, which might be more useful for actual (realtime VST/AU) use:
https://github.com/damskaggep/WaveNetVA
Alec Wright has done more work on this since then, using it for amplifiers:
https://www.aalto.fi/en/news/deep-learning-can-fool-listener...
And time variant effects:
I.e, if you want to build a complete model of the tubescreamer, you'd essentially have to train a model for each possible setting on the pedal - or in other words, every combination of the knobs.
Sounds like a real chore, if you were to actually do that physically - and in the end, don't you just want to learn the impulse response of the circuit?
I know some tools - like the Kemper modelling gear, are made for that exact purpose, and with extremely convincing results.
What I do have a problem with is that if the pedal is already implemented digitally, then all the human interpretability, along with the classic DSP machinery, is thrown out the window. A better approach would be to build the pedal via a differentiable programming language and then try to gradient descent toward some analog "can't get this juicy tube sound digitally" variant.
That would be part of the problem with this approach.
Also with this approach you pretty much have to train the model with a near infinite collection of guitars in front of the model and a near infinite number of other effects turned on and off in front of the model.
I'd be really curious to see if the model could be expressed as a transfer function and compared to the schematic for the pedal. The Tubescreamer is a fairly simple circuit but the mystery surrounding it indicates that there are some weird variables at play with the component properties that would lead to additional factors in the transfer function. Wonder if those variables could be identified somehow.
And then there's the real holy grail of analog simulation: the tube amplifier. I'm not sure SPICE models really capture the limiting behavior of tubes very well. You might need to implement the spec sheet in code. All fun sounding problems, and I'm not sure anyone has even done them yet.
edit: And a Tubescreamer is one of the examples!
But, it is sort of using a sledgehammer where a tap from a spoon will do- the original tube screamer is just an op amp and a couple diodes, plus a bit of eq! Not much to it.
Plus, your real problems are going to be noise level (tube screamers in particular are noisy but a discrete transistor distortion can be made very very quiet). your a/d converter, your power requirements (comparable analog distortion effects use a few milliwatts) and cost.
Edit: But that said, this is a super cool project! Good job! Sorry I just realized that what I wrote was kind of negative.
There are other ways to do that, like Volterra Series, used by Nebula plugins [3]
[1] https://www.uaudio.com/webzine/2004/july/text/content2.html
All of this sounds horrible.. it doesn't even sound like his input is an actual guitar, it sounds like he's using a synth guitar sound or something. There's no dynamics, almost no sustain, no articulations. The outputs barely even sound distinguishable as a guitar through a tube screamer, even his actual tube screamer samples. (Possibly cause his interface is terrible?)
The conclusion is ridiculous given how simplistic everything is.
You can't use two tiny little clips to justify your model being high quality.
The true test has to even allow a bunch of guitarists to move all the knobs, plug the model into different amp & guitar combinations, put other effects in front of and behind it, etc..
The Tube screamer is called a Tube screamer because it's intended use case is to make the tubes in a tube amp "scream". Using it with all the knobs at noon is not consistent with this, it usually gets used with a tube amp that is already on the verge of distortion, and then you use the TS with the volume turned up a lot (3/4-max) and the gain quite low, this might be part of why this sounds so bad to me.
There are actually two different trains of thought on guitar effect modeling:
- Model it based on input & output waveforms like he's doing
- Actually model the circuit as an electrical simulation and then pass the signal through that.
I have personally found the second approach to be way more realistic and satisfying. The Yamaha THR amps work this way and they're really amazing.
One of the tricks here is a listener might not be able to tell a difference, but the guitar player picks up on a perceived change in how the guitar feels with these effects. A tube screamer has a lot of compression built into it for example. It causes everything to play to sound a little dirtier for the same amount of picking energy you put into the guitar. It will cause the player to play a little more lightly than they would without the effect. This is the kind of thing that makes a player reject the model and want to stick with the real thing, whereas the guy in the naive lab building the model thinks it's great cause they're not even playing an actual guitar through it. Once a skilled player tries it the "feel" is a dead giveaway which is which.
It's easy for some of this stuff to get lost on the electronics crowd if the background is electronic music. An actual acoustic piano is the only keyboard based instrument that has anywhere near the nuance that a guitar has, and a guitar still has way more weird stuff going on with dynamics and articulation. The range of inputs you have to feed into any kind of computer model to simulate guitar well is huge.
What that neural network learns is basically an approximation of a static impulse response. So while it can simulate linear time-invariant effects such as reverb quite nicely, it'll surely have issues with chorus.
I wanted to do a very similar project, but with an overdrive. Let's see if I get time anytime soon!
Im curious, do you have a reference or source for this? Distortion is non-linear making it impractical to model using an impulse response. Is there something about neural networks that makes them good for modeling non-linear but time-invariant effects?
Like you said, it will most likely have limitations, but it's still one more tool in the belt, regardless.
That isn't reasonable. There are too many variables beyond the effect, like room, fingers, guitar, and amp. Without the knobs, you haven't delivered the effect.
An impulse response will characterize only a system that is
* linear
* time-invariant
Many effects are not linear (especially distortion: the crunchiness comes from the nonlinearity). f(a) + f(b) != f(a+b)
And many effects are time varying, for example phasers and choruses which have low frequency oscillators controlling how the sound is shaped depending on when it comes in. Chorus for example will vary the pitch up and down.
Linear adaptive filters have been around for a long long time, and nowadays are everywhere. They can't capture the nonlinear behavior of effect pedals, not even just the waveshaper.
The model you are describing sounds like a 'wiener model,' which refers to a linear filter followed by some nonlinearity (i.e. the waveshaper).
There are other approaches to nonlinear adaptive filters, like Volterra series and kernel methods.
People have been using all of these techniques, and more, to approximate analog audio effects for decades.
A 'trained deep neural network' is not in principle that much different or 'less pure' than other nonlinear adaptive filtering techniques, just with a load more parameters. What matters is if the results are sufficiently improved to justify the computation.
The same pedal from this post has been pain stakingly circuit modeled by Cytomic[1] over the past few years and still isn't out of beta. Analog circuit modeling is a huge thing in DSP right now because it's the closest we have to proper 1:1 software clones of analog hardware. But it's incredibly time consuming.
I'm really excited by this use of WaveNet. It could drastically cut down the time to clone old costly to maintain hardware. But it will have some way to go before you can tweak the parameters in realtime. Or so I assume?
I imagine the difficulty in designing these models comes from modeling the variable factors, IE the parameters normally controlled by the knobs on the amp or effect. Some of these should be straightforward (for example "gain" increasing the volume on the input signal), but I suspect that in some pedals these parameters changing can have impacts on how other parameters behave. I don't see any mention of how this "deep learning" model works with that.
Guitar modeling gear has been around for about 25 years (The first Line6 amp debuted in 1996, I'm not sure if their were earlier products brought to market). They've been derided by purists, but have kind of turned a corner in recent years and are now becoming very mainstream.
Some modern products, such as those sold by Kemper, actually allow you to plug in to your existing gear and generate a profile based on the impulse response. The results, at least according to the reviews I've read, are actually very impressive.
This would be true for a linear impulse response, however for this kind of effects you need both state/memory (like a convolution) and non-linearity (like a waveshaper), which is why people use RNN's and CNN's
My personal experience with electronic tools is the lack of feel. Can I make music with digital tools like AxeFX and similar? Absofreakinglutely. No doubt about it.
But those digital tools feel VERY different to me than the real thing. I'm not just talking about a speaker moving air, though that's certainly part of it. My tube amp simply responds differently than any digital model of a similar amp.
I find tools like the Kemper to be amazing, but they're just a snapshot of an amp in a particular configuration in a particular room.
From a technical standpoint, all this modeling stuff is super cool. But it doesn't feel the same at the end of the day and this is a personal opinion and preference on my part.
I look forward to the day that I can get an amp in a pedal (like the Strymon Iridium) and it behaves the same as the real amp. I think Fender's Deluxe Reverb (Tonemaster model) is as close as it has ever gotten, but it very specifically emulates a single amp and does so within a real amp cabinet rather than pushing it out to an audio interface.
Anyway, anything that gets people playing guitar is, in my opinion, a great thing. We live in a golden age of guitar equipment. I don't think it can honestly get much better than it is right now. It's an amazing time to be a guitar player and incredible options are available at amazing prices.
Can you expand on this a bit? Curious what you mean by responds and what the difference is.
It is indeed an amazing time to be a guitar player!
It sure is a great time for guitar equipment, as the digital revolution has made its way there too.
But being a guitar player is also increasingly lonely : https://www.washingtonpost.com/graphics/2017/lifestyle/the-s...
And it's arguably an opportunity cost for a kid to be pouring so much effort today learning the iconic (but tired) instrument of the boomer generation, when they could be breaking new musical ground instead, mastering Ableton's Push for instance. But to each their own, of course.
Quick question - how does that Axe-FX compare to various Amp emulators such as AmpliTube, Line 6 Helix Native, Guitar Rig, Positive Grid BIAS Amp, S-Gear, etc... ?
IMHO for the bedroom player the Helix is the best solution as it's good enough and significantly cheaper than the other options.
If you want to check this out for yourself, try a Line 6 Helix, and then the Helix Native VST with a normal soundcard. For me, there was orders of magnitude in difference. Good modelling boils down to having excellent hardware.
It can be done. I don’t know why we aren’t there.
The audio world is halfway to to the alien truther community: the closer a rational outsider looks at it, the crazier they feel. Technically, it’s a trivial field. Yet here we are with snake oil saturation and subpar solutions.
This guy has been doing component level simulation from the beginning. I have one and it is accurate enough to convince some pretty big players to ditch their tube amps.
You will have the same sound from gig to gig and a lot of band really value this.
Overall, my goal was to add to this discussion by pointing out the massive progress that has been made and also to show off my supercool signal path in the hopes that it would be inspirational to fellow geeks like me.
The price points really aren't horrendous if you consider how expensive the engineering is, how little demand there is, and how long you need to maintain a product. You aren't being ripped off by spending a couple hundred bucks on a plugin. I think we'll end up at a place where everything is a subscription, but I can tell you from experience that it creates friction for the users.
I made fun of him and we wouldn't have trusted it to be used live, but damn it worked impressively well
Distortion is non-linear, it is something like a max(-1, min(1, input)) function (a waveshaper, like you said), and it produces harmonics when applied to audio signals.
However guitar pedals also have some additional circuitry to "sweeten" the distortion, removing the extra harmonics added by the clipping diodes. Tubescreamers are notable for cutting bass and enhancing mids. An IR is able to capture this. This is important for guitar pedals, and the reason multiple of them exist.
If you capture the impulse response of an overdrive pedal you'll be capturing only the frequency response of a distorted impulse. If you process clean guitar trough this you'll simulate the frequency response but not the distortion itself, so it will just be a clean guitar with a tinny, shrill, sound, not an overdriven guitar sound.
One way around it (other than the idea in this article!) is doing multiple passes of Impulse Response capture with different amplitudes, this will capture this distortion non-linearity. This is supposedly how a Kemper Profiler works.
It's a great conference, well worth attending. It's heavy on the maths, but that's DSP for you!
Calling the guitar a boomer generation instrument is odd.
I suspect Martin would argue they were well ahead of the curve, since they've been creating guitars for over a hundred years.
An acquaintance that builds boutique studio gear had some of his creation modeled by them, and we were quite impressed.
The feeling of this is significantly different in almost every emulated/simulated/modeled amp than reality. They can be close, but the "feel" of it on the guitar side is often quite different.
Generally speaking, I feel like I have more control over the sound and how it plays with a real tube amp over a modeled amp.
Does that help? I approached this as if you aren't a guitarist, but if you are, sorry for the boring bits that you already probably know.
As for the collection of guitars and samples - not necessarily, it would depend on how you set up the training.
>Do solo or small shop vst plugin developers make any money?
So realtime might match your definition, but it is consistent in audio production.
For humans, you can start to notice the lag @ 50ms. (A selection of experimental results summarized here https://gamedev.stackexchange.com/a/74975)
I believe he was incorrect to call guitar players who use analog equipment hipsters, as using analog equipment is the status quo, not some niche subculture outside of the mainstream.
I would like to respectfully suggest being a little less sensitive, though. Not giving new things a chance because of other people attitudes seems very silly to me.
I've tried a lot of modelers. If all I ever hear is that I'm defective for not thinking they're perfect, why would I be open? It feels like a crusade with a side of propaganda. I really do want them to be good.
You might want to familiarize yourself with [0]. Time-invariance is a specific property of a system, where the output (for any given input) has no dependency on if the input signal happens now or 1 second from now or 100 years from now (except for the corresponding delay). Most reverb models are, to a first approximation, time invariant, because the effect will have the same sound for the same guitar line, no matter when you play the line.
Chorus, on the other hand, has a (perhaps subtle) modulator to get that warbly (scientific word!) sound. It doesn't feel like a time-based effect, but it certainly is and that makes it quite a lot more difficult to mimic with a system that (as others have noted) boils down to an impulse response.
Studio reverbs famously aren't, and some of the most popular models (notably Lexicon) have included time-variant algorithms since the late 70s. The processing power to handle IR convolution didn't exist, and it turned out some time variation added lushness and density to the sound that simpler models couldn't capture.
Modelling a chorus or time-variant reverb with any form of convolution - including any convolution-based neural net - is a complete waste of time, because most chorus algos are trivial and convolution is completely the wrong tool for the job.
It's literally about as useful as taking a still picture of a 90 minute movie.
Pasting the other response below as well:
> Ah righto, the reverb pedal I'm most familiar with turns out to not be just reverb - EQD Afterneath does a whole bunch of funky stuff. Plain reverb though, yeah. I was approaching this more from the angle of training a neural network, where the input and output waves have to be correlated over a great span of time/ samples.
You can learn to play it. Pipe organs routinely have more than 50ms latency just from the distance the sound has to travel from the pipes to the organist. Add the time needed to set up steady oscillations in large pipes, and the slow pneumatic actions found in some organs, and >200ms latency is nothing unusual. The important thing is that the latency is consistent.
(jk)
That said, I agree that the question of what "real-time" might mean is irrelevant given the context.
I’ll repeat again that any compsci theorycrafting is not the concern here, and real-time has a very specific meaning in DSP. Computer science does not own the concept of real-time, and the only people tripping over the terminology are those with more compsci experience than DSP. I appreciate everyone trying to explain this to me, but (a) I understand both, and (b) this is like saying “no, Captain, a vector could mean anything like a mathematical collection, air traffic control should learn a thing or two from mathematics.”
It's not "theorycrafting" to say that real-time music software running in a preemptive multitasking operating system without deterministic process time allocation will have to suffer the possibility of occasional drops. It happens in practice and audio drivers have to be implemented to account for the bulk of it, and the VST API is designed in such a way that failure to fill a buffer on time needn't be fatal.
The medium answer is "this is a wavenet model, so inference is probably really expensive unless the continuous output is a huge improvement to performance".
Although the unusual structure of the net here may mean you're doing original and possibly publication-level work to adapt that stuff to this net structure.
If you were really interested in this, there could also be some profit in minimizing the model and then figuring out how to replicate it in a non-neural net way. Direct study of the resulting net may be profitable.
(I'm not in the ML field. I haven't seen anyone report this but I may just not be seeing it. But I'd be intrigued to see the result of running the size reduction on the net, running training on that network, then seeing if maybe you can reduce the resulting network again, then training that, and iterating until you either stop getting reduced sizes or the quality degrades too far. I've also wondered if there is something you could do to a net to encourage it not to have redundancies in it... although in this case the structure itself may do that job.)
Are you speaking from some experience with which I’m unfamiliar where it’s okay for DSP code to fail hourly? Trying to understand your viewpoint.
The article and your comments inspired in me the idea of a wave-net based VST learning wrapper. If the real plugin fails, substitute a wave-net based simulation of the plugin.
The behavior you describe (zero signal on underruns) is a common mitigation. The DAW or the driver itself initializes that'll eventually be handed to the sound card to zero before the host application requests the plugins to process, and if it doesn't have time to mix the plugin outputs it'll play back the initialized buffer instead.
From aea12 one might think that it's normal for an underrun to be fatal. Because underruns are not an exceptional occurrence during production (where you might occasionally load one plugin too many or run a different application with unpredictable load characteristics like a web browser) it really isn't an unexplored area and although they're are a pretty jarring degradation I've never experienced crashes that directly correlated with underruns.
If your plugins are crashing because of an underrun you have a much more serious problem than underruns. Then you have plugins writing to or reading from memory that wasn't either handed to them by the host or allocated by themselves. That bad code running in your process can cause it to crash is an orthogonal problem to buffer underruns causing skips or stuttering in audio.
Of course audio is block buffered over (mostly) USB, and as long as the buffers are being filled more quickly than they're being played out, the odd ms glitch here and there is irrelevant.
As real-time systems Windows, MacOS and Linux are terrible from a theoretical POV, and they're useless for the kinds of process control applications where even a ms of lag can destroy your control model.
But with adequate buffering and conservative loading they work well enough to handle decent amounts of audio synthesis processing without glitching - live, on stage.
> Of course audio is block buffered over (mostly) USB, and as long as the buffers are being filled more quickly than they're being played out, the odd ms glitch here and there is irrelevant.
As I've noted earlier in the thread. In fact, that the only thing you can offer under such circumstances is that "it usually doesn't happen" because "it's usually fast enough" is my entire point.
> As real-time systems Windows, MacOS and Linux are terrible from a theoretical POV, and they're useless for the kinds of process control applications where even a ms of lag can destroy your control model.
You could employ the same strategies to process control problems where latency is not a problem so much as jitter. You don't, because unlike a music performance an occasional once-in-a-week buffer underflow caused by a system that runs tens to hundreds of processes already at boot can actually make lasting damage there.
(This goes in both directions on the spectrum too. You can have your hearing damaged by infrasound as well.)