Deep Learning for Guitar Effect Emulation

Deep Learning for Guitar Effect Emulation(teddykoker.com)

314 points by teddykoker 6 years ago | 159 comments

fab1an 6 years ago |

Pretty cool, though I wonder what the latency of this would be if used as a plugin?

The author says it works in real-time, but to non music/audio folks this could mean '100 ms latency is real-time enough, right?'

Generally, I think the audio VST business is a really fun space to be in for a lifestyle business, as it is way too small to be attractive for VCs. It seems like a space that provides many niches for lots of small players to thrive in.

As an aside, it's really quite interesting that a lot of cutting edge tech is now used to emulate the hardware-based tech of yesteryear. Think film filters for photoshop, and about 90% of all audio plugins that emulate high end hardware, compressors, pedals, etc etc.

qppo 6 years ago | |

I know of a few shops that took VC money. The big problem isn't the market size so much as how slow the market moves. The product lifetime of a plugin is around a decade. And users hate subscriptions. And it's really hard to determine the value you add to your customers. And no one wants to pay you.

It's basically a terrible place to be a developer in it for the money. Really fun work otherwise. The cool gigs are the ones where you build custom plugins for someone's crazy idea.

In consumer applications, plugins are used all the time for prototyping before you go to hardware. MATLAB is way too slow for anything useful.

nil-sec 6 years ago | | |

The success of splice would disagree with your notion that “users hate subscriptions”. Given the horrendous price point of many of these plugins it seems to be perfect for a subscription based model. To me it always seemed there is more of a pushback from the industry producing vsts than from the consumers.

whiddershins 6 years ago | |

Do solo or small shop vst plugin developers make any money?

I’m curious if anyone has any direct knowledge about that.

There are so many professional activities similar to that where no one makes any money and people really just do it for the love, and then there are seemingly similar things like that where people make surprisingly large amounts of money.

abaga129 6 years ago | | |

I'm fairly new to the game, but I'm a solo developer. Currently I dont make enough to quit my day job, but it is a nice supplementary income, and it's nice to get paid a bit for something I truly enjoy.

There are also several solo/small shop developers that do make a living from selling plug-ins. Here are a few that I can think of off the top of my head.

Auburn Sounds: https://www.auburnsounds.com/ Valhalla DSP: https://valhalladsp.com/ Kilohearts: https://kilohearts.com/

munificent 6 years ago | | |

Steve Duda, the developer of Serum is kind of the poster child for this. He contracts out for pieces of the synth (UI design, resampler, filters), but he's mostly a one-man shop and, as I understand it, Serum pays the bills.

It's hard to tell how much Duda is an outlier, though, and how many other people could succesfully follow his path.

gregsadetsky 6 years ago | | |

I was in talks with a (new-style) 'label' that sells samples, sound packs, and VST plugins. Some of their plugins have been purchased 25k times.

One of the things I've also heard from labels is that not only there's money in the VST world (it's also very crowded, piracy is rampant as noted, etc.), a lot of plugins are ported over to iOS and are sold as "virtual pedals". The number of sales and revenue there was noted as being very interesting.

TheRealPomax 6 years ago | | |

They do. Strezov sampling is one guy. Serum is one guy. Chris Heinz is one guy, etc. etc.

But you have to be willing to put in the time and make phenomenal products, because no one wants average instruments and effects, we can get those for free.

ff7f00 6 years ago | | |

There are definitely big players making a lot of money from plugins they develop. Here are a few to check out:

($1200) https://www.native-instruments.com/en/products/komplete/bund...

($300) https://www.soundtoys.com/product/soundtoys-5/

($500) https://www.arturia.com/products/analog-classics/v-collectio...

However, piracy is also pretty big when it comes to plugins.

jeremyjh 6 years ago | | |

AFAIK, Mike Schuffham (www.scuffhamamps.com) earns a living developing and selling S-Gear. It might be a semi-retirement or lifestyle type living - not sure - but he's been doing it over a decade now. He doesn't charge as much as he could and gives away free updates for far too long. Despite being a (mostly at least) solo effort, its widely regarded as being a top-tier amp sim. I personally think it sounds better than both Helix and Bias, which are both heavily bank-rolled outfits.

It doesn't have their breadth, but the tones it does have are nearly as good as it gets without serious air movement.

samplenoise 6 years ago | |

There's latency and there's the somewhat separate question of how much time is needed to make a prediction. Wavenet is causal (no look-ahead) and operates on the sample level so there are no buffers and thus no latency in the strict sense, beyond encoding/decoding into the sample rate and format required by the ML model, which should take <1ms. Whether a model manages to make a prediction in that amount of time depends on things like the receptive field and number of layers. The linked paper says their custom implementation runs at 1.1x real-time. I guess this isn't impossible; their receptive field is ~40ms, vs. 300 for the original (notoriously slow) wavenet, and the model is likely to have less layers and channels.

cjlars 6 years ago | | |

"Round trip," or guitar to processing to speakers needs to be sub 10ms to be transparent to the musician. Source: spent years playing guitar through my guitar -> DAC -> PC -> DAC -> speaker signal chain

alexlarsson 6 years ago | |

I assume by real-time he meant "able to produce samples at a rate equal to or higher than the audio output sample rate".

TheRealPomax 6 years ago | | |

That's not what real time, means though. Real time processing means taking signals as they come in, and outputting the transformed result such that there is as close to no signal lag as possible. The output can in fact be wildly lower or higher resolution, real-time does not particularly say anything about that. It's all about whether the output plays (for practical purposes) at the perceived "same time" as the input signal. There will always be some delay, but that delay can't get perceivable, and for obvious reasons there can't be any (significant) buffering.

amelius 6 years ago | | |

Latency is equally important.

TheRealPomax 6 years ago | |

while training? terrible. As finalised model running in an AU/VST3 wrapper? probably extremely low.

eru 6 years ago | |

Real-time has a few slightly different meanings. So it's hard to say what the author means.

One meaning is just that you can guarantee specific deadlines. So if your programme can react within an hour guaranteed, that would be real-time. (Though usually we are talking about tighter deadlines, like what's needed to make ABS brakes work.)

For 'real time' music usage you wouldn't need strict guarantees, but something that's usually fast enough.

aea12 6 years ago | | |

Implementing a VST plugin is literally the exact definition of requiring strict latency guarantees. Your comment winds through a lot of unrelated comparisons to ultimately not make any sense.

“Usually fast enough” are three words that guarantee failure in a live show/MIDI environment, which is a large use case of VST and its peers beyond production. By extension, “usually fast enough” further guarantees nobody will ever use your software. That’s noticeable right away.

The question isn’t about compsci real-time theorycrafting, it’s “here’s a buffer of samples, if you don’t give it back in a dozen milliseconds the entire show collapses.” That’s pretty clearly meant by “real time“ contextually.

svantana 6 years ago |

End-to-end modelling is very enticing for the lazy engineer, unfortunately parameter control (knobs) are an important feature of most audio effects, and sampling enough of the parameter space will become prohibitive for more complex effects. That's why the traditional approach is divide-and-conquer.

Also, I don't think this approach won't work well with time-varying effects such as chorus, although I'm happy to be proven wrong.

mrob 6 years ago |

This isn't bad, but the note decays sound noticeably different. My guess is that the NN doesn't know that human ears have non-linear response that makes them more sensitive to errors in the decay than the attack, so it treats them equivalently. If this is the case then it might be fixable by using logarithmic scale audio samples instead of linear.

The non-linearity of the ear is frequency dependent[0], but in practice I suspect it would be sufficient to pre-process the linear PCM data with x=sqrt(x) and undo before playback with x=x^2.

[0] https://en.wikipedia.org/wiki/Equal-loudness_contour

thesausageking 6 years ago | |

I came into the comments to say the same thing. To my ears, the NN versions roll off unnaturally at the end and that makes them really easy to identify as artificial.

rubatuga 6 years ago | |

Why square root and not log?

mrob 6 years ago | | |

Cheap and dirty fast calculation. I don't actually know what the best mapping is, so I'd start with this.

munificent 6 years ago |

I'm not an expert on machine learning or DSP, but I do know just enough of each to suspect this isn't anywhere near as impressive as it seems.

A distortion pedal is essentially just a waveshaper [1]. Think of audio in digital terms as just a series of numbers. A waveshaper is just a simple mathematical function. To apply it, you literally just apply the function to each value in the input stream and there's your output stream. There's no memory or interesting algorithms going on. It's the audio equivalent to calling map() on your list of samples with some lambda to produce a new list of samples.

Of course distortion pedals do that in the analogue domain using circuitry, which has some additional complexity because transistors and diodes and friends don't behave exactly like mathematical functions. There's "sag" and some other physical effects that cause the output to also somewhat depend on previous input.

Even so, that can generally be modelled using a simple convolution. Each output sample is calculated by taking some finite number of previous input samples, multiplying each of them by a weight factor, and then summing the results.

Does that sound like a neural net? It is. That's what we call them convolutional neural networks. Convolution is bread and butter in DSP. You can easily generate one that produces the same effect as some piece of hardware or acoustic environment by running an impulse (a single 1.0 sample surrounded by silence) through the system and then recording the result. That "impulse response" essentially is your set of convolution weights.

So using a deep neural network and then training sounds a lot to me like overkill to me. You could accomplish much the same by using a "depth-1 network" and running an impulse through it.

Caveat, though: I am just a novice here, so there could very well be a lot of subtlety I'm missing out on.

[1]: https://en.wikipedia.org/wiki/Waveshaper

jelling 6 years ago |

> We find that the model is able to reproduce a sound nearly indistinguishable from the real analog pedal.

Maybe for the average person or buried in the mix, but the audio samples were easy to distinguish for me as a guitarist. The NN samples unnatural decay were a dead give away.

finder83 6 years ago | |

Agreed, the NN had that "digital" sound you typically get from a simulated tube screamer, such as in a POD HD or something.

Very impressive given it's from a NN, but I specifically moved to analog for that reason.

magicalhippo 6 years ago | |

Even as a regular Joe it was easy for me to distinguish them, and though I was not very confident in my guess, I did guess correctly as well.

It was close though, so maybe for say a beginner on a shoe-string budget it would be perfectly acceptable.

sailfast 6 years ago | |

Yeah - this was clearly audible on my phone speakers, especially during more muddy / multi-note sequences.

While it may not be able to emulate a real pedal to create one’s own sound, it would be interesting / fun for amateurs when applied as a post-filter with an interface that says “make this sound like X famous incredible track” coming out of a stock guitar signal.

hashkb 6 years ago | |

Confirmation bias overrides ear training. Always have an unbiased tone junkie do your blind test.

mrob 6 years ago | |

I could also tell the difference, but I preferred the more staccato sound of the NN version.

jefftk 6 years ago | |

Not really a guitarist, but listening to them I couldn't hear a specific difference. Yet I still liked one of them more. And when I clicked "reveal" that one was the real one, turns out.

zuppy 6 years ago | | |

the real one has longer fading tones, the one generated by machine learning cuts the sound abruptly.

it seems easy for me to differentiate them and I’m a beginner with guitars (~1 month, so I’m your average Joe). it’s pretty good though, I’m sure it can be improved greatly.

Tade0 6 years ago | |

Also the pretty obvious quantization noise which sounds as if the effect had a wide bandwidth, which is impossible with their op-amps at these gains.

zwieback 6 years ago | |

Yeah, real pedal sounds much "better" but maybe we're just used to how they sound.

Tade0 6 years ago |

Sounds great and I had to listen to both of the samples to guess correctly.

That being said the Tube Screamer is a somewhat simple effect: it's just a distortion with the clipping diodes moved to the feedback loop.

How possible would it be to get the famous A/B class amplifier voltage sag and associated changes in parameters of the whole amplifier, or in other words "will it chug"?

cesaref 6 years ago | |

I think this would be very possible - there was quite a bit of discussion of using NN techniques for modelling fx discussed at DAFx2019 (http://dafx2019.bcu.ac.uk/). There are a number of papers discussing different techniques in the paper archive.

Many of the techniques discussed were variations on image processing - transforming the input to the frequency domain then converting this to an image, and applying standard techniques to transform the image, then back to the time domain. There are many compromises with this approach (loosing phase information for example) but with a suitable overlap/add the results were better than I expected, and certainly there's room for further investigation to see if there's useful stuff in there.

Another time domain approach that was applicable to your amplifier model question was an attempt to determine hidden variables in a circuit. Basically, the circuit under test is examined, and rather that build a spice model (which can be laborious) the technique was to expose the interval voltages following components with memory (so capacitors for example). These outputs were included in the NN training model, and so in effect the normally hidden internal state was exposed and allowed for a very good approximation.

Here's the paper:

http://dafx2019.bcu.ac.uk/papers/DAFx2019_paper_42.pdf

Tade0 6 years ago | | |

Thank you very much.

Do you know if there will be a DAFx2020? That would make it the first conference in years that I would really want to attend.

wintermutestwin 6 years ago |

"many purists argue that the sound of analog pedals can not be replaced by their digital counterparts."

Truly effective modelling of analog pedals, tube amps and guitar cabs has been around for years and is way more cost effective from the bedroom to touring bands.

The "purists" are hipsters who value the rarity of some pedals, massive pedalboards and their tube amps. I'm not knocking them - I understand why there is a nostalgia factor and tweaking dials is cool. As a computer guy though, I much prefer the ability to make things like this in my bedroom: https://i.imgur.com/OqMoBxz.png And when I want to tweak a dial, I program an expression foot controller to tweak any parameter (or multiple).

All that said, great to be looking at modelling techniques...

317070 6 years ago |

That is very cool. Though, part of the pedal are of course the knobs. You'd need to condition the wavenet on the knobs. Did that work well (I assume that you tried that already)?

Also, what is the inference latency on your model? A nice thing about analog guitar effects is that they are blazingly fast.

ericfrederich 6 years ago |

So this seems similar to an IR (impulse response) where you get a snapshot of an amp mic'd up in a room with knobs fixed at a particular position. In the end, you don't get knobs to fiddle with.

Awesome, I'd love to hear Josh from JHS Pedal's opinion on this.

ratww 6 years ago | |

This is even more impressive since regular IRs can't duplicate the distortion effect itself, only the frequency response

munificent 6 years ago | | |

What is the difference between "distortion itself" and "only the frequency response"? Are you saying the phase response is important?

sdenton4 6 years ago |

It has been said that if we achieve the ability to fully simulate the universe from initial conditions, the first application will be creating a perfect recreation of Marvin Gaye's Roland 808 drum machine in a 1982 performance.

dharma1 6 years ago |

Here is the original paper from 2019 by Eero-Pekka Damskägg- https://research.aalto.fi/en/publications/realtime-modeling-...

It was also published as a realtime JUCE project, which might be more useful for actual (realtime VST/AU) use:

https://github.com/damskaggep/WaveNetVA

Alec Wright has done more work on this since then, using it for amplifiers:

https://www.aalto.fi/en/news/deep-learning-can-fool-listener...

And time variant effects:

https://github.com/Alec-Wright/NeuralTimeVaryFx

TrackerFF 6 years ago |

Isn't this essentially just learning the case of learning one function, with set parameters?

I.e, if you want to build a complete model of the tubescreamer, you'd essentially have to train a model for each possible setting on the pedal - or in other words, every combination of the knobs.

Sounds like a real chore, if you were to actually do that physically - and in the end, don't you just want to learn the impulse response of the circuit?

I know some tools - like the Kemper modelling gear, are made for that exact purpose, and with extremely convincing results.

Scene_Cast2 6 years ago | |

Not quite. As long as the knobs make consistent changes, just feed some large amount of tests and the model should generalize (smartly interpolate) the rest.

What I do have a problem with is that if the pedal is already implemented digitally, then all the human interpretability, along with the classic DSP machinery, is thrown out the window. A better approach would be to build the pedal via a differentiable programming language and then try to gradient descent toward some analog "can't get this juicy tube sound digitally" variant.

ben7799 6 years ago | | |

The knobs actually don't behave linearly on a tube screamer. Even the "tone" knob (EQ) doesn't behave at all linearly like you might expect out of consumer audio gear. Tube Screamers have an S-curve potentiometer in use for that knob.

That would be part of the problem with this approach.

Also with this approach you pretty much have to train the model with a near infinite collection of guitars in front of the model and a near infinite number of other effects turned on and off in front of the model.

baylessj 6 years ago |

Excellent writeup, I love seeing real engineering applied to guitar pedals rather than black magic tone chasing.

I'd be really curious to see if the model could be expressed as a transfer function and compared to the schematic for the pedal. The Tubescreamer is a fairly simple circuit but the mystery surrounding it indicates that there are some weird variables at play with the component properties that would lead to additional factors in the transfer function. Wonder if those variables could be identified somehow.

hashkb 6 years ago | |

The "weird variables" may have to do with the various changes in manufacturing over the years. "Tube screamer" refers to at least 10 different units. Maxon, Ibanez, TS9, TS808, and zillions of clones.

willis936 6 years ago |

A neat approach for sure. I am more interested in SPICE style modeled VSTs though. There's no need to throw ML at a simple math problem to get a bad approximation. I have not found many VSTs that seem like they're doing proper simulation of analog circuits. The VST space is filled with people claiming awesome results, but never revealing the sauce. If you're making a convincing sounding zener limiter, what are you actually doing? There are a dozen different levels of approximations you could make. Shouldn't a VST that is really simulating the analog circuit advertise that? On paper it should be easy, right? I've sat down with pen and paper to try to write out a simple input/output equation for a zener limiter circuit and I decided it was probably more worth my time to just plop a zener SPICE model into some language that could evaluate expressions and compile to VST (or use a systems of equations solver).

And then there's the real holy grail of analog simulation: the tube amplifier. I'm not sure SPICE models really capture the limiting behavior of tubes very well. You might need to implement the spec sheet in code. All fun sounding problems, and I'm not sure anyone has even done them yet.

dsharlet 6 years ago | |

Funny you mention SPICE to VST compilation... It was on my list for this (my) side project but I never got around to it: http://livespice.org/

edit: And a Tubescreamer is one of the examples!

ben7799 6 years ago | |

Right..the Spice modeled version has a much better chance of catching the oddball behavior of guitar effects across the wide span of possible inputs.

fallingfrog 6 years ago |

I think that whereas most guitar effects are really very simple (gain and clipping, or delaying the signal and adding it back in), this approach will probably work just fine.

But, it is sort of using a sledgehammer where a tap from a spoon will do- the original tube screamer is just an op amp and a couple diodes, plus a bit of eq! Not much to it.

Plus, your real problems are going to be noise level (tube screamers in particular are noisy but a discrete transistor distortion can be made very very quiet). your a/d converter, your power requirements (comparable analog distortion effects use a few milliwatts) and cost.

Edit: But that said, this is a super cool project! Good job! Sorry I just realized that what I wrote was kind of negative.

exabrial 6 years ago |

Pretty cool! Is this how Kemper amplifiers work when they do a capture?

ratww 6 years ago | |

AFAIK Kemper performs multiple passes of impulse-response capture, all at multiple signal levels in order to model non-linearities (like distortion). This is called dynamic convolution. [1] [2]

There are other ways to do that, like Volterra Series, used by Nebula plugins [3]

[1] https://www.uaudio.com/webzine/2004/july/text/content2.html

[2] http://www.sintefex.com/docs/appnotes/dynaconv.PDF

[3] https://en.wikipedia.org/wiki/Volterra_series

ZoomZoomZoom 6 years ago |

For anyone planning to try this, don't forget about impedance matching and use a transformer/active reamper. Some pedals may react very differently.

mgamache 6 years ago |

It would be interesting to see how this responds to dynamics. For example, a favorite guitar sound is a fuzz cranked, but with the guitar volume turned down. This results in a compressed dirty sound that can overdrive into distortion if you hit the strings harder (attack).

ben7799 6 years ago |

I play guitar and own a tube amp & a tube screamer.

All of this sounds horrible.. it doesn't even sound like his input is an actual guitar, it sounds like he's using a synth guitar sound or something. There's no dynamics, almost no sustain, no articulations. The outputs barely even sound distinguishable as a guitar through a tube screamer, even his actual tube screamer samples. (Possibly cause his interface is terrible?)

The conclusion is ridiculous given how simplistic everything is.

You can't use two tiny little clips to justify your model being high quality.

The true test has to even allow a bunch of guitarists to move all the knobs, plug the model into different amp & guitar combinations, put other effects in front of and behind it, etc..

The Tube screamer is called a Tube screamer because it's intended use case is to make the tubes in a tube amp "scream". Using it with all the knobs at noon is not consistent with this, it usually gets used with a tube amp that is already on the verge of distortion, and then you use the TS with the volume turned up a lot (3/4-max) and the gain quite low, this might be part of why this sounds so bad to me.

There are actually two different trains of thought on guitar effect modeling:

- Model it based on input & output waveforms like he's doing

- Actually model the circuit as an electrical simulation and then pass the signal through that.

I have personally found the second approach to be way more realistic and satisfying. The Yamaha THR amps work this way and they're really amazing.

One of the tricks here is a listener might not be able to tell a difference, but the guitar player picks up on a perceived change in how the guitar feels with these effects. A tube screamer has a lot of compression built into it for example. It causes everything to play to sound a little dirtier for the same amount of picking energy you put into the guitar. It will cause the player to play a little more lightly than they would without the effect. This is the kind of thing that makes a player reject the model and want to stick with the real thing, whereas the guy in the naive lab building the model thinks it's great cause they're not even playing an actual guitar through it. Once a skilled player tries it the "feel" is a dead giveaway which is which.

It's easy for some of this stuff to get lost on the electronics crowd if the background is electronic music. An actual acoustic piano is the only keyboard based instrument that has anywhere near the nuance that a guitar has, and a guitar still has way more weird stuff going on with dynamics and articulation. The range of inputs you have to feed into any kind of computer model to simulate guitar well is huge.

EamonnMR 6 years ago |

Add the ability to train on arbitrary effects as inputs and this will a best-selling VST for whoever can make it first.

saadalem 6 years ago |

This is actually impressive, I'm wondering if we could transfer the smartphone mic to a high quality one with AI

veenkar 6 years ago |

B-but a simple convolution would do the same. Or for faster operation - a transfer function obtained using least squares method. NN is kinda overkill for this, but it's cool POC anyways ;)

ssalazar 6 years ago | |

Nope- Tubescreamer is non-linear so a simple transfer function won't do it.

hashkb 6 years ago |

Trey Anastasio of Phish famously uses 2 stacked tube screamers. (And so do many of us phans). He deserves to be mentioned because more notes have hit audience ears through his screamers than anyone else's.

Also, the modern TS9 isn't exactly right. I'd love to see this work applied to vintage vs current TS vs modded units.