Aphex twin did something similar, but this is more playful in my opinion.
He talks about that and plenty of other cool stuff in his talk at the 2017 GDC conference. One of my favorite conference talks ever, he did so much cool experimentation to get the sounds he used on the soundtrack, and watching his talk is one of those moments where you really get to see a master of his craft let loose and explain his process.
Warning - this music freaked my dog out!
Unfortunately it's in Matlab so I can not run it any more.
[1] https://jo-m.ch/posts/2015/01/hack-the-spectrum-hide-images-...
https://www.reddit.com/r/Damnthatsinteresting/comments/kvjil...
1. https://arstechnica.com/tech-policy/2015/11/beware-of-ads-th...
I usually use Audacity to inspect the spectrogram of FLAC files and see if they really are 44100Hz or if someone packaged a constant rate 320kbps mp3 encode into a FLAC file.
Now I can just check it in my browser :D
One place I used these was on a toy AI assistant. I recorded myself saying a trigger word thousands of times, cut the audio in pieces and converted each to a spectrogram image. I then feed those to a training model to help recognize the trigger word.
Before the spectrogram, i was feeding the wav file directly, it was incredibly intensive on my laptop. But the image files were easier to process in real time. This tool can be used for debugging.
I like the interesting ability to play a "rectangular" (time + frequency limited) section of the audio.
Looks very interesting though.
Nice work.
Can we hire you to help us improve the (broken) spectral visualizations on our app?
Example: https://fakeyou.com/tts/result/TR:9jy3vew9w0s3ew4keay9m330rd...
I would so love to hire you to help us. This is freaking cool.
Even if you're not interested, mad props. I really love this.
Of course, don't forget the window function (Hann, or raised cosine), but it looks like you've got that covered because your spectrogram looks smooth.
The color palette looks good in your case. FWIW, my color function is like this: pow(fft_amp, 1.5) * rgb(9, 3, 1). The pow() part brightens the low/quiet amplitudes, and the (9,3,1) multiplier displays 10x wider amp range by mapping it to a visually long black->orange->yellow->white range of colors. Note, that I don't do log10 mapping of the amplitudes.
- Allow playback via Space button. Show a play marker to let the user know where in the sample they are, even without having selected a part.
- Choose a sample that is easier on the ears than high-pitched bird song. I was really shocked when the first loud part came.
Is there any way to make this display in real time, or is that not (currently?) possible with audio APIs?
https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_A...
If you're referring to generating spectrograms with Fourier transforms, you will need some math background to properly do the calculation by hand. It largely just boils down to "find the amount of each frequency over time"
Last question, if this is the premise your work, shouldn't you know about it already?
o The tall vertical lines reflect "plosives" - sudden releases of sound energy often at the begining of words from having mouth/airway closed then open, as in the first letter of "put" or "tea"
o The high frequencies come from "fricatives" like the first letter of "see" or "free" where air is being passed through the teeth or almost closed lips
o The lower frequencies are where most of the recognizable speech content is, corresponding to the way the resonant frequencies of the mouth and throat are being changed (articulation) by moving the tongue, lips and teeth. Specifically the speech content is in changes to the "formants" which are the changing resonant frequencies showing up as bright mostly horizontal bands in the lower frequencies
Noise may show up in various ways depending on what the noise source is. A fixed frequency spectrum background hum is going to show up as one or more horizontal frequency bands across the entire spectrogram. High frequency noise is going to show up as much more energy in the higher frequencies, which don't have a lot of energy for clean speech (fricatives only).
1. STFT (get frequencies from the audio signal)
2. Log scale/ decibel scale (since we hear on the log scale)
3. Optionally convert to the Mel scale (filters to how humans hear)
Happy to answer any questions
Here's a few (very old) plots of how our radars see the world through a spectrogram: https://weibelradars.com/space/space-industry/
What would be cool, would be a browser-based way to do soft analysis of these plots.
edit: tyvm, nice idea! would very much like to try it
For example:
- Sine sweeps (a sine wave that starts at a low frequency and sweeps up to a high one) - to learn associate the frequencies you hear with the Y-axis
- Sine pulses at various frequencies - to better understand the time axis
- different types of noise (e.g. white)
Perhaps move on to your own voice as well, and try different scales (log or mel spectrograms, which are commonly used).
With this, I think you can develop a familiarity quickly!
Note that speech like any audio source consists of multiple frequencies, a fundamental frequency and its harmonics.
Background noise can be identified as distinct frequency bands that are not part of the vocal range of human speech. E.g. if you see lots of bright lines below or above the human vocal range then there's lots of background noise. Especially lower frequencies can have a big impact on the perceived clarity of a recording whereas high frequencies come of as being more annoying.
Noise within the frequency range of human speech is harder to spot and you should always use your ears to decide whether it's noise or not.
You can also use a spectrogram to check for plosives (e.g. "s" "k" "t" sounds) as they also can make a recording sound bad/harsh.
Personally I hypothesize that the reason it’s so hard is that the sources are intermixed sharing frequencies so isolating to certain frequencies doesn’t isolate a speaker. We’d need something like beam forming to know how much amplitude of each frequency to extract. I’d also hypothesize that humans, while able to focus on a directional source, also cannot “extract” clean signal either (imagine someone talking while a pan crashes on the floor - it completely drowns out what the person said)
When we recognize speech is almost as if we're hearing the way the speaker is articulating words, since what we're recognizing is the changing resonant frequencies ("formants") of the vocal tract corresponding to articulation, as well as other articulation clues such as the sudden energy onset of plosives or high frequencies of fricatives (see my other post in this topic for a bit more info).
High quality (that is, highly intelligible) speech synthesis has been available for a long time based on this understanding of speech production/recognition. One of the earliest speech synthesizers was the DECTalk (from Digital Equipment) introduced in 1984 - a formant-based synthesizer based on the work of linguist Denis Klatt.
The fact that most of the information in speech comes from the formants can be proved by generating synthetic formant-only speech just consisting of sine waves at the changing formant frequencies. It doesn't sound at all natural, but nonetheless very easy to recognize.
The starting point for human speech recognition is similar to a spectrogram - it's a frequency analysis (cf FFT) done by the ear via the varying length hairs in the inner ear vibrating according to the frequencies present, therefore picking up the dominant formant frequencies.
If you know of any implementations that can look at a spectrogram and say “hey there’s peaks at 150hz, 220hz and 300hz with standard deviations of 5hz, 7hz, and 10hz, decreasing in frequency over time thus this is a deep voice saying ‘ay’” and get it right every time I’d be really interested in seeing it (besides neural networks)
Some sources of noise like the constant background hum (e.g. computer fan) are easy to spot though.