Smoothed amplitude envelope
The smoothed amplitude envelope is an acoustic measure that is correlated with for example jaw motion of speech (1. Chandresekaran et al., 2009) and contains gross information about the amplitude of the most dominant frequency of the sound. The smoothed amplitude envelope contains key information about the quasi-rhythmic structure of speech (2. Poeppel & Assaneo, 2020), as well as complex prosodic modulations that occur at multiple time scales (e.g., 2. Tilsen & Arviniti, 2013).
Lets move to the code. We first define our folders, list the waveform (.wav) files that are contained in them, and also define some output folders where to save our smoothed amplitude envelope timeseries to.
Main function
So there is more than one way to compute a smoothed amplitude envelope. We however follow the procedure originally coded in PRAAT by He & Dellwo (2017; ref 4.) which we reimplemented in R. The main function below follow their procedure. Firstly, we read in a sound file, and then apply a Hilbert transform to the waveform signal, which yields what is called a complex analytic signal (containing real and imaginary numbers). From this we can extract the amplitude information by applying what is called the complex modulus, and we are left with a 1D signal that will track the contours of that audio waveform . Following He & Dellwo, since we are interested in an approximation of the syllable cycles that tend to occur no shorter than 200ms, we will smooth the signal at a max 5Hz (1000ms/5Hz = 200ms) using a Hanning Window filter. We then downsample the smoothed amplitude envelope to something more workable, namely, to 100Hz time series, which we can later align for example with other signals (e.g., motion tracking data). For a more information about the key procedure, the Hilbert transformation, see (5. Cohen, 20019)
Note, that the function below, allows you to set the resampling rate differently (e.g., 200Hz instead of 100Hz), as well as the hanning window settings (e.g., 8Hz instead of 5Hz).
library(seewave)
library(signal)
library(rPraat)
library(dplR)
#####################MAIN FUNCTION TO EXTRACT SMOOTHED ENVELOPE###############################
amplitude_envelope.extract <- function(locationsound, smoothingHz, resampledHz)
{
#read the sound file into R
snd <- rPraat::snd.read(locationsound)
#apply the hilbert on the signal
hilb <- seewave::hilbert(snd$sig, f = snd$fs, fftw =FALSE)
#apply complex modulus
env <- as.vector(abs(hilb))
#smooth with a hanning window
env_smoothed <- dplR::hanning(x= env, n = snd$fs/smoothingHz)
#set undeterminable at beginning and end NA's to 0
env_smoothed[is.na(env_smoothed)] <- 0
#resample settings at desired sampling rate
f <- approxfun(1:(snd$duration*snd$fs),env_smoothed)
#resample apply
downsampled <- f(seq(from=0,to=snd$duration*snd$fs,by=snd$fs/resampledHz))
#let function return the downsampled smoothed amplitude envelope
return(downsampled[!is.na(downsampled)])
}
Now that we have initialized our main function. We can apply it to our list of audio file locations that we already have. We can further add some time information based on what we decided to be our sampling rate. We then bind the envelope signal with the time information and save it to an output folder.
Loop through audio files and execute function
########################APPLY MAIN FUNCTION ON THE SOUNDFILES#################################
#loop through soundfile locations
for(wav in list_wavs)
{
#do not run this when these files are already generated
if(!file.exists(paste0(outputfolder, substr(wav, 1, nchar(wav)-4), "_ENV", ".csv")))
{
#location of the current sound file in the loop
locsound <- paste0(data_to_process, wav)
#get the amplitude envelope at location, 5Hz Hanning, 100Hz sampling
env <- amplitude_envelope.extract(locsound, 5, 100)
#make a time vector based on sampling rate (1000/Hz)
time_ms <- seq(1000/100, length(env)*(1000/100), by = 1000/100)
#bind into data frame
ENV <- cbind.data.frame(time_ms, env)
#save it to a folder
write.csv(ENV, file = paste0(paste0(outputfolder, substr(wav, 1, nchar(wav)-4), "_ENV", ".csv")),row.names=FALSE)
}
}
A simple application (approximate syllables and syllable rhythm)
With the smoothed amplitude envelope we can identify peaks in the acoustics that are roughly corresponding with syllable production. We can for example do a simple peak analysis to get an automatic approximation of syllable rate (i.e., speech rate) or rhythmicity (standard deviation of the syllable interval). This is just an example, and for more complex analysis see for example (3. Tilsen & Arviniti, 2013).
library(pracma) #package that has a peakfinding function
library(ggplot2) #plotting
library(gridExtra) #some plotting extras
#lets read in an amplitude envelope time series
env_ts <- read.csv(paste0(outputfolder, "audio_cartoon_ENV.csv"))
#identify peaks: #this will give you the height of the peak,
#the index, the index of the left through [,3] and the right through [,4]
peaks <- pracma::findpeaks(env_ts$env, minpeakdistance = 10,
minpeakheight = mean(env_ts$env)-(1*sd(env_ts$env)))
#initialize a peak variable
env_ts$peaks <- NA
#at each location of the timeseries where there is a peak, load in the value of that peak so we can plot it later
env_ts$peaks[peaks[,2]] <- peaks[,1]
#lets plot a sample of 5 seconds
a <- ggplot(env_ts, aes(x=time_ms, y = env) ) + geom_path() +
geom_point(aes(y=peaks), color = "red", size = 3) + xlim(5000, 10000) + theme_bw()
#lets also plot the original sound file to see how this compares
snd <- rPraat::snd.read(paste0(data_to_process, "audio_cartoon.wav"))
waveformdat <- cbind.data.frame(snd$t, snd$sig)
colnames(waveformdat) <- c("time_sec", "signal")
b <- ggplot(waveformdat, aes(x=time_sec, y = signal))+
geom_path()+ xlim(5, 10) + theme_bw() + ylim(-0.15, 0.15)
grid.arrange(a,b)
We can now extract information about the syllable rate per seconds, as well as the interpeak interval deviation as some measure of rhythmicity, by the following simple queries.
syllable_p_sec <- sum(!is.na(env_ts$peaks))/(max(env_ts$time_ms)/diff(range(env_ts$time_ms)))
print(paste0("N syllables produced p/s: ", syllable_p_sec))
## [1] "N syllables produced p/s: 377.975823472977"
print(paste0("SD in ms syllable interval: ", sd(diff(env_ts$time_ms[!is.na(env_ts$peaks)]))))
## [1] "SD in ms syllable interval: 333.480422379132"