Gesture Networks Module 1: Computing Kinematic Distances Using Dynamic Time Warping

Wim Pouw (wim.pouw@donders.ru.nl) & James Trujillo (james.trujillo@donders.ru.nl)

2022-01-03

Info documents

location Repository: https://github.com/WimPouw/EnvisionBootcamp2021
location Rmarkdown: https://github.com/WimPouw/EnvisionBootcamp2021/blob/main/R/GestureNetworks_module1/Scripts/gesturenetworks_module1.Rmd
citation: Pouw, W., Tobin, S. J., & Trujillo, J. P. (2022-01-03). Gesture Networks Module 1: Computing kinematic distances using dynamic time warping. [the day you viewed the site]. Retrieved from: https://wimpouw.github.io/EnvisionBootcamp2021/gesturenetworks_module1.html

Video Tutorial

Background

Human (communicative) body movements are often recurrent, such that they use similar shapes or trajectories to convey or act on something consistently. How can we make sense of the variability and consistency with which movement events are produced? One way to do this is by quantifying the similarities present between all the produced movement events. Once we have documented how similar each movement event is relative to all other events, we can then visualize and quantify the structure of these interrelationships and discover novel patterns that exist over the entire set of behaviors. For example, in previous work we have found that when communicative silent gestures have been practiced and communicated more, those silent gestures start to change their interrelationships in consistent ways (e.g., the kinematic space reduces in complexity) (see [3] Pouw, Dingemanse, Motamedi, Ozyurek, 2021).

Note that this particular procedure outlined here, is introduced in [1] Pouw & Dixon (2020) and we advise you to read this first before starting (or [3]). The main procedure: we compute some distance measure between two N-dimenionsal time series i and j, and do this for all events n. Similar to previous work [1] we will use dynamic time warping (see [2] Giorgino, 2009) as the distance calculation. We then compute what is called a distance matrix (M) which has n rows and n columns containing for each cell M[i,j] the distance score between movement i and j. Then we can use dimensional reduction techniques (e.g., classic multidimenionsal scaling, t-sne, UMAP, PCA) to visualize in 2-dimensional space the interrelationships between all movement events.

For the current movement events we use a set of silent gestures produced by an actor conveying certain concepts (open dataset by: [4][Ortega & Ozyurek, 2020). We extracted the movement information using mediapipe 3D body tracking for each silent gesture, and this is the input of our procedure. Note that the current procedure can be applied on any signal (e.g., acoustic trajectories) and also on multiple persons that are producing movement events.

Overview processing steps

This markdown file contains the following processing steps.

Create a distance matrix with dynamic time warping distances between events
Visualize the distance matrix in a 2D space
Some application: Do ratings of iconicity of a particular gesture depend on the interrelationships of that gesture?

The folder ‘Time_series’ is filled with mediapipe body tracking time series for each video. We extract from it, nose, elbow, index finger (x , y, z) info (for both hands). Figure 1 shows an example of the kinematic time series for the silent gesture “TO-SWIM” which is also shown below. The swimming gesture has a characteristic rhythmic movement, as reflected in the oscillating movements in the horizontal dimension for left and right index finger.

Figure 1. Example raw time series of SWIMMING for the right (red) and left hand (in blue)

MT <- read.csv(file.path(timeserfolder, "TO-SWIM_silentgesture.csv"))

a <- ggplot(MT, aes(x = time)) + 
                  geom_path(aes(y = X_RIGHT_INDEX), color = "red") + 
                  geom_path(aes(y = X_LEFT_INDEX), color = "blue") + 
                  theme_bw()+ylab("horizontal\nposition normalized")
ggplotly(a)

ACCORDIAN GESTURE

Main routine: constructing a distance matrix

The next step is to compute a average dynamic time warping distance between joint position traces. This means we are performing a dependent DTW, where x,y,z time series are submitted for DTW comparison for a gesture i and gesture event j, and we perform this procedure for each joint separately (this averaging of by-keypoint DTW comparisons is called an independent DTW; so we are combining an independent and dependent DTW version).

Script procedure 2. This loop makes a distance matrix

timeseriess <- list.files(timeserfolder)

#########################FUNCTIONS

#identify which columns contain the relevant variables
keypointsindex_right <- which(colnames(MT) %in% c("X_RIGHT_INDEX", "Y_RIGHT_INDEX", "Z_RIGHT_INDEX"))
keypointsindex_left <- which(colnames(MT) %in% c("X_RIGHT_INDEX", "Y_RIGHT_INDEX", "Z_RIGHT_INDEX"))
keypointselbow_right <- which(colnames(MT) %in% c("X_RIGHT_ELBOW", "Y_RIGHT_ELBOW", "Z_RIGHT_ELBOW"))
keypointselbow_left <- which(colnames(MT) %in% c("X_LEFT_ELBOW", "Y_LEFT_ELBOW", "Z_LEFT_ELBOW"))
keypointsnose <- which(colnames(MT) %in% c("X_NOSE", "Y_NOSE", "Z_NOSE"))

#Summed DTW
DTW.compare <- function(TS1, TS2)
{
  #perform the dependent (x,y,z) dynamic time warping for each keypoint seperately (indepedently)
  distancedtw <-  mean(dtw(TS1[keypointsindex_right], TS2[keypointsindex_right])$normalizedDistance,
                      dtw(TS1[keypointsindex_left], TS2[keypointsindex_left])$normalizedDistance,
                      dtw(TS1[keypointselbow_right], TS2[keypointselbow_right])$normalizedDistance,
                      dtw(TS1[keypointselbow_left], TS2[keypointselbow_left])$normalizedDistance,
                      dtw(TS1[keypointsnose], TS2[keypointsnose ])$normalizedDistance)
  return(distancedtw)
}


#########################MAIN ROUTINE
  #initialize some variables
  #ncores: number of available cores for parallel processing--ideally one less than the
  #number of available physical cores (one core for other applications). 
  #MacOS, SunOS and Windows provide this number reliably via detectCores(logical=FALSE)
  #Other systems: we assume 2 virtual cores per physical core to avoid resource-hogging, crashes, etc.
  #However, users with familiar with their processors should adjust ncores as they see fit
  ncores <- ifelse(Sys.info()["sysname"] %in% c("Darwin","SunOS","Windows"), detectCores(logical=FALSE)-1, detectCores()/2-1)
  numges <- length(timeseriess) #how many gestures (should be 109)

  catchnames <- substr(timeseriess,1,nchar(timeseriess)-4) #get vector of gesture IDs
  #initially, we neglect the first row (numges-1 rows). this row is by definition 
  # a single zero followed by numges-1 NAs, and can be added more efficiently later
  distmat <- matrix(nrow=numges-1, ncol=numges) #108x109 matrix initially
  
  #load all time series into data frames for quick access
  for(tsi in 1:length(timeseriess)) {
    assign(catchnames[tsi], read.csv(file.path(timeserfolder, timeseriess[tsi])))
  }
  
  if (ncores > 1) {              #if at least two devoted cores are available,  
    cl <- makeCluster(ncores)    #set up a local cluster for parallel processing
    registerDoParallel(ncores)
    #first just calculate lower triangular matrix (can be reflected across diagonal later, since matrix is symmetric)
    #create a parallel process for each row. On Windows, packages/data called by these procs must be named in the .packages/.export arguments
    distmat <- foreach(TS1index = icount(numges-1), .combine = "rbind", .packages=c("foreach", "iterators", "dtw"), .export = catchnames) %dopar% {
      #each parallel process produces a distance vector
      distvec <- foreach(TS2index = icount(TS1index), .combine = "cbind") %do% {
        #compute DTW. We start with "TS1index+1" because we are neglecting the first row (see above)
        TS1 <- get(catchnames[TS1index+1])
        TS2 <- get(catchnames[TS2index])
        dtwoutput <- DTW.compare(TS1, TS2)
      }
      #ensure correct dimensions by filling any remaining empty cells in distvec with NAs
      distvec <- cbind(distvec, t(rep(NA, numges-length(distvec)))) 
    }
    stopCluster(cl)
  } else {           #serial version: used if only one single devoted or shared core is available
    #loop through all combinations and save DTW score
    #first loop through rows
    #initially, we neglect the first row, so we start with "timeseriess[2]" (see above)
    for(ts1 in timeseriess[2:length(timeseriess)]) #loop through i 
    {
      TS1index <- which(timeseriess == ts1) #index of this gesture
      TS1 <- get(catchnames[TS1index]) #get first time series
      #loop through columns. End index is "TS1index-1" because DTW.compare(TS1index,TS1index) is by definition zero
      #so we can neglect the final column in every loop (the diagonal) and add the zeros later
      for(ts2 in timeseriess[1:TS1index-1]) 
      {
      TS2index <- which(timeseriess == ts2) #index of this gesture
        TS2 <- get(catchnames[TS2index]) #get second time series
        #compute DTW and save into matrix
        dtwoutput <- DTW.compare(TS1, TS2)
        #"TS1index-1" because first row is initially neglected
        distmat[TS1index-1,TS2index] <- dtwoutput
      }
    }
  }
  #add previously neglected first row to matrix
  distmat <- rbind(t(c(0, rep(NA, numges-1))), distmat) 
  diag(distmat) <- 0  #set diagonal to zero: distance from self is by definition zero
  distmat <- pmax(distmat, t(distmat), na.rm = T) #reflect values across diagonal
  colnames(distmat) <- catchnames #attach the names to the distance matrix
  #save the distance matrix
  write.csv(distmat,file.path(distancematfolder, "distance_matrix.csv"), row.names = FALSE)
  rm(list=catchnames) #remove timeseries dataframes from memory

Visualizing interrelationships between movement/gesture events

Now that we have produced a distance matrix containing all the dynamic time warping distances between the silent gestures, we can visualize how these gestures relate to one another. We do this by applying a dimensionality reduction technique to visualize the high-dimensional distance matrix into a 2D space using classic multidimensional scaling. We can now inspect some of the nodes. For example if we look up the gesture Accordian and Butterfly we see that both are judged by the current procedure to be very similar in nature as they are close in the kinematic space. Indeed, as seen in the gestures shown below these gestures are highly similar in nature.

distmat <- read.csv(file.path(distancematfolder, "distance_matrix.csv"))
#plot distances
plot.space <- function(top, title,cluster1, sizepoint)
{
  top <- cbind.data.frame(top, cluster1)
  colnames(top) <- c("Xpos", "Ypos", "grouping")
  pl <- ggplot(top, aes(x= Xpos, y = Ypos, color =grouping)) + geom_point(size= sizepoint, alpha= 0.5)+theme(legend.position = "none")
  pl <- pl+  theme_bw() + ggtitle(title) + xlab("D1") + ylab("D2")
  return(pl)
}

#plot the distance matrix
topology <- cmdscale(as.matrix(log(distmat+1))) #we do a log(x+1) transform because the distances are long-tailed distributed
#plot the tsne 
ggplotly(plot.space(topology, "distances represented through classic mds", colnames(distmat),10))

Animated Visualisations

But we can do even better! We can position all gesture .gif files that we have in our Images folder, and position them according to the scaled coordinates of the multidimensional scaling applied to the distance matrix. In this way we can directly inspect whether the kinematic distances make sense when looking at the gesture.

library(stringr) #for string manipulation
library(scales)  #for rescaling 

distmat <- read.csv(file.path(distancematfolder, "distance_matrix.csv")) #read in the distance matrix
topology <- as.data.frame(cmdscale(as.matrix(log(distmat+1)))) #multidimenionsal scaling of the logarithmically transformed distance matrix
topology[,1:2] <- apply(topology[,1:2], 2, FUN=function(x)scales::rescale(x, from=range(x), c(0, 1300))) #lets rescale the topology from 0 to 600 pixels
nams <- colnames(distmat)
topology$nams <- stringr::str_replace_all(nams, "[.]", "-") #make sure names are the same as the video (so no "." but "-")

todraw <- ""
layer = 1
adjustright = 300
adjustvertical = -100
#we are only showing 33% of the videos, otherwise it is very crowded
for(nam in topology$nams[1:(length(topology$nams)/3)])
{
todraw <- paste0(todraw, paste0('<img src="Images/', nam, '.gif" style="position: absolute; top: ', round(topology[topology$nams==nam,1]+adjustvertical),'px; right:',  round(topology[topology$nams==nam,2]+adjustright), 'px; width: 6%; height: auto; zindex:',layer,'; border: solid 1px #CCC"/>'))
layer <- layer+1
}
fileConn<-file("animated_kinematicspace.html") #save to a separate html file
writeLines(todraw, fileConn)
close(fileConn)

To inspect the result separately as html see here: https://wimpouw.github.io/EnvisionBootcamp2021/animated_kinematicspace.html

Things to consider

There are many different things that you can tweak with this general gesture procedure. For example, you can expand the gesture set to comprise gestures of multiple individuals (see gesture network module 2). Other things to consider are when you care about the general shape of a gesture but not its size, then you might want to normalize (e.g., z-scale) the time series first. Also consider, we now only take a few keypoints that do not capture much of the hand shape, so you can consider adding more keypoints. Further, what if you have a gesture consisting of multiple elements, and you want to judge another gesture as similar when it has the same elements regardless of the order of the items? Then you might not want to use dynamic time warping as it will judge a time series with elements a-b-c to be completely different from a time series with elements c-b-a (a possible time series distance measure that does judge such time series to be similar is matrix profile distance or MPdist). Further there are a range of settings that optimize dynamic time warping performance such as not allowing the distance error calculation to be based on the begin and trailing portions of the time series ([5] Silva et al 2016; for further references see e.g., Pouw & Dixon, 2020; Pouw et al., 2021).

Applications

This open dataset by Ortega and Oyzyrek (2020) also has information about how each gesture was rated for its iconicity (how well the gesture represents its target referent). Since we now have information about how each gesture relates to other gestures, we can for example ask whether gestures that are more differentiable from other gestures will also likely be seen as more iconic; after all, the gesture is less to be confused with others. We can do all kinds of more complex operations to characterize how a particular gesture relates to other gestures (see for example Pouw et al., 2021), but for now we will keep it simple. Namely, what is the average kinematic distance of a gesture (higher average distance = more differentiable?) and what was the mean iconic rating for that gesture. Below we show this in a plot. It seems that gestures that have generally higher kinematic distances between other gestures, are also are more likely to be found more iconic (i.e., more transparent in meaning).

diag(distmat) <- NA #lets ignore the diagonals
#the distances are long-tailed distributed so we will logtransform it by log(x+1)
distmatl <- log(distmat+1) 

#compute the average distance of a gesture with all others
avdist <- apply(distmatl, 2, FUN = function(x)mean(x,na.rm=TRUE))
#collect the names of the gestures (ignoring the last bit of the string)
nam <- substr(colnames(distmatl), 1, nchar(colnames(distmatl))-14)
distsum <- cbind(nam, avdist) #combine in one matrix
#merge our summarized distance data with the iconicity ratins
mergedc <- merge(iconicityratings, distsum, by.x="English", by.y="nam")
mergedc$avdist <- as.numeric(mergedc$avdist) #make sure that avdist is a numeric variable 

#plot the relation
ggplot(mergedc, aes(x= MEAN, y = avdist))+geom_point(size=3)+geom_smooth(method="lm", size = 3)+theme_bw()+
  xlab("iconicity")+ylab("average distance")