Body Tracking Using MediaPipe

Wim Pouw ( wim.pouw@donders.ru.nl )
James Trujillo ( james.trujillo@donders.ru.nl )
18-11-2021

Info documents

This module provides a simple demonstration of how to use MediaPipe for motion tracking of a single person. The approach provides a lightweight motion tracking solution, and several distinct advantages in the type of output that we get

resources

In [1]:
from IPython.display import HTML

HTML('<iframe width="935" height="584" src="https://www.youtube.com/embed/mw8RymohMp0?start=5845" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\display.py:717: UserWarning: Consider using IPython.display.IFrame instead
  warnings.warn("Consider using IPython.display.IFrame instead")
Out[1]:
In [9]:
%config Completer.use_jedi = False
import cv2
import mediapipe
import pandas as pd
import numpy as np
import csv
 
drawingModule = mediapipe.solutions.drawing_utils #from mediapipe initialize a module that we will use
poseModule = mediapipe.solutions.pose             #from mediapipe initialize a module that we will use
In [10]:
#list all videos in mediafolder
from os import listdir
from os.path import isfile, join
mypath = "./MediaToAnalyze/" #this is your folder with (all) your video(s)
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] #loop through the filenames and collect them in a list
#time series output folder
foldtime = "./Timeseries_Output/"
In [15]:
#################some preperatory functions and lists for saving the data

#take some google classification object and convert it into a string
def makegoginto_str(gogobj):
    gogobj = str(gogobj).strip("[]")
    gogobj = gogobj.split("\n")
    return(gogobj[:-1]) #ignore last element as this has nothing

#landmarks 33x that are used by Mediapipe (Blazepose)
markers = ['NOSE', 'LEFT_EYE_INNER', 'LEFT_EYE', 'LEFT_EYE_OUTER', 'RIGHT_EYE_OUTER', 'RIGHT_EYE', 'RIGHT_EYE_OUTER',
          'LEFT_EAR', 'RIGHT_EAR', 'MOUTH_LEFT', 'MOUTH_RIGHT', 'LEFT_SHOULDER', 'RIGHT_SHOULDER', 'LEFT_ELBOW', 
          'RIGHT_ELBOW', 'LEFT_WRIST', 'RIGHT_WRIST', 'LEFT_PINKY', 'RIGHT_PINKY', 'LEFT_INDEX', 'RIGHT_INDEX',
          'LEFT_THUMB', 'RIGHT_THUMB', 'LEFT_HIP', 'RIGHT_HIP', 'LEFT_KNEE', 'RIGHT_KNEE', 'LEFT_ANKLE', 'RIGHT_ANKLE',
          'LEFT_HEEL', 'RIGHT_HEEL', 'LEFT_FOOT_INDEX', 'RIGHT_FOOT_INDEX']

#check if there are numbers in a string
def num_there(s):
    return any(i.isdigit() for i in s)

#make the stringifyd position traces into clean numerical values
def listpostions(newsamplemarks):
    tracking_p = []
    for value in newsamplelmarks:
        if num_there(value):
            stripped = value.split(':', 1)[1]
            stripped = stripped.strip() #remove spaces in the string if present
            tracking_p.append(stripped) #add to this list  
    return(tracking_p)

Once we have our preparatory functions set and packages loaded. We can get to tracking. In the code block below, we will do 3 things. The code will perform the actual tracking using MediaPipe (functions such as pose, posemodule), draw the tracked points back onto each frame of the video (using cv2), and save the coordinates of the tracked points into a dataframe (using pandas) for analysis or further processing.

In [17]:
#loop through all the video files and extract pose information
for ff in onlyfiles:
    #capture the video, and check video settings
    capture = cv2.VideoCapture(mypath+ff)
    frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
    frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)
    fps = capture.get(cv2.CAP_PROP_FPS)   #fps = frames per second
    print(frameWidth, frameHeight, fps)
    #pose tracking with keypoints save!
    
    #make an 'empty' video file where we project the poste tracking on
    samplerate = fps #make the same as current video
    fourcc = cv2.VideoWriter_fourcc(*'MP4V') #(*'XVID')
    out = cv2.VideoWriter('Videotracking_output/'+ff[:-4]+'.mp4', fourcc, fps = samplerate, frameSize = (int(frameWidth), int(frameHeight)))

    #make a variable list with x, y, z, info where data is appended to
        #the markers are initialized above
    markerxyz = []
    for mark in markers:
        for pos in ['X', 'Y', 'Z', 'visibility']:
            nm = pos + "_" + mark
            markerxyz.append(nm)
    addvariable = ['time']
    addvariable.extend(markerxyz)

    time = 0 #initalize a time variable that starts at 0
    timeseries = [addvariable] #add the first row of column names to your timeseres data object (X_NOSE, .. etc.)
    #MAIN ROUTINE
        #check the settings of your posemodel if you want to finetune (https://google.github.io/mediapipe/solutions/pose.html)
    with poseModule.Pose(min_detection_confidence=0.5, model_complexity = 1, min_tracking_confidence=0.75, smooth_landmarks = True) as pose:
         while (True):
            ret, frame = capture.read() #read frames
            if ret == True:
                results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) #apply the mediapipe pose tracking ont the frame
                if results.pose_landmarks != None: #get the ddata from the results if there is info
                    newsamplelmarks = makegoginto_str(results.pose_world_landmarks)
                    newsamplelmarks = listpostions(newsamplelmarks)
                    fuldataslice = [str(time)] #this is the first info in the time series slice (time)
                    fuldataslice.extend(newsamplelmarks) #add positions to this slice
                    timeseries.append(fuldataslice) #append slice to the timeries data object            
                    drawingModule.draw_landmarks(frame, results.pose_landmarks, poseModule.POSE_CONNECTIONS) #draw skeleton
                    #for point in handsModule.HandLandmark: #you can uncomments this if you want to draw points instead of skeleton
                        #normalizedLandmark = results.pose_landmarks.landmark[point]
                        #pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)
                        #cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)
                cv2.imshow('MediaPipe Pose', frame) #show the current frame wiht skeleton tracking
                out.write(frame)  ######write the frame to your video object######################comment this if you dont want to make a video
                time = time+(1000/samplerate) #routine is done, next frame will be 1000 milliseconds/samplerate later in time
                if cv2.waitKey(1) == 27: #allow the use of ESCAPE to break the loop
                    break
            if ret == False: #if there are no more frames, break the loop
                break
    #once done de-initialize
    out.release()
    capture.release()
    cv2.destroyAllWindows()

    ####################################################### data to be written row-wise in csv fil
    # opening the csv file in 'w+' mode
    file = open(foldtime + ff[:-4]+'.csv', 'w+', newline ='')
    #write it
    with file:    
        write = csv.writer(file)
        write.writerows(timeseries)
250.0 480.0 30.0

Here's a sample frame from the output video:
</center>
As well as a sample of the data that we produced:

In [18]:
df_body = pd.read_csv(foldtime + ff[:-4]+'.csv')
df_body.head()
Out[18]:
time X_NOSE Y_NOSE Z_NOSE visibility_NOSE X_LEFT_EYE_INNER Y_LEFT_EYE_INNER Z_LEFT_EYE_INNER visibility_LEFT_EYE_INNER X_LEFT_EYE ... Z_RIGHT_HEEL visibility_RIGHT_HEEL X_LEFT_FOOT_INDEX Y_LEFT_FOOT_INDEX Z_LEFT_FOOT_INDEX visibility_LEFT_FOOT_INDEX X_RIGHT_FOOT_INDEX Y_RIGHT_FOOT_INDEX Z_RIGHT_FOOT_INDEX visibility_RIGHT_FOOT_INDEX
0 0.000000 -0.063733 0.502805 -0.395698 0.999921 -0.083962 0.538291 -0.383986 0.999944 -0.083089 ... 0.130854 0.192507 0.018961 -0.489134 -0.100693 0.418703 -0.002238 -0.516333 0.090475 0.193644
1 33.333333 -0.043211 0.502240 -0.394696 0.999912 -0.071669 0.535697 -0.384011 0.999936 -0.070713 ... 0.119296 0.194849 0.034104 -0.402429 0.040178 0.397703 -0.005346 -0.358187 0.092969 0.195986
2 66.666667 -0.020163 0.503504 -0.386350 0.999909 -0.051884 0.535895 -0.380524 0.999936 -0.050845 ... 0.115370 0.193730 0.053036 -0.289042 0.061571 0.381764 -0.005315 -0.230478 0.111993 0.196811
3 100.000000 -0.008440 0.533132 -0.351903 0.999891 -0.043266 0.560953 -0.352411 0.999932 -0.042234 ... 0.198556 0.207998 0.045575 -0.337201 0.151154 0.368794 0.006082 -0.334947 0.215761 0.206893
4 133.333333 0.002767 0.559002 -0.262275 0.999872 -0.036010 0.584854 -0.269395 0.999932 -0.035020 ... 0.222347 0.214859 0.037744 -0.280906 0.162776 0.364183 0.015523 -0.267834 0.253445 0.212417

5 rows × 133 columns

One advantage of the output that we get here is that even though we used a 2D video, we get 3D tracking coordinates. This is possible because the MediaPipe detector was trained on hand coordinates for which the depth was known. As the authors state: "Synthetic dataset: To even better cover the possible hand poses and provide additional supervision for depth, we render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. We use a commercial 3D hand model that is rigged with 24 bones and includes 36 blendshapes, which control fingers and palm thickness. The model also provides 5 textures with different skin tones. We created video sequences of transformation between hand poses and sampled 100K images from the videos." Zhang et al., 2020

Additionally, the coordinates provided here are given in meters, with the absolute origin (0,0,0) being the center between the hips. This is advantageous because it reduces variability between videos when the distance to camera also varies.

The major disadvantage to this method is that it is only capable of tracking a single individual at a time. For videos of one speaker/actor, this isn't an issue of course. But if we're interested in multi-party interactions and cannot (or do not wish to) split the video into different individuals (e.g., because of overlapping space between them), we need to use a different solution. We discuss a couple of such options in the modules covering hand tracking with MediaPipe, tracking using DeepLabCut, and tracking using OpenPose.