MultiParty and Single Person Tracking with MediaPipe: Top- and Front-View Hand Tracking

Wim Pouw ( wim.pouw@donders.ru.nl )
James Trujillo ( james.trujillo@donders.ru.nl )
18-11-2021

Info documents

In this module, we'll demonstrates how to perform motion tracking using the lightweight tool MediaPipe, and considers some of the pros and cons of this method. Specifically, we'll be using MediaPipe for hand-tracking in situations where a) we have multiple people in frame from a top view, b) a single person from top view, and c) a single person for front-view. Single person tracking is a more easily processable as we will explain, as it much easier to identify which hands belong to which person from frame to frame. In the case of multi-person hand tracking we require a little bit more post-processing to identify persons from frame to frame (the envision toolbox also contains a script for linking person from frame to frame in such cases).

Introduction

Here we will cover how to utilize MediaPipe to acquire motion tracking of the hands, from multiple people. MediaPipe offers a nice lightweight (computationally) solution to capture hand motion from multiple people (or just one person). We'll first go over some code to get body and hand tracking.

resources

  • https://github.com/google/mediapipe

  • https://google.github.io/mediapipe/solutions/hands.html

  • Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.

    Body Tracking

    The hand tracking algorith provided below captures the x,y,z keypoints of just the hands, from everyone in frame. Let's do some tracking and see what we get!
    First, let's load some packages and set our paths
In [2]:
from IPython.display import HTML

HTML('<iframe width="935" height="584" src="https://www.youtube.com/embed/mw8RymohMp0?start=7442" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
Out[2]:
In [7]:
%config Completer.use_jedi = False
import cv2
import sys
import mediapipe
import pandas as pd
import numpy as np
import csv
from os import listdir
from os.path import isfile, join
  
#initialize modules
drawingModule = mediapipe.solutions.drawing_utils #the module(s) usd from the mediapipe package
handsModule = mediapipe.solutions.hands           #the module(s) usd from the mediapipe package
In [8]:
#list all videos in mediafolder
mypath = "./MediaToAnalyze/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] # get all files that are in mediatoanalyze
#time series output folder
foldtime = "./Timeseries_Output/"
In [9]:
################################some preperatory functions and lists for saving the data
#take some google classification object and convert it into a string
def makegoginto_str(gogobj):
    gogobj = str(gogobj).strip("[]")
    gogobj = gogobj.split("\n")
    return(gogobj[:-1]) #ignore last element as this has nothing

#Hand landmarks
markers = ['WRIST', 'THUMB_CMC', 'THUMB_MCP', 'THUMB_IP', 'THUMB_TIP', 
 'INDEX_MCP', 'INDEX_PIP', 'INDEX_DIP', 'INDEX_TIP', 
 'MIDDLE_MCP', 'MIDDLE_PIP', 'MIDDLE_DIP','MIDDLE_TIP', 
 'RING_MCP', 'RING_TIP', 'RING_DIP', 'RING_TIP', 
 'PINKY_MCP', 'PINKY_PIP', 'PINKY_DIP', 'PINKY_TIP']

#make the stringifyd position traces into clean values
def listpostions(newsamplemarks):
    tracking_p = []
    for value in newsamplemarks:
        stripped = value.split(':', 1)[1]
        stripped = stripped.strip() #remove spaces in the string if present
        tracking_p.append(stripped) #add to this list  
    return(tracking_p)

#a function that only retrieves the numerical info in a string
def only_numerics(seq):
    seq_type= type(seq)
    return seq_type().join(filter(seq_type.isdigit, seq))

Now we'll pefrorm the actual tracking. This block goes through each video file in your directory, gets the video frames (images) using cv2, creates an output video file, and then collects the tracked points. The saved keypoint coordinates are then drawn onto a copy of the video frame in order to visualize the tracking as well as saved into a .csv file for later analysis.

In [10]:
#loop through the frames of the video
for ff in onlyfiles:
    #capture the video and save some video properties
    capture = cv2.VideoCapture(mypath+ff)
    frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
    frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)
    fps = capture.get(cv2.CAP_PROP_FPS)

    print(frameWidth, frameHeight, fps ) #print some video info to the console
    
    #make a video file where we will project keypoints on
    samplerate = fps #make the same as current 
    fourcc = cv2.VideoWriter_fourcc(*'XVID') #(*'XVID')
    out = cv2.VideoWriter('Videotracking_output/'+ff[:-4]+'.avi', fourcc, fps= samplerate, frameSize = (int(frameWidth), int(frameHeight))) #make sure that frameheight/width is the same a original

    #make a variable list with x, y, z, info where data is appended to
    markerxyz = []
    for mark in markers:
        for pos in ['X', 'Y', 'Z']:
            nm = pos + "_" + mark
            markerxyz.append(nm)
    addvariable = ['index', 'confidence', 'hand', 'time']
    addvariable.extend(markerxyz)
    time = 0
    fr = 1
    timeseries = [addvariable]
    #MAIN ROUTINE
         #For finetuning the tracking here check: https://google.github.io/mediapipe/solutions/hands.html
    with handsModule.Hands(static_image_mode=False, min_detection_confidence=0.5, min_tracking_confidence=0.75, max_num_hands=6) as hands:
         while (True):
            ret, frame = capture.read()
            if ret == True:
                results = hands.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
                # the results.multi_hand_landmarks should contain sets of x,y,z values for each landmark
                # However, they have no label or ID, just raw coordinates. 
                # we do know which set of coordinates corresponds to which joint:
                # see https://google.github.io/mediapipe/solutions/hands.html and figure 2.21 on that page
                if results.multi_hand_landmarks != None: 
                    #attach an id based on location                    
                    for handLandmarks, handinfo in zip(results.multi_hand_landmarks,results.multi_handedness):
                        # these first few lines just convert the results output into something more workable
                        newsamplelmarks = makegoginto_str(handLandmarks.landmark)
                        newsamplelmarks = listpostions(newsamplelmarks)
                        newsampleinfo = makegoginto_str(handinfo) #get info the hands
                        # now we compile the data into a complete row, and add it to our dataframe
                        fuldataslice = [fr, newsampleinfo[2], newsampleinfo[3]]
                        fuldataslice.extend([str(time)]) #add time
                        fuldataslice.extend(newsamplelmarks) #add positions
                        timeseries.append(fuldataslice)
                        #get information about hand index [0], hand confidence [1], handedness [2]              
                        for point in handsModule.HandLandmark:
                            normalizedLandmark = handLandmarks.landmark[point]
                            # now draw the landmark onto the video frame
                            pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)
                            cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)
                if results.multi_hand_landmarks == None:
                    timeseries.append(["NA"]) #add a row of NAs
                cv2.imshow('Test hand', frame)
                out.write(frame)  #########################################comment this out if you dont wn
                time = round(time+1000/samplerate)
                fr = fr+1
                if cv2.waitKey(1) == 27:
                    break
            if ret == False:
                break
    out.release()
    capture.release()
    cv2.destroyAllWindows()

    ####################################################### data to be written row-wise in csv file
    data = timeseries

    # opening the csv file in 'w+' mode
    file = open(foldtime+ff[:-4]+'.csv', 'w+', newline ='')
    #write it
    with file:    
        write = csv.writer(file)
        write.writerows(data)
1280.0 720.0 50.0
1440.0 1080.0 29.97017053449149
720.0 480.0 29.97002997002997

Let's take a first look at the data to see what kind of output we get.

In [11]:
print(foldtime+ff[:-4]+'.csv')
df = pd.read_csv(foldtime+ff[:-4]+'.csv')
df.head()
./Timeseries_Output/singlefirst_person_sample.csv
Out[11]:
index confidence hand time X_WRIST Y_WRIST Z_WRIST X_THUMB_CMC Y_THUMB_CMC Z_THUMB_CMC ... Z_PINKY_MCP X_PINKY_PIP Y_PINKY_PIP Z_PINKY_PIP X_PINKY_DIP Y_PINKY_DIP Z_PINKY_DIP X_PINKY_TIP Y_PINKY_TIP Z_PINKY_TIP
0 1.0 score: 0.6093786954879761 label: "Left" 0.0 0.635196 0.388808 -0.000030 0.602918 0.389601 -0.018807 ... -0.004158 0.634445 0.229552 -0.005760 0.623060 0.205796 -0.000986 0.611503 0.190996 0.006322
1 1.0 score: 0.9999953508377075 label: "Left" 0.0 0.299859 0.397056 -0.000041 0.331408 0.387960 -0.023282 ... -0.003270 0.282278 0.243715 -0.011688 0.283709 0.214422 -0.018607 0.285403 0.189971 -0.026059
2 2.0 score: 0.9450348019599915 label: "Left" 33.0 0.639061 0.392384 -0.000022 0.606290 0.390252 -0.023101 ... -0.004803 0.637472 0.231520 -0.005495 0.626105 0.206491 -0.002203 0.614551 0.191324 0.002124
3 2.0 score: 0.9999712109565735 label: "Left" 33.0 0.299803 0.401946 -0.000042 0.331562 0.390181 -0.022328 ... 0.006947 0.285484 0.243658 0.001423 0.288054 0.216183 -0.001764 0.290211 0.194107 -0.006174
4 3.0 score: 0.9674971699714661 label: "Left" 66.0 0.641997 0.398204 -0.000030 0.607862 0.392848 -0.025636 ... -0.013638 0.642445 0.233575 -0.015278 0.631055 0.207358 -0.012657 0.620473 0.190910 -0.009438

5 rows × 67 columns

Above we have the first 5 rows of our output data. The first named column, "index", provides you with the frame number. Note that each frame may have multiple rows, if multiple hands are tracked in that frame. We also get a label, right or left, and x,y,z coordinates (scaled to 0,1 --- see below) for each keypoint.

Output Details

The 3D coordinate output is certainly an advantage for MediaPipe, as it is able to provide some sense of depth, even if you don't have multiple camera angles or an actual depth image (e.g, as recorded by infrared sensors). The authors of MediaPipe achieve this by training their detector model on a synthetic dataset where they could vary the pose and orientation of the hand in many ways, but always have ground-truth 3D coordinates. As they state in the Zhang et al., 2020 paper: "Synthetic dataset: To even better cover the possible hand poses and provide additional supervision for depth, we render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. We use a commercial 3D hand model that is rigged with 24 bones and includes 36 blendshapes, which control fingers and palm thickness. The model also provides 5 textures with different skin tones. We created video sequences of transformation between hand poses and sampled 100K images from the videos." Zhang et al., 2020
It is important to note that the depth provided in this output, however, Mediapipe does allow for estimating positions in meters (see https://google.github.io/mediapipe/solutions/hands.html to see which function you need to use for this). In the current example case, a point with x,y coordinates = 0.5, 0.5 would be in the center of the image, while x,y = 0.25, 0.75 would indicate that the point is 1/4 of the way from left to right, and 3/4 of the way from top to bottom (x,y = 0,0 is the top left corner). For depth, it is relative to the wrist. In other words, the wrist is taken as the origin (0 depth), and smaller values are estimated to be closer to the camera, and larger values further away.
This relative scaling makes it difficult to compare across videos with different camera set-ups, but is quite intuitive when looking at the coordinates compared to the actual video.


However, especially for multi-party data we don't know if the first row in frame 1 is the same hand as the first row in frame 2. Thus, we don't know if a left and right hand belong together as there are multiple persons! We'll cover a potential solution to this in the module on linking and pairing hands. This is easier when there is just one person, as mediapipe does differientate between left and right hand.