location code: https://github.com/WimPouw/EnvisionBootcamp2021/tree/main/Python/MediaBodyTracking
citation: Pouw, W. & Trujillo, J.P.(2021-11-18). Body Tracking Using MediaPipe [day you visited the site]. Retrieved from: https://github.com/WimPouw/EnvisionBootcamp2021/blob/main/Python/BodyTracking_MediaPipe
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
opencv-python
from IPython.display import HTML
HTML('<iframe width="935" height="584" src="https://www.youtube.com/embed/mw8RymohMp0?start=5845" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
%config Completer.use_jedi = False
import cv2
import mediapipe
import pandas as pd
import numpy as np
import csv
drawingModule = mediapipe.solutions.drawing_utils #from mediapipe initialize a module that we will use
poseModule = mediapipe.solutions.pose #from mediapipe initialize a module that we will use
#list all videos in mediafolder
from os import listdir
from os.path import isfile, join
mypath = "./MediaToAnalyze/" #this is your folder with (all) your video(s)
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))] #loop through the filenames and collect them in a list
#time series output folder
foldtime = "./Timeseries_Output/"
#################some preperatory functions and lists for saving the data
#take some google classification object and convert it into a string
def makegoginto_str(gogobj):
gogobj = str(gogobj).strip("[]")
gogobj = gogobj.split("\n")
return(gogobj[:-1]) #ignore last element as this has nothing
#landmarks 33x that are used by Mediapipe (Blazepose)
markers = ['NOSE', 'LEFT_EYE_INNER', 'LEFT_EYE', 'LEFT_EYE_OUTER', 'RIGHT_EYE_OUTER', 'RIGHT_EYE', 'RIGHT_EYE_OUTER',
'LEFT_EAR', 'RIGHT_EAR', 'MOUTH_LEFT', 'MOUTH_RIGHT', 'LEFT_SHOULDER', 'RIGHT_SHOULDER', 'LEFT_ELBOW',
'RIGHT_ELBOW', 'LEFT_WRIST', 'RIGHT_WRIST', 'LEFT_PINKY', 'RIGHT_PINKY', 'LEFT_INDEX', 'RIGHT_INDEX',
'LEFT_THUMB', 'RIGHT_THUMB', 'LEFT_HIP', 'RIGHT_HIP', 'LEFT_KNEE', 'RIGHT_KNEE', 'LEFT_ANKLE', 'RIGHT_ANKLE',
'LEFT_HEEL', 'RIGHT_HEEL', 'LEFT_FOOT_INDEX', 'RIGHT_FOOT_INDEX']
#check if there are numbers in a string
def num_there(s):
return any(i.isdigit() for i in s)
#make the stringifyd position traces into clean numerical values
def listpostions(newsamplemarks):
tracking_p = []
for value in newsamplelmarks:
if num_there(value):
stripped = value.split(':', 1)[1]
stripped = stripped.strip() #remove spaces in the string if present
tracking_p.append(stripped) #add to this list
return(tracking_p)
Once we have our preparatory functions set and packages loaded. We can get to tracking. In the code block below, we will do 3 things. The code will perform the actual tracking using MediaPipe (functions such as pose, posemodule), draw the tracked points back onto each frame of the video (using cv2), and save the coordinates of the tracked points into a dataframe (using pandas) for analysis or further processing.
#loop through all the video files and extract pose information
for ff in onlyfiles:
#capture the video, and check video settings
capture = cv2.VideoCapture(mypath+ff)
frameWidth = capture.get(cv2.CAP_PROP_FRAME_WIDTH)
frameHeight = capture.get(cv2.CAP_PROP_FRAME_HEIGHT)
fps = capture.get(cv2.CAP_PROP_FPS) #fps = frames per second
print(frameWidth, frameHeight, fps)
#pose tracking with keypoints save!
#make an 'empty' video file where we project the poste tracking on
samplerate = fps #make the same as current video
fourcc = cv2.VideoWriter_fourcc(*'MP4V') #(*'XVID')
out = cv2.VideoWriter('Videotracking_output/'+ff[:-4]+'.mp4', fourcc, fps = samplerate, frameSize = (int(frameWidth), int(frameHeight)))
#make a variable list with x, y, z, info where data is appended to
#the markers are initialized above
markerxyz = []
for mark in markers:
for pos in ['X', 'Y', 'Z', 'visibility']:
nm = pos + "_" + mark
markerxyz.append(nm)
addvariable = ['time']
addvariable.extend(markerxyz)
time = 0 #initalize a time variable that starts at 0
timeseries = [addvariable] #add the first row of column names to your timeseres data object (X_NOSE, .. etc.)
#MAIN ROUTINE
#check the settings of your posemodel if you want to finetune (https://google.github.io/mediapipe/solutions/pose.html)
with poseModule.Pose(min_detection_confidence=0.5, model_complexity = 1, min_tracking_confidence=0.75, smooth_landmarks = True) as pose:
while (True):
ret, frame = capture.read() #read frames
if ret == True:
results = pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) #apply the mediapipe pose tracking ont the frame
if results.pose_landmarks != None: #get the ddata from the results if there is info
newsamplelmarks = makegoginto_str(results.pose_world_landmarks)
newsamplelmarks = listpostions(newsamplelmarks)
fuldataslice = [str(time)] #this is the first info in the time series slice (time)
fuldataslice.extend(newsamplelmarks) #add positions to this slice
timeseries.append(fuldataslice) #append slice to the timeries data object
drawingModule.draw_landmarks(frame, results.pose_landmarks, poseModule.POSE_CONNECTIONS) #draw skeleton
#for point in handsModule.HandLandmark: #you can uncomments this if you want to draw points instead of skeleton
#normalizedLandmark = results.pose_landmarks.landmark[point]
#pixelCoordinatesLandmark = drawingModule._normalized_to_pixel_coordinates(normalizedLandmark.x, normalizedLandmark.y, frameWidth, frameHeight)
#cv2.circle(frame, pixelCoordinatesLandmark, 5, (0, 255, 0), -1)
cv2.imshow('MediaPipe Pose', frame) #show the current frame wiht skeleton tracking
out.write(frame) ######write the frame to your video object######################comment this if you dont want to make a video
time = time+(1000/samplerate) #routine is done, next frame will be 1000 milliseconds/samplerate later in time
if cv2.waitKey(1) == 27: #allow the use of ESCAPE to break the loop
break
if ret == False: #if there are no more frames, break the loop
break
#once done de-initialize
out.release()
capture.release()
cv2.destroyAllWindows()
####################################################### data to be written row-wise in csv fil
# opening the csv file in 'w+' mode
file = open(foldtime + ff[:-4]+'.csv', 'w+', newline ='')
#write it
with file:
write = csv.writer(file)
write.writerows(timeseries)
Here's a sample frame from the output video:
</center>
As well as a sample of the data that we produced:
df_body = pd.read_csv(foldtime + ff[:-4]+'.csv')
df_body.head()
One advantage of the output that we get here is that even though we used a 2D video, we get 3D tracking coordinates. This is possible because the MediaPipe detector was trained on hand coordinates for which the depth was known. As the authors state: "Synthetic dataset: To even better cover the possible hand poses and provide additional supervision for depth, we render a high-quality synthetic hand model over various backgrounds and map it to the corresponding 3D coordinates. We use a commercial 3D hand model that is rigged with 24 bones and includes 36 blendshapes, which control fingers and palm thickness. The model also provides 5 textures with different skin tones. We created video sequences of transformation between hand poses and sampled 100K images from the videos." Zhang et al., 2020
Additionally, the coordinates provided here are given in meters, with the absolute origin (0,0,0) being the center between the hips. This is advantageous because it reduces variability between videos when the distance to camera also varies.
The major disadvantage to this method is that it is only capable of tracking a single individual at a time. For videos of one speaker/actor, this isn't an issue of course. But if we're interested in multi-party interactions and cannot (or do not wish to) split the video into different individuals (e.g., because of overlapping space between them), we need to use a different solution. We discuss a couple of such options in the modules covering hand tracking with MediaPipe, tracking using DeepLabCut, and tracking using OpenPose.