AI & the [Human] Body
In this guide, Cailean Finn tells us about human pose detection and recognition technology. In this application of AI and ML, the technology hasn’t been accessible to ‘outsiders’ until recently, and Cailean walks us through what it is, tells us about its history and shows examples of how it can be used for creative purposes.
[PUBLISHED]
Dec 2022
[AUTHORS]
[FUNDING]
01_Introduction to Human Pose Recognition
→ Human body language is an intrinsic component of our lived experience. Through our movements we can engage in nonverbal communication, using our physical movements to express ideas and emotions often done instinctively rather than consciously. Subsequently, our perceptions of others are heavily influenced by their body language, communicating a plethora of information to the world. This flow of information does not cease when you stop speaking, even when you are silent you are still communicating.
→ In the digital age, we have unfortunately seen this complex language fade into the background. This has become even more evident during the pandemic, intoxicated by the copious amount of Zoom meetings, a domain where our level of communication is severely limited. So, how can we encompass our body language as a tool for communication in the digital age?
In this guide, I hope to provide a brief historical and technical overview of the many artificial intelligence and machine learning tools for Human Pose Recognition (HPR) that are currently available and in development.
→ Human Pose Recognition is a branch of Computer Vision research, and is essentially a technique that allows us to accurately detect and predict/estimate the pose of a person. This is achieved by identifying and classifying the coordinates of the joints of a human body, such as wrists, shoulders, knees, arms (…) commonly known as landmarks.
→ Through more accurate representations of our physical body it can enable us to create more natural and complex interactions with different virtual environments.
Authors Ginés Hidalgo (left) and Hanbyul Joo (right) in front of the CMU Panoptic Studio, Open Pose
→ In the past, there has been many technical limitations for artist, designer and creative practitioners to utilise and experiment with Human Pose Recognition tools. As this technology becomes more accessible through the development of tools like OpenPose and MoveNet, it presents us with the opportunity to explore new modalities of bodily interaction. With this increase in accessibility and speed, human pose recognition is becoming more ubiquitous across numerous ecologies, and we must begin to critically observe how this information could be used when mediating our bodily movements digitally.
How can we use Human Pose Recognition to translate our intimate bodily movements in a digital environment? What elements do we lose during that process?
Ultimately, the aim of this guide is to provide a foundation for further exploration and experimentation of Human Pose Recognition.
Key Terminology
CV → Computer Vision
HPR → Human Pose Recognition
HPE → Human Pose Estimation
CNN → Convolutional Neural Network
Landmarks → A set of defined coordinates that represent the different joints in the human body. The number of joints mapped varies from model to model.
CVPR → Computer Vision and Pattern Recognition Conference: An annual conference on computer vision and pattern recognition, which is regarded as the most important conference in its field.
IMU → Inertial Measurement Unit : An electronical device which records and measure a body’s specific force, angular velocity and sometimes the orientation of the body.
COCO → Common Objects in Context, A large scale object detection, segmentation, and captioning dataset.
OpenPose → An open-source real-time multi-person system to detect not only human body joints but also hand, face and foot keypoints.
Landmarks → A landmark corresponds to different body parts/joints. The relative position of landmarks can be used to distinguish one pose from another.
PoseNet → PoseNet is a machine learning package that allows for Real-time Human Pose Estimation. There is also a TensorFlow.js implementation that enables these models to run real-time human pose estimation in the browser. This Tensorflow implementation has been integrated into the ml5.js library, which makes machine learning for the web more accessible and approachable!
02_History of Human Pose Recognition
This section presents an incomplete list of the many developments made in human pose recognition, as well as some early ideas surrounding mapping and representing bodily movements.
→ Historically, systems have been created to translate the semiotics of our bodily movements into another language prior to the advent of computer vision, human pose recognition or computing in general.
→ Movement scripts is one instance of a system that was developed to transcribe this visual-kinesthetic language, and was widely used across Europe in the 15th century. Many were invented to record a unique movement system such as an idiom for dancing or gestural system. This was seen as a technological breakthrough at the time, as there was no existing tool created for such a purpose.
→ Movement notation was never an integral part of any dance study or practise. The technological advancements in video – especially in 1970s with video recording – overshadowed movement notation greatly. However, these early movement systems reflect many of the same goals and motives in human pose recognition research, striving to create more and more accurate mathematical and graphical representations of movement itself.
→ In computer vision, Human Pose Estimation has been studied for decades. However many methods prior to 2012/2013 had many limitations around adaptability, speed, and hardware requires that extend outside of a single RGB camera or monocular view. During this time, many algorithms have had their spotlight - pictorial structures - but in recent year human pose recognition has seen major developments with the advent of larger and more complex dataset (COCO*, *CMU Motion Capture Dataset) and machine-learning algorithms being formed, enabling the machine to establish a greater understanding of human body language through pose detection and pose tracking.
Learn how to dance La Macarena with SubZero, at 9gag.com
→ The importance and influence of this technology can not be understated, as we now have the capability to extract more information from a single image then ever before. At present, human pose estimation is used across a range of consumer and scientific domains, such as Robotics, Surveillance, Gaming, and Sports. It presents a new technique and perspective on how we can view, and study body language and utilise it as a tool (hopefully for good) to create more natural computer interfaces, which is inclusive of a more visual/kinaesthetic form of communication.
→ The timeline below are findings taken from the Computer Vision and Pattern Recognition (CVPR) conference. CVPR is an annual conference held to discuss and showcase the latest developments made across a wide range of topics such as object detection, object segmentation, 3D reconstruction and human pose estimation, and is hailed as the most importance conference in the field of computer vision.
I attempted to include projects and papers that reflect key stages of development in human pose estimation over the past 10 years. The timeline is incomplete with its content at times quite technical, but for each paper it is usually accompanied by a video presentation which is visually fun to watch, and providing a glimpse of what the future might look like!
Timeline
(2013)
→ Unconstrained Monocular 3D Pose Estimation by Action Detection and Regression Forest
→ A Stereo Camera Based Full Body Human Motion Capture System Using a Partitioned Particle Filter
(2014)
→ 3D Pose from Motion for Cross-view Action Recognition
→ A Layered Model of Human Body and Garment Deformation
(2015)
→ Pose-Conditioned Joint Angle Limits for 3D Human Pose Reconstruction
→ The Stitched Puppet: A Graphical Model of 3D Human Shape and Pose
→ Simultaneous Pose and Non-Rigid Shape with Particle Dynamics
(2016)
→ DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features
→ SMPLify: 3D Human Pose and Shape from a Single Image
→ Deepcut + http://mocap.cs.cmu.edu/ CMU dataset
→ DeepCut
→ End-to-End Learning of Deformable Mixture of Parts and Deep CNN for Human Pose Estimation
→ OpenPose
(2017)
→ Realtime Multi-Person 2D Human Pose Estimation using Part Affinity Fields, CVPR 2017 Oral
→ Estimating body shape under clothing
→ A simple yet effective baseline for 3d human pose estimation
→ Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision
(2018)
→ DensePose: Dense Human Pose Estimation In The Wild
→ Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild
(2019)
→ MonoPerfCap: Human Performance Capture from Monocular Video
→ CVPR 2019 Oral Session 3-2B: Face & Body
→ BodyFusion: Real-time Capture of Human Motion and Surface Geometry Using a Single Depth Camera
→ Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views
→ 5 2D cameras + SIMPlify
→ DeepCap
→ SMPL-X
(2020)
→ Object-Occluded Human Shape and Pose Estimation from a Single Color Image
→ Contact and Human Dynamics from Monocular Video
→ ExPose: Monocular Expressive Body Regression through Body-Driven Attention
→ VIBE: Video Inference for Human Body Pose and Shape Estimation
(2021)
→ AGORA human pose and shape dataset
→ FrankMocap: A Strong and Easy-to-use Single View 3D Hand+Body Pose Estimator
→ HybrIK - A Hybrid Analytical-Neural IK Solution for 3D Human Pose and Shape Estimation
→ SimPoE: Simulated Character Control for 3D Human Pose Estimation
→ TUCH: On Self-Contact and Human Pose ( interesting to show issues w/ contact )
→ POSA: Populating 3D Scenes by Learning Human-Scene Interaction ( contact + proximity )
→ Human POSEitioning System uses IMUs
→ img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation
→ PIFuHD
→ PaMIR
→ ARCH++
(2022)
→ GLAMR: Global Occlusion-Aware Human Mesh Recovery with Dynamic Cameras
→ OSSO: Obtaining Skeletal Shape from Outside
→ BEV: Monocular Regression of Multiple 3D People in Depth
03_Re/defining Creativity
So now, how can we use such tools to translate bodily movements into something meaningful?
Through these developments human pose recognition has now reached a point where it has become widely accessible and commercially viable. State-of-the-art models such as OpenPose* and *PoseNet has enabled and inspired more developers and makers to experiment and apply pose detection into their own unique projects.
This field is still in its infancy and has yet to be fully explored by many creative practitioners/designers/artist. However, we should be excited for how it could re-shape how we interact with digital technologies with its many possibilities. As I stated in the introduction, body language is an intrinsic component of how we communicate and share knowledge with each other as humans, and for the most part what we output to the world is done subconsciously. As human pose recognition technologies become more and more ubiquitous, it may shine a spotlight on this forgotten language in the wide digital landscape, and allow us to critically observe and subsequently reconfigure our approach to this (in)visible language in which we all know.
“The human mind ‘knows’ body language from a kind of primordial memory. We seem to be capable of reading different meanings in different expressions and postures by the second, translating it into emotions based on our personal and cultural experiences when interacting with others. Teaching this complex and often subconscious ‘body knowledge’ to an AI is a different story.”
Coralie Vogelaar
NOTABLE CREATORS
04_DIY: Proceed with Caution
When talking about anything AI x ML related, we must be very careful in how we plan and develop such systems and tools. We should also consider their wider influence and impact on various ecologies that might extend outside the more typical use cases for such technologies. In the context of Human Pose Estimation, our approach should be no different. We should be critical of the techniques researchers and big tech companies adopt when working with artificial intelligence; right down to how they curate their datasets, hardware/power requirements, and how these technologies are applied “in the wild”.
Besides the overarching problems that plague the AI pipeline such as labor and biases, I’m unsure what potential issues we may face due to human pose estimation technologies, and so my observations may be speculative at times. However, I also hope to touch upon the limitations of current human pose estimation technologies, alongside the problems that researchers face moving forward.
→ Datasets & Cognitive Sweatshops
A dataset is one of the most important element which is responsible for creating most of the models and tools we see and use today. By curating a large dataset that contains a diverse set of information such as images and text, or in our case a wide set of poses from different perspectives, and body shapes, it allows us to create a more accurate representation and estimation can be achieved. This paired with highly detailed and complex image labeling can really assist in an algorithms quest to achieve its desired output.
“Again, the myth of AI as affordable and efficient depends on layers of exploitation, including the extraction of mass unpaid labor to fine-tune the AI systems of the richest companies on earth” -
Kate Crawford
For example, OpenPose was trained on the Common Objects in Context (COCO ) dataset created by Microsoft, with it having over 250,000 people with their keypoints labelled. There is even 3D motion capture datasets such as the MU Motion Capture Dataset which has aided in even providing even more accurate labeling surrounding motion.
Even though these large scale datatsets exist, they require a vast amount of labour to produce and at times it is not achieved ethically. In the case of COCO, one of the most popular large scale labelled image datasets, they annotated their large corpus of images through “crowdsourcing” tasks, using workers from Amazon’s Mechanical Turk (AMT) platform. Workers are primarily tasked with tagging images for computer vision systems to test whether an algorithm is producing accurate results. In the majority of cases, these workers are underpaid with their contribution becoming lost in the long AI pipeline; and to do so would only increase the cost and efficiency of AI for large companies/corporations who are driven by capitalistic greed. This is one less recognised fact about AI and its infrastructure, and I can only hope it changes in the future.
If you are curious about the images used in the COCO dataset, Roboflow has developed a web application to allow users to easily browse through its corpus. I recently made a post on Instagram that showcases some of the unusual images that I stumbled upon!
→ Body language in the wild
As I mentioned earlier, human pose estimation is a major branch in computer vision research. One of the biggest industries that can take advantage of new developments in computer vision technologies is in surveillance. As privacy is practically an illusion at this point, with every corner of urban landscapes infested with cameras that collect, mine and process information about individuals and their interactions with their environment, for opaque agendas.
Human pose estimation is unique when compared to many other popular machine learning algorithms applied in surveillance systems, as what information can be extracted from our body language and movements? and how can that be used to target, and harm an individual?
However, researchers are attempting to shine a new perspective on human pose estimation when integrating into such systems. Stating that it could be used as a method to preserve privacy, prioritizing viewing the reconstructed skeletal structure, over any facial features - preventing ethnic bias and reducing complexity.
Anonymized results of a central station scene. Re-identification shows stable track results in the foreground. Background tracks appear more unstable due to smaller boxes and increased occlusion. Temporal consistency partially stabilized background tracks. From with Human Pose Estimation in Real-World Surveillance? Mickael Cormier, Aris Clepe,Andreas Specker,Jurgen Beyerer, Fraunhofer IOSB, Karlsruhe, Germany Image Source
Future Obstacles
Some questions to ask and things to consider in relation to the use and development of HPR:
→ Datasets
→ What datasets are used in the creation of prolific models such as OpenPose?
→ It is possible to use tools for exploring the contents of an image dataset ( COCO ), and it’s worth doing so
→ What are the logistics of the curation of dataset used to train HPR models?
→ Issues with contemporary HPR models
→ Body diversity, Age Diversity - Issues with recognizing different age groups
→ Occlusion - Challenging problem in CV/HPR
→ Most models don’t explicitly model depth
→ State-of-the-art model fail to consider contact, with others and self contact
→ Privacy
→ How HPR is used as a mechanism to extract data in surveillance systems
→ What data can be extracted from our body language? How can that data used unethically? How is it being used now?
→ Its invisible nature and how we might struggle to realise its implications
NOTABLE CREATORS, ARTISTS & RESEARCHERS
05_Tools & Resources
ml5.js → https://ml5js.org/
OpenPose Guide 2022 → https://viso.ai/deep-learning/openpose/
Computer Vision Art Gallery, CVPR. → https://computervisionart.com/
Experimental Film with ML: Week 2 (OpenPose Pose Detection in Google Colab + p5.js)
Cool “2D Image To 3D Mesh Object” AI Research I Found - July 2022
06_Closing remarks
While Human Pose Recognition might be less explored by creatives compared to image-generating and text-generating AIs, it is clear that there is an in-depth artistic and creative exploration of this technology happening at the moment. Creators looking into HPR are looking into its potential and producing experiments that benefit both art and science. They are also building up on rich tradition of exploring technology to understand what it means to live in a physical body, that has been present in media art and net art. Using reflection and artistic sensitivity, makers working with HPR address biases and harms done to humans who exist within the global technological stack and call for better ecologies of human-machine coexistence.
Looking at how this technology advances and becomes more accessible to independent makers, it is clear that there is a lot more to come, and we are looking forward to seeing more experimentation and more shared resources, and we hope this guide can contribute to the pool of knowledge.
💜
AI PLAYGROUND S01 / BODY
This guide is a part of our community program AI Playground / Body. AI Playground is an event series and a collection of Guides, structured under four topics: Image, Text, Body and Sound. As part of the program we hosted 2 events:
Artist Talk: Navigating the Self + Body on the Internet | Artist talk w/ Maya Man
Workshop: Learn to Fingerp(AI)nt with Words | Workshop w/ Computational Mama