Deep Reinforcement Learning for Human Dressing Motion Synthesis

Learning to dress:

Synthesizing human dressing motion

via deep reinforcement learning

INTRODUCTION

INTRODUCTION

•

Two main purpose

•

Traverse inside of the garment

•

Prevent damage to the garment

INTRODUCTION

•

Learning a single control policy to achieve all these distinct motor skills and

execute them sequentially is impractical

•

Break down a full dressing sequence to

subtasks

 and learn a control policy for

each subtask

Grasping T-shirt

→

Tucking a hand into T-shirt

→

Pushing a hand through a sleeve

Policy sequencing algorithm

avoid each policy switching

INTRODUCTION

•

Producing a successful policy for a single subtask requires hours of simulation and

optimization

•

Benefit:

•

The end result is not a single animation, but a character control policy that is capable of

handling variations in the initial cloth position and character pose.

RELATED WORK

•

Dexterous Manipulation of Cloth [Bai et al. 2016]

REINFORCEMENT LEARNING BACKGROUND

•

Markov Decision Process(MDP) is a tuple

•

S : state space

•

A : action space

•

r : reward function

•

ρ

 distribution of the initial state s

•

sas’

: transition probability

•

γ

: discount factor

REINFORCEMENT LEARNING BACKGROUND

•

Partially Observable Markov Decision Process (POMDP)

•

Humans do not have direct perception of the full state of the world and themselves

•

In the case of dressing, humans have limited perception of the state of

the garment

outside of haptic

and

visual observations

•

 is a subspace of the state space

•

goal :

optimize the policy π, represented as a neural network, such that the

expected accumulated reward is maximized.

•

All subtasks share the same action space

SEQUENCING CONTROL POLICIES

OBSERVATION SPACE

•

The full state space of dressing tasks is typically high-dimensional

•

Formulate a compact observation space that is tailored for

dressing tasks.

•

= [

•

With carefully picked components, they observation is a 163-

dimensional vector.

OBSERVATION SPACE

•

: proprioception

本體

•

q(s) is the

vector of joint angles

 describing the human pose at state s.

•

 The human model in this work contains 22 degrees of freedom, all of

which are actuated.

OBSERVATION SPACE

•

Garment feature location

•

The current location of a garment feature (e.g. , a sleeve opening)

•

c : the world position of the centroid

•

p : the world position of the garment polygon

OBSERVATION SPACE

•

Haptics

•

Humans rely on haptic sensing during dressing to avoid

damage to clothes and to minimize discomfort.

•

: 3-dimensional

•

n = 21 (22 nodes)

OBSERVATION SPACE

•

: Signed surface

•

provide the policy with a surface sign for each haptic sensor

, that

differentiates the contact between the inner and outer surfaces of the

garment.

•

If the sum of the assigned values for sensor

 is positive, we consider

that the sensor is in contact with the surface from inside.

OBSERVATION SPACE

•

: Task vector

•

The task vector depends on geodesic information when the limb is in

contact with the garment but has not yet entered the garment feature.

REWARD FUNCTION

•

A good reward function is important to the success of reinforcement learning.

•

 : progress reward

•

 : deformation penalty

•

 : geodesic reward

•

 : end effector motion in the direction of the task vector

•

 : attracts the character to a target position

REWARD FUNCTION

•

 : progress reward

•

= 0,…,m  each joint of the limb

check

 for each bone until the first encounter of

= 1

r =

int

∩

||

|| = the length of the bone

c = centroid of the polygon

REWARD FUNCTION

•

 : deformation penalty

•

•

•

This formulation for deformation penalty results in little to no penalty for

small deformations in order to encourage the use of contact for dressing

REWARD FUNCTION

•

 : geodesic contact

REWARD FUNCTION

•

 : task vector displacement

•

•

t : current simulation step

REWARD FUNCTION

RESULTS

RESULTS

Slide Note

Embed Share

Download

Using deep reinforcement learning, this research explores synthesizing human dressing motions by breaking down the dressing sequence into subtasks and learning control policies for each subtask. The goal is to achieve dexterous manipulation of clothing while optimizing character control policies to handle variations in cloth position and character pose, enhancing efficiency and preventing garment damage.

niis111 Follow

Uploaded on Oct 07, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Learning to dress: Synthesizing human dressing motion via deep reinforcement learning

INTRODUCTION

INTRODUCTION Two main purpose Traverse inside of the garment Prevent damage to the garment

INTRODUCTION Learning a single control policy to achieve all these distinct motor skills and execute them sequentially is impractical Break down a full dressing sequence to subtasks and learn a control policy for each subtask Grasping T-shirt Tucking a hand into T-shirt Pushing a hand through a sleeve Policy sequencing algorithm avoid each policy switching

INTRODUCTION Producing a successful policy for a single subtask requires hours of simulation and optimization Benefit: The end result is not a single animation, but a character control policy that is capable of handling variations in the initial cloth position and character pose.

RELATED WORK Dexterous Manipulation of Cloth [Bai et al. 2016]

REINFORCEMENT LEARNING BACKGROUND Markov Decision Process(MDP) is a tuple S : state space A : action space r : reward function : distribution of the initial state s0 Psas : transition probability : discount factor

REINFORCEMENT LEARNING BACKGROUND Partially Observable Markov Decision Process (POMDP) Humans do not have direct perception of the full state of the world and themselves In the case of dressing, humans have limited perception of the state of the garment outside of haptic and visual observations. O is a subspace of the state space S goal : optimize the policy , represented as a neural network, such that the expected accumulated reward is maximized. All subtasks share the same action space

SEQUENCING CONTROL POLICIES

OBSERVATION SPACE The full state space of dressing tasks is typically high-dimensional Formulate a compact observation space that is tailored for dressing tasks. O = [Op, Of, Oh, Os, Ot] With carefully picked components, they observation is a 163- dimensional vector.

OBSERVATION SPACE Op : proprioception ( ) q(s) is the vector of joint angles describing the human pose at state s. The human model in this work contains 22 degrees of freedom, all of which are actuated.

OBSERVATION SPACE Of: Garment feature location The current location of a garment feature (e.g. , a sleeve opening) c : the world position of the centroid p : the world position of the garment polygon

OBSERVATION SPACE Oh: Haptics Humans rely on haptic sensing during dressing to avoid damage to clothes and to minimize discomfort. fi : 3-dimensional n = 21 (22 nodes)

OBSERVATION SPACE Os: Signed surface provide the policy with a surface sign for each haptic sensor i, that differentiates the contact between the inner and outer surfaces of the garment. If the sum of the assigned values for sensor i is positive, we consider that the sensor is in contact with the surface from inside.

OBSERVATION SPACE Ot: Task vector The task vector depends on geodesic information when the limb is in contact with the garment but has not yet entered the garment feature.

REWARD FUNCTION A good reward function is important to the success of reinforcement learning. rp: progress reward rd: deformation penalty rg: geodesic reward rt: end effector motion in the direction of the task vector rr: attracts the character to a target position

REWARD FUNCTION rp: progress reward pii = 0, ,m each joint of the limb check ci for each bone until the first encounter of ci = 1 r = bkint P , ||bi|| = the length of the bone c = centroid of the polygon P

REWARD FUNCTION rd : deformation penalty w wmid w wscale function. function. mid (=25): midpoint of the deformation penalty range. (=25): midpoint of the deformation penalty range. scale (=0.14): scales the slope and upper/lower limits of the deformation penalty (=0.14): scales the slope and upper/lower limits of the deformation penalty This formulation for deformation penalty results in little to no penalty for small deformations in order to encourage the use of contact for dressing