Deep Reinforcement Learning for Human Dressing Motion Synthesis

Learning to dress:
Synthesizing human dressing motion
via deep reinforcement learning
 
INTRODUCTION
INTRODUCTION
Two main purpose
Traverse inside of the garment
Prevent damage to the garment
INTRODUCTION
 
Learning a single control policy to achieve all these distinct motor skills and
execute them sequentially is impractical
 
Break down a full dressing sequence to 
subtasks
 and learn a control policy for
each subtask
 
Grasping T-shirt 
Tucking a hand into T-shirt 
Pushing a hand through a sleeve
 
Policy sequencing algorithm 
avoid each policy switching
INTRODUCTION
Producing a successful policy for a single subtask requires hours of simulation and
optimization
Benefit:
The end result is not a single animation, but a character control policy that is capable of
handling variations in the initial cloth position and character pose.
RELATED WORK
Dexterous Manipulation of Cloth [Bai et al. 2016]
REINFORCEMENT LEARNING BACKGROUND
Markov Decision Process(MDP) is a tuple
S : state space
A : action space
r : reward function
ρ
 :
 distribution of the initial state s
0
P
sas’ 
: transition probability
γ 
: discount factor
REINFORCEMENT LEARNING BACKGROUND
Partially Observable Markov Decision Process (POMDP)
Humans do not have direct perception of the full state of the world and themselves
In the case of dressing, humans have limited perception of the state of 
the garment
outside of haptic 
and 
visual observations
.
O
 is a subspace of the state space 
S
goal : 
optimize the policy π, represented as a neural network, such that the
expected accumulated reward is maximized.
All subtasks share the same action space
SEQUENCING CONTROL POLICIES
OBSERVATION SPACE
The full state space of dressing tasks is typically high-dimensional
Formulate a compact observation space that is tailored for
dressing tasks.
O 
= [
O
p
,
 O
f
 
,
 O
h
,
 O
s
 
,
 O
t
 
]
With carefully picked components, they observation is a 163-
dimensional vector.
OBSERVATION SPACE
O
p 
: proprioception 
(
本體
)
q(s) is the 
vector of joint angles
 describing the human pose at state s.
 The human model in this work contains 22 degrees of freedom, all of
which are actuated.
OBSERVATION SPACE
O
f
  : 
Garment feature location
The current location of a garment feature (e.g. , a sleeve opening)
c : the world position of the centroid
p : the world position of the garment polygon
OBSERVATION SPACE
O
h
  : 
Haptics
Humans rely on haptic sensing during dressing to avoid
   
damage to clothes and to minimize discomfort.
f
i 
: 3-dimensional 
n = 21 (22 nodes)
OBSERVATION SPACE
O
s
  
: Signed surface
provide the policy with a surface sign for each haptic sensor 
i
, that
differentiates the contact between the inner and outer surfaces of the
garment.
If the sum of the assigned values for sensor 
i
 is positive, we consider
that the sensor is in contact with the surface from inside.
OBSERVATION SPACE
O
t
  
: Task vector
The task vector depends on geodesic information when the limb is in
contact with the garment but has not yet entered the garment feature.
REWARD FUNCTION
A good reward function is important to the success of reinforcement learning.
r
p
 : progress reward
r
d
 : deformation penalty 
r
g
 : geodesic reward
r
t
 : end effector motion in the direction of the task vector
r
r
 : attracts the character to a target position
REWARD FUNCTION
r
p
 : progress reward
p
i
 i
 
= 0,…,m  each joint of the limb
check 
c
i
 for each bone until the first encounter of 
c
i
 = 1
r = 
b
k
int
 
 
P
   ,  
||
b
i
|| = the length of the bone
c = centroid of the polygon 
P
REWARD FUNCTION
r
d
 : deformation penalty
w
m
i
d
 
(
=
2
5
)
:
 
m
i
d
p
o
i
n
t
 
o
f
 
t
h
e
 
d
e
f
o
r
m
a
t
i
o
n
 
p
e
n
a
l
t
y
 
r
a
n
g
e
.
w
s
c
a
l
e
 
(
=
0
.
1
4
)
:
 
s
c
a
l
e
s
 
t
h
e
 
s
l
o
p
e
 
a
n
d
 
u
p
p
e
r
/
l
o
w
e
r
 
l
i
m
i
t
s
 
o
f
 
t
h
e
 
d
e
f
o
r
m
a
t
i
o
n
 
p
e
n
a
l
t
y
f
u
n
c
t
i
o
n
.
This formulation for deformation penalty results in little to no penalty for
small deformations in order to encourage the use of contact for dressing
REWARD FUNCTION
r
g
 : geodesic contact
REWARD FUNCTION
r
t
 : task vector displacement
O
t
 
:
 
t
a
s
k
 
v
e
c
t
o
r
 
o
b
s
e
r
v
a
t
i
o
n
t : current simulation step
REWARD FUNCTION
RESULTS
 
RESULTS
Slide Note
Embed
Share

Using deep reinforcement learning, this research explores synthesizing human dressing motions by breaking down the dressing sequence into subtasks and learning control policies for each subtask. The goal is to achieve dexterous manipulation of clothing while optimizing character control policies to handle variations in cloth position and character pose, enhancing efficiency and preventing garment damage.

  • Deep Learning
  • Reinforcement Learning
  • Human Motion Synthesis
  • Clothing Manipulation
  • Control Policies

Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Learning to dress: Synthesizing human dressing motion via deep reinforcement learning

  2. INTRODUCTION

  3. INTRODUCTION Two main purpose Traverse inside of the garment Prevent damage to the garment

  4. INTRODUCTION Learning a single control policy to achieve all these distinct motor skills and execute them sequentially is impractical Break down a full dressing sequence to subtasks and learn a control policy for each subtask Grasping T-shirt Tucking a hand into T-shirt Pushing a hand through a sleeve Policy sequencing algorithm avoid each policy switching

  5. INTRODUCTION Producing a successful policy for a single subtask requires hours of simulation and optimization Benefit: The end result is not a single animation, but a character control policy that is capable of handling variations in the initial cloth position and character pose.

  6. RELATED WORK Dexterous Manipulation of Cloth [Bai et al. 2016]

  7. REINFORCEMENT LEARNING BACKGROUND Markov Decision Process(MDP) is a tuple S : state space A : action space r : reward function : distribution of the initial state s0 Psas : transition probability : discount factor

  8. REINFORCEMENT LEARNING BACKGROUND Partially Observable Markov Decision Process (POMDP) Humans do not have direct perception of the full state of the world and themselves In the case of dressing, humans have limited perception of the state of the garment outside of haptic and visual observations. O is a subspace of the state space S goal : optimize the policy , represented as a neural network, such that the expected accumulated reward is maximized. All subtasks share the same action space

  9. SEQUENCING CONTROL POLICIES

  10. OBSERVATION SPACE The full state space of dressing tasks is typically high-dimensional Formulate a compact observation space that is tailored for dressing tasks. O = [Op, Of, Oh, Os, Ot] With carefully picked components, they observation is a 163- dimensional vector.

  11. OBSERVATION SPACE Op : proprioception ( ) q(s) is the vector of joint angles describing the human pose at state s. The human model in this work contains 22 degrees of freedom, all of which are actuated.

  12. OBSERVATION SPACE Of: Garment feature location The current location of a garment feature (e.g. , a sleeve opening) c : the world position of the centroid p : the world position of the garment polygon

  13. OBSERVATION SPACE Oh: Haptics Humans rely on haptic sensing during dressing to avoid damage to clothes and to minimize discomfort. fi : 3-dimensional n = 21 (22 nodes)

  14. OBSERVATION SPACE Os: Signed surface provide the policy with a surface sign for each haptic sensor i, that differentiates the contact between the inner and outer surfaces of the garment. If the sum of the assigned values for sensor i is positive, we consider that the sensor is in contact with the surface from inside.

  15. OBSERVATION SPACE Ot: Task vector The task vector depends on geodesic information when the limb is in contact with the garment but has not yet entered the garment feature.

  16. REWARD FUNCTION A good reward function is important to the success of reinforcement learning. rp: progress reward rd: deformation penalty rg: geodesic reward rt: end effector motion in the direction of the task vector rr: attracts the character to a target position

  17. REWARD FUNCTION rp: progress reward pii = 0, ,m each joint of the limb check ci for each bone until the first encounter of ci = 1 r = bkint P , ||bi|| = the length of the bone c = centroid of the polygon P

  18. REWARD FUNCTION rd : deformation penalty w wmid w wscale function. function. mid (=25): midpoint of the deformation penalty range. (=25): midpoint of the deformation penalty range. scale (=0.14): scales the slope and upper/lower limits of the deformation penalty (=0.14): scales the slope and upper/lower limits of the deformation penalty This formulation for deformation penalty results in little to no penalty for small deformations in order to encourage the use of contact for dressing

  19. REWARD FUNCTION rg : geodesic contact

  20. REWARD FUNCTION rt : task vector displacement O Ot t : : task vector observation task vector observation t : current simulation step

  21. REWARD FUNCTION rr : target position q(s) is the current pose of the character ? is the goal pose

  22. RESULTS

  23. RESULTS

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#