Addressing Monitoring Challenges in Hardware Offloading

move_pages() for physical addresses:
move_phys_pages()
CXL Hardware vendors (esp. 2LM) asking what work can be offloaded
Proposals for a variety of offloaded monitoring features
Examples: Heatmaps, IDLE bits, hotness lists, access locality info
The Problems:
Devices talk Host Physical Address or Device Physical Address (0-base)
No userland phys-addr migration interface. No hardware standard.
move_pages:  Uses virtual addresses, requires pid
T
h
e
s
i
s
Current State of Tiering
Interception/Code Changes (“legacy optane” methods)
IBS/PEBs based monitoring (sampling, provides HVA and/or HPA)
Transparent Page Placement – TPP and AutoNUMA (charging/faults)
Page/Folio flag monitoring.  (IDLE Bitmap. PFN based)
Systems like DAMON already use a large variety of these techniques
All of these have a variety of performance or functionality issues.
W
h
y
 
d
o
 
w
e
 
e
v
e
n
 
w
a
n
t
 
t
o
 
d
o
 
t
h
i
s
?
Most common sampling metric: LLC Cache Misses
HVA/HPA + PID
Address to node is queried tracked
The Problems
Prefetch traffic is captured differently (or potentially missed)
Runtime overhead based on sample rate
PMU counters often already in use (shared/unavailable)
I
B
S
/
P
E
B
s
 
(
S
a
m
p
l
e
-
b
a
s
e
d
)
Primary Mechanism:  Fault Based
Mark some pages not-present, catch the fault, demote cold pages
Recent extension for promotion based on lower tier fault detection
Additional extensions pushing to avoid fault overhead.
Problems
Fault overhead / Tail latencies.
Can be complicated to tune depending on workload.
T
r
a
n
s
p
a
r
e
n
t
 
P
a
g
e
 
P
l
a
c
e
m
e
n
t
 
(
T
P
P
)
Mechanism:   (struct page)->flag & PG_idle   - now in folio
User can mark PFNs as idle/not idle.
Presumably updated by the kernel to denote a PFN not idle.
Problems:
Presently appears broken from user space due to transition to folios
PFN-based lookup
Still requires a virtual-to-physical translation (proc/pagemap)
I
D
L
E
-
b
i
t
 
T
r
a
c
k
i
n
g
Why: Devices w/ multiple layers of memory. (DRAM + SSD + Network)
Idea: 2LM (DRAM+SSD) offload page faults.  Use data to promote.
Mechanisms: IDLE, Heatmap, Hotness list, etc – you name it
The Big Problems
Devices have no concept of tasks / virtual memory
Physical/Device Addressing – reverse lookup is very expensive
No Standard Interface: Hard to build core kernel support.
P
r
o
p
o
s
e
d
:
 
O
f
f
l
o
a
d
e
d
 
P
a
g
e
-
T
r
a
c
k
i
n
g
Question:  How bad is a reverse-lookup?
Must convert HPA to HVA to use move_pages
No direct interface, so have to build a map for all tasks using the node.
Contrived Test:
Run 4-48 memory hogs @ different capacities and system load.
Use proc/pid/maps and proc/pid/pagemap to build reverse map
Measure build/merge time of reverse maps. (Time to time to insight)
R
e
v
e
r
s
e
 
L
o
o
k
u
p
 
O
v
e
r
h
e
a
d
Map build time increases
~linearly w/ capacity, and
non-linearly if tasks > cpus
Hitting /proc/pagemap
aggressively
move_phys_pages would
alleviate these overheads
and reduce time-to-action
O
N
L
Y
 
G
O
O
D
 
F
O
R
 
O
N
E
S
N
A
P
S
H
O
T
 
I
N
 
T
I
M
E
T
e
s
t
 
R
e
s
u
l
t
s
Hardware has no contextual information about a page
“Physical Page” could be re-used rapidly (process creation/death)
Transparent Huge Pages
Many unknowns due to Chicken/Egg Situation:
No tracking offload implementations in hardware.
No interface because no ("acceptable") use case.
O
f
f
l
o
a
d
e
d
 
T
r
a
c
k
i
n
g
 
l
i
m
i
t
a
t
i
o
n
s
Working branch based on v6.6 @ my github
move_pages(pid, count, pages, nodes, status, flags)
remove pid: move_phys_pages(count, pages, nodes, status, flags)
Re-use move_pages code: 2 commits + documentation
1 commit refactor, 1 commit to implement syscall
Does the reverse lookup anyway, but efficient (HPA -> Folios)
Validates addresses are valid and movable
M
o
v
e
_
p
h
y
s
_
p
a
g
e
s
 
i
m
p
l
e
m
e
n
t
a
t
i
o
n
Per page:  
rmap_walk(folio, &rwc);
static bool phys_page_migratable(struct folio *folio,
                                                       struct vm_area_struct *vma,
                                                       unsigned long address,
                                                       void *arg)
Walks each VMA that maps the page and determines migratability
cpusets intersection (same as move_pages)
vma_migratable (same as move_pages)
V
a
l
i
d
a
t
i
n
g
 
M
o
v
a
b
i
l
i
t
y
Create the interface, but not the syscall
Allows core and drivers to make use of it, but not userland
Issue:  Allows development, but only in drivers.  May hurt adoption.
Issue:  May not encourage open development and standardization.
"Userland shouldn't talk physical addresses, because security"
IBS/PEBS can already be configured to give HVA to HPA mappings
This would be a CAP_SYS_ADMIN only interface
O
t
h
e
r
 
i
d
e
a
s
 
a
n
d
 
F
e
e
d
b
a
c
k
questions
References
gmprice/linux at sys_move_phys_pages_11_9 (github.com)
[RFC v2 0/5] move_phys_pages syscall - Gregory Price (kernel.org)
[2206.02878] TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory (arxiv.org)
[RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing (kernel.org)
Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale (micahlerner.com)
Memtis: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination (acm.org)
HW counters for hold/cold pages - Aneesh Kumar K V, Wei Xu
DAMON updates and future plans - SeongJae Park
Authors
Gregory Price
Svetly Todorov
MemVerge Inc.
Slide Note
Embed
Share

Exploring the complexities of offloading monitoring tasks in hardware, this content delves into the limitations of current methods such as move_pages and IBS/PEBs-based monitoring. It highlights the need for efficient tracking mechanisms like IDLE-bit and proposes innovative solutions like Transparent Page Placement and Offloaded Page-Tracking to overcome performance issues in hardware monitoring.

  • Hardware Offloading
  • Monitoring Challenges
  • IDLE-bit Tracking
  • Transparent Page Placement
  • Offloaded Page-Tracking

Uploaded on Apr 16, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. move_pages() for physical addresses: move_phys_pages()

  2. Thesis CXL Hardware vendors (esp. 2LM) asking what work can be offloaded Proposals for a variety of offloaded monitoring features Examples: Heatmaps, IDLE bits, hotness lists, access locality info The Problems: Devices talk Host Physical Address or Device Physical Address (0-base) No userland phys-addr migration interface. No hardware standard. move_pages: Uses virtual addresses, requires pid

  3. Why do we even want to do this? Current State of Tiering Interception/Code Changes ( legacy optane methods) IBS/PEBs based monitoring (sampling, provides HVA and/or HPA) Transparent Page Placement TPP and AutoNUMA (charging/faults) Page/Folio flag monitoring. (IDLE Bitmap. PFN based) Systems like DAMON already use a large variety of these techniques All of these have a variety of performance or functionality issues.

  4. IBS/PEBs (Sample-based) Most common sampling metric: LLC Cache Misses HVA/HPA + PID Address to node is queried tracked The Problems Prefetch traffic is captured differently (or potentially missed) Runtime overhead based on sample rate PMU counters often already in use (shared/unavailable)

  5. Transparent Page Placement (TPP) Primary Mechanism: Fault Based Mark some pages not-present, catch the fault, demote cold pages Recent extension for promotion based on lower tier fault detection Additional extensions pushing to avoid fault overhead. Problems Fault overhead / Tail latencies. Can be complicated to tune depending on workload.

  6. IDLE-bit Tracking Mechanism: (struct page)->flag & PG_idle - now in folio User can mark PFNs as idle/not idle. Presumably updated by the kernel to denote a PFN not idle. Problems: Presently appears broken from user space due to transition to folios PFN-based lookup Still requires a virtual-to-physical translation (proc/pagemap)

  7. Proposed: Offloaded Page-Tracking Why: Devices w/ multiple layers of memory. (DRAM + SSD + Network) Idea: 2LM (DRAM+SSD) offload page faults. Use data to promote. Mechanisms: IDLE, Heatmap, Hotness list, etc you name it The Big Problems Devices have no concept of tasks / virtual memory Physical/Device Addressing reverse lookup is very expensive No Standard Interface: Hard to build core kernel support.

  8. Reverse Lookup Overhead Question: How bad is a reverse-lookup? Must convert HPA to HVA to use move_pages No direct interface, so have to build a map for all tasks using the node. Contrived Test: Run 4-48 memory hogs @ different capacities and system load. Use proc/pid/maps and proc/pid/pagemap to build reverse map Measure build/merge time of reverse maps. (Time to time to insight)

  9. Test Results Map build time increases ~linearly w/ capacity, and non-linearly if tasks > cpus Hitting /proc/pagemap aggressively move_phys_pages would alleviate these overheads and reduce time-to-action ONLY GOOD FOR ONE SNAPSHOT IN TIME

  10. Offloaded Tracking limitations Hardware has no contextual information about a page Physical Page could be re-used rapidly (process creation/death) Transparent Huge Pages Many unknowns due to Chicken/Egg Situation: No tracking offload implementations in hardware. No interface because no ("acceptable") use case.

  11. Move_phys_pages implementation Working branch based on v6.6 @ my github move_pages(pid, count, pages, nodes, status, flags) remove pid: move_phys_pages(count, pages, nodes, status, flags) Re-use move_pages code: 2 commits + documentation 1 commit refactor, 1 commit to implement syscall Does the reverse lookup anyway, but efficient (HPA -> Folios) Validates addresses are valid and movable

  12. Validating Movability Per page: rmap_walk(folio, &rwc); static bool phys_page_migratable(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) Walks each VMA that maps the page and determines migratability cpusets intersection (same as move_pages) vma_migratable (same as move_pages)

  13. Other ideas and Feedback Create the interface, but not the syscall Allows core and drivers to make use of it, but not userland Issue: Allows development, but only in drivers. May hurt adoption. Issue: May not encourage open development and standardization. "Userland shouldn't talk physical addresses, because security" IBS/PEBS can already be configured to give HVA to HPA mappings This would be a CAP_SYS_ADMIN only interface

  14. questions

  15. References gmprice/linux at sys_move_phys_pages_11_9 (github.com) [RFC v2 0/5] move_phys_pages syscall - Gregory Price (kernel.org) [2206.02878] TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory (arxiv.org) [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing (kernel.org) Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale (micahlerner.com) Memtis: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination (acm.org) HW counters for hold/cold pages - Aneesh Kumar K V, Wei Xu DAMON updates and future plans - SeongJae Park Authors Gregory Price Svetly Todorov MemVerge Inc.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#