Troubleshooting Memory and Network Stack Tuning in Linux for Highly Loaded Servers

 
 
M
e
m
o
r
y
 
a
n
d
 
n
e
t
w
o
r
k
s
t
a
c
k
 
t
u
n
i
n
g
 
i
n
 
L
i
n
u
x
:
t
h
e
 
s
t
o
r
y
 
o
f
 
h
i
g
h
l
y
 
l
o
a
d
e
d
 
s
e
r
v
e
r
s
m
i
g
r
a
t
i
o
n
 
t
o
 
f
r
e
s
h
 
L
i
n
u
x
 
d
i
s
t
r
i
b
u
t
i
o
n
 
 
Dmitry Samsonov
 
Dmitry Samsonov
Lead System Administrator at Odnoklassniki
Expertise:
Zabbix
CFEngine
Linux tuning
 
 
OpenSuSE 10.2
Release:
07.12.2006
End of life:
30.11.2008
 
CentOS 7
Release:
07.07.2014
End of life:
30.06.2024
 
V
i
d
e
o
 
d
i
s
t
r
i
b
u
t
i
o
n
 
4 x 10Gbit/s to users
2 x 10Gbit/s to storage
256GB RAM — in-memory cache
22 х 480GB SSD — SSD cache
2 х E5-2690 v2
 
T
O
C
 
Memory
OOM killer
Swap
Network
Broken pipe
Network load distribution between CPU
cores
SoftIRQ
 
M
e
m
o
r
y
 
OOM killer
 
NODE 0
(CPU N0)
 
1. All the physical memory
 
NODE 1
(CPU N1)
 
ZONE_DMA (0-
16MB)
 
ZONE_DMA32 (0-
4GB)
 
ZONE_NORMAL
(4+GB)
 
2. NODE 0 (only)
 
3. Each zone
 
W
h
a
t
 
i
s
 
g
o
i
n
g
 
o
n
?
 
OOM killer, system CPU spikes!
 
M
e
m
o
r
y
 
f
r
a
g
m
e
n
t
a
t
i
o
n
 
Memory after server has booted up
 
 
After some time
 
After some more time
 
W
h
y
 
t
h
i
s
 
i
s
 
h
a
p
p
e
n
i
n
g
?
 
 
Lack of free memory
Memory pressure
 
W
h
a
t
 
t
o
 
d
o
 
w
i
t
h
f
r
a
g
m
e
n
t
a
t
i
o
n
?
 
Increase vm.min_free_kbytes!
High/low/min watermark.
/proc/zoneinfo
 
Node 0, zone   Normal
  pages free     2020543
        min      1297238
        low      1621547
        high     1945857
 
C
u
r
r
e
n
t
 
f
r
a
g
m
e
n
t
a
t
i
o
n
s
t
a
t
u
s
 
/proc/buddyinfo
 
 
 
Node 0, zone      DMA      0      0      1      0  ...
Node 0, zone    DMA32   1147    980    813    450  ...
Node 0, zone   Normal  55014  15311   1173    120  ...
Node 1, zone   Normal  70581  15309   2604    200  ...
 
...   2      1      1      0      1      1      3
... 386    115     32     14      2      3      5
...   5      0      0      0      0      0      0
...  32      0      0      0      0      0      0
 
W
h
y
 
i
s
 
i
t
 
b
a
d
 
t
o
 
i
n
c
r
e
a
s
e
m
i
n
_
f
r
e
e
_
k
b
y
t
e
s
?
 
Part of the memory 
min_free_kbytes
-
sized will not be available.
 
M
e
m
o
r
y
 
Swap
 
 
40GB of free memory and
vm.swappiness=0, but server is still
swapping!
 
W
h
a
t
 
i
s
 
g
o
i
n
g
 
o
n
?
 
NODE 0
(CPU N0)
 
1. 
All the physical memory
 
NODE 1
(CPU N1)
 
ZONE_DMA (0-
16MB)
 
ZONE_DMA32 (0-
4GB)
 
ZONE_NORMAL
(4+GB)
 
2. NODE 0
 (only)
 
3. Each zone
 
U
n
e
v
e
n
 
m
e
m
o
r
y
 
u
s
a
g
e
b
e
t
w
e
e
n
 
n
o
d
e
s
 
NODE 0
(CPU N0)
 
NODE 1
(CPU N1)
 
Free
 
Used
 
Free
 
 
 
Used
 
numastat -m <PID>
numastat -m
                Node 0          Node 1           Total
       --------------- --------------- ---------------
MemFree       51707.00        23323.77        75030.77
...
 
C
u
r
r
e
n
t
 
u
s
a
g
e
 
b
y
 
n
o
d
e
s
 
W
h
a
t
 
t
o
 
d
o
 
w
i
t
h
 
N
U
M
A
?
 
    Turn off NUMA
 
For the whole system
(kernel parameter):
numa=off
Per process:
numactl —
interleave=all <cmd>
 
Prepare application
 
Multithreading in
all parts
Node affinity
 
N
e
t
w
o
r
k
 
W
h
a
t
 
a
l
r
e
a
d
y
 
h
a
d
 
t
o
 
b
e
d
o
n
e
 
Ring buffer: 
ethtool -g/-G
Transmit queue length: 
ip link/ip link set <DEV>
txqueuelen <PACKETS>
Receive queue length:
net.core.netdev_max_backlog
Socket buffer: 
net.core.<rmem_default|rmem_max>
net.core.<wmem_default|wmem_max>
net.ipv4.<tcp_rmem|udp_rmem>
net.ipv4.<tcp_wmem|udp_wmem>
net.ipv4.udp_mem
Offload
: ethtool -k/-K
 
N
e
t
w
o
r
k
 
Broken pipe
 
Broken pipe errors background
In tcpdump - half-duplex close sequence.
 
W
h
a
t
 
i
s
 
g
o
i
n
g
 
o
n
?
 
O
O
O
 
Out-of-order packet, i.e. packet with incorrect
SEQuence number.
 
W
h
a
t
 
t
o
 
d
o
 
w
i
t
h
 
O
O
O
?
 
One connection packets by one route:
Same CPU core
Same network interface
Same NIC queue
Configuration:
Bind threads/processes to CPU cores
Bind NIC queues to CPU cores
Use RFS
 
B
e
f
o
r
e
/
a
f
t
e
r
Broken pipes per second per server
 
W
h
y
 
i
s
 
s
t
a
t
i
c
 
b
i
n
d
i
n
g
b
a
d
?
 
Load distribution between CPU cores
might be uneven
 
N
e
t
w
o
r
k
 
Network load distribution between CPU cores
 
C
P
U
0
 
u
t
i
l
i
z
a
t
i
o
n
 
a
t
 
1
0
0
%
 
W
h
y
 
t
h
i
s
 
i
s
 
h
a
p
p
e
n
i
n
g
?
 
1.
Single queue - turn on more:
ethtool -l/-L
2.
Interrupts are not distributed:
dynamic distribution - launch
irqbalance/irqd/birq
static distribution - configure RSS
 
R
S
S
 
CPU
 
                                               
RSS
 
Network
 
 
 
 
eth0
 
C
P
U
0
-
C
P
U
7
 
u
t
i
l
i
z
a
t
i
o
n
a
t
 
1
0
0
%
 
W
e
 
n
e
e
d
 
m
o
r
e
 
q
u
e
u
e
s
!
 
1
6
 
c
o
r
e
 
u
t
i
l
i
z
a
t
i
o
n
a
t
 
1
0
0
%
 
s
c
a
l
i
n
g
.
t
x
t
 
RPS = Software RSS
XPS = RPS for outgoing packets
R
F
S
?
 
U
s
e
 
p
a
c
k
e
t
 
c
o
n
s
u
m
e
r
 
c
o
r
e
n
u
m
b
e
r
 
https://www.kernel.org/doc/Documen
tation/networking/scaling.txt
 
1.
Load distribution between CPU
cores might be uneven.
2.
CPU overhead
 
W
h
y
 
i
s
 
R
P
S
/
R
F
S
/
X
P
S
b
a
d
?
 
A
c
c
e
l
e
r
a
t
e
d
 
R
F
S
 
Mellanox supports it, but after
switching it on maximal throughput
on 10G NICs were only 5Gbit/s.
 
I
n
t
e
l
 
Signature Filter (
also known as
 ATR -
Application Targeted Receive)
RPS+RFS counterpart
 
N
e
t
w
o
r
k
 
SoftIRQ
 
H
o
w
 
S
o
f
t
I
R
Q
s
 
a
r
e
 
b
o
r
n
Network
 
H
o
w
 
S
o
f
t
I
R
Q
s
 
a
r
e
 
b
o
r
n
Network
HW
IRQ
42
 
RSS
 
H
o
w
 
S
o
f
t
I
R
Q
s
 
a
r
e
 
b
o
r
n
Network
HW
IRQ
42
SoftIRQ
NET_RX
CPU0
 
RSS
 
HW interrupt
processing is finished
 
H
o
w
 
S
o
f
t
I
R
Q
s
 
a
r
e
 
b
o
r
n
Network
HW
IRQ
42
SoftIRQ
NET_RX
CPU0
 
RSS
NAPI
poll
 
HW interrupt
processing is finished
 
W
h
a
t
 
t
o
 
d
o
 
w
i
t
h
 
h
i
g
h
S
o
f
t
I
R
Q
?
 
 
I
n
t
e
r
r
u
p
t
 
m
o
d
e
r
a
t
i
o
n
 
ethtool -c/-C
 
W
h
y
 
i
s
 
i
n
t
e
r
r
u
p
t
m
o
d
e
r
a
t
i
o
n
 
b
a
d
?
 
You have to balance between
throughput and latency
 
W
h
a
t
 
i
s
 
g
o
i
n
g
 
o
n
?
 
Too rapid growth
H
e
a
l
t
h
 
m
i
n
i
s
t
r
y
 
i
s
w
a
r
n
i
n
g
!
C
H
A
N
G
E
S
T
E
S
T
S
R
E
V
E
R
T
!
K
E
E
P
 
I
T
 
T
h
a
n
k
 
y
o
u
!
 
 
Odnoklassniki technical blog on habrahabr.ru
http://habrahabr.ru/company/odnoklassniki/
More about us
http://v.ok.ru/
 
 
 
 
 
 
D
m
i
t
r
y
 
S
a
m
s
o
n
o
v
d
m
i
t
r
y
.
s
a
m
s
o
n
o
v
@
o
d
n
o
k
l
a
s
s
n
i
k
i
.
r
u
Slide Note

Всем привет! Привет, админы!

надеюсь все передохнули, обсудили доклады и готовы к новой информации

Embed
Share

Dmitry Samsonov, Lead System Administrator at Odnoklassniki, shares insights on memory tuning, the impact of network stack configuration, and the challenges faced during server migration. The discussion covers memory fragmentation, OOM killer issues, CPU spikes, and strategies to address memory pressure and fragmentation in Linux distributions like OpenSuSE 10.2 and CentOS 7. Valuable tips include adjusting vm.min_free_kbytes and understanding memory zones for optimal performance.

  • Troubleshooting
  • Linux Servers
  • Memory Tuning
  • Network Stack
  • Server Migration

Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Memory and network stack tuning in Linux: the story of highly loaded servers migration to fresh Linux distribution Dmitry Samsonov

  2. Dmitry Samsonov Lead System Administrator at Odnoklassniki Expertise: Zabbix CFEngine Linux tuning dmitry.samsonov@odnoklassniki.ru https://www.linkedin.com/in/dmitrysamsonov

  3. OpenSuSE 10.2 Release: 07.12.2006 End of life: 30.11.2008 CentOS 7 Release: 07.07.2014 End of life: 30.06.2024

  4. Video distribution 4 x 10Gbit/s to users 2 x 10Gbit/s to storage 256GB RAM in-memory cache 22 480GB SSD SSD cache 2 E5-2690 v2

  5. TOC Memory OOM killer Swap Network Broken pipe Network load distribution between CPU cores SoftIRQ

  6. Memory OOM killer

  7. 1. All the physical memory 3. Each zone NODE 0 (CPU N0) NODE 1 (CPU N1) 20*PAGE_SIZE 21*PAGE_SIZE 22*PAGE_SIZE 23*PAGE_SIZE 24*PAGE_SIZE 25*PAGE_SIZE 26*PAGE_SIZE 27*PAGE_SIZE ... 28*PAGE_SIZE ... 29*PAGE_SIZE ... 210*PAGE_SIZE ... 2. NODE 0 (only) ZONE_NORMAL (4+GB) ZONE_DMA32 (0- 4GB) ZONE_DMA (0- 16MB)

  8. What is going on? OOM killer, system CPU spikes!

  9. Memory fragmentation Memory after server has booted up After some time After some more time

  10. Why this is happening? Lack of free memory Memory pressure

  11. What to do with fragmentation? Increase vm.min_free_kbytes! High/low/min watermark. /proc/zoneinfo Node 0, zone Normal pages free 2020543 min 1297238 low 1621547 high 1945857

  12. Current fragmentation status /proc/buddyinfo Node 0, zone DMA 0 0 1 0 ... Node 0, zone DMA32 1147 980 813 450 ... Node 0, zone Normal 55014 15311 1173 120 ... Node 1, zone Normal 70581 15309 2604 200 ... ... 2 1 1 0 1 1 3 ... 386 115 32 14 2 3 5 ... 5 0 0 0 0 0 0 ... 32 0 0 0 0 0 0

  13. Why is it bad to increase min_free_kbytes? Part of the memory min_free_kbytes- sized will not be available.

  14. Memory Swap

  15. What is going on? 40GB of free memory and vm.swappiness=0, but server is still swapping!

  16. 1. All the physical memory 3. Each zone NODE 0 (CPU N0) NODE 1 (CPU N1) 20*PAGE_SIZE 21*PAGE_SIZE 22*PAGE_SIZE 23*PAGE_SIZE 24*PAGE_SIZE 25*PAGE_SIZE 26*PAGE_SIZE 27*PAGE_SIZE ... 28*PAGE_SIZE ... 29*PAGE_SIZE ... 210*PAGE_SIZE ... 2. NODE 0 (only) ZONE_NORMAL (4+GB) ZONE_DMA32 (0- 4GB) ZONE_DMA (0- 16MB)

  17. Uneven memory usage between nodes Free Free NODE 0 Used NODE 1 (CPU N1) (CPU N0) Used

  18. Current usage by nodes numastat -m <PID> numastat -m Node 0 Node 1 Total --------------- --------------- --------------- MemFree 51707.00 23323.77 75030.77 ...

  19. What to do with NUMA? Prepare application Turn off NUMA Multithreading in all parts For the whole system (kernel parameter): Node affinity numa=off Per process: numactl interleave=all <cmd>

  20. Network

  21. What already had to be done Ring buffer: ethtool -g/-G Transmit queue length: ip link/ip link set <DEV> txqueuelen <PACKETS> Receive queue length: net.core.netdev_max_backlog Socket buffer: net.core.<rmem_default|rmem_max> net.core.<wmem_default|wmem_max> net.ipv4.<tcp_rmem|udp_rmem> net.ipv4.<tcp_wmem|udp_wmem> net.ipv4.udp_mem Offload: ethtool -k/-K

  22. Network Broken pipe

  23. What is going on? Broken pipe errors background In tcpdump - half-duplex close sequence.

  24. OOO Out-of-order packet, i.e. packet with incorrect SEQuence number.

  25. What to do with OOO? One connection packets by one route: Same CPU core Same network interface Same NIC queue Configuration: Bind threads/processes to CPU cores Bind NIC queues to CPU cores Use RFS

  26. Before/after Broken pipes per second per server

  27. Why is static binding bad? Load distribution between CPU cores might be uneven

  28. Network Network load distribution between CPU cores

  29. CPU0 utilization at 100%

  30. Why this is happening? 1. Single queue - turn on more: ethtool -l/-L 2. Interrupts are not distributed: dynamic distribution - launch irqbalance/irqd/birq static distribution - configure RSS

  31. RSS CPU 0 1 2 3 4 5 6 7 8 9 10 11 12 RSS Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 eth0 Network

  32. CPU0-CPU7 utilization at 100%

  33. We need more queues!

  34. 16 core utilization at 100%

  35. scaling.txt RPS = Software RSS XPS = RPS for outgoing packets RFS? Use packet consumer core number https://www.kernel.org/doc/Documen tation/networking/scaling.txt

  36. Why is RPS/RFS/XPS bad? 1.Load distribution between CPU cores might be uneven. 2.CPU overhead

  37. Accelerated RFS Mellanox supports it, but after switching it on maximal throughput on 10G NICs were only 5Gbit/s.

  38. Intel Signature Filter (also known as ATR - Application Targeted Receive) RPS+RFS counterpart

  39. Network SoftIRQ

  40. How SoftIRQs are born Network Q0 Q... eth0

  41. How SoftIRQs are born Network CPU HW IRQ 42 RSS Q0 Q... C0 C... eth0

  42. How SoftIRQs are born Network HW interrupt processing is finished SoftIRQ NET_RX CPU0 CPU HW IRQ 42 RSS Q0 Q... C0 C... eth0

  43. How SoftIRQs are born Network HW interrupt processing is finished SoftIRQ NET_RX CPU0 NAPI poll CPU HW IRQ 42 RSS Q0 Q... C0 C... eth0

  44. What to do with high SoftIRQ?

  45. Interrupt moderation ethtool -c/-C

  46. Why is interrupt moderation bad? You have to balance between throughput and latency

  47. What is going on? Too rapid growth

  48. Health ministry is warning! REVERT! CHANGES KEEP IT TESTS

  49. Thank you! Odnoklassniki technical blog on habrahabr.ru http://habrahabr.ru/company/odnoklassniki/ More about us http://v.ok.ru/ Dmitry Samsonov dmitry.samsonov@odnoklassniki.ru

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#