From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org
Cc: ying.huang@intel.com, mgorman@techsingularity.net,
linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>
Subject: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Sun, 7 Jul 2024 17:49:53 +0800 [thread overview]
Message-ID: <20240707094956.94654-1-laoar.shao@gmail.com> (raw)
Background
==========
In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.
Investigation
=============
My investigation using perf tracing revealed that the root cause of
these spikes is the simultaneous execution of exit_mmap() by each of
the exiting processes. This concurrent access to the zone->lock
results in contention, which becomes a hotspot and negatively impacts
performance. The perf results clearly indicate this contention as a
primary contributor to the observed latency issues.
+ 77.02% 0.00% uwsgi [kernel.kallsyms] [k] mmput
- 76.98% 0.01% uwsgi [kernel.kallsyms] [k] exit_mmap
- 76.97% exit_mmap
- 58.58% unmap_vmas
- 58.55% unmap_single_vma
- unmap_page_range
- 58.32% zap_pte_range
- 42.88% tlb_flush_mmu
- 42.76% free_pages_and_swap_cache
- 41.22% release_pages
- 33.29% free_unref_page_list
- 32.37% free_unref_page_commit
- 31.64% free_pcppages_bulk
+ 28.65% _raw_spin_lock
1.28% __list_del_entry_valid
+ 3.25% folio_lruvec_lock_irqsave
+ 0.75% __mem_cgroup_uncharge_list
0.60% __mod_lruvec_state
1.07% free_swap_cache
+ 11.69% page_remove_rmap
0.64% __mod_lruvec_page_state
- 17.34% remove_vma
- 17.25% vm_area_free
- 17.23% kmem_cache_free
- 17.15% __slab_free
- 14.56% discard_slab
free_slab
__free_slab
__free_pages
- free_unref_page
- 13.50% free_unref_page_commit
- free_pcppages_bulk
+ 13.44% _raw_spin_lock
By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
with the majority of them being regular order-0 user pages.
<...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
e=1
<...>-1540432 [224] d..3. 618048.023887: <stack trace>
=> free_pcppages_bulk
=> free_unref_page_commit
=> free_unref_page_list
=> release_pages
=> free_pages_and_swap_cache
=> tlb_flush_mmu
=> zap_pte_range
=> unmap_page_range
=> unmap_single_vma
=> unmap_vmas
=> exit_mmap
=> mmput
=> do_exit
=> do_group_exit
=> get_signal
=> arch_do_signal_or_restart
=> exit_to_user_mode_prepare
=> syscall_exit_to_user_mode
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
The servers experiencing these issues are equipped with impressive
hardware specifications, including 256 CPUs and 1TB of memory, all
within a single NUMA node. The zoneinfo is as follows,
Node 0, zone Normal
pages free 144465775
boost 0
min 1309270
low 1636587
high 1963904
spanned 564133888
present 296747008
managed 291974346
cma 0
protection: (0, 0, 0, 0)
...
pagesets
cpu: 0
count: 2217
high: 6392
batch: 63
vm stats threshold: 125
cpu: 1
count: 4510
high: 6392
batch: 63
vm stats threshold: 125
cpu: 2
count: 3059
high: 6392
batch: 63
...
The pcp high is around 100 times the batch size.
I also traced the latency associated with the free_pcppages_bulk()
function during the container exit process:
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 148 |***************** |
512 -> 1023 : 334 |****************************************|
1024 -> 2047 : 33 |*** |
2048 -> 4095 : 5 | |
4096 -> 8191 : 7 | |
8192 -> 16383 : 12 |* |
16384 -> 32767 : 30 |*** |
32768 -> 65535 : 21 |** |
65536 -> 131071 : 15 |* |
131072 -> 262143 : 27 |*** |
262144 -> 524287 : 84 |********** |
524288 -> 1048575 : 203 |************************ |
1048576 -> 2097151 : 284 |********************************** |
2097152 -> 4194303 : 327 |*************************************** |
4194304 -> 8388607 : 215 |************************* |
8388608 -> 16777215 : 116 |************* |
16777216 -> 33554431 : 47 |***** |
33554432 -> 67108863 : 8 | |
67108864 -> 134217727 : 3 | |
The latency can reach tens of milliseconds.
Experimenting
=============
vm.percpu_pagelist_high_fraction
--------------------------------
The kernel version currently deployed in our production environment is the
stable 6.1.y, and my initial strategy involves optimizing the
vm.percpu_pagelist_high_fraction parameter. By increasing the value of
vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
page draining, which subsequently leads to a substantial reduction in
latency. After setting the sysctl value to 0x7fffffff, I observed a notable
improvement in latency.
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 120 | |
256 -> 511 : 365 |* |
512 -> 1023 : 201 | |
1024 -> 2047 : 103 | |
2048 -> 4095 : 84 | |
4096 -> 8191 : 87 | |
8192 -> 16383 : 4777 |************** |
16384 -> 32767 : 10572 |******************************* |
32768 -> 65535 : 13544 |****************************************|
65536 -> 131071 : 12723 |************************************* |
131072 -> 262143 : 8604 |************************* |
262144 -> 524287 : 3659 |********** |
524288 -> 1048575 : 921 |** |
1048576 -> 2097151 : 122 | |
2097152 -> 4194303 : 5 | |
However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
pcp high watermark size to a minimum of four times the batch size. While
this could theoretically affect throughput, as highlighted by Ying[0], we
have yet to observe any significant difference in throughput within our
production environment after implementing this change.
Backporting the series "mm: PCP high auto-tuning"
-------------------------------------------------
My second endeavor was to backport the series titled
"mm: PCP high auto-tuning"[1], which comprises nine individual patches,
into our 6.1.y stable kernel version. Subsequent to its deployment in our
production environment, I noted a pronounced reduction in latency. The
observed outcomes are as enumerated below:
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 2 | |
2048 -> 4095 : 11 | |
4096 -> 8191 : 3 | |
8192 -> 16383 : 1 | |
16384 -> 32767 : 2 | |
32768 -> 65535 : 7 | |
65536 -> 131071 : 198 |********* |
131072 -> 262143 : 530 |************************ |
262144 -> 524287 : 824 |************************************** |
524288 -> 1048575 : 852 |****************************************|
1048576 -> 2097151 : 714 |********************************* |
2097152 -> 4194303 : 389 |****************** |
4194304 -> 8388607 : 143 |****** |
8388608 -> 16777215 : 29 |* |
16777216 -> 33554431 : 1 | |
Compared to the previous data, the maximum latency has been reduced to
less than 30ms.
Adjusting the CONFIG_PCP_BATCH_SCALE_MAX
----------------------------------------
Upon Ying's suggestion, adjusting the CONFIG_PCP_BATCH_SCALE_MAX can
potentially reduce the PCP batch size without compromising the PCP high
watermark size. This approach could mitigate latency spikes without
adversely affecting throughput. Consequently, my third attempt focused on
modifying this configuration.
To facilitate easier adjustments, I replaced CONFIG_PCP_BATCH_SCALE_MAX
with a new sysctl knob named vm.pcp_batch_scale_max. By fine-tuning
vm.pcp_batch_scale_max from its default value of 5 down to 0, I achieved a
further reduction in the maximum latency, which was lowered to less than
2ms:
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 36 | |
2048 -> 4095 : 5063 |***** |
4096 -> 8191 : 31226 |******************************** |
8192 -> 16383 : 37606 |*************************************** |
16384 -> 32767 : 38359 |****************************************|
32768 -> 65535 : 30652 |******************************* |
65536 -> 131071 : 18714 |******************* |
131072 -> 262143 : 7968 |******** |
262144 -> 524287 : 1996 |** |
524288 -> 1048575 : 302 | |
1048576 -> 2097151 : 19 | |
After multiple trials, I observed no significant differences between
each attempt.
The Proposal
============
This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.
Future improvement to zone->lock
================================
To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[2], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to[3]. However, implementing these solutions is
likely to necessitate a more extended development effort.
Link: https://lore.kernel.org/linux-mm/874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com/ [0]
Link: https://lore.kernel.org/all/20231016053002.756205-1-ying.huang@intel.com/ [1]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]
Changes:
- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/
Yafang Shao (3):
mm/page_alloc: A minor fix to the calculation of pcp->free_count
mm/page_alloc: Avoid changing pcp->high decaying when adjusting
CONFIG_PCP_BATCH_SCALE_MAX
mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++
include/linux/sysctl.h | 1 +
kernel/sysctl.c | 2 +-
mm/Kconfig | 11 -------
mm/page_alloc.c | 38 ++++++++++++++++++-------
5 files changed, 45 insertions(+), 22 deletions(-)
--
2.43.5
next reply other threads:[~2024-07-07 9:50 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-07 9:49 Yafang Shao [this message]
2024-07-07 9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-10 1:52 ` Huang, Ying
2024-07-07 9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-10 1:51 ` Huang, Ying
2024-07-10 2:07 ` Yafang Shao
2024-07-07 9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-10 2:49 ` Huang, Ying
2024-07-11 2:21 ` Yafang Shao
2024-07-11 6:42 ` Huang, Ying
2024-07-11 7:25 ` Yafang Shao
2024-07-11 8:18 ` Huang, Ying
2024-07-11 9:51 ` Yafang Shao
2024-07-11 10:49 ` Huang, Ying
2024-07-11 12:45 ` Yafang Shao
2024-07-12 1:19 ` Huang, Ying
2024-07-12 2:25 ` Yafang Shao
2024-07-12 3:05 ` Huang, Ying
2024-07-12 3:44 ` Yafang Shao
2024-07-12 5:25 ` Huang, Ying
2024-07-12 5:41 ` Yafang Shao
2024-07-12 6:16 ` Huang, Ying
2024-07-12 6:41 ` Yafang Shao
2024-07-12 7:04 ` Huang, Ying
2024-07-12 7:36 ` Yafang Shao
2024-07-12 8:24 ` Huang, Ying
2024-07-12 8:49 ` Yafang Shao
2024-07-12 9:10 ` Huang, Ying
2024-07-12 9:24 ` Yafang Shao
2024-07-12 9:46 ` Yafang Shao
2024-07-15 1:09 ` Huang, Ying
2024-07-15 4:32 ` Yafang Shao
2024-07-10 3:00 ` [PATCH 0/3] " Huang, Ying
2024-07-11 2:25 ` Yafang Shao
2024-07-11 6:38 ` Huang, Ying
2024-07-11 7:21 ` Yafang Shao
2024-07-11 8:36 ` Huang, Ying
2024-07-11 9:40 ` Yafang Shao
2024-07-11 11:03 ` Huang, Ying
2024-07-11 12:40 ` Yafang Shao
2024-07-12 2:32 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240707094956.94654-1-laoar.shao@gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).