From: Yafang Shao <laoar.shao@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net,
linux-mm@kvack.org
Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Thu, 11 Jul 2024 15:21:43 +0800 [thread overview]
Message-ID: <CALOAHbBdQY7C8sttb7T18YrGNLzMAtJKxHAvALs8xxdfPajs4Q@mail.gmail.com> (raw)
In-Reply-To: <87sewga0wx.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Thu, Jul 11, 2024 at 2:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 10, 2024 at 11:02 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > Background
> >> > ==========
> >> >
> >> > In our containerized environment, we have a specific type of container
> >> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> >> > processes are organized as separate processes rather than threads due
> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> >> > multi-threaded setup. Upon the exit of these containers, other
> >> > containers hosted on the same machine experience significant latency
> >> > spikes.
> >> >
> >> > Investigation
> >> > =============
> >> >
> >> > My investigation using perf tracing revealed that the root cause of
> >> > these spikes is the simultaneous execution of exit_mmap() by each of
> >> > the exiting processes. This concurrent access to the zone->lock
> >> > results in contention, which becomes a hotspot and negatively impacts
> >> > performance. The perf results clearly indicate this contention as a
> >> > primary contributor to the observed latency issues.
> >> >
> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] [k] mmput
> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] [k] exit_mmap
> >> > - 76.97% exit_mmap
> >> > - 58.58% unmap_vmas
> >> > - 58.55% unmap_single_vma
> >> > - unmap_page_range
> >> > - 58.32% zap_pte_range
> >> > - 42.88% tlb_flush_mmu
> >> > - 42.76% free_pages_and_swap_cache
> >> > - 41.22% release_pages
> >> > - 33.29% free_unref_page_list
> >> > - 32.37% free_unref_page_commit
> >> > - 31.64% free_pcppages_bulk
> >> > + 28.65% _raw_spin_lock
> >> > 1.28% __list_del_entry_valid
> >> > + 3.25% folio_lruvec_lock_irqsave
> >> > + 0.75% __mem_cgroup_uncharge_list
> >> > 0.60% __mod_lruvec_state
> >> > 1.07% free_swap_cache
> >> > + 11.69% page_remove_rmap
> >> > 0.64% __mod_lruvec_page_state
> >> > - 17.34% remove_vma
> >> > - 17.25% vm_area_free
> >> > - 17.23% kmem_cache_free
> >> > - 17.15% __slab_free
> >> > - 14.56% discard_slab
> >> > free_slab
> >> > __free_slab
> >> > __free_pages
> >> > - free_unref_page
> >> > - 13.50% free_unref_page_commit
> >> > - free_pcppages_bulk
> >> > + 13.44% _raw_spin_lock
> >>
> >> I don't think your change will reduce zone->lock contention cycles. So,
> >> I don't find the value of the above data.
> >>
> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent page,
> >> > with the majority of them being regular order-0 user pages.
> >> >
> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=0000000035a1b0b7 pfn=0x11c19c72 order=0 migratetyp
> >> > e=1
> >> > <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> > => free_pcppages_bulk
> >> > => free_unref_page_commit
> >> > => free_unref_page_list
> >> > => release_pages
> >> > => free_pages_and_swap_cache
> >> > => tlb_flush_mmu
> >> > => zap_pte_range
> >> > => unmap_page_range
> >> > => unmap_single_vma
> >> > => unmap_vmas
> >> > => exit_mmap
> >> > => mmput
> >> > => do_exit
> >> > => do_group_exit
> >> > => get_signal
> >> > => arch_do_signal_or_restart
> >> > => exit_to_user_mode_prepare
> >> > => syscall_exit_to_user_mode
> >> > => do_syscall_64
> >> > => entry_SYSCALL_64_after_hwframe
> >> >
> >> > The servers experiencing these issues are equipped with impressive
> >> > hardware specifications, including 256 CPUs and 1TB of memory, all
> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >
> >> > Node 0, zone Normal
> >> > pages free 144465775
> >> > boost 0
> >> > min 1309270
> >> > low 1636587
> >> > high 1963904
> >> > spanned 564133888
> >> > present 296747008
> >> > managed 291974346
> >> > cma 0
> >> > protection: (0, 0, 0, 0)
> >> > ...
> >> > pagesets
> >> > cpu: 0
> >> > count: 2217
> >> > high: 6392
> >> > batch: 63
> >> > vm stats threshold: 125
> >> > cpu: 1
> >> > count: 4510
> >> > high: 6392
> >> > batch: 63
> >> > vm stats threshold: 125
> >> > cpu: 2
> >> > count: 3059
> >> > high: 6392
> >> > batch: 63
> >> >
> >> > ...
> >> >
> >> > The pcp high is around 100 times the batch size.
> >> >
> >> > I also traced the latency associated with the free_pcppages_bulk()
> >> > function during the container exit process:
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 148 |***************** |
> >> > 512 -> 1023 : 334 |****************************************|
> >> > 1024 -> 2047 : 33 |*** |
> >> > 2048 -> 4095 : 5 | |
> >> > 4096 -> 8191 : 7 | |
> >> > 8192 -> 16383 : 12 |* |
> >> > 16384 -> 32767 : 30 |*** |
> >> > 32768 -> 65535 : 21 |** |
> >> > 65536 -> 131071 : 15 |* |
> >> > 131072 -> 262143 : 27 |*** |
> >> > 262144 -> 524287 : 84 |********** |
> >> > 524288 -> 1048575 : 203 |************************ |
> >> > 1048576 -> 2097151 : 284 |********************************** |
> >> > 2097152 -> 4194303 : 327 |*************************************** |
> >> > 4194304 -> 8388607 : 215 |************************* |
> >> > 8388608 -> 16777215 : 116 |************* |
> >> > 16777216 -> 33554431 : 47 |***** |
> >> > 33554432 -> 67108863 : 8 | |
> >> > 67108864 -> 134217727 : 3 | |
> >> >
> >> > The latency can reach tens of milliseconds.
> >> >
> >> > Experimenting
> >> > =============
> >> >
> >> > vm.percpu_pagelist_high_fraction
> >> > --------------------------------
> >> >
> >> > The kernel version currently deployed in our production environment is the
> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >>
> >> IMHO, we should focus on upstream activity in the cover letter and patch
> >> description. And I don't think that it's necessary to describe the
> >> alternative solution with too much details.
> >>
> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value of
> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size during
> >> > page draining, which subsequently leads to a substantial reduction in
> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a notable
> >> > improvement in latency.
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 120 | |
> >> > 256 -> 511 : 365 |* |
> >> > 512 -> 1023 : 201 | |
> >> > 1024 -> 2047 : 103 | |
> >> > 2048 -> 4095 : 84 | |
> >> > 4096 -> 8191 : 87 | |
> >> > 8192 -> 16383 : 4777 |************** |
> >> > 16384 -> 32767 : 10572 |******************************* |
> >> > 32768 -> 65535 : 13544 |****************************************|
> >> > 65536 -> 131071 : 12723 |************************************* |
> >> > 131072 -> 262143 : 8604 |************************* |
> >> > 262144 -> 524287 : 3659 |********** |
> >> > 524288 -> 1048575 : 921 |** |
> >> > 1048576 -> 2097151 : 122 | |
> >> > 2097152 -> 4194303 : 5 | |
> >> >
> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decrease the
> >> > pcp high watermark size to a minimum of four times the batch size. While
> >> > this could theoretically affect throughput, as highlighted by Ying[0], we
> >> > have yet to observe any significant difference in throughput within our
> >> > production environment after implementing this change.
> >> >
> >> > Backporting the series "mm: PCP high auto-tuning"
> >> > -------------------------------------------------
> >>
> >> Again, not upstream activity. We can describe the upstream behavior
> >> directly.
> >
> > Andrew has requested that I provide a more comprehensive analysis of
> > this issue, and in response, I have endeavored to outline all the
> > pertinent details in a thorough and detailed manner.
>
> IMHO, upstream activity can provide comprehensive analysis of the issue
> too. And, your patch has changed much from the first version. It's
> better to describe your current version.
After backporting the pcp auto-tuning feature to the 6.1.y branch, the
code is almost the same with the upstream kernel wrt the pcp. I have
thoroughly documented the detailed data showcasing the changes in the
backported version, providing a clear picture of the results. However,
it's crucial to note that I am unable to directly run the upstream
kernel on our production environment due to practical constraints.
>
> >>
> >> > My second endeavor was to backport the series titled
> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patches,
> >> > into our 6.1.y stable kernel version. Subsequent to its deployment in our
> >> > production environment, I noted a pronounced reduction in latency. The
> >> > observed outcomes are as enumerated below:
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 0 | |
> >> > 512 -> 1023 : 0 | |
> >> > 1024 -> 2047 : 2 | |
> >> > 2048 -> 4095 : 11 | |
> >> > 4096 -> 8191 : 3 | |
> >> > 8192 -> 16383 : 1 | |
> >> > 16384 -> 32767 : 2 | |
> >> > 32768 -> 65535 : 7 | |
> >> > 65536 -> 131071 : 198 |********* |
> >> > 131072 -> 262143 : 530 |************************ |
> >> > 262144 -> 524287 : 824 |************************************** |
> >> > 524288 -> 1048575 : 852 |****************************************|
> >> > 1048576 -> 2097151 : 714 |********************************* |
> >> > 2097152 -> 4194303 : 389 |****************** |
> >> > 4194304 -> 8388607 : 143 |****** |
> >> > 8388608 -> 16777215 : 29 |* |
> >> > 16777216 -> 33554431 : 1 | |
> >> >
> >> > Compared to the previous data, the maximum latency has been reduced to
> >> > less than 30ms.
> >>
> >> People don't care too much about page freeing latency during processes
> >> exiting. Instead, they care more about the process exiting time, that
> >> is, throughput. So, it's better to show the page allocation latency
> >> which is affected by the simultaneous processes exiting.
> >
> > I'm confused also. Is this issue really hard to understand ?
>
> IMHO, it's better to prove the issue directly. If you cannot prove it
> directly, you can try alternative one and describe why.
Not all data can be verified straightforwardly or effortlessly. The
primary focus lies in the zone->lock contention, which necessitates
measuring the latency it incurs. To accomplish this, the
free_pcppages_bulk() function serves as an effective tool for
evaluation. Therefore, I have opted to specifically measure the
latency associated with free_pcppages_bulk().
The rationale behind not measuring allocation latency is due to the
necessity of finding a willing participant to endure potential delays,
a task that proved unsuccessful as no one expressed interest. In
contrast, assessing free_pcppages_bulk()'s latency solely requires
identifying and experimenting with the source causing the delays,
making it a more feasible approach.
--
Regards
Yafang
next prev parent reply other threads:[~2024-07-11 7:22 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-07 9:49 [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-07 9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-10 1:52 ` Huang, Ying
2024-07-07 9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-10 1:51 ` Huang, Ying
2024-07-10 2:07 ` Yafang Shao
2024-07-07 9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-10 2:49 ` Huang, Ying
2024-07-11 2:21 ` Yafang Shao
2024-07-11 6:42 ` Huang, Ying
2024-07-11 7:25 ` Yafang Shao
2024-07-11 8:18 ` Huang, Ying
2024-07-11 9:51 ` Yafang Shao
2024-07-11 10:49 ` Huang, Ying
2024-07-11 12:45 ` Yafang Shao
2024-07-12 1:19 ` Huang, Ying
2024-07-12 2:25 ` Yafang Shao
2024-07-12 3:05 ` Huang, Ying
2024-07-12 3:44 ` Yafang Shao
2024-07-12 5:25 ` Huang, Ying
2024-07-12 5:41 ` Yafang Shao
2024-07-12 6:16 ` Huang, Ying
2024-07-12 6:41 ` Yafang Shao
2024-07-12 7:04 ` Huang, Ying
2024-07-12 7:36 ` Yafang Shao
2024-07-12 8:24 ` Huang, Ying
2024-07-12 8:49 ` Yafang Shao
2024-07-12 9:10 ` Huang, Ying
2024-07-12 9:24 ` Yafang Shao
2024-07-12 9:46 ` Yafang Shao
2024-07-15 1:09 ` Huang, Ying
2024-07-15 4:32 ` Yafang Shao
2024-07-10 3:00 ` [PATCH 0/3] " Huang, Ying
2024-07-11 2:25 ` Yafang Shao
2024-07-11 6:38 ` Huang, Ying
2024-07-11 7:21 ` Yafang Shao [this message]
2024-07-11 8:36 ` Huang, Ying
2024-07-11 9:40 ` Yafang Shao
2024-07-11 11:03 ` Huang, Ying
2024-07-11 12:40 ` Yafang Shao
2024-07-12 2:32 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALOAHbBdQY7C8sttb7T18YrGNLzMAtJKxHAvALs8xxdfPajs4Q@mail.gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).