From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 055EDC3DA41
	for <linux-mm@archiver.kernel.org>; Thu, 11 Jul 2024 09:41:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5D37C6B0089; Thu, 11 Jul 2024 05:41:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 582E96B008C; Thu, 11 Jul 2024 05:41:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 44A4E6B0092; Thu, 11 Jul 2024 05:41:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 227036B0089
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 05:41:31 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 782A31604B3
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 09:41:30 +0000 (UTC)
X-FDA: 82326979140.21.C05DDF8
Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52])
	by imf16.hostedemail.com (Postfix) with ESMTP id 9CABD18000E
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 09:41:27 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QFlaTVYD;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf16.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720690851;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=yUtYJT5mjJbtsWzV7n6Gjgate8cV1JMDfP+SDvLNSUw=;
	b=NxgIHVQbkcev3zXEsIFDpMsTVBqQV37V7P3IJ2BBqihWIyzojemAlMei5Bh7fx6Lh8I/sa
	3POEBEs63N+FTseMLWySymC+mXR1gJgP/HwnS4s9v7VMiz/vwz30xgJOJ2VARWrDTVQQqV
	LbJCS4AdsCKpJCoNwoemuCKVhnOxE6k=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720690851; a=rsa-sha256;
	cv=none;
	b=U3UknMTYBxDkYqIud6RplPBlbgscgWl3G3G6hh844iaaIhDvL3tICJi684y4dJK08k32kC
	t5k1uP0bjV3hiRVu9T8iiUf2SxOIb+QGis27yNKYYIDmXceFTk200GJpwO+oU4yZDwLNU0
	KXG/U1vRdlVIgSF8KMeM/iSCaygEAkA=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QFlaTVYD;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf16.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com
Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-6b5f3348f05so4175816d6.0
        for <linux-mm@kvack.org>; Thu, 11 Jul 2024 02:41:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720690887; x=1721295687; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=yUtYJT5mjJbtsWzV7n6Gjgate8cV1JMDfP+SDvLNSUw=;
        b=QFlaTVYDERdUJ9IyHTLZnWG6qV1pVAbGvGlYIG9qpLKsO+qRhN0HlFwFNosG/Bx/Qc
         B98E9w2yQwFeTCJHEYrKm+/54VV+DW/1tXnwXcyW2X1pnCKVoidiLBL6vBzCnhLuBq3j
         QPr+OldZZf45T0MMxPuy6DZNTzx6bzM2JPAZFb0gtkGxLViESY+TtX5tcNWMjWKfmjhM
         1Ohd+pQJJTfFrT/PkA++o1QxfQrS346ZhKlKDcGkCJ+c2kQqbg8ev0PvYO55qddNwxop
         fAg3Z7CwDD/1ObC8PdQrpuklMgt4qYiNDkYjI+3dviNljrchxElGrspjqnE65dqe6Bsr
         YbRQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720690887; x=1721295687;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=yUtYJT5mjJbtsWzV7n6Gjgate8cV1JMDfP+SDvLNSUw=;
        b=WksRPMXR58Hs1tVeMzoB435j9SvmdIznlKVEMOahylx14zDx+MI0hu+aRDoTppblww
         meFUFjeTrYTYwKvPAI9/CmW+g3YNlfE/LQEUts4tJOoED4uLnsvErXUYoIioLqCnhbqs
         jNhdiG23DGyXQ3vU0Qdjkiy2P3U0LKnQ0MDGhg7r0eyo0tbXMZUXALsaD3xNSYOFdaF2
         pX5fK28YNn3VeAvPREdYsBc8mY9RromRIYAABuCY61MgWkOau0mwIYd1ivXse8DDRva9
         GPeOEpw8FHriw1bdI3rkwQOTvgtZK727pSzZWH/Gx7Di3DU4+vj8NLVdPg4yzZNmkK8l
         Y64w==
X-Forwarded-Encrypted: i=1; AJvYcCWEHuT9LUOkWzZcCaIHnRc6r7hXHPlPK4k+20ARNhIGZMvhtpDAidKtEwbyqdGjsbsOlnvISAHerAfydabCA+6xng8=
X-Gm-Message-State: AOJu0YybduJJxGE7avlTIKVnUAJt22D0EG7SPBQMegql1oD62vkJBaAr
	imjSsnE7/bHTXz8PwlFqruxjIYZzKFtcEg6zVy+Uru4l8ZOVNe6KVOy/O0SaV7uOemcNEKdigDH
	hc3FWrFlI2xZthLtnVHHMnektWeQ=
X-Google-Smtp-Source: AGHT+IEV/eztneFJOgQ4Jhdgl82PcNLJKw3vLxAUD6Te+gNk8T/8zUM8YbSOnzb986UPsgyqZnWiZJdvN5IFE+1ZPRU=
X-Received: by 2002:a05:6214:f05:b0:6b4:ff5d:3ca with SMTP id
 6a1803df08f44-6b61c1b2b5dmr95714006d6.40.1720690886646; Thu, 11 Jul 2024
 02:41:26 -0700 (PDT)
MIME-Version: 1.0
References: <20240707094956.94654-1-laoar.shao@gmail.com> <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <CALOAHbDmLx3Ky6h9kFS_p8A6o-mR8Z46Jnr3d=nOEycJX0SqCg@mail.gmail.com>
 <87sewga0wx.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBdQY7C8sttb7T18YrGNLzMAtJKxHAvALs8xxdfPajs4Q@mail.gmail.com>
 <87bk349vg4.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87bk349vg4.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Thu, 11 Jul 2024 17:40:50 +0800
Message-ID: <CALOAHbCEO1L_7zc_Qd6HigJs6Agb-rdGZKrT_nBPpE742PD7OQ@mail.gmail.com>
Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 9CABD18000E
X-Stat-Signature: qoc3p1hd3gy6m3g6cdyg6ia6pb849k18
X-Rspam-User: 
X-HE-Tag: 1720690887-15694
X-HE-Meta: U2FsdGVkX1/yJffJZEDaWnzhBpc3Y7dmBip8THdvwUU+fE56mw0ujof5kfy7d5NldIWKpPKxbWkin4OjAiYUa4G3n6nSHf/phT/WUbiqUoT1rqkRdJ0nLVMTPXjCBh758IqRpddrioJkC1EsYRCte2T7sX2sjK/92gTpZhrP+daMEUrT9ATHtSpobxh9X8jkllXEWZsIhiW1nANb4NWFT830sO3LVXJztpi4YzzFHlBXh8Xn04Ti97bQQ0d30LHHOJcioXN2aBP5m71Ptj5yUxM+Oo4eUES1LHcuHpuWdzTSw/JRyBUP3Ns/LhoACOonOUT6uYPraXeUTV8AfBVXBrniClVsubtoS72XyVI49+LXaArSYa0Uw06SlBndbVo2X1s9z31X2Dv4yWUwRLq8h1wmIBjFb68twncoKzJDYZxl0unWY6688T+QbeXXbLhFVYOIRI5YfrGTEzQZte6vTqh8Ef7UX6pqjTboJWB6XoV1RkY/WjV73UKtLS48PIBIrpgqy8zu3LWtVYalTwW+SJk9v7S6Aob7vb0whZUwwhtUoB9d2O/8fIYVvOpm7tbxlZK4q7hiF8FfjTvWey5xKuYvs1ea1GAMpAB0zZ/8M3qv/iGk3p3dbJ6m/euCTiRTl+blfarFi52xfgzRgnfI4vLkblXlq58fGc7L/D/oNmBZN0vBi17+7iLuvLO03eC4j7VX0W/I70VlnogYEjZErtv3kOcgppvHYVWMmnnx9VAT56dLCNmQ98K8jMyCJmrN0v8EVafRR3agfL1Bg37iLOzAM4UnCt7do/VrkNgVD3wdRyiDZkV7IvR2lNwaxXL0l65QIcJpxTQv7LD5GEfycAGueqHNhVBqC6DTSLY4LceW/TAxTqk9eki0VCJoNcEXkAqd3hV8C8fhviphFMqPRYjqs5QWmA/hkBgtMg7wqnyZCPeRS0jRJ0uBK0YfaKqnWoTIqpTs+N5r5DjIp8s
 ntFWdx6J
 M0/55nK1hsNUBrIsVNbJTvR7er9PXGJCIkQAfQgCcTR+noxu710x5ZExKROf1Da2396knsuofE1raf+8YpNWs3rfpPEGw57SwKRZgbDpUH0U8Yei+jeODPTw4NoSuM3e67ycfJGUaZ0ffCbw1ypiXsHTL/yVQSKUhUQ0yCr/orTWs50VM3uU9WY/ObKHy9wYJu26iI8CZakRr0rIlDi0BFqvpZg3ZRbsay7PFKa0D2zoGoKGelue0h3AhEaudesIO45woZYkJqna106bFC2LO3bGT/7CMG6Nt7cg+qV6FXrVkRuGwA9TWs/WTvHookii+xLz8A/Uyyk0ACiQ3dqgOL5A0z5y32S6EnDOsLMTvNWD0xZVyKdL6ezltLzFcDGWJkSjsUfQglHVmsTIMgEG8lE1TXc6AfjxdoArw
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000024, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jul 11, 2024 at 4:38=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 2:40=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 10, 2024 at 11:02=E2=80=AFAM Huang, Ying <ying.huang@int=
el.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > Background
> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >
> >> >> > In our containerized environment, we have a specific type of cont=
ainer
> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. =
These
> >> >> > processes are organized as separate processes rather than threads=
 due
> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in=
 a
> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> > containers hosted on the same machine experience significant late=
ncy
> >> >> > spikes.
> >> >> >
> >> >> > Investigation
> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >
> >> >> > My investigation using perf tracing revealed that the root cause =
of
> >> >> > these spikes is the simultaneous execution of exit_mmap() by each=
 of
> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> > results in contention, which becomes a hotspot and negatively imp=
acts
> >> >> > performance. The perf results clearly indicate this contention as=
 a
> >> >> > primary contributor to the observed latency issues.
> >> >> >
> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]                 =
                 [k] mmput
> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]                 =
                 [k] exit_mmap
> >> >> >    - 76.97% exit_mmap
> >> >> >       - 58.58% unmap_vmas
> >> >> >          - 58.55% unmap_single_vma
> >> >> >             - unmap_page_range
> >> >> >                - 58.32% zap_pte_range
> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >                         - 41.22% release_pages
> >> >> >                            - 33.29% free_unref_page_list
> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >                                       1.28% __list_del_entry_vali=
d
> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >                              0.60% __mod_lruvec_state
> >> >> >                           1.07% free_swap_cache
> >> >> >                   + 11.69% page_remove_rmap
> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >       - 17.34% remove_vma
> >> >> >          - 17.25% vm_area_free
> >> >> >             - 17.23% kmem_cache_free
> >> >> >                - 17.15% __slab_free
> >> >> >                   - 14.56% discard_slab
> >> >> >                        free_slab
> >> >> >                        __free_slab
> >> >> >                        __free_pages
> >> >> >                      - free_unref_page
> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >                            - free_pcppages_bulk
> >> >> >                               + 13.44% _raw_spin_lock
> >> >>
> >> >> I don't think your change will reduce zone->lock contention cycles.=
  So,
> >> >> I don't find the value of the above data.
> >> >>
> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent =
page,
> >> >> > with the majority of them being regular order-0 user pages.
> >> >> >
> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_d=
rain: page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
> >> >> > e=3D1
> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >  =3D> free_pcppages_bulk
> >> >> >  =3D> free_unref_page_commit
> >> >> >  =3D> free_unref_page_list
> >> >> >  =3D> release_pages
> >> >> >  =3D> free_pages_and_swap_cache
> >> >> >  =3D> tlb_flush_mmu
> >> >> >  =3D> zap_pte_range
> >> >> >  =3D> unmap_page_range
> >> >> >  =3D> unmap_single_vma
> >> >> >  =3D> unmap_vmas
> >> >> >  =3D> exit_mmap
> >> >> >  =3D> mmput
> >> >> >  =3D> do_exit
> >> >> >  =3D> do_group_exit
> >> >> >  =3D> get_signal
> >> >> >  =3D> arch_do_signal_or_restart
> >> >> >  =3D> exit_to_user_mode_prepare
> >> >> >  =3D> syscall_exit_to_user_mode
> >> >> >  =3D> do_syscall_64
> >> >> >  =3D> entry_SYSCALL_64_after_hwframe
> >> >> >
> >> >> > The servers experiencing these issues are equipped with impressiv=
e
> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, al=
l
> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >
> >> >> > Node 0, zone   Normal
> >> >> >   pages free     144465775
> >> >> >         boost    0
> >> >> >         min      1309270
> >> >> >         low      1636587
> >> >> >         high     1963904
> >> >> >         spanned  564133888
> >> >> >         present  296747008
> >> >> >         managed  291974346
> >> >> >         cma      0
> >> >> >         protection: (0, 0, 0, 0)
> >> >> > ...
> >> >> >   pagesets
> >> >> >     cpu: 0
> >> >> >               count: 2217
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 1
> >> >> >               count: 4510
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 2
> >> >> >               count: 3059
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >
> >> >> > ...
> >> >> >
> >> >> > The pcp high is around 100 times the batch size.
> >> >> >
> >> >> > I also traced the latency associated with the free_pcppages_bulk(=
)
> >> >> > function during the container exit process:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                            =
            |
> >> >> >          2 -> 3          : 0        |                            =
            |
> >> >> >          4 -> 7          : 0        |                            =
            |
> >> >> >          8 -> 15         : 0        |                            =
            |
> >> >> >         16 -> 31         : 0        |                            =
            |
> >> >> >         32 -> 63         : 0        |                            =
            |
> >> >> >         64 -> 127        : 0        |                            =
            |
> >> >> >        128 -> 255        : 0        |                            =
            |
> >> >> >        256 -> 511        : 148      |*****************           =
            |
> >> >> >        512 -> 1023       : 334      |****************************=
************|
> >> >> >       1024 -> 2047       : 33       |***                         =
            |
> >> >> >       2048 -> 4095       : 5        |                            =
            |
> >> >> >       4096 -> 8191       : 7        |                            =
            |
> >> >> >       8192 -> 16383      : 12       |*                           =
            |
> >> >> >      16384 -> 32767      : 30       |***                         =
            |
> >> >> >      32768 -> 65535      : 21       |**                          =
            |
> >> >> >      65536 -> 131071     : 15       |*                           =
            |
> >> >> >     131072 -> 262143     : 27       |***                         =
            |
> >> >> >     262144 -> 524287     : 84       |**********                  =
            |
> >> >> >     524288 -> 1048575    : 203      |************************    =
            |
> >> >> >    1048576 -> 2097151    : 284      |****************************=
******      |
> >> >> >    2097152 -> 4194303    : 327      |****************************=
*********** |
> >> >> >    4194304 -> 8388607    : 215      |*************************   =
            |
> >> >> >    8388608 -> 16777215   : 116      |*************               =
            |
> >> >> >   16777216 -> 33554431   : 47       |*****                       =
            |
> >> >> >   33554432 -> 67108863   : 8        |                            =
            |
> >> >> >   67108864 -> 134217727  : 3        |                            =
            |
> >> >> >
> >> >> > The latency can reach tens of milliseconds.
> >> >> >
> >> >> > Experimenting
> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >
> >> >> > vm.percpu_pagelist_high_fraction
> >> >> > --------------------------------
> >> >> >
> >> >> > The kernel version currently deployed in our production environme=
nt is the
> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >> >>
> >> >> IMHO, we should focus on upstream activity in the cover letter and =
patch
> >> >> description.  And I don't think that it's necessary to describe the
> >> >> alternative solution with too much details.
> >> >>
> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the val=
ue of
> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch siz=
e during
> >> >> > page draining, which subsequently leads to a substantial reductio=
n in
> >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed=
 a notable
> >> >> > improvement in latency.
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                            =
            |
> >> >> >          2 -> 3          : 0        |                            =
            |
> >> >> >          4 -> 7          : 0        |                            =
            |
> >> >> >          8 -> 15         : 0        |                            =
            |
> >> >> >         16 -> 31         : 0        |                            =
            |
> >> >> >         32 -> 63         : 0        |                            =
            |
> >> >> >         64 -> 127        : 0        |                            =
            |
> >> >> >        128 -> 255        : 120      |                            =
            |
> >> >> >        256 -> 511        : 365      |*                           =
            |
> >> >> >        512 -> 1023       : 201      |                            =
            |
> >> >> >       1024 -> 2047       : 103      |                            =
            |
> >> >> >       2048 -> 4095       : 84       |                            =
            |
> >> >> >       4096 -> 8191       : 87       |                            =
            |
> >> >> >       8192 -> 16383      : 4777     |**************              =
            |
> >> >> >      16384 -> 32767      : 10572    |****************************=
***         |
> >> >> >      32768 -> 65535      : 13544    |****************************=
************|
> >> >> >      65536 -> 131071     : 12723    |****************************=
*********   |
> >> >> >     131072 -> 262143     : 8604     |*************************   =
            |
> >> >> >     262144 -> 524287     : 3659     |**********                  =
            |
> >> >> >     524288 -> 1048575    : 921      |**                          =
            |
> >> >> >    1048576 -> 2097151    : 122      |                            =
            |
> >> >> >    2097152 -> 4194303    : 5        |                            =
            |
> >> >> >
> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also dec=
rease the
> >> >> > pcp high watermark size to a minimum of four times the batch size=
. While
> >> >> > this could theoretically affect throughput, as highlighted by Yin=
g[0], we
> >> >> > have yet to observe any significant difference in throughput with=
in our
> >> >> > production environment after implementing this change.
> >> >> >
> >> >> > Backporting the series "mm: PCP high auto-tuning"
> >> >> > -------------------------------------------------
> >> >>
> >> >> Again, not upstream activity.  We can describe the upstream behavio=
r
> >> >> directly.
> >> >
> >> > Andrew has requested that I provide a more comprehensive analysis of
> >> > this issue, and in response, I have endeavored to outline all the
> >> > pertinent details in a thorough and detailed manner.
> >>
> >> IMHO, upstream activity can provide comprehensive analysis of the issu=
e
> >> too.  And, your patch has changed much from the first version.  It's
> >> better to describe your current version.
> >
> > After backporting the pcp auto-tuning feature to the 6.1.y branch, the
> > code is almost the same with the upstream kernel wrt the pcp. I have
> > thoroughly documented the detailed data showcasing the changes in the
> > backported version, providing a clear picture of the results. However,
> > it's crucial to note that I am unable to directly run the upstream
> > kernel on our production environment due to practical constraints.
>
> IMHO, the patch is for upstream kernel, not some downstream kernel, so
> focus should be the upstream activity.  The issue of the upstream
> kernel, and how to resolve it.  The production environment test results
> can be used to support the upstream change.

 The sole distinction in the pcp between version 6.1.y and the
upstream kernel lies solely in the modifications made to the code by
you. Furthermore, given that your code changes have now been
successfully backported, what else do you expect me to do ?

>
> >> >> > My second endeavor was to backport the series titled
> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual pa=
tches,
> >> >> > into our 6.1.y stable kernel version. Subsequent to its deploymen=
t in our
> >> >> > production environment, I noted a pronounced reduction in latency=
. The
> >> >> > observed outcomes are as enumerated below:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                            =
            |
> >> >> >          2 -> 3          : 0        |                            =
            |
> >> >> >          4 -> 7          : 0        |                            =
            |
> >> >> >          8 -> 15         : 0        |                            =
            |
> >> >> >         16 -> 31         : 0        |                            =
            |
> >> >> >         32 -> 63         : 0        |                            =
            |
> >> >> >         64 -> 127        : 0        |                            =
            |
> >> >> >        128 -> 255        : 0        |                            =
            |
> >> >> >        256 -> 511        : 0        |                            =
            |
> >> >> >        512 -> 1023       : 0        |                            =
            |
> >> >> >       1024 -> 2047       : 2        |                            =
            |
> >> >> >       2048 -> 4095       : 11       |                            =
            |
> >> >> >       4096 -> 8191       : 3        |                            =
            |
> >> >> >       8192 -> 16383      : 1        |                            =
            |
> >> >> >      16384 -> 32767      : 2        |                            =
            |
> >> >> >      32768 -> 65535      : 7        |                            =
            |
> >> >> >      65536 -> 131071     : 198      |*********                   =
            |
> >> >> >     131072 -> 262143     : 530      |************************    =
            |
> >> >> >     262144 -> 524287     : 824      |****************************=
**********  |
> >> >> >     524288 -> 1048575    : 852      |****************************=
************|
> >> >> >    1048576 -> 2097151    : 714      |****************************=
*****       |
> >> >> >    2097152 -> 4194303    : 389      |******************          =
            |
> >> >> >    4194304 -> 8388607    : 143      |******                      =
            |
> >> >> >    8388608 -> 16777215   : 29       |*                           =
            |
> >> >> >   16777216 -> 33554431   : 1        |                            =
            |
> >> >> >
> >> >> > Compared to the previous data, the maximum latency has been reduc=
ed to
> >> >> > less than 30ms.
> >> >>
> >> >> People don't care too much about page freeing latency during proces=
ses
> >> >> exiting.  Instead, they care more about the process exiting time, t=
hat
> >> >> is, throughput.  So, it's better to show the page allocation latenc=
y
> >> >> which is affected by the simultaneous processes exiting.
> >> >
> >> > I'm confused also. Is this issue really hard to understand ?
> >>
> >> IMHO, it's better to prove the issue directly.  If you cannot prove it
> >> directly, you can try alternative one and describe why.
> >
> > Not all data can be verified straightforwardly or effortlessly. The
> > primary focus lies in the zone->lock contention, which necessitates
> > measuring the latency it incurs. To accomplish this, the
> > free_pcppages_bulk() function serves as an effective tool for
> > evaluation. Therefore, I have opted to specifically measure the
> > latency associated with free_pcppages_bulk().
> >
> > The rationale behind not measuring allocation latency is due to the
> > necessity of finding a willing participant to endure potential delays,
> > a task that proved unsuccessful as no one expressed interest. In
> > contrast, assessing free_pcppages_bulk()'s latency solely requires
> > identifying and experimenting with the source causing the delays,
> > making it a more feasible approach.
>
> Can you run a benchmark program that do quite some memory allocation by
> yourself to test it?

I can have a try.
However, is it the key point here?  Why can't the lock contention be
measured by the freeing?


--=20
Regards
Yafang