From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8A83DC54ED1
	for <linux-mm@archiver.kernel.org>; Tue, 27 May 2025 05:53:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E55B76B0088; Tue, 27 May 2025 01:53:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E07196B008A; Tue, 27 May 2025 01:53:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D1ED26B008C; Tue, 27 May 2025 01:53:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id B46F36B0088
	for <linux-mm@kvack.org>; Tue, 27 May 2025 01:53:48 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 252D1121CD0
	for <linux-mm@kvack.org>; Tue, 27 May 2025 05:53:48 +0000 (UTC)
X-FDA: 83487621336.04.569251C
Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174])
	by imf03.hostedemail.com (Postfix) with ESMTP id 3A54D2000A
	for <linux-mm@kvack.org>; Tue, 27 May 2025 05:53:46 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Ltm4U4ga;
	spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1748325226;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1+jlkUh3l7BQSELkl79XBQS392Bms/Mo9y185WEn4s8=;
	b=X1Y+C7sd25EZzLpHjZKt0jeBLRkZrnG530XTyE5M/HwtqPseemRMqn0/D8TfLMKZgp0uNc
	k+DayCi2Xd25B5HXJfMteiWbE6R/Nf8jEekPzQsyY9+LI4uswp1kFVh5ArrTC8kongmVkb
	+AMr0QOWXJew1rFV3y01RXZUe89/qYo=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Ltm4U4ga;
	spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748325226; a=rsa-sha256;
	cv=none;
	b=4w3CcbEJatRJVnDkeVytC5KGf7sp8EuuuALCeQwWLuLBnuWQh6kSiF6iYbipsPKW4CRfKP
	k0Ky0RTeIrHrPuhJ4V/xtSRTPaSi1yqGG5sreJl5UobWTOBkuiXJrpqHzpX+SRQHNzYZVI
	R+mxBQPh0WouYwiUeGvTTI4Me671HiM=
Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7c5f720c717so395471285a.0
        for <linux-mm@kvack.org>; Mon, 26 May 2025 22:53:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1748325225; x=1748930025; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=1+jlkUh3l7BQSELkl79XBQS392Bms/Mo9y185WEn4s8=;
        b=Ltm4U4gaJVn+oxsnGQZNRofypC6Zx7oteaY641NxrhKDA7Oc/EuqH1XzIh/drIAzoo
         m0xhL4p/70ZySgE/0540pu3/lUwagEbxst5NZ0RdGoCp4ikKqOkgGTvbOa1lkA/Gg3AO
         XvZAbZmAuNmHE0I94Ubq4lUsMn1ebJfYJ7/SJueSx0wNL/+7w5/NCjuMIR80nsJI52v5
         qZD5f2tib3drWi8joKfF/1ZM66PlEKZooQjKrjko2tqpT75uVQS7j1SPLybZ1R9fGxGS
         n6uqmgllqbvpL9/+iTsb0NVHDTKrFUACgraLHLOhy4UPr1Yzf0LPXT0tBzE0TqZzcSgC
         xQWg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1748325225; x=1748930025;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=1+jlkUh3l7BQSELkl79XBQS392Bms/Mo9y185WEn4s8=;
        b=BwzKfLKemYoMc0imuEdOAuHVddIiiYJskyK8IJDtk7GMlZ28gMn4ujv4YECZ+u0mh+
         yB+lEaMA2Lnj7uU9tB9xmvZ/+UYocTw7aU918np5PM/iHY8O6Mn3vbu8Oj8rdTlSLZau
         0fs7B8Lc6W70oY3DSKTCsaDmGzgLWXfUblhKdArkrrzZoe4iWygkq8NYWtTWUtGgckWU
         QKsCQ6Xxp+rrSQahAULjN7WoOdjhKONdRUj3Rz1YV+U1PCRV6MH5J4IgeOjCASpLMmmt
         ET3UYAz+RGlFsHlkDX2iS9NV0gRFIOdynl93v+qnl2kXxknbuMYgfh6225KxNU2tiSdQ
         sIfw==
X-Forwarded-Encrypted: i=1; AJvYcCWs2dPxGXE65Oq3P/NOiWBofpXNE2Ho1Sh+7YLbeKf7jkRj7XFvicNdTzpNPPdXDkSVIYtR5UJCCA==@kvack.org
X-Gm-Message-State: AOJu0YxoocU84YXJB3+/b+KinMzT6851yxTNQgYeXO5vI7OTd45yjq1Y
	ZdS0o72Ny5jjrYj5In5Weac4e+vDpivAKY96cNtHPBPzoPvp8n1SK4goCt7noGEkGW3/TipD5eo
	xhy4ShWfdtYTuTQborDwyxwosaJqEjKE=
X-Gm-Gg: ASbGncvJM4dAHbr+zUUKUV6kzpqiJZLOSEseSTHH2SJ3BJaV4HZinFkwKVQSy7PPWP+
	FK2b4s92H5yElIqCOwxLHVtHy78KCOsMWLPOdRKbK43GbAYrL8g8MkuR+kZxPK7uZkXQEGKd84A
	R83CIG3c4Zd2rEG7Heb5EustNmeRIoHpBRqg==
X-Google-Smtp-Source: AGHT+IGg9jK9aZ4hJiCsWXBMCC61gEwztFmL/XIpD/Poc6kt4hDLdJPTLfQndw4XWCMEKtqoLaO9UwlfXFta7WdIVkY=
X-Received: by 2002:a05:6214:1d0e:b0:6fa:a65a:224d with SMTP id
 6a1803df08f44-6faa65a243cmr116898796d6.4.1748325225217; Mon, 26 May 2025
 22:53:45 -0700 (PDT)
MIME-Version: 1.0
References: <20250520060504.20251-1-laoar.shao@gmail.com> <CALOAHbDPF+Mxqwh+5ScQFCyEdiz1ghNbgxJKAqmBRDeAZfe3sA@mail.gmail.com>
 <7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com>
In-Reply-To: <7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Tue, 27 May 2025 13:53:09 +0800
X-Gm-Features: AX0GCFsJaW1N_a3zIZGQ2An7BlCcwcaEeya7GB9e4FPHLbx2SU3DqjabLfyS7Y4
Message-ID: <CALOAHbA=5cRHJV8hBS18oQ0C_aFx2f4JLJS0gazJgRzTNV99Ww@mail.gmail.com>
Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment
To: Zi Yan <ziy@nvidia.com>
Cc: akpm@linux-foundation.org, david@redhat.com, baolin.wang@linux.alibaba.com, 
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, 
	ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, 
	usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, 
	willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, 
	bpf@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 3A54D2000A
X-Stat-Signature: z6t6nfrs19cc5aht4ef8o6xs77ybetkp
X-Rspam-User: 
X-Rspamd-Server: rspam07
X-HE-Tag: 1748325226-383226
X-HE-Meta: U2FsdGVkX18e2jXEj7qC/7rVgUFaBZAtFGDryyFFALRRvQS5jtvuLJsq/oeHKsqNp7Kq2679imO1y6pHEl/BTB3ye8BnYT0/BX/92seCIHfG5RN6BKV1+sXfHM0wEX2TPahkqW2rVWOEK31+hLB1BLHTriwmVn6kEmxToRRobjZChxDX5wQNG6oMehD+V1bGPG1ZDTlRitpZSZSo3acvSD3SIiwotJaYxHfqCO+XLHyAPfaL0rJDWnd0IXhWJIRnkrLadJLaSpdKJ8n9MRjF/XbPZweTB5bvfSv3StoA6H6RWcDM+KCwHLL5lt/L1w8OojvbnlFMpKQEN3wXlzWNmN1Ll4N5BqImpnD9D1hkdI7LB08oWZBxQEKhSZu8s+tTnUrNzIpdzq/kFiiWkPAYmpbXSDiHyz0QGIBYqpmgJ0EG1mjPgW27IggAdwRrmbKW8SY380yYW0uTLpy8zq9OriM/BCUJrUCSOJIQ7XiX88jy3GMdrxee8S5J3R5tf+Usu/rft8xG7P5nw2/49qwk74Dmapzt1W4LP+FDMNK80WspH2O2uWAb7SVel4cTIWSYjcHQrvw+AX5VE8/EakQLBx+Tb9IUqK2kPtsOvTuYKlAN/As1xUjyOopPJfDQvjHzJsiKZwmD2kGDZ7EuKdU18ACZYS0OYnkRPwhT3wCTUqM8P7hZU9KqMAAIRdvReAg3n5cCbZ+b1B4D6ZpkvU7dW5JxIYyNQJpWPBgFCC+DhRn/546Q316BvfBEwNf+277V3cqvJr/K/JbzKqK1LRBSdXgaaYyASHSG7bYv2H6Md2R041DHx7w/3dAbBUihVkK+plEoGDGXMNTJ02giD+QJr6j1MgpukB9++GEc3UFqntVVuVCfYAMDS1p3u7hmYluR4ff4oI4buvHK+qDoFV7Cu62zC7pOFLcex8m5JYkfzpOMSy/riQWi+TucM7Vh5lWk5MhJgjyZQ940P4mcqnx
 /57kvTi+
 kvXGUkYceLEfTiiuN+S7ICpGYkJxQLtMV9WMvZLC76XEPZKJn8pP8ZYwXkRey+g2+Mcf1oc2EC/ug9D3kD5sP9GrucR6oAgmrx70qxepL+uBGSi3ijPIjhpQTnzdvCpT/6ad9LiGgqmXhK9qomPHmGy4JQjPm+pEka4okPOdMAVJHSpYlTlw4SsuSdElxl+Z53QXAtu9IDk95V/HbvtH2Tv1uoAc8ISDLrzq3BAyYtPbgrgn6OIk0ChhUeRU61K4YHOeblSe4agogCh28AQfbDMA3uDs0UuTsiuao4uwoJmQdehUwi7IebehOFZYDfY3AR8IqytcDDf5iq6OxoM+zy8sH0jAzQNV4KBoL2a8gfxSgBW30aiZWGlqV+F6vIMZIJa/SJKhsbRTIxbuCeJz2tQpKzEt48EWL4EgnCLpDyui0ltZJj5eQf71TFUkUVI6dz76K7hkckCN/R4DZHN95DsiJ2RlkuQDfrpbINtTvKY3Ni3QieUPqkL9hhrVAeG/24nctqsQUEOCRAgBMhKGmmBniFLbuXpRc6BgH6FLM4iRkLat1B3Tvz9UnRfO5jUawdpvQvZhPGT1XjX+kXxnUfymCdw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, May 26, 2025 at 10:32=E2=80=AFPM Zi Yan <ziy@nvidia.com> wrote:
>
> On 24 May 2025, at 23:01, Yafang Shao wrote:
>
> > On Tue, May 20, 2025 at 2:05=E2=80=AFPM Yafang Shao <laoar.shao@gmail.c=
om> wrote:
> >>
> >> Background
> >> ----------
> >>
> >> At my current employer, PDD, we have consistently configured THP to "n=
ever"
> >> on our production servers due to past incidents caused by its behavior=
:
> >>
> >> - Increased memory consumption
> >>   THP significantly raises overall memory usage.
> >>
> >> - Latency spikes
> >>   Random latency spikes occur due to more frequent memory compaction
> >>   activity triggered by THP.
> >>
> >> These issues have made sysadmins hesitant to switch to "madvise" or
> >> "always" modes.
> >>
> >> New Motivation
> >> --------------
> >>
> >> We have now identified that certain AI workloads achieve substantial
> >> performance gains with THP enabled. However, we=E2=80=99ve also verifi=
ed that some
> >> workloads see little to no benefit=E2=80=94or are even negatively impa=
cted=E2=80=94by THP.
> >>
> >> In our Kubernetes environment, we deploy mixed workloads on a single s=
erver
> >> to maximize resource utilization. Our goal is to selectively enable TH=
P for
> >> services that benefit from it while keeping it disabled for others. Th=
is
> >> approach allows us to incrementally enable THP for additional services=
 and
> >> assess how to make it more viable in production.
> >>
> >> Proposed Solution
> >> -----------------
> >>
> >> For this use case, Johannes suggested introducing a dedicated mode [0]=
. In
> >> this new mode, we could implement BPF-based THP adjustment for fine-gr=
ained
> >> control over tasks or cgroups. If no BPF program is attached, THP rema=
ins
> >> in "never" mode. This solution elegantly meets our needs while avoidin=
g the
> >> complexity of managing BPF alongside other THP modes.
> >>
> >> A selftest example demonstrates how to enable THP for the current task
> >> while keeping it disabled for others.
> >>
> >> Alternative Proposals
> >> ---------------------
> >>
> >> - Gutierrez=E2=80=99s cgroup-based approach [1]
> >>   - Proposed adding a new cgroup file to control THP policy.
> >>   - However, as Johannes noted, cgroups are designed for hierarchical
> >>     resource allocation, not arbitrary policy settings [2].
> >>
> >> - Usama=E2=80=99s per-task THP proposal based on prctl() [3]:
> >>   - Enabling THP per task via prctl().
> >>   - As David pointed out, neither madvise() nor prctl() works in "neve=
r"
> >>     mode [4], making this solution insufficient for our needs.
> >>
> >> Conclusion
> >> ----------
> >>
> >> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is=
 the
> >> most effective solution for our requirements. This approach represents=
 a
> >> small but meaningful step toward making THP truly usable=E2=80=94and m=
anageable=E2=80=94in
> >> production environments.
> >>
> >> This is currently a PoC implementation. Feedback of any kind is welcom=
e.
> >>
> >> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg=
.org/ [0]
> >> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierr=
ez.asier@huawei-partners.com/ [1]
> >> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.o=
rg/ [2]
> >> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaa=
rif642@gmail.com/ [3]
> >> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838=
c881b@redhat.com/ [4]
> >>
> >> RFC v1->v2:
> >> The main changes are as follows,
> >> - Use struct_ops instead of fmod_ret (Alexei)
> >> - Introduce a new THP mode (Johannes)
> >> - Introduce new helpers for BPF hook (Zi)
> >> - Refine the commit log
> >>
> >> RFC v1: https://lwn.net/Articles/1019290/
> >>
> >> Yafang Shao (5):
> >>   mm: thp: Add a new mode "bpf"
> >>   mm: thp: Add hook for BPF based THP adjustment
> >>   mm: thp: add struct ops for BPF based THP adjustment
> >>   bpf: Add get_current_comm to bpf_base_func_proto
> >>   selftests/bpf: Add selftest for THP adjustment
> >>
> >>  include/linux/huge_mm.h                       |  15 +-
> >>  kernel/bpf/cgroup.c                           |   2 -
> >>  kernel/bpf/helpers.c                          |   2 +
> >>  mm/Makefile                                   |   3 +
> >>  mm/bpf_thp.c                                  | 120 ++++++++++++
> >>  mm/huge_memory.c                              |  65 ++++++-
> >>  mm/khugepaged.c                               |   3 +
> >>  tools/testing/selftests/bpf/config            |   1 +
> >>  .../selftests/bpf/prog_tests/thp_adjust.c     | 175 +++++++++++++++++=
+
> >>  .../selftests/bpf/progs/test_thp_adjust.c     |  39 ++++
> >>  10 files changed, 414 insertions(+), 11 deletions(-)
> >>  create mode 100644 mm/bpf_thp.c
> >>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.=
c
> >>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.=
c
> >>
> >> --
> >> 2.43.5
> >>
> >
> > Hi all,
> >
> > Let=E2=80=99s summarize the current state of the discussion and identif=
y how
> > to move forward.
> >
> > - Global-Only Control is Not Viable
> > We all seem to agree that a global-only control for THP is unwise. In
> > practice, some workloads benefit from THP while others do not, so a
> > one-size-fits-all approach doesn=E2=80=99t work.
> >
> > - Should We Use "Always" or "Madvise"?
> > I suspect no one would choose 'always' in its current state. ;)
> > Both Lorenzo and David propose relying on the madvise mode. However,
> > since madvise is an unprivileged userspace mechanism, any user can
> > freely adjust their THP policy. This makes fine-grained control
> > impossible without breaking userspace compatibility=E2=80=94an undesira=
ble
> > tradeoff.
> > Given these limitations, the community should consider introducing a
> > new "admin" mode for privileged THP policy management.
> >
>
> I agree with the above two points.
>
> > - Can the Kernel Automatically Manage THP Without User Input?
> > In practice, users define their own success metrics=E2=80=94such as lat=
ency
> > (RT), queries per second (QPS), or throughput=E2=80=94to evaluate a fea=
ture=E2=80=99s
> > usefulness. If a feature fails to improve these metrics, it provides
> > no practical value.
> > Currently, the kernel lacks visibility into user-defined metrics,
> > making fully automated optimization impossible (at least without user
> > input). More importantly, automatic management offers no benefit if it
> > doesn=E2=80=99t align with user needs.
>
> Yes, kernel is basically guessing what userspace wants with some hints
> like MADV_HUGEPAGE/MADV_NOHUGEPAGE. But kernel has the global view
> of memory fragmentation, which userspace cannot get easily.

Correct, memory fragmentation is another critical factor in
determining whether to allocate THP.

> I wonder
> if it is possible that userspace tuning might benefit one set of
> applications but hurt others or overall performance. Right now,
> THP tuning is 0 or 1, either an application wants THPs or not.
> We might need a way of ranking THP requests from userspace to
> let kernel prioritize them (I am not sure if we can add another
> user input parameter, like THP_nice, to get this done, since
> apparently everyone will set THP_nice to -100 to get themselves
> at the top of the list).

Interesting idea. Perhaps we could make this configurable only by sysadmins=
.

--=20
Regards
Yafang