From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A83DC54ED1 for ; Tue, 27 May 2025 05:53:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E55B76B0088; Tue, 27 May 2025 01:53:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E07196B008A; Tue, 27 May 2025 01:53:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D1ED26B008C; Tue, 27 May 2025 01:53:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B46F36B0088 for ; Tue, 27 May 2025 01:53:48 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 252D1121CD0 for ; Tue, 27 May 2025 05:53:48 +0000 (UTC) X-FDA: 83487621336.04.569251C Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) by imf03.hostedemail.com (Postfix) with ESMTP id 3A54D2000A for ; Tue, 27 May 2025 05:53:46 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ltm4U4ga; spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748325226; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1+jlkUh3l7BQSELkl79XBQS392Bms/Mo9y185WEn4s8=; b=X1Y+C7sd25EZzLpHjZKt0jeBLRkZrnG530XTyE5M/HwtqPseemRMqn0/D8TfLMKZgp0uNc k+DayCi2Xd25B5HXJfMteiWbE6R/Nf8jEekPzQsyY9+LI4uswp1kFVh5ArrTC8kongmVkb +AMr0QOWXJew1rFV3y01RXZUe89/qYo= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Ltm4U4ga; spf=pass (imf03.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748325226; a=rsa-sha256; cv=none; b=4w3CcbEJatRJVnDkeVytC5KGf7sp8EuuuALCeQwWLuLBnuWQh6kSiF6iYbipsPKW4CRfKP k0Ky0RTeIrHrPuhJ4V/xtSRTPaSi1yqGG5sreJl5UobWTOBkuiXJrpqHzpX+SRQHNzYZVI R+mxBQPh0WouYwiUeGvTTI4Me671HiM= Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7c5f720c717so395471285a.0 for ; Mon, 26 May 2025 22:53:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748325225; x=1748930025; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=1+jlkUh3l7BQSELkl79XBQS392Bms/Mo9y185WEn4s8=; b=Ltm4U4gaJVn+oxsnGQZNRofypC6Zx7oteaY641NxrhKDA7Oc/EuqH1XzIh/drIAzoo m0xhL4p/70ZySgE/0540pu3/lUwagEbxst5NZ0RdGoCp4ikKqOkgGTvbOa1lkA/Gg3AO XvZAbZmAuNmHE0I94Ubq4lUsMn1ebJfYJ7/SJueSx0wNL/+7w5/NCjuMIR80nsJI52v5 qZD5f2tib3drWi8joKfF/1ZM66PlEKZooQjKrjko2tqpT75uVQS7j1SPLybZ1R9fGxGS n6uqmgllqbvpL9/+iTsb0NVHDTKrFUACgraLHLOhy4UPr1Yzf0LPXT0tBzE0TqZzcSgC xQWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748325225; x=1748930025; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1+jlkUh3l7BQSELkl79XBQS392Bms/Mo9y185WEn4s8=; b=BwzKfLKemYoMc0imuEdOAuHVddIiiYJskyK8IJDtk7GMlZ28gMn4ujv4YECZ+u0mh+ yB+lEaMA2Lnj7uU9tB9xmvZ/+UYocTw7aU918np5PM/iHY8O6Mn3vbu8Oj8rdTlSLZau 0fs7B8Lc6W70oY3DSKTCsaDmGzgLWXfUblhKdArkrrzZoe4iWygkq8NYWtTWUtGgckWU QKsCQ6Xxp+rrSQahAULjN7WoOdjhKONdRUj3Rz1YV+U1PCRV6MH5J4IgeOjCASpLMmmt ET3UYAz+RGlFsHlkDX2iS9NV0gRFIOdynl93v+qnl2kXxknbuMYgfh6225KxNU2tiSdQ sIfw== X-Forwarded-Encrypted: i=1; AJvYcCWs2dPxGXE65Oq3P/NOiWBofpXNE2Ho1Sh+7YLbeKf7jkRj7XFvicNdTzpNPPdXDkSVIYtR5UJCCA==@kvack.org X-Gm-Message-State: AOJu0YxoocU84YXJB3+/b+KinMzT6851yxTNQgYeXO5vI7OTd45yjq1Y ZdS0o72Ny5jjrYj5In5Weac4e+vDpivAKY96cNtHPBPzoPvp8n1SK4goCt7noGEkGW3/TipD5eo xhy4ShWfdtYTuTQborDwyxwosaJqEjKE= X-Gm-Gg: ASbGncvJM4dAHbr+zUUKUV6kzpqiJZLOSEseSTHH2SJ3BJaV4HZinFkwKVQSy7PPWP+ FK2b4s92H5yElIqCOwxLHVtHy78KCOsMWLPOdRKbK43GbAYrL8g8MkuR+kZxPK7uZkXQEGKd84A R83CIG3c4Zd2rEG7Heb5EustNmeRIoHpBRqg== X-Google-Smtp-Source: AGHT+IGg9jK9aZ4hJiCsWXBMCC61gEwztFmL/XIpD/Poc6kt4hDLdJPTLfQndw4XWCMEKtqoLaO9UwlfXFta7WdIVkY= X-Received: by 2002:a05:6214:1d0e:b0:6fa:a65a:224d with SMTP id 6a1803df08f44-6faa65a243cmr116898796d6.4.1748325225217; Mon, 26 May 2025 22:53:45 -0700 (PDT) MIME-Version: 1.0 References: <20250520060504.20251-1-laoar.shao@gmail.com> <7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com> In-Reply-To: <7570019E-1FF1-47E0-82CD-D28378EBD8B6@nvidia.com> From: Yafang Shao Date: Tue, 27 May 2025 13:53:09 +0800 X-Gm-Features: AX0GCFsJaW1N_a3zIZGQ2An7BlCcwcaEeya7GB9e4FPHLbx2SU3DqjabLfyS7Y4 Message-ID: Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment To: Zi Yan Cc: akpm@linux-foundation.org, david@redhat.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3A54D2000A X-Stat-Signature: z6t6nfrs19cc5aht4ef8o6xs77ybetkp X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1748325226-383226 X-HE-Meta: U2FsdGVkX18e2jXEj7qC/7rVgUFaBZAtFGDryyFFALRRvQS5jtvuLJsq/oeHKsqNp7Kq2679imO1y6pHEl/BTB3ye8BnYT0/BX/92seCIHfG5RN6BKV1+sXfHM0wEX2TPahkqW2rVWOEK31+hLB1BLHTriwmVn6kEmxToRRobjZChxDX5wQNG6oMehD+V1bGPG1ZDTlRitpZSZSo3acvSD3SIiwotJaYxHfqCO+XLHyAPfaL0rJDWnd0IXhWJIRnkrLadJLaSpdKJ8n9MRjF/XbPZweTB5bvfSv3StoA6H6RWcDM+KCwHLL5lt/L1w8OojvbnlFMpKQEN3wXlzWNmN1Ll4N5BqImpnD9D1hkdI7LB08oWZBxQEKhSZu8s+tTnUrNzIpdzq/kFiiWkPAYmpbXSDiHyz0QGIBYqpmgJ0EG1mjPgW27IggAdwRrmbKW8SY380yYW0uTLpy8zq9OriM/BCUJrUCSOJIQ7XiX88jy3GMdrxee8S5J3R5tf+Usu/rft8xG7P5nw2/49qwk74Dmapzt1W4LP+FDMNK80WspH2O2uWAb7SVel4cTIWSYjcHQrvw+AX5VE8/EakQLBx+Tb9IUqK2kPtsOvTuYKlAN/As1xUjyOopPJfDQvjHzJsiKZwmD2kGDZ7EuKdU18ACZYS0OYnkRPwhT3wCTUqM8P7hZU9KqMAAIRdvReAg3n5cCbZ+b1B4D6ZpkvU7dW5JxIYyNQJpWPBgFCC+DhRn/546Q316BvfBEwNf+277V3cqvJr/K/JbzKqK1LRBSdXgaaYyASHSG7bYv2H6Md2R041DHx7w/3dAbBUihVkK+plEoGDGXMNTJ02giD+QJr6j1MgpukB9++GEc3UFqntVVuVCfYAMDS1p3u7hmYluR4ff4oI4buvHK+qDoFV7Cu62zC7pOFLcex8m5JYkfzpOMSy/riQWi+TucM7Vh5lWk5MhJgjyZQ940P4mcqnx /57kvTi+ kvXGUkYceLEfTiiuN+S7ICpGYkJxQLtMV9WMvZLC76XEPZKJn8pP8ZYwXkRey+g2+Mcf1oc2EC/ug9D3kD5sP9GrucR6oAgmrx70qxepL+uBGSi3ijPIjhpQTnzdvCpT/6ad9LiGgqmXhK9qomPHmGy4JQjPm+pEka4okPOdMAVJHSpYlTlw4SsuSdElxl+Z53QXAtu9IDk95V/HbvtH2Tv1uoAc8ISDLrzq3BAyYtPbgrgn6OIk0ChhUeRU61K4YHOeblSe4agogCh28AQfbDMA3uDs0UuTsiuao4uwoJmQdehUwi7IebehOFZYDfY3AR8IqytcDDf5iq6OxoM+zy8sH0jAzQNV4KBoL2a8gfxSgBW30aiZWGlqV+F6vIMZIJa/SJKhsbRTIxbuCeJz2tQpKzEt48EWL4EgnCLpDyui0ltZJj5eQf71TFUkUVI6dz76K7hkckCN/R4DZHN95DsiJ2RlkuQDfrpbINtTvKY3Ni3QieUPqkL9hhrVAeG/24nctqsQUEOCRAgBMhKGmmBniFLbuXpRc6BgH6FLM4iRkLat1B3Tvz9UnRfO5jUawdpvQvZhPGT1XjX+kXxnUfymCdw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 26, 2025 at 10:32=E2=80=AFPM Zi Yan wrote: > > On 24 May 2025, at 23:01, Yafang Shao wrote: > > > On Tue, May 20, 2025 at 2:05=E2=80=AFPM Yafang Shao wrote: > >> > >> Background > >> ---------- > >> > >> At my current employer, PDD, we have consistently configured THP to "n= ever" > >> on our production servers due to past incidents caused by its behavior= : > >> > >> - Increased memory consumption > >> THP significantly raises overall memory usage. > >> > >> - Latency spikes > >> Random latency spikes occur due to more frequent memory compaction > >> activity triggered by THP. > >> > >> These issues have made sysadmins hesitant to switch to "madvise" or > >> "always" modes. > >> > >> New Motivation > >> -------------- > >> > >> We have now identified that certain AI workloads achieve substantial > >> performance gains with THP enabled. However, we=E2=80=99ve also verifi= ed that some > >> workloads see little to no benefit=E2=80=94or are even negatively impa= cted=E2=80=94by THP. > >> > >> In our Kubernetes environment, we deploy mixed workloads on a single s= erver > >> to maximize resource utilization. Our goal is to selectively enable TH= P for > >> services that benefit from it while keeping it disabled for others. Th= is > >> approach allows us to incrementally enable THP for additional services= and > >> assess how to make it more viable in production. > >> > >> Proposed Solution > >> ----------------- > >> > >> For this use case, Johannes suggested introducing a dedicated mode [0]= . In > >> this new mode, we could implement BPF-based THP adjustment for fine-gr= ained > >> control over tasks or cgroups. If no BPF program is attached, THP rema= ins > >> in "never" mode. This solution elegantly meets our needs while avoidin= g the > >> complexity of managing BPF alongside other THP modes. > >> > >> A selftest example demonstrates how to enable THP for the current task > >> while keeping it disabled for others. > >> > >> Alternative Proposals > >> --------------------- > >> > >> - Gutierrez=E2=80=99s cgroup-based approach [1] > >> - Proposed adding a new cgroup file to control THP policy. > >> - However, as Johannes noted, cgroups are designed for hierarchical > >> resource allocation, not arbitrary policy settings [2]. > >> > >> - Usama=E2=80=99s per-task THP proposal based on prctl() [3]: > >> - Enabling THP per task via prctl(). > >> - As David pointed out, neither madvise() nor prctl() works in "neve= r" > >> mode [4], making this solution insufficient for our needs. > >> > >> Conclusion > >> ---------- > >> > >> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is= the > >> most effective solution for our requirements. This approach represents= a > >> small but meaningful step toward making THP truly usable=E2=80=94and m= anageable=E2=80=94in > >> production environments. > >> > >> This is currently a PoC implementation. Feedback of any kind is welcom= e. > >> > >> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg= .org/ [0] > >> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierr= ez.asier@huawei-partners.com/ [1] > >> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.o= rg/ [2] > >> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaa= rif642@gmail.com/ [3] > >> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838= c881b@redhat.com/ [4] > >> > >> RFC v1->v2: > >> The main changes are as follows, > >> - Use struct_ops instead of fmod_ret (Alexei) > >> - Introduce a new THP mode (Johannes) > >> - Introduce new helpers for BPF hook (Zi) > >> - Refine the commit log > >> > >> RFC v1: https://lwn.net/Articles/1019290/ > >> > >> Yafang Shao (5): > >> mm: thp: Add a new mode "bpf" > >> mm: thp: Add hook for BPF based THP adjustment > >> mm: thp: add struct ops for BPF based THP adjustment > >> bpf: Add get_current_comm to bpf_base_func_proto > >> selftests/bpf: Add selftest for THP adjustment > >> > >> include/linux/huge_mm.h | 15 +- > >> kernel/bpf/cgroup.c | 2 - > >> kernel/bpf/helpers.c | 2 + > >> mm/Makefile | 3 + > >> mm/bpf_thp.c | 120 ++++++++++++ > >> mm/huge_memory.c | 65 ++++++- > >> mm/khugepaged.c | 3 + > >> tools/testing/selftests/bpf/config | 1 + > >> .../selftests/bpf/prog_tests/thp_adjust.c | 175 +++++++++++++++++= + > >> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++ > >> 10 files changed, 414 insertions(+), 11 deletions(-) > >> create mode 100644 mm/bpf_thp.c > >> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.= c > >> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.= c > >> > >> -- > >> 2.43.5 > >> > > > > Hi all, > > > > Let=E2=80=99s summarize the current state of the discussion and identif= y how > > to move forward. > > > > - Global-Only Control is Not Viable > > We all seem to agree that a global-only control for THP is unwise. In > > practice, some workloads benefit from THP while others do not, so a > > one-size-fits-all approach doesn=E2=80=99t work. > > > > - Should We Use "Always" or "Madvise"? > > I suspect no one would choose 'always' in its current state. ;) > > Both Lorenzo and David propose relying on the madvise mode. However, > > since madvise is an unprivileged userspace mechanism, any user can > > freely adjust their THP policy. This makes fine-grained control > > impossible without breaking userspace compatibility=E2=80=94an undesira= ble > > tradeoff. > > Given these limitations, the community should consider introducing a > > new "admin" mode for privileged THP policy management. > > > > I agree with the above two points. > > > - Can the Kernel Automatically Manage THP Without User Input? > > In practice, users define their own success metrics=E2=80=94such as lat= ency > > (RT), queries per second (QPS), or throughput=E2=80=94to evaluate a fea= ture=E2=80=99s > > usefulness. If a feature fails to improve these metrics, it provides > > no practical value. > > Currently, the kernel lacks visibility into user-defined metrics, > > making fully automated optimization impossible (at least without user > > input). More importantly, automatic management offers no benefit if it > > doesn=E2=80=99t align with user needs. > > Yes, kernel is basically guessing what userspace wants with some hints > like MADV_HUGEPAGE/MADV_NOHUGEPAGE. But kernel has the global view > of memory fragmentation, which userspace cannot get easily. Correct, memory fragmentation is another critical factor in determining whether to allocate THP. > I wonder > if it is possible that userspace tuning might benefit one set of > applications but hurt others or overall performance. Right now, > THP tuning is 0 or 1, either an application wants THPs or not. > We might need a way of ranking THP requests from userspace to > let kernel prioritize them (I am not sure if we can add another > user input parameter, like THP_nice, to get this done, since > apparently everyone will set THP_nice to -100 to get themselves > at the top of the list). Interesting idea. Perhaps we could make this configurable only by sysadmins= . --=20 Regards Yafang