From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D24FAC54FB3 for ; Mon, 26 May 2025 07:41:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 720316B0083; Mon, 26 May 2025 03:41:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D1216B0088; Mon, 26 May 2025 03:41:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E7016B0089; Mon, 26 May 2025 03:41:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 418D86B0083 for ; Mon, 26 May 2025 03:41:42 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 02CFA80475 for ; Mon, 26 May 2025 07:41:41 +0000 (UTC) X-FDA: 83484264444.05.AFC2E94 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf22.hostedemail.com (Postfix) with ESMTP id 64A7FC000A for ; Mon, 26 May 2025 07:41:39 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei-partners.com; spf=pass (imf22.hostedemail.com: domain of gutierrez.asier@huawei-partners.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=gutierrez.asier@huawei-partners.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748245300; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=slatqesMTjsyiUtHq/838nonpTU33xT4ytnGh46O0gg=; b=G0kOl5a2yLpNRyBmuzEtBMsPmTuGL9+PDoK4hllETcpcA+FxEDunfozPHL3da21JoUF/3J ZUciY5KN+BrFDGCSFsQ7i6f/eSjgCYKGIO2psL5iTlqLlPmu4KRuO7nd3duOqza7jKimrO a+Pzasay5xKV8H4nOs6MKxUZz7iNLXo= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei-partners.com; spf=pass (imf22.hostedemail.com: domain of gutierrez.asier@huawei-partners.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=gutierrez.asier@huawei-partners.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748245300; a=rsa-sha256; cv=none; b=V7bQmhigqOigDgzbV4NfWy+LiwEzvc7TC+nRlcM67cdwYRBrVfX1hONeLhTvuNDDzVdrfm z/OyNS+nXqAYAWv+MdsxQ9tlUOv1QX86/7TqMrBEdmOlrC3/OI+Q2xLDfklM6OuEHeeEDy Mn5Zh6ueO1OjPDdq4nw3Z4OlYWhhK/Q= Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4b5SJ31F83z6M4tJ; Mon, 26 May 2025 15:36:35 +0800 (CST) Received: from mscpeml500003.china.huawei.com (unknown [7.188.49.51]) by mail.maildlp.com (Postfix) with ESMTPS id C37181402FE; Mon, 26 May 2025 15:41:33 +0800 (CST) Received: from [10.123.123.154] (10.123.123.154) by mscpeml500003.china.huawei.com (7.188.49.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 26 May 2025 10:41:33 +0300 Message-ID: Date: Mon, 26 May 2025 10:41:32 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment To: Yafang Shao , , , , , , , , , , , , , , , CC: , References: <20250520060504.20251-1-laoar.shao@gmail.com> Content-Language: en-US From: Gutierrez Asier In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.123.123.154] X-ClientProxiedBy: mscpeml100003.china.huawei.com (10.199.174.67) To mscpeml500003.china.huawei.com (7.188.49.51) X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 64A7FC000A X-Stat-Signature: fh7fkk33izz7etr63chjkuhjzbdoo3ej X-Rspam-User: X-HE-Tag: 1748245299-262912 X-HE-Meta: U2FsdGVkX190uArzAGzUpdZGdxLYjf8g3MNrD3jYARlWcnx6He13C8RvNfBJrZpkZdcK6PlySFqXgipVy1aIG6ePT4gfBaT9YnQEo3WPfD1dawbSbfQ8FerKEAZY8n0N38Sx9ZPyH6N6+3DbT9w1Tt1rEySvDYbLMrA4v4L/5ApZsYNvzWxZmj1kuFeJeu0SfWPgIiFqQt7tX2X7rx6UJb3cpFitxD1AWARpLtuUJ6MqEr1rjdnAPIHNUS/Hz/8TYrt86Q2THELSxv9DS+Wrjx8viAQfaR09A9Ke3ft8t+Z7SzidwYdx7YcjgnfRtXXwX7re5gqbwTQXuEf492n7q0cQNM5Rz20jkpc4rrjnxLySdGQFIDi6rQ82YtWTnGe9Kwd6dIz0T4xFi9xz0YiawxRFQWKPux5aXBTHVIqWN6BOIlrRESoBX/2bF89cznqF7OR0v7QcvFmAroQRH4PauWRW2IMGG+U0vReeroawXVpMTuqI1lvtWeAuDUo/seUDSf6gCiz4pG4TyYC1s49Rnj5lHoyziuU4g1j9UemUNF28rSyJ/LGb+JHAG7cPGm+4adzr+hSWaHW0dnpAKoqeHxbNILiHSJKxeu16BxHOqP8dCbmraaMG9busy8bawdge7iphqfRGV4qHEvcH5tBLPFKOTrWeLiDMuynVwMr3fPJZh/DnsZHbnEBG+/PU8+eCTzql1KorpH49gCW4kAW0/8zfuJBMpzpezuhD19DGRliRsfv8TS74ewICNF6592nDI9ayKvyTUF9MfXm1sX8ZIsS0O/y9pG8gD3z4qtIHU8p/2Pq48+wk7ErTvx6O4mppa8VQwWo3du8kchGxJumNyvzaENEYlDxmoogAF5eHHK3PJyTxuX1cGWSpbM/YCIkl48MjFySUC+sBT13U/qGlRR0y1WFQIzxtDZhNeG0Va+aWxClDU17BT15JllMGCjJNGZcd6EeYbmvF6EZh46b wcqSMIkS +thONIY2DAKzbDyqJ31KWlM5jTy2h5ewt4WQ0PCDTkzVyy3bhUD1MwXn7PPtphysPw29alPcYeTUYnB0hll+RgeR2Re4H/FPJifMrOBGGpTVTHHJQ2iwJGnDm3eIDbf92UMyeJxWf1Gb4NLl1voxIIPFQdNZAaiYGyMmZRi6TLABsGHvJ1Inml04SunZYbTt9FfmgUFfKMdEAdesYI/S2oRFyh4oX8GrIoNZl0WHl8KsS8BkzarAdqT4+CMCk8zpmSq5MIx7Tq0sYXvDX+j8mA3OpOnwnVclpydF0YAUtTshpKRKXBkuvz+Xinmb36V7vKf9r2w7RSPxbDdhjXW6ZEDc2xYXzzOGOGmYXtiZsBj+hx8LeDs77Fy/QiMLj5/Z9YU28uKAGY3AFCVHZ/Jrot7S+tvb3Rd5Iy5rpmWPjVQbMmq+3v/yWP69Li0ySHENhOX/QLQYr0qONqYHNIDzCOGtdL4aW0j0/qDlzBNizQ8CwNA3t5U2SWj3D6kHYS6yPIdNbZNH9tPuTQ2hXnsJ+rUu6S75R7tV9qpxbT7OVmq9qyxHsOGr2gBVgRZkD8FhRl/o4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 5/25/2025 6:01 AM, Yafang Shao wrote: > On Tue, May 20, 2025 at 2:05 PM Yafang Shao wrote: >> >> Background >> ---------- >> >> At my current employer, PDD, we have consistently configured THP to "never" >> on our production servers due to past incidents caused by its behavior: >> >> - Increased memory consumption >> THP significantly raises overall memory usage. >> >> - Latency spikes >> Random latency spikes occur due to more frequent memory compaction >> activity triggered by THP. >> >> These issues have made sysadmins hesitant to switch to "madvise" or >> "always" modes. >> >> New Motivation >> -------------- >> >> We have now identified that certain AI workloads achieve substantial >> performance gains with THP enabled. However, we’ve also verified that some >> workloads see little to no benefit—or are even negatively impacted—by THP. >> >> In our Kubernetes environment, we deploy mixed workloads on a single server >> to maximize resource utilization. Our goal is to selectively enable THP for >> services that benefit from it while keeping it disabled for others. This >> approach allows us to incrementally enable THP for additional services and >> assess how to make it more viable in production. >> >> Proposed Solution >> ----------------- >> >> For this use case, Johannes suggested introducing a dedicated mode [0]. In >> this new mode, we could implement BPF-based THP adjustment for fine-grained >> control over tasks or cgroups. If no BPF program is attached, THP remains >> in "never" mode. This solution elegantly meets our needs while avoiding the >> complexity of managing BPF alongside other THP modes. >> >> A selftest example demonstrates how to enable THP for the current task >> while keeping it disabled for others. >> >> Alternative Proposals >> --------------------- >> >> - Gutierrez’s cgroup-based approach [1] >> - Proposed adding a new cgroup file to control THP policy. >> - However, as Johannes noted, cgroups are designed for hierarchical >> resource allocation, not arbitrary policy settings [2]. >> >> - Usama’s per-task THP proposal based on prctl() [3]: >> - Enabling THP per task via prctl(). >> - As David pointed out, neither madvise() nor prctl() works in "never" >> mode [4], making this solution insufficient for our needs. >> >> Conclusion >> ---------- >> >> Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the >> most effective solution for our requirements. This approach represents a >> small but meaningful step toward making THP truly usable—and manageable—in >> production environments. >> >> This is currently a PoC implementation. Feedback of any kind is welcome. >> >> Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0] >> Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1] >> Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2] >> Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3] >> Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4] >> >> RFC v1->v2: >> The main changes are as follows, >> - Use struct_ops instead of fmod_ret (Alexei) >> - Introduce a new THP mode (Johannes) >> - Introduce new helpers for BPF hook (Zi) >> - Refine the commit log >> >> RFC v1: https://lwn.net/Articles/1019290/ >> >> Yafang Shao (5): >> mm: thp: Add a new mode "bpf" >> mm: thp: Add hook for BPF based THP adjustment >> mm: thp: add struct ops for BPF based THP adjustment >> bpf: Add get_current_comm to bpf_base_func_proto >> selftests/bpf: Add selftest for THP adjustment >> >> include/linux/huge_mm.h | 15 +- >> kernel/bpf/cgroup.c | 2 - >> kernel/bpf/helpers.c | 2 + >> mm/Makefile | 3 + >> mm/bpf_thp.c | 120 ++++++++++++ >> mm/huge_memory.c | 65 ++++++- >> mm/khugepaged.c | 3 + >> tools/testing/selftests/bpf/config | 1 + >> .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++ >> .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++ >> 10 files changed, 414 insertions(+), 11 deletions(-) >> create mode 100644 mm/bpf_thp.c >> create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c >> create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c >> >> -- >> 2.43.5 >> > > Hi all, > > Let’s summarize the current state of the discussion and identify how > to move forward. > > - Global-Only Control is Not Viable > We all seem to agree that a global-only control for THP is unwise. In > practice, some workloads benefit from THP while others do not, so a > one-size-fits-all approach doesn’t work. > > - Should We Use "Always" or "Madvise"? > I suspect no one would choose 'always' in its current state. ;) > Both Lorenzo and David propose relying on the madvise mode. However, > since madvise is an unprivileged userspace mechanism, any user can > freely adjust their THP policy. This makes fine-grained control > impossible without breaking userspace compatibility—an undesirable > tradeoff. > Given these limitations, the community should consider introducing a > new "admin" mode for privileged THP policy management. > > - Can the Kernel Automatically Manage THP Without User Input? > In practice, users define their own success metrics—such as latency > (RT), queries per second (QPS), or throughput—to evaluate a feature’s > usefulness. If a feature fails to improve these metrics, it provides > no practical value. > Currently, the kernel lacks visibility into user-defined metrics, > making fully automated optimization impossible (at least without user > input). More importantly, automatic management offers no benefit if it > doesn’t align with user needs. I don't think that using things like RPS or QPS is the right way. These metrics can be affected by many factors like network issues, garbage collectors in the user space (JVM, golang, etc.) and many other things out of our control. Even noisy neighbors can slow down a service. > Exception: For kernel-enforced changes (e.g., the page-to-folio > transition), users must adapt regardless. But THP tuning requires > flexibility—forcing automation without measurable gains is > counterproductive. > (Please correct me if I’ve overlooked anything.) > -- Asier Gutierrez Huawei