From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E026C3ABDD for ; Tue, 20 May 2025 06:05:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D3306B0083; Tue, 20 May 2025 02:05:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0849D6B0085; Tue, 20 May 2025 02:05:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EDD206B0088; Tue, 20 May 2025 02:05:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CCA176B0083 for ; Tue, 20 May 2025 02:05:57 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 3333C14036E for ; Tue, 20 May 2025 06:05:57 +0000 (UTC) X-FDA: 83462250354.30.AF77272 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) by imf17.hostedemail.com (Postfix) with ESMTP id 586BF40003 for ; Tue, 20 May 2025 06:05:55 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JH9IN2DV; spf=pass (imf17.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747721155; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=IoO1ZG+ahV3O8j+yt6jMzGkcb3VB4QBlBdOxkB1mErg=; b=0+0bBqjcvlo+nSQFqLaZahll0pnMBVOdr+wJhmN4Pwyc/PlAxcy8it//GbDm6ofQkOtQx/ hBGoFRMHZFck8cuJWoR5tje/rZFebiZJ9C+9T0w65b2+TJt5kzwrAcc8DBgFYtYGkBGmVs S5ZFjV/WAiks8xEoHqrw1Y51JjZhtJE= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JH9IN2DV; spf=pass (imf17.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.216.48 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747721155; a=rsa-sha256; cv=none; b=sgd2j+h4dB82msyKcW2p1WKr2MBlfZFTAAX27RrEq7uAiE6oZl7iYsuo2zdUrfX/RJlL7Q GU/NxXgABQ9lcxjdc9u1MzmR42pGhVKiqpzQYaImaDsCvUfqz1T5D4uuli7NColgpyAn48 ymPsRHpHWvgypbBsWaeaOZvaOZ8o2F0= Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-30f0d85a092so1508339a91.0 for ; Mon, 19 May 2025 23:05:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747721154; x=1748325954; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=IoO1ZG+ahV3O8j+yt6jMzGkcb3VB4QBlBdOxkB1mErg=; b=JH9IN2DVpXP8nsYD01g5Oz3LnUKUqE71KG/hs8zwH6WE3MuZSKMkCN6paPEMKO1MsD PcuHFqnBZdQUW/k9jnBxTnCJA5hPxTkLHtGlqtQwB23k7xPx0b/gwiUjZ59lhUJkSCJ1 XsyirMPwihjVrzf+CkYRQ/tTMdCo8EDm0rusVLomkXB2d4n7ZC6bx0MRgcdU9HpOhbeE SQeMDHY6KVfxdS4zPkQDaHCjf66SqmnvpGsqij+rX/nUhU2tuYk/OsHzPCTzR0tjqXPc f/y/zKekTXjHDnDD9Yg6im80QSuLGZA8gmHW260NqEKd8v+zQDrz8uumKV7fLQcl0qRp m9EQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747721154; x=1748325954; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=IoO1ZG+ahV3O8j+yt6jMzGkcb3VB4QBlBdOxkB1mErg=; b=b2s34Mr34yxkrfLKn7K0tWmJGpOOHdXAvizREajxnKHGCJ8XvuRoMAVsUrxjXXjJqx mYRrRcxDI2BqRL6y8BeJIIaqgpp4Fodbk3ZYQ/NiZeHNjFNKFKbakQEevGor1/LJO8jb TH8ez3xkSQR6X2YINrSBcaL1L5Tqsg4j/lvAWJRyE9dNCCK3MxEz1HT+Sjxs7jPsEdjv MqBw2qWyqfnqvEW0Bq5LP2M6OBUgRzor81/VeX3/rZdRtpbV8ifKCKBC002+VINunMur l4c/gufuwcyITINv1i3sfjVdkPkWw4OKYCTVqXur8vwQekx3iVDVNDuDgJ9sWN3hEP9I w2Sw== X-Forwarded-Encrypted: i=1; AJvYcCWH8u2te4lARraE369VCYnPk/nbgey0xO3c58pL9Gt7dB8BjJQ4zgdO5zCUyRHikHkY/Ma1h6ziTQ==@kvack.org X-Gm-Message-State: AOJu0YyIGUQmq1CYMJn6GoMbPPmWFRtfAKERZIYxwOa6H3FS4G8wPvha kZOChaT6sVgi+pGnCeEDfLEt40B4WtClUQ+pTLiFcb1mNLWt23MnI1+2 X-Gm-Gg: ASbGncvYODfCOhRB0gIr1HMq4tXyA6FTVbFCUNm1iiSST9F7rawGi+DkPRR6JAbt8xv n5hRMUGWGSjQDEbp0UKV4YWw1+5R5As/I4ddLE5doeD/3lp4LvLB/83jQnLI6X+8CJVwrRFRFsR P6cVpcZx88tX7/eC3f925IarEgiJg7znnzGLY6NuwjeW80uSeCHdEpRe/oWbZPIrbhYjgZ9BN1N GRwTzB3dHYEn7kWs4BAMU4rKI/0goZ7zuhmjLh5BMYpXDTdy4fYhANvKvwcHDcxWmFWd5eNzubn 6XulO7QZ5sRRgH2ym8rkFGl0x2yye1acHHtZ1o8Qe3pxQI/pebqrj9C1mRUqCYmSvhSBrHYY1ds = X-Google-Smtp-Source: AGHT+IFO2cTP5K2gFf3u2jvbQ9TH0AIx++5m9nnXFp+E54swdRvB7tIgv4wBN3xBoOSiOnMSyLAADQ== X-Received: by 2002:a17:90b:1d4f:b0:308:6d7a:5d30 with SMTP id 98e67ed59e1d1-30e7d5458dcmr29595890a91.18.1747721153848; Mon, 19 May 2025 23:05:53 -0700 (PDT) Received: from localhost.localdomain ([39.144.103.61]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-30f36385e91sm823428a91.12.2025.05.19.23.05.45 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 19 May 2025 23:05:53 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org, david@redhat.com, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org Cc: bpf@vger.kernel.org, linux-mm@kvack.org, Yafang Shao Subject: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment Date: Tue, 20 May 2025 14:04:58 +0800 Message-Id: <20250520060504.20251-1-laoar.shao@gmail.com> X-Mailer: git-send-email 2.37.1 (Apple Git-137.1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 586BF40003 X-Rspamd-Server: rspam09 X-Stat-Signature: gd6uau9whtbqxawc3ytt38ugw4iob385 X-HE-Tag: 1747721155-93469 X-HE-Meta: U2FsdGVkX1+Ng+IDflltJSqcBSCOvm/vBjggYOpPDT+qu5e+0w3lM6ZSCCRZdcX+oqZ7x0h8na+J5LtmOvZSdJeZ4H35084O/v3hp5oCFSzex+H6qOVZdujjY44/eCQ/ccES+E4Gb0uK3/W48YcLNT7RDb0rT7p90QCcTMhdDzf3dumHonQ7Fuu6SOTAubpaq7m+9n2MvqHdKq9RtT5dGqWcvOudN1Ep9mCg1XR2XoUXGzElx1++7XesdOaefCNZUBF9dNz5dUmy+WVaVFBS3ll7ysIGKAoVwW1AT019ZPZv7beqMEo25pEc5KvSI/iR9hBDPtkyKFQDsDd3Zwb9VrBSQrsmkalk23sQsGYQscWupLjzOkSeH2uKMv6x5Eyw8wTXt2P9tfWXf6rUA5C6ueOPUdbiU6WX2eQ2TGdbKaq/QVhBKPY9MhmnLoM719V4JHuBicNJWIsnNeUfP21zZ8I/FDTx9gr3nMH9Qq1kOmWGI1aehxgzAfLsB4TvmTJpA4qypXlFN+ZL4T9meCNqiIQyAYoX/xgkhadC0yKibvosrXdcQxUPB9J3P3Np/NUCBZhvbfjeSLSlQgDuOK1dS9vPLqHWBSQJqmSqBvr7jwzoKn4n5rXLHEFdXPXqjRp7+bXpMtZ7d9CMEuezew6kaKN18BiJdyhRiq/+ohZMairuUC9zsi+a7P3QsKqBSzq1j9TYBkKoAwdpURcEgs8fpTtCrlCYQ3Iy7mJEDsK5zo5CuyaFm2chb8vi5V6LaiUwAYjKsdNrROfkwyclGkQdHQNO+KV90vXF3W4uM79x9OPjqAVPPqIdOrQIsJgOEltdEqyV0eBVygQ4IXZ0TyL2jdwXrIASo5s3U4ZpTO0uoHW34sUWVe0UPDJDKJly8gysONu90EBfMBcrsUPBYX6Rk8O28sD5vLZxHy81i7+F+afVh2D4QcuyXacj/Uh+lFp9lXuzzt1NNP5ro2Z03CT VKCMlsj2 6brAPi54b8p95aPiz6l240xqp68a57D5gxqz8duHU/beerpG57z/hbZylssVgARammloj6RYnpQ/n4XlqcmeGgS0GKC5hSQ1qfSeqY0/vnuzNcd9c9DrQZzxgcAK8Op41/JY5qqthDq8UBHMY6L60fUrnXZu9L6i89emeYmvsONPrbQjmT91Z/QC72Av8xnWeqvgo7fdkHvehoH8o/2MbkNzYp6wAbt6tWit0LPQHHBZ1vAKBmWW9b1KZB2kMEwuFRSRPpDtPHWvmcIj8Ex813nlbSCa96myHHQo3fFEbl/wJbIYBx7ojj7UmmwAKrjMCqvp7BFUg90JprG+P3fM11v51GJsMnnE7uG/37y9+aKtuE3YZSQvhTyYiyfhCJS3B9p3rnPYikCeHxLQY8lrxNKvKTJBXwmXDr/e+LOOUJ7+oC8Z08m4K+Vk9H8EXA95nz97QPpHA1GE8Hy2ImGc61MaQb7wuPLoTgqr0+q7lcxngIih//j1nNN/zwJL4Ry3VElGGXjg3TfsOoZ1XzfdbwirUxf72H05ooPM4F1Y8QMlc311K2k5TJ8LEreCPEDt1w8JWa/695EwT6DqZ1igsWO4/16Na6EcJnCHJP7A4veSUqjadhwQ22GVX70wKedJEQgaPmXwA5OO8xVS4M/ublL/h1zteSgFgEE7HDajeWMbMveqT/OBhsa0jeJ6dMYhubfQQSlCeXUQmoek4SBV4MnxiDe7gqaQm0O8SOgwzOr43eRc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Background ---------- At my current employer, PDD, we have consistently configured THP to "never" on our production servers due to past incidents caused by its behavior: - Increased memory consumption THP significantly raises overall memory usage. - Latency spikes Random latency spikes occur due to more frequent memory compaction activity triggered by THP. These issues have made sysadmins hesitant to switch to "madvise" or "always" modes. New Motivation -------------- We have now identified that certain AI workloads achieve substantial performance gains with THP enabled. However, we’ve also verified that some workloads see little to no benefit—or are even negatively impacted—by THP. In our Kubernetes environment, we deploy mixed workloads on a single server to maximize resource utilization. Our goal is to selectively enable THP for services that benefit from it while keeping it disabled for others. This approach allows us to incrementally enable THP for additional services and assess how to make it more viable in production. Proposed Solution ----------------- For this use case, Johannes suggested introducing a dedicated mode [0]. In this new mode, we could implement BPF-based THP adjustment for fine-grained control over tasks or cgroups. If no BPF program is attached, THP remains in "never" mode. This solution elegantly meets our needs while avoiding the complexity of managing BPF alongside other THP modes. A selftest example demonstrates how to enable THP for the current task while keeping it disabled for others. Alternative Proposals --------------------- - Gutierrez’s cgroup-based approach [1] - Proposed adding a new cgroup file to control THP policy. - However, as Johannes noted, cgroups are designed for hierarchical resource allocation, not arbitrary policy settings [2]. - Usama’s per-task THP proposal based on prctl() [3]: - Enabling THP per task via prctl(). - As David pointed out, neither madvise() nor prctl() works in "never" mode [4], making this solution insufficient for our needs. Conclusion ---------- Introducing a new "bpf" mode for BPF-based per-task THP adjustments is the most effective solution for our requirements. This approach represents a small but meaningful step toward making THP truly usable—and manageable—in production environments. This is currently a PoC implementation. Feedback of any kind is welcome. Link: https://lore.kernel.org/linux-mm/20250509164654.GA608090@cmpxchg.org/ [0] Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1] Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2] Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3] Link: https://lore.kernel.org/linux-mm/41e60fa0-2943-4b3f-ba92-9f02838c881b@redhat.com/ [4] RFC v1->v2: The main changes are as follows, - Use struct_ops instead of fmod_ret (Alexei) - Introduce a new THP mode (Johannes) - Introduce new helpers for BPF hook (Zi) - Refine the commit log RFC v1: https://lwn.net/Articles/1019290/ Yafang Shao (5): mm: thp: Add a new mode "bpf" mm: thp: Add hook for BPF based THP adjustment mm: thp: add struct ops for BPF based THP adjustment bpf: Add get_current_comm to bpf_base_func_proto selftests/bpf: Add selftest for THP adjustment include/linux/huge_mm.h | 15 +- kernel/bpf/cgroup.c | 2 - kernel/bpf/helpers.c | 2 + mm/Makefile | 3 + mm/bpf_thp.c | 120 ++++++++++++ mm/huge_memory.c | 65 ++++++- mm/khugepaged.c | 3 + tools/testing/selftests/bpf/config | 1 + .../selftests/bpf/prog_tests/thp_adjust.c | 175 ++++++++++++++++++ .../selftests/bpf/progs/test_thp_adjust.c | 39 ++++ 10 files changed, 414 insertions(+), 11 deletions(-) create mode 100644 mm/bpf_thp.c create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c -- 2.43.5