From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15BDBC54F30 for ; Tue, 27 May 2025 05:46:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D25E6B0085; Tue, 27 May 2025 01:46:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4831F6B008A; Tue, 27 May 2025 01:46:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 372236B008C; Tue, 27 May 2025 01:46:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 19E5A6B0085 for ; Tue, 27 May 2025 01:46:41 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 56EEB161D09 for ; Tue, 27 May 2025 05:46:40 +0000 (UTC) X-FDA: 83487603360.23.5BFFFE3 Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf17.hostedemail.com (Postfix) with ESMTP id 783724000F for ; Tue, 27 May 2025 05:46:38 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="MJvdVyR/"; spf=pass (imf17.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748324798; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TJCbc0a/7ZaQAm2xUfaAihletUL+93qyJ5pWtxixAyM=; b=Lh8W9GJmR0VF1TV+pCdXhGkVJwZwX+4HQjwSycdZSmIU7tLqJouNhrUfq7AJMzeq7AX8Ae GRz8zLuRTTP9mKEwu8yJw0xUc5cLPKeJjZkLCCk5sNqfbqfk2Q0+khOtSS5NjBxJe26EPM UXGP8Rf5vIALZ54S76fGCVJ4DLf8RZo= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="MJvdVyR/"; spf=pass (imf17.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748324798; a=rsa-sha256; cv=none; b=8VaH1EP9PI3nzOuYs/zAUic7LOwWoly8bYf1/A8A0ODtkMbsj8FoXkLrcKceT9dFDiNOq0 uZV5rCbZuDKQ6lKYyCTYiUjnJkP0YkA0Z+m5/17knv7cIo015JHJRzx51sg4HQpV+dw9TU fsSDkRysr6mT6f4yNu1hUZtVX6ybhOA= Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6f2c45ecaffso23849846d6.2 for ; Mon, 26 May 2025 22:46:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748324797; x=1748929597; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=TJCbc0a/7ZaQAm2xUfaAihletUL+93qyJ5pWtxixAyM=; b=MJvdVyR/whNLLnX9mqd5nYUFvVdRTvAw/Xsp04xLj1//0ysfUfe9XOO0syDrrUXzFi Y/HjthmV0m1nziKq04+DaonZgmfjjqve2OfCIE81M4hF7fa009Mf7sLRRIj5gnZQALOZ HgMG9GlLK+nA92PaMGEp3d2TDydpfZKHfpq16IkfBB+sbWkYa8DP/hoCwuze8lZeakwI VT7tEQ7ErAdkofn70pJs+GcKzVDxWZiSLEK3IxAcCb70mX++p1mOAwGDRB8/5M1XTqEm R2443bUyLcX9OoQmfkLefzg1E4OFaC2CYHtx+HjaBSZZ/FpW3NBCpy9RN7ILi2uW15Tm GzRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748324797; x=1748929597; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TJCbc0a/7ZaQAm2xUfaAihletUL+93qyJ5pWtxixAyM=; b=nwwMrF6LOkHJefP9y6QqByR0sSCl9KzGTTAAAVqpT7Ai2kZ39AYqmlBL/XJhX9R9/a +C4NOamg3p7VZMgl16T3AxOVk8DrhmKwrfkpBhtfejKTNTNkoikhjtyrNNZvdBOIfhnT aNCgEuj7tkBhHC+I/CgDaM3zNhG8FaQox46MZSEWTbwKXzksiYsDUTiIvYzNeGAyNXri rjQS5hnNR81OBcddqz/4rK5IdJbsuBCXOh2xDHg8DnX6V8JbLRJvZYnxU1FDRZjCBRLA B2bk3gclDwVrkg4E1lqPz2fDbPvaPyEx2wx97an/vTpPny5dpUhB/dl9q8iPCG6PE+dN TEtQ== X-Forwarded-Encrypted: i=1; AJvYcCVWgoxhEPkdoO3kH7q9XMLHzaRYCibCIcwFdpIlXrYXtkaKe/ND+ruXiO7Bmzlm6jh+sEZ0CG3+lA==@kvack.org X-Gm-Message-State: AOJu0YxJPj78/wx3uudA2bbJu+v+M+oZZgQfJpVrfzf38hIBIWok+u6O /hiTRyqcaxDF8qLFljTxSuyXBej2ukz2l4PjDHGhEq4nijqCz1kVdNMTlURp8KscPKQiiiGqhAT 7ny/xKxp05J7KwPqz+ozrFGT4zRUGkjM= X-Gm-Gg: ASbGncurGg94Mtylqn9pNDGbDmhNUrPVujG7k+0e0LLsjlnkypkf+CLb/CjrGk3zpoT vMebS4/vVHdEb13kVzrb235C7P82BnDZ9/sE0PE2YZZPWGW39MZq7oPxgKISO77rnLkw4fs14Os 6lzhqyVjSgcmJxMZzQMA3/bNYyV4+n0OTupA== X-Google-Smtp-Source: AGHT+IHFv8XfeiCYCS6zO0b40EF+rNhPdykuEeUe5/KMpUgIAYcJjX/h1tjox31xMMH7KohYXXBjHoGZ1/gM/+uGgBw= X-Received: by 2002:a05:6214:1d23:b0:6e6:6964:ca77 with SMTP id 6a1803df08f44-6fa9d29c58dmr192563776d6.28.1748324797546; Mon, 26 May 2025 22:46:37 -0700 (PDT) MIME-Version: 1.0 References: <20250520060504.20251-1-laoar.shao@gmail.com> <7d8a9a5c-e0ef-4e36-9e1d-1ef8e853aed4@redhat.com> In-Reply-To: From: Yafang Shao Date: Tue, 27 May 2025 13:46:00 +0800 X-Gm-Features: AX0GCFv-hbtWMTLiDK07hhh-THJXY8FnrcFFxVUq6drCNZyBETIZNFd3z_p_uq0 Message-ID: Subject: Re: [RFC PATCH v2 0/5] mm, bpf: BPF based THP adjustment To: David Hildenbrand Cc: akpm@linux-foundation.org, ziy@nvidia.com, baolin.wang@linux.alibaba.com, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, dev.jain@arm.com, hannes@cmpxchg.org, usamaarif642@gmail.com, gutierrez.asier@huawei-partners.com, willy@infradead.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 783724000F X-Stat-Signature: 5id761bmb1kss4qbpakypmrgmgty4bcg X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748324798-809361 X-HE-Meta: U2FsdGVkX1/CkG6DzybtboYttZyrZp2xXDsrueUClkhPO7fm08rISludPnEo0o8+4tDg7p9DD9GInNo6cRrG7bFliJLFWrwwZCy8dChG2SbMobTPYWSgFFY2YcT82FpBRhgBpdmCj5YenysQBVL/34wi+ATGwKLEHxls/9rrIfa/6wOjuvjEfSCSjPOBL73y4KSgQIdU+iPHtxMdXjOvs3WfgCgfz5oNujmwHwF2knBFUTEXiM/ODzb0iuz2vIhKJy/t5k6c5Zywz6Xx1iVFbNxwHdRSbpUB3x49NS9ALbOkPNFD/qGL5nR7CSG/IjyY4MUwWOlnB+hMPQ5/0mgt3m2uFzVu8LnB8EwpEvOv2/Nr6dIt3hXROoQ5XWRTpb1yvCnnXNIIS/Le5b0vbwxBaGm/JWlRN27JqJVw9BFXMOXCoe+++71nqDDsSqwetZcL/kD+VGgshYMy7br/C+X0zKR2nqY/Ta3P7VrtGboGydpfpyojGc7S6Yb+YVa6sZK7FVyFCMwiN1yMclupEjPYVsRar0RTjHkr8CghedEgonSh0ZCce2eRb+mP8UPY8is4K4c0lCy1laaoAi2qpVY/D5V68hgMx5BuGYJqX+uxNeSBOzHMb9jaK/gkpBRIMfYlpilx0vLeaqNnBlM7Mvvh7/Tct1oO10isWKB1fBuu2etJUYLLLxhBK994WFRDGUrjHd6WJ3/pg1l5gT4YNuhl3eRvvEIoiJ/mb6UXpGEajqRdANXCO2hTZOmhzfRmmonNIUf4aS3zmGAHojfEns1FZMkhJsXgC/5Qm7rTRqKQGRF+8mzw07cmv5N9Jpvuq06LVX58GjIW9uyGmAcbyA2D67uQOvHLEy2TowyfFHn/XUf2jDz9Gw3vSosIqvVfqZ0U/RHV8MQSV2bykD+HgxxWeCuShGgfqXAYqLE2jt0mvRoYMTPgGTsUy24ZwCzvkaDiyRb1VHlvvpxX22svMEY dwOG8Wlg EZmOJA2rkmMPVIKRNHswPjf2MSScPInjw4XTu8vKXYyf3hTvd1QA0CHx4wnP1pzeMbf2VUa98m8PjabUI3zOtsJdwFmD9Z+lh965BSTLXMt+CDMIyZiSm+sYzH1BH9xM9q6c56G1ro4nL//OBqAJcxJtF+k/qEl2fpUeT5cYPVt6i1f9OJBCdwcF7C3ilcS5bFEMtTQxEWJBMP+i1IOSO0HlnQahg14nE2HjnaUPL33qJ79ni2aennqBL6iVaDPZfgfT3cePg3pIT6cg3/Fc/gCMQCwsJCbXqnWoPvraC1ezMMj+/i2JZoVf9mm6mXS1LCE7Z+yD6WVbgfT48nMQjodnkQytHcJi5O7TzllK5kk2zxg+WgNyv4beGBw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, May 26, 2025 at 6:49=E2=80=AFPM David Hildenbrand wrote: > > On 26.05.25 11:37, Yafang Shao wrote: > > On Mon, May 26, 2025 at 4:14=E2=80=AFPM David Hildenbrand wrote: > >> > >>> Hi all, > >>> > >>> Let=E2=80=99s summarize the current state of the discussion and ident= ify how > >>> to move forward. > >>> > >>> - Global-Only Control is Not Viable > >>> We all seem to agree that a global-only control for THP is unwise. In > >>> practice, some workloads benefit from THP while others do not, so a > >>> one-size-fits-all approach doesn=E2=80=99t work. > >>> > >>> - Should We Use "Always" or "Madvise"? > >>> I suspect no one would choose 'always' in its current state. ;) > >> > >> IIRC, RHEL9 has the default set to "always" for a long time. > > > > good to know. > > > >> > >> I guess it really depends on how different the workloads are that you > >> are running on the same machine. > > > > Correct. If we want to enable THP for specific workloads without > > modifying the kernel, we must isolate them on dedicated servers. > > However, this approach wastes resources and is not an acceptable > > solution. > > > >> > >> > Both Lorenzo and David propose relying on the madvise mode. Howeve= r,> > >> since madvise is an unprivileged userspace mechanism, any user can > >>> freely adjust their THP policy. This makes fine-grained control > >>> impossible without breaking userspace compatibility=E2=80=94an undesi= rable > >>> tradeoff. > >> > >> If required, we could look into a "sealing" mechanism, that would > >> essentially lock modification attempts performed by the process (i.e., > >> MADV_HUGEPAGE). > > > > If we don=E2=80=99t introduce a new THP mode and instead rely solely on > > madvise, the "sealing" mechanism could either violate the intended > > semantics of madvise(), or simply break madvise() entirely, right? > > We would have to be a bit careful, yes. > > Errors from MADV_HUGEPAGE/MADV_NOHUGEPAGE are often ignored, because > these options also fail with -EINVAL on kernels without THP support. > > Ignoring MADV_NOHUGEPAGE can be problematic with userfaultfd. > > What you likely really want to do is seal when you configured > MADV_NOHUGEPAGE to be the default, and fail MADV_HUGEPAGE later. > > >> > >> The could be added on top of the current proposals that are flying > >> around, and could be done e.g., per-process. > > > > How about introducing a dedicated "process" mode? This would allow > > each process to use different THP modes=E2=80=94some in "always," other= s in > > "madvise," and the rest in "never." Future THP modes could also be > > added to this framework. > > We have to be really careful about not creating even more mess with more > modes. > > How would that design look like in detail (how would we set it per > process etc?)? I have a preliminary idea to implement this using BPF. We could define the API as follows: struct bpf_thp_ops { /** * @task_thp_mode: Get the THP mode for a specific task * * Return: * - TASK_THP_ALWAYS: "always" mode * - TASK_THP_MADVISE: "madvise" mode * - TASK_THP_NEVER: "never" mode * Future modes can also be added. */ int (*task_thp_mode)(struct task_struct *p); }; For observability, we could add a "THP mode" field to /proc/[pid]/status. For example: $ grep "THP mode" /proc/123/status always $ grep "THP mode" /proc/456/status madvise $ grep "THP mode" /proc/789/status never The THP mode for each task would be determined by the attached BPF program based on the task's attributes. We would place the BPF hook in appropriate kernel functions. Note that this setting wouldn't be inherited during fork/exec - the BPF program would make the decision dynamically for each task. This approach also enables runtime adjustments to THP modes based on system-wide conditions, such as memory fragmentation or other performance overheads. The BPF program could adapt policies dynamically, optimizing THP behavior in response to changing workloads. As Liam pointed out in another thread, naming is challenging here - "process" might not be the most accurate term for this context. --=20 Regards Yafang