From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 828D4C2D0CD for ; Thu, 15 May 2025 14:50:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B5D766B00A1; Thu, 15 May 2025 10:50:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B0CEC6B00A2; Thu, 15 May 2025 10:50:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9D51F6B00A3; Thu, 15 May 2025 10:50:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7E7D56B00A1 for ; Thu, 15 May 2025 10:50:51 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3DCDA12018D for ; Thu, 15 May 2025 14:50:52 +0000 (UTC) X-FDA: 83445429144.03.33090A9 Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf26.hostedemail.com (Postfix) with ESMTP id 1A094140003 for ; Thu, 15 May 2025 14:50:49 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mozv7cjP; spf=pass (imf26.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747320650; a=rsa-sha256; cv=none; b=zeRc1qvU7xY1oKXspAvui5loPPl4hG7WzSY8sjp6Jm8oCbb64t1a4N8ozLy5S9MjaZ6B2k IS5qs0Axtu+8NBKm3cQ2Onb/+SUArsrhOcRvc0MoOukVA5GAANICyPnu2Qeq1j4YqY7Fu2 XVz22NbJHL3kC4LVxQ8dCeD7RTQLE5E= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mozv7cjP; spf=pass (imf26.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747320650; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/GuVrR218OTIKMuet6zzIKhyiz67c2WUQNm8GReKqZo=; b=FZVxiF75u7EwH+eJPYaWc8SUxK0cPJ7lrTVSOCTpJ/Y3ULPAQjlLrkIZMH+7NOkHjrVd6w qA6F+du19z0OCQgvjWDWn8+7PgsTPbO7Dx9TVdLF4fLgPVAX5J/zS6MsIpjzAav5Efy98C QdaKmLWob5tOamerzk2t1nyXX0GDMl8= Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-43d04dc73b7so10900985e9.3 for ; Thu, 15 May 2025 07:50:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747320648; x=1747925448; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=/GuVrR218OTIKMuet6zzIKhyiz67c2WUQNm8GReKqZo=; b=mozv7cjPLyUKVgyGfA1q5+9z1QwfgaDYd0jRMXwYdH6I7D85uZBu62Qhh3t4cpn2VD YYYKqvETTflER7h6spTFj7ZEB431uljWhAgDdt/N7K2ehqB0EsWTvQHxAq7O6SR4IsAl 9ukU2okMHHSilVN603y1SeUNssg21Pf5kgtLgaPlCsyKU11Ol53osoWn/ny8Ho596t24 fazjq/LQhtKG1qG8iucpHkfEd6zCooABMQmFuH2COHLerD9TzCs4K3rNoQT01bvDxUBW cNB6WHgpkKzZpNToydPt9clSDbKR9SKUDUOiXhMcLwEON5xoniZ4E7Wr4wI+z1JBvote VW7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747320648; x=1747925448; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/GuVrR218OTIKMuet6zzIKhyiz67c2WUQNm8GReKqZo=; b=XySE1jj6Ky3E3RoeSnRt2+LoB8En0FnkbiXZqad+qzv/2NtROCPR4RlnX9UxSfmsIt Y2dX5gpw5WbjtAJIRqeUauUrHGDdWtTAxDAfjGLGQzFToQ3+OTG71jfRKRlvSRPqd7Th CGSaodDLp40eMDVE2GxWKi4J+GM9U1Il63axYKxLGgGC1MTL2CcDTmpekZ27InCPhkNR CzF5LpkJlEbnLaWLpgnu57KgdZwEpPAw3UphIo+1mIu48t19Gurquk7mfUEa8CdsnymT /6RMGbGsCvM6xjulbqSUKtLMLS+cC3wh5kW7gjHZIIvcwFHrlcAJ0SUs/uw88q5q9hsG sKuA== X-Forwarded-Encrypted: i=1; AJvYcCV5HUNvQtYhG6JJut4OsCjPNFDYoZcZtSg71uRR5mqQbQxoDwXbtR20nv751gvFdojnLLFFdUyqtA==@kvack.org X-Gm-Message-State: AOJu0Yx2Dc8ZN+oF21jekRd4Msmze/gcL2W4K08HvGH/JIXqyeGuhLID tdC6+7FF8Yja9UoYB2BkkU762ealKb2LdUN5d1j9rMhmUEn9kr+4 X-Gm-Gg: ASbGnctYn6jDaOaVxkNk2ny5JmBo/+PxHLbxbx19YLsbA2+BV5XMYFc3ZGsQQUyqXT1 Rsdz4Zz39AQ8dgUfi5dIuMjcwXu4+5Q3VXwyEdnv7TD2TeD26ZwWOaiwAh+LzqQaDdk6PChHQf/ 100xYp+0IP25V7HxwMgAPV7fJlkFpT2i9mM9EWgP1yConeD5Lyc0rX7CbIQrImHKRMi34JWkeP4 hyC/LzjAs1GCnDP+UbCeeVYR8BWwS6mU9YKN6jxRRZZ0aYRuLYvBAByW5NK2pb9oTLJ82p6U5BE OXO7Ko0qNV04cxvekrblIIkBEAOmr8J1QqGYE5Q2opSZ9xSeIY7vUFhViGoMURKUijORDt/tyz9 vzhAQquWG6FUx0sT/Am7m1RlG264czLZVjQPBG+33yD+PzCc= X-Google-Smtp-Source: AGHT+IGLlYxUOv3wMLNiKWUkaDJQTkJgIA/yeIbbcT3MiDxc+AeU9uYZmu24kvdmZnnE9FDI9YerxA== X-Received: by 2002:a05:600c:1986:b0:43d:7588:6688 with SMTP id 5b1f17b1804b1-442f4735b63mr71905045e9.12.1747320648103; Thu, 15 May 2025 07:50:48 -0700 (PDT) Received: from ?IPV6:2a01:4b00:b211:ad00:1096:2c00:b223:9747? ([2a01:4b00:b211:ad00:1096:2c00:b223:9747]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-442f3380615sm71009705e9.16.2025.05.15.07.50.47 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 May 2025 07:50:47 -0700 (PDT) Message-ID: <5e4c107f-9db8-4212-99b6-a490406fec77@gmail.com> Date: Thu, 15 May 2025 15:50:47 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY To: Lorenzo Stoakes Cc: Andrew Morton , david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com References: <20250515133519.2779639-1-usamaarif642@gmail.com> <6502bbb7-e8b3-4520-9547-823207119061@lucifer.local> Content-Language: en-US From: Usama Arif In-Reply-To: <6502bbb7-e8b3-4520-9547-823207119061@lucifer.local> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 1A094140003 X-Stat-Signature: uzg8as666uxomz3r47mtbjpcrt1cwmuo X-HE-Tag: 1747320649-92606 X-HE-Meta: U2FsdGVkX1+NoJOOmQdbtwB+q6rdTtfAlMA7LoA6NE8+4gzL9b+Gt7cvDPrk/nCRS3t514LwElep1BWM6zPAZEY2kfU8kfT+bQwySdnfXyY/hLC6RJnIkGTuOBeu7l6hCfHSyLrNhRT307SVwDLBiRhoq60QsoFb0U94kQr97GBC13L63oRSp1O8BgbaWJNeKi6Pf9umf18J5A7l6JgTmCB3TXa0WukrKyRdMZEsyCgO5ygHWPPOPVW2dR6in6mapODEBoPi5UbvZbvs7D7F+ffkixtSPrWEu3gWbcPdYJJg1Nz3HQOm5aAtmwDuJQYsIz3cfd4V5gduWgTj3md0EiKBLwLGZC5QeAoqIXAbHSWj68/BCon1ixns5rvs71XjwYP670qvCLdV194rTmIJ33dpppSHeb+zVelVn76PvG0wdL6/F8zDvoVy3XAlkHpGd1n4jVfNmMZ2kD4K+HdIdoAmImf0UfaPDMGMmxj1SlFejHzmen79nuqFwjh7PudjogHyCT6v7AnNlGLl/nh3Iyundz/0Fg45ysMa637Rk2PLOotqrGDFpuwJhQhuLhY/8VbN2hE9ptcCUYN9prhkE8v8Na+8wXxHEzZbHUz63k6Fg3m56mj+Ib23w8yjnOAQls0Txh5jPAKCTkX26V3AyjZhP+IdTsgOXMu/Td5bKfrHQAfdYSg9Fxo4WHH70T3a60FqzqwZ4BOSfoI3XIejL9fNYvwTIR9hQo1fECnR0alNaGBoyF2XRcKiE8NhMwXy1AAIUQnQcD2Ubz0AZAWxJqTGG2pdKTReWUsQcRCVC4HrNqk8rNSM4Lwh6SO05dX652AcmIVcCqTr6eP8az90wNctrOyogfLilnEutOi/5SZoDrceEi+5hVlMyAGq4TSdsO+v3i29v+j/cxmgSutwhLHymLGZc5J4BauP/AzqU2Bl4wZ9n7XZiqCR7HeP6zD/hstRVJPandzSrgHKKwu i0+y+3uy nfYEfAeevblDVZS35jcvSYQx//b/zTdKbV+oqn1oQpTGbkFXI90byzonQ6DnKq4miSfFGfGW4Tk+43wIsZIX4NxEUd34CMxvqcAE68XfZxwoRd3OPKum7J7D7JvqV/s2sY7ucdCtefMrpaYOaw3HC5pDMvzJfVnwHLXlvQZHQkrvG74rRVeOaEqe+9j6C2xa1zq59LAk1aKeUsAEz/ZWOKRe40jWWRM6J8ZU1j7Zghj16RtoK2bxYbiP/ze8cd24ceHQgn+ruvfL6Sq8t0kgp/oRt5sH7OsBdycz/QxDd6ot9fx2RHC/GCE2iM5TtpIvERm7dRlzPGT5I64RGgfl9xH5t068tlLA+ggBQ2R12Qg+Dgn5dGnorbevHycCkOpq2f2UUhYegN/Hm0YsxrXJeMz02TnCCcNDLPfV4cdKVV1Dyy2Nj9BXSs8/kG9xZV22efAhsWnn/2NY+hI3B3FqUBbU6X9xwHXQcusmkXvJnsN3Olf/4ojP7p63feJ9TSca/88Z3MQ4N2bS4Pz//RJZ1fy7shxpxkMqricUsk7DMWpRvGorWAv/dIEhenTcq7xc5kWGVYs77YKL/KQo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 15/05/2025 14:55, Lorenzo Stoakes wrote: > On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote: >> This allows to change the THP policy of a process, according to the value >> set in arg2, all of which will be inherited during fork+exec: > > This is pretty confusing. > > It should be something like 'add a new prctl() option that allows...' etc. > >> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE >> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The >> call also modifies all existing VMAs that are not VM_NOHUGEPAGE >> to be VM_HUGEPAGE. > > This is referring to implementation detail that doesn't matter for an overview, > just add a summary here e.g. > > PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default, > including after fork/exec, ignoring global policy. > > PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default, > including after fork/exec, ignoring global policy. > > PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above. Hi Lorenzo, Thanks for the review. I will make the cover letter clearer in the next revision. > >> This allows systems where the global policy is set to "madvise" >> to effectively have THPs always for the process. In an environment >> where different types of workloads are stacked on the same machine >> whose global policy is set to "madvise", this will allow workloads >> that benefit from always having hugepages to do so, without regressing >> those that don't. > > So does this just ignore and override the global policy? I'm not sure I'm > comfortable with that. No. The decision making of when and what order THPs are allowed is not changed, i.e. there are no changes in __thp_vma_allowable_orders and thp_vma_allowable_orders. David has the same concern as you and this current series is implementing what David suggested in https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/ It will change the existing VMA (NO)HUGE flags according to the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give a THP when global policy is never. > > What about if the the policy is 'never'? Does this override that? That seems > completely wrong. No, it won't override it. hugepage_global_always and hugepage_global_enabled will still evaluate to false and you wont get a hugepage no matter what prctl is set. > >> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE >> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE. >> The call also modifies all existing VMAs that are not VM_HUGEPAGE >> to be VM_NOHUGEPAGE. >> This allows systems where the global policy is set to "always" >> to effectively have THPs on madvise only for the process. In an >> environment where different types of workloads are stacked on the >> same machine whose global policy is set to "always", this will allow >> workloads that benefit from having hugepages on an madvise basis only >> to do so, without regressing those that benefit from having hugepages >> always. > > Wait, so 'no huge' means 'madvise'? What? This is confusing. I probably made the cover letter confusing :) or maybe need to rename the flags. This flag work as follows: a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE c) Is inherited during fork+exec I think maybe I should add VMA to the flag names and rename the flags to PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ?? > >> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE >> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags. >> >> These patches are required in rolling out hugepages in hyperscaler >> configurations for workloads that benefit from them, where workloads are >> stacked anda single THP global policy is likely to be used across the entire >> fleet, and prctl will help override it. > > I don't understand this justification whatsoever. What does 'stacked' mean? And > you're not justifying why you'd override the policy? By stacked I just meant different types of workloads running on the same machine. Lets say we have a single server whose global policy is set to madvise. You can have a container on that server running some database workload that best works with madvise. You can have another container on that same server running some AI workload that would benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that container. > > This series has no actual justificaiton here at all? You really need to provide one. > There was a discussion on the usecases in https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/ I tried (and I guess failed :)) to summarize the justification from that thread. I will try and rephrase it here. In hyperscalers, we have a single THP policy for the entire fleet. We have different types of workloads (e.g. AI/compute/databases/etc) running on a single server (this is what I meant by 'stacked'). Some of these workloads will benefit from always getting THP at fault (or collapsed by khugepaged), some of them will benefit by only getting them at madvise. This series is useful for 2 usecases: 1) global system policy = madvise, while we want some workloads to get THPs at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur regression (either a performance regression or they are completely memory bound and even a very slight increase in memory will cause them to OOM). So what these patches will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads, (This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment). 2) global system policy = always, while we want some workloads to get THPs only on madvise basis :- Same reason as 1). What these patches will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database workloads. (We hope this is us (Meta) in the near future, if a majority of workloads show that they benefit from always, we flip the default host setting to "always" across the fleet and workloads that regress can opt-out and be "madvise". New services developed will then be tested with always by default. "always" is also the default defconfig option upstream, so I would imagine this is faced by others as well.) Hope this makes the justification for the patches clearer :) >> >> v1->v2: > > Where was the v1? Is it [0]? > > This seems like a massive change compared to that series? > > You've renamed it and not referenced the old series, please make sure you link > it or somehow let somebody see what this is against, because it makes review > difficult. > Yes its the patch you linked below. Sorry should have linked it in this series. Its a big change, but it was basically incorporating all feedback from David, while trying to achieve a similar goal. Will link it in future series. > [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/ > >> - change from modifying the THP decision making for the process, to modifying >> VMA flags only. This prevents further complicating the logic used to >> determine THP order (Thanks David!) >> - change from using a prctl per policy change to just using PR_SET_THP_POLICY >> and arg2 to set the policy. (Zi Yan) >> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM >> - Add selftests and documentation. >> >> Usama Arif (6): >> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process >> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process >> prctl: introduce PR_THP_POLICY_SYSTEM for the process >> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE >> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE >> docs: transhuge: document process level THP controls >> >> Documentation/admin-guide/mm/transhuge.rst | 40 +++ >> include/linux/huge_mm.h | 4 + >> include/linux/mm_types.h | 14 + >> include/uapi/linux/prctl.h | 6 + >> kernel/fork.c | 1 + >> kernel/sys.c | 35 +++ >> mm/huge_memory.c | 56 ++++ >> mm/vma.c | 2 + >> tools/include/uapi/linux/prctl.h | 6 + >> .../trace/beauty/include/uapi/linux/prctl.h | 6 + >> tools/testing/selftests/prctl/Makefile | 2 +- >> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++ >> 12 files changed, 457 insertions(+), 1 deletion(-) >> create mode 100644 tools/testing/selftests/prctl/thp_policy.c >> >> -- >> 2.47.1 >>