From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14F47C2D0CD for ; Thu, 15 May 2025 16:47:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C80066B009A; Thu, 15 May 2025 12:47:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C2D686B00A2; Thu, 15 May 2025 12:47:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7F796B00A3; Thu, 15 May 2025 12:47:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7C61D6B009A for ; Thu, 15 May 2025 12:47:40 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5D7E01403BB for ; Thu, 15 May 2025 16:47:40 +0000 (UTC) X-FDA: 83445723480.16.9BF755D Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) by imf24.hostedemail.com (Postfix) with ESMTP id 27D27180008 for ; Thu, 15 May 2025 16:47:37 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=J9r7lVIT; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.52 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747327658; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TkgpkfBDs4JHFOqb847ZHMG9CL/tOHnupWssCLbExzA=; b=de2QqgB0nTFVlPqszjpY+1H8le68B5qBokEvbhVzIhs9kV4eUn2VG+H29BnwRYxNxm8jIe 1clTrXpIDaPGpZ9k9GYkS1VyEor+sio9NapXgaHLWJDBWFeKE+jYDSO1XPyM05cEk3OplF xfHdc860mbzTP2xL+MbCA+EpT0pSIos= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=J9r7lVIT; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.221.52 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747327658; a=rsa-sha256; cv=none; b=GsP7P6jdXrjuNYLnUdM3zscIBcqBFXcyIaa1gAsjj2VkCtkYUo4TdwvRZAPCUiHtAA/7yT 4JDeelYJQ7qTYN8juxmMjmxvE+vVe82TZ86KtFW57qRBEZbLSzL73EuCSvmpIMzS32DOYR pWRF9y6bULgzkIOeZVlAOaeyVlXwrwg= Received: by mail-wr1-f52.google.com with SMTP id ffacd0b85a97d-3a0ac853894so922177f8f.3 for ; Thu, 15 May 2025 09:47:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747327656; x=1747932456; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=TkgpkfBDs4JHFOqb847ZHMG9CL/tOHnupWssCLbExzA=; b=J9r7lVITNF1oxsvlsmABfawkXUlg2forUEVHp9F3t5xF50vYnmCaw1Bg9L01A3ggWD LIqo0ha0wmIZG+jzELhaFBzJvo6tzISMnLZZJXmS43Cb1aPAj1IFbo6/US+k05aGb/SC /v+QmybW9em+2gHTKA2hsywjkAXDjuYy6AK5OmPVtvPUxpqYIvP/i9LiVBKl90y/Ke2v PJgARLi7H1vGWqzn/IC4kAacisGBVDiCfwxKHJZbbG8TDB+l2G93oSNgrrH2KqK6488p X9uiaMi5G2lBnhsFtrJEJa6P4+Gwds11KQcBRSsHBGQWxBqsOXH0OzljIOid1gLDCxLF c0Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747327656; x=1747932456; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=TkgpkfBDs4JHFOqb847ZHMG9CL/tOHnupWssCLbExzA=; b=WYBPR3Ocl+2wSLVZB6Y3TCtsMse9JHtMNF9tSthjCmZzxOm7kXz88xohlFh67+pTlx u9Q00YR/BJC0NMn7CVsbNO1BWgi2PGrA6t/bucHFi2rMszg9AUQ5o4W1h3wwX+qAd8lh X5l+AmV3A/t3jRfe3psJkdDmiOmGsnQ6m8orTxjzEfE6znhR8eFjZGHZHbgX8bHvPjwF GtY9jxwoQnTVs2IJu6BCmpFb7REQT9LRTenqop/Q9npsxrZ62VuN9V2nZ89VojrVbvc3 I4tyZKYzAGP8MvOB7OxQTlqUqjNLksaFycK5JARoEY92K8gv7RT2fKwADDA2E9tBeoS8 ZLQQ== X-Forwarded-Encrypted: i=1; AJvYcCWh4x+JjKwMyEF881VTjOUhSJdg3KPsvUImG6YllJkii1cF6DYSWVQRzGNDLgVkiBRSr4JkShgsZg==@kvack.org X-Gm-Message-State: AOJu0YxlhkizDDIyMYmO9BPKI+oZZuonVWWggkcprOYeA9sEcEHJYaYW W1rGDo9AXlVG9Q3/LkTVwUD/gaV2nc19gaMFVToQOhD3adJ6Ipc/mLoF X-Gm-Gg: ASbGnctxFnOBjChjYSaas+VdqZZcoL4dB6nHTpOOVVnA+pwJy1MuaQTgTwC6FIPvRiw ahxPTiY2c8jno5aDP6ex5OcIpTnmZm/QshUq7fD/ROty6uvVyI5mDqiOjXlvBsnisx0M1kaejCL I8tyGcVP4V2dq02QbG725aXcEjYluGvMWbHtU2u/r7zmF0Xm2Cw32+ySCYMs2Zu/+BMQL+++pU7 pBV1yuQI7xt5Bq6W6g+5lIMimWKN5/DUGy3QhODxlGuQmlb+WPDDLlReYMb+2jG0EwXLtkbMZUb SyhbL/ex2W0AduhnJ7YJ/gjEYYAz4NIyiaEiflDni844UcBEv4RnY1IbQ1wO/6YYKoFBGcHSRtj XVG8Lx8+WRTRo6qbnhBWVrwie65vyYPmXPvEDEPhtyIY9LlU= X-Google-Smtp-Source: AGHT+IGwoZGXfRfrijk5EbPVA9ERSXANckJYpJ0HgniSNhLa4G+DQiCotf5dit+V2clIYe7I//aTDA== X-Received: by 2002:a05:6000:2385:b0:3a3:4ba4:f3cd with SMTP id ffacd0b85a97d-3a35c821e38mr588370f8f.1.1747327655957; Thu, 15 May 2025 09:47:35 -0700 (PDT) Received: from ?IPV6:2a01:4b00:b211:ad00:1096:2c00:b223:9747? ([2a01:4b00:b211:ad00:1096:2c00:b223:9747]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-442fd594707sm2637355e9.33.2025.05.15.09.47.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 15 May 2025 09:47:35 -0700 (PDT) Message-ID: <732ff995-0e18-4e8c-a0a5-14da400d4078@gmail.com> Date: Thu, 15 May 2025 17:47:34 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process To: Lorenzo Stoakes Cc: Andrew Morton , david@redhat.com, linux-mm@kvack.org, hannes@cmpxchg.org, shakeel.butt@linux.dev, riel@surriel.com, ziy@nvidia.com, laoar.shao@gmail.com, baolin.wang@linux.alibaba.com, Liam.Howlett@oracle.com, npache@redhat.com, ryan.roberts@arm.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com References: <20250515133519.2779639-1-usamaarif642@gmail.com> <20250515133519.2779639-2-usamaarif642@gmail.com> <02ead03b-339b-45c8-b252-d31a66501c39@lucifer.local> Content-Language: en-US From: Usama Arif In-Reply-To: <02ead03b-339b-45c8-b252-d31a66501c39@lucifer.local> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Stat-Signature: fwu7i4ur8qf4r7o9h51mxp93mnd5sxbm X-Rspamd-Queue-Id: 27D27180008 X-Rspam-User: X-HE-Tag: 1747327657-686026 X-HE-Meta: U2FsdGVkX19cxlKFiIW2HcLYnFoWCs4MCtOPxTaiL3aRQ5BjshkfgxBxC7GwgcpdE1+RzoX9A1ny9H0DN9/rJdw9JO2G97Z/GCvKlH1AwoZx/yEC6+4NjVzCbkzRp0heJ+XrrRMJ2x6C42Wuui4mKp0dE5F/Ea2kbz2RtOjue6LrylwIN3kuxgBEKDCT++0f+xjFop4nHBLMT8CmhP+4KoWTM3F50Pttx2l4CT/xVGaV1ZLLmr4OlYTdEGgcFbkE1v8uzw0kyr4ACRXGHTrR2NOgmt6prB02gKkjkgWNBCyQV98/0OGebraG3wo/VPXTdp4wZaVwwTgz6VWv3jLGQDpxKSkNbg4nfyr6aScFYHw4U1x1hPxjN6KbhcEf6zcaVqj75JSfUxhEBgWPXZ5meOYaAeaAW5z36/bmQnwsj6WF/rFjF5dqsNqTSHPImaXPKVpY8G+XTAr52Vrn/ac0XcuEfiIN20ArVxoF89wrRBc9ePwHxizV9jCRkKUjc1k01it2wwrGVaNJ7WShI+5oulZFpygERA/OoRGO1XFyPuF/+SBtyRCZV2ZcI4gnNIij9Jmgeff9GfCsn/ZlOBjpkBI/k8TFTpsXYgZtmJaN3NK4iAuCOCVvpFuBsipdZu5bb6woOIQTk/urQ56Lm5DkGeiFxuHaOokTGKMre4oYTJhBrMoES1RnrBvLWgfWSe/+5VBZ12PISfJlFxjdFsnZesJYvbCvcT8ZVNTs1VFjOH8LabsBGSG4v2O/B26T/e4T3snXqijqXFPZv/NGV4UFnJg4LG6V3W3anKn/2i4RM5NGjEcpaGx3PqDvEn+84AVq08OJLB/urZXJF0RZQobwLXwzm1/sy2ZkdJGXNkwn4cXi3ZnSqJdaXX5Gx1MUfW42Jq7vYW1Si+w/zmkqM1QJU1pCN6Tn+m2d4N34UzQoYB2U0pMX52c5wEg4Jo/Si+HqNkFF1Fr9WOAJF89hMyN 45F6kM5m PZb0Gzrliae501gTB8S1TjYLt6I/2YwcTLjgdFg1HXBf5ZYT2f6uC+sfdGZHy9ZFB/xy5z/zFk+M7vZKnvPVzic+czDRTJoJi0ezzyZLuy/gB067kiH71lon3LelAysJ2q5RmqdJdCZ2fmP7ynWptd1PYSHPPgMcIgfhmN9+9Ax1wQOPk8vvJ/KPu3nimmvQ3aIuroPYYz4dYaRDG++fjSzDB4ZcPtGqXnHX0qsHEN1gM+ViU/8zm35X3wfgC7kJ9w894TireGYl8F2s2FS6jdzoUKzBDpZ4ERIaURjUo/69GEuH1vdomsHSrXpFrmYlXP3bBFsewAAHDuY94BAOzKl9/FKGe2MJ5W+xjNScs1W8p/ThWyxOTnDgjlu2+OD0PhE8CsBabHFOUjH10WzA/i4HczCgrYd2BB6Fs+3Uaa/1QkXEIY4x/2u7JUVgMAM7A8BAv1VDZBZLnafx7/bGFXYK9jdffs8otvXNb6tZ93UY4zSHAPu8HF/RLVnKZwm62hPAnJN2AvRcx4grT8ZdoidAonBoR5Uu3UjCPbB5yr82z9oJksC3y5hzmzn0ov9SguWKj6GJOMBDyt3lDknhKEwdV+hAoGodneXVeV9TUhxcNp0QBTKO/ZclrAFTTtxQRl7lvr4j4xPYXhqdyaNJnO5NP8bzwiD781/8H X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 15/05/2025 17:06, Lorenzo Stoakes wrote: >> Its doing the same as KSM as suggested by David. Does KSM break these tests? >> Is there some specific test you can point to that I can run that is breaking >> with this patch and not without it? > > They don't build, at all. I explain how you can attempt a build below. > Ah yes, I initially thought by break you meant they were failing. I saw that, will fix it. > And no, KSM doesn't break the tests, because steps were taken to make them > not break the tests :) I mean it's really easy - it's just adding some > trivial stubs. > Yes, will do Thanks! > If you need help with it just ping me in whatever way helps and I can help! > > It's understandable as it's not necessarily clear this is a thing (maybe we > need self tests to build it, but that might break CI setups so unclear). > > The merging is much more important! > >> >> >>> I really feel this series needs to be an RFC until we can get some >>> consensus on how to approach this. >> >> There was consensus in https://lore.kernel.org/all/97702ff0-fc50-4779-bfa8-83dc42352db1@redhat.com/ > > I disagree with this asssessment, that doesn't look like consensus at all, > I think at least this is a very contentious or at least _complicated_ topic > that we need to really dig into. > > So in my view - it's this kind of situation that warrants an RFC until > there's some stabilisation and agreement on a way forward. > Sure will change next revision to RFC, unless hopefully maybe we can get a consensus in this revision :) >> >>> >>> On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote: >>>> This is set via the new PR_SET_THP_POLICY prctl. >>> >>> What is? >>> >>> You're making very major changes here, including adding a new flag to >>> mm_struct (!!) and the explanation/justification for this is missing. >>> >> >> I have added the justification in your reply to the coverletter. > > As stated there, you've not explained why alternatives are unworkable, I > think we need this! > > Sort of: > > 1. Why not cgroups? blah blah blah > 2. Why not process_madvise()? blah blah blah > 3. Why not bpf? blah blah blah > 4. Why not ? blah blah blah > I will add this in the next cover letter. >> >>>> This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag >>>> which changes the default of new VMAs to be VM_HUGEPAGE. The >>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE >>>> to be VM_HUGEPAGE. The policy is inherited during fork+exec. >>> >>> So you can only set this flag? >>> >> >> ?? > > This patch is only allowing the setting of this flag. I am asking 'so you > can only set this flag?' > > To which it appears the answer is, yes I think :) > > An improved cover letter could say something like: > > " > Here we implement the first flag intended to allow the _overriding_ of huge > page policy to ensure that, when > /sys/kernel/mm/transparent_hugepage/enabled is set to madvise, we are able > to maintain fine-grained control of individual processes, including any > fork/exec'd, by setting this flag. > > In subsequent commits, we intend to permit further such control. > " > >> >>>> >>>> This allows systems where the global policy is set to "madvise" >>>> to effectively have THPs always for the process. In an environment >>>> where different types of workloads are stacked on the same machine, >>>> this will allow workloads that benefit from always having hugepages >>>> to do so, without regressing those that don't. >>> >>> Again, this explanation really makes no sense at all to me, I don't really >>> know what you mean, you're not going into what you're doing in this change, >>> this is just a very unclear commit message. >>> >> >> I hope this is answered in my reply to your coverletter. > > You still need to improve the cover letter here I think, see above for a > suggestion! > Sure, will do in the next revision, Thanks! >> >>>> >>>> Signed-off-by: Usama Arif >>>> --- >>>> include/linux/huge_mm.h | 3 ++ >>>> include/linux/mm_types.h | 11 +++++++ >>>> include/uapi/linux/prctl.h | 4 +++ >>>> kernel/fork.c | 1 + >>>> kernel/sys.c | 21 ++++++++++++ >>>> mm/huge_memory.c | 32 +++++++++++++++++++ >>>> mm/vma.c | 2 ++ >>>> tools/include/uapi/linux/prctl.h | 4 +++ >>>> .../trace/beauty/include/uapi/linux/prctl.h | 4 +++ >>>> 9 files changed, 82 insertions(+) >>>> >>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >>>> index 2f190c90192d..e652ad9ddbbd 100644 >>>> --- a/include/linux/huge_mm.h >>>> +++ b/include/linux/huge_mm.h >>>> @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, >>>> return orders; >>>> } >>>> >>>> +void vma_set_thp_policy(struct vm_area_struct *vma); >>> >>> This is a VMA-specific function but you're putting it in huge_mm.h? Why >>> can't >> this be in vma.h or vma.c? >>> >> >> Sure can move it there. >> >>>> +void process_vmas_thp_default_huge(struct mm_struct *mm); >>> >>> 'vmas' is redundant here. >>> >> >> Sure. >>>> + >>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, >>>> unsigned long vm_flags, >>>> unsigned long tva_flags, >>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>>> index e76bade9ebb1..2fe93965e761 100644 >>>> --- a/include/linux/mm_types.h >>>> +++ b/include/linux/mm_types.h >>>> @@ -1066,6 +1066,7 @@ struct mm_struct { >>>> mm_context_t context; >>>> >>>> unsigned long flags; /* Must use atomic bitops to access */ >>>> + unsigned long flags2; >>> >>> >>> Ugh, god really?? >>> >>> I really am not a fan of adding flags2 just to add a prctl() feature like >>> this. This is crazy. >>> >>> Also this is a TERRIBLE name. I mean, no please PLEASE no. >>> >>> Do we really have absolutely no choice but to add a new flags field here? >>> >>> It again doesn't help that you don't mention nor even try to justify this >>> in the commit message or cover letter. >>> >> >> And again, I hope my reply to your email has given you the justification. > > No :) I understood why you did this though of course. > >> >>> If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits', >>> let's just go make this flags field 64-bit on 32-bit kernels. >>> >>> I mean - I'm kind of insisting we do that to be honest. Because I really >>> don't like this. >> >> >> If the maintainers want this, I will make it a 64 bit only feature. We >> are only using it for 64 bit servers. But it will probably mean ifdef >> config 64 bit in a lot of places. > > I'm going to presume you are including me in this category rather than > implying that you are deferring only to others :) > Yes ofcourse! I mean all maintainers :) And hopefully everyone else as well :) > So, there's another option: > > Have a prerequisite series that makes mm_struct->flags 64-bit on 32-bit > kernels, which solves this problem everywhere and avoids us wasting a bunch > of memory for a very specific usecase, splitting flag state across 2 fields > (which are no longer atomic as a whole of course), adding confusion, > possibly subtly breaking anywhere that assumes mm->flags completely > describes mm-granularity flag state etc. > This is probably a very basic question, but by make mm_struct->flags 64-bit on 32-bit do you mean convert flags to unsigned long long when !CONIFG_64BIT? > The RoI here is not looking good, otherwise. > >> >>> >>> Also if we _HAVE_ to have this, shouldn't we duplicate that comment about >>> atomic bitops?... >>> >> >> Sure >> >>>> >>>> #ifdef CONFIG_AIO >>>> spinlock_t ioctx_lock; >>>> @@ -1744,6 +1745,11 @@ enum { >>>> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\ >>>> MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK) >>>> >>>> +#define MMF2_THP_VMA_DEFAULT_HUGE 0 >>> >>> I thought the whole idea was to move away from explicitly refrencing 'THP' >>> in a future where large folios are implicit and now we're saying 'THP'. >>> >>> Anyway the 'VMA' is totally redundant here. >>> >> >> Sure, I can remove VMA. >> I see THP everywhere in the kernel code. >> Its mentioned 108 times in transhuge.rst alone :) >> If you have any suggestion to rename this flag, happy to take it :) > > Yeah I mean it's a mess man, and it's not your fault... Again naming is > hard, I put a suggestion in reply to cover letter anyway... > >> >>>> +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE) >>> >>> Do we really need explicit trivial mask declarations like this? >>> >> >> I have followed the convention that has existed in this file, please see below >> links :) >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1645 >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1623 >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1603 >> https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/mm_types.h#L1582 > > Ack, yuck but ack. > >> >> >>>> + >>>> +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK) >>> >>>> + >>>> static inline unsigned long mmf_init_flags(unsigned long flags) >>>> { >>>> if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT)) >>>> @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags) >>>> return flags & MMF_INIT_MASK; >>>> } >>>> >>>> +static inline unsigned long mmf2_init_flags(unsigned long flags) >>>> +{ >>>> + return flags & MMF2_INIT_MASK; >>>> +} >>>> + >>>> #endif /* _LINUX_MM_TYPES_H */ >>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h >>>> index 15c18ef4eb11..325c72f40a93 100644 >>>> --- a/include/uapi/linux/prctl.h >>>> +++ b/include/uapi/linux/prctl.h >>>> @@ -364,4 +364,8 @@ struct prctl_mm_map { >>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 >>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 >>>> >>>> +#define PR_SET_THP_POLICY 78 >>>> +#define PR_GET_THP_POLICY 79 >>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0 >>>> + >>>> #endif /* _LINUX_PRCTL_H */ >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 9e4616dacd82..6e5f4a8869dc 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, >>>> >>>> if (current->mm) { >>>> mm->flags = mmf_init_flags(current->mm->flags); >>>> + mm->flags2 = mmf2_init_flags(current->mm->flags2); >>>> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK; >>>> } else { >>>> mm->flags = default_dump_filter; >>>> diff --git a/kernel/sys.c b/kernel/sys.c >>>> index c434968e9f5d..1115f258f253 100644 >>>> --- a/kernel/sys.c >>>> +++ b/kernel/sys.c >>>> @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >>>> clear_bit(MMF_DISABLE_THP, &me->mm->flags); >>>> mmap_write_unlock(me->mm); >>>> break; >>>> + case PR_GET_THP_POLICY: >>>> + if (arg2 || arg3 || arg4 || arg5) >>>> + return -EINVAL; >>>> + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2)) >>> >>> I really don't think we need the !!? Do we? >> >> I have followed the convention that has existed in this file already, >> please see: >> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644 > > OK, but please don't, I don't see why this is necessary. if (truthy) is > fine. > > Unless somebody has a really good reason why this is necessary, it's just > ugly ceremony. > Agreed :) >> >>> >>> Shouldn't we lock the mm when we do this no? Can't somebody change this? >>> >> >> It wasn't locked in PR_GET_THP_DISABLE >> https://elixir.bootlin.com/linux/v6.14.6/source/kernel/sys.c#L2644 >> >> I can acquire do mmap_write_lock_killable the same as PR_SET_THP_POLICY >> in the next series. >> >> I can also add the lock in PR_GET_THP_DISABLE. > > Well, the issue I guess is... if the flags field is atomic, and we know > over this call maybe we can rely on mm sticking around, then we probalby > don't need an mmap lock actually. > >> >>>> + error = PR_THP_POLICY_DEFAULT_HUGE; > > Wait, error = PR_THP_POLICY_DEFAULT_HUGE? Is this the convention for > returning here? :) I see a few of the PR_GET_.. setting the return value. I hope I didnt misinterpret that. > >>>> + break; >>>> + case PR_SET_THP_POLICY: >>>> + if (arg3 || arg4 || arg5) >>>> + return -EINVAL; >>>> + if (mmap_write_lock_killable(me->mm)) >>>> + return -EINTR; >>>> + switch (arg2) { >>>> + case PR_THP_POLICY_DEFAULT_HUGE: >>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2); >>>> + process_vmas_thp_default_huge(me->mm); >>>> + break; >>>> + default: >>>> + return -EINVAL; > > Oh I just noticed - this is really broken - you're not unlocking the mmap() > here on error... :) you definitely need to fix this. > Ah yes, will do Thanks! >>>> + } >>>> + mmap_write_unlock(me->mm); >>>> + break; >>>> case PR_MPX_ENABLE_MANAGEMENT: >>>> case PR_MPX_DISABLE_MANAGEMENT: >>>> /* No longer implemented: */ >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>> index 2780a12b25f0..64f66d5295e8 100644 >>>> --- a/mm/huge_memory.c >>>> +++ b/mm/huge_memory.c >>>> @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) >>>> return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); >>>> } >>>> >>>> +void vma_set_thp_policy(struct vm_area_struct *vma) >>>> +{ >>>> + struct mm_struct *mm = vma->vm_mm; >>>> + >>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2)) >>>> + vm_flags_set(vma, VM_HUGEPAGE); >>>> +} >>>> + >>>> +static void vmas_thp_default_huge(struct mm_struct *mm) >>>> +{ >>>> + struct vm_area_struct *vma; >>>> + unsigned long vm_flags; >>>> + >>>> + VMA_ITERATOR(vmi, mm, 0); >>> >>> This is a declaration, it should be grouped with declarations... >>> >> >> Sure, will make the change in next version. >> >> Unfortunately checkpatch didn't complain. > > Checkpatch actually complains the other way :P it doesn't understand > macros. > > So you'll start getting a warning here, which you can ignore. It sucks, but > there we go. Making checkpatch.pl understand that would be a pain, probs. > >> >>>> + for_each_vma(vmi, vma) { >>>> + vm_flags = vma->vm_flags; >>>> + if (vm_flags & VM_NOHUGEPAGE) >>>> + continue; >>> >>> Literally no point in you putting vm_flags as a separate variable here. >>> >> >> Sure, will make the change in next version. > > Thanks! > >> >>> So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise >>> is to override global 'never'? >>> >> >> Again, I am not overriding never. >> >> hugepage_global_always and hugepage_global_enabled will evaluate to false >> and you will not get a hugepage. > > Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even > if the policy is never. > > And we now get into realms of: > > 'Hey I set prctl() to make everything huge pages, and PR_GET_THP_POLICY > says I've set that, but nothing is huge? BUG???' > > Of course then you get into - if somebody sets it to never, do we go around > and remove VM_HUGEPAGE and this MMF_ flag? > >> >> >>> I'm really concerned about this. >>> >>>> + vm_flags_set(vma, VM_HUGEPAGE); >>>> + } >>>> +} >>> >>> Do we have an mmap write lock established here? Can you confirm that? Also >>> you should add an assert for that here. >>> >> >> Yes I do, its only called in PR_SET_THP_POLICY where mmap_write lock was taken. >> I can add an assert if it helps. > > It not only helps, it's utterly critical :) > > 'It's only called in xxx()' is famous last words for a programmer, because > later somebody (maybe even your good self) calls it from somewhere else > and... we've all been there... > Thanks! Will do. >> >>>> + >>>> +void process_vmas_thp_default_huge(struct mm_struct *mm) >>>> +{ >>>> + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2)) >>>> + return; >>>> + >>>> + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2); >>>> + vmas_thp_default_huge(mm); >>>> +} >>>> + >>>> + >>>> unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, >>>> unsigned long vm_flags, >>>> unsigned long tva_flags, >>>> diff --git a/mm/vma.c b/mm/vma.c >>>> index 1f2634b29568..101b19c96803 100644 >>>> --- a/mm/vma.c >>>> +++ b/mm/vma.c >>>> @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) >>>> if (!vma_is_anonymous(vma)) >>>> khugepaged_enter_vma(vma, map->flags); >>>> ksm_add_vma(vma); >>>> + vma_set_thp_policy(vma); >>> >>> You're breaking VMA merging completely by doing this here... >>> >>> Now I can map one VMA with this policy set, then map another immediately >>> next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not >>> set in the new VMA on merge attempt. >>> >>> I realise KSM is just as broken (grr) but this doesn't justify us >>> completely breaking VMA merging here. >> >> I think this answers it. Its doing the same as KSM. > > Yes, but as I said there, it's not acceptable, at all. > > You're making it so litearlly VMA merging _does not happen at all_. That's > unacceptable and might even break some workloads. > > You'll certainly cause very big kernel metadata usage. > > Consider: > > |-----------------------------|..................| > | some VMA flags, VM_HUGEPAGE | proposed new VMA | > |-----------------------------|..................| > > Now, because you set VM_HUGEPAGE _after any merge is attempted_, this will > _always_ be fragmented, forever. > So if __mmap_new_vma and do_brk_flags are called after merge attempt, is it possible to vma_set_thp_policy (or do something similar) before the merge attempt? Actually I just read your reply to the next block, so I think its ok? Added more to the next block. I dont have any preference on where its put, so happy with putting this earlier. > That's just not... acceptable. > > The fact KSM is broken this way doesn't make that OK. > > Especially on brk(), which now will _always_ allocate new VMAs for every > brk() expansion which doesn't seem very efficient. > > It may also majorly degrade performance. > > That makes me think we need some perf testing for this ideally... > >> >>> >>> You need to set earlier than this. Then of course a driver might decide to >>> override this, so maybe then we need to override that. >>> >>> But then we're getting into realms of changing fundamental VMA code _just >>> for this feature_. >>> >>> Again I'm iffy about this. Very. >>> >>> Also you've broken the VMA userland tests here: >>> >>> $ cd tools/testing/vma >>> $ make >>> ... >>> In file included from vma.c:33: >>> ../../../mm/vma.c: In function ‘__mmap_new_vma’: >>> ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration] >>> 2486 | vma_set_thp_policy(vma); >>> | ^~~~~~~~~~~~~~~~~~ >>> | vma_dup_policy >>> make: *** [: vma.o] Error 1 >>> >>> You need to create stubs accordingly. >>> >> >> Thanks will do. > > Thanks! > >> >>>> *vmap = vma; >>>> return 0; >>>> >>>> @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, >>>> mm->map_count++; >>>> validate_mm(mm); >>>> ksm_add_vma(vma); >>>> + vma_set_thp_policy(vma); >>> >>> You're breaking merging again... This is quite a bad case too as now you'll >>> have totally fragmented brk VMAs no? >>> >> >> Again doing it the same as KSM. > > That doesn't make it ok. Just because KSM is broken doesn't make this ok. I > mean grr at KSM :) I'm going to look into that and see about > investigating/fixing that behaviour. > > obviously I can't accept anything that will fundamentally break VMA > merging. > Ofcourse! > The answer really is to do this earlier, but you risk a driver overriding > it, but that's OK I think (I don't even think any in-tree ones do actually > _anywhere_ - and yes I was literally reading through _every single_ .mmap() > callback lately because I am quite obviously insane ;) > > Again I can help with this. > Appreaciate it! I am actually not familiar with the merge code. I will try and have a look, but if you could give a pointer to the file:line after which its not acceptable to have and I can move vma_set_thp_policy to before it or try and do something similar to that. >> >>> We can't have it implemented this way. >>> >>>> out: >>>> perf_event_mmap(vma); >>>> mm->total_vm += len >> PAGE_SHIFT; >>>> diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h >>>> index 35791791a879..f5945ebfe3f2 100644 >>>> --- a/tools/include/uapi/linux/prctl.h >>>> +++ b/tools/include/uapi/linux/prctl.h >>>> @@ -328,4 +328,8 @@ struct prctl_mm_map { >>>> # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ >>>> # define PR_PPC_DEXCR_CTRL_MASK 0x1f >>>> >>>> +#define PR_SET_THP_POLICY 78 >>>> +#define PR_GET_THP_POLICY 79 >>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0 >>>> + >>>> #endif /* _LINUX_PRCTL_H */ >>>> diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h >>>> index 15c18ef4eb11..325c72f40a93 100644 >>>> --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h >>>> +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h >>>> @@ -364,4 +364,8 @@ struct prctl_mm_map { >>>> # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 >>>> # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 >>>> >>>> +#define PR_SET_THP_POLICY 78 >>>> +#define PR_GET_THP_POLICY 79 >>>> +#define PR_THP_POLICY_DEFAULT_HUGE 0 >>>> + >>>> #endif /* _LINUX_PRCTL_H */ >>>> -- >>>> 2.47.1 >>>> >>