From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: David Hildenbrand <david@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Arnd Bergmann <arnd@arndb.de>,
Christian Brauner <brauner@kernel.org>,
SeongJae Park <sj@kernel.org>, Mike Rapoport <rppt@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Barry Song <21cnbao@gmail.com>,
linux-mm@kvack.org, linux-arch@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
Pedro Falcato <pfalcato@suse.de>,
Matthew Wilcox <willy@infradead.org>
Subject: Re: [DISCUSSION] proposed mctl() API
Date: Tue, 10 Jun 2025 17:02:06 +0100 [thread overview]
Message-ID: <0d2046ef-7ad5-4224-a34c-fec473a0f180@lucifer.local> (raw)
In-Reply-To: <2fd7f80c-2b13-4478-900a-d65547586db3@gmail.com>
On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>
>
> On 10/06/2025 16:17, Lorenzo Stoakes wrote:
> > On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
> >>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
> >>>>
> >>>>
> >>>> My knowledge is security is limited, so please bare with me, but I actually
> >>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> >>>> doing VM_(NO)HUGEPAGE.
> >>>>
> >>>> A process can already madvise its own VMAs, and this is just doing that
> >>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> >>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> >>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> >>>> for that process as well, which again doesnt seem like a security issue
> >>>> to me.
> >>>
> >>> W.R.T. the current process, the Issue is one Jann raised, in relation to
> >>> propagation of behaviour to privileged (e.g. setuid) processes.
> >>>
> >>
> >> But what is the actual security issue of having hugepages (or not having them) when
> >> the process is running with setuid?
> >
> > Speak to Jann about this. Security isn't my area. He gave feedback on this,
> > which is why I raised it, if you search through previous threads you can find
> > it.
> >
>
> Yes, he is in CC here as well. I have read it in the previous thread. Just raising it
> here as it was mentioned here :)
>
> >>
> >> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
> >> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
> >>
> >> We can chown the cgroup so that the property is set by unprivileged process.
> >>
> >> Having the process swap with setuid when the unprivileged process has swap disabled
> >> in the cgroup is not the right behaviour. What currently happens is that the process
> >> after obtaining the higher privilege level doesn't swap as well.
> >>
> >> Similarly for hugepages, if it was a cgroup level setting, having the process give
> >> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
> >> would not be the right behaviour.
> >>
> >> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
> >> I can tell.
> >>
> >> So madlibs I dont see what the security issue is and why we would need to elevate privileges
> >> to do this.
> >>
> >>> W.R.T. remote processes, obviously we want to make sure we are permitted to do
> >>> so.
> >>>
> >>
> >> I know that this needs to be future proof. But I don't actually know of a real world
> >> usecase where we want to do any of these things for remote processes.
> >> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
> >> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
> >> or Barrys proposal.
> >> All of them are for the process itself (and its children by fork+exec) and not for
> >> remote processes. As we try to make our changes usecase driven, I think we should
> >> not add support for remote processes (which is another reason why I think this might
> >> sit better in prctl).
> >
> > I'm extremely confused as to why you think this propoal is predicated upon
> > remote process manipulation? It was simply suggested as a possibility for
> > increased flexibility.
> >
> > We can just remove this parameter no?
> >
>
> Sure.
>
> > It is entirely orthogonal to the prctl() stuff.
> >
> > Overall at this point I share Matthew's point of view on this - we shouldn't be
> > doing any of this upstream.
>
> As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
> how it works in the medium term and I dont think it will work even in the long term.
> I will paste my answer from [1] below as well:
>
> If we have 2 workloads on the same server, For e.g. one is database where THPs
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?
>
> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.
>
> It would be awesome if THPs are truly transparent and don't require
> any input, but unfortunately I don't think that there is a solution
> for this with just kernel monitoring.
>
> This is just a big hint from the user. If the global system policy is madvise
> and the workload owner has done their own benchmarks and see benefits
> with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
> If the global system policy is always and the workload owner has done their own
> benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for
> the process to optin as "madvise".
>
> [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/
>
>
Yup I appreciate these points, and we have discussed them I feel quite a
bit :) I echo them.
Nobody says that the interface isn't sucky and THPs are not as transparent
as they should be, nor that we lack decent non-cgroup 'policy'
manipulation.
BUT.
We're talking about adding a permanent hack into the kernel that
force-sets a VMA flag for all VMAs across fork/exec.
I have simply been trying to flesh out the _least worst_ means of
doing this - _if we have to do it_.
That last bit being operative - I have come to think, based on Matthew's
feedback, that the RoI of permanently adding this hack is not a good one.
I think the case remains to be made for that.
> I havent seen activity on this thread over the past week, but I was hoping
> we can reach a consensus on which approach to use, prctl or mctl.
> If its mctl and if you don't think this should be done, please let me know
> if you would like me to work on this instead. This is a valid big realworld
> usecase that is a real blocker for deploying THPs in workloads in servers.
Please exercise patience, upstream moves at its own pace.
>
> Thanks!
> Usama
next prev parent reply other threads:[~2025-06-10 16:02 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
2025-05-29 15:28 ` Matthew Wilcox
2025-05-29 17:54 ` Shakeel Butt
2025-05-29 18:13 ` Matthew Wilcox
2025-05-29 18:32 ` Usama Arif
2025-05-29 21:14 ` Johannes Weiner
2025-05-29 21:24 ` Liam R. Howlett
2025-05-29 23:14 ` Johannes Weiner
2025-05-30 7:52 ` Barry Song
2025-06-04 12:00 ` Johannes Weiner
2025-06-04 12:05 ` David Hildenbrand
2025-05-30 10:31 ` Vlastimil Babka
2025-06-04 12:19 ` Johannes Weiner
2025-06-05 12:31 ` Johannes Weiner
2025-06-09 17:03 ` Tejun Heo
2025-06-02 18:01 ` Matthew Wilcox
2025-06-04 13:21 ` Johannes Weiner
2025-06-04 12:28 ` Lorenzo Stoakes
2025-05-29 17:21 ` Usama Arif
2025-05-30 13:10 ` Lorenzo Stoakes
2025-06-10 15:03 ` Usama Arif
2025-06-10 15:17 ` Lorenzo Stoakes
2025-06-10 15:30 ` Usama Arif
2025-06-10 15:46 ` Matthew Wilcox
2025-06-10 16:00 ` Usama Arif
2025-06-10 16:26 ` Matthew Wilcox
2025-06-10 17:02 ` Usama Arif
2025-06-10 16:02 ` Lorenzo Stoakes [this message]
2025-07-02 14:15 ` Usama Arif
2025-07-02 17:38 ` SeongJae Park
2025-07-04 10:34 ` David Hildenbrand
2025-05-29 18:50 ` Andy Lutomirski
2025-05-29 21:31 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0d2046ef-7ad5-4224-a34c-fec473a0f180@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=brauner@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jannh@google.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pfalcato@suse.de \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=sj@kernel.org \
--cc=usamaarif642@gmail.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).