Re: [DISCUSSION] proposed mctl() API

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: David Hildenbrand <david@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Arnd Bergmann <arnd@arndb.de>,
	Christian Brauner <brauner@kernel.org>,
	SeongJae Park <sj@kernel.org>, Mike Rapoport <rppt@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Barry Song <21cnbao@gmail.com>,
	linux-mm@kvack.org, linux-arch@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	Pedro Falcato <pfalcato@suse.de>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [DISCUSSION] proposed mctl() API
Date: Tue, 10 Jun 2025 17:02:06 +0100	[thread overview]
Message-ID: <0d2046ef-7ad5-4224-a34c-fec473a0f180@lucifer.local> (raw)
In-Reply-To: <2fd7f80c-2b13-4478-900a-d65547586db3@gmail.com>

On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>
>
> On 10/06/2025 16:17, Lorenzo Stoakes wrote:
> > On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
> >>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
> >>>>
> >>>>
> >>>> My knowledge is security is limited, so please bare with me, but I actually
> >>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> >>>> doing VM_(NO)HUGEPAGE.
> >>>>
> >>>> A process can already madvise its own VMAs, and this is just doing that
> >>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> >>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> >>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> >>>> for that process as well, which again doesnt seem like a security issue
> >>>> to me.
> >>>
> >>> W.R.T. the current process, the Issue is one Jann raised, in relation to
> >>> propagation of behaviour to privileged (e.g. setuid) processes.
> >>>
> >>
> >> But what is the actual security issue of having hugepages (or not having them) when
> >> the process is running with setuid?
> >
> > Speak to Jann about this. Security isn't my area. He gave feedback on this,
> > which is why I raised it, if you search through previous threads you can find
> > it.
> >
>
> Yes, he is in CC here as well. I have read it in the previous thread. Just raising it
> here as it was mentioned here :)
>
> >>
> >> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
> >> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
> >>
> >> We can chown the cgroup so that the property is set by unprivileged process.
> >>
> >> Having the process swap with setuid when the unprivileged process has swap disabled
> >> in the cgroup is not the right behaviour. What currently happens is that the process
> >> after obtaining the higher privilege level doesn't swap as well.
> >>
> >> Similarly for hugepages, if it was a cgroup level setting, having the process give
> >> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
> >> would not be the right behaviour.
> >>
> >> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
> >> I can tell.
> >>
> >> So madlibs I dont see what the security issue is and why we would need to elevate privileges
> >> to do this.
> >>
> >>> W.R.T. remote processes, obviously we want to make sure we are permitted to do
> >>> so.
> >>>
> >>
> >> I know that this needs to be future proof. But I don't actually know of a real world
> >> usecase where we want to do any of these things for remote processes.
> >> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
> >> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
> >> or Barrys proposal.
> >> All of them are for the process itself (and its children by fork+exec) and not for
> >> remote processes. As we try to make our changes usecase driven, I think we should
> >> not add support for remote processes (which is another reason why I think this might
> >> sit better in prctl).
> >
> > I'm extremely confused as to why you think this propoal is predicated upon
> > remote process manipulation? It was simply suggested as a possibility for
> > increased flexibility.
> >
> > We can just remove this parameter no?
> >
>
> Sure.
>
> > It is entirely orthogonal to the prctl() stuff.
> >
> > Overall at this point I share Matthew's point of view on this - we shouldn't be
> > doing any of this upstream.
>
> As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
> how it works in the medium term and I dont think it will work even in the long term.
> I will paste my answer from [1] below as well:
>
> If we have 2 workloads on the same server, For e.g. one is database where THPs
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?
>
> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.
>
> It would be awesome if THPs are truly transparent and don't require
> any input, but unfortunately I don't think that there is a solution
> for this with just kernel monitoring.
>
> This is just a big hint from the user. If the global system policy is madvise
> and the workload owner has done their own benchmarks and see benefits
> with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
> If the global system policy is always and the workload owner has done their own
> benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for
> the process to optin as "madvise".
>
> [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/
>
>

Yup I appreciate these points, and we have discussed them I feel quite a
bit :) I echo them.

Nobody says that the interface isn't sucky and THPs are not as transparent
as they should be, nor that we lack decent non-cgroup 'policy'
manipulation.

BUT.

We're talking about adding a permanent hack into the kernel that
force-sets a VMA flag for all VMAs across fork/exec.

I have simply been trying to flesh out the _least worst_ means of
doing this - _if we have to do it_.

That last bit being operative - I have come to think, based on Matthew's
feedback, that the RoI of permanently adding this hack is not a good one.

I think the case remains to be made for that.

> I havent seen activity on this thread over the past week, but I was hoping
> we can reach a consensus on which approach to use, prctl or mctl.
> If its mctl and if you don't think this should be done, please let me know
> if you would like me to work on this instead. This is a valid big realworld
> usecase that is a real blocker for deploying THPs in workloads in servers.

Please exercise patience, upstream moves at its own pace.

>
> Thanks!
> Usama

next prev parent reply	other threads:[~2025-06-10 16:02 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
2025-05-29 15:28 ` Matthew Wilcox
2025-05-29 17:54   ` Shakeel Butt
2025-05-29 18:13     ` Matthew Wilcox
2025-05-29 18:32       ` Usama Arif
2025-05-29 21:14   ` Johannes Weiner
2025-05-29 21:24     ` Liam R. Howlett
2025-05-29 23:14       ` Johannes Weiner
2025-05-30  7:52     ` Barry Song
2025-06-04 12:00       ` Johannes Weiner
2025-06-04 12:05         ` David Hildenbrand
2025-05-30 10:31     ` Vlastimil Babka
2025-06-04 12:19       ` Johannes Weiner
2025-06-05 12:31         ` Johannes Weiner
2025-06-09 17:03           ` Tejun Heo
2025-06-02 18:01     ` Matthew Wilcox
2025-06-04 13:21       ` Johannes Weiner
2025-06-04 12:28   ` Lorenzo Stoakes
2025-05-29 17:21 ` Usama Arif
2025-05-30 13:10   ` Lorenzo Stoakes
2025-06-10 15:03     ` Usama Arif
2025-06-10 15:17       ` Lorenzo Stoakes
2025-06-10 15:30         ` Usama Arif
2025-06-10 15:46           ` Matthew Wilcox
2025-06-10 16:00             ` Usama Arif
2025-06-10 16:26               ` Matthew Wilcox
2025-06-10 17:02                 ` Usama Arif
2025-06-10 16:02           ` Lorenzo Stoakes [this message]
2025-07-02 14:15           ` Usama Arif
2025-07-02 17:38             ` SeongJae Park
2025-07-04 10:34               ` David Hildenbrand
2025-05-29 18:50 ` Andy Lutomirski
2025-05-29 21:31 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0d2046ef-7ad5-4224-a34c-fec473a0f180@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=21cnbao@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=brauner@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=jannh@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pfalcato@suse.de \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).