[DISCUSSION] proposed mctl() API

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [DISCUSSION] proposed mctl() API
@ 2025-05-29 14:43 Lorenzo Stoakes
  2025-05-29 15:28 ` Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-05-29 14:43 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Usama Arif, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

## INTRODUCTION

After discussions in various threads (Usama's series adding a new prctl()
in [0], and a proposal to adapt process_madvise() to do the same -
conception in [1] and RFC in [2]), it seems fairly clear that it would make
sense to explore a dedicated API to explicitly allow for actions which
affect the virtual address space as a whole.

Also, Barry is implementing a feature (currently under RFC) which could
additionally make use of this API (see [3]).

[0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
[1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
[2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@oracle.com/
[3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/

While madvise() and process_madvise() are useful for altering the
attributes of VMAs within a virtual address space, it isn't the right fit
for something that affects the whole address space.

Additionally, a requirement of Usama's proposal (see [0]) is that we have
the ability to propagate the change in behaviour across fork/exec. This
further suggests the need for a dedicated interface, as this really sits
outside the ordinary behaviour of [process_]madvise().

prctl() is too broad and encourages mm code to migrate to kernel/sys.c
where it is at risk of bit-rotting. It can make it harder/impossible to
isolate mm logic for testing and logic there might be missed in changes
moving forward.

It also, like so many kernel interfaces, has 'grown roots out of its pot'
so to speak - while it started off as an ostensible 'process' manipulation
interface, prctl() operations manipulate a myriad of task, virtual
address space and even specific VMA attributes.

At this stage it really is a 'catch-all' for things we simply couldn't fit
elsewhere.

Therefore, as suggested by the rather excellent Liam Howlett, I propose an
mm-specific interface that _explicitly_ manipulates attributes of the
virtual address space as a whole.

I think something that mimics the simplicity of [process_]madvise() makes
sense - have a limited set of actions that can be taken, and treat them as
a simple action - a user requests you do XXX to the virtual address space
(that is, across the mm_struct), and you do it.

## INTERFACE

The proposed interface is simply:

int mctl(int pidfd, int action, unsigned int flags);

Since PIDFD_SELF is now available, it is easy to invoke this for the
current process, while also adding the flexibility of being able to apply
actions to other processes also.

The function will return 0 on success, -1 on failure, with errno set to the
error that arose, standard stuff.

The behaviour will be tailored to each action taken.

To begin with, I propose a single flag:

- MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.

This again will be tailored - only certain actions will be allowed to set
this flag, and we will of course assert appropriate capabilities, etc. upon
its use.

All actions would, impact every VMA (if adjustments to VMAs are required).

## SECURITY

Of course, security will be of utmost concern (Jann's input is important
here :)

We can vary security requirements depending on the action taken.

For an initial version I suggest we simply limit operations which:

- Operate on a remote process
- Use the MCTL_SET_DEFAULT_EXEC flag

To those tasks which possess the CAP_SYS_ADMIN capability.

This may be too restrictive - be good to get some feedback on this.

I know Jann raised concerns around privileged execution and perhaps it'd be
useful to see whether this would make more sense for the SET_DEFAULT_EXEC
case or not.

Usama - would requiring CAP_SYS_ADMIN be egregious to your use case?

## IMPLEMENTATION

I think that sensibly we'd need to add some new files here, mm/mctl.c,
include/linux/mctl.h (the latter of providing the MCTL_xxx actions and
flags).

We could find ways to share code between mm files where appropriate to
avoid too much duplication.

I suggest that the best way forward, if we were minded to examine how this
would look in practice, would be for me to implement an RFC that adds the
interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE
implementation as a proof of concept.

If we wanted to then go ahead with a non-RFC version, this could then form
a foundation upon which Usama and Barry could implement their features,
with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry
MCTL_[SET/CLEAR]_FADE_ON_DEATH.

Obviously I don't mean to presume to suggest how we might proceed here -
only suggesting this might be a good way of moving forward and getting
things done as quickly as possible while allowing you guys to move forward
with your features.

Let me know if this makes sense, alternatively I could try to find a
relatively benign action to implement as part of the base work, or we could
simply collaborate to do it all in one series with multiple authors?

## RFC

The above is all only in effect 'putting ideas out there' so this is
entirely an RFC in spirit and intent - let me know if this makes sense in
whole or part :)

Thanks!

Lorenzo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
@ 2025-05-29 15:28 ` Matthew Wilcox
  2025-05-29 17:54   ` Shakeel Butt
                     ` (2 more replies)
  2025-05-29 17:21 ` Usama Arif
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 33+ messages in thread
From: Matthew Wilcox @ 2025-05-29 15:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Usama Arif, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, May 29, 2025 at 03:43:26PM +0100, Lorenzo Stoakes wrote:
> After discussions in various threads (Usama's series adding a new prctl()
> in [0], and a proposal to adapt process_madvise() to do the same -
> conception in [1] and RFC in [2]), it seems fairly clear that it would make
> sense to explore a dedicated API to explicitly allow for actions which
> affect the virtual address space as a whole.
> 
> Also, Barry is implementing a feature (currently under RFC) which could
> additionally make use of this API (see [3]).

I think the reason that you're having trouble coming up with a good
place to put these ideas is because they are all bad ideas.  Do none of
them.  Problem solved.

People should put more effort into allocating THPs automatically and
monitoring where they're helping performance and where they're hurting
performance, instead of coming up with these baroque reasons to blame
the sysadmin for not having tweaked some magic knob.

Barry's problem is that we're all nervous about possibly regressing
performance on some unknown workloads.  Just try Barry's proposal, see
if anyone actually compains or if we're just afraid of our own shadows.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
  2025-05-29 15:28 ` Matthew Wilcox
@ 2025-05-29 17:21 ` Usama Arif
  2025-05-30 13:10   ` Lorenzo Stoakes
  2025-05-29 18:50 ` Andy Lutomirski
  2025-05-29 21:31 ` Andrew Morton
  3 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-05-29 17:21 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato



On 29/05/2025 15:43, Lorenzo Stoakes wrote:
> ## INTRODUCTION
> 
> After discussions in various threads (Usama's series adding a new prctl()
> in [0], and a proposal to adapt process_madvise() to do the same -
> conception in [1] and RFC in [2]), it seems fairly clear that it would make
> sense to explore a dedicated API to explicitly allow for actions which
> affect the virtual address space as a whole.
> 
> Also, Barry is implementing a feature (currently under RFC) which could
> additionally make use of this API (see [3]).
> 
> [0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> [2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@oracle.com/
> [3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/
> 
> While madvise() and process_madvise() are useful for altering the
> attributes of VMAs within a virtual address space, it isn't the right fit
> for something that affects the whole address space.
> 
> Additionally, a requirement of Usama's proposal (see [0]) is that we have
> the ability to propagate the change in behaviour across fork/exec. This
> further suggests the need for a dedicated interface, as this really sits
> outside the ordinary behaviour of [process_]madvise().
> 
> prctl() is too broad and encourages mm code to migrate to kernel/sys.c
> where it is at risk of bit-rotting. It can make it harder/impossible to
> isolate mm logic for testing and logic there might be missed in changes
> moving forward.
> 
> It also, like so many kernel interfaces, has 'grown roots out of its pot'
> so to speak - while it started off as an ostensible 'process' manipulation
> interface, prctl() operations manipulate a myriad of task, virtual
> address space and even specific VMA attributes.
> 
> At this stage it really is a 'catch-all' for things we simply couldn't fit
> elsewhere.
> 
> Therefore, as suggested by the rather excellent Liam Howlett, I propose an
> mm-specific interface that _explicitly_ manipulates attributes of the
> virtual address space as a whole.
> 
> I think something that mimics the simplicity of [process_]madvise() makes
> sense - have a limited set of actions that can be taken, and treat them as
> a simple action - a user requests you do XXX to the virtual address space
> (that is, across the mm_struct), and you do it.
> 


Hi Lorenzo,

Thanks for writing the proposal, this is awesome!

Whatever the community agrees with, whether its this or prctl, happy to
move forward with either as both should accomplish the usecase proposed.

I will just add some points over here in defence of prctl, this is just for
discussion, and if the community disagrees, completely happy to move forward
with new syscall as well.

When it comes to having mm code in kernel/sys.c, we can just do something
like below that can actually clean it up? 

diff --git a/kernel/sys.c b/kernel/sys.c
index 3a2df1bd9f64..bfadc339e2c5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2467,6 +2467,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
        error = 0;
        switch (option) {
+       case PR_SET_MM:
+       case PR_GET_THP_DISABLE:
+       case PR_SET_THP_DISABLE:
+       case PR_NEW_MM_THING:
+               error = some_function_in_mm_folder(); // in mm/mctl.c ?
+               break;
        case PR_SET_PDEATHSIG:
                if (!valid_signal(arg2)) {
                        error = -EINVAL;

when it comes to prctl becoming a catch-all thing, with above clean up,
we can be a lot more careful to what gets added to the mm side of prctl.

The advantage of this is it avoids having another syscall.
My personal view (which can be wrong :)) is that a new syscall should be
for something major,
and I feel that PR_DEFAULT_MADV_HUGEPAGE and PR_DEFAULT_MADV_NOHUGEPAGE
might be small enough to fit in prctl? but I completely understand
your point of view as well!

> ## INTERFACE
> 
> The proposed interface is simply:
> 
> int mctl(int pidfd, int action, unsigned int flags);
> 
> Since PIDFD_SELF is now available, it is easy to invoke this for the
> current process, while also adding the flexibility of being able to apply
> actions to other processes also.
> 
> The function will return 0 on success, -1 on failure, with errno set to the
> error that arose, standard stuff.
> 
> The behaviour will be tailored to each action taken.
> 
> To begin with, I propose a single flag:
> 
> - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.
> 
> This again will be tailored - only certain actions will be allowed to set
> this flag, and we will of course assert appropriate capabilities, etc. upon
> its use.
> 

Sounds good to me. Just adding this here, the solution will be used in systemd
in exec_invoke, similar to how KSM is done with prctl in [1], so for the THP
solution, we would need MCTL_SET_DEFAULT_EXEC as it would need to be inherited
across fork+exec. 

[1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5046

> All actions would, impact every VMA (if adjustments to VMAs are required).
> 
> ## SECURITY
> 
> Of course, security will be of utmost concern (Jann's input is important
> here :)
> 
> We can vary security requirements depending on the action taken.
> 
> For an initial version I suggest we simply limit operations which:
> 
> - Operate on a remote process
> - Use the MCTL_SET_DEFAULT_EXEC flag
> 
> To those tasks which possess the CAP_SYS_ADMIN capability.
> 
> This may be too restrictive - be good to get some feedback on this.
> 
> I know Jann raised concerns around privileged execution and perhaps it'd be
> useful to see whether this would make more sense for the SET_DEFAULT_EXEC
> case or not.
> 
> Usama - would requiring CAP_SYS_ADMIN be egregious to your use case?
> 

My knowledge is security is limited, so please bare with me, but I actually
didn't understand the security issue and the need for CAP_SYS_ADMIN for
doing VM_(NO)HUGEPAGE.

A process can already madvise its own VMAs, and this is just doing that
for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
for that process as well, which again doesnt seem like a security issue
to me.

> ## IMPLEMENTATION
> 
> I think that sensibly we'd need to add some new files here, mm/mctl.c,
> include/linux/mctl.h (the latter of providing the MCTL_xxx actions and
> flags).
> 
> We could find ways to share code between mm files where appropriate to
> avoid too much duplication.
> 
> I suggest that the best way forward, if we were minded to examine how this
> would look in practice, would be for me to implement an RFC that adds the
> interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE
> implementation as a proof of concept.
> 
> If we wanted to then go ahead with a non-RFC version, this could then form
> a foundation upon which Usama and Barry could implement their features,
> with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry
> MCTL_[SET/CLEAR]_FADE_ON_DEATH.
> 
> Obviously I don't mean to presume to suggest how we might proceed here -
> only suggesting this might be a good way of moving forward and getting
> things done as quickly as possible while allowing you guys to move forward
> with your features.
> 
> Let me know if this makes sense, alternatively I could try to find a
> relatively benign action to implement as part of the base work, or we could
> simply collaborate to do it all in one series with multiple authors?
> 
> ## RFC
> 
> The above is all only in effect 'putting ideas out there' so this is
> entirely an RFC in spirit and intent - let me know if this makes sense in
> whole or part :)
> 
> Thanks!
> 
> Lorenzo

Again thanks for the proposal! Happy to move forward with this or prctl.
Just adding my 2 cents in this email.

Thanks
Usama



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 15:28 ` Matthew Wilcox
@ 2025-05-29 17:54   ` Shakeel Butt
  2025-05-29 18:13     ` Matthew Wilcox
  2025-05-29 21:14   ` Johannes Weiner
  2025-06-04 12:28   ` Lorenzo Stoakes
  2 siblings, 1 reply; 33+ messages in thread
From: Shakeel Butt @ 2025-05-29 17:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Johannes Weiner, Barry Song, linux-mm, linux-arch, linux-kernel,
	linux-api, Pedro Falcato

Hi Willy,

On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 03:43:26PM +0100, Lorenzo Stoakes wrote:
> > After discussions in various threads (Usama's series adding a new prctl()
> > in [0], and a proposal to adapt process_madvise() to do the same -
> > conception in [1] and RFC in [2]), it seems fairly clear that it would make
> > sense to explore a dedicated API to explicitly allow for actions which
> > affect the virtual address space as a whole.
> > 
> > Also, Barry is implementing a feature (currently under RFC) which could
> > additionally make use of this API (see [3]).
> 
> I think the reason that you're having trouble coming up with a good
> place to put these ideas is because they are all bad ideas.  Do none of
> them.  Problem solved.
> 
> People should put more effort into allocating THPs automatically and
> monitoring where they're helping performance and where they're hurting
> performance,

Can you please expand on people putting more effort? Is it about
workloads or maybe malloc implementations (tcmalloc, jemalloc) being
more intelligent in managing their allocations/frees to keep more used
memory in hugepage aligned regions? And conveying to kernel which
regions they prefer hugepage backed and which they do not? Or something
else?

> instead of coming up with these baroque reasons to blame
> the sysadmin for not having tweaked some magic knob.

To me this is not about blaming sysadmin but more about sysadmin wanting
more fine grained control on THP allocation policies for different
workloads running in a multi-tenant environment.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 17:54   ` Shakeel Butt
@ 2025-05-29 18:13     ` Matthew Wilcox
  2025-05-29 18:32       ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-05-29 18:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Johannes Weiner, Barry Song, linux-mm, linux-arch, linux-kernel,
	linux-api, Pedro Falcato

On Thu, May 29, 2025 at 10:54:34AM -0700, Shakeel Butt wrote:
> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > People should put more effort into allocating THPs automatically and
> > monitoring where they're helping performance and where they're hurting
> > performance,
> 
> Can you please expand on people putting more effort? Is it about
> workloads or maybe malloc implementations (tcmalloc, jemalloc) being
> more intelligent in managing their allocations/frees to keep more used
> memory in hugepage aligned regions? And conveying to kernel which
> regions they prefer hugepage backed and which they do not? Or something
> else?

We need infrastructure inside the kernel to monitor whether a task is
making effective use of the THPs that it has, and if it's not then move
those THPs over to where they will be better used.

I don't necessarily object to userspace giving hints like "I think I'm
going to use all of this 20MB region quite heavily", but the kernel should
treat those hints with the appropriate skepticism, otherwise it's just
a turbo button that nobody would ever _not_ press.

> > instead of coming up with these baroque reasons to blame
> > the sysadmin for not having tweaked some magic knob.
> 
> To me this is not about blaming sysadmin but more about sysadmin wanting
> more fine grained control on THP allocation policies for different
> workloads running in a multi-tenant environment.

That's the same thing.  Linux should be auto-tuning, not relying on some
omniscient sysadmin to fix it up.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 18:13     ` Matthew Wilcox
@ 2025-05-29 18:32       ` Usama Arif
  0 siblings, 0 replies; 33+ messages in thread
From: Usama Arif @ 2025-05-29 18:32 UTC (permalink / raw)
  To: Matthew Wilcox, Shakeel Butt
  Cc: Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On 29/05/2025 19:13, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 10:54:34AM -0700, Shakeel Butt wrote:
>> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
>>> People should put more effort into allocating THPs automatically and
>>> monitoring where they're helping performance and where they're hurting
>>> performance,
>>
>> Can you please expand on people putting more effort? Is it about
>> workloads or maybe malloc implementations (tcmalloc, jemalloc) being
>> more intelligent in managing their allocations/frees to keep more used
>> memory in hugepage aligned regions? And conveying to kernel which
>> regions they prefer hugepage backed and which they do not? Or something
>> else?
> 
> We need infrastructure inside the kernel to monitor whether a task is
> making effective use of the THPs that it has, and if it's not then move
> those THPs over to where they will be better used.
> 

I think this is the really difficult part.

If we have 2 workloads on the same server, For e.g. one is database where THPs 
just dont do well, but the other one is AI where THPs do really well. How
will the kernel monitor that the database workload is performing worse
and the AI one isnt?

I added THP shrinker to hopefully try and do this automatically, and it does
really help. But unfortunately it is not a complete solution.
There are severely memory bound workloads where even a tiny increase
in memory will lead to an OOM. And if you colocate the container thats running
that workload with one in which we will benefit with THPs, we unfortunately
can't just rely on the system doing the right thing.

It would be awesome if THPs are truly transparent and don't require
any input, but unfortunately I don't think that there is a solution
for this with just kernel monitoring.

This is just a big hint from the user. If the global system policy is madvise
and the workload owner has done their own benchmarks and see benefits
with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
If the global system policy is always and the workload owner has done their own 
benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for 
the process to optin as "madvise". 

> I don't necessarily object to userspace giving hints like "I think I'm
> going to use all of this 20MB region quite heavily", but the kernel should
> treat those hints with the appropriate skepticism, otherwise it's just
> a turbo button that nobody would ever _not_ press.
> 
>>> instead of coming up with these baroque reasons to blame
>>> the sysadmin for not having tweaked some magic knob.
>>
>> To me this is not about blaming sysadmin but more about sysadmin wanting
>> more fine grained control on THP allocation policies for different
>> workloads running in a multi-tenant environment.
> 
> That's the same thing.  Linux should be auto-tuning, not relying on some
> omniscient sysadmin to fix it up.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
  2025-05-29 15:28 ` Matthew Wilcox
  2025-05-29 17:21 ` Usama Arif
@ 2025-05-29 18:50 ` Andy Lutomirski
  2025-05-29 21:31 ` Andrew Morton
  3 siblings, 0 replies; 33+ messages in thread
From: Andy Lutomirski @ 2025-05-29 18:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Usama Arif, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, May 29, 2025 at 7:44 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The behaviour will be tailored to each action taken.
>
> To begin with, I propose a single flag:
>
> - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.

It's hard to comment without a more complete proposal (*what* behavior
is persisted?), but off the top of my head, this isn't so great.

First, the name means nothing to me.  What's "default exec"?  Even
aside from that, why are fork and exec the same flag?

Beyond this, persisting anything across exec is a giant can of worms.
We have PR_SET_NO_NEW_PRIVS to make it less hazardous, but, in
general, it's risky and potentially quite confusing to do anything
that affects exec'd processes.

Oh, and this whole scheme is also potentially nasty for a different
reason: it's not thread safe.  If one thread wants to spawn a process,
it should not interfere with another thread doing something else.  So
making an mm flag that persists across close can interfere a bit, and
persisting it across clone + exec is even gnarlier.

For any of these, there should be matching query features -- CRIU,
debugging, etc should not be an afterthought.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 15:28 ` Matthew Wilcox
  2025-05-29 17:54   ` Shakeel Butt
@ 2025-05-29 21:14   ` Johannes Weiner
  2025-05-29 21:24     ` Liam R. Howlett
                       ` (3 more replies)
  2025-06-04 12:28   ` Lorenzo Stoakes
  2 siblings, 4 replies; 33+ messages in thread
From: Johannes Weiner @ 2025-05-29 21:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> Barry's problem is that we're all nervous about possibly regressing
> performance on some unknown workloads.  Just try Barry's proposal, see
> if anyone actually compains or if we're just afraid of our own shadows.

I actually explained why I think this is a terrible idea. But okay, I
tried the patch anyway.

This is 'git log' on a hot kernel repo after a large IO stream:

                                     VANILLA                      BARRY
Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
System time               14.41 (    +0.00%)         14.64 (    +1.50%)
pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)

Clearly we can't generally ignore page cache hits just because the
mmaps() are intermittent.

The whole point is to cache across processes and their various
apertures into a common, long-lived filesystem space.

Barry knows something about the relationship between certain processes
and certain files that he could exploit with MADV_COLD-on-exit
semantics. But that's not something the kernel can safely assume. Not
without defeating the page cache for an entire class of file accesses.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 21:14   ` Johannes Weiner
@ 2025-05-29 21:24     ` Liam R. Howlett
  2025-05-29 23:14       ` Johannes Weiner
  2025-05-30  7:52     ` Barry Song
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Liam R. Howlett @ 2025-05-29 21:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

* Johannes Weiner <hannes@cmpxchg.org> [250529 17:14]:
> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > Barry's problem is that we're all nervous about possibly regressing
> > performance on some unknown workloads.  Just try Barry's proposal, see
> > if anyone actually compains or if we're just afraid of our own shadows.
> 
> I actually explained why I think this is a terrible idea. But okay, I
> tried the patch anyway.
> 
> This is 'git log' on a hot kernel repo after a large IO stream:

Can you clarify this benchmark please?

Is this running 'git log', then stream IO, then running 'git log' again?

> 
>                                      VANILLA                      BARRY
> Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> 
> Clearly we can't generally ignore page cache hits just because the
> mmaps() are intermittent.
> 
> The whole point is to cache across processes and their various
> apertures into a common, long-lived filesystem space.
> 
> Barry knows something about the relationship between certain processes
> and certain files that he could exploit with MADV_COLD-on-exit
> semantics. But that's not something the kernel can safely assume. Not
> without defeating the page cache for an entire class of file accesses.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-05-29 18:50 ` Andy Lutomirski
@ 2025-05-29 21:31 ` Andrew Morton
  3 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2025-05-29 21:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Usama Arif, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, 29 May 2025 15:43:26 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

>
> madvise()
>
> ...
>
> process_madvise()
>
> ...
>
> prctl()
>
> ...
>

Yeah.  I think there has always been an attitude "ooh, new syscalls are
scary and I probably need permission from someone so let's graft it
onto something which is already there and hope nobody notices".

New syscalls are super easy and are cheap - it's just a table entry! 
If it makes sense as a standalone thing, do that.

> The proposed interface is simply:
> 
> int mctl(int pidfd, int action, unsigned int flags);

Well, why `flags'?  One could even add a syscall per operation. 
Debatable.

> Of course, security will be of utmost concern (Jann's input is important
> here :)
> 
> We can vary security requirements depending on the action taken.
> 
> For an initial version I suggest we simply limit operations which:
> 
> - Operate on a remote process
> - Use the MCTL_SET_DEFAULT_EXEC flag
> 
> To those tasks which possess the CAP_SYS_ADMIN capability.
>
> This may be too restrictive - be good to get some feedback on this.

Permissions needs careful consideration on a case-by-case basis.
Clearly in many cases, user A should be able to manipulate user A's
mm's.

And if different modes of mctl() end up with different permission
structures, one wonders why everything was clumped under the same
syscall!  mctl() becomes "prctl() for memory".

Anyway, fun discussion.  I'm with willy ;)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 21:24     ` Liam R. Howlett
@ 2025-05-29 23:14       ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2025-05-29 23:14 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, May 29, 2025 at 05:24:23PM -0400, Liam R. Howlett wrote:
> * Johannes Weiner <hannes@cmpxchg.org> [250529 17:14]:
> > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > > Barry's problem is that we're all nervous about possibly regressing
> > > performance on some unknown workloads.  Just try Barry's proposal, see
> > > if anyone actually compains or if we're just afraid of our own shadows.
> > 
> > I actually explained why I think this is a terrible idea. But okay, I
> > tried the patch anyway.
> > 
> > This is 'git log' on a hot kernel repo after a large IO stream:
> 
> Can you clarify this benchmark please?
>
> Is this running 'git log', then stream IO, then running 'git log' again?

Yes, but it's running git log twice first. On the vanilla kernel this
is the number of references when we usually activate.

You can substitute any sequence of commands that would interact with
the git objects repeatedly before a pause where programmer thinks.

You can probably get similar mmapIO patterns with sqlite, lmdb, etc.

Periodically running executables and scripts are another case. They
tend to be less latency-sensitive I suppose, but would still
unnecessarily eat into the available IO bandwidth.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 21:14   ` Johannes Weiner
  2025-05-29 21:24     ` Liam R. Howlett
@ 2025-05-30  7:52     ` Barry Song
  2025-06-04 12:00       ` Johannes Weiner
  2025-05-30 10:31     ` Vlastimil Babka
  2025-06-02 18:01     ` Matthew Wilcox
  3 siblings, 1 reply; 33+ messages in thread
From: Barry Song @ 2025-05-30  7:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, SeongJae Park, Usama Arif,
	Mike Rapoport, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Fri, May 30, 2025 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > Barry's problem is that we're all nervous about possibly regressing
> > performance on some unknown workloads.  Just try Barry's proposal, see
> > if anyone actually compains or if we're just afraid of our own shadows.
>
> I actually explained why I think this is a terrible idea. But okay, I
> tried the patch anyway.
>
> This is 'git log' on a hot kernel repo after a large IO stream:
>
>                                      VANILLA                      BARRY
> Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
>
> Clearly we can't generally ignore page cache hits just because the
> mmaps() are intermittent.

Hi Johannes,
Thanks!

Are you on v1, which lacks folio demotion[1], or v2, which includes it [2]?

[1] https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20250514070820.51793-1-21cnbao@gmail.com/

>
> The whole point is to cache across processes and their various
> apertures into a common, long-lived filesystem space.
>
> Barry knows something about the relationship between certain processes
> and certain files that he could exploit with MADV_COLD-on-exit
> semantics. But that's not something the kernel can safely assume. Not
> without defeating the page cache for an entire class of file accesses.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 21:14   ` Johannes Weiner
  2025-05-29 21:24     ` Liam R. Howlett
  2025-05-30  7:52     ` Barry Song
@ 2025-05-30 10:31     ` Vlastimil Babka
  2025-06-04 12:19       ` Johannes Weiner
  2025-06-02 18:01     ` Matthew Wilcox
  3 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2025-05-30 10:31 UTC (permalink / raw)
  To: Johannes Weiner, Matthew Wilcox
  Cc: Lorenzo Stoakes, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	David Hildenbrand, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Usama Arif, Mike Rapoport, Barry Song, linux-mm,
	linux-arch, linux-kernel, linux-api, Pedro Falcato

On 5/29/25 23:14, Johannes Weiner wrote:
> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
>> Barry's problem is that we're all nervous about possibly regressing
>> performance on some unknown workloads.  Just try Barry's proposal, see
>> if anyone actually compains or if we're just afraid of our own shadows.
> 
> I actually explained why I think this is a terrible idea. But okay, I
> tried the patch anyway.
> 
> This is 'git log' on a hot kernel repo after a large IO stream:
> 
>                                      VANILLA                      BARRY
> Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> 
> Clearly we can't generally ignore page cache hits just because the
> mmaps() are intermittent.
> 
> The whole point is to cache across processes and their various
> apertures into a common, long-lived filesystem space.
> 
> Barry knows something about the relationship between certain processes
> and certain files that he could exploit with MADV_COLD-on-exit
> semantics. But that's not something the kernel can safely assume. Not
> without defeating the page cache for an entire class of file accesses.

I've just read the previous threads about Barry's proposal and if doing this
always isn't feasible, I'm wondering if memcg would be a better interface to
opt-in for this kind of behavior than both prctl or mctl. I think at least
conceptually it fits what memcg is doing? The question is if the
implementation would be feasible, and if android puts apps in separate memcgs...


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 17:21 ` Usama Arif
@ 2025-05-30 13:10   ` Lorenzo Stoakes
  2025-06-10 15:03     ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 13:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Mike Rapoport, Johannes Weiner, Barry Song,
	linux-mm, linux-arch, linux-kernel, linux-api, Pedro Falcato

On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
>
>
> On 29/05/2025 15:43, Lorenzo Stoakes wrote:
> > ## INTRODUCTION
> >
> > After discussions in various threads (Usama's series adding a new prctl()
> > in [0], and a proposal to adapt process_madvise() to do the same -
> > conception in [1] and RFC in [2]), it seems fairly clear that it would make
> > sense to explore a dedicated API to explicitly allow for actions which
> > affect the virtual address space as a whole.
> >
> > Also, Barry is implementing a feature (currently under RFC) which could
> > additionally make use of this API (see [3]).
> >
> > [0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > [2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@oracle.com/
> > [3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/
> >
> > While madvise() and process_madvise() are useful for altering the
> > attributes of VMAs within a virtual address space, it isn't the right fit
> > for something that affects the whole address space.
> >
> > Additionally, a requirement of Usama's proposal (see [0]) is that we have
> > the ability to propagate the change in behaviour across fork/exec. This
> > further suggests the need for a dedicated interface, as this really sits
> > outside the ordinary behaviour of [process_]madvise().
> >
> > prctl() is too broad and encourages mm code to migrate to kernel/sys.c
> > where it is at risk of bit-rotting. It can make it harder/impossible to
> > isolate mm logic for testing and logic there might be missed in changes
> > moving forward.
> >
> > It also, like so many kernel interfaces, has 'grown roots out of its pot'
> > so to speak - while it started off as an ostensible 'process' manipulation
> > interface, prctl() operations manipulate a myriad of task, virtual
> > address space and even specific VMA attributes.
> >
> > At this stage it really is a 'catch-all' for things we simply couldn't fit
> > elsewhere.
> >
> > Therefore, as suggested by the rather excellent Liam Howlett, I propose an
> > mm-specific interface that _explicitly_ manipulates attributes of the
> > virtual address space as a whole.
> >
> > I think something that mimics the simplicity of [process_]madvise() makes
> > sense - have a limited set of actions that can be taken, and treat them as
> > a simple action - a user requests you do XXX to the virtual address space
> > (that is, across the mm_struct), and you do it.
> >
>
>
> Hi Lorenzo,
>
> Thanks for writing the proposal, this is awesome!

Thanks :)

>
> Whatever the community agrees with, whether its this or prctl, happy to
> move forward with either as both should accomplish the usecase proposed.

Awesome!

>
> I will just add some points over here in defence of prctl, this is just for
> discussion, and if the community disagrees, completely happy to move forward
> with new syscall as well.

Please do - it's vital we have all sides of the argument!

>
> When it comes to having mm code in kernel/sys.c, we can just do something
> like below that can actually clean it up?
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 3a2df1bd9f64..bfadc339e2c5 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2467,6 +2467,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
>         error = 0;
>         switch (option) {
> +       case PR_SET_MM:
> +       case PR_GET_THP_DISABLE:
> +       case PR_SET_THP_DISABLE:
> +       case PR_NEW_MM_THING:
> +               error = some_function_in_mm_folder(); // in mm/mctl.c ?
> +               break;
>         case PR_SET_PDEATHSIG:
>                 if (!valid_signal(arg2)) {
>                         error = -EINVAL;
>
> when it comes to prctl becoming a catch-all thing, with above clean up,
> we can be a lot more careful to what gets added to the mm side of prctl.

Sure, and I think we were to decide prctl() was the way to go this would be the
way it should look, or at least specifically for this case.

This addresses the bit rot issue but not how hidden the functionality is, what
prctl() generally represents de-facto (that is - a catch-all for obscure stuff),
depending on the interface of the function it might be harder/impossible to
userland-test and of course we have the beautiful vector-interface that means we
can't restrict behaviour in the API in the way we'd ideally like.

I mean overall there is no beautiful solution here.

As Matthew says, the reason we're struggling here is probably more so because
this is just showing up a flaw in how THP is these days - we really ideally need
something automagic where the kernel figures things out for us.

But in the meantime we have this frankly kinda terrible interface that's rather
off/on and yet want to establish policy _ourselves_ when the kernel should be
able to figure it out.

But with that said, we must trade that off with the reality that there is a need
for this kind of fine-grained control now, in this less than ideal world we
currently inhabit :)

>
> The advantage of this is it avoids having another syscall.
> My personal view (which can be wrong :)) is that a new syscall should be
> for something major,
> and I feel that PR_DEFAULT_MADV_HUGEPAGE and PR_DEFAULT_MADV_NOHUGEPAGE
> might be small enough to fit in prctl? but I completely understand
> your point of view as well!

See Andrew's point on the syscall thing, they're cheap ;)

I mean the idea of course is that we can put all future stuff like that here
too.

We could even port some existing prctl() stuff across, in theory. But having 2
ways of doing the same thing would probably suck.

>
> > ## INTERFACE
> >
> > The proposed interface is simply:
> >
> > int mctl(int pidfd, int action, unsigned int flags);
> >
> > Since PIDFD_SELF is now available, it is easy to invoke this for the
> > current process, while also adding the flexibility of being able to apply
> > actions to other processes also.
> >
> > The function will return 0 on success, -1 on failure, with errno set to the
> > error that arose, standard stuff.
> >
> > The behaviour will be tailored to each action taken.
> >
> > To begin with, I propose a single flag:
> >
> > - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.
> >
> > This again will be tailored - only certain actions will be allowed to set
> > this flag, and we will of course assert appropriate capabilities, etc. upon
> > its use.
> >
>
> Sounds good to me. Just adding this here, the solution will be used in systemd
> in exec_invoke, similar to how KSM is done with prctl in [1], so for the THP
> solution, we would need MCTL_SET_DEFAULT_EXEC as it would need to be inherited
> across fork+exec.

Right yeah, this was the intent.

>
> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5046
>
> > All actions would, impact every VMA (if adjustments to VMAs are required).
> >
> > ## SECURITY
> >
> > Of course, security will be of utmost concern (Jann's input is important
> > here :)
> >
> > We can vary security requirements depending on the action taken.
> >
> > For an initial version I suggest we simply limit operations which:
> >
> > - Operate on a remote process
> > - Use the MCTL_SET_DEFAULT_EXEC flag
> >
> > To those tasks which possess the CAP_SYS_ADMIN capability.
> >
> > This may be too restrictive - be good to get some feedback on this.
> >
> > I know Jann raised concerns around privileged execution and perhaps it'd be
> > useful to see whether this would make more sense for the SET_DEFAULT_EXEC
> > case or not.
> >
> > Usama - would requiring CAP_SYS_ADMIN be egregious to your use case?
> >
>
> My knowledge is security is limited, so please bare with me, but I actually
> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> doing VM_(NO)HUGEPAGE.
>
> A process can already madvise its own VMAs, and this is just doing that
> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> for that process as well, which again doesnt seem like a security issue
> to me.

W.R.T. the current process, the Issue is one Jann raised, in relation to
propagation of behaviour to privileged (e.g. setuid) processes.

W.R.T. remote processes, obviously we want to make sure we are permitted to do
so.

>
> > ## IMPLEMENTATION
> >
> > I think that sensibly we'd need to add some new files here, mm/mctl.c,
> > include/linux/mctl.h (the latter of providing the MCTL_xxx actions and
> > flags).
> >
> > We could find ways to share code between mm files where appropriate to
> > avoid too much duplication.
> >
> > I suggest that the best way forward, if we were minded to examine how this
> > would look in practice, would be for me to implement an RFC that adds the
> > interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE
> > implementation as a proof of concept.
> >
> > If we wanted to then go ahead with a non-RFC version, this could then form
> > a foundation upon which Usama and Barry could implement their features,
> > with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry
> > MCTL_[SET/CLEAR]_FADE_ON_DEATH.
> >
> > Obviously I don't mean to presume to suggest how we might proceed here -
> > only suggesting this might be a good way of moving forward and getting
> > things done as quickly as possible while allowing you guys to move forward
> > with your features.
> >
> > Let me know if this makes sense, alternatively I could try to find a
> > relatively benign action to implement as part of the base work, or we could
> > simply collaborate to do it all in one series with multiple authors?
> >
> > ## RFC
> >
> > The above is all only in effect 'putting ideas out there' so this is
> > entirely an RFC in spirit and intent - let me know if this makes sense in
> > whole or part :)
> >
> > Thanks!
> >
> > Lorenzo
>
> Again thanks for the proposal! Happy to move forward with this or prctl.
> Just adding my 2 cents in this email.

Thank you! And your input is most appreciated - it's important we get all sides
of the story here!

>
> Thanks
> Usama
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 21:14   ` Johannes Weiner
                       ` (2 preceding siblings ...)
  2025-05-30 10:31     ` Vlastimil Babka
@ 2025-06-02 18:01     ` Matthew Wilcox
  2025-06-04 13:21       ` Johannes Weiner
  3 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-06-02 18:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Lorenzo Stoakes, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, May 29, 2025 at 05:14:23PM -0400, Johannes Weiner wrote:
> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > Barry's problem is that we're all nervous about possibly regressing
> > performance on some unknown workloads.  Just try Barry's proposal, see
> > if anyone actually compains or if we're just afraid of our own shadows.
> 
> I actually explained why I think this is a terrible idea. But okay, I
> tried the patch anyway.

Sorry, I must've missed that one ;-(

> This is 'git log' on a hot kernel repo after a large IO stream:
> 
>                                      VANILLA                      BARRY
> Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> 
> Clearly we can't generally ignore page cache hits just because the
> mmaps() are intermittent.
> 
> The whole point is to cache across processes and their various
> apertures into a common, long-lived filesystem space.
> 
> Barry knows something about the relationship between certain processes
> and certain files that he could exploit with MADV_COLD-on-exit
> semantics. But that's not something the kernel can safely assume. Not
> without defeating the page cache for an entire class of file accesses.

So what about distinguishing between exited-normally processes (ie git
log) vs killed-by-oom processes (ie Barry's usecase)?  Update the
referenced bit in the first case and not the second?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-30  7:52     ` Barry Song
@ 2025-06-04 12:00       ` Johannes Weiner
  2025-06-04 12:05         ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Johannes Weiner @ 2025-06-04 12:00 UTC (permalink / raw)
  To: Barry Song
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, David Hildenbrand, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, SeongJae Park, Usama Arif,
	Mike Rapoport, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Fri, May 30, 2025 at 07:52:28PM +1200, Barry Song wrote:
> On Fri, May 30, 2025 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > > Barry's problem is that we're all nervous about possibly regressing
> > > performance on some unknown workloads.  Just try Barry's proposal, see
> > > if anyone actually compains or if we're just afraid of our own shadows.
> >
> > I actually explained why I think this is a terrible idea. But okay, I
> > tried the patch anyway.
> >
> > This is 'git log' on a hot kernel repo after a large IO stream:
> >
> >                                      VANILLA                      BARRY
> > Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> > User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> > System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> > pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> > workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> >
> > Clearly we can't generally ignore page cache hits just because the
> > mmaps() are intermittent.
> 
> Hi Johannes,
> Thanks!
> 
> Are you on v1, which lacks folio demotion[1], or v2, which includes it [2]?
> 
> [1] https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20250514070820.51793-1-21cnbao@gmail.com/

The subthread is about whether the reference dismissal / demotion
should be unconditional (v1) or opt-in (v2).

I'm arguing for v2.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-04 12:00       ` Johannes Weiner
@ 2025-06-04 12:05         ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-06-04 12:05 UTC (permalink / raw)
  To: Johannes Weiner, Barry Song
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	linux-mm, linux-arch, linux-kernel, linux-api, Pedro Falcato

On 04.06.25 14:00, Johannes Weiner wrote:
> On Fri, May 30, 2025 at 07:52:28PM +1200, Barry Song wrote:
>> On Fri, May 30, 2025 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>
>>> On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
>>>> Barry's problem is that we're all nervous about possibly regressing
>>>> performance on some unknown workloads.  Just try Barry's proposal, see
>>>> if anyone actually compains or if we're just afraid of our own shadows.
>>>
>>> I actually explained why I think this is a terrible idea. But okay, I
>>> tried the patch anyway.
>>>
>>> This is 'git log' on a hot kernel repo after a large IO stream:
>>>
>>>                                       VANILLA                      BARRY
>>> Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
>>> User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
>>> System time               14.41 (    +0.00%)         14.64 (    +1.50%)
>>> pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
>>> workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
>>>
>>> Clearly we can't generally ignore page cache hits just because the
>>> mmaps() are intermittent.
>>
>> Hi Johannes,
>> Thanks!
>>
>> Are you on v1, which lacks folio demotion[1], or v2, which includes it [2]?
>>
>> [1] https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/
>> [2] https://lore.kernel.org/linux-mm/20250514070820.51793-1-21cnbao@gmail.com/
> 
> The subthread is about whether the reference dismissal / demotion
> should be unconditional (v1) or opt-in (v2).
> 
> I'm arguing for v2.

+1

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-30 10:31     ` Vlastimil Babka
@ 2025-06-04 12:19       ` Johannes Weiner
  2025-06-05 12:31         ` Johannes Weiner
  0 siblings, 1 reply; 33+ messages in thread
From: Johannes Weiner @ 2025-06-04 12:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, David Hildenbrand, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato, tj

On Fri, May 30, 2025 at 12:31:35PM +0200, Vlastimil Babka wrote:
> On 5/29/25 23:14, Johannes Weiner wrote:
> > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> >> Barry's problem is that we're all nervous about possibly regressing
> >> performance on some unknown workloads.  Just try Barry's proposal, see
> >> if anyone actually compains or if we're just afraid of our own shadows.
> > 
> > I actually explained why I think this is a terrible idea. But okay, I
> > tried the patch anyway.
> > 
> > This is 'git log' on a hot kernel repo after a large IO stream:
> > 
> >                                      VANILLA                      BARRY
> > Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> > User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> > System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> > pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> > workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> > 
> > Clearly we can't generally ignore page cache hits just because the
> > mmaps() are intermittent.
> > 
> > The whole point is to cache across processes and their various
> > apertures into a common, long-lived filesystem space.
> > 
> > Barry knows something about the relationship between certain processes
> > and certain files that he could exploit with MADV_COLD-on-exit
> > semantics. But that's not something the kernel can safely assume. Not
> > without defeating the page cache for an entire class of file accesses.
> 
> I've just read the previous threads about Barry's proposal and if doing this
> always isn't feasible, I'm wondering if memcg would be a better interface to
> opt-in for this kind of behavior than both prctl or mctl. I think at least
> conceptually it fits what memcg is doing? The question is if the
> implementation would be feasible, and if android puts apps in separate memcgs...

CCing Tejun.

Cgroups has been trying to resist flag settings like these. The cgroup
tree is a nested hierarchical structure designed for dividing up
system resources. But flag properties don't have natural inheritance
rules. What does it mean if the parent group says one thing and the
child says another? Which one has precedence?

Hence the proposal to make it a per-process property that propagates
through fork() and exec(). This also enables the container usecase (by
setting the flag in the container launching process), without there
being any confusion what the *effective* setting for any given process
in the system is.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-29 15:28 ` Matthew Wilcox
  2025-05-29 17:54   ` Shakeel Butt
  2025-05-29 21:14   ` Johannes Weiner
@ 2025-06-04 12:28   ` Lorenzo Stoakes
  2 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-06-04 12:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Usama Arif, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 03:43:26PM +0100, Lorenzo Stoakes wrote:
> > After discussions in various threads (Usama's series adding a new prctl()
> > in [0], and a proposal to adapt process_madvise() to do the same -
> > conception in [1] and RFC in [2]), it seems fairly clear that it would make
> > sense to explore a dedicated API to explicitly allow for actions which
> > affect the virtual address space as a whole.
> >
> > Also, Barry is implementing a feature (currently under RFC) which could
> > additionally make use of this API (see [3]).
>
> I think the reason that you're having trouble coming up with a good
> place to put these ideas is because they are all bad ideas.  Do none of
> them.  Problem solved.
>
> People should put more effort into allocating THPs automatically and
> monitoring where they're helping performance and where they're hurting
> performance, instead of coming up with these baroque reasons to blame
> the sysadmin for not having tweaked some magic knob.
>
> Barry's problem is that we're all nervous about possibly regressing
> performance on some unknown workloads.  Just try Barry's proposal, see
> if anyone actually compains or if we're just afraid of our own shadows.
>

So from my perspective - I very much agree with the concept of doing nothing
here, ideally :)

But I feel we need to look at this problem from both the short term and the
long term - in the long run I absolutely agree with what you say.

In the short term this proposal is broadly 'what is the least worst means
of establishing policy like this?'.

I think there is broad agreement that prctl() is sub-optimal, so an
mm-specific API makes sense.

So it comes down to a question of - how badly do we need to be able to do
this _now_?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-02 18:01     ` Matthew Wilcox
@ 2025-06-04 13:21       ` Johannes Weiner
  0 siblings, 0 replies; 33+ messages in thread
From: Johannes Weiner @ 2025-06-04 13:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Mon, Jun 02, 2025 at 07:01:56PM +0100, Matthew Wilcox wrote:
> On Thu, May 29, 2025 at 05:14:23PM -0400, Johannes Weiner wrote:
> > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > > Barry's problem is that we're all nervous about possibly regressing
> > > performance on some unknown workloads.  Just try Barry's proposal, see
> > > if anyone actually compains or if we're just afraid of our own shadows.
> > 
> > I actually explained why I think this is a terrible idea. But okay, I
> > tried the patch anyway.
> 
> Sorry, I must've missed that one ;-(

Apologies for my tone. The discussion is spread out over too many
threads...

> > This is 'git log' on a hot kernel repo after a large IO stream:
> > 
> >                                      VANILLA                      BARRY
> > Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> > User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> > System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> > pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> > workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> > 
> > Clearly we can't generally ignore page cache hits just because the
> > mmaps() are intermittent.
> > 
> > The whole point is to cache across processes and their various
> > apertures into a common, long-lived filesystem space.
> > 
> > Barry knows something about the relationship between certain processes
> > and certain files that he could exploit with MADV_COLD-on-exit
> > semantics. But that's not something the kernel can safely assume. Not
> > without defeating the page cache for an entire class of file accesses.
> 
> So what about distinguishing between exited-normally processes (ie git
> log) vs killed-by-oom processes (ie Barry's usecase)?  Update the
> referenced bit in the first case and not the second?

In cloud environments, it's common to restart a workload immediately
after an OOM kill.

The hosts tend to handle a fairly dynamic mix of batch jobs and
semi-predictable user request load, all while also trying to target
decent average host utilization. Adapting to external load peaks is
laggy (spawning new workers, rebalancing).

In such setups, OOM conditions are generally assumed to be highly
transient. And quick restarts are important to avoid cascading
failures in the worker pool during load peaks.

So I don't think OOM is a good universal signal that the workload is
gone and the memory is cold.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-04 12:19       ` Johannes Weiner
@ 2025-06-05 12:31         ` Johannes Weiner
  2025-06-09 17:03           ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Johannes Weiner @ 2025-06-05 12:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Lorenzo Stoakes, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, David Hildenbrand, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Usama Arif, Mike Rapoport,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato, Tejun Heo

CCing Tejun - with the right mutt alias this time.

On Wed, Jun 04, 2025 at 08:19:28AM -0400, Johannes Weiner wrote:
> On Fri, May 30, 2025 at 12:31:35PM +0200, Vlastimil Babka wrote:
> > On 5/29/25 23:14, Johannes Weiner wrote:
> > > On Thu, May 29, 2025 at 04:28:46PM +0100, Matthew Wilcox wrote:
> > >> Barry's problem is that we're all nervous about possibly regressing
> > >> performance on some unknown workloads.  Just try Barry's proposal, see
> > >> if anyone actually compains or if we're just afraid of our own shadows.
> > > 
> > > I actually explained why I think this is a terrible idea. But okay, I
> > > tried the patch anyway.
> > > 
> > > This is 'git log' on a hot kernel repo after a large IO stream:
> > > 
> > >                                      VANILLA                      BARRY
> > > Real time                 49.93 (    +0.00%)         60.36 (   +20.48%)
> > > User time                 32.10 (    +0.00%)         32.09 (    -0.04%)
> > > System time               14.41 (    +0.00%)         14.64 (    +1.50%)
> > > pgmajfault              9227.00 (    +0.00%)      18390.00 (   +99.30%)
> > > workingset_refault_file  184.00 (    +0.00%)    236899.00 (+127954.05%)
> > > 
> > > Clearly we can't generally ignore page cache hits just because the
> > > mmaps() are intermittent.
> > > 
> > > The whole point is to cache across processes and their various
> > > apertures into a common, long-lived filesystem space.
> > > 
> > > Barry knows something about the relationship between certain processes
> > > and certain files that he could exploit with MADV_COLD-on-exit
> > > semantics. But that's not something the kernel can safely assume. Not
> > > without defeating the page cache for an entire class of file accesses.
> > 
> > I've just read the previous threads about Barry's proposal and if doing this
> > always isn't feasible, I'm wondering if memcg would be a better interface to
> > opt-in for this kind of behavior than both prctl or mctl. I think at least
> > conceptually it fits what memcg is doing? The question is if the
> > implementation would be feasible, and if android puts apps in separate memcgs...
> 
> CCing Tejun.
> 
> Cgroups has been trying to resist flag settings like these. The cgroup
> tree is a nested hierarchical structure designed for dividing up
> system resources. But flag properties don't have natural inheritance
> rules. What does it mean if the parent group says one thing and the
> child says another? Which one has precedence?
> 
> Hence the proposal to make it a per-process property that propagates
> through fork() and exec(). This also enables the container usecase (by
> setting the flag in the container launching process), without there
> being any confusion what the *effective* setting for any given process
> in the system is.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-05 12:31         ` Johannes Weiner
@ 2025-06-09 17:03           ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2025-06-09 17:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vlastimil Babka, Matthew Wilcox, Lorenzo Stoakes, Andrew Morton,
	Shakeel Butt, Liam R . Howlett, David Hildenbrand, Jann Horn,
	Arnd Bergmann, Christian Brauner, SeongJae Park, Usama Arif,
	Mike Rapoport, Barry Song, linux-mm, linux-arch, linux-kernel,
	linux-api, Pedro Falcato

Hello,

On Thu, Jun 05, 2025 at 08:31:56AM -0400, Johannes Weiner wrote:
> On Wed, Jun 04, 2025 at 08:19:28AM -0400, Johannes Weiner wrote:
> > On Fri, May 30, 2025 at 12:31:35PM +0200, Vlastimil Babka wrote:
...
> > > I've just read the previous threads about Barry's proposal and if doing this
> > > always isn't feasible, I'm wondering if memcg would be a better interface to
> > > opt-in for this kind of behavior than both prctl or mctl. I think at least
> > > conceptually it fits what memcg is doing? The question is if the
> > > implementation would be feasible, and if android puts apps in separate memcgs...
> > 
> > CCing Tejun.
> > 
> > Cgroups has been trying to resist flag settings like these. The cgroup
> > tree is a nested hierarchical structure designed for dividing up
> > system resources. But flag properties don't have natural inheritance
> > rules. What does it mean if the parent group says one thing and the
> > child says another? Which one has precedence?
> > 
> > Hence the proposal to make it a per-process property that propagates
> > through fork() and exec(). This also enables the container usecase (by
> > setting the flag in the container launching process), without there
> > being any confusion what the *effective* setting for any given process
> > in the system is.

+1. If something can work as something which gets inherited through the
process hierarchy, that's usually the better choice than making it a cgroup
property. There isn't much to be gained by making them cgroup properties
especially given that cgroup hierarchy, in most systems at this point, is a
degenerated process hierarchy.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-05-30 13:10   ` Lorenzo Stoakes
@ 2025-06-10 15:03     ` Usama Arif
  2025-06-10 15:17       ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-06-10 15:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Mike Rapoport, Johannes Weiner, Barry Song,
	linux-mm, linux-arch, linux-kernel, linux-api, Pedro Falcato

On 30/05/2025 14:10, Lorenzo Stoakes wrote:
> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
>>
>>
>> My knowledge is security is limited, so please bare with me, but I actually
>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
>> doing VM_(NO)HUGEPAGE.
>>
>> A process can already madvise its own VMAs, and this is just doing that
>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
>> for that process as well, which again doesnt seem like a security issue
>> to me.
> 
> W.R.T. the current process, the Issue is one Jann raised, in relation to
> propagation of behaviour to privileged (e.g. setuid) processes.
> 

But what is the actual security issue of having hugepages (or not having them) when
the process is running with setuid?

I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.

We can chown the cgroup so that the property is set by unprivileged process. 

Having the process swap with setuid when the unprivileged process has swap disabled
in the cgroup is not the right behaviour. What currently happens is that the process
after obtaining the higher privilege level doesn't swap as well.

Similarly for hugepages, if it was a cgroup level setting, having the process give
hugepages always with setuid when the unprivileged user had it disabled it or vice versa
would not be the right behaviour.

Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
I can tell.

So madlibs I dont see what the security issue is and why we would need to elevate privileges
to do this.

> W.R.T. remote processes, obviously we want to make sure we are permitted to do
> so.
> 

I know that this needs to be future proof. But I don't actually know of a real world
usecase where we want to do any of these things for remote processes.
Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
or Barrys proposal.
All of them are for the process itself (and its children by fork+exec) and not for
remote processes. As we try to make our changes usecase driven, I think we should
not add support for remote processes (which is another reason why I think this might
sit better in prctl).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 15:03     ` Usama Arif
@ 2025-06-10 15:17       ` Lorenzo Stoakes
  2025-06-10 15:30         ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 15:17 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Mike Rapoport, Johannes Weiner, Barry Song,
	linux-mm, linux-arch, linux-kernel, linux-api, Pedro Falcato

On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
>
>
> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
> > On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
> >>
> >>
> >> My knowledge is security is limited, so please bare with me, but I actually
> >> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> >> doing VM_(NO)HUGEPAGE.
> >>
> >> A process can already madvise its own VMAs, and this is just doing that
> >> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> >> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> >> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> >> for that process as well, which again doesnt seem like a security issue
> >> to me.
> >
> > W.R.T. the current process, the Issue is one Jann raised, in relation to
> > propagation of behaviour to privileged (e.g. setuid) processes.
> >
>
> But what is the actual security issue of having hugepages (or not having them) when
> the process is running with setuid?

Speak to Jann about this. Security isn't my area. He gave feedback on this,
which is why I raised it, if you search through previous threads you can find
it.

>
> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
>
> We can chown the cgroup so that the property is set by unprivileged process.
>
> Having the process swap with setuid when the unprivileged process has swap disabled
> in the cgroup is not the right behaviour. What currently happens is that the process
> after obtaining the higher privilege level doesn't swap as well.
>
> Similarly for hugepages, if it was a cgroup level setting, having the process give
> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
> would not be the right behaviour.
>
> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
> I can tell.
>
> So madlibs I dont see what the security issue is and why we would need to elevate privileges
> to do this.
>
> > W.R.T. remote processes, obviously we want to make sure we are permitted to do
> > so.
> >
>
> I know that this needs to be future proof. But I don't actually know of a real world
> usecase where we want to do any of these things for remote processes.
> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
> or Barrys proposal.
> All of them are for the process itself (and its children by fork+exec) and not for
> remote processes. As we try to make our changes usecase driven, I think we should
> not add support for remote processes (which is another reason why I think this might
> sit better in prctl).

I'm extremely confused as to why you think this propoal is predicated upon
remote process manipulation? It was simply suggested as a possibility for
increased flexibility.

We can just remove this parameter no?

It is entirely orthogonal to the prctl() stuff.

Overall at this point I share Matthew's point of view on this - we shouldn't be
doing any of this upstream.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 15:17       ` Lorenzo Stoakes
@ 2025-06-10 15:30         ` Usama Arif
  2025-06-10 15:46           ` Matthew Wilcox
                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Usama Arif @ 2025-06-10 15:30 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: Andrew Morton, Shakeel Butt, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Arnd Bergmann, Christian Brauner, SeongJae Park,
	Mike Rapoport, Johannes Weiner, Barry Song, linux-mm, linux-arch,
	linux-kernel, linux-api, Pedro Falcato, Matthew Wilcox

On 10/06/2025 16:17, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
>>
>>
>> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
>>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
>>>>
>>>>
>>>> My knowledge is security is limited, so please bare with me, but I actually
>>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
>>>> doing VM_(NO)HUGEPAGE.
>>>>
>>>> A process can already madvise its own VMAs, and this is just doing that
>>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
>>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
>>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
>>>> for that process as well, which again doesnt seem like a security issue
>>>> to me.
>>>
>>> W.R.T. the current process, the Issue is one Jann raised, in relation to
>>> propagation of behaviour to privileged (e.g. setuid) processes.
>>>
>>
>> But what is the actual security issue of having hugepages (or not having them) when
>> the process is running with setuid?
> 
> Speak to Jann about this. Security isn't my area. He gave feedback on this,
> which is why I raised it, if you search through previous threads you can find
> it.
> 

Yes, he is in CC here as well. I have read it in the previous thread. Just raising it
here as it was mentioned here :)

>>
>> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
>> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
>>
>> We can chown the cgroup so that the property is set by unprivileged process.
>>
>> Having the process swap with setuid when the unprivileged process has swap disabled
>> in the cgroup is not the right behaviour. What currently happens is that the process
>> after obtaining the higher privilege level doesn't swap as well.
>>
>> Similarly for hugepages, if it was a cgroup level setting, having the process give
>> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
>> would not be the right behaviour.
>>
>> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
>> I can tell.
>>
>> So madlibs I dont see what the security issue is and why we would need to elevate privileges
>> to do this.
>>
>>> W.R.T. remote processes, obviously we want to make sure we are permitted to do
>>> so.
>>>
>>
>> I know that this needs to be future proof. But I don't actually know of a real world
>> usecase where we want to do any of these things for remote processes.
>> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
>> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
>> or Barrys proposal.
>> All of them are for the process itself (and its children by fork+exec) and not for
>> remote processes. As we try to make our changes usecase driven, I think we should
>> not add support for remote processes (which is another reason why I think this might
>> sit better in prctl).
> 
> I'm extremely confused as to why you think this propoal is predicated upon
> remote process manipulation? It was simply suggested as a possibility for
> increased flexibility.
> 
> We can just remove this parameter no?
> 

Sure.

> It is entirely orthogonal to the prctl() stuff.
> 
> Overall at this point I share Matthew's point of view on this - we shouldn't be
> doing any of this upstream.

As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
how it works in the medium term and I dont think it will work even in the long term.
I will paste my answer from [1] below as well:

If we have 2 workloads on the same server, For e.g. one is database where THPs 
just dont do well, but the other one is AI where THPs do really well. How
will the kernel monitor that the database workload is performing worse
and the AI one isnt?

I added THP shrinker to hopefully try and do this automatically, and it does
really help. But unfortunately it is not a complete solution.
There are severely memory bound workloads where even a tiny increase
in memory will lead to an OOM. And if you colocate the container thats running
that workload with one in which we will benefit with THPs, we unfortunately
can't just rely on the system doing the right thing.

It would be awesome if THPs are truly transparent and don't require
any input, but unfortunately I don't think that there is a solution
for this with just kernel monitoring.

This is just a big hint from the user. If the global system policy is madvise
and the workload owner has done their own benchmarks and see benefits
with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
If the global system policy is always and the workload owner has done their own 
benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for 
the process to optin as "madvise". 

[1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/

I havent seen activity on this thread over the past week, but I was hoping
we can reach a consensus on which approach to use, prctl or mctl.
If its mctl and if you don't think this should be done, please let me know
if you would like me to work on this instead. This is a valid big realworld
usecase that is a real blocker for deploying THPs in workloads in servers.

Thanks!
Usama

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 15:30         ` Usama Arif
@ 2025-06-10 15:46           ` Matthew Wilcox
  2025-06-10 16:00             ` Usama Arif
  2025-06-10 16:02           ` Lorenzo Stoakes
  2025-07-02 14:15           ` Usama Arif
  2 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-06-10 15:46 UTC (permalink / raw)
  To: Usama Arif
  Cc: Lorenzo Stoakes, David Hildenbrand, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
> If we have 2 workloads on the same server, For e.g. one is database where THPs 
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?

It can monitor the allocation/access patterns and see who's getting
the benefit.  The two workloads are in competition for memory, and
we can tell which pages are hot and which cold.

And I don't believe it's a binary anyway.  I bet there are some
allocations where the database benefits from having THPs (I mean, I know
a database which invented the entire hugetlbfs subsystem so it could
use PMD entries and avoid one layer of TLB misses!)

> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.

Then maybe THP aren't for you.  If your workloads are this sensitive,
perhaps you should be using a mechanism which gives you complete control
like hugetlbfs.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 15:46           ` Matthew Wilcox
@ 2025-06-10 16:00             ` Usama Arif
  2025-06-10 16:26               ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-06-10 16:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, David Hildenbrand, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato



On 10/06/2025 16:46, Matthew Wilcox wrote:
> On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>> If we have 2 workloads on the same server, For e.g. one is database where THPs 
>> just dont do well, but the other one is AI where THPs do really well. How
>> will the kernel monitor that the database workload is performing worse
>> and the AI one isnt?
> 
> It can monitor the allocation/access patterns and see who's getting
> the benefit.  The two workloads are in competition for memory, and
> we can tell which pages are hot and which cold.
> 
> And I don't believe it's a binary anyway.  I bet there are some
> allocations where the database benefits from having THPs (I mean, I know
> a database which invented the entire hugetlbfs subsystem so it could
> use PMD entries and avoid one layer of TLB misses!)
> 

Sure, but this is just an example. Workload owners are not going to spend time
trying to see how each allocation works and if its hot, they put it in hugetlbfs.
Ofcourse hugetlbfs has its own drawbacks of reserving pages.
This is one of the reasons that we have THPs.

But they will try THPs. i.e. if they see performance benefits from just turning
a knob, they will take it otherwise leave it.

>> I added THP shrinker to hopefully try and do this automatically, and it does
>> really help. But unfortunately it is not a complete solution.
>> There are severely memory bound workloads where even a tiny increase
>> in memory will lead to an OOM. And if you colocate the container thats running
>> that workload with one in which we will benefit with THPs, we unfortunately
>> can't just rely on the system doing the right thing.
> 
> Then maybe THP aren't for you.  If your workloads are this sensitive,
> perhaps you should be using a mechanism which gives you complete control
> like hugetlbfs.

Yes, completely agree, THPs aren't for the workloads that are this sensitive.
But that's why we need this, to disable it for them if the global policy is always,
or enable it on other services that are not sensitive and benefit from THPs
if the global policy is madvise. We have to keep in mind that these workloads
will be colocated on the same server. 

and hugetlbfs isnt transparent enough.. :)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 15:30         ` Usama Arif
  2025-06-10 15:46           ` Matthew Wilcox
@ 2025-06-10 16:02           ` Lorenzo Stoakes
  2025-07-02 14:15           ` Usama Arif
  2 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 16:02 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	SeongJae Park, Mike Rapoport, Johannes Weiner, Barry Song,
	linux-mm, linux-arch, linux-kernel, linux-api, Pedro Falcato,
	Matthew Wilcox

On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>
>
> On 10/06/2025 16:17, Lorenzo Stoakes wrote:
> > On Tue, Jun 10, 2025 at 04:03:07PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 30/05/2025 14:10, Lorenzo Stoakes wrote:
> >>> On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
> >>>>
> >>>>
> >>>> My knowledge is security is limited, so please bare with me, but I actually
> >>>> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> >>>> doing VM_(NO)HUGEPAGE.
> >>>>
> >>>> A process can already madvise its own VMAs, and this is just doing that
> >>>> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> >>>> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> >>>> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> >>>> for that process as well, which again doesnt seem like a security issue
> >>>> to me.
> >>>
> >>> W.R.T. the current process, the Issue is one Jann raised, in relation to
> >>> propagation of behaviour to privileged (e.g. setuid) processes.
> >>>
> >>
> >> But what is the actual security issue of having hugepages (or not having them) when
> >> the process is running with setuid?
> >
> > Speak to Jann about this. Security isn't my area. He gave feedback on this,
> > which is why I raised it, if you search through previous threads you can find
> > it.
> >
>
> Yes, he is in CC here as well. I have read it in the previous thread. Just raising it
> here as it was mentioned here :)
>
> >>
> >> I know the cgroup proposal has been shot down, but lets imagine if this was a cgroup
> >> setting, similar to the other memory controls we have, for e.g. memory.swap.{max,high,peak}.
> >>
> >> We can chown the cgroup so that the property is set by unprivileged process.
> >>
> >> Having the process swap with setuid when the unprivileged process has swap disabled
> >> in the cgroup is not the right behaviour. What currently happens is that the process
> >> after obtaining the higher privilege level doesn't swap as well.
> >>
> >> Similarly for hugepages, if it was a cgroup level setting, having the process give
> >> hugepages always with setuid when the unprivileged user had it disabled it or vice versa
> >> would not be the right behaviour.
> >>
> >> Another example is PR_SET_MEMORY_MERGE, setuid does not change how it works as far as
> >> I can tell.
> >>
> >> So madlibs I dont see what the security issue is and why we would need to elevate privileges
> >> to do this.
> >>
> >>> W.R.T. remote processes, obviously we want to make sure we are permitted to do
> >>> so.
> >>>
> >>
> >> I know that this needs to be future proof. But I don't actually know of a real world
> >> usecase where we want to do any of these things for remote processes.
> >> Whether its the existing per process changes like PR_SET_MEMORY_MERGE for KSM and
> >> PR_SET_THP_DISABLE for THP or the newer proposals of PR_DEFAULT_MADV_(NO)HUGEPAGE
> >> or Barrys proposal.
> >> All of them are for the process itself (and its children by fork+exec) and not for
> >> remote processes. As we try to make our changes usecase driven, I think we should
> >> not add support for remote processes (which is another reason why I think this might
> >> sit better in prctl).
> >
> > I'm extremely confused as to why you think this propoal is predicated upon
> > remote process manipulation? It was simply suggested as a possibility for
> > increased flexibility.
> >
> > We can just remove this parameter no?
> >
>
> Sure.
>
> > It is entirely orthogonal to the prctl() stuff.
> >
> > Overall at this point I share Matthew's point of view on this - we shouldn't be
> > doing any of this upstream.
>
> As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
> how it works in the medium term and I dont think it will work even in the long term.
> I will paste my answer from [1] below as well:
>
> If we have 2 workloads on the same server, For e.g. one is database where THPs
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?
>
> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.
>
> It would be awesome if THPs are truly transparent and don't require
> any input, but unfortunately I don't think that there is a solution
> for this with just kernel monitoring.
>
> This is just a big hint from the user. If the global system policy is madvise
> and the workload owner has done their own benchmarks and see benefits
> with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
> If the global system policy is always and the workload owner has done their own
> benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for
> the process to optin as "madvise".
>
> [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/
>
>

Yup I appreciate these points, and we have discussed them I feel quite a
bit :) I echo them.

Nobody says that the interface isn't sucky and THPs are not as transparent
as they should be, nor that we lack decent non-cgroup 'policy'
manipulation.

BUT.

We're talking about adding a permanent hack into the kernel that
force-sets a VMA flag for all VMAs across fork/exec.

I have simply been trying to flesh out the _least worst_ means of
doing this - _if we have to do it_.

That last bit being operative - I have come to think, based on Matthew's
feedback, that the RoI of permanently adding this hack is not a good one.

I think the case remains to be made for that.

> I havent seen activity on this thread over the past week, but I was hoping
> we can reach a consensus on which approach to use, prctl or mctl.
> If its mctl and if you don't think this should be done, please let me know
> if you would like me to work on this instead. This is a valid big realworld
> usecase that is a real blocker for deploying THPs in workloads in servers.

Please exercise patience, upstream moves at its own pace.

>
> Thanks!
> Usama


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 16:00             ` Usama Arif
@ 2025-06-10 16:26               ` Matthew Wilcox
  2025-06-10 17:02                 ` Usama Arif
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2025-06-10 16:26 UTC (permalink / raw)
  To: Usama Arif
  Cc: Lorenzo Stoakes, David Hildenbrand, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On Tue, Jun 10, 2025 at 05:00:47PM +0100, Usama Arif wrote:
> On 10/06/2025 16:46, Matthew Wilcox wrote:
> > On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
> >> If we have 2 workloads on the same server, For e.g. one is database where THPs 
> >> just dont do well, but the other one is AI where THPs do really well. How
> >> will the kernel monitor that the database workload is performing worse
> >> and the AI one isnt?
> > 
> > It can monitor the allocation/access patterns and see who's getting
> > the benefit.  The two workloads are in competition for memory, and
> > we can tell which pages are hot and which cold.
> > 
> > And I don't believe it's a binary anyway.  I bet there are some
> > allocations where the database benefits from having THPs (I mean, I know
> > a database which invented the entire hugetlbfs subsystem so it could
> > use PMD entries and avoid one layer of TLB misses!)
> > 
> 
> Sure, but this is just an example. Workload owners are not going to spend time
> trying to see how each allocation works and if its hot, they put it in hugetlbfs.

No, they're not.  It should be automatic.  There are many deficiencies
in the kernel; this is one of them.

> Ofcourse hugetlbfs has its own drawbacks of reserving pages.

Drawback or advantage?  It's a feature.  You're being very strange about
this.  First you want to reserve THPs for some workloads only, then when
given a way to do that you complain that ... you have to reserve hugetlb
pages.  You can't possibly mean both of these things sincerely.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 16:26               ` Matthew Wilcox
@ 2025-06-10 17:02                 ` Usama Arif
  0 siblings, 0 replies; 33+ messages in thread
From: Usama Arif @ 2025-06-10 17:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, David Hildenbrand, Andrew Morton, Shakeel Butt,
	Liam R . Howlett, Vlastimil Babka, Jann Horn, Arnd Bergmann,
	Christian Brauner, SeongJae Park, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-mm, linux-arch, linux-kernel, linux-api,
	Pedro Falcato

On 10/06/2025 17:26, Matthew Wilcox wrote:
> On Tue, Jun 10, 2025 at 05:00:47PM +0100, Usama Arif wrote:
>> On 10/06/2025 16:46, Matthew Wilcox wrote:
>>> On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>>>> If we have 2 workloads on the same server, For e.g. one is database where THPs 
>>>> just dont do well, but the other one is AI where THPs do really well. How
>>>> will the kernel monitor that the database workload is performing worse
>>>> and the AI one isnt?
>>>
>>> It can monitor the allocation/access patterns and see who's getting
>>> the benefit.  The two workloads are in competition for memory, and
>>> we can tell which pages are hot and which cold.
>>>
>>> And I don't believe it's a binary anyway.  I bet there are some
>>> allocations where the database benefits from having THPs (I mean, I know
>>> a database which invented the entire hugetlbfs subsystem so it could
>>> use PMD entries and avoid one layer of TLB misses!)
>>>
>>
>> Sure, but this is just an example. Workload owners are not going to spend time
>> trying to see how each allocation works and if its hot, they put it in hugetlbfs.
> 
> No, they're not.  It should be automatic.  There are many deficiencies
> in the kernel; this is one of them.
> 
>> Ofcourse hugetlbfs has its own drawbacks of reserving pages.
> 
> Drawback or advantage?  It's a feature.  You're being very strange about
> this.  First you want to reserve THPs for some workloads only, then when
> given a way to do that you complain that ... you have to reserve hugetlb
> pages.  You can't possibly mean both of these things sincerely.
> 

Let me try and explain my view better:

hugetlb requires 2 things, reserving hugepages and passing MAP_HUGETLB at mmap time i.e.
not "transparent". (I know the meaning of transparent even in THP is a bit messed up :))

There are some workload owners that will happily test (and have the resources to do
so) to see what is the best point to use hugetlb. They can go in their code and change
mmap and make the necessary changes to disrupt workload orchestration so that hugetlb
is reserved. This is a small minority.

An extremely large majority of workload owners will not be willing to do this (and don't
have the resources to do so as well).
For them, we have THPs to do it "transparently". If you just give a knob to switch
THP=always on/off for *just their workload* without affecting others on the same server,
they will be happy to try it and other workloads that are running on the same server
in controlled cgroups wont care and won't be affected. i.e.:

- if the machine policy (/sys/kernel/mm/transparent_hugepage/enabled) is madvise, workloads can
  opt-in getting THPs by just having this call (the PR_DEFAULT_MADV_HUGEPAGE version) in systemd.

- if the machine policy is always, and they dont benefit, they can opt-out of getting THPs
  by having this call (the PR_DEFAULT_MADV_NOHUGEPAGE) version in systemd *without* disrupting
  the other workloads that are running on the same server that do.

Doing above is very simple. This is how KSM is done as well. It doesnt require doing any changes
to mmap, i.e. is "transparent" (after the prctl/mctl call :)) and doesn't require reserving anything
for hugetlb before the application starts.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-06-10 15:30         ` Usama Arif
  2025-06-10 15:46           ` Matthew Wilcox
  2025-06-10 16:02           ` Lorenzo Stoakes
@ 2025-07-02 14:15           ` Usama Arif
  2025-07-02 17:38             ` SeongJae Park
  2 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-07-02 14:15 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, Andrew Morton
  Cc: Shakeel Butt, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, SeongJae Park, Mike Rapoport,
	Johannes Weiner, Barry Song, linux-arch, linux-kernel, linux-api,
	Pedro Falcato, Matthew Wilcox, Lorenzo Stoakes

> As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
> how it works in the medium term and I dont think it will work even in the long term.
> I will paste my answer from [1] below as well:
> 
> If we have 2 workloads on the same server, For e.g. one is database where THPs 
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?
> 
> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.
> 
> It would be awesome if THPs are truly transparent and don't require
> any input, but unfortunately I don't think that there is a solution
> for this with just kernel monitoring.
> 
> This is just a big hint from the user. If the global system policy is madvise
> and the workload owner has done their own benchmarks and see benefits
> with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
> If the global system policy is always and the workload owner has done their own 
> benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for 
> the process to optin as "madvise". 
> 
> [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/
> 
> 
> I havent seen activity on this thread over the past week, but I was hoping
> we can reach a consensus on which approach to use, prctl or mctl.
> If its mctl and if you don't think this should be done, please let me know
> if you would like me to work on this instead. This is a valid big realworld
> usecase that is a real blocker for deploying THPs in workloads in servers.
> 

Hi!

Just wanted to check if anyone has any thoughts on this?

I think we are all in agreement for the long term eventual goal, have THP just work
and be default enabled. From our perspective, we (meta) have spent a significant
amount of time and effort over the last 18 months trying to make changes/optimizations
where we could actually have it so and can transparently and reliably get hugepages.
This includes the THP shrinker [1], changes to allocator and reclaim/compaction code
to reduce fragmentation [2] and reducing type mixing [3].
We want to continue spending more time and resources in trying to achieve this.

But in the current state, not being able to selectively enable THPs always for certain
workloads is a significant blocker in trying to roll it out at hyperscaler levels, and
from the attempts made by others, I do believe its a problem others are facing as well.
Our long term aim is to have transparent_hugepage/enabled set to always across the fleet.
But for that we need to have the ability to enable it selectively for workloads that
show benefit, try and solve problems that come up in production when it is enabled,
and see why for those that regress. This can not be done with just transparent_hugepage/enabled
for hyperscalers which run vastly different types of containerized workloads on the same
machine.

There have been multiple efforts from different people on trying to address similar
problems (including via cgroup[4] and bpf[5]). IMHO, its quite clear that unfortunately
just having a system wide setting for THP is not enough at the moment or in the near future,
especially when running workloads that have completely different characteristic on the same server.

In terms of the approach of doing this, IMHO, I dont think the way to do this
is controversial. After the great feedback from Lorenzo on the prctl series, the
approach would be for userpsace to make a call that just does for_each_vma of the process,
madvises the VMAs, and changes the mm->def_flags for the process. We are not making changes
to the pagefaulting path (thp_vma_allowable_orders has no code change which is awesome).
In terms of what the call is going to be, there has been a lot of debate (and my preference
of prctl is clear), I am ok with either with prctl or mctl, as it should not change
the actual implementation. If there is consensus, I would love to send a RFC for how the
prctl or mctl solution would look like.

[1] https://lore.kernel.org/all/20240830100438.3623486-1-usamaarif642@gmail.com/
[2] https://lore.kernel.org/all/20250313210647.1314586-1-hannes@cmpxchg.org/
[3] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/
[4] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
[5] https://lore.kernel.org/all/20250520060504.20251-1-laoar.shao@gmail.com/

Thanks,
Usama

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-07-02 14:15           ` Usama Arif
@ 2025-07-02 17:38             ` SeongJae Park
  2025-07-04 10:34               ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: SeongJae Park @ 2025-07-02 17:38 UTC (permalink / raw)
  To: Usama Arif
  Cc: SeongJae Park, David Hildenbrand, linux-mm, Andrew Morton,
	Shakeel Butt, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Arnd Bergmann, Christian Brauner, Mike Rapoport, Johannes Weiner,
	Barry Song, linux-arch, linux-kernel, linux-api, Pedro Falcato,
	Matthew Wilcox, Lorenzo Stoakes

On Wed, 2 Jul 2025 15:15:01 +0100 Usama Arif <usamaarif642@gmail.com> wrote:

[...]
> In terms of the approach of doing this, IMHO, I dont think the way to do this
> is controversial. After the great feedback from Lorenzo on the prctl series, the
> approach would be for userpsace to make a call that just does for_each_vma of the process,
> madvises the VMAs,

One dirty hack that I can think off the top of my head for doing this without
new kernel changes is, unsurprisingly, using DAMOS.  Using DAMOS, users can do
madvise(MADV_HUGEPAGE) to virtual address ranges of specific access patterns.
It is aimed to be used for hot regions, while using similar one of
MADV_NOHUGEPAGE for cold regions.  An experiment with a prototype[1] showed it
eliminates about 80% of internal fragmentation caused memory overhead while
keeping 46% of performance improvement under a constrained situation.

If you set the access pattern as any pattern, hence, you can do
madvise(MADV_HUGEPAGE) for effectively entire virtual address space of the
process.  DAMON user-space tool supports periodically tracking childs and
applying same DAMOS scheme to those.  So, for example, below hack could be
tried.

    # damo start $(pidof XXX) --damos_action hugepage --include_child_tasks

I'm working with Usama at Meta but not very closely involved in THP works, so
I'm not sure if this works for Usama's case and others.  I even not tried this
at all on any test environment.  So I'm not recommending this but just sharing
a thought for more brainsorming, and that's why I call this a dirty hack.

[1] https://assets.amazon.science/b7/2b/ce53222247739b174f2b54498d1a/daos-data-access-aware-operating-system.pdf

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [DISCUSSION] proposed mctl() API
  2025-07-02 17:38             ` SeongJae Park
@ 2025-07-04 10:34               ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-07-04 10:34 UTC (permalink / raw)
  To: SeongJae Park, Usama Arif
  Cc: linux-mm, Andrew Morton, Shakeel Butt, Liam R . Howlett,
	Vlastimil Babka, Jann Horn, Arnd Bergmann, Christian Brauner,
	Mike Rapoport, Johannes Weiner, Barry Song, linux-arch,
	linux-kernel, linux-api, Pedro Falcato, Matthew Wilcox,
	Lorenzo Stoakes

On 02.07.25 19:38, SeongJae Park wrote:
> On Wed, 2 Jul 2025 15:15:01 +0100 Usama Arif <usamaarif642@gmail.com> wrote:
> 
> [...]
>> In terms of the approach of doing this, IMHO, I dont think the way to do this
>> is controversial. After the great feedback from Lorenzo on the prctl series, the
>> approach would be for userpsace to make a call that just does for_each_vma of the process,
>> madvises the VMAs,
> 
> One dirty hack that I can think off the top of my head for doing this without
> new kernel changes is, unsurprisingly, using DAMOS.  Using DAMOS, users can do
> madvise(MADV_HUGEPAGE) to virtual address ranges of specific access patterns.
> It is aimed to be used for hot regions, while using similar one of
> MADV_NOHUGEPAGE for cold regions.  An experiment with a prototype[1] showed it
> eliminates about 80% of internal fragmentation caused memory overhead while
> keeping 46% of performance improvement under a constrained situation.
> 
> If you set the access pattern as any pattern, hence, you can do
> madvise(MADV_HUGEPAGE) for effectively entire virtual address space of the
> process.  DAMON user-space tool supports periodically tracking childs and
> applying same DAMOS scheme to those.  So, for example, below hack could be
> tried.
> 
>      # damo start $(pidof XXX) --damos_action hugepage --include_child_tasks

IIRC, setting MADV_HUGEPAGE on arbitrary VMAs from arbitrary processes 
has the potential of breaking applications.

Just imagine them deliberately setting MADV_NOHUGEPAGE and are intending 
of using userfaultfd, where it is crucial that we don't over-allocate 
memory even before userfaultdf is actually registered.

(QEMU does that)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-07-04 10:34 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29 14:43 [DISCUSSION] proposed mctl() API Lorenzo Stoakes
2025-05-29 15:28 ` Matthew Wilcox
2025-05-29 17:54   ` Shakeel Butt
2025-05-29 18:13     ` Matthew Wilcox
2025-05-29 18:32       ` Usama Arif
2025-05-29 21:14   ` Johannes Weiner
2025-05-29 21:24     ` Liam R. Howlett
2025-05-29 23:14       ` Johannes Weiner
2025-05-30  7:52     ` Barry Song
2025-06-04 12:00       ` Johannes Weiner
2025-06-04 12:05         ` David Hildenbrand
2025-05-30 10:31     ` Vlastimil Babka
2025-06-04 12:19       ` Johannes Weiner
2025-06-05 12:31         ` Johannes Weiner
2025-06-09 17:03           ` Tejun Heo
2025-06-02 18:01     ` Matthew Wilcox
2025-06-04 13:21       ` Johannes Weiner
2025-06-04 12:28   ` Lorenzo Stoakes
2025-05-29 17:21 ` Usama Arif
2025-05-30 13:10   ` Lorenzo Stoakes
2025-06-10 15:03     ` Usama Arif
2025-06-10 15:17       ` Lorenzo Stoakes
2025-06-10 15:30         ` Usama Arif
2025-06-10 15:46           ` Matthew Wilcox
2025-06-10 16:00             ` Usama Arif
2025-06-10 16:26               ` Matthew Wilcox
2025-06-10 17:02                 ` Usama Arif
2025-06-10 16:02           ` Lorenzo Stoakes
2025-07-02 14:15           ` Usama Arif
2025-07-02 17:38             ` SeongJae Park
2025-07-04 10:34               ` David Hildenbrand
2025-05-29 18:50 ` Andy Lutomirski
2025-05-29 21:31 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).