public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
@ 2026-04-02 11:13 Qi Tang
  2026-04-02 12:57 ` Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Qi Tang @ 2026-04-02 11:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cyrill Gorcunov, David Hildenbrand, Lorenzo Stoakes,
	Oleg Nesterov, linux-kernel, stable, Qi Tang

prctl_set_mm_map() allows modifying all mm_struct boundaries and
the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
dispatches before this check and has no capability requirement of its
own when exe_fd is -1.

This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
(nearly all distros) can rewrite mm boundaries including start_brk, brk,
arg_start/end, env_start/end and saved_auxv.  Consequences include:

  - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
  - procfs info disclosure by pointing arg/env ranges at other memory
  - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)

The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
PR_SET_MM_MAP operation") states "we require the caller to be at least
user-namespace root user", but this was never enforced in the code.

Add a checkpoint_restore_ns_capable() check at the top of
prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
user namespace, matching the stated design intent and the existing
check for exe_fd changes.

Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
Cc: stable@vger.kernel.org
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
---
 kernel/sys.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sys.c b/kernel/sys.c
index c86eba9aa7e9..2b8c57f23a35 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 		return put_user((unsigned int)sizeof(prctl_map),
 				(unsigned int __user *)addr);
 
+	if (!checkpoint_restore_ns_capable(current_user_ns()))
+		return -EPERM;
+
 	if (data_size != sizeof(prctl_map))
 		return -EINVAL;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 11:13 [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP Qi Tang
@ 2026-04-02 12:57 ` Oleg Nesterov
  2026-04-02 13:07   ` Lorenzo Stoakes (Oracle)
  2026-04-02 13:13   ` Oleg Nesterov
  2026-04-02 13:06 ` Lorenzo Stoakes (Oracle)
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Oleg Nesterov @ 2026-04-02 12:57 UTC (permalink / raw)
  To: Qi Tang
  Cc: Andrew Morton, Cyrill Gorcunov, David Hildenbrand,
	Lorenzo Stoakes, linux-kernel, stable

On 04/02, Qi Tang wrote:
>
> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> PR_SET_MM_MAP operation") states "we require the caller to be at least
> user-namespace root user", but this was never enforced in the code.
>
> Add a checkpoint_restore_ns_capable() check at the top of
> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> user namespace, matching the stated design intent and the existing
> check for exe_fd changes.

Can't really comment... but if you add this check at the start, then you
should also remove the same checkpoint_restore_ns_capable() check below?
In the "if (prctl_map.exe_fd != (u32)-1)" block.

Oleg.


> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> Cc: stable@vger.kernel.org
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Signed-off-by: Qi Tang <tpluszz77@gmail.com>
> ---
>  kernel/sys.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c86eba9aa7e9..2b8c57f23a35 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>  		return put_user((unsigned int)sizeof(prctl_map),
>  				(unsigned int __user *)addr);
>  
> +	if (!checkpoint_restore_ns_capable(current_user_ns()))
> +		return -EPERM;
> +
>  	if (data_size != sizeof(prctl_map))
>  		return -EINVAL;
>  
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 11:13 [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP Qi Tang
  2026-04-02 12:57 ` Oleg Nesterov
@ 2026-04-02 13:06 ` Lorenzo Stoakes (Oracle)
  2026-04-02 13:55   ` David Hildenbrand (Arm)
  2026-04-02 13:30 ` David Hildenbrand (Arm)
  2026-04-02 17:47 ` Andrew Morton
  3 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 13:06 UTC (permalink / raw)
  To: Qi Tang
  Cc: Andrew Morton, Cyrill Gorcunov, David Hildenbrand, Oleg Nesterov,
	linux-kernel, stable

On Thu, Apr 02, 2026 at 07:13:32PM +0800, Qi Tang wrote:
> prctl_set_mm_map() allows modifying all mm_struct boundaries and
> the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
> dispatches before this check and has no capability requirement of its
> own when exe_fd is -1.
>
> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
> arg_start/end, env_start/end and saved_auxv.  Consequences include:
>
>   - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
>   - procfs info disclosure by pointing arg/env ranges at other memory
>   - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
>
> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> PR_SET_MM_MAP operation") states "we require the caller to be at least
> user-namespace root user", but this was never enforced in the code.
>
> Add a checkpoint_restore_ns_capable() check at the top of
> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> user namespace, matching the stated design intent and the existing
> check for exe_fd changes.
>
> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")

We've had a gaping security hole since 2014 and nobody noticed? I find it
hard to believe.

> Cc: stable@vger.kernel.org
> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> Signed-off-by: Qi Tang <tpluszz77@gmail.com>
> ---
>  kernel/sys.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index c86eba9aa7e9..2b8c57f23a35 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>  		return put_user((unsigned int)sizeof(prctl_map),
>  				(unsigned int __user *)addr);
>
> +	if (!checkpoint_restore_ns_capable(current_user_ns()))
> +		return -EPERM;

Hmm there is already:

	if (prctl_map.exe_fd != (u32)-1) {
		/*
		 * Check if the current user is checkpoint/restore capable.
		 * At the time of this writing, it checks for CAP_SYS_ADMIN
		 * or CAP_CHECKPOINT_RESTORE.
		 * Note that a user with access to ptrace can masquerade an
		 * arbitrary program as any executable, even setuid ones.
		 * This may have implications in the tomoyo subsystem.
		 */
		if (!checkpoint_restore_ns_capable(current_user_ns()))
			return -EPERM;

And you're proposing _adding_ this check on top of that? Seems super
redundant.

but also, this seems super-specific buuut... Then again #ifdef
CONFIG_CHECKPOINT_RESTORE around this. Ugh.

I _hate_ this inteface. HATE HATE HATE it.

Anyway, does updating _your own_ auxv really require elevated permissions
like this?

I don't think so? Couldn't you go and manipulate that anyway without
elevated anything?


> +
>  	if (data_size != sizeof(prctl_map))
>  		return -EINVAL;
>
> --
> 2.43.0
>

This all seems unnecessary and in fact, surely would break userspace? Am I
missing something here?

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 12:57 ` Oleg Nesterov
@ 2026-04-02 13:07   ` Lorenzo Stoakes (Oracle)
  2026-04-02 13:13   ` Oleg Nesterov
  1 sibling, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 13:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Qi Tang, Andrew Morton, Cyrill Gorcunov, David Hildenbrand,
	linux-kernel, stable

On Thu, Apr 02, 2026 at 02:57:51PM +0200, Oleg Nesterov wrote:
> On 04/02, Qi Tang wrote:
> >
> > The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> > PR_SET_MM_MAP operation") states "we require the caller to be at least
> > user-namespace root user", but this was never enforced in the code.
> >
> > Add a checkpoint_restore_ns_capable() check at the top of
> > prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
> > requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> > user namespace, matching the stated design intent and the existing
> > check for exe_fd changes.
>
> Can't really comment... but if you add this check at the start, then you
> should also remove the same checkpoint_restore_ns_capable() check below?
> In the "if (prctl_map.exe_fd != (u32)-1)" block.

Ah yeah we noticed the same thing :)

But also as per sub-thread, I question this patch in general... :)

>
> Oleg.
>
>
> > Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> > Cc: stable@vger.kernel.org
> > Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> > Signed-off-by: Qi Tang <tpluszz77@gmail.com>
> > ---
> >  kernel/sys.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index c86eba9aa7e9..2b8c57f23a35 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> >  		return put_user((unsigned int)sizeof(prctl_map),
> >  				(unsigned int __user *)addr);
> >
> > +	if (!checkpoint_restore_ns_capable(current_user_ns()))
> > +		return -EPERM;
> > +
> >  	if (data_size != sizeof(prctl_map))
> >  		return -EINVAL;
> >
> > --
> > 2.43.0
> >
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 12:57 ` Oleg Nesterov
  2026-04-02 13:07   ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 13:13   ` Oleg Nesterov
  2026-04-02 13:41     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2026-04-02 13:13 UTC (permalink / raw)
  To: Qi Tang
  Cc: Andrew Morton, Cyrill Gorcunov, David Hildenbrand,
	Lorenzo Stoakes, linux-kernel, stable

Note also the comment above validate_prctl_map_addr() called by
prctl_set_mm_map(), "we don't require any capability here ...".

Oleg.

On 04/02, Oleg Nesterov wrote:
>
> On 04/02, Qi Tang wrote:
> >
> > The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> > PR_SET_MM_MAP operation") states "we require the caller to be at least
> > user-namespace root user", but this was never enforced in the code.
> >
> > Add a checkpoint_restore_ns_capable() check at the top of
> > prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
> > requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> > user namespace, matching the stated design intent and the existing
> > check for exe_fd changes.
>
> Can't really comment... but if you add this check at the start, then you
> should also remove the same checkpoint_restore_ns_capable() check below?
> In the "if (prctl_map.exe_fd != (u32)-1)" block.
>
> Oleg.
>
>
> > Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> > Cc: stable@vger.kernel.org
> > Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> > Signed-off-by: Qi Tang <tpluszz77@gmail.com>
> > ---
> >  kernel/sys.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index c86eba9aa7e9..2b8c57f23a35 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> >  		return put_user((unsigned int)sizeof(prctl_map),
> >  				(unsigned int __user *)addr);
> >
> > +	if (!checkpoint_restore_ns_capable(current_user_ns()))
> > +		return -EPERM;
> > +
> >  	if (data_size != sizeof(prctl_map))
> >  		return -EINVAL;
> >
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 11:13 [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP Qi Tang
  2026-04-02 12:57 ` Oleg Nesterov
  2026-04-02 13:06 ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 13:30 ` David Hildenbrand (Arm)
  2026-04-02 17:47 ` Andrew Morton
  3 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-02 13:30 UTC (permalink / raw)
  To: Qi Tang, Andrew Morton
  Cc: Cyrill Gorcunov, Lorenzo Stoakes, Oleg Nesterov, linux-kernel,
	stable

On 4/2/26 13:13, Qi Tang wrote:
> prctl_set_mm_map() allows modifying all mm_struct boundaries and
> the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
> dispatches before this check and has no capability requirement of its
> own when exe_fd is -1.
> 
> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
> arg_start/end, env_start/end and saved_auxv.  Consequences include:
> 
>   - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
>   - procfs info disclosure by pointing arg/env ranges at other memory
>   - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
> 
> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> PR_SET_MM_MAP operation") states "we require the caller to be at least
> user-namespace root user", but this was never enforced in the code.

That is taken out of contex, no?

"Still note that updating exe-file link now doesn't require sys-resource
capability anymore, ... Still we require the caller to be at least
user-namespace root user."

That check was added in prctl_set_mm_map()->validate_prctl_map() in the
original patch:

+       /*
+        * Finally, make sure the caller has the rights to
+        * change /proc/pid/exe link: only local root should
+        * be allowed to.
+        */
+       if (prctl_map->exe_fd != (u32)-1) {
+               struct user_namespace *ns = current_user_ns();
+               const struct cred *cred = current_cred();
+
+               if (!uid_eq(cred->uid, make_kuid(ns, 0)) ||
+                   !gid_eq(cred->gid, make_kgid(ns, 0)))
+                       goto out;
+       }


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 13:13   ` Oleg Nesterov
@ 2026-04-02 13:41     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-02 13:41 UTC (permalink / raw)
  To: Oleg Nesterov, Qi Tang
  Cc: Andrew Morton, Cyrill Gorcunov, Lorenzo Stoakes, linux-kernel,
	stable

On 4/2/26 15:13, Oleg Nesterov wrote:
> Note also the comment above validate_prctl_map_addr() called by
> prctl_set_mm_map(), "we don't require any capability here ...".
> 
> Oleg.
> 
> On 04/02, Oleg Nesterov wrote:
>>
>> On 04/02, Qi Tang wrote:
>>>
>>> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
>>> PR_SET_MM_MAP operation") states "we require the caller to be at least
>>> user-namespace root user", but this was never enforced in the code.
>>>
>>> Add a checkpoint_restore_ns_capable() check at the top of
>>> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
>>> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
>>> user namespace, matching the stated design intent and the existing
>>> check for exe_fd changes.
>>
>> Can't really comment... but if you add this check at the start, then you
>> should also remove the same checkpoint_restore_ns_capable() check below?
>> In the "if (prctl_map.exe_fd != (u32)-1)" block.
>>
>

I'll note that the man page is wrong as well.

PR_SET_MM lists PR_SET_MM_MAP, but states "The calling process must have
the CAP_SYS_RESOURCE capability."

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 13:06 ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 13:55   ` David Hildenbrand (Arm)
  2026-04-02 14:05     ` David Hildenbrand (Arm)
  2026-04-02 14:21     ` Lorenzo Stoakes (Oracle)
  0 siblings, 2 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-02 13:55 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Qi Tang
  Cc: Andrew Morton, Cyrill Gorcunov, Oleg Nesterov, linux-kernel,
	stable

On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Apr 02, 2026 at 07:13:32PM +0800, Qi Tang wrote:
>> prctl_set_mm_map() allows modifying all mm_struct boundaries and
>> the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
>> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
>> dispatches before this check and has no capability requirement of its
>> own when exe_fd is -1.
>>
>> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
>> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
>> arg_start/end, env_start/end and saved_auxv.  Consequences include:
>>
>>   - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
>>   - procfs info disclosure by pointing arg/env ranges at other memory
>>   - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
>>
>> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
>> PR_SET_MM_MAP operation") states "we require the caller to be at least
>> user-namespace root user", but this was never enforced in the code.
>>
>> Add a checkpoint_restore_ns_capable() check at the top of
>> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
>> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
>> user namespace, matching the stated design intent and the existing
>> check for exe_fd changes.
>>
>> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> 
> We've had a gaping security hole since 2014 and nobody noticed? I find it
> hard to believe.
> 
>> Cc: stable@vger.kernel.org
>> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
>> Signed-off-by: Qi Tang <tpluszz77@gmail.com>
>> ---
>>  kernel/sys.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index c86eba9aa7e9..2b8c57f23a35 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>>  		return put_user((unsigned int)sizeof(prctl_map),
>>  				(unsigned int __user *)addr);
>>
>> +	if (!checkpoint_restore_ns_capable(current_user_ns()))
>> +		return -EPERM;
> 
> Hmm there is already:
> 
> 	if (prctl_map.exe_fd != (u32)-1) {
> 		/*
> 		 * Check if the current user is checkpoint/restore capable.
> 		 * At the time of this writing, it checks for CAP_SYS_ADMIN
> 		 * or CAP_CHECKPOINT_RESTORE.
> 		 * Note that a user with access to ptrace can masquerade an
> 		 * arbitrary program as any executable, even setuid ones.
> 		 * This may have implications in the tomoyo subsystem.
> 		 */
> 		if (!checkpoint_restore_ns_capable(current_user_ns()))
> 			return -EPERM;
> 
> And you're proposing _adding_ this check on top of that? Seems super
> redundant.

Yes, should be moved.

> 
> but also, this seems super-specific buuut... Then again #ifdef
> CONFIG_CHECKPOINT_RESTORE around this. Ugh.
> 
> I _hate_ this inteface. HATE HATE HATE it.
> 
> Anyway, does updating _your own_ auxv really require elevated permissions
> like this?
> 
> I don't think so? Couldn't you go and manipulate that anyway without
> elevated anything?

Hard to believe ...

I was wondering whether this could break some users. At least CRIU doc
states:

    This option tells *criu* to accept the limitations when running
    as non-root. Running as non-root requires *criu* at least to have
    *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
    running *criu* as non-root please consult the *NON-ROOT* section.

I mean, the check makes sense given that prctl_set_mm() rejects all
these operations without CAP_SYS_RESOURCE.


CAP_CHECKPOINT_RESTORE was not introduced before

commit 124ea650d3072b005457faed69909221c2905a1f
Author: Adrian Reber <areber@redhat.com>
Date:   Sun Jul 19 12:04:11 2020 +0200

    capabilities: Introduce CAP_CHECKPOINT_RESTORE

So at the time PR_SET_MM_MAP was added there simply was no such capability.

Likely, now that we have it, we should indeed use it.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 13:55   ` David Hildenbrand (Arm)
@ 2026-04-02 14:05     ` David Hildenbrand (Arm)
  2026-04-02 14:21     ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-02 14:05 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Qi Tang
  Cc: Andrew Morton, Cyrill Gorcunov, Oleg Nesterov, linux-kernel,
	stable

On 4/2/26 15:55, David Hildenbrand (Arm) wrote:
> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
>> On Thu, Apr 02, 2026 at 07:13:32PM +0800, Qi Tang wrote:
>>> prctl_set_mm_map() allows modifying all mm_struct boundaries and
>>> the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
>>> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
>>> dispatches before this check and has no capability requirement of its
>>> own when exe_fd is -1.
>>>
>>> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
>>> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
>>> arg_start/end, env_start/end and saved_auxv.  Consequences include:
>>>
>>>   - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
>>>   - procfs info disclosure by pointing arg/env ranges at other memory
>>>   - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
>>>
>>> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
>>> PR_SET_MM_MAP operation") states "we require the caller to be at least
>>> user-namespace root user", but this was never enforced in the code.
>>>
>>> Add a checkpoint_restore_ns_capable() check at the top of
>>> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
>>> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
>>> user namespace, matching the stated design intent and the existing
>>> check for exe_fd changes.
>>>
>>> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
>>
>> We've had a gaping security hole since 2014 and nobody noticed? I find it
>> hard to believe.
>>
>>> Cc: stable@vger.kernel.org
>>> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
>>> Signed-off-by: Qi Tang <tpluszz77@gmail.com>
>>> ---
>>>  kernel/sys.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>> index c86eba9aa7e9..2b8c57f23a35 100644
>>> --- a/kernel/sys.c
>>> +++ b/kernel/sys.c
>>> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>>>  		return put_user((unsigned int)sizeof(prctl_map),
>>>  				(unsigned int __user *)addr);
>>>
>>> +	if (!checkpoint_restore_ns_capable(current_user_ns()))
>>> +		return -EPERM;
>>
>> Hmm there is already:
>>
>> 	if (prctl_map.exe_fd != (u32)-1) {
>> 		/*
>> 		 * Check if the current user is checkpoint/restore capable.
>> 		 * At the time of this writing, it checks for CAP_SYS_ADMIN
>> 		 * or CAP_CHECKPOINT_RESTORE.
>> 		 * Note that a user with access to ptrace can masquerade an
>> 		 * arbitrary program as any executable, even setuid ones.
>> 		 * This may have implications in the tomoyo subsystem.
>> 		 */
>> 		if (!checkpoint_restore_ns_capable(current_user_ns()))
>> 			return -EPERM;
>>
>> And you're proposing _adding_ this check on top of that? Seems super
>> redundant.
> 
> Yes, should be moved.
> 
>>
>> but also, this seems super-specific buuut... Then again #ifdef
>> CONFIG_CHECKPOINT_RESTORE around this. Ugh.
>>
>> I _hate_ this inteface. HATE HATE HATE it.
>>
>> Anyway, does updating _your own_ auxv really require elevated permissions
>> like this?
>>
>> I don't think so? Couldn't you go and manipulate that anyway without
>> elevated anything?
> 
> Hard to believe ...
> 
> I was wondering whether this could break some users. At least CRIU doc
> states:
> 
>     This option tells *criu* to accept the limitations when running
>     as non-root. Running as non-root requires *criu* at least to have
>     *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
>     running *criu* as non-root please consult the *NON-ROOT* section.

Doing some digging, lxc seems to use that interface.

https://github.com/lxc/lxc/blob/3ee89c5d95ee8f31bd81623fd73ad7beea4297f8/src/lxc/initutils.c#L311

I have no clue about capabilities there.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 13:55   ` David Hildenbrand (Arm)
  2026-04-02 14:05     ` David Hildenbrand (Arm)
@ 2026-04-02 14:21     ` Lorenzo Stoakes (Oracle)
  2026-04-02 14:27       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-04-02 14:21 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Qi Tang, Andrew Morton, Cyrill Gorcunov, Oleg Nesterov,
	linux-kernel, stable

On Thu, Apr 02, 2026 at 03:55:27PM +0200, David Hildenbrand (Arm) wrote:
> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Apr 02, 2026 at 07:13:32PM +0800, Qi Tang wrote:
> >> prctl_set_mm_map() allows modifying all mm_struct boundaries and
> >> the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
> >> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
> >> dispatches before this check and has no capability requirement of its
> >> own when exe_fd is -1.
> >>
> >> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
> >> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
> >> arg_start/end, env_start/end and saved_auxv.  Consequences include:
> >>
> >>   - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
> >>   - procfs info disclosure by pointing arg/env ranges at other memory
> >>   - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
> >>
> >> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> >> PR_SET_MM_MAP operation") states "we require the caller to be at least
> >> user-namespace root user", but this was never enforced in the code.
> >>
> >> Add a checkpoint_restore_ns_capable() check at the top of
> >> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
> >> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> >> user namespace, matching the stated design intent and the existing
> >> check for exe_fd changes.
> >>
> >> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> >
> > We've had a gaping security hole since 2014 and nobody noticed? I find it
> > hard to believe.
> >
> >> Cc: stable@vger.kernel.org
> >> Cc: Cyrill Gorcunov <gorcunov@openvz.org>
> >> Signed-off-by: Qi Tang <tpluszz77@gmail.com>
> >> ---
> >>  kernel/sys.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index c86eba9aa7e9..2b8c57f23a35 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> >>  		return put_user((unsigned int)sizeof(prctl_map),
> >>  				(unsigned int __user *)addr);
> >>
> >> +	if (!checkpoint_restore_ns_capable(current_user_ns()))
> >> +		return -EPERM;
> >
> > Hmm there is already:
> >
> > 	if (prctl_map.exe_fd != (u32)-1) {
> > 		/*
> > 		 * Check if the current user is checkpoint/restore capable.
> > 		 * At the time of this writing, it checks for CAP_SYS_ADMIN
> > 		 * or CAP_CHECKPOINT_RESTORE.
> > 		 * Note that a user with access to ptrace can masquerade an
> > 		 * arbitrary program as any executable, even setuid ones.
> > 		 * This may have implications in the tomoyo subsystem.
> > 		 */
> > 		if (!checkpoint_restore_ns_capable(current_user_ns()))
> > 			return -EPERM;
> >
> > And you're proposing _adding_ this check on top of that? Seems super
> > redundant.
>
> Yes, should be moved.

Well, I don't think this patch should be applied at all...

>
> >
> > but also, this seems super-specific buuut... Then again #ifdef
> > CONFIG_CHECKPOINT_RESTORE around this. Ugh.
> >
> > I _hate_ this inteface. HATE HATE HATE it.
> >
> > Anyway, does updating _your own_ auxv really require elevated permissions
> > like this?
> >
> > I don't think so? Couldn't you go and manipulate that anyway without
> > elevated anything?
>
> Hard to believe ...
>
> I was wondering whether this could break some users. At least CRIU doc
> states:
>
>     This option tells *criu* to accept the limitations when running
>     as non-root. Running as non-root requires *criu* at least to have
>     *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
>     running *criu* as non-root please consult the *NON-ROOT* section.

Hmm. I wonder if we don't have more users than that though? Hard to rule out
some weird program somewhere using it for some strange reason.

Commit ebd6de681238 ("prctl: Allow local CAP_CHECKPOINT_RESTORE to change
/proc/self/exe") explicitly _only_ restricted the exe link.

So maybe these comment is in reference to _other_ operations other than non-exe
changing PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE?

>
> I mean, the check makes sense given that prctl_set_mm() rejects all
> these operations without CAP_SYS_RESOURCE.

Hmm but the CAP_SYS_RESOURCE check is only applicable to commands other than
PR_SET_MM_MAP or PR_SET_MM_MAP_SIZE?

#ifdef CONFIG_CHECKPOINT_RESTORE
	if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
		return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
#endif

	if (!capable(CAP_SYS_RESOURCE))
		return -EPERM;

	... rest ...

>
>
> CAP_CHECKPOINT_RESTORE was not introduced before
>
> commit 124ea650d3072b005457faed69909221c2905a1f
> Author: Adrian Reber <areber@redhat.com>
> Date:   Sun Jul 19 12:04:11 2020 +0200
>
>     capabilities: Introduce CAP_CHECKPOINT_RESTORE
>
> So at the time PR_SET_MM_MAP was added there simply was no such capability.
>
> Likely, now that we have it, we should indeed use it.

But we did start using it in the exec_fd != -1 case?

Hmm actually sorry it does more than just manipulating auxv, you can change a
bunch of mm->... stuff.

But if it's your process does it really matter? You can manipulate memory all
over the place in your process...

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 14:21     ` Lorenzo Stoakes (Oracle)
@ 2026-04-02 14:27       ` David Hildenbrand (Arm)
  2026-04-02 17:46         ` Andrei Vagin
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-02 14:27 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Qi Tang, Andrew Morton, Cyrill Gorcunov, Oleg Nesterov,
	linux-kernel, stable

On 4/2/26 16:21, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Apr 02, 2026 at 03:55:27PM +0200, David Hildenbrand (Arm) wrote:
>> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
>>>
>>> We've had a gaping security hole since 2014 and nobody noticed? I find it
>>> hard to believe.
>>>
>>>
>>> Hmm there is already:
>>>
>>> 	if (prctl_map.exe_fd != (u32)-1) {
>>> 		/*
>>> 		 * Check if the current user is checkpoint/restore capable.
>>> 		 * At the time of this writing, it checks for CAP_SYS_ADMIN
>>> 		 * or CAP_CHECKPOINT_RESTORE.
>>> 		 * Note that a user with access to ptrace can masquerade an
>>> 		 * arbitrary program as any executable, even setuid ones.
>>> 		 * This may have implications in the tomoyo subsystem.
>>> 		 */
>>> 		if (!checkpoint_restore_ns_capable(current_user_ns()))
>>> 			return -EPERM;
>>>
>>> And you're proposing _adding_ this check on top of that? Seems super
>>> redundant.
>>
>> Yes, should be moved.
> 
> Well, I don't think this patch should be applied at all...
> 

I mean a v2 would have to do that. Whether we would merge that is
another discussion :)

>>
>>>
>>> but also, this seems super-specific buuut... Then again #ifdef
>>> CONFIG_CHECKPOINT_RESTORE around this. Ugh.
>>>
>>> I _hate_ this inteface. HATE HATE HATE it.
>>>
>>> Anyway, does updating _your own_ auxv really require elevated permissions
>>> like this?
>>>
>>> I don't think so? Couldn't you go and manipulate that anyway without
>>> elevated anything?
>>
>> Hard to believe ...
>>
>> I was wondering whether this could break some users. At least CRIU doc
>> states:
>>
>>     This option tells *criu* to accept the limitations when running
>>     as non-root. Running as non-root requires *criu* at least to have
>>     *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
>>     running *criu* as non-root please consult the *NON-ROOT* section.
> 
> Hmm. I wonder if we don't have more users than that though? Hard to rule out
> some weird program somewhere using it for some strange reason.

See my LXC example. My gut feeling is that there are more users.

Which then raises the question why this is still protected by that
kconfig option.

Something is off here, maybe :)

> 
> Commit ebd6de681238 ("prctl: Allow local CAP_CHECKPOINT_RESTORE to change
> /proc/self/exe") explicitly _only_ restricted the exe link.
> 
> So maybe these comment is in reference to _other_ operations other than non-exe
> changing PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE?
> 
>>
>> I mean, the check makes sense given that prctl_set_mm() rejects all
>> these operations without CAP_SYS_RESOURCE.
> 
> Hmm but the CAP_SYS_RESOURCE check is only applicable to commands other than
> PR_SET_MM_MAP or PR_SET_MM_MAP_SIZE?
> 
> #ifdef CONFIG_CHECKPOINT_RESTORE
> 	if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
> 		return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
> #endif
> 
> 	if (!capable(CAP_SYS_RESOURCE))
> 		return -EPERM;
> 
> 	... rest ...

My point is that you can perform all these modifications without
CAP_SYS_RESOURCE through prctl_set_mm_map().

Like PR_SET_MM_AUXV.

It's all very inconsistent, that's what I am saying.

> 
>>
>>
>> CAP_CHECKPOINT_RESTORE was not introduced before
>>
>> commit 124ea650d3072b005457faed69909221c2905a1f
>> Author: Adrian Reber <areber@redhat.com>
>> Date:   Sun Jul 19 12:04:11 2020 +0200
>>
>>     capabilities: Introduce CAP_CHECKPOINT_RESTORE
>>
>> So at the time PR_SET_MM_MAP was added there simply was no such capability.
>>
>> Likely, now that we have it, we should indeed use it.
> 
> But we did start using it in the exec_fd != -1 case?

The existing ns check was replaced at some point, yes.

> 
> Hmm actually sorry it does more than just manipulating auxv, you can change a
> bunch of mm->... stuff.
> 
> But if it's your process does it really matter? You can manipulate memory all
> over the place in your process...

Well, I am wondering why e.g., PR_SET_MM_AUXV etc requires CAP_SYS_RESOURCE.

PR_SET_MM_EXE_FILE I understand. The other not.

Extremely inconsistent.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 14:27       ` David Hildenbrand (Arm)
@ 2026-04-02 17:46         ` Andrei Vagin
  0 siblings, 0 replies; 13+ messages in thread
From: Andrei Vagin @ 2026-04-02 17:46 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Cyrill Gorcunov
  Cc: Lorenzo Stoakes (Oracle), Qi Tang, Andrew Morton, Oleg Nesterov,
	linux-kernel, stable, criu, Aleksandr Mikhalitsyn

On Thu, Apr 2, 2026 at 7:29 AM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 4/2/26 16:21, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Apr 02, 2026 at 03:55:27PM +0200, David Hildenbrand (Arm) wrote:
> >> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
> >>>
> >>> We've had a gaping security hole since 2014 and nobody noticed? I find it
> >>> hard to believe.
> >>>
> >>>
> >>> Hmm there is already:
> >>>
> >>>     if (prctl_map.exe_fd != (u32)-1) {
> >>>             /*
> >>>              * Check if the current user is checkpoint/restore capable.
> >>>              * At the time of this writing, it checks for CAP_SYS_ADMIN
> >>>              * or CAP_CHECKPOINT_RESTORE.
> >>>              * Note that a user with access to ptrace can masquerade an
> >>>              * arbitrary program as any executable, even setuid ones.
> >>>              * This may have implications in the tomoyo subsystem.
> >>>              */
> >>>             if (!checkpoint_restore_ns_capable(current_user_ns()))
> >>>                     return -EPERM;
> >>>
> >>> And you're proposing _adding_ this check on top of that? Seems super
> >>> redundant.
> >>
> >> Yes, should be moved.
> >
> > Well, I don't think this patch should be applied at all...
> >
>
> I mean a v2 would have to do that. Whether we would merge that is
> another discussion :)
>
> >>
> >>>
> >>> but also, this seems super-specific buuut... Then again #ifdef
> >>> CONFIG_CHECKPOINT_RESTORE around this. Ugh.
> >>>
> >>> I _hate_ this inteface. HATE HATE HATE it.
> >>>
> >>> Anyway, does updating _your own_ auxv really require elevated permissions
> >>> like this?
> >>>
> >>> I don't think so? Couldn't you go and manipulate that anyway without
> >>> elevated anything?
> >>
> >> Hard to believe ...
> >>
> >> I was wondering whether this could break some users. At least CRIU doc
> >> states:
> >>
> >>     This option tells *criu* to accept the limitations when running
> >>     as non-root. Running as non-root requires *criu* at least to have
> >>     *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
> >>     running *criu* as non-root please consult the *NON-ROOT* section.
> >
> > Hmm. I wonder if we don't have more users than that though? Hard to rule out
> > some weird program somewhere using it for some strange reason.
>
> See my LXC example. My gut feeling is that there are more users.
>
> Which then raises the question why this is still protected by that
> kconfig option.
>
> Something is off here, maybe :)
>
> >
> > Commit ebd6de681238 ("prctl: Allow local CAP_CHECKPOINT_RESTORE to change
> > /proc/self/exe") explicitly _only_ restricted the exe link.
> >
> > So maybe these comment is in reference to _other_ operations other than non-exe
> > changing PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE?
> >
> >>
> >> I mean, the check makes sense given that prctl_set_mm() rejects all
> >> these operations without CAP_SYS_RESOURCE.
> >
> > Hmm but the CAP_SYS_RESOURCE check is only applicable to commands other than
> > PR_SET_MM_MAP or PR_SET_MM_MAP_SIZE?
> >
> > #ifdef CONFIG_CHECKPOINT_RESTORE
> >       if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
> >               return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
> > #endif
> >
> >       if (!capable(CAP_SYS_RESOURCE))
> >               return -EPERM;
> >
> >       ... rest ...
>
> My point is that you can perform all these modifications without
> CAP_SYS_RESOURCE through prctl_set_mm_map().
>
> Like PR_SET_MM_AUXV.
>
> It's all very inconsistent, that's what I am saying.
>
> >
> >>
> >>
> >> CAP_CHECKPOINT_RESTORE was not introduced before
> >>
> >> commit 124ea650d3072b005457faed69909221c2905a1f
> >> Author: Adrian Reber <areber@redhat.com>
> >> Date:   Sun Jul 19 12:04:11 2020 +0200
> >>
> >>     capabilities: Introduce CAP_CHECKPOINT_RESTORE
> >>
> >> So at the time PR_SET_MM_MAP was added there simply was no such capability.
> >>
> >> Likely, now that we have it, we should indeed use it.
> >
> > But we did start using it in the exec_fd != -1 case?
>
> The existing ns check was replaced at some point, yes.
>
> >
> > Hmm actually sorry it does more than just manipulating auxv, you can change a
> > bunch of mm->... stuff.
> >
> > But if it's your process does it really matter? You can manipulate memory all
> > over the place in your process...
>
> Well, I am wondering why e.g., PR_SET_MM_AUXV etc requires CAP_SYS_RESOURCE.
>
> PR_SET_MM_EXE_FILE I understand. The other not.
>
> Extremely inconsistent.

It was a long time ago, and I was not directly involved in introducing
PR_SET_MM_MAP. As far as I remember, the PR_SET_MM_{START,END}_ATTR
commands were introduced first, and they were guarded by the global
CAP_SYS_RESOURCE capability. I believe these prctl calls were part of
the first series of patches for user-process Checkpoint/Restore (C/R)
merged into the Linux kernel.

A few years later, as real users started adopting CRIU, there was a demand
to support user namespaces. There was a discussion about relaxing the
global CAP_SYS_RESOURCE check. Eric expressed
concern that this interface didn’t guarantee the consistency of these
parameters. That was the moment PR_SET_MM_MAP was introduced. I dug into
the history and found the relevant discussion:
https://lkml.iu.edu/hypermail/linux/kernel/1407.1/01621.html
https://lkml.iu.edu/hypermail/linux/kernel/1407.1/00899.html.

```
Subject: prctl: PR_SET_MM -- Introduce PR_SET_MM_MAP operation

During development of c/r we've noticed that in case if we need to
support user namespaces we face a problem with capabilities in
prctl(PR_SET_MM, ...) call, in particular once new user namespace
is created capable(CAP_SYS_RESOURCE) no longer passes.

A approach is to eliminate CAP_SYS_RESOURCE check but pass all
new values in one bundle, which would allow the kernel to make
more intensive test for sanity of values and same time allow us to
support checkpoint/restore of user namespaces.
```

The initial implementation of PR_SET_MM_MAP didn't have the capability
check. It was introduced in 4d28df6152aa ("prctl: Allow local
CAP_SYS_ADMIN changing exe_file").

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP
  2026-04-02 11:13 [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP Qi Tang
                   ` (2 preceding siblings ...)
  2026-04-02 13:30 ` David Hildenbrand (Arm)
@ 2026-04-02 17:47 ` Andrew Morton
  3 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2026-04-02 17:47 UTC (permalink / raw)
  To: Qi Tang
  Cc: Cyrill Gorcunov, David Hildenbrand, Lorenzo Stoakes,
	Oleg Nesterov, linux-kernel, stable

On Thu,  2 Apr 2026 19:13:32 +0800 Qi Tang <tpluszz77@gmail.com> wrote:

> prctl_set_mm_map() allows modifying all mm_struct boundaries and
> the saved auxv vector.  The individual field path (PR_SET_MM_START_CODE
> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
> dispatches before this check and has no capability requirement of its
> own when exe_fd is -1.
> 
> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
> arg_start/end, env_start/end and saved_auxv.  Consequences include:
> 
>   - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
>   - procfs info disclosure by pointing arg/env ranges at other memory
>   - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
> 
> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> PR_SET_MM_MAP operation") states "we require the caller to be at least
> user-namespace root user", but this was never enforced in the code.
> 
> Add a checkpoint_restore_ns_capable() check at the top of
> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return.  This
> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> user namespace, matching the stated design intent and the existing
> check for exe_fd changes.

Thanks.

AI review claims to have found a couple of things:
	https://sashiko.dev/#/patchset/20260402111332.55957-1-tpluszz77@gmail.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-02 17:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 11:13 [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP Qi Tang
2026-04-02 12:57 ` Oleg Nesterov
2026-04-02 13:07   ` Lorenzo Stoakes (Oracle)
2026-04-02 13:13   ` Oleg Nesterov
2026-04-02 13:41     ` David Hildenbrand (Arm)
2026-04-02 13:06 ` Lorenzo Stoakes (Oracle)
2026-04-02 13:55   ` David Hildenbrand (Arm)
2026-04-02 14:05     ` David Hildenbrand (Arm)
2026-04-02 14:21     ` Lorenzo Stoakes (Oracle)
2026-04-02 14:27       ` David Hildenbrand (Arm)
2026-04-02 17:46         ` Andrei Vagin
2026-04-02 13:30 ` David Hildenbrand (Arm)
2026-04-02 17:47 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox