Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Jason Gunthorpe @ 2025-07-29 16:14 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-32-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:44PM +0000, Pasha Tatashin wrote:
> From: Pratyush Yadav <ptyadav@amazon.de>
> 
> luoctl is a utility to interact with the LUO state machine. It currently
> supports viewing and change the current state of LUO. This can be used
> by scripts, tools, or developers to control LUO state during the live
> update process.
>
> Example usage:
> 
>     $ luoctl state
>     normal
>     $ luoctl prepare
>     $ luoctl state
>     prepared
>     $ luoctl cancel
>     $ luoctl state
>     normal
> 
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  tools/lib/luo/Makefile       |   6 +-
>  tools/lib/luo/cli/.gitignore |   1 +
>  tools/lib/luo/cli/Makefile   |  18 ++++
>  tools/lib/luo/cli/luoctl.c   | 178 +++++++++++++++++++++++++++++++++++
>  4 files changed, 202 insertions(+), 1 deletion(-)
>  create mode 100644 tools/lib/luo/cli/.gitignore
>  create mode 100644 tools/lib/luo/cli/Makefile
>  create mode 100644 tools/lib/luo/cli/luoctl.c

In the calls I thought the plan had changed to put libluo in its own
repository?

There is nothing tightly linked to the kernel here, I think it would
be easier on everyone to not add ordinary libraries to the kernel
tree.

Jason

^ permalink raw reply

* [PATCH v3 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-07-29 13:47 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1753795807.git.lorenzo.stoakes@oracle.com>

Document the new behaviour introduced in Linux 6.17 whereby it is now
possible to move multiple mappings in a single operation, as long as the
operation is simply an MREMAP_FIXED move - that is old_size is equal to
new_size and MREMAP_FIXED is specified.

To make things clearer, also describe this kind of move operation, before
expanding upon it to describe the newly introduced behaviour.

This change also explains the limitations of of this method and the
possibility of partial failure.

Finally, we pluralise language where it makes sense to do so such that the
documentation does not contradict either this new capability nor the
pre-existing edge case.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 84 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 2168ca728..6ba51310c 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -25,18 +25,47 @@ moving it at the same time (controlled by the
 argument and
 the available virtual address space).
 .P
+Mappings can also simply be moved
+(without any resizing)
+by specifying equal
+.I old_size
+and
+.I new_size
+and using the
+.B MREMAP_FIXED
+flag
+(see below).
+Since Linux 6.17,
+while
+.I old_address
+must reside within a mapping,
+.I old_size
+may span multiple mappings
+which do not have to be
+adjacent to one another when
+performing a move like this.
+The
+.B MREMAP_DONTUNMAP
+flag may be specified.
+.P
+If the operation is not
+simply moving mappings,
+then
+.I old_size
+must span only a single mapping.
+.P
 .I old_address
-is the old address of the virtual memory block that you
-want to expand (or shrink).
+is the old address of the first virtual memory block that you
+want to expand, shrink, and/or move.
 Note that
 .I old_address
 has to be page aligned.
 .I old_size
-is the old size of the
-virtual memory block.
+is the size of the range containing
+virtual memory blocks to be manipulated.
 .I new_size
 is the requested size of the
-virtual memory block after the resize.
+virtual memory blocks after the resize.
 An optional fifth argument,
 .IR new_address ,
 may be provided; see the description of
@@ -105,13 +134,43 @@ If
 is specified, then
 .B MREMAP_MAYMOVE
 must also be specified.
+.IP
+Since Linux 6.17,
+if
+.I old_size
+is equal to
+.I new_size
+and
+.B MREMAP_FIXED
+is specified, then
+.I old_size
+may span beyond the mapping in which
+.I old_address
+resides.
+In this case,
+gaps between mappings in the original range
+are maintained in the new range.
+The whole operation is performed atomically
+unless an error arises,
+in which case the operation may be partially
+completed,
+that is,
+some mappings may be moved and others not.
+.IP
+
+Moving multiple mappings is not permitted if
+any of those mappings have either
+been registered with
+.BR userfaultfd (2) ,
+or map drivers that
+specify their own custom address mapping logic.
 .TP
 .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
 .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
 This flag, which must be used in conjunction with
 .BR MREMAP_MAYMOVE ,
-remaps a mapping to a new address but does not unmap the mapping at
-.IR old_address .
+remaps mappings to a new address but does not unmap them
+from their original address.
 .IP
 The
 .B MREMAP_DONTUNMAP
@@ -149,13 +208,13 @@ mapped.
 See NOTES for some possible applications of
 .BR MREMAP_DONTUNMAP .
 .P
-If the memory segment specified by
+If the memory segments specified by
 .I old_address
 and
 .I old_size
-is locked (using
+are locked (using
 .BR mlock (2)
-or similar), then this lock is maintained when the segment is
+or similar), then this lock is maintained when the segments are
 resized and/or relocated.
 As a consequence, the amount of memory locked by the process may change.
 .SH RETURN VALUE
@@ -188,7 +247,10 @@ virtual memory address for this process.
 You can also get
 .B EFAULT
 even if there exist mappings that cover the
-whole address space requested, but those mappings are of different types.
+whole address space requested, but those mappings are of different types,
+and the
+.BR mremap ()
+operation being performed does not support this.
 .TP
 .B EINVAL
 An invalid argument was given.
--
2.50.1

^ permalink raw reply related

* [PATCH v3 2/2] man/man2/mremap.2: describe previously undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-07-29 13:47 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1753795807.git.lorenzo.stoakes@oracle.com>

There is pre-existing logic that appears to be undocumented for an mremap()
shrink operation, where it turns out that the usual 'input range must span
a single mapping' requirement no longer applies.

In fact, it turns out that the input range specified by [old_address,
old_size) may span any number of mappings, as long old_address resides at
or within a mapping and [old_address, new_size) spans only a single
mapping.

Explicitly document this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 6ba51310c..bbd807f1b 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -48,8 +48,24 @@ The
 .B MREMAP_DONTUNMAP
 flag may be specified.
 .P
-If the operation is not
-simply moving mappings,
+Equally, if the operation performs a shrink,
+that is if
+.I old_size
+is greater than
+.IR new_size ,
+then
+.I old_size
+may also span multiple mappings
+which do not have to be
+adjacent to one another.
+However in this case,
+.I new_size
+must span only a single mapping.
+.P
+If the operation is neither a
+.B MREMAP_FIXED
+move
+nor a shrink,
 then
 .I old_size
 must span only a single mapping.
-- 
2.50.1


^ permalink raw reply related

* [PATCH v3 0/2] man/man2/mremap.2: describe multiple mapping move, shrink
From: Lorenzo Stoakes @ 2025-07-29 13:47 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api

We have added new functionality to mremap() in Linux 6.17, permitting the
move of multiple VMAs when performing a move alone (that is - providing
MREMAP_MAYMOVE | MREMAP_FIXED flags and specifying old_size == new_size).

We document this new feature.

Additionally, we document previously undocumented behaviour around
shrinking of input VMA ranges which permits the input range to span
multiple VMAs.

v3:
* Use more precise language around mremap() move description as per Jann.
* Fix some typos in commit messages.

v2:
* Split out the two man page changes as requested by Alejandro.
https://lore.kernel.org/all/cover.1753711160.git.lorenzo.stoakes@oracle.com/

v1:
https://lore.kernel.org/all/20250723174634.75054-1-lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (2):
  man/man2/mremap.2: describe multiple mapping move
  man/man2/mremap.2: describe previously undocumented shrink behaviour

 man/man2/mremap.2 | 100 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 89 insertions(+), 11 deletions(-)

--
2.50.1

^ permalink raw reply

* Re: [PATCH v2 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-07-29 12:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Alejandro Colomar, linux-man, Andrew Morton, Peter Xu,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Pedro Falcato, Rik van Riel, linux-mm,
	linux-kernel, linux-api
In-Reply-To: <CAG48ez2rQfWJwnpAspNr8OtLXgPadG55Re0KoK5ovBKqE3AcbQ@mail.gmail.com>

On Mon, Jul 28, 2025 at 10:34:07PM +0200, Jann Horn wrote:
> On Mon, Jul 28, 2025 at 4:05 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Document the new behaviour introduced in Linux 6.17 whereby it is now
> > possible to move multiple mappings in a single operation, as long as the
> > operation is purely a move, that is old_size is equal to new_size and
> > MREMAP_FIXED is specified.
> >
> > To make things clearer, also describe this 'pure move' operation, before
> > expanding upon it to describe the newly introduced behaviour.
> >
> > This change also explains the limitations of of this method and the
> > possibility of partial failure.
> >
> > Finally, we pluralise language where it makes sense to so the documentation
> > does not contradict either this new capability nor the pre-existing edge
> > case.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  man/man2/mremap.2 | 78 ++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 67 insertions(+), 11 deletions(-)
> >
> > diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> > index 2168ca728..cb3412591 100644
> > --- a/man/man2/mremap.2
> > +++ b/man/man2/mremap.2
> > @@ -25,18 +25,41 @@ moving it at the same time (controlled by the
> >  argument and
> >  the available virtual address space).
> >  .P
> > +Mappings can simply be moved by specifying equal
>
> (Bikeshedding: This "simply" sounds weird to me. If you're trying to
> define a "simple move" with this, the rest of this block is not very
> specific about what exactly that is supposed to be. In my opinion,
> "pure" would also be a nicer word than "simple" if you're looking for
> an expression that means "a move that doesn't do other things".)

Will rephrase, the intent is that I'm saying we can 'simply' perform the
'only move'.

>
> > +.I old_size
> > +and
> > +.I new_size
> > +and specifying
> > +.IR new_address ,
> > +see the description of
> > +.B MREMAP_FIXED
> > +below.
> > +Since Linux 6.17,
> > +while
> > +.I old_address
> > +must reside within a mapping,
> > +.I old_size
> > +may span multiple mappings
> > +which do not have to be
> > +adjacent to one another.
> > +.P
> > +If the operation is not a simple move
> > +then
> > +.I old_size
> > +must span only a single mapping.
>
> I'm reading between the lines that "simple move" is supposed to mean
> "the size is not changing and MREMAP_DONTUNMAP is not set", which then
> implies that in order to actually make anything happen, MREMAP_FIXED
> must be specified?

No, MREMAP_DONTUNMAP can be set, but MREMAP_FIXED must always be set for this to
happen.

Let me rephrase for clarity.

>
> > +.P
> >  .I old_address
> > -is the old address of the virtual memory block that you
> > -want to expand (or shrink).
> > +is the old address of the first virtual memory block that you
> > +want to expand, shrink, and/or move.
> >  Note that
> >  .I old_address
> >  has to be page aligned.
> >  .I old_size
> > -is the old size of the
> > -virtual memory block.
> > +is the size of the range containing
> > +virtual memory blocks to be manipulated.
> >  .I new_size
> >  is the requested size of the
> > -virtual memory block after the resize.
> > +virtual memory blocks after the resize.
> >  An optional fifth argument,
> >  .IR new_address ,
> >  may be provided; see the description of
> > @@ -105,13 +128,43 @@ If
> >  is specified, then
> >  .B MREMAP_MAYMOVE
> >  must also be specified.
> > +.IP
> > +Since Linux 6.17,
> > +if
> > +.I old_size
> > +is equal to
> > +.I new_size
> > +and
> > +.B MREMAP_FIXED
> > +is specified, then
> > +.I old_size
> > +may span beyond the mapping in which
> > +.I old_address
> > +resides.
> > +In this case,
> > +gaps between mappings in the original range
> > +are maintained in the new range.
> > +The whole operation is performed atomically
> > +unless an error arises,
> > +in which case the operation may be partially
> > +completed,
> > +that is,
> > +some mappings may be moved and others not.
>
> This is much clearer to me.

Thanks

^ permalink raw reply

* Re: [PATCH v2 1/2] man/man2/mremap.2: describe multiple mapping move
From: Jann Horn @ 2025-07-28 20:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Alejandro Colomar, linux-man, Andrew Morton, Peter Xu,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Pedro Falcato, Rik van Riel, linux-mm,
	linux-kernel, linux-api
In-Reply-To: <386ba8fc99adb7c796d3fc5b867c581d0ad376c7.1753711160.git.lorenzo.stoakes@oracle.com>

On Mon, Jul 28, 2025 at 4:05 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Document the new behaviour introduced in Linux 6.17 whereby it is now
> possible to move multiple mappings in a single operation, as long as the
> operation is purely a move, that is old_size is equal to new_size and
> MREMAP_FIXED is specified.
>
> To make things clearer, also describe this 'pure move' operation, before
> expanding upon it to describe the newly introduced behaviour.
>
> This change also explains the limitations of of this method and the
> possibility of partial failure.
>
> Finally, we pluralise language where it makes sense to so the documentation
> does not contradict either this new capability nor the pre-existing edge
> case.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  man/man2/mremap.2 | 78 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 67 insertions(+), 11 deletions(-)
>
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 2168ca728..cb3412591 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -25,18 +25,41 @@ moving it at the same time (controlled by the
>  argument and
>  the available virtual address space).
>  .P
> +Mappings can simply be moved by specifying equal

(Bikeshedding: This "simply" sounds weird to me. If you're trying to
define a "simple move" with this, the rest of this block is not very
specific about what exactly that is supposed to be. In my opinion,
"pure" would also be a nicer word than "simple" if you're looking for
an expression that means "a move that doesn't do other things".)

> +.I old_size
> +and
> +.I new_size
> +and specifying
> +.IR new_address ,
> +see the description of
> +.B MREMAP_FIXED
> +below.
> +Since Linux 6.17,
> +while
> +.I old_address
> +must reside within a mapping,
> +.I old_size
> +may span multiple mappings
> +which do not have to be
> +adjacent to one another.
> +.P
> +If the operation is not a simple move
> +then
> +.I old_size
> +must span only a single mapping.

I'm reading between the lines that "simple move" is supposed to mean
"the size is not changing and MREMAP_DONTUNMAP is not set", which then
implies that in order to actually make anything happen, MREMAP_FIXED
must be specified?

> +.P
>  .I old_address
> -is the old address of the virtual memory block that you
> -want to expand (or shrink).
> +is the old address of the first virtual memory block that you
> +want to expand, shrink, and/or move.
>  Note that
>  .I old_address
>  has to be page aligned.
>  .I old_size
> -is the old size of the
> -virtual memory block.
> +is the size of the range containing
> +virtual memory blocks to be manipulated.
>  .I new_size
>  is the requested size of the
> -virtual memory block after the resize.
> +virtual memory blocks after the resize.
>  An optional fifth argument,
>  .IR new_address ,
>  may be provided; see the description of
> @@ -105,13 +128,43 @@ If
>  is specified, then
>  .B MREMAP_MAYMOVE
>  must also be specified.
> +.IP
> +Since Linux 6.17,
> +if
> +.I old_size
> +is equal to
> +.I new_size
> +and
> +.B MREMAP_FIXED
> +is specified, then
> +.I old_size
> +may span beyond the mapping in which
> +.I old_address
> +resides.
> +In this case,
> +gaps between mappings in the original range
> +are maintained in the new range.
> +The whole operation is performed atomically
> +unless an error arises,
> +in which case the operation may be partially
> +completed,
> +that is,
> +some mappings may be moved and others not.

This is much clearer to me.

^ permalink raw reply

* [PATCH v2 2/2] man/man2/mremap.2: describe previous undocumented shrink behaviour
From: Lorenzo Stoakes @ 2025-07-28 14:04 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1753711160.git.lorenzo.stoakes@oracle.com>

There is pre-existing logic that appears to be undocumented for an mremap()
shrink operation, where it turns out that the usual 'input range must span
a single mapping' requirement no longer applies.

In fact, it turns out that the input range specified by [old_address,
old_size) may span any number of mappings, as long old_address resides at
or within a mapping and [old_address, new_size) spans only a single
mapping.

Explicitly document this.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index cb3412591..c1a9e7397 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -43,7 +43,22 @@ may span multiple mappings
 which do not have to be
 adjacent to one another.
 .P
-If the operation is not a simple move
+Equally, if the operation performs a shrink,
+that is if
+.I old_size
+is greater than
+.IR new_size ,
+then
+.I old_size
+may also span multiple mappings
+which do not have to be
+adjacent to one another.
+However in this case,
+.I new_size
+must span only a single mapping.
+.P
+If the operation is neither a simple move
+nor a shrink,
 then
 .I old_size
 must span only a single mapping.
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-07-28 14:04 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1753711160.git.lorenzo.stoakes@oracle.com>

Document the new behaviour introduced in Linux 6.17 whereby it is now
possible to move multiple mappings in a single operation, as long as the
operation is purely a move, that is old_size is equal to new_size and
MREMAP_FIXED is specified.

To make things clearer, also describe this 'pure move' operation, before
expanding upon it to describe the newly introduced behaviour.

This change also explains the limitations of of this method and the
possibility of partial failure.

Finally, we pluralise language where it makes sense to so the documentation
does not contradict either this new capability nor the pre-existing edge
case.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 man/man2/mremap.2 | 78 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 67 insertions(+), 11 deletions(-)

diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
index 2168ca728..cb3412591 100644
--- a/man/man2/mremap.2
+++ b/man/man2/mremap.2
@@ -25,18 +25,41 @@ moving it at the same time (controlled by the
 argument and
 the available virtual address space).
 .P
+Mappings can simply be moved by specifying equal
+.I old_size
+and
+.I new_size
+and specifying
+.IR new_address ,
+see the description of
+.B MREMAP_FIXED
+below.
+Since Linux 6.17,
+while
+.I old_address
+must reside within a mapping,
+.I old_size
+may span multiple mappings
+which do not have to be
+adjacent to one another.
+.P
+If the operation is not a simple move
+then
+.I old_size
+must span only a single mapping.
+.P
 .I old_address
-is the old address of the virtual memory block that you
-want to expand (or shrink).
+is the old address of the first virtual memory block that you
+want to expand, shrink, and/or move.
 Note that
 .I old_address
 has to be page aligned.
 .I old_size
-is the old size of the
-virtual memory block.
+is the size of the range containing
+virtual memory blocks to be manipulated.
 .I new_size
 is the requested size of the
-virtual memory block after the resize.
+virtual memory blocks after the resize.
 An optional fifth argument,
 .IR new_address ,
 may be provided; see the description of
@@ -105,13 +128,43 @@ If
 is specified, then
 .B MREMAP_MAYMOVE
 must also be specified.
+.IP
+Since Linux 6.17,
+if
+.I old_size
+is equal to
+.I new_size
+and
+.B MREMAP_FIXED
+is specified, then
+.I old_size
+may span beyond the mapping in which
+.I old_address
+resides.
+In this case,
+gaps between mappings in the original range
+are maintained in the new range.
+The whole operation is performed atomically
+unless an error arises,
+in which case the operation may be partially
+completed,
+that is,
+some mappings may be moved and others not.
+.IP
+
+Moving multiple mappings is not permitted if
+any of those mappings have either
+been registered with
+.BR userfaultfd (2) ,
+or map drivers that
+specify their own custom address mapping logic.
 .TP
 .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
 .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
 This flag, which must be used in conjunction with
 .BR MREMAP_MAYMOVE ,
-remaps a mapping to a new address but does not unmap the mapping at
-.IR old_address .
+remaps mappings to a new address but does not unmap them
+from their original address.
 .IP
 The
 .B MREMAP_DONTUNMAP
@@ -149,13 +202,13 @@ mapped.
 See NOTES for some possible applications of
 .BR MREMAP_DONTUNMAP .
 .P
-If the memory segment specified by
+If the memory segments specified by
 .I old_address
 and
 .I old_size
-is locked (using
+are locked (using
 .BR mlock (2)
-or similar), then this lock is maintained when the segment is
+or similar), then this lock is maintained when the segments are
 resized and/or relocated.
 As a consequence, the amount of memory locked by the process may change.
 .SH RETURN VALUE
@@ -188,7 +241,10 @@ virtual memory address for this process.
 You can also get
 .B EFAULT
 even if there exist mappings that cover the
-whole address space requested, but those mappings are of different types.
+whole address space requested, but those mappings are of different types,
+and the
+.BR mremap ()
+operation being performed does not support this.
 .TP
 .B EINVAL
 An invalid argument was given.
-- 
2.50.1


^ permalink raw reply related

* [PATCH v2 0/2] man/man2/mremap.2: describe multiple mapping move, shrink
From: Lorenzo Stoakes @ 2025-07-28 14:04 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api

We have added new functionality to mremap() in Linux 6.17, permitting the
move of multiple VMAs when performing a pure move (that is - providing
MREMAP_MAYMOVE | MREMAP_FIXED flags and specifying old_size == new_size).

We document this new feature.

Additionally, we document previously undocumented behaviour around
shrinking of input VMA ranges which permits the input range to span
multiple VMAs.


v2:
* Split out the two man page changes as requested by Alejandro.

v1:
https://lore.kernel.org/all/20250723174634.75054-1-lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (2):
  man/man2/mremap.2: describe multiple mapping move
  man/man2/mremap.2: describe previous undocumented shrink behaviour

 man/man2/mremap.2 | 93 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 82 insertions(+), 11 deletions(-)

--
2.50.1

^ permalink raw reply

* Re: [PATCH v2 04/32] kho: allow to drive kho from within kernel
From: Mike Rapoport @ 2025-07-28 10:18 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-5-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:17PM +0000, Pasha Tatashin wrote:
> Allow to do finalize and abort from kernel modules, so LUO could
> drive the KHO sequence via its own state machine.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/kexec_handover.h | 15 +++++++++
>  kernel/kexec_handover.c        | 58 ++++++++++++++++++++++++++++++++--
>  2 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h

...

> -static int kho_finalize(void)
> +int kho_abort(void)
> +{
> +	int ret = 0;
> +
> +	if (!kho_enable)
> +		return -EOPNOTSUPP;
> +
> +	mutex_lock(&kho_out.lock);
> +
> +	if (!kho_out.finalized) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	ret = __kho_abort();
> +	if (ret)
> +		goto unlock;
> +
> +	kho_out.finalized = false;
> +	ret = kho_out_update_debugfs_fdt();
> +
> +unlock:
> +	mutex_unlock(&kho_out.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(kho_abort);

I don't think a module should be able to drive KHO. Please drop
EXPORT_SYMBOL_GPL here and for kho_finalize().

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 03/32] kho: warn if KHO is disabled due to an error
From: Mike Rapoport @ 2025-07-28 10:15 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-4-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:16PM +0000, Pasha Tatashin wrote:
> During boot scratch area is allocated based on command line
> parameters or auto calculated. However, scratch area may fail
> to allocate, and in that case KHO is disabled. Currently,
> no warning is printed that KHO is disabled, which makes it
> confusing for the end user to figure out why KHO is not
> available. Add the missing warning message.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  kernel/kexec_handover.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index 1ff6b242f98c..368e23db0a17 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -565,6 +565,7 @@ static void __init kho_reserve_scratch(void)
>  err_free_scratch_desc:
>  	memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch));
>  err_disable_kho:
> +	pr_warn("Failed to reserve scratch area, disabling kexec handover\n");
>  	kho_enable = false;
>  }
>  
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 02/32] kho: mm: Don't allow deferred struct page with KHO
From: Mike Rapoport @ 2025-07-28 10:14 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-3-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:15PM +0000, Pasha Tatashin wrote:
> KHO uses struct pages for the preserved memory early in boot, however,
> with deferred struct page initialization, only a small portion of
> memory has properly initialized struct pages.
> 
> This problem was detected where vmemmap is poisoned, and illegal flag
> combinations are detected.
> 
> Don't allow them to be enabled together, and later we will have to
> teach KHO to work properly with deferred struct page init kernel
> feature.
> 
> Fixes: 990a950fe8fd ("kexec: add config option for KHO")
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  kernel/Kconfig.kexec | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> index 2ee603a98813..1224dd937df0 100644
> --- a/kernel/Kconfig.kexec
> +++ b/kernel/Kconfig.kexec
> @@ -97,6 +97,7 @@ config KEXEC_JUMP
>  config KEXEC_HANDOVER
>  	bool "kexec handover"
>  	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> +	depends on !DEFERRED_STRUCT_PAGE_INIT
>  	select MEMBLOCK_KHO_SCRATCH
>  	select KEXEC_FILE
>  	select DEBUG_FS
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 01/32] kho: init new_physxa->phys_bits to fix lockdep
From: Mike Rapoport @ 2025-07-28 10:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-2-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:14PM +0000, Pasha Tatashin wrote:
> Lockdep shows the following warning:
> 
> INFO: trying to register non-static key.
> The code is fine but needs lockdep annotation, or maybe
> you didn't initialize this object before use?
> turning off the locking correctness validator.
> 
> [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
> [<ffffffff8136012c>] assign_lock_key+0x10c/0x120
> [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
> [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
> [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
> [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
> [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
> [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff81359556>] lock_acquire+0xe6/0x280
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
> [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
> [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
> [<ffffffff813ebef4>] kho_finalize+0x34/0x70
> 
> This is becase xa has its own lock, that is not initialized in
> xa_load_or_alloc.
> 
> Modifiy __kho_preserve_order(), to properly call
> xa_init(&new_physxa->phys_bits);
> 
> Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/kexec_handover.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index 5a21dbe17950..1ff6b242f98c 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
>  				unsigned int order)
>  {
>  	struct kho_mem_phys_bits *bits;
> -	struct kho_mem_phys *physxa;
> +	struct kho_mem_phys *physxa, *new_physxa;
>  	const unsigned long pfn_high = pfn >> order;
>  
>  	might_sleep();
>  
> -	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> -	if (IS_ERR(physxa))
> -		return PTR_ERR(physxa);
> +	physxa = xa_load(&track->orders, order);
> +	if (!physxa) {
> +		new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
> +		if (!new_physxa)
> +			return -ENOMEM;
> +
> +		xa_init(&new_physxa->phys_bits);
> +		physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
> +				    GFP_KERNEL);
> +		if (xa_is_err(physxa)) {
> +			int err_ret = xa_err(physxa);

Just int err should be fine here, otherwise

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> +
> +			xa_destroy(&new_physxa->phys_bits);
> +			kfree(new_physxa);
> +
> +			return err_ret;
> +		}
> +		if (physxa) {
> +			xa_destroy(&new_physxa->phys_bits);
> +			kfree(new_physxa);
> +		} else {
> +			physxa = new_physxa;
> +		}
> +	}
>  
>  	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS,
>  				sizeof(*bits));
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH] man/man2/mremap.2: describe multiple mapping move, shrink
From: Lorenzo Stoakes @ 2025-07-28  4:34 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <siwe6k4ks44mvdzy7rmir2pmf7547gqxknuoppcn54pkh4lwdb@lko3ecfdjtda>

On Fri, Jul 25, 2025 at 10:44:59PM +0200, Alejandro Colomar wrote:
> Hi Lorenzo,
>
> On Wed, Jul 23, 2025 at 06:46:34PM +0100, Lorenzo Stoakes wrote:
> > There is pre-existing logic that appears to be undocumented for an mremap()
> > shrink operation, where it turns out that the usual 'input range must span
> > a single mapping' requirement no longer applies.
> >
> > In fact, it turns out that the input range specified by [old_address,
> > old_size) may span any number of mappings, as long old_address resides at
> > or within a mapping and [old_address, new_size) spans only a single
> > mapping.
> >
> > Explicitly document this.
> >
> > In addition, document the new behaviour introduced in Linux 6.17 whereby it
> > is now possible to move multiple mappings in a single operation, as long as
> > the operation is purely a move, that is old_size is equal to new_size and
> > MREMAP_FIXED is specified.
>
> Please separate the new behavior into a separate patch.  Each patch
> should change one thing only.

OK will split and send two separate patches. Since this will cause merge pain
otherwise, I'll send it as a series.

^ permalink raw reply

* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Andrei Vagin @ 2025-07-26 21:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrei Vagin, Christian Brauner, linux-fsdevel, LKML, criu,
	Linux API, stable
In-Reply-To: <20250726175310.GB222315@ZenIV>

On Sat, Jul 26, 2025 at 10:53 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
> > On Thu, Jul 24, 2025 at 4:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
> > > > Hi Al and Christian,
> > > >
> > > > The commit 12f147ddd6de ("do_change_type(): refuse to operate on
> > > > unmounted/not ours mounts") introduced an ABI backward compatibility
> > > > break. CRIU depends on the previous behavior, and users are now
> > > > reporting criu restore failures following the kernel update. This change
> > > > has been propagated to stable kernels. Is this check strictly required?
> > >
> > > Yes.
> > >
> > > > Would it be possible to check only if the current process has
> > > > CAP_SYS_ADMIN within the mount user namespace?
> > >
> > > Not enough, both in terms of permissions *and* in terms of "thou
> > > shalt not bugger the kernel data structures - nobody's priveleged
> > > enough for that".
> >
> > Al,
> >
> > I am still thinking in terms of "Thou shalt not break userspace"...
> >
> > Seriously though, this original behavior has been in the kernel for 20
> > years, and it hasn't triggered any corruptions in all that time.
>
> For a very mild example of fun to be had there:
>         mount("none", "/mnt", "tmpfs", 0, "");
>         chdir("/mnt");
>         umount2(".", MNT_DETACH);
>         mount(NULL, ".", NULL, MS_SHARED, NULL);
> Repeat in a loop, watch mount group id leak.  That's a trivial example
> of violating the assertion ("a mount that had been through umount_tree()
> is out of propagation graph and related data structures for good").

I wasn't referring to detached mounts. CRIU modifies mounts from
non-current namespaces.

>
> As for the "CAP_SYS_ADMIN within the mount user namespace" - which
> userns do you have in mind?
>

The user namespace of the target mount:
ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)

^ permalink raw reply

* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Al Viro @ 2025-07-26 17:53 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Christian Brauner, linux-fsdevel, LKML, criu, Linux API, stable
In-Reply-To: <CANaxB-xbsOMkKqfaOJ0Za7-yP2N8axO=E1XS1KufnP78H1YzsA@mail.gmail.com>

On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
> On Thu, Jul 24, 2025 at 4:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
> > > Hi Al and Christian,
> > >
> > > The commit 12f147ddd6de ("do_change_type(): refuse to operate on
> > > unmounted/not ours mounts") introduced an ABI backward compatibility
> > > break. CRIU depends on the previous behavior, and users are now
> > > reporting criu restore failures following the kernel update. This change
> > > has been propagated to stable kernels. Is this check strictly required?
> >
> > Yes.
> >
> > > Would it be possible to check only if the current process has
> > > CAP_SYS_ADMIN within the mount user namespace?
> >
> > Not enough, both in terms of permissions *and* in terms of "thou
> > shalt not bugger the kernel data structures - nobody's priveleged
> > enough for that".
> 
> Al,
> 
> I am still thinking in terms of "Thou shalt not break userspace"...
> 
> Seriously though, this original behavior has been in the kernel for 20
> years, and it hasn't triggered any corruptions in all that time.

For a very mild example of fun to be had there:
	mount("none", "/mnt", "tmpfs", 0, "");
	chdir("/mnt");
	umount2(".", MNT_DETACH);
	mount(NULL, ".", NULL, MS_SHARED, NULL);
Repeat in a loop, watch mount group id leak.  That's a trivial example
of violating the assertion ("a mount that had been through umount_tree()
is out of propagation graph and related data structures for good").

As for the "CAP_SYS_ADMIN within the mount user namespace" - which
userns do you have in mind?

^ permalink raw reply

* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Andrei Vagin @ 2025-07-26 17:12 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, linux-fsdevel, LKML, criu, Linux API, stable
In-Reply-To: <20250724230052.GW2580412@ZenIV>

On Thu, Jul 24, 2025 at 4:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
> > Hi Al and Christian,
> >
> > The commit 12f147ddd6de ("do_change_type(): refuse to operate on
> > unmounted/not ours mounts") introduced an ABI backward compatibility
> > break. CRIU depends on the previous behavior, and users are now
> > reporting criu restore failures following the kernel update. This change
> > has been propagated to stable kernels. Is this check strictly required?
>
> Yes.
>
> > Would it be possible to check only if the current process has
> > CAP_SYS_ADMIN within the mount user namespace?
>
> Not enough, both in terms of permissions *and* in terms of "thou
> shalt not bugger the kernel data structures - nobody's priveleged
> enough for that".

Al,

I am still thinking in terms of "Thou shalt not break userspace"...

Seriously though, this original behavior has been in the kernel for 20
years, and it hasn't triggered any corruptions in all that time. I
understand this change might be necessary in its current form, and
that some collateral damage could be unavoidable. But if that's the
case, I'd expect a detailed explanation of why it had to be so and why
userspace breakage is unavoidable.

The original change was merged two decades ago. We need to
consider that some applications might rely on that behavior. I'm not
questioning the security aspect - that must be addressed. But for
anything else, we need to minimize the impact on user applications that
don't violate security.

We can consider a cleaner fix for the upstream kernel, but when we are
talking about stable kernels, the user-space backward compatibility
aspect should be even more critical.

Thanks,
Andrei

^ permalink raw reply

* Re: [PATCH] man/man2/mremap.2: describe multiple mapping move, shrink
From: Alejandro Colomar @ 2025-07-25 20:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <20250723174634.75054-1-lorenzo.stoakes@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 5732 bytes --]

Hi Lorenzo,

On Wed, Jul 23, 2025 at 06:46:34PM +0100, Lorenzo Stoakes wrote:
> There is pre-existing logic that appears to be undocumented for an mremap()
> shrink operation, where it turns out that the usual 'input range must span
> a single mapping' requirement no longer applies.
> 
> In fact, it turns out that the input range specified by [old_address,
> old_size) may span any number of mappings, as long old_address resides at
> or within a mapping and [old_address, new_size) spans only a single
> mapping.
> 
> Explicitly document this.
> 
> In addition, document the new behaviour introduced in Linux 6.17 whereby it
> is now possible to move multiple mappings in a single operation, as long as
> the operation is purely a move, that is old_size is equal to new_size and
> MREMAP_FIXED is specified.

Please separate the new behavior into a separate patch.  Each patch
should change one thing only.


Have a lovely night!
Alex

> 
> To make things clearer, also describe this 'pure move' operation, before
> expanding upon it to describe the newly introduced behaviour.
> 
> This change also explains the limitations of of this method and the
> possibility of partial failure.
> 
> Finally, we pluralise language where it makes sense to so the documentation
> does not contradict either this new capability nor the pre-existing edge
> case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  man/man2/mremap.2 | 93 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 82 insertions(+), 11 deletions(-)
> 
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 2168ca728..c1a9e7397 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -25,18 +25,56 @@ moving it at the same time (controlled by the
>  argument and
>  the available virtual address space).
>  .P
> +Mappings can simply be moved by specifying equal
> +.I old_size
> +and
> +.I new_size
> +and specifying
> +.IR new_address ,
> +see the description of
> +.B MREMAP_FIXED
> +below.
> +Since Linux 6.17,
> +while
>  .I old_address
> -is the old address of the virtual memory block that you
> -want to expand (or shrink).
> +must reside within a mapping,
> +.I old_size
> +may span multiple mappings
> +which do not have to be
> +adjacent to one another.
> +.P
> +Equally, if the operation performs a shrink,
> +that is if
> +.I old_size
> +is greater than
> +.IR new_size ,
> +then
> +.I old_size
> +may also span multiple mappings
> +which do not have to be
> +adjacent to one another.
> +However in this case,
> +.I new_size
> +must span only a single mapping.
> +.P
> +If the operation is neither a simple move
> +nor a shrink,
> +then
> +.I old_size
> +must span only a single mapping.
> +.P
> +.I old_address
> +is the old address of the first virtual memory block that you
> +want to expand, shrink, and/or move.
>  Note that
>  .I old_address
>  has to be page aligned.
>  .I old_size
> -is the old size of the
> -virtual memory block.
> +is the size of the range containing
> +virtual memory blocks to be manipulated.
>  .I new_size
>  is the requested size of the
> -virtual memory block after the resize.
> +virtual memory blocks after the resize.
>  An optional fifth argument,
>  .IR new_address ,
>  may be provided; see the description of
> @@ -105,13 +143,43 @@ If
>  is specified, then
>  .B MREMAP_MAYMOVE
>  must also be specified.
> +.IP
> +Since Linux 6.17,
> +if
> +.I old_size
> +is equal to
> +.I new_size
> +and
> +.B MREMAP_FIXED
> +is specified, then
> +.I old_size
> +may span beyond the mapping in which
> +.I old_address
> +resides.
> +In this case,
> +gaps between mappings in the original range
> +are maintained in the new range.
> +The whole operation is performed atomically
> +unless an error arises,
> +in which case the operation may be partially
> +completed,
> +that is,
> +some mappings may be moved and others not.
> +.IP
> +
> +Moving multiple mappings is not permitted if
> +any of those mappings have either
> +been registered with
> +.BR userfaultfd (2) ,
> +or map drivers that
> +specify their own custom address mapping logic.
>  .TP
>  .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
>  .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
>  This flag, which must be used in conjunction with
>  .BR MREMAP_MAYMOVE ,
> -remaps a mapping to a new address but does not unmap the mapping at
> -.IR old_address .
> +remaps mappings to a new address but does not unmap them
> +from their original address.
>  .IP
>  The
>  .B MREMAP_DONTUNMAP
> @@ -149,13 +217,13 @@ mapped.
>  See NOTES for some possible applications of
>  .BR MREMAP_DONTUNMAP .
>  .P
> -If the memory segment specified by
> +If the memory segments specified by
>  .I old_address
>  and
>  .I old_size
> -is locked (using
> +are locked (using
>  .BR mlock (2)
> -or similar), then this lock is maintained when the segment is
> +or similar), then this lock is maintained when the segments are
>  resized and/or relocated.
>  As a consequence, the amount of memory locked by the process may change.
>  .SH RETURN VALUE
> @@ -188,7 +256,10 @@ virtual memory address for this process.
>  You can also get
>  .B EFAULT
>  even if there exist mappings that cover the
> -whole address space requested, but those mappings are of different types.
> +whole address space requested, but those mappings are of different types,
> +and the
> +.BR mremap ()
> +operation being performed does not support this.
>  .TP
>  .B EINVAL
>  An invalid argument was given.
> -- 
> 2.50.1
> 
> 

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
From: Lorenzo Stoakes @ 2025-07-25 19:59 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API
In-Reply-To: <CAG48ez3qB7W3JqjrkkQ3SRdQNza3Q9noqkgmBg=3F_8vhwQ4gQ@mail.gmail.com>

On Fri, Jul 25, 2025 at 09:10:25PM +0200, Jann Horn wrote:
> On Fri, Jul 25, 2025 at 7:28 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> > > On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > > Note that any failures encountered will result in a partial move. Since an
> > > > mremap() can fail at any time, this might result in only some of the VMAs
> > > > being moved.
> > > >
> > > > Note that failures are very rare and typically require an out of a memory
> > > > condition or a mapping limit condition to be hit, assuming the VMAs being
> > > > moved are valid.
> > >
> > > Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> > > the operation fails, and userspace assumes the old syscall semantics,
> > > userspace could assume that its memory is still at the old address,
> > > when that's actually not true; and if userspace tries to access it
> > > there, userspace UAF happens?
> >
> > At 6pm on the last day of the cycle? :) dude :) this long week gets ever
> > longer...
>
> To be clear, I very much do not expect you to instantly reply to
> random patch review mail I send you late on a Friday evening. :P

Haha sure, just keen to clarify things on this!

>
> > Otherwise for mapping limit we likely hit it right away. I moved all the
> > checks up front for standard VMA/param errors.
>
> Ah, I missed that part.

Yeah the refactoring part of the series prior to this patch goes to great
lengths to prepare for this (including in moving tests earlier - all of
which I confirmed were good to be done earlier).

:)

^ permalink raw reply

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
From: Jann Horn @ 2025-07-25 19:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API
In-Reply-To: <892e3e49-dbcd-4c1f-9966-c004d63f52df@lucifer.local>

On Fri, Jul 25, 2025 at 7:28 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> > On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > > Note that any failures encountered will result in a partial move. Since an
> > > mremap() can fail at any time, this might result in only some of the VMAs
> > > being moved.
> > >
> > > Note that failures are very rare and typically require an out of a memory
> > > condition or a mapping limit condition to be hit, assuming the VMAs being
> > > moved are valid.
> >
> > Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> > the operation fails, and userspace assumes the old syscall semantics,
> > userspace could assume that its memory is still at the old address,
> > when that's actually not true; and if userspace tries to access it
> > there, userspace UAF happens?
>
> At 6pm on the last day of the cycle? :) dude :) this long week gets ever
> longer...

To be clear, I very much do not expect you to instantly reply to
random patch review mail I send you late on a Friday evening. :P

> Otherwise for mapping limit we likely hit it right away. I moved all the
> checks up front for standard VMA/param errors.

Ah, I missed that part.

^ permalink raw reply

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
From: Lorenzo Stoakes @ 2025-07-25 17:27 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API
In-Reply-To: <CAG48ez0KjHHAWsJo76GuuYYaFCH=3n7axN2ryxy7-Vabp5JA-Q@mail.gmail.com>

On Fri, Jul 25, 2025 at 07:11:49PM +0200, Jann Horn wrote:
> On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Note that any failures encountered will result in a partial move. Since an
> > mremap() can fail at any time, this might result in only some of the VMAs
> > being moved.
> >
> > Note that failures are very rare and typically require an out of a memory
> > condition or a mapping limit condition to be hit, assuming the VMAs being
> > moved are valid.
>
> Hrm. So if userspace tries to move a series of VMAs with mremap(), and
> the operation fails, and userspace assumes the old syscall semantics,
> userspace could assume that its memory is still at the old address,
> when that's actually not true; and if userspace tries to access it
> there, userspace UAF happens?

At 6pm on the last day of the cycle? :) dude :) this long week gets ever
longer...

I doubt any logic like this really exists, since mremap() is actually quite hard
to fail, and it'll likely be a mapping limit or oom issue (the latter being 'too
small to fail' so would mean your system is about to die anyway).

So it'd be very strange to then rely on that.

And the _usual_ sensible reasons why this might fail, would likely fail for
the first mapping (mapping limit, you'd probably hit it only on that one,
etc.)

The _unusual_ errors are typically user errors - you try to use with uffd
mappings, etc. well then that's not a functional program and we don't need
to worry.

And I sent a manpage change which very explicitly describes this behaviour.

In any case there is simply _no way_ to not do this.

If the failure's because of oom, we're screwed anyway and user segfault is
inevitable, we'll get into a pickle trying to move back.

Otherwise for mapping limit we likely hit it right away. I moved all the
checks up front for standard VMA/param errors.

The other kinds of errors would require you to try to move normal VMAs
right next to driver VMAs or a mix of uffd and not uffd VMAs or something
that'll be your fault.

So basically it'll nearly never happen, and it doesn't make much sense for
code to rely on this failing.

>
> If we were explicitly killing the userspace process on this error
> path, that'd be fine; but since we're just returning an error, we're
> kind of making userspace believe that the move hasn't happened? (You
> might notice that I'm generally in favor of killing userspace
> processes when userspace does sufficiently weird things.)

Well if we get them to segfault... :P

I think it's userspace's fault if they try to 'recover' based on shakey
assumptions.

>
> I guess it's not going to happen particularly often since mremap()
> with MREMAP_FIXED is a weirdly specific operation in the first place;
> normal users of mremap() (like libc's realloc()) wouldn't have a
> reason to use it...

Yes, and you would have to be using it such a way that things would have
broken before for very specific reasons.

I think the only truly viable 'if this fails assume still in place' might
very well be if the VMAs happened to be fragmented, which obviously this
now changes :)

So I don't think there's an issue here. Additionally it's very much 'the
kernel way' that partially failed aggregate operations don't unwind.

The key thing about mremap is that each individual move will be unwound
such that page tables remain valid...

^ permalink raw reply

* Re: [PATCH v3 09/10] mm/mremap: permit mremap() move of multiple VMAs
From: Jann Horn @ 2025-07-25 17:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Peter Xu, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Pedro Falcato,
	Rik van Riel, linux-mm, linux-fsdevel, linux-kernel,
	linux-kselftest, Linux API
In-Reply-To: <8f41e72b0543953d277e96d5e67a52f287cdbac3.1752232673.git.lorenzo.stoakes@oracle.com>

On Fri, Jul 11, 2025 at 1:38 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Note that any failures encountered will result in a partial move. Since an
> mremap() can fail at any time, this might result in only some of the VMAs
> being moved.
>
> Note that failures are very rare and typically require an out of a memory
> condition or a mapping limit condition to be hit, assuming the VMAs being
> moved are valid.

Hrm. So if userspace tries to move a series of VMAs with mremap(), and
the operation fails, and userspace assumes the old syscall semantics,
userspace could assume that its memory is still at the old address,
when that's actually not true; and if userspace tries to access it
there, userspace UAF happens?

If we were explicitly killing the userspace process on this error
path, that'd be fine; but since we're just returning an error, we're
kind of making userspace believe that the move hasn't happened? (You
might notice that I'm generally in favor of killing userspace
processes when userspace does sufficiently weird things.)

I guess it's not going to happen particularly often since mremap()
with MREMAP_FIXED is a weirdly specific operation in the first place;
normal users of mremap() (like libc's realloc()) wouldn't have a
reason to use it...

^ permalink raw reply

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
From: Aleksa Sarai @ 2025-07-25  2:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest
In-Reply-To: <20250724-beobachten-verfassen-9a39c0318341@brauner>

[-- Attachment #1: Type: text/plain, Size: 8852 bytes --]

On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > /proc has historically had very opaque semantics about PID namespaces,
> > which is a little unfortunate for container runtimes and other programs
> > that deal with switching namespaces very often. One common issue is that
> > of converting between PIDs in the process's namespace and PIDs in the
> > namespace of /proc.
> > 
> > In principle, it is possible to do this today by opening a pidfd with
> > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > contain a PID value translated to the pid namespace associated with that
> > procfs superblock). However, allocating a new file for each PID to be
> > converted is less than ideal for programs that may need to scan procfs,
> > and it is generally useful for userspace to be able to finally get this
> > information from procfs.
> > 
> > So, add a new API for this in the form of an ioctl(2) you can call on
> > the root directory of procfs. The returned file descriptor will have
> > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > option, finally allowing userspace full control of the pid namespaces
> > associated with procfs instances.
> > 
> > The permission model for this is a bit looser than that of the "pidns"
> > mount option, but this is mainly because /proc/1/ns/pid provides the
> > same information, so as long as you have access to that magic-link (or
> > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > or being in an ancestor pid namespace) it makes sense to allow userspace
> > to grab a handle. setns(2) will still have their own permission checks,
> > so being able to open a pidns handle doesn't really provide too many
> > other capabilities.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  Documentation/filesystems/proc.rst |  4 +++
> >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/fs.h            |  3 +++
> >  3 files changed, 59 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index c520b9f8a3fd..506383273c9d 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> >  will be used by the procfs instance when translating pids. By default, procfs
> >  will use the calling process's active pid namespace.
> >  
> > +Processes can check which pid namespace is used by a procfs instance by using
> > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > +instance.
> > +
> >  Chapter 5: Filesystem behavior
> >  ==============================
> >  
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index 057c8a125c6e..548a57ec2152 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -23,8 +23,10 @@
> >  #include <linux/cred.h>
> >  #include <linux/magic.h>
> >  #include <linux/slab.h>
> > +#include <linux/ptrace.h>
> >  
> >  #include "internal.h"
> > +#include "../internal.h"
> >  
> >  struct proc_fs_context {
> >  	struct pid_namespace	*pid_ns;
> > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> >  	return proc_pid_readdir(file, ctx);
> >  }
> >  
> > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > +{
> > +	switch (cmd) {
> > +#ifdef CONFIG_PID_NS
> > +	case PROCFS_GET_PID_NAMESPACE: {
> > +		struct pid_namespace *active = task_active_pid_ns(current);
> > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > +		bool can_access_pidns = false;
> > +
> > +		/*
> > +		 * If we are in an ancestors of the pidns, or have join
> > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > +		 * would be able to grab a handle to the pidns.
> > +		 *
> > +		 * Otherwise, if there is a root process, then being able to
> > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > +		 * we should probably match the permission model. For empty
> > +		 * namespaces it seems unlikely for there to be a downside to
> > +		 * allowing unprivileged users to open a handle to it (setns
> > +		 * will fail for unprivileged users anyway).
> > +		 */
> > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> 
> This seems to imply that if @ns is a descendant of @active that the
> caller holds privileges over it. Is that actually always true?
> 
> IOW, why is the check different from the previous pidns= mount option
> check. I would've expected:
> 
> ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> 
> and then the ptrace check as a fallback.

That would mirror pidns_install(), and I did think about it. The primary
(mostly handwave-y) reasoning I had for making it less strict was that:

 * If you are in an ancestor pidns, then you can already see those
   processes in your own /proc. In theory that means that you will be
   able to access /proc/$pid/ns/pid for at least some subprocess there
   (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
   cleared on ).

   Though hypothetically if they are all running as a different user,
   this does not apply (and you could create scenarios where a child
   pidns is owned by a userns that you do not have privileges over -- if
   you deal with setuid binaries). Maybe that risk means we should just
   combine them, I'm not sure.

 * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
   strange to disallow access even if it is not in an ancestor
   namespace. This is distinct to pidns_install(), where you want to
   ensure you cannot escape to a parent pid namespace, this is about
   getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).

Maybe they should be combined to match pidns_install(), but then I would
expect the ptrace_may_access() check to apply to all processes in the
pidns to make it less restrictive, which is not something you can
practically do (and there is a higher chance that pid1 will have
SUID_DUMP_DISABLE than some random subprocess, which almost certainly
will not be SUID_DUMP_DISABLE).

Fundamentally, I guess I'm still trying to see what the risk is of
allowing a process to get a handle to a pidns that they have some kind
of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being
able to see and address all processes in the namespace, or by being able
to open /proc/$pidns_pid1/ns/pid anyway) but cannot join.

Then again, maybe the fact that it is kind of strange to explain is
enough of a reason to just make it simpler...

> > +		if (!can_access_pidns) {
> > +			bool cannot_ptrace_pid1 = false;
> > +
> > +			read_lock(&tasklist_lock);
> > +			if (ns->child_reaper)
> > +				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
> > +								       PTRACE_MODE_READ_FSCREDS);
> > +			read_unlock(&tasklist_lock);
> > +			can_access_pidns = !cannot_ptrace_pid1;
> > +		}
> > +		if (!can_access_pidns)
> > +			return -EPERM;
> > +
> > +		/* open_namespace() unconditionally consumes the reference. */
> > +		get_pid_ns(ns);
> > +		return open_namespace(to_ns_common(ns));
> > +	}
> > +#endif /* CONFIG_PID_NS */
> > +	default:
> > +		return -ENOIOCTLCMD;
> > +	}
> > +}
> > +
> >  /*
> >   * The root /proc directory is special, as it has the
> >   * <pid> directories. Thus we don't use the generic
> >   * directory handling functions for that..
> >   */
> >  static const struct file_operations proc_root_operations = {
> > -	.read		 = generic_read_dir,
> > -	.iterate_shared	 = proc_root_readdir,
> > +	.read		= generic_read_dir,
> > +	.iterate_shared	= proc_root_readdir,
> >  	.llseek		= generic_file_llseek,
> > +	.unlocked_ioctl = proc_root_ioctl,
> > +	.compat_ioctl   = compat_ptr_ioctl,
> >  };
> >  
> >  /*
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 0bd678a4a10e..aa642cb48feb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
> >  
> >  #define PROCFS_IOCTL_MAGIC 'f'
> >  
> > +/* procfs root ioctls */
> > +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 1)
> > +
> >  /* Pagemap ioctl */
> >  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
> >  
> > 
> > -- 
> > 2.50.0
> > 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH RFC v2 2/4] procfs: add "pidns" mount option
From: Aleksa Sarai @ 2025-07-25  2:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest
In-Reply-To: <20250724-ammoniak-gepinselt-6dd6255c2368@brauner>

[-- Attachment #1: Type: text/plain, Size: 9153 bytes --]

On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> On Wed, Jul 23, 2025 at 09:18:52AM +1000, Aleksa Sarai wrote:
> > Since the introduction of pid namespaces, their interaction with procfs
> > has been entirely implicit in ways that require a lot of dancing around
> > by programs that need to construct sandboxes with different PID
> > namespaces.
> > 
> > Being able to explicitly specify the pid namespace to use when
> > constructing a procfs super block will allow programs to no longer need
> > to fork off a process which does then does unshare(2) / setns(2) and
> > forks again in order to construct a procfs in a pidns.
> > 
> > So, provide a "pidns" mount option which allows such users to just
> > explicitly state which pid namespace they want that procfs instance to
> > use. This interface can be used with fsconfig(2) either with a file
> > descriptor or a path:
> > 
> >   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> Fwiw, namespace mount options could just be VFS generic mount options.
> But it's not something that we need to solve right now.

Yeah if we add this to sysfs it probably should be made generic, but
let's punt this to later. :D

> > or with classic mount(2) / mount(8):
> > 
> >   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> >   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > 
> > As this new API is effectively shorthand for setns(2) followed by
> > mount(2), the permission model for this mirrors pidns_install() to avoid
> > opening up new attack surfaces by loosening the existing permission
> > model.
> > 
> > Note that the mount infrastructure also allows userspace to reconfigure
> > the pidns of an existing procfs mount, which may or may not be useful to
> > some users.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  Documentation/filesystems/proc.rst |  6 +++
> >  fs/proc/root.c                     | 90 +++++++++++++++++++++++++++++++++++---
> >  2 files changed, 90 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 5236cb52e357..c520b9f8a3fd 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -2360,6 +2360,7 @@ The following mount options are supported:
> >  	hidepid=	Set /proc/<pid>/ access mode.
> >  	gid=		Set the group authorized to learn processes information.
> >  	subset=		Show only the specified subset of procfs.
> > +	pidns=		Specify a the namespace used by this procfs.
> >  	=========	========================================================
> >  
> >  hidepid=off or hidepid=0 means classic mode - everybody may access all
> > @@ -2392,6 +2393,11 @@ information about processes information, just add identd to this group.
> >  subset=pid hides all top level files and directories in the procfs that
> >  are not related to tasks.
> >  
> > +pidns= specifies a pid namespace (either as a string path to something like
> > +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> > +will be used by the procfs instance when translating pids. By default, procfs
> > +will use the calling process's active pid namespace.
> > +
> >  Chapter 5: Filesystem behavior
> >  ==============================
> >  
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index ed86ac710384..057c8a125c6e 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -38,12 +38,18 @@ enum proc_param {
> >  	Opt_gid,
> >  	Opt_hidepid,
> >  	Opt_subset,
> > +#ifdef CONFIG_PID_NS
> > +	Opt_pidns,
> > +#endif
> >  };
> >  
> >  static const struct fs_parameter_spec proc_fs_parameters[] = {
> > -	fsparam_u32("gid",	Opt_gid),
> > +	fsparam_u32("gid",		Opt_gid),
> >  	fsparam_string("hidepid",	Opt_hidepid),
> >  	fsparam_string("subset",	Opt_subset),
> > +#ifdef CONFIG_PID_NS
> > +	fsparam_file_or_string("pidns",	Opt_pidns),
> > +#endif
> >  	{}
> >  };
> >  
> > @@ -109,11 +115,67 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
> >  	return 0;
> >  }
> >  
> > +#ifdef CONFIG_PID_NS
> > +static int proc_parse_pidns_param(struct fs_context *fc,
> > +				  struct fs_parameter *param,
> > +				  struct fs_parse_result *result)
> > +{
> > +	struct proc_fs_context *ctx = fc->fs_private;
> > +	struct pid_namespace *target, *active = task_active_pid_ns(current);
> > +	struct ns_common *ns;
> > +	struct file *ns_filp __free(fput) = NULL;
> > +
> > +	switch (param->type) {
> > +	case fs_value_is_file:
> > +		/* came throug fsconfig, steal the file reference */
> > +		ns_filp = param->file;
> > +		param->file = NULL;
> 
> This can be shortened to:
> 
> ns_filp = no_free_ptr(param->file);

I really need to take a closer look at <linux/cleanup.h>, each time I
look at it I learn about another handy helper.

> > +		break;
> > +	case fs_value_is_string:
> > +		ns_filp = filp_open(param->string, O_RDONLY, 0);
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(true);
> > +		break;
> > +	}
> > +	if (!ns_filp)
> > +		ns_filp = ERR_PTR(-EBADF);
> > +	if (IS_ERR(ns_filp)) {
> > +		errorfc(fc, "could not get file from pidns argument");
> > +		return PTR_ERR(ns_filp);
> > +	}
> > +
> > +	if (!proc_ns_file(ns_filp))
> > +		return invalfc(fc, "pidns argument is not an nsfs file");
> > +	ns = get_proc_ns(file_inode(ns_filp));
> > +	if (ns->ops->type != CLONE_NEWPID)
> > +		return invalfc(fc, "pidns argument is not a pidns file");
> > +	target = container_of(ns, struct pid_namespace, ns);
> > +
> > +	/*
> > +	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
> > +	 * permission model should be the same as pidns_install().
> > +	 */
> > +	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
> > +		errorfc(fc, "insufficient permissions to set pidns");
> > +		return -EPERM;
> > +	}
> > +	if (!pidns_is_ancestor(target, active))
> > +		return invalfc(fc, "cannot set pidns to non-descendant pidns");
> 
> This made me think. If one rewrote this as:
> 
> if (!ns_capable(task_active_pidns(current)->user_ns, CAP_SYS_ADMIN))
> 
> if (!pidns_is_ancestor(target, active))
> 
> that would also work, right? IOW, you'd be checking whether you're
> capable over your current pid namespace owning userns and if the target
> pidns is an ancestor it's also implied by the first check that you're
> capable over it.
> 
> The only way this would not be true is if a descendant pidns would be
> owned by a userns over which you don't hold privileges and I wondered
> whether that's even possible? I don't think it is but maybe you see a
> way.

Well, if you run a setuid binary, it could create a pidns that is a
child but is owned by a more privileged userns than you. My main goal
here was to just mirror pidns_install() exactly, to make sure that the
permission model was identical.

> > +
> > +	put_pid_ns(ctx->pid_ns);
> > +	ctx->pid_ns = get_pid_ns(target);
> > +	put_user_ns(fc->user_ns);
> > +	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> > +	return 0;
> > +}
> > +#endif /* CONFIG_PID_NS */
> > +
> >  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >  {
> >  	struct proc_fs_context *ctx = fc->fs_private;
> >  	struct fs_parse_result result;
> > -	int opt;
> > +	int opt, err;
> >  
> >  	opt = fs_parse(fc, proc_fs_parameters, param, &result);
> >  	if (opt < 0)
> > @@ -125,14 +187,24 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >  		break;
> >  
> >  	case Opt_hidepid:
> > -		if (proc_parse_hidepid_param(fc, param))
> > -			return -EINVAL;
> > +		err = proc_parse_hidepid_param(fc, param);
> > +		if (err)
> > +			return err;
> >  		break;
> >  
> >  	case Opt_subset:
> > -		if (proc_parse_subset_param(fc, param->string) < 0)
> > -			return -EINVAL;
> > +		err = proc_parse_subset_param(fc, param->string);
> > +		if (err)
> > +			return err;
> > +		break;
> > +
> > +#ifdef CONFIG_PID_NS
> > +	case Opt_pidns:
> 
> I think it would be easier if we returned EOPNOTSUPP when !CONFIG_PID_NS
> instead of EINVALing this?
> 
> > +		err = proc_parse_pidns_param(fc, param, &result);
> > +		if (err)
> > +			return err;
> >  		break;
> > +#endif
> >  
> >  	default:
> >  		return -EINVAL;
> > @@ -154,6 +226,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
> >  		fs_info->hide_pid = ctx->hidepid;
> >  	if (ctx->mask & (1 << Opt_subset))
> >  		fs_info->pidonly = ctx->pidonly;
> > +#ifdef CONFIG_PID_NS
> > +	if (ctx->mask & (1 << Opt_pidns)) {
> > +		put_pid_ns(fs_info->pid_ns);
> > +		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> > +	}
> > +#endif
> >  }
> >  
> >  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> > 
> > -- 
> > 2.50.0
> > 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* [PATCH v3 4/4] selftests/proc: add tests for new pidns APIs
From: Aleksa Sarai @ 2025-07-24  8:32 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai
In-Reply-To: <20250724-procfs-pidns-api-v3-0-4c685c910923@cyphar.com>

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 252 ++++++++++++++++++++++++++++++
 3 files changed, 254 insertions(+)

diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 973968f45bba..2dced03e9e0e 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -17,6 +17,7 @@
 /proc-tid0
 /proc-uptime-001
 /proc-uptime-002
+/proc-pidns
 /read
 /self
 /setns-dcache
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index b12921b9794b..c6f7046b9860 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -27,5 +27,6 @@ TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
 TEST_GEN_PROGS += proc-multiple-procfs
 TEST_GEN_PROGS += proc-fsconfig-hidepid
+TEST_GEN_PROGS += proc-pidns
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-pidns.c b/tools/testing/selftests/proc/proc-pidns.c
new file mode 100644
index 000000000000..5994375e2377
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-pidns.c
@@ -0,0 +1,252 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2025 SUSE LLC.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+
+#include "../kselftest_harness.h"
+
+#define ASSERT_ERRNO(expected, _t, seen)				\
+	__EXPECT(expected, #expected,					\
+		({__typeof__(seen) _tmp_seen = (seen);			\
+		  _tmp_seen >= 0 ? _tmp_seen : -errno; }), #seen, _t, 1)
+
+#define ASSERT_ERRNO_EQ(expected, seen) \
+	ASSERT_ERRNO(expected, ==, seen)
+
+#define ASSERT_SUCCESS(seen) \
+	ASSERT_ERRNO(0, <=, seen)
+
+static int touch(char *path)
+{
+	int fd = open(path, O_WRONLY|O_CREAT|O_CLOEXEC, 0644);
+	if (fd < 0)
+		return -1;
+	return close(fd);
+}
+
+FIXTURE(ns)
+{
+	int host_mntns, host_pidns;
+	int dummy_pidns;
+};
+
+FIXTURE_SETUP(ns)
+{
+	/* Stash the old mntns. */
+	self->host_mntns = open("/proc/self/ns/mnt", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->host_mntns);
+
+	/* Create a new mount namespace and make it private. */
+	ASSERT_SUCCESS(unshare(CLONE_NEWNS));
+	ASSERT_SUCCESS(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL));
+
+	/*
+	 * Create a proper tmpfs that we can use and will disappear once we
+	 * leave this mntns.
+	 */
+	ASSERT_SUCCESS(mount("tmpfs", "/tmp", "tmpfs", 0, NULL));
+
+	/*
+	 * Create a pidns we can use for later tests. We need to fork off a
+	 * child so that we get a usable nsfd that we can bind-mount and open.
+	 */
+	ASSERT_SUCCESS(touch("/tmp/dummy-pidns"));
+
+	self->host_pidns = open("/proc/self/ns/pid", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->host_pidns);
+	ASSERT_SUCCESS(unshare(CLONE_NEWPID));
+
+	pid_t pid = fork();
+	ASSERT_SUCCESS(pid);
+	if (!pid) {
+		prctl(PR_SET_PDEATHSIG, SIGKILL);
+		ASSERT_SUCCESS(mount("/proc/self/ns/pid", "/tmp/dummy-pidns", NULL, MS_BIND, 0));
+		exit(0);
+	}
+
+	int wstatus;
+	ASSERT_EQ(waitpid(pid, &wstatus, 0), pid);
+	ASSERT_TRUE(WIFEXITED(wstatus));
+	ASSERT_EQ(WEXITSTATUS(wstatus), 0);
+
+	ASSERT_SUCCESS(setns(self->host_pidns, CLONE_NEWPID));
+
+	self->dummy_pidns = open("/tmp/dummy-pidns", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->dummy_pidns);
+}
+
+FIXTURE_TEARDOWN(ns)
+{
+	ASSERT_SUCCESS(setns(self->host_mntns, CLONE_NEWNS));
+	ASSERT_SUCCESS(close(self->host_mntns));
+
+	ASSERT_SUCCESS(close(self->host_pidns));
+	ASSERT_SUCCESS(close(self->dummy_pidns));
+}
+
+TEST_F(ns, pidns_mount_string_path)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid"));
+	ASSERT_SUCCESS(access("/tmp/proc-host/self/", X_OK));
+
+	ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy-pidns"));
+	ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/1/", X_OK));
+	ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/self/", X_OK));
+}
+
+TEST_F(ns, pidns_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy-pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_remount)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc", "proc", 0, ""));
+
+	ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+
+	ASSERT_ERRNO_EQ(-EBUSY, mount(NULL, "/tmp/proc", NULL, MS_REMOUNT, "pidns=/tmp/dummy-pidns"));
+
+	ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy-pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+int is_same_inode(int fd1, int fd2)
+{
+	struct stat stat1, stat2;
+
+	assert(fstat(fd1, &stat1) == 0);
+	assert(fstat(fd2, &stat2) == 0);
+
+	return stat1.st_ino == stat2.st_ino && stat1.st_dev == stat2.st_dev;
+}
+
+#define PROCFS_IOCTL_MAGIC 'f'
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 1)
+
+TEST_F(ns, get_pidns_ioctl)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	/* fsmount returns an O_PATH, which ioctl(2) doesn't accept. */
+	int new_mountfd = openat(mountfd, ".", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
+	ASSERT_SUCCESS(new_mountfd);
+
+	ASSERT_SUCCESS(close(mountfd));
+	mountfd = -EBADF;
+
+	int procfs_pidns = ioctl(new_mountfd, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_NE(self->dummy_pidns, procfs_pidns);
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(new_mountfd));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_HARNESS_MAIN

-- 
2.50.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox