Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-09-03 19:39 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Pasha Tatashin, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0qzwnvcwk.fsf@kernel.org>

Hi Pratyush,

On Wed, Sep 03, 2025 at 04:17:15PM +0200, Pratyush Yadav wrote:
> On Tue, Sep 02 2025, Mike Rapoport wrote:
> >
> > As for porting kho_preserve_vmalloc() to kho_array, I also feel that it
> > would just make kho_preserve_vmalloc() more complex and I'd rather simplify
> > it even more, e.g. with preallocating all the pages that preserve indices
> > in advance.
> 
> I think there are two parts here. One is the data format of the KHO
> array and the other is the way to build it. I think the format is quite
> simple and versatile, and we can have many strategies of building it.
> 
> For example, if you are only concerned with pre-allocating data, I can
> very well add a way to initialize the KHO array with with a fixed size
> up front.

I wasn't concerned with preallocation vs allocating a page at a time, I
though with preallocation the vmalloc code will become even simpler, but
it's not :)
 
> Beyond that, I think KHO array will actually make kho_preserve_vmalloc()
> simpler since it won't have to deal with the linked list traversal
> logic. It can just do ka_for_each() and just get all the pages.
>
> We can also convert the preservation bitmaps to use it so the linked list
> logic is in one place, and others just build on top of it.

I disagree. The boilerplate to initialize and iterate the kho_array will
not make neither vmalloc nor bitmaps preservation simpler IMO.

And for bitmaps Pasha and Jason M. are anyway working on a different data
structure already, so if their proposal moves forward converting bitmap
preservation to anything would be a wasted effort.

> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Randy Dunlap @ 2025-09-04  6:22 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: patches, Randy Dunlap, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api

Don't define the AT_RENAME_* macros at all since the kernel does not
use them nor does the kernel need to provide them for userspace.
Leave them as comments in <uapi/linux/fcntl.h> only as an example.

The AT_RENAME_* macros have recently been added to glibc's <stdio.h>.
For a kernel allmodconfig build, this made the macros be defined
differently in 2 places (same values but different macro text),
causing build errors/warnings (duplicate definitions) in both
samples/watch_queue/watch_test.c and samples/vfs/test-statx.c.
(<linux/fcntl.h> is included indirecty in both programs above.)

Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
CC: linux-api@vger.kernel.org
To: linux-fsdevel@vger.kernel.org
---
 include/uapi/linux/fcntl.h |    6 ++++++
 1 file changed, 6 insertions(+)

--- linux-next-20250819.orig/include/uapi/linux/fcntl.h
+++ linux-next-20250819/include/uapi/linux/fcntl.h
@@ -155,10 +155,16 @@
  * as possible, so we can use them for generic bits in the future if necessary.
  */
 
+/*
+ * Note: This is an example of how the AT_RENAME_* flags could be defined,
+ * but the kernel has no need to define them, so leave them as comments.
+ */
 /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
+/*
 #define AT_RENAME_NOREPLACE	0x0001
 #define AT_RENAME_EXCHANGE	0x0002
 #define AT_RENAME_WHITEOUT	0x0004
+*/
 
 /* Flag for faccessat(2). */
 #define AT_EACCESS		0x200	/* Test access permitted for

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-04 12:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Jason Gunthorpe, Pasha Tatashin, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLiZbb_F5R2x9-y2@kernel.org>

Hi Mike,

On Wed, Sep 03 2025, Mike Rapoport wrote:
> On Wed, Sep 03, 2025 at 04:17:15PM +0200, Pratyush Yadav wrote:
>> On Tue, Sep 02 2025, Mike Rapoport wrote:
>> >
>> > As for porting kho_preserve_vmalloc() to kho_array, I also feel that it
>> > would just make kho_preserve_vmalloc() more complex and I'd rather simplify
>> > it even more, e.g. with preallocating all the pages that preserve indices
>> > in advance.
[...]
>  
>> Beyond that, I think KHO array will actually make kho_preserve_vmalloc()
>> simpler since it won't have to deal with the linked list traversal
>> logic. It can just do ka_for_each() and just get all the pages.
>>
>> We can also convert the preservation bitmaps to use it so the linked list
>> logic is in one place, and others just build on top of it.
>
> I disagree. The boilerplate to initialize and iterate the kho_array will
> not make neither vmalloc nor bitmaps preservation simpler IMO.

I have done 80% of the work on this already, so let's do this: I will do
the rest of the 20% and publish the patches. Then you and Jason can have
a look and if you still think it's not worth it, I am fine shelving it
for now and revisiting later when there might be a stronger case.

>
> And for bitmaps Pasha and Jason M. are anyway working on a different data
> structure already, so if their proposal moves forward converting bitmap
> preservation to anything would be a wasted effort.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-04 12:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250903150157.GH470103@nvidia.com>

Hi Jason,

On Wed, Sep 03 2025, Jason Gunthorpe wrote:

> On Wed, Sep 03, 2025 at 04:10:37PM +0200, Pratyush Yadav wrote:
>
>> > So, it could be useful, but I wouldn't use it for memfd, the vmalloc
>> > approach is better and we shouldn't optimize for sparsness which
>> > should never happen.
>> 
>> I disagree. I think we are re-inventing the same data format with minor
>> variations. I think we should define extensible fundamental data formats
>> first, and then use those as the building blocks for the rest of our
>> serialization logic.
>
> page, vmalloc, slab seem to me to be the fundamental units of memory
> management in linux, so they should get KHO support.
>
> If you want to preserve a known-sized array you use vmalloc and then
> write out the per-list items. If it is a dictionary/sparse array then
> you write an index with each item too. This is all trivial and doesn't
> really need more abstraction in of itself, IMHO.

We will use up double the space for tracking metadata, but maybe that is
fine until we start seeing bigger memfds in real workloads.

>
>> cases can then build on top of it. For example, the preservation bitmaps
>> can get rid of their linked list logic and just use KHO array to hold
>> and retrieve its bitmaps. It will make the serialization simpler.
>
> I don't think the bitmaps should, the serialization here is very
> special because it is not actually preserved, it just exists for the
> time while the new kernel runs in scratch and is insta freed once the
> allocators start up.

I don't think it matters if they are preserved or not. The serialization
and deserialization is independent of that. You can very well create a
KHO array that you don't KHO-preserve. On next boot, you can still use
it, you just have to be careful of doing it while scratch-only. Same as
we do now.

>
>> I also don't get why you think sparseness "should never happen". For
>> memfd for example, you say in one of your other emails that "And again
>> in real systems we expect memfd to be fully populated too." Which
>> systems and use cases do you have in mind? Why do you think people won't
>> want a sparse memfd?
>
> memfd should principally be used to back VM memory, and I expect VM
> memory to be fully populated. Why would it be sparse?

For the _hypervisor_ live update case, sure. Though even there, I have a
feeling we will start seeing userspace components on the hypervisor use
memfd for stashing some of their state. Pasha has already mentioned they
have a use case for a memfd that is not VM memory.

But hypervisor live upadte isn't the only use case for LUO. We are
looking at enabling state preservation for "normal" userspace
applications. Think big storage nodes with memory in order of TiB. Those
can use a memfd to back their caches so on a kernel upgrade the caches
don't have to be re-fetched. Sparseness is to be expected for such use
cases.

>
>> All in all, I think KHO array is going to prove useful and will make
>> serialization for subsystems easier. I think sparseness will also prove
>> useful but it is not a hill I want to die on. I am fine with starting
>> with a non-sparse array if people really insist. But I do think we
>> should go with KHO array as a base instead of re-inventing the linked
>> list of pages again and again.
>
> The two main advantages I see to the kho array design vs vmalloc is
> that it should be a bit faster as it doesn't establish a vmap, and it
> handles unknown size lists much better.
>
> Are these important considerations? IDK.
>
> As I said to Chris, I think we should see more examples of what we
> actually need before assuming any certain datastructure is the best
> choice.
>
> So I'd stick to simpler open coded things and go back and improve them
> than start out building the wrong shared data structure.
>
> How about have at least three luo clients that show meaningful benefit
> before proposing something beyond the fundamental page, vmalloc, slab
> things?

I think the fundamentals themselves get some benefit. But anyway, since
I have done most of the work on this feature anyway, I will do the rest
and send the patches out. Then you can have a look and if you're still
not convinced, I am fine shelving it for now to revisit later when a
stronger case can be made.

>
>> What do you mean by "data per version"? I think there should be only one
>> version of the serialized object. Multiple versions of the same thing
>> will get ugly real quick.
>
> If you want to support backwards/forwards compatability then you
> probably should support multiple versions as well. Otherwise it
> could become quite hard to make downgrades..

Hmm, forward can work regardless since a newer kernel should speak older
formats too, but for backwards it makes sense to have an older version.

But perhaps it might be a better idea to come up with a mechanism for
the kernel to discover which formats the "next" kernel speaks so it can
for one decide whether it can do the live update at all, and for another
which formats it should use. Maybe we give a way for luod to choose
formats, and give it the responsibility for doing these checks?

>
> Ideally I'd want to remove the upstream code for obsolete versions
> fairly quickly so I'd imagine kernels will want to generate both
> versions during the transition period and then eventually newer
> kernels will only accept the new version.
>
> I've argued before that the extended matrix of any kernel version to
> any other kernel version should lie with the distro/CSP making the
> kernel fork. They know what their upgrade sequence will be so they can
> manage any missing versions to make it work.
>
> Upstream should do like v6.1 to v6.2 only or something similarly well
> constrained. I think this is a reasonable trade off to get subsystem
> maintainers to even accept this stuff at all.
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-04 14:42 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0a53av0hs.fsf@kernel.org>

On Thu, Sep 04, 2025 at 02:57:35PM +0200, Pratyush Yadav wrote:

> I don't think it matters if they are preserved or not. The serialization
> and deserialization is independent of that. You can very well create a
> KHO array that you don't KHO-preserve. On next boot, you can still use
> it, you just have to be careful of doing it while scratch-only. Same as
> we do now.

The KHO array machinery itself can't preserve its own memory
either.

> For the _hypervisor_ live update case, sure. Though even there, I have a
> feeling we will start seeing userspace components on the hypervisor use
> memfd for stashing some of their state. 

Sure, but don't make excessively sparse memfds for kexec use, why
should that be hard?

> applications. Think big storage nodes with memory in order of TiB. Those
> can use a memfd to back their caches so on a kernel upgrade the caches
> don't have to be re-fetched. Sparseness is to be expected for such use
> cases.

Oh? I'm surpised you'd have sparseness there. sparseness seems like
such a weird feature to want to rely on :\

> But perhaps it might be a better idea to come up with a mechanism for
> the kernel to discover which formats the "next" kernel speaks so it can
> for one decide whether it can do the live update at all, and for another
> which formats it should use. Maybe we give a way for luod to choose
> formats, and give it the responsibility for doing these checks?

I have felt that we should catalog the formats&versions the kernel can
read/write in some way during kbuild.

Maybe this turns into a sysfs directory of all the data with an
'enable_write' flag that luod could set to 0 to optimize.

And maybe this could be a kbuild report that luod could parse to do
this optimization.

And maybe distro/csps use this information mechanically to check if
version pairs are kexec compatible.

Which re-enforces my feeling that the formats/version should be first
class concepts, every version should be registered and luo should
sequence calling the code for the right version at the right time.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-04 17:34 UTC (permalink / raw)
  To: Chris Li
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CACePvbWGR+XPfTub41=Ekj3aSMjzyO+FyJmzMy5HEQKq0-wqag@mail.gmail.com>

On Wed, Sep 03, 2025 at 05:01:15AM -0700, Chris Li wrote:

> > And if you want to serialize that the optimal path would be to have a
> > vmalloc of all the strings and a vmalloc of the [] data, sort of like
> > the kho array idea.
> 
> The KHO array idea is already implemented in the existing KHO code or
> that is something new you want to propose?

Pratyush has proposed it

> Then we will have to know the combined size of the string up front,
> similar to the FDT story. Ideally the list can incrementally add items
> to it. May be stored as a list as raw pointer without vmalloc
> first,then have a final pass vmalloc and serialize the string and
> data.

There are many options, and the dynamic extendability from the KHO
array might be a good fit here. But you can also just store the
serializations in a linked list and then write them out.

> With the additional detail above, I would like to point out something
> I have observed earlier: even though the core idea of the native C
> struct is simple and intuitive, the end of end implementation is not.
> When we compare C struct implementation, we need to include all those
> additional boilerplate details as a whole, otherwise it is not a apple
> to apple comparison.

You need all of this anyhow, BTF doesn't create version meta data,
evaluate which version are suitable, or de-serialize complex rbtree or
linked lists structures.

> > Your BTF proposal doesn't seem to benifit memfd at all, it was focused
> > on extracting data directly from an existing struct which I feel very
> > strongly we should never do.
> 
> From data flow point of view, the data is get from a C struct and
> eventually store into a C struct. That is no way around that. That is
> the necessary evil if you automate this process. Hey, there is also no
> rule saying that you can't use a bounce buffer of some kind of manual
> control in between.

Yeah but if I already wrote the code to make the required C struct
there only difference is 'memcpy c struct' vs 'serialze with btf c
struct' and that isn't meaningful.

If the boilerplate is around arrays of C structs and things then the
KHO array proposal is a good direction to de-duplicate code.

> It is just a way to automate stuff to reduce the boilerplate. 

You haven't clearly spelled out what the boilerplate even is, this was
my feedback to you to be very clear on what is being improved.

> I feel a much stronger sense of urgency than you though.  The stakes
> are high, currently you already have four departments can use this
> common serialization library right now:
> 1) PCI
> 2) VFIO
> 3) IOMMU
> 4) Memfd.

We don't know what they actually need to write out, we haven't seen
any patches. 

Let's start with simple patches and deal with the fundamental problems
like versioning, then you can come with ideas to optimize if it turns
out there is something to improve here.

I'm not convinced PCI (a few bits per struct pci_device to start),
memfd (xarray) and IOMMU (dictionaries of HW physical pointers) share
a significant overlap of serialization requirements beyond luo level
managing the objects and versioning.

Jason

^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Amir Goldstein @ 2025-09-04 18:17 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <20250904062215.2362311-1-rdunlap@infradead.org>

On Thu, Sep 4, 2025 at 8:22 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Don't define the AT_RENAME_* macros at all since the kernel does not
> use them nor does the kernel need to provide them for userspace.
> Leave them as comments in <uapi/linux/fcntl.h> only as an example.
>
> The AT_RENAME_* macros have recently been added to glibc's <stdio.h>.
> For a kernel allmodconfig build, this made the macros be defined
> differently in 2 places (same values but different macro text),
> causing build errors/warnings (duplicate definitions) in both
> samples/watch_queue/watch_test.c and samples/vfs/test-statx.c.
> (<linux/fcntl.h> is included indirecty in both programs above.)
>
> Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> ---
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Chuck Lever <chuck.lever@oracle.com>
> Cc: Alexander Aring <alex.aring@gmail.com>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Howells <dhowells@redhat.com>
> CC: linux-api@vger.kernel.org
> To: linux-fsdevel@vger.kernel.org
> ---
>  include/uapi/linux/fcntl.h |    6 ++++++
>  1 file changed, 6 insertions(+)
>
> --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
> +++ linux-next-20250819/include/uapi/linux/fcntl.h
> @@ -155,10 +155,16 @@
>   * as possible, so we can use them for generic bits in the future if necessary.
>   */
>
> +/*
> + * Note: This is an example of how the AT_RENAME_* flags could be defined,
> + * but the kernel has no need to define them, so leave them as comments.
> + */
>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> +/*
>  #define AT_RENAME_NOREPLACE    0x0001
>  #define AT_RENAME_EXCHANGE     0x0002
>  #define AT_RENAME_WHITEOUT     0x0004
> +*/
>

I find this end result a bit odd, but I don't want to suggest another variant
I already proposed one in v2 review [1] that maybe you did not like.
It's fine.
I'll let Aleksa and Christian chime in to decide on if and how they want this
comment to look or if we should just delete these definitions and be done with
this episode.

Thanks,
Amir.

[1] https://lore.kernel.org/r/CAOQ4uxjXvYBsW1Nb2HKaoUg1qi8Pkq1XKtQEbnAvMUGcp7LrZA@mail.gmail.com/

^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Florian Weimer @ 2025-09-04 18:49 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Randy Dunlap, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api
In-Reply-To: <CAOQ4uxiJibbq_MX3HkNaFb3GXGsZ0nNehk+MNODxXxy_khSwEQ@mail.gmail.com>

* Amir Goldstein:

> I find this end result a bit odd, but I don't want to suggest another variant
> I already proposed one in v2 review [1] that maybe you did not like.
> It's fine.
> I'll let Aleksa and Christian chime in to decide on if and how they want this
> comment to look or if we should just delete these definitions and be done with
> this episode.

We should fix the definition in glibc to be identical token-wise to the
kernel's.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Randy Dunlap @ 2025-09-04 21:51 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <CAOQ4uxiJibbq_MX3HkNaFb3GXGsZ0nNehk+MNODxXxy_khSwEQ@mail.gmail.com>



On 9/4/25 11:17 AM, Amir Goldstein wrote:
> On Thu, Sep 4, 2025 at 8:22 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>>
>> Don't define the AT_RENAME_* macros at all since the kernel does not
>> use them nor does the kernel need to provide them for userspace.
>> Leave them as comments in <uapi/linux/fcntl.h> only as an example.
>>
>> The AT_RENAME_* macros have recently been added to glibc's <stdio.h>.
>> For a kernel allmodconfig build, this made the macros be defined
>> differently in 2 places (same values but different macro text),
>> causing build errors/warnings (duplicate definitions) in both
>> samples/watch_queue/watch_test.c and samples/vfs/test-statx.c.
>> (<linux/fcntl.h> is included indirecty in both programs above.)
>>
>> Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
>> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
>> ---
>> Cc: Amir Goldstein <amir73il@gmail.com>
>> Cc: Jeff Layton <jlayton@kernel.org>
>> Cc: Chuck Lever <chuck.lever@oracle.com>
>> Cc: Alexander Aring <alex.aring@gmail.com>
>> Cc: Josef Bacik <josef@toxicpanda.com>
>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: David Howells <dhowells@redhat.com>
>> CC: linux-api@vger.kernel.org
>> To: linux-fsdevel@vger.kernel.org
>> ---
>>  include/uapi/linux/fcntl.h |    6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
>> +++ linux-next-20250819/include/uapi/linux/fcntl.h
>> @@ -155,10 +155,16 @@
>>   * as possible, so we can use them for generic bits in the future if necessary.
>>   */
>>
>> +/*
>> + * Note: This is an example of how the AT_RENAME_* flags could be defined,
>> + * but the kernel has no need to define them, so leave them as comments.
>> + */
>>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
>> +/*
>>  #define AT_RENAME_NOREPLACE    0x0001
>>  #define AT_RENAME_EXCHANGE     0x0002
>>  #define AT_RENAME_WHITEOUT     0x0004
>> +*/
>>
> 
> I find this end result a bit odd, but I don't want to suggest another variant
> I already proposed one in v2 review [1] that maybe you did not like.
> It's fine.

Yes, I replied to that with another problem.

> I'll let Aleksa and Christian chime in to decide on if and how they want this
> comment to look or if we should just delete these definitions and be done with
> this episode.

Sure, I'm ready to just throw my hands up (give up).

> 
> Thanks,
> Amir.
> 
> [1] https://lore.kernel.org/r/CAOQ4uxjXvYBsW1Nb2HKaoUg1qi8Pkq1XKtQEbnAvMUGcp7LrZA@mail.gmail.com/
thanks.
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Randy Dunlap @ 2025-09-04 21:52 UTC (permalink / raw)
  To: Florian Weimer, Amir Goldstein
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <lhua53auk7q.fsf@oldenburg.str.redhat.com>



On 9/4/25 11:49 AM, Florian Weimer wrote:
> * Amir Goldstein:
> 
>> I find this end result a bit odd, but I don't want to suggest another variant
>> I already proposed one in v2 review [1] that maybe you did not like.
>> It's fine.
>> I'll let Aleksa and Christian chime in to decide on if and how they want this
>> comment to look or if we should just delete these definitions and be done with
>> this episode.
> 
> We should fix the definition in glibc to be identical token-wise to the
> kernel's.

That's probably a good suggestion...
while I tried the reverse of that and Amir opposed.

Now I find that I don't care enough to sustain this.

Thanks.

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Aleksa Sarai @ 2025-09-05  5:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Randy Dunlap, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <CAOQ4uxiJibbq_MX3HkNaFb3GXGsZ0nNehk+MNODxXxy_khSwEQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3411 bytes --]

On 2025-09-04, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Sep 4, 2025 at 8:22 AM Randy Dunlap <rdunlap@infradead.org> wrote:
> >
> > Don't define the AT_RENAME_* macros at all since the kernel does not
> > use them nor does the kernel need to provide them for userspace.
> > Leave them as comments in <uapi/linux/fcntl.h> only as an example.
> >
> > The AT_RENAME_* macros have recently been added to glibc's <stdio.h>.
> > For a kernel allmodconfig build, this made the macros be defined
> > differently in 2 places (same values but different macro text),
> > causing build errors/warnings (duplicate definitions) in both
> > samples/watch_queue/watch_test.c and samples/vfs/test-statx.c.
> > (<linux/fcntl.h> is included indirecty in both programs above.)
> >
> > Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
> > Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> > ---
> > Cc: Amir Goldstein <amir73il@gmail.com>
> > Cc: Jeff Layton <jlayton@kernel.org>
> > Cc: Chuck Lever <chuck.lever@oracle.com>
> > Cc: Alexander Aring <alex.aring@gmail.com>
> > Cc: Josef Bacik <josef@toxicpanda.com>
> > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: David Howells <dhowells@redhat.com>
> > CC: linux-api@vger.kernel.org
> > To: linux-fsdevel@vger.kernel.org
> > ---
> >  include/uapi/linux/fcntl.h |    6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
> > +++ linux-next-20250819/include/uapi/linux/fcntl.h
> > @@ -155,10 +155,16 @@
> >   * as possible, so we can use them for generic bits in the future if necessary.
> >   */
> >
> > +/*
> > + * Note: This is an example of how the AT_RENAME_* flags could be defined,
> > + * but the kernel has no need to define them, so leave them as comments.
> > + */
> >  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> > +/*
> >  #define AT_RENAME_NOREPLACE    0x0001
> >  #define AT_RENAME_EXCHANGE     0x0002
> >  #define AT_RENAME_WHITEOUT     0x0004
> > +*/
> >
> 
> I find this end result a bit odd, but I don't want to suggest another variant
> I already proposed one in v2 review [1] that maybe you did not like.
> It's fine.
> I'll let Aleksa and Christian chime in to decide on if and how they want this
> comment to look or if we should just delete these definitions and be done with
> this episode.

For my part, I'm fine with these becoming comments or even removing them
outright. I think that defining them as AT_* flags would've been useful
examples of how these flags should be used, but it is what it is.

Then again, AT_EXECVE_CHECK went in and used a higher-level bit despite
the comments describing that this was unfavourable and what should be
done instead, so maybe attempting to avoid conflicts is an exercise in
futility...

If it's too much effort to synchronise them between glibc then it's
better to just close the book on this whole chapter (even though my
impression is that glibc made a mistake or two when adding the
definitions).

In either case, feel free to take my

Acked-by: Aleksa Sarai <cyphar@cyphar.com>

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Florian Weimer @ 2025-09-05  7:19 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Amir Goldstein, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api
In-Reply-To: <b35f0ff7-8ffb-400f-b537-d15e83319808@infradead.org>

* Randy Dunlap:

> On 9/4/25 11:49 AM, Florian Weimer wrote:
>> * Amir Goldstein:
>> 
>>> I find this end result a bit odd, but I don't want to suggest another variant
>>> I already proposed one in v2 review [1] that maybe you did not like.
>>> It's fine.
>>> I'll let Aleksa and Christian chime in to decide on if and how they want this
>>> comment to look or if we should just delete these definitions and be done with
>>> this episode.
>> 
>> We should fix the definition in glibc to be identical token-wise to the
>> kernel's.
>
> That's probably a good suggestion...
> while I tried the reverse of that and Amir opposed.

It's certainly odd that the kernel uses different token sequences for
defining AT_RENAME_* and RENAME_*.  But it's probably too late to fix
that.

Here's the glibc patch:

  [PATCH] libio: Define AT_RENAME_* with the same tokens as Linux
  <https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/T/#u>

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Randy Dunlap @ 2025-09-05  7:36 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Amir Goldstein, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api
In-Reply-To: <lhu7bydv01m.fsf@oldenburg.str.redhat.com>

Hi,

On 9/5/25 12:19 AM, Florian Weimer wrote:
> * Randy Dunlap:
> 
>> On 9/4/25 11:49 AM, Florian Weimer wrote:
>>> * Amir Goldstein:
>>>
>>>> I find this end result a bit odd, but I don't want to suggest another variant
>>>> I already proposed one in v2 review [1] that maybe you did not like.
>>>> It's fine.
>>>> I'll let Aleksa and Christian chime in to decide on if and how they want this
>>>> comment to look or if we should just delete these definitions and be done with
>>>> this episode.
>>>
>>> We should fix the definition in glibc to be identical token-wise to the
>>> kernel's.
>>
>> That's probably a good suggestion...
>> while I tried the reverse of that and Amir opposed.
> 
> It's certainly odd that the kernel uses different token sequences for
> defining AT_RENAME_* and RENAME_*.  But it's probably too late to fix
> that.
> 
> Here's the glibc patch:
> 
>   [PATCH] libio: Define AT_RENAME_* with the same tokens as Linux
>   <https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/T/#u>


Thanks!

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Amir Goldstein @ 2025-09-05  9:17 UTC (permalink / raw)
  To: Aleksa Sarai, Florian Weimer
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Jan Kara, Christian Brauner, Matthew Wilcox,
	David Howells, linux-api, Kees Cook, Randy Dunlap
In-Reply-To: <2025-09-05-armless-uneaten-venture-denizen-HnoIhR@cyphar.com>

On Fri, Sep 5, 2025 at 7:11 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2025-09-04, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Thu, Sep 4, 2025 at 8:22 AM Randy Dunlap <rdunlap@infradead.org> wrote:
> > >
> > > Don't define the AT_RENAME_* macros at all since the kernel does not
> > > use them nor does the kernel need to provide them for userspace.
> > > Leave them as comments in <uapi/linux/fcntl.h> only as an example.
> > >
> > > The AT_RENAME_* macros have recently been added to glibc's <stdio.h>.
> > > For a kernel allmodconfig build, this made the macros be defined
> > > differently in 2 places (same values but different macro text),
> > > causing build errors/warnings (duplicate definitions) in both
> > > samples/watch_queue/watch_test.c and samples/vfs/test-statx.c.
> > > (<linux/fcntl.h> is included indirecty in both programs above.)
> > >
> > > Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
> > > Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> > > ---
> > > Cc: Amir Goldstein <amir73il@gmail.com>
> > > Cc: Jeff Layton <jlayton@kernel.org>
> > > Cc: Chuck Lever <chuck.lever@oracle.com>
> > > Cc: Alexander Aring <alex.aring@gmail.com>
> > > Cc: Josef Bacik <josef@toxicpanda.com>
> > > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: David Howells <dhowells@redhat.com>
> > > CC: linux-api@vger.kernel.org
> > > To: linux-fsdevel@vger.kernel.org
> > > ---
> > >  include/uapi/linux/fcntl.h |    6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
> > > +++ linux-next-20250819/include/uapi/linux/fcntl.h
> > > @@ -155,10 +155,16 @@
> > >   * as possible, so we can use them for generic bits in the future if necessary.
> > >   */
> > >
> > > +/*
> > > + * Note: This is an example of how the AT_RENAME_* flags could be defined,
> > > + * but the kernel has no need to define them, so leave them as comments.
> > > + */
> > >  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> > > +/*
> > >  #define AT_RENAME_NOREPLACE    0x0001
> > >  #define AT_RENAME_EXCHANGE     0x0002
> > >  #define AT_RENAME_WHITEOUT     0x0004
> > > +*/
> > >
> >
> > I find this end result a bit odd, but I don't want to suggest another variant
> > I already proposed one in v2 review [1] that maybe you did not like.
> > It's fine.
> > I'll let Aleksa and Christian chime in to decide on if and how they want this
> > comment to look or if we should just delete these definitions and be done with
> > this episode.
>
> For my part, I'm fine with these becoming comments or even removing them
> outright. I think that defining them as AT_* flags would've been useful
> examples of how these flags should be used, but it is what it is.
>
> Then again, AT_EXECVE_CHECK went in and used a higher-level bit despite
> the comments describing that this was unfavourable and what should be
> done instead, so maybe attempting to avoid conflicts is an exercise in
> futility...

That's a bummer :-/
but to be fair, AT_EXECVE_CHECK was merged after v23, so I guess
the patch set started way before this comment and got rebased
after the comment was added, so it was easier to miss it.

>
> If it's too much effort to synchronise them between glibc then it's
> better to just close the book on this whole chapter (even though my
> impression is that glibc made a mistake or two when adding the
> definitions).

Considering that glibc has this fix lined up:
https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/

Do we need to do anything at all?

Florian,

I am not that familiar with packaging and distributions of glibc
headers and kernel headers to downstream users.

What are the chances that us removing these definitions from the
current kernel header is going to help any downstream user in the future?

Thanks,
Amir.

^ permalink raw reply

* Re: [PATCH v3] uapi/linux/fcntl: remove AT_RENAME* macros
From: Florian Weimer @ 2025-09-05  9:41 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Aleksa Sarai, linux-fsdevel, patches, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api, Kees Cook, Randy Dunlap
In-Reply-To: <CAOQ4uxjcLDUcfdp72cpQcDQEtZaaR4G+P8oPXL_HbotFirGrKQ@mail.gmail.com>

* Amir Goldstein:

>> If it's too much effort to synchronise them between glibc then it's
>> better to just close the book on this whole chapter (even though my
>> impression is that glibc made a mistake or two when adding the
>> definitions).
>
> Considering that glibc has this fix lined up:
> https://inbox.sourceware.org/libc-alpha/lhubjnpv03o.fsf@oldenburg.str.redhat.com/
>
> Do we need to do anything at all?
>
> Florian,
>
> I am not that familiar with packaging and distributions of glibc
> headers and kernel headers to downstream users.

I don't think kernel changes are necessary or desirable at this point.
The glibc change went into glibc 2.42 only, and at this point in time,
all distributions shipping 2.42 (few of them do) are pretty much
guaranteed to pick up fixes from the 2.42 stable release branch
regularly.  So if we get this into glibc 2.43 and backport it to 2.42,
the problem should disappear quite soon from a developer's perspective.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Aleksa Sarai @ 2025-09-05 14:48 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest
In-Reply-To: <20250902-gehofft-ruheraum-3c286b25b6d3@brauner>

[-- Attachment #1: Type: text/plain, Size: 8585 bytes --]

On 2025-09-02, Christian Brauner <brauner@kernel.org> wrote:
> On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> > Ever since the introduction of pid namespaces, procfs has had very
> > implicit behaviour surrounding them (the pidns used by a procfs mount is
> > auto-selected based on the mounting process's active pidns, and the
> > pidns itself is basically hidden once the mount has been constructed).
> > 
> > /* pidns mount option for procfs */
> > 
> > This implicit behaviour has historically meant that userspace was
> > required to do some special dances in order to configure the pidns of a
> > procfs mount as desired. Examples include:
> > 
> >  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
> >    a procfs mount from an empty pidns so that user namespaced containers
> >    can be nested (without this, the nested containers would fail to
> >    mount procfs). But this requires forking off a helper process because
> >    you cannot just one-shot this using mount(2).
> > 
> >  * Container runtimes in general need to fork into a container before
> >    configuring its mounts, which can lead to security issues in the case
> >    of shared-pidns containers (a privileged process in the pidns can
> >    interact with your container runtime process). While
> >    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
> >    strict need for this due to a minor uAPI wart is kind of unfortunate.
> > 
> > Things would be much easier if there was a way for userspace to just
> > specify the pidns they want. Patch 1 implements a new "pidns" argument
> > which can be set using fsconfig(2):
> > 
> >     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> > 
> > or classic mount(2) / mount(8):
> > 
> >     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> >     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > 
> > The initial security model I have in this RFC is to be as conservative
> > as possible and just mirror the security model for setns(2) -- which
> > means that you can only set pidns=... to pid namespaces that your
> > current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> > privileges over the pid namespace. This fulfils the requirements of
> > container runtimes, but I suspect that this may be too strict for some
> > usecases.
> > 
> > The pidns argument is not displayed in mountinfo -- it's not clear to me
> > what value it would make sense to show (maybe we could just use ns_dname
> > to provide an identifier for the namespace, but this number would be
> > fairly useless to userspace). I'm open to suggestions. Note that
> > PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> > information about this outside of mountinfo.
> > 
> > Note that you cannot change the pidns of an already-created procfs
> > instance. The primary reason is that allowing this to be changed would
> > require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> > fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> > the pid namespace. Since creating procfs instances is very cheap, it
> > seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> > procfs this way errors out with -EBUSY.
> > 
> > /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> > 
> > In addition, being able to figure out what pid namespace is being used
> > by a procfs mount is quite useful when you have an administrative
> > process (such as a container runtime) which wants to figure out the
> > correct way of mapping PIDs between its own namespace and the namespace
> > for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> > alternative ways to do this, but they all rely on ancillary information
> > that third-party libraries and tools do not necessarily have access to.
> > 
> > To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> > can be used to get a reference to the pidns that a procfs is using.
> > 
> > Rather than copying the (fairly strict) security model for setns(2),
> > apply a slightly looser model to better match what userspace can already
> > do:
> > 
> >  * Make the ioctl only valid on the root (meaning that a process without
> >    access to the procfs root -- such as only having an fd to a procfs
> >    file or some open_tree(2)-like subset -- cannot use this API). This
> >    means that the process already has some level of access to the
> >    /proc/$pid directories.
> > 
> >  * If the calling process is in an ancestor pidns, then they can already
> >    create pidfd for processes inside the pidns, which is morally
> >    equivalent to a pidns file descriptor according to setns(2). So it
> >    seems reasonable to just allow it in this case. (The justification
> >    for this model was suggested by Christian.)
> > 
> >  * If the process has access to /proc/1/ns/pid already (i.e. has
> >    ptrace-read access to the pidns pid1), then this ioctl is equivalent
> >    to just opening a handle to it that way.
> > 
> >    Ideally we would check for ptrace-read access against all processes
> >    in the pidns (which is very likely to be true for at least one
> >    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
> >    by most programs), but this would obviously not scale.
> > 
> > I'm open to suggestions for whether we need to make this stricter (or
> > possibly allow more cases).
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> 
> Thanks for the patchset. Being able to specify what pid namespace the
> procfs instance is supposed to belong to is super useful and will make
> things easier for userspace for sure.

I was going to send a new version changing the whole thing to be struct
path based (and adding FSCONFIG_SET_PATH{,_EMPTY} support) so we don't
need to allocate a file explicitly for the non-FSCONFIG_SET_FD case, but
we can do that as a follow-up I guess.

> The code you added contains a minor wrinkle that I disliked which I've
> changed and you tell me if you can live with this restriction or not.
> 
> The way you've implemented it specifying a pid namespace that the caller
> holds privilege over would silently also override the user namespace the
> filesystem is supposed to belong to.
> 
> Specifically, you did something like:
> 
>         put_pid_ns(ctx->pid_ns);
>         ctx->pid_ns = get_pid_ns(target);
>         put_user_ns(fc->user_ns);
>         fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> 
> This silently overrides the user namespace recorded at fsopen() time. I
> think that's too subtle and we should just not allow that at all for
> now.
> 
> Instead I've changed this to:
> 
>         if (fc->user_ns != target->user_ns)
>                 return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");
> 
>         put_pid_ns(ctx->pid_ns);
>         ctx->pid_ns = get_pid_ns(target);
> 
> so we just refuse different owernship.

That sounds fine, I wasn't quite sure what to do with fc->user_ns to be
honest. Being more conservative is probably the right call here.

> I've also dropped the procfs ioctl because I'm not sure how much value
> it will actually add given that you can do this via /proc/1/ns/pid.
> 
> If that is something that libpathrs despearately needs I would like to
> do it as a separate patch anyways.

The main issues are:

1. pid1 can often be non-dumpable, which can block you from doing that.
   In principle, because the dumpable flag is reset on execve, it is
   theoretically possible to get access to /proc/$pid/ns/pid if you win
   the race in a pid namespace with lots of process activity, but this
   kind of sucks.

2. This approach doesn't work for empty pid namesapces.
   pidns_for_children doesn't let you get a handle to an empty pid
   namespace either (I briefly looked at the history and it seems this
   was silently changed in v2 of the patchset based on some feedback
   that I'm not sure was entirely correct).

3. Now that you can configure the procfs mount, it seems like a
   half-baked interface to not provide diagnostic information about the
   namespace. (I suspect the criu folks would be happy to have this too
   ;).)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Christian Brauner @ 2025-09-05 15:21 UTC (permalink / raw)
  To: Mark Brown
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api,
	Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-4-4d9fff1c53e7@kernel.org>

On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the shadow
> stack for a new thread, instead the kernel will dynamically allocate a
> new shadow stack with the same size as the normal stack. This appears to
> be due to the shadow stack series having been in development since
> before the more extensible clone3() was added rather than anything more
> deliberate.
> 
> Add a parameter to clone3() specifying a shadow stack pointer to use
> for the new thread, this is inconsistent with the way we specify the
> normal stack but during review concerns were expressed about having to
> identify where the shadow stack pointer should be placed especially in
> cases where the shadow stack has been previously active.  If no shadow
> stack is specified then the existing implicit allocation behaviour is
> maintained.
> 
> If a shadow stack pointer is specified then it is required to have an
> architecture defined token placed on the stack, this will be consumed by
> the new task, the shadow stack is specified by pointing to this token.  If
> no valid token is present then this will be reported with -EINVAL.  This
> token prevents new threads being created pointing at the shadow stack of
> an existing running thread.  On architectures with support for userspace
> pivoting of shadow stacks it is expected that the same format and placement
> of tokens will be used, this is the case for arm64 and x86.
> 
> If the architecture does not support shadow stacks the shadow stack
> pointer must be not be specified, architectures that do support the
> feature are expected to enforce the same requirement on individual
> systems that lack shadow stack support.
> 
> Update the existing arm64 and x86 implementations to pay attention to
> the newly added arguments, in order to maintain compatibility we use the
> existing behaviour if no shadow stack is specified. Since we are now
> using more fields from the kernel_clone_args we pass that into the
> shadow stack code rather than individual fields.
> 
> Portions of the x86 architecture code were written by Rick Edgecombe.
> 
> Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
>  arch/x86/include/asm/shstk.h     | 11 +++--
>  arch/x86/kernel/process.c        |  2 +-
>  arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
>  include/asm-generic/cacheflush.h | 11 +++++
>  include/linux/sched/task.h       | 17 ++++++++
>  include/uapi/linux/sched.h       |  9 ++--
>  kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
>  8 files changed, 217 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
> index 3abcbf9adb5c..249ff05bca45 100644
> --- a/arch/arm64/mm/gcs.c
> +++ b/arch/arm64/mm/gcs.c
> @@ -43,8 +43,23 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
>  {
>  	unsigned long addr, size;
>  
> -	if (!system_supports_gcs())
> +	if (!system_supports_gcs()) {
> +		if (args->shadow_stack_token)
> +			return -EINVAL;
> +
>  		return 0;
> +	}
> +
> +	/*
> +	 * If the user specified a GCS then use it, otherwise fall
> +	 * back to a default allocation strategy. Validation is done
> +	 * in arch_shstk_validate_clone().
> +	 */
> +	if (args->shadow_stack_token) {
> +		tsk->thread.gcs_base = 0;
> +		tsk->thread.gcs_size = 0;
> +		return 0;
> +	}
>  
>  	if (!task_gcs_el0_enabled(tsk))
>  		return 0;
> @@ -68,6 +83,36 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
>  	return 0;
>  }
>  
> +static bool gcs_consume_token(struct vm_area_struct *vma, struct page *page,
> +			      unsigned long user_addr)
> +{
> +	u64 expected = GCS_CAP(user_addr);
> +	u64 *token = page_address(page) + offset_in_page(user_addr);
> +
> +	if (!cmpxchg_to_user_page(vma, page, user_addr, token, expected, 0))
> +		return false;
> +	set_page_dirty_lock(page);
> +
> +	return true;
> +}
> +
> +int arch_shstk_validate_clone(struct task_struct *tsk,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	unsigned long gcspr_el0;
> +	int ret = 0;
> +
> +	gcspr_el0 = args->shadow_stack_token;
> +	if (!gcs_consume_token(vma, page, gcspr_el0))
> +		return -EINVAL;
> +
> +	tsk->thread.gcspr_el0 = gcspr_el0 + sizeof(u64);
> +
> +	return ret;
> +}
> +
>  SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
>  {
>  	unsigned long alloc_size;
> diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
> index ba6f2fe43848..827e983430aa 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -6,6 +6,7 @@
>  #include <linux/types.h>
>  
>  struct task_struct;
> +struct kernel_clone_args;
>  struct ksignal;
>  
>  #ifdef CONFIG_X86_USER_SHADOW_STACK
> @@ -16,8 +17,8 @@ struct thread_shstk {
>  
>  long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
>  void reset_thread_features(void);
> -unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
> -				       unsigned long stack_size);
> +unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> +				       const struct kernel_clone_args *args);
>  void shstk_free(struct task_struct *p);
>  int setup_signal_shadow_stack(struct ksignal *ksig);
>  int restore_signal_shadow_stack(void);
> @@ -28,8 +29,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
>  			       unsigned long arg2) { return -EINVAL; }
>  static inline void reset_thread_features(void) {}
>  static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> -						     unsigned long clone_flags,
> -						     unsigned long stack_size) { return 0; }
> +						     const struct kernel_clone_args *args)
> +{
> +	return 0;
> +}
>  static inline void shstk_free(struct task_struct *p) {}
>  static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
>  static inline int restore_signal_shadow_stack(void) { return 0; }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 1b7960cf6eb0..0a54af6c60df 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -209,7 +209,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
>  	 * update it.
>  	 */
> -	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
> +	new_ssp = shstk_alloc_thread_stack(p, args);
>  	if (IS_ERR_VALUE(new_ssp))
>  		return PTR_ERR((void *)new_ssp);
>  
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 2ddf23387c7e..9926d58e5d41 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -191,18 +191,61 @@ void reset_thread_features(void)
>  	current->thread.features_locked = 0;
>  }
>  
> -unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
> -				       unsigned long stack_size)
> +int arch_shstk_validate_clone(struct task_struct *t,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	void *maddr = page_address(page);
> +	unsigned long token;
> +	int offset;
> +	u64 expected;
> +
> +	/*
> +	 * kernel_clone_args() verification assures token address is 8
> +	 * byte aligned.
> +	 */
> +	token = args->shadow_stack_token;
> +	expected = (token + SS_FRAME_SIZE) | BIT(0);
> +	offset = offset_in_page(token);
> +
> +	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
> +				  expected, 0))
> +		return -EINVAL;
> +	set_page_dirty_lock(page);
> +
> +	return 0;
> +}
> +
> +unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
> +				       const struct kernel_clone_args *args)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> +	unsigned long clone_flags = args->flags;
>  	unsigned long addr, size;
>  
>  	/*
>  	 * If shadow stack is not enabled on the new thread, skip any
> -	 * switch to a new shadow stack.
> +	 * implicit switch to a new shadow stack and reject attempts to
> +	 * explicitly specify one.
>  	 */
> -	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
> +		if (args->shadow_stack_token)
> +			return (unsigned long)ERR_PTR(-EINVAL);
> +
>  		return 0;
> +	}
> +
> +	/*
> +	 * If the user specified a shadow stack then use it, otherwise
> +	 * fall back to a default allocation strategy. Validation is
> +	 * done in arch_shstk_validate_clone().
> +	 */
> +	if (args->shadow_stack_token) {
> +		shstk->base = 0;
> +		shstk->size = 0;
> +		return args->shadow_stack_token + 8;
> +	}
>  
>  	/*
>  	 * For CLONE_VFORK the child will share the parents shadow stack.
> @@ -222,7 +265,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
>  	if (!(clone_flags & CLONE_VM))
>  		return 0;
>  
> -	size = adjust_shstk_size(stack_size);
> +	size = adjust_shstk_size(args->stack_size);
>  	addr = alloc_shstk(0, size, 0, false);
>  	if (IS_ERR_VALUE(addr))
>  		return addr;
> diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
> index 7ee8a179d103..96cc0c7a5c90 100644
> --- a/include/asm-generic/cacheflush.h
> +++ b/include/asm-generic/cacheflush.h
> @@ -124,4 +124,15 @@ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
>  	} while (0)
>  #endif
>  
> +#ifndef cmpxchg_to_user_page
> +#define cmpxchg_to_user_page(vma, page, vaddr, ptr, old, new)  \
> +({							  \
> +	bool ret;						  \
> +								  \
> +	ret = try_cmpxchg(ptr, &old, new);			  \
> +	flush_icache_user_page(vma, page, vaddr, sizeof(*ptr));	  \
> +	ret;							  \
> +})
> +#endif
> +
>  #endif /* _ASM_GENERIC_CACHEFLUSH_H */
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ea41795a352b..b501f752fc9a 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -16,6 +16,7 @@ struct task_struct;
>  struct rusage;
>  union thread_union;
>  struct css_set;
> +struct vm_area_struct;
>  
>  /* All the bits taken by the old clone syscall. */
>  #define CLONE_LEGACY_FLAGS 0xffffffffULL
> @@ -44,6 +45,7 @@ struct kernel_clone_args {
>  	struct cgroup *cgrp;
>  	struct css_set *cset;
>  	unsigned int kill_seq;
> +	unsigned long shadow_stack_token;
>  };
>  
>  /*
> @@ -226,4 +228,19 @@ static inline void task_unlock(struct task_struct *p)
>  
>  DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
>  
> +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> +int arch_shstk_validate_clone(struct task_struct *p,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args);
> +#else
> +static inline int arch_shstk_validate_clone(struct task_struct *p,
> +					    struct vm_area_struct *vma,
> +					    struct page *page,
> +					    struct kernel_clone_args *args)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _LINUX_SCHED_TASK_H */
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 359a14cc76a4..9cf5c419e109 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -84,6 +84,7 @@
>   *                kernel's limit of nested PID namespaces.
>   * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
>   *                a file descriptor for the cgroup.
> + * @shadow_stack_token: Pointer to shadow stack token at top of stack.
>   *
>   * The structure is versioned by size and thus extensible.
>   * New struct members must go at the end of the struct and
> @@ -101,12 +102,14 @@ struct clone_args {
>  	__aligned_u64 set_tid;
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
> +	__aligned_u64 shadow_stack_token;
>  };
>  #endif
>  
> -#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> -#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
> -#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
> +#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
> +#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER3  96 /* sizeof fourth published struct */
>  
>  /*
>   * Scheduling policies
> diff --git a/kernel/fork.c b/kernel/fork.c
> index af673856499d..d484ebeded33 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
>  	return true;
>  }
>  
> +static int shstk_validate_clone(struct task_struct *p,
> +				struct kernel_clone_args *args)
> +{
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +	unsigned long addr;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> +		return 0;
> +
> +	if (!args->shadow_stack_token)
> +		return 0;
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return -EFAULT;
> +
> +	mmap_read_lock(mm);
> +
> +	addr = untagged_addr_remote(mm, args->shadow_stack_token);
> +	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
> +					&vma);
> +	if (IS_ERR(page)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (!(vma->vm_flags & VM_SHADOW_STACK) ||
> +	    !(vma->vm_flags & VM_WRITE)) {
> +		ret = -EFAULT;
> +		goto out_page;
> +	}
> +
> +	ret = arch_shstk_validate_clone(p, vma, page, args);
> +
> +out_page:
> +	put_page(page);
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	return ret;
> +}
> +
>  /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
> @@ -2182,6 +2227,9 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_namespaces;
>  	retval = copy_thread(p, args);
> +	if (retval)
> +		goto bad_fork_cleanup_io;
> +	retval = shstk_validate_clone(p, args);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> @@ -2763,7 +2811,9 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		     CLONE_ARGS_SIZE_VER1);
>  	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
>  		     CLONE_ARGS_SIZE_VER2);
> -	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
> +	BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_token) !=
> +		     CLONE_ARGS_SIZE_VER3);
> +	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
>  
>  	if (unlikely(usize > PAGE_SIZE))
>  		return -E2BIG;
> @@ -2796,16 +2846,17 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		return -EINVAL;
>  
>  	*kargs = (struct kernel_clone_args){
> -		.flags		= args.flags,
> -		.pidfd		= u64_to_user_ptr(args.pidfd),
> -		.child_tid	= u64_to_user_ptr(args.child_tid),
> -		.parent_tid	= u64_to_user_ptr(args.parent_tid),
> -		.exit_signal	= args.exit_signal,
> -		.stack		= args.stack,
> -		.stack_size	= args.stack_size,
> -		.tls		= args.tls,
> -		.set_tid_size	= args.set_tid_size,
> -		.cgroup		= args.cgroup,
> +		.flags			= args.flags,
> +		.pidfd			= u64_to_user_ptr(args.pidfd),
> +		.child_tid		= u64_to_user_ptr(args.child_tid),
> +		.parent_tid		= u64_to_user_ptr(args.parent_tid),
> +		.exit_signal		= args.exit_signal,
> +		.stack			= args.stack,
> +		.stack_size		= args.stack_size,
> +		.tls			= args.tls,
> +		.set_tid_size		= args.set_tid_size,
> +		.cgroup			= args.cgroup,
> +		.shadow_stack_token	= args.shadow_stack_token,

I'm not sure why this has to be named "shadow_stack_token" I think
that's just confusing and we should just call it "shadow_stack" and be
done with it. It's also a bit long of a field name imho.

I have a kernel-6.18.clone3 branch
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=kernel-6.18.clone3
because there's another cross-arch cleanup that cleans up copy_thread(),
copy_sighand(), and copy_process() and - surprisingly - also adds
clone3() support for nios2...

Anyway, if you just want me to slap it on top of that branch then I can
simply rename while applying so no need to resend in that case.

>  	};
>  
>  	if (args.set_tid &&
> @@ -2846,6 +2897,24 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
>  	return true;
>  }
>  
> +/**
> + * clone3_shadow_stack_valid - check and prepare shadow stack
> + * @kargs: kernel clone args
> + *
> + * Verify that shadow stacks are only enabled if supported.
> + */
> +static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (!kargs->shadow_stack_token)
> +		return true;
> +
> +	if (!IS_ALIGNED(kargs->shadow_stack_token, sizeof(void *)))
> +		return false;
> +
> +	/* Fail if the kernel wasn't built with shadow stacks */
> +	return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
> +}
> +
>  static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  {
>  	/* Verify that no unknown flags are passed along. */
> @@ -2868,7 +2937,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  	    kargs->exit_signal)
>  		return false;
>  
> -	if (!clone3_stack_valid(kargs))
> +	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
>  		return false;
>  
>  	return true;
> 
> -- 
> 2.39.5
> 

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-05 15:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api,
	Kees Cook
In-Reply-To: <20250905-nutria-befund-2f3e92003734@brauner>

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On Fri, Sep 05, 2025 at 05:21:59PM +0200, Christian Brauner wrote:
> On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:

> > +		.shadow_stack_token	= args.shadow_stack_token,

> I'm not sure why this has to be named "shadow_stack_token" I think
> that's just confusing and we should just call it "shadow_stack" and be
> done with it. It's also a bit long of a field name imho.

I'm not hugely attached to the name, if you want to rename that's
perfectly fine by me.  My thinking was that there's a potential
confusion with it being a pointer to the base of the shadow stack by
comparison with the existing "stack" but I do agree that the resulting
name is quite long and if someone does actually get confused they should
discover the problem fairly rapidly in testing.  ss_token would shorter
but the abbreviation is less clear, whatever name you prefer is fine by
me.

> I have a kernel-6.18.clone3 branch
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=kernel-6.18.clone3
> because there's another cross-arch cleanup that cleans up copy_thread(),
> copy_sighand(), and copy_process() and - surprisingly - also adds
> clone3() support for nios2...

> Anyway, if you just want me to slap it on top of that branch then I can
> simply rename while applying so no need to resend in that case.

That would be amazing, I'm totally happy with you doing that.  If I do
need to rebase and resend let me know.

It's probably worth mentioning that the RISC-V shadow stack support was
getting near to being merged, if that ends up being OK for this release
(it's not been posted yet though) there might be some cross tree merge
needed or something.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Edgecombe, Rick P @ 2025-09-05 15:53 UTC (permalink / raw)
  To: broonie@kernel.org, brauner@kernel.org
  Cc: dietmar.eggemann@arm.com, linux-api@vger.kernel.org,
	Szabolcs.Nagy@arm.com, shuah@kernel.org,
	dave.hansen@linux.intel.com, debug@rivosinc.com, mgorman@suse.de,
	vincent.guittot@linaro.org, fweimer@redhat.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	rostedt@goodmis.org, hjl.tools@gmail.com, tglx@linutronix.de,
	akpm@linux-foundation.org, vschneid@redhat.com,
	catalin.marinas@arm.com, kees@kernel.org, will@kernel.org,
	hpa@zytor.com, peterz@infradead.org, jannh@google.com,
	yury.khrustalev@arm.com, bp@alien8.de, wilco.dijkstra@arm.com,
	bsegall@google.com, linux-kselftest@vger.kernel.org,
	x86@kernel.org, juri.lelli@redhat.com
In-Reply-To: <0ff8b70e-283f-4d56-8bab-bcae11cd5bdb@sirena.org.uk>

On Fri, 2025-09-05 at 16:43 +0100, Mark Brown wrote:
> On Fri, Sep 05, 2025 at 05:21:59PM +0200, Christian Brauner wrote:
> > On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> 
> > > +		.shadow_stack_token	= args.shadow_stack_token,
> 
> > I'm not sure why this has to be named "shadow_stack_token" I think
> > that's just confusing and we should just call it "shadow_stack" and be
> > done with it. It's also a bit long of a field name imho.
> 
> I'm not hugely attached to the name, if you want to rename that's
> perfectly fine by me.  My thinking was that there's a potential
> confusion with it being a pointer to the base of the shadow stack by
> comparison with the existing "stack" but I do agree that the resulting
> name is quite long and if someone does actually get confused they should
> discover the problem fairly rapidly in testing.  ss_token would shorter
> but the abbreviation is less clear, whatever name you prefer is fine by
> me.

Yea the token point here is kind of important. That said, we could probably make
up for it with documentation.

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Kees Cook @ 2025-09-05 16:00 UTC (permalink / raw)
  To: Mark Brown
  Cc: Christian Brauner, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy,
	H.J. Lu, Florian Weimer, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Shuah Khan,
	linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api
In-Reply-To: <0ff8b70e-283f-4d56-8bab-bcae11cd5bdb@sirena.org.uk>

On Fri, Sep 05, 2025 at 04:43:22PM +0100, Mark Brown wrote:
> On Fri, Sep 05, 2025 at 05:21:59PM +0200, Christian Brauner wrote:
> > On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> 
> > > +		.shadow_stack_token	= args.shadow_stack_token,
> 
> > I'm not sure why this has to be named "shadow_stack_token" I think
> > that's just confusing and we should just call it "shadow_stack" and be
> > done with it. It's also a bit long of a field name imho.
> 
> I'm not hugely attached to the name, if you want to rename that's
> perfectly fine by me.  My thinking was that there's a potential
> confusion with it being a pointer to the base of the shadow stack by
> comparison with the existing "stack" but I do agree that the resulting
> name is quite long and if someone does actually get confused they should
> discover the problem fairly rapidly in testing.  ss_token would shorter
> but the abbreviation is less clear, whatever name you prefer is fine by
> me.

Bike shed: shstk_token?


-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-05 16:02 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy,
	H.J. Lu, Florian Weimer, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Shuah Khan,
	linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Yury Khrustalev, Wilco Dijkstra, linux-kselftest, linux-api
In-Reply-To: <202509050900.8A01B1E6@keescook>

[-- Attachment #1: Type: text/plain, Size: 375 bytes --]

On Fri, Sep 05, 2025 at 09:00:51AM -0700, Kees Cook wrote:
> On Fri, Sep 05, 2025 at 04:43:22PM +0100, Mark Brown wrote:

> > discover the problem fairly rapidly in testing.  ss_token would shorter
> > but the abbreviation is less clear, whatever name you prefer is fine by
> > me.

> Bike shed: shstk_token?

That also works and is fine by me, probably better than my idea.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH 2/3] ext4: add support for 32-bit default reserved uid and gid values
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250908-tune2fs-v1-0-e3a6929f3355@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Support for specifying the default user id and group id that is
allowed to use the reserved block space was added way back when Linux
only supported 16-bit uid's and gid's.  (Yeah, that long ago.)  It's
not a commonly used feature, but let's add support for 32-bit user and
group id's.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ext4.h  | 16 +++++++++++++++-
 fs/ext4/super.c |  8 ++++----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 01a6e2de7fc3ef0e20b039d3200b9c9bd656f59f..4bfcd5f0c74fda30db4009ee28fbee00a2f6b76f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1442,7 +1442,9 @@ struct ext4_super_block {
 	__le16  s_encoding;		/* Filename charset encoding */
 	__le16  s_encoding_flags;	/* Filename charset encoding flags */
 	__le32  s_orphan_file_inum;	/* Inode for tracking orphan inodes */
-	__le32	s_reserved[94];		/* Padding to the end of the block */
+	__le16	s_def_resuid_hi;
+	__le16	s_def_resgid_hi;
+	__le32	s_reserved[93];		/* Padding to the end of the block */
 	__le32	s_checksum;		/* crc32c(superblock) */
 };
 
@@ -1812,6 +1814,18 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
 }
 
+static inline int ext4_get_resuid(struct ext4_super_block *es)
+{
+	return(le16_to_cpu(es->s_def_resuid) |
+	       (le16_to_cpu(es->s_def_resuid_hi) << 16));
+}
+
+static inline int ext4_get_resgid(struct ext4_super_block *es)
+{
+	return(le16_to_cpu(es->s_def_resgid) |
+	       (le16_to_cpu(es->s_def_resgid_hi) << 16));
+}
+
 /*
  * Returns: sbi->field[index]
  * Used to access an array element from the following sbi fields which require
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 94c98446c84f9a4614971d246ca7f001de610a8a..0256c8f7c6cee2b8d9295f2fa9a7acd904382e83 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2951,11 +2951,11 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 	}
 
 	if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
-	    le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
+	    ext4_get_resuid(es) != EXT4_DEF_RESUID)
 		SEQ_OPTS_PRINT("resuid=%u",
 				from_kuid_munged(&init_user_ns, sbi->s_resuid));
 	if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
-	    le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
+	    ext4_get_resgid(es) != EXT4_DEF_RESGID)
 		SEQ_OPTS_PRINT("resgid=%u",
 				from_kgid_munged(&init_user_ns, sbi->s_resgid));
 	def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
@@ -5270,8 +5270,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 
 	ext4_set_def_opts(sb, es);
 
-	sbi->s_resuid = make_kuid(&init_user_ns, le16_to_cpu(es->s_def_resuid));
-	sbi->s_resgid = make_kgid(&init_user_ns, le16_to_cpu(es->s_def_resgid));
+	sbi->s_resuid = make_kuid(&init_user_ns, ext4_get_resuid(es));
+	sbi->s_resgid = make_kgid(&init_user_ns, ext4_get_resuid(es));
 	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
 	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;

-- 
2.51.0



^ permalink raw reply related

* [PATCH 1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable
In-Reply-To: <20250908-tune2fs-v1-0-e3a6929f3355@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Unlike other strings in the ext4 superblock, we rely on tune2fs to
make sure s_mount_opts is NUL terminated.  Harden
parse_apply_sb_mount_options() by treating s_mount_opts as a potential
__nonstring.

Cc: stable@vger.kernel.org
Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/super.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 699c15db28a82f26809bf68533454a242596f0fd..94c98446c84f9a4614971d246ca7f001de610a8a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2460,7 +2460,7 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 					struct ext4_fs_context *m_ctx)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	char *s_mount_opts = NULL;
+	char s_mount_opts[65];
 	struct ext4_fs_context *s_ctx = NULL;
 	struct fs_context *fc = NULL;
 	int ret = -ENOMEM;
@@ -2468,15 +2468,11 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 	if (!sbi->s_es->s_mount_opts[0])
 		return 0;
 
-	s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
-				sizeof(sbi->s_es->s_mount_opts),
-				GFP_KERNEL);
-	if (!s_mount_opts)
-		return ret;
+	strscpy_pad(s_mount_opts, sbi->s_es->s_mount_opts);
 
 	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
 	if (!fc)
-		goto out_free;
+		return -ENOMEM;
 
 	s_ctx = kzalloc(sizeof(struct ext4_fs_context), GFP_KERNEL);
 	if (!s_ctx)
@@ -2508,11 +2504,8 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 	ret = 0;
 
 out_free:
-	if (fc) {
-		ext4_fc_free(fc);
-		kfree(fc);
-	}
-	kfree(s_mount_opts);
+	ext4_fc_free(fc);
+	kfree(fc);
 	return ret;
 }
 

-- 
2.51.0



^ permalink raw reply related

* [PATCH 0/3] ext4: Add support for mounted updates to the superblock via an ioctl
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable

This patch series enables a future version of tune2fs to be able to
modify certain parts of the ext4 superblock without to write to the
block device.

The first patch fixes a potential buffer overrun caused by a
maliciously moified superblock.  The second patch adds support for
32-bit uid and gid's which can have access to the reserved blocks pool.
The last patch adds the ioctl's which will be used by tune2fs.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
Theodore Ts'o (3):
      ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
      ext4: add support for 32-bit default reserved uid and gid values
      ext4: implemet new ioctls to set and get superblock parameters

 fs/ext4/ext4.h            |  16 ++++-
 fs/ext4/ioctl.c           | 256 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/super.c           |  25 +++-----
 include/uapi/linux/ext4.h |  75 ++++++++++++++++++++++
 4 files changed, 348 insertions(+), 24 deletions(-)
---
base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
change-id: 20250830-tune2fs-3376beb72403

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>



^ permalink raw reply

* [PATCH 3/3] ext4: implemet new ioctls to set and get superblock parameters
From: Theodore Ts'o via B4 Relay @ 2025-09-09  3:15 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250908-tune2fs-v1-0-e3a6929f3355@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Implement the EXT4_IOC_GET_TUNE_SB_PARAM and
EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock
parameters to be set while the file system is mounted, without needing
write access to the block device.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ioctl.c           | 256 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 include/uapi/linux/ext4.h |  75 ++++++++++++++++++++++
 2 files changed, 324 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 84e3c73952d72e436429489f5fc8b7ae1c01c7a1..569c98c962af63130c0119f60788a26a2807bd86 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -27,14 +27,16 @@
 #include "fsmap.h"
 #include <trace/events/ext4.h>
 
-typedef void ext4_update_sb_callback(struct ext4_super_block *es,
-				       const void *arg);
+typedef void ext4_update_sb_callback(struct ext4_sb_info *sbi,
+				     struct ext4_super_block *es,
+				     const void *arg);
 
 /*
  * Superblock modification callback function for changing file system
  * label
  */
-static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
+static void ext4_sb_setlabel(struct ext4_sb_info *sbi,
+			     struct ext4_super_block *es, const void *arg)
 {
 	/* Sanity check, this should never happen */
 	BUILD_BUG_ON(sizeof(es->s_volume_name) < EXT4_LABEL_MAX);
@@ -46,7 +48,8 @@ static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
  * Superblock modification callback function for changing file system
  * UUID.
  */
-static void ext4_sb_setuuid(struct ext4_super_block *es, const void *arg)
+static void ext4_sb_setuuid(struct ext4_sb_info *sbi,
+			    struct ext4_super_block *es, const void *arg)
 {
 	memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
 }
@@ -71,7 +74,7 @@ int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
 		goto out_err;
 
 	lock_buffer(bh);
-	func(es, arg);
+	func(sbi, es, arg);
 	ext4_superblock_csum_set(sb);
 	unlock_buffer(bh);
 
@@ -149,7 +152,7 @@ static int ext4_update_backup_sb(struct super_block *sb,
 		unlock_buffer(bh);
 		goto out_bh;
 	}
-	func(es, arg);
+	func(EXT4_SB(sb), es, arg);
 	if (ext4_has_feature_metadata_csum(sb))
 		es->s_checksum = ext4_superblock_csum(es);
 	set_buffer_uptodate(bh);
@@ -1230,6 +1233,239 @@ static int ext4_ioctl_setuuid(struct file *filp,
 	return ret;
 }
 
+
+#define TUNE_OPS_SUPPORTED (EXT4_TUNE_FL_ERRORS_BEHAVIOR |    \
+	EXT4_TUNE_FL_MNT_COUNT | EXT4_TUNE_FL_MAX_MNT_COUNT | \
+	EXT4_TUNE_FL_CHECKINTRVAL | EXT4_TUNE_FL_LAST_CHECK_TIME | \
+	EXT4_TUNE_FL_RESERVED_BLOCKS | EXT4_TUNE_FL_RESERVED_UID | \
+	EXT4_TUNE_FL_RESERVED_GID | EXT4_TUNE_FL_DEFAULT_MNT_OPTS | \
+	EXT4_TUNE_FL_DEF_HASH_ALG | EXT4_TUNE_FL_RAID_STRIDE | \
+	EXT4_TUNE_FL_RAID_STRIPE_WIDTH | EXT4_TUNE_FL_MOUNT_OPTS | \
+	EXT4_TUNE_FL_FEATURES | EXT4_TUNE_FL_EDIT_FEATURES | \
+	EXT4_TUNE_FL_FORCE_FSCK)
+
+static int ext4_ioctl_get_tune_sb(struct ext4_sb_info *sbi,
+				  struct ext4_tune_sb_params __user *params)
+{
+	struct ext4_tune_sb_params ret;
+	struct ext4_super_block *es = sbi->s_es;
+
+	memset(&ret, 0, sizeof(ret));
+	ret.set_flags = TUNE_OPS_SUPPORTED;
+	ret.errors_behavior = es->s_errors;
+	ret.mnt_count = le16_to_cpu(es->s_mnt_count);
+	ret.max_mnt_count = le16_to_cpu(es->s_max_mnt_count);
+	ret.checkinterval = le32_to_cpu(es->s_checkinterval);
+	ret.last_check_time = le32_to_cpu(es->s_lastcheck);
+	ret.reserved_blocks = ext4_r_blocks_count(es);
+	ret.blocks_count = ext4_blocks_count(es);
+	ret.reserved_uid = ext4_get_resuid(es);
+	ret.reserved_gid = ext4_get_resgid(es);
+	ret.default_mnt_opts = le32_to_cpu(es->s_default_mount_opts);
+	ret.def_hash_alg = es->s_def_hash_version;
+	ret.raid_stride = le16_to_cpu(es->s_raid_stride);
+	ret.raid_stripe_width = le16_to_cpu(es->s_raid_stripe_width);
+	strscpy_pad(ret.mount_opts, es->s_mount_opts);
+	ret.feature_compat = le32_to_cpu(es->s_feature_compat);
+	ret.feature_incompat = le32_to_cpu(es->s_feature_incompat);
+	ret.feature_ro_compat = le32_to_cpu(es->s_feature_ro_compat);
+	ret.set_feature_compat_mask = EXT4_TUNE_SET_COMPAT_SUPP;
+	ret.set_feature_incompat_mask = EXT4_TUNE_SET_INCOMPAT_SUPP;
+	ret.set_feature_ro_compat_mask = EXT4_TUNE_SET_RO_COMPAT_SUPP;
+	ret.clear_feature_compat_mask = EXT4_TUNE_CLEAR_COMPAT_SUPP;
+	ret.clear_feature_incompat_mask = EXT4_TUNE_CLEAR_INCOMPAT_SUPP;
+	ret.clear_feature_ro_compat_mask = EXT4_TUNE_CLEAR_RO_COMPAT_SUPP;
+	if (copy_to_user(params, &ret, sizeof(ret)))
+		return -EFAULT;
+	return 0;
+}
+
+static void ext4_sb_setparams(struct ext4_sb_info *sbi,
+			      struct ext4_super_block *es, const void *arg)
+{
+	const struct ext4_tune_sb_params *params = arg;
+
+	if (params->set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR)
+		es->s_errors = cpu_to_le16(params->errors_behavior);
+	if (params->set_flags & EXT4_TUNE_FL_MNT_COUNT)
+		es->s_mnt_count = cpu_to_le16(params->mnt_count);
+	if (params->set_flags & EXT4_TUNE_FL_MAX_MNT_COUNT)
+		es->s_max_mnt_count = cpu_to_le16(params->max_mnt_count);
+	if (params->set_flags & EXT4_TUNE_FL_CHECKINTRVAL)
+		es->s_checkinterval = cpu_to_le32(params->checkinterval);
+	if (params->set_flags & EXT4_TUNE_FL_LAST_CHECK_TIME)
+		es->s_lastcheck = cpu_to_le32(params->last_check_time);
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) {
+		ext4_fsblk_t blk = params->reserved_blocks;
+
+		es->s_r_blocks_count_lo = cpu_to_le32((u32)blk);
+		es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_UID) {
+		int uid = params->reserved_uid;
+
+		es->s_def_resuid = cpu_to_le16(uid & 0xFFFF);
+		es->s_def_resuid_hi = cpu_to_le16(uid >> 16);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_GID) {
+		int gid = params->reserved_gid;
+
+		es->s_def_resgid = cpu_to_le16(gid & 0xFFFF);
+		es->s_def_resgid_hi = cpu_to_le16(gid >> 16);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_DEFAULT_MNT_OPTS)
+		es->s_default_mount_opts = cpu_to_le32(params->default_mnt_opts);
+	if (params->set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+		es->s_def_hash_version = params->def_hash_alg;
+	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIDE)
+		es->s_raid_stride = cpu_to_le16(params->raid_stride);
+	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIPE_WIDTH)
+		es->s_raid_stripe_width =
+			cpu_to_le16(params->raid_stripe_width);
+	strscpy_pad(es->s_mount_opts, params->mount_opts);
+	if (params->set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
+		es->s_feature_compat |=
+			cpu_to_le32(params->set_feature_compat_mask);
+		es->s_feature_incompat |=
+			cpu_to_le32(params->set_feature_incompat_mask);
+		es->s_feature_ro_compat |=
+			cpu_to_le32(params->set_feature_ro_compat_mask);
+		es->s_feature_compat &=
+			~cpu_to_le32(params->clear_feature_compat_mask);
+		es->s_feature_incompat &=
+			~cpu_to_le32(params->clear_feature_incompat_mask);
+		es->s_feature_ro_compat &=
+			~cpu_to_le32(params->clear_feature_ro_compat_mask);
+		if (params->set_feature_compat_mask &
+		    EXT4_FEATURE_COMPAT_DIR_INDEX)
+			es->s_def_hash_version = sbi->s_def_hash_version;
+		if (params->set_feature_incompat_mask &
+		    EXT4_FEATURE_INCOMPAT_CSUM_SEED)
+			es->s_checksum_seed = cpu_to_le32(sbi->s_csum_seed);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_FORCE_FSCK)
+		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
+}
+
+static int ext4_ioctl_set_tune_sb(struct file *filp,
+				  struct ext4_tune_sb_params __user *in)
+{
+	struct ext4_tune_sb_params params;
+	struct super_block *sb = file_inode(filp)->i_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_super_block *es = sbi->s_es;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&params, in, sizeof(params)))
+		return -EFAULT;
+
+	if ((params.set_flags & ~TUNE_OPS_SUPPORTED) != 0)
+		return -EOPNOTSUPP;
+
+	if ((params.set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR) &&
+	    (params.errors_behavior > EXT4_ERRORS_PANIC))
+		return -EINVAL;
+
+	if ((params.set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) &&
+	    (params.reserved_blocks > ext4_blocks_count(sbi->s_es) / 2))
+		return -EINVAL;
+	if ((params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG) &&
+	    ((params.def_hash_alg > DX_HASH_LAST) ||
+	     (params.def_hash_alg == DX_HASH_SIPHASH)))
+		return -EINVAL;
+	if ((params.set_flags & EXT4_TUNE_FL_FEATURES) &&
+	    (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES))
+		return -EINVAL;
+
+	if (params.set_flags & EXT4_TUNE_FL_FEATURES) {
+		params.set_feature_compat_mask =
+			params.feature_compat &
+			~le32_to_cpu(es->s_feature_compat);
+		params.set_feature_incompat_mask =
+			params.feature_incompat &
+			~le32_to_cpu(es->s_feature_incompat);
+		params.set_feature_ro_compat_mask =
+			params.feature_ro_compat &
+			~le32_to_cpu(es->s_feature_ro_compat);
+		params.clear_feature_compat_mask =
+			~params.feature_compat &
+			le32_to_cpu(es->s_feature_compat);
+		params.clear_feature_incompat_mask =
+			~params.feature_incompat &
+			le32_to_cpu(es->s_feature_incompat);
+		params.clear_feature_ro_compat_mask =
+			~params.feature_ro_compat &
+			le32_to_cpu(es->s_feature_ro_compat);
+		params.set_flags |= EXT4_TUNE_FL_EDIT_FEATURES;
+	}
+	if (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
+		if ((params.set_feature_compat_mask &
+		     ~EXT4_TUNE_SET_COMPAT_SUPP) ||
+		    (params.set_feature_incompat_mask &
+		     ~EXT4_TUNE_SET_INCOMPAT_SUPP) ||
+		    (params.set_feature_ro_compat_mask &
+		     ~EXT4_TUNE_SET_RO_COMPAT_SUPP) ||
+		    (params.clear_feature_compat_mask &
+		     ~EXT4_TUNE_CLEAR_COMPAT_SUPP) ||
+		    (params.clear_feature_incompat_mask &
+		     ~EXT4_TUNE_CLEAR_INCOMPAT_SUPP) ||
+		    (params.clear_feature_ro_compat_mask &
+		     ~EXT4_TUNE_CLEAR_RO_COMPAT_SUPP))
+			return -EOPNOTSUPP;
+
+		/*
+		 * Filter out the features that are already set from
+		 * the set_mask.
+		 */
+		params.set_feature_compat_mask &=
+			~le32_to_cpu(es->s_feature_compat);
+		params.set_feature_incompat_mask &=
+			~le32_to_cpu(es->s_feature_incompat);
+		params.set_feature_ro_compat_mask &=
+			~le32_to_cpu(es->s_feature_ro_compat);
+		if ((params.set_feature_compat_mask &
+		     EXT4_FEATURE_COMPAT_DIR_INDEX) &&
+		    !ext4_has_feature_dir_index(sb)) {
+			uuid_t	uu;
+
+			memcpy(&uu, sbi->s_hash_seed, UUID_SIZE);
+			if (uuid_is_null(&uu))
+				generate_random_uuid((char *)
+						     &sbi->s_hash_seed);
+			if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+				sbi->s_def_hash_version = params.def_hash_alg;
+			else if (sbi->s_def_hash_version == 0)
+				sbi->s_def_hash_version = DX_HASH_HALF_MD4;
+			if (!(es->s_flags &
+			      cpu_to_le32(EXT2_FLAGS_UNSIGNED_HASH)) &&
+			    !(es->s_flags &
+			      cpu_to_le32(EXT2_FLAGS_SIGNED_HASH))) {
+#ifdef __CHAR_UNSIGNED__
+				sbi->s_hash_unsigned = 3;
+#else
+				sbi->s_hash_unsigned = 0;
+#endif
+			}
+		}
+	}
+
+
+	ret = mnt_want_write_file(filp);
+	if (ret)
+		return ret;
+
+	ret = ext4_update_superblocks_fn(sb, ext4_sb_setparams, &params);
+	mnt_drop_write_file(filp);
+
+	if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+		sbi->s_def_hash_version = params.def_hash_alg;
+
+	return ret;
+}
+
 static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -1616,6 +1852,11 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
 	case EXT4_IOC_SETFSUUID:
 		return ext4_ioctl_setuuid(filp, (const void __user *)arg);
+	case EXT4_IOC_GET_TUNE_SB_PARAM:
+		return ext4_ioctl_get_tune_sb(EXT4_SB(sb),
+					      (void __user *)arg);
+	case EXT4_IOC_SET_TUNE_SB_PARAM:
+		return ext4_ioctl_set_tune_sb(filp, (void __user *)arg);
 	default:
 		return -ENOTTY;
 	}
@@ -1703,7 +1944,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 }
 #endif
 
-static void set_overhead(struct ext4_super_block *es, const void *arg)
+static void set_overhead(struct ext4_sb_info *sbi,
+			 struct ext4_super_block *es, const void *arg)
 {
 	es->s_overhead_clusters = cpu_to_le32(*((unsigned long *) arg));
 }
diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
index 1c4c2dd29112cda9f7dc91d917492cffc33ee524..145875fd633772e76ce7fd8bc0fef136ff620d2d 100644
--- a/include/uapi/linux/ext4.h
+++ b/include/uapi/linux/ext4.h
@@ -33,6 +33,8 @@
 #define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
 #define EXT4_IOC_GETFSUUID		_IOR('f', 44, struct fsuuid)
 #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
+#define EXT4_IOC_GET_TUNE_SB_PARAM	_IOR('f', 45, struct ext4_tune_sb_params)
+#define EXT4_IOC_SET_TUNE_SB_PARAM	_IOW('f', 46, struct ext4_tune_sb_params)
 
 #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
 
@@ -108,6 +110,79 @@ struct ext4_new_group_input {
 	__u16 unused;
 };
 
+struct ext4_tune_sb_params {
+	__u32 set_flags;
+	__u32 checkinterval;
+	__u16 errors_behavior;
+	__u16 mnt_count;
+	__u16 max_mnt_count;
+	__u16 raid_stride;
+	__u64 last_check_time;
+	__u64 reserved_blocks;
+	__u64 blocks_count;
+	__u32 default_mnt_opts;
+	__u32 reserved_uid;
+	__u32 reserved_gid;
+	__u32 raid_stripe_width;
+	__u8  def_hash_alg;
+	__u8  pad_1;
+	__u16 pad_2;
+	__u32 feature_compat;
+	__u32 feature_incompat;
+	__u32 feature_ro_compat;
+	__u32 set_feature_compat_mask;
+	__u32 set_feature_incompat_mask;
+	__u32 set_feature_ro_compat_mask;
+	__u32 clear_feature_compat_mask;
+	__u32 clear_feature_incompat_mask;
+	__u32 clear_feature_ro_compat_mask;
+	__u8  mount_opts[64];
+	__u8  pad[64];
+};
+
+#define EXT4_TUNE_FL_ERRORS_BEHAVIOR	0x00000001
+#define EXT4_TUNE_FL_MNT_COUNT		0x00000002
+#define EXT4_TUNE_FL_MAX_MNT_COUNT	0x00000004
+#define EXT4_TUNE_FL_CHECKINTRVAL	0x00000008
+#define EXT4_TUNE_FL_LAST_CHECK_TIME	0x00000010
+#define EXT4_TUNE_FL_RESERVED_BLOCKS	0x00000020
+#define EXT4_TUNE_FL_RESERVED_UID	0x00000040
+#define EXT4_TUNE_FL_RESERVED_GID	0x00000080
+#define EXT4_TUNE_FL_DEFAULT_MNT_OPTS	0x00000100
+#define EXT4_TUNE_FL_DEF_HASH_ALG	0x00000200
+#define EXT4_TUNE_FL_RAID_STRIDE	0x00000400
+#define EXT4_TUNE_FL_RAID_STRIPE_WIDTH	0x00000800
+#define EXT4_TUNE_FL_MOUNT_OPTS		0x00001000
+#define EXT4_TUNE_FL_FEATURES		0x00002000
+#define EXT4_TUNE_FL_EDIT_FEATURES	0x00004000
+#define EXT4_TUNE_FL_FORCE_FSCK		0x00008000
+
+#define EXT4_TUNE_SET_COMPAT_SUPP \
+		(EXT4_FEATURE_COMPAT_DIR_INDEX |	\
+		 EXT4_FEATURE_COMPAT_STABLE_INODES)
+#define EXT4_TUNE_SET_INCOMPAT_SUPP \
+		(EXT4_FEATURE_INCOMPAT_EXTENTS |	\
+		 EXT4_FEATURE_INCOMPAT_EA_INODE |	\
+		 EXT4_FEATURE_INCOMPAT_ENCRYPT |	\
+		 EXT4_FEATURE_INCOMPAT_CSUM_SEED |	\
+		 EXT4_FEATURE_INCOMPAT_LARGEDIR |	\
+		 EXT4_FEATURE_INCOMPAT_CASEFOLD)
+#define EXT4_TUNE_SET_RO_COMPAT_SUPP \
+		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
+		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
+		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
+		 EXT4_FEATURE_RO_COMPAT_READONLY |	\
+		 EXT4_FEATURE_RO_COMPAT_PROJECT |	\
+		 EXT4_FEATURE_RO_COMPAT_VERITY)
+
+#define EXT4_TUNE_CLEAR_COMPAT_SUPP (0)
+#define EXT4_TUNE_CLEAR_INCOMPAT_SUPP (0)
+#define EXT4_TUNE_CLEAR_RO_COMPAT_SUPP \
+		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
+		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
+		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
+		 EXT4_FEATURE_RO_COMPAT_PROJECT)
+
 /*
  * Returned by EXT4_IOC_GET_ES_CACHE as an additional possible flag.
  * It indicates that the entry in extent status cache is for a hole.

-- 
2.51.0



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox