Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-09-03 12:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250902134156.GM186519@nvidia.com>

On Tue, Sep 2, 2025 at 6:42 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Aug 29, 2025 at 12:18:43PM -0700, Chris Li wrote:
>
> > Another idea is that having a middle layer manages the life cycle of
> > the reserved memory for you. Kind of like a slab allocator for the
> > preserved memory.
>
> If you want a slab allocator then I think you should make slab
> preservable.. Don't need more allocators :\

Sure, we can reuse the slab allocator to add the KHO function to it. I
consider that as the implementation detail side, I haven't even
started yet. I just want to point out that we might want to have a
high level library to take care of the life cycle of the preserved
memory. Less boilerplate code for the caller.

> > Question: Do we have a matching FDT node to match the memfd C
> > structure hierarchy? Otherwise all the C struct will lump into one FDT
> > node. Maybe one FDT node for all C struct is fine. Then there is a
> > risk of overflowing the 4K buffer limit on the FDT node.
>
> I thought you were getting rid of FDT? My suggestion was to be taken
> as a FDT replacement..

Thanks for the clarification. Yes, I do want to get rid of FDT, very much so.

If we are not using FDT, adding an object might change the underlying
C structure layout causing a chain reaction of C struct change back to
the root. That is where I assume you might be still using FDT. I see
your later comments address that with a list of objects. I will
discuss it there.

> You need some kind of hierarchy of identifiers, things like memfd
> should chain off some higher level luo object for a file descriptor.

Ack.

>
> PCI should be the same, but not fd based.

Ack.

> It may be that luo maintains some flat dictionary of
>   string -> [object type, version, u64 ptr]*

I see, got it. That answers my question of how to add a new object
without changing the C structure layout. You are using a list of the
same C structure. When adding more objects to it, just add more items
to the list. This part of the boiler plate detail is not mentioned in
your original suggestion.  I understand your proposal better now.

> And if you want to serialize that the optimal path would be to have a
> vmalloc of all the strings and a vmalloc of the [] data, sort of like
> the kho array idea.

The KHO array idea is already implemented in the existing KHO code or
that is something new you want to propose?

Then we will have to know the combined size of the string up front,
similar to the FDT story. Ideally the list can incrementally add items
to it. May be stored as a list as raw pointer without vmalloc
first,then have a final pass vmalloc and serialize the string and
data.

With the additional detail above, I would like to point out something
I have observed earlier: even though the core idea of the native C
struct is simple and intuitive, the end of end implementation is not.
When we compare C struct implementation, we need to include all those
additional boilerplate details as a whole, otherwise it is not a apple
to apple comparison.

> > At this stage, do you see that exploring such a machine idea can be
> > beneficial or harmful to the project? If such an idea is considered
> > harmful, we should stop discussing such an idea at all. Go back to
> > building more batches of hand crafted screws, which are waiting by the
> > next critical component.
>
> I haven't heard a compelling idea that will obviously make things
> better.. Adding more layers and complexity is not better.

Yes, I completely understand how you reason it, and I agree with your
assessment.

I like to add to that you have been heavily discounting the
boilerplate stuff in the C struct solution. Here is where our view
point might different:
If the "more layer" has its counterpart in the C struct solution as
well, then it is not "more", it is the necessary evil. We need to
compare apples to apples.

> Your BTF proposal doesn't seem to benifit memfd at all, it was focused
> on extracting data directly from an existing struct which I feel very
> strongly we should never do.

From data flow point of view, the data is get from a C struct and
eventually store into a C struct. That is no way around that. That is
the necessary evil if you automate this process. Hey, there is also no
rule saying that you can't use a bounce buffer of some kind of manual
control in between.

It is just a way to automate stuff to reduce the boilerplate. We can
put different label on that and escalate that label or concept is bad.
Your C struct has the exact same thing pulling data from the C struct
and storing into C struct. It is just the label we are arguing. This
label is good and that label is bad. Underlying it has the similar
common necessary evil.

> The above dictionary, I also don't see how BTF helps. It is such a
> special encoding. Yes you could make some elaborate serialization
> infrastructure, like FDT, but we have all been saying FDT is too hard
> to use and too much code. I'm not sure I'm convinced there is really a

Are you ready to be connived? If you keep this as a religion you can
never be convinced.

The reason FDT is too hard to use have other reason. FDT is design to
be constructed by offline tools. In kernel mostly just read only. We
are using FDT outside of its original design parameter. It does not
mean that some thing (the machine) specially design for this purpose
can't be build and easier to use.

> better middle ground :\

With due respect, it sounds like you have the risk of judging
something you haven't fully understood. I feel that a baby, my baby,
has been thrown out with the bathwater.

As a test of water for the above statement, can you describe my idea
equal or better than I do so it passes the test of I say: "yes, this
is exactly what I am trying to build".

That is the communication barrier I am talking about. I estimate at
this rate it will take us about 15 email exchanges to get to the core
stuff. It might be much quicker to lock you and me in a room, Only
release us when you and I can describe each other's viewpoint at a
mutual satisfactory level. I understand your time is precious, and I
don't want to waste your time. I fully respect and comply with your
decision. If you want me to stop now, I can stop. No question asked.

That gets back to my original question, do we already have a ruling
that even the discussion of "the machine" idea is forbidden.

> IMHO if there is some way to improve this it still yet to be found,

In my mind, I have found it. I have to get over the communication
barrier to plead my case to you. You can issue a preliminary ruling to
dismiss my case. I just wish you fully understood the case facts
before you make such a ruling.

> and I think we don't well understand what we need to serialize just
> yet.

That may be true, we don't have 100% understanding of what needs to be
serialized.  On the other hand, it is not 0% either. Based on what we
understand, we can already use "the machine" to help us do what we
know much more effectively. Of course, there is a trade off for
developing "the machine". It takes extra time and the complexity to
maintain such a machine. I fully understand that.

> Smaller ideas like preserve the vmalloc will make big improvement
> already.

Yes, I totally agree. It is a local optimization we can do, it might
not be the global optimized though. "the machine" might not use
vmalloc at all, all this small incremental change will be throw away
once we have "the machine".

I put this situation in the airplane story, yes, we build diamond
plated filers to produce the hand craft screws faster. The missing
opportunity is that, if we have "the machine" earlier, we can pump out
machined screws much faster at scale, minus the time to build the
machine, it might still be an overall win. We don't need to use
diamond plated filter if we have the machine.

> Lets not race ahead until we understand the actual problem properly.

Is that the final ruling? It feels like so. Just clarifying what I am receiving.

I feel a much stronger sense of urgency than you though.  The stakes
are high, currently you already have four departments can use this
common serialization library right now:
1) PCI
2) VFIO
3) IOMMU
4) Memfd.

We are getting into the more complex data structures. If we merge this
into the mainline, it is much harder to pull them out later.
Basically, this is a done deal. That is why I am putting my reputation
and my job on the line to pitch "the machine" idea. It is a very risky
move, I fully understand that.

Chris

^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-03 10:01 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan,
	linux-kernel, Will Deacon, jannh, Andrew Morton, Yury Khrustalev,
	Wilco Dijkstra, linux-kselftest, linux-api, Kees Cook
In-Reply-To: <aLdbT67auUpaOj2T@arm.com>

[-- Attachment #1: Type: text/plain, Size: 1903 bytes --]

On Tue, Sep 02, 2025 at 10:02:07PM +0100, Catalin Marinas wrote:
> On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:

> > +	mm = get_task_mm(p);
> > +	if (!mm)
> > +		return -EFAULT;

> In theory, I don't think we need the get_task_mm() -> mmget() since
> copy_mm() early on already did this and the task can't disappear from
> underneath while we are creating it.

mmget() will only have been done in the CLONE_VM case, if we're in the
!CLONE_VM case we do a dup_mm() but that also returns with a reference.
I didn't know if people would be happier with the reference clearly
taken by the code using things or not, the general pattern is that
whenever we're doing anything with remote VMs we take a reference.

> > +	mmap_read_lock(mm);
> > +
> > +	addr = untagged_addr_remote(mm, args->shadow_stack_token);
> > +	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
> > +					&vma);

> However, I wonder whether it makes sense to use the remote mm access
> here at all. Does this code ever run without CLONE_VM? If not, this is
> all done within the current mm context.

Yes, userspace can select if it wants CLONE_VM or not so we should
handle that case.  We discussed this on prior versions and we felt that
while we couldn't immediately see the use case for !CLONE_VM there
wasn't a good reason to restrict the creativity of userspace developers,
and given that you can specify the regular stack in these cases it seems
logical that you'd also be able to specify the shadow stack.

> I can see the x86 shstk_alloc_thread_stack() returns early if !CLONE_VM.
> Similarly on arm64. I think the behaviour is preserved with this series
> but I'm not entirely sure from the contextual diff (I need to apply the
> patches locally).

That is all for the case where the kernel allocates and manages the
shadow stack, it's the behaviour that this series allows userspace to
override.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2] uapi/fcntl: define RENAME_* and AT_RENAME_* macros
From: Randy Dunlap @ 2025-09-03  0:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <5ff4dfe2-271f-4967-bb45-ad59614edc37@infradead.org>



On 9/2/25 2:31 PM, Randy Dunlap wrote:
> Hi,
> 
> On 9/1/25 11:58 PM, Amir Goldstein wrote:
>> On Tue, Sep 2, 2025 at 1:14 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>>>
>>> Define the RENAME_* and AT_RENAME_* macros exactly the same as in
>>> recent glibc <stdio.h> so that duplicate definition build errors in
>>> both samples/watch_queue/watch_test.c and samples/vfs/test-statx.c
>>> no longer happen. When they defined in exactly the same way in
>>> multiple places, the build errors are prevented.
>>>
>>> Defining only the AT_RENAME_* macros is not sufficient since they
>>> depend on the RENAME_* macros, which may not be defined when the
>>> AT_RENAME_* macros are used.
>>>
>>> Build errors being fixed:
>>>
>>> for samples/vfs/test-statx.c:
>>>
>>> In file included from ../samples/vfs/test-statx.c:23:
>>> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>>>   159 | #define AT_RENAME_NOREPLACE     0x0001
>>> In file included from ../samples/vfs/test-statx.c:13:
>>> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>>>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>>> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>>>   160 | #define AT_RENAME_EXCHANGE      0x0002
>>> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>>>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>>> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>>>   161 | #define AT_RENAME_WHITEOUT      0x0004
>>> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>>>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>>
>>> for samples/watch_queue/watch_test.c:
>>>
>>> In file included from usr/include/linux/watch_queue.h:6,
>>>                  from ../samples/watch_queue/watch_test.c:19:
>>> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>>>   159 | #define AT_RENAME_NOREPLACE     0x0001
>>> In file included from ../samples/watch_queue/watch_test.c:11:
>>> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>>>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>>> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>>>   160 | #define AT_RENAME_EXCHANGE      0x0002
>>> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>>>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>>> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>>>   161 | #define AT_RENAME_WHITEOUT      0x0004
>>> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>>>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>>
>>> Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
>>> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
>>> ---
>>> Cc: Amir Goldstein <amir73il@gmail.com>
>>> Cc: Jeff Layton <jlayton@kernel.org>
>>> Cc: Chuck Lever <chuck.lever@oracle.com>
>>> Cc: Alexander Aring <alex.aring@gmail.com>
>>> Cc: Josef Bacik <josef@toxicpanda.com>
>>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>>> Cc: Jan Kara <jack@suse.cz>
>>> Cc: Christian Brauner <brauner@kernel.org>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: David Howells <dhowells@redhat.com>
>>> CC: linux-api@vger.kernel.org
>>> To: linux-fsdevel@vger.kernel.org
>>>
>>>  include/uapi/linux/fcntl.h |    9 ++++++---
>>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
>>> +++ linux-next-20250819/include/uapi/linux/fcntl.h
>>> @@ -156,9 +156,12 @@
>>>   */
>>>
>>>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
>>> -#define AT_RENAME_NOREPLACE    0x0001
>>> -#define AT_RENAME_EXCHANGE     0x0002
>>> -#define AT_RENAME_WHITEOUT     0x0004
>>> +# define RENAME_NOREPLACE (1 << 0)
>>> +# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>>> +# define RENAME_EXCHANGE (1 << 1)
>>> +# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>>> +# define RENAME_WHITEOUT (1 << 2)
>>> +# define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>>
>>
>> This solution, apart from being terribly wrong (adjust the source to match
>> to value of its downstream copy), does not address the issue that Mathew
>> pointed out on v1 discussion [1]:
> 
> I didn't forget or ignore this.
> If the macros have the same values (well, not just values but also the
> same text), then I don't see why it matters whether they are in some older
> version of glibc.
> 
>> $ grep -r AT_RENAME_NOREPLACE /usr/include
>> /usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE  0x0001
>>
>> It's not in stdio.h at all.  This is with libc6 2.41-10
>>
>> [1] https://lore.kernel.org/linux-fsdevel/aKxfGix_o4glz8-Z@casper.infradead.org/
>>
>> I don't know how to resolve the mess that glibc has created.
> 
> Yeah, I guess I don't either.
> 
>> Perhaps like this:
>>
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index f291ab4f94ebc..dde14fa3c2007 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -155,10 +155,16 @@
>>   * as possible, so we can use them for generic bits in the future if necessary.
>>   */
>>
>> -/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
>> -#define AT_RENAME_NOREPLACE    0x0001
>> -#define AT_RENAME_EXCHANGE     0x0002
>> -#define AT_RENAME_WHITEOUT     0x0004
>> +/*
>> + * The legacy renameat2(2) RENAME_* flags are conceptually also
>> syscall-specific
>> + * flags, so it could makes sense to create the AT_RENAME_* aliases
>> for them and
>> + * maybe later add support for generic AT_* flags to this syscall.
>> + * However, following a mismatch of definitions in glibc and since no
>> kernel code
>> + * currently uses the AT_RENAME_* aliases, we leave them undefined here.
>> +#define AT_RENAME_NOREPLACE    RENAME_NOREPLACE
>> +#define AT_RENAME_EXCHANGE     RENAME_EXCHANGE
>> +#define AT_RENAME_WHITEOUT     RENAME_WHITEOUT
>> +*/
> 
> Well, we do have samples/ code that uses fcntl.h (indirectly; maybe
> that can be fixed).
> See the build errors in the patch description.
> 
> 
>>  /* Flag for faccessat(2). */
>>  #define AT_EACCESS             0x200   /* Test access permitted for
> 
> With this patch (your suggestion above):
> 
> IF a userspace program in samples/ uses <uapi/linux/fcntl.h> without
> using <stdio.h>, [yes, I created one to test this] and without using
> <uapi/linux/fs.h> then the build fails with similar build errors:
> 
> ../samples/watch_queue/watch_nostdio.c: In function ‘consumer’:
> ../samples/watch_queue/watch_nostdio.c:33:32: error: ‘RENAME_NOREPLACE’ undeclared (first use in this function)
>    33 |                         return RENAME_NOREPLACE;
> ../samples/watch_queue/watch_nostdio.c:33:32: note: each undeclared identifier is reported only once for each function it appears in
> ../samples/watch_queue/watch_nostdio.c:37:32: error: ‘RENAME_EXCHANGE’ undeclared (first use in this function)
>    37 |                         return RENAME_EXCHANGE;
> ../samples/watch_queue/watch_nostdio.c:41:32: error: ‘RENAME_WHITEOUT’ undeclared (first use in this function)
>    41 |                         return RENAME_WHITEOUT;
> 
> This build succeeds with my version 1 patch (full defining of both
> RENAME_* and AT_RENAME_* macros). It fails with the patch that you suggested
> above.
> 
> OK, here's what I propose.
> 
> a. remove the unused and (sort of) recently added AT_RENAME_* macros
> in include/uapi/linux/fcntl.h. Nothing in the kernel tree uses them.
> This is:
> 
> commit b4fef22c2fb9
> Author: Aleksa Sarai <cyphar@cyphar.com>
> Date:   Wed Aug 28 20:19:42 2024 +1000
>     uapi: explain how per-syscall AT_* flags should be allocated
> 
> These macros should have never been added here IMO.
> Just putting them somewhere as examples (in comments) would be OK.
> 
> This alone fixes all of the build errors in samples/ that I originally
> reported.
> 
> b. if a userspace program wants to use the RENAME_* macros, it should
> #include <linux/fs.h> instead of <linux/fcntl.h>.
> 
> This fixes the "contrived" build error that I manufactured.
> 
> Note that some programs in tools/ do use AT_RENAME_* (all 3 macros)
> but they define those macros locally.
> 

And after more testing, this is what I think works:

a. remove all of the AT_RENAME-* macros from <uapi/linux/fcntl.h>
   (as above)

b. put the AT_RENAME_* macros into <uapi/linux/fs.h> like so:

+/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
+# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
+# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
+# define AT_RENAME_WHITEOUT RENAME_WHITEOUT

so that they match what is in upstream glibc stdio.h, hence not
causing duplicate definition errors.


-- 
~Randy


^ permalink raw reply

* Re: [PATCH v2] uapi/fcntl: define RENAME_* and AT_RENAME_* macros
From: Randy Dunlap @ 2025-09-02 21:31 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <CAOQ4uxjXvYBsW1Nb2HKaoUg1qi8Pkq1XKtQEbnAvMUGcp7LrZA@mail.gmail.com>

Hi,

On 9/1/25 11:58 PM, Amir Goldstein wrote:
> On Tue, Sep 2, 2025 at 1:14 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>>
>> Define the RENAME_* and AT_RENAME_* macros exactly the same as in
>> recent glibc <stdio.h> so that duplicate definition build errors in
>> both samples/watch_queue/watch_test.c and samples/vfs/test-statx.c
>> no longer happen. When they defined in exactly the same way in
>> multiple places, the build errors are prevented.
>>
>> Defining only the AT_RENAME_* macros is not sufficient since they
>> depend on the RENAME_* macros, which may not be defined when the
>> AT_RENAME_* macros are used.
>>
>> Build errors being fixed:
>>
>> for samples/vfs/test-statx.c:
>>
>> In file included from ../samples/vfs/test-statx.c:23:
>> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>>   159 | #define AT_RENAME_NOREPLACE     0x0001
>> In file included from ../samples/vfs/test-statx.c:13:
>> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>>   160 | #define AT_RENAME_EXCHANGE      0x0002
>> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>>   161 | #define AT_RENAME_WHITEOUT      0x0004
>> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>
>> for samples/watch_queue/watch_test.c:
>>
>> In file included from usr/include/linux/watch_queue.h:6,
>>                  from ../samples/watch_queue/watch_test.c:19:
>> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>>   159 | #define AT_RENAME_NOREPLACE     0x0001
>> In file included from ../samples/watch_queue/watch_test.c:11:
>> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>>   160 | #define AT_RENAME_EXCHANGE      0x0002
>> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>>   161 | #define AT_RENAME_WHITEOUT      0x0004
>> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>
>> Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
>> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
>> ---
>> Cc: Amir Goldstein <amir73il@gmail.com>
>> Cc: Jeff Layton <jlayton@kernel.org>
>> Cc: Chuck Lever <chuck.lever@oracle.com>
>> Cc: Alexander Aring <alex.aring@gmail.com>
>> Cc: Josef Bacik <josef@toxicpanda.com>
>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: David Howells <dhowells@redhat.com>
>> CC: linux-api@vger.kernel.org
>> To: linux-fsdevel@vger.kernel.org
>>
>>  include/uapi/linux/fcntl.h |    9 ++++++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
>> +++ linux-next-20250819/include/uapi/linux/fcntl.h
>> @@ -156,9 +156,12 @@
>>   */
>>
>>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
>> -#define AT_RENAME_NOREPLACE    0x0001
>> -#define AT_RENAME_EXCHANGE     0x0002
>> -#define AT_RENAME_WHITEOUT     0x0004
>> +# define RENAME_NOREPLACE (1 << 0)
>> +# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
>> +# define RENAME_EXCHANGE (1 << 1)
>> +# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
>> +# define RENAME_WHITEOUT (1 << 2)
>> +# define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>>
> 
> This solution, apart from being terribly wrong (adjust the source to match
> to value of its downstream copy), does not address the issue that Mathew
> pointed out on v1 discussion [1]:

I didn't forget or ignore this.
If the macros have the same values (well, not just values but also the
same text), then I don't see why it matters whether they are in some older
version of glibc.

> $ grep -r AT_RENAME_NOREPLACE /usr/include
> /usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE  0x0001
> 
> It's not in stdio.h at all.  This is with libc6 2.41-10
> 
> [1] https://lore.kernel.org/linux-fsdevel/aKxfGix_o4glz8-Z@casper.infradead.org/
> 
> I don't know how to resolve the mess that glibc has created.

Yeah, I guess I don't either.

> Perhaps like this:
> 
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index f291ab4f94ebc..dde14fa3c2007 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -155,10 +155,16 @@
>   * as possible, so we can use them for generic bits in the future if necessary.
>   */
> 
> -/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> -#define AT_RENAME_NOREPLACE    0x0001
> -#define AT_RENAME_EXCHANGE     0x0002
> -#define AT_RENAME_WHITEOUT     0x0004
> +/*
> + * The legacy renameat2(2) RENAME_* flags are conceptually also
> syscall-specific
> + * flags, so it could makes sense to create the AT_RENAME_* aliases
> for them and
> + * maybe later add support for generic AT_* flags to this syscall.
> + * However, following a mismatch of definitions in glibc and since no
> kernel code
> + * currently uses the AT_RENAME_* aliases, we leave them undefined here.
> +#define AT_RENAME_NOREPLACE    RENAME_NOREPLACE
> +#define AT_RENAME_EXCHANGE     RENAME_EXCHANGE
> +#define AT_RENAME_WHITEOUT     RENAME_WHITEOUT
> +*/

Well, we do have samples/ code that uses fcntl.h (indirectly; maybe
that can be fixed).
See the build errors in the patch description.


>  /* Flag for faccessat(2). */
>  #define AT_EACCESS             0x200   /* Test access permitted for

With this patch (your suggestion above):

IF a userspace program in samples/ uses <uapi/linux/fcntl.h> without
using <stdio.h>, [yes, I created one to test this] and without using
<uapi/linux/fs.h> then the build fails with similar build errors:

../samples/watch_queue/watch_nostdio.c: In function ‘consumer’:
../samples/watch_queue/watch_nostdio.c:33:32: error: ‘RENAME_NOREPLACE’ undeclared (first use in this function)
   33 |                         return RENAME_NOREPLACE;
../samples/watch_queue/watch_nostdio.c:33:32: note: each undeclared identifier is reported only once for each function it appears in
../samples/watch_queue/watch_nostdio.c:37:32: error: ‘RENAME_EXCHANGE’ undeclared (first use in this function)
   37 |                         return RENAME_EXCHANGE;
../samples/watch_queue/watch_nostdio.c:41:32: error: ‘RENAME_WHITEOUT’ undeclared (first use in this function)
   41 |                         return RENAME_WHITEOUT;

This build succeeds with my version 1 patch (full defining of both
RENAME_* and AT_RENAME_* macros). It fails with the patch that you suggested
above.

OK, here's what I propose.

a. remove the unused and (sort of) recently added AT_RENAME_* macros
in include/uapi/linux/fcntl.h. Nothing in the kernel tree uses them.
This is:

commit b4fef22c2fb9
Author: Aleksa Sarai <cyphar@cyphar.com>
Date:   Wed Aug 28 20:19:42 2024 +1000
    uapi: explain how per-syscall AT_* flags should be allocated

These macros should have never been added here IMO.
Just putting them somewhere as examples (in comments) would be OK.

This alone fixes all of the build errors in samples/ that I originally
reported.

b. if a userspace program wants to use the RENAME_* macros, it should
#include <linux/fs.h> instead of <linux/fcntl.h>.

This fixes the "contrived" build error that I manufactured.

Note that some programs in tools/ do use AT_RENAME_* (all 3 macros)
but they define those macros locally.

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Catalin Marinas @ 2025-09-02 21:02 UTC (permalink / raw)
  To: Mark Brown
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan,
	linux-kernel, Will Deacon, jannh, Andrew Morton, Yury Khrustalev,
	Wilco Dijkstra, linux-kselftest, linux-api, Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-4-4d9fff1c53e7@kernel.org>

On Tue, Sep 02, 2025 at 11:21:48AM +0100, Mark Brown wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index af673856499d..d484ebeded33 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
>  	return true;
>  }
>  
> +static int shstk_validate_clone(struct task_struct *p,
> +				struct kernel_clone_args *args)
> +{
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +	unsigned long addr;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> +		return 0;
> +
> +	if (!args->shadow_stack_token)
> +		return 0;
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return -EFAULT;

In theory, I don't think we need the get_task_mm() -> mmget() since
copy_mm() early on already did this and the task can't disappear from
underneath while we are creating it.

> +
> +	mmap_read_lock(mm);
> +
> +	addr = untagged_addr_remote(mm, args->shadow_stack_token);
> +	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
> +					&vma);

However, I wonder whether it makes sense to use the remote mm access
here at all. Does this code ever run without CLONE_VM? If not, this is
all done within the current mm context.

I can see the x86 shstk_alloc_thread_stack() returns early if !CLONE_VM.
Similarly on arm64. I think the behaviour is preserved with this series
but I'm not entirely sure from the contextual diff (I need to apply the
patches locally).

Otherwise the patch looks fine (well, even the above wouldn't fail, I
just find it strange that we pretend it's a remote mm but on the default
allocation path like alloc_gcs() we go for current->mm).

BTW, if you repost, it might be worth cross-posting to linux-arm-kernel
for wider exposure as not everyone reads LKML (and you can drop
Szabolcs, his arm address is no longer valid).

-- 
Catalin

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Paul Eggert @ 2025-09-02 18:17 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, Arjun Shankar
  Cc: Aleksa Sarai, libc-alpha, linux-api
In-Reply-To: <2a979a5e-a78d-4ee1-ac96-7176a8c45fb4@linaro.org>

On 2025-09-02 10:11, Adhemerval Zanella Netto wrote:
> if kernel developers are planing to make the argument in/out

I'm not seeing much evidence for such a plan, and if I saw some I'd make 
my feelings known in the appropriate forum. There's no real need there, 
and the only stated proposal would not accomplish its goals.

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Adhemerval Zanella Netto @ 2025-09-02 17:11 UTC (permalink / raw)
  To: Paul Eggert, Arjun Shankar; +Cc: Aleksa Sarai, libc-alpha, linux-api
In-Reply-To: <d88d7228-fabe-41d1-9a09-298fcb313647@cs.ucla.edu>



On 02/09/25 13:34, Paul Eggert wrote:
> On 2025-09-01 19:41, Arjun Shankar wrote:
>> While it is true that openat cannot be extended in this way, for
>> openat2 (whether or not it eventually materializes in Linux) there
>> already is the RFC patch series proposing CHECK_FIELDS that Aleksa
>> referred to earlier.
> 
> Is this the RFC Aleksa proposed last October <https://lkml.org/lkml/2024/10/10/25>? If so, I don't exactly see a rousing endorsement there.
> 
> If not, where is the later RFC? I'd like to send the critical comments I've already sent on this thread. These comments have not been responded to adequately.
> 
> 
>> Unless the kernel marks open_how as const
> 
> ? The kernel doesn't mark anything as const. It merely copies in or copies out. And for openat2, it copies only one way.

I think in this case we can refer to the SYSCALL_DEFINE4 at the kernel
source kernel, since there is no explicit contract w.r.t to argument 
point-to-const in the kernel header. But I am not sure which is the 
Linux policy for changing the implementation, or if Linus or other
maintainer will chime in if someone tries to do it.


> 
> 
>> future if the kernel starts modifying open_how, glibc's openat2
>> wrapper will no longer align with the kernel's behavior. At that
>> point, glibc will either need to discard the const (which will cause
>> any existing users of the wrapper to fail to recompile),
> 
> There are multiple easy ways out there. For example, glibc could document the argument as being pointer-to-const now, but warn that this may change to unrestricted pointer later, if the misguided change is made to the kernel. This would be similar to the already-existing warning in the proposed glibc patch, which warns that you can't assume sizeof (struct open_how) is a constant and so you can't expose it in library APIs. Of course people can ignore the documentation warnings but that's on them.

Yes, we could proper document it but changing it later is always troublesome
and may force users to start to resort to hacks like bypassing the libc with
assemble hacks, and/or redefine the function prototype, or just using syscall()
instead (all far from ideal) to support multiple glibc version.

So I would *really* like to avoid going forward with this path.

> 
> Better, though, would be to keep the API pointer-to-const. That's much cleaner. We can extend it later for a "give me the supported flags" flag, wwithout changing it the API away from pointer-to-const.
> 
>> Earlier on in this thread, Aleksa mentioned sched_setattr as
>> establishing precedent for the kernel modifying non-const objects. It
>> looks like glibc actually does provide a sched_setattr wrapper since
>> 2.41.
> 
> Although it may be too late to change that misfeature, it's not too late to change this one. And even if it was a good idea for sched_setattr, that doesn't mean it's a good a good idea for openat2.
> 

The main problem I see is if we set for point-to-const, the kernel eventually
adds some API that changes the input, and the users will start to deploy the
aforementioned hacks to overcome possible issues using the glibc interface.

So I think if kernel developers are planing to make the argument in/out I think
glibc should just follow the kernel.

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Paul Eggert @ 2025-09-02 16:34 UTC (permalink / raw)
  To: Arjun Shankar
  Cc: Aleksa Sarai, Adhemerval Zanella Netto, libc-alpha, linux-api
In-Reply-To: <CAG_osaYc21nR0M3O6UKs8zna6x_k9U4=Rt4B0mKHog=ZLSH1AQ@mail.gmail.com>

On 2025-09-01 19:41, Arjun Shankar wrote:
> While it is true that openat cannot be extended in this way, for
> openat2 (whether or not it eventually materializes in Linux) there
> already is the RFC patch series proposing CHECK_FIELDS that Aleksa
> referred to earlier.

Is this the RFC Aleksa proposed last October 
<https://lkml.org/lkml/2024/10/10/25>? If so, I don't exactly see a 
rousing endorsement there.

If not, where is the later RFC? I'd like to send the critical comments 
I've already sent on this thread. These comments have not been responded 
to adequately.

> Unless the kernel marks open_how as const

? The kernel doesn't mark anything as const. It merely copies in or 
copies out. And for openat2, it copies only one way.

> future if the kernel starts modifying open_how, glibc's openat2
> wrapper will no longer align with the kernel's behavior. At that
> point, glibc will either need to discard the const (which will cause
> any existing users of the wrapper to fail to recompile),

There are multiple easy ways out there. For example, glibc could 
document the argument as being pointer-to-const now, but warn that this 
may change to unrestricted pointer later, if the misguided change is 
made to the kernel. This would be similar to the already-existing 
warning in the proposed glibc patch, which warns that you can't assume 
sizeof (struct open_how) is a constant and so you can't expose it in 
library APIs. Of course people can ignore the documentation warnings but 
that's on them.

Better, though, would be to keep the API pointer-to-const. That's much 
cleaner. We can extend it later for a "give me the supported flags" 
flag, wwithout changing it the API away from pointer-to-const.

> Earlier on in this thread, Aleksa mentioned sched_setattr as
> establishing precedent for the kernel modifying non-const objects. It
> looks like glibc actually does provide a sched_setattr wrapper since
> 2.41.

Although it may be too late to change that misfeature, it's not too late 
to change this one. And even if it was a good idea for sched_setattr, 
that doesn't mean it's a good a good idea for openat2.

^ permalink raw reply

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: enh @ 2025-09-02 16:23 UTC (permalink / raw)
  To: Arjun Shankar
  Cc: Paul Eggert, Aleksa Sarai, Adhemerval Zanella Netto, libc-alpha,
	linux-api
In-Reply-To: <CAG_osaYc21nR0M3O6UKs8zna6x_k9U4=Rt4B0mKHog=ZLSH1AQ@mail.gmail.com>

On Mon, Sep 1, 2025 at 10:42 PM Arjun Shankar <arjun@redhat.com> wrote:
>
> Hi Paul,
>
> > On 2025-08-28 01:42, Aleksa Sarai wrote:
> > >> I still fail to understand how a hypothetical "give me the supported flags"
> > >> openat2 flag would be useful enough to justify complicating the openat2 API
> > >> today.
> > > My only concern is that it would break recompiles if/when we change it
> > > back.
> >
> > OK, but from what I can see there's no identified possibility that
> > openat2 will modify the objects its arguments point to, just as there's
> > no identified possibility that plain openat will do so (in a
> > hypothetical extension to remove unnecessary slashes from its filename
> > argument, say).
>
> While it is true that openat cannot be extended in this way, for
> openat2 (whether or not it eventually materializes in Linux) there
> already is the RFC patch series proposing CHECK_FIELDS that Aleksa
> referred to earlier. And it's not just that: it has been mentioned as
> a potential future direction even when the openat2 syscall was
> implemented [1]. I think we should interpret this to mean that there
> is indeed a possibility for openat2.
>
> > In that case it's pretty clear that glibc should mark the open_how
> > argument as pointer-to-const, just as glibc already marks the filename
> > argument.
>
> Unless the kernel marks open_how as const, glibc marking it as const
> can lead to additional maintenance complications down the line: in the
> future if the kernel starts modifying open_how, glibc's openat2
> wrapper will no longer align with the kernel's behavior. At that
> point, glibc will either need to discard the const (which will cause
> any existing users of the wrapper to fail to recompile), or glibc will
> need to handle the kernel's new behavior in the wrapper (which will
> lead to further divergence from the behavior of the syscall that we
> would claim to wrap). Neither of these seems problem-free. On the
> other hand, following the kernel's declaration will mean that should
> the kernel choose to mark it const, we can easily follow suit in glibc
> without breaking recompiles.
>
> Earlier on in this thread, Aleksa mentioned sched_setattr as
> establishing precedent for the kernel modifying non-const objects. It
> looks like glibc actually does provide a sched_setattr wrapper since
> 2.41. The relevant argument hasn't been marked as const and the kernel
> does modify the contents, and glibc's syscall wrapper simply passes it
> through. So we already do this.

given that

SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
                               unsigned int, flags)

calls sched_setattr(), which is defined thus:

int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
{
        return __sched_setscheduler(p, attr, true, true);
}

i think that's just a copy & paste mistake in the kernel -- carefully
preserved in glibc and bionic -- no?

(i only see the kernel updating its own _copy_ of the passed-in struct.)

> Based on all this, I feel that leaving open_how as-is is the easier
> and more maintenance-friendly choice for the syscall wrapper.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179
>
> --
> Arjun Shankar
> he/him/his
>

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-02 13:48 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0h5xmw12a.fsf@kernel.org>

On Mon, Sep 01, 2025 at 07:10:53PM +0200, Pratyush Yadav wrote:
> Building kvalloc on top of this becomes trivial.
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af

This isn't really an array, it is a non-seekable serialization of
key/values with some optimization for consecutive keys. IMHO it is
most useful if you don't know the size of the thing you want to
serialize in advance since it has a nice dynamic append.

But if you do know the size, I think it makes more sense just to do a
preserving vmalloc and write out a linear array..

So, it could be useful, but I wouldn't use it for memfd, the vmalloc
approach is better and we shouldn't optimize for sparsness which
should never happen.

> > The versioning should be first class, not hidden away as some emergent
> > property of registering multiple serializers or something like that.
> 
> That makes sense. How about some simple changes to the LUO interfaces to
> make the version more prominent:
> 
> 	int (*prepare)(struct liveupdate_file_handler *handler,
> 		       struct file *file, u64 *data, char **compatible);

Yeah, something more integrated with the ops is better.

You could list the supported versions in the ops itself

  const char **supported_deserialize_versions;

And let the luo framework find the right versions.

But for prepare I would expect an inbetween object:

	int (*prepare)(struct liveupdate_file_handler *handler,
	    	       struct luo_object *obj, struct file *file);

And then you'd do function calls on 'obj' to store 'data' per version.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-02 13:41 UTC (permalink / raw)
  To: Chris Li
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CAF8kJuPaSQN04M-pvpFTjjpzk3pfHNhpx+mCkvWpZOs=0TF3gg@mail.gmail.com>

On Fri, Aug 29, 2025 at 12:18:43PM -0700, Chris Li wrote:

> Another idea is that having a middle layer manages the life cycle of
> the reserved memory for you. Kind of like a slab allocator for the
> preserved memory. 

If you want a slab allocator then I think you should make slab
preservable.. Don't need more allocators :\

> Question: Do we have a matching FDT node to match the memfd C
> structure hierarchy? Otherwise all the C struct will lump into one FDT
> node. Maybe one FDT node for all C struct is fine. Then there is a
> risk of overflowing the 4K buffer limit on the FDT node.

I thought you were getting rid of FDT? My suggestion was to be taken
as a FDT replacement..

You need some kind of hierarchy of identifiers, things like memfd
should chain off some higher level luo object for a file descriptor.

PCI should be the same, but not fd based.

It may be that luo maintains some flat dictionary of
  string -> [object type, version, u64 ptr]*

And if you want to serialize that the optimal path would be to have a
vmalloc of all the strings and a vmalloc of the [] data, sort of like
the kho array idea.

> At this stage, do you see that exploring such a machine idea can be
> beneficial or harmful to the project? If such an idea is considered
> harmful, we should stop discussing such an idea at all. Go back to
> building more batches of hand crafted screws, which are waiting by the
> next critical component.

I haven't heard a compelling idea that will obviously make things
better.. Adding more layers and complexity is not better.

Your BTF proposal doesn't seem to benifit memfd at all, it was focused
on extracting data directly from an existing struct which I feel very
strongly we should never do.

The above dictionary, I also don't see how BTF helps. It is such a
special encoding. Yes you could make some elaborate serialization
infrastructure, like FDT, but we have all been saying FDT is too hard
to use and too much code. I'm not sure I'm convinced there is really a
better middle ground :\

IMHO if there is some way to improve this it still yet to be found,
and I think we don't well understand what we need to serialize just
yet.

Smaller ideas like preserve the vmalloc will make big improvement
already.

Lets not race ahead until we understand the actual problem properly.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-09-02 11:58 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Gunthorpe, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bC96fxHBb78DvNhyfdjsDfPCLY5J5cN8W0hUDt9KAPBJQ@mail.gmail.com>

On Mon, Sep 01, 2025 at 04:54:15PM +0000, Pasha Tatashin wrote:
> On Mon, Sep 1, 2025 at 4:23 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> > >
> > > > +   /*
> > > > +    * Most of the space should be taken by preserved folios. So take its
> > > > +    * size, plus a page for other properties.
> > > > +    */
> > > > +   fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > > > +   if (!fdt) {
> > > > +           err = -ENOMEM;
> > > > +           goto err_unpin;
> > > > +   }
> > >
> > > This doesn't seem to have any versioning scheme, it really should..
> > >
> > > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > > > +                                  (void **)&preserved_folios);
> > > > +   if (err) {
> > > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
> > > > +                  fdt_strerror(err));
> > > > +           err = -ENOMEM;
> > > > +           goto err_free_fdt;
> > > > +   }
> > >
> > > Yuk.
> > >
> > > This really wants some luo helper
> > >
> > > 'luo alloc array'
> > > 'luo restore array'
> > > 'luo free array'
> >
> > We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> 
> The patch looks okay to me, but it doesn't support holes in vmap
> areas. While that is likely acceptable for vmalloc, it could be a
> problem if we want to preserve memfd with holes and using vmap
> preservation as a method, which would require a different approach.
> Still, this would help with preserving memfd.

I can't say I understand what you mean by "holes in vmap areas". We anyway
get an array of folios in memfd_pin_folios() and at that point we know
exactly how many folios there is. So we can do something like

	preserved_folios = vmalloc_array(nr_folios, sizeof(*preserved_folios));
	memfd_luo_preserve_folios(preserved_folios, folios, nr_folios);
	kho_preserve_vmalloc(preserved_folios, &folios_info);

> However, I wonder if we should add a separate preservation library on
> top of the kho and not as part of kho (or at least keep them in a
> separate file from core logic). This would allow us to preserve more
> advanced data structures such as this and define preservation version
> control, similar to Jason's store_object/restore_object proposal.

kho_preserve_vmalloc() seems quite basic and I don't think it should be
separated from kho core. kho_array is already planned in a separate file :)
 
> > Will wait for kbuild and then send proper patches.
> >
> >
> > --
> > Sincerely yours,
> > Mike.
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-09-02 11:44 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, Pasha Tatashin, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs0ldmyw1hp.fsf@kernel.org>

Hi Pratyush,

On Mon, Sep 01, 2025 at 07:01:38PM +0200, Pratyush Yadav wrote:
> Hi Mike,
> 
> On Mon, Sep 01 2025, Mike Rapoport wrote:
> 
> > On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> >> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> >> 
> >> > +	/*
> >> > +	 * Most of the space should be taken by preserved folios. So take its
> >> > +	 * size, plus a page for other properties.
> >> > +	 */
> >> > +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> >> > +	if (!fdt) {
> >> > +		err = -ENOMEM;
> >> > +		goto err_unpin;
> >> > +	}
> >> 
> >> This doesn't seem to have any versioning scheme, it really should..
> >> 
> >> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> >> > +				       (void **)&preserved_folios);
> >> > +	if (err) {
> >> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
> >> > +		       fdt_strerror(err));
> >> > +		err = -ENOMEM;
> >> > +		goto err_free_fdt;
> >> > +	}
> >> 
> >> Yuk.
> >> 
> >> This really wants some luo helper
> >> 
> >> 'luo alloc array'
> >> 'luo restore array'
> >> 'luo free array'
> >
> > We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> >
> > Will wait for kbuild and then send proper patches.
> 
> I have been working on something similar, but in a more generic way.
> 
> I have implemented a sparse KHO-preservable array (called kho_array)
> with xarray like properties. It can take in 4-byte aligned pointers and
> supports saving non-pointer values similar to xa_mk_value(). For now it
> doesn't support multi-index entries, but if needed the data format can
> be extended to support it as well.
> 
> The structure is very similar to what you have implemented. It uses a
> linked list of pages with some metadata at the head of each page.
> 
> I have used it for memfd preservation, and I think it is quite
> versatile. For example, your kho_preserve_vmalloc() can be very easily
> built on top of this kho_array by simply saving each physical page
> address at consecutive indices in the array.

I've started to work on something similar to your kho_array for memfd case
and then I thought that since we know the size of the array we can simply
vmalloc it and preserve vmalloc, and that lead me to implementing
preservation of vmalloc :)

I like the idea to have kho_array for cases when we don't know the amount
of data to preserve in advance, but for memfd as it's currently
implemented I think that allocating and preserving vmalloc is simpler.

As for porting kho_preserve_vmalloc() to kho_array, I also feel that it
would just make kho_preserve_vmalloc() more complex and I'd rather simplify
it even more, e.g. with preallocating all the pages that preserve indices
in advance.
 
> The code is still WIP and currently a bit hacky, but I will clean it up
> in a couple days and I think it should be ready for posting. You can
> find the current version at [0][1]. Would be good to hear your thoughts,
> and if you agree with the approach, I can also port
> kho_preserve_vmalloc() to work on top of kho_array as well.
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=5eb0d7316274a9c87acaeedd86941979fc4baf96
> 
> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-09-02 11:38 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Mike Rapoport, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bAb6s=gUTCNjMrOqptZ3a_nj3teuVSZs86AvVymvaURQA@mail.gmail.com>

On Mon, Sep 01, 2025 at 07:02:46PM +0000, Pasha Tatashin wrote:
> > >> > This really wants some luo helper
> > >> >
> > >> > 'luo alloc array'
> > >> > 'luo restore array'
> > >> > 'luo free array'
> > >>
> > >> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> > >
> > > The patch looks okay to me, but it doesn't support holes in vmap
> > > areas. While that is likely acceptable for vmalloc, it could be a
> > > problem if we want to preserve memfd with holes and using vmap
> > > preservation as a method, which would require a different approach.
> > > Still, this would help with preserving memfd.
> >
> > I agree. I think we should do it the other way round. Build a sparse
> > array first, and then use that to build vmap preservation. Our emails
> 
> Yes, sparse array support would help both: vmalloc and memfd preservation.

Why? vmalloc is always full popoulated, no sparseness..

And again in real systems we expect memfd to be fully populated too.

I wouldn't invest any time in something like this right now. Just be
inefficient if there is sparseness for some reason.

Jason

^ permalink raw reply

* [PATCH v20 8/8] selftests/clone3: Test shadow stack support
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Add basic test coverage for specifying the shadow stack for a newly
created thread via clone3(), including coverage of the newly extended
argument structure.  We check that a user specified shadow stack can be
provided, and that invalid combinations of parameters are rejected.

In order to facilitate testing on systems without userspace shadow stack
support we manually enable shadow stacks on startup, this is architecture
specific due to the use of an arch_prctl() on x86. Due to interactions with
potential userspace locking of features we actually detect support for
shadow stacks on the running system by attempting to allocate a shadow
stack page during initialisation using map_shadow_stack(), warning if this
succeeds when the enable failed.

In order to allow testing of user configured shadow stacks on
architectures with that feature we need to ensure that we do not return
from the function where the clone3() syscall is called in the child
process, doing so would trigger a shadow stack underflow.  To do this we
use inline assembly rather than the standard syscall wrapper to call
clone3().  In order to avoid surprises we also use a syscall rather than
the libc exit() function., this should be overly cautious.

Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c           | 143 +++++++++++++++++++++-
 tools/testing/selftests/clone3/clone3_selftests.h |  63 ++++++++++
 2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 5b8b7d640e70..6fd2b3238e2c 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -3,6 +3,7 @@
 /* Based on Christian Brauner's clone3() example */
 
 #define _GNU_SOURCE
+#include <asm/mman.h>
 #include <errno.h>
 #include <inttypes.h>
 #include <linux/types.h>
@@ -11,6 +12,7 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <sys/mman.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <sys/un.h>
@@ -19,8 +21,12 @@
 #include <sched.h>
 
 #include "../kselftest.h"
+#include "../ksft_shstk.h"
 #include "clone3_selftests.h"
 
+static bool shadow_stack_supported;
+static size_t max_supported_args_size;
+
 enum test_mode {
 	CLONE3_ARGS_NO_TEST,
 	CLONE3_ARGS_ALL_0,
@@ -28,6 +34,10 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NEG,
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_CSIG,
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
+	CLONE3_ARGS_SHADOW_STACK,
+	CLONE3_ARGS_SHADOW_STACK_MISALIGNED,
+	CLONE3_ARGS_SHADOW_STACK_NO_TOKEN,
+	CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
 };
 
 typedef bool (*filter_function)(void);
@@ -44,6 +54,44 @@ struct test {
 	filter_function filter;
 };
 
+
+/*
+ * We check for shadow stack support by attempting to use
+ * map_shadow_stack() since features may have been locked by the
+ * dynamic linker resulting in spurious errors when we attempt to
+ * enable on startup.  We warn if the enable failed.
+ */
+static void test_shadow_stack_supported(void)
+{
+	long ret;
+
+	ret = syscall(__NR_map_shadow_stack, 0, getpagesize(), 0);
+	if (ret == -1) {
+		ksft_print_msg("map_shadow_stack() not supported\n");
+	} else if ((void *)ret == MAP_FAILED) {
+		ksft_print_msg("Failed to map shadow stack\n");
+	} else {
+		ksft_print_msg("Shadow stack supportd\n");
+		shadow_stack_supported = true;
+
+		if (!shadow_stack_enabled)
+			ksft_print_msg("Mapped but did not enable shadow stack\n");
+	}
+}
+
+static void *get_shadow_stack_page(unsigned long flags)
+{
+	unsigned long long page;
+
+	page = syscall(__NR_map_shadow_stack, 0, getpagesize(), flags);
+	if ((void *)page == MAP_FAILED) {
+		ksft_print_msg("map_shadow_stack() failed: %d\n", errno);
+		return 0;
+	}
+
+	return (void *)page;
+}
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct __clone_args args = {
@@ -57,6 +105,7 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	} args_ext;
 
 	pid_t pid = -1;
+	void *p;
 	int status;
 
 	memset(&args_ext, 0, sizeof(args_ext));
@@ -89,6 +138,26 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	case CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG:
 		args.exit_signal = 0x00000000000000f0ULL;
 		break;
+	case CLONE3_ARGS_SHADOW_STACK:
+		p = get_shadow_stack_page(SHADOW_STACK_SET_TOKEN);
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_MISALIGNED:
+		p = get_shadow_stack_page(SHADOW_STACK_SET_TOKEN);
+		p += getpagesize() - sizeof(void *) - 1;
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY:
+		p = malloc(getpagesize());
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
+	case CLONE3_ARGS_SHADOW_STACK_NO_TOKEN:
+		p = get_shadow_stack_page(0);
+		p += getpagesize() - sizeof(void *);
+		args.shadow_stack_token = (unsigned long long)p;
+		break;
 	}
 
 	memcpy(&args_ext.args, &args, sizeof(struct __clone_args));
@@ -102,7 +171,12 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 
 	if (pid == 0) {
 		ksft_print_msg("I am the child, my PID is %d\n", getpid());
-		_exit(EXIT_SUCCESS);
+		/*
+		 * Use a raw syscall to ensure we don't get issues
+		 * with manually specified shadow stack and exit handlers.
+		 */
+		syscall(__NR_exit, EXIT_SUCCESS);
+		ksft_print_msg("CHILD FAILED TO EXIT PID is %d\n", getpid());
 	}
 
 	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
@@ -184,6 +258,26 @@ static bool no_timenamespace(void)
 	return true;
 }
 
+static bool have_shadow_stack(void)
+{
+	if (shadow_stack_supported) {
+		ksft_print_msg("Shadow stack supported\n");
+		return true;
+	}
+
+	return false;
+}
+
+static bool no_shadow_stack(void)
+{
+	if (!shadow_stack_supported) {
+		ksft_print_msg("Shadow stack not supported\n");
+		return true;
+	}
+
+	return false;
+}
+
 static size_t page_size_plus_8(void)
 {
 	return getpagesize() + 8;
@@ -327,6 +421,50 @@ static const struct test tests[] = {
 		.expected = -EINVAL,
 		.test_mode = CLONE3_ARGS_NO_TEST,
 	},
+	{
+		.name = "Shadow stack on system with shadow stack",
+		.size = 0,
+		.expected = 0,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with misaligned address",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EINVAL,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_MISALIGNED,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with normal memory",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EFAULT,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack with no token",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EINVAL,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NO_TOKEN,
+		.filter = no_shadow_stack,
+	},
+	{
+		.name = "Shadow stack on system without shadow stack",
+		.flags = CLONE_VM,
+		.size = 0,
+		.expected = -EFAULT,
+		.e2big_valid = true,
+		.test_mode = CLONE3_ARGS_SHADOW_STACK_NORMAL_MEMORY,
+		.filter = have_shadow_stack,
+	},
 };
 
 int main(int argc, char *argv[])
@@ -334,9 +472,12 @@ int main(int argc, char *argv[])
 	size_t size;
 	int i;
 
+	enable_shadow_stack();
+
 	ksft_print_header();
 	ksft_set_plan(ARRAY_SIZE(tests));
 	test_clone3_supported();
+	test_shadow_stack_supported();
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++)
 		test_clone3(&tests[i]);
diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index 939b26c86d42..8151c4fc971a 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -31,12 +31,75 @@ struct __clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+#ifndef CLONE_ARGS_SIZE_VER2
+#define CLONE_ARGS_SIZE_VER2 88	/* sizeof third published struct */
+#endif
+	__aligned_u64 shadow_stack_token;
+#ifndef CLONE_ARGS_SIZE_VER3
+#define CLONE_ARGS_SIZE_VER3 96 /* sizeof fourth published struct */
+#endif
 };
 
+/*
+ * For architectures with shadow stack support we need to be
+ * absolutely sure that the clone3() syscall will be inline and not a
+ * function call so we open code.
+ */
+#ifdef __x86_64__
+static __always_inline pid_t sys_clone3(struct __clone_args *args, size_t size)
+{
+	register long _num  __asm__ ("rax") = __NR_clone3;
+	register long _args __asm__ ("rdi") = (long)(args);
+	register long _size __asm__ ("rsi") = (long)(size);
+	long ret;
+
+	__asm__ volatile (
+		"syscall\n"
+		: "=a"(ret)
+		: "r"(_args), "r"(_size),
+		  "0"(_num)
+		: "rcx", "r11", "memory", "cc"
+	);
+
+	if (ret < 0) {
+		errno = -ret;
+		return -1;
+	}
+
+	return ret;
+}
+#elif defined(__aarch64__)
+static __always_inline pid_t sys_clone3(struct __clone_args *args, size_t size)
+{
+	register long _num  __asm__ ("x8") = __NR_clone3;
+	register long _args __asm__ ("x0") = (long)(args);
+	register long _size __asm__ ("x1") = (long)(size);
+	register long arg2 __asm__ ("x2") = 0;
+	register long arg3 __asm__ ("x3") = 0;
+	register long arg4 __asm__ ("x4") = 0;
+
+	__asm__ volatile (
+		"svc #0\n"
+		: "=r"(_args)
+		: "r"(_args), "r"(_size),
+		  "r"(_num), "r"(arg2),
+		  "r"(arg3), "r"(arg4)
+		: "memory", "cc"
+	);
+
+	if ((int)_args < 0) {
+		errno = -((int)_args);
+		return -1;
+	}
+
+	return _args;
+}
+#else
 static pid_t sys_clone3(struct __clone_args *args, size_t size)
 {
 	return syscall(__NR_clone3, args, size);
 }
+#endif
 
 static inline void test_clone3_supported(void)
 {

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 7/8] selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

The clone_args structure is extensible, with the syscall passing in the
length of the structure. Inside the kernel we use copy_struct_from_user()
to read the struct but this has the unfortunate side effect of silently
accepting some overrun in the structure size providing the extra data is
all zeros. This means that we can't discover the clone3() features that
the running kernel supports by simply probing with various struct sizes.
We need to check this for the benefit of test systems which run newer
kselftests on old kernels.

Add a flag which can be set on a test to indicate that clone3() may return
-E2BIG due to the use of newer struct versions. Currently no tests need
this but it will become an issue for testing clone3() support for shadow
stacks, the support for shadow stacks is already present on x86.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e066b201fa64..5b8b7d640e70 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -39,6 +39,7 @@ struct test {
 	size_t size;
 	size_function size_function;
 	int expected;
+	bool e2big_valid;
 	enum test_mode test_mode;
 	filter_function filter;
 };
@@ -146,6 +147,11 @@ static void test_clone3(const struct test *test)
 	ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
 			getpid(), ret, test->expected);
 	if (ret != test->expected) {
+		if (test->e2big_valid && ret == -E2BIG) {
+			ksft_print_msg("Test reported -E2BIG\n");
+			ksft_test_result_skip("%s\n", test->name);
+			return;
+		}
 		ksft_print_msg(
 			"[%d] Result (%d) is different than expected (%d)\n",
 			getpid(), ret, test->expected);

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 6/8] selftests/clone3: Factor more of main loop into test_clone3()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

In order to make it easier to add more configuration for the tests and
more support for runtime detection of when tests can be run pass the
structure describing the tests into test_clone3() rather than picking
the arguments out of it and have that function do all the per-test work.

No functional change.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3.c | 77 ++++++++++++++++-----------------
 1 file changed, 37 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e61f07973ce5..e066b201fa64 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -30,6 +30,19 @@ enum test_mode {
 	CLONE3_ARGS_INVAL_EXIT_SIGNAL_NSIG,
 };
 
+typedef bool (*filter_function)(void);
+typedef size_t (*size_function)(void);
+
+struct test {
+	const char *name;
+	uint64_t flags;
+	size_t size;
+	size_function size_function;
+	int expected;
+	enum test_mode test_mode;
+	filter_function filter;
+};
+
 static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 {
 	struct __clone_args args = {
@@ -109,30 +122,40 @@ static int call_clone3(uint64_t flags, size_t size, enum test_mode test_mode)
 	return 0;
 }
 
-static bool test_clone3(uint64_t flags, size_t size, int expected,
-			enum test_mode test_mode)
+static void test_clone3(const struct test *test)
 {
+	size_t size;
 	int ret;
 
+	if (test->filter && test->filter()) {
+		ksft_test_result_skip("%s\n", test->name);
+		return;
+	}
+
+	if (test->size_function)
+		size = test->size_function();
+	else
+		size = test->size;
+
+	ksft_print_msg("Running test '%s'\n", test->name);
+
 	ksft_print_msg(
 		"[%d] Trying clone3() with flags %#" PRIx64 " (size %zu)\n",
-		getpid(), flags, size);
-	ret = call_clone3(flags, size, test_mode);
+		getpid(), test->flags, size);
+	ret = call_clone3(test->flags, size, test->test_mode);
 	ksft_print_msg("[%d] clone3() with flags says: %d expected %d\n",
-			getpid(), ret, expected);
-	if (ret != expected) {
+			getpid(), ret, test->expected);
+	if (ret != test->expected) {
 		ksft_print_msg(
 			"[%d] Result (%d) is different than expected (%d)\n",
-			getpid(), ret, expected);
-		return false;
+			getpid(), ret, test->expected);
+		ksft_test_result_fail("%s\n", test->name);
+		return;
 	}
 
-	return true;
+	ksft_test_result_pass("%s\n", test->name);
 }
 
-typedef bool (*filter_function)(void);
-typedef size_t (*size_function)(void);
-
 static bool not_root(void)
 {
 	if (getuid() != 0) {
@@ -160,16 +183,6 @@ static size_t page_size_plus_8(void)
 	return getpagesize() + 8;
 }
 
-struct test {
-	const char *name;
-	uint64_t flags;
-	size_t size;
-	size_function size_function;
-	int expected;
-	enum test_mode test_mode;
-	filter_function filter;
-};
-
 static const struct test tests[] = {
 	{
 		.name = "simple clone3()",
@@ -319,24 +332,8 @@ int main(int argc, char *argv[])
 	ksft_set_plan(ARRAY_SIZE(tests));
 	test_clone3_supported();
 
-	for (i = 0; i < ARRAY_SIZE(tests); i++) {
-		if (tests[i].filter && tests[i].filter()) {
-			ksft_test_result_skip("%s\n", tests[i].name);
-			continue;
-		}
-
-		if (tests[i].size_function)
-			size = tests[i].size_function();
-		else
-			size = tests[i].size;
-
-		ksft_print_msg("Running test '%s'\n", tests[i].name);
-
-		ksft_test_result(test_clone3(tests[i].flags, size,
-					     tests[i].expected,
-					     tests[i].test_mode),
-				 "%s\n", tests[i].name);
-	}
+	for (i = 0; i < ARRAY_SIZE(tests); i++)
+		test_clone3(&tests[i]);
 
 	ksft_finished();
 }

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 5/8] selftests/clone3: Remove redundant flushes of output streams
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Since there were widespread issues with output not being flushed the
kselftest framework was modified to explicitly set the output streams
unbuffered in commit 58e2847ad2e6 ("selftests: line buffer test
program's stdout") so there is no need to explicitly flush in the clone3
tests.

Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/clone3/clone3_selftests.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/selftests/clone3/clone3_selftests.h b/tools/testing/selftests/clone3/clone3_selftests.h
index eeca8005723f..939b26c86d42 100644
--- a/tools/testing/selftests/clone3/clone3_selftests.h
+++ b/tools/testing/selftests/clone3/clone3_selftests.h
@@ -35,8 +35,6 @@ struct __clone_args {
 
 static pid_t sys_clone3(struct __clone_args *args, size_t size)
 {
-	fflush(stdout);
-	fflush(stderr);
 	return syscall(__NR_clone3, args, size);
 }
 

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Unlike with the normal stack there is no API for configuring the shadow
stack for a new thread, instead the kernel will dynamically allocate a
new shadow stack with the same size as the normal stack. This appears to
be due to the shadow stack series having been in development since
before the more extensible clone3() was added rather than anything more
deliberate.

Add a parameter to clone3() specifying a shadow stack pointer to use
for the new thread, this is inconsistent with the way we specify the
normal stack but during review concerns were expressed about having to
identify where the shadow stack pointer should be placed especially in
cases where the shadow stack has been previously active.  If no shadow
stack is specified then the existing implicit allocation behaviour is
maintained.

If a shadow stack pointer is specified then it is required to have an
architecture defined token placed on the stack, this will be consumed by
the new task, the shadow stack is specified by pointing to this token.  If
no valid token is present then this will be reported with -EINVAL.  This
token prevents new threads being created pointing at the shadow stack of
an existing running thread.  On architectures with support for userspace
pivoting of shadow stacks it is expected that the same format and placement
of tokens will be used, this is the case for arm64 and x86.

If the architecture does not support shadow stacks the shadow stack
pointer must be not be specified, architectures that do support the
feature are expected to enforce the same requirement on individual
systems that lack shadow stack support.

Update the existing arm64 and x86 implementations to pay attention to
the newly added arguments, in order to maintain compatibility we use the
existing behaviour if no shadow stack is specified. Since we are now
using more fields from the kernel_clone_args we pass that into the
shadow stack code rather than individual fields.

Portions of the x86 architecture code were written by Rick Edgecombe.

Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
 arch/x86/include/asm/shstk.h     | 11 +++--
 arch/x86/kernel/process.c        |  2 +-
 arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
 include/asm-generic/cacheflush.h | 11 +++++
 include/linux/sched/task.h       | 17 ++++++++
 include/uapi/linux/sched.h       |  9 ++--
 kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
 8 files changed, 217 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index 3abcbf9adb5c..249ff05bca45 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -43,8 +43,23 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
 {
 	unsigned long addr, size;
 
-	if (!system_supports_gcs())
+	if (!system_supports_gcs()) {
+		if (args->shadow_stack_token)
+			return -EINVAL;
+
 		return 0;
+	}
+
+	/*
+	 * If the user specified a GCS then use it, otherwise fall
+	 * back to a default allocation strategy. Validation is done
+	 * in arch_shstk_validate_clone().
+	 */
+	if (args->shadow_stack_token) {
+		tsk->thread.gcs_base = 0;
+		tsk->thread.gcs_size = 0;
+		return 0;
+	}
 
 	if (!task_gcs_el0_enabled(tsk))
 		return 0;
@@ -68,6 +83,36 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
 	return 0;
 }
 
+static bool gcs_consume_token(struct vm_area_struct *vma, struct page *page,
+			      unsigned long user_addr)
+{
+	u64 expected = GCS_CAP(user_addr);
+	u64 *token = page_address(page) + offset_in_page(user_addr);
+
+	if (!cmpxchg_to_user_page(vma, page, user_addr, token, expected, 0))
+		return false;
+	set_page_dirty_lock(page);
+
+	return true;
+}
+
+int arch_shstk_validate_clone(struct task_struct *tsk,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args)
+{
+	unsigned long gcspr_el0;
+	int ret = 0;
+
+	gcspr_el0 = args->shadow_stack_token;
+	if (!gcs_consume_token(vma, page, gcspr_el0))
+		return -EINVAL;
+
+	tsk->thread.gcspr_el0 = gcspr_el0 + sizeof(u64);
+
+	return ret;
+}
+
 SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
 {
 	unsigned long alloc_size;
diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index ba6f2fe43848..827e983430aa 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 
 struct task_struct;
+struct kernel_clone_args;
 struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@ struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
-				       unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+				       const struct kernel_clone_args *args);
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
@@ -28,8 +29,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
 static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
-						     unsigned long clone_flags,
-						     unsigned long stack_size) { return 0; }
+						     const struct kernel_clone_args *args)
+{
+	return 0;
+}
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1b7960cf6eb0..0a54af6c60df 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -209,7 +209,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
 	 * update it.
 	 */
-	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	new_ssp = shstk_alloc_thread_stack(p, args);
 	if (IS_ERR_VALUE(new_ssp))
 		return PTR_ERR((void *)new_ssp);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 2ddf23387c7e..9926d58e5d41 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,18 +191,61 @@ void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-				       unsigned long stack_size)
+int arch_shstk_validate_clone(struct task_struct *t,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args)
+{
+	void *maddr = page_address(page);
+	unsigned long token;
+	int offset;
+	u64 expected;
+
+	/*
+	 * kernel_clone_args() verification assures token address is 8
+	 * byte aligned.
+	 */
+	token = args->shadow_stack_token;
+	expected = (token + SS_FRAME_SIZE) | BIT(0);
+	offset = offset_in_page(token);
+
+	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
+				  expected, 0))
+		return -EINVAL;
+	set_page_dirty_lock(page);
+
+	return 0;
+}
+
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+				       const struct kernel_clone_args *args)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long clone_flags = args->flags;
 	unsigned long addr, size;
 
 	/*
 	 * If shadow stack is not enabled on the new thread, skip any
-	 * switch to a new shadow stack.
+	 * implicit switch to a new shadow stack and reject attempts to
+	 * explicitly specify one.
 	 */
-	if (!features_enabled(ARCH_SHSTK_SHSTK))
+	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
+		if (args->shadow_stack_token)
+			return (unsigned long)ERR_PTR(-EINVAL);
+
 		return 0;
+	}
+
+	/*
+	 * If the user specified a shadow stack then use it, otherwise
+	 * fall back to a default allocation strategy. Validation is
+	 * done in arch_shstk_validate_clone().
+	 */
+	if (args->shadow_stack_token) {
+		shstk->base = 0;
+		shstk->size = 0;
+		return args->shadow_stack_token + 8;
+	}
 
 	/*
 	 * For CLONE_VFORK the child will share the parents shadow stack.
@@ -222,7 +265,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long cl
 	if (!(clone_flags & CLONE_VM))
 		return 0;
 
-	size = adjust_shstk_size(stack_size);
+	size = adjust_shstk_size(args->stack_size);
 	addr = alloc_shstk(0, size, 0, false);
 	if (IS_ERR_VALUE(addr))
 		return addr;
diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
index 7ee8a179d103..96cc0c7a5c90 100644
--- a/include/asm-generic/cacheflush.h
+++ b/include/asm-generic/cacheflush.h
@@ -124,4 +124,15 @@ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
 	} while (0)
 #endif
 
+#ifndef cmpxchg_to_user_page
+#define cmpxchg_to_user_page(vma, page, vaddr, ptr, old, new)  \
+({							  \
+	bool ret;						  \
+								  \
+	ret = try_cmpxchg(ptr, &old, new);			  \
+	flush_icache_user_page(vma, page, vaddr, sizeof(*ptr));	  \
+	ret;							  \
+})
+#endif
+
 #endif /* _ASM_GENERIC_CACHEFLUSH_H */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ea41795a352b..b501f752fc9a 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -16,6 +16,7 @@ struct task_struct;
 struct rusage;
 union thread_union;
 struct css_set;
+struct vm_area_struct;
 
 /* All the bits taken by the old clone syscall. */
 #define CLONE_LEGACY_FLAGS 0xffffffffULL
@@ -44,6 +45,7 @@ struct kernel_clone_args {
 	struct cgroup *cgrp;
 	struct css_set *cset;
 	unsigned int kill_seq;
+	unsigned long shadow_stack_token;
 };
 
 /*
@@ -226,4 +228,19 @@ static inline void task_unlock(struct task_struct *p)
 
 DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
 
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
+int arch_shstk_validate_clone(struct task_struct *p,
+			      struct vm_area_struct *vma,
+			      struct page *page,
+			      struct kernel_clone_args *args);
+#else
+static inline int arch_shstk_validate_clone(struct task_struct *p,
+					    struct vm_area_struct *vma,
+					    struct page *page,
+					    struct kernel_clone_args *args)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_SCHED_TASK_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a4..9cf5c419e109 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -84,6 +84,7 @@
  *                kernel's limit of nested PID namespaces.
  * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
  *                a file descriptor for the cgroup.
+ * @shadow_stack_token: Pointer to shadow stack token at top of stack.
  *
  * The structure is versioned by size and thus extensible.
  * New struct members must go at the end of the struct and
@@ -101,12 +102,14 @@ struct clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+	__aligned_u64 shadow_stack_token;
 };
 #endif
 
-#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
-#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
-#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER3  96 /* sizeof fourth published struct */
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index af673856499d..d484ebeded33 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
 	return true;
 }
 
+static int shstk_validate_clone(struct task_struct *p,
+				struct kernel_clone_args *args)
+{
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct page *page;
+	unsigned long addr;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
+		return 0;
+
+	if (!args->shadow_stack_token)
+		return 0;
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return -EFAULT;
+
+	mmap_read_lock(mm);
+
+	addr = untagged_addr_remote(mm, args->shadow_stack_token);
+	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
+					&vma);
+	if (IS_ERR(page)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	if (!(vma->vm_flags & VM_SHADOW_STACK) ||
+	    !(vma->vm_flags & VM_WRITE)) {
+		ret = -EFAULT;
+		goto out_page;
+	}
+
+	ret = arch_shstk_validate_clone(p, vma, page, args);
+
+out_page:
+	put_page(page);
+out:
+	mmap_read_unlock(mm);
+	mmput(mm);
+	return ret;
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -2182,6 +2227,9 @@ __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
 	retval = copy_thread(p, args);
+	if (retval)
+		goto bad_fork_cleanup_io;
+	retval = shstk_validate_clone(p, args);
 	if (retval)
 		goto bad_fork_cleanup_io;
 
@@ -2763,7 +2811,9 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		     CLONE_ARGS_SIZE_VER1);
 	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
 		     CLONE_ARGS_SIZE_VER2);
-	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+	BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_token) !=
+		     CLONE_ARGS_SIZE_VER3);
+	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
 
 	if (unlikely(usize > PAGE_SIZE))
 		return -E2BIG;
@@ -2796,16 +2846,17 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		return -EINVAL;
 
 	*kargs = (struct kernel_clone_args){
-		.flags		= args.flags,
-		.pidfd		= u64_to_user_ptr(args.pidfd),
-		.child_tid	= u64_to_user_ptr(args.child_tid),
-		.parent_tid	= u64_to_user_ptr(args.parent_tid),
-		.exit_signal	= args.exit_signal,
-		.stack		= args.stack,
-		.stack_size	= args.stack_size,
-		.tls		= args.tls,
-		.set_tid_size	= args.set_tid_size,
-		.cgroup		= args.cgroup,
+		.flags			= args.flags,
+		.pidfd			= u64_to_user_ptr(args.pidfd),
+		.child_tid		= u64_to_user_ptr(args.child_tid),
+		.parent_tid		= u64_to_user_ptr(args.parent_tid),
+		.exit_signal		= args.exit_signal,
+		.stack			= args.stack,
+		.stack_size		= args.stack_size,
+		.tls			= args.tls,
+		.set_tid_size		= args.set_tid_size,
+		.cgroup			= args.cgroup,
+		.shadow_stack_token	= args.shadow_stack_token,
 	};
 
 	if (args.set_tid &&
@@ -2846,6 +2897,24 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
 	return true;
 }
 
+/**
+ * clone3_shadow_stack_valid - check and prepare shadow stack
+ * @kargs: kernel clone args
+ *
+ * Verify that shadow stacks are only enabled if supported.
+ */
+static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
+{
+	if (!kargs->shadow_stack_token)
+		return true;
+
+	if (!IS_ALIGNED(kargs->shadow_stack_token, sizeof(void *)))
+		return false;
+
+	/* Fail if the kernel wasn't built with shadow stacks */
+	return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
+}
+
 static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
@@ -2868,7 +2937,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	    kargs->exit_signal)
 		return false;
 
-	if (!clone3_stack_valid(kargs))
+	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
 		return false;
 
 	return true;

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 3/8] selftests: Provide helper header for shadow stack testing
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

While almost all users of shadow stacks should be relying on the dynamic
linker and libc to enable the feature there are several low level test
programs where it is useful to enable without any libc support, allowing
testing without full system enablement. This low level testing is helpful
during bringup of the support itself, and also in enabling coverage by
automated testing without needing all system components in the target root
filesystems to have enablement.

Provide a header with helpers for this purpose, intended for use only by
test programs directly exercising shadow stack interfaces.

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 tools/testing/selftests/ksft_shstk.h | 98 ++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/tools/testing/selftests/ksft_shstk.h b/tools/testing/selftests/ksft_shstk.h
new file mode 100644
index 000000000000..fecf91218ea5
--- /dev/null
+++ b/tools/testing/selftests/ksft_shstk.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Helpers for shadow stack enablement, this is intended to only be
+ * used by low level test programs directly exercising interfaces for
+ * working with shadow stacks.
+ *
+ * Copyright (C) 2024 ARM Ltd.
+ */
+
+#ifndef __KSFT_SHSTK_H
+#define __KSFT_SHSTK_H
+
+#include <asm/mman.h>
+
+/* This is currently only defined for x86 */
+#ifndef SHADOW_STACK_SET_TOKEN
+#define SHADOW_STACK_SET_TOKEN (1ULL << 0)
+#endif
+
+static bool shadow_stack_enabled;
+
+#ifdef __x86_64__
+#define ARCH_SHSTK_ENABLE	0x5001
+#define ARCH_SHSTK_SHSTK	(1ULL <<  0)
+
+#define ARCH_PRCTL(arg1, arg2)					\
+({								\
+	long _ret;						\
+	register long _num  asm("eax") = __NR_arch_prctl;	\
+	register long _arg1 asm("rdi") = (long)(arg1);		\
+	register long _arg2 asm("rsi") = (long)(arg2);		\
+								\
+	asm volatile (						\
+		"syscall\n"					\
+		: "=a"(_ret)					\
+		: "r"(_arg1), "r"(_arg2),			\
+		  "0"(_num)					\
+		: "rcx", "r11", "memory", "cc"			\
+	);							\
+	_ret;							\
+})
+
+#define ENABLE_SHADOW_STACK
+static __always_inline void enable_shadow_stack(void)
+{
+	int ret = ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
+	if (ret == 0)
+		shadow_stack_enabled = true;
+}
+
+#endif
+
+#ifdef __aarch64__
+#define PR_SET_SHADOW_STACK_STATUS      75
+# define PR_SHADOW_STACK_ENABLE         (1UL << 0)
+
+#define my_syscall2(num, arg1, arg2)                                          \
+({                                                                            \
+	register long _num  __asm__ ("x8") = (num);                           \
+	register long _arg1 __asm__ ("x0") = (long)(arg1);                    \
+	register long _arg2 __asm__ ("x1") = (long)(arg2);                    \
+	register long _arg3 __asm__ ("x2") = 0;                               \
+	register long _arg4 __asm__ ("x3") = 0;                               \
+	register long _arg5 __asm__ ("x4") = 0;                               \
+									      \
+	__asm__  volatile (                                                   \
+		"svc #0\n"                                                    \
+		: "=r"(_arg1)                                                 \
+		: "r"(_arg1), "r"(_arg2),                                     \
+		  "r"(_arg3), "r"(_arg4),                                     \
+		  "r"(_arg5), "r"(_num)					      \
+		: "memory", "cc"                                              \
+	);                                                                    \
+	_arg1;                                                                \
+})
+
+#define ENABLE_SHADOW_STACK
+static __always_inline void enable_shadow_stack(void)
+{
+	int ret;
+
+	ret = my_syscall2(__NR_prctl, PR_SET_SHADOW_STACK_STATUS,
+			  PR_SHADOW_STACK_ENABLE);
+	if (ret == 0)
+		shadow_stack_enabled = true;
+}
+
+#endif
+
+#ifndef __NR_map_shadow_stack
+#define __NR_map_shadow_stack 453
+#endif
+
+#ifndef ENABLE_SHADOW_STACK
+static inline void enable_shadow_stack(void) { }
+#endif
+
+#endif

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 2/8] Documentation: userspace-api: Add shadow stack API documentation
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

There are a number of architectures with shadow stack features which we are
presenting to userspace with as consistent an API as we can (though there
are some architecture specifics). Especially given that there are some
important considerations for userspace code interacting directly with the
feature let's provide some documentation covering the common aspects.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Tested-by: Kees Cook <kees@kernel.org>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Reviewed-by: Deepak Gupta <debug@rivosinc.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 Documentation/userspace-api/index.rst        |  1 +
 Documentation/userspace-api/shadow_stack.rst | 44 ++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index b8c73be4fb11..0167e59b541e 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -62,6 +62,7 @@ Everything else
 
    ELF
    netlink/index
+   shadow_stack
    sysfs-platform_profile
    vduse
    futex2
diff --git a/Documentation/userspace-api/shadow_stack.rst b/Documentation/userspace-api/shadow_stack.rst
new file mode 100644
index 000000000000..42617d0470ba
--- /dev/null
+++ b/Documentation/userspace-api/shadow_stack.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Shadow Stacks
+=============
+
+Introduction
+============
+
+Several architectures have features which provide backward edge
+control flow protection through a hardware maintained stack, only
+writable by userspace through very limited operations.  This feature
+is referred to as shadow stacks on Linux, on x86 it is part of Intel
+Control Enforcement Technology (CET), on arm64 it is Guarded Control
+Stacks feature (FEAT_GCS) and for RISC-V it is the Zicfiss extension.
+It is expected that this feature will normally be managed by the
+system dynamic linker and libc in ways broadly transparent to
+application code, this document covers interfaces and considerations.
+
+
+Enabling
+========
+
+Shadow stacks default to disabled when a userspace process is
+executed, they can be enabled for the current thread with a syscall:
+
+ - For x86 the ARCH_SHSTK_ENABLE arch_prctl()
+ - For other architectures the PR_SET_SHADOW_STACK_ENABLE prctl()
+
+It is expected that this will normally be done by the dynamic linker.
+Any new threads created by a thread with shadow stacks enabled will
+themselves have shadow stacks enabled.
+
+
+Enablement considerations
+=========================
+
+- Returning from the function that enables shadow stacks without first
+  disabling them will cause a shadow stack exception.  This includes
+  any syscall wrapper or other library functions, the syscall will need
+  to be inlined.
+- A lock feature allows userspace to prevent disabling of shadow stacks.
+- Those that change the stack context like longjmp() or use of ucontext
+  changes on signal return will need support from libc.

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 1/8] arm64/gcs: Return a success value from gcs_alloc_thread_stack()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook
In-Reply-To: <20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org>

Currently as a result of templating from x86 code gcs_alloc_thread_stack()
returns a pointer as an unsigned int however on arm64 we don't actually use
this pointer value as anything other than a pass/fail flag. Simplify the
interface to just return an int with 0 on success and a negative error code
on failure.

Acked-by: Deepak Gupta <debug@rivosinc.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/include/asm/gcs.h | 8 ++++----
 arch/arm64/kernel/process.c  | 8 ++++----
 arch/arm64/mm/gcs.c          | 8 ++++----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/gcs.h b/arch/arm64/include/asm/gcs.h
index 5bc432234d3a..b4bbec9382a1 100644
--- a/arch/arm64/include/asm/gcs.h
+++ b/arch/arm64/include/asm/gcs.h
@@ -64,8 +64,8 @@ static inline bool task_gcs_el0_enabled(struct task_struct *task)
 void gcs_set_el0_mode(struct task_struct *task);
 void gcs_free(struct task_struct *task);
 void gcs_preserve_current_state(void);
-unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-				     const struct kernel_clone_args *args);
+int gcs_alloc_thread_stack(struct task_struct *tsk,
+			   const struct kernel_clone_args *args);
 
 static inline int gcs_check_locked(struct task_struct *task,
 				   unsigned long new_val)
@@ -91,8 +91,8 @@ static inline bool task_gcs_el0_enabled(struct task_struct *task)
 static inline void gcs_set_el0_mode(struct task_struct *task) { }
 static inline void gcs_free(struct task_struct *task) { }
 static inline void gcs_preserve_current_state(void) { }
-static inline unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-						   const struct kernel_clone_args *args)
+static inline int gcs_alloc_thread_stack(struct task_struct *tsk,
+					 const struct kernel_clone_args *args)
 {
 	return -ENOTSUPP;
 }
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 96482a1412c6..f0b1bea9c873 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -299,7 +299,7 @@ static void flush_gcs(void)
 static int copy_thread_gcs(struct task_struct *p,
 			   const struct kernel_clone_args *args)
 {
-	unsigned long gcs;
+	int ret;
 
 	if (!system_supports_gcs())
 		return 0;
@@ -310,9 +310,9 @@ static int copy_thread_gcs(struct task_struct *p,
 	p->thread.gcs_el0_mode = current->thread.gcs_el0_mode;
 	p->thread.gcs_el0_locked = current->thread.gcs_el0_locked;
 
-	gcs = gcs_alloc_thread_stack(p, args);
-	if (IS_ERR_VALUE(gcs))
-		return PTR_ERR((void *)gcs);
+	ret = gcs_alloc_thread_stack(p, args);
+	if (ret != 0)
+		return ret;
 
 	return 0;
 }
diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index 6e93f78de79b..3abcbf9adb5c 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -38,8 +38,8 @@ static unsigned long gcs_size(unsigned long size)
 	return max(PAGE_SIZE, size);
 }
 
-unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-				     const struct kernel_clone_args *args)
+int gcs_alloc_thread_stack(struct task_struct *tsk,
+			   const struct kernel_clone_args *args)
 {
 	unsigned long addr, size;
 
@@ -59,13 +59,13 @@ unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
 	size = gcs_size(size);
 	addr = alloc_gcs(0, size);
 	if (IS_ERR_VALUE(addr))
-		return addr;
+		return PTR_ERR((void *)addr);
 
 	tsk->thread.gcs_base = addr;
 	tsk->thread.gcs_size = size;
 	tsk->thread.gcspr_el0 = addr + size - sizeof(u64);
 
-	return addr;
+	return 0;
 }
 
 SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)

-- 
2.39.5


^ permalink raw reply related

* [PATCH v20 0/8] fork: Support shadow stacks in clone3()
From: Mark Brown @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, Wilco Dijkstra, linux-kselftest,
	linux-api, Mark Brown, Kees Cook, Kees Cook, Shuah Khan

[ I think at this point everyone is OK with the ABI, and the x86
  implementation has been tested so hopefully we are near to being
  able to get this merged?  If there are any outstanding issues let
  me know and I can look at addressing them.  The one possible issue
  I am aware of is that the RISC-V shadow stack support was briefly
  in -next but got dropped along with the general RISC-V issues during
  the last merge window, rebasing for that is still in progress.  I
  guess ideally this could be applied on a branch and then pulled into
  the RISC-V tree? ]

The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1].  With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses.  This provides some
protection against ROP attacks and making it easier to collect call
stacks.  These shadow stacks are allocated in the address space of the
userspace process.

Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled.  The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread.  This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces.  As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.

Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone().  The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.

Yuri Khrustalev has raised questions from the libc side regarding
discoverability of extended clone3() structure sizes[2], this seems like
a general issue with clone3().  There was a suggestion to add a hwcap on
arm64 which isn't ideal but is doable there, though architecture
specific mechanisms would also be needed for x86 (and RISC-V if it's
support gets merged before this does).  The idea has, however, had
strong pushback from the architecture maintainers and it is possible to
detect support for this in clone3() by attempting a call with a
misaligned shadow stack pointer specified so no hwcap has been added.

[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org/T/#mc58f97f27461749ccf400ebabf6f9f937116a86b
[2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v20:
- Comment fixes and clarifications in x86 arch_shstk_validate_clone()
  from Rick Edgecombe.
- Spelling fix in documentation.
- Link to v19: https://lore.kernel.org/r/20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org

Changes in v19:
- Rebase onto v6.17-rc1.
- Link to v18: https://lore.kernel.org/r/20250702-clone3-shadow-stack-v18-0-7965d2b694db@kernel.org

Changes in v18:
- Rebase onto v6.16-rc3.
- Thanks to pointers from Yuri Khrustalev this version has been tested
  on x86 so I have removed the RFT tag.
- Clarify clone3_shadow_stack_valid() comment about the Kconfig check.
- Remove redundant GCSB DSYNCs in arm64 code.
- Fix token validation on x86.
- Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@kernel.org

Changes in v17:
- Rebase onto v6.16-rc1.
- Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@kernel.org

Changes in v16:
- Rebase onto v6.15-rc2.
- Roll in fixes from x86 testing from Rick Edgecombe.
- Rework so that the argument is shadow_stack_token.
- Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@kernel.org

Changes in v15:
- Rebase onto v6.15-rc1.
- Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@kernel.org

Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@kernel.org

Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@kernel.org

Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
  support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@kernel.org

Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
  integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
  base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@kernel.org

Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
  Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@kernel.org

Changes in v9:
- Pull token validation earlier and report problems with an error return
  to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@kernel.org

Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@kernel.org

Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@kernel.org

Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
  x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
  the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@kernel.org

Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
  map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
  other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@kernel.org

Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
  validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
  CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@kernel.org

Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
  desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@kernel.org

---
Mark Brown (8):
      arm64/gcs: Return a success value from gcs_alloc_thread_stack()
      Documentation: userspace-api: Add shadow stack API documentation
      selftests: Provide helper header for shadow stack testing
      fork: Add shadow stack support to clone3()
      selftests/clone3: Remove redundant flushes of output streams
      selftests/clone3: Factor more of main loop into test_clone3()
      selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
      selftests/clone3: Test shadow stack support

 Documentation/userspace-api/index.rst             |   1 +
 Documentation/userspace-api/shadow_stack.rst      |  44 +++++
 arch/arm64/include/asm/gcs.h                      |   8 +-
 arch/arm64/kernel/process.c                       |   8 +-
 arch/arm64/mm/gcs.c                               |  55 +++++-
 arch/x86/include/asm/shstk.h                      |  11 +-
 arch/x86/kernel/process.c                         |   2 +-
 arch/x86/kernel/shstk.c                           |  53 ++++-
 include/asm-generic/cacheflush.h                  |  11 ++
 include/linux/sched/task.h                        |  17 ++
 include/uapi/linux/sched.h                        |   9 +-
 kernel/fork.c                                     |  93 +++++++--
 tools/testing/selftests/clone3/clone3.c           | 226 ++++++++++++++++++----
 tools/testing/selftests/clone3/clone3_selftests.h |  65 ++++++-
 tools/testing/selftests/ksft_shstk.h              |  98 ++++++++++
 15 files changed, 620 insertions(+), 81 deletions(-)
---
base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
change-id: 20231019-clone3-shadow-stack-15d40d2bf536

Best regards,
--  
Mark Brown <broonie@kernel.org>


^ permalink raw reply

* Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Christian Brauner @ 2025-09-02 10:02 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> This implicit behaviour has historically meant that userspace was
> required to do some special dances in order to configure the pidns of a
> procfs mount as desired. Examples include:
> 
>  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
>    a procfs mount from an empty pidns so that user namespaced containers
>    can be nested (without this, the nested containers would fail to
>    mount procfs). But this requires forking off a helper process because
>    you cannot just one-shot this using mount(2).
> 
>  * Container runtimes in general need to fork into a container before
>    configuring its mounts, which can lead to security issues in the case
>    of shared-pidns containers (a privileged process in the pidns can
>    interact with your container runtime process). While
>    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
>    strict need for this due to a minor uAPI wart is kind of unfortunate.
> 
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
> 
>     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or classic mount(2) / mount(8):
> 
>     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> privileges over the pid namespace. This fulfils the requirements of
> container runtimes, but I suspect that this may be too strict for some
> usecases.
> 
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions. Note that
> PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> information about this outside of mountinfo.
> 
> Note that you cannot change the pidns of an already-created procfs
> instance. The primary reason is that allowing this to be changed would
> require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> the pid namespace. Since creating procfs instances is very cheap, it
> seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> procfs this way errors out with -EBUSY.
> 
> /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> 
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
> 
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
> 
> Rather than copying the (fairly strict) security model for setns(2),
> apply a slightly looser model to better match what userspace can already
> do:
> 
>  * Make the ioctl only valid on the root (meaning that a process without
>    access to the procfs root -- such as only having an fd to a procfs
>    file or some open_tree(2)-like subset -- cannot use this API). This
>    means that the process already has some level of access to the
>    /proc/$pid directories.
> 
>  * If the calling process is in an ancestor pidns, then they can already
>    create pidfd for processes inside the pidns, which is morally
>    equivalent to a pidns file descriptor according to setns(2). So it
>    seems reasonable to just allow it in this case. (The justification
>    for this model was suggested by Christian.)
> 
>  * If the process has access to /proc/1/ns/pid already (i.e. has
>    ptrace-read access to the pidns pid1), then this ioctl is equivalent
>    to just opening a handle to it that way.
> 
>    Ideally we would check for ptrace-read access against all processes
>    in the pidns (which is very likely to be true for at least one
>    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
>    by most programs), but this would obviously not scale.
> 
> I'm open to suggestions for whether we need to make this stricter (or
> possibly allow more cases).
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks for the patchset. Being able to specify what pid namespace the
procfs instance is supposed to belong to is super useful and will make
things easier for userspace for sure.

The code you added contains a minor wrinkle that I disliked which I've
changed and you tell me if you can live with this restriction or not.

The way you've implemented it specifying a pid namespace that the caller
holds privilege over would silently also override the user namespace the
filesystem is supposed to belong to.

Specifically, you did something like:

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);

This silently overrides the user namespace recorded at fsopen() time. I
think that's too subtle and we should just not allow that at all for
now.

Instead I've changed this to:

        if (fc->user_ns != target->user_ns)
                return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);

so we just refuse different owernship.

I've also dropped the procfs ioctl because I'm not sure how much value
it will actually add given that you can do this via /proc/1/ns/pid.

If that is something that libpathrs despearately needs I would like to
do it as a separate patch anyways.

Thanks for the excellent cover letter. This was a pleasure merging!

^ permalink raw reply

* Re: (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Christian Brauner @ 2025-09-02  9:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest, Alexander Viro, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

On Tue, 05 Aug 2025 15:45:07 +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> [...]

Applied to the vfs-6.18.procfs branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.procfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.procfs

[1/4] pidns: move is-ancestor logic to helper
      https://git.kernel.org/vfs/vfs/c/60d22c6ef41b
[2/4] procfs: add "pidns" mount option
      https://git.kernel.org/vfs/vfs/c/77e211dd1392
[4/4] selftests/proc: add tests for new pidns APIs
      https://git.kernel.org/vfs/vfs/c/568d4239002c

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox