Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH v22 1/8] arm64/gcs: Return a success value from gcs_alloc_thread_stack()
From: Mark Brown @ 2025-10-15 12:49 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, H.J. Lu, Florian Weimer,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, H.J. Lu, Adhemerval Zanella Netto,
	Wilco Dijkstra, CarlosO'Donell, Florian Weimer, Rich Felker,
	linux-kselftest, linux-api, Mark Brown, Kees Cook
In-Reply-To: <20251015-clone3-shadow-stack-v22-0-a8c8da011427@kernel.org>

Currently as a result of templating from x86 code gcs_alloc_thread_stack()
returns a pointer as an unsigned int however on arm64 we don't actually use
this pointer value as anything other than a pass/fail flag. Simplify the
interface to just return an int with 0 on success and a negative error code
on failure.

Acked-by: Deepak Gupta <debug@rivosinc.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/arm64/include/asm/gcs.h | 8 ++++----
 arch/arm64/kernel/process.c  | 8 ++++----
 arch/arm64/mm/gcs.c          | 8 ++++----
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/gcs.h b/arch/arm64/include/asm/gcs.h
index 8fa0707069e8..534ea5ae9281 100644
--- a/arch/arm64/include/asm/gcs.h
+++ b/arch/arm64/include/asm/gcs.h
@@ -64,8 +64,8 @@ static inline bool task_gcs_el0_enabled(struct task_struct *task)
 void gcs_set_el0_mode(struct task_struct *task);
 void gcs_free(struct task_struct *task);
 void gcs_preserve_current_state(void);
-unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-				     const struct kernel_clone_args *args);
+int gcs_alloc_thread_stack(struct task_struct *tsk,
+			   const struct kernel_clone_args *args);
 
 static inline int gcs_check_locked(struct task_struct *task,
 				   unsigned long new_val)
@@ -171,8 +171,8 @@ static inline void put_user_gcs(unsigned long val, unsigned long __user *addr,
 				int *err) { }
 static inline void push_user_gcs(unsigned long val, int *err) { }
 
-static inline unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-						   const struct kernel_clone_args *args)
+static inline int gcs_alloc_thread_stack(struct task_struct *tsk,
+					 const struct kernel_clone_args *args)
 {
 	return -ENOTSUPP;
 }
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index fba7ca102a8c..4dadc70df16b 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -299,7 +299,7 @@ static void flush_gcs(void)
 static int copy_thread_gcs(struct task_struct *p,
 			   const struct kernel_clone_args *args)
 {
-	unsigned long gcs;
+	int ret;
 
 	if (!system_supports_gcs())
 		return 0;
@@ -310,9 +310,9 @@ static int copy_thread_gcs(struct task_struct *p,
 	p->thread.gcs_el0_mode = current->thread.gcs_el0_mode;
 	p->thread.gcs_el0_locked = current->thread.gcs_el0_locked;
 
-	gcs = gcs_alloc_thread_stack(p, args);
-	if (IS_ERR_VALUE(gcs))
-		return PTR_ERR((void *)gcs);
+	ret = gcs_alloc_thread_stack(p, args);
+	if (ret != 0)
+		return ret;
 
 	return 0;
 }
diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
index 6e93f78de79b..3abcbf9adb5c 100644
--- a/arch/arm64/mm/gcs.c
+++ b/arch/arm64/mm/gcs.c
@@ -38,8 +38,8 @@ static unsigned long gcs_size(unsigned long size)
 	return max(PAGE_SIZE, size);
 }
 
-unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
-				     const struct kernel_clone_args *args)
+int gcs_alloc_thread_stack(struct task_struct *tsk,
+			   const struct kernel_clone_args *args)
 {
 	unsigned long addr, size;
 
@@ -59,13 +59,13 @@ unsigned long gcs_alloc_thread_stack(struct task_struct *tsk,
 	size = gcs_size(size);
 	addr = alloc_gcs(0, size);
 	if (IS_ERR_VALUE(addr))
-		return addr;
+		return PTR_ERR((void *)addr);
 
 	tsk->thread.gcs_base = addr;
 	tsk->thread.gcs_size = size;
 	tsk->thread.gcspr_el0 = addr + size - sizeof(u64);
 
-	return addr;
+	return 0;
 }
 
 SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)

-- 
2.47.2


^ permalink raw reply related

* [PATCH v22 0/8] fork: Support shadow stacks in clone3()
From: Mark Brown @ 2025-10-15 12:49 UTC (permalink / raw)
  To: Rick P. Edgecombe, Deepak Gupta, H.J. Lu, Florian Weimer,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Brauner, Shuah Khan
  Cc: linux-kernel, Catalin Marinas, Will Deacon, jannh, bsegall,
	Andrew Morton, Yury Khrustalev, H.J. Lu, Adhemerval Zanella Netto,
	Wilco Dijkstra, CarlosO'Donell, Florian Weimer, Rich Felker,
	linux-kselftest, linux-api, Mark Brown, Kees Cook, Kees Cook,
	Shuah Khan

At this point I think everyone in the on the kernel side is happy with
this but there were some questions from the glibc side about the value
of controlling the shadow stack placement and size, especially with the
current inability to reuse the shadow stack for an exited thread.  With
support for reuse it would be possible to have a cache of shadow stacks
as is currently supported for the normal stack.

Since the discussion petered out I'm resending this in order to give
people something work with while prototyping.  It should be possible to
prototype any potential kernel features to help build out shadow stack
support in userspace by enabling shadow stack writes, as suggested by
Rick Edgecombe this may end up being required anyway for supporting more
exotic scenarios.  On all current architectures with the feature writes
to shadow stack require specific instructions so there are still
security benefits even with writes enabled.

I did send a change implementing a feature writing a token on thread
exit to allow reuse:

   https://lore.kernel.org/r/20250921-arm64-gcs-exit-token-v1-0-45cf64e648d5@kernel.org

but wasn't planning to refresh it without some indication from the
userspace side that that'd be useful.

Non-process cover letter:

The kernel has added support for shadow stacks, currently x86 only using
their CET feature but both arm64 and RISC-V have equivalent features
(GCS and Zicfiss respectively), I am actively working on GCS[1].  With
shadow stacks the hardware maintains an additional stack containing only
the return addresses for branch instructions which is not generally
writeable by userspace and ensures that any returns are to the recorded
addresses.  This provides some protection against ROP attacks and making
it easier to collect call stacks.  These shadow stacks are allocated in
the address space of the userspace process.

Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled.  The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread.  This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces.  As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.

Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process, keeping the current
implicit allocation behaviour if one is not specified either with
clone3() or through the use of clone().  The user must provide a shadow
stack pointer, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with an architecture specified shadow stack
token at the top of the stack.

Yuri Khrustalev has raised questions from the libc side regarding
discoverability of extended clone3() structure sizes[2], this seems like
a general issue with clone3().  There was a suggestion to add a hwcap on
arm64 which isn't ideal but is doable there, though architecture
specific mechanisms would also be needed for x86 (and RISC-V if it's
support gets merged before this does).  The idea has, however, had
strong pushback from the architecture maintainers and it is possible to
detect support for this in clone3() by attempting a call with a
misaligned shadow stack pointer specified so no hwcap has been added.

[1] https://lore.kernel.org/linux-arm-kernel/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org/T/#mc58f97f27461749ccf400ebabf6f9f937116a86b
[2] https://lore.kernel.org/r/aCs65ccRQtJBnZ_5@arm.com

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v22:
- Rebase onto v6.18-rc1.
- Cover letter updates.
- Link to v21: https://lore.kernel.org/r/20250916-clone3-shadow-stack-v21-0-910493527013@kernel.org

Changes in v21:
- Rebase onto https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git kernel-6.18.clone3
- Rename shadow_stack_token to shstk_token, since it's a simple rename I've
  kept the acks and reviews but I dropped the tested-bys just to be safe.
- Link to v20: https://lore.kernel.org/r/20250902-clone3-shadow-stack-v20-0-4d9fff1c53e7@kernel.org

Changes in v20:
- Comment fixes and clarifications in x86 arch_shstk_validate_clone()
  from Rick Edgecombe.
- Spelling fix in documentation.
- Link to v19: https://lore.kernel.org/r/20250819-clone3-shadow-stack-v19-0-bc957075479b@kernel.org

Changes in v19:
- Rebase onto v6.17-rc1.
- Link to v18: https://lore.kernel.org/r/20250702-clone3-shadow-stack-v18-0-7965d2b694db@kernel.org

Changes in v18:
- Rebase onto v6.16-rc3.
- Thanks to pointers from Yuri Khrustalev this version has been tested
  on x86 so I have removed the RFT tag.
- Clarify clone3_shadow_stack_valid() comment about the Kconfig check.
- Remove redundant GCSB DSYNCs in arm64 code.
- Fix token validation on x86.
- Link to v17: https://lore.kernel.org/r/20250609-clone3-shadow-stack-v17-0-8840ed97ff6f@kernel.org

Changes in v17:
- Rebase onto v6.16-rc1.
- Link to v16: https://lore.kernel.org/r/20250416-clone3-shadow-stack-v16-0-2ffc9ca3917b@kernel.org

Changes in v16:
- Rebase onto v6.15-rc2.
- Roll in fixes from x86 testing from Rick Edgecombe.
- Rework so that the argument is shadow_stack_token.
- Link to v15: https://lore.kernel.org/r/20250408-clone3-shadow-stack-v15-0-3fa245c6e3be@kernel.org

Changes in v15:
- Rebase onto v6.15-rc1.
- Link to v14: https://lore.kernel.org/r/20250206-clone3-shadow-stack-v14-0-805b53af73b9@kernel.org

Changes in v14:
- Rebase onto v6.14-rc1.
- Link to v13: https://lore.kernel.org/r/20241203-clone3-shadow-stack-v13-0-93b89a81a5ed@kernel.org

Changes in v13:
- Rebase onto v6.13-rc1.
- Link to v12: https://lore.kernel.org/r/20241031-clone3-shadow-stack-v12-0-7183eb8bee17@kernel.org

Changes in v12:
- Add the regular prctl() to the userspace API document since arm64
  support is queued in -next.
- Link to v11: https://lore.kernel.org/r/20241005-clone3-shadow-stack-v11-0-2a6a2bd6d651@kernel.org

Changes in v11:
- Rebase onto arm64 for-next/gcs, which is based on v6.12-rc1, and
  integrate arm64 support.
- Rework the interface to specify a shadow stack pointer rather than a
  base and size like we do for the regular stack.
- Link to v10: https://lore.kernel.org/r/20240821-clone3-shadow-stack-v10-0-06e8797b9445@kernel.org

Changes in v10:
- Integrate fixes & improvements for the x86 implementation from Rick
  Edgecombe.
- Require that the shadow stack be VM_WRITE.
- Require that the shadow stack base and size be sizeof(void *) aligned.
- Clean up trailing newline.
- Link to v9: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@kernel.org

Changes in v9:
- Pull token validation earlier and report problems with an error return
  to parent rather than signal delivery to the child.
- Verify that the top of the supplied shadow stack is VM_SHADOW_STACK.
- Rework token validation to only do the page mapping once.
- Drop no longer needed support for testing for signals in selftest.
- Fix typo in comments.
- Link to v8: https://lore.kernel.org/r/20240808-clone3-shadow-stack-v8-0-0acf37caf14c@kernel.org

Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@kernel.org

Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@kernel.org

Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
  x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
  the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@kernel.org

Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
  map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
  other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@kernel.org

Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
  validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
  CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@kernel.org

Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
  desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@kernel.org

---
Mark Brown (8):
      arm64/gcs: Return a success value from gcs_alloc_thread_stack()
      Documentation: userspace-api: Add shadow stack API documentation
      selftests: Provide helper header for shadow stack testing
      fork: Add shadow stack support to clone3()
      selftests/clone3: Remove redundant flushes of output streams
      selftests/clone3: Factor more of main loop into test_clone3()
      selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
      selftests/clone3: Test shadow stack support

 Documentation/userspace-api/index.rst             |   1 +
 Documentation/userspace-api/shadow_stack.rst      |  44 +++++
 arch/arm64/include/asm/gcs.h                      |   8 +-
 arch/arm64/kernel/process.c                       |   8 +-
 arch/arm64/mm/gcs.c                               |  55 +++++-
 arch/x86/include/asm/shstk.h                      |  11 +-
 arch/x86/kernel/process.c                         |   2 +-
 arch/x86/kernel/shstk.c                           |  53 ++++-
 include/asm-generic/cacheflush.h                  |  11 ++
 include/linux/sched/task.h                        |  17 ++
 include/uapi/linux/sched.h                        |   9 +-
 kernel/fork.c                                     |  93 +++++++--
 tools/testing/selftests/clone3/clone3.c           | 226 ++++++++++++++++++----
 tools/testing/selftests/clone3/clone3_selftests.h |  65 ++++++-
 tools/testing/selftests/ksft_shstk.h              |  98 ++++++++++
 15 files changed, 620 insertions(+), 81 deletions(-)
---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20231019-clone3-shadow-stack-15d40d2bf536

Best regards,
--  
Mark Brown <broonie@kernel.org>


^ permalink raw reply

* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: Wei Yang @ 2025-10-15  0:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Wei Yang, linux-kernel, linux-doc, cgroups,
	linux-mm, linux-fsdevel, linux-api, Andrew Morton, Tejun Heo,
	Zefan Li, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Muchun Song, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn
In-Reply-To: <f9d19f72-58f7-4694-ae18-1d944238a3e7@redhat.com>

On Tue, Oct 14, 2025 at 04:38:38PM +0200, David Hildenbrand wrote:
>On 14.10.25 16:32, Matthew Wilcox wrote:
>> On Tue, Oct 14, 2025 at 02:59:30PM +0200, David Hildenbrand wrote:
>> > > As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
>> > > folio yet.
>> > 
>> > We do support partially mapped PUD-sized folios I think, but not anonymous
>> > PUD-sized folios.
>> 
>> I don't think so?  The only mechanism I know of to allocate PUD-sized
>> chunks of memory is hugetlb, and that doesn't permit partial mappings.
>
>Greetings from the latest DAX rework :)

After a re-think, do you think it's better to align the behavior between
CONFIG_NO_PAGE_MAPCOUNT and CONFIG_PAGE_MAPCOUNT?

It looks we treat a PUD-sized folio partially_mapped if CONFIG_NO_PAGE_MAPCOUNT,
but !partially_mapped if CONFIG_PAGE_MAPCOUNT, if my understanding is correct.

>
>-- 
>Cheers
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: David Hildenbrand @ 2025-10-14 14:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wei Yang, linux-kernel, linux-doc, cgroups, linux-mm,
	linux-fsdevel, linux-api, Andrew Morton, Tejun Heo, Zefan Li,
	Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Muchun Song, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn
In-Reply-To: <aO5fCT62gZZw9-wQ@casper.infradead.org>

On 14.10.25 16:32, Matthew Wilcox wrote:
> On Tue, Oct 14, 2025 at 02:59:30PM +0200, David Hildenbrand wrote:
>>> As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
>>> folio yet.
>>
>> We do support partially mapped PUD-sized folios I think, but not anonymous
>> PUD-sized folios.
> 
> I don't think so?  The only mechanism I know of to allocate PUD-sized
> chunks of memory is hugetlb, and that doesn't permit partial mappings.

Greetings from the latest DAX rework :)

-- 
Cheers

David / dhildenb


^ permalink raw reply

* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: Matthew Wilcox @ 2025-10-14 14:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wei Yang, linux-kernel, linux-doc, cgroups, linux-mm,
	linux-fsdevel, linux-api, Andrew Morton, Tejun Heo, Zefan Li,
	Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Muchun Song, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn
In-Reply-To: <71380b43-c23c-42b5-8aab-f158bb37bc75@redhat.com>

On Tue, Oct 14, 2025 at 02:59:30PM +0200, David Hildenbrand wrote:
> > As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
> > folio yet.
> 
> We do support partially mapped PUD-sized folios I think, but not anonymous
> PUD-sized folios.

I don't think so?  The only mechanism I know of to allocate PUD-sized
chunks of memory is hugetlb, and that doesn't permit partial mappings.

^ permalink raw reply

* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: Wei Yang @ 2025-10-14 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Wei Yang, linux-kernel, linux-doc, cgroups, linux-mm,
	linux-fsdevel, linux-api, Andrew Morton, Matthew Wilcox (Oracle),
	Tejun Heo, Zefan Li, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Muchun Song, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn
In-Reply-To: <71380b43-c23c-42b5-8aab-f158bb37bc75@redhat.com>

On Tue, Oct 14, 2025 at 02:59:30PM +0200, David Hildenbrand wrote:
>On 14.10.25 14:23, Wei Yang wrote:
>> On Mon, Mar 03, 2025 at 05:30:13PM +0100, David Hildenbrand wrote:
>> [...]
>> > @@ -1678,6 +1726,22 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>> > 		break;
>> > 	case RMAP_LEVEL_PMD:
>> > 	case RMAP_LEVEL_PUD:
>> > +		if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
>> > +			last = atomic_add_negative(-1, &folio->_entire_mapcount);
>> > +			if (level == RMAP_LEVEL_PMD && last)
>> > +				nr_pmdmapped = folio_large_nr_pages(folio);
>> > +			nr = folio_dec_return_large_mapcount(folio, vma);
>> > +			if (!nr) {
>> > +				/* Now completely unmapped. */
>> > +				nr = folio_large_nr_pages(folio);
>> > +			} else {
>> > +				partially_mapped = last &&
>> > +						   nr < folio_large_nr_pages(folio);
>> 
>> Hi, David
>
>Hi!
>
>> 
>> Do you think this is better to be?
>> 
>> 	partially_mapped = last && nr < nr_pmdmapped;
>
>I see what you mean, it would be similar to the CONFIG_PAGE_MAPCOUNT case
>below.
>
>But probably it could then be
>
>	partially_mapped = nr < nr_pmdmapped;
>
>because nr_pmdmapped is only set when "last = true".
>
>I'm not sure if there is a good reason to change it at this point though.
>Smells like a micro-optimization for PUD, which we probably shouldn't worry
>about.
>
>> 
>> As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
>> folio yet.
>
>We do support partially mapped PUD-sized folios I think, but not anonymous
>PUD-sized folios.
>
>So consequently the partially_mapped variable will never really be used later
>on, because the folio_test_anon() will never hit in the PUD case.
>

Ok, folio_test_anon() takes care of it. We won't add it to defer list by
accident.

>-- 
>Cheers
>
>David / dhildenb

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-10-14 13:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	skhawaja, chrisl, steven.sistare
In-Reply-To: <20251010150116.GC3901471@nvidia.com>

On Fri, Oct 10 2025, Jason Gunthorpe wrote:

> On Thu, Oct 09, 2025 at 07:50:12PM -0400, Pasha Tatashin wrote:
>> >   This can look something like:
>> >
>> >   hugetlb_luo_preserve_folio(folio, ...);
>> >
>> >   Nice and simple.
>> >
>> >   Compare this with the new proposed API:
>> >
>> >   liveupdate_fh_global_state_get(h, &hugetlb_data);
>> >   // This will have update serialized state now.
>> >   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>> >   liveupdate_fh_global_state_put(h);
>> >
>> >   We do the same thing but in a very complicated way.
>> >
>> > - When the system-wide preserve happens, the hugetlb subsystem gets a
>> >   callback to serialize. It converts its runtime global state to
>> >   serialized state since now it knows no more FDs will be added.
>> >
>> >   With the new API, this doesn't need to be done since each FD prepare
>> >   already updates serialized state.
>> >
>> > - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>> >   anything in LUO. This is same as new API.
>> >
>> > - If some hugetlb FDs are not restored after liveupdate and the finish
>> >   event is triggered, the subsystem gets its finish() handler called and
>> >   it can free things up.
>> >
>> >   I don't get how that would work with the new API.
>> 
>> The new API isn't more complicated; It codifies the common pattern of
>> "create on first use, destroy on last use" into a reusable helper,
>> saving each file handler from having to reinvent the same reference
>> counting and locking scheme. But, as you point out, subsystems provide
>> more control, specifically they handle full creation/free instead of
>> relying on file-handlers for that.
>
> I'd say hugetlb *should* be doing the more complicated thing. We
> should not have global static data for luo floating around the kernel,
> this is too easily abused in bad ways.

Not sure how much difference this makes in practice, but I get your
point.

>
> The above "complicated" sequence forces the caller to have a fd
> session handle, and "hides" the global state inside luo so the
> subsystem can't just randomly reach into it whenever it likes.
>
> This is a deliberate and violent way to force clean coding practices
> and good layering.
>
> Not sure why hugetlb pools would need another xarray??

Not sure myself either. I used it to demonstrate my point of having
runtime state and serialized state separate from each other.

>
> 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
>    frozen, can't add/remove PFNs.

Doesn't that circumvent LUO's state machine? The idea with the state
machine was to have clear points in time when the system goes into the
"limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
event. With what you propose, the first FD being preserved implicitly
triggers the prepare event. Same thing for unprepare/cancel operations.

I am wondering if it is better to do it the other way round: prepare all
files first, and then prepare the hugetlb subsystem at
LIVEUPDATE_PREPARE event. At that point it already knows which pages to
mark preserved so the serialization can be done in one go.

> 2) Require the users of hugetlb memory, like memfd, to
>    preserve/restore the folios they are using (using their hugetlb order)
> 3) Just before kexec run over the PFN list and mark a bit if the folio
>    was preserved by KHO or not. Make sure everything gets KHO
>    preserved.

"just before kexec" would need a callback from LUO. I suppose a
subsystem is the place for that callback. I wrote my email under the
(wrong) impression that we were replacing subsystems.

That makes me wonder: how is the subsystem-level callback supposed to
access the global data? I suppose it can use the liveupdate_file_handler
directly, but it is kind of strange since technically the subsystem and
file handler are two different entities.

Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
I'm not sure how that would map with this shared global data. memfd and
guest_memfd will likely have different liveupdate_file_handler but would
share data from the same subsystem. Maybe that's a problem to solve for
later...

>
> Restore puts the PFNs that were not preserved directly in the free
> pool, the end user of the folio like the memfd restores and eventually
> normally frees the other folios.

Yeah, on the restore side this idea works fine I think.

>
> It is simple and fits nicely into the infrastructure here, where the
> first time you trigger a global state it does the pfn list and
> freezing, and the lifecycle and locking for this operation is directly
> managed by luo.
>
> The memfd, when it knows it has hugetlb folios inside it, would
> trigger this.
>
> Jason

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: David Hildenbrand @ 2025-10-14 12:59 UTC (permalink / raw)
  To: Wei Yang
  Cc: linux-kernel, linux-doc, cgroups, linux-mm, linux-fsdevel,
	linux-api, Andrew Morton, Matthew Wilcox (Oracle), Tejun Heo,
	Zefan Li, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Muchun Song, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn
In-Reply-To: <20251014122335.dpyk5advbkioojnm@master>

On 14.10.25 14:23, Wei Yang wrote:
> On Mon, Mar 03, 2025 at 05:30:13PM +0100, David Hildenbrand wrote:
> [...]
>> @@ -1678,6 +1726,22 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>> 		break;
>> 	case RMAP_LEVEL_PMD:
>> 	case RMAP_LEVEL_PUD:
>> +		if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
>> +			last = atomic_add_negative(-1, &folio->_entire_mapcount);
>> +			if (level == RMAP_LEVEL_PMD && last)
>> +				nr_pmdmapped = folio_large_nr_pages(folio);
>> +			nr = folio_dec_return_large_mapcount(folio, vma);
>> +			if (!nr) {
>> +				/* Now completely unmapped. */
>> +				nr = folio_large_nr_pages(folio);
>> +			} else {
>> +				partially_mapped = last &&
>> +						   nr < folio_large_nr_pages(folio);
> 
> Hi, David

Hi!

> 
> Do you think this is better to be?
> 
> 	partially_mapped = last && nr < nr_pmdmapped;

I see what you mean, it would be similar to the CONFIG_PAGE_MAPCOUNT 
case below.

But probably it could then be

	partially_mapped = nr < nr_pmdmapped;

because nr_pmdmapped is only set when "last = true".

I'm not sure if there is a good reason to change it at this point 
though. Smells like a micro-optimization for PUD, which we probably 
shouldn't worry about.

> 
> As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
> folio yet.

We do support partially mapped PUD-sized folios I think, but not 
anonymous PUD-sized folios.

So consequently the partially_mapped variable will never really be used 
later on, because the folio_test_anon() will never hit in the PUD case.

-- 
Cheers

David / dhildenb


^ permalink raw reply

* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: Wei Yang @ 2025-10-14 12:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-doc, cgroups, linux-mm, linux-fsdevel,
	linux-api, Andrew Morton, Matthew Wilcox (Oracle), Tejun Heo,
	Zefan Li, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Muchun Song, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn
In-Reply-To: <20250303163014.1128035-21-david@redhat.com>

On Mon, Mar 03, 2025 at 05:30:13PM +0100, David Hildenbrand wrote:
[...]
>@@ -1678,6 +1726,22 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> 		break;
> 	case RMAP_LEVEL_PMD:
> 	case RMAP_LEVEL_PUD:
>+		if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
>+			last = atomic_add_negative(-1, &folio->_entire_mapcount);
>+			if (level == RMAP_LEVEL_PMD && last)
>+				nr_pmdmapped = folio_large_nr_pages(folio);
>+			nr = folio_dec_return_large_mapcount(folio, vma);
>+			if (!nr) {
>+				/* Now completely unmapped. */
>+				nr = folio_large_nr_pages(folio);
>+			} else {
>+				partially_mapped = last &&
>+						   nr < folio_large_nr_pages(folio);

Hi, David

Do you think this is better to be?

	partially_mapped = last && nr < nr_pmdmapped;

As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
folio yet.

Not sure what I missed here.

>+				nr = 0;
>+			}
>+			break;
>+		}
>+
> 		folio_dec_large_mapcount(folio, vma);
> 		last = atomic_add_negative(-1, &folio->_entire_mapcount);
> 		if (last) {
>-- 
>2.48.1
>

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: Florian Weimer @ 2025-10-14  8:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Weißschuh, Huacai Chen, WANG Xuerui,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan, loongarch, linux-kernel,
	linux-s390, linux-um, linux-api, linux-kselftest
In-Reply-To: <CALCETrV2W3cZEJ2yy7F-F9=e_8HLP84ZWrOJCzUYn_ASb0+M6A@mail.gmail.com>

* Andy Lutomirski:

> The theory is that people thought that getcpu was going to be kind of
> slow, so userspace would allocate a little cache (IIRC per-thread) and
> pass it in, and the vDSO would do, well, something clever to return
> the right value.  The something clever was probably based on the idea
> that you can't actually tell (in general) if the return value from
> getcpu is stale, since you might well get migrated right as the
> function returns anyway, so the cache could be something silly like
> (jiffies, cpu).

It probably had to do something with per-CPU or per-node mappings of the
vDSO.  Or may some non-coherent cache line in the vDSO.  As far as I
understand it, the cache has zero chance of working with the way vDSO
data is currently implemented.

We have the CPU ID and node ID in the restartable sequences area now
(although glibc does not use the node ID yet).  It's not a cache.  So
this clearly supersedes whatever struct getcpu_cache tried to achieve.

Thanks,
Florian


^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: H. Peter Anvin @ 2025-10-13 21:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Weißschuh, Huacai Chen, WANG Xuerui,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan, loongarch, linux-kernel,
	linux-s390, linux-um, linux-api, linux-kselftest
In-Reply-To: <CALCETrVuW_MmksnkprK5Ljm-5RBSS=FUA8R8fgGMhZ3BxW15Sw@mail.gmail.com>

On 2025-10-13 13:32, Andy Lutomirski wrote:
> 
> The global timestamp would just be some field in the vvar area, which
> we have plenty of anyway.
> 
> But I agree, accelerating getcpu is pointless.  In any case, anything
> that historically thought it really really wanted accelerated getcpu
> can, and probably does, use rseq these days.
> 

Indeed. And with RDPID it is fast enough that the bulk of the cost is probably
in the vdso call.

	-hpa


^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: Andy Lutomirski @ 2025-10-13 20:32 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Dave Hansen, Thomas Weißschuh, Huacai Chen,
	WANG Xuerui, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan, loongarch, linux-kernel,
	linux-s390, linux-um, linux-api, linux-kselftest
In-Reply-To: <494caf29-8755-4bc6-a2c3-b9d0b3e9b78d@zytor.com>

On Mon, Oct 13, 2025 at 12:45 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2025-10-13 10:14, Andy Lutomirski wrote:
> >
> > I don't actually remember whether the kernel ever used this.  It's
> > possible that there are ancient kernels where passing a wild, non-null
> > pointer would blow up.  But it's certainly safe to pass null, and it's
> > certainly safe for the kernel to ignore the parameter.
> >
>
> One could imagine an architecture which would have to execute an actual system
> call wanting to use this, but on x86 it is pointless -- even the LSL trick is
> much faster than a system call, and once you account for whatever hassle you
> would have to deal with do make the cache make sense (probably having a global
> generation number and/or a timestamp to expire it) it well and truly makes no
> sense.

The global timestamp would just be some field in the vvar area, which
we have plenty of anyway.

But I agree, accelerating getcpu is pointless.  In any case, anything
that historically thought it really really wanted accelerated getcpu
can, and probably does, use rseq these days.

--Andy

^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: H. Peter Anvin @ 2025-10-13 19:44 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Thomas Weißschuh, Huacai Chen, WANG Xuerui, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan, loongarch, linux-kernel,
	linux-s390, linux-um, linux-api, linux-kselftest
In-Reply-To: <CALCETrV2W3cZEJ2yy7F-F9=e_8HLP84ZWrOJCzUYn_ASb0+M6A@mail.gmail.com>

On 2025-10-13 10:14, Andy Lutomirski wrote:
> 
> I don't actually remember whether the kernel ever used this.  It's
> possible that there are ancient kernels where passing a wild, non-null
> pointer would blow up.  But it's certainly safe to pass null, and it's
> certainly safe for the kernel to ignore the parameter.
> 

One could imagine an architecture which would have to execute an actual system
call wanting to use this, but on x86 it is pointless -- even the LSL trick is
much faster than a system call, and once you account for whatever hassle you
would have to deal with do make the cache make sense (probably having a global
generation number and/or a timestamp to expire it) it well and truly makes no
sense.

	-hpa


^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: Andy Lutomirski @ 2025-10-13 17:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Weißschuh, Huacai Chen, WANG Xuerui, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan, loongarch, linux-kernel,
	linux-s390, linux-um, linux-api, linux-kselftest
In-Reply-To: <e95dc212-6fd3-43e3-aeb7-bf55917e0cd4@intel.com>

On Mon, Oct 13, 2025 at 7:07 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/13/25 02:20, Thomas Weißschuh wrote:
> > -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
> > -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
> > +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
> > +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
> >  {
> >       int cpu_id;
>
> It would ideally be nice to have a _bit_ more history on this about
> how it became unused any why there is such high confidence that
> userspace never tries to use it.

The theory is that people thought that getcpu was going to be kind of
slow, so userspace would allocate a little cache (IIRC per-thread) and
pass it in, and the vDSO would do, well, something clever to return
the right value.  The something clever was probably based on the idea
that you can't actually tell (in general) if the return value from
getcpu is stale, since you might well get migrated right as the
function returns anyway, so the cache could be something silly like
(jiffies, cpu).

I don't actually remember whether the kernel ever used this.  It's
possible that there are ancient kernels where passing a wild, non-null
pointer would blow up.  But it's certainly safe to pass null, and it's
certainly safe for the kernel to ignore the parameter.

--Andy

>
> Let's say someone comes along in a few years and wants to use this
> 'unused' parameter. Could they?
>

^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: H. Peter Anvin @ 2025-10-13 16:14 UTC (permalink / raw)
  To: Dave Hansen, Thomas Weißschuh, Huacai Chen, WANG Xuerui,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan
  Cc: loongarch, linux-kernel, linux-s390, linux-um, linux-api,
	linux-kselftest
In-Reply-To: <e95dc212-6fd3-43e3-aeb7-bf55917e0cd4@intel.com>

On October 13, 2025 7:06:55 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>On 10/13/25 02:20, Thomas Weißschuh wrote:
>> -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
>> -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
>> +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
>> +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
>>  {
>>  	int cpu_id;
>
>It would ideally be nice to have a _bit_ more history on this about
>how it became unused any why there is such high confidence that
>userspace never tries to use it.
>
>Let's say someone comes along in a few years and wants to use this
>'unused' parameter. Could they?

I believe it was a private storage area for the kernel to use... which it doesn't. Not doing anything at all with the pointer is perfectly legitimate.

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-10-13 15:23 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bB6F634HCw_N5z9E5r_LpbGJrucuFb_5fL4da5_W99e4Q@mail.gmail.com>

On Thu, Oct 09 2025, Pasha Tatashin wrote:

> On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>>
>> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
>> > <pasha.tatashin@soleen.com> wrote:
>> >>
>> [...]
>> > 4. New File-Lifecycle-Bound Global State
>> > ----------------------------------------
>> > A new mechanism for managing global state was proposed, designed to be
>> > tied to the lifecycle of the preserved files themselves. This would
>> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
>> > global state that is only relevant when one or more of its FDs are
>> > being managed by LUO.
>>
>> Is this going to replace LUO subsystems? If yes, then why? The global
>> state will likely need to have its own lifecycle just like the FDs, and
>> subsystems are a simple and clean abstraction to control that. I get the
>> idea of only "activating" a subsystem when one or more of its FDs are
>> participating in LUO, but we can do that while keeping subsystems
>> around.
>
> Thanks for the feedback. The FLB Global State is not replacing the LUO
> subsystems. On the contrary, it's a higher-level abstraction that is
> itself implemented as a LUO subsystem. The goal is to provide a
> solution for a pattern that emerged during the PCI and IOMMU
> discussions.

Okay, makes sense then. I thought we were removing the subsystems idea.
I didn't follow the PCI and IOMMU discussions that closely.

Side note: I see a dependency between subsystems forming. For example,
the FLB subsystem probably wants to make sure all its dependent
subsystems (like LUO files) go through their callbacks before getting
its callback. Maybe in the current implementation doing it in any order
works, but in general, if it manages data of other subsystems, it should
be serialized after them.

Same with the hugetlb subsystem for example. On prepare or freeze time,
it would probably be a good idea if the files callbacks finish first. I
would imagine most subsystems would want to go after files.

With the current registration mechanism, the order depends on when the
subsystem is registered, which is hard to control. Maybe we should have
a global list of subsystems and can manually specify the order? Not sure
if that is a good idea, just throwing it out there off the top of my
head.

>
> You can see the WIP implementation here, which shows it registering as
> a subsystem named "luo-fh-states-v1-struct":
> https://github.com/soleen/linux/commit/94e191aab6b355d83633718bc4a1d27dda390001
>
> The existing subsystem API is a low-level tool that provides for the
> preservation of a raw 8-byte handle. It doesn't provide locking, nor
> is it explicitly tied to the lifecycle of any higher-level object like
> a file handler. The new API is designed to solve a more specific
> problem: allowing global components (like IOMMU or PCI) to
> automatically track when resources relevant to them are added to or
> removed from preservation. If HugeTLB requires a subsystem, it can
> still use it, but I suspect it might benefit from FLB Global State as
> well.

Hmm, right. Let me see how I can make use of it.

>
>> Here is how I imagine the proposed API would compare against subsystems
>> with hugetlb as an example (hugetlb support is still WIP, so I'm still
>> not clear on specifics, but this is how I imagine it will work):
>>
>> - Hugetlb subsystem needs to track its huge page pools and which pages
>>   are allocated and free. This is its global state. The pools get
>>   reconstructed after kexec. Post-kexec, the free pages are ready for
>>   allocation from other "regular" files and the pages used in LUO files
>>   are reserved.
>>
>> - Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
>>   in hugetlb's global data structure tracking this. This is runtime data
>>   (say xarray), and _not_ serialized data. Reason being, there are
>>   likely more FDs to come so no point in wasting time serializing just
>>   yet.
>>
>>   This can look something like:
>>
>>   hugetlb_luo_preserve_folio(folio, ...);
>>
>>   Nice and simple.
>>
>>   Compare this with the new proposed API:
>>
>>   liveupdate_fh_global_state_get(h, &hugetlb_data);
>>   // This will have update serialized state now.
>>   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>>   liveupdate_fh_global_state_put(h);
>>
>>   We do the same thing but in a very complicated way.
>>
>> - When the system-wide preserve happens, the hugetlb subsystem gets a
>>   callback to serialize. It converts its runtime global state to
>>   serialized state since now it knows no more FDs will be added.
>>
>>   With the new API, this doesn't need to be done since each FD prepare
>>   already updates serialized state.
>>
>> - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>>   anything in LUO. This is same as new API.
>>
>> - If some hugetlb FDs are not restored after liveupdate and the finish
>>   event is triggered, the subsystem gets its finish() handler called and
>>   it can free things up.
>>
>>   I don't get how that would work with the new API.
>
> The new API isn't more complicated; It codifies the common pattern of
> "create on first use, destroy on last use" into a reusable helper,
> saving each file handler from having to reinvent the same reference
> counting and locking scheme. But, as you point out, subsystems provide
> more control, specifically they handle full creation/free instead of
> relying on file-handlers for that.
>
>> My point is, I see subsystems working perfectly fine here and I don't
>> get how the proposed API is any better.
>>
>> Am I missing something?
>
> No, I don't think you are. Your analysis is correct that this is
> achievable with subsystems. The goal of the new API is to make that
> specific, common use case simpler.

Right. Thanks for clarifying.

>
> Pasha

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v2] vdso: Remove struct getcpu_cache
From: Dave Hansen @ 2025-10-13 14:06 UTC (permalink / raw)
  To: Thomas Weißschuh, Huacai Chen, WANG Xuerui, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Vincenzo Frascino, Shuah Khan
  Cc: loongarch, linux-kernel, linux-s390, linux-um, linux-api,
	linux-kselftest
In-Reply-To: <20251013-getcpu_cache-v2-1-880fbfa3b7cc@linutronix.de>

On 10/13/25 02:20, Thomas Weißschuh wrote:
> -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
> -int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
> +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
> +int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
>  {
>  	int cpu_id;

It would ideally be nice to have a _bit_ more history on this about
how it became unused any why there is such high confidence that
userspace never tries to use it.

Let's say someone comes along in a few years and wants to use this
'unused' parameter. Could they?

^ permalink raw reply

* Re: [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-13 10:29 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Alexander Graf, Rob Landley, Lennart Poettering, linux-arch,
	linux-block, initramfs, linux-api, linux-doc, Michal Simek,
	Luis Chamberlain, Kees Cook, Thorsten Blum, Heiko Carstens,
	Arnd Bergmann, Dave Young, Christophe Leroy, Krzysztof Kozlowski,
	Borislav Petkov, Jessica Clarke, Nicolas Schichan,
	David Disseldorp, patches
In-Reply-To: <07ae142e-4266-44a3-9aa1-4b2acbd72c1b@infradead.org>

On Fri, Oct 10, 2025 at 10:31 PM Randy Dunlap <rdunlap@infradead.org> wrote:
> There are more places in Documentation/ that refer to "linuxrc".
> Should those also be removed or fixed?
>
> accounting/delay-accounting.rst
> admin-guide/initrd.rst
> driver-api/early-userspace/early_userspace_support.rst
> power/swsusp-dmcrypt.rst
> translations/zh_CN/accounting/delay-accounting.rst

Yes, they should be removed.
I made this patchset minimal to be sure it is easy to revert.
I will remove these linuxrc mentions in cleanup patchset.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-13  9:59 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Aleksa Sarai, Thomas Weißschuh, Julian Stecklina,
	Gao Xiang, Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <CAHp75VezkZ7A1VOP8cBH8h0DKVumP66jjUbepMCP87wGOrh+MQ@mail.gmail.com>

On Fri, Oct 10, 2025 at 6:05 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> > -       noinitrd        [RAM] Tells the kernel not to load any configured
> > +       noinitrd        [Deprecated,RAM] Tells the kernel not to load any configured
> >                         initial RAM disk.
>
> How one is supposed to run this when just having a kernel is enough?
> At least (ex)colleague of mine was a heavy user of this option for
> testing kernel builds on the real HW.

This option applies to initrd only, not to initramfs.
Except for EFI mode, when it applies to both.

I will remove this option when I remove initrd.

In EFI mode it is easy just not to pass initramfs, so all is okay.

Also I will clarify docs in v3.

Also, please, answer here:
https://lore.kernel.org/regressions/20250918183336.5633-1-safinaskar@gmail.com/

-- 
Askar Safin

^ permalink raw reply

* [PATCH v2] vdso: Remove struct getcpu_cache
From: Thomas Weißschuh @ 2025-10-13  9:20 UTC (permalink / raw)
  To: Huacai Chen, WANG Xuerui, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Richard Weinberger,
	Anton Ivanov, Johannes Berg, Vincenzo Frascino, Shuah Khan
  Cc: loongarch, linux-kernel, linux-s390, linux-um, linux-api,
	linux-kselftest, Thomas Weißschuh

The cache parameter of getcpu() is not used by the kernel and no user
ever passes it in anyways.

Remove the struct and its header.

As a side-effect we get rid of an unwanted inclusion of the linux/
header namespace from vDSO code.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Changes in v2:
- Rebase on v6.18-rc1
- Link to v1: https://lore.kernel.org/r/20250826-getcpu_cache-v1-1-8748318f6141@linutronix.de
---
We could also completely remove the parameter, but I am not sure if
that is a good idea for syscalls and vDSO entrypoints.
---
 arch/loongarch/vdso/vgetcpu.c                   |  5 ++---
 arch/s390/kernel/vdso64/getcpu.c                |  3 +--
 arch/s390/kernel/vdso64/vdso.h                  |  4 +---
 arch/x86/entry/vdso/vgetcpu.c                   |  5 ++---
 arch/x86/include/asm/vdso/processor.h           |  4 +---
 arch/x86/um/vdso/um_vdso.c                      |  7 +++----
 include/linux/getcpu.h                          | 19 -------------------
 include/linux/syscalls.h                        |  3 +--
 kernel/sys.c                                    |  4 +---
 tools/testing/selftests/vDSO/vdso_test_getcpu.c |  4 +---
 10 files changed, 13 insertions(+), 45 deletions(-)

diff --git a/arch/loongarch/vdso/vgetcpu.c b/arch/loongarch/vdso/vgetcpu.c
index 5301cd9d0f839eb0fd7b73a1d36e80aaa75d5e76..aefba899873ed035d70766a95b0b6fea881e94df 100644
--- a/arch/loongarch/vdso/vgetcpu.c
+++ b/arch/loongarch/vdso/vgetcpu.c
@@ -4,7 +4,6 @@
  */
 
 #include <asm/vdso.h>
-#include <linux/getcpu.h>
 
 static __always_inline int read_cpu_id(void)
 {
@@ -20,8 +19,8 @@ static __always_inline int read_cpu_id(void)
 }
 
 extern
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
-int __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
+int __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	int cpu_id;
 
diff --git a/arch/s390/kernel/vdso64/getcpu.c b/arch/s390/kernel/vdso64/getcpu.c
index 5c5d4a848b7669436e73df8e3b711e5b876eb3db..1e17665616c5fa766ca66c8de276b212528934bd 100644
--- a/arch/s390/kernel/vdso64/getcpu.c
+++ b/arch/s390/kernel/vdso64/getcpu.c
@@ -2,11 +2,10 @@
 /* Copyright IBM Corp. 2020 */
 
 #include <linux/compiler.h>
-#include <linux/getcpu.h>
 #include <asm/timex.h>
 #include "vdso.h"
 
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	union tod_clock clk;
 
diff --git a/arch/s390/kernel/vdso64/vdso.h b/arch/s390/kernel/vdso64/vdso.h
index 9e5397e7b590a23c149ccc6043d0c0b0d5ea8457..cadd307d7a365cabf53f5c8d313be3718625533d 100644
--- a/arch/s390/kernel/vdso64/vdso.h
+++ b/arch/s390/kernel/vdso64/vdso.h
@@ -4,9 +4,7 @@
 
 #include <vdso/datapage.h>
 
-struct getcpu_cache;
-
-int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+int __s390_vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 int __s390_vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 int __s390_vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
 int __s390_vdso_clock_getres(clockid_t clock, struct __kernel_timespec *ts);
diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
index e4640306b2e3c95d74d73037ab6b09294b8e1d6c..6381b472b7c52487bccf3cbf0664c3d7a0e59699 100644
--- a/arch/x86/entry/vdso/vgetcpu.c
+++ b/arch/x86/entry/vdso/vgetcpu.c
@@ -6,17 +6,16 @@
  */
 
 #include <linux/kernel.h>
-#include <linux/getcpu.h>
 #include <asm/segment.h>
 #include <vdso/processor.h>
 
 notrace long
-__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned *cpu, unsigned *node, void *unused)
 {
 	vdso_read_cpunode(cpu, node);
 
 	return 0;
 }
 
-long getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *tcache)
+long getcpu(unsigned *cpu, unsigned *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/arch/x86/include/asm/vdso/processor.h b/arch/x86/include/asm/vdso/processor.h
index 7000aeb59aa287e2119c3d43ab3eaf82befb59c4..93e0e24e5cb47f7b0056c13f2a7f2304ed4a0595 100644
--- a/arch/x86/include/asm/vdso/processor.h
+++ b/arch/x86/include/asm/vdso/processor.h
@@ -18,9 +18,7 @@ static __always_inline void cpu_relax(void)
 	native_pause();
 }
 
-struct getcpu_cache;
-
-notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused);
+notrace long __vdso_getcpu(unsigned *cpu, unsigned *node, void *unused);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
index cbae2584124fd0ff0f9d240c33fefb8d213c84cd..9aa2c62cce6b7a07bbaf8441014d347162d1950d 100644
--- a/arch/x86/um/vdso/um_vdso.c
+++ b/arch/x86/um/vdso/um_vdso.c
@@ -10,14 +10,13 @@
 #define DISABLE_BRANCH_PROFILING
 
 #include <linux/time.h>
-#include <linux/getcpu.h>
 #include <asm/unistd.h>
 
 /* workaround for -Wmissing-prototypes warnings */
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts);
 int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
 __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
-long __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused);
+long __vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused);
 
 int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
 {
@@ -60,7 +59,7 @@ __kernel_old_time_t __vdso_time(__kernel_old_time_t *t)
 __kernel_old_time_t time(__kernel_old_time_t *t) __attribute__((weak, alias("__vdso_time")));
 
 long
-__vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused)
+__vdso_getcpu(unsigned int *cpu, unsigned int *node, void *unused)
 {
 	/*
 	 * UML does not support SMP, we can cheat here. :)
@@ -74,5 +73,5 @@ __vdso_getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *unused
 	return 0;
 }
 
-long getcpu(unsigned int *cpu, unsigned int *node, struct getcpu_cache *tcache)
+long getcpu(unsigned int *cpu, unsigned int *node, void *tcache)
 	__attribute__((weak, alias("__vdso_getcpu")));
diff --git a/include/linux/getcpu.h b/include/linux/getcpu.h
deleted file mode 100644
index c304dcdb4eac2a9117080e6a14f4e3f28d07fd56..0000000000000000000000000000000000000000
--- a/include/linux/getcpu.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_GETCPU_H
-#define _LINUX_GETCPU_H 1
-
-/* Cache for getcpu() to speed it up. Results might be a short time
-   out of date, but will be faster.
-
-   User programs should not refer to the contents of this structure.
-   I repeat they should not refer to it. If they do they will break
-   in future kernels.
-
-   It is only a private cache for vgetcpu(). It will change in future kernels.
-   The user program must store this information per thread (__thread)
-   If you want 100% accurate information pass NULL instead. */
-struct getcpu_cache {
-	unsigned long blob[128 / sizeof(long)];
-};
-
-#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 66c06fcdfe19e27b99eb9a187c22e022e260802f..403488e5eba906ecf40975fc3cb29ed0402491f2 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -59,7 +59,6 @@ struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
 struct futex_waitv;
-struct getcpu_cache;
 struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
@@ -714,7 +713,7 @@ asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
 asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
-asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, void __user *cache);
 asmlinkage long sys_gettimeofday(struct __kernel_old_timeval __user *tv,
 				struct timezone __user *tz);
 asmlinkage long sys_settimeofday(struct __kernel_old_timeval __user *tv,
diff --git a/kernel/sys.c b/kernel/sys.c
index 8b58eece4e580b883d19bb1336aff627ae783a4d..f1780ab132a3fbce6aac937ade5b9a35d9837f13 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -31,7 +31,6 @@
 #include <linux/tty.h>
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
-#include <linux/getcpu.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/seccomp.h>
 #include <linux/cpu.h>
@@ -2876,8 +2875,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	return error;
 }
 
-SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep,
-		struct getcpu_cache __user *, unused)
+SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, void __user *, unused)
 {
 	int err = 0;
 	int cpu = raw_smp_processor_id();
diff --git a/tools/testing/selftests/vDSO/vdso_test_getcpu.c b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
index cdeaed45fb26c61f6314c58fe1b71fa0be3c0108..994ce569dc37c6689b1a3c79156e3dfc8bf27f22 100644
--- a/tools/testing/selftests/vDSO/vdso_test_getcpu.c
+++ b/tools/testing/selftests/vDSO/vdso_test_getcpu.c
@@ -16,9 +16,7 @@
 #include "vdso_config.h"
 #include "vdso_call.h"
 
-struct getcpu_cache;
-typedef long (*getcpu_t)(unsigned int *, unsigned int *,
-			 struct getcpu_cache *);
+typedef long (*getcpu_t)(unsigned int *, unsigned int *, void *);
 
 int main(int argc, char **argv)
 {

---
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
change-id: 20250825-getcpu_cache-3abcd2e65437

Best regards,
-- 
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply related

* Re: [PATCH v2 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Askar Safin @ 2025-10-13  6:05 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Aleksa Sarai, Thomas Weißschuh, Julian Stecklina,
	Gao Xiang, Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <CAHp75VeJM_OoCWDX20FhphRi6e7rG9Z4X6zkjx9vFF12n7Ef7A@mail.gmail.com>

On Fri, Oct 10, 2025 at 6:02 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> 1) often the last period is missing in the commit messages;
I will fix in v3.

> 2) in this change it's unclear for how long (years) the feature was
> deprecated, i.e. the other patch states that 2020 for something else.
> I wonder if this one has the similar order of oldness.

These two commits were done in 2020, too. I will fix in v3.

--
Askar Safin

^ permalink raw reply

* Re: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
From: kernel test robot @ 2025-10-11 12:07 UTC (permalink / raw)
  To: Maxime Bélair, linux-security-module
  Cc: oe-kbuild-all, john.johansen, paul, jmorris, serge, mic, kees,
	stephen.smalley.work, casey, takedakn, penguin-kernel, song,
	rdunlap, linux-api, apparmor, linux-kernel, Maxime Bélair
In-Reply-To: <20251010132610.12001-2-maxime.belair@canonical.com>

Hi Maxime,

kernel test robot noticed the following build errors:

[auto build test ERROR on 9c32cda43eb78f78c73aee4aa344b777714e259b]

url:    https://github.com/intel-lab-lkp/linux/commits/Maxime-B-lair/Wire-up-lsm_config_self_policy-and-lsm_config_system_policy-syscalls/20251010-213606
base:   9c32cda43eb78f78c73aee4aa344b777714e259b
patch link:    https://lore.kernel.org/r/20251010132610.12001-2-maxime.belair%40canonical.com
patch subject: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
config: sh-randconfig-001-20251011 (https://download.01.org/0day-ci/archive/20251011/202510111947.0ObJ6YUH-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251011/202510111947.0ObJ6YUH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510111947.0ObJ6YUH-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/umh.c:9:0:
>> include/linux/syscalls.h:994:45: error: expected ';', ',' or ')' before 'u32'
              u32 __user size, u32 common_flags u32 flags);
                                                ^~~
--
   In file included from kernel/fork.c:56:0:
>> include/linux/syscalls.h:994:45: error: expected ';', ',' or ')' before 'u32'
              u32 __user size, u32 common_flags u32 flags);
                                                ^~~
   kernel/fork.c: In function '__do_sys_clone3':
   kernel/fork.c:3135:2: warning: #warning clone3() entry point is missing, please fix [-Wcpp]
    #warning clone3() entry point is missing, please fix
     ^~~~~~~


vim +994 include/linux/syscalls.h

   817	
   818	/* CONFIG_MMU only */
   819	asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
   820	asmlinkage long sys_swapoff(const char __user *specialfile);
   821	asmlinkage long sys_mprotect(unsigned long start, size_t len,
   822					unsigned long prot);
   823	asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
   824	asmlinkage long sys_mlock(unsigned long start, size_t len);
   825	asmlinkage long sys_munlock(unsigned long start, size_t len);
   826	asmlinkage long sys_mlockall(int flags);
   827	asmlinkage long sys_munlockall(void);
   828	asmlinkage long sys_mincore(unsigned long start, size_t len,
   829					unsigned char __user * vec);
   830	asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
   831	asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
   832				size_t vlen, int behavior, unsigned int flags);
   833	asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
   834	asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
   835				unsigned long prot, unsigned long pgoff,
   836				unsigned long flags);
   837	asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long flags);
   838	asmlinkage long sys_mbind(unsigned long start, unsigned long len,
   839					unsigned long mode,
   840					const unsigned long __user *nmask,
   841					unsigned long maxnode,
   842					unsigned flags);
   843	asmlinkage long sys_get_mempolicy(int __user *policy,
   844					unsigned long __user *nmask,
   845					unsigned long maxnode,
   846					unsigned long addr, unsigned long flags);
   847	asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
   848					unsigned long maxnode);
   849	asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
   850					const unsigned long __user *from,
   851					const unsigned long __user *to);
   852	asmlinkage long sys_move_pages(pid_t pid, unsigned long nr_pages,
   853					const void __user * __user *pages,
   854					const int __user *nodes,
   855					int __user *status,
   856					int flags);
   857	asmlinkage long sys_rt_tgsigqueueinfo(pid_t tgid, pid_t  pid, int sig,
   858			siginfo_t __user *uinfo);
   859	asmlinkage long sys_perf_event_open(
   860			struct perf_event_attr __user *attr_uptr,
   861			pid_t pid, int cpu, int group_fd, unsigned long flags);
   862	asmlinkage long sys_accept4(int, struct sockaddr __user *, int __user *, int);
   863	asmlinkage long sys_recvmmsg(int fd, struct mmsghdr __user *msg,
   864				     unsigned int vlen, unsigned flags,
   865				     struct __kernel_timespec __user *timeout);
   866	asmlinkage long sys_recvmmsg_time32(int fd, struct mmsghdr __user *msg,
   867				     unsigned int vlen, unsigned flags,
   868				     struct old_timespec32 __user *timeout);
   869	asmlinkage long sys_wait4(pid_t pid, int __user *stat_addr,
   870					int options, struct rusage __user *ru);
   871	asmlinkage long sys_prlimit64(pid_t pid, unsigned int resource,
   872					const struct rlimit64 __user *new_rlim,
   873					struct rlimit64 __user *old_rlim);
   874	asmlinkage long sys_fanotify_init(unsigned int flags, unsigned int event_f_flags);
   875	#if defined(CONFIG_ARCH_SPLIT_ARG64)
   876	asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
   877	                                unsigned int mask_1, unsigned int mask_2,
   878					int dfd, const char  __user * pathname);
   879	#else
   880	asmlinkage long sys_fanotify_mark(int fanotify_fd, unsigned int flags,
   881					  u64 mask, int fd,
   882					  const char  __user *pathname);
   883	#endif
   884	asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
   885					      struct file_handle __user *handle,
   886					      void __user *mnt_id, int flag);
   887	asmlinkage long sys_open_by_handle_at(int mountdirfd,
   888					      struct file_handle __user *handle,
   889					      int flags);
   890	asmlinkage long sys_clock_adjtime(clockid_t which_clock,
   891					struct __kernel_timex __user *tx);
   892	asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
   893					struct old_timex32 __user *tx);
   894	asmlinkage long sys_syncfs(int fd);
   895	asmlinkage long sys_setns(int fd, int nstype);
   896	asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
   897	asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
   898				     unsigned int vlen, unsigned flags);
   899	asmlinkage long sys_process_vm_readv(pid_t pid,
   900					     const struct iovec __user *lvec,
   901					     unsigned long liovcnt,
   902					     const struct iovec __user *rvec,
   903					     unsigned long riovcnt,
   904					     unsigned long flags);
   905	asmlinkage long sys_process_vm_writev(pid_t pid,
   906					      const struct iovec __user *lvec,
   907					      unsigned long liovcnt,
   908					      const struct iovec __user *rvec,
   909					      unsigned long riovcnt,
   910					      unsigned long flags);
   911	asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
   912				 unsigned long idx1, unsigned long idx2);
   913	asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
   914	asmlinkage long sys_sched_setattr(pid_t pid,
   915						struct sched_attr __user *attr,
   916						unsigned int flags);
   917	asmlinkage long sys_sched_getattr(pid_t pid,
   918						struct sched_attr __user *attr,
   919						unsigned int size,
   920						unsigned int flags);
   921	asmlinkage long sys_renameat2(int olddfd, const char __user *oldname,
   922				      int newdfd, const char __user *newname,
   923				      unsigned int flags);
   924	asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
   925				    void __user *uargs);
   926	asmlinkage long sys_getrandom(char __user *buf, size_t count,
   927				      unsigned int flags);
   928	asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
   929	asmlinkage long sys_bpf(int cmd, union bpf_attr __user *attr, unsigned int size);
   930	asmlinkage long sys_execveat(int dfd, const char __user *filename,
   931				const char __user *const __user *argv,
   932				const char __user *const __user *envp, int flags);
   933	asmlinkage long sys_userfaultfd(int flags);
   934	asmlinkage long sys_membarrier(int cmd, unsigned int flags, int cpu_id);
   935	asmlinkage long sys_mlock2(unsigned long start, size_t len, int flags);
   936	asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
   937					    int fd_out, loff_t __user *off_out,
   938					    size_t len, unsigned int flags);
   939	asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
   940				    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
   941				    rwf_t flags);
   942	asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
   943				    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
   944				    rwf_t flags);
   945	asmlinkage long sys_pkey_mprotect(unsigned long start, size_t len,
   946					  unsigned long prot, int pkey);
   947	asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
   948	asmlinkage long sys_pkey_free(int pkey);
   949	asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
   950				  unsigned mask, struct statx __user *buffer);
   951	asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
   952				 int flags, uint32_t sig);
   953	asmlinkage long sys_open_tree(int dfd, const char __user *path, unsigned flags);
   954	asmlinkage long sys_open_tree_attr(int dfd, const char __user *path,
   955					   unsigned flags,
   956					   struct mount_attr __user *uattr,
   957					   size_t usize);
   958	asmlinkage long sys_move_mount(int from_dfd, const char __user *from_path,
   959				       int to_dfd, const char __user *to_path,
   960				       unsigned int ms_flags);
   961	asmlinkage long sys_mount_setattr(int dfd, const char __user *path,
   962					  unsigned int flags,
   963					  struct mount_attr __user *uattr, size_t usize);
   964	asmlinkage long sys_fsopen(const char __user *fs_name, unsigned int flags);
   965	asmlinkage long sys_fsconfig(int fs_fd, unsigned int cmd, const char __user *key,
   966				     const void __user *value, int aux);
   967	asmlinkage long sys_fsmount(int fs_fd, unsigned int flags, unsigned int ms_flags);
   968	asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags);
   969	asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
   970					       siginfo_t __user *info,
   971					       unsigned int flags);
   972	asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
   973	asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr __user *attr,
   974			size_t size, __u32 flags);
   975	asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
   976			const void __user *rule_attr, __u32 flags);
   977	asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
   978	asmlinkage long sys_memfd_secret(unsigned int flags);
   979	asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
   980						    unsigned long home_node,
   981						    unsigned long flags);
   982	asmlinkage long sys_cachestat(unsigned int fd,
   983			struct cachestat_range __user *cstat_range,
   984			struct cachestat __user *cstat, unsigned int flags);
   985	asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
   986	asmlinkage long sys_lsm_get_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
   987					      u32 __user *size, u32 flags);
   988	asmlinkage long sys_lsm_set_self_attr(unsigned int attr, struct lsm_ctx __user *ctx,
   989					      u32 size, u32 flags);
   990	asmlinkage long sys_lsm_list_modules(u64 __user *ids, u32 __user *size, u32 flags);
   991	asmlinkage long sys_lsm_config_self_policy(u32 lsm_id, u32 op, void __user *buf,
   992						   u32 __user size, u32 common_flags, u32 flags);
   993	asmlinkage long sys_lsm_config_system_policy(u32 lsm_id, u32 op, void __user *buf,
 > 994						     u32 __user size, u32 common_flags u32 flags);
   995	
   996	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Andy Lutomirski @ 2025-10-11  4:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <aOm0WCB_woFgnv0v@dread.disaster.area>

On Fri, Oct 10, 2025 at 6:35 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote:
> > On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> >
> > >
> > > You are conflating "synchronous update" with "blocking".
> > >
> > > Avoiding the need for synchronous timestamp updates is exactly what
> > > the lazytime mount option provides. i.e. lazytime degrades immediate
> > > consistency requirements to eventual consistency similar to how the
> > > default relatime behaviour defers atime updates for eventual
> > > writeback.
> > >
> > > IOWs, we've already largely addressed the synchronous c/mtime update
> > > problem but what we haven't done is made timestamp updates
> > > fully support non-blocking caller semantics. That's a separate
> > > problem...
> >
> > I'm probably missing something, but is this really different?
>
> Yes, and yes.
>
> > Either the mtime update can block or it can't block.
>
> Sure, but that's not the issue we have to deal with.
>
> In many filesystems and fs operations, we have to know if an
> operation is going to block -before- we start the operation. e.g.
> transactional changes cannot be rolled back once we've started the
> modification if they need to block to make progress (e.g. read in
> on-disk metadata).
>
> This foresight, in many cases, is -unknowable-. Even though the
> operation /likely/ won't block, we cannot *guarantee* ahead of time
> that any given instance of the operation will /not/ block.  Hence
> the reliable non-blocking operation that users are asking for is not
> possible with unknowable implementation characteristics like this.
>
> IOWs, a timestamp update implementation can be synchronous and
> reliably non-blocking if it always knows when blocking will occur
> and can return -EAGAIN instead of blocking to complete the
> operation.
>
> If it can't know when/if blocking will occur, then lazytime allows
> us to defer the (potentially) blocking update operation to another
> context that can block. Queuing for async processing can easily be
> made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this
> for us.
>
> So, yeah, it should be pretty obvious at this point that non-blocking
> implementation is completely independent of whether the operation is
> performed synchronously or asynchronously. It's easier to make async
> operations non-blocking, but that doesn't mean "non_blocking" and
> "asynchronous execution" are interchangable terms or behaviours.
>
> > I haven't dug all the
> > way into exactly what happens in __mark_inode_dirty(), but there is a
> > lot going on in there even in the I_DIRTY_TIME path.
>
> It's pretty simple, really.  __mark_inode_dirty(I_DIRTY_TIME) is
> non-blocking and queues the inode on the wb->i_dirty_time queue
> for later processing.
>

First, I apologize if I'm off base here.

Second, I don't think I'm entirely nuts, and I'm moderately confident
that, ten-ish years ago, I tested lazytime in the hopes that it would
solve my old problem, and IIRC it didn't help.  I was running a
production workload on ext4 on regrettably slow spinning rust backed
by a truly atrocious HPE controller.  And I was running latencytop to
generate little traces when my task got blocked, and there was no form
of AIO involved.  (And I don't really understand how AIO is wired up
internally...  And yes, in retrospect I should not have been using
shared-writable mmaps or even file-backed things at all for what I was
doing, but I had unrealistic expectations of how mmap worked when I
wrote that code more like 20 years ago, and I wasn't even using Linux
at the time I wrote it.)

I'm looking at the code now, and I see what you're talking about, and
__mark_inode_dirty(inode, I_DIRTY_TIME) looks fairly polite and like
it won't block.  But the relevant code seems to be:

int generic_update_time(struct inode *inode, int flags)
{
        int updated = inode_update_timestamps(inode, flags);
        int dirty_flags = 0;

        if (updated & (S_ATIME|S_MTIME|S_CTIME))
                dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ?
I_DIRTY_TIME : I_DIRTY_SYNC;
        if (updated & S_VERSION)
                dirty_flags |= I_DIRTY_SYNC;
        __mark_inode_dirty(inode, dirty_flags);
        ...

inode_update_timestamps does this, where updated != 0 if the timestamp
actually changed (which is subject to some complex coarse-graining
logic so it may only happen some of the time):

                if (IS_I_VERSION(inode) &&
inode_maybe_inc_iversion(inode, updated))
                        updated |= S_VERSION;

IS_I_VERSION seems to be unconditionally true on ext4.
inode_maybe_inc_iversion always returns true if updated is set, so
generic_update_time has a decent chance of doing
__mark_inode_dirty(inode, I_DIRTY_SYNC), which calls
s_op->dirty_inode, which calls ext4_journal_start, which, from my
recollection a decade ago, could easily block for a good second or so
on my delightful, now retired, HP/HPE system.

In my case, I think this is the path that was blocking for me in lots
of do_wp_page calls that would otherwise not have blocked.  I also
don't see any kiocb passed around or any mechanism by which this code
could know that it's supposed to be nonblocking, although I have
approximately no understanding of Linux AIO and I don't really know
what I should be looking for.

I could try to instrument the code a bit and test to see if I've
analyzed it right in a few days.

--Andy
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Dave Chinner @ 2025-10-11  1:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <CALCETrX-cs5MH3k369q2Fk5Q-pYQfEV6CW3va-4E9vD1CoCaGA@mail.gmail.com>

On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> 
> >
> > You are conflating "synchronous update" with "blocking".
> >
> > Avoiding the need for synchronous timestamp updates is exactly what
> > the lazytime mount option provides. i.e. lazytime degrades immediate
> > consistency requirements to eventual consistency similar to how the
> > default relatime behaviour defers atime updates for eventual
> > writeback.
> >
> > IOWs, we've already largely addressed the synchronous c/mtime update
> > problem but what we haven't done is made timestamp updates
> > fully support non-blocking caller semantics. That's a separate
> > problem...
> 
> I'm probably missing something, but is this really different?

Yes, and yes.

> Either the mtime update can block or it can't block.

Sure, but that's not the issue we have to deal with.

In many filesystems and fs operations, we have to know if an
operation is going to block -before- we start the operation. e.g.
transactional changes cannot be rolled back once we've started the
modification if they need to block to make progress (e.g. read in
on-disk metadata).

This foresight, in many cases, is -unknowable-. Even though the
operation /likely/ won't block, we cannot *guarantee* ahead of time
that any given instance of the operation will /not/ block.  Hence
the reliable non-blocking operation that users are asking for is not
possible with unknowable implementation characteristics like this.

IOWs, a timestamp update implementation can be synchronous and
reliably non-blocking if it always knows when blocking will occur
and can return -EAGAIN instead of blocking to complete the
operation.

If it can't know when/if blocking will occur, then lazytime allows
us to defer the (potentially) blocking update operation to another
context that can block. Queuing for async processing can easily be
made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this
for us.

So, yeah, it should be pretty obvious at this point that non-blocking
implementation is completely independent of whether the operation is
performed synchronously or asynchronously. It's easier to make async
operations non-blocking, but that doesn't mean "non_blocking" and
"asynchronous execution" are interchangable terms or behaviours.

> I haven't dug all the
> way into exactly what happens in __mark_inode_dirty(), but there is a
> lot going on in there even in the I_DIRTY_TIME path.

It's pretty simple, really.  __mark_inode_dirty(I_DIRTY_TIME) is
non-blocking and queues the inode on the wb->i_dirty_time queue
for later processing.

> And Pavel is
> saying that AIO and mtime updates don't play along well.

Again: this is exactly why lazytime was added to XFS *ten years
ago*. From 2015 (issue #3):

https://lore.kernel.org/linux-xfs/CAD-J=zZh1dtJsfrW_Gwxjg+qvkZMu7ED-QOXrMMO6B-G0HY2-A@mail.gmail.com/

(Oh, look, a discussion that starts from a user suggestion of
exposing FMODE_NOCMTIME to userspace apps! Sound familiar?)

> > IOWs, with lazytime, writeback already persists timestamp updates
> > when appropriate for best performance.
> 
> I'm probably doing a bad job explaining myself.

No, I think both Christoph and I both understand exactly what you
are trying to describe.

It seems to me that haven't yet understood that lazytime already
does exactly what you are asking for. Hence you think we don't
understand the "lazytime" concept you are proposing and keep trying
to reinvent lazytime to convince us that we need "lazytime"
functionalitying in the kernel...

> > > Thinking out loud, to handle both write_iter and mmap, there might
> > > need to be two bits: one saying "the timestamp needs to be updated"
> > > and another saying "the timestamp has been updated in the in-memory
> > > inode, but the inode hasn't been dirtied yet".
> >
> > The flag that implements the latter is called I_DIRTY_TIME. We have
> > not implemented the former as that's a userspace visible change of
> > behaviour.
> 
> Maybe that change should be done?  Or not -- it wouldn't be terribly
> hard to have a pair of atomic timestamps in struct inode indicating
> what timestamps we want to write the next time we get around to it.

See, you just reinvented the lazytime mechanism. Again. :/

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: [PATCH v6 1/5] Wire up lsm_config_self_policy and lsm_config_system_policy syscalls
From: Casey Schaufler @ 2025-10-10 21:13 UTC (permalink / raw)
  To: Song Liu, Maxime Bélair
  Cc: linux-security-module, john.johansen, paul, jmorris, serge, mic,
	kees, stephen.smalley.work, takedakn, penguin-kernel, rdunlap,
	linux-api, apparmor, linux-kernel, Casey Schaufler
In-Reply-To: <CAHzjS_uBq8xGCSmHC_kBWi0j8DCdwsy4XtfkH2iH6NygCcChNw@mail.gmail.com>

On 10/10/2025 11:06 AM, Song Liu wrote:
> On Fri, Oct 10, 2025 at 6:27 AM Maxime Bélair
> <maxime.belair@canonical.com> wrote:
> [...]
>> --- a/security/lsm_syscalls.c
>> +++ b/security/lsm_syscalls.c
>> @@ -118,3 +118,15 @@ SYSCALL_DEFINE3(lsm_list_modules, u64 __user *, ids, u32 __user *, size,
>>
>>         return lsm_active_cnt;
>>  }
>> +
>> +SYSCALL_DEFINE6(lsm_config_self_policy, u32, lsm_id, u32, op, void __user *,
>> +               buf, u32 __user, size, u32, common_flags, u32, flags)
>> +{
>> +       return 0;
>> +}
>> +
>> +SYSCALL_DEFINE6(lsm_config_system_policy, u32, lsm_id, u32, op, void __user *,
>> +               buf, u32 __user, size, u32, common_flags, u32, flags)
>> +{
>> +       return 0;
>> +}
> These two APIs look the same. Why not just keep one API and use
> one bit in the flag to differentiate "self" vs. "system"?

I think that's a valid point.

>
> Thanks,
> Song
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox