Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH v4] kernel/trace: fixed static warnings
From: Steven Rostedt @ 2026-04-08 20:24 UTC (permalink / raw)
  To: abhijithsriram95
  Cc: Masami Hiramatsu, Mathieu Desnoyers, open list:TRACING,
	open list:TRACING
In-Reply-To: <20260406072834.243491-2-abhijithsriram95@gmail.com>


Subject should be:

 tracing: Fixed static checker warnings

On Mon,  6 Apr 2026 09:28:34 +0200
abhijithsriram95@gmail.com wrote:

> From: Abhijith Sriram <abhijithsriram95@gmail.com>
> 
> The change in the function argument description
> was due to the static code checker script reading
> the word filter back to back
> 

The below changes should be beneath the '---'

> Changes in v2:

The last change should be first. In fact, I only care about the last change
as the previous versions should have the description of what changed.

> - corrected *m = file->private_data to m = file->private_data
> 
> Changes in v3:
> - reverted the changes for struct seq_file *m and
>   added a new empty line instead
> 
> Changes in v4:

That said, this should really be:

 Changes since v3: https://lore.kernel.org/all/20260406060046.223496-2-abhijithsriram95@gmail.com/


> - added a new empty line before char *buf ...
>   previously this line was relocated to avoid the
>   static check warning.
> 
> Signed-off-by: Abhijith Sriram <abhijithsriram95@gmail.com>
> ---
>  kernel/trace/trace_events_trigger.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
> index 655db2e82513..664283bcd9ea 100644
> --- a/kernel/trace/trace_events_trigger.c
> +++ b/kernel/trace/trace_events_trigger.c
> @@ -246,7 +246,7 @@ event_triggers_post_call(struct trace_event_file *file,
>  }
>  EXPORT_SYMBOL_GPL(event_triggers_post_call);
>  
> -#define SHOW_AVAILABLE_TRIGGERS	(void *)(1UL)
> +#define SHOW_AVAILABLE_TRIGGERS	((void *)(1UL))
>  
>  static void *trigger_next(struct seq_file *m, void *t, loff_t *pos)
>  {
> @@ -352,6 +352,7 @@ static int event_trigger_regex_open(struct inode *inode, struct file *file)
>  		ret = seq_open(file, &event_triggers_seq_ops);
>  		if (!ret) {
>  			struct seq_file *m = file->private_data;
> +

This blank line makes the code look worse. Yes, we usually want a blank
line between the variable declarations and the code, but when it comes to
code blocks (not functions) that rule is not as strict.

Get rid of this newline.


>  			m->private = file;
>  		}
>  	}
> @@ -390,6 +391,7 @@ static ssize_t event_trigger_regex_write(struct file *file,
>  {
>  	struct trace_event_file *event_file;
>  	ssize_t ret;
> +
>  	char *buf __free(kfree) = NULL;

The char *buf is a declaration. It no new line is expected before it.

>  
>  	if (!cnt)
> @@ -633,6 +635,7 @@ clear_event_triggers(struct trace_array *tr)
>  
>  	list_for_each_entry(file, &tr->events, list) {
>  		struct event_trigger_data *data, *n;
> +

Again, if it's in a code block, don't change it.

-- Steve

>  		list_for_each_entry_safe(data, n, &file->triggers, list) {
>  			trace_event_trigger_enable_disable(file, 0);
>  			list_del_rcu(&data->list);
> @@ -785,7 +788,7 @@ static void unregister_trigger(char *glob,
>   *   cmd               - the trigger command name
>   *   glob              - the trigger command name optionally prefaced with '!'
>   *   param_and_filter  - text following cmd and ':'
> - *   param             - text following cmd and ':' and stripped of filter
> + *   param             - text following cmd and ':' and filter removed
>   *   filter            - the optional filter text following (and including) 'if'
>   *
>   * To illustrate the use of these components, here are some concrete


^ permalink raw reply

* Re: [PATCH bpf-next v4 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Jiri Olsa @ 2026-04-08 20:13 UTC (permalink / raw)
  To: Varun R Mallya
  Cc: bpf, leon.hwang, memxor, ast, daniel, yonghong.song, rostedt,
	linux-kernel, linux-trace-kernel
In-Reply-To: <adai2mXaRGpf1Kvb@computer>

On Thu, Apr 09, 2026 at 12:17:54AM +0530, Varun R Mallya wrote:
> On Thu, Apr 09, 2026 at 12:05:49AM +0530, Varun R Mallya wrote:
> > @@ -676,5 +750,7 @@ void test_kprobe_multi_test(void)
> >  		test_unique_match();
> >  	if (test__start_subtest("attach_write_ctx"))
> >  		test_attach_write_ctx();
> > +	if (test__start_subtest("attach_multi_sleepable"))
> > +		test_attach_multi_sleepable();
> >  	RUN_TESTS(kprobe_multi_verifier);
> Please ignore this patch. I will send a v5 in a few minutes. I forgot to
> remove the selftest from the previous location after moving it into
> attach_api_fails.

also no need to send patch#1 it's already in:
  eb7024bfcc5f bpf: Reject sleepable kprobe_multi programs at attach time

jirka


> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
> > -- 
> > 2.53.0
> > 

^ permalink raw reply

* Re: [PATCH v2] seq_buf: export seq_buf_putmem_hex() and add KUnit tests
From: Steven Rostedt @ 2026-04-08 20:09 UTC (permalink / raw)
  To: Shuvam Pandey
  Cc: Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260408144449.57375-1-shuvampandey1@gmail.com>

On Wed,  8 Apr 2026 20:29:49 +0545
Shuvam Pandey <shuvampandey1@gmail.com> wrote:

> seq_buf: export seq_buf_putmem_hex() and add KUnit tests
> 
> The seq_buf KUnit suite does not exercise seq_buf_putmem_hex().
> 
> Add one test for the len > 8 chunking path and one overflow test
> where a later chunk no longer fits in the buffer.
> 
> Export seq_buf_putmem_hex() as well so SEQ_BUF_KUNIT_TEST=m links
> cleanly. Without the export, modpost reports seq_buf_putmem_hex as
> undefined when seq_buf_kunit is built as a module.
> 

Please always send a new version as its own thread and not as a reply to
the old version.

Also, for tracing and seq_buf, use capital letters in the subject:

  seq_buf: Export seq_buf_putmem_hex() and add KUnit tests

> Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
> ---
> v2:

This should be:

 Changes since v1: https://lore.kernel.org/all/20260406033728.25998-1-shuvampandey1@gmail.com/


> - export seq_buf_putmem_hex() so SEQ_BUF_KUNIT_TEST=m links cleanly
> - validate with a fresh arm64 build using CONFIG_KUNIT=y and CONFIG_SEQ_BUF_KUNIT_TEST=m
> 
>  lib/seq_buf.c             |  1 +
>  lib/tests/seq_buf_kunit.c | 34 ++++++++++++++++++++++++++++++++++

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve


^ permalink raw reply

* Re: [PATCH v2] seq_buf: export seq_buf_putmem_hex() and add KUnit tests
From: Steven Rostedt @ 2026-04-08 20:09 UTC (permalink / raw)
  To: Shuvam Pandey
  Cc: Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, David Gow
In-Reply-To: <20260408144449.57375-1-shuvampandey1@gmail.com>

On Wed,  8 Apr 2026 20:29:49 +0545
Shuvam Pandey <shuvampandey1@gmail.com> wrote:

> seq_buf: export seq_buf_putmem_hex() and add KUnit tests
> 
> The seq_buf KUnit suite does not exercise seq_buf_putmem_hex().
> 
> Add one test for the len > 8 chunking path and one overflow test
> where a later chunk no longer fits in the buffer.
> 
> Export seq_buf_putmem_hex() as well so SEQ_BUF_KUNIT_TEST=m links
> cleanly. Without the export, modpost reports seq_buf_putmem_hex as
> undefined when seq_buf_kunit is built as a module.
> 

Please always send a new version as its own thread and not as a reply to
the old version.

Also, for tracing and seq_buf, use capital letters in the subject:

  seq_buf: Export seq_buf_putmem_hex() and add KUnit tests

> Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
> ---
> v2:

This should be:

 Changes since v1: https://lore.kernel.org/all/20260406033728.25998-1-shuvampandey1@gmail.com/


> - export seq_buf_putmem_hex() so SEQ_BUF_KUNIT_TEST=m links cleanly
> - validate with a fresh arm64 build using CONFIG_KUNIT=y and CONFIG_SEQ_BUF_KUNIT_TEST=m
> 
>  lib/seq_buf.c             |  1 +
>  lib/tests/seq_buf_kunit.c | 34 ++++++++++++++++++++++++++++++++++

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve


^ permalink raw reply

* Re: [RFC v6 6/7] ext4: fast commit: add lock_updates tracepoint
From: Steven Rostedt @ 2026-04-08 20:04 UTC (permalink / raw)
  To: Li Chen
  Cc: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-kernel, linux-trace-kernel
In-Reply-To: <20260408112020.716706-7-me@linux.beauty>

On Wed,  8 Apr 2026 19:20:17 +0800
Li Chen <me@linux.beauty> wrote:

> Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
> so it is useful to quantify the time spent with updates locked and to
> understand why snapshotting can fail.
> 
> Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
> the updates-locked window along with the number of snapshotted inodes
> and ranges. Record the first snapshot failure reason in a stable snap_err
> field for tooling.
> 

[..]

> @@ -1338,13 +1375,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
>  	if (ret)
>  		return ret;
>  
> -
>  	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
>  	if (ret)
>  		return ret;
>  
>  	/* Step 4: Mark all inodes as being committed. */
>  	jbd2_journal_lock_updates(journal);
> +	lock_start = ktime_get();

ktime_get() is rather quick but if you care about micro-optimizations, you
could have:

	if (trace_ext4_fc_lock_updates_enabled())
		lock_start = ktime_get();
	else	
		lock_start = 0;

>  	/*
>  	 * The journal is now locked. No more handles can start and all the
>  	 * previous handles are now drained. Snapshotting happens in this
> @@ -1358,8 +1395,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
>  	}
>  	ext4_fc_unlock(sb, alloc_ctx);
>  
> -	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
> +	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
> +				      &snap_inodes, &snap_ranges, &snap_err);
>  	jbd2_journal_unlock_updates(journal);
> +	if (trace_ext4_fc_lock_updates_enabled()) {

	if (trace_ext4_fc_lock_updates_enabled() && lock_start) {

But feel free to ignore this if the overhead of always calling ktime_get()
is not an issue.


> +		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
> +		trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
> +					   snap_inodes, snap_ranges, ret,
> +					   snap_err);
> +	}
>  	kvfree(inodes);
>  	if (ret)
>  		return ret;
> @@ -1564,7 +1608,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
>  		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
>  	set_task_ioprio(current, journal_ioprio);
>  	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
> -	ret = ext4_fc_perform_commit(journal);
> +	ret = ext4_fc_perform_commit(journal, commit_tid);
>  	if (ret < 0) {
>  		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
>  			status = EXT4_FC_STATUS_INELIGIBLE;
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index f493642cf121..7028a28316fa 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY);
>  TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT);
>  TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
>  
> +#undef EM
> +#undef EMe
> +#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
> +#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
> +
> +#define TRACE_SNAP_ERR						\
> +	EM(NONE)						\
> +	EM(ES_MISS)						\
> +	EM(ES_DELAYED)						\
> +	EM(ES_OTHER)						\
> +	EM(INODES_CAP)						\
> +	EM(RANGES_CAP)						\
> +	EM(NOMEM)						\
> +	EMe(INODE_LOC)
> +
> +TRACE_SNAP_ERR
> +
> +#undef EM
> +#undef EMe
> +
>  #define show_fc_reason(reason)						\
>  	__print_symbolic(reason,					\
>  		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
> @@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
>  		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
>  );
>  
> +#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
> +#define EMe(a)	{ EXT4_FC_SNAP_ERR_##a, #a }
> +
> +TRACE_EVENT(ext4_fc_lock_updates,
> +	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
> +		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
> +		     int snap_err),
> +
> +	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
> +
> +	TP_STRUCT__entry(/* entry */
> +		__field(dev_t, dev)
> +		__field(tid_t, tid)
> +		__field(u64, locked_ns)
> +		__field(unsigned int, nr_inodes)
> +		__field(unsigned int, nr_ranges)
> +		__field(int, err)
> +		__field(int, snap_err)
> +	),
> +
> +	TP_fast_assign(/* assign */
> +		__entry->dev = sb->s_dev;
> +		__entry->tid = commit_tid;
> +		__entry->locked_ns = locked_ns;
> +		__entry->nr_inodes = nr_inodes;
> +		__entry->nr_ranges = nr_ranges;
> +		__entry->err = err;
> +		__entry->snap_err = snap_err;
> +	),
> +
> +	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
> +		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
> +		  __entry->err, __print_symbolic(__entry->snap_err,
> +						 TRACE_SNAP_ERR))
> +);
> +
> +#undef EM
> +#undef EMe
> +#undef TRACE_SNAP_ERR
> +
>  #define FC_REASON_NAME_STAT(reason)					\
>  	show_fc_reason(reason),						\
>  	__entry->fc_ineligible_rc[reason]

As for the rest:

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

[ Please add this reviewed-by to any new versions so I remember I already
  looked at it. ]

Thanks,

-- Steve

^ permalink raw reply

* Re: [PATCH 12/24] nfsd: add data structures for handling CB_NOTIFY
From: Jeff Layton @ 2026-04-08 19:53 UTC (permalink / raw)
  To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <8a6dbbee-5d59-4833-b0f6-22b1e46dfd11@app.fastmail.com>

On Wed, 2026-04-08 at 14:39 -0400, Chuck Lever wrote:
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > Add the data structures, allocation helpers, and callback operations
> > needed for directory delegation CB_NOTIFY support:
> > 
> > - struct nfsd_notify_event: carries fsnotify events for CB_NOTIFY
> > - struct nfsd4_cb_notify: per-delegation state for notification handling
> > - Union dl_cb_fattr with dl_cb_notify in nfs4_delegation since a
> >   delegation is either a regular file delegation or a directory
> >   delegation, never both
> > 
> > Refactor alloc_init_deleg() into a common __alloc_init_deleg() base
> > with a pluggable sc_free callback, and add alloc_init_dir_deleg() which
> > allocates the page array and notify4 buffer needed for CB_NOTIFY
> > encoding.
> > 
> > Add skeleton nfsd4_cb_notify_ops with done/release handlers that will
> > be filled in when the notification path is wired up.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 4afe7e68fb51..b2b8c454fc0f 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> 
> > @@ -3381,6 +3440,30 @@ nfsd4_cb_getattr_release(struct nfsd4_callback 
> > *cb)
> >  	nfs4_put_stid(&dp->dl_stid);
> >  }
> > 
> > +static int
> > +nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> > +				struct rpc_task *task)
> > +{
> > +	switch (task->tk_status) {
> > +	case -NFS4ERR_DELAY:
> > +		rpc_delay(task, 2 * HZ);
> > +		return 0;
> > +	default:
> > +		return 1;
> > +	}
> > +}
> > +
> > +static void
> > +nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> > +{
> > +	struct nfsd4_cb_notify *ncn =
> > +			container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> > +	struct nfs4_delegation *dp =
> > +			container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> > +
> > +	nfs4_put_stid(&dp->dl_stid);
> > +}
> > +
> >  static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
> >  	.done		= nfsd4_cb_recall_any_done,
> >  	.release	= nfsd4_cb_recall_any_release,
> 
> So when a client responds with NFS4ERR_DELAY, the RPC framework retries
> after 2s. On retry, prepare() is called again, but ncn_evt_cnt is
> already 0 (drained in the first prepare). prepare returns false, which
> destroys the callback.
> 

This is actually not a problem. When ->done() returns 0,
rpc_restart_call_prepare retries the RPC through the RPC-level prepare
(nfsd4_cb_prepare at nfs4callback.c:1479), which just acquires a
session slot and calls rpc_call_start. It does not call cb_ops->prepare
again — the same encoded XDR is re-sent in that case.

> Events arriving during the retry window are dropped because
> nfsd4_run_cb_notify() returns early when NFSD4_CALLBACK_RUNNING is set.
> After the callback is destroyed, future events can queue a new CB_NOTIFY,
> but the window's events are lost.                                                                                        
> 
> The result is that the client misses notifications. Does this impact
> behavioral correctness or spec compliance? Is there a way for that
> client to detect the loss and recover?
> 

There _is_ a problem, however. Events that arrive while the job is
running queue up, but won't get sent until the _next_ event arrives. We
need to make the ->release handler check for new events and requeue the
callback if there are any. I'll plan to fix that up.

Thanks for all the review! I'll plan to send another version soon (but
probably not until after next week's testing event).
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v15 03/13] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-04-08 19:48 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, lorenzo.stoakes, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <c4e80668-9018-48fc-883c-5d52a5950065@kernel.org>

On Thu, Mar 12, 2026 at 2:56 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 3/12/26 21:36, David Hildenbrand (Arm) wrote:
> > On 3/12/26 21:32, David Hildenbrand (Arm) wrote:
> >> On 2/26/26 04:23, Nico Pache wrote:
> >>> generalize the order of the __collapse_huge_page_* functions
> >>> to support future mTHP collapse.
> >>>
> >>> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> >>> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> >>> shared or swapped entry.
> >>>
> >>> No functional changes in this patch.
> >>>
> >>> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> >>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> >>> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> >>> Co-developed-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >>> Signed-off-by: Nico Pache <npache@redhat.com>
> >>> ---
> >>>  mm/khugepaged.c | 73 +++++++++++++++++++++++++++++++------------------
> >>>  1 file changed, 47 insertions(+), 26 deletions(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index a9b645402b7f..ecdbbf6a01a6 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -535,7 +535,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> >>>
> >>>  static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>             unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> >>> -           struct list_head *compound_pagelist)
> >>> +           unsigned int order, struct list_head *compound_pagelist)
> >>>  {
> >>>     struct page *page = NULL;
> >>>     struct folio *folio = NULL;
> >>> @@ -543,15 +543,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>     pte_t *_pte;
> >>>     int none_or_zero = 0, shared = 0, referenced = 0;
> >>>     enum scan_result result = SCAN_FAIL;
> >>> +   const unsigned long nr_pages = 1UL << order;
> >>> +   int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >>
> >> It might be a bit more readable to move "const unsigned long
> >> nr_pages = 1UL << order;" all the way to the top.
> >>
> >> Then, have here
> >>
> >>      int max_ptes_none = 0;
> >>
> >> and do at the beginning of the function:
> >>
> >>      /* For MADV_COLLAPSE, we always collapse ... */
> >>      if (!cc->is_khugepaged)
> >>              max_ptes_none = HPAGE_PMD_NR;
> >>      /*  ... except if userfaultf relies on MISSING faults. */
> >>      if (!userfaultfd_armed(vma))
> >>              max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> >>
> >> (but see below regarding helper function)
> >>
> >> then the code below becomes ...
> >>
> >>>
> >>> -   for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>> +   for (_pte = pte; _pte < pte + nr_pages;
> >>>          _pte++, addr += PAGE_SIZE) {
> >>>             pte_t pteval = ptep_get(_pte);
> >>>             if (pte_none_or_zero(pteval)) {
> >>>                     ++none_or_zero;
> >>>                     if (!userfaultfd_armed(vma) &&
> >>>                         (!cc->is_khugepaged ||
> >>> -                        none_or_zero <= khugepaged_max_ptes_none)) {
> >>> +                        none_or_zero <= max_ptes_none)) {
> >>
> >> ...
> >>
> >>      if (none_or_zero <= max_ptes_none) {
> >>
> >>
> >> I see that you do something like that (but slightly different) in the next
> >> patch. You could easily extend the above by it.
> >>
> >> Or go one step further and move all of that conditional into collapse_max_ptes_none(), whereby
> >> you simply also pass the cc and the vma.
> >>
> >> Then this all gets cleaned up and you'd end up above with
> >>
> >> max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> >> if (max_ptes_none < 0)
> >>      return result;
> >>
> >> I'd do all that in this patch here, getting rid of #4.
> >>
> >>
> >>>                             continue;
> >>>                     } else {
> >>>                             result = SCAN_EXCEED_NONE_PTE;
> >>> @@ -585,8 +587,14 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>             /* See collapse_scan_pmd(). */
> >>>             if (folio_maybe_mapped_shared(folio)) {
> >>>                     ++shared;
> >>> -                   if (cc->is_khugepaged &&
> >>> -                       shared > khugepaged_max_ptes_shared) {
> >>> +                   /*
> >>> +                    * TODO: Support shared pages without leading to further
> >>> +                    * mTHP collapses. Currently bringing in new pages via
> >>> +                    * shared may cause a future higher order collapse on a
> >>> +                    * rescan of the same range.
> >>> +                    */
> >>> +                   if (!is_pmd_order(order) || (cc->is_khugepaged &&
> >>> +                       shared > khugepaged_max_ptes_shared)) {
> >>
> >> That's not how we indent within a nested ().
> >>
> >> To make this easier to read, what about similarly having at the beginning
> >> of the function:
> >>
> >> int max_ptes_shared = 0;
> >>
> >> /* For MADV_COLLAPSE, we always collapse. */
> >> if (cc->is_khugepaged)
> >>      max_ptes_none = HPAGE_PMD_NR;
> >> /* TODO ... */
> >> if (is_pmd_order(order))
> >>      max_ptes_none = khugepaged_max_ptes_shared;
> >>
> >> to turn this code into a
> >>
> >>      if (shared > khugepaged_max_ptes_shared)
> >>
> >> Also, here, might make sense to have a collapse_max_ptes_swap(cc, order)
> >> to do that and clean it up.
> >>
> >>
> >>>                             result = SCAN_EXCEED_SHARED_PTE;
> >>>                             count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >>>                             goto out;
> >>> @@ -679,18 +687,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >>>  }
> >>>
> >>>  static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >>> -                                           struct vm_area_struct *vma,
> >>> -                                           unsigned long address,
> >>> -                                           spinlock_t *ptl,
> >>> -                                           struct list_head *compound_pagelist)
> >>> +           struct vm_area_struct *vma, unsigned long address,
> >>> +           spinlock_t *ptl, unsigned int order,
> >>> +           struct list_head *compound_pagelist)
> >>>  {
> >>> -   unsigned long end = address + HPAGE_PMD_SIZE;
> >>> +   unsigned long end = address + (PAGE_SIZE << order);
> >>>     struct folio *src, *tmp;
> >>>     pte_t pteval;
> >>>     pte_t *_pte;
> >>>     unsigned int nr_ptes;
> >>> +   const unsigned long nr_pages = 1UL << order;
> >>
> >> Move it further to the top.
> >>
> >>>
> >>> -   for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> >>> +   for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> >>>          address += nr_ptes * PAGE_SIZE) {
> >>>             nr_ptes = 1;
> >>>             pteval = ptep_get(_pte);
> >>> @@ -743,13 +751,11 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> >>>  }
> >>>
> >>>  static void __collapse_huge_page_copy_failed(pte_t *pte,
> >>> -                                        pmd_t *pmd,
> >>> -                                        pmd_t orig_pmd,
> >>> -                                        struct vm_area_struct *vma,
> >>> -                                        struct list_head *compound_pagelist)
> >>> +           pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >>> +           unsigned int order, struct list_head *compound_pagelist)
> >>>  {
> >>>     spinlock_t *pmd_ptl;
> >>> -
> >>> +   const unsigned long nr_pages = 1UL << order;
> >>>     /*
> >>>      * Re-establish the PMD to point to the original page table
> >>>      * entry. Restoring PMD needs to be done prior to releasing
> >>> @@ -763,7 +769,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >>>      * Release both raw and compound pages isolated
> >>>      * in __collapse_huge_page_isolate.
> >>>      */
> >>> -   release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> >>> +   release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> >>>  }
> >>>
> >>>  /*
> >>> @@ -783,16 +789,16 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> >>>   */
> >>>  static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> >>>             pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> >>> -           unsigned long address, spinlock_t *ptl,
> >>> +           unsigned long address, spinlock_t *ptl, unsigned int order,
> >>>             struct list_head *compound_pagelist)
> >>>  {
> >>>     unsigned int i;
> >>>     enum scan_result result = SCAN_SUCCEED;
> >>> -
> >>> +   const unsigned long nr_pages = 1UL << order;
> >>
> >> Same here, all the way to the top.
> >>
> >>>     /*
> >>>      * Copying pages' contents is subject to memory poison at any iteration.
> >>>      */
> >>> -   for (i = 0; i < HPAGE_PMD_NR; i++) {
> >>> +   for (i = 0; i < nr_pages; i++) {
> >>>             pte_t pteval = ptep_get(pte + i);
> >>>             struct page *page = folio_page(folio, i);
> >>>             unsigned long src_addr = address + i * PAGE_SIZE;
> >>> @@ -811,10 +817,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
> >>>
> >>>     if (likely(result == SCAN_SUCCEED))
> >>>             __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> >>> -                                               compound_pagelist);
> >>> +                                               order, compound_pagelist);
> >>>     else
> >>>             __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> >>> -                                            compound_pagelist);
> >>> +                                            order, compound_pagelist);
> >>>
> >>>     return result;
> >>>  }
> >>> @@ -985,12 +991,12 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
> >>>   * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> >>>   */
> >>>  static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >>> -           struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> >>> -           int referenced)
> >>> +           struct vm_area_struct *vma, unsigned long start_addr,
> >>> +           pmd_t *pmd, int referenced, unsigned int order)
> >>>  {
> >>>     int swapped_in = 0;
> >>>     vm_fault_t ret = 0;
> >>> -   unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> >>> +   unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> >>>     enum scan_result result;
> >>>     pte_t *pte = NULL;
> >>>     spinlock_t *ptl;
> >>> @@ -1022,6 +1028,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> >>>                 pte_present(vmf.orig_pte))
> >>>                     continue;
> >>>
> >>> +           /*
> >>> +            * TODO: Support swapin without leading to further mTHP
> >>> +            * collapses. Currently bringing in new pages via swapin may
> >>> +            * cause a future higher order collapse on a rescan of the same
> >>> +            * range.
> >>> +            */
> >>> +           if (!is_pmd_order(order)) {
> >>> +                   pte_unmap(pte);
> >>> +                   mmap_read_unlock(mm);
> >>> +                   result = SCAN_EXCEED_SWAP_PTE;
> >>> +                   goto out;
> >>> +           }
> >>> +
> >>
> >> Interesting, we just swapin everything we find :)
> >>
> >> But do we really need this check here? I mean, we just found it to be present.
> >>
> >> In the rare event that there was a race, do we really care? It was just
> >> present, now it's swapped. Bad luck. Just swap it in.
> >>
> >
> > Okay, now I am confused. Why are you not taking care of
> > collapse_scan_pmd() in the same context?
> >
> > Because if you make sure that we properly check against a max_ptes_swap
> > similar as in the style above, we'd rule out swapin right from the start?
> >
> > Also, I would expect that all other parameters in there are similarly
> > handled?
> >
>
> Okay, I think you should add the following:

Hey! Thanks for all your reviews here.

For multiple reasons, here is the solution I developed:

Add a patch before the generalize __collapse.. patch that reworks the
max_ptes* handling and introduces the helpers (no functional changes).

I later updated these functions to follow the specific mthp rules in
the generalization patch. Honestly, refactoring much of this has been
very hard without one large patch, which is why we split it up
initially.

How does that sound?

-- Nico

>
> From 17bce81ab93f3b16e044ac2f4f62be19aac38180 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Thu, 12 Mar 2026 21:54:22 +0100
> Subject: [PATCH] tmp
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
>  mm/khugepaged.c | 89 +++++++++++++++++++++++++++++--------------------
>  1 file changed, 53 insertions(+), 36 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b7b4680d27ab..6a3773bfa0a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -318,6 +318,34 @@ static ssize_t max_ptes_shared_store(struct kobject *kobj,
>         return count;
>  }
>
> +static int collapse_max_ptes_none(struct collapse_control *cc,
> +               struct vm_area_struct *vma)
> +{
> +       /* We don't mess with MISSING faults. */
> +       if (vma && userfaultfd_armed(vma))
> +               return 0;
> +       /* MADV_COLLAPSE always collapses. */
> +       if (!cc->is_khugepaged)
> +               return HPAGE_PMD_NR;
> +       return khugepaged_max_ptes_none;
> +}
> +
> +static int collapse_max_ptes_shared(struct collapse_control *cc)
> +{
> +       /* MADV_COLLAPSE always collapses. */
> +       if (!cc->is_khugepaged)
> +               return HPAGE_PMD_NR;
> +       return khugepaged_max_ptes_shared;
> +}
> +
> +static int collapse_max_ptes_swap(struct collapse_control *cc)
> +{
> +       /* MADV_COLLAPSE always collapses. */
> +       if (!cc->is_khugepaged)
> +               return HPAGE_PMD_NR;
> +       return khugepaged_max_ptes_swap;
> +}
> +
>  static struct kobj_attribute khugepaged_max_ptes_shared_attr =
>         __ATTR_RW(max_ptes_shared);
>
> @@ -539,6 +567,8 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>                 unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
>                 struct list_head *compound_pagelist)
>  {
> +       const int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +       const int max_ptes_shared = collapse_max_ptes_shared(cc);
>         struct page *page = NULL;
>         struct folio *folio = NULL;
>         unsigned long addr = start_addr;
> @@ -550,16 +580,12 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>              _pte++, addr += PAGE_SIZE) {
>                 pte_t pteval = ptep_get(_pte);
>                 if (pte_none_or_zero(pteval)) {
> -                       ++none_or_zero;
> -                       if (!userfaultfd_armed(vma) &&
> -                           (!cc->is_khugepaged ||
> -                            none_or_zero <= khugepaged_max_ptes_none)) {
> -                               continue;
> -                       } else {
> +                       if (++none_or_zero > max_ptes_none) {
>                                 result = SCAN_EXCEED_NONE_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                                 goto out;
>                         }
> +                       continue;
>                 }
>                 if (!pte_present(pteval)) {
>                         result = SCAN_PTE_NON_PRESENT;
> @@ -586,9 +612,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
>                 /* See hpage_collapse_scan_pmd(). */
>                 if (folio_maybe_mapped_shared(folio)) {
> -                       ++shared;
> -                       if (cc->is_khugepaged &&
> -                           shared > khugepaged_max_ptes_shared) {
> +                       if (++shared > max_ptes_shared) {
>                                 result = SCAN_EXCEED_SHARED_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                                 goto out;
> @@ -1247,6 +1271,9 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>                 struct vm_area_struct *vma, unsigned long start_addr,
>                 bool *mmap_locked, struct collapse_control *cc)
>  {
> +       const int max_ptes_none = collapse_max_ptes_none(cc, vma);
> +       const int max_ptes_swap = collapse_max_ptes_swap(cc);
> +       const int max_ptes_shared = collapse_max_ptes_shared(cc);
>         pmd_t *pmd;
>         pte_t *pte, *_pte;
>         int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -1280,36 +1307,28 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>
>                 pte_t pteval = ptep_get(_pte);
>                 if (pte_none_or_zero(pteval)) {
> -                       ++none_or_zero;
> -                       if (!userfaultfd_armed(vma) &&
> -                           (!cc->is_khugepaged ||
> -                            none_or_zero <= khugepaged_max_ptes_none)) {
> -                               continue;
> -                       } else {
> +                       if (++none_or_zero > max_ptes_none) {
>                                 result = SCAN_EXCEED_NONE_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                                 goto out_unmap;
>                         }
> +                       continue;
>                 }
>                 if (!pte_present(pteval)) {
> -                       ++unmapped;
> -                       if (!cc->is_khugepaged ||
> -                           unmapped <= khugepaged_max_ptes_swap) {
> -                               /*
> -                                * Always be strict with uffd-wp
> -                                * enabled swap entries.  Please see
> -                                * comment below for pte_uffd_wp().
> -                                */
> -                               if (pte_swp_uffd_wp_any(pteval)) {
> -                                       result = SCAN_PTE_UFFD_WP;
> -                                       goto out_unmap;
> -                               }
> -                               continue;
> -                       } else {
> +                       if (++unmapped > max_ptes_swap) {
>                                 result = SCAN_EXCEED_SWAP_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>                                 goto out_unmap;
>                         }
> +                       /*
> +                        * Always be strict with uffd-wp enabled swap entries.
> +                        * See the comment below for pte_uffd_wp().
> +                        */
> +                       if (pte_swp_uffd_wp_any(pteval)) {
> +                               result = SCAN_PTE_UFFD_WP;
> +                               goto out_unmap;
> +                       }
> +                       continue;
>                 }
>                 if (pte_uffd_wp(pteval)) {
>                         /*
> @@ -1348,9 +1367,7 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
>                  * is shared.
>                  */
>                 if (folio_maybe_mapped_shared(folio)) {
> -                       ++shared;
> -                       if (cc->is_khugepaged &&
> -                           shared > khugepaged_max_ptes_shared) {
> +                       if (++shared > max_ptes_shared) {
>                                 result = SCAN_EXCEED_SHARED_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
>                                 goto out_unmap;
> @@ -2305,6 +2322,8 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>                 unsigned long addr, struct file *file, pgoff_t start,
>                 struct collapse_control *cc)
>  {
> +       const int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> +       const int max_ptes_swap = collapse_max_ptes_swap(cc);
>         struct folio *folio = NULL;
>         struct address_space *mapping = file->f_mapping;
>         XA_STATE(xas, &mapping->i_pages, start);
> @@ -2323,8 +2342,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>
>                 if (xa_is_value(folio)) {
>                         swap += 1 << xas_get_order(&xas);
> -                       if (cc->is_khugepaged &&
> -                           swap > khugepaged_max_ptes_swap) {
> +                       if (swap > max_ptes_swap) {
>                                 result = SCAN_EXCEED_SWAP_PTE;
>                                 count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
>                                 break;
> @@ -2395,8 +2413,7 @@ static enum scan_result hpage_collapse_scan_file(struct mm_struct *mm,
>                 cc->progress += HPAGE_PMD_NR;
>
>         if (result == SCAN_SUCCEED) {
> -               if (cc->is_khugepaged &&
> -                   present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
> +               if (present < HPAGE_PMD_NR - max_ptes_none) {
>                         result = SCAN_EXCEED_NONE_PTE;
>                         count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>                 } else {
> --
> 2.43.0
>
>
> Then extend it by passing an order + return value check in this patch here. You can
> directly squash changes from patch #4 in here then.
>
> --
> Cheers,
>
> David
>


^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Sean Christopherson @ 2026-04-08 19:48 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Michael Roth, Vishal Annapurve, aik, andrew.jones, binbin.wu,
	brauner, chao.p.peng, david, ira.weiny, jmattson, jthoughton,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgEcKnHkgC2iK8hRMybruY5LrxWH+RK94M7MddUqgGChHQ@mail.gmail.com>

On Wed, Apr 08, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > On Tue, Apr 07, 2026, Michael Roth wrote:
> >> On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> >> > > So I agree with Ackerley's proposal (which I guess is the same as what's
> >> > > in this series).
> >> > >
> >> > > However, 1 other alternative would be to do what was suggested on the
> >> > > call, but require userspace to subsequently handle the shared->private
> >> > > conversion. I think that would be workable too.
> >> >
> >> > IIUC, Converting memory ranges to private after it essentially is
> >> > treated as private by the KVM CC backend will expose the
> >> > implementation to the same risk of userspace being able to access
> >> > private memory and compromise host safety which guest_memfd was
> >> > invented to address.
> >>
> >> Doh, fair point. Doing conversion as part of the populate call would allow
> >> us to use the filemap write-lock to avoid userspace being able to fault
> >> in private (as tracked by trusted entity) pages before they are
> >> transitioned to private (as tracked by KVM), so it's safer than having
> >> userspace drive it.
> >>
> >> But obviously I still think Ackerley's original proposal has more
> >> upsides than the alternatives mentioned so far.
> >
> > I'm a bit lost.  What exactly is/was Ackerley's original proposal?  If the answer
> > is "convert pages from shared=>private when populating via in-place conversion",
> > then I agree, because AFAICT, that's the only sane option.
> 
> Discussed this at PUCK today 2026-04-08.
> 
> The update is that the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl will
> now support the PRESERVE flag for TDX and SNP only if the setup for the
> VM in question hasn't yet been completed (KVM_TDX_FINALIZE_VM or
> KVM_SEV_SNP_LAUNCH_FINISH hasn't completed yet).
> 
> The populate flow will be
> 
> 1a. Get contents to be loaded in guest_memfd (src_addr: NULL) as shared
> OR
> 1b. Provide contents from some other userspace address (src_addr:
>     userspace address)
> 
> 2.  KVM_SET_MEMORY_ATTRIBUTES2(attribute: PRIVATE and flags: PRESERVE)
> 3.  KVM_SEV_SNP_LAUNCH_UPDATE() or KVM_TDX_INIT_MEM_REGION()
> ...
> 4.  KVM_SEV_SNP_LAUNCH_FINISH() or KVM_TDX_FINALIZE_VM()
> 
> This applies whether src_addr is some userspace address that is shared
> or NULL, so the non-in-place loading flow is not considered legacy. ARM
> CCA can still use that flow :)
> 
> Other than supporting PRESERVE only if the setup for the VM in question
> hasn't yet been completed, KVM's fault path will also not permit faults
> if the setup hasn't been completed. (Some exception setup will be used
> for TDX to be able to perform the required fault.)

Nit: as Mike (or Rick?) called out in PUCK, TDX's flow is now a separate path
thanks to commit 3ab3283dbb2c ("KVM: x86/mmu: Add dedicated API to map guest_memfd
pfn into TDP MMU").  I.e. it is NOT a fault in any way, shape, or form.

tdx_mem_page_add() already asserts pre_fault_allowed=false:

	if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm) ||
	    KVM_BUG_ON(!kvm_tdx->page_add_src, kvm))
		return -EIO;

so I think we just to add similar checks in SEV and the MMU.  This can even be
done today as a hardening measure, as the rules aren't changing, we're just
doubling down on disallowing (pre-)faulting during pre-boot.

E.g.

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 73cdcbccc89e..99f070cf2480 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -363,6 +363,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
        };
        int r;
 
+       if (KVM_BUG_ON(!vcpu->kvm->arch.pre_fault_allowed, vcpu->kvm))
+               return -EIO;
+
        if (vcpu->arch.mmu->root_role.direct) {
                /*
                 * Things like memslots don't understand the concept of a shared
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 2010b157e288..f0bbbda6e9c4 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2419,6 +2419,9 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
        if (!sev_snp_guest(kvm) || !sev->snp_context)
                return -EINVAL;
 
+       if (KVM_BUG_ON(kvm->arch.pre_fault_allowed, kvm)
+               return -EIO;
+
        if (copy_from_user(&params, u64_to_user_ptr(argp->data), sizeof(params)))
                return -EFAULT;

^ permalink raw reply related

* Re: [PATCH 13/24] nfsd: add notification handlers for dir events
From: Jeff Layton @ 2026-04-08 19:40 UTC (permalink / raw)
  To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <f60c8d0f-45b6-4d4e-8944-dfd69f2cf9bf@app.fastmail.com>

On Wed, 2026-04-08 at 14:34 -0400, Chuck Lever wrote:
> 
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > Add the necessary parts to accept a fsnotify callback for directory
> > change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> > is created set a handle_event callback to handle the notification.
> > 
> > Use that to allocate a nfsd_notify_event object and then hand off a
> > reference to each delegation's CB_NOTIFY. If anything fails along the
> > way, recall any affected delegations.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index b2b8c454fc0f..339c3d0bb575 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> 
> > @@ -9796,3 +9887,118 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state 
> > *cstate,
> >  	put_nfs4_file(fp);
> >  	return ERR_PTR(status);
> >  }
> > +
> > +static void
> > +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> > +{
> > +	struct nfs4_delegation *dp = container_of(ncn, struct 
> > nfs4_delegation, dl_cb_notify);
> > +
> > +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> > +		return;
> > +
> > +	if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> > +		clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> > +	else
> > +		nfsd4_run_cb(&ncn->ncn_cb);
> > +}
> > +
> > +static struct nfsd_notify_event *
> > +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry 
> > *dentry)
> > +{
> > +	struct nfsd_notify_event *ne;
> > +
> > +	ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_KERNEL);
> > +	if (!ne)
> > +		return NULL;
> > +
> > +	memcpy(&ne->ne_name, q->name, q->len);
> > +	refcount_set(&ne->ne_ref, 1);
> > +	ne->ne_mask = mask;
> > +	ne->ne_name[q->len] = '\0';
> > +	ne->ne_namelen = q->len;
> > +	ne->ne_dentry = dget(dentry);
> > +	return ne;
> > +}
> > +
> > +static bool
> > +should_notify_deleg(u32 mask, struct file_lease *fl)
> > +{
> > +	/* Only nfsd leases */
> > +	if (fl->fl_lmops != &nfsd_lease_mng_ops)
> > +		return false;
> > +
> > +	/* Skip if this event wasn't ignored by the lease */
> > +	if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> > +		return false;
> > +	if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> > +		return false;
> > +	if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> > +		return false;
> > +
> > +	return true;
> > +}
> 
> For a cross-directory rename, vfs_rename calls try_break_deleg(old_dir,
> LEASE_BREAK_DIR_DELETE, ...). A delegation with FL_IGN_DIR_DELETE
> (subscribed to NOTIFY4_REMOVE_ENTRY) suppresses the lease break, which
> is correct.
> 
> But fsnotify delivers FS_RENAME on old_dir, not FS_DELETE. In
> should_notify_deleg(), the check (mask & FS_RENAME) &&
> !(fl->c.flc_flags & FL_IGN_DIR_RENAME) fails, because the delegation
> has FL_IGN_DIR_DELETE but not FL_IGN_DIR_RENAME. No notification is
> sent.
> 

Actually, it delivers FS_RENAME_FROM to the old directory and
FS_RENAME_TO to the new one in this case, AFAICT. Basically, we need to
watch for those in addition to FS_CREATE/DELETE and treat them the same
way. I'm looking at a fix now. I'll also see about adding a pynfs test
to do a cross directory rename with notifications. I don't think I have
that now.

Thanks!

> IIUC, a client subscribed to NOTIFY4_REMOVE_ENTRY for old_dir sees
> neither a lease break nor a CB_NOTIFY when a child is renamed out of
> the directory. Is that behavior correct?
> 
> 
> > +
> > +static void
> > +nfsd_recall_all_dir_delegs(const struct inode *dir)
> > +{
> > +	struct file_lock_context *ctx = locks_inode_context(dir);
> > +	struct file_lock_core *flc;
> > +
> > +	spin_lock(&ctx->flc_lock);
> > +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +
> > +		if (fl->fl_lmops == &nfsd_lease_mng_ops)
> > +			nfsd_break_deleg_cb(fl);
> > +	}
> > +	spin_unlock(&ctx->flc_lock);
> > +}
> > +
> > +int
> > +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void 
> > *data,
> > +		      int data_type, const struct qstr *name)
> > +{
> > +	struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> > +	struct file_lock_context *ctx;
> > +	struct file_lock_core *flc;
> > +	struct nfsd_notify_event *evt;
> > +
> > +	/* Don't do anything if this is not an expected event */
> > +	if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> > +		return 0;
> > +
> > +	ctx = locks_inode_context(dir);
> > +	if (!ctx || list_empty(&ctx->flc_lease))
> > +		return 0;
> > +
> > +	evt = alloc_nfsd_notify_event(mask, name, dentry);
> > +	if (!evt) {
> > +		nfsd_recall_all_dir_delegs(dir);
> > +		return 0;
> > +	}
> > +
> > +	spin_lock(&ctx->flc_lock);
> > +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> > +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> > +		struct nfs4_delegation *dp = flc->flc_owner;
> > +		struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> > +
> > +		if (!should_notify_deleg(mask, fl))
> > +			continue;
> > +
> > +		spin_lock(&ncn->ncn_lock);
> > +		if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> > +			/* We're generating notifications too fast. Recall. */
> > +			spin_unlock(&ncn->ncn_lock);
> > +			nfsd_break_deleg_cb(fl);
> > +			continue;
> > +		}
> > +		ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> > +		spin_unlock(&ncn->ncn_lock);
> > +
> > +		nfsd4_run_cb_notify(ncn);
> > +	}
> > +	spin_unlock(&ctx->flc_lock);
> > +	nfsd_notify_event_put(evt);
> > +	return 0;
> > +}
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Jeff Layton @ 2026-04-08 19:29 UTC (permalink / raw)
  To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <412e143e-1e99-42a6-a959-654bde4e7038@app.fastmail.com>

On Wed, 2026-04-08 at 14:24 -0400, Chuck Lever wrote:
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > Add a new helper function that will update the mask on the nfsd_file's
> > fsnotify_mark to be a union of all current directory delegations on an
> > inode. Call that when directory delegations are added or removed.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index c8fb84c38637..9a4cff08c67d 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> 
> > @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct 
> > nfs4_delegation *dp)
> >  	WARN_ON_ONCE(!fp->fi_delegees);
> > 
> >  	nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> > +	nfsd_fsnotify_recalc_mask(nf);
> >  	kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
> >  	put_deleg_file(fp);
> >  }
> 
> The grant path in nfsd_get_dir_deleg() uses a different ordering
> (setlease first, recalc_mask after).
> 
> Here, since the delegation being removed is still in flc_lease,
> inode_lease_ignore_mask() includes its ignore flags. The mask is
> computed as if the delegation is still present.
> 
> The result is that stale FS_CREATE/FS_DELETE/FS_RENAME bits remain
> in the fsnotify mark. It might be harmless in practice since the
> handler finds no leases and returns early, but it creates
> unnecessary work.
> 
> Should nfs4_unlock_deleg_lease call nfsd_fsnotify_recalc_mask()
> after kernel_setlease(F_UNLCK)?
> 

Good catch. Will fix.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Jeff Layton @ 2026-04-08 19:25 UTC (permalink / raw)
  To: Chuck Lever, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <bb4fc3da-765d-4284-9282-dd04d286f671@app.fastmail.com>

On Wed, 2026-04-08 at 14:16 -0400, Chuck Lever wrote:
> On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> > If a NFS client requests a directory delegation with a notification
> > bitmask covering directory change events, the server shouldn't recall
> > the delegation. Instead the client will be notified of the change after
> > the fact.
> > 
> > Add support for ignoring lease breaks on directory changes. Add a new
> > flags parameter to try_break_deleg() and teach __break_lease how to
> > ignore certain types of delegation break events.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> 
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 8e44b1f6c15a..dafa0752fdce 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> 
> > @@ -1670,7 +1709,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
> >  			locks_delete_lock_ctx(&fl->c, &dispose);
> >  	}
> > 
> > -	if (list_empty(&ctx->flc_lease))
> > +	if (!visible_leases_remaining(inode, flags))
> >  		goto out;
> > 
> >  	if (flags & LEASE_BREAK_NONBLOCK) {
> 
> After breaking visible leases, the restart: label calls any_leases_conflict()
> which does not filter ignored dir-delegation leases. When only ignored leases
> remain, any_leases_conflict returns true, but visible_leases_remaining also
> returned true (triggering the wait). The code picks the first lease (possibly
> ignored), computes break_time = 1 jiffy, blocks, then loops.                                                     
> 
> For example, suppose you have two directory delegations on a directory, one
> with FL_IGN_DIR_DELETE and one without. After the non-ignored one is broken
> and removed, the ignored one keeps any_leases_conflict returning true. The
> loop spins at 1-jiffy intervals until the ignored delegation is released.  
> 
> Should the restart: block skip ignored leases?
> 

Yep. The right fix is to switch the any_leases_conflict() call to
visible_leases_remaining(). I'll code that up and test it.

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* [PATCH bpf-next v5 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-08 19:01 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya
In-Reply-To: <20260408190137.101418-1-varunrmallya@gmail.com>

Add a selftest to ensure that kprobe_multi programs cannot be attached
using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 .../bpf/prog_tests/kprobe_multi_test.c        | 35 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 +++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..a07cd853ed2a 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -220,7 +221,9 @@ static void test_attach_api_syms(void)
 static void test_attach_api_fails(void)
 {
 	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	struct kprobe_multi *skel = NULL;
+	struct kprobe_multi_sleepable *sl_skel = NULL;
 	struct bpf_link *link = NULL;
 	unsigned long long addrs[2];
 	const char *syms[2] = {
@@ -228,7 +231,7 @@ static void test_attach_api_fails(void)
 		"bpf_fentry_test2",
 	};
 	__u64 cookies[2];
-	int saved_error;
+	int saved_error, err;
 
 	addrs[0] = ksym_get_addr("bpf_fentry_test1");
 	addrs[1] = ksym_get_addr("bpf_fentry_test2");
@@ -351,9 +354,39 @@ static void test_attach_api_fails(void)
 	if (!ASSERT_EQ(saved_error, -ENOENT, "fail_8_error"))
 		goto cleanup;
 
+	/* fail_9 - sleepable kprobe multi should not attach */
+	sl_skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(sl_skel, "sleep_skel_open"))
+		goto cleanup;
+
+	sl_skel->bss->user_ptr = sl_skel;
+
+	err = bpf_program__set_flags(sl_skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "sleep_skel_set_flags"))
+		goto cleanup;
+
+	err = kprobe_multi_sleepable__load(sl_skel);
+	if (!ASSERT_OK(err, "sleep_skel_load"))
+		goto cleanup;
+
+	link = bpf_program__attach_kprobe_multi_opts(sl_skel->progs.handle_kprobe_multi_sleepable,
+						     "bpf_fentry_test1", NULL);
+	saved_error = -errno;
+
+	if (!ASSERT_ERR_PTR(link, "fail_9"))
+		goto cleanup;
+
+	if (!ASSERT_EQ(saved_error, -EINVAL, "fail_9_error"))
+		goto cleanup;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(sl_skel->progs.fentry), &topts);
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+
 cleanup:
 	bpf_link__destroy(link);
 	kprobe_multi__destroy(skel);
+	kprobe_multi_sleepable__destroy(sl_skel);
 }
 
 static void test_session_skel_api(void)
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..932e1d9c72e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+void *user_ptr = 0;
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	int a, err;
+
+	err = bpf_copy_from_user(&a, sizeof(a), user_ptr);
+	barrier_var(a);
+	return err;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(fentry)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf-next v5 1/2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 19:01 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya
In-Reply-To: <20260408190137.101418-1-varunrmallya@gmail.com>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>

---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf-next v5 0/2] Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 19:01 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya

These patches fix an issue where sleepable kprobe_multi programs
were allowed to attach, leading to "sleeping function called from invalid
context" splats.

Because kprobe.multi programs run in atomic/RCU context, they cannot
sleep. However, `bpf_kprobe_multi_link_attach()` previously lacked
validation for the `prog->sleepable` flag. This allowed sleepable
helpers, such as `bpf_copy_from_user()`, to be invoked from an invalid
non-sleepable context.

This series addresses the issue by:
1. Rejecting sleepable kprobe_multi programs early in
   `bpf_kprobe_multi_link_attach()` by returning -EINVAL.
2. Adding selftests to explicitly verify that attaching a sleepable
   kprobe_multi program is rejected by the kernel.

P.S: The first of these two commits has been applied to the bpf tree.

Changes:
v1->v2:
- v1: https://lore.kernel.org/bpf/20260401134921.362148-1-varunrmallya@gmail.com/
- Defective selftest added
v2->v3:
- v2: https://lore.kernel.org/bpf/CAP01T74YgnKop-dgwBToOcfg4_D44t1wUBopFYPMquirCmaLfg@mail.gmail.com/
- Selftest separated from change into different commit.
v3->v4:
- v3: https://lore.kernel.org/bpf/20260401191126.440683-1-varunrmallya@gmail.com/
- Selftest moved to test_attach_api_fails.
- Changed attachment symbol to bpf_fentry_test1 for stability.
- Changes suggested by Leon implemented.
v4->v5:
- v4: https://lore.kernel.org/bpf/20260408183549.92990-1-varunrmallya@gmail.com/
- fix the mistake of leaving test_attach_multi_sleepable after changing
  location.

Varun R Mallya (2):
  bpf: Reject sleepable kprobe_multi programs at attach time
  selftests/bpf: Add test to ensure kprobe_multi is not sleepable

 kernel/trace/bpf_trace.c                      |  4 +++
 .../bpf/prog_tests/kprobe_multi_test.c        | 35 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 +++++++++++++
 3 files changed, 63 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

-- 
2.53.0

^ permalink raw reply

* Re: [PATCH bpf-next v4 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-08 18:47 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260408183549.92990-3-varunrmallya@gmail.com>

On Thu, Apr 09, 2026 at 12:05:49AM +0530, Varun R Mallya wrote:
> @@ -676,5 +750,7 @@ void test_kprobe_multi_test(void)
>  		test_unique_match();
>  	if (test__start_subtest("attach_write_ctx"))
>  		test_attach_write_ctx();
> +	if (test__start_subtest("attach_multi_sleepable"))
> +		test_attach_multi_sleepable();
>  	RUN_TESTS(kprobe_multi_verifier);
Please ignore this patch. I will send a v5 in a few minutes. I forgot to
remove the selftest from the previous location after moving it into
attach_api_fails.
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH 12/24] nfsd: add data structures for handling CB_NOTIFY
From: Chuck Lever @ 2026-04-08 18:39 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-12-aaf68c478abd@kernel.org>


On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add the data structures, allocation helpers, and callback operations
> needed for directory delegation CB_NOTIFY support:
>
> - struct nfsd_notify_event: carries fsnotify events for CB_NOTIFY
> - struct nfsd4_cb_notify: per-delegation state for notification handling
> - Union dl_cb_fattr with dl_cb_notify in nfs4_delegation since a
>   delegation is either a regular file delegation or a directory
>   delegation, never both
>
> Refactor alloc_init_deleg() into a common __alloc_init_deleg() base
> with a pluggable sc_free callback, and add alloc_init_dir_deleg() which
> allocates the page array and notify4 buffer needed for CB_NOTIFY
> encoding.
>
> Add skeleton nfsd4_cb_notify_ops with done/release handlers that will
> be filled in when the notification path is wired up.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 4afe7e68fb51..b2b8c454fc0f 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c

> @@ -3381,6 +3440,30 @@ nfsd4_cb_getattr_release(struct nfsd4_callback 
> *cb)
>  	nfs4_put_stid(&dp->dl_stid);
>  }
> 
> +static int
> +nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> +				struct rpc_task *task)
> +{
> +	switch (task->tk_status) {
> +	case -NFS4ERR_DELAY:
> +		rpc_delay(task, 2 * HZ);
> +		return 0;
> +	default:
> +		return 1;
> +	}
> +}
> +
> +static void
> +nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> +{
> +	struct nfsd4_cb_notify *ncn =
> +			container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> +	struct nfs4_delegation *dp =
> +			container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> +
> +	nfs4_put_stid(&dp->dl_stid);
> +}
> +
>  static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
>  	.done		= nfsd4_cb_recall_any_done,
>  	.release	= nfsd4_cb_recall_any_release,

So when a client responds with NFS4ERR_DELAY, the RPC framework retries
after 2s. On retry, prepare() is called again, but ncn_evt_cnt is
already 0 (drained in the first prepare). prepare returns false, which
destroys the callback.

Events arriving during the retry window are dropped because
nfsd4_run_cb_notify() returns early when NFSD4_CALLBACK_RUNNING is set.
After the callback is destroyed, future events can queue a new CB_NOTIFY,
but the window's events are lost.                                                                                        

The result is that the client misses notifications. Does this impact
behavioral correctness or spec compliance? Is there a way for that
client to detect the loss and recover?


-- 
Chuck Lever

^ permalink raw reply

* [PATCH bpf-next v4 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-08 18:35 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya
In-Reply-To: <20260408183549.92990-1-varunrmallya@gmail.com>

Add a selftest to ensure that kprobe_multi programs cannot be attached
using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 .../bpf/prog_tests/kprobe_multi_test.c        | 78 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 ++++++
 2 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..e4f9021a84ed 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -220,7 +221,9 @@ static void test_attach_api_syms(void)
 static void test_attach_api_fails(void)
 {
 	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	struct kprobe_multi *skel = NULL;
+	struct kprobe_multi_sleepable *sl_skel = NULL;
 	struct bpf_link *link = NULL;
 	unsigned long long addrs[2];
 	const char *syms[2] = {
@@ -228,7 +231,7 @@ static void test_attach_api_fails(void)
 		"bpf_fentry_test2",
 	};
 	__u64 cookies[2];
-	int saved_error;
+	int saved_error, err;
 
 	addrs[0] = ksym_get_addr("bpf_fentry_test1");
 	addrs[1] = ksym_get_addr("bpf_fentry_test2");
@@ -351,9 +354,39 @@ static void test_attach_api_fails(void)
 	if (!ASSERT_EQ(saved_error, -ENOENT, "fail_8_error"))
 		goto cleanup;
 
+	/* fail_9 - sleepable kprobe multi should not attach */
+	sl_skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(sl_skel, "sleep_skel_open"))
+		goto cleanup;
+
+	sl_skel->bss->user_ptr = sl_skel;
+
+	err = bpf_program__set_flags(sl_skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "sleep_skel_set_flags"))
+		goto cleanup;
+
+	err = kprobe_multi_sleepable__load(sl_skel);
+	if (!ASSERT_OK(err, "sleep_skel_load"))
+		goto cleanup;
+
+	link = bpf_program__attach_kprobe_multi_opts(sl_skel->progs.handle_kprobe_multi_sleepable,
+						     "bpf_fentry_test1", NULL);
+	saved_error = -errno;
+
+	if (!ASSERT_ERR_PTR(link, "fail_9"))
+		goto cleanup;
+
+	if (!ASSERT_EQ(saved_error, -EINVAL, "fail_9_error"))
+		goto cleanup;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(sl_skel->progs.fentry), &topts);
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+
 cleanup:
 	bpf_link__destroy(link);
 	kprobe_multi__destroy(skel);
+	kprobe_multi_sleepable__destroy(sl_skel);
 }
 
 static void test_session_skel_api(void)
@@ -609,6 +642,47 @@ static void test_override(void)
 	kprobe_multi_override__destroy(skel);
 }
 
+static void test_attach_multi_sleepable(void)
+{
+	struct kprobe_multi_sleepable *skel;
+	int err;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+	skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
+		return;
+
+	skel->bss->user_ptr = skel;
+
+	err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "bpf_program__set_flags"))
+		goto cleanup;
+
+	/* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
+	err = kprobe_multi_sleepable__load(skel);
+	if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
+		goto cleanup;
+
+	skel->links.handle_kprobe_multi_sleepable =
+		bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
+						      "bpf_fentry_test1", NULL);
+
+	ASSERT_EQ(libbpf_get_error(skel->links.handle_kprobe_multi_sleepable),
+		  -EINVAL, "attach_multi_sleepable_err");
+
+	ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
+		       "bpf_program__attach_kprobe_multi_opts");
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.fentry), &topts);
+
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+
+cleanup:
+	kprobe_multi_sleepable__destroy(skel);
+}
+
 #ifdef __x86_64__
 static void test_attach_write_ctx(void)
 {
@@ -676,5 +750,7 @@ void test_kprobe_multi_test(void)
 		test_unique_match();
 	if (test__start_subtest("attach_write_ctx"))
 		test_attach_write_ctx();
+	if (test__start_subtest("attach_multi_sleepable"))
+		test_attach_multi_sleepable();
 	RUN_TESTS(kprobe_multi_verifier);
 }
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..932e1d9c72e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+void *user_ptr = 0;
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	int a, err;
+
+	err = bpf_copy_from_user(&a, sizeof(a), user_ptr);
+	barrier_var(a);
+	return err;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(fentry)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf-next v4 1/2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 18:35 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya
In-Reply-To: <20260408183549.92990-1-varunrmallya@gmail.com>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf-next v4 0/2] Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 18:35 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya

These patches fix an issue where sleepable kprobe_multi programs
were allowed to attach, leading to "sleeping function called from invalid
context" splats.

Because kprobe.multi programs run in atomic/RCU context, they cannot
sleep. However, `bpf_kprobe_multi_link_attach()` previously lacked
validation for the `prog->sleepable` flag. This allowed sleepable
helpers, such as `bpf_copy_from_user()`, to be invoked from an invalid
non-sleepable context.

This series addresses the issue by:
1. Rejecting sleepable kprobe_multi programs early in
   `bpf_kprobe_multi_link_attach()` by returning -EINVAL.
2. Adding selftests to explicitly verify that attaching a sleepable
   kprobe_multi program is rejected by the kernel.

P.S: The first of these two commits has been applied to the bpf tree.

Changes:
v1->v2:
- v1: https://lore.kernel.org/bpf/20260401134921.362148-1-varunrmallya@gmail.com/
- Defective selftest added
v2->v3:
- v2: https://lore.kernel.org/bpf/CAP01T74YgnKop-dgwBToOcfg4_D44t1wUBopFYPMquirCmaLfg@mail.gmail.com/
- Selftest separated from change into different commit.
v3->v4:
- v3: https://lore.kernel.org/bpf/20260401191126.440683-1-varunrmallya@gmail.com/
- Selftest moved to test_attach_api_fails.
- Changed attachment symbol to bpf_fentry_test1 for stability.
- Changes suggested by Leon implemented.

Varun R Mallya (2):
  bpf: Reject sleepable kprobe_multi programs at attach time
  selftests/bpf: Add test to ensure kprobe_multi is not sleepable

 kernel/trace/bpf_trace.c                      |  4 +
 .../bpf/prog_tests/kprobe_multi_test.c        | 78 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 ++++++
 3 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

-- 
2.53.0

^ permalink raw reply

* Re: [PATCH 13/24] nfsd: add notification handlers for dir events
From: Chuck Lever @ 2026-04-08 18:34 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-13-aaf68c478abd@kernel.org>



On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add the necessary parts to accept a fsnotify callback for directory
> change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> is created set a handle_event callback to handle the notification.
>
> Use that to allocate a nfsd_notify_event object and then hand off a
> reference to each delegation's CB_NOTIFY. If anything fails along the
> way, recall any affected delegations.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index b2b8c454fc0f..339c3d0bb575 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c

> @@ -9796,3 +9887,118 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state 
> *cstate,
>  	put_nfs4_file(fp);
>  	return ERR_PTR(status);
>  }
> +
> +static void
> +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> +{
> +	struct nfs4_delegation *dp = container_of(ncn, struct 
> nfs4_delegation, dl_cb_notify);
> +
> +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> +		return;
> +
> +	if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> +		clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> +	else
> +		nfsd4_run_cb(&ncn->ncn_cb);
> +}
> +
> +static struct nfsd_notify_event *
> +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry 
> *dentry)
> +{
> +	struct nfsd_notify_event *ne;
> +
> +	ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_KERNEL);
> +	if (!ne)
> +		return NULL;
> +
> +	memcpy(&ne->ne_name, q->name, q->len);
> +	refcount_set(&ne->ne_ref, 1);
> +	ne->ne_mask = mask;
> +	ne->ne_name[q->len] = '\0';
> +	ne->ne_namelen = q->len;
> +	ne->ne_dentry = dget(dentry);
> +	return ne;
> +}
> +
> +static bool
> +should_notify_deleg(u32 mask, struct file_lease *fl)
> +{
> +	/* Only nfsd leases */
> +	if (fl->fl_lmops != &nfsd_lease_mng_ops)
> +		return false;
> +
> +	/* Skip if this event wasn't ignored by the lease */
> +	if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> +		return false;
> +	if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> +		return false;
> +	if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> +		return false;
> +
> +	return true;
> +}

For a cross-directory rename, vfs_rename calls try_break_deleg(old_dir,
LEASE_BREAK_DIR_DELETE, ...). A delegation with FL_IGN_DIR_DELETE
(subscribed to NOTIFY4_REMOVE_ENTRY) suppresses the lease break, which
is correct.

But fsnotify delivers FS_RENAME on old_dir, not FS_DELETE. In
should_notify_deleg(), the check (mask & FS_RENAME) &&
!(fl->c.flc_flags & FL_IGN_DIR_RENAME) fails, because the delegation
has FL_IGN_DIR_DELETE but not FL_IGN_DIR_RENAME. No notification is
sent.

IIUC, a client subscribed to NOTIFY4_REMOVE_ENTRY for old_dir sees
neither a lease break nor a CB_NOTIFY when a child is renamed out of
the directory. Is that behavior correct?


> +
> +static void
> +nfsd_recall_all_dir_delegs(const struct inode *dir)
> +{
> +	struct file_lock_context *ctx = locks_inode_context(dir);
> +	struct file_lock_core *flc;
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> +
> +		if (fl->fl_lmops == &nfsd_lease_mng_ops)
> +			nfsd_break_deleg_cb(fl);
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +}
> +
> +int
> +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void 
> *data,
> +		      int data_type, const struct qstr *name)
> +{
> +	struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> +	struct file_lock_context *ctx;
> +	struct file_lock_core *flc;
> +	struct nfsd_notify_event *evt;
> +
> +	/* Don't do anything if this is not an expected event */
> +	if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> +		return 0;
> +
> +	ctx = locks_inode_context(dir);
> +	if (!ctx || list_empty(&ctx->flc_lease))
> +		return 0;
> +
> +	evt = alloc_nfsd_notify_event(mask, name, dentry);
> +	if (!evt) {
> +		nfsd_recall_all_dir_delegs(dir);
> +		return 0;
> +	}
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> +		struct nfs4_delegation *dp = flc->flc_owner;
> +		struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> +
> +		if (!should_notify_deleg(mask, fl))
> +			continue;
> +
> +		spin_lock(&ncn->ncn_lock);
> +		if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> +			/* We're generating notifications too fast. Recall. */
> +			spin_unlock(&ncn->ncn_lock);
> +			nfsd_break_deleg_cb(fl);
> +			continue;
> +		}
> +		ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> +		spin_unlock(&ncn->ncn_lock);
> +
> +		nfsd4_run_cb_notify(ncn);
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +	nfsd_notify_event_put(evt);
> +	return 0;
> +}


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Chuck Lever @ 2026-04-08 18:24 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-8-aaf68c478abd@kernel.org>


On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add a new helper function that will update the mask on the nfsd_file's
> fsnotify_mark to be a union of all current directory delegations on an
> inode. Call that when directory delegations are added or removed.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c8fb84c38637..9a4cff08c67d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c

> @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct 
> nfs4_delegation *dp)
>  	WARN_ON_ONCE(!fp->fi_delegees);
> 
>  	nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> +	nfsd_fsnotify_recalc_mask(nf);
>  	kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
>  	put_deleg_file(fp);
>  }

The grant path in nfsd_get_dir_deleg() uses a different ordering
(setlease first, recalc_mask after).

Here, since the delegation being removed is still in flc_lease,
inode_lease_ignore_mask() includes its ignore flags. The mask is
computed as if the delegation is still present.

The result is that stale FS_CREATE/FS_DELETE/FS_RENAME bits remain
in the fsnotify mark. It might be harmless in practice since the
handler finds no leases and returns early, but it creates
unnecessary work.

Should nfs4_unlock_deleg_lease call nfsd_fsnotify_recalc_mask()
after kernel_setlease(F_UNLCK)?


-- 
Chuck Lever

^ permalink raw reply

* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Song Liu @ 2026-04-08 18:19 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Yafang Shao, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
	rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
	yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
	bpf
In-Reply-To: <adY_WgA54CDtWBq6@pathway.suse.cz>

On Wed, Apr 8, 2026 at 4:43 AM Petr Mladek <pmladek@suse.com> wrote:
[...]
> > >
> > > This is weird semantic. Which livepatch tag would be allowed to
> > > supersede it, please?
> > >
> > > Do we still need this category?
> >
> > It can be superseded by any livepatch that has a non-zero tag set.
>
> And this exactly the weird thing.
>
> A patch with the .replace flag set is supposed to obsolete all already
> installed livepatches. It means that it should provide all existing
> fixes and features.
>
> Now, we want to introduce a replace flag/set which would allow to
> replace/obsolete only the livepatch with the same tag/set number.
> And we want to prevent conflicts by making sure that livepatches with
> different tag/set number will never livepatch the same function.
>
> Obviously, livepatches with different tag/set number could not
> obsolete the same no-replace livepatch. They would need to livepatch
> the same functions touched by the no-replace livepatch and would
> conflict.
>
> So, I suggest to remove the no-replace mode completely. It should
> not be needed. A livepatch which should be installed in parallel
> will simply use another unique tag/set number.

I think I see your point now. Existing code works as:
- replace=false doesn't replace anything
- replace=true replaces everything

If we assume false=0 and true=1, it is technically possible to define:
- replace_set=0 doesn't replace anything
- replace_set=1 replaces everything
- replace_set=2+ only replace the same replace_set

This is probably a little too complicated.

> > This ensures backward compatibility: while a non-atomic-replace
> > livepatch can be superseded by an atomic-replace one, the reverse is
> > not permitted—an atomic-replace livepatch cannot be superseded by a
> > non-atomic one.
>
> IMHO, the backward compatibility would just create complexity and mess
> in this case.

Given that livepatch is for expert users, I think we can make this work
without backward compatibility. But breaking compatibility is always not
preferred.

Thanks,
Song

^ permalink raw reply

* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Chuck Lever @ 2026-04-08 18:16 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-1-aaf68c478abd@kernel.org>

On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> If a NFS client requests a directory delegation with a notification
> bitmask covering directory change events, the server shouldn't recall
> the delegation. Instead the client will be notified of the change after
> the fact.
>
> Add support for ignoring lease breaks on directory changes. Add a new
> flags parameter to try_break_deleg() and teach __break_lease how to
> ignore certain types of delegation break events.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

> diff --git a/fs/locks.c b/fs/locks.c
> index 8e44b1f6c15a..dafa0752fdce 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c

> @@ -1670,7 +1709,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
>  			locks_delete_lock_ctx(&fl->c, &dispose);
>  	}
> 
> -	if (list_empty(&ctx->flc_lease))
> +	if (!visible_leases_remaining(inode, flags))
>  		goto out;
> 
>  	if (flags & LEASE_BREAK_NONBLOCK) {

After breaking visible leases, the restart: label calls any_leases_conflict()
which does not filter ignored dir-delegation leases. When only ignored leases
remain, any_leases_conflict returns true, but visible_leases_remaining also
returned true (triggering the wait). The code picks the first lease (possibly
ignored), computes break_time = 1 jiffy, blocks, then loops.                                                     

For example, suppose you have two directory delegations on a directory, one
with FL_IGN_DIR_DELETE and one without. After the non-ignored one is broken
and removed, the ignored one keeps any_leases_conflict returning true. The
loop spins at 1-jiffy intervals until the ignored delegation is released.  

Should the restart: block skip ignored leases?

-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH 1/2] tracing: Store trace_marker_raw payload length in events
From: Steven Rostedt @ 2026-04-08 17:39 UTC (permalink / raw)
  To: Cao Ruichuang
  Cc: mhiramat, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260408153241.15391-1-create0818@163.com>

On Wed,  8 Apr 2026 23:32:40 +0800
Cao Ruichuang <create0818@163.com> wrote:

> trace_marker_raw currently records its bytes in TRACE_RAW_DATA events,
> but the event output path derives the byte count from the padded record
> size in the ring buffer. As a result, the printed raw-data payload is
> rounded up and small writes do not preserve their true length.
> 
> Keep the true payload length in the TRACE_RAW_DATA event itself and use
> that field when printing the bytes. This leaves the ring buffer record
> size semantics unchanged while letting trace_marker_raw report the exact
> payload that was written.

May I ask why?  The above describes what is happening but fails to
leave out the why? Why does the payload length need to be added to the
event? I mean, it's recording raw data, and the user who writes to it
already knows the length as this was made for applications to write
structures directly into the buffer. When reading back from the buffer
the structure size is the length.

Thus, why record the length? I see no reason to. The length wastes
precious space in the ring buffer when the user of trace_marker_raw
should already know its length.

-- Steve

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Tom Zanussi @ 2026-04-08 17:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Pengpeng Hou, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260408122514.60bbfd61@gandalf.local.home>

On Wed, 2026-04-08 at 12:25 -0400, Steven Rostedt wrote:
> On Wed, 08 Apr 2026 10:58:06 -0500
> Tom Zanussi <zanussi@kernel.org> wrote:
> 
> Hi Tom,
> 
> > ->system is set when using fully-qualified variable names. For  
> > instance:
> > 
> > echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
> > echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
> > echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> > echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> > 
> > Here, the sched_switch trigger would error out if the unqualified $ts0
> > variables were used instead of the fully-qualified ones because there's
> > no way to distinguish which $ts0 was meant.
> > 
> 
> Yep I see that now. I never had a need to use it before, but I probably
> should implement this in libtracefs to be safe.
> 
> We should definitely add a selftest that tests this. There's one case that
> does use it but it doesn't use multiple ones. We should add a test that
> does so.
> 
> trigger-multi-actions-accept.tc has the system, but it's not needed here.
> 
> We should also have a test to test the output of theses lines.

Yeah, definitely. I can try adding this as a test..

Tom


> 
> -- Steve


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox