Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH 02/24] filelock: add a tracepoint to start of break_lease()
From: Jan Kara @ 2026-04-08 13:45 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-2-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:15, Jeff Layton wrote:
> ...mostly to show the LEASE_BREAK_* flags.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

OK. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/locks.c                      |  2 ++
>  include/trace/events/filelock.h | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index dafa0752fdce..5af6dca2d46c 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1654,6 +1654,8 @@ int __break_lease(struct inode *inode, unsigned int flags)
>  	bool want_write = !(flags & LEASE_BREAK_OPEN_RDONLY);
>  	int error = 0;
>  
> +	trace_break_lease(inode, flags);
> +
>  	if (flags & LEASE_BREAK_LEASE)
>  		type = FL_LEASE;
>  	else if (flags & LEASE_BREAK_DELEG)
> diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h
> index ef4bb0afb86a..fff0ee2d452d 100644
> --- a/include/trace/events/filelock.h
> +++ b/include/trace/events/filelock.h
> @@ -120,6 +120,39 @@ DEFINE_EVENT(filelock_lock, flock_lock_inode,
>  		TP_PROTO(struct inode *inode, struct file_lock *fl, int ret),
>  		TP_ARGS(inode, fl, ret));
>  
> +#define show_lease_break_flags(val)					\
> +	__print_flags(val, "|",						\
> +		{ LEASE_BREAK_LEASE,		"LEASE" },		\
> +		{ LEASE_BREAK_DELEG,		"DELEG" },		\
> +		{ LEASE_BREAK_LAYOUT,		"LAYOUT" },		\
> +		{ LEASE_BREAK_NONBLOCK,		"NONBLOCK" },		\
> +		{ LEASE_BREAK_OPEN_RDONLY,	"OPEN_RDONLY" },	\
> +		{ LEASE_BREAK_DIR_CREATE,	"DIR_CREATE" },		\
> +		{ LEASE_BREAK_DIR_DELETE,	"DIR_DELETE" },		\
> +		{ LEASE_BREAK_DIR_RENAME,	"DIR_RENAME" })
> +
> +TRACE_EVENT(break_lease,
> +	TP_PROTO(struct inode *inode, unsigned int flags),
> +
> +	TP_ARGS(inode, flags),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, i_ino)
> +		__field(dev_t, s_dev)
> +		__field(unsigned int, flags)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->s_dev = inode->i_sb->s_dev;
> +		__entry->i_ino = inode->i_ino;
> +		__entry->flags = flags;
> +	),
> +
> +	TP_printk("dev=0x%x:0x%x ino=0x%lx flags=%s",
> +		  MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
> +		  __entry->i_ino, show_lease_break_flags(__entry->flags))
> +);
> +
>  DECLARE_EVENT_CLASS(filelock_lease,
>  	TP_PROTO(struct inode *inode, struct file_lease *fl),
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 07/24] vfs: add fsnotify_modify_mark_mask()
From: Jan Kara @ 2026-04-08 13:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-7-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:20, Jeff Layton wrote:
> nfsd needs to be able to modify the mask on an existing mark when new
> directory delegations are set or unset. Add an exported function that
> allows the caller to set and clear bits in the mark->mask, and does
> the recalculation if something changed.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  fs/notify/mark.c                 | 29 +++++++++++++++++++++++++++++
>  include/linux/fsnotify_backend.h |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index c2ed5b11b0fe..b1e73c6fd382 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -310,6 +310,35 @@ void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>  		fsnotify_conn_set_children_dentry_flags(conn);
>  }
>  
> +/**
> + * fsnotify_modify_mark_mask - set and/or clear flags in a mark's mask
> + * @mark: mark to be modified
> + * @set: bits to be set in mask
> + * @clear: bits to be cleared in mask
> + *
> + * Modify a fsnotify_mark mask as directed, and update its associated conn.
> + * The caller is expected to hold a reference to the mark.
> + */
> +void fsnotify_modify_mark_mask(struct fsnotify_mark *mark, u32 set, u32 clear)
> +{
> +	bool recalc = false;
> +	u32 mask;
> +
> +	WARN_ON_ONCE(clear & set);
> +
> +	spin_lock(&mark->lock);
> +	mask = mark->mask;
> +	mark->mask |= set;
> +	mark->mask &= ~clear;
> +	if (mark->mask != mask)
> +		recalc = true;
> +	spin_unlock(&mark->lock);
> +
> +	if (recalc)
> +		fsnotify_recalc_mask(mark->connector);
> +}
> +EXPORT_SYMBOL_GPL(fsnotify_modify_mark_mask);
> +
>  /* Free all connectors queued for freeing once SRCU period ends */
>  static void fsnotify_connector_destroy_workfn(struct work_struct *work)
>  {
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 95985400d3d8..66e185bd1b1b 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -917,6 +917,7 @@ extern void fsnotify_get_mark(struct fsnotify_mark *mark);
>  extern void fsnotify_put_mark(struct fsnotify_mark *mark);
>  extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info);
>  extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
> +extern void fsnotify_modify_mark_mask(struct fsnotify_mark *mark, u32 set, u32 clear);
>  
>  static inline void fsnotify_init_event(struct fsnotify_event *event)
>  {
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 03/24] filelock: add an inode_lease_ignore_mask helper
From: Jan Kara @ 2026-04-08 13:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-3-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:16, Jeff Layton wrote:
> Add a new routine that returns a mask of all dir change events that are
> currently ignored by any leases. nfsd will use this to determine how to
> configure the fsnotify_mark mask.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/locks.c               | 32 ++++++++++++++++++++++++++++++++
>  include/linux/filelock.h |  1 +
>  2 files changed, 33 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 5af6dca2d46c..04980b065734 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1597,6 +1597,38 @@ any_leases_conflict(struct inode *inode, struct file_lease *breaker)
>  	return false;
>  }
>  
> +#define IGNORE_MASK	(FL_IGN_DIR_CREATE | FL_IGN_DIR_DELETE | FL_IGN_DIR_RENAME)
> +
> +/**
> + * inode_lease_ignore_mask - return union of all ignored inode events for this inode
> + * @inode: inode of which to get ignore mask
> + *
> + * Walk the list of leases, and return the result of all of
> + * their FL_IGN_DIR_* bits or'ed together.
> + */
> +u32
> +inode_lease_ignore_mask(struct inode *inode)
> +{
> +	struct file_lock_context *ctx;
> +	struct file_lock_core *flc;
> +	u32 mask = 0;
> +
> +	ctx = locks_inode_context(inode);
> +	if (!ctx)
> +		return 0;
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> +		mask |= flc->flc_flags & IGNORE_MASK;
> +		/* If we already have everything, we can stop */
> +		if (mask == IGNORE_MASK)
> +			break;
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +	return mask;
> +}
> +EXPORT_SYMBOL_GPL(inode_lease_ignore_mask);
> +
>  static bool
>  ignore_dir_deleg_break(struct file_lease *fl, unsigned int flags)
>  {
> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
> index 5a19cdb047da..416483b136f1 100644
> --- a/include/linux/filelock.h
> +++ b/include/linux/filelock.h
> @@ -236,6 +236,7 @@ int generic_setlease(struct file *, int, struct file_lease **, void **priv);
>  int kernel_setlease(struct file *, int, struct file_lease **, void **);
>  int vfs_setlease(struct file *, int, struct file_lease **, void **);
>  int lease_modify(struct file_lease *, int, struct list_head *);
> +u32 inode_lease_ignore_mask(struct inode *inode);
>  
>  struct notifier_block;
>  int lease_register_notifier(struct notifier_block *);
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Jan Kara @ 2026-04-08 13:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-8-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:21, Jeff Layton wrote:
> Add a new helper function that will update the mask on the nfsd_file's
> fsnotify_mark to be a union of all current directory delegations on an
> inode. Call that when directory delegations are added or removed.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/nfsd/nfs4state.c | 33 +++++++++++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c8fb84c38637..9a4cff08c67d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -1258,6 +1258,37 @@ static void nfsd4_finalize_deleg_timestamps(struct nfs4_delegation *dp, struct f
>  	}
>  }
>  
> +static void nfsd_fsnotify_recalc_mask(struct nfsd_file *nf)
> +{
> +	struct fsnotify_mark *mark = &nf->nf_mark->nfm_mark;
> +	struct inode *inode = file_inode(nf->nf_file);
> +	u32 lease_mask, set = 0, clear = 0;
> +
> +	/* This is only needed when adding or removing dir delegs */
> +	if (!S_ISDIR(inode->i_mode))
> +		return;
> +
> +	/* Set up notifications for any ignored delegation events */
> +	lease_mask = inode_lease_ignore_mask(inode);
> +
> +	if (lease_mask & FL_IGN_DIR_CREATE)
> +		set |= FS_CREATE;
> +	else
> +		clear |= FS_CREATE;
> +
> +	if (lease_mask & FL_IGN_DIR_DELETE)
> +		set |= FS_DELETE;
> +	else
> +		clear |= FS_DELETE;
> +
> +	if (lease_mask & FL_IGN_DIR_RENAME)
> +		set |= FS_RENAME;
> +	else
> +		clear |= FS_RENAME;
> +
> +	fsnotify_modify_mark_mask(mark, set, clear);
> +}
> +
>  static void nfs4_unlock_deleg_lease(struct nfs4_delegation *dp)
>  {
>  	struct nfs4_file *fp = dp->dl_stid.sc_file;
> @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct nfs4_delegation *dp)
>  	WARN_ON_ONCE(!fp->fi_delegees);
>  
>  	nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> +	nfsd_fsnotify_recalc_mask(nf);
>  	kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
>  	put_deleg_file(fp);
>  }
> @@ -9652,6 +9684,7 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
>  
>  	if (!status) {
>  		put_nfs4_file(fp);
> +		nfsd_fsnotify_recalc_mask(nf);
>  		return dp;
>  	}
>  
> 
> -- 
> 2.53.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 00/24] vfs/nfsd: add support for CB_NOTIFY callbacks in directory delegations
From: Jan Kara @ 2026-04-08 13:55 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
	Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein, Calum Mackay, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org>

On Tue 07-04-26 09:21:13, Jeff Layton wrote:
> This patchset builds on the directory delegation work we did a few
> months ago, to add support for CB_NOTIFY callbacks for some events. In
> particular, creates, unlinks and renames. The server also sends updated
> directory attributes in the notifications. With this support, the client
> can register interest in a directory and get notifications about changes
> within it without losing its lease.
> 
> The series starts with patches to allow the vfs to ignore certain types
> of events on directories. nfsd can then request these sorts of
> delegations on directories, and then set up inotify watches on the
> directory to trigger sending CB_NOTIFY events.
> 
> This has mainly been tested with pynfs, with some new testcases that
> I'll be posting soon. They seem to work fine with those tests, but I
> don't think we'll want to merge these until we have a complete
> client-side implementation to test against.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

The fsnotify changes and generic file locking changes look OK to me. I
don't feel confident enough with NFSD stuff to really review that :)

								Honza

> ---
> Jeff Layton (24):
>       filelock: add support for ignoring deleg breaks for dir change events
>       filelock: add a tracepoint to start of break_lease()
>       filelock: add an inode_lease_ignore_mask helper
>       nfsd: add protocol support for CB_NOTIFY
>       nfs_common: add new NOTIFY4_* flags proposed in RFC8881bis
>       nfsd: allow nfsd to get a dir lease with an ignore mask
>       vfs: add fsnotify_modify_mark_mask()
>       nfsd: update the fsnotify mark when setting or removing a dir delegation
>       nfsd: make nfsd4_callback_ops->prepare operation bool return
>       nfsd: add callback encoding and decoding linkages for CB_NOTIFY
>       nfsd: use RCU to protect fi_deleg_file
>       nfsd: add data structures for handling CB_NOTIFY
>       nfsd: add notification handlers for dir events
>       nfsd: add tracepoint to dir_event handler
>       nfsd: apply the notify mask to the delegation when requested
>       nfsd: add helper to marshal a fattr4 from completed args
>       nfsd: allow nfsd4_encode_fattr4_change() to work with no export
>       nfsd: send basic file attributes in CB_NOTIFY
>       nfsd: allow encoding a filehandle into fattr4 without a svc_fh
>       nfsd: add a fi_connectable flag to struct nfs4_file
>       nfsd: add the filehandle to returned attributes in CB_NOTIFY
>       nfsd: properly track requested child attributes
>       nfsd: track requested dir attributes
>       nfsd: add support to CB_NOTIFY for dir attribute changes
> 
>  Documentation/sunrpc/xdr/nfs4_1.x    | 264 ++++++++++++++-
>  fs/attr.c                            |   2 +-
>  fs/locks.c                           |  89 +++++-
>  fs/namei.c                           |  31 +-
>  fs/nfsd/filecache.c                  |  57 +++-
>  fs/nfsd/nfs4callback.c               |  60 +++-
>  fs/nfsd/nfs4layouts.c                |   5 +-
>  fs/nfsd/nfs4proc.c                   |  15 +
>  fs/nfsd/nfs4state.c                  | 524 ++++++++++++++++++++++++++----
>  fs/nfsd/nfs4xdr.c                    | 300 ++++++++++++++---
>  fs/nfsd/nfs4xdr_gen.c                | 601 ++++++++++++++++++++++++++++++++++-
>  fs/nfsd/nfs4xdr_gen.h                |  20 +-
>  fs/nfsd/state.h                      |  70 +++-
>  fs/nfsd/trace.h                      |  21 ++
>  fs/nfsd/xdr4.h                       |   5 +
>  fs/nfsd/xdr4cb.h                     |  12 +
>  fs/notify/mark.c                     |  29 ++
>  fs/posix_acl.c                       |   4 +-
>  fs/xattr.c                           |   4 +-
>  include/linux/filelock.h             |  54 +++-
>  include/linux/fsnotify_backend.h     |   1 +
>  include/linux/nfs4.h                 | 127 --------
>  include/linux/sunrpc/xdrgen/nfs4_1.h | 291 ++++++++++++++++-
>  include/trace/events/filelock.h      |  38 ++-
>  include/uapi/linux/nfs4.h            |   2 -
>  25 files changed, 2321 insertions(+), 305 deletions(-)
> ---
> base-commit: bd5b9fd5e3d55bc412cec4bebe5a11da2151de4a
> change-id: 20260325-dir-deleg-339066dd1017
> 
> Best regards,
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Jeff Layton @ 2026-04-08 14:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Christian Brauner, Chuck Lever, Alexander Aring,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, NeilBrown, Olga Kornievskaia,
	Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker,
	Amir Goldstein, Calum Mackay, linux-fsdevel, linux-kernel,
	linux-trace-kernel, linux-doc, linux-nfs
In-Reply-To: <snnggefctfffpb3rsyhjdwmxozqdklqmweiojmxy7owettksgz@6vud2iacgeqc>

On Wed, 2026-04-08 at 15:45 +0200, Jan Kara wrote:
> On Tue 07-04-26 09:21:14, Jeff Layton wrote:
> > If a NFS client requests a directory delegation with a notification
> > bitmask covering directory change events, the server shouldn't recall
> > the delegation. Instead the client will be notified of the change after
> > the fact.
> > 
> > Add support for ignoring lease breaks on directory changes. Add a new
> > flags parameter to try_break_deleg() and teach __break_lease how to
> > ignore certain types of delegation break events.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Looks good. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> > @@ -222,6 +225,10 @@ struct file_lease *locks_alloc_lease(void);
> >  #define LEASE_BREAK_LAYOUT		BIT(2)	// break layouts only
> >  #define LEASE_BREAK_NONBLOCK		BIT(3)	// non-blocking break
> >  #define LEASE_BREAK_OPEN_RDONLY		BIT(4)	// readonly open event
> > +#define LEASE_BREAK_DIR_CREATE		BIT(6)  // dir deleg create event
> > +#define LEASE_BREAK_DIR_DELETE		BIT(7)  // dir deleg delete event
> > +#define LEASE_BREAK_DIR_RENAME		BIT(8)  // dir deleg rename event
> 
> Just curious why you've left out bit 5 here... :)
> 
> 								Honza

No reason. I've had this series for a couple of years now, and I think
bit 5 got removed at some point after I originally did this patch, and
I didn't notice when I fixed up the conflict. I'll plan to renumber
this for neatness sake.

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* [PATCH v2] seq_buf: export seq_buf_putmem_hex() and add KUnit tests
From: Shuvam Pandey @ 2026-04-08 14:44 UTC (permalink / raw)
  To: Andrew Morton, Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, shuvampandey1
In-Reply-To: <20260406033728.25998-1-shuvampandey1@gmail.com>

seq_buf: export seq_buf_putmem_hex() and add KUnit tests

The seq_buf KUnit suite does not exercise seq_buf_putmem_hex().

Add one test for the len > 8 chunking path and one overflow test
where a later chunk no longer fits in the buffer.

Export seq_buf_putmem_hex() as well so SEQ_BUF_KUNIT_TEST=m links
cleanly. Without the export, modpost reports seq_buf_putmem_hex as
undefined when seq_buf_kunit is built as a module.

Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
---
v2:
- export seq_buf_putmem_hex() so SEQ_BUF_KUNIT_TEST=m links cleanly
- validate with a fresh arm64 build using CONFIG_KUNIT=y and CONFIG_SEQ_BUF_KUNIT_TEST=m

 lib/seq_buf.c             |  1 +
 lib/tests/seq_buf_kunit.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index f3f3436d60a9403eae5b1ef9b091b027881f14fb..b59488fa8135cdb0340fbeb43d8d74db8ae13146 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -298,6 +298,7 @@ int seq_buf_putmem_hex(struct seq_buf *s, const void *mem,
 	}
 	return 0;
 }
+EXPORT_SYMBOL_GPL(seq_buf_putmem_hex);
 
 /**
  * seq_buf_path - copy a path into the sequence buffer
diff --git a/lib/tests/seq_buf_kunit.c b/lib/tests/seq_buf_kunit.c
index 8a01579a978e655cd09024d0ea9c4c9cd095263f..eb466386bbefb1c81773cdae65a8ac3df91cd8ea 100644
--- a/lib/tests/seq_buf_kunit.c
+++ b/lib/tests/seq_buf_kunit.c
@@ -184,6 +184,38 @@ static void seq_buf_get_buf_commit_test(struct kunit *test)
 	KUNIT_EXPECT_TRUE(test, seq_buf_has_overflowed(&s));
 }
 
+static void seq_buf_putmem_hex_test(struct kunit *test)
+{
+	DECLARE_SEQ_BUF(s, 24);
+	const u8 data[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
+#ifdef __BIG_ENDIAN
+	const char *expected = "0001020304050607 0809 ";
+#else
+	const char *expected = "0706050403020100 0908 ";
+#endif
+
+	KUNIT_EXPECT_EQ(test, seq_buf_putmem_hex(&s, data, sizeof(data)), 0);
+	KUNIT_EXPECT_FALSE(test, seq_buf_has_overflowed(&s));
+	KUNIT_EXPECT_EQ(test, seq_buf_used(&s), strlen(expected));
+	KUNIT_EXPECT_STREQ(test, seq_buf_str(&s), expected);
+}
+
+static void seq_buf_putmem_hex_overflow_test(struct kunit *test)
+{
+	DECLARE_SEQ_BUF(s, 20);
+	const u8 data[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
+#ifdef __BIG_ENDIAN
+	const char *expected = "0001020304050607 ";
+#else
+	const char *expected = "0706050403020100 ";
+#endif
+
+	KUNIT_EXPECT_EQ(test, seq_buf_putmem_hex(&s, data, sizeof(data)), -1);
+	KUNIT_EXPECT_TRUE(test, seq_buf_has_overflowed(&s));
+	KUNIT_EXPECT_EQ(test, seq_buf_used(&s), 20);
+	KUNIT_EXPECT_STREQ(test, seq_buf_str(&s), expected);
+}
+
 static struct kunit_case seq_buf_test_cases[] = {
 	KUNIT_CASE(seq_buf_init_test),
 	KUNIT_CASE(seq_buf_declare_test),
@@ -194,6 +226,8 @@ static struct kunit_case seq_buf_test_cases[] = {
 	KUNIT_CASE(seq_buf_printf_test),
 	KUNIT_CASE(seq_buf_printf_overflow_test),
 	KUNIT_CASE(seq_buf_get_buf_commit_test),
+	KUNIT_CASE(seq_buf_putmem_hex_test),
+	KUNIT_CASE(seq_buf_putmem_hex_overflow_test),
 	{}
 };
 

^ permalink raw reply related

* Re: [PATCH v2 1/2] module/kallsyms: fix nextval for data symbol lookup
From: Petr Pavlu @ 2026-04-08 15:24 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-modules, Sami Tolvanen, Luis Chamberlain, linux-kernel,
	linux-trace-kernel, live-patching, Daniel Gomez, Aaron Tomlin,
	Steven Rostedt, Masami Hiramatsu, Jordan Rome, Viktor Malik
In-Reply-To: <20260327110005.16499-1-stf_xl@wp.pl>

On 3/27/26 12:00 PM, Stanislaw Gruszka wrote:
> The symbol lookup code assumes the queried address resides in either
> MOD_TEXT or MOD_INIT_TEXT. This breaks for addresses in other module
> memory regions (e.g. rodata or data), resulting in incorrect upper
> bounds and wrong symbol size.
> 
> Select the module memory region the address belongs to instead of
> hardcoding text sections. Also initialize the lower bound to the start
> of that region, as searching from address 0 is unnecessary.
> 
> Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>

Looks ok to me. Feel free to add:

Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>

As a side note, I wonder if manually determining symbol sizes this way
is the best approach for modules, instead of simply returning the
st_size of the symbol. The logic comes from the original implementation
in "[PATCH] kallsyms for new modules" [1]. Unfortunately, the
description doesn't explain this aspect but considering that the patch
rewrote both the main and module kallsyms code, I expect it was done
this way for consistency between vmlinux and modules.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=d069cf94ca296b7fb4c7e362e8f27e2c8aca70f1

-- 
Thanks,
Petr

^ permalink raw reply

* [PATCH 1/2] tracing: Store trace_marker_raw payload length in events
From: Cao Ruichuang @ 2026-04-08 15:32 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, shuah
  Cc: linux-kernel, linux-trace-kernel, linux-kselftest

trace_marker_raw currently records its bytes in TRACE_RAW_DATA events,
but the event output path derives the byte count from the padded record
size in the ring buffer. As a result, the printed raw-data payload is
rounded up and small writes do not preserve their true length.

Keep the true payload length in the TRACE_RAW_DATA event itself and use
that field when printing the bytes. This leaves the ring buffer record
size semantics unchanged while letting trace_marker_raw report the exact
payload that was written.

Signed-off-by: Cao Ruichuang <create0818@163.com>
---
 kernel/trace/trace.c         | 11 ++++++-----
 kernel/trace/trace_entries.h |  1 +
 kernel/trace/trace_output.c  |  4 ++--
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a626211ce..d9cb643b8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6906,11 +6906,13 @@ static ssize_t write_raw_marker_to_buffer(struct trace_array *tr,
 	struct ring_buffer_event *event;
 	struct trace_buffer *buffer;
 	struct raw_data_entry *entry;
+	size_t payload_len;
 	ssize_t written;
 	size_t size;
 
 	/* cnt includes both the entry->id and the data behind it. */
-	size = struct_offset(entry, id) + cnt;
+	payload_len = cnt - sizeof(entry->id);
+	size = struct_offset(entry, buf) + payload_len;
 
 	buffer = tr->array_buffer.buffer;
 
@@ -6924,10 +6926,9 @@ static ssize_t write_raw_marker_to_buffer(struct trace_array *tr,
 		return -EBADF;
 
 	entry = ring_buffer_event_data(event);
-	unsafe_memcpy(&entry->id, buf, cnt,
-		      "id and content already reserved on ring buffer"
-		      "'buf' includes the 'id' and the data."
-		      "'entry' was allocated with cnt from 'id'.");
+	memcpy(&entry->id, buf, sizeof(entry->id));
+	entry->len = payload_len;
+	memcpy(entry->buf, buf + sizeof(entry->id), payload_len);
 	written = cnt;
 
 	__buffer_unlock_commit(buffer, event);
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468f..5f867a144 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -288,6 +288,7 @@ FTRACE_ENTRY(raw_data, raw_data_entry,
 
 	F_STRUCT(
 		__field(	unsigned int,	id	)
+		__field(unsigned int, len)
 		__dynamic_array(	char,	buf	)
 	),
 
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 1996d7aba..4e1edfa05 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1817,13 +1817,13 @@ static enum print_line_t trace_raw_data(struct trace_iterator *iter, int flags,
 					 struct trace_event *event)
 {
 	struct raw_data_entry *field;
-	int i;
+	unsigned int i;
 
 	trace_assign_type(field, iter->ent);
 
 	trace_seq_printf(&iter->seq, "# %x buf:", field->id);
 
-	for (i = 0; i < iter->ent_size - offsetof(struct raw_data_entry, buf); i++)
+	for (i = 0; i < field->len; i++)
 		trace_seq_printf(&iter->seq, " %02x",
 				 (unsigned char)field->buf[i]);
 
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* [PATCH 2/2] selftests/ftrace: Check exact trace_marker_raw payload lengths
From: Cao Ruichuang @ 2026-04-08 15:32 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, shuah
  Cc: linux-kernel, linux-trace-kernel, linux-kselftest
In-Reply-To: <20260408153241.15391-1-create0818@163.com>

trace_marker_raw.tc currently depends on awk strtonum() and assumes
that the printed raw-data byte count is rounded up to four bytes.

Now that TRACE_RAW_DATA records keep the true payload length in the
event itself, update the testcase to validate the exact number of bytes
printed for a short sequence of writes. While doing that, make the test
portable to /bin/sh environments that use mawk by replacing strtonum()
and the lscpu endian probe with od-based checks.

Signed-off-by: Cao Ruichuang <create0818@163.com>
---
 .../ftrace/test.d/00basic/trace_marker_raw.tc | 93 ++++++++++++-------
 1 file changed, 59 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
index a2c42e13f..3b37890f8 100644
--- a/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
+++ b/tools/testing/selftests/ftrace/test.d/00basic/trace_marker_raw.tc
@@ -1,11 +1,11 @@
 #!/bin/sh
 # SPDX-License-Identifier: GPL-2.0
 # description: Basic tests on writing to trace_marker_raw
-# requires: trace_marker_raw
+# requires: trace_marker_raw od:program
 # flags: instance
 
 is_little_endian() {
-	if lscpu | grep -q 'Little Endian'; then
+	if [ "$(printf '\001\000\000\000' | od -An -tu4 | tr -d '[:space:]')" = "1" ]; then
 		echo 1;
 	else
 		echo 0;
@@ -34,7 +34,7 @@ make_str() {
 
 	data=`printf -- 'X%.0s' $(seq $cnt)`
 
-	printf "${val}${data}"
+	printf "%b%s" "${val}" "${data}"
 }
 
 write_buffer() {
@@ -47,36 +47,68 @@ write_buffer() {
 
 
 test_multiple_writes() {
+	out_file=$TMPDIR/trace_marker_raw.out
+	match_file=$TMPDIR/trace_marker_raw.lines
+	wait_iter=0
+	pause_on_trace=
+
+	if [ -f options/pause-on-trace ]; then
+		pause_on_trace=`cat options/pause-on-trace`
+		echo 0 > options/pause-on-trace
+	fi
+
+	: > trace
+	cat trace_pipe > $out_file &
+	reader_pid=$!
+	sleep 1
+
+	# Write sizes that cover both the short and long raw-data encodings
+	# without overflowing the trace buffer before we can verify them.
+	for i in `seq 1 12`; do
+		write_buffer 0x12345678 $i
+	done
 
-	# Write a bunch of data where the id is the count of
-	# data to write
-	for i in `seq 1 10` `seq 101 110` `seq 1001 1010`; do
-		write_buffer $i $i
+	while [ "`grep -c ' buf:' $out_file 2> /dev/null || true`" -lt 12 ]; do
+		wait_iter=$((wait_iter + 1))
+		if [ $wait_iter -ge 10 ]; then
+			kill $reader_pid 2> /dev/null || true
+			wait $reader_pid 2> /dev/null || true
+			if [ -n "$pause_on_trace" ]; then
+				echo $pause_on_trace > options/pause-on-trace
+			fi
+			return 1
+		fi
+		sleep 1
 	done
 
 	# add a little buffer
 	echo stop > trace_marker
+	sleep 1
+	kill $reader_pid 2> /dev/null || true
+	wait $reader_pid 2> /dev/null || true
+	if [ -n "$pause_on_trace" ]; then
+		echo $pause_on_trace > options/pause-on-trace
+	fi
 
-	# Check to make sure the number of entries is the id (rounded up by 4)
-	awk '/.*: # [0-9a-f]* / {
-			print;
-			cnt = -1;
-			for (i = 0; i < NF; i++) {
-				# The counter is after the "#" marker
-				if ( $i == "#" ) {
-					i++;
-					cnt = strtonum("0x" $i);
-					num = NF - (i + 1);
-					# The number of items is always rounded up by 4
-					cnt2 = int((cnt + 3) / 4) * 4;
-					if (cnt2 != num) {
-						exit 1;
-					}
-					break;
-				}
-			}
-		}
-	// { if (NR > 30) { exit 0; } } ' trace_pipe;
+	grep ' buf:' $out_file > $match_file || return 1
+	if [ "`wc -l < $match_file`" -ne 12 ]; then
+		cat $match_file
+		return 1
+	fi
+
+	# Check to make sure the number of byte values matches the id exactly.
+	for expected in `seq 1 12`; do
+		line=`sed -n "${expected}p" $match_file`
+		if [ -z "$line" ]; then
+			return 1
+		fi
+		rest=${line#* buf: }
+		set -- $rest
+		if [ "$#" -ne "$expected" ]; then
+			echo "$line"
+			return 1
+		fi
+	done
 }
 
 
@@ -107,13 +139,6 @@ test_buffer() {
 
 ORIG=`cat buffer_size_kb`
 
-# test_multiple_writes test needs at least 12KB buffer
-NEW_SIZE=12
-
-if [ ${ORIG} -lt ${NEW_SIZE} ]; then
-	echo ${NEW_SIZE} > buffer_size_kb
-fi
-
 test_buffer
 if ! test_multiple_writes; then
 	echo ${ORIG} > buffer_size_kb
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Tom Zanussi @ 2026-04-08 15:58 UTC (permalink / raw)
  To: Steven Rostedt, Pengpeng Hou
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260407210502.102e5d37@gandalf.local.home>

Hi Steve,

On Tue, 2026-04-07 at 21:05 -0400, Steven Rostedt wrote:
> 
> Tom,
> 
> On Wed,  1 Apr 2026 19:22:23 +0800
> Pengpeng Hou <pengpeng@iscas.ac.cn> wrote:
> 
> > hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully
> > qualified variable-reference names, but it currently appends into that
> > buffer with strcat() without rebuilding it first. As a result, repeated
> > calls append a new "system.event.field" name onto the previous one,
> > which can eventually run past the end of full_name.
> > 
> > Build the name with snprintf() on each call and return NULL if the fully
> > qualified name does not fit in MAX_FILTER_STR_VAL.
> > 
> > Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers")
> > Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
> > ---
> > Changes since v1: https://lore.kernel.org/all/20260329030950.32503-1-pengpeng@iscas.ac.cn/
> > 
> > - rebuild full_name on each call instead of falling back to field->name
> > - return NULL on overflow as suggested
> > - split out the snprintf() length check instead of using an inline if
> > 
> >  kernel/trace/trace_events_hist.c | 12 +++++++-----
> >  1 file changed, 7 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
> > index 73ea180cad55..f9c8a4f078ea 100644
> > --- a/kernel/trace/trace_events_hist.c
> > +++ b/kernel/trace/trace_events_hist.c
> > @@ -1361,12 +1361,14 @@ static const char *hist_field_name(struct hist_field *field,
> >  		 field->flags & HIST_FIELD_FL_VAR_REF) {
> >  		if (field->system) {
> >  			static char full_name[MAX_FILTER_STR_VAL];
> > +			int len;
> > +
> > +			len = snprintf(full_name, sizeof(full_name), "%s.%s.%s",
> > +				       field->system, field->event_name,
> > +				       field->name);
> > +			if (len >= sizeof(full_name))
> > +				return NULL;
> >  
> > -			strcat(full_name, field->system);
> > -			strcat(full_name, ".");
> > -			strcat(full_name, field->event_name);
> > -			strcat(full_name, ".");
> > -			strcat(full_name, field->name);
> >  			field_name = full_name;
> 
> I wanted to test this but I can't find anything that triggers this path.
> How does a field here get its ->system set?
> 

->system is set when using fully-qualified variable names. For
instance:

echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

Here, the sched_switch trigger would error out if the unqualified $ts0
variables were used instead of the fully-qualified ones because there's
no way to distinguish which $ts0 was meant.

Tom



> If there's no way to hit this path, I much rather remove it than "fix" it.
> 
> -- Steve
> 
> 
> >  		} else
> >  			field_name = field->name;
> 


^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Steven Rostedt @ 2026-04-08 16:25 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Pengpeng Hou, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <f59d594ff21658db45c58a094edeab0f92ae8345.camel@kernel.org>

On Wed, 08 Apr 2026 10:58:06 -0500
Tom Zanussi <zanussi@kernel.org> wrote:

Hi Tom,

> ->system is set when using fully-qualified variable names. For  
> instance:
> 
> echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
> echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
> echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> 
> Here, the sched_switch trigger would error out if the unqualified $ts0
> variables were used instead of the fully-qualified ones because there's
> no way to distinguish which $ts0 was meant.
> 

Yep I see that now. I never had a need to use it before, but I probably
should implement this in libtracefs to be safe.

We should definitely add a selftest that tests this. There's one case that
does use it but it doesn't use multiple ones. We should add a test that
does so.

trigger-multi-actions-accept.tc has the system, but it's not needed here.

We should also have a test to test the output of theses lines.

-- Steve

^ permalink raw reply

* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Ackerley Tng @ 2026-04-08 16:54 UTC (permalink / raw)
  To: Sean Christopherson, Michael Roth
  Cc: Vishal Annapurve, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <adWidf8UgZeYctr1@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Tue, Apr 07, 2026, Michael Roth wrote:
>> On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
>> > On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
>> > >
>> > > > TLDR:
>> > > >
>> > > > + Think of populate ioctls not as KVM touching memory, but platform
>> > > >   handling population.
>> > > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
>> > > > + post_populate is platform-specific code that handles loading into
>> > > >   private destination memory just to support legacy non-in-place
>> > > >   conversion.
>> > > > + Don't complicate populate ioctls by doing conversion just to support
>> > > >   legacy use-cases where platform-specific code has to do copying on
>> > > >   the host.
>> > >
>> > > That's a good point: these are only considerations in the context of
>> > > actually copying from src->dst, but with in-place conversion the
>> > > primary/more-performant approach will be for userspace to initial
>> > > directly. I.e. if we enforced that, then gmem could right ascertain that
>> > > it isn't even writing to private pages via these hooks and any
>> > > manipulation of that memory is purely on the part of the trusted entity
>> > > handling initial encryption/etc.
>> > >
>> > > I understand that we decided to keep the option of allowing separate
>> > > src/dst even with in-place conversion, but it doesn't seem worthwhile if
>> > > that necessarily means we need to glue population+conversion together in
>> > > 1 clumsy interface that needs to handle partial return/error responses to
>> > > userspace (or potentially get stuck forever in the conversion path).
>> >
>> > I think ARM needs userspace to specify separate source and destination
>> > memory ranges for initial population as ARM doesn't support in-place
>> > memory encryption. [1]
>> >
>> > [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
>> >
>> > >
>> > > So I agree with Ackerley's proposal (which I guess is the same as what's
>> > > in this series).
>> > >
>> > > However, 1 other alternative would be to do what was suggested on the
>> > > call, but require userspace to subsequently handle the shared->private
>> > > conversion. I think that would be workable too.
>> >
>> > IIUC, Converting memory ranges to private after it essentially is
>> > treated as private by the KVM CC backend will expose the
>> > implementation to the same risk of userspace being able to access
>> > private memory and compromise host safety which guest_memfd was
>> > invented to address.
>>
>> Doh, fair point. Doing conversion as part of the populate call would allow
>> us to use the filemap write-lock to avoid userspace being able to fault
>> in private (as tracked by trusted entity) pages before they are
>> transitioned to private (as tracked by KVM), so it's safer than having
>> userspace drive it.
>>
>> But obviously I still think Ackerley's original proposal has more
>> upsides than the alternatives mentioned so far.
>
> I'm a bit lost.  What exactly is/was Ackerley's original proposal?  If the answer
> is "convert pages from shared=>private when populating via in-place conversion",
> then I agree, because AFAICT, that's the only sane option.

Discussed this at PUCK today 2026-04-08.

The update is that the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl will
now support the PRESERVE flag for TDX and SNP only if the setup for the
VM in question hasn't yet been completed (KVM_TDX_FINALIZE_VM or
KVM_SEV_SNP_LAUNCH_FINISH hasn't completed yet).

The populate flow will be

1a. Get contents to be loaded in guest_memfd (src_addr: NULL) as shared
OR
1b. Provide contents from some other userspace address (src_addr:
    userspace address)

2.  KVM_SET_MEMORY_ATTRIBUTES2(attribute: PRIVATE and flags: PRESERVE)
3.  KVM_SEV_SNP_LAUNCH_UPDATE() or KVM_TDX_INIT_MEM_REGION()
...
4.  KVM_SEV_SNP_LAUNCH_FINISH() or KVM_TDX_FINALIZE_VM()

This applies whether src_addr is some userspace address that is shared
or NULL, so the non-in-place loading flow is not considered legacy. ARM
CCA can still use that flow :)

Other than supporting PRESERVE only if the setup for the VM in question
hasn't yet been completed, KVM's fault path will also not permit faults
if the setup hasn't been completed. (Some exception setup will be used
for TDX to be able to perform the required fault.)

^ permalink raw reply

* Re: [PATCH v2 1/2] tracing/hist: rebuild full_name on each hist_field_name() call
From: Tom Zanussi @ 2026-04-08 17:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Pengpeng Hou, mhiramat, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260408122514.60bbfd61@gandalf.local.home>

On Wed, 2026-04-08 at 12:25 -0400, Steven Rostedt wrote:
> On Wed, 08 Apr 2026 10:58:06 -0500
> Tom Zanussi <zanussi@kernel.org> wrote:
> 
> Hi Tom,
> 
> > ->system is set when using fully-qualified variable names. For  
> > instance:
> > 
> > echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> sys/kernel/debug/tracing/events/sched/sched_waking/trigger
> > echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
> > echo 'hist:keys=next_pid:lat0=common_timestamp.usecs-sched.sched_waking.$ts0:lat1=common_timestamp.usecs-sched.sched_wakeup.$ts0' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> > echo 'hist:keys=next_pid:vals=$lat0,$lat1' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
> > 
> > Here, the sched_switch trigger would error out if the unqualified $ts0
> > variables were used instead of the fully-qualified ones because there's
> > no way to distinguish which $ts0 was meant.
> > 
> 
> Yep I see that now. I never had a need to use it before, but I probably
> should implement this in libtracefs to be safe.
> 
> We should definitely add a selftest that tests this. There's one case that
> does use it but it doesn't use multiple ones. We should add a test that
> does so.
> 
> trigger-multi-actions-accept.tc has the system, but it's not needed here.
> 
> We should also have a test to test the output of theses lines.

Yeah, definitely. I can try adding this as a test..

Tom


> 
> -- Steve


^ permalink raw reply

* Re: [PATCH 1/2] tracing: Store trace_marker_raw payload length in events
From: Steven Rostedt @ 2026-04-08 17:39 UTC (permalink / raw)
  To: Cao Ruichuang
  Cc: mhiramat, mathieu.desnoyers, shuah, linux-kernel,
	linux-trace-kernel, linux-kselftest
In-Reply-To: <20260408153241.15391-1-create0818@163.com>

On Wed,  8 Apr 2026 23:32:40 +0800
Cao Ruichuang <create0818@163.com> wrote:

> trace_marker_raw currently records its bytes in TRACE_RAW_DATA events,
> but the event output path derives the byte count from the padded record
> size in the ring buffer. As a result, the printed raw-data payload is
> rounded up and small writes do not preserve their true length.
> 
> Keep the true payload length in the TRACE_RAW_DATA event itself and use
> that field when printing the bytes. This leaves the ring buffer record
> size semantics unchanged while letting trace_marker_raw report the exact
> payload that was written.

May I ask why?  The above describes what is happening but fails to
leave out the why? Why does the payload length need to be added to the
event? I mean, it's recording raw data, and the user who writes to it
already knows the length as this was made for applications to write
structures directly into the buffer. When reading back from the buffer
the structure size is the length.

Thus, why record the length? I see no reason to. The length wastes
precious space in the ring buffer when the user of trace_marker_raw
should already know its length.

-- Steve

^ permalink raw reply

* Re: [PATCH 01/24] filelock: add support for ignoring deleg breaks for dir change events
From: Chuck Lever @ 2026-04-08 18:16 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-1-aaf68c478abd@kernel.org>


On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> If a NFS client requests a directory delegation with a notification
> bitmask covering directory change events, the server shouldn't recall
> the delegation. Instead the client will be notified of the change after
> the fact.
>
> Add support for ignoring lease breaks on directory changes. Add a new
> flags parameter to try_break_deleg() and teach __break_lease how to
> ignore certain types of delegation break events.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

> diff --git a/fs/locks.c b/fs/locks.c
> index 8e44b1f6c15a..dafa0752fdce 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c

> @@ -1670,7 +1709,7 @@ int __break_lease(struct inode *inode, unsigned int flags)
>  			locks_delete_lock_ctx(&fl->c, &dispose);
>  	}
> 
> -	if (list_empty(&ctx->flc_lease))
> +	if (!visible_leases_remaining(inode, flags))
>  		goto out;
> 
>  	if (flags & LEASE_BREAK_NONBLOCK) {

After breaking visible leases, the restart: label calls any_leases_conflict()
which does not filter ignored dir-delegation leases. When only ignored leases
remain, any_leases_conflict returns true, but visible_leases_remaining also
returned true (triggering the wait). The code picks the first lease (possibly
ignored), computes break_time = 1 jiffy, blocks, then loops.                                                     

For example, suppose you have two directory delegations on a directory, one
with FL_IGN_DIR_DELETE and one without. After the non-ignored one is broken
and removed, the ignored one keeps any_leases_conflict returning true. The
loop spins at 1-jiffy intervals until the ignored delegation is released.  

Should the restart: block skip ignored leases?


-- 
Chuck Lever

^ permalink raw reply

* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Song Liu @ 2026-04-08 18:19 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Yafang Shao, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
	rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
	yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
	bpf
In-Reply-To: <adY_WgA54CDtWBq6@pathway.suse.cz>

On Wed, Apr 8, 2026 at 4:43 AM Petr Mladek <pmladek@suse.com> wrote:
[...]
> > >
> > > This is weird semantic. Which livepatch tag would be allowed to
> > > supersede it, please?
> > >
> > > Do we still need this category?
> >
> > It can be superseded by any livepatch that has a non-zero tag set.
>
> And this exactly the weird thing.
>
> A patch with the .replace flag set is supposed to obsolete all already
> installed livepatches. It means that it should provide all existing
> fixes and features.
>
> Now, we want to introduce a replace flag/set which would allow to
> replace/obsolete only the livepatch with the same tag/set number.
> And we want to prevent conflicts by making sure that livepatches with
> different tag/set number will never livepatch the same function.
>
> Obviously, livepatches with different tag/set number could not
> obsolete the same no-replace livepatch. They would need to livepatch
> the same functions touched by the no-replace livepatch and would
> conflict.
>
> So, I suggest to remove the no-replace mode completely. It should
> not be needed. A livepatch which should be installed in parallel
> will simply use another unique tag/set number.

I think I see your point now. Existing code works as:
- replace=false doesn't replace anything
- replace=true replaces everything

If we assume false=0 and true=1, it is technically possible to define:
- replace_set=0 doesn't replace anything
- replace_set=1 replaces everything
- replace_set=2+ only replace the same replace_set

This is probably a little too complicated.

> > This ensures backward compatibility: while a non-atomic-replace
> > livepatch can be superseded by an atomic-replace one, the reverse is
> > not permitted—an atomic-replace livepatch cannot be superseded by a
> > non-atomic one.
>
> IMHO, the backward compatibility would just create complexity and mess
> in this case.

Given that livepatch is for expert users, I think we can make this work
without backward compatibility. But breaking compatibility is always not
preferred.

Thanks,
Song

^ permalink raw reply

* Re: [PATCH 08/24] nfsd: update the fsnotify mark when setting or removing a dir delegation
From: Chuck Lever @ 2026-04-08 18:24 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-8-aaf68c478abd@kernel.org>


On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add a new helper function that will update the mask on the nfsd_file's
> fsnotify_mark to be a union of all current directory delegations on an
> inode. Call that when directory delegations are added or removed.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index c8fb84c38637..9a4cff08c67d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c

> @@ -1266,6 +1297,7 @@ static void nfs4_unlock_deleg_lease(struct 
> nfs4_delegation *dp)
>  	WARN_ON_ONCE(!fp->fi_delegees);
> 
>  	nfsd4_finalize_deleg_timestamps(dp, nf->nf_file);
> +	nfsd_fsnotify_recalc_mask(nf);
>  	kernel_setlease(nf->nf_file, F_UNLCK, NULL, (void **)&dp);
>  	put_deleg_file(fp);
>  }

The grant path in nfsd_get_dir_deleg() uses a different ordering
(setlease first, recalc_mask after).

Here, since the delegation being removed is still in flc_lease,
inode_lease_ignore_mask() includes its ignore flags. The mask is
computed as if the delegation is still present.

The result is that stale FS_CREATE/FS_DELETE/FS_RENAME bits remain
in the fsnotify mark. It might be harmless in practice since the
handler finds no leases and returns early, but it creates
unnecessary work.

Should nfs4_unlock_deleg_lease call nfsd_fsnotify_recalc_mask()
after kernel_setlease(F_UNLCK)?


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH 13/24] nfsd: add notification handlers for dir events
From: Chuck Lever @ 2026-04-08 18:34 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-13-aaf68c478abd@kernel.org>



On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add the necessary parts to accept a fsnotify callback for directory
> change event and create a CB_NOTIFY request for it. When a dir nfsd_file
> is created set a handle_event callback to handle the notification.
>
> Use that to allocate a nfsd_notify_event object and then hand off a
> reference to each delegation's CB_NOTIFY. If anything fails along the
> way, recall any affected delegations.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index b2b8c454fc0f..339c3d0bb575 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c

> @@ -9796,3 +9887,118 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state 
> *cstate,
>  	put_nfs4_file(fp);
>  	return ERR_PTR(status);
>  }
> +
> +static void
> +nfsd4_run_cb_notify(struct nfsd4_cb_notify *ncn)
> +{
> +	struct nfs4_delegation *dp = container_of(ncn, struct 
> nfs4_delegation, dl_cb_notify);
> +
> +	if (test_and_set_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags))
> +		return;
> +
> +	if (!refcount_inc_not_zero(&dp->dl_stid.sc_count))
> +		clear_bit(NFSD4_CALLBACK_RUNNING, &ncn->ncn_cb.cb_flags);
> +	else
> +		nfsd4_run_cb(&ncn->ncn_cb);
> +}
> +
> +static struct nfsd_notify_event *
> +alloc_nfsd_notify_event(u32 mask, const struct qstr *q, struct dentry 
> *dentry)
> +{
> +	struct nfsd_notify_event *ne;
> +
> +	ne = kmalloc(sizeof(*ne) + q->len + 1, GFP_KERNEL);
> +	if (!ne)
> +		return NULL;
> +
> +	memcpy(&ne->ne_name, q->name, q->len);
> +	refcount_set(&ne->ne_ref, 1);
> +	ne->ne_mask = mask;
> +	ne->ne_name[q->len] = '\0';
> +	ne->ne_namelen = q->len;
> +	ne->ne_dentry = dget(dentry);
> +	return ne;
> +}
> +
> +static bool
> +should_notify_deleg(u32 mask, struct file_lease *fl)
> +{
> +	/* Only nfsd leases */
> +	if (fl->fl_lmops != &nfsd_lease_mng_ops)
> +		return false;
> +
> +	/* Skip if this event wasn't ignored by the lease */
> +	if ((mask & FS_DELETE) && !(fl->c.flc_flags & FL_IGN_DIR_DELETE))
> +		return false;
> +	if ((mask & FS_CREATE) && !(fl->c.flc_flags & FL_IGN_DIR_CREATE))
> +		return false;
> +	if ((mask & FS_RENAME) && !(fl->c.flc_flags & FL_IGN_DIR_RENAME))
> +		return false;
> +
> +	return true;
> +}

For a cross-directory rename, vfs_rename calls try_break_deleg(old_dir,
LEASE_BREAK_DIR_DELETE, ...). A delegation with FL_IGN_DIR_DELETE
(subscribed to NOTIFY4_REMOVE_ENTRY) suppresses the lease break, which
is correct.

But fsnotify delivers FS_RENAME on old_dir, not FS_DELETE. In
should_notify_deleg(), the check (mask & FS_RENAME) &&
!(fl->c.flc_flags & FL_IGN_DIR_RENAME) fails, because the delegation
has FL_IGN_DIR_DELETE but not FL_IGN_DIR_RENAME. No notification is
sent.

IIUC, a client subscribed to NOTIFY4_REMOVE_ENTRY for old_dir sees
neither a lease break nor a CB_NOTIFY when a child is renamed out of
the directory. Is that behavior correct?


> +
> +static void
> +nfsd_recall_all_dir_delegs(const struct inode *dir)
> +{
> +	struct file_lock_context *ctx = locks_inode_context(dir);
> +	struct file_lock_core *flc;
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> +
> +		if (fl->fl_lmops == &nfsd_lease_mng_ops)
> +			nfsd_break_deleg_cb(fl);
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +}
> +
> +int
> +nfsd_handle_dir_event(u32 mask, const struct inode *dir, const void 
> *data,
> +		      int data_type, const struct qstr *name)
> +{
> +	struct dentry *dentry = fsnotify_data_dentry(data, data_type);
> +	struct file_lock_context *ctx;
> +	struct file_lock_core *flc;
> +	struct nfsd_notify_event *evt;
> +
> +	/* Don't do anything if this is not an expected event */
> +	if (!(mask & (FS_CREATE|FS_DELETE|FS_RENAME)))
> +		return 0;
> +
> +	ctx = locks_inode_context(dir);
> +	if (!ctx || list_empty(&ctx->flc_lease))
> +		return 0;
> +
> +	evt = alloc_nfsd_notify_event(mask, name, dentry);
> +	if (!evt) {
> +		nfsd_recall_all_dir_delegs(dir);
> +		return 0;
> +	}
> +
> +	spin_lock(&ctx->flc_lock);
> +	list_for_each_entry(flc, &ctx->flc_lease, flc_list) {
> +		struct file_lease *fl = container_of(flc, struct file_lease, c);
> +		struct nfs4_delegation *dp = flc->flc_owner;
> +		struct nfsd4_cb_notify *ncn = &dp->dl_cb_notify;
> +
> +		if (!should_notify_deleg(mask, fl))
> +			continue;
> +
> +		spin_lock(&ncn->ncn_lock);
> +		if (ncn->ncn_evt_cnt >= NOTIFY4_EVENT_QUEUE_SIZE) {
> +			/* We're generating notifications too fast. Recall. */
> +			spin_unlock(&ncn->ncn_lock);
> +			nfsd_break_deleg_cb(fl);
> +			continue;
> +		}
> +		ncn->ncn_evt[ncn->ncn_evt_cnt++] = nfsd_notify_event_get(evt);
> +		spin_unlock(&ncn->ncn_lock);
> +
> +		nfsd4_run_cb_notify(ncn);
> +	}
> +	spin_unlock(&ctx->flc_lock);
> +	nfsd_notify_event_put(evt);
> +	return 0;
> +}


-- 
Chuck Lever

^ permalink raw reply

* [PATCH bpf-next v4 0/2] Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 18:35 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya

These patches fix an issue where sleepable kprobe_multi programs
were allowed to attach, leading to "sleeping function called from invalid
context" splats.

Because kprobe.multi programs run in atomic/RCU context, they cannot
sleep. However, `bpf_kprobe_multi_link_attach()` previously lacked
validation for the `prog->sleepable` flag. This allowed sleepable
helpers, such as `bpf_copy_from_user()`, to be invoked from an invalid
non-sleepable context.

This series addresses the issue by:
1. Rejecting sleepable kprobe_multi programs early in
   `bpf_kprobe_multi_link_attach()` by returning -EINVAL.
2. Adding selftests to explicitly verify that attaching a sleepable
   kprobe_multi program is rejected by the kernel.

P.S: The first of these two commits has been applied to the bpf tree.

Changes:
v1->v2:
- v1: https://lore.kernel.org/bpf/20260401134921.362148-1-varunrmallya@gmail.com/
- Defective selftest added
v2->v3:
- v2: https://lore.kernel.org/bpf/CAP01T74YgnKop-dgwBToOcfg4_D44t1wUBopFYPMquirCmaLfg@mail.gmail.com/
- Selftest separated from change into different commit.
v3->v4:
- v3: https://lore.kernel.org/bpf/20260401191126.440683-1-varunrmallya@gmail.com/
- Selftest moved to test_attach_api_fails.
- Changed attachment symbol to bpf_fentry_test1 for stability.
- Changes suggested by Leon implemented.

Varun R Mallya (2):
  bpf: Reject sleepable kprobe_multi programs at attach time
  selftests/bpf: Add test to ensure kprobe_multi is not sleepable

 kernel/trace/bpf_trace.c                      |  4 +
 .../bpf/prog_tests/kprobe_multi_test.c        | 78 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 ++++++
 3 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

-- 
2.53.0


^ permalink raw reply

* [PATCH bpf-next v4 1/2] bpf: Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 18:35 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya
In-Reply-To: <20260408183549.92990-1-varunrmallya@gmail.com>

kprobe.multi programs run in atomic/RCU context and cannot sleep.
However, bpf_kprobe_multi_link_attach() did not validate whether the
program being attached had the sleepable flag set, allowing sleepable
helpers such as bpf_copy_from_user() to be invoked from a non-sleepable
context.

This causes a "sleeping function called from invalid context" splat:

  BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169
  in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo
  preempt_count: 1, expected: 0
  RCU nest depth: 2, expected: 0

Fix this by rejecting sleepable programs early in
bpf_kprobe_multi_link_attach(), before any further processing.

Fixes: 0dcac2725406 ("bpf: Add multi kprobe link")
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
---
 kernel/trace/bpf_trace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0b040a417442..af7079aa0f36 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2752,6 +2752,10 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
 	if (!is_kprobe_multi(prog))
 		return -EINVAL;
 
+	/* kprobe_multi is not allowed to be sleepable. */
+	if (prog->sleepable)
+		return -EINVAL;
+
 	/* Writing to context is not allowed for kprobes. */
 	if (prog->aux->kprobe_write_ctx)
 		return -EINVAL;
-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf-next v4 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-08 18:35 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya
In-Reply-To: <20260408183549.92990-1-varunrmallya@gmail.com>

Add a selftest to ensure that kprobe_multi programs cannot be attached
using the BPF_F_SLEEPABLE flag. This test succeeds when the kernel
rejects attachment of kprobe_multi when the BPF_F_SLEEPABLE flag is set.

Suggested-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Varun R Mallya <varunrmallya@gmail.com>
---
 .../bpf/prog_tests/kprobe_multi_test.c        | 78 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 ++++++
 2 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..e4f9021a84ed 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -10,6 +10,7 @@
 #include "kprobe_multi_session_cookie.skel.h"
 #include "kprobe_multi_verifier.skel.h"
 #include "kprobe_write_ctx.skel.h"
+#include "kprobe_multi_sleepable.skel.h"
 #include "bpf/libbpf_internal.h"
 #include "bpf/hashmap.h"
 
@@ -220,7 +221,9 @@ static void test_attach_api_syms(void)
 static void test_attach_api_fails(void)
 {
 	LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	struct kprobe_multi *skel = NULL;
+	struct kprobe_multi_sleepable *sl_skel = NULL;
 	struct bpf_link *link = NULL;
 	unsigned long long addrs[2];
 	const char *syms[2] = {
@@ -228,7 +231,7 @@ static void test_attach_api_fails(void)
 		"bpf_fentry_test2",
 	};
 	__u64 cookies[2];
-	int saved_error;
+	int saved_error, err;
 
 	addrs[0] = ksym_get_addr("bpf_fentry_test1");
 	addrs[1] = ksym_get_addr("bpf_fentry_test2");
@@ -351,9 +354,39 @@ static void test_attach_api_fails(void)
 	if (!ASSERT_EQ(saved_error, -ENOENT, "fail_8_error"))
 		goto cleanup;
 
+	/* fail_9 - sleepable kprobe multi should not attach */
+	sl_skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(sl_skel, "sleep_skel_open"))
+		goto cleanup;
+
+	sl_skel->bss->user_ptr = sl_skel;
+
+	err = bpf_program__set_flags(sl_skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "sleep_skel_set_flags"))
+		goto cleanup;
+
+	err = kprobe_multi_sleepable__load(sl_skel);
+	if (!ASSERT_OK(err, "sleep_skel_load"))
+		goto cleanup;
+
+	link = bpf_program__attach_kprobe_multi_opts(sl_skel->progs.handle_kprobe_multi_sleepable,
+						     "bpf_fentry_test1", NULL);
+	saved_error = -errno;
+
+	if (!ASSERT_ERR_PTR(link, "fail_9"))
+		goto cleanup;
+
+	if (!ASSERT_EQ(saved_error, -EINVAL, "fail_9_error"))
+		goto cleanup;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(sl_skel->progs.fentry), &topts);
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+
 cleanup:
 	bpf_link__destroy(link);
 	kprobe_multi__destroy(skel);
+	kprobe_multi_sleepable__destroy(sl_skel);
 }
 
 static void test_session_skel_api(void)
@@ -609,6 +642,47 @@ static void test_override(void)
 	kprobe_multi_override__destroy(skel);
 }
 
+static void test_attach_multi_sleepable(void)
+{
+	struct kprobe_multi_sleepable *skel;
+	int err;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+	skel = kprobe_multi_sleepable__open();
+	if (!ASSERT_OK_PTR(skel, "kprobe_multi_sleepable__open"))
+		return;
+
+	skel->bss->user_ptr = skel;
+
+	err = bpf_program__set_flags(skel->progs.handle_kprobe_multi_sleepable,
+				     BPF_F_SLEEPABLE);
+	if (!ASSERT_OK(err, "bpf_program__set_flags"))
+		goto cleanup;
+
+	/* Load should succeed even with BPF_F_SLEEPABLE for KPROBE types */
+	err = kprobe_multi_sleepable__load(skel);
+	if (!ASSERT_OK(err, "kprobe_multi_sleepable__load"))
+		goto cleanup;
+
+	skel->links.handle_kprobe_multi_sleepable =
+		bpf_program__attach_kprobe_multi_opts(skel->progs.handle_kprobe_multi_sleepable,
+						      "bpf_fentry_test1", NULL);
+
+	ASSERT_EQ(libbpf_get_error(skel->links.handle_kprobe_multi_sleepable),
+		  -EINVAL, "attach_multi_sleepable_err");
+
+	ASSERT_ERR_PTR(skel->links.handle_kprobe_multi_sleepable,
+		       "bpf_program__attach_kprobe_multi_opts");
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.fentry), &topts);
+
+	ASSERT_OK(err, "bpf_prog_test_run_opts");
+
+cleanup:
+	kprobe_multi_sleepable__destroy(skel);
+}
+
 #ifdef __x86_64__
 static void test_attach_write_ctx(void)
 {
@@ -676,5 +750,7 @@ void test_kprobe_multi_test(void)
 		test_unique_match();
 	if (test__start_subtest("attach_write_ctx"))
 		test_attach_write_ctx();
+	if (test__start_subtest("attach_multi_sleepable"))
+		test_attach_multi_sleepable();
 	RUN_TESTS(kprobe_multi_verifier);
 }
diff --git a/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
new file mode 100644
index 000000000000..932e1d9c72e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+void *user_ptr = 0;
+
+SEC("kprobe.multi")
+int handle_kprobe_multi_sleepable(struct pt_regs *ctx)
+{
+	int a, err;
+
+	err = bpf_copy_from_user(&a, sizeof(a), user_ptr);
+	barrier_var(a);
+	return err;
+}
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(fentry)
+{
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 12/24] nfsd: add data structures for handling CB_NOTIFY
From: Chuck Lever @ 2026-04-08 18:39 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Chuck Lever, Alexander Aring, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
	Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
	Anna Schumaker, Amir Goldstein
  Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-doc, linux-nfs
In-Reply-To: <20260407-dir-deleg-v1-12-aaf68c478abd@kernel.org>


On Tue, Apr 7, 2026, at 9:21 AM, Jeff Layton wrote:
> Add the data structures, allocation helpers, and callback operations
> needed for directory delegation CB_NOTIFY support:
>
> - struct nfsd_notify_event: carries fsnotify events for CB_NOTIFY
> - struct nfsd4_cb_notify: per-delegation state for notification handling
> - Union dl_cb_fattr with dl_cb_notify in nfs4_delegation since a
>   delegation is either a regular file delegation or a directory
>   delegation, never both
>
> Refactor alloc_init_deleg() into a common __alloc_init_deleg() base
> with a pluggable sc_free callback, and add alloc_init_dir_deleg() which
> allocates the page array and notify4 buffer needed for CB_NOTIFY
> encoding.
>
> Add skeleton nfsd4_cb_notify_ops with done/release handlers that will
> be filled in when the notification path is wired up.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 4afe7e68fb51..b2b8c454fc0f 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c

> @@ -3381,6 +3440,30 @@ nfsd4_cb_getattr_release(struct nfsd4_callback 
> *cb)
>  	nfs4_put_stid(&dp->dl_stid);
>  }
> 
> +static int
> +nfsd4_cb_notify_done(struct nfsd4_callback *cb,
> +				struct rpc_task *task)
> +{
> +	switch (task->tk_status) {
> +	case -NFS4ERR_DELAY:
> +		rpc_delay(task, 2 * HZ);
> +		return 0;
> +	default:
> +		return 1;
> +	}
> +}
> +
> +static void
> +nfsd4_cb_notify_release(struct nfsd4_callback *cb)
> +{
> +	struct nfsd4_cb_notify *ncn =
> +			container_of(cb, struct nfsd4_cb_notify, ncn_cb);
> +	struct nfs4_delegation *dp =
> +			container_of(ncn, struct nfs4_delegation, dl_cb_notify);
> +
> +	nfs4_put_stid(&dp->dl_stid);
> +}
> +
>  static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = {
>  	.done		= nfsd4_cb_recall_any_done,
>  	.release	= nfsd4_cb_recall_any_release,

So when a client responds with NFS4ERR_DELAY, the RPC framework retries
after 2s. On retry, prepare() is called again, but ncn_evt_cnt is
already 0 (drained in the first prepare). prepare returns false, which
destroys the callback.

Events arriving during the retry window are dropped because
nfsd4_run_cb_notify() returns early when NFSD4_CALLBACK_RUNNING is set.
After the callback is destroyed, future events can queue a new CB_NOTIFY,
but the window's events are lost.                                                                                        

The result is that the client misses notifications. Does this impact
behavioral correctness or spec compliance? Is there a way for that
client to detect the loss and recover?


-- 
Chuck Lever

^ permalink raw reply

* Re: [PATCH bpf-next v4 2/2] selftests/bpf: Add test to ensure kprobe_multi is not sleepable
From: Varun R Mallya @ 2026-04-08 18:47 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260408183549.92990-3-varunrmallya@gmail.com>

On Thu, Apr 09, 2026 at 12:05:49AM +0530, Varun R Mallya wrote:
> @@ -676,5 +750,7 @@ void test_kprobe_multi_test(void)
>  		test_unique_match();
>  	if (test__start_subtest("attach_write_ctx"))
>  		test_attach_write_ctx();
> +	if (test__start_subtest("attach_multi_sleepable"))
> +		test_attach_multi_sleepable();
>  	RUN_TESTS(kprobe_multi_verifier);
Please ignore this patch. I will send a v5 in a few minutes. I forgot to
remove the selftest from the previous location after moving it into
attach_api_fails.
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.53.0
> 

^ permalink raw reply

* [PATCH bpf-next v5 0/2] Reject sleepable kprobe_multi programs at attach time
From: Varun R Mallya @ 2026-04-08 19:01 UTC (permalink / raw)
  To: bpf, leon.hwang, memxor, jolsa
  Cc: ast, daniel, yonghong.song, rostedt, linux-kernel,
	linux-trace-kernel, varunrmallya

These patches fix an issue where sleepable kprobe_multi programs
were allowed to attach, leading to "sleeping function called from invalid
context" splats.

Because kprobe.multi programs run in atomic/RCU context, they cannot
sleep. However, `bpf_kprobe_multi_link_attach()` previously lacked
validation for the `prog->sleepable` flag. This allowed sleepable
helpers, such as `bpf_copy_from_user()`, to be invoked from an invalid
non-sleepable context.

This series addresses the issue by:
1. Rejecting sleepable kprobe_multi programs early in
   `bpf_kprobe_multi_link_attach()` by returning -EINVAL.
2. Adding selftests to explicitly verify that attaching a sleepable
   kprobe_multi program is rejected by the kernel.

P.S: The first of these two commits has been applied to the bpf tree.

Changes:
v1->v2:
- v1: https://lore.kernel.org/bpf/20260401134921.362148-1-varunrmallya@gmail.com/
- Defective selftest added
v2->v3:
- v2: https://lore.kernel.org/bpf/CAP01T74YgnKop-dgwBToOcfg4_D44t1wUBopFYPMquirCmaLfg@mail.gmail.com/
- Selftest separated from change into different commit.
v3->v4:
- v3: https://lore.kernel.org/bpf/20260401191126.440683-1-varunrmallya@gmail.com/
- Selftest moved to test_attach_api_fails.
- Changed attachment symbol to bpf_fentry_test1 for stability.
- Changes suggested by Leon implemented.
v4->v5:
- v4: https://lore.kernel.org/bpf/20260408183549.92990-1-varunrmallya@gmail.com/
- fix the mistake of leaving test_attach_multi_sleepable after changing
  location.

Varun R Mallya (2):
  bpf: Reject sleepable kprobe_multi programs at attach time
  selftests/bpf: Add test to ensure kprobe_multi is not sleepable

 kernel/trace/bpf_trace.c                      |  4 +++
 .../bpf/prog_tests/kprobe_multi_test.c        | 35 ++++++++++++++++++-
 .../bpf/progs/kprobe_multi_sleepable.c        | 25 +++++++++++++
 3 files changed, 63 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/kprobe_multi_sleepable.c

-- 
2.53.0


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox