* [PATCH 21/24] nfsd: add the filehandle to returned attributes in CB_NOTIFY
From: Jeff Layton @ 2026-04-07 13:21 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org>
nfsd's usual fh_compose routine requires a svc_export and fills out a
svc_fh. In the context of a CB_NOTIFY there is no such export to
consult.
Add a new routine that composes a filehandle with only a parent
filehandle and nfs4_file. Use that to fill out the fhandle field in the
nfsd4_fattr_args.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4xdr.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index faf0c3d35dee..e468cbc087ad 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4109,6 +4109,39 @@ nfsd4_encode_fattr4(struct svc_rqst *rqstp, struct xdr_stream *xdr,
goto out;
}
+static bool
+setup_notify_fhandle(struct dentry *dentry, struct nfs4_file *fi,
+ struct nfsd_file *nf, struct nfsd4_fattr_args *args)
+{
+ int fileid_type, fsid_len, maxsize, flags = 0;
+ struct knfsd_fh *fhp = &args->fhandle;
+ struct inode *inode = d_inode(dentry);
+ struct inode *parent = NULL;
+ struct fid *fid;
+
+ fsid_len = key_len(fi->fi_fhandle.fh_fsid_type);
+ fhp->fh_size = 4 + fsid_len;
+
+ /* Copy first 4 bytes + fsid */
+ memcpy(&fhp->fh_raw, &fi->fi_fhandle.fh_raw, fhp->fh_size);
+
+ fid = (struct fid *)(fh_fsid(fhp) + fsid_len/4);
+ maxsize = (NFS4_FHSIZE - fhp->fh_size)/4;
+
+ if (fi->fi_connectable && !S_ISDIR(inode->i_mode)) {
+ parent = d_inode(nf->nf_file->f_path.dentry);
+ flags = EXPORT_FH_CONNECTABLE;
+ }
+
+ fileid_type = exportfs_encode_inode_fh(inode, fid, &maxsize, parent, flags);
+ if (fileid_type < 0)
+ return false;
+
+ fhp->fh_fileid_type = fileid_type;
+ fhp->fh_size += maxsize * 4;
+ return true;
+}
+
#define CB_NOTIFY_STATX_REQUEST_MASK (STATX_BASIC_STATS | \
STATX_BTIME | \
STATX_CHANGE_COOKIE)
@@ -4118,6 +4151,7 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
struct dentry *dentry, struct nfs4_delegation *dp,
struct nfsd_file *nf, char *name, u32 namelen)
{
+ struct nfs4_file *fi = dp->dl_stid.sc_file;
struct path path = { .mnt = nf->nf_file->f_path.mnt,
.dentry = dentry };
struct nfsd4_fattr_args args = { };
@@ -4156,6 +4190,9 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
FATTR4_WORD1_TIME_METADATA | FATTR4_WORD1_TIME_MODIFY;
attrmask[2] = 0;
+ if (setup_notify_fhandle(dentry, fi, nf, &args))
+ attrmask[0] |= FATTR4_WORD0_FILEHANDLE;
+
if (args.stat.result_mask & STATX_BTIME)
attrmask[1] |= FATTR4_WORD1_TIME_CREATE;
--
2.53.0
^ permalink raw reply related
* [PATCH 22/24] nfsd: properly track requested child attributes
From: Jeff Layton @ 2026-04-07 13:21 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org>
Track the union of requested and supported child attributes in the
delegation, and only encode the attributes in that union when sending
add/remove/rename updates.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4proc.c | 2 ++
fs/nfsd/nfs4state.c | 18 ++++++++++++++++++
fs/nfsd/nfs4xdr.c | 15 ++++++---------
fs/nfsd/state.h | 3 +++
4 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 01e3bf9e1839..a807a55dddf9 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2553,6 +2553,8 @@ nfsd4_get_dir_delegation(struct svc_rqst *rqstp,
gdd->gddrnf_status = GDD4_OK;
memcpy(&gdd->gddr_stateid, &dd->dl_stid.sc_stateid, sizeof(gdd->gddr_stateid));
+ gdd->gddr_child_attributes[0] = dd->dl_child_attrs[0];
+ gdd->gddr_child_attributes[1] = dd->dl_child_attrs[1];
nfs4_put_stid(&dd->dl_stid);
return nfs_ok;
}
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 0580c935d804..59a9b1ca836b 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9780,6 +9780,21 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct dentry *dentry,
return status;
}
+#define GDD_WORD0_CHILD_ATTRS (FATTR4_WORD0_TYPE | \
+ FATTR4_WORD0_CHANGE | \
+ FATTR4_WORD0_SIZE | \
+ FATTR4_WORD0_FILEID | \
+ FATTR4_WORD0_FILEHANDLE)
+
+#define GDD_WORD1_CHILD_ATTRS (FATTR4_WORD1_MODE | \
+ FATTR4_WORD1_NUMLINKS | \
+ FATTR4_WORD1_RAWDEV | \
+ FATTR4_WORD1_SPACE_USED | \
+ FATTR4_WORD1_TIME_ACCESS | \
+ FATTR4_WORD1_TIME_METADATA | \
+ FATTR4_WORD1_TIME_MODIFY | \
+ FATTR4_WORD1_TIME_CREATE)
+
/**
* nfsd_get_dir_deleg - attempt to get a directory delegation
* @cstate: compound state
@@ -9849,6 +9864,9 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
dp->dl_stid.sc_export =
exp_get(cstate->current_fh.fh_export);
+ dp->dl_child_attrs[0] = gdd->gdda_child_attributes[0] & GDD_WORD0_CHILD_ATTRS;
+ dp->dl_child_attrs[1] = gdd->gdda_child_attributes[1] & GDD_WORD1_CHILD_ATTRS;
+
fl = nfs4_alloc_init_lease(dp, gdd->gddr_notification[0]);
if (!fl)
goto out_put_stid;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index e468cbc087ad..35646809becb 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4183,18 +4183,15 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
args.change_attr = nfsd4_change_attribute(&args.stat);
- attrmask[0] = FATTR4_WORD0_TYPE | FATTR4_WORD0_CHANGE |
- FATTR4_WORD0_SIZE | FATTR4_WORD0_FILEID;
- attrmask[1] = FATTR4_WORD1_MODE | FATTR4_WORD1_NUMLINKS | FATTR4_WORD1_RAWDEV |
- FATTR4_WORD1_SPACE_USED | FATTR4_WORD1_TIME_ACCESS |
- FATTR4_WORD1_TIME_METADATA | FATTR4_WORD1_TIME_MODIFY;
+ attrmask[0] = dp->dl_child_attrs[0];
+ attrmask[1] = dp->dl_child_attrs[1];
attrmask[2] = 0;
- if (setup_notify_fhandle(dentry, fi, nf, &args))
- attrmask[0] |= FATTR4_WORD0_FILEHANDLE;
+ if (!setup_notify_fhandle(dentry, fi, nf, &args))
+ attrmask[0] &= ~FATTR4_WORD0_FILEHANDLE;
- if (args.stat.result_mask & STATX_BTIME)
- attrmask[1] |= FATTR4_WORD1_TIME_CREATE;
+ if (!(args.stat.result_mask & STATX_BTIME))
+ attrmask[1] &= ~FATTR4_WORD1_TIME_CREATE;
ne->ne_attrs.attrmask.count = 2;
ne->ne_attrs.attr_vals.data = (u8 *)xdr->p;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index d060d70c5820..7ca5ef9caafe 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -282,6 +282,9 @@ struct nfs4_delegation {
struct timespec64 dl_atime;
struct timespec64 dl_mtime;
struct timespec64 dl_ctime;
+
+ /* For dir delegations */
+ uint32_t dl_child_attrs[2];
};
static inline bool deleg_is_read(u32 dl_type)
--
2.53.0
^ permalink raw reply related
* [PATCH 23/24] nfsd: track requested dir attributes
From: Jeff Layton @ 2026-04-07 13:21 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org>
Track the union of requested and supported dir attributes in the
delegation, and only encode the attributes in that union when sending
add/remove/rename updates.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4proc.c | 7 ++++---
fs/nfsd/nfs4state.c | 14 +++++++++++++-
fs/nfsd/state.h | 2 ++
3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a807a55dddf9..82d7c473e4d3 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -2506,9 +2506,10 @@ nfsd4_verify(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
return status == nfserr_same ? nfs_ok : status;
}
-#define SUPPORTED_NOTIFY_MASK (BIT(NOTIFY4_REMOVE_ENTRY) | \
- BIT(NOTIFY4_ADD_ENTRY) | \
- BIT(NOTIFY4_RENAME_ENTRY) | \
+#define SUPPORTED_NOTIFY_MASK (BIT(NOTIFY4_CHANGE_DIR_ATTRS) | \
+ BIT(NOTIFY4_REMOVE_ENTRY) | \
+ BIT(NOTIFY4_ADD_ENTRY) | \
+ BIT(NOTIFY4_RENAME_ENTRY) | \
BIT(NOTIFY4_GFLAG_EXTEND))
static __be32
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 59a9b1ca836b..c4b6f4d65a47 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -9795,6 +9795,15 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct dentry *dentry,
FATTR4_WORD1_TIME_MODIFY | \
FATTR4_WORD1_TIME_CREATE)
+#define GDD_WORD0_DIR_ATTRS (FATTR4_WORD0_CHANGE | \
+ FATTR4_WORD0_SIZE)
+
+#define GDD_WORD1_DIR_ATTRS (FATTR4_WORD1_NUMLINKS | \
+ FATTR4_WORD1_SPACE_USED | \
+ FATTR4_WORD1_TIME_ACCESS | \
+ FATTR4_WORD1_TIME_METADATA | \
+ FATTR4_WORD1_TIME_MODIFY)
+
/**
* nfsd_get_dir_deleg - attempt to get a directory delegation
* @cstate: compound state
@@ -9864,10 +9873,13 @@ nfsd_get_dir_deleg(struct nfsd4_compound_state *cstate,
dp->dl_stid.sc_export =
exp_get(cstate->current_fh.fh_export);
+ dp->dl_notify_mask = gdd->gddr_notification[0];
dp->dl_child_attrs[0] = gdd->gdda_child_attributes[0] & GDD_WORD0_CHILD_ATTRS;
dp->dl_child_attrs[1] = gdd->gdda_child_attributes[1] & GDD_WORD1_CHILD_ATTRS;
+ dp->dl_dir_attrs[0] = gdd->gdda_dir_attributes[0] & GDD_WORD0_DIR_ATTRS;
+ dp->dl_dir_attrs[1] = gdd->gdda_dir_attributes[1] & GDD_WORD1_DIR_ATTRS;
- fl = nfs4_alloc_init_lease(dp, gdd->gddr_notification[0]);
+ fl = nfs4_alloc_init_lease(dp, dp->dl_notify_mask);
if (!fl)
goto out_put_stid;
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 7ca5ef9caafe..56a3cfb12e65 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -284,7 +284,9 @@ struct nfs4_delegation {
struct timespec64 dl_ctime;
/* For dir delegations */
+ uint32_t dl_notify_mask;
uint32_t dl_child_attrs[2];
+ uint32_t dl_dir_attrs[2];
};
static inline bool deleg_is_read(u32 dl_type)
--
2.53.0
^ permalink raw reply related
* [PATCH 24/24] nfsd: add support to CB_NOTIFY for dir attribute changes
From: Jeff Layton @ 2026-04-07 13:21 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Chuck Lever,
Alexander Aring, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, NeilBrown,
Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust,
Anna Schumaker, Amir Goldstein
Cc: Calum Mackay, linux-fsdevel, linux-kernel, linux-trace-kernel,
linux-doc, linux-nfs, Jeff Layton
In-Reply-To: <20260407-dir-deleg-v1-0-aaf68c478abd@kernel.org>
If the client requested dir attribute change notifications, send those
alongside any set of add/remove/rename events. Note that the server will
still recall the delegation on a SETATTR, so these are only sent for
changes to child dirents.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
fs/nfsd/nfs4state.c | 25 ++++++++++++++++++++--
fs/nfsd/nfs4xdr.c | 61 +++++++++++++++++++++++++++++++++++++++++++++--------
fs/nfsd/xdr4.h | 2 ++
3 files changed, 77 insertions(+), 11 deletions(-)
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index c4b6f4d65a47..01a2fb11dc0e 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3470,10 +3470,15 @@ nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
struct nfsd_notify_event *events[NOTIFY4_EVENT_QUEUE_SIZE];
struct xdr_buf xdr = { .buflen = PAGE_SIZE * NOTIFY4_PAGE_ARRAY_SIZE,
.pages = ncn->ncn_pages };
+ int limit = NOTIFY4_EVENT_QUEUE_SIZE;
struct xdr_stream stream;
struct nfsd_file *nf;
- int count, i;
bool error = false;
+ int count, i;
+
+ /* Save a slot for dir attr update if requested */
+ if (dp->dl_notify_mask & BIT(NOTIFY4_CHANGE_DIR_ATTRS))
+ --limit;
xdr_init_encode_pages(&stream, &xdr);
@@ -3487,7 +3492,7 @@ nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
}
/* we can't keep up! */
- if (count > NOTIFY4_EVENT_QUEUE_SIZE) {
+ if (count > limit) {
spin_unlock(&ncn->ncn_lock);
goto out_recall;
}
@@ -3534,6 +3539,22 @@ nfsd4_cb_notify_prepare(struct nfsd4_callback *cb)
nfsd_notify_event_put(nne);
}
if (!error) {
+ if (dp->dl_notify_mask & BIT(NOTIFY4_CHANGE_DIR_ATTRS)) {
+ u32 *maskp = (u32 *)xdr_reserve_space(&stream, sizeof(*maskp));
+
+ if (maskp) {
+ u8 *p = nfsd4_encode_dir_attr_change(&stream, dp, nf);
+
+ if (p) {
+ *maskp = BIT(NOTIFY4_CHANGE_DIR_ATTRS);
+ ncn->ncn_nf[count].notify_mask.count = 1;
+ ncn->ncn_nf[count].notify_mask.element = maskp;
+ ncn->ncn_nf[count].notify_vals.data = p;
+ ncn->ncn_nf[count].notify_vals.len = (u8 *)stream.p - p;
+ ++count;
+ }
+ }
+ }
ncn->ncn_nf_cnt = count;
nfsd_file_put(nf);
return true;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 35646809becb..6e76502ca149 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -4152,11 +4152,11 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
struct nfsd_file *nf, char *name, u32 namelen)
{
struct nfs4_file *fi = dp->dl_stid.sc_file;
- struct path path = { .mnt = nf->nf_file->f_path.mnt,
- .dentry = dentry };
+ struct path path = nf->nf_file->f_path;
struct nfsd4_fattr_args args = { };
uint32_t *attrmask;
__be32 status;
+ bool parent;
int ret;
/* Reserve space for attrmask */
@@ -4168,6 +4168,9 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
ne->ne_file.len = namelen;
ne->ne_attrs.attrmask.element = attrmask;
+ parent = (dentry == path.dentry);
+ path.dentry = dentry;
+
/* FIXME: d_find_alias for inode ? */
if (!path.dentry || !d_inode(path.dentry))
goto noattrs;
@@ -4183,15 +4186,20 @@ nfsd4_setup_notify_entry4(struct notify_entry4 *ne, struct xdr_stream *xdr,
args.change_attr = nfsd4_change_attribute(&args.stat);
- attrmask[0] = dp->dl_child_attrs[0];
- attrmask[1] = dp->dl_child_attrs[1];
- attrmask[2] = 0;
+ if (parent) {
+ attrmask[0] = dp->dl_dir_attrs[0];
+ attrmask[1] = dp->dl_dir_attrs[1];
+ } else {
+ attrmask[0] = dp->dl_child_attrs[0];
+ attrmask[1] = dp->dl_child_attrs[1];
- if (!setup_notify_fhandle(dentry, fi, nf, &args))
- attrmask[0] &= ~FATTR4_WORD0_FILEHANDLE;
+ if (!setup_notify_fhandle(dentry, fi, nf, &args))
+ attrmask[0] &= ~FATTR4_WORD0_FILEHANDLE;
- if (!(args.stat.result_mask & STATX_BTIME))
- attrmask[1] &= ~FATTR4_WORD1_TIME_CREATE;
+ if (!(args.stat.result_mask & STATX_BTIME))
+ attrmask[1] &= ~FATTR4_WORD1_TIME_CREATE;
+ }
+ attrmask[2] = 0;
ne->ne_attrs.attrmask.count = 2;
ne->ne_attrs.attr_vals.data = (u8 *)xdr->p;
@@ -4283,6 +4291,41 @@ u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct nfsd_notify_event *
return NULL;
}
+/**
+ * nfsd4_encode_dir_attr_change
+ * @xdr: stream to which to encode the fattr4
+ * @dp: delegation where the event occurred
+ * @nf: nfsd_file opened on the directory
+ *
+ * Encode a dir attr change event.
+ */
+u8 *nfsd4_encode_dir_attr_change(struct xdr_stream *xdr, struct nfs4_delegation *dp,
+ struct nfsd_file *nf)
+{
+ struct dentry *dentry = nf->nf_file->f_path.dentry;
+ struct notify_attr4 na = { };
+ struct name_snapshot n;
+ bool ret;
+ u8 *p = NULL;
+
+ if (!(dp->dl_notify_mask & BIT(NOTIFY4_CHANGE_DIR_ATTRS)))
+ return NULL;
+
+ take_dentry_name_snapshot(&n, dentry);
+ ret = nfsd4_setup_notify_entry4(&na.na_changed_entry, xdr,
+ dentry, dp, nf, (char *)n.name.name,
+ n.name.len);
+
+ /* Don't bother with the event if we're not encoding attrs */
+ if (ret && na.na_changed_entry.ne_attrs.attr_vals.len) {
+ p = (u8 *)xdr->p;
+ if (!xdrgen_encode_notify_attr4(xdr, &na))
+ p = NULL;
+ }
+ release_dentry_name_snapshot(&n);
+ return p;
+}
+
static void svcxdr_init_encode_from_buffer(struct xdr_stream *xdr,
struct xdr_buf *buf, __be32 *p, int bytes)
{
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index d276840aca50..cf7f0df68d63 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -958,6 +958,8 @@ __be32 nfsd4_encode_fattr_to_buf(__be32 **p, int words,
u8 *nfsd4_encode_notify_event(struct xdr_stream *xdr, struct nfsd_notify_event *nne,
struct nfs4_delegation *dd, struct nfsd_file *nf,
u32 *notify_mask);
+u8 *nfsd4_encode_dir_attr_change(struct xdr_stream *xdr, struct nfs4_delegation *dp,
+ struct nfsd_file *nf);
extern __be32 nfsd4_setclientid(struct svc_rqst *rqstp,
struct nfsd4_compound_state *, union nfsd4_op_u *u);
extern __be32 nfsd4_setclientid_confirm(struct svc_rqst *rqstp,
--
2.53.0
^ permalink raw reply related
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Petr Mladek @ 2026-04-07 13:52 UTC (permalink / raw)
To: Yafang Shao
Cc: Song Liu, Dylan Hatch, jpoimboe, jikos, mbenes, joe.lawrence,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CALOAHbCbcw2jpjk9JD9yyf+SMpQ-s9FAonSaz7Gs4XUeP+w+2g@mail.gmail.com>
On Mon 2026-04-06 19:08:05, Yafang Shao wrote:
> On Sat, Apr 4, 2026 at 5:36 AM Song Liu <song@kernel.org> wrote:
> >
> > On Fri, Apr 3, 2026 at 1:55 PM Dylan Hatch <dylanbhatch@google.com> wrote:
> > [...]
> > > > IIRC, the use case for this change is when multiple users load various
> > > > livepatch modules on the same system. I still don't believe this is the
> > > > right way to manage livepatches. That said, I won't really NACK this
> > > > if other folks think this is a useful option.
> > >
> > > In our production fleet, we apply exactly one cumulative livepatch
> > > module, and we use per-kernel build "livepatch release" branches to
> > > track the contents of these cumulative livepatches. This model has
> > > worked relatively well for us, but there are some painpoints.
> > >
> > > We are often under pressure to selectively deploy a livepatch fix to
> > > certain subpopulations of production. If the subpopulation is running
> > > the same build of everything else, this would require us to introduce
> > > another branching factor to the "livepatch release" branches --
> > > something we do not support due to the added toil and complexity.
> > >
> > > However, if we had the ability to build "off-band" livepatch modules
> > > that were marked as non-replaceable, we could support these selective
> > > patches without the additional branching factor. I will have to
> > > circulate the idea internally, but to me this seems like a very useful
> > > option to have in certain cases.
> >
> > IIUC, the plan is:
> >
> > - The regular livepatches are cumulative, have the replace flag; and
> > are replaceable.
> > - The occasional "off-band" livepatches do not have the replace flag,
> > and are not replaceable.
> >
> > With this setup, for systems with off-band livepatches loaded, we can
> > still release a cumulative livepatch to replace the previous cumulative
> > livepatch. Is this the expected use case?
>
> That matches our expected use case.
>
> >
> > I think there are a few issues with this:
> > 1. The "off-band" livepatches cannot be replaced atomically. To upgrade
> > "off-band' livepatches, we will have to unload the old version and load
> > the new version later.
>
> Right. That is how the non-atomic-replace patch works.
>
> > 2. Any conflict with the off-band livepatches and regular livepatches will
> > be difficult to manage.
>
> We need to manage this conflict with a complex user script. That said,
> everything can be controlled from userspace.
>
> > IOW, we kind removed the benefit of cumulative
> > livepatches. For example, what shall we do if we really need two fixes
> > to the same kernel functions: one from the original branch, the other
> > from the off-band branch?
>
> We run tens of livepatches on our production servers and have never
> run into this issue. It's an extremely rare case — and if it does
> happen, a user script should be able to handle it just fine.
Could you please share the script? Or at least summarize the situations
when this script detect a conflict and refuse loading a livepatch?
I believe that most/all of these checks can be implemented in the kernel.
And if we agreed to add a hybrid mode than it should be added
together with the checks.
We have already invested a lot of effort into make the kernel
livepatching as safe as possible. From my POV, the most important
parts are:
+ consistency model: Tasks are transitioned separately when they
do not use any livepatched function.
+ atomic replace: Transition all livepatched functions at once.
If we agree to add the hybrid model then we should add it with
some safety belts as well. And it would be nice to get inspiration
about the safety checks from your script.
Best Regards,
Petr
^ permalink raw reply
* Re: [PATCH v5 0/3] PCI Controller event and LTSSM tracepoint support
From: Manivannan Sadhasivam @ 2026-04-07 14:43 UTC (permalink / raw)
To: Bjorn Helgaas, Shawn Lin
Cc: linux-rockchip, linux-pci, linux-trace-kernel, linux-doc,
Steven Rostedt
In-Reply-To: <1774403912-210670-1-git-send-email-shawn.lin@rock-chips.com>
On Wed, 25 Mar 2026 09:58:29 +0800, Shawn Lin wrote:
> This patch-set adds new pci controller event and LTSSM tracepoint used by host drivers
> which provide LTSSM trace functionality. The first user is pcie-dw-rockchip with a 256
> Bytes FIFO for recording LTSSM transition.
>
> Testing
> =========
>
> [...]
Applied, thanks!
[1/3] PCI: trace: Add PCI controller LTSSM transition tracepoint
commit: d1b7add89c004295cd48d7cd49946ed5cb5cbb55
[2/3] Documentation: tracing: Add PCI controller event documentation
commit: a3966a6f915ea7d1af0941ea26848d921e574c45
[3/3] PCI: dw-rockchip: Add pcie_ltssm_state_transition trace support
commit: a276c0d802d8d2a22088b7919d9e82e936995cf4
Best regards,
--
Manivannan Sadhasivam <mani@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH 3/4] livepatch: Add "replaceable" attribute to klp_patch
From: Petr Mladek @ 2026-04-07 15:08 UTC (permalink / raw)
To: Yafang Shao
Cc: Song Liu, Joe Lawrence, Dylan Hatch, jpoimboe, jikos, mbenes,
rostedt, mhiramat, mathieu.desnoyers, kpsingh, mattbobrowski,
jolsa, ast, daniel, andrii, martin.lau, eddyz87, memxor,
yonghong.song, live-patching, linux-kernel, linux-trace-kernel,
bpf
In-Reply-To: <CALOAHbCxPA0dtsx7L2kYn8wwBdM=krZyOpfRTBiDW9qfA_zmzQ@mail.gmail.com>
On Tue 2026-04-07 17:45:31, Yafang Shao wrote:
> On Tue, Apr 7, 2026 at 11:16 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Apr 7, 2026 at 10:54 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Mon, Apr 6, 2026 at 2:12 PM Joe Lawrence <joe.lawrence@redhat.com> wrote:
> > > [...]
> > > > > > > - The regular livepatches are cumulative, have the replace flag; and
> > > > > > > are replaceable.
> > > > > > > - The occasional "off-band" livepatches do not have the replace flag,
> > > > > > > and are not replaceable.
> > > > > > >
> > > > > > > With this setup, for systems with off-band livepatches loaded, we can
> > > > > > > still release a cumulative livepatch to replace the previous cumulative
> > > > > > > livepatch. Is this the expected use case?
> > > > > >
> > > > > > That matches our expected use case.
> > > > >
> > > > > If we really want to serve use cases like this, I think we can introduce
> > > > > some replace tag concept: Each livepatch will have a tag, u32 number.
> > > > > Newly loaded livepatch will only replace existing livepatch with the
> > > > > same tag. We can even reuse the existing "bool replace" in klp_patch,
> > > > > and make it u32: replace=0 means no replace; replace > 0 are the
> > > > > replace tag.
> > > > >
> > > > > For current users of cumulative patches, all the livepatch will have the
> > > > > same tag, say 1. For your use case, you can assign each user a
> > > > > unique tag. Then all these users can do atomic upgrades of their
> > > > > own livepatches.
> > > > >
> > > > > We may also need to check whether two livepatches of different tags
> > > > > touch the same kernel function. When that happens, the later
> > > > > livepatch should fail to load.
I still think how to make the hybrid mode more secure:
+ The isolated sets of livepatched functions look like a good rule.
+ What about isolating the shadow variables/states as well?
> > That sounds like a viable solution. I'll look into it and see how we
> > can implement it.
>
> Does the following change look good to you ?
>
> Subject: [PATCH] livepatch: Support scoped atomic replace using replace tags
>
> Extend the replace attribute from a boolean to a u32 to act as a replace
> tag. This introduces the following semantics:
>
> replace = 0: Atomic replace is disabled. However, this patch remains
> eligible to be superseded by others.
> replace > 0: Enables tagged replace (default is 1). A newly loaded
> livepatch will only replace existing patches that share the
> same tag.
>
> To maintain backward compatibility, a patch with replace == 0 does not
> trigger an outgoing atomic replace, but remains eligible to be superseded
> by any incoming patch with a valid replace tag.
>
> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> index ba9e3988c07c..417c67a17b99 100644
> --- a/include/linux/livepatch.h
> +++ b/include/linux/livepatch.h
> @@ -123,7 +123,11 @@ struct klp_state {
> * @mod: reference to the live patch module
> * @objs: object entries for kernel objects to be patched
> * @states: system states that can get modified
> - * @replace: replace all actively used patches
> + * @replace: replace tag:
> + * = 0: Atomic replace is disabled; however, this patch remains
> + * eligible to be superseded by others.
This is weird semantic. Which livepatch tag would be allowed to
supersede it, please?
Do we still need this category?
> + * > 0: Atomic replace is enabled. Only existing patches with a
> + * matching replace tag will be superseded.
> * @list: list node for global list of actively used patches
> * @kobj: kobject for sysfs resources
> * @obj_list: dynamic list of the object entries
> @@ -137,7 +141,7 @@ struct klp_patch {
> struct module *mod;
> struct klp_object *objs;
> struct klp_state *states;
> - bool replace;
> + unsigned int replace;
This already breaks the backward compatibility by changing the type
and semantic of this field. I would also change the name to better
match the new semantic. What about using:
* @replace_set: Livepatch using the same @replace_set will get
atomically replaced, see also conflicts[*].
unsigned int replace_set;
[*] A livepatch module, livepatching an already livepatches function,
can be loaded only when it has the same @replace_set number.
By other words, two livepatches conflict when they have a different
@replace_set number and have at least one livepatched version
in common.
>
> /* internal */
> struct list_head list;
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 28d15ba58a26..e4e5c03b0724 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -793,6 +793,8 @@ void klp_free_replaced_patches_async(struct
> klp_patch *new_patch)
> klp_for_each_patch_safe(old_patch, tmp_patch) {
> if (old_patch == new_patch)
> return;
> + if (old_patch->replace && old_patch->replace !=
> new_patch->replace)
> + continue;
> klp_free_patch_async(old_patch);
> }
> }
> @@ -1194,6 +1196,8 @@ void klp_unpatch_replaced_patches(struct
> klp_patch *new_patch)
> klp_for_each_patch(old_patch) {
> if (old_patch == new_patch)
> return;
> + if (old_patch->replace && old_patch->replace !=
> new_patch->replace)
> + continue;
>
> old_patch->enabled = false;
> klp_unpatch_objects(old_patch);
This handles only the freeing part. More changes will be
necessary:
+ klp_is_patch_compatible() must check also conflicts
between livepatches with different @replace_set.
The conflicts might be in the lists of:
+ livepatched functions
+ state IDs (aka callbacks and shadow variables IDs)
+ klp_add_nops() must skip livepatches with another @replace_set
+ klp_unpatch_replaced_patches() should unpatch only
patches with the same @replace_set
Finally, we would need to update existing selftests
plus add new selftests.
It is possible that I have missed something.
Anyway, you should wait for more feedback before you do too much
coding, especially the selftests are not needed at RFC stage.
Best Regards,
Petr
^ permalink raw reply
* Re: [PATCH] ring-buffer: Preserve true payload lengths in long data events
From: Steven Rostedt @ 2026-04-07 15:48 UTC (permalink / raw)
To: Cao Ruichuang
Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260407091550.67963-1-create0818@163.com>
On Tue, 7 Apr 2026 17:15:50 +0800
Cao Ruichuang <create0818@163.com> wrote:
> Long ring buffer data records currently store the aligned in-buffer size in
> their length field. That makes ring_buffer_event_length() report padded
> sizes, and small TRACE_PRINT / TRACE_RAW_DATA records lose their true
> payload length entirely when they use the short type_len encoding.
>
> Teach long data events to keep the true payload size in array[0], and let
> the ring buffer derive the aligned in-buffer size separately when it needs
> to walk or discard records. Then add a long-reserve helper and use it for
> TRACE_PRINT and TRACE_RAW_DATA so their zero-length-array tails always
> preserve the real payload size.
Sorry, I'm against this change. I made a conscious decision to have the
ring buffer meta data record the size allocated and not the size requested.
If an event needs to know the actual size, then it needs to add that
information itself. That's not the job of the ring buffer code.
NAK.
-- Steve
^ permalink raw reply
* Re: [PATCH] ring-buffer: Preserve true payload lengths in long data events
From: Cao Ruichuang @ 2026-04-07 16:45 UTC (permalink / raw)
To: rostedt; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260407114812.580c27d5@gandalf.local.home>
Thanks for the clarification.
I understand the design boundary now: the ring buffer metadata is
supposed to describe the allocated record size, not the requested
payload size. Given that, I will stop pushing this ring-buffer
approach.
If there is a valid fix here, it should be done at the event layer
for the users that need the true payload length, not in the generic
ring buffer code.
Thanks,
Cao Ruichuang
^ permalink raw reply
* Re: [PATCH] selftests/ftrace: Check exact trace_marker_raw payload lengths
From: Cao Ruichuang @ 2026-04-07 16:45 UTC (permalink / raw)
To: rostedt, mhiramat
Cc: mathieu.desnoyers, shuah, linux-kernel, linux-trace-kernel,
linux-kselftest
In-Reply-To: <20260407101245.78988-1-create0818@163.com>
A follow-up on this patch: I am pausing it in its current form.
The exact payload-length checks here were based on my earlier
ring-buffer change, and Steven has rejected that direction because
the ring buffer is meant to keep the allocated size, not the
requested payload size.
So this selftest should not move forward as-is. If there is a real
fix, it needs to come from an event-specific change that explicitly
stores the true payload length where needed. I will revisit it only
from that direction.
Thanks,
Cao Ruichuang
^ permalink raw reply
* [PATCH bpf-next v6 2/2] selftests/bpf: Add tests for kprobe attachment with duplicate symbols
From: Andrey Grodzovsky @ 2026-04-07 16:51 UTC (permalink / raw)
To: bpf, linux-trace-kernel, jolsa, ihor.solodrai
Cc: ast, daniel, andrii, rostedt, mhiramat, emil, linux-open-source
In-Reply-To: <20260407165145.1651061-1-andrey.grodzovsky@crowdstrike.com>
bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
bpf_testmod (bpf_testmod.c), creating a duplicate symbol condition when
bpf_testmod is loaded. Add subtests that verify kprobe behavior with
this duplicate symbol:
In attach_probe:
- dup-sym-{default,legacy,perf,link}: unqualified attach succeeds
across all four modes, preferring vmlinux over module shadow.
- dup-sym-module-{default,legacy,perf,link}: MOD:SYM qualification
(bpf_testmod:bpf_fentry_shadow_test) attaches to the module version.
In kprobe_multi_test:
- dup_sym: kprobe_multi attach with kprobe and kretprobe succeeds.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
---
.../selftests/bpf/prog_tests/attach_probe.c | 68 +++++++++++++++++++
.../bpf/prog_tests/kprobe_multi_test.c | 39 +++++++++++
2 files changed, 107 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 12a841afda68..2d52547c395e 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -197,6 +197,65 @@ static void test_attach_kprobe_legacy_by_addr_reject(void)
test_attach_probe_manual__destroy(skel);
}
+/* bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
+ * bpf_testmod (bpf_testmod.c). When bpf_testmod is loaded the symbol is
+ * duplicated. Test that kprobe attachment handles this correctly:
+ * - Unqualified name ("bpf_fentry_shadow_test") attaches to vmlinux.
+ * - MOD:SYM name ("bpf_testmod:bpf_fentry_shadow_test") attaches to module.
+ *
+ * Note: bpf_fentry_shadow_test is not invoked via test_run, so we only
+ * verify that attach and detach succeed without triggering the probe.
+ */
+static void test_attach_probe_dup_sym(enum probe_attach_mode attach_mode)
+{
+ DECLARE_LIBBPF_OPTS(bpf_kprobe_opts, kprobe_opts);
+ struct bpf_link *kprobe_link, *kretprobe_link;
+ struct test_attach_probe_manual *skel;
+
+ skel = test_attach_probe_manual__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_dup_sym_open_and_load"))
+ return;
+
+ kprobe_opts.attach_mode = attach_mode;
+
+ /* Unqualified: should attach to vmlinux symbol */
+ kprobe_opts.retprobe = false;
+ kprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kprobe,
+ "bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kprobe_link, "attach_kprobe_vmlinux"))
+ goto cleanup;
+ bpf_link__destroy(kprobe_link);
+
+ kprobe_opts.retprobe = true;
+ kretprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kretprobe,
+ "bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kretprobe_link, "attach_kretprobe_vmlinux"))
+ goto cleanup;
+ bpf_link__destroy(kretprobe_link);
+
+ /* MOD:SYM qualified: should attach to module symbol */
+ kprobe_opts.retprobe = false;
+ kprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kprobe,
+ "bpf_testmod:bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kprobe_link, "attach_kprobe_module"))
+ goto cleanup;
+ bpf_link__destroy(kprobe_link);
+
+ kprobe_opts.retprobe = true;
+ kretprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kretprobe,
+ "bpf_testmod:bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kretprobe_link, "attach_kretprobe_module"))
+ goto cleanup;
+ bpf_link__destroy(kretprobe_link);
+
+cleanup:
+ test_attach_probe_manual__destroy(skel);
+}
+
/* attach uprobe/uretprobe long event name testings */
static void test_attach_uprobe_long_event_name(void)
{
@@ -559,6 +618,15 @@ void test_attach_probe(void)
if (test__start_subtest("kprobe-legacy-by-addr-reject"))
test_attach_kprobe_legacy_by_addr_reject();
+ if (test__start_subtest("dup-sym-default"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_DEFAULT);
+ if (test__start_subtest("dup-sym-legacy"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_LEGACY);
+ if (test__start_subtest("dup-sym-perf"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_PERF);
+ if (test__start_subtest("dup-sym-link"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_LINK);
+
if (test__start_subtest("auto"))
test_attach_probe_auto(skel);
if (test__start_subtest("kprobe-sleepable"))
diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..d89650e31fd3 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -633,6 +633,43 @@ static void test_attach_write_ctx(void)
}
#endif
+/* Test kprobe_multi handles shadow symbols (vmlinux + module duplicate).
+ * bpf_fentry_shadow_test exists in both vmlinux and bpf_testmod.
+ * kprobe_multi resolves via ftrace_lookup_symbols() which finds the
+ * vmlinux symbol first and stops, so this should always succeed.
+ */
+static void test_attach_probe_dup_sym(void)
+{
+ LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+ const char *syms[1] = { "bpf_fentry_shadow_test" };
+ struct kprobe_multi *skel = NULL;
+ struct bpf_link *link1 = NULL, *link2 = NULL;
+
+ skel = kprobe_multi__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "kprobe_multi__open_and_load"))
+ goto cleanup;
+
+ skel->bss->pid = getpid();
+ opts.syms = syms;
+ opts.cnt = ARRAY_SIZE(syms);
+
+ link1 = bpf_program__attach_kprobe_multi_opts(skel->progs.test_kprobe_manual,
+ NULL, &opts);
+ if (!ASSERT_OK_PTR(link1, "attach_kprobe_multi_dup_sym"))
+ goto cleanup;
+
+ opts.retprobe = true;
+ link2 = bpf_program__attach_kprobe_multi_opts(skel->progs.test_kretprobe_manual,
+ NULL, &opts);
+ if (!ASSERT_OK_PTR(link2, "attach_kretprobe_multi_dup_sym"))
+ goto cleanup;
+
+cleanup:
+ bpf_link__destroy(link2);
+ bpf_link__destroy(link1);
+ kprobe_multi__destroy(skel);
+}
+
void serial_test_kprobe_multi_bench_attach(void)
{
if (test__start_subtest("kernel"))
@@ -676,5 +713,7 @@ void test_kprobe_multi_test(void)
test_unique_match();
if (test__start_subtest("attach_write_ctx"))
test_attach_write_ctx();
+ if (test__start_subtest("dup_sym"))
+ test_attach_probe_dup_sym();
RUN_TESTS(kprobe_multi_verifier);
}
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v6 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Andrey Grodzovsky @ 2026-04-07 16:51 UTC (permalink / raw)
To: bpf, linux-trace-kernel, jolsa, ihor.solodrai
Cc: ast, daniel, andrii, rostedt, mhiramat, emil, linux-open-source
In-Reply-To: <20260407165145.1651061-1-andrey.grodzovsky@crowdstrike.com>
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
---
kernel/trace/trace_kprobe.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index a5dbb72528e0..99c41ea8b6d7 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -765,6 +765,13 @@ static unsigned int number_of_same_symbols(const char *mod, const char *func_nam
if (!mod)
kallsyms_on_each_match_symbol(count_symbols, func_name, &ctx.count);
+ /* If the symbol is found in vmlinux, use vmlinux resolution only.
+ * This prevents module symbols from shadowing vmlinux symbols
+ * and causing -EADDRNOTAVAIL for unqualified kprobe targets.
+ */
+ if (!mod && ctx.count > 0)
+ return ctx.count;
+
module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
return ctx.count;
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v6 0/2] tracing: Fix kprobe attachment when module shadows vmlinux symbol
From: Andrey Grodzovsky @ 2026-04-07 16:51 UTC (permalink / raw)
To: bpf, linux-trace-kernel, jolsa, ihor.solodrai
Cc: ast, daniel, andrii, rostedt, mhiramat, emil, linux-open-source
When a kernel module exports a symbol with the same name as an existing
vmlinux symbol, kprobe attachment fails with -EADDRNOTAVAIL because
number_of_same_symbols() counts matches across both vmlinux and all
loaded modules, returning a count greater than 1.
This series takes a different approach from v1-v5, which implemented a
libbpf-side fallback parsing /proc/kallsyms and retrying with the
absolute address. That approach was rejected (Andrii Nakryiko, Ihor
Solodrai) because ambiguous symbol resolution does not belong in libbpf,
and because it did not cover the kprobe_multi path.
Following Ihor's suggestion, this series fixes the root cause in the
kernel: when an unqualified symbol name is given and the symbol is found
in vmlinux, prefer the vmlinux symbol and do not scan loaded modules.
This makes the skeleton auto-attach path work transparently with no
libbpf changes needed.
Patch 1: Kernel fix - return vmlinux-only count from
number_of_same_symbols() when the symbol is found in vmlinux,
preventing module shadows from causing -EADDRNOTAVAIL.
Patch 2: Selftests using bpf_fentry_shadow_test which exists in both
vmlinux and bpf_testmod - tests unqualified (vmlinux) and
MOD:SYM (module) attachment across all four attach modes, plus
kprobe_multi with the duplicate symbol.
Changes since v5 [1]:
- Selftest: use existing bpf_fentry_shadow_test (vmlinux + bpf_testmod)
instead of a new bpf_testmod_dup_sym.ko test module (Jiri Olsa).
- Selftest: add MOD:SYM qualification test for module shadow attachment
(Jiri Olsa).
- Selftest: add kprobe_multi dup_sym test in kprobe_multi_test.c
(Jiri Olsa).
[1] https://lore.kernel.org/bpf/20260406193158.754498-1-andrey.grodzovsky@crowdstrike.com/
Andrey Grodzovsky (2):
tracing: Prefer vmlinux symbols over module symbols for unqualified
kprobes
selftests/bpf: Add tests for kprobe attachment with duplicate symbols
kernel/trace/trace_kprobe.c | 7 ++
.../selftests/bpf/prog_tests/attach_probe.c | 68 +++++++++++++++++++
.../bpf/prog_tests/kprobe_multi_test.c | 39 +++++++++++
3 files changed, 114 insertions(+)
--
2.34.1
^ permalink raw reply
* Re: [PATCH] selftests/ftrace: Check exact trace_marker_raw payload lengths
From: Steven Rostedt @ 2026-04-07 16:54 UTC (permalink / raw)
To: CaoRuichuang
Cc: mhiramat, mathieu.desnoyers, shuah, linux-kernel,
linux-trace-kernel, linux-kselftest
In-Reply-To: <20260407101245.78988-1-create0818@163.com>
On Tue, 7 Apr 2026 18:12:45 +0800
CaoRuichuang <create0818@163.com> wrote:
> From: Cao Ruichuang <create0818@163.com>
>
> trace_marker_raw.tc currently depends on awk strtonum() and
> assumes that the printed raw-data byte count is rounded up to four
> bytes.
>
> That makes the test fail on systems that use mawk, and it no longer
> matches the raw_data trace output we want to validate after preserving
> true payload lengths for long records.
>
> Rewrite the test to capture a small sequence of raw marker writes and
> check the exact number of printed payload bytes in order. While doing
> that, use od for the endian probe, switch to a fixed raw marker id so
> the test only varies payload length, and disable pause-on-trace while
> streaming trace_pipe.
>
The tests are to validate the code and if the tests fail when the code is
working as designed, it is the test that is at fault, and the fix is to fix
the tests. Not to make the code match the test.
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next v6 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Ihor Solodrai @ 2026-04-07 18:42 UTC (permalink / raw)
To: Andrey Grodzovsky, bpf, linux-trace-kernel, jolsa
Cc: ast, daniel, andrii, rostedt, mhiramat, emil, linux-open-source
In-Reply-To: <20260407165145.1651061-2-andrey.grodzovsky@crowdstrike.com>
On 4/7/26 9:51 AM, Andrey Grodzovsky wrote:
> When an unqualified kprobe target exists in both vmlinux and a loaded
> module, number_of_same_symbols() returns a count greater than 1,
> causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
> vmlinux symbol is unambiguous.
>
> When no module qualifier is given and the symbol is found in vmlinux,
> return the vmlinux-only count without scanning loaded modules. This
> preserves the existing behavior for all other cases:
> - Symbol only in a module: vmlinux count is 0, falls through to module
> scan as before.
> - Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
> - Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
>
> Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
> Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
> Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Acked-by: Jiri Olsa <jolsa@kernel.org>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
> ---
> kernel/trace/trace_kprobe.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index a5dbb72528e0..99c41ea8b6d7 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -765,6 +765,13 @@ static unsigned int number_of_same_symbols(const char *mod, const char *func_nam
> if (!mod)
> kallsyms_on_each_match_symbol(count_symbols, func_name, &ctx.count);
>
> + /* If the symbol is found in vmlinux, use vmlinux resolution only.
> + * This prevents module symbols from shadowing vmlinux symbols
> + * and causing -EADDRNOTAVAIL for unqualified kprobe targets.
> + */
> + if (!mod && ctx.count > 0)
> + return ctx.count;
Nice to see this actually works.
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> +
> module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
>
> return ctx.count;
^ permalink raw reply
* [RFC 0/2] openrisc: Add support for KProbes
From: Sahil Siddiq @ 2026-04-07 18:56 UTC (permalink / raw)
To: jonas, stefan.kristiansson, shorne, naveen, davem, mhiramat
Cc: peterz, jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes,
nsc, masahiroy, tytso, linux-openrisc, linux-kernel,
linux-trace-kernel, Sahil Siddiq
Hi,
This series adds basic support for KProbes on OpenRISC. There are
a few changes that I would still like to add and test before this
can be considered for merging. I was hoping to get some feedback on
the changes made so far. The implementation in this series is based
on KProbes for LoongArch, MIPS and RISC-V.
The current state of the series allows traps to be inserted dynamically
in the kernel. A KProbe can be inserted via a kernel module whose
init/exit functions are used to register/unregister a KProbe. A pre-
handler and post-handler can also be provisioned in the module, which
are associated with the KProbe and triggered when the probe is hit. See
the documentation on KProbes for a detailed explanation [1].
The following are yet to be implemented for OpenRISC:
1. kretprobes
2. kprobe-based event tracing
3. ftrace, and kprobe features that depend on ftrace (particularly,
dynamic tracing)
I hope to submit a patch for kretprobes soon (possibly in a revision of
this series).
I wrote a couple of kernel modules to test these changes. They can be found
here [2]. I also ran test_kprobes located at ./lib/tests/ against these
changes [3]. The results are as shown below:
/home # insmod test_kprobes.ko
KTAP version 1
1..1
KTAP version 1
# Subtest: kprobes_test
# module: test_kprobes
1..3
ok 1 test_kprobe
ok 2 test_kprobes
ok 3 test_kprobe_missed
# kprobes_test: pass:3 fail:0 skip:0 total:3
# Totals: pass:3 fail:0 skip:0 total:3
ok 1 kprobes_test
/home #
When compiling the kernel, the following options should be enabled:
1. CONFIG_HAVE_KPROBES=y
2. CONFIG_KPROBES=y
Also ensure that CONFIG_KPROBE_EVENTS is disabled.
To compile /lib/tests/test_kprobes.c, add the following to .config:
1. CONFIG_KUNIT=y
2. CONFIG_DEBUG_KERNEL=y
3. CONFIG_KPROBES_SANITY_TEST=m
The first commit cleans up and reorganizes existing functions, fixes
a few issues with instruction simulation, and introduces new structures
and macros that will be used by KProbes and other tracing facilities
in the future.
The second commit adds support for KProbes. Currently, I have
implemented this in such a way that KProbes can't be used to probe
a few "blacklisted" instructions. Probes can't be inserted in a delay
slot either (similar to MIPS). I have also added a few asm functions
to the blacklist that I think should not be probed. For e.g., "memset"
and "_trap_handler" have been blacklisted because probing them causes
the kernel to hang. However, I am not sure if other functions in "entry.S"
need to be added as well to the blacklist.
Thanks,
Sahil
[1] https://www.kernel.org/doc/html/latest/trace/kprobes.html
[2] https://github.com/valdaarhun/or-dev/tree/main/home
[3] https://github.com/openrisc/linux/blob/for-next/lib/tests/test_kprobes.c
Sahil Siddiq (2):
openrisc: Add utilities and clean up simulation of instructions
openrisc: Add KProbes
arch/openrisc/Kconfig | 1 +
arch/openrisc/configs/or1ksim_defconfig | 2 +
arch/openrisc/configs/virt_defconfig | 2 +
arch/openrisc/include/asm/asm.h | 22 ++
arch/openrisc/include/asm/break.h | 19 ++
arch/openrisc/include/asm/insn-def.h | 61 +++-
arch/openrisc/include/asm/kprobes.h | 76 +++++
arch/openrisc/include/asm/spr_defs.h | 1 +
arch/openrisc/kernel/Makefile | 3 +-
arch/openrisc/kernel/entry.S | 16 +
arch/openrisc/kernel/insn.c | 74 +++++
arch/openrisc/kernel/jump_label.c | 2 +-
arch/openrisc/kernel/kprobes.c | 381 ++++++++++++++++++++++++
arch/openrisc/kernel/traps.c | 67 ++---
arch/openrisc/lib/memcpy.c | 2 +
arch/openrisc/lib/memset.S | 4 +
arch/openrisc/mm/fault.c | 5 +
samples/kprobes/kprobe_example.c | 8 +
18 files changed, 701 insertions(+), 45 deletions(-)
create mode 100644 arch/openrisc/include/asm/asm.h
create mode 100644 arch/openrisc/include/asm/break.h
create mode 100644 arch/openrisc/include/asm/kprobes.h
create mode 100644 arch/openrisc/kernel/insn.c
create mode 100644 arch/openrisc/kernel/kprobes.c
--
2.53.0
^ permalink raw reply
* [RFC 1/2] openrisc: Add utilities and clean up simulation of instructions
From: Sahil Siddiq @ 2026-04-07 18:56 UTC (permalink / raw)
To: jonas, stefan.kristiansson, shorne, naveen, davem, mhiramat
Cc: peterz, jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes,
nsc, masahiroy, tytso, linux-openrisc, linux-kernel,
linux-trace-kernel, Sahil Siddiq
In-Reply-To: <20260407185650.79816-1-sahilcdq0@gmail.com>
Introduce new instruction-related utilities and macros for OpenRISC.
This is in preparation for patches that add tracing support such as
KProbes.
Simulate l.adrp. Fix bugs in simulation of l.jal and l.jalr. Earlier,
PC was being updated and then saved in the link register r9, resulting
in a corrupted page table (bad page map in process). Instead, update
PC after storing it in r9.
Move instruction simulation to its own file to enable reuse. Clean it
up and replace hardcoded values with computed expressions.
Link: https://raw.githubusercontent.com/openrisc/doc/master/openrisc-arch-1.4-rev0.pdf
Signed-off-by: Sahil Siddiq <sahilcdq0@gmail.com>
---
arch/openrisc/include/asm/insn-def.h | 61 +++++++++++++++++++++--
arch/openrisc/include/asm/spr_defs.h | 1 +
arch/openrisc/kernel/Makefile | 2 +-
arch/openrisc/kernel/insn.c | 74 ++++++++++++++++++++++++++++
arch/openrisc/kernel/jump_label.c | 2 +-
arch/openrisc/kernel/traps.c | 41 +--------------
6 files changed, 136 insertions(+), 45 deletions(-)
create mode 100644 arch/openrisc/kernel/insn.c
diff --git a/arch/openrisc/include/asm/insn-def.h b/arch/openrisc/include/asm/insn-def.h
index 1e0c028a5b95..c98f9770c52e 100644
--- a/arch/openrisc/include/asm/insn-def.h
+++ b/arch/openrisc/include/asm/insn-def.h
@@ -3,13 +3,66 @@
* Copyright (C) 2025 Chen Miao
*/
+#include <asm/spr.h>
+#include <asm/spr_defs.h>
+
#ifndef __ASM_OPENRISC_INSN_DEF_H
#define __ASM_OPENRISC_INSN_DEF_H
-/* or1k instructions are always 32 bits. */
-#define OPENRISC_INSN_SIZE 4
-
/* or1k nop instruction code */
-#define OPENRISC_INSN_NOP 0x15000000U
+#define INSN_NOP 0x15000000U
+
+#define INSN_CSYNC 0x23000000U
+#define INSN_MSYNC 0x22000000U
+#define INSN_PSYNC 0x22800000U
+
+#define OPCODE_TRAP 0x21000000U
+#define OPCODE_SYS 0x20000000U
+#define OPCODE_MACRC 0x18010000U
+
+struct pt_regs;
+
+enum six_bit_opcodes {
+ l_rfe = 0x09,
+ l_lwa = 0x1b,
+ l_mfspr = 0x2d,
+ l_mtspr = 0x30,
+ l_swa = 0x33,
+ l_j = 0x00,
+ l_jal = 0x01,
+ l_adrp = 0x02,
+ l_bnf = 0x03,
+ l_bf = 0x04,
+ l_jr = 0x11,
+ l_jalr = 0x12,
+};
+
+struct insn {
+ unsigned int opcode: 6;
+ unsigned int operands: 26;
+};
+
+union openrisc_instruction {
+ unsigned int word;
+ struct insn opcodes_6bit;
+};
+
+#define OPENRISC_INSN_SIZE (sizeof(union openrisc_instruction))
+
+/* Helpers for working with l.trap */
+static inline unsigned long __emit_trap(unsigned int code)
+{
+ return (code & 0xffff) | OPCODE_TRAP;
+}
+
+static inline bool has_delay_slot(void)
+{
+ unsigned int cpucfgr = mfspr(SPR_CPUCFGR);
+
+ return !(cpucfgr & SPR_CPUCFGR_ND);
+}
+
+void simulate_pc(struct pt_regs *regs, unsigned int jmp);
+void simulate_branch(struct pt_regs *regs, unsigned int jmp, bool has_delay_slot);
#endif /* __ASM_OPENRISC_INSN_DEF_H */
diff --git a/arch/openrisc/include/asm/spr_defs.h b/arch/openrisc/include/asm/spr_defs.h
index f0b6b492e9f4..5d13a9b96263 100644
--- a/arch/openrisc/include/asm/spr_defs.h
+++ b/arch/openrisc/include/asm/spr_defs.h
@@ -179,6 +179,7 @@
#define SPR_CPUCFGR_OF32S 0x00000080 /* ORFPX32 supported */
#define SPR_CPUCFGR_OF64S 0x00000100 /* ORFPX64 supported */
#define SPR_CPUCFGR_OV64S 0x00000200 /* ORVDX64 supported */
+#define SPR_CPUCFGR_ND 0x00000400 /* No delay slot */
#define SPR_CPUCFGR_RES 0xfffffc00 /* Reserved */
/*
diff --git a/arch/openrisc/kernel/Makefile b/arch/openrisc/kernel/Makefile
index 19e0eb94f2eb..150779fbf010 100644
--- a/arch/openrisc/kernel/Makefile
+++ b/arch/openrisc/kernel/Makefile
@@ -5,7 +5,7 @@
always-$(KBUILD_BUILTIN) := vmlinux.lds
-obj-y := head.o setup.o or32_ksyms.o process.o dma.o \
+obj-y := head.o insn.o setup.o or32_ksyms.o process.o dma.o \
traps.o time.o irq.o entry.o ptrace.o signal.o \
sys_call_table.o unwinder.o cacheinfo.o
diff --git a/arch/openrisc/kernel/insn.c b/arch/openrisc/kernel/insn.c
new file mode 100644
index 000000000000..2c97eceee6d7
--- /dev/null
+++ b/arch/openrisc/kernel/insn.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * OpenRISC instruction utils
+ *
+ * Linux architectural port borrowing liberally from similar works of
+ * others. All original copyrights apply as per the original source
+ * declaration.
+ *
+ * OpenRISC implementation:
+ * Copyright (C) 2026 Sahil Siddiq <sahilcdq0@gmail.com>
+ */
+
+#include <linux/ptrace.h>
+#include <asm/insn-def.h>
+
+void simulate_pc(struct pt_regs *regs, unsigned int jmp)
+{
+ int displacement;
+ unsigned int rd, op;
+
+ displacement = sign_extend32(((jmp) & 0x7ffff) << 13, 31);
+ rd = (jmp & 0x3ffffff) >> 21;
+ op = jmp >> 26;
+
+ switch (op) {
+ case l_adrp:
+ regs->gpr[rd] = displacement + (regs->pc & (-8192));
+ return;
+ default:
+ break;
+ }
+}
+
+void simulate_branch(struct pt_regs *regs, unsigned int jmp_insn, bool has_delay_slot)
+{
+ int displacement;
+ unsigned int rb, op, jmp;
+
+ displacement = sign_extend32(((jmp_insn) & 0x3ffffff) << 2, 27);
+ rb = (jmp_insn & 0x0000ffff) >> 11;
+ op = jmp_insn >> 26;
+ jmp = has_delay_slot ? 2 * OPENRISC_INSN_SIZE : OPENRISC_INSN_SIZE;
+
+ switch (op) {
+ case l_j: /* l.j */
+ regs->pc += displacement;
+ return;
+ case l_jal: /* l.jal */
+ regs->gpr[9] = regs->pc + jmp;
+ regs->pc += displacement;
+ return;
+ case l_bnf: /* l.bnf */
+ if (regs->sr & SPR_SR_F)
+ regs->pc += jmp;
+ else
+ regs->pc += displacement;
+ return;
+ case l_bf: /* l.bf */
+ if (regs->sr & SPR_SR_F)
+ regs->pc += displacement;
+ else
+ regs->pc += jmp;
+ return;
+ case l_jr: /* l.jr */
+ regs->pc = regs->gpr[rb];
+ return;
+ case l_jalr: /* l.jalr */
+ regs->gpr[9] = regs->pc + jmp;
+ regs->pc = regs->gpr[rb];
+ return;
+ default:
+ break;
+ }
+}
diff --git a/arch/openrisc/kernel/jump_label.c b/arch/openrisc/kernel/jump_label.c
index ab7137c23b46..fe082eb847a4 100644
--- a/arch/openrisc/kernel/jump_label.c
+++ b/arch/openrisc/kernel/jump_label.c
@@ -34,7 +34,7 @@ bool arch_jump_label_transform_queue(struct jump_entry *entry,
insn = offset;
} else {
- insn = OPENRISC_INSN_NOP;
+ insn = INSN_NOP;
}
if (early_boot_irqs_disabled)
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index c195be9cc9fc..ee87a3af34fc 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -32,6 +32,7 @@
#include <asm/bug.h>
#include <asm/fpu.h>
+#include <asm/insn-def.h>
#include <asm/io.h>
#include <asm/processor.h>
#include <asm/unwinder.h>
@@ -269,47 +270,9 @@ static inline int in_delay_slot(struct pt_regs *regs)
static inline void adjust_pc(struct pt_regs *regs, unsigned long address)
{
- int displacement;
- unsigned int rb, op, jmp;
-
if (unlikely(in_delay_slot(regs))) {
/* In delay slot, instruction at pc is a branch, simulate it */
- jmp = *((unsigned int *)regs->pc);
-
- displacement = sign_extend32(((jmp) & 0x3ffffff) << 2, 27);
- rb = (jmp & 0x0000ffff) >> 11;
- op = jmp >> 26;
-
- switch (op) {
- case 0x00: /* l.j */
- regs->pc += displacement;
- return;
- case 0x01: /* l.jal */
- regs->pc += displacement;
- regs->gpr[9] = regs->pc + 8;
- return;
- case 0x03: /* l.bnf */
- if (regs->sr & SPR_SR_F)
- regs->pc += 8;
- else
- regs->pc += displacement;
- return;
- case 0x04: /* l.bf */
- if (regs->sr & SPR_SR_F)
- regs->pc += displacement;
- else
- regs->pc += 8;
- return;
- case 0x11: /* l.jr */
- regs->pc = regs->gpr[rb];
- return;
- case 0x12: /* l.jalr */
- regs->pc = regs->gpr[rb];
- regs->gpr[9] = regs->pc + 8;
- return;
- default:
- break;
- }
+ simulate_branch(regs, *((unsigned int *)regs->pc), has_delay_slot());
} else {
regs->pc += 4;
}
--
2.53.0
^ permalink raw reply related
* [RFC 2/2] openrisc: Add KProbes
From: Sahil Siddiq @ 2026-04-07 18:56 UTC (permalink / raw)
To: jonas, stefan.kristiansson, shorne, naveen, davem, mhiramat
Cc: peterz, jpoimboe, jbaron, rostedt, ardb, chenmiao.ku, johannes,
nsc, masahiroy, tytso, linux-openrisc, linux-kernel,
linux-trace-kernel, Sahil Siddiq
In-Reply-To: <20260407185650.79816-1-sahilcdq0@gmail.com>
Add KProbes support for OpenRISC. This work is primarily based
on similar work done for LoongArch, MIPS and RISC-V.
KProbes make it possible to trap at almost any address in the
kernel to collect performance/debugging info.
Signed-off-by: Sahil Siddiq <sahilcdq0@gmail.com>
---
arch/openrisc/Kconfig | 1 +
arch/openrisc/configs/or1ksim_defconfig | 2 +
arch/openrisc/configs/virt_defconfig | 2 +
arch/openrisc/include/asm/asm.h | 22 ++
arch/openrisc/include/asm/break.h | 19 ++
arch/openrisc/include/asm/kprobes.h | 76 +++++
arch/openrisc/kernel/Makefile | 1 +
arch/openrisc/kernel/entry.S | 16 +
arch/openrisc/kernel/kprobes.c | 381 ++++++++++++++++++++++++
arch/openrisc/kernel/traps.c | 26 ++
arch/openrisc/lib/memcpy.c | 2 +
arch/openrisc/lib/memset.S | 4 +
arch/openrisc/mm/fault.c | 5 +
samples/kprobes/kprobe_example.c | 8 +
14 files changed, 565 insertions(+)
create mode 100644 arch/openrisc/include/asm/asm.h
create mode 100644 arch/openrisc/include/asm/break.h
create mode 100644 arch/openrisc/include/asm/kprobes.h
create mode 100644 arch/openrisc/kernel/kprobes.c
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 9156635dd264..d240533b424b 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -27,6 +27,7 @@ config OPENRISC
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_PCI
+ select HAVE_KPROBES
select HAVE_UID16
select HAVE_PAGE_SIZE_8KB
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/openrisc/configs/or1ksim_defconfig b/arch/openrisc/configs/or1ksim_defconfig
index 769705ac24d5..24d2915e7609 100644
--- a/arch/openrisc/configs/or1ksim_defconfig
+++ b/arch/openrisc/configs/or1ksim_defconfig
@@ -10,7 +10,9 @@ CONFIG_EXPERT=y
# CONFIG_KALLSYMS is not set
CONFIG_BUILTIN_DTB_NAME="or1ksim"
CONFIG_HZ_100=y
+CONFIG_OPENRISC=y
CONFIG_JUMP_LABEL=y
+CONFIG_KPROBES=y
CONFIG_MODULES=y
# CONFIG_BLOCK is not set
CONFIG_SLUB_TINY=y
diff --git a/arch/openrisc/configs/virt_defconfig b/arch/openrisc/configs/virt_defconfig
index 0b9979b35ca8..2eccb506032f 100644
--- a/arch/openrisc/configs/virt_defconfig
+++ b/arch/openrisc/configs/virt_defconfig
@@ -11,8 +11,10 @@ CONFIG_OPENRISC_HAVE_INST_SEXT=y
CONFIG_NR_CPUS=8
CONFIG_SMP=y
CONFIG_HZ_100=y
+CONFIG_OPENRISC=y
# CONFIG_OPENRISC_NO_SPR_SR_DSX is not set
CONFIG_JUMP_LABEL=y
+CONFIG_KPROBES=y
# CONFIG_COMPAT_BRK is not set
CONFIG_NET=y
CONFIG_PACKET=y
diff --git a/arch/openrisc/include/asm/asm.h b/arch/openrisc/include/asm/asm.h
new file mode 100644
index 000000000000..1a9c8bbb4430
--- /dev/null
+++ b/arch/openrisc/include/asm/asm.h
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Macros for OpenRISC asm
+ *
+ * Linux architectural port borrowing nearly verbatim from
+ * LoongArch and Arm. All original copyrights apply as per
+ * the original source declaration.
+ */
+
+#ifndef __ASM_ASM_H
+#define __ASM_ASM_H
+
+#ifdef CONFIG_KPROBES
+#define _ASM_NOKPROBE(symbol) \
+ .pushsection "_kprobe_blacklist", "aw"; \
+ .long symbol; \
+ .popsection
+#else
+#define _ASM_NOKPROBE(symbol)
+#endif
+
+#endif /* __ASM_ASM_H */
diff --git a/arch/openrisc/include/asm/break.h b/arch/openrisc/include/asm/break.h
new file mode 100644
index 000000000000..4bd04f4dd17a
--- /dev/null
+++ b/arch/openrisc/include/asm/break.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * OpenRISC trap codes used internally by the kernel
+ *
+ * Linux architectural port borrowing liberally from similar works of
+ * others. All original copyrights apply as per the original source
+ * declaration.
+ *
+ * Modifications for the OpenRISC architecture:
+ * Copyright (C) 2026 Sahil Siddiq <sahilcdq0@gmail.com>
+ */
+
+#ifndef __ASM_BREAK_H
+#define __ASM_BREAK_H
+
+#define BRK_KPROBE_BP 512 /* Kprobe break */
+#define BRK_KPROBE_SSTEPBP 1024 /* Kprobe single-step software implementation */
+
+#endif /* __ASM_BREAK_H */
diff --git a/arch/openrisc/include/asm/kprobes.h b/arch/openrisc/include/asm/kprobes.h
new file mode 100644
index 000000000000..50b6dc6d5a0c
--- /dev/null
+++ b/arch/openrisc/include/asm/kprobes.h
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * OpenRISC Linux
+ *
+ * Linux architectural port borrowing liberally from similar works of
+ * others. All original copyrights apply as per the original source
+ * declaration.
+ *
+ * OpenRISC implementation:
+ * Copyright (C) 2026 Sahil Siddiq <sahilcdq0@gmail.com>
+ */
+
+#ifndef __ASM_OPENRISC_KPROBES_H
+#define __ASM_OPENRISC_KPROBES_H
+
+#include <asm-generic/kprobes.h>
+
+#ifdef CONFIG_KPROBES
+#include <asm/break.h>
+#include <asm/cacheflush.h>
+
+#define __ARCH_WANT_KPROBES_INSN_SLOT
+
+struct pt_regs;
+struct kprobe;
+
+typedef u32 kprobe_opcode_t;
+
+/*
+ * MAX_INSN_SIZE is used as the number of slots in an executable
+ * page for single-stepping out of line (SSOL). We need two slots
+ * since we single-step using software breakpoints. The probed
+ * instruction is placed in the first slot with a breakpoint
+ * instruction in the second slot.
+ */
+#define MAX_INSN_SIZE 2
+
+/* Architecture specific copy of original instruction */
+struct arch_specific_insn {
+ /* copy of original instruction */
+ kprobe_opcode_t *insn;
+ /* address of next instruction in case of SSOL */
+ unsigned long restore;
+};
+
+struct prev_kprobe {
+ struct kprobe *kp;
+ unsigned int status;
+};
+
+/* per-cpu kprobe control block */
+struct kprobe_ctlblk {
+ unsigned int kprobe_status;
+ unsigned long irq_flags;
+ struct prev_kprobe prev_kprobe;
+};
+
+#define flush_insn_slot(p) do { } while (0)
+#define kretprobe_blacklist_size 0
+
+void arch_remove_kprobe(struct kprobe *p);
+int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
+bool kprobe_breakpoint_handler(struct pt_regs *regs);
+bool kprobe_singlestep_handler(struct pt_regs *regs);
+#else /* !CONFIG_KPROBES */
+static inline bool kprobe_breakpoint_handler(struct pt_regs *regs)
+{
+ return false;
+}
+
+static inline bool kprobe_singlestep_handler(struct pt_regs *regs)
+{
+ return false;
+}
+#endif /* CONFIG_KPROBES */
+#endif /* __ASM_OPENRISC_KPROBES_H */
diff --git a/arch/openrisc/kernel/Makefile b/arch/openrisc/kernel/Makefile
index 150779fbf010..2ac824867963 100644
--- a/arch/openrisc/kernel/Makefile
+++ b/arch/openrisc/kernel/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_SMP) += smp.o sync-timer.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-$(CONFIG_MODULES) += module.o
obj-$(CONFIG_OF) += prom.o
+obj-$(CONFIG_KPROBES) += kprobes.o
obj-y += patching.o
clean:
diff --git a/arch/openrisc/kernel/entry.S b/arch/openrisc/kernel/entry.S
index c7e90b09645e..cd28bf1f7a3b 100644
--- a/arch/openrisc/kernel/entry.S
+++ b/arch/openrisc/kernel/entry.S
@@ -15,6 +15,7 @@
#include <linux/linkage.h>
#include <linux/pgtable.h>
+#include <asm/asm.h>
#include <asm/processor.h>
#include <asm/unistd.h>
#include <asm/thread_info.h>
@@ -640,6 +641,7 @@ ENTRY(_sys_call_handler)
/* r30 is the only register we clobber in the fast path */
/* r30 already saved */
/* l.sw PT_GPR30(r1),r30 */
+_ASM_NOKPROBE(_sys_call_handler)
_syscall_check_trace_enter:
/* syscalls run with interrupts enabled */
@@ -652,12 +654,14 @@ _syscall_check_trace_enter:
l.sfne r30,r0
l.bf _syscall_trace_enter
l.nop
+_ASM_NOKPROBE(_syscall_check_trace_enter)
_syscall_check:
/* Ensure that the syscall number is reasonable */
l.sfgeui r11,__NR_syscalls
l.bf _syscall_badsys
l.nop
+_ASM_NOKPROBE(_syscall_check)
_syscall_call:
l.movhi r29,hi(sys_call_table)
@@ -668,12 +672,14 @@ _syscall_call:
l.jalr r29
l.nop
+_ASM_NOKPROBE(_syscall_call)
_syscall_return:
/* All syscalls return here... just pay attention to ret_from_fork
* which does it in a round-about way.
*/
l.sw PT_GPR11(r1),r11 // save return value
+_ASM_NOKPROBE(_syscall_return)
#if 0
_syscall_debug:
@@ -708,6 +714,7 @@ _syscall_check_trace_leave:
l.sfne r30,r0
l.bf _syscall_trace_leave
l.nop
+_ASM_NOKPROBE(_syscall_check_trace_leave)
/* This is where the exception-return code begins... interrupts need to be
* disabled the rest of the way here because we can't afford to miss any
@@ -744,6 +751,7 @@ _syscall_check_work:
/* _work_pending needs to be called with interrupts disabled */
l.j _work_pending
l.nop
+_ASM_NOKPROBE(_syscall_check_work)
_syscall_resume_userspace:
// ENABLE_INTERRUPTS(r29)
@@ -800,6 +808,7 @@ _syscall_resume_userspace:
l.mtspr r0,r13,SPR_EPCR_BASE
l.mtspr r0,r15,SPR_ESR_BASE
l.rfe
+_ASM_NOKPROBE(_syscall_resume_userspace)
/* End of hot path!
* Keep the below tracing and error handling out of the hot path...
@@ -829,6 +838,7 @@ _syscall_trace_enter:
l.j _syscall_check
l.lwz r8,PT_GPR8(r1)
+_ASM_NOKPROBE(_syscall_trace_enter)
_syscall_trace_leave:
l.jal do_syscall_trace_leave
@@ -836,6 +846,7 @@ _syscall_trace_leave:
l.j _syscall_check_work
l.nop
+_ASM_NOKPROBE(_syscall_trace_leave)
_syscall_badsys:
/* Here we effectively pretend to have executed an imaginary
@@ -845,6 +856,7 @@ _syscall_badsys:
*/
l.j _syscall_return
l.addi r11,r0,-ENOSYS
+_ASM_NOKPROBE(_syscall_badsys)
/******* END SYSCALL HANDLING *******/
@@ -870,6 +882,7 @@ EXCEPTION_ENTRY(_trap_handler)
l.j _ret_from_exception
l.nop
+_ASM_NOKPROBE(_trap_handler)
/* ---[ 0xf00: Reserved exception ]-------------------------------------- */
@@ -1004,6 +1017,8 @@ ENTRY(_ret_from_exception)
l.nop
l.j _resume_userspace
l.nop
+_ASM_NOKPROBE(_ret_from_exception)
+_ASM_NOKPROBE(_ret_from_intr)
ENTRY(ret_from_fork)
l.jal schedule_tail
@@ -1038,6 +1053,7 @@ ENTRY(ret_from_fork)
l.j _syscall_return
l.nop
+_ASM_NOKPROBE(ret_from_fork)
/* ========================================================[ switch ] === */
diff --git a/arch/openrisc/kernel/kprobes.c b/arch/openrisc/kernel/kprobes.c
new file mode 100644
index 000000000000..f9073a1cb6eb
--- /dev/null
+++ b/arch/openrisc/kernel/kprobes.c
@@ -0,0 +1,381 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Kernel probes (KProbes) for OpenRISC
+ *
+ * Linux architectural port borrowing liberally from similar works of
+ * others. All original copyrights apply as per the original source
+ * declaration.
+ *
+ * OpenRISC implementation:
+ * Copyright (C) 2026 Sahil Siddiq <sahilcdq0@gmail.com>
+ */
+
+#include <linux/kprobes.h>
+#include <asm/insn-def.h>
+#include <asm/text-patching.h>
+
+DEFINE_PER_CPU(struct kprobe *, current_kprobe);
+DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
+
+#define KPROBE_BP_INSN __emit_trap(BRK_KPROBE_BP)
+#define KPROBE_SSTEPBP_INSN __emit_trap(BRK_KPROBE_SSTEPBP)
+
+static bool insn_not_supported(union openrisc_instruction insn)
+{
+ switch (insn.word) {
+ case INSN_CSYNC:
+ case INSN_MSYNC:
+ case INSN_PSYNC:
+ return true;
+ }
+
+ if ((insn.word & 0xffff0000) == OPCODE_SYS)
+ return true;
+
+ if ((insn.word & 0xfc01ffff) == OPCODE_MACRC)
+ return true;
+
+ switch (insn.opcodes_6bit.opcode) {
+ case l_rfe:
+ case l_lwa:
+ case l_mfspr:
+ case l_mtspr:
+ case l_swa:
+ return true;
+ }
+
+ return false;
+}
+NOKPROBE_SYMBOL(insn_not_supported)
+
+static bool is_branch_insn(union openrisc_instruction insn)
+{
+ switch (insn.opcodes_6bit.opcode) {
+ case l_j:
+ case l_jal:
+ case l_bnf:
+ case l_bf:
+ case l_jr:
+ case l_jalr:
+ return true;
+ }
+
+ return false;
+}
+NOKPROBE_SYMBOL(is_branch_insn)
+
+static bool is_pc_insn(union openrisc_instruction insn)
+{
+ if (insn.opcodes_6bit.opcode == l_adrp)
+ return true;
+
+ return false;
+}
+NOKPROBE_SYMBOL(is_pc_insn)
+
+static bool insns_need_simulation(union openrisc_instruction insn, bool *exec_delay_slot)
+{
+ if (is_branch_insn(insn)) {
+ *exec_delay_slot = has_delay_slot();
+ return true;
+ }
+
+ if (is_pc_insn(insn)) {
+ *exec_delay_slot = false;
+ return true;
+ }
+
+ *exec_delay_slot = false;
+ return false;
+}
+NOKPROBE_SYMBOL(insns_need_simulation)
+
+int arch_prepare_kprobe(struct kprobe *p)
+{
+ union openrisc_instruction insn;
+ union openrisc_instruction prev_insn;
+ unsigned int cpucfgr = mfspr(SPR_CPUCFGR);
+ bool ss_delay_slot = false, exec_delay_slot = false;
+
+ /* Attempt to probe at unaligned address */
+ if ((unsigned long)p->addr & 0x3)
+ return -EILSEQ;
+
+ p->opcode = *p->addr;
+ insn.word = p->opcode;
+
+ if (insn_not_supported(insn)) {
+ pr_notice("Can't insert KProbe at blacklisted instruction.\n");
+ return -EINVAL;
+ }
+
+ if (copy_from_kernel_nofault(&prev_insn, p->addr - 1,
+ OPENRISC_INSN_SIZE) == 0 &&
+ insns_need_simulation(prev_insn, &exec_delay_slot) && exec_delay_slot) {
+ pr_notice("Can't insert KProbe in delay slot.\n");
+ return -EINVAL;
+ }
+
+ if (insns_need_simulation(insn, &exec_delay_slot) && !exec_delay_slot) {
+ p->ainsn.insn = NULL;
+ p->ainsn.restore = 0;
+ } else {
+ /*
+ * Single step probed instruction or, in case of branch instructions, single
+ * step instruction in delay slot.
+ */
+ p->ainsn.insn = get_insn_slot();
+ if (!p->ainsn.insn)
+ return -ENOMEM;
+
+ if (exec_delay_slot) {
+ patch_insn_write(p->ainsn.insn, *(p->addr + 1));
+ p->ainsn.restore = 0;
+ } else {
+ patch_insn_write(p->ainsn.insn, p->opcode);
+ p->ainsn.restore = (unsigned long)p->addr + OPENRISC_INSN_SIZE;
+ }
+ patch_insn_write(&p->ainsn.insn[1], KPROBE_SSTEPBP_INSN);
+ }
+
+ return 0;
+}
+NOKPROBE_SYMBOL(arch_prepare_kprobe);
+
+void arch_arm_kprobe(struct kprobe *p)
+{
+ patch_insn_write(p->addr, KPROBE_BP_INSN);
+ flush_insn_slot(p);
+}
+NOKPROBE_SYMBOL(arch_arm_kprobe);
+
+void arch_disarm_kprobe(struct kprobe *p)
+{
+ patch_insn_write(p->addr, p->opcode);
+ flush_insn_slot(p);
+}
+NOKPROBE_SYMBOL(arch_disarm_kprobe);
+
+void arch_remove_kprobe(struct kprobe *p)
+{
+ if (p->ainsn.insn) {
+ free_insn_slot(p->ainsn.insn, 0);
+ p->ainsn.insn = NULL;
+ }
+}
+NOKPROBE_SYMBOL(arch_remove_kprobe);
+
+static void set_current_kprobe(struct kprobe *p)
+{
+ __this_cpu_write(current_kprobe, p);
+}
+NOKPROBE_SYMBOL(set_current_kprobe);
+
+static void save_previous_kprobe(struct kprobe_ctlblk *kcb)
+{
+ kcb->prev_kprobe.kp = kprobe_running();
+ kcb->prev_kprobe.status = kcb->kprobe_status;
+}
+NOKPROBE_SYMBOL(save_previous_kprobe);
+
+static void restore_previous_kprobe(struct kprobe_ctlblk *kcb)
+{
+ kcb->kprobe_status = kcb->prev_kprobe.status;
+ set_current_kprobe(kcb->prev_kprobe.kp);
+}
+NOKPROBE_SYMBOL(restore_previous_kprobe);
+
+static void post_kprobe_handler(struct kprobe *cur, struct kprobe_ctlblk *kcb,
+ struct pt_regs *regs)
+{
+ if (cur->ainsn.restore != 0)
+ instruction_pointer_set(regs, (unsigned long)cur->ainsn.restore);
+
+ if (kcb->kprobe_status == KPROBE_REENTER) {
+ restore_previous_kprobe(kcb);
+ preempt_enable_no_resched();
+ return;
+ }
+
+ kcb->kprobe_status = KPROBE_HIT_SSDONE;
+
+ if (cur->post_handler)
+ cur->post_handler(cur, regs, 0);
+
+ reset_current_kprobe();
+ preempt_enable_no_resched();
+}
+NOKPROBE_SYMBOL(post_kprobe_handler);
+
+static void setup_singlestep(struct kprobe *p, struct pt_regs *regs,
+ struct kprobe_ctlblk *kcb, int reenter)
+{
+ union openrisc_instruction insn;
+
+ if (reenter) {
+ save_previous_kprobe(kcb);
+ set_current_kprobe(p);
+ kcb->kprobe_status = KPROBE_REENTER;
+ } else {
+ kcb->kprobe_status = KPROBE_HIT_SS;
+ }
+
+ /* Emulate instruction if required. */
+ insn.word = p->opcode;
+ if (is_branch_insn(insn)) {
+ simulate_branch(regs, insn.word, has_delay_slot());
+ /* Save target addr before updating PC to SSOL slot */
+ p->ainsn.restore = regs->pc;
+ } else if (is_pc_insn(insn)) {
+ simulate_pc(regs, insn.word);
+ /* No SSOL is required here, so call post-process immediately. */
+ post_kprobe_handler(p, kcb, regs);
+ return;
+ }
+
+ if (p->ainsn.insn) {
+ /* Disable IRQs before single stepping */
+ local_irq_save(kcb->irq_flags);
+ instruction_pointer_set(regs, (unsigned long)p->ainsn.insn);
+ } else {
+ /*
+ * The instruction is a branch but delay slots are disabled.
+ * Simply call the post-handler.
+ */
+ post_kprobe_handler(p, kcb, regs);
+ }
+}
+NOKPROBE_SYMBOL(setup_singlestep);
+
+static bool reenter_kprobe(struct kprobe *p, struct pt_regs *regs,
+ struct kprobe_ctlblk *kcb)
+{
+ switch (kcb->kprobe_status) {
+ case KPROBE_HIT_SS:
+ case KPROBE_HIT_SSDONE:
+ case KPROBE_HIT_ACTIVE:
+ kprobes_inc_nmissed_count(p);
+ setup_singlestep(p, regs, kcb, 1);
+ break;
+ case KPROBE_REENTER:
+ pr_warn("Unrecoverable KProbe detected.\n");
+ dump_kprobe(p);
+ WARN_ON_ONCE(1);
+ break;
+ default:
+ WARN_ON(1);
+ return false;
+ }
+
+ return true;
+}
+NOKPROBE_SYMBOL(reenter_kprobe);
+
+bool kprobe_breakpoint_handler(struct pt_regs *regs)
+{
+ struct kprobe *p, *curr_kprobe;
+ struct kprobe_ctlblk *kcb;
+ kprobe_opcode_t *addr = (kprobe_opcode_t *)regs->pc;
+
+ /*
+ * We don't want to be preempted for the entire
+ * duration of kprobe processing.
+ */
+ preempt_disable();
+
+ kcb = get_kprobe_ctlblk();
+ curr_kprobe = kprobe_running();
+ p = get_kprobe(addr);
+ if (p) {
+ if (curr_kprobe) {
+ if (reenter_kprobe(p, regs, kcb))
+ return true;
+ } else {
+ /* Probe hit */
+ set_current_kprobe(p);
+ kcb->kprobe_status = KPROBE_HIT_ACTIVE;
+ /*
+ * If there's no pre-handler or it returns 0, continue
+ * processing the KProbe as usual. A non-zero return
+ * value implies the saved registers have been modified
+ * and the execution path might deviate from what's
+ * expected. Reset the KProbe and return.
+ */
+ if (!p->pre_handler || !p->pre_handler(p, regs)) {
+ setup_singlestep(p, regs, kcb, 0);
+ } else {
+ reset_current_kprobe();
+ preempt_enable_no_resched();
+ }
+ return true;
+ }
+ }
+
+ /*
+ * The breakpoint instruction was removed right after we hit it
+ * possibly by another cpu. If the original instruction was also
+ * a trap instruction then return to the trap handler for further
+ * processing, else no further processing is required.
+ */
+ if ((*addr & 0xffff0000) != OPCODE_TRAP) {
+ preempt_enable_no_resched();
+ return true;
+ }
+
+ preempt_enable_no_resched();
+ return false;
+}
+NOKPROBE_SYMBOL(kprobe_breakpoint_handler);
+
+bool kprobe_singlestep_handler(struct pt_regs *regs)
+{
+ struct kprobe *cur = kprobe_running();
+ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
+ unsigned long addr = instruction_pointer(regs);
+
+ if (cur && (kcb->kprobe_status & (KPROBE_HIT_SS | KPROBE_REENTER)) &&
+ ((unsigned long)&cur->ainsn.insn[1] == addr)) {
+ /* Single stepping has been completed. Enable IRQs. */
+ local_irq_restore(kcb->irq_flags);
+ post_kprobe_handler(cur, kcb, regs);
+ return true;
+ }
+
+ preempt_enable_no_resched();
+ return false;
+}
+NOKPROBE_SYMBOL(kprobe_singlestep_handler);
+
+int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
+{
+ struct kprobe *cur = kprobe_running();
+ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
+
+ switch (kcb->kprobe_status) {
+ case KPROBE_HIT_SS:
+ case KPROBE_REENTER:
+ /*
+ * The instruction being single-stepped caused a page fault.
+ * Reset the current KProbe and set PC to the probe's address.
+ * Then allow the page fault handler to continue as usual.
+ */
+ instruction_pointer_set(regs, (unsigned long)cur->addr);
+
+ if (kcb->kprobe_status == KPROBE_REENTER) {
+ restore_previous_kprobe(kcb);
+ } else {
+ local_irq_restore(kcb->irq_flags);
+ reset_current_kprobe();
+ preempt_enable_no_resched();
+ }
+ break;
+ }
+
+ return 0;
+}
+NOKPROBE_SYMBOL(kprobe_fault_handler);
+
+int __init arch_init_kprobes(void)
+{
+ return 0;
+}
diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index ee87a3af34fc..ae6bfcca388e 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -30,10 +30,12 @@
#include <linux/kallsyms.h>
#include <linux/uaccess.h>
+#include <asm/break.h>
#include <asm/bug.h>
#include <asm/fpu.h>
#include <asm/insn-def.h>
#include <asm/io.h>
+#include <asm/kprobes.h>
#include <asm/processor.h>
#include <asm/unwinder.h>
#include <asm/sections.h>
@@ -216,10 +218,34 @@ asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
if (user_mode(regs)) {
force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
} else {
+ unsigned int trap = *(unsigned int *)regs->pc, cpucfgr = mfspr(SPR_CPUCFGR), bcode;
+
+ /*
+ * Trap instruction was probably removed and no further processing
+ * is required.
+ */
+ if ((trap & 0xffff0000) != OPCODE_TRAP)
+ return;
+
+ bcode = (trap & 0xffff);
+ switch (bcode) {
+ case BRK_KPROBE_BP:
+ if (kprobe_breakpoint_handler(regs))
+ return;
+ break;
+ case BRK_KPROBE_SSTEPBP:
+ if (kprobe_singlestep_handler(regs))
+ return;
+ break;
+ default:
+ break;
+ }
+
pr_emerg("KERNEL: Illegal trap exception 0x%.8lx\n", regs->pc);
die("Die:", regs, SIGILL);
}
}
+NOKPROBE_SYMBOL(do_trap)
asmlinkage void do_unaligned_access(struct pt_regs *regs, unsigned long address)
{
diff --git a/arch/openrisc/lib/memcpy.c b/arch/openrisc/lib/memcpy.c
index e2af9b510804..701262a40134 100644
--- a/arch/openrisc/lib/memcpy.c
+++ b/arch/openrisc/lib/memcpy.c
@@ -16,6 +16,7 @@
#include <linux/export.h>
+#include <linux/kprobes.h>
#include <linux/string.h>
#ifdef CONFIG_OR1K_1200
@@ -122,4 +123,5 @@ void *memcpy(void *dest, __const void *src, __kernel_size_t n)
}
#endif
+NOKPROBE_SYMBOL(memcpy);
EXPORT_SYMBOL(memcpy);
diff --git a/arch/openrisc/lib/memset.S b/arch/openrisc/lib/memset.S
index c3ac2a8b68d3..14e63af12d73 100644
--- a/arch/openrisc/lib/memset.S
+++ b/arch/openrisc/lib/memset.S
@@ -9,6 +9,8 @@
* Copyright (C) 2015 Olof Kindgren <olof.kindgren@gmail.com>
*/
+#include <asm/asm.h>
+
.global memset
.type memset, @function
memset:
@@ -92,3 +94,5 @@ memset:
4: l.jr r9
l.ori r11, r3, 0
+
+_ASM_NOKPROBE(memset)
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 29e232d78d82..5263a832562f 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -14,6 +14,7 @@
#include <linux/mm.h>
#include <linux/interrupt.h>
#include <linux/extable.h>
+#include <linux/kprobes.h>
#include <linux/sched/signal.h>
#include <linux/perf_event.h>
@@ -55,6 +56,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
tsk = current;
+ if (kprobe_page_fault(regs, vector))
+ return;
+
/*
* We fault-in kernel-space virtual memory on-demand. The
* 'reference' page table is init_mm.pgd.
@@ -351,3 +355,4 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
return;
}
}
+NOKPROBE_SYMBOL(do_page_fault)
diff --git a/samples/kprobes/kprobe_example.c b/samples/kprobes/kprobe_example.c
index 53ec6c8b8c40..84e26ebef70b 100644
--- a/samples/kprobes/kprobe_example.c
+++ b/samples/kprobes/kprobe_example.c
@@ -59,6 +59,10 @@ static int __kprobes handler_pre(struct kprobe *p, struct pt_regs *regs)
pr_info("<%s> p->addr = 0x%p, era = 0x%lx, estat = 0x%lx\n",
p->symbol_name, p->addr, regs->csr_era, regs->csr_estat);
#endif
+#ifdef CONFIG_OPENRISC
+ pr_info("<%s> p->addr = 0x%p, pc = 0x%lx, status = 0x%lx\n",
+ p->symbol_name, p->addr, regs->pc, regs->sr);
+#endif
/* A dump_stack() here will give a stack backtrace */
return 0;
@@ -100,6 +104,10 @@ static void __kprobes handler_post(struct kprobe *p, struct pt_regs *regs,
pr_info("<%s> p->addr = 0x%p, estat = 0x%lx\n",
p->symbol_name, p->addr, regs->csr_estat);
#endif
+#ifdef CONFIG_OPENRISC
+ pr_info("<%s> p->addr = 0x%p, status = 0x%lx\n",
+ p->symbol_name, p->addr, regs->sr);
+#endif
}
static int __init kprobe_init(void)
--
2.53.0
^ permalink raw reply related
* Re: [PATCH bpf-next v6 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Alexei Starovoitov @ 2026-04-07 19:34 UTC (permalink / raw)
To: Andrey Grodzovsky
Cc: bpf, linux-trace-kernel, Jiri Olsa, Ihor Solodrai,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Steven Rostedt, Masami Hiramatsu, Emil Tsalapatis,
DL Linux Open Source Team
In-Reply-To: <20260407165145.1651061-2-andrey.grodzovsky@crowdstrike.com>
On Tue, Apr 7, 2026 at 9:52 AM Andrey Grodzovsky
<andrey.grodzovsky@crowdstrike.com> wrote:
>
> When an unqualified kprobe target exists in both vmlinux and a loaded
> module, number_of_same_symbols() returns a count greater than 1,
> causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
> vmlinux symbol is unambiguous.
>
> When no module qualifier is given and the symbol is found in vmlinux,
> return the vmlinux-only count without scanning loaded modules. This
> preserves the existing behavior for all other cases:
> - Symbol only in a module: vmlinux count is 0, falls through to module
> scan as before.
> - Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
> - Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
>
> Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
> Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
> Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
> Acked-by: Jiri Olsa <jolsa@kernel.org>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
> ---
> kernel/trace/trace_kprobe.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index a5dbb72528e0..99c41ea8b6d7 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -765,6 +765,13 @@ static unsigned int number_of_same_symbols(const char *mod, const char *func_nam
> if (!mod)
> kallsyms_on_each_match_symbol(count_symbols, func_name, &ctx.count);
>
> + /* If the symbol is found in vmlinux, use vmlinux resolution only.
> + * This prevents module symbols from shadowing vmlinux symbols
> + * and causing -EADDRNOTAVAIL for unqualified kprobe targets.
> + */
Please do not use networking style comments in the new code.
Same applies to selftests.
pw-bot: cr
^ permalink raw reply
* [PATCH bpf-next v7 1/2] tracing: Prefer vmlinux symbols over module symbols for unqualified kprobes
From: Andrey Grodzovsky @ 2026-04-07 20:39 UTC (permalink / raw)
To: bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source
In-Reply-To: <20260407203912.1787502-1-andrey.grodzovsky@crowdstrike.com>
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
---
kernel/trace/trace_kprobe.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index a5dbb72528e0..058724c41c46 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -765,6 +765,14 @@ static unsigned int number_of_same_symbols(const char *mod, const char *func_nam
if (!mod)
kallsyms_on_each_match_symbol(count_symbols, func_name, &ctx.count);
+ /*
+ * If the symbol is found in vmlinux, use vmlinux resolution only.
+ * This prevents module symbols from shadowing vmlinux symbols
+ * and causing -EADDRNOTAVAIL for unqualified kprobe targets.
+ */
+ if (!mod && ctx.count > 0)
+ return ctx.count;
+
module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
return ctx.count;
--
2.34.1
^ permalink raw reply related
* [PATCH bpf-next v7 0/2] tracing: Fix kprobe attachment when module shadows vmlinux symbol
From: Andrey Grodzovsky @ 2026-04-07 20:39 UTC (permalink / raw)
To: bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source
When a kernel module exports a symbol with the same name as an existing
vmlinux symbol, kprobe attachment fails with -EADDRNOTAVAIL because
number_of_same_symbols() counts matches across both vmlinux and all
loaded modules, returning a count greater than 1.
This series takes a different approach from v1-v4, which implemented a
libbpf-side fallback parsing /proc/kallsyms and retrying with the
absolute address. That approach was rejected (Andrii Nakryiko, Ihor
Solodrai) because ambiguous symbol resolution does not belong in libbpf.
Following Ihor's suggestion, this series fixes the root cause in the
kernel: when an unqualified symbol name is given and the symbol is found
in vmlinux, prefer the vmlinux symbol and do not scan loaded modules.
This makes the skeleton auto-attach path work transparently with no
libbpf changes needed.
Patch 1: Kernel fix - return vmlinux-only count from
number_of_same_symbols() when the symbol is found in vmlinux,
preventing module shadows from causing -EADDRNOTAVAIL.
Patch 2: Selftests using bpf_fentry_shadow_test which exists in both
vmlinux and bpf_testmod - tests unqualified (vmlinux) and
MOD:SYM (module) attachment across all four attach modes, plus
kprobe_multi with the duplicate symbol.
Changes since v6 [1]:
- Fix comment style: use /* on its own line instead of networking-style
/* text on opener line (Alexei Starovoitov).
[1] https://lore.kernel.org/bpf/20260407165145.1651061-1-andrey.grodzovsky@crowdstrike.com/
Andrey Grodzovsky (2):
tracing: Prefer vmlinux symbols over module symbols for unqualified
kprobes
selftests/bpf: Add tests for kprobe attachment with duplicate symbols
kernel/trace/trace_kprobe.c | 8 +++
.../selftests/bpf/prog_tests/attach_probe.c | 69 +++++++++++++++++++
.../bpf/prog_tests/kprobe_multi_test.c | 40 +++++++++++
3 files changed, 117 insertions(+)
--
2.34.1
^ permalink raw reply
* [PATCH bpf-next v7 2/2] selftests/bpf: Add tests for kprobe attachment with duplicate symbols
From: Andrey Grodzovsky @ 2026-04-07 20:39 UTC (permalink / raw)
To: bpf, linux-trace-kernel
Cc: ast, daniel, andrii, jolsa, rostedt, mhiramat, ihor.solodrai,
emil, linux-open-source
In-Reply-To: <20260407203912.1787502-1-andrey.grodzovsky@crowdstrike.com>
bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
bpf_testmod (bpf_testmod.c), creating a duplicate symbol condition when
bpf_testmod is loaded. Add subtests that verify kprobe behavior with
this duplicate symbol:
In attach_probe:
- dup-sym-{default,legacy,perf,link}: unqualified attach succeeds
across all four modes, preferring vmlinux over module shadow.
- MOD:SYM qualification attaches to the module version.
In kprobe_multi_test:
- dup_sym: kprobe_multi attach with kprobe and kretprobe succeeds.
bpf_fentry_shadow_test is not invoked via test_run, so tests verify
attach and detach succeed without triggering the probe.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
---
.../selftests/bpf/prog_tests/attach_probe.c | 69 +++++++++++++++++++
.../bpf/prog_tests/kprobe_multi_test.c | 40 +++++++++++
2 files changed, 109 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index 12a841afda68..e8c1a619e330 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -197,6 +197,66 @@ static void test_attach_kprobe_legacy_by_addr_reject(void)
test_attach_probe_manual__destroy(skel);
}
+/*
+ * bpf_fentry_shadow_test exists in both vmlinux (net/bpf/test_run.c) and
+ * bpf_testmod (bpf_testmod.c). When bpf_testmod is loaded the symbol is
+ * duplicated. Test that kprobe attachment handles this correctly:
+ * - Unqualified name ("bpf_fentry_shadow_test") attaches to vmlinux.
+ * - MOD:SYM name ("bpf_testmod:bpf_fentry_shadow_test") attaches to module.
+ *
+ * Note: bpf_fentry_shadow_test is not invoked via test_run, so we only
+ * verify that attach and detach succeed without triggering the probe.
+ */
+static void test_attach_probe_dup_sym(enum probe_attach_mode attach_mode)
+{
+ DECLARE_LIBBPF_OPTS(bpf_kprobe_opts, kprobe_opts);
+ struct bpf_link *kprobe_link, *kretprobe_link;
+ struct test_attach_probe_manual *skel;
+
+ skel = test_attach_probe_manual__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_dup_sym_open_and_load"))
+ return;
+
+ kprobe_opts.attach_mode = attach_mode;
+
+ /* Unqualified: should attach to vmlinux symbol */
+ kprobe_opts.retprobe = false;
+ kprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kprobe,
+ "bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kprobe_link, "attach_kprobe_vmlinux"))
+ goto cleanup;
+ bpf_link__destroy(kprobe_link);
+
+ kprobe_opts.retprobe = true;
+ kretprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kretprobe,
+ "bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kretprobe_link, "attach_kretprobe_vmlinux"))
+ goto cleanup;
+ bpf_link__destroy(kretprobe_link);
+
+ /* MOD:SYM qualified: should attach to module symbol */
+ kprobe_opts.retprobe = false;
+ kprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kprobe,
+ "bpf_testmod:bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kprobe_link, "attach_kprobe_module"))
+ goto cleanup;
+ bpf_link__destroy(kprobe_link);
+
+ kprobe_opts.retprobe = true;
+ kretprobe_link = bpf_program__attach_kprobe_opts(skel->progs.handle_kretprobe,
+ "bpf_testmod:bpf_fentry_shadow_test",
+ &kprobe_opts);
+ if (!ASSERT_OK_PTR(kretprobe_link, "attach_kretprobe_module"))
+ goto cleanup;
+ bpf_link__destroy(kretprobe_link);
+
+cleanup:
+ test_attach_probe_manual__destroy(skel);
+}
+
/* attach uprobe/uretprobe long event name testings */
static void test_attach_uprobe_long_event_name(void)
{
@@ -559,6 +619,15 @@ void test_attach_probe(void)
if (test__start_subtest("kprobe-legacy-by-addr-reject"))
test_attach_kprobe_legacy_by_addr_reject();
+ if (test__start_subtest("dup-sym-default"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_DEFAULT);
+ if (test__start_subtest("dup-sym-legacy"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_LEGACY);
+ if (test__start_subtest("dup-sym-perf"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_PERF);
+ if (test__start_subtest("dup-sym-link"))
+ test_attach_probe_dup_sym(PROBE_ATTACH_MODE_LINK);
+
if (test__start_subtest("auto"))
test_attach_probe_auto(skel);
if (test__start_subtest("kprobe-sleepable"))
diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
index 78c974d4ea33..20ddff6812e9 100644
--- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
+++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
@@ -633,6 +633,44 @@ static void test_attach_write_ctx(void)
}
#endif
+/*
+ * Test kprobe_multi handles shadow symbols (vmlinux + module duplicate).
+ * bpf_fentry_shadow_test exists in both vmlinux and bpf_testmod.
+ * kprobe_multi resolves via ftrace_lookup_symbols() which finds the
+ * vmlinux symbol first and stops, so this should always succeed.
+ */
+static void test_attach_probe_dup_sym(void)
+{
+ LIBBPF_OPTS(bpf_kprobe_multi_opts, opts);
+ const char *syms[1] = { "bpf_fentry_shadow_test" };
+ struct kprobe_multi *skel = NULL;
+ struct bpf_link *link1 = NULL, *link2 = NULL;
+
+ skel = kprobe_multi__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "kprobe_multi__open_and_load"))
+ goto cleanup;
+
+ skel->bss->pid = getpid();
+ opts.syms = syms;
+ opts.cnt = ARRAY_SIZE(syms);
+
+ link1 = bpf_program__attach_kprobe_multi_opts(skel->progs.test_kprobe_manual,
+ NULL, &opts);
+ if (!ASSERT_OK_PTR(link1, "attach_kprobe_multi_dup_sym"))
+ goto cleanup;
+
+ opts.retprobe = true;
+ link2 = bpf_program__attach_kprobe_multi_opts(skel->progs.test_kretprobe_manual,
+ NULL, &opts);
+ if (!ASSERT_OK_PTR(link2, "attach_kretprobe_multi_dup_sym"))
+ goto cleanup;
+
+cleanup:
+ bpf_link__destroy(link2);
+ bpf_link__destroy(link1);
+ kprobe_multi__destroy(skel);
+}
+
void serial_test_kprobe_multi_bench_attach(void)
{
if (test__start_subtest("kernel"))
@@ -676,5 +714,7 @@ void test_kprobe_multi_test(void)
test_unique_match();
if (test__start_subtest("attach_write_ctx"))
test_attach_write_ctx();
+ if (test__start_subtest("dup_sym"))
+ test_attach_probe_dup_sym();
RUN_TESTS(kprobe_multi_verifier);
}
--
2.34.1
^ permalink raw reply related
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-07 21:09 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAEvNRgEtigp7+PVDkyu_DH947CUqDt312d+P+hWjjd2fHONiag@mail.gmail.com>
On Fri, Apr 03, 2026 at 07:50:16AM -0700, Ackerley Tng wrote:
> Ackerley
> dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnng <ackerleytng@google.com> writes:
>
> >
> > [...snip...]
> >
> > guest_memfd's populate will first check that the memory is shared, then
> > also set the memory to private after the populate.
> >
> > [...snip...]
> >
> Looking at this again, the above basically means that the entire
> conversion process needs to be performed within populate.
>
> In addition to setting the attributes in guest_memfd as private, for
> consistency, populate will also have to do all the associated
> operations, especially unmapping from the host, checking refcounts,
> and the list of work in conversion will only increase in future with
> direct map removal/restoration and huge page merging.
>
> The complexity of conversion also means possible errors (EAGAIN for
> elevated refcounts and ENOMEM when we're out of memory), and error
> information like the offset where the elevated refcount was.
>
> It doesn't look like there's room for this information to be plumbed out
> through the platform-specific ioctls, and even if we make space, it
> seems odd to have conversion-related error information returned through
> the platform-specific call.
>
>
> I agree with the goal of not having KVM touch private memory contents as
> tracked by guest_memfd, so I'd like to propose that we distinguish:
>
> 1. private as tracked by KVM (guest_memfd/vm_memory_attributes)
> 2. private as tracked by trusted entity
I think this is a good distinction to keep in mind, because if we adopt
the proposal from the call of having userspace set the destination memory
to shared prior to calling kvm_gmem_populate(), then they don't really
stay shared until gmem convert them to private: instead, they get set to
"private as tracked by trusted entity", but at the same time still have
'shared' memory attributes as far as KVM is concerned. Normally (SNP,
at least), the 'private (as tracked by KVM)' state is an intermediate state
on the way to 'private (as tracked by KVM + trusted entity)'.
So we introduce some inconsistencies on that side, in order to address
the inconsistency of kvm_gmem_populate() writing to 'private (as tracked
by KVM)' memory. But as you point out...
> + destination address: private (as tracked by guest_memfd)
> + source address: shared (as tracked by guest_memfd) or NULL
>
> KVM doesn't touch private memory contents, even in this case, because
> it's really a platform-specific ioctl that handles loading, and the
> platform does expect the destination is private for both TDX and SNP
> at the firmware boundary.
...yah, it's not really gmem that's writing to that memory, it's the
platform-specific hooks that 'prepare' the memory as part of population
and puts that in a 'private (as tracked by trusted entity)' state, just as
it's the platform-specific hooks that 'prepare' the memory as part of vCPU
page fault path at run-time and put them into a private (as tracked by
trusted entity). You could even imagine a naive CoCo implementation that
encrypts memory in-place at initial fault time via kvm_gmem_prepare()
hooks... we likely wouldn't insist on some new flow because this results
in gmem calling something that writes to 'private (as tracked by KVM)'
pages and would consider that to be more of a platform-specific
implementation detail that should be handled the same as other
architectures. And that seems like it would be roughly analogous to what
is being discussed here WRT the kvm_gmem_populate() path, so I think it
makes sense to continue to expecting the pages to be marked private in
advance of platform-specific preparation, whether that be via the
populate path or the runtime/fault-time path.
And for recent KVM,
; all the things we exp
about how the callbacks
As far as the copying goes,
By expecting 'private' (as tracked by KVM) as the initial state for
kvm_gmem_populate(), a lot of invariants about private memory (safe
refcount, directmap removal expectations, etc.) remain consistent even
in the populate path, where any special handling for private memory can be
accounted for in the same way rather than "shared, but..." or "private,
but...".
>
> Since SNP (platform-specific) only allows in-place launch update, and
> KVM had to provide an interface that allows a different source address
> for support before in-place conversion, then SNP has to continue
> supporting the to-be-deprecated path by doing the copying within
> platform-specific code.
>
> For consistency, guest_memfd can continue to check that it tracks the
> destination address as private, and sev_gmem_populate will then hide
> the copying code away just to support the legacy case.
>
>
> The flow before in-place conversion is
>
> 1. Load memory (shared or non-guest_memfd memory)
> 2. KVM_SEV_SNP_LAUNCH_UPDATE or KVM_TDX_INIT_MEM_REGION (destination:
> gfn for separate private memory, source: shared memory)
>
> The proposed flow for in-place conversion is
>
> 1. INIT_SHARED or convert to shared
> 2. Load memory while it is shared
> 3. Convert to private (PRESERVE, or new flag?)
> 4. KVM_SEV_SNP_LAUNCH_UPDATE or KVM_TDX_INIT_MEM_REGION (destination:
> gfn for converted private memory, source: NULL)
>
>
> TLDR:
>
> + Think of populate ioctls not as KVM touching memory, but platform
> handling population.
> + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> + post_populate is platform-specific code that handles loading into
> private destination memory just to support legacy non-in-place
> conversion.
> + Don't complicate populate ioctls by doing conversion just to support
> legacy use-cases where platform-specific code has to do copying on
> the host.
That's a good point: these are only considerations in the context of
actually copying from src->dst, but with in-place conversion the
primary/more-performant approach will be for userspace to initial
directly. I.e. if we enforced that, then gmem could right ascertain that
it isn't even writing to private pages via these hooks and any
manipulation of that memory is purely on the part of the trusted entity
handling initial encryption/etc.
I understand that we decided to keep the option of allowing separate
src/dst even with in-place conversion, but it doesn't seem worthwhile if
that necessarily means we need to glue population+conversion together in
1 clumsy interface that needs to handle partial return/error responses to
userspace (or potentially get stuck forever in the conversion path).
So I agree with Ackerley's proposal (which I guess is the same as what's
in this series).
However, 1 other alternative would be to do what was suggested on the
call, but require userspace to subsequently handle the shared->private
conversion. I think that would be workable too.
One other benefit to Ackerley's/current approach however is that it allows
us to potentially keep hugepages intact in the populate path, since
prep'ing/encrypting everything while it's in a shared state means gmem will
split the hugepage and all the firmware/RMP/etc. data structures will only
be able to handle individual 4K pages. I still suspect doing things like
encoding the initial 2MB OVMF image as a single hugepage might yield
enough benefit to explore this (at some point). So there's some niceness
in knowing that Ackerley's approach would allow for that eventually and
not require a complete rethink on these same topics.
Thanks,
Mike
>
> >>>
> >>> [...snip...]
> >>>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Vishal Annapurve @ 2026-04-07 21:50 UTC (permalink / raw)
To: Michael Roth
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <yvfwexsub7nrogh67hzcsupbrkzer6a7kbeao5tlq4elrzc2iz@xrwdjd7p32pp>
On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
>
> > TLDR:
> >
> > + Think of populate ioctls not as KVM touching memory, but platform
> > handling population.
> > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > + post_populate is platform-specific code that handles loading into
> > private destination memory just to support legacy non-in-place
> > conversion.
> > + Don't complicate populate ioctls by doing conversion just to support
> > legacy use-cases where platform-specific code has to do copying on
> > the host.
>
> That's a good point: these are only considerations in the context of
> actually copying from src->dst, but with in-place conversion the
> primary/more-performant approach will be for userspace to initial
> directly. I.e. if we enforced that, then gmem could right ascertain that
> it isn't even writing to private pages via these hooks and any
> manipulation of that memory is purely on the part of the trusted entity
> handling initial encryption/etc.
>
> I understand that we decided to keep the option of allowing separate
> src/dst even with in-place conversion, but it doesn't seem worthwhile if
> that necessarily means we need to glue population+conversion together in
> 1 clumsy interface that needs to handle partial return/error responses to
> userspace (or potentially get stuck forever in the conversion path).
I think ARM needs userspace to specify separate source and destination
memory ranges for initial population as ARM doesn't support in-place
memory encryption. [1]
[1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
>
> So I agree with Ackerley's proposal (which I guess is the same as what's
> in this series).
>
> However, 1 other alternative would be to do what was suggested on the
> call, but require userspace to subsequently handle the shared->private
> conversion. I think that would be workable too.
IIUC, Converting memory ranges to private after it essentially is
treated as private by the KVM CC backend will expose the
implementation to the same risk of userspace being able to access
private memory and compromise host safety which guest_memfd was
invented to address.
>
> One other benefit to Ackerley's/current approach however is that it allows
> us to potentially keep hugepages intact in the populate path, since
> prep'ing/encrypting everything while it's in a shared state means gmem will
> split the hugepage and all the firmware/RMP/etc. data structures will only
> be able to handle individual 4K pages. I still suspect doing things like
> encoding the initial 2MB OVMF image as a single hugepage might yield
> enough benefit to explore this (at some point). So there's some niceness
> in knowing that Ackerley's approach would allow for that eventually and
> not require a complete rethink on these same topics.
>
> Thanks,
>
> Mike
>
> >
> > >>>
> > >>> [...snip...]
> > >>>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-04-07 22:09 UTC (permalink / raw)
To: Vishal Annapurve
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm
In-Reply-To: <CAGtprH-kgRByFvvCYeWMXtsvpb6qpaWAo8k-3PEnioyPg-LEvA@mail.gmail.com>
On Tue, Apr 07, 2026 at 02:50:58PM -0700, Vishal Annapurve wrote:
> On Tue, Apr 7, 2026 at 2:09 PM Michael Roth <michael.roth@amd.com> wrote:
> >
> > > TLDR:
> > >
> > > + Think of populate ioctls not as KVM touching memory, but platform
> > > handling population.
> > > + KVM code (kvm_gmem_populate) still doesn't touch memory contents
> > > + post_populate is platform-specific code that handles loading into
> > > private destination memory just to support legacy non-in-place
> > > conversion.
> > > + Don't complicate populate ioctls by doing conversion just to support
> > > legacy use-cases where platform-specific code has to do copying on
> > > the host.
> >
> > That's a good point: these are only considerations in the context of
> > actually copying from src->dst, but with in-place conversion the
> > primary/more-performant approach will be for userspace to initial
> > directly. I.e. if we enforced that, then gmem could right ascertain that
> > it isn't even writing to private pages via these hooks and any
> > manipulation of that memory is purely on the part of the trusted entity
> > handling initial encryption/etc.
> >
> > I understand that we decided to keep the option of allowing separate
> > src/dst even with in-place conversion, but it doesn't seem worthwhile if
> > that necessarily means we need to glue population+conversion together in
> > 1 clumsy interface that needs to handle partial return/error responses to
> > userspace (or potentially get stuck forever in the conversion path).
>
> I think ARM needs userspace to specify separate source and destination
> memory ranges for initial population as ARM doesn't support in-place
> memory encryption. [1]
>
> [1] https://lore.kernel.org/kvm/20260318155413.793430-25-steven.price@arm.com/
>
> >
> > So I agree with Ackerley's proposal (which I guess is the same as what's
> > in this series).
> >
> > However, 1 other alternative would be to do what was suggested on the
> > call, but require userspace to subsequently handle the shared->private
> > conversion. I think that would be workable too.
>
> IIUC, Converting memory ranges to private after it essentially is
> treated as private by the KVM CC backend will expose the
> implementation to the same risk of userspace being able to access
> private memory and compromise host safety which guest_memfd was
> invented to address.
Doh, fair point. Doing conversion as part of the populate call would allow
us to use the filemap write-lock to avoid userspace being able to fault
in private (as tracked by trusted entity) pages before they are
transitioned to private (as tracked by KVM), so it's safer than having
userspace drive it.
But obviously I still think Ackerley's original proposal has more
upsides than the alternatives mentioned so far.
-Mike
>
> >
> > One other benefit to Ackerley's/current approach however is that it allows
> > us to potentially keep hugepages intact in the populate path, since
> > prep'ing/encrypting everything while it's in a shared state means gmem will
> > split the hugepage and all the firmware/RMP/etc. data structures will only
> > be able to handle individual 4K pages. I still suspect doing things like
> > encoding the initial 2MB OVMF image as a single hugepage might yield
> > enough benefit to explore this (at some point). So there's some niceness
> > in knowing that Ackerley's approach would allow for that eventually and
> > not require a complete rethink on these same topics.
> >
> > Thanks,
> >
> > Mike
> >
> > >
> > > >>>
> > > >>> [...snip...]
> > > >>>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox