* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-01 17:31 UTC (permalink / raw)
To: LKML, Linux Trace Kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
In-Reply-To: <20260601130746.2139d926@gandalf.local.home>
On Mon, 1 Jun 2026 13:07:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
>
> - Add error message in parse_btf_args() for failed parsing of TEVENT.
> (Sashiko)
>
> - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> The flag was redundant and added unnecessary complexity.
>
> - Restructure to keep the lifetime of the TYPECAST to the end of
> traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> around in case there's not a type parameter and then btf can still be
> used.
> (Sashiko and Masami Hiramatsu)
And I rebased onto probes/for-next
-- Steve
^ permalink raw reply
* Re: [PATCH 0/4] bootconfig: embed kernel.* cmdline at build time
From: Breno Leitao @ 2026-06-01 17:56 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Andrew Morton, Nathan Chancellor, paulmck, Nicolas Schier,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
bpf, kernel-team
In-Reply-To: <20260529001519.14ca9dbe92fb2622249137c6@kernel.org>
On Fri, May 29, 2026 at 12:15:19AM +0900, Masami Hiramatsu wrote:
> On Wed, 27 May 2026 09:41:33 -0700
> Breno Leitao <leitao@debian.org> wrote:
>
> > The userspace pieces (xbc_snprint_cmdline() in lib/, tools/bootconfig -C)
> > already landed; this series wires the rendered cmdline into the kernel.
> >
> > Motivation: today the embedded bootconfig is parsed at runtime, after
> > parse_early_param() has already run, so early_param() handlers can't
> > see embedded values. Folding the kernel.* subtree into the cmdline at
> > build time gives a CONFIG_CMDLINE-equivalent for embedded-bootconfig
> > users without forcing them to maintain two cmdline sources.
> >
> > Behaviorally, the "kernel" subtree is rendered to a flat string at
> > build time and stashed in .init.rodata. setup_arch() prepends it to
> > boot_command_line before parse_early_param() runs. Overflow is a soft
> > error: the helper logs and leaves boot_command_line untouched rather
> > than panicking, so an oversized embedded bconf cannot brick a boot.
> >
>
> Thanks Breno, yes, that is what I think about.
> Let me check it. And could you also check Sashiko's comments?
yes, I've spent some time on them, and it reported some good points, in
fact. I will fix those and resend.
Thanks!
--breno
^ permalink raw reply
* Re: [PATCH v2] unwind: Add sframe_(un)register() system calls
From: Andrii Nakryiko @ 2026-06-01 17:57 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, bpf, Masami Hiramatsu,
Mathieu Desnoyers, Jens Remus, Josh Poimboeuf, Peter Zijlstra,
Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo, Namhyung Kim,
Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
Vasily Gorbik, Thomas Weißschuh
In-Reply-To: <20260528222051.60b38433@fedora>
On Thu, May 28, 2026 at 7:20 PM Steven Rostedt <rostedt@kernel.org> wrote:
>
> On Thu, 28 May 2026 16:01:06 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> >
> > [...]
> >
> > > * Architecture-specific system calls
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index a627acc8fb5f..17042d7e5e87 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -863,8 +863,13 @@ __SYSCALL(__NR_listns, sys_listns)
> > > #define __NR_rseq_slice_yield 471
> > > __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
> > >
> > > +#define __NR_sframe_register 472
> > > +__SYSCALL(__NR_sframe_register, sys_sframe_register)
> > > +#define __NR_sframe_unregister 473
> > > +__SYSCALL(__NR_sframe_unregister, sys_sframe_unregister)
> > > +
> > > #undef __NR_syscalls
> > > -#define __NR_syscalls 472
> > > +#define __NR_syscalls 474
> > >
> > > /*
> > > * 32 bit systems traditionally used different
> > > diff --git a/include/uapi/linux/sframe.h b/include/uapi/linux/sframe.h
> > > new file mode 100644
> > > index 000000000000..d3c9f88b024b
> > > --- /dev/null
> > > +++ b/include/uapi/linux/sframe.h
> > > @@ -0,0 +1,12 @@
> > > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > > +#ifndef _UAPI_LINUX_SFRAME_H
> > > +#define _UAPI_LINUX_SFRAME_H
> > > +
> > > +struct sframe_setup {
> >
> > I'd add `u64 flags;` field for easier and nicer extensibility. Check
> > in the kernel that it is set to zero, future kernels will allow some
> > of the bits to be set.
>
> That sounds reasonable.
>
> >
> > And I still think that prctl() instead of a separate sframe-specific
> > syscall is the way to go. I see no reason for sframe-specific set of
> > syscalls just to set a bit of extra metadata for the entire process.
> > That seems to be the job of prctl().
>
> I personally do not have a preference. I've just heard a lot from
> others where they want to avoid extending an ioctl() like system call
> or even create a new multiplexer syscall.
>
> If we can get a consensus of using prctl() or adding a separate system
> call, I'll go with whatever that is.
prctl() is an already existing multiplexing syscall used to provide
some per-process (of per-thread sometimes, it seems) hints and
options. Please consider sending prctl() extension, please CC me, and
let's see what arguments do people have against extending an already
existing syscall.
>
> >
> > > + __u64 sframe_start;
> > > + __u64 sframe_size;
> > > + __u64 text_start;
> > > + __u64 text_size;
> > > +};
> > > +
> >
> > [...]
> >
> > > +
> > > +/**
> > > + * sys_sframe_register - register an address for user space stacktrace walking.
> > > + * @data: Structure of sframe data used to register the sframe section
> > > + * @size: The size of the given structure.
> > > + *
> > > + * This system call is used by dynamic library utilities to inform the kernel
> > > + * of meta data that it loaded that can be used by the kernel to know how
> > > + * to stack walk the given text locations.
> > > + *
> > > + * Return: 0 if successful, otherwise a negative error.
> > > + */
> > > +SYSCALL_DEFINE2(sframe_register, struct sframe_setup __user *, data, size_t, size)
> > > +{
> > > + struct sframe_setup sframe;
> > > +
> > > + if (sizeof(sframe) != size)
> > > + return -EINVAL;
> >
> > This seems overly aggressive. It seems like the pattern is to allow
> > sizes both smaller and bigger:
> > - if user-provided size is smaller than what kernel knows about,
> > treat missing fields as zeroes
>
> Well, that could work with unregister, but for register that isn't
> quite useful, as all fields should be filled (well, if we add flags,
> that may not be 100% true).
>
This is a question of API design. If newly added fields are optional
by default, this works great. And even if you are adding some fields
that in the future will be mandatory (or it could be mandatory based
on flags), then it's super easy to error out if they are not set.
We've been doing this for years now in bpf() syscall and it works
pretty well overall, while also keeping user-space (libbpf, for
instance) side *much* simpler. I don't want to imagine bpf() syscall
which in each kernel version enforces a different size of bpf_attr
union...
> > - if user-provided size is bigger, then check that space after
> > fields that kernel recognizes are all zeroes.
>
> That is dangerous. A zero with greater size could mean something. If
> the size is greater than expected it should simply fail and let user
> space call it again with the older version.
>
Could, but it shouldn't if we extend API reasonably. And if it so
happens that zero will be meaningful, then you add a new flag that has
to be set if that field is present. This is a solved problem.
Requiring user space to use differently-sized structs for different
kernel versions is much-much worse.
> >
> > This allows extensibility without having to change user space code all
> > the time. Old code will provide smaller struct without new (presumably
> > optional) fields, while newer code can use newer and larger struct
> > size, but as long as it clears extra fields old kernel will be fine
> > with that.
>
> The old size will always work, thus old code will always continue to
> work. If we extend the system call, then it must handle both the older
> size as well as the newer size. User space would not need to change. It
> would only change if it wanted to use a new feature, and if it wants to
> work with older kernels it would need to try the bigger size first and
> if that fails, it knows the kernel doesn't support that new feature and
> then user space can figure out what to do. Either use the old system
> call or abort.
See above, many added features are typically optional (e.g., imagine
some extra bits of information that goes along with currently existing
mandatory sframe data). And it's easy to code user space code that can
automatically and gracefully "downgrade" by detecting that kernel
doesn't support some feature and thus just not setting the field,
leaving it zero. But you won't have to track what should be the right
size of the struct which in your API headers is already larger because
you compiled something on newer kernel headers.
Believe me, this is the right way to go with this kind of extendable binary API.
>
> -- Steve
>
> >
> > > +
> > > + if (copy_from_user(&sframe, data, size))
> > > + return -EFAULT;
> > > +
> > > + return sframe_add_section(sframe.sframe_start,
> > > + sframe.sframe_start + sframe.sframe_size,
> > > + sframe.text_start,
> > > + sframe.text_start + sframe.text_size);
> > > +}
> > > +
> >
> > [...]
>
^ permalink raw reply
* [PATCH 1/2] rtla/timerlat: Fix parsing of short options with attached arguments
From: John Kacur @ 2026-06-01 21:15 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar, linux-trace-kernel
Cc: Costa Shulyupin, Wander Lairson Costa, Crystal Wood,
Luis Claudio R . Goncalves, linux-kernel
The timerlat hist command fails to parse short options with attached
numeric arguments (e.g., -p100) due to conflicts between digit characters
used as option values and numeric arguments to other options.
This issue was discovered when testing rtla 7.1.0-rc6 with rteval,
which passes arguments in the compact -p100 format. The rteval tests
failed with the confusing error "no-irq and no-thread set, there is
nothing to do here" even though neither option was specified.
The root cause is two-fold:
1. Digit characters ('0'-'9') were used as short option values for
long-only options like --no-irq, --no-thread, etc. This caused
getopt_auto() to generate an option string like 'a:b:...:u0123456:7:8:9'
which made getopt treat digits as valid option characters.
2. The two-phase option parsing approach (alternating calls between
common_parse_options() and local option parsing) confused getopt's
internal state when encountering arguments like -p100.
When a user passed -p100, getopt would incorrectly parse it as three
separate options: -p, -1, -0, and -0, silently setting no_irq and
no_thread flags instead of recognizing "100" as the period argument.
The two-phase parsing was introduced in commit 850cd24cb6d6 ("tools/rtla:
Add common_parse_options()") which first appeared in v7.0-rc1. Prior to
that commit, -p100 worked correctly. The digit characters as option
values existed since the original timerlat implementation, but only
became problematic when combined with the two-phase parsing approach.
Fix this by:
1. Eliminating digit characters from the option string by filtering them
out in getopt_auto(). This prevents conflicts with numeric arguments.
2. Refactoring timerlat_hist_parse_args() to use single-pass option
parsing. Instead of alternating between common_parse_options() and
local parsing, merge all options (common and local) into a single
option table and parse them in one pass. This matches the approach
used by cyclictest and other tools.
With these changes, all argument formats work correctly:
-p 100 (short with space)
-p100 (short without space)
--period=100 (long with =)
--period 100 (long with space)
This maintains compatibility with existing usage while enabling the
compact -p100 format that users expect from similar tools.
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
tools/tracing/rtla/src/common.c | 4 ++
tools/tracing/rtla/src/timerlat_hist.c | 55 ++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/tools/tracing/rtla/src/common.c b/tools/tracing/rtla/src/common.c
index 35e3d3aa922e..c2fd051c562c 100644
--- a/tools/tracing/rtla/src/common.c
+++ b/tools/tracing/rtla/src/common.c
@@ -65,6 +65,10 @@ int getopt_auto(int argc, char **argv, const struct option *long_opts)
if (long_opts[i].val < 32 || long_opts[i].val > 127)
continue;
+ /* Skip digit characters to avoid conflicts with numeric arguments */
+ if (long_opts[i].val >= '0' && long_opts[i].val <= '9')
+ continue;
+
if (n + 4 >= sizeof(opts))
fatal("optstring buffer overflow");
diff --git a/tools/tracing/rtla/src/timerlat_hist.c b/tools/tracing/rtla/src/timerlat_hist.c
index 79142af4f566..c0b6d7c30114 100644
--- a/tools/tracing/rtla/src/timerlat_hist.c
+++ b/tools/tracing/rtla/src/timerlat_hist.c
@@ -787,11 +787,24 @@ static struct common_params
static struct option long_options[] = {
{"auto", required_argument, 0, 'a'},
{"bucket-size", required_argument, 0, 'b'},
+ /* Common options */
+ {"cpus", required_argument, 0, 'c'},
+ {"cgroup", optional_argument, 0, 'C'},
+ {"debug", no_argument, 0, 'D'},
+ {"duration", required_argument, 0, 'd'},
+ {"event", required_argument, 0, 'e'},
+ /* End common options */
{"entries", required_argument, 0, 'E'},
{"help", no_argument, 0, 'h'},
+ /* Common option */
+ {"house-keeping", required_argument, 0, 'H'},
+ /* End common option */
{"irq", required_argument, 0, 'i'},
{"nano", no_argument, 0, 'n'},
{"period", required_argument, 0, 'p'},
+ /* Common option */
+ {"priority", required_argument, 0, 'P'},
+ /* End common option */
{"stack", required_argument, 0, 's'},
{"thread", required_argument, 0, 'T'},
{"trace", optional_argument, 0, 't'},
@@ -819,9 +832,6 @@ static struct common_params
{0, 0, 0, 0}
};
- if (common_parse_options(argc, argv, ¶ms->common))
- continue;
-
c = getopt_auto(argc, argv, long_options);
/* detect the end of the options. */
@@ -850,6 +860,35 @@ static struct common_params
params->common.hist.bucket_size >= 1000000)
fatal("Bucket size needs to be > 0 and <= 1000000");
break;
+ case 'c':
+ if (parse_cpu_set(optarg, ¶ms->common.monitored_cpus))
+ fatal("Invalid -c cpu list");
+ params->common.cpus = optarg;
+ break;
+ case 'C':
+ params->common.cgroup = 1;
+ params->common.cgroup_name = parse_optional_arg(argc, argv);
+ break;
+ case 'D':
+ config_debug = 1;
+ break;
+ case 'd':
+ params->common.duration = parse_seconds_duration(optarg);
+ if (!params->common.duration)
+ fatal("Invalid -d duration");
+ break;
+ case 'e':
+ {
+ struct trace_events *tevent;
+ tevent = trace_event_alloc(optarg);
+ if (!tevent)
+ fatal("Error alloc trace event");
+
+ if (params->common.events)
+ tevent->next = params->common.events;
+ params->common.events = tevent;
+ }
+ break;
case 'E':
params->common.hist.entries = get_llong_from_str(optarg);
if (params->common.hist.entries < 10 ||
@@ -860,6 +899,11 @@ static struct common_params
case '?':
timerlat_hist_usage();
break;
+ case 'H':
+ params->common.hk_cpus = 1;
+ if (parse_cpu_set(optarg, ¶ms->common.hk_cpu_set))
+ fatal("Error parsing house keeping CPUs");
+ break;
case 'i':
params->common.stop_us = get_llong_from_str(optarg);
break;
@@ -874,6 +918,11 @@ static struct common_params
if (params->timerlat_period_us > 1000000)
fatal("Period longer than 1 s");
break;
+ case 'P':
+ if (parse_prio(optarg, ¶ms->common.sched_param) == -1)
+ fatal("Invalid -P priority");
+ params->common.set_sched = 1;
+ break;
case 's':
params->print_stack = get_llong_from_str(optarg);
break;
--
2.54.0
^ permalink raw reply related
* [PATCH 2/2] rtla/timerlat: Add tests for option parsing with attached arguments
From: John Kacur @ 2026-06-01 21:15 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar, linux-trace-kernel
Cc: Costa Shulyupin, Wander Lairson Costa, Crystal Wood,
Luis Claudio R . Goncalves, linux-kernel
In-Reply-To: <20260601211538.381649-1-jkacur@redhat.com>
Add tests to verify that numeric arguments work correctly with both
attached and detached formats:
-p 100 (short with space)
-p100 (short without space)
--period=100 (long with =)
--period 100 (long with space)
These tests prevent regression of the bug fixed in commit eefa8af46ff7
("rtla/timerlat: Fix parsing of short options with attached arguments")
where -p100 was incorrectly parsed as multiple separate options.
The tests verify that:
1. All four argument formats succeed (exit code 0)
2. None trigger the "no-irq and no-thread" error that occurred when
the bug was present
Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: John Kacur <jkacur@redhat.com>
---
tools/tracing/rtla/tests/timerlat.t | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index fd4935fd7b49..1a63301f5d70 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -42,6 +42,16 @@ check "verify -c/--cpus" \
check "hist test in nanoseconds" \
"timerlat hist -i 2 -c 0 -n -d 10s" 2 "ns"
+# Option parsing tests - verify attached numeric arguments work correctly
+check "verify -p with space" \
+ "timerlat hist -p 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify -p without space (attached argument)" \
+ "timerlat hist -p100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with equals" \
+ "timerlat hist --period=100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+check "verify --period with space" \
+ "timerlat hist --period 100 -c 0 -d 1s" 0 "" "no-irq and no-thread"
+
# Actions tests
check "trace output through -t" \
"timerlat hist -T 2 -t" 2 "^ Saving trace to timerlat_trace.txt$"
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v7 09/42] KVM: guest_memfd: Add base support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-06-01 23:14 UTC (permalink / raw)
To: ackerleytng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-9-2f0fae496530@google.com>
On Fri, May 22, 2026 at 05:17:51PM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> Introduce base support for KVM_SET_MEMORY_ATTRIBUTES2 in guest_memfd, which
> just updates attributes tracked by guest_memfd.
>
> Validate input fields in general. Guard usage of KVM_SET_MEMORY_ATTRIBUTES2
> by making sure requested attributes are supported for this instance of kvm.
>
> A new KVM_SET_MEMORY_ATTRIBUTES2 is defined to support writes (unlike
> KVM_SET_MEMORY_ATTRIBUTES) in addition to reads so it can provide error
> details to userspace. This will be used in a later patch.
>
> The two ioctls use their corresponding structs with no overlap, but
> backward compatibility is baked in for future support of
> KVM_SET_MEMORY_ATTRIBUTES2 and struct kvm_memory_attributes2 in the VM
> ioctl.
>
> The process of setting memory attributes is set up such that the later half
> will not fail due to allocation. Any necessary checks are performed before
> the point of no return.
>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Sean Christoperson <seanjc@google.com>
Typo on the "person".
(Sent this earlier but looks like some of my emails never hit the
list so re-sending. Apologies if this is a dupe).
Thanks,
Mike
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
^ permalink raw reply
* [PATCH] tracing/events: Expand ring buffer for in-kernel event enables
From: Manjunath Patil @ 2026-06-01 23:24 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel,
Manjunath Patil
Ftrace keeps trace arrays at a boot-minimum ring-buffer size until
tracing is used. Tracefs event-enable paths already call
tracing_update_buffers() before enabling events, but the exported
in-kernel helpers trace_set_clr_event() and trace_array_set_clr_event()
directly enable events through __ftrace_set_clr_event().
This can leave events enabled by in-kernel users recording into the tiny
boot-minimum buffer instead of the configured default-sized buffer. Any
caller that enables events through these exported helpers observes
different buffer-expansion behavior than a userspace tracefs event enable.
Expand the relevant trace array before enabling events through the
exported in-kernel helpers, matching the tracefs event-enable behavior.
Disabling events remains unchanged.
Assisted-by: Codex:gpt-5
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
---
kernel/trace/trace_events.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index c46e623e7e0d..3ce5b0121c5c 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1479,10 +1479,22 @@ int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set)
int trace_set_clr_event(const char *system, const char *event, int set)
{
struct trace_array *tr = top_trace_array();
+ int ret;
if (!tr)
return -ENODEV;
+ /*
+ * Keep in-kernel event enabling consistent with tracefs event
+ * enabling: once an event is being enabled, expand the boot-minimum
+ * ring buffer to the configured default size before records arrive.
+ */
+ if (set) {
+ ret = tracing_update_buffers(tr);
+ if (ret < 0)
+ return ret;
+ }
+
return __ftrace_set_clr_event(tr, NULL, system, event, set, NULL);
}
EXPORT_SYMBOL_GPL(trace_set_clr_event);
@@ -1504,11 +1516,24 @@ int trace_array_set_clr_event(struct trace_array *tr, const char *system,
const char *event, bool enable)
{
int set;
+ int ret;
if (!tr)
return -ENOENT;
set = (enable == true) ? 1 : 0;
+
+ /*
+ * Keep in-kernel event enabling consistent with tracefs event
+ * enabling: once an event is being enabled, expand the boot-minimum
+ * ring buffer to the configured default size before records arrive.
+ */
+ if (set) {
+ ret = tracing_update_buffers(tr);
+ if (ret < 0)
+ return ret;
+ }
+
return __ftrace_set_clr_event(tr, NULL, system, event, set, NULL);
}
EXPORT_SYMBOL_GPL(trace_array_set_clr_event);
base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
--
2.47.3
^ permalink raw reply related
* Re: PATCH v7] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02 0:03 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
Jiri Olsa
In-Reply-To: <20260601122126.5ebbd7e7@fedora>
On Mon, 1 Jun 2026 12:21:26 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sun, 31 May 2026 10:14:58 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > > > Does this prematurely release the BTF struct reference?
> > > > If TPARG_FL_TYPECAST is unset here and ctx->struct_btf is put, won't
> > > > later steps in traceprobe_parse_probe_arg_body() (like
> > > > find_fetch_type_from_btf_type()) fail to properly infer struct field sizes?
> > > > When ctx_btf(ctx) is called later without TPARG_FL_TYPECAST set, it
> > > > will evaluate to ctx->btf (which is NULL for eprobes).
> > > > Could this potentially lead to silent defaults, such as 64-bit reads for
> > > > smaller fields, or fail to inject pointer dereferences for string fields,
> > > > while also leaving ctx->last_type pointing to a prematurely released BTF
> > > > object?
> > >
> > > Does this mean we need to set ctx->last_type to NULL here too?
> >
> > No, since the member we refer can be different from unsigned long.
> > When we don't have ":type" suffix, we use BTF type information to
> > decide appropriate type.
> >
> > >
> > > Because everything above is pretty much the expected behavior. The put is
> > > *not* premature. The last_struct and struct_btf are both set to NULL. I
> > > guess the only thing missing is to reset last_type as well.
> >
> > No, as I explained, the last_type is used to determine the member type
> > when user does not specify the ":type" suffix.
> >
> > So, what we need to do is deferring the btf_put(struct_btf) as below:
> > (no build test yet.)
>
> OK, but I don't think we want the struct_btf to exist beyond a single
> arg like the btf descriptor does. How about this (on top of this change),
> where it clears the struct_btf at the end of traceprobe_parse_probe_arg_body()?
>
> Also, I see the flag as being redundant and use the existence of
> struct_btf to denote that it's parsing a typedef struct.
Ah, indeed. OK, let me check v8 patch.
Thanks!
>
> -- Steve
>
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 9246e9c3d066..56b7dc406ca1 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -397,8 +397,7 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
>
> static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
> {
> - return ctx->flags & TPARG_FL_TYPECAST ?
> - ctx->struct_btf : ctx->btf;
> + return ctx->struct_btf ? : ctx->btf;
> }
>
> static int check_prepare_btf_string_fetch(char *typename,
> @@ -531,6 +530,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
> return 0;
> }
>
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + if (ctx->struct_btf) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> + ctx->last_struct = NULL;
> + }
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> if (ctx->btf) {
> @@ -579,7 +587,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> struct fetch_insn *code = *pcode;
> const struct btf_member *field;
> u32 bitoffs, anon_offs;
> - bool is_struct = ctx->flags & TPARG_FL_TYPECAST;
> + bool is_struct = ctx->struct_btf != NULL;
> struct btf *btf = ctx_btf(ctx);
> char *next;
> int is_ptr;
> @@ -690,7 +698,7 @@ static int parse_btf_arg(char *varname,
> ret = parse_trace_event(varname, code, ctx);
> if (ret < 0)
> return ret;
> - if (WARN_ON_ONCE(!(ctx->flags & TPARG_FL_TYPECAST)))
> + if (WARN_ON_ONCE(ctx->struct_btf == NULL))
> return -EINVAL;
> type = ctx->last_struct;
> goto found_type;
> @@ -804,21 +812,19 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
>
> static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> {
> + struct btf *btf = NULL;
> int id;
>
> - if (!ctx->struct_btf) {
> - struct btf *btf;
> -
> - id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> - if (id < 0)
> - return id;
> - ctx->struct_btf = btf;
> - } else {
> - id = btf_find_by_name_kind(ctx->struct_btf, sname, BTF_KIND_STRUCT);
> - if (id < 0)
> - return id;
> + /* Could be a for a structure in a different module */
> + if (ctx->struct_btf) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> }
>
> + id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> + if (id < 0)
> + return id;
> + ctx->struct_btf = btf;
> ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> return 0;
> }
> @@ -848,25 +854,23 @@ static int handle_typecast(char *arg, struct fetch_insn **pcode,
>
> if (ret < 0) {
> trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> - ret = -EINVAL;
> - goto out_put;
> + return -EINVAL;
> }
>
> - ctx->flags |= TPARG_FL_TYPECAST;
> tmp++;
>
> ctx->offset += tmp - arg;
> ret = parse_btf_arg(tmp, pcode, end, ctx);
> - ctx->flags &= ~TPARG_FL_TYPECAST;
> - ctx->last_struct = NULL;
> -out_put:
> - btf_put(ctx->struct_btf);
> - ctx->struct_btf = NULL;
> return ret;
> }
>
> #else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
>
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + ctx->struct_btf = NULL;
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> ctx->btf = NULL;
> @@ -1673,6 +1677,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
> }
> kfree(tmp);
>
> + /* struct_btf should not be passed to other arguments */
> + clear_struct_btf(ctx);
> +
> return ret;
> }
>
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 952e3d7582b8..83565f1634db 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -394,7 +394,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
> * TPARG_FL_KERNEL and TPARG_FL_USER are also mutually exclusive.
> * TPARG_FL_FPROBE and TPARG_FL_TPOINT are optional but it should be with
> * TPARG_FL_KERNEL.
> - * TPARG_FL_TYPECAST is set if an argument was typecast to a structure.
> */
> #define TPARG_FL_RETURN BIT(0)
> #define TPARG_FL_KERNEL BIT(1)
> @@ -403,7 +402,6 @@ static inline int traceprobe_get_entry_data_size(struct trace_probe *tp)
> #define TPARG_FL_USER BIT(4)
> #define TPARG_FL_FPROBE BIT(5)
> #define TPARG_FL_TPOINT BIT(6)
> -#define TPARG_FL_TYPECAST BIT(7)
> #define TPARG_FL_LOC_MASK GENMASK(4, 0)
>
> static inline bool tparg_is_function_entry(unsigned int flags)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02 0:06 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
Ian Rogers, Jiri Olsa
In-Reply-To: <20260601133129.4a1e9dec@gandalf.local.home>
On Mon, 1 Jun 2026 13:31:29 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 1 Jun 2026 13:07:46 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
> >
> > - Add error message in parse_btf_args() for failed parsing of TEVENT.
> > (Sashiko)
> >
> > - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> > The flag was redundant and added unnecessary complexity.
> >
> > - Restructure to keep the lifetime of the TYPECAST to the end of
> > traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> > around in case there's not a type parameter and then btf can still be
> > used.
> > (Sashiko and Masami Hiramatsu)
>
> And I rebased onto probes/for-next
>
Thanks, but it seems Sashiko failed to apply (because it is using
linux-trace/HEAD branch?) Hmm, we may always need "base-id" tag.
Thanks,
> -- Steve
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [RESEND][PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-06-02 0:25 UTC (permalink / raw)
To: LKML, Linux Trace Kernel
Cc: Masami Hiramatsu, Mathieu Desnoyers, Mark Rutland, Peter Zijlstra,
Namhyung Kim, Takaya Saeki, Douglas Raillard, Tom Zanussi,
Andrew Morton, Thomas Gleixner, Ian Rogers, Jiri Olsa
From: Steven Rostedt <rostedt@goodmis.org>
Add syntax to the parsing of eprobes to be able to typecast a trace event
field that is a pointer to a structure.
Currently, a dereference must be a number, where the user has to figure
out manually the offset of a member of a structure that they want to
dereference.
But for event probes that records a field that happens to be a pointer to
a structure, it cannot dereference these values with BTF naming, but
must use numerical offsets.
For example, to find out what device a sk_buff is pointing to in the
net_dev_xmit trace event, one must first use gdb to find the offsets of the
members of the structures:
(gdb) p &((struct sk_buff *)0)->dev
$1 = (struct net_device **) 0x10
(gdb) p &((struct net_device *)0)->name
$2 = (char (*)[16]) 0x118
And then use the raw numbers to dereference:
# echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
If BTF is in the kernel, then instead, the skbaddr can be typecast to
sk_buff and use the normal dereference logic.
# echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
# echo 1 > events/eprobes/xmit/enable
# cat trace
[..]
sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
to know what they are for.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
[ Resend with base-id below, maybe Sashiko will apply it to the correct tree! ]
base-id: 585abc02be3d3ab82fbcc4dbcbbf0ceb61a02129
Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
- Add error message in parse_btf_args() for failed parsing of TEVENT.
(Sashiko)
- Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
The flag was redundant and added unnecessary complexity.
- Restructure to keep the lifetime of the TYPECAST to the end of
traceprobe_parse_probe_arg_body(). This allows the last_type to stay
around in case there's not a type parameter and then btf can still be
used.
(Sashiko and Masami Hiramatsu)
Documentation/trace/eprobetrace.rst | 4 +
kernel/trace/trace_probe.c | 173 +++++++++++++++++++++++-----
kernel/trace/trace_probe.h | 5 +-
3 files changed, 154 insertions(+), 28 deletions(-)
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
index 89b5157cfab8..fe3602540569 100644
--- a/Documentation/trace/eprobetrace.rst
+++ b/Documentation/trace/eprobetrace.rst
@@ -46,6 +46,10 @@ Synopsis of eprobe_events
(x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
"string", "ustring", "symbol", "symstr" and "bitfield" are
supported.
+ (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
+ a pointer to STRUCT and then derference the pointer defined by
+ ->MEMBER. Note that when this is used, the FIELD name does not
+ need to be prefixed with a '$'.
Types
-----
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 695310571b08..fd1caa1f9723 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
return -ENOENT;
}
+static int parse_trace_event(char *arg, struct fetch_insn *code,
+ struct traceprobe_parse_context *ctx)
+{
+ int ret;
+
+ if (code->data)
+ return -EFAULT;
+ ret = parse_trace_event_arg(arg, code, ctx);
+ if (!ret)
+ return 0;
+ if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
+ code->op = FETCH_OP_COMM;
+ return 0;
+ }
+ return -EINVAL;
+}
+
#ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
static u32 btf_type_int(const struct btf_type *t)
@@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
&& BTF_INT_BITS(intdata) == 8;
}
+static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
+{
+ return ctx->struct_btf ? : ctx->btf;
+}
+
static int check_prepare_btf_string_fetch(char *typename,
struct fetch_insn **pcode,
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
if (!btf || !ctx->last_type)
return 0;
@@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
return 0;
}
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ if (ctx->struct_btf) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ ctx->last_struct = NULL;
+ }
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
if (ctx->btf) {
@@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
struct fetch_insn *code = *pcode;
const struct btf_member *field;
u32 bitoffs, anon_offs;
+ bool is_struct = ctx->struct_btf != NULL;
+ struct btf *btf = ctx_btf(ctx);
char *next;
int is_ptr;
s32 tid;
do {
- /* Outer loop for solving arrow operator ('->') */
- if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
- trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
- return -EINVAL;
- }
- /* Convert a struct pointer type to a struct type */
- type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
- if (!type) {
- trace_probe_log_err(ctx->offset, BAD_BTF_TID);
- return -EINVAL;
+ if (!is_struct) {
+ /* Outer loop for solving arrow operator ('->') */
+ if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
+ trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ /* Convert a struct pointer type to a struct type */
+ type = btf_type_skip_modifiers(btf, type->type, &tid);
+ if (!type) {
+ trace_probe_log_err(ctx->offset, BAD_BTF_TID);
+ return -EINVAL;
+ }
}
+ /* Only the first type can skip being a pointer */
+ is_struct = false;
bitoffs = 0;
do {
@@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
return is_ptr;
anon_offs = 0;
- field = btf_find_struct_member(ctx->btf, type, fieldname,
+ field = btf_find_struct_member(btf, type, fieldname,
&anon_offs);
if (IS_ERR(field)) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
@@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
ctx->last_bitsize = 0;
}
- type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
+ type = btf_type_skip_modifiers(btf, field->type, &tid);
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
int i, is_ptr, ret;
u32 tid;
- if (WARN_ON_ONCE(!ctx->funcname))
+ if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
return -EINVAL;
is_ptr = split_next_field(varname, &field, ctx);
@@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
return -EOPNOTSUPP;
}
+ if (ctx->flags & TPARG_FL_TEVENT) {
+ ret = parse_trace_event(varname, code, ctx);
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
+ return ret;
+ }
+ /* TEVENT is only here via a typecast */
+ if (WARN_ON_ONCE(ctx->struct_btf == NULL))
+ return -EINVAL;
+ type = ctx->last_struct;
+ goto found_type;
+ }
+
if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
code->op = FETCH_OP_RETVAL;
/* Check whether the function return type is not void */
@@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
found:
type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
+found_type:
if (!type) {
trace_probe_log_err(ctx->offset, BAD_BTF_TID);
return -EINVAL;
@@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
static const struct fetch_type *find_fetch_type_from_btf_type(
struct traceprobe_parse_context *ctx)
{
- struct btf *btf = ctx->btf;
+ struct btf *btf = ctx_btf(ctx);
const char *typestr = NULL;
if (btf && ctx->last_type)
@@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
return 0;
}
-#else
+static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
+{
+ struct btf *btf = NULL;
+ int id;
+
+ /* A struct_btf should only be used by a single argument */
+ if (WARN_ON_ONCE(ctx->struct_btf)) {
+ btf_put(ctx->struct_btf);
+ ctx->struct_btf = NULL;
+ }
+
+ id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
+ if (id < 0)
+ return id;
+ ctx->struct_btf = btf;
+ ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
+ return 0;
+}
+
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ char *tmp;
+ int ret;
+
+ /* Currently this only works for eprobes */
+ if (!(ctx->flags & TPARG_FL_TEVENT)) {
+ trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
+ return -EINVAL;
+ }
+
+ tmp = strchr(arg, ')');
+ if (!tmp) {
+ trace_probe_log_err(ctx->offset + strlen(arg),
+ DEREF_OPEN_BRACE);
+ return -EINVAL;
+ }
+ *tmp = '\0';
+ ret = query_btf_struct(arg + 1, ctx);
+ *tmp = ')';
+
+ if (ret < 0) {
+ trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
+ return -EINVAL;
+ }
+
+ tmp++;
+
+ ctx->offset += tmp - arg;
+ ret = parse_btf_arg(tmp, pcode, end, ctx);
+ return ret;
+}
+
+#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
+
+static void clear_struct_btf(struct traceprobe_parse_context *ctx)
+{
+ ctx->struct_btf = NULL;
+}
+
static void clear_btf_context(struct traceprobe_parse_context *ctx)
{
ctx->btf = NULL;
@@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
return 0;
}
-#endif
+static int handle_typecast(char *arg, struct fetch_insn **pcode,
+ struct fetch_insn *end,
+ struct traceprobe_parse_context *ctx)
+{
+ trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
#ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
@@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
int len;
if (ctx->flags & TPARG_FL_TEVENT) {
- if (code->data)
- return -EFAULT;
- ret = parse_trace_event_arg(arg, code, ctx);
- if (!ret)
- return 0;
- if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
- code->op = FETCH_OP_COMM;
- return 0;
- }
- goto inval;
+ if (parse_trace_event(arg, code, ctx) < 0)
+ goto inval;
+ return 0;
}
if (str_has_prefix(arg, "retval")) {
@@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
code->op = FETCH_OP_IMM;
}
break;
+ case '(':
+ ret = handle_typecast(arg, pcode, end, ctx);
+ break;
default:
if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
if (!tparg_is_function_entry(ctx->flags) &&
@@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
}
kfree(tmp);
+ /* struct_btf should not be passed to other arguments */
+ clear_struct_btf(ctx);
+
return ret;
}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 1076f1df347b..15758cc11fc6 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -422,7 +422,9 @@ struct traceprobe_parse_context {
const struct btf_param *params; /* Parameter of the function */
s32 nr_params; /* The number of the parameters */
struct btf *btf; /* The BTF to be used */
+ struct btf *struct_btf; /* The BTF to be used for structs */
const struct btf_type *last_type; /* Saved type */
+ const struct btf_type *last_struct; /* Saved structure */
u32 last_bitoffs; /* Saved bitoffs */
u32 last_bitsize; /* Saved bitsize */
struct trace_probe *tp;
@@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
C(TOO_MANY_ARGS, "Too many arguments are specified"), \
C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
- C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
+ C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
+ C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
#undef C
#define C(a, b) TP_ERR_##a
--
2.53.0
^ permalink raw reply related
* Re: [PATCH mm-hotfixes-unstable v18 00/14] khugepaged: add mTHP collapse support
From: Lance Yang @ 2026-06-02 1:53 UTC (permalink / raw)
To: Lorenzo Stoakes, Alexander Gordeev
Cc: Andrew Morton, Gerald Schaefer, Nico Pache, linux-doc,
linux-kernel, linux-mm, linux-trace-kernel, aarcange,
anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe,
linux-s390, linux-next
In-Reply-To: <ah2z26OzPktchVeT@lucifer>
On 2026/6/2 01:08, Lorenzo Stoakes wrote:
> On Mon, Jun 01, 2026 at 05:58:08PM +0200, Alexander Gordeev wrote:
>> On Fri, May 22, 2026 at 01:47:24PM -0700, Andrew Morton wrote:
>>
>> Hi Andrew et al,
>>
>>> On Fri, 22 May 2026 08:59:55 -0600 Nico Pache <npache@redhat.com> wrote:
>>>
>>>> The following series provides khugepaged with the capability to collapse
>>>> anonymous memory regions to mTHPs.
>>>
>>> Thanks, I've update mm.git's mm-unstable branch to this version.
>>>
>>> It sounds like I might be dropping it soon, haven't started looking at
>>> that yet. But let's at least eyeball the latest version at this time.
>>>
>>> Sashiko was able to apply this, so the base-it-on-hotfixes thing worked
>>> well, thanks. The AI checking made a few allegations:
>>
>> This series appears to cause hangs on s390 in linux-next.
>> The issue is not easily reproducible, so it is not yet confirmed.
>> Any ideas for a reliable reproducer that exercises the code path below?
>>
>> [ 2749.385719] sysrq: Show Blocked State
>> [ 2749.385730] task:khugepaged state:D stack:0 pid:209 tgid:209 ppid:2 task_flags:0x200040 flags:0x00000000
>> [ 2749.385735] Call Trace:
>> [ 2749.385736] [<0000017f63c8b226>] __schedule+0x316/0x890
>> [ 2749.385740] [<0000017f63c8b7dc>] schedule+0x3c/0xc0
>> [ 2749.385743] [<0000017f63c8b888>] schedule_preempt_disabled+0x28/0x40
>> [ 2749.385746] [<0000017f63c902ea>] rwsem_down_write_slowpath+0x2fa/0x8b0
>> [ 2749.385749] [<0000017f63c90910>] down_write+0x70/0x80
>> [ 2749.385752] [<0000017f6313407a>] collapse_huge_page+0x2ea/0x9e0
>> [ 2749.385755] [<0000017f6313491e>] mthp_collapse+0x1ae/0x1f0
>> [ 2749.385757] [<0000017f63134fda>] collapse_scan_pmd+0x67a/0x8f0
>> [ 2749.385760] [<0000017f6313751a>] collapse_single_pmd+0x15a/0x260
>> [ 2749.385762] [<0000017f6313792c>] collapse_scan_mm_slot.constprop.0+0x30c/0x470
>> [ 2749.385765] [<0000017f63137cb6>] khugepaged+0x226/0x240
>> [ 2749.385768] [<0000017f62db3128>] kthread+0x148/0x170
>> [ 2749.385770] [<0000017f62d2c238>] __ret_from_fork+0x48/0x220
>> [ 2749.385772] [<0000017f63c95d0a>] ret_from_fork+0xa/0x30
>>
>> Thanks!
>
> Hi Alexander,
>
> Thanks for the report.
>
> It's a pity it's non-repro, I had Claude have a look at it and it couldn't find
> a definite issue with the code at v18, all the locks seem balanced internally.
>
> Things it highlighted FWIW:
>
> - Far more mmap_write_lock()'s being taken - the stack-based approach calls
> colapse_huge_page() multiple times per-PMD each of which entails an mmap read
> lock/unlock and mmap write lock.
>
> - anon_vma write lock held for a much longer period over partial collapse.
>
> So maybe these are triggering issues rather than being the cause of them per-se?
>
> If you happen to see it again could you give the output for:
>
> 'echo t > /proc/sysrq-trigger' so we can track who holds the contended lock and
> get more details on it?
>
> Also the .config would be useful.
>
> I'm guessing you've also not enabled mTHP in any way on the system?
>
> Repro-wise you could also:
>
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> # echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>
> To get khugepaged going a more aggressively:
>
> $ for f in /sys/kernel/mm/transparent_hugepage/hugepages-*; do echo always | sudo tee $f/enabled; done
>
> Then maybe some stress-ng like sudo stress-ng --vm 4 --vm-bytes 2G --vm-method
> all --timeout 5m (or maybe something more refined :)?
>
> Maybe some of this will help repro more reliably?
>
Cool!
Maybe also worth trying with CONFIG_DETECT_HUNG_TASK=y and
CONFIG_DETECT_HUNG_TASK_BLOCKER=y.
# detect after 10s in D state instead of default 120s
echo 10 > /proc/sys/kernel/hung_task_timeout_secs
# optional: check more often; 0 means same as timeout
echo 0 > /proc/sys/kernel/hung_task_check_interval_secs
With that enabled, the kernel should hopefully tell us which task likely
owns the rwsem. If it is writer-owned, I would expect that to be fairly
reliable.
Cheers, Lance
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-06-02 2:16 UTC (permalink / raw)
To: Gregory Price
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ahOqzpzAua96HVkn@gourry-fedora-PF4VCD3F>
On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> On Thu, May 21, 2026 at 04:23:28PM +1000, Balbir Singh wrote:
> > On Sun, Feb 22, 2026 at 03:48:15AM -0500, Gregory Price wrote:
> > > Topic type: MM
> > >
> > > Presenter: Gregory Price <gourry@gourry.net>
> > >
> > > This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> > > managed by the buddy allocator but excluded from normal allocations.
> > >
> > > I present it with an end-to-end Compressed RAM service (mm/cram.c)
> > > that would otherwise not be possible (or would be considerably more
> > > difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
> > >
> >
> > Do we have updates/notes from the meeting?
> >
>
> I have been on leave since LSF, but I do have some notes posted:
>
> https://lore.kernel.org/linux-mm/af9i7dkNvGGxPHzu@gourry-fedora-PF4VCD3F/
> https://lore.kernel.org/linux-mm/agYJcRgOHho8upVv@gourry-fedora-PF4VCD3F/
>
> I will be trying to post an updated set stripped down without the GFP
> flag as a first pass w/o RFC tags and no UAPI implications so that
> device folks can play with this upstream.
>
> I'm debating on whether to include OPS_MEMPOLICY in the initial version
> if only because it's not intuitive how it interacts with pagecache. That
> needs more time to bake.
>
It makes sense to look at it and then decide if it makes sense.
> > >
> > > page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
> >
> > Do we want to provide kernel level control over allocation of private
> > pages, I assumed that only user space applications? I would assume
> > node affinity would be the way to do so, unless we have multiple
> >
>
> alloc_pages_node() is the kernel interface
I was think we wouldn't need explicit flags and that allocations would
happen from user space using __GFP_THISNODE to the node or via a nodemask
based on nodes of interest. Is there a reason to add this flag, a system
might have more than one source of N_MEMORY_PRIVATE?
>
> > >
> > > /* Ok but I want to do something useful with it */
> > > static const struct node_private_ops ops = {
> > > .migrate_to = my_migrate_to,
> > > .folio_migrate = my_folio_migrate,
> > > .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> > > };
> > > node_private_set_ops(nid, &ops);
> > >
> >
> > Could you explain this further? Why does OPS_MIGRATION
> > and OPS_MEMPOLICY needs to be set explictly?
> >
>
> Both of these have been removed from the upcoming version, but in this
> RFC version i was testing OPS_MIGRATION as an explicit flag that meant
> "migrate.c can touch the folios" while OPS_MEMPOLICY meant "mempolicy.c
> can touch the folios".
>
> As it turns out, OPS_MIGRATION is not a useful filter, as it doesn't
> actually filter anything (anything using OPS_MIGRATION would also need
> its own filter flag, so better to just drop it and do per-server
> opt-ins).
>
Thanks,
Balbir
^ permalink raw reply
* Re: [RESEND][PATCH v8] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Masami Hiramatsu @ 2026-06-02 2:28 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Linux Trace Kernel, Masami Hiramatsu, Mathieu Desnoyers,
Mark Rutland, Peter Zijlstra, Namhyung Kim, Takaya Saeki,
Douglas Raillard, Tom Zanussi, Andrew Morton, Thomas Gleixner,
Ian Rogers, Jiri Olsa
In-Reply-To: <20260601202546.564e867b@gandalf.local.home>
On Mon, 1 Jun 2026 20:25:46 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> From: Steven Rostedt <rostedt@goodmis.org>
>
> Add syntax to the parsing of eprobes to be able to typecast a trace event
> field that is a pointer to a structure.
>
> Currently, a dereference must be a number, where the user has to figure
> out manually the offset of a member of a structure that they want to
> dereference.
>
> But for event probes that records a field that happens to be a pointer to
> a structure, it cannot dereference these values with BTF naming, but
> must use numerical offsets.
>
> For example, to find out what device a sk_buff is pointing to in the
> net_dev_xmit trace event, one must first use gdb to find the offsets of the
> members of the structures:
>
> (gdb) p &((struct sk_buff *)0)->dev
> $1 = (struct net_device **) 0x10
> (gdb) p &((struct net_device *)0)->name
> $2 = (char (*)[16]) 0x118
>
> And then use the raw numbers to dereference:
>
> # echo 'e:xmit net.net_dev_xmit +0x118(+0x10($skbaddr)):string' >> dynamic_events
>
> If BTF is in the kernel, then instead, the skbaddr can be typecast to
> sk_buff and use the normal dereference logic.
>
> # echo 'e:xmit net.net_dev_xmit (sk_buff)skbaddr->dev->name:string' >> dynamic_events
> # echo 1 > events/eprobes/xmit/enable
> # cat trace
> [..]
> sshd-session-1022 [000] b..2. 860.249343: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.250061: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.250142: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.263553: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.283820: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.302716: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.322905: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.342828: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.362268: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.382335: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.400856: xmit: (net.net_dev_xmit) arg1="enp7s0"
> sshd-session-1022 [000] b..2. 860.419893: xmit: (net.net_dev_xmit) arg1="enp7s0"
>
> The syntax is simply: (STRUCT)(FIELD)->MEMBER[->MEMBER..]
>
> Also add comments around the #else and #endif of #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
> to know what they are for.
>
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>
> [ Resend with base-id below, maybe Sashiko will apply it to the correct tree! ]
Sashiko still faailed to apply this... Not sure why.
https://sashiko.dev/#/message/20260601202546.564e867b%40gandalf.local.home
Maybe better to configure Sashiko via github or sashiko-ml?
https://github.com/sashiko-dev/sashiko/blob/main/MAINTAINERS_GUIDE.md
Anyway, at least for me, this looks good.
Thanks,
>
> base-id: 585abc02be3d3ab82fbcc4dbcbbf0ceb61a02129
>
> Changes since v7: https://patch.msgid.link/20260529110442.0967a64c@fedora
>
> - Add error message in parse_btf_args() for failed parsing of TEVENT.
> (Sashiko)
>
> - Remove TPARG_FL_TYPECAST and just use ctx->struct_btf instead.
> The flag was redundant and added unnecessary complexity.
>
> - Restructure to keep the lifetime of the TYPECAST to the end of
> traceprobe_parse_probe_arg_body(). This allows the last_type to stay
> around in case there's not a type parameter and then btf can still be
> used.
> (Sashiko and Masami Hiramatsu)
>
> Documentation/trace/eprobetrace.rst | 4 +
> kernel/trace/trace_probe.c | 173 +++++++++++++++++++++++-----
> kernel/trace/trace_probe.h | 5 +-
> 3 files changed, 154 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
> index 89b5157cfab8..fe3602540569 100644
> --- a/Documentation/trace/eprobetrace.rst
> +++ b/Documentation/trace/eprobetrace.rst
> @@ -46,6 +46,10 @@ Synopsis of eprobe_events
> (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
> "string", "ustring", "symbol", "symstr" and "bitfield" are
> supported.
> + (STRUCT)FIELD->MEMBER[->MEMBER] : If BTF is supported, typecast FIELD to
> + a pointer to STRUCT and then derference the pointer defined by
> + ->MEMBER. Note that when this is used, the FIELD name does not
> + need to be prefixed with a '$'.
>
> Types
> -----
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index 695310571b08..fd1caa1f9723 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -332,6 +332,23 @@ static int parse_trace_event_arg(char *arg, struct fetch_insn *code,
> return -ENOENT;
> }
>
> +static int parse_trace_event(char *arg, struct fetch_insn *code,
> + struct traceprobe_parse_context *ctx)
> +{
> + int ret;
> +
> + if (code->data)
> + return -EFAULT;
> + ret = parse_trace_event_arg(arg, code, ctx);
> + if (!ret)
> + return 0;
> + if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
> + code->op = FETCH_OP_COMM;
> + return 0;
> + }
> + return -EINVAL;
> +}
> +
> #ifdef CONFIG_PROBE_EVENTS_BTF_ARGS
>
> static u32 btf_type_int(const struct btf_type *t)
> @@ -376,11 +393,16 @@ static bool btf_type_is_char_array(struct btf *btf, const struct btf_type *type)
> && BTF_INT_BITS(intdata) == 8;
> }
>
> +static struct btf *ctx_btf(struct traceprobe_parse_context *ctx)
> +{
> + return ctx->struct_btf ? : ctx->btf;
> +}
> +
> static int check_prepare_btf_string_fetch(char *typename,
> struct fetch_insn **pcode,
> struct traceprobe_parse_context *ctx)
> {
> - struct btf *btf = ctx->btf;
> + struct btf *btf = ctx_btf(ctx);
>
> if (!btf || !ctx->last_type)
> return 0;
> @@ -506,6 +528,15 @@ static int query_btf_context(struct traceprobe_parse_context *ctx)
> return 0;
> }
>
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + if (ctx->struct_btf) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> + ctx->last_struct = NULL;
> + }
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> if (ctx->btf) {
> @@ -554,22 +585,29 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> struct fetch_insn *code = *pcode;
> const struct btf_member *field;
> u32 bitoffs, anon_offs;
> + bool is_struct = ctx->struct_btf != NULL;
> + struct btf *btf = ctx_btf(ctx);
> char *next;
> int is_ptr;
> s32 tid;
>
> do {
> - /* Outer loop for solving arrow operator ('->') */
> - if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
> - trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
> - return -EINVAL;
> - }
> - /* Convert a struct pointer type to a struct type */
> - type = btf_type_skip_modifiers(ctx->btf, type->type, &tid);
> - if (!type) {
> - trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> - return -EINVAL;
> + if (!is_struct) {
> + /* Outer loop for solving arrow operator ('->') */
> + if (BTF_INFO_KIND(type->info) != BTF_KIND_PTR) {
> + trace_probe_log_err(ctx->offset, NO_PTR_STRCT);
> + return -EINVAL;
> + }
> +
> + /* Convert a struct pointer type to a struct type */
> + type = btf_type_skip_modifiers(btf, type->type, &tid);
> + if (!type) {
> + trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> + return -EINVAL;
> + }
> }
> + /* Only the first type can skip being a pointer */
> + is_struct = false;
>
> bitoffs = 0;
> do {
> @@ -580,7 +618,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> return is_ptr;
>
> anon_offs = 0;
> - field = btf_find_struct_member(ctx->btf, type, fieldname,
> + field = btf_find_struct_member(btf, type, fieldname,
> &anon_offs);
> if (IS_ERR(field)) {
> trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> @@ -602,7 +640,7 @@ static int parse_btf_field(char *fieldname, const struct btf_type *type,
> ctx->last_bitsize = 0;
> }
>
> - type = btf_type_skip_modifiers(ctx->btf, field->type, &tid);
> + type = btf_type_skip_modifiers(btf, field->type, &tid);
> if (!type) {
> trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> return -EINVAL;
> @@ -640,7 +678,7 @@ static int parse_btf_arg(char *varname,
> int i, is_ptr, ret;
> u32 tid;
>
> - if (WARN_ON_ONCE(!ctx->funcname))
> + if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
> return -EINVAL;
>
> is_ptr = split_next_field(varname, &field, ctx);
> @@ -653,6 +691,19 @@ static int parse_btf_arg(char *varname,
> return -EOPNOTSUPP;
> }
>
> + if (ctx->flags & TPARG_FL_TEVENT) {
> + ret = parse_trace_event(varname, code, ctx);
> + if (ret < 0) {
> + trace_probe_log_err(ctx->offset, BAD_ATTACH_ARG);
> + return ret;
> + }
> + /* TEVENT is only here via a typecast */
> + if (WARN_ON_ONCE(ctx->struct_btf == NULL))
> + return -EINVAL;
> + type = ctx->last_struct;
> + goto found_type;
> + }
> +
> if (ctx->flags & TPARG_FL_RETURN && !strcmp(varname, "$retval")) {
> code->op = FETCH_OP_RETVAL;
> /* Check whether the function return type is not void */
> @@ -709,6 +760,7 @@ static int parse_btf_arg(char *varname,
>
> found:
> type = btf_type_skip_modifiers(ctx->btf, tid, &tid);
> +found_type:
> if (!type) {
> trace_probe_log_err(ctx->offset, BAD_BTF_TID);
> return -EINVAL;
> @@ -727,7 +779,7 @@ static int parse_btf_arg(char *varname,
> static const struct fetch_type *find_fetch_type_from_btf_type(
> struct traceprobe_parse_context *ctx)
> {
> - struct btf *btf = ctx->btf;
> + struct btf *btf = ctx_btf(ctx);
> const char *typestr = NULL;
>
> if (btf && ctx->last_type)
> @@ -758,7 +810,67 @@ static int parse_btf_bitfield(struct fetch_insn **pcode,
> return 0;
> }
>
> -#else
> +static int query_btf_struct(const char *sname, struct traceprobe_parse_context *ctx)
> +{
> + struct btf *btf = NULL;
> + int id;
> +
> + /* A struct_btf should only be used by a single argument */
> + if (WARN_ON_ONCE(ctx->struct_btf)) {
> + btf_put(ctx->struct_btf);
> + ctx->struct_btf = NULL;
> + }
> +
> + id = bpf_find_btf_id(sname, BTF_KIND_STRUCT, &btf);
> + if (id < 0)
> + return id;
> + ctx->struct_btf = btf;
> + ctx->last_struct = btf_type_by_id(ctx->struct_btf, id);
> + return 0;
> +}
> +
> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> + struct fetch_insn *end,
> + struct traceprobe_parse_context *ctx)
> +{
> + char *tmp;
> + int ret;
> +
> + /* Currently this only works for eprobes */
> + if (!(ctx->flags & TPARG_FL_TEVENT)) {
> + trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> + return -EINVAL;
> + }
> +
> + tmp = strchr(arg, ')');
> + if (!tmp) {
> + trace_probe_log_err(ctx->offset + strlen(arg),
> + DEREF_OPEN_BRACE);
> + return -EINVAL;
> + }
> + *tmp = '\0';
> + ret = query_btf_struct(arg + 1, ctx);
> + *tmp = ')';
> +
> + if (ret < 0) {
> + trace_probe_log_err(ctx->offset + 1, NO_PTR_STRCT);
> + return -EINVAL;
> + }
> +
> + tmp++;
> +
> + ctx->offset += tmp - arg;
> + ret = parse_btf_arg(tmp, pcode, end, ctx);
> + return ret;
> +}
> +
> +#else /* !CONFIG_PROBE_EVENTS_BTF_ARGS */
> +
> +static void clear_struct_btf(struct traceprobe_parse_context *ctx)
> +{
> + ctx->struct_btf = NULL;
> +}
> +
> static void clear_btf_context(struct traceprobe_parse_context *ctx)
> {
> ctx->btf = NULL;
> @@ -794,7 +906,15 @@ static int check_prepare_btf_string_fetch(char *typename,
> return 0;
> }
>
> -#endif
> +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> + struct fetch_insn *end,
> + struct traceprobe_parse_context *ctx)
> +{
> + trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
> + return -EOPNOTSUPP;
> +}
> +
> +#endif /* CONFIG_PROBE_EVENTS_BTF_ARGS */
>
> #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API
>
> @@ -948,16 +1068,9 @@ static int parse_probe_vars(char *orig_arg, const struct fetch_type *t,
> int len;
>
> if (ctx->flags & TPARG_FL_TEVENT) {
> - if (code->data)
> - return -EFAULT;
> - ret = parse_trace_event_arg(arg, code, ctx);
> - if (!ret)
> - return 0;
> - if (strcmp(arg, "comm") == 0 || strcmp(arg, "COMM") == 0) {
> - code->op = FETCH_OP_COMM;
> - return 0;
> - }
> - goto inval;
> + if (parse_trace_event(arg, code, ctx) < 0)
> + goto inval;
> + return 0;
> }
>
> if (str_has_prefix(arg, "retval")) {
> @@ -1224,6 +1337,9 @@ parse_probe_arg(char *arg, const struct fetch_type *type,
> code->op = FETCH_OP_IMM;
> }
> break;
> + case '(':
> + ret = handle_typecast(arg, pcode, end, ctx);
> + break;
> default:
> if (isalpha(arg[0]) || arg[0] == '_') { /* BTF variable */
> if (!tparg_is_function_entry(ctx->flags) &&
> @@ -1556,6 +1672,9 @@ static int traceprobe_parse_probe_arg_body(const char *argv, ssize_t *size,
> }
> kfree(tmp);
>
> + /* struct_btf should not be passed to other arguments */
> + clear_struct_btf(ctx);
> +
> return ret;
> }
>
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 1076f1df347b..15758cc11fc6 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -422,7 +422,9 @@ struct traceprobe_parse_context {
> const struct btf_param *params; /* Parameter of the function */
> s32 nr_params; /* The number of the parameters */
> struct btf *btf; /* The BTF to be used */
> + struct btf *struct_btf; /* The BTF to be used for structs */
> const struct btf_type *last_type; /* Saved type */
> + const struct btf_type *last_struct; /* Saved structure */
> u32 last_bitoffs; /* Saved bitoffs */
> u32 last_bitsize; /* Saved bitsize */
> struct trace_probe *tp;
> @@ -563,7 +565,8 @@ extern int traceprobe_define_arg_fields(struct trace_event_call *event_call,
> C(NEED_STRING_TYPE, "$comm and immediate-string only accepts string type"),\
> C(TOO_MANY_ARGS, "Too many arguments are specified"), \
> C(TOO_MANY_EARGS, "Too many entry arguments specified"), \
> - C(EVENT_TOO_BIG, "Event too big (too many fields?)"),
> + C(EVENT_TOO_BIG, "Event too big (too many fields?)"), \
> + C(TYPECAST_NOT_EVENT, "Typecasts are only for eprobe fields"),
>
> #undef C
> #define C(a, b) TP_ERR_##a
> --
> 2.53.0
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-06-02 3:08 UTC (permalink / raw)
To: David Hildenbrand (Arm), Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <e3d023f1-ab6e-4424-b304-55f1294480c3@kernel.org>
On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
> On 6/1/26 14:28, Miaohe Lin wrote:
>> On 2026/5/27 22:06, Breno Leitao wrote:
>>> get_any_page() collapses every HWPoisonHandlable() rejection into a
>>> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
>>> -> retry path. That is correct for the transient case (a userspace
>>> folio briefly off LRU during migration or compaction, which a later
>>> shake can drag back), but wrong for stable kernel-owned pages: slab,
>>> page-table, large-kmalloc and PG_reserved pages will never become
>>> HWPoisonHandlable(), so the retry loop is wasted work and the final
>>> -EIO loses the "this is structurally unrecoverable" information.
>>> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
>>> panic-on-unrecoverable sysctl deliberately does not act on.
>>>
>>> Introduce HWPoisonKernelOwned(), a small predicate that positively
>>> identifies pages the hwpoison handler cannot recover from:
>>>
>>> HWPoisonKernelOwned(p, flags) :=
>>> !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
>>> (PageReserved(p) || PageSlab(p) ||
>>> PageTable(p) || PageLargeKmalloc(p))
>>>
>>> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
>>> same exception in HWPoisonHandlable(): soft-offline is allowed to
>>> migrate movable_ops pages even though they are not on the LRU, and
>>> we must not pre-empt that with an unrecoverable verdict.
>>>
>>> The list is intentionally not exhaustive. vmalloc and kernel-stack
>>> pages, for example, do not carry a page_type bit and would need a
>>> different oracle; they keep going through the existing retry path
>>> unchanged. This is the smallest set we can identify with certainty
>>> by page type.
>>>
>>> Wire the helper into the top of get_any_page() to short-circuit
>>> those pages before the retry loop runs. On a hit, drop the caller's
>>> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
>>> straight away. Pages outside the helper's positive list still take
>>> the existing retry path and return -EIO, leaving operator-visible
>>> behaviour for those cases unchanged.
>>>
>>> Extend the unhandlable-page pr_err() to fire for either errno and
>>> update the get_hwpoison_page() kerneldoc to document the new return.
>>>
>>> memory_failure() still folds every negative return into
>>> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
>>> this patch on its own only changes the errno that soft_offline_page()
>>> can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
>>> through memory_failure() and reports MF_MSG_KERNEL for the
>>> unrecoverable cases, which is what the
>>> panic_on_unrecoverable_memory_failure sysctl observes.
>>
>> Thanks for your patch.
>>
>>>
>>> Suggested-by: David Hildenbrand <david@kernel.org>
>>> Suggested-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Breno Leitao <leitao@debian.org>
>>> ---
>>> mm/memory-failure.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 40 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index f4d3e6e20e13..8f63bdfeff8f 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1325,6 +1325,28 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>> return PageLRU(page) || is_free_buddy_page(page);
>>> }
>>>
>>> +/*
>>> + * Positive identification of pages the hwpoison handler cannot recover.
>>> + * These page types are owned by kernel internals (no userspace mapping
>>> + * to unmap, no file mapping to invalidate, no migration target), so the
>>> + * shake_page() / retry loop in get_any_page() can never turn them into
>>> + * something HWPoisonHandlable() will accept. Short-circuit them to
>>> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
>>> + * spinning through retries that exit as a transient-looking -EIO.
>>> + *
>>> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
>>> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
>>> + * pages even though they are not on the LRU.
>>> + */
>>> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
>>> +{
>>> + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
>>> + return false;
>>> +
>>> + return PageReserved(page) || PageSlab(page) ||
>>
>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
>>
>>> + PageTable(page) || PageLargeKmalloc(page);
>>
>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
>> PageTable and PageLargeKmalloc without extra page refcnt?
>
> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
> PageLargeKmalloc).
Got it. Thanks.
> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
> allow checking it on compound pages.
It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
>
> For PageLargeKmalloc, we would want to check the head page, though. The page
> type is only stored for the head page.
Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
on folio.
>
> So maybe we want to lookup the compound head (if any) and perform the type
> checks against that?
Maybe we should or we might miss some pages that could have been handled. And
if compound head is required, should we hold an extra page refcnt to guard against
possible folio split race?
Thanks.
.
^ permalink raw reply
* Re: [PATCH v8 3/6] mm/memory-failure: report MF_MSG_KERNEL for unrecoverable kernel pages
From: Miaohe Lin @ 2026-06-02 3:31 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-3-9ea0cfa16bb0@debian.org>
On 2026/5/27 22:06, Breno Leitao wrote:
> The previous patch teaches get_any_page() to return -ENOTRECOVERABLE
> for stable unhandlable kernel pages (PG_reserved, slab, page tables,
> large-kmalloc). memory_failure() still folds every negative return
> into MF_MSG_GET_HWPOISON, so callers that want to react to the
> unrecoverable cases (a panic option, smarter logging) cannot tell
> them apart from transient page-allocator races.
>
> Turn the post-call branch into a switch over the get_hwpoison_page()
> return code: map -ENOTRECOVERABLE to MF_MSG_KERNEL and any other
> negative return to MF_MSG_GET_HWPOISON. case 0 keeps the existing
> free-buddy / kernel-high-order handling and case 1 falls through to
> the rest of memory_failure() unchanged.
>
> The MF_MSG_KERNEL label and tracepoint string are kept as
> "reserved kernel page" to avoid breaking userspace tools that match
> on those literals; the enum value still adequately tags the failure
> even though it now also covers slab, page tables and large-kmalloc
> pages.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.
^ permalink raw reply
* Re: [PATCH v8 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Miaohe Lin @ 2026-06-02 7:05 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <20260527-ecc_panic-v8-4-9ea0cfa16bb0@debian.org>
On 2026/5/27 22:06, Breno Leitao wrote:
> Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
> default) that triggers a kernel panic when memory_failure()
> encounters pages that cannot be recovered. This provides a clean
> crash with useful debug information rather than allowing silent
> data corruption or a delayed crash at an unrelated code path.
>
> Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
> result == MF_IGNORED panics. After the previous patch, MF_MSG_KERNEL
> covers PG_reserved pages and the kernel-owned pages promoted from
> get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
> large-kmalloc).
>
> All other action types are excluded:
>
> - MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
> transient refcount races with the page allocator (an in-flight buddy
> allocation has refcount 0 and is no longer on the buddy free list,
> briefly), and panicking on them would risk killing the box for what
> is actually a recoverable userspace page.
>
> - MF_MSG_UNKNOWN means identify_page_state() could not classify the
> page; that is precisely the wrong basis for a panic decision.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> mm/memory-failure.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 14c0a958638c..dcd53dbc6aec 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
>
> static int sysctl_enable_soft_offline __read_mostly = 1;
>
> +static int sysctl_panic_on_unrecoverable_mf __read_mostly;
> +
> atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
> static bool hw_memory_failure __read_mostly = false;
> @@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
> .proc_handler = proc_dointvec_minmax,
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_ONE,
> + },
> + {
> + .procname = "panic_on_unrecoverable_memory_failure",
> + .data = &sysctl_panic_on_unrecoverable_mf,
> + .maxlen = sizeof(sysctl_panic_on_unrecoverable_mf),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + .extra2 = SYSCTL_ONE,
> }
> };
>
> @@ -1255,6 +1266,15 @@ static void update_per_node_mf_stats(unsigned long pfn,
> ++mf_stats->total;
> }
>
> +static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
> + enum mf_result result)
> +{
> + if (!sysctl_panic_on_unrecoverable_mf || result != MF_IGNORED)
> + return false;
> +
> + return type == MF_MSG_KERNEL;
Would it be more straightforward to write as something like:
if (!sysctl_panic_on_unrecoverable_mf)
return false;
return (type == MF_MSG_KERNEL && result == MF_IGNORED);
Thanks.
.
^ permalink raw reply
* Re: [PATCH v4 06/13] rv: Do not rely on clean monitor when initialising HA
From: Nam Cao @ 2026-06-02 8:52 UTC (permalink / raw)
To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260601153840.124372-7-gmonaco@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> Hybrid Automata monitors hook into the DA implementation when doing
> da_monitor_reset(). This function is called both on initialisation and
> teardown, HA monitors try to cancel a timer only when it's initialised
> relying on the da_mon->monitoring flag. This flag could however be
> corrupted during initialisation. This happens for instance on per-task
> monitors that share the same storage with different type of monitors
> like LTL or in case of races during a previous teardown.
>
> Stop relying on the monitoring flag during initialisation, assume that
> can have any value, so use a separate da_reset_state() skiping timer
> cancellation.
> New monitors (e.g. new tasks) are always zero-initialised so it is safe
> to rely on the monitoring flag for those.
>
> Reported-by: Wen Yang <wen.yang@linux.dev>
> Closes: https://lore.kernel.org/lkml/d02c656aada7d071f083460a5c9a454363669b61.1778522945.git.wen.yang@linux.dev
> Suggested-by: Nam Cao <namcao@linutronix.de>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Reviewed-by: Wen Yang <wen.yang@linux.dev>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
^ permalink raw reply
* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02 8:55 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-7-2f0fae496530@google.com>
On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
>
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.
nit: s/non-CoCo/CoCo ?
>
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
>
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
>
> Add a check to make sure that preparation is only performed for private
> folios.
>
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
>
> Signed-off-by: Michael Roth <michael.roth@amd.com>
nit: Missing Co-Developed-by: ?
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> virt/kvm/guest_memfd.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 78e5435967341..adf57a3a1f5dd 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> int *max_order)
> {
> pgoff_t index = kvm_gmem_get_index(slot, gfn);
> + struct inode *inode;
> struct folio *folio;
> int r = 0;
>
> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> if (!file)
> return -EFAULT;
>
> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> + inode = file_inode(file);
> + filemap_invalidate_lock_shared(inode->i_mapping);
>
> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> if (IS_ERR(folio)) {
> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_mark_uptodate(folio);
> }
>
> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> + if (kvm_gmem_is_private_mem(inode, index))
Don't we need to make sure the entire folio is private ? Not just the
page at the index ?
if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
Suzuki
> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>
> folio_unlock(folio);
>
> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> folio_put(folio);
>
> out:
> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> + filemap_invalidate_unlock_shared(inode->i_mapping);
> return r;
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>
^ permalink raw reply
* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-06-02 8:57 UTC (permalink / raw)
To: Balbir Singh
Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
terry.bowman
In-Reply-To: <ah47NNhuiClgGCdn@parvat>
On Tue, Jun 02, 2026 at 12:16:50PM +1000, Balbir Singh wrote:
> On Sun, May 24, 2026 at 09:50:06PM -0400, Gregory Price wrote:
> >
> > I'm debating on whether to include OPS_MEMPOLICY in the initial version
> > if only because it's not intuitive how it interacts with pagecache. That
> > needs more time to bake.
> >
>
> It makes sense to look at it and then decide if it makes sense.
>
I am thinking i will ship without any OPS flags at all for now and the
have the introduction of ops as a separate series.
> > alloc_pages_node() is the kernel interface
>
> I was think we wouldn't need explicit flags and that allocations would
> happen from user space using __GFP_THISNODE to the node or via a nodemask
> based on nodes of interest. Is there a reason to add this flag, a system
> might have more than one source of N_MEMORY_PRIVATE?
>
There's a few things to unpack here. I discussed this many times on
list and at LSF, but to reiterate.
1) __GFP_THISNODE is insufficient to enforce isolation and otherwise
not particularly useful. Additionally, from userland, it's not
something you can actually set.
for node in possible_nodes:
alloc_pages_node(private_node, __GFP_THISNODE)
In fact it's the opposite semantic of what we want.
THISNODE says: "Do not fallback back to OTHER nodes".
The semantic we want is "Do not allow allocations from private
nodes UNLESS we specifically request" (__GFP_PRIVATE).
__GFP_THISNODE does not actually buy you anything here, AND it's
worse, in the scenario where a private node makes its way into the
preferred slot (via possible_nodes or some other nodemask), the
allocator cannot fall back to a node it can access.
__GFP_THISNODE cannot be overloaded to do anything useful here.
2) We're trying not to expose *ANY* userland APIs for this, at all.
The ultimate goal here should be one of two things:
1) fd = open(/dev/xxx, ...);
mem = mmap(fd, ...);
mem[0] = 0xDEADBEEF; /* Fault device page into page table */
In this case, the driver is responsible for doing the
alloc_pages_node() call.
or
2) mem = mmap(NULL, ..., ANON);
mbind(mem, ..., private_node);
mem[0] = 0xDEADBEEF; /* Fault device page into page table */
in this case mempolicy.c is responsible for doing the
alloc_pages_node() call via the _mpol() alloc variants.
Addition OPT flags (reclaim, compaction, whatever), would
(optionally) allow mm/ to operate on the device memory with, for
example, mmu_notifier callbacks to tell the device to invalidate
whatever it's caching about that page.
This would all be relatively transparent the userland, all userland
"knows" is that it's getting memory from a device (/dev/xxx) or a
node it's otherwise aware of hosting device memory somehow.
~Gregory
^ permalink raw reply
* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02 9:10 UTC (permalink / raw)
To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>
On 02/06/2026 09:55, Suzuki K Poulose wrote:
> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
>
> nit: s/non-CoCo/CoCo ?
>
>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn()
>> on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>
> nit: Missing Co-Developed-by: ?
>
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>> virt/kvm/guest_memfd.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 78e5435967341..adf57a3a1f5dd 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> int *max_order)
>> {
>> pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> + struct inode *inode;
>> struct folio *folio;
>> int r = 0;
>> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> if (!file)
>> return -EFAULT;
>> - filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> + inode = file_inode(file);
>> + filemap_invalidate_lock_shared(inode->i_mapping);
>> folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>> if (IS_ERR(folio)) {
>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> folio_mark_uptodate(folio);
>> }
>> - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> + if (kvm_gmem_is_private_mem(inode, index))
>
> Don't we need to make sure the entire folio is private ? Not just the
> page at the index ?
> if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
Or rather, we should go through the individual pages and apply the
prepare for ones that are private ?
Suzuki
>
> Suzuki
>
>> + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> folio_unlock(folio);
>> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>> kvm_memory_slot *slot,
>> folio_put(folio);
>> out:
>> - filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> + filemap_invalidate_unlock_shared(inode->i_mapping);
>> return r;
>> }
>> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
>
^ permalink raw reply
* Re: [PATCH v4 08/13] rv: Ensure synchronous cleanup for HA monitors
From: Nam Cao @ 2026-06-02 9:17 UTC (permalink / raw)
To: Gabriele Monaco, linux-kernel, Steven Rostedt, Gabriele Monaco,
linux-trace-kernel
Cc: Wen Yang
In-Reply-To: <20260601153840.124372-9-gmonaco@redhat.com>
Gabriele Monaco <gmonaco@redhat.com> writes:
> HA monitors may start timers, all cleanup functions currently stop the
> timers asynchronously to avoid sleeping in the wrong context.
> Nothing makes sure running callbacks terminate on cleanup.
>
> Run the entire HA timer callback in an RCU read-side critical section,
> this way we can simply synchronize_rcu() with any pending timer and are
> sure any cleanup using kfree_rcu() runs after callbacks terminated.
> Additionally make sure any unlikely callback running late won't run any
> code if the monitor is marked as disabled or if destruction started.
> Use memory barriers to serialise with racing resets.
>
> Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
> Fixes: 4a24127bd6cb ("rv: Add support for per-object monitors in DA/HA")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Reviewed-by: Nam Cao <namcao@linutronix.de>
^ permalink raw reply
* Re: [PATCH v8 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: David Hildenbrand (Arm) @ 2026-06-02 9:41 UTC (permalink / raw)
To: Miaohe Lin, Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Lance Yang, Andrew Morton,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Liam R. Howlett
In-Reply-To: <33ef8821-c809-b7d1-ea77-6e8a07a6e784@huawei.com>
On 6/2/26 05:08, Miaohe Lin wrote:
> On 2026/6/1 21:22, David Hildenbrand (Arm) wrote:
>> On 6/1/26 14:28, Miaohe Lin wrote:
>>>
>>> Thanks for your patch.
>>>
>>>
>>> Once shake_page finds a lightweight range-based way to shrink slab, slab pages could be freed
>>> into buddy and above PageSlab test should be removed then. Maybe add a TODO or XXX here?
>>>
>>>
>>> I'm not sure but is it safe or a common way to test PageReserved, PageSlab,
>>> PageTable and PageLargeKmalloc without extra page refcnt?
>>
>> Checking typed pages in a racy fashion is fine (PageSlab, PageTable,
>> PageLargeKmalloc).
>
> Got it. Thanks.
>
>> Checking PageReserved in a racy fashion is fine as well. TESTPAGEFLAG() will
>> allow checking it on compound pages.
>
> It seems PageReserved is not intended to be set on compound pages. I see there are PF_NO_COMPOUND
> in its definition: PAGEFLAG(Reserved, reserved, PF_NO_COMPOUND).
>
>>
>> For PageLargeKmalloc, we would want to check the head page, though. The page
>> type is only stored for the head page.
>
> Maybe we should check the head page for PageSlab and PageTable too? alloc_slab_page only
> set PageSlab on the head page and __pagetable_ctor uses __folio_set_pgtable to set PageTable
> on folio.
>
>>
>> So maybe we want to lookup the compound head (if any) and perform the type
>> checks against that?
>
> Maybe we should or we might miss some pages that could have been handled. And
> if compound head is required, should we hold an extra page refcnt to guard against
> possible folio split race?
Races are fine. We might miss some pages, but that can happen on races either way.
I'd just do something like
if (PageReserved(page))
return true;
head = compound_head(page);
return PageSlab(head) || ...;
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH mm-unstable v18 05/14] mm/khugepaged: require collapse_huge_page to enter/exit with the lock dropped
From: Nico Pache @ 2026-06-02 10:26 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <ah2Ro54tMDMsPevk@lucifer>
On Mon, Jun 1, 2026 at 8:13 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 22, 2026 at 09:00:00AM -0600, Nico Pache wrote:
> > Currently the collapse_huge_page function requires the mmap_read_lock to
> > enter with it held, and exit with it dropped. This function moves the
> > unlock into its parent caller, and changes this semantic to requiring it
> > to enter/exit with it always unlocked.
> >
> > In future patches, we need this expectation, as for in mTHP collapse, we
> > may have already have dropped the lock, and do not want to conditionally
> > check for this by passing through the lock_dropped variable.
> >
> > No functional change is expected as one of the first things the
> > collapse_huge_page function does is drop this lock before allocating the
> > hugepage.
> >
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> One small nit below, otherwise LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Thank you for reviewing!
>
> > ---
> > mm/khugepaged.c | 16 ++++++++--------
> > 1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index e98ba5b15163..fab35d318641 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1208,6 +1208,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > return SCAN_SUCCEED;
> > }
> >
> > +/*
> > + * collapse_huge_page expects the mmap_lock to be unlocked before entering and
> > + * will always return with the lock unlocked, to avoid holding the mmap_lock
> > + * while allocating a THP, as that could trigger direct reclaim/compaction.
> > + * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> > + */
> > static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > int referenced, int unmapped, struct collapse_control *cc)
> > {
> > @@ -1223,14 +1229,6 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >
> > VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > - /*
> > - * Before allocating the hugepage, release the mmap_lock read lock.
> > - * The allocation can take potentially a long time if it involves
> > - * sync compaction, and we do not need to hold the mmap_lock during
> > - * that. We will recheck the vma after taking it again in write mode.
> > - */
> > - mmap_read_unlock(mm);
> > -
>
> NIT: Maybe worth an mmap_assert_locked()?
But it will already be unlocked here. The contract is that we enter
unlocked and exit unlocked.
Cheers,
-- Nico
>
> > result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > if (result != SCAN_SUCCEED)
> > goto out_nolock;
> > @@ -1535,6 +1533,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > out_unmap:
> > pte_unmap_unlock(pte, ptl);
> > if (result == SCAN_SUCCEED) {
> > + /* collapse_huge_page expects the lock to be dropped before calling */
> > + mmap_read_unlock(mm);
> > result = collapse_huge_page(mm, start_addr, referenced,
> > unmapped, cc);
> > /* collapse_huge_page will return with the mmap_lock released */
> > --
> > 2.54.0
> >
>
> Cheers, Lorenzo
>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-06-02 10:58 UTC (permalink / raw)
To: Lance Yang
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat, mhocko,
peterx, pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang,
rientjes, rostedt, rppt, ryan.roberts, shivankg, sunnanyong,
surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe
In-Reply-To: <20260531071845.10875-1-lance.yang@linux.dev>
On Sun, May 31, 2026 at 1:19 AM Lance Yang <lance.yang@linux.dev> wrote:
>
>
> On Fri, May 22, 2026 at 09:00:06AM -0600, Nico Pache wrote:
> [...]
> >@@ -1587,10 +1749,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> > if (result == SCAN_SUCCEED) {
> > /* collapse_huge_page expects the lock to be dropped before calling */
> > mmap_read_unlock(mm);
> >- result = collapse_huge_page(mm, start_addr, referenced,
> >- unmapped, cc, HPAGE_PMD_ORDER);
> >- /* collapse_huge_page will return with the mmap_lock released */
> >+ nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >+ unmapped, cc, enabled_orders);
> >+ /* mmap_lock was released above, set lock_dropped */
> > *lock_dropped = true;
> >+ result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>
> Hmm ... don't we lose the allocation-failure result here?
>
> Previously collapse_scan_pmd() propagated SCAN_ALLOC_HUGE_PAGE_FAIL from
> collapse_huge_page(), so khugepaged would call khugepaged_alloc_sleep()
> in khugepaged_do_scan().
>
> Now if allocation fails and nr_collapsed stays 0, we just return
> SCAN_FAIL. So we won't back off via khugepaged_alloc_sleep() anymore?
Ok I did the error propagation! I think I handled both of these cases
you brought up pretty easily.
However I don't know what to do in the following case: We successfully
collapsed some portion of the PMD, but during that process, we also
hit an allocation failure. Is it best to back off entirely? or can we
treat some forward progress as a sign we can continue trying collapses
without sleeping.
Basically, do we prioritize SCAN_ALLOC_HUGE_PAGE_FAIL or the
successful collapses as the returned value?
This is what I currently have:
done:
if (collapsed)
return SCAN_SUCCEED;
if (alloc_failed)
return SCAN_ALLOC_HUGE_PAGE_FAIL;
Thanks,
-- Nico
>
> Cheers, Lance
>
^ permalink raw reply
* Re: [PATCH v2 2/8] riscv: stacktrace: Add frame record metadata
From: Shuai Xue @ 2026-06-02 11:18 UTC (permalink / raw)
To: Wang Han, Paul Walmsley, Palmer Dabbelt, Albert Ou
Cc: Steven Rostedt, Alexandre Ghiti, Masami Hiramatsu, Mark Rutland,
Catalin Marinas, Chen Pei, Andy Chiu, Björn Töpel,
Deepak Gupta, Puranjay Mohan, Conor Dooley, Josh Poimboeuf,
Jiri Kosina, Miroslav Benes, Petr Mladek, Joe Lawrence,
Shuah Khan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, oliver.yang, zhuo.song, jkchen, linux-riscv,
linux-kernel, linux-trace-kernel, live-patching, linux-kselftest,
linux-perf-users
In-Reply-To: <20260528082310.1994388-3-wanghan@linux.alibaba.com>
On 5/28/26 4:23 PM, Wang Han wrote:
> Reliable frame-pointer unwinding needs an explicit way to identify
> exception boundaries and the final entry frame. The existing unwinder
> infers those boundaries from return addresses, which is too loose for a
> future reliable unwinder.
>
> Add a small metadata frame record to pt_regs and initialize it on
> exception entry, kernel thread fork, user fork, and early idle task
> setup. The record uses a zero {fp, ra} sentinel plus a type field so a
> later unwinder can distinguish a final user-to-kernel boundary from a
> nested kernel pt_regs boundary.
>
> This follows the arm64 metadata frame-record model, adapted to the
> RISC-V {fp, ra} frame record convention.
>
> The metadata is established at the RISC-V entry boundaries that need an
> explicit unwind marker:
>
> * exception entry clears the metadata {fp, ra} pair and uses SPP
> (or MPP in M-mode) to record whether the pt_regs frame is the final
> user-to-kernel boundary or a nested kernel boundary;
> * _start_kernel builds the init task's final metadata record, while
> the secondary CPU path sets up s0 before smp_callin() so idle-task
> unwinding does not inherit an undefined caller frame;
> * copy_thread creates matching final metadata records for new kernel
> and user tasks, and keeps s0 available for the frame-pointer chain;
> * call_on_irq_stack still reserves an aligned stack slot, but links the
> saved {fp, ra} with the raw frame-record size so s0 points at the
> RISC-V frame record rather than past the alignment padding.
>
> These changes keep s0 reserved for the frame-pointer chain at task and
> stack-switch boundaries.
>
> Signed-off-by: Wang Han <wanghan@linux.alibaba.com>
> ---
> arch/riscv/include/asm/ptrace.h | 9 ++++
> arch/riscv/include/asm/stacktrace/frame.h | 53 +++++++++++++++++++++++
> arch/riscv/kernel/asm-offsets.c | 4 ++
> arch/riscv/kernel/entry.S | 30 +++++++++++--
> arch/riscv/kernel/head.S | 23 ++++++++++
> arch/riscv/kernel/process.c | 31 ++++++++++++-
> 6 files changed, 144 insertions(+), 6 deletions(-)
> create mode 100644 arch/riscv/include/asm/stacktrace/frame.h
>
> diff --git a/arch/riscv/include/asm/ptrace.h b/arch/riscv/include/asm/ptrace.h
> index addc8188152f..4b9b0f279214 100644
> --- a/arch/riscv/include/asm/ptrace.h
> +++ b/arch/riscv/include/asm/ptrace.h
> @@ -8,6 +8,7 @@
>
> #include <uapi/asm/ptrace.h>
> #include <asm/csr.h>
> +#include <asm/stacktrace/frame.h>
> #include <linux/compiler.h>
>
> #ifndef __ASSEMBLER__
> @@ -53,6 +54,14 @@ struct pt_regs {
> unsigned long cause;
> /* a0 value before the syscall */
> unsigned long orig_a0;
> +
> + /*
> + * This frame record is entirely zeroed on exception entry, allowing the
> + * unwinder to identify exception boundaries. The type field encodes
> + * whether the exception was taken from user (FINAL) or kernel (PT_REGS)
> + * mode.
> + */
> + struct frame_record_meta stackframe;
> };
>
> #define PTRACE_SYSEMU 0x1f
> diff --git a/arch/riscv/include/asm/stacktrace/frame.h b/arch/riscv/include/asm/stacktrace/frame.h
> new file mode 100644
> index 000000000000..5720a6c65fe8
> --- /dev/null
> +++ b/arch/riscv/include/asm/stacktrace/frame.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __ASM_RISCV_STACKTRACE_FRAME_H
> +#define __ASM_RISCV_STACKTRACE_FRAME_H
> +
> +/*
> + * See: arch/arm64/include/asm/stacktrace/frame.h for the reference
> + * implementation.
> + */
> +
> +/*
> + * - FRAME_META_TYPE_NONE
> + *
> + * This value is reserved.
> + *
> + * - FRAME_META_TYPE_FINAL
> + *
> + * The record is the last entry on the stack.
> + * Unwinding should terminate successfully.
> + *
> + * - FRAME_META_TYPE_PT_REGS
> + *
> + * The record is embedded within a struct pt_regs, recording the registers at
> + * an arbitrary point in time.
> + * Unwinding should consume pt_regs::epc, followed by pt_regs::ra.
> + *
> + * Note: all other values are reserved and should result in unwinding
> + * terminating with an error.
> + */
> +#define FRAME_META_TYPE_NONE 0
> +#define FRAME_META_TYPE_FINAL 1
> +#define FRAME_META_TYPE_PT_REGS 2
> +
> +#ifndef __ASSEMBLER__
> +/*
> + * A standard RISC-V frame record.
> + */
> +struct frame_record {
> + unsigned long fp;
> + unsigned long ra;
> +};
> +
> +/*
> + * A metadata frame record indicating a special unwind.
> + * The record::{fp,ra} fields must be zero to indicate the presence of
> + * metadata.
> + */
> +struct frame_record_meta {
> + struct frame_record record;
> + unsigned long type;
> +};
> +#endif /* __ASSEMBLER__ */
> +
> +#endif /* __ASM_RISCV_STACKTRACE_FRAME_H */
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index af827448a609..8dfcb5a44bb8 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -131,6 +131,9 @@ void asm_offsets(void)
> OFFSET(PT_BADADDR, pt_regs, badaddr);
> OFFSET(PT_CAUSE, pt_regs, cause);
>
> + DEFINE(S_STACKFRAME, offsetof(struct pt_regs, stackframe));
> + DEFINE(S_STACKFRAME_TYPE, offsetof(struct pt_regs, stackframe.type));
> +
> OFFSET(SUSPEND_CONTEXT_REGS, suspend_context, regs);
>
> OFFSET(HIBERN_PBE_ADDR, pbe, address);
> @@ -501,6 +504,7 @@ void asm_offsets(void)
> OFFSET(SBI_HART_BOOT_STACK_PTR_OFFSET, sbi_hart_boot_data, stack_ptr);
>
> DEFINE(STACKFRAME_SIZE_ON_STACK, ALIGN(sizeof(struct stackframe), STACK_ALIGN));
> + DEFINE(STACKFRAME_RECORD_SIZE, sizeof(struct stackframe));
> OFFSET(STACKFRAME_FP, stackframe, fp);
> OFFSET(STACKFRAME_RA, stackframe, ra);
> #ifdef CONFIG_FUNCTION_TRACER
> diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> index d011fb51c59a..9cae0e1eba1c 100644
> --- a/arch/riscv/kernel/entry.S
> +++ b/arch/riscv/kernel/entry.S
> @@ -11,6 +11,7 @@
> #include <asm/asm.h>
> #include <asm/csr.h>
> #include <asm/scs.h>
> +#include <asm/stacktrace/frame.h>
> #include <asm/unistd.h>
> #include <asm/page.h>
> #include <asm/thread_info.h>
> @@ -193,6 +194,27 @@ SYM_CODE_START(handle_exception)
> REG_S s4, PT_CAUSE(sp)
> REG_S s5, PT_TP(sp)
>
> + /*
> + * Create a metadata frame record. The unwinder will use this to
> + * identify and unwind exception boundaries.
> + */
> + REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp) /* stackframe.record.fp = 0 */
> + REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp) /* stackframe.record.ra = 0 */
> +#ifdef CONFIG_RISCV_M_MODE
> + li t0, SR_MPP
> + and t0, s1, t0
> +#else
> + andi t0, s1, SR_SPP
> +#endif
> + bnez t0, 1f
> + li t0, FRAME_META_TYPE_FINAL
> + j 2f
> +1:
> + li t0, FRAME_META_TYPE_PT_REGS
> +2:
> + REG_S t0, S_STACKFRAME_TYPE(sp)
> + addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> +
One spot for symmetry (non-blocking, robustness only):
handle_kernel_stack_overflow in entry.S allocates a full
PT_SIZE_ON_STACK frame (including the new stackframe metadata fields)
but, unlike .Lsave_context, never initialises stackframe.{record,type}
nor repoints s0 at the metadata. Those three words are therefore left
as whatever was on the overflow_stack.
In practice this is currently harmless: handle_bad_stack() only calls
__show_regs() (register dump, no unwind) followed by panic(), so the
reliable unwinder never actually consumes that metadata today. So this
is not an active bug — purely a robustness / symmetry gap.
It would still be worth initialising it, because the moment someone
adds a dump_stack() here, or another CPU NMI-backtraces this task, or a
kdump image is walked offline via the frame-record chain, the garbage
type byte would mislead the unwinder. Since the overflow path is by
definition entered from kernel context, FRAME_META_TYPE_PT_REGS is the
right type, and it has the nice property that the unwinder will resume
from frame_pointer(regs)==regs->s0 (the pre-overflow s0 is already
saved into PT_S0 by save_from_x6_to_x31), giving the pre-overflow call
chain instead of a hard stop.
> /*
> * Set the scratch register to 0, so that if a recursive exception
> * occurs, the exception vector knows it came from the kernel
> @@ -357,8 +379,8 @@ ASM_NOKPROBE(handle_kernel_stack_overflow)
>
> SYM_CODE_START(ret_from_fork_kernel_asm)
> call schedule_tail
> - move a0, s1 /* fn_arg */
> - move a1, s0 /* fn */
> + move a0, s3 /* fn_arg */
> + move a1, s2 /* fn */
> move a2, sp /* pt_regs */
> call ret_from_fork_kernel
> j ret_from_exception
> @@ -383,7 +405,7 @@ SYM_FUNC_START(call_on_irq_stack)
> addi sp, sp, -STACKFRAME_SIZE_ON_STACK
> REG_S ra, STACKFRAME_RA(sp)
> REG_S s0, STACKFRAME_FP(sp)
> - addi s0, sp, STACKFRAME_SIZE_ON_STACK
> + addi s0, sp, STACKFRAME_RECORD_SIZE
>
> /* Switch to the per-CPU shadow call stack */
> scs_save_current
> @@ -399,7 +421,7 @@ SYM_FUNC_START(call_on_irq_stack)
> scs_load_current
>
> /* Switch back to the thread stack and restore ra and s0 */
> - addi sp, s0, -STACKFRAME_SIZE_ON_STACK
> + addi sp, s0, -STACKFRAME_RECORD_SIZE
Worth calling out explicitly that this is more than a cosmetic refactor:
on RV32 the previous code is actually wrong, and this hunk fixes it.
STACKFRAME_SIZE_ON_STACK = ALIGN(sizeof(struct stackframe), STACK_ALIGN)
STACKFRAME_RECORD_SIZE = sizeof(struct stackframe)
RV64: sizeof(stackframe) == STACK_ALIGN == 16, so the two are equal
and the old code happened to work.
RV32: sizeof(stackframe) == 8 but STACK_ALIGN == 16, so the old
"addi s0, sp, STACKFRAME_SIZE_ON_STACK" left s0 pointing 8 bytes
past the saved {fp, ra} pair, into the alignment padding. An FP
unwinder that derives the frame record from s0 (e.g. via
"(struct stackframe *)s0 - 1" or fixed -8/-16(s0) loads) would
then read garbage instead of the saved fp/ra at the IRQ-stack
After the change s0 lands exactly at the end of the {fp, ra} record on
both RV32 and RV64, while the aligned slot is still reserved by the
unchanged "addi sp, sp, -STACKFRAME_SIZE_ON_STACK" / matching restore.
Could you mention this in the v3 commit message? It's load-bearing
context for anyone bisecting an RV32 unwind regression later, and it
also justifies why the change is correct to apply ahead of the reliable
unwinder rather than folded into it.
> REG_L ra, STACKFRAME_RA(sp)
> REG_L s0, STACKFRAME_FP(sp)
> addi sp, sp, STACKFRAME_SIZE_ON_STACK
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index f6a8ca49e627..00e16a24f149 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -14,6 +14,7 @@
> #include <asm/hwcap.h>
> #include <asm/image.h>
> #include <asm/scs.h>
> +#include <asm/stacktrace/frame.h>
> #include <asm/usercfi.h>
> #include "efi-header.S"
>
> @@ -177,6 +178,14 @@ secondary_start_sbi:
> REG_S a0, (a1)
> 1:
> #endif
> +
> + /*
> + * Set up the frame pointer for the secondary idle task so reliable
> + * stack unwinding terminates at the metadata frame in task_pt_regs().
> + * Without this, the first frame records can inherit an undefined caller
> + * fp and unwind past smp_callin() into .Lsecondary_park.
> + */
> + addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> scs_load_current
> call smp_callin
> #endif /* CONFIG_SMP */
> @@ -305,6 +314,20 @@ SYM_CODE_START(_start_kernel)
> la tp, init_task
> la sp, init_thread_union + THREAD_SIZE
> addi sp, sp, -PT_SIZE_ON_STACK
> +
> + /*
> + * Set up a metadata frame record for the init task so that
> + * the unwinder can identify the outermost frame by its
> + * {fp, ra} = {0, 0} sentinel at the bottom of pt_regs.
> + * fp/s0 points above the metadata record (RISC-V
> + * convention).
> + */
> + REG_S zero, (S_STACKFRAME + STACKFRAME_FP)(sp)
> + REG_S zero, (S_STACKFRAME + STACKFRAME_RA)(sp)
> + li t0, FRAME_META_TYPE_FINAL
> + REG_S t0, S_STACKFRAME_TYPE(sp)
> + addi s0, sp, S_STACKFRAME + STACKFRAME_RECORD_SIZE
> +
> #if defined(CONFIG_RISCV_SBI) && defined(CONFIG_RISCV_USER_CFI)
> li a7, SBI_EXT_FWFT
> li a6, SBI_EXT_FWFT_SET
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index b2df7f72241a..5212926b926b 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -258,8 +258,23 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
> /* Supervisor/Machine, irqs on: */
> childregs->status = SR_PP | SR_PIE;
>
> - p->thread.s[0] = (unsigned long)args->fn;
> - p->thread.s[1] = (unsigned long)args->fn_arg;
> + /*
> + * Set up a metadata frame record at the bottom of the
> + * stack for the unwinder. Use FRAME_META_TYPE_FINAL
> + * since this is the outermost kernel entry for the new
> + * task. The frame_record::{fp,ra} are already zero from
> + * memset().
> + *
> + * fp/s0 points above the metadata record (RISC-V
> + * convention). fn and fn_arg are passed via s2/s3,
> + * keeping s0 available for the frame pointer chain.
> + */
> + childregs->stackframe.type = FRAME_META_TYPE_FINAL;
> +
> + p->thread.s[0] = (unsigned long)(&childregs->stackframe)
> + + sizeof(struct frame_record);
> + p->thread.s[2] = (unsigned long)args->fn;
> + p->thread.s[3] = (unsigned long)args->fn_arg;
> p->thread.ra = (unsigned long)ret_from_fork_kernel_asm;
> } else {
> /* allocate new shadow stack if needed. In case of CLONE_VM we have to */
> @@ -278,6 +293,18 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
> if (clone_flags & CLONE_SETTLS)
> childregs->tp = tls;
> childregs->a0 = 0; /* Return value of fork() */
> +
> + /*
> + * Set up the unwind boundary: ensure the metadata
> + * frame record has its {fp,ra} sentinel zeroed and
> + * point fp/s0 above the metadata record. The type
> + * field is inherited from the parent's pt_regs.
> + */
> + childregs->stackframe.record.fp = 0;
> + childregs->stackframe.record.ra = 0;
This relies on the parent always entering kernel via handle_exception
on a user->kernel boundary (which writes FRAME_META_TYPE_FINAL).
That is true for fork()/clone() today, but:
- The kernel-thread path right above explicitly assigns type =
FINAL, so the user-thread path looks asymmetric and like a
possible omission to anyone reading it cold.
- A future caller invoking kernel_clone() from a nested-kernel
context (parent pt_regs.type == PT_REGS) would silently produce
a broken unwind boundary on the new task.
Recommend explicitly setting it here too:
childregs->stackframe.type = FRAME_META_TYPE_FINAL;
Even if currently redundant, it is one assignment, costs nothing, is
self-documenting, and fails closed instead of open.
Thanks.
Shuai
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox