* [PATCH v9 3/3] tracing/Documentation: Add a section about backup instance
From: Masami Hiramatsu (Google) @ 2026-03-31 16:32 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <177497473558.569199.6527680985537865638.stgit@mhiramat.tok.corp.google.com>
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a section about backup instance to the debugging.rst.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
Changes in v6:
- Fix typos.
---
Documentation/trace/debugging.rst | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
index 4d88c346fc38..15857951b506 100644
--- a/Documentation/trace/debugging.rst
+++ b/Documentation/trace/debugging.rst
@@ -159,3 +159,22 @@ If setting it from the kernel command line, it is recommended to also
disable tracing with the "traceoff" flag, and enable tracing after boot up.
Otherwise the trace from the most recent boot will be mixed with the trace
from the previous boot, and may make it confusing to read.
+
+Using a backup instance for keeping previous boot data
+------------------------------------------------------
+
+It is also possible to record trace data at system boot time by specifying
+events with the persistent ring buffer, but in this case the data before the
+reboot will be lost before it can be read. This problem can be solved by a
+backup instance. From the kernel command line::
+
+ reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map
+
+On boot up, the previous data in the "boot_map" is copied to the "backup"
+instance, and the "sched:*" and "irq:*" events for the current boot are traced
+in the "boot_map". Thus the user can read the previous boot data from the "backup"
+instance without stopping the trace.
+
+Note that this "backup" instance is readonly, and will be removed automatically
+if you clear the trace data or read out all trace data from the "trace_pipe"
+or the "trace_pipe_raw" files.
\ No newline at end of file
^ permalink raw reply related
* Re: [PATCH v15 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Catalin Marinas @ 2026-03-31 17:57 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Will Deacon, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel, Ian Rogers, linux-arm-kernel, Robin Murphy
In-Reply-To: <177494616630.71933.2941681397188791689.stgit@mhiramat.tok.corp.google.com>
On Tue, Mar 31, 2026 at 05:36:06PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> On real hardware, panic and machine reboot may not flush hardware cache
> to memory. This means the persistent ring buffer, which relies on a
> coherent state of memory, may not have its events written to the buffer
> and they may be lost. Moreover, there may be inconsistency with the
> counters which are used for validation of the integrity of the
> persistent ring buffer which may cause all data to be discarded.
>
> To avoid this issue, stop recording of the ring buffer on panic and
> flush the cache of the ring buffer's memory.
>
> Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
> Cc: stable@vger.kernel.org
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[...]
> diff --git a/arch/arm64/include/asm/ring_buffer.h b/arch/arm64/include/asm/ring_buffer.h
> new file mode 100644
> index 000000000000..62316c406888
> --- /dev/null
> +++ b/arch/arm64/include/asm/ring_buffer.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _ASM_ARM64_RING_BUFFER_H
> +#define _ASM_ARM64_RING_BUFFER_H
> +
> +#include <asm/cacheflush.h>
> +
> +/* Flush D-cache on persistent ring buffer */
> +#define arch_ring_buffer_flush_range(start, end) dcache_clean_pop(start, end)
> +
> +#endif /* _ASM_ARM64_RING_BUFFER_H */
Adding Robin as he wrote the pmem support for arm64.
I assume the ring buffer here is cacheable memory, otherwise we'd also
need a dmb(osh) as in arch_wb_cache_pmem(). If that's correct:
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply
* Re: [PATCH v15 1/5] ring-buffer: Flush and stop persistent ring buffer on panic
From: Steven Rostedt @ 2026-03-31 18:03 UTC (permalink / raw)
To: Catalin Marinas
Cc: Masami Hiramatsu (Google), Will Deacon, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel, Ian Rogers, linux-arm-kernel,
Robin Murphy
In-Reply-To: <acwK-Xu-Mon2_6bT@arm.com>
On Tue, 31 Mar 2026 18:57:13 +0100
Catalin Marinas <catalin.marinas@arm.com> wrote:
> I assume the ring buffer here is cacheable memory, otherwise we'd also
> need a dmb(osh) as in arch_wb_cache_pmem(). If that's correct:
Yes, it's cacheable (I couldn't imagine the performance overhead if it
wasn't! ;-)
>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Thanks!
-- Steve
^ permalink raw reply
* [PATCH v3 1/3] tracing: Have futex syscall trace event show specific user data
From: Steven Rostedt @ 2026-03-31 18:13 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra, Brian Geffon, John Stultz,
Ian Rogers, Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>
From: Steven Rostedt <rostedt@goodmis.org>
Add specific reporting of the futex system call. This allows for debugging
the futex code a bit easier. Instead of just showing the values passed
into the futex system call, read the value of the user space memory
pointed to by the addr parameter.
Also make the op parameter more readable by parsing the values to show
what the command is:
futex_requeue_p-3251 [002] ..... 2101.068479: sys_futex(uaddr: 0x55e79a4da834 (0x80000cb1), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
futex_requeue_p-3248 [001] ..... 2101.068970: sys_futex(uaddr: 0x7f859072f990 (0xcb2), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val: 3250)
futex_requeue_p-3252 [005] ..... 2101.069108: sys_futex(uaddr: 0x55e79a4da838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, val: 0, timespec: 0x7ffe61076aa0, uaddr2: 0x55e79a4da834, uaddr2: 94453214586932, val3: 0)
futex_requeue_p-3252 [005] ..... 2101.069410: sys_futex(uaddr: 0x55e79a4da834 (0x80000cb1), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v2: https://patch.msgid.link/20260310201036.542627924@kernel.org
- Removed unused "buf" variable (kernel test robot)
- Iterate __futex_cmds[] make the print statement.
Note this required exposing __futex_cmds[] to trace_syscall.c
- Added back val statement (with the move to futex/syscall.c the
third parameter was dropped).
include/linux/futex.h | 8 +-
kernel/futex/syscalls.c | 97 +++++++++++++++++++++++
kernel/trace/trace_syscalls.c | 144 +++++++++++++++++++++++++++++++++-
3 files changed, 245 insertions(+), 4 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 9e9750f04980..7eaf01ff68cf 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -82,6 +82,11 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
+#ifdef CONFIG_FTRACE_SYSCALLS
+void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr);
+extern const char * __futex_cmds[];
+#endif
+
#ifdef CONFIG_FUTEX_PRIVATE_HASH
int futex_hash_allocate_default(void);
void futex_hash_free(struct mm_struct *mm);
@@ -114,7 +119,8 @@ static inline int futex_hash_allocate_default(void)
}
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
-
+static inline void futex_print_syscall(struct seq_buf *s, int nr_args,
+ unsigned long *args, u32 *ptr) { }
#endif
#endif
diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index 743c7a728237..3ba8ca017c9c 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -171,6 +171,18 @@ static __always_inline bool futex_cmd_has_timeout(u32 cmd)
return false;
}
+static __always_inline bool futex_cmd_has_addr2(u32 cmd)
+{
+ switch (cmd) {
+ case FUTEX_REQUEUE:
+ case FUTEX_CMP_REQUEUE:
+ case FUTEX_WAKE_OP:
+ case FUTEX_WAIT_REQUEUE_PI:
+ return true;
+ }
+ return false;
+}
+
static __always_inline int
futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
{
@@ -207,6 +219,91 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
+#ifdef CONFIG_FTRACE_SYSCALLS
+
+/* End with NULL for iterators */
+const char * __futex_cmds[] =
+{
+ "FUTEX_WAIT", "FUTEX_WAKE", "FUTEX_FD", "FUTEX_REQUEUE",
+ "FUTEX_CMP_REQUEUE", "FUTEX_WAKE_OP", "FUTEX_LOCK_PI",
+ "FUTEX_UNLOCK_PI", "FUTEX_TRYLOCK_PI", "FUTEX_WAIT_BITSET",
+ "FUTEX_WAKE_BITSET", "FUTEX_WAIT_REQUEUE_PI", "FUTEX_CMP_REQUEUE_PI",
+ "FUTEX_LOCK_PI2", NULL
+};
+
+void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr)
+{
+ unsigned int op, cmd;
+ bool done = false;
+
+ for (int i = 0; !done && i < nr_args; i++) {
+
+ if (seq_buf_has_overflowed(s))
+ break;
+
+ switch (i) {
+ case 0:
+ seq_buf_printf(s, "uaddr: 0x%lx", args[i]);
+ if (ptr) {
+ u32 val = *ptr;
+ if (val < 10)
+ seq_buf_printf(s, " (%u)", val);
+ else
+ seq_buf_printf(s, " (0x%x)", val);
+ }
+ continue;
+ case 1:
+ op = args[i];
+ cmd = op & FUTEX_CMD_MASK;
+ if (cmd <= FUTEX_LOCK_PI2)
+ seq_buf_printf(s, ", %s", __futex_cmds[cmd]);
+ else
+ seq_buf_puts(s, ", UNKNOWN");
+
+ if (op & FUTEX_PRIVATE_FLAG)
+ seq_buf_puts(s, "|FUTEX_PRIVATE_FLAG");
+ if (op & FUTEX_CLOCK_REALTIME)
+ seq_buf_puts(s, "|FUTEX_CLOCK_REALTIME");
+ continue;
+ case 2:
+ if (args[i] < 10)
+ seq_buf_printf(s, ", val: %ld", args[i]);
+ else
+ seq_buf_printf(s, ", val: 0x%lx", args[i]);
+ continue;
+ case 3:
+ if (!futex_cmd_has_timeout(cmd)) {
+
+ if (!futex_cmd_has_addr2(cmd)) {
+ done = true;
+ continue;
+ }
+
+ seq_buf_printf(s, ", val2: 0x%x", (u32)(long)args[i]);
+ continue;
+ }
+
+ if (!args[i])
+ continue;
+
+ seq_buf_printf(s, ", timespec: 0x%lx", args[i]);
+ continue;
+ case 4:
+ if (!futex_cmd_has_addr2(cmd)) {
+ done = true;
+ continue;
+ }
+ seq_buf_printf(s, ", uaddr2: 0x%lx", args[i]);
+ continue;
+ case 5:
+ seq_buf_printf(s, ", val3: %lu", args[i]);
+ done = true;
+ continue;
+ }
+ }
+}
+#endif
+
/**
* futex_parse_waitv - Parse a waitv array from userspace
* @futexv: Kernel side list of waiters to be filled
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 37317b81fcda..f8213d772f89 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -6,11 +6,13 @@
#include <linux/slab.h>
#include <linux/kernel.h>
#include <linux/module.h> /* for MODULE_NAME_LEN via KSYM_SYMBOL_LEN */
+#include <linux/futex.h>
#include <linux/ftrace.h>
#include <linux/perf_event.h>
#include <linux/xarray.h>
#include <asm/syscall.h>
+
#include "trace_output.h"
#include "trace.h"
@@ -237,6 +239,27 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
return trace_handle_return(s);
}
+static enum print_line_t
+sys_enter_futex_print(struct syscall_trace_enter *trace, struct syscall_metadata *entry,
+ struct trace_seq *s, struct trace_event *event, int ent_size)
+{
+ void *end = (void *)trace + ent_size;
+ void *ptr;
+
+ /* Set ptr to the user space copied area */
+ ptr = (void *)trace->args + sizeof(unsigned long) * entry->nb_args;
+ if (ptr + 4 > end)
+ ptr = NULL;
+
+ trace_seq_printf(s, "%s(", entry->name);
+
+ futex_print_syscall(&s->seq, entry->nb_args, trace->args, ptr);
+
+ trace_seq_puts(s, ")\n");
+
+ return trace_handle_return(s);
+}
+
static enum print_line_t
print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event)
@@ -267,6 +290,10 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
if (!tr || !(tr->trace_flags & TRACE_ITER(VERBOSE)))
return sys_enter_openat_print(trace, entry, s, event);
break;
+ case __NR_futex:
+ if (!tr || !(tr->trace_flags & TRACE_ITER(VERBOSE)))
+ return sys_enter_futex_print(trace, entry, s, event, iter->ent_size);
+ break;
default:
break;
}
@@ -437,6 +464,48 @@ sys_enter_openat_print_fmt(struct syscall_metadata *entry, char *buf, int len)
return pos;
}
+static int __init
+sys_enter_futex_print_fmt(struct syscall_metadata *entry, char *buf, int len)
+{
+ int pos = 0;
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ "\"uaddr: 0x%%lx (0x%%lx) cmd=%%s%%s%%s");
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " val: 0x%%x timeout/val2: 0x%%llx");
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " uaddr2: 0x%%lx val3: 0x%%x\", ");
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " REC->uaddr,");
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " REC->__value,");
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " __print_symbolic(REC->op & 0x%x, ", FUTEX_CMD_MASK);
+
+ for (int i = 0; __futex_cmds[i]; i++) {
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ "%s{%d, \"%s\"} ",
+ i ? "," : "", i, __futex_cmds[i]);
+ }
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO, "), ");
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " (REC->op & %d) ? \"|FUTEX_PRIVATE_FLAG\" : \"\",",
+ FUTEX_PRIVATE_FLAG);
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " (REC->op & %d) ? \"|FUTEX_CLOCK_REALTIME\" : \"\",",
+ FUTEX_CLOCK_REALTIME);
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " REC->val, REC->utime,");
+
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " REC->uaddr, REC->val3");
+ return pos;
+}
+
static int __init
__set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
{
@@ -447,6 +516,8 @@ __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
switch (entry->syscall_nr) {
case __NR_openat:
return sys_enter_openat_print_fmt(entry, buf, len);
+ case __NR_futex:
+ return sys_enter_futex_print_fmt(entry, buf, len);
default:
break;
}
@@ -523,6 +594,21 @@ static void __init free_syscall_print_fmt(struct trace_event_call *call)
kfree(call->print_fmt);
}
+static int __init futex_fields(struct trace_event_call *call, int offset)
+{
+ char *arg;
+ int ret;
+
+ arg = kstrdup("__value", GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg))
+ return -ENOMEM;
+ ret = trace_define_field(call, "u32", arg, offset, sizeof(int), 0,
+ FILTER_OTHER);
+ if (ret)
+ kfree(arg);
+ return ret;
+}
+
static int __init syscall_enter_define_fields(struct trace_event_call *call)
{
struct syscall_trace_enter trace;
@@ -544,6 +630,9 @@ static int __init syscall_enter_define_fields(struct trace_event_call *call)
offset += sizeof(unsigned long);
}
+ if (!ret && meta->syscall_nr == __NR_futex)
+ return futex_fields(call, offset);
+
if (ret || !meta->user_mask)
return ret;
@@ -689,6 +778,45 @@ static int syscall_copy_user_array(char *buf, const char __user *ptr,
return 0;
}
+static int
+syscall_get_futex(unsigned long *args, char **buffer, int *size, int buf_size)
+{
+ struct syscall_user_buffer *sbuf;
+ const char __user *ptr;
+
+ /* buf_size of zero means user doesn't want user space read */
+ if (!buf_size)
+ return -1;
+
+ /* If the syscall_buffer is NULL, tracing is being shutdown */
+ sbuf = READ_ONCE(syscall_buffer);
+ if (!sbuf)
+ return -1;
+
+ ptr = (char __user *)args[0];
+
+ *buffer = trace_user_fault_read(&sbuf->buf, ptr, 4, NULL, NULL);
+ if (!*buffer)
+ return -1;
+
+ /* Add room for the value */
+ *size += 4;
+
+ return 0;
+}
+
+static void syscall_put_futex(struct syscall_metadata *sys_data,
+ struct syscall_trace_enter *entry,
+ char *buffer)
+{
+ u32 *ptr;
+
+ /* Place the futex key into the storage */
+ ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
+
+ *ptr = *(u32 *)buffer;
+}
+
static char *sys_fault_user(unsigned int buf_size,
struct syscall_metadata *sys_data,
struct syscall_user_buffer *sbuf,
@@ -905,6 +1033,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (syscall_get_data(sys_data, args, &user_ptr,
&size, user_sizes, &uargs, tr->syscall_buf_sz) < 0)
return;
+ } else if (syscall_nr == __NR_futex) {
+ if (syscall_get_futex(args, &user_ptr, &size, tr->syscall_buf_sz) < 0)
+ return;
}
size += sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
@@ -921,6 +1052,9 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
if (mayfault)
syscall_put_data(sys_data, entry, user_ptr, size, user_sizes, uargs);
+ else if (syscall_nr == __NR_futex)
+ syscall_put_futex(sys_data, entry, user_ptr);
+
trace_event_buffer_commit(&fbuffer);
}
@@ -971,14 +1105,18 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
{
struct syscall_metadata *sys_data = call->data;
struct trace_array *tr = file->tr;
+ bool enable_faults;
int ret = 0;
int num;
num = sys_data->syscall_nr;
if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
return -ENOSYS;
+
+ enable_faults = sys_data->user_mask || num == __NR_futex;
+
guard(mutex)(&syscall_trace_lock);
- if (sys_data->user_mask) {
+ if (enable_faults) {
ret = syscall_fault_buffer_enable();
if (ret < 0)
return ret;
@@ -986,7 +1124,7 @@ static int reg_event_syscall_enter(struct trace_event_file *file,
if (!tr->sys_refcount_enter) {
ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
if (ret < 0) {
- if (sys_data->user_mask)
+ if (enable_faults)
syscall_fault_buffer_disable();
return ret;
}
@@ -1011,7 +1149,7 @@ static void unreg_event_syscall_enter(struct trace_event_file *file,
WRITE_ONCE(tr->enter_syscall_files[num], NULL);
if (!tr->sys_refcount_enter)
unregister_trace_sys_enter(ftrace_syscall_enter, tr);
- if (sys_data->user_mask)
+ if (sys_data->user_mask || num == __NR_futex)
syscall_fault_buffer_disable();
}
--
2.51.0
^ permalink raw reply related
* [PATCH v3 0/3] tracing: Read user data from futex system call trace event
From: Steven Rostedt @ 2026-03-31 18:13 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra, Brian Geffon, John Stultz,
Ian Rogers, Suleiman Souhlal
We are looking at the performance of futexes and require a bit more
information when tracing them.
The two patches here extend the system call reading of user space to
create specific handling of the futex system call. It now reads the
user space relevant data (the addr, utime and addr2), as well as
parses the flags. This adds a little smarts to the trace event as
it only shows the parameters that are relevant, as well as parses
utime as either a timespec or as val2 depending on the futex_op.
Here's an example of the new output:
sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG)
sys_futex(uaddr: 0x56196292e834 (0x4a7) tid: 1191, FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG)
sys_futex(uaddr: 0x56196292e834 (0) tid: 0, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
sys_futex(uaddr: 0x56196292e830 (0), FUTEX_WAIT|FUTEX_PRIVATE_FLAG)
sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (4aa), val3: 0)
sys_futex(uaddr: 0x56196292e834 (0x4aa) tid: 1194, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
sys_futex(uaddr: 0x7f7ed6b29990 (0x4ab), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME)
sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
sys_futex(uaddr: 0x56196292e838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, timespec: 0x7ffc1b91a9f0 (163.048528790), uaddr2: 0x56196292e834 (800004aa), val3: 0)
sys_futex(uaddr: 0x56196292e834 (0x800004aa) tid: 1194 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG)
Changes since v2: https://lore.kernel.org/all/20260310200954.285663884@kernel.org/
- Removed unused "buf" variable (kernel test robot)
- Iterate __futex_cmds[] make the print statement.
Note this required exposing __futex_cmds[] to trace_syscall.c
(Masami Hiramatsu)
- Added back val statement (with the move to futex/syscall.c the
third parameter was dropped).
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
futex/core
Head SHA1: 79b0609ad15b24d0bbfb1790e17902a6c210ae69
Steven Rostedt (3):
tracing: Have futex syscall trace event show specific user data
tracing: Update futex syscall trace event to show more commands
tracing: Show TID and flags for PI futex system call trace event
----
include/linux/futex.h | 39 ++++++-
kernel/futex/syscalls.c | 137 +++++++++++++++++++++---
kernel/trace/trace_syscalls.c | 237 +++++++++++++++++++++++++++++++++++++++++-
3 files changed, 396 insertions(+), 17 deletions(-)
^ permalink raw reply
* [PATCH v3 3/3] tracing: Show TID and flags for PI futex system call trace event
From: Steven Rostedt @ 2026-03-31 18:13 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra, Brian Geffon, John Stultz,
Ian Rogers, Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>
From: Steven Rostedt <rostedt@goodmis.org>
For the futex system call trace event for FUTEX_LOCK_PI and
FUTEX_UNLOCK_PI commands, show the TID from the value (which is usually in
hex) as well as translate the flags (DIED and WAITERS).
pi_mutex_hammer-1098 [000] ..... 121.876928: sys_futex(uaddr: 0x560f40cc8180 (0x450) tid: 1104, FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0, timespec: 0x7f2f9d4b1e50 (0.000100000))
pi_mutex_hammer-1128 [000] ..... 121.877120: sys_futex(uaddr: 0x560f40cc8180 (0x8000042a) tid: 1066 (WAITERS), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0, timespec: 0x7f2f8e493e50 (0.000100000))
pi_mutex_hammer-1106 [000] ..... 121.877242: sys_futex(uaddr: 0x560f40cc8180 (0x80000452) tid: 1106 (WAITERS), FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
This makes it easier to see the hand off of a mutex and who the owner was.
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/futex/syscalls.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index 265ec0a236d0..8144781a9ff0 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -213,6 +213,9 @@ void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args,
unsigned int op, cmd;
bool done = false;
+ op = args[1];
+ cmd = op & FUTEX_CMD_MASK;
+
for (int i = 0; !done && i < nr_args; i++) {
if (seq_buf_has_overflowed(s))
@@ -227,11 +230,30 @@ void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args,
seq_buf_printf(s, " (%u)", val);
else
seq_buf_printf(s, " (0x%x)", val);
+
+ switch(cmd) {
+ case FUTEX_LOCK_PI:
+ case FUTEX_UNLOCK_PI:
+ seq_buf_printf(s, " tid: %d",
+ val & FUTEX_TID_MASK);
+
+ if (!(val & (FUTEX_OWNER_DIED|FUTEX_WAITERS)))
+ break;
+
+ seq_buf_puts(s, " (");
+ if (val & FUTEX_WAITERS)
+ seq_buf_puts(s, "WAITERS");
+ if (val & FUTEX_OWNER_DIED) {
+ if (op & FUTEX_WAITERS)
+ seq_buf_putc(s, '|');
+ seq_buf_puts(s, "DIED");
+ }
+ seq_buf_putc(s, ')');
+ break;
+ }
}
continue;
case 1:
- op = args[i];
- cmd = op & FUTEX_CMD_MASK;
if (cmd <= FUTEX_LOCK_PI2)
seq_buf_printf(s, ", %s", __futex_cmds[cmd]);
else
--
2.51.0
^ permalink raw reply related
* [PATCH v3 2/3] tracing: Update futex syscall trace event to show more commands
From: Steven Rostedt @ 2026-03-31 18:13 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Thomas Gleixner, Peter Zijlstra, Brian Geffon, John Stultz,
Ian Rogers, Suleiman Souhlal
In-Reply-To: <20260331181349.062575155@kernel.org>
From: Steven Rostedt <rostedt@goodmis.org>
Make the futex syscall trace event a little more smart. Have it read the
futex_op instruction to determine what else it can save and print. For the
appropriate options, it will read the utime (timespec) parameter and show
its output as well as the uaddr2.
futex_requeue_p-1154 [004] ..... 144.568339: sys_futex(uaddr: 0x5652b178d834 (0x482), FUTEX_UNLOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
futex_requeue_p-1162 [002] ..... 144.568696: sys_futex(uaddr: 0x7f763b7fece0 (2), FUTEX_WAIT|FUTEX_PRIVATE_FLAG, val: 2)
futex_requeue_p-1151 [000] ..... 144.568700: sys_futex(uaddr: 0x7f763b7fece0 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG, val: 1)
futex_requeue_p-1162 [002] ..... 144.568705: sys_futex(uaddr: 0x7f763b7fece0 (0), FUTEX_WAKE|FUTEX_PRIVATE_FLAG, val: 1)
futex_requeue_p-1151 [000] ..... 144.568715: sys_futex(uaddr: 0x7f764369e990 (0x483), FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, val: 1155)
futex_requeue_p-1155 [005] ..... 144.569420: sys_futex(uaddr: 0x5652b178d838 (0), FUTEX_WAIT_REQUEUE_PI|FUTEX_PRIVATE_FLAG, val: 0, timespec: 0x7ffdacfba500 (143.890024054), uaddr2: 0x5652b178d834 (0), val3: 0)
futex_requeue_p-1155 [005] ..... 144.569454: sys_futex(uaddr: 0x5652b178d834 (0), FUTEX_LOCK_PI|FUTEX_PRIVATE_FLAG, val: 0)
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
include/linux/futex.h | 35 ++++++++-
kernel/futex/syscalls.c | 48 ++++++-------
kernel/trace/trace_syscalls.c | 129 +++++++++++++++++++++++++++++-----
3 files changed, 164 insertions(+), 48 deletions(-)
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 7eaf01ff68cf..ec0d9cfa8a59 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -82,8 +82,35 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3);
int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
+static __always_inline bool futex_cmd_has_timeout(u32 cmd)
+{
+ switch (cmd) {
+ case FUTEX_WAIT:
+ case FUTEX_LOCK_PI:
+ case FUTEX_LOCK_PI2:
+ case FUTEX_WAIT_BITSET:
+ case FUTEX_WAIT_REQUEUE_PI:
+ return true;
+ }
+ return false;
+}
+
+static __always_inline bool futex_cmd_has_addr2(u32 cmd)
+{
+ switch (cmd) {
+ case FUTEX_REQUEUE:
+ case FUTEX_CMP_REQUEUE:
+ case FUTEX_WAKE_OP:
+ case FUTEX_WAIT_REQUEUE_PI:
+ return true;
+ }
+ return false;
+}
+
#ifdef CONFIG_FTRACE_SYSCALLS
-void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr);
+void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args,
+ u32 *ptr1, u32 *ptr2, unsigned long *ts1,
+ unsigned long *ts2);
extern const char * __futex_cmds[];
#endif
@@ -120,7 +147,11 @@ static inline int futex_hash_allocate_default(void)
static inline int futex_hash_free(struct mm_struct *mm) { return 0; }
static inline int futex_mm_init(struct mm_struct *mm) { return 0; }
static inline void futex_print_syscall(struct seq_buf *s, int nr_args,
- unsigned long *args, u32 *ptr) { }
+ unsigned long *args, u32 *ptr1,
+ u32 *ptr2, unsigned long *ts1,
+ unsigned long *ts2) { }
+static __always_inline bool futex_cmd_has_timeout(u32 cmd) { return false; }
+static __always_inline bool futex_cmd_has_addr2(u32 cmd) { return false; }
#endif
#endif
diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index 3ba8ca017c9c..265ec0a236d0 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -158,31 +158,6 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
return -ENOSYS;
}
-static __always_inline bool futex_cmd_has_timeout(u32 cmd)
-{
- switch (cmd) {
- case FUTEX_WAIT:
- case FUTEX_LOCK_PI:
- case FUTEX_LOCK_PI2:
- case FUTEX_WAIT_BITSET:
- case FUTEX_WAIT_REQUEUE_PI:
- return true;
- }
- return false;
-}
-
-static __always_inline bool futex_cmd_has_addr2(u32 cmd)
-{
- switch (cmd) {
- case FUTEX_REQUEUE:
- case FUTEX_CMP_REQUEUE:
- case FUTEX_WAKE_OP:
- case FUTEX_WAIT_REQUEUE_PI:
- return true;
- }
- return false;
-}
-
static __always_inline int
futex_init_timeout(u32 cmd, u32 op, struct timespec64 *ts, ktime_t *t)
{
@@ -231,7 +206,9 @@ const char * __futex_cmds[] =
"FUTEX_LOCK_PI2", NULL
};
-void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u32 *ptr)
+void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args,
+ u32 *ptr1, u32 *ptr2, unsigned long *ts1,
+ unsigned long *ts2)
{
unsigned int op, cmd;
bool done = false;
@@ -244,8 +221,8 @@ void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u3
switch (i) {
case 0:
seq_buf_printf(s, "uaddr: 0x%lx", args[i]);
- if (ptr) {
- u32 val = *ptr;
+ if (ptr1) {
+ u32 val = *ptr1;
if (val < 10)
seq_buf_printf(s, " (%u)", val);
else
@@ -287,6 +264,15 @@ void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u3
continue;
seq_buf_printf(s, ", timespec: 0x%lx", args[i]);
+
+ if (!ts1 || !ts2)
+ continue;
+
+ if (!*ts1 && !*ts2) {
+ seq_buf_puts(s, " (0)");
+ continue;
+ }
+ seq_buf_printf(s, " (%lu.%09lu)", *ts1, *ts2);
continue;
case 4:
if (!futex_cmd_has_addr2(cmd)) {
@@ -294,6 +280,12 @@ void futex_print_syscall(struct seq_buf *s, int nr_args, unsigned long *args, u3
continue;
}
seq_buf_printf(s, ", uaddr2: 0x%lx", args[i]);
+
+ if (!ptr2)
+ continue;
+
+ seq_buf_printf(s, " (%x)", *ptr2);
+
continue;
case 5:
seq_buf_printf(s, ", val3: %lu", args[i]);
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index f8213d772f89..0c86986ec7c4 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -239,21 +239,35 @@ sys_enter_openat_print(struct syscall_trace_enter *trace, struct syscall_metadat
return trace_handle_return(s);
}
+struct futex_data {
+ u32 val1;
+ u32 val2;
+ unsigned long ts1;
+ unsigned long ts2;
+};
+
static enum print_line_t
sys_enter_futex_print(struct syscall_trace_enter *trace, struct syscall_metadata *entry,
struct trace_seq *s, struct trace_event *event, int ent_size)
{
+ struct futex_data *data;
void *end = (void *)trace + ent_size;
- void *ptr;
+ unsigned long *ts1 = NULL, *ts2 = NULL;
+ u32 *ptr1 = NULL, *ptr2 = NULL;
/* Set ptr to the user space copied area */
- ptr = (void *)trace->args + sizeof(unsigned long) * entry->nb_args;
- if (ptr + 4 > end)
- ptr = NULL;
+ data = (void *)trace->args + sizeof(unsigned long) * entry->nb_args;
+ if ((void *)data + sizeof(*data) <= end) {
+ ptr1 = &data->val1;
+ ptr2 = &data->val2;
+ ts1 = &data->ts1;
+ ts2 = &data->ts2;
+ }
trace_seq_printf(s, "%s(", entry->name);
- futex_print_syscall(&s->seq, entry->nb_args, trace->args, ptr);
+ futex_print_syscall(&s->seq, entry->nb_args, trace->args, ptr1, ptr2,
+ ts1, ts2);
trace_seq_puts(s, ")\n");
@@ -472,9 +486,9 @@ sys_enter_futex_print_fmt(struct syscall_metadata *entry, char *buf, int len)
pos += snprintf(buf + pos, LEN_OR_ZERO,
"\"uaddr: 0x%%lx (0x%%lx) cmd=%%s%%s%%s");
pos += snprintf(buf + pos, LEN_OR_ZERO,
- " val: 0x%%x timeout/val2: 0x%%llx");
+ " val: 0x%%x timeout/val2: 0x%%llx (%%lu.%%lu)");
pos += snprintf(buf + pos, LEN_OR_ZERO,
- " uaddr2: 0x%%lx val3: 0x%%x\", ");
+ " uaddr2: 0x%%lx (0x%%lx) val3: 0x%%x\", ");
pos += snprintf(buf + pos, LEN_OR_ZERO,
" REC->uaddr,");
@@ -499,10 +513,12 @@ sys_enter_futex_print_fmt(struct syscall_metadata *entry, char *buf, int len)
FUTEX_CLOCK_REALTIME);
pos += snprintf(buf + pos, LEN_OR_ZERO,
- " REC->val, REC->utime,");
+ " REC->val, REC->utime, REC->__ts1, REC->__ts2,");
pos += snprintf(buf + pos, LEN_OR_ZERO,
- " REC->uaddr, REC->val3");
+ " REC->uaddr,");
+ pos += snprintf(buf + pos, LEN_OR_ZERO,
+ " REC->__value2, REC->val3");
return pos;
}
@@ -605,7 +621,39 @@ static int __init futex_fields(struct trace_event_call *call, int offset)
ret = trace_define_field(call, "u32", arg, offset, sizeof(int), 0,
FILTER_OTHER);
if (ret)
- kfree(arg);
+ goto free;
+ offset += sizeof(int);
+
+ arg = kstrdup("__value2", GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg))
+ return -ENOMEM;
+ ret = trace_define_field(call, "u32", arg, offset, sizeof(int), 0,
+ FILTER_OTHER);
+ if (ret)
+ goto free;
+ offset += sizeof(int);
+
+ arg = kstrdup("__ts1", GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg))
+ return -ENOMEM;
+ ret = trace_define_field(call, "unsigned long", arg, offset,
+ sizeof(unsigned long), 0, FILTER_OTHER);
+ if (ret)
+ goto free;
+ offset += sizeof(long);
+
+ arg = kstrdup("__ts2", GFP_KERNEL);
+ if (WARN_ON_ONCE(!arg))
+ return -ENOMEM;
+ ret = trace_define_field(call, "unsigned long", arg, offset,
+ sizeof(unsigned long), 0, FILTER_OTHER);
+ if (ret)
+ goto free;
+
+ return 0;
+
+free:
+ kfree(arg);
return ret;
}
@@ -778,11 +826,51 @@ static int syscall_copy_user_array(char *buf, const char __user *ptr,
return 0;
}
+struct tp_futex_data {
+ u32 cmd;
+ const u32 __user *val1;
+ const u32 __user *val2;
+ void __user *timeout;
+};
+
+static int syscall_copy_futex(char *buf, const char __user *ptr,
+ size_t size, void *data)
+{
+ struct tp_futex_data *tp_data = data;
+ struct futex_data *fdata = (void *)buf;
+ int cmd = tp_data->cmd & FUTEX_CMD_MASK;
+ int ret;
+
+ memset(fdata, 0, sizeof(*fdata));
+
+ if (tp_data->val1) {
+ ret = __copy_from_user(&fdata->val1, tp_data->val1, 4);
+ if (ret)
+ return -1;
+ }
+
+ if (tp_data->val2 && futex_cmd_has_addr2(cmd)) {
+ ret = __copy_from_user(&fdata->val2, tp_data->val2, 4);
+ if (ret)
+ return -1;
+ }
+
+ if (tp_data->timeout && futex_cmd_has_timeout(cmd)) {
+ /* Copies both ts1 and ts2 */
+ ret = __copy_from_user(&fdata->ts1, tp_data->timeout,
+ sizeof(long) * 2);
+ if (ret)
+ return -1;
+ }
+
+ return 0;
+}
+
static int
syscall_get_futex(unsigned long *args, char **buffer, int *size, int buf_size)
{
struct syscall_user_buffer *sbuf;
- const char __user *ptr;
+ struct tp_futex_data tp_data;
/* buf_size of zero means user doesn't want user space read */
if (!buf_size)
@@ -793,14 +881,18 @@ syscall_get_futex(unsigned long *args, char **buffer, int *size, int buf_size)
if (!sbuf)
return -1;
- ptr = (char __user *)args[0];
+ tp_data.cmd = args[1];
+ tp_data.val1 = (u32 __user *)args[0];
+ tp_data.val2 = (u32 __user *)args[4];
+ tp_data.timeout = (u64 __user *)args[3];
- *buffer = trace_user_fault_read(&sbuf->buf, ptr, 4, NULL, NULL);
+ *buffer = trace_user_fault_read(&sbuf->buf, NULL, 0,
+ syscall_copy_futex, &tp_data);
if (!*buffer)
return -1;
- /* Add room for the value */
- *size += 4;
+ /* Add room for values */
+ *size += sizeof(struct futex_data);
return 0;
}
@@ -809,12 +901,13 @@ static void syscall_put_futex(struct syscall_metadata *sys_data,
struct syscall_trace_enter *entry,
char *buffer)
{
- u32 *ptr;
+ struct futex_data *fdata = (void *)buffer;
+ struct futex_data *data;
/* Place the futex key into the storage */
- ptr = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
+ data = (void *)entry->args + sizeof(unsigned long) * sys_data->nb_args;
- *ptr = *(u32 *)buffer;
+ *data = *fdata;
}
static char *sys_fault_user(unsigned int buf_size,
--
2.51.0
^ permalink raw reply related
* Re: [PATCH 1/2] selftests/tracing: Fix to make --logdir option work again
From: Shuah Khan @ 2026-03-31 19:07 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Steven Rostedt, Shuah Khan, Gabriele Monaco, Mathieu Desnoyers,
linux-kernel, linux-trace-kernel, linux-kselftest, Shuah Khan
In-Reply-To: <20260304110425.3858532b6ba092d84a31595a@kernel.org>
On 3/3/26 19:04, Masami Hiramatsu (Google) wrote:
> Hi Shuah,
>
> Could you pick these 2 patches to the selftests tree or should I pick it?
>
> Thanks,
>
> On Tue, 10 Feb 2026 18:54:12 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
>> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>>
>
Done. Sorry for the delay on this. I applied it to my next to send
it for Linux 7.1
thanks,
-- Shuah
^ permalink raw reply
* [PATCH] tracing: Allow backup to save persistent ring buffer before it starts
From: Steven Rostedt @ 2026-03-31 20:39 UTC (permalink / raw)
To: LKML, Linux Trace Kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers, John Stultz
From: Steven Rostedt <rostedt@goodmis.org>
When the persistent ring buffer was first introduced, it did not make
sense to start tracing for it on the kernel command line. That's because
if there was a crash, the start of events would invalidate the events from
the previous boot that had the crash.
But now that there's a "backup" instance that can take a snapshot of the
persistent ring buffer when boot starts, it is possible to have the
persistent ring buffer start events at boot up and not lose the old events.
Update the code where the boot events start after all boot time instances
are created. This will allow the backup instance to copy the persistent
ring buffer from the previous boot, and allow the persistent ring buffer
to start tracing new events for the current boot.
reserve_mem=100M:12M:trace trace_instance=boot_mapped^@trace,sched trace_instance=backup=boot_mapped
The above will create a boot_mapped persistent ring buffer and enabled the
scheduler events. If there's a crash, a "backup" instance will be created
holding the events of the persistent ring buffer from the previous boot,
while the persistent ring buffer will once again start tracing scheduler
events of the current boot.
Now the user doesn't have to remember to start the persistent ring buffer.
It will always have the events started at each boot.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace.c | 22 ++++++++++++++++++++++
kernel/trace/trace.h | 5 ++++-
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ef55b48064da..5b46083f6f94 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -10871,9 +10871,31 @@ __init static void enable_instances(void)
tr->range_name = no_free_ptr(rname);
}
+ /*
+ * Save the events to start and enabled them after all boot instances
+ * have been created.
+ */
+ tr->boot_events = curr_str;
+ }
+
+ /* Enable the events after all boot instances have been created */
+ list_for_each_entry(tr, &ftrace_trace_arrays, list) {
+
+ if (!tr->boot_events || !(*tr->boot_events)) {
+ tr->boot_events = NULL;
+ continue;
+ }
+
+ curr_str = tr->boot_events;
+
+ /* Clear the instance if this is a persistent buffer */
+ if (tr->flags & TRACE_ARRAY_FL_LAST_BOOT)
+ update_last_data(tr);
+
while ((tok = strsep(&curr_str, ","))) {
early_enable_events(tr, tok, true);
}
+ tr->boot_events = NULL;
}
}
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b001fbba0881..e68f9c2027eb 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -405,7 +405,10 @@ struct trace_array {
unsigned char trace_flags_index[TRACE_FLAGS_MAX_SIZE];
unsigned int flags;
raw_spinlock_t start_lock;
- const char *system_names;
+ union {
+ const char *system_names;
+ char *boot_events;
+ };
struct list_head err_log;
struct dentry *dir;
struct dentry *options;
--
2.51.0
^ permalink raw reply related
* Re: [PATCH v9 2/3] tracing: Remove the backup instance automatically after read
From: Steven Rostedt @ 2026-03-31 21:19 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <177497475349.569199.11513916633426967730.stgit@mhiramat.tok.corp.google.com>
On Wed, 1 Apr 2026 01:32:33 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 8cec7bd70438..1d73400a01c7 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -539,8 +539,65 @@ void trace_set_ring_buffer_expanded(struct trace_array *tr)
> tr->ring_buffer_expanded = true;
> }
>
> +static int __remove_instance(struct trace_array *tr);
> +
> +static void trace_array_autoremove(struct work_struct *work)
> +{
> + struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
> +
> + guard(mutex)(&event_mutex);
> + guard(mutex)(&trace_types_lock);
Hmm, should we do a check if the tr still exists? Couldn't the user delete
this via a rmdir after the last file closed and this was kicked?
CPU 0 CPU 1
----- -----
open(trace_pipe);
read(..);
close(trace_pipe);
kick the work queue to delete it....
rmdir();
[instance deleted]
__remove_instance();
[ now the tr is freed, and the remove will crash!]
What would prevent this is this is to use trace_array_destroy() that checks
this and also adds the proper locking:
static void trace_array_autoremove(struct work_struct *work)
{
struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
trace_array_destroy(tr);
}
> +
> + /*
> + * This can be fail if someone gets @tr before starting this
> + * function, but in that case, this will be kicked again when
> + * putting it. So we don't care about the result.
> + */
> + __remove_instance(tr);
> +}
> +
> +static struct workqueue_struct *autoremove_wq;
> +
> +static void trace_array_kick_autoremove(struct trace_array *tr)
> +{
> + if (autoremove_wq && !work_pending(&tr->autoremove_work))
> + queue_work(autoremove_wq, &tr->autoremove_work);
Doesn't queue_work() check if it's pending? Do we really need to check it
twice?
> +}
> +
> +static void trace_array_cancel_autoremove(struct trace_array *tr)
> +{
> + if (autoremove_wq && work_pending(&tr->autoremove_work))
> + cancel_work(&tr->autoremove_work);
Same here, as can't this be racy?
> +}
> +
> +static void trace_array_init_autoremove(struct trace_array *tr)
> +{
> + INIT_WORK(&tr->autoremove_work, trace_array_autoremove);
> +}
> +
> +static void trace_array_start_autoremove(void)
> +{
> + if (autoremove_wq)
> + return;
> +
> + autoremove_wq = alloc_workqueue("tr_autoremove_wq",
> + WQ_UNBOUND | WQ_HIGHPRI, 0);
> + if (!autoremove_wq)
> + pr_warn("Unable to allocate tr_autoremove_wq. autoremove
> disabled.\n"); +}
> +
> LIST_HEAD(ftrace_trace_arrays);
-- Steve
^ permalink raw reply
* Re: [PATCH v9 3/3] tracing/Documentation: Add a section about backup instance
From: Steven Rostedt @ 2026-03-31 21:21 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <177497476117.569199.18085846838539980210.stgit@mhiramat.tok.corp.google.com>
On Wed, 1 Apr 2026 01:32:41 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> Add a section about backup instance to the debugging.rst.
>
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
> Changes in v6:
> - Fix typos.
> ---
> Documentation/trace/debugging.rst | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
> index 4d88c346fc38..15857951b506 100644
> --- a/Documentation/trace/debugging.rst
> +++ b/Documentation/trace/debugging.rst
> @@ -159,3 +159,22 @@ If setting it from the kernel command line, it is recommended to also
> disable tracing with the "traceoff" flag, and enable tracing after boot up.
> Otherwise the trace from the most recent boot will be mixed with the trace
> from the previous boot, and may make it confusing to read.
> +
> +Using a backup instance for keeping previous boot data
> +------------------------------------------------------
> +
> +It is also possible to record trace data at system boot time by specifying
> +events with the persistent ring buffer, but in this case the data before the
> +reboot will be lost before it can be read. This problem can be solved by a
> +backup instance. From the kernel command line::
> +
> + reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map
The above didn't actually work well without my patch to enable events on
the persistent ring buffer with the backup. But keep it, as it will work
with my patch ;-)
-- Steve
> +
> +On boot up, the previous data in the "boot_map" is copied to the "backup"
> +instance, and the "sched:*" and "irq:*" events for the current boot are traced
> +in the "boot_map". Thus the user can read the previous boot data from the "backup"
> +instance without stopping the trace.
> +
> +Note that this "backup" instance is readonly, and will be removed automatically
> +if you clear the trace data or read out all trace data from the "trace_pipe"
> +or the "trace_pipe_raw" files.
> \ No newline at end of file
^ permalink raw reply
* Re: [PATCH RFC v4 25/44] KVM: selftests: Test basic single-page conversion flow
From: Ackerley Tng @ 2026-03-31 22:33 UTC (permalink / raw)
To: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jroedel, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Jason Gunthorpe,
Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-25-e202fe950ffd@google.com>
Ackerley Tng <ackerleytng@google.com> writes:
> Add a selftest for the guest_memfd memory attribute conversion ioctls.
> The test starts the guest_memfd as all-private (the default state), and
> verifies the basic flow of converting a single page to shared and then back
> to private.
>
> Add infrastructure that supports extensions to other conversion flow
> tests. This infrastructure will be used in upcoming patches for other
> conversion tests.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> tools/testing/selftests/kvm/Makefile.kvm | 1 +
> .../selftests/kvm/guest_memfd_conversions_test.c | 205 +++++++++++++++++++++
> 2 files changed, 206 insertions(+)
>
> diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
> index dc68371f76a33..0e2a9adfca57e 100644
> --- a/tools/testing/selftests/kvm/Makefile.kvm
> +++ b/tools/testing/selftests/kvm/Makefile.kvm
> @@ -147,6 +147,7 @@ TEST_GEN_PROGS_x86 += access_tracking_perf_test
> TEST_GEN_PROGS_x86 += coalesced_io_test
> TEST_GEN_PROGS_x86 += dirty_log_perf_test
> TEST_GEN_PROGS_x86 += guest_memfd_test
> +TEST_GEN_PROGS_x86 += guest_memfd_conversions_test
> TEST_GEN_PROGS_x86 += hardware_disable_test
> TEST_GEN_PROGS_x86 += memslot_modification_stress_test
> TEST_GEN_PROGS_x86 += memslot_perf_test
> diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
> new file mode 100644
> index 0000000000000..841b2824ae996
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
> @@ -0,0 +1,205 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024, Google LLC.
> + */
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +#include <linux/align.h>
> +#include <linux/kvm.h>
> +#include <linux/sizes.h>
> +
> +#include "kvm_util.h"
> +#include "kselftest_harness.h"
> +#include "test_util.h"
> +#include "ucall_common.h"
> +
> +FIXTURE(gmem_conversions) {
> + struct kvm_vcpu *vcpu;
> + int gmem_fd;
> + /* HVA of the first byte of the memory mmap()-ed from gmem_fd. */
> + char *mem;
> +};
> +
> +typedef FIXTURE_DATA(gmem_conversions) test_data_t;
> +
> +FIXTURE_SETUP(gmem_conversions) { }
> +
> +static uint64_t page_size;
> +
> +static void guest_do_rmw(void);
> +#define GUEST_MEMFD_SHARING_TEST_GVA 0x90000000ULL
> +
> +/*
> + * Defer setup until the individual test is invoked so that tests can specify
> + * the number of pages and flags for the guest_memfd instance.
> + */
> +static void gmem_conversions_do_setup(test_data_t *t, int nr_pages,
> + int gmem_flags)
> +{
> + const struct vm_shape shape = {
> + .mode = VM_MODE_DEFAULT,
> + .type = KVM_X86_SW_PROTECTED_VM,
> + };
> + /*
> + * Use high GPA above APIC_DEFAULT_PHYS_BASE to avoid clashing with
> + * APIC_DEFAULT_PHYS_BASE.
> + */
> + const uint64_t gpa = SZ_4G;
> + const uint32_t slot = 1;
> + u64 supported_flags;
> + struct kvm_vm *vm;
> +
> + vm = __vm_create_shape_with_one_vcpu(shape, &t->vcpu, nr_pages, guest_do_rmw);
> +
> + supported_flags = vm_check_cap(vm, KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS);
> + TEST_REQUIRE(supported_flags & KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE);
> +
> + vm_mem_add(vm, VM_MEM_SRC_SHMEM, gpa, slot, nr_pages,
> + KVM_MEM_GUEST_MEMFD, -1, 0, gmem_flags);
> +
> + t->gmem_fd = kvm_slot_to_fd(vm, slot);
> + t->mem = addr_gpa2hva(vm, gpa);
> + virt_map(vm, GUEST_MEMFD_SHARING_TEST_GVA, gpa, nr_pages);
> +}
> +
> +static void gmem_conversions_do_teardown(test_data_t *t)
> +{
> + /* No need to close gmem_fd, it's owned by the VM structure. */
> + kvm_vm_free(t->vcpu->vm);
> +}
> +
> +FIXTURE_TEARDOWN(gmem_conversions)
> +{
> + gmem_conversions_do_teardown(self);
> +}
> +
> +/*
> + * In these test definition macros, __nr_pages and nr_pages is used to set up
> + * the total number of pages in the guest_memfd under test. This will be
> + * available in the test definitions as nr_pages.
> + */
> +
> +#define __GMEM_CONVERSION_TEST(test, __nr_pages, flags) \
> +static void __gmem_conversions_##test(test_data_t *t, int nr_pages); \
> + \
> +TEST_F(gmem_conversions, test) \
> +{ \
> + gmem_conversions_do_setup(self, __nr_pages, flags); \
> + __gmem_conversions_##test(self, __nr_pages); \
> +} \
> +static void __gmem_conversions_##test(test_data_t *t, int nr_pages) \
> +
> +#define GMEM_CONVERSION_TEST(test, __nr_pages, flags) \
> + __GMEM_CONVERSION_TEST(test, __nr_pages, (flags) | GUEST_MEMFD_FLAG_MMAP)
> +
> +#define __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, __nr_pages) \
> + GMEM_CONVERSION_TEST(test, __nr_pages, 0)
> +
> +#define GMEM_CONVERSION_TEST_INIT_PRIVATE(test) \
> + __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
> +
> +struct guest_check_data {
> + void *mem;
> + char expected_val;
> + char write_val;
> +};
> +static struct guest_check_data guest_data;
> +
> +static void guest_do_rmw(void)
> +{
> + for (;;) {
> + char *mem = READ_ONCE(guest_data.mem);
> +
> + GUEST_ASSERT_EQ(READ_ONCE(*mem), READ_ONCE(guest_data.expected_val));
> + WRITE_ONCE(*mem, READ_ONCE(guest_data.write_val));
> +
> + GUEST_SYNC(0);
> + }
> +}
> +
> +static void run_guest_do_rmw(struct kvm_vcpu *vcpu, loff_t pgoff,
> + char expected_val, char write_val)
> +{
> + struct ucall uc;
> + int r;
> +
> + guest_data.mem = (void *)GUEST_MEMFD_SHARING_TEST_GVA + pgoff * page_size;
> + guest_data.expected_val = expected_val;
> + guest_data.write_val = write_val;
> + sync_global_to_guest(vcpu->vm, guest_data);
> +
> + do {
> + r = __vcpu_run(vcpu);
> + } while (r == -1 && errno == EINTR);
> +
> + TEST_ASSERT_EQ(r, 0);
TEST_ASSERT_EQ() ends up calling exit() on failures, which skips
FIXTURE_TEARDOWN().
Other than the explicit assertions not working with the
kselftest_harness, kvm selftest library functions like vm_mem_add() also
call TEST_ASSERT, which doesn't play nice with kselftest_harness.
Any suggestions for this? Should we use the kselftest framework with
these tests?
(I ran into this issue while trying to test something else, where I
needed FIXTURE_TEARDOWN() to clean up system state.)
Or is it "okay" in this case since FIXTURE_TEARDOWN() only cleans up
stuff that would happen if the program exits anyway?
>
> [...snip...]
>
^ permalink raw reply
* Re: [PATCH RFC v4 08/44] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-03-31 22:53 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-8-e202fe950ffd@google.com>
On Thu, Mar 26, 2026 at 03:24:17PM -0700, Ackerley Tng wrote:
> Introduce a "version 2" of KVM_SET_MEMORY_ATTRIBUTES to support returning
> information back to userspace.
Hi Ackerley,
Not trying to bikeshed below, but I'm working on getting related QEMU
patches cleaned up to post soon and was working through some of the new
uAPI bits, and plumbing some of these capabilities in seems a little
awkward in a couple places so wondering if we should revisit how some of
this API is defined...
>
> This new ioctl and structure will, in a later patch, be shared as a
> guest_memfd ioctl, where the padding in the new kvm_memory_attributes2
> structure will be for writing the response from the guest_memfd ioctl to
> userspace.
>
> A new ioctl is necessary for these reasons:
>
> 1. KVM_SET_MEMORY_ATTRIBUTES is currently a write-only ioctl and does not
> allow userspace to read fields. There's nothing in code (yet?) that
> validates this, but using _IOWR for consistency would be prudent.
>
> 2. KVM_SET_MEMORY_ATTRIBUTES, when used as a guest_memfd ioctl, will need
> an additional field to provide userspace with more error details.
>
> Alternatively, a completely new ioctl could be defined, unrelated to
> KVM_SET_MEMORY_ATTRIBUTES, but using the same ioctl number and struct for
> the vm and guest_memfd ioctls streamlines the interface for userspace. In
> addition, any memory attributes, implemented on the vm or guest_memfd
> ioctl, can be easily shared with the other.
>
> Add KVM_CAP_MEMORY_ATTRIBUTES2 to indicate that struct
> kvm_memory_attributes2 exists and can be used either with
> KVM_SET_MEMORY_ATTRIBUTES2 via the vm or guest_memfd ioctl.
The guest_memfd support for the KVM_SET_MEMORY_ATTRIBUTES2 ioctl isn't
added until patch #10, and to scan for it you sort of need to infer it
via KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES reporting non-zero (i.e.
KVM_MEMORY_ATTRIBUTE_PRIVATE), so it's confusing to state that
KVM_CAP_MEMORY_ATTRIBUTES2 means you can use the struct via a guest_memfd
ioctl.
I think the above is trying to simply say that the corresponding struct
exists, and remain agnostic about how it can be used. But if that were
the case, there would be no way to know when KVM_SET_MEMORY_ATTRIBUTES2 is
available in the first place, so in the case of KVM ioctls at least,
KVM_CAP_MEMORY_ATTRIBUTES2 is advertising both the struct and the ioctl,
whereas for guest_memfd it's only advertising the struct and not saying
anything about whether a similar gmem ioctl is available to use it.
Instead, maybe they should both have the same semantics:
KVM_CAP_MEMORY_ATTRIBUTES2: *SET_ATTRIBUTES* ioctl exists for KVM that utilizes
struct kvm_memory_attributes2
KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES: *SET_ATTRIBUTES* ioctl exists for
guest_memfd that utilizes struct kvm_memory_attributes2
In which case you would leave out any mention of guest_memfd here as far as
the documentation does, and then in patch #10 you could modify it to be
something like:
4.145 KVM_SET_MEMORY_ATTRIBUTES2
---------------------------------
-:Capability: KVM_CAP_MEMORY_ATTRIBUTES2
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES2, KVM_GUEST_MEMFD_CAP_MEMORY_ATTRIBUTES
-:Architectures: x86
+:Architectures: all
-:Type: vm ioctl
+:Type: vm, guest_memfd ioctl
:Parameters: struct kvm_memory_attributes2 (in/out)
:Returns: 0 on success, <0 on error
and *then* add in your mentions of how the usage/fields differ for
guest_memfd/KVM_GUEST_MEMFD_CAP_MEMORY_ATTRIBUTES case vs. KVM ioctls.
This avoids needing to issue 2 checks for the guest_memfd variant vs. 1
for KVM, but more importantly avoids subtle differences in how these
similarly-named capabilities are used/documented that might cause
unecessary confusion.
Thanks,
Mike
>
> Handle KVM_CAP_MEMORY_ATTRIBUTES2 and return the same supported attributes
> as would be returned for KVM_CAP_MEMORY_ATTRIBUTES - the supported
> attributes are the same for now, regardless of the CAP requested.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> Documentation/virt/kvm/api.rst | 32 ++++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 12 ++++++++++++
> virt/kvm/kvm_main.c | 40 +++++++++++++++++++++++++++++++++++++---
> 3 files changed, 81 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 032516783e962..0b61e2579e1d8 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6359,6 +6359,8 @@ S390:
> Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set.
> Returns -EINVAL if called on a protected VM.
>
> +.. _KVM_SET_MEMORY_ATTRIBUTES:
> +
> 4.141 KVM_SET_MEMORY_ATTRIBUTES
> -------------------------------
>
> @@ -6551,6 +6553,36 @@ KVM_S390_KEYOP_SSKE
> Sets the storage key for the guest address ``guest_addr`` to the key
> specified in ``key``, returning the previous value in ``key``.
>
> +4.145 KVM_SET_MEMORY_ATTRIBUTES2
> +---------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes2 (in/out)
> +:Returns: 0 on success, <0 on error
> +
> +KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
> +KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
> +userspace. The original (pre-extension) fields are shared with
> +KVM_SET_MEMORY_ATTRIBUTES identically.
> +
> +Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
> +
> +::
> +
> + struct kvm_memory_attributes2 {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> + };
> +
> + #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> +
> +See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
> +
> .. _kvm_run:
>
> 5. The kvm_run structure
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 80364d4dbebb0..16567d4a769e5 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -989,6 +989,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_ARM_SEA_TO_USER 245
> #define KVM_CAP_S390_USER_OPEREXEC 246
> #define KVM_CAP_S390_KEYOP 247
> +#define KVM_CAP_MEMORY_ATTRIBUTES2 248
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1637,6 +1638,17 @@ struct kvm_memory_attributes {
> __u64 flags;
> };
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES2 */
> +#define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
> +
> +struct kvm_memory_attributes2 {
> + __u64 address;
> + __u64 size;
> + __u64 attributes;
> + __u64 flags;
> + __u64 reserved[12];
> +};
> +
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 70b594dafc5cc..3c261904322f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2621,9 +2621,10 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
> return r;
> }
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> - struct kvm_memory_attributes *attrs)
> + struct kvm_memory_attributes2 *attrs)
> {
> gfn_t start, end;
> + int i;
>
> /* flags is currently not used. */
> if (attrs->flags)
> @@ -2634,6 +2635,10 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> return -EINVAL;
> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> + if (attrs->reserved[i])
> + return -EINVAL;
> + }
>
> start = attrs->address >> PAGE_SHIFT;
> end = (attrs->address + attrs->size) >> PAGE_SHIFT;
> @@ -4966,6 +4971,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> case KVM_CAP_DEVICE_CTRL:
> return 1;
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> + case KVM_CAP_MEMORY_ATTRIBUTES2:
> case KVM_CAP_MEMORY_ATTRIBUTES:
> if (!vm_memory_attributes)
> return 0;
> @@ -5191,6 +5197,14 @@ do { \
> sizeof_field(struct kvm_userspace_memory_region2, field)); \
> } while (0)
>
> +#define SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(field) \
> +do { \
> + BUILD_BUG_ON(offsetof(struct kvm_memory_attributes, field) != \
> + offsetof(struct kvm_memory_attributes2, field)); \
> + BUILD_BUG_ON(sizeof_field(struct kvm_memory_attributes, field) != \
> + sizeof_field(struct kvm_memory_attributes2, field)); \
> +} while (0)
> +
> static long kvm_vm_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg)
> {
> @@ -5373,15 +5387,35 @@ static long kvm_vm_ioctl(struct file *filp,
> }
> #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> case KVM_SET_MEMORY_ATTRIBUTES: {
> - struct kvm_memory_attributes attrs;
> + struct kvm_memory_attributes2 attrs;
> + unsigned long size;
> +
> + if (ioctl == KVM_SET_MEMORY_ATTRIBUTES) {
> + /*
> + * Fields beyond struct kvm_memory_attributes shouldn't
> + * be accessed, but avoid leaking kernel memory in case
> + * of a bug.
> + */
> + memset(&attrs, 0, sizeof(attrs));
> + size = sizeof(struct kvm_memory_attributes);
> + } else {
> + size = sizeof(struct kvm_memory_attributes2);
> + }
> +
> + /* Ensure the common parts of the two structs are identical. */
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(address);
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(size);
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(attributes);
> + SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(flags);
>
> r = -ENOTTY;
> if (!vm_memory_attributes)
> goto out;
>
> r = -EFAULT;
> - if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + if (copy_from_user(&attrs, argp, size))
> goto out;
>
> r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply
* Re: [PATCH RFC v4 10/44] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Michael Roth @ 2026-03-31 23:31 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jroedel, jthoughton, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, Paolo Bonzini, Sean Christopherson,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm
In-Reply-To: <20260326-gmem-inplace-conversion-v4-10-e202fe950ffd@google.com>
On Thu, Mar 26, 2026 at 03:24:19PM -0700, Ackerley Tng wrote:
> For shared to private conversions, if refcounts on any of the folios
> within the range are elevated, fail the conversion with -EAGAIN.
>
> At the point of shared to private conversion, all folios in range are
> also unmapped. The filemap_invalidate_lock() is held, so no faulting
> can occur. Hence, from that point on, only transient refcounts can be
> taken on the folios associated with that guest_memfd.
>
> Hence, it is safe to do the conversion from shared to private.
>
> After conversion is complete, refcounts may become elevated, but that
> is fine since users of transient refcounts don't actually access
> memory.
>
> For private to shared conversions, there are no refcount checks, since
> the guest is the only user of private pages, and guest_memfd will be the
> only holder of refcounts on private pages.
I think KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES deserves some mention in
the commit log.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> Documentation/virt/kvm/api.rst | 48 +++++++-
> include/linux/kvm_host.h | 10 ++
> include/uapi/linux/kvm.h | 9 +-
> virt/kvm/Kconfig | 1 +
> virt/kvm/guest_memfd.c | 245 ++++++++++++++++++++++++++++++++++++++---
> virt/kvm/kvm_main.c | 17 ++-
> 6 files changed, 300 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0b61e2579e1d8..15148c80cfdb6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
> x86 includes both i386 and x86_64.
>
> Type:
> - system, vm, or vcpu.
> + system, vm, vcpu or guest_memfd.
>
> Parameters:
> what parameters are accepted by the ioctl.
> @@ -6557,11 +6557,22 @@ KVM_S390_KEYOP_SSKE
> ---------------------------------
>
> :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> -:Architectures: x86
> -:Type: vm ioctl
> +:Architectures: all
> +:Type: vm, guest_memfd ioctl
> :Parameters: struct kvm_memory_attributes2 (in/out)
> :Returns: 0 on success, <0 on error
>
> +Errors:
> +
> + ========== ===============================================================
> + EINVAL The specified `offset` or `size` were invalid (e.g. not
> + page aligned, causes an overflow, or size is zero).
> + EFAULT The parameter address was invalid.
> + EAGAIN Some page within requested range had unexpected refcounts. The
> + offset of the page will be returned in `error_offset`.
> + ENOMEM Ran out of memory trying to track private/shared state
> + ========== ===============================================================
> +
> KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
> KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
> userspace. The original (pre-extension) fields are shared with
> @@ -6572,15 +6583,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
> ::
>
> struct kvm_memory_attributes2 {
> - __u64 address;
> + /* in */
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + /* out */
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
>
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
That's only true for the guest_memfd ioctl, for KVM ioctl these new
fields and r/w behavior are basically ignored. So you might need to be
clearer on which fields/behavior are specific to guest_memfd like in
the preceeding paragraphs..
..or maybe it's better to do the opposite and just have a blanket 'for
now, all newly-described behavior pertains only to usage via a
guest_memfd ioctl, and for KVM ioctls only the fields/behaviors common
with KVM_SET_MEMORY_ATTRIBUTES are applicable.', since it doesn't seem
like vm_memory_attributes=1 is long for this world and that's the only
case where KVM memory attribute ioctls seem relevant.
But then it makes me wonder, if we adopt the semantics I mentioned
earlier and have KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES advertise both
the gmem ioctl support as well as the struct kvm_memory_attributes2
support, if we should even advertise KVM_CAP_MEMORY_ATTRIBUTES2 at all
as part of this series.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
One might read this to mean error_offset is used purely for the EAGAIN
case, so it might be worth touching on the other cases as well.
-Mike
> +
> See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
>
> .. _kvm_run:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 19f026f8de390..1ea14c66fc82e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2514,6 +2514,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> }
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> +static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +#ifdef kvm_arch_has_private_mem
> + if (!kvm || kvm_arch_has_private_mem(kvm))
> + return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +#endif
> +
> + return 0;
> +}
> +
> typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
> DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 16567d4a769e5..29baaa60de35a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -990,6 +990,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_S390_USER_OPEREXEC 246
> #define KVM_CAP_S390_KEYOP 247
> #define KVM_CAP_MEMORY_ATTRIBUTES2 248
> +#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 249
>
> struct kvm_irq_routing_irqchip {
> __u32 irqchip;
> @@ -1642,11 +1643,15 @@ struct kvm_memory_attributes {
> #define KVM_SET_MEMORY_ATTRIBUTES2 _IOWR(KVMIO, 0xd2, struct kvm_memory_attributes2)
>
> struct kvm_memory_attributes2 {
> - __u64 address;
> + union {
> + __u64 address;
> + __u64 offset;
> + };
> __u64 size;
> __u64 attributes;
> __u64 flags;
> - __u64 reserved[12];
> + __u64 error_offset;
> + __u64 reserved[11];
> };
>
> #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 3fea89c45cfb4..e371e079e2c50 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -109,6 +109,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
>
> config KVM_GUEST_MEMFD
> select XARRAY_MULTI
> + select KVM_MEMORY_ATTRIBUTES
> bool
>
> config HAVE_KVM_ARCH_GMEM_PREPARE
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d414ebfcb4c19..0cff9a85a4c53 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -183,10 +183,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>
> static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
> {
> - if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)
> - return KVM_FILTER_SHARED;
> -
> - return KVM_FILTER_PRIVATE;
> + /*
> + * TODO: Limit invalidations based on the to-be-invalidated range, i.e.
> + * invalidate shared/private if and only if there can possibly be
> + * such mappings.
> + */
> + return KVM_FILTER_SHARED | KVM_FILTER_PRIVATE;
> }
>
> static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
> @@ -552,11 +554,235 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_memory_attributes);
>
> +static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
> + pgoff_t index, size_t nr_pages,
> + u64 attributes)
> +{
> + pgoff_t end = index + nr_pages - 1;
> + void *entry;
> +
> + lockdep_assert(mt_lock_is_held(mt));
> +
> + mt_for_each(mt, entry, index, end) {
> + if (xa_to_value(entry) != attributes)
> + return false;
> + }
> +
> + return true;
> +}
> +
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> + size_t nr_pages, pgoff_t *err_index)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + const int filemap_get_folios_refcount = 1;
> + pgoff_t last = start + nr_pages - 1;
> + struct folio_batch fbatch;
> + bool safe = true;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> +
> + if (folio_ref_count(folio) !=
> + folio_nr_pages(folio) + filemap_get_folios_refcount) {
> + safe = false;
> + *err_index = folio->index;
> + break;
> + }
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +
> + return safe;
> +}
> +
> +/*
> + * Preallocate memory for attributes to be stored on a maple tree, pointed to
> + * by mas. Adjacent ranges with attributes identical to the new attributes
> + * will be merged. Also sets mas's bounds up for storing attributes.
> + *
> + * This maintains the invariant that ranges with the same attributes will
> + * always be merged.
> + */
> +static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
> + pgoff_t start, size_t nr_pages)
> +{
> + pgoff_t end = start + nr_pages;
> + pgoff_t last = end - 1;
> + void *entry;
> +
> + /* Try extending range. entry is NULL on overflow/wrap-around. */
> + mas_set_range(mas, end, end);
> + entry = mas_find(mas, end);
> + if (entry && xa_to_value(entry) == attributes)
> + last = mas->last;
> +
> + if (start > 0) {
> + mas_set_range(mas, start - 1, start - 1);
> + entry = mas_find(mas, start - 1);
> + if (entry && xa_to_value(entry) == attributes)
> + start = mas->index;
> + }
> +
> + mas_set_range(mas, start, last);
> + return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
> +}
> +
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
> +{
> + struct folio_batch fbatch;
> + pgoff_t next = start;
> + int i;
> +
> + folio_batch_init(&fbatch);
> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> + struct folio *folio = fbatch.folios[i];
> + unsigned long pfn = folio_pfn(folio);
> +
> + kvm_arch_gmem_invalidate(pfn, pfn + folio_nr_pages(folio));
> + }
> +
> + folio_batch_release(&fbatch);
> + cond_resched();
> + }
> +}
> +#else
> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
> +#endif
> +
> +static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> + size_t nr_pages, uint64_t attrs,
> + pgoff_t *err_index)
> +{
> + bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> + struct address_space *mapping = inode->i_mapping;
> + struct gmem_inode *gi = GMEM_I(inode);
> + pgoff_t end = start + nr_pages;
> + struct maple_tree *mt;
> + struct ma_state mas;
> + int r;
> +
> + mt = &gi->attributes;
> +
> + filemap_invalidate_lock(mapping);
> +
> + mas_init(&mas, mt, start);
> +
> + if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs)) {
> + r = 0;
> + goto out;
> + }
> +
> + r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> + if (r) {
> + *err_index = start;
> + goto out;
> + }
> +
> + if (to_private) {
> + unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> + if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> + err_index)) {
> + mas_destroy(&mas);
> + r = -EAGAIN;
> + goto out;
> + }
> + }
> +
> + /*
> + * From this point on guest_memfd has performed necessary
> + * checks and can proceed to do guest-breaking changes.
> + */
> +
> + kvm_gmem_invalidate_begin(inode, start, end);
> +
> + if (!to_private)
> + kvm_gmem_invalidate(inode, start, end);
> +
> + mas_store_prealloc(&mas, xa_mk_value(attrs));
> +
> + kvm_gmem_invalidate_end(inode, start, end);
> +out:
> + filemap_invalidate_unlock(mapping);
> + return r;
> +}
> +
> +static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
> +{
> + struct gmem_file *f = file->private_data;
> + struct inode *inode = file_inode(file);
> + struct kvm_memory_attributes2 attrs;
> + pgoff_t err_index;
> + size_t nr_pages;
> + pgoff_t index;
> + int i, r;
> +
> + if (copy_from_user(&attrs, argp, sizeof(attrs)))
> + return -EFAULT;
> +
> + if (attrs.flags)
> + return -EINVAL;
> + if (attrs.error_offset)
> + return -EINVAL;
> + for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
> + if (attrs.reserved[i])
> + return -EINVAL;
> + }
> + if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
> + return -EINVAL;
> + if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
> + return -EINVAL;
> +
> + if (attrs.offset >= inode->i_size ||
> + attrs.offset + attrs.size > inode->i_size)
> + return -EINVAL;
> +
> + nr_pages = attrs.size >> PAGE_SHIFT;
> + index = attrs.offset >> PAGE_SHIFT;
> + r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> + &err_index);
> + if (r) {
> + attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> + if (copy_to_user(argp, &attrs, sizeof(attrs)))
> + return -EFAULT;
> + }
> +
> + return r;
> +}
> +
> +static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> + unsigned long arg)
> +{
> + switch (ioctl) {
> + case KVM_SET_MEMORY_ATTRIBUTES2:
> + if (vm_memory_attributes)
> + return -ENOTTY;
> +
> + return kvm_gmem_set_attributes(file, (void __user *)arg);
> + default:
> + return -ENOTTY;
> + }
> +}
> +
> +
> static struct file_operations kvm_gmem_fops = {
> .mmap = kvm_gmem_mmap,
> .open = generic_file_open,
> .release = kvm_gmem_release,
> .fallocate = kvm_gmem_fallocate,
> + .unlocked_ioctl = kvm_gmem_ioctl,
> };
>
> static int kvm_gmem_migrate_folio(struct address_space *mapping,
> @@ -942,20 +1168,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
> size_t nr_pages, struct kvm *kvm, gfn_t gfn)
> {
> - pgoff_t end = index + nr_pages - 1;
> - void *entry;
> -
> if (vm_memory_attributes)
> return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
> KVM_MEMORY_ATTRIBUTE_PRIVATE,
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
>
> - mt_for_each(&gi->attributes, entry, index, end) {
> - if (xa_to_value(entry) != KVM_MEMORY_ATTRIBUTE_PRIVATE)
> - return false;
> - }
> -
> - return true;
> + return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
> + KVM_MEMORY_ATTRIBUTE_PRIVATE);
> }
>
> static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3c261904322f0..85c14197587d4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2435,16 +2435,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
> -static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> -{
> -#ifdef kvm_arch_has_private_mem
> - if (!kvm || kvm_arch_has_private_mem(kvm))
> - return KVM_MEMORY_ATTRIBUTE_PRIVATE;
> -#endif
> -
> - return 0;
> -}
> -
> #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
> static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
> {
> @@ -2635,6 +2625,8 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> return -EINVAL;
> if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> return -EINVAL;
> + if (attrs->error_offset)
> + return -EINVAL;
> for (i = 0; i < ARRAY_SIZE(attrs->reserved); i++) {
> if (attrs->reserved[i])
> return -EINVAL;
> @@ -4983,6 +4975,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> return 1;
> case KVM_CAP_GUEST_MEMFD_FLAGS:
> return kvm_gmem_get_supported_flags(kvm);
> + case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
> + if (vm_memory_attributes)
> + return 0;
> +
> + return kvm_supported_mem_attributes(kvm);
> #endif
> default:
> break;
>
> --
> 2.53.0.1018.g2bb0e51243-goog
>
^ permalink raw reply
* Re: [PATCH v2] bootconfig: Apply early options from embedded config
From: Masami Hiramatsu @ 2026-04-01 1:02 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Breno Leitao, Jonathan Corbet, Shuah Khan, linux-kernel,
linux-trace-kernel, linux-doc, oss, paulmck, rostedt, kernel-team
In-Reply-To: <acueCFv4neO7zQGI@thinkstation>
On Tue, 31 Mar 2026 11:18:33 +0100
Kiryl Shutsemau <kirill@shutemov.name> wrote:
> On Wed, Mar 25, 2026 at 11:22:04PM +0900, Masami Hiramatsu wrote:
> > > diff --git a/init/main.c b/init/main.c
> > > index 453ac9dff2da0..14a04c283fa48 100644
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -416,9 +416,64 @@ static int __init warn_bootconfig(char *str)
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * do_early_param() is defined later in this file but called from
> > > + * bootconfig_apply_early_params() below, so we need a forward declaration.
> > > + */
> > > +static int __init do_early_param(char *param, char *val,
> > > + const char *unused, void *arg);
> > > +
> > > +/*
> > > + * bootconfig_apply_early_params - dispatch kernel.* keys from the embedded
> > > + * bootconfig as early_param() calls.
> > > + *
> > > + * early_param() handlers must run before most of the kernel initialises
> > > + * (e.g. before the GIC driver reads irqchip.gicv3_pseudo_nmi). A bootconfig
> > > + * attached to the initrd arrives too late for this because the initrd is not
> > > + * mapped yet when early params are processed. The embedded bootconfig lives
> > > + * in the kernel image itself (.init.data), so it is always reachable.
> > > + *
> > > + * This function is called from setup_boot_config() which runs in
> > > + * start_kernel() before parse_early_param(), making the timing correct.
> > > + */
> > > +static void __init bootconfig_apply_early_params(void)
> >
> > [sashiko comment]
> > | Does this run early enough for architectural parameters?
> > | While setup_boot_config() runs before parse_early_param() in start_kernel(),
> > | it runs after setup_arch(). setup_boot_config() relies on xbc_init() which
> > | uses the memblock allocator, requiring setup_arch() to have already
> > | initialized it.
> > | However, the kernel expects many early parameters (like mem=, earlycon,
> > | noapic, and iommu) to be parsed during setup_arch() via the architecture's
> > | call to parse_early_param(). Since setup_arch() completes before
> > | setup_boot_config() runs, will these architectural early parameters be
> > | silently ignored because the decisions they influence were already
> > | finalized?
> >
> > This is the major reason that I did not support early parameter
> > in bootconfig. Some archs initialize kernel_cmdline in setup_arch()
> > and setup early parameters in it.
> > To fix this, we need to change setup_arch() for each architecture so
> > that it calls this bootconfig_apply_early_params().
>
> Hi Masami,
>
> I have a question about bootconfig design. Is it necessary to parse the
> bootconfig at boot time?
>
> We could avoid a lot of complexity if we flattened the bootconfig into a
> simple key-value list before embedding it in the kernel image or
> attaching it to the initrd. That would eliminate the need for memory
> allocations and allow the config to be used earlier during boot.
Hi Kiryl,
Thanks for a good question.
If it is embedded in the kernel, yes, we can compile it. But if it is
attached to initrd, the kernel needs to verify it. So anyway we have to
parse the key-value. Of course we can make it a binary data structure,
but it still needs verification. Moreover, memory allocation is not
required by design (the first design uses a statically allocated data).
Even if it is normalized as key-value, we can not trust the contents
without outer verification system, like verified boot. Therefore, our
basic strategy is to parse the text.
However, if the content can be verified externally, for example, if a
verified boot program verifies the compiled boot configuration, then
it would be possible to use that directly.
Thank you,
>
> What am I missing?
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH 0/2] Fix trace remotes read with an offline CPU
From: Vincent Donnefort @ 2026-04-01 2:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
This small series is fixing non-consuming read of a trace remote when the
trace_buffer is created after a CPU is offline.
It also extends hotplug testing coverage to include this test case.
I have based this series on top of kvmarm/next which contains the hypervisor
tracing patches.
Vincent Donnefort (2):
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: selftests: Extend hotplug testing for trace remotes
kernel/trace/trace_remote.c | 21 ++++-
.../selftests/ftrace/test.d/remotes/functions | 15 +++-
.../ftrace/test.d/remotes/hotplug.tc | 86 +++++++++++++++++++
.../test.d/remotes/hypervisor/hotplug.tc | 11 +++
.../selftests/ftrace/test.d/remotes/trace.tc | 27 +-----
.../ftrace/test.d/remotes/trace_pipe.tc | 25 ------
6 files changed, 129 insertions(+), 56 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
base-commit: 5ad2ff071b5980f072a85c8114649218971c586e
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply
* [PATCH 1/2] tracing: Non-consuming read for trace remotes with an offline CPU
From: Vincent Donnefort @ 2026-04-01 2:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401025003.3258729-1-vdonnefort@google.com>
When a trace_buffer is created while a CPU is offline, this CPU is
cleared from the trace_buffer CPU mask, preventing the creation of a
non-consuming iterator (ring_buffer_iter). For trace remotes, it means
the iterator fails to be allocated (-ENOMEM) even though there are
available ring buffers in the trace_buffer.
For non-consuming reads of trace remotes, skip missing ring_buffer_iter
to allow reading the available ring buffers.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 0d78e5f5fe98..39d5414c1723 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -282,6 +282,14 @@ static void trace_remote_put(struct trace_remote *remote)
trace_remote_try_unload(remote);
}
+static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu)
+{
+ if (cpu == RING_BUFFER_ALL_CPUS)
+ return true;
+
+ return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0;
+}
+
static void __poll_remote(struct work_struct *work)
{
struct delayed_work *dwork = to_delayed_work(work);
@@ -324,6 +332,10 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
iter->rb_iters[cpu] = ring_buffer_read_start(iter->remote->trace_buffer, cpu,
GFP_KERNEL);
if (!iter->rb_iters[cpu]) {
+ /* This CPU isn't part of trace_buffer. Skip it */
+ if (!trace_remote_has_cpu(iter->remote, cpu))
+ continue;
+
__free_ring_buffer_iter(iter, RING_BUFFER_ALL_CPUS);
return -ENOMEM;
}
@@ -347,10 +359,10 @@ static struct trace_remote_iterator
if (ret)
return ERR_PTR(ret);
- /* Test the CPU */
- ret = ring_buffer_poll_remote(remote->trace_buffer, cpu);
- if (ret)
+ if (!trace_remote_has_cpu(remote, cpu)) {
+ ret = -ENODEV;
goto err;
+ }
iter = kzalloc_obj(*iter);
if (iter) {
@@ -476,6 +488,9 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long
return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events);
case TRI_NONCONSUMING:
rb_iter = __get_rb_iter(iter, cpu);
+ if (!rb_iter)
+ return NULL;
+
rb_evt = ring_buffer_iter_peek(rb_iter, ts);
if (!rb_evt)
return NULL;
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH 2/2] tracing: selftests: Extend hotplug testing for trace remotes
From: Vincent Donnefort @ 2026-04-01 2:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401025003.3258729-1-vdonnefort@google.com>
The hotplug testing only tries reading a trace remote buffer, loaded
before a CPU is offline. Extend this testing to cover:
* A trace remote buffer loaded after a CPU is offline.
* A trace remote buffer loaded before a CPU is online.
Because of these added test cases, move the hotplug testing into a
separate hotplug.tc file.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions
index 97a09d564a34..ca531a0336dc 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/functions
+++ b/tools/testing/selftests/ftrace/test.d/remotes/functions
@@ -32,6 +32,15 @@ assert_unloaded()
grep -q "(unloaded)" buffer_size_kb
}
+reload_remote()
+{
+ echo 0 > tracing_on
+ clear_trace
+ assert_unloaded
+ echo 1 > tracing_on
+ assert_loaded
+}
+
dump_trace_pipe()
{
output=$(mktemp $TMPDIR/remote_test.XXXXXX)
@@ -79,10 +88,12 @@ get_cpu_ids()
sed -n 's/^processor\s*:\s*\([0-9]\+\).*/\1/p' /proc/cpuinfo
}
-get_page_size() {
+get_page_size()
+{
sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
}
-get_selftest_event_size() {
+get_selftest_event_size()
+{
sed -ne 's/^.*field:.*;.*size:\([0-9][0-9]*\);.*/\1/p' events/*/selftest/format | awk '{s+=$1} END {print s}'
}
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
new file mode 100644
index 000000000000..86b55824d631
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
@@ -0,0 +1,86 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote read with an offline CPU
+# requires: remotes/test
+
+. $TEST_DIR/remotes/functions
+
+hotunplug_one_cpu()
+{
+ [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 1
+
+ for cpu in $(get_cpu_ids); do
+ echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 1
+ break
+ done
+
+ echo $cpu
+}
+
+# Check non-consuming and consuming read
+check_read()
+{
+ for i in $(seq 1 8); do
+ echo $i > write_event
+ done
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+}
+
+test_hotplug()
+{
+ echo 0 > trace
+ assert_loaded
+
+ #
+ # Test a trace buffer containing an offline CPU
+ #
+
+ cpu=$(hotunplug_one_cpu) || exit_unsupported
+
+ check_read
+
+ #
+ # Test a trace buffer with a missing CPU
+ #
+
+ reload_remote
+
+ check_read
+
+ #
+ # Test a trace buffer with a CPU added later
+ #
+
+ echo 1 > /sys/devices/system/cpu/cpu$cpu/online
+ assert_loaded
+
+ check_read
+
+ # Test if the ring-buffer for the newly added CPU is both writable and
+ # readable
+ for i in $(seq 1 8); do
+ taskset -c $cpu echo $i > write_event
+ done
+
+ cd per_cpu/cpu$cpu/
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+
+ cd -
+}
+
+if [ -z "$SOURCE_REMOTE_TEST" ]; then
+ set -e
+
+ setup_remote_test
+ test_hotplug
+fi
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
new file mode 100644
index 000000000000..580ec32c8f81
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test hypervisor trace read with an offline CPU
+# requires: remotes/hypervisor/write_event
+
+SOURCE_REMOTE_TEST=1
+. $TEST_DIR/remotes/hotplug.tc
+
+set -e
+setup_remote "hypervisor"
+test_hotplug
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
index 170f7648732a..bc9377a70e8d 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
@@ -58,11 +58,7 @@ test_trace()
#
# Ensure the writer is not on the reader page by reloading the buffer
- echo 0 > tracing_on
- echo 0 > trace
- assert_unloaded
- echo 1 > tracing_on
- assert_loaded
+ reload_remote
# Ensure ring-buffer overflow by emitting events from the same CPU
for cpu in $(get_cpu_ids); do
@@ -96,27 +92,6 @@ test_trace()
cd - > /dev/null
done
-
- #
- # Test with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- check_trace 1 8 trace
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
index 669a7288ed7c..7f7b7b79c490 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
@@ -92,31 +92,6 @@ test_trace_pipe()
rm $output
cd - > /dev/null
done
-
- #
- # Test interaction with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- output=$(dump_trace_pipe)
-
- check_trace 1 8 $output
-
- rm $output
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* Re: [PATCH v9 3/3] tracing/Documentation: Add a section about backup instance
From: Masami Hiramatsu @ 2026-04-01 3:15 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260331172105.609d59aa@gandalf.local.home>
On Tue, 31 Mar 2026 17:21:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 1 Apr 2026 01:32:41 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> >
> > Add a section about backup instance to the debugging.rst.
> >
> > Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > ---
> > Changes in v6:
> > - Fix typos.
> > ---
> > Documentation/trace/debugging.rst | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
> > index 4d88c346fc38..15857951b506 100644
> > --- a/Documentation/trace/debugging.rst
> > +++ b/Documentation/trace/debugging.rst
> > @@ -159,3 +159,22 @@ If setting it from the kernel command line, it is recommended to also
> > disable tracing with the "traceoff" flag, and enable tracing after boot up.
> > Otherwise the trace from the most recent boot will be mixed with the trace
> > from the previous boot, and may make it confusing to read.
> > +
> > +Using a backup instance for keeping previous boot data
> > +------------------------------------------------------
> > +
> > +It is also possible to record trace data at system boot time by specifying
> > +events with the persistent ring buffer, but in this case the data before the
> > +reboot will be lost before it can be read. This problem can be solved by a
> > +backup instance. From the kernel command line::
> > +
> > + reserve_mem=12M:4096:trace trace_instance=boot_map@trace,sched,irq trace_instance=backup=boot_map
>
> The above didn't actually work well without my patch to enable events on
> the persistent ring buffer with the backup. But keep it, as it will work
> with my patch ;-)
Ah, thanks!
Let me rebase on it.
>
> -- Steve
>
>
> > +
> > +On boot up, the previous data in the "boot_map" is copied to the "backup"
> > +instance, and the "sched:*" and "irq:*" events for the current boot are traced
> > +in the "boot_map". Thus the user can read the previous boot data from the "backup"
> > +instance without stopping the trace.
> > +
> > +Note that this "backup" instance is readonly, and will be removed automatically
> > +if you clear the trace data or read out all trace data from the "trace_pipe"
> > +or the "trace_pipe_raw" files.
> > \ No newline at end of file
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* Re: [PATCH v9 2/3] tracing: Remove the backup instance automatically after read
From: Masami Hiramatsu @ 2026-04-01 3:19 UTC (permalink / raw)
To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260331171936.6f84e357@gandalf.local.home>
On Tue, 31 Mar 2026 17:19:36 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 1 Apr 2026 01:32:33 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
>
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 8cec7bd70438..1d73400a01c7 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -539,8 +539,65 @@ void trace_set_ring_buffer_expanded(struct trace_array *tr)
> > tr->ring_buffer_expanded = true;
> > }
> >
> > +static int __remove_instance(struct trace_array *tr);
> > +
> > +static void trace_array_autoremove(struct work_struct *work)
> > +{
> > + struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
> > +
> > + guard(mutex)(&event_mutex);
> > + guard(mutex)(&trace_types_lock);
>
> Hmm, should we do a check if the tr still exists? Couldn't the user delete
> this via a rmdir after the last file closed and this was kicked?
>
> CPU 0 CPU 1
> ----- -----
> open(trace_pipe);
> read(..);
> close(trace_pipe);
> kick the work queue to delete it....
> rmdir();
> [instance deleted]
I thought this requires trace_types_lock, and after kicked the queue,
can rmdir() gets the tr? (__trace_array_get() return error if
tr->free_on_close is set)
>
> __remove_instance();
>
> [ now the tr is freed, and the remove will crash!]
>
>
> What would prevent this is this is to use trace_array_destroy() that checks
> this and also adds the proper locking:
>
> static void trace_array_autoremove(struct work_struct *work)
> {
> struct trace_array *tr = container_of(work, struct trace_array, autoremove_work);
>
> trace_array_destroy(tr);
> }
OK, let's use it.
>
>
> > +
> > + /*
> > + * This can be fail if someone gets @tr before starting this
> > + * function, but in that case, this will be kicked again when
> > + * putting it. So we don't care about the result.
> > + */
> > + __remove_instance(tr);
> > +}
> > +
> > +static struct workqueue_struct *autoremove_wq;
> > +
> > +static void trace_array_kick_autoremove(struct trace_array *tr)
> > +{
> > + if (autoremove_wq && !work_pending(&tr->autoremove_work))
> > + queue_work(autoremove_wq, &tr->autoremove_work);
>
> Doesn't queue_work() check if it's pending? Do we really need to check it
> twice?
Indeed, it checked the flag.
>
> > +}
> > +
> > +static void trace_array_cancel_autoremove(struct trace_array *tr)
> > +{
> > + if (autoremove_wq && work_pending(&tr->autoremove_work))
> > + cancel_work(&tr->autoremove_work);
>
> Same here, as can't this be racy?
Yeah, and this should use cancel_work_sync().
>
> > +}
> > +
> > +static void trace_array_init_autoremove(struct trace_array *tr)
> > +{
> > + INIT_WORK(&tr->autoremove_work, trace_array_autoremove);
> > +}
> > +
> > +static void trace_array_start_autoremove(void)
> > +{
> > + if (autoremove_wq)
> > + return;
> > +
> > + autoremove_wq = alloc_workqueue("tr_autoremove_wq",
> > + WQ_UNBOUND | WQ_HIGHPRI, 0);
> > + if (!autoremove_wq)
> > + pr_warn("Unable to allocate tr_autoremove_wq. autoremove
> > disabled.\n"); +}
> > +
> > LIST_HEAD(ftrace_trace_arrays);
>
> -- Steve
>
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
* [PATCH v2 0/2] Fix trace remotes read with an offline CPU
From: Vincent Donnefort @ 2026-04-01 4:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
This small series is fixing non-consuming read of a trace remote when the
trace_buffer is created after a CPU is offline.
It also extends hotplug testing coverage to include this test case.
I have based this series on top of kvmarm/next which contains the hypervisor
tracing patches.
Changes in v2:
* Add a trap to restore offlined CPU on test failure in hotplug.tc.
* Fix assert_loaded/assert_unloaded error propagation.
* Ensure we probe for existing events when setting up consuming read
iterator.
v1: https://lore.kernel.org/all/20260401025003.3258729-1-vdonnefort@google.com/
Vincent Donnefort (2):
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: selftests: Extend hotplug testing for trace remotes
kernel/trace/trace_remote.c | 22 ++++-
.../selftests/ftrace/test.d/remotes/functions | 19 +++-
.../ftrace/test.d/remotes/hotplug.tc | 88 +++++++++++++++++++
.../test.d/remotes/hypervisor/hotplug.tc | 11 +++
.../selftests/ftrace/test.d/remotes/trace.tc | 27 +-----
.../ftrace/test.d/remotes/trace_pipe.tc | 25 ------
6 files changed, 134 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
base-commit: 5ad2ff071b5980f072a85c8114649218971c586e
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply
* [PATCH v2 1/2] tracing: Non-consuming read for trace remotes with an offline CPU
From: Vincent Donnefort @ 2026-04-01 4:50 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401045100.3394299-1-vdonnefort@google.com>
When a trace_buffer is created while a CPU is offline, this CPU is
cleared from the trace_buffer CPU mask, preventing the creation of a
non-consuming iterator (ring_buffer_iter). For trace remotes, it means
the iterator fails to be allocated (-ENOMEM) even though there are
available ring buffers in the trace_buffer.
For non-consuming reads of trace remotes, skip missing ring_buffer_iter
to allow reading the available ring buffers.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/kernel/trace/trace_remote.c b/kernel/trace/trace_remote.c
index 0d78e5f5fe98..d6c3f94d67cd 100644
--- a/kernel/trace/trace_remote.c
+++ b/kernel/trace/trace_remote.c
@@ -282,6 +282,14 @@ static void trace_remote_put(struct trace_remote *remote)
trace_remote_try_unload(remote);
}
+static bool trace_remote_has_cpu(struct trace_remote *remote, int cpu)
+{
+ if (cpu == RING_BUFFER_ALL_CPUS)
+ return true;
+
+ return ring_buffer_poll_remote(remote->trace_buffer, cpu) == 0;
+}
+
static void __poll_remote(struct work_struct *work)
{
struct delayed_work *dwork = to_delayed_work(work);
@@ -324,6 +332,10 @@ static int __alloc_ring_buffer_iter(struct trace_remote_iterator *iter, int cpu)
iter->rb_iters[cpu] = ring_buffer_read_start(iter->remote->trace_buffer, cpu,
GFP_KERNEL);
if (!iter->rb_iters[cpu]) {
+ /* This CPU isn't part of trace_buffer. Skip it */
+ if (!trace_remote_has_cpu(iter->remote, cpu))
+ continue;
+
__free_ring_buffer_iter(iter, RING_BUFFER_ALL_CPUS);
return -ENOMEM;
}
@@ -347,10 +359,10 @@ static struct trace_remote_iterator
if (ret)
return ERR_PTR(ret);
- /* Test the CPU */
- ret = ring_buffer_poll_remote(remote->trace_buffer, cpu);
- if (ret)
+ if (!trace_remote_has_cpu(remote, cpu)) {
+ ret = -ENODEV;
goto err;
+ }
iter = kzalloc_obj(*iter);
if (iter) {
@@ -361,6 +373,7 @@ static struct trace_remote_iterator
switch (type) {
case TRI_CONSUMING:
+ ring_buffer_poll_remote(remote->trace_buffer, cpu);
INIT_DELAYED_WORK(&iter->poll_work, __poll_remote);
schedule_delayed_work(&iter->poll_work, msecs_to_jiffies(remote->poll_ms));
break;
@@ -476,6 +489,9 @@ __peek_event(struct trace_remote_iterator *iter, int cpu, u64 *ts, unsigned long
return ring_buffer_peek(iter->remote->trace_buffer, cpu, ts, lost_events);
case TRI_NONCONSUMING:
rb_iter = __get_rb_iter(iter, cpu);
+ if (!rb_iter)
+ return NULL;
+
rb_evt = ring_buffer_iter_peek(rb_iter, ts);
if (!rb_evt)
return NULL;
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH v2 2/2] tracing: selftests: Extend hotplug testing for trace remotes
From: Vincent Donnefort @ 2026-04-01 4:51 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz
Cc: kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260401045100.3394299-1-vdonnefort@google.com>
The hotplug testing only tries reading a trace remote buffer, loaded
before a CPU is offline. Extend this testing to cover:
* A trace remote buffer loaded after a CPU is offline.
* A trace remote buffer loaded before a CPU is online.
Because of these added test cases, move the hotplug testing into a
separate hotplug.tc file.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/functions b/tools/testing/selftests/ftrace/test.d/remotes/functions
index 97a09d564a34..05224fac3653 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/functions
+++ b/tools/testing/selftests/ftrace/test.d/remotes/functions
@@ -24,12 +24,21 @@ setup_remote_test()
assert_loaded()
{
- grep -q "(loaded)" buffer_size_kb
+ grep -q "(loaded)" buffer_size_kb || return 1
}
assert_unloaded()
{
- grep -q "(unloaded)" buffer_size_kb
+ grep -q "(unloaded)" buffer_size_kb || return 1
+}
+
+reload_remote()
+{
+ echo 0 > tracing_on
+ clear_trace
+ assert_unloaded
+ echo 1 > tracing_on
+ assert_loaded
}
dump_trace_pipe()
@@ -79,10 +88,12 @@ get_cpu_ids()
sed -n 's/^processor\s*:\s*\([0-9]\+\).*/\1/p' /proc/cpuinfo
}
-get_page_size() {
+get_page_size()
+{
sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
}
-get_selftest_event_size() {
+get_selftest_event_size()
+{
sed -ne 's/^.*field:.*;.*size:\([0-9][0-9]*\);.*/\1/p' events/*/selftest/format | awk '{s+=$1} END {print s}'
}
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
new file mode 100644
index 000000000000..145617eb8061
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hotplug.tc
@@ -0,0 +1,88 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test trace remote read with an offline CPU
+# requires: remotes/test
+
+. $TEST_DIR/remotes/functions
+
+hotunplug_one_cpu()
+{
+ [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 1
+
+ for cpu in $(get_cpu_ids); do
+ echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 1
+ break
+ done
+
+ echo $cpu
+}
+
+# Check non-consuming and consuming read
+check_read()
+{
+ for i in $(seq 1 8); do
+ echo $i > write_event
+ done
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+}
+
+test_hotplug()
+{
+ echo 0 > trace
+ assert_loaded
+
+ #
+ # Test a trace buffer containing an offline CPU
+ #
+
+ cpu=$(hotunplug_one_cpu) || exit_unsupported
+ trap "echo 1 > /sys/devices/system/cpu/cpu$cpu/online" EXIT
+
+ check_read
+
+ #
+ # Test a trace buffer with a missing CPU
+ #
+
+ reload_remote
+
+ check_read
+
+ #
+ # Test a trace buffer with a CPU added later
+ #
+
+ echo 1 > /sys/devices/system/cpu/cpu$cpu/online
+ trap "" EXIT
+ assert_loaded
+
+ check_read
+
+ # Test if the ring-buffer for the newly added CPU is both writable and
+ # readable
+ for i in $(seq 1 8); do
+ taskset -c $cpu echo $i > write_event
+ done
+
+ cd per_cpu/cpu$cpu/
+
+ check_trace 1 8 trace
+
+ output=$(dump_trace_pipe)
+ check_trace 1 8 $output
+ rm $output
+
+ cd -
+}
+
+if [ -z "$SOURCE_REMOTE_TEST" ]; then
+ set -e
+
+ setup_remote_test
+ test_hotplug
+fi
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
new file mode 100644
index 000000000000..580ec32c8f81
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/remotes/hypervisor/hotplug.tc
@@ -0,0 +1,11 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test hypervisor trace read with an offline CPU
+# requires: remotes/hypervisor/write_event
+
+SOURCE_REMOTE_TEST=1
+. $TEST_DIR/remotes/hotplug.tc
+
+set -e
+setup_remote "hypervisor"
+test_hotplug
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
index 170f7648732a..bc9377a70e8d 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace.tc
@@ -58,11 +58,7 @@ test_trace()
#
# Ensure the writer is not on the reader page by reloading the buffer
- echo 0 > tracing_on
- echo 0 > trace
- assert_unloaded
- echo 1 > tracing_on
- assert_loaded
+ reload_remote
# Ensure ring-buffer overflow by emitting events from the same CPU
for cpu in $(get_cpu_ids); do
@@ -96,27 +92,6 @@ test_trace()
cd - > /dev/null
done
-
- #
- # Test with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- check_trace 1 8 trace
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
diff --git a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
index 669a7288ed7c..7f7b7b79c490 100644
--- a/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
+++ b/tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
@@ -92,31 +92,6 @@ test_trace_pipe()
rm $output
cd - > /dev/null
done
-
- #
- # Test interaction with hotplug
- #
-
- [ "$(get_cpu_ids | wc -l)" -ge 2 ] || return 0
-
- echo 0 > trace
-
- for cpu in $(get_cpu_ids); do
- echo 0 > /sys/devices/system/cpu/cpu$cpu/online || return 0
- break
- done
-
- for i in $(seq 1 8); do
- echo $i > write_event
- done
-
- output=$(dump_trace_pipe)
-
- check_trace 1 8 $output
-
- rm $output
-
- echo 1 > /sys/devices/system/cpu/cpu$cpu/online
}
if [ -z "$SOURCE_REMOTE_TEST" ]; then
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH] ring-buffer: Enforce read ordering of trace_buffer cpumask and buffers
From: Vincent Donnefort @ 2026-04-01 5:36 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel
Cc: kernel-team, linux-kernel, Vincent Donnefort
On CPU hotplug, if it is the first time a trace_buffer sees a CPU, a
ring_buffer_per_cpu will be allocated and its corresponding bit toggled
in the cpumask. Many readers check this cpumask to know if they can
safely read the ring_buffer_per_cpu but they are doing so without memory
ordering and may observe the cpumask bit set while having NULL buffer
pointer.
Enforce the memory read ordering by sending an IPI to all online CPUs.
The hotplug path is a slow-path anyway and it saves us from adding read
barriers in numerous call sites.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 170170bd83bd..10d2d0404434 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7468,6 +7468,12 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
return 0;
}
+static void rb_cpu_sync(void *data)
+{
+ /* Not really needed, but documents what is happening */
+ smp_rmb();
+}
+
/*
* We only allocate new buffers, never free them if the CPU goes down.
* If we were to free the buffer, then the user would lose any trace that was in
@@ -7506,7 +7512,18 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node)
cpu);
return -ENOMEM;
}
- smp_wmb();
+
+ /*
+ * Ensure trace_buffer readers observe the newly allocated
+ * ring_buffer_per_cpu before they check the cpumask. Instead of using a
+ * read barrier for all readers, send an IPI.
+ */
+ if (unlikely(system_state == SYSTEM_RUNNING)) {
+ on_each_cpu(rb_cpu_sync, NULL, 1);
+ /* Not really needed, but documents what is happening */
+ smp_wmb();
+ }
+
cpumask_set_cpu(cpu, buffer->cpumask);
return 0;
}
base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
--
2.53.0.1118.gaef5881109-goog
^ permalink raw reply related
* [PATCH v10 0/3] tracing: Remove backup instance after read all
From: Masami Hiramatsu (Google) @ 2026-04-01 6:37 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
linux-trace-kernel
Hi,
Here is the v10 of the series to improve backup instances of
the persistent ring buffer. The previous version is here:
https://lore.kernel.org/all/177497473558.569199.6527680985537865638.stgit@mhiramat.tok.corp.google.com/
In this version, I fixed to use trace_array_destroy() and
cancel_work_sync() to avoid possible race. And remove unneeded
work_pending() check in trace_array_kick_autoremove()
(for trace_array_cancel_autoremove() to avoid deadlock of
workqueue, it still use work_pending() check) [2/3] and
add a newline at EOF [3/3]
Series Description
------------------
Since backup instances are a kind of snapshot of the persistent
ring buffer, it should be readonly. And if it is readonly
there is no reason to keep it after reading all data via trace_pipe
because the data has been consumed. But user should be able to remove
the readonly instance by rmdir or truncating `trace` file.
Thus, [1/3] makes backup instances readonly (not able to write any
events, cleanup trace, change buffer size). Also, [2/3] removes the
backup instance after consuming all data via trace_pipe.
With this improvements, even if we makes a backup instance (using
the same amount of memory of the persistent ring buffer), it will
be removed after reading the data automatically.
Thanks,
---
Masami Hiramatsu (Google) (3):
tracing: Make the backup instance non-reusable
tracing: Remove the backup instance automatically after read
tracing/Documentation: Add a section about backup instance
Documentation/trace/debugging.rst | 19 ++++
kernel/trace/trace.c | 160 ++++++++++++++++++++++++++++---------
kernel/trace/trace.h | 13 +++
kernel/trace/trace_boot.c | 5 +
kernel/trace/trace_events.c | 76 ++++++++++--------
5 files changed, 202 insertions(+), 71 deletions(-)
--
Masami Hiramatsu (Google) <mhiramat@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox