Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v2] tracing: Add NULL pointer check to trigger_data_free()
From: Guenter Roeck @ 2026-03-05 19:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Guenter Roeck, Miaoqian Lin

If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse()
jumps to the out_free error path. While kfree() safely handles a NULL
pointer, trigger_data_free() does not. This causes a NULL pointer
dereference in trigger_data_free() when evaluating
data->cmd_ops->set_filter.

Fix the problem by adding a NULL pointer check to trigger_data_free().

The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.

Assisted-by: Gemini:gemini-3.1-pro
Cc: Miaoqian Lin <linmq006@gmail.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
v2: Add NULL pointer check to trigger_data_free() to make it more robust
    instead of changing the calling code.
    Note: Changed patch description to reflect new functionality.

 kernel/trace/trace_events_trigger.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c
index fecbd679d432..d5230b759a2d 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -50,6 +50,9 @@ static int trigger_kthread_fn(void *ignore)
 
 void trigger_data_free(struct event_trigger_data *data)
 {
+	if (!data)
+		return;
+
 	if (data->cmd_ops->set_filter)
 		data->cmd_ops->set_filter(NULL, data, NULL);
 
-- 
2.45.2


^ permalink raw reply related

* Re: [PATCH v13 00/18] unwind_deferred: Implement sframe handling
From: Steven Rostedt @ 2026-03-05 20:18 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jens Remus, linux-kernel, linux-trace-kernel, bpf, x86, linux-mm,
	Josh Poimboeuf, Masami Hiramatsu, Mathieu Desnoyers,
	Peter Zijlstra, Ingo Molnar, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner, Andrii Nakryiko, Indu Bhagat, Jose E. Marchesi,
	Beau Belgrave, Linus Torvalds, Andrew Morton, Florian Weimer,
	Kees Cook, Carlos O'Donell, Sam James, Dylan Hatch,
	Borislav Petkov, Dave Hansen, David Hildenbrand, H. Peter Anvin,
	Liam R. Howlett, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Heiko Carstens,
	Vasily Gorbik
In-Reply-To: <aYvKqse1aXxGqFwR@google.com>

On Tue, 10 Feb 2026 16:17:46 -0800
Namhyung Kim <namhyung@kernel.org> wrote:

> > 
> > Maybe it would make sense not to "overload" the perf record option
> > "--call-graph fp,defer" and use it for all deferred unwinding methods.
> > 
> > What about "--call-graph defer", "--call-graph any,defer", or
> > "--call-graph *,defer"?  
> 
> Sounds better.  But I think it cannot enforce "--call-graph fp,defer" to
> use frame pointers when SFrame is available.. Hmm.

Yeah, let's just call it --defer. The "fp" part isn't really something perf
has control over. It's an implementation detail. Perf should really just
care about getting a stack trace and not how the kernel goes about it.

The only difference is that it needs to know if it is deferred or not, as
perf needs to do things differently when it is.

-- Steve

^ permalink raw reply

* Re: [PATCH v2] tracefs: Use dentry name snapshots instead of heap allocation
From: Steven Rostedt @ 2026-03-05 21:52 UTC (permalink / raw)
  To: AnishMulay
  Cc: viro, mhiramat, mathieu.desnoyers, linux-trace-kernel,
	linux-kernel
In-Reply-To: <20260227211505.226643-1-anishm7030@gmail.com>

On Fri, 27 Feb 2026 16:15:05 -0500
AnishMulay <anishm7030@gmail.com> wrote:

> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index 86ba8dc25aaef..ad322e8f9e2ad 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -94,23 +94,14 @@ static struct tracefs_dir_ops {
>  	int (*rmdir)(const char *name);
>  } tracefs_ops __ro_after_init;
>  
> -static char *get_dname(struct dentry *dentry)
> -{
> -	return kmemdup_nul(dentry->d_name.name, dentry->d_name.len, GFP_KERNEL);
> -}
> -
>  static struct dentry *tracefs_syscall_mkdir(struct mnt_idmap *idmap,
>  					    struct inode *inode, struct dentry *dentry,
>  					    umode_t mode)

I can't even apply your patch as it appears you based it off of your local
branch that applied your previous version of the patch.

Please rebase it off of 7.0-rc2 and resend.

Thanks!

-- Steve

^ permalink raw reply

* Re: [PATCH bpf-next v4 0/3] Optimize kprobe.session attachment for exact function names
From: patchwork-bot+netdevbpf @ 2026-03-05 23:30 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: bpf, linux-trace-kernel, ast, daniel, andrii, jolsa, rostedt,
	linux-open-source
In-Reply-To: <20260302200837.317907-1-andrey.grodzovsky@crowdstrike.com>

Hello:

This series was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Mon, 2 Mar 2026 15:08:34 -0500 you wrote:
> When libbpf attaches kprobe.session programs with exact function names
> (the common case: SEC("kprobe.session/vfs_read")), the current code path
> has two independent performance bottlenecks:
> 
> 1. Userspace (libbpf): attach_kprobe_session() always parses
> /proc/kallsyms to resolve function names, even when the name is exact
> (no wildcards).  This takes ~150ms per function.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v4,1/3] libbpf: Optimize kprobe.session attachment for exact function names
    https://git.kernel.org/bpf/bpf-next/c/6afc431db1b4
  - [bpf-next,v4,2/3] ftrace: Use kallsyms binary search for single-symbol lookup
    (no matching commit)
  - [bpf-next,v4,3/3] selftests/bpf: add tests for kprobe.session optimization
    https://git.kernel.org/bpf/bpf-next/c/a28441dd2961

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v4 4/5] mm: rename zone->lock to zone->_lock
From: SeongJae Park @ 2026-03-06  1:20 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: SeongJae Park, Vlastimil Babka (SUSE), Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Rafael J. Wysocki,
	Pavel Machek, Len Brown, Brendan Jackman, Johannes Weiner, Zi Yan,
	Oscar Salvador, Qi Zheng, Shakeel Butt, linux-kernel, linux-mm,
	linux-trace-kernel, linux-pm
In-Reply-To: <aanSnywUXTVPaYUj@shell.ilvokhin.com>

On Thu, 5 Mar 2026 18:59:43 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
[...]
> Following the suggestion from SJ and Vlastimil, I prepared fixup to
> standardize documentation and comments on the term "zone lock".
> 
> The patch is based on top of the current mm-new.
> 
> Andrew, please let me know if you would prefer a respin of the series
> instead.
> 
> From 267cda3e0e160f97b346009bc48819bfeed92e52 Mon Sep 17 00:00:00 2001
> From: Dmitry Ilvokhin <d@ilvokhin.com>
> Date: Thu, 5 Mar 2026 10:36:17 -0800
> Subject: [PATCH] mm: documentation: standardize on "zone lock" terminology
> 
> During review of the zone lock tracing series it was suggested to
> standardize documentation and comments on the term "zone lock"
> instead of using zone_lock or referring to the internal field
> zone->_lock.
> 
> Update references accordingly.
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>

Acked-by: SeongJae Park <sj@kernel.org>


THanks,
SJ

[...]

^ permalink raw reply

* Re: [PATCH] tracing: Fix use-after-free race in copy_trace_marker on instance removal
From: Steven Rostedt @ 2026-03-06  2:14 UTC (permalink / raw)
  To: Sasha Levin; +Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260225133122.237275-1-sashal@kernel.org>

On Wed, 25 Feb 2026 08:31:22 -0500
Sasha Levin <sashal@kernel.org> wrote:

> When a trace instance with copy_trace_marker enabled is removed,
> __remove_instance() first iterates ZEROED_TRACE_FLAGS (which includes
> COPY_MARKER), calling set_tracer_flag() -> update_marker_trace(tr, 0).
> This removes the instance from the marker_copies RCU list via
> list_del_init() and returns immediately.

Hmm, did the AI write the change log too?

It breaks things as much as it fixes.

> 
> The subsequent explicit update_marker_trace(tr, 0) call then finds
> list_empty(&tr->marker_list) is true and returns false, causing
> synchronize_rcu() to be skipped. The ring buffer and trace_array are
> then freed while a concurrent writer in tracing_mark_write() may still
> hold an RCU-protected reference, leading to use-after-free.
> 
>   BUG: KASAN: slab-use-after-free in write_marker_to_buffer+0x1e7/0x610 kernel/trace/trace.c:6527
>   Write of size 4054 at addr ffff888103af7058 by task syz.0.277/5019
> 
>   CPU: 3 UID: 0 PID: 5019 Comm: syz.0.277 Tainted: G                 N  7.0.0-rc1-00001-gc5447a46efed #51 PREEMPT(full)
>   Tainted: [N]=TEST
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
>   Call Trace:
>    <TASK>
>    __dump_stack lib/dump_stack.c:94 [inline]
>    dump_stack_lvl+0xba/0x110 lib/dump_stack.c:120
>    print_address_description mm/kasan/report.c:378 [inline]
>    print_report+0x156/0x4d9 mm/kasan/report.c:482
>    kasan_report+0xf6/0x1f0 mm/kasan/report.c:595
>    check_region_inline mm/kasan/generic.c:186 [inline]
>    kasan_check_range+0x125/0x200 mm/kasan/generic.c:200
>    __asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
>    write_marker_to_buffer+0x1e7/0x610 kernel/trace/trace.c:6527
>    tracing_mark_write+0x218/0x3f0 kernel/trace/trace.c:6875
>    vfs_write+0x2b7/0x1070 fs/read_write.c:686
>    ksys_write+0x1f8/0x250 fs/read_write.c:740
>    do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>    do_syscall_64+0xf3/0x700 arch/x86/entry/syscall_64.c:94
>    entry_SYSCALL_64_after_hwframe+0x77/0x7f
>   RIP: 0033:0x7fdb7eb9df29
>   Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48
>    c7 c1 e8 ff ff ff f7 d8 64 89 01 48
>   RSP: 002b:00007fdb7fa81008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>   RAX: ffffffffffffffda RBX: 00007fdb7ee15fa0 RCX: 00007fdb7eb9df29
>   RDX: 0000000000001000 RSI: 0000200000000300 RDI: 0000000000000003
>   RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>   R13: 00007ffec21bfd06 R14: 00007fdb7fa81ce4 R15: 00007fdb7fa61000
>    </TASK>
> 
>   The buggy address belongs to the physical page:
>   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x103af7
>   flags: 0x200000000000000(node=0|zone=2)
>   raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000
>   raw: ffffffffffffffff 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: kasan: bad access detected
> 
>   Memory state around the buggy address:
>    ffff888103af7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>    ffff888103af7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>   >ffff888103af8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb  
>                      ^
>    ffff888103af8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
>    ffff888103af8100: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
> 
> Fix this by:
> 
> 1. Removing TRACE_ITER(COPY_MARKER) from ZEROED_TRACE_FLAGS so the flag
>    loop doesn't pre-clear it. The explicit update_marker_trace(tr, 0) +
>    synchronize_rcu() then correctly waits for RCU readers to finish
>    before freeing.

There's a specific reason COPY_MARKER is part of the ZEROED_TRACE_FLAGS
that this patch doesn't address by removing it. That macro is all the
flags that are not copied when creating an instance. Other bad things
can happen by allowing it.

> 
> 2. Replacing list_del_init() with list_del_rcu() in update_marker_trace()
>    for proper RCU list removal semantics. list_del_init() overwrites
>    entry->next to point to itself, which can cause concurrent RCU readers
>    to loop infinitely. list_del_rcu() preserves entry->next so readers
>    can safely finish their traversal. The duplicate-operation guards are
>    changed from list_empty() to trace_flags bit checks accordingly, since
>    list_del_rcu() does not reinitialize the list head.

The above change is a fix, but it does look like AI wrote it, as it
added a lot of extra text that isn't needed for this change log. Yes,
we know why list_del_rcu() is used. It doesn't need to go into details
about use cases for list_del_rcu().

> 
> Fixes: 7b382efd5e8a ("tracing: Allow the top level trace_marker to write into another instances")
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  kernel/trace/trace.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 23de3719f4952..fa413214da764 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -523,8 +523,7 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
>  
>  /* trace_flags that are default zero for instances */
>  #define ZEROED_TRACE_FLAGS \
> -	(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
> -	 TRACE_ITER(COPY_MARKER))
> +	(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK))

This is wrong.

>  
>  /*
>   * The global_trace is the descriptor that holds the top-level tracing
> @@ -555,7 +554,7 @@ static bool update_marker_trace(struct trace_array *tr, int enabled)
>  	lockdep_assert_held(&event_mutex);
>  
>  	if (enabled) {
> -		if (!list_empty(&tr->marker_list))
> +		if (tr->trace_flags & TRACE_ITER(COPY_MARKER))
>  			return false;

This is fine.

>  
>  		list_add_rcu(&tr->marker_list, &marker_copies);
> @@ -563,10 +562,10 @@ static bool update_marker_trace(struct trace_array *tr, int enabled)
>  		return true;
>  	}
>  
> -	if (list_empty(&tr->marker_list))
> +	if (!(tr->trace_flags & TRACE_ITER(COPY_MARKER)))

This is fine.

>  		return false;
>  
> -	list_del_init(&tr->marker_list);
> +	list_del_rcu(&tr->marker_list);

This is fine.

>  	tr->trace_flags &= ~TRACE_ITER(COPY_MARKER);
>  	return true;
>  }

What is needed is to just reverse the order of the checks:

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 855cba2ff5b5..42ea3770859b 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9743,18 +9743,18 @@ static int __remove_instance(struct trace_array *tr)
 
 	list_del(&tr->list);
 
-	/* Disable all the flags that were enabled coming in */
-	for (i = 0; i < TRACE_FLAGS_MAX_SIZE; i++) {
-		if ((1ULL << i) & ZEROED_TRACE_FLAGS)
-			set_tracer_flag(tr, 1ULL << i, 0);
-	}
-
 	if (printk_trace == tr)
 		update_printk_trace(&global_trace);
 
 	if (update_marker_trace(tr, 0))
 		synchronize_rcu();
 
+	/* Disable all the flags that were enabled coming in */
+	for (i = 0; i < TRACE_FLAGS_MAX_SIZE; i++) {
+		if ((1ULL << i) & ZEROED_TRACE_FLAGS)
+			set_tracer_flag(tr, 1ULL << i, 0);
+	}
+
 	tracing_set_nop(tr);
 	clear_ftrace_function_probes(tr);
 	event_trace_del_tracer(tr);

-- Steve

^ permalink raw reply related

* [PATCH] tracing: Move snapshot code out of trace.c and into trace_snapshot.c
From: Steven Rostedt @ 2026-03-06  2:18 UTC (permalink / raw)
  To: LKML, Linux trace kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers

From: Steven Rostedt <rostedt@goodmis.org>

The trace.c file was a dumping ground for most tracing code. Start
organizing it better by moving various functions out into their own files.
Move all the snapshot code, including the max trace code into its own
trace_snapshot.c file.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20260206195936.803146337@kernel.org

- Rebased on top of v7.0-rc2

- Fixed up a few config issues

 include/linux/ftrace.h        |    2 +-
 kernel/trace/Makefile         |    1 +
 kernel/trace/trace.c          | 1273 ++-------------------------------
 kernel/trace/trace.h          |  105 ++-
 kernel/trace/trace_snapshot.c | 1067 +++++++++++++++++++++++++++
 5 files changed, 1230 insertions(+), 1218 deletions(-)
 create mode 100644 kernel/trace/trace_snapshot.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index c242fe49af4c..28b30c6f1031 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -31,7 +31,7 @@
 #define ARCH_SUPPORTS_FTRACE_OPS 0
 #endif
 
-#ifdef CONFIG_TRACING
+#ifdef CONFIG_TRACER_SNAPSHOT
 extern void ftrace_boot_snapshot(void);
 #else
 static inline void ftrace_boot_snapshot(void) { }
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 04096c21d06b..83aeb5c77008 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -69,6 +69,7 @@ obj-$(CONFIG_TRACING) += trace_seq.o
 obj-$(CONFIG_TRACING) += trace_stat.o
 obj-$(CONFIG_TRACING) += trace_printk.o
 obj-$(CONFIG_TRACING) += trace_pid.o
+obj-$(CONFIG_TRACER_SNAPSHOT) += trace_snapshot.o
 obj-$(CONFIG_TRACING) += 	pid_list.o
 obj-$(CONFIG_TRACING_MAP) += tracing_map.o
 obj-$(CONFIG_PREEMPTIRQ_DELAY_TEST) += preemptirq_delay_test.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 23de3719f495..05fb9964fd4e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -47,7 +47,6 @@
 #include <linux/trace.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/rt.h>
-#include <linux/fsnotify.h>
 #include <linux/irq_work.h>
 #include <linux/workqueue.h>
 #include <linux/sort.h>
@@ -219,15 +218,9 @@ static void ftrace_trace_userstack(struct trace_array *tr,
 static char bootup_tracer_buf[MAX_TRACER_SIZE] __initdata;
 static char *default_bootup_tracer;
 
-static bool allocate_snapshot;
-static bool snapshot_at_boot;
-
 static char boot_instance_info[COMMAND_LINE_SIZE] __initdata;
 static int boot_instance_index;
 
-static char boot_snapshot_info[COMMAND_LINE_SIZE] __initdata;
-static int boot_snapshot_index;
-
 static int __init set_cmdline_ftrace(char *str)
 {
 	strscpy(bootup_tracer_buf, str, MAX_TRACER_SIZE);
@@ -276,38 +269,6 @@ static int __init stop_trace_on_warning(char *str)
 }
 __setup("traceoff_on_warning", stop_trace_on_warning);
 
-static int __init boot_alloc_snapshot(char *str)
-{
-	char *slot = boot_snapshot_info + boot_snapshot_index;
-	int left = sizeof(boot_snapshot_info) - boot_snapshot_index;
-	int ret;
-
-	if (str[0] == '=') {
-		str++;
-		if (strlen(str) >= left)
-			return -1;
-
-		ret = snprintf(slot, left, "%s\t", str);
-		boot_snapshot_index += ret;
-	} else {
-		allocate_snapshot = true;
-		/* We also need the main ring buffer expanded */
-		trace_set_ring_buffer_expanded(NULL);
-	}
-	return 1;
-}
-__setup("alloc_snapshot", boot_alloc_snapshot);
-
-
-static int __init boot_snapshot(char *str)
-{
-	snapshot_at_boot = true;
-	boot_alloc_snapshot(str);
-	return 1;
-}
-__setup("ftrace_boot_snapshot", boot_snapshot);
-
-
 static int __init boot_instance(char *str)
 {
 	char *slot = boot_instance_info + boot_instance_index;
@@ -807,47 +768,6 @@ void tracing_on(void)
 EXPORT_SYMBOL_GPL(tracing_on);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-static void tracing_snapshot_instance_cond(struct trace_array *tr,
-					   void *cond_data)
-{
-	unsigned long flags;
-
-	if (in_nmi()) {
-		trace_array_puts(tr, "*** SNAPSHOT CALLED FROM NMI CONTEXT ***\n");
-		trace_array_puts(tr, "*** snapshot is being ignored        ***\n");
-		return;
-	}
-
-	if (!tr->allocated_snapshot) {
-		trace_array_puts(tr, "*** SNAPSHOT NOT ALLOCATED ***\n");
-		trace_array_puts(tr, "*** stopping trace here!   ***\n");
-		tracer_tracing_off(tr);
-		return;
-	}
-
-	if (tr->mapped) {
-		trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
-		trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
-		return;
-	}
-
-	/* Note, snapshot can not be used when the tracer uses it */
-	if (tracer_uses_snapshot(tr->current_trace)) {
-		trace_array_puts(tr, "*** LATENCY TRACER ACTIVE ***\n");
-		trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
-		return;
-	}
-
-	local_irq_save(flags);
-	update_max_tr(tr, current, smp_processor_id(), cond_data);
-	local_irq_restore(flags);
-}
-
-void tracing_snapshot_instance(struct trace_array *tr)
-{
-	tracing_snapshot_instance_cond(tr, NULL);
-}
-
 /**
  * tracing_snapshot - take a snapshot of the current buffer.
  *
@@ -870,138 +790,6 @@ void tracing_snapshot(void)
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot);
 
-/**
- * tracing_snapshot_cond - conditionally take a snapshot of the current buffer.
- * @tr:		The tracing instance to snapshot
- * @cond_data:	The data to be tested conditionally, and possibly saved
- *
- * This is the same as tracing_snapshot() except that the snapshot is
- * conditional - the snapshot will only happen if the
- * cond_snapshot.update() implementation receiving the cond_data
- * returns true, which means that the trace array's cond_snapshot
- * update() operation used the cond_data to determine whether the
- * snapshot should be taken, and if it was, presumably saved it along
- * with the snapshot.
- */
-void tracing_snapshot_cond(struct trace_array *tr, void *cond_data)
-{
-	tracing_snapshot_instance_cond(tr, cond_data);
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_cond);
-
-/**
- * tracing_cond_snapshot_data - get the user data associated with a snapshot
- * @tr:		The tracing instance
- *
- * When the user enables a conditional snapshot using
- * tracing_snapshot_cond_enable(), the user-defined cond_data is saved
- * with the snapshot.  This accessor is used to retrieve it.
- *
- * Should not be called from cond_snapshot.update(), since it takes
- * the tr->max_lock lock, which the code calling
- * cond_snapshot.update() has already done.
- *
- * Returns the cond_data associated with the trace array's snapshot.
- */
-void *tracing_cond_snapshot_data(struct trace_array *tr)
-{
-	void *cond_data = NULL;
-
-	local_irq_disable();
-	arch_spin_lock(&tr->max_lock);
-
-	if (tr->cond_snapshot)
-		cond_data = tr->cond_snapshot->cond_data;
-
-	arch_spin_unlock(&tr->max_lock);
-	local_irq_enable();
-
-	return cond_data;
-}
-EXPORT_SYMBOL_GPL(tracing_cond_snapshot_data);
-
-static int resize_buffer_duplicate_size(struct array_buffer *trace_buf,
-					struct array_buffer *size_buf, int cpu_id);
-static void set_buffer_entries(struct array_buffer *buf, unsigned long val);
-
-int tracing_alloc_snapshot_instance(struct trace_array *tr)
-{
-	int order;
-	int ret;
-
-	if (!tr->allocated_snapshot) {
-
-		/* Make the snapshot buffer have the same order as main buffer */
-		order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
-		ret = ring_buffer_subbuf_order_set(tr->snapshot_buffer.buffer, order);
-		if (ret < 0)
-			return ret;
-
-		/* allocate spare buffer */
-		ret = resize_buffer_duplicate_size(&tr->snapshot_buffer,
-				   &tr->array_buffer, RING_BUFFER_ALL_CPUS);
-		if (ret < 0)
-			return ret;
-
-		tr->allocated_snapshot = true;
-	}
-
-	return 0;
-}
-
-static void free_snapshot(struct trace_array *tr)
-{
-	/*
-	 * We don't free the ring buffer. instead, resize it because
-	 * The max_tr ring buffer has some state (e.g. ring->clock) and
-	 * we want preserve it.
-	 */
-	ring_buffer_subbuf_order_set(tr->snapshot_buffer.buffer, 0);
-	ring_buffer_resize(tr->snapshot_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
-	set_buffer_entries(&tr->snapshot_buffer, 1);
-	tracing_reset_online_cpus(&tr->snapshot_buffer);
-	tr->allocated_snapshot = false;
-}
-
-static int tracing_arm_snapshot_locked(struct trace_array *tr)
-{
-	int ret;
-
-	lockdep_assert_held(&trace_types_lock);
-
-	spin_lock(&tr->snapshot_trigger_lock);
-	if (tr->snapshot == UINT_MAX || tr->mapped) {
-		spin_unlock(&tr->snapshot_trigger_lock);
-		return -EBUSY;
-	}
-
-	tr->snapshot++;
-	spin_unlock(&tr->snapshot_trigger_lock);
-
-	ret = tracing_alloc_snapshot_instance(tr);
-	if (ret) {
-		spin_lock(&tr->snapshot_trigger_lock);
-		tr->snapshot--;
-		spin_unlock(&tr->snapshot_trigger_lock);
-	}
-
-	return ret;
-}
-
-int tracing_arm_snapshot(struct trace_array *tr)
-{
-	guard(mutex)(&trace_types_lock);
-	return tracing_arm_snapshot_locked(tr);
-}
-
-void tracing_disarm_snapshot(struct trace_array *tr)
-{
-	spin_lock(&tr->snapshot_trigger_lock);
-	if (!WARN_ON(!tr->snapshot))
-		tr->snapshot--;
-	spin_unlock(&tr->snapshot_trigger_lock);
-}
-
 /**
  * tracing_alloc_snapshot - allocate snapshot buffer.
  *
@@ -1023,129 +811,12 @@ int tracing_alloc_snapshot(void)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tracing_alloc_snapshot);
-
-/**
- * tracing_snapshot_alloc - allocate and take a snapshot of the current buffer.
- *
- * This is similar to tracing_snapshot(), but it will allocate the
- * snapshot buffer if it isn't already allocated. Use this only
- * where it is safe to sleep, as the allocation may sleep.
- *
- * This causes a swap between the snapshot buffer and the current live
- * tracing buffer. You can use this to take snapshots of the live
- * trace when some condition is triggered, but continue to trace.
- */
-void tracing_snapshot_alloc(void)
-{
-	int ret;
-
-	ret = tracing_alloc_snapshot();
-	if (ret < 0)
-		return;
-
-	tracing_snapshot();
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
-
-/**
- * tracing_snapshot_cond_enable - enable conditional snapshot for an instance
- * @tr:		The tracing instance
- * @cond_data:	User data to associate with the snapshot
- * @update:	Implementation of the cond_snapshot update function
- *
- * Check whether the conditional snapshot for the given instance has
- * already been enabled, or if the current tracer is already using a
- * snapshot; if so, return -EBUSY, else create a cond_snapshot and
- * save the cond_data and update function inside.
- *
- * Returns 0 if successful, error otherwise.
- */
-int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data,
-				 cond_update_fn_t update)
-{
-	struct cond_snapshot *cond_snapshot __free(kfree) =
-		kzalloc_obj(*cond_snapshot);
-	int ret;
-
-	if (!cond_snapshot)
-		return -ENOMEM;
-
-	cond_snapshot->cond_data = cond_data;
-	cond_snapshot->update = update;
-
-	guard(mutex)(&trace_types_lock);
-
-	if (tracer_uses_snapshot(tr->current_trace))
-		return -EBUSY;
-
-	/*
-	 * The cond_snapshot can only change to NULL without the
-	 * trace_types_lock. We don't care if we race with it going
-	 * to NULL, but we want to make sure that it's not set to
-	 * something other than NULL when we get here, which we can
-	 * do safely with only holding the trace_types_lock and not
-	 * having to take the max_lock.
-	 */
-	if (tr->cond_snapshot)
-		return -EBUSY;
-
-	ret = tracing_arm_snapshot_locked(tr);
-	if (ret)
-		return ret;
-
-	local_irq_disable();
-	arch_spin_lock(&tr->max_lock);
-	tr->cond_snapshot = no_free_ptr(cond_snapshot);
-	arch_spin_unlock(&tr->max_lock);
-	local_irq_enable();
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_cond_enable);
-
-/**
- * tracing_snapshot_cond_disable - disable conditional snapshot for an instance
- * @tr:		The tracing instance
- *
- * Check whether the conditional snapshot for the given instance is
- * enabled; if so, free the cond_snapshot associated with it,
- * otherwise return -EINVAL.
- *
- * Returns 0 if successful, error otherwise.
- */
-int tracing_snapshot_cond_disable(struct trace_array *tr)
-{
-	int ret = 0;
-
-	local_irq_disable();
-	arch_spin_lock(&tr->max_lock);
-
-	if (!tr->cond_snapshot)
-		ret = -EINVAL;
-	else {
-		kfree(tr->cond_snapshot);
-		tr->cond_snapshot = NULL;
-	}
-
-	arch_spin_unlock(&tr->max_lock);
-	local_irq_enable();
-
-	tracing_disarm_snapshot(tr);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
 #else
 void tracing_snapshot(void)
 {
 	WARN_ONCE(1, "Snapshot feature not enabled, but internal snapshot used");
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot);
-void tracing_snapshot_cond(struct trace_array *tr, void *cond_data)
-{
-	WARN_ONCE(1, "Snapshot feature not enabled, but internal conditional snapshot used");
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_cond);
 int tracing_alloc_snapshot(void)
 {
 	WARN_ONCE(1, "Snapshot feature not enabled, but snapshot allocation used");
@@ -1158,23 +829,6 @@ void tracing_snapshot_alloc(void)
 	tracing_snapshot();
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
-void *tracing_cond_snapshot_data(struct trace_array *tr)
-{
-	return NULL;
-}
-EXPORT_SYMBOL_GPL(tracing_cond_snapshot_data);
-int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update)
-{
-	return -ENODEV;
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_cond_enable);
-int tracing_snapshot_cond_disable(struct trace_array *tr)
-{
-	return false;
-}
-EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
-#define free_snapshot(tr)	do { } while (0)
-#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; })
 #endif /* CONFIG_TRACER_SNAPSHOT */
 
 void tracer_tracing_off(struct trace_array *tr)
@@ -1487,206 +1141,6 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, void *buf, size_t cnt)
 
 unsigned long __read_mostly	tracing_thresh;
 
-#ifdef CONFIG_TRACER_MAX_TRACE
-#ifdef LATENCY_FS_NOTIFY
-static struct workqueue_struct *fsnotify_wq;
-
-static void latency_fsnotify_workfn(struct work_struct *work)
-{
-	struct trace_array *tr = container_of(work, struct trace_array,
-					      fsnotify_work);
-	fsnotify_inode(tr->d_max_latency->d_inode, FS_MODIFY);
-}
-
-static void latency_fsnotify_workfn_irq(struct irq_work *iwork)
-{
-	struct trace_array *tr = container_of(iwork, struct trace_array,
-					      fsnotify_irqwork);
-	queue_work(fsnotify_wq, &tr->fsnotify_work);
-}
-
-__init static int latency_fsnotify_init(void)
-{
-	fsnotify_wq = alloc_workqueue("tr_max_lat_wq",
-				      WQ_UNBOUND | WQ_HIGHPRI, 0);
-	if (!fsnotify_wq) {
-		pr_err("Unable to allocate tr_max_lat_wq\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-late_initcall_sync(latency_fsnotify_init);
-
-void latency_fsnotify(struct trace_array *tr)
-{
-	if (!fsnotify_wq)
-		return;
-	/*
-	 * We cannot call queue_work(&tr->fsnotify_work) from here because it's
-	 * possible that we are called from __schedule() or do_idle(), which
-	 * could cause a deadlock.
-	 */
-	irq_work_queue(&tr->fsnotify_irqwork);
-}
-#endif /* !LATENCY_FS_NOTIFY */
-
-static const struct file_operations tracing_max_lat_fops;
-
-static void trace_create_maxlat_file(struct trace_array *tr,
-				     struct dentry *d_tracer)
-{
-#ifdef LATENCY_FS_NOTIFY
-	INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn);
-	init_irq_work(&tr->fsnotify_irqwork, latency_fsnotify_workfn_irq);
-#endif
-	tr->d_max_latency = trace_create_file("tracing_max_latency",
-					      TRACE_MODE_WRITE,
-					      d_tracer, tr,
-					      &tracing_max_lat_fops);
-}
-
-/*
- * Copy the new maximum trace into the separate maximum-trace
- * structure. (this way the maximum trace is permanently saved,
- * for later retrieval via /sys/kernel/tracing/tracing_max_latency)
- */
-static void
-__update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
-{
-	struct array_buffer *trace_buf = &tr->array_buffer;
-	struct trace_array_cpu *data = per_cpu_ptr(trace_buf->data, cpu);
-	struct array_buffer *max_buf = &tr->snapshot_buffer;
-	struct trace_array_cpu *max_data = per_cpu_ptr(max_buf->data, cpu);
-
-	max_buf->cpu = cpu;
-	max_buf->time_start = data->preempt_timestamp;
-
-	max_data->saved_latency = tr->max_latency;
-	max_data->critical_start = data->critical_start;
-	max_data->critical_end = data->critical_end;
-
-	strscpy(max_data->comm, tsk->comm);
-	max_data->pid = tsk->pid;
-	/*
-	 * If tsk == current, then use current_uid(), as that does not use
-	 * RCU. The irq tracer can be called out of RCU scope.
-	 */
-	if (tsk == current)
-		max_data->uid = current_uid();
-	else
-		max_data->uid = task_uid(tsk);
-
-	max_data->nice = tsk->static_prio - 20 - MAX_RT_PRIO;
-	max_data->policy = tsk->policy;
-	max_data->rt_priority = tsk->rt_priority;
-
-	/* record this tasks comm */
-	tracing_record_cmdline(tsk);
-	latency_fsnotify(tr);
-}
-#else
-static inline void trace_create_maxlat_file(struct trace_array *tr,
-					    struct dentry *d_tracer) { }
-static inline void __update_max_tr(struct trace_array *tr,
-				   struct task_struct *tsk, int cpu) { }
-#endif /* CONFIG_TRACER_MAX_TRACE */
-
-#ifdef CONFIG_TRACER_SNAPSHOT
-/**
- * update_max_tr - snapshot all trace buffers from global_trace to max_tr
- * @tr: tracer
- * @tsk: the task with the latency
- * @cpu: The cpu that initiated the trace.
- * @cond_data: User data associated with a conditional snapshot
- *
- * Flip the buffers between the @tr and the max_tr and record information
- * about which task was the cause of this latency.
- */
-void
-update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu,
-	      void *cond_data)
-{
-	if (tr->stop_count)
-		return;
-
-	WARN_ON_ONCE(!irqs_disabled());
-
-	if (!tr->allocated_snapshot) {
-		/* Only the nop tracer should hit this when disabling */
-		WARN_ON_ONCE(tr->current_trace != &nop_trace);
-		return;
-	}
-
-	arch_spin_lock(&tr->max_lock);
-
-	/* Inherit the recordable setting from array_buffer */
-	if (ring_buffer_record_is_set_on(tr->array_buffer.buffer))
-		ring_buffer_record_on(tr->snapshot_buffer.buffer);
-	else
-		ring_buffer_record_off(tr->snapshot_buffer.buffer);
-
-	if (tr->cond_snapshot && !tr->cond_snapshot->update(tr, cond_data)) {
-		arch_spin_unlock(&tr->max_lock);
-		return;
-	}
-
-	swap(tr->array_buffer.buffer, tr->snapshot_buffer.buffer);
-
-	__update_max_tr(tr, tsk, cpu);
-
-	arch_spin_unlock(&tr->max_lock);
-
-	/* Any waiters on the old snapshot buffer need to wake up */
-	ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS);
-}
-
-/**
- * update_max_tr_single - only copy one trace over, and reset the rest
- * @tr: tracer
- * @tsk: task with the latency
- * @cpu: the cpu of the buffer to copy.
- *
- * Flip the trace of a single CPU buffer between the @tr and the max_tr.
- */
-void
-update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
-{
-	int ret;
-
-	if (tr->stop_count)
-		return;
-
-	WARN_ON_ONCE(!irqs_disabled());
-	if (!tr->allocated_snapshot) {
-		/* Only the nop tracer should hit this when disabling */
-		WARN_ON_ONCE(tr->current_trace != &nop_trace);
-		return;
-	}
-
-	arch_spin_lock(&tr->max_lock);
-
-	ret = ring_buffer_swap_cpu(tr->snapshot_buffer.buffer, tr->array_buffer.buffer, cpu);
-
-	if (ret == -EBUSY) {
-		/*
-		 * We failed to swap the buffer due to a commit taking
-		 * place on this CPU. We fail to record, but we reset
-		 * the max trace buffer (no one writes directly to it)
-		 * and flag that it failed.
-		 * Another reason is resize is in progress.
-		 */
-		trace_array_printk_buf(tr->snapshot_buffer.buffer, _THIS_IP_,
-			"Failed to swap buffers due to commit or resize in progress\n");
-	}
-
-	WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY);
-
-	__update_max_tr(tr, tsk, cpu);
-	arch_spin_unlock(&tr->max_lock);
-}
-#endif /* CONFIG_TRACER_SNAPSHOT */
-
 struct pipe_wait {
 	struct trace_iterator		*iter;
 	int				wait_index;
@@ -1995,7 +1449,7 @@ int __init register_tracer(struct tracer *type)
 	return 0;
 }
 
-static void tracing_reset_cpu(struct array_buffer *buf, int cpu)
+void tracing_reset_cpu(struct array_buffer *buf, int cpu)
 {
 	struct trace_buffer *buffer = buf->buffer;
 
@@ -3760,50 +3214,6 @@ static void test_ftrace_alive(struct seq_file *m)
 		    "#          MAY BE MISSING FUNCTION EVENTS\n");
 }
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-static void show_snapshot_main_help(struct seq_file *m)
-{
-	seq_puts(m, "# echo 0 > snapshot : Clears and frees snapshot buffer\n"
-		    "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n"
-		    "#                      Takes a snapshot of the main buffer.\n"
-		    "# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)\n"
-		    "#                      (Doesn't have to be '2' works with any number that\n"
-		    "#                       is not a '0' or '1')\n");
-}
-
-static void show_snapshot_percpu_help(struct seq_file *m)
-{
-	seq_puts(m, "# echo 0 > snapshot : Invalid for per_cpu snapshot file.\n");
-#ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
-	seq_puts(m, "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n"
-		    "#                      Takes a snapshot of the main buffer for this cpu.\n");
-#else
-	seq_puts(m, "# echo 1 > snapshot : Not supported with this kernel.\n"
-		    "#                     Must use main snapshot file to allocate.\n");
-#endif
-	seq_puts(m, "# echo 2 > snapshot : Clears this cpu's snapshot buffer (but does not allocate)\n"
-		    "#                      (Doesn't have to be '2' works with any number that\n"
-		    "#                       is not a '0' or '1')\n");
-}
-
-static void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
-{
-	if (iter->tr->allocated_snapshot)
-		seq_puts(m, "#\n# * Snapshot is allocated *\n#\n");
-	else
-		seq_puts(m, "#\n# * Snapshot is freed *\n#\n");
-
-	seq_puts(m, "# Snapshot commands:\n");
-	if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
-		show_snapshot_main_help(m);
-	else
-		show_snapshot_percpu_help(m);
-}
-#else
-/* Should never be called */
-static inline void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter) { }
-#endif
-
 static int s_show(struct seq_file *m, void *v)
 {
 	struct trace_iterator *iter = v;
@@ -3852,17 +3262,6 @@ static int s_show(struct seq_file *m, void *v)
 	return 0;
 }
 
-/*
- * Should be used after trace_array_get(), trace_types_lock
- * ensures that i_cdev was already initialized.
- */
-static inline int tracing_get_cpu(struct inode *inode)
-{
-	if (inode->i_cdev) /* See trace_create_cpu_file() */
-		return (long)inode->i_cdev - 1;
-	return RING_BUFFER_ALL_CPUS;
-}
-
 static const struct seq_operations tracer_seq_ops = {
 	.start		= s_start,
 	.next		= s_next,
@@ -3889,7 +3288,7 @@ static void free_trace_iter_content(struct trace_iterator *iter)
 	free_cpumask_var(iter->started);
 }
 
-static struct trace_iterator *
+struct trace_iterator *
 __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 {
 	struct trace_array *tr = inode->i_private;
@@ -4071,7 +3470,7 @@ int tracing_single_release_file_tr(struct inode *inode, struct file *filp)
 	return single_release(inode, filp);
 }
 
-static int tracing_release(struct inode *inode, struct file *file)
+int tracing_release(struct inode *inode, struct file *file)
 {
 	struct trace_array *tr = inode->i_private;
 	struct seq_file *m = file->private_data;
@@ -5222,7 +4621,7 @@ int tracer_init(struct tracer *t, struct trace_array *tr)
 	return t->init(tr);
 }
 
-static void set_buffer_entries(struct array_buffer *buf, unsigned long val)
+void trace_set_buffer_entries(struct array_buffer *buf, unsigned long val)
 {
 	int cpu;
 
@@ -5233,40 +4632,12 @@ static void set_buffer_entries(struct array_buffer *buf, unsigned long val)
 static void update_buffer_entries(struct array_buffer *buf, int cpu)
 {
 	if (cpu == RING_BUFFER_ALL_CPUS) {
-		set_buffer_entries(buf, ring_buffer_size(buf->buffer, 0));
+		trace_set_buffer_entries(buf, ring_buffer_size(buf->buffer, 0));
 	} else {
 		per_cpu_ptr(buf->data, cpu)->entries = ring_buffer_size(buf->buffer, cpu);
 	}
 }
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-/* resize @tr's buffer to the size of @size_tr's entries */
-static int resize_buffer_duplicate_size(struct array_buffer *trace_buf,
-					struct array_buffer *size_buf, int cpu_id)
-{
-	int cpu, ret = 0;
-
-	if (cpu_id == RING_BUFFER_ALL_CPUS) {
-		for_each_tracing_cpu(cpu) {
-			ret = ring_buffer_resize(trace_buf->buffer,
-				 per_cpu_ptr(size_buf->data, cpu)->entries, cpu);
-			if (ret < 0)
-				break;
-			per_cpu_ptr(trace_buf->data, cpu)->entries =
-				per_cpu_ptr(size_buf->data, cpu)->entries;
-		}
-	} else {
-		ret = ring_buffer_resize(trace_buf->buffer,
-				 per_cpu_ptr(size_buf->data, cpu_id)->entries, cpu_id);
-		if (ret == 0)
-			per_cpu_ptr(trace_buf->data, cpu_id)->entries =
-				per_cpu_ptr(size_buf->data, cpu_id)->entries;
-	}
-
-	return ret;
-}
-#endif /* CONFIG_TRACER_SNAPSHOT */
-
 static int __tracing_resize_ring_buffer(struct trace_array *tr,
 					unsigned long size, int cpu)
 {
@@ -5685,9 +5056,8 @@ tracing_set_trace_write(struct file *filp, const char __user *ubuf,
 	return ret;
 }
 
-static ssize_t
-tracing_nsecs_read(unsigned long *ptr, char __user *ubuf,
-		   size_t cnt, loff_t *ppos)
+ssize_t tracing_nsecs_read(unsigned long *ptr, char __user *ubuf,
+			   size_t cnt, loff_t *ppos)
 {
 	char buf[64];
 	int r;
@@ -5699,9 +5069,8 @@ tracing_nsecs_read(unsigned long *ptr, char __user *ubuf,
 	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
-static ssize_t
-tracing_nsecs_write(unsigned long *ptr, const char __user *ubuf,
-		    size_t cnt, loff_t *ppos)
+ssize_t tracing_nsecs_write(unsigned long *ptr, const char __user *ubuf,
+			    size_t cnt, loff_t *ppos)
 {
 	unsigned long val;
 	int ret;
@@ -5743,28 +5112,6 @@ tracing_thresh_write(struct file *filp, const char __user *ubuf,
 	return cnt;
 }
 
-#ifdef CONFIG_TRACER_MAX_TRACE
-
-static ssize_t
-tracing_max_lat_read(struct file *filp, char __user *ubuf,
-		     size_t cnt, loff_t *ppos)
-{
-	struct trace_array *tr = filp->private_data;
-
-	return tracing_nsecs_read(&tr->max_latency, ubuf, cnt, ppos);
-}
-
-static ssize_t
-tracing_max_lat_write(struct file *filp, const char __user *ubuf,
-		      size_t cnt, loff_t *ppos)
-{
-	struct trace_array *tr = filp->private_data;
-
-	return tracing_nsecs_write(&tr->max_latency, ubuf, cnt, ppos);
-}
-
-#endif
-
 static int open_pipe_on_cpu(struct trace_array *tr, int cpu)
 {
 	if (cpu == RING_BUFFER_ALL_CPUS) {
@@ -7052,266 +6399,78 @@ static ssize_t tracing_clock_write(struct file *filp, const char __user *ubuf,
 	const char *clockstr;
 	int ret;
 
-	if (cnt >= sizeof(buf))
-		return -EINVAL;
-
-	if (copy_from_user(buf, ubuf, cnt))
-		return -EFAULT;
-
-	buf[cnt] = 0;
-
-	clockstr = strstrip(buf);
-
-	ret = tracing_set_clock(tr, clockstr);
-	if (ret)
-		return ret;
-
-	*fpos += cnt;
-
-	return cnt;
-}
-
-static int tracing_clock_open(struct inode *inode, struct file *file)
-{
-	struct trace_array *tr = inode->i_private;
-	int ret;
-
-	ret = tracing_check_open_get_tr(tr);
-	if (ret)
-		return ret;
-
-	ret = single_open(file, tracing_clock_show, inode->i_private);
-	if (ret < 0)
-		trace_array_put(tr);
-
-	return ret;
-}
-
-static int tracing_time_stamp_mode_show(struct seq_file *m, void *v)
-{
-	struct trace_array *tr = m->private;
-
-	guard(mutex)(&trace_types_lock);
-
-	if (ring_buffer_time_stamp_abs(tr->array_buffer.buffer))
-		seq_puts(m, "delta [absolute]\n");
-	else
-		seq_puts(m, "[delta] absolute\n");
-
-	return 0;
-}
-
-static int tracing_time_stamp_mode_open(struct inode *inode, struct file *file)
-{
-	struct trace_array *tr = inode->i_private;
-	int ret;
-
-	ret = tracing_check_open_get_tr(tr);
-	if (ret)
-		return ret;
-
-	ret = single_open(file, tracing_time_stamp_mode_show, inode->i_private);
-	if (ret < 0)
-		trace_array_put(tr);
-
-	return ret;
-}
-
-u64 tracing_event_time_stamp(struct trace_buffer *buffer, struct ring_buffer_event *rbe)
-{
-	if (rbe == this_cpu_read(trace_buffered_event))
-		return ring_buffer_time_stamp(buffer);
-
-	return ring_buffer_event_time_stamp(buffer, rbe);
-}
-
-struct ftrace_buffer_info {
-	struct trace_iterator	iter;
-	void			*spare;
-	unsigned int		spare_cpu;
-	unsigned int		spare_size;
-	unsigned int		read;
-};
-
-#ifdef CONFIG_TRACER_SNAPSHOT
-static int tracing_snapshot_open(struct inode *inode, struct file *file)
-{
-	struct trace_array *tr = inode->i_private;
-	struct trace_iterator *iter;
-	struct seq_file *m;
-	int ret;
-
-	ret = tracing_check_open_get_tr(tr);
-	if (ret)
-		return ret;
-
-	if (file->f_mode & FMODE_READ) {
-		iter = __tracing_open(inode, file, true);
-		if (IS_ERR(iter))
-			ret = PTR_ERR(iter);
-	} else {
-		/* Writes still need the seq_file to hold the private data */
-		ret = -ENOMEM;
-		m = kzalloc_obj(*m);
-		if (!m)
-			goto out;
-		iter = kzalloc_obj(*iter);
-		if (!iter) {
-			kfree(m);
-			goto out;
-		}
-		ret = 0;
-
-		iter->tr = tr;
-		iter->array_buffer = &tr->snapshot_buffer;
-		iter->cpu_file = tracing_get_cpu(inode);
-		m->private = iter;
-		file->private_data = m;
-	}
-out:
-	if (ret < 0)
-		trace_array_put(tr);
-
-	return ret;
-}
-
-static void tracing_swap_cpu_buffer(void *tr)
-{
-	update_max_tr_single((struct trace_array *)tr, current, smp_processor_id());
-}
-
-static ssize_t
-tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
-		       loff_t *ppos)
-{
-	struct seq_file *m = filp->private_data;
-	struct trace_iterator *iter = m->private;
-	struct trace_array *tr = iter->tr;
-	unsigned long val;
-	int ret;
-
-	ret = tracing_update_buffers(tr);
-	if (ret < 0)
-		return ret;
+	if (cnt >= sizeof(buf))
+		return -EINVAL;
 
-	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
-	if (ret)
-		return ret;
+	if (copy_from_user(buf, ubuf, cnt))
+		return -EFAULT;
 
-	guard(mutex)(&trace_types_lock);
+	buf[cnt] = 0;
 
-	if (tracer_uses_snapshot(tr->current_trace))
-		return -EBUSY;
+	clockstr = strstrip(buf);
 
-	local_irq_disable();
-	arch_spin_lock(&tr->max_lock);
-	if (tr->cond_snapshot)
-		ret = -EBUSY;
-	arch_spin_unlock(&tr->max_lock);
-	local_irq_enable();
+	ret = tracing_set_clock(tr, clockstr);
 	if (ret)
 		return ret;
 
-	switch (val) {
-	case 0:
-		if (iter->cpu_file != RING_BUFFER_ALL_CPUS)
-			return -EINVAL;
-		if (tr->allocated_snapshot)
-			free_snapshot(tr);
-		break;
-	case 1:
-/* Only allow per-cpu swap if the ring buffer supports it */
-#ifndef CONFIG_RING_BUFFER_ALLOW_SWAP
-		if (iter->cpu_file != RING_BUFFER_ALL_CPUS)
-			return -EINVAL;
-#endif
-		if (tr->allocated_snapshot)
-			ret = resize_buffer_duplicate_size(&tr->snapshot_buffer,
-					&tr->array_buffer, iter->cpu_file);
+	*fpos += cnt;
 
-		ret = tracing_arm_snapshot_locked(tr);
-		if (ret)
-			return ret;
+	return cnt;
+}
 
-		/* Now, we're going to swap */
-		if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
-			local_irq_disable();
-			update_max_tr(tr, current, smp_processor_id(), NULL);
-			local_irq_enable();
-		} else {
-			smp_call_function_single(iter->cpu_file, tracing_swap_cpu_buffer,
-						 (void *)tr, 1);
-		}
-		tracing_disarm_snapshot(tr);
-		break;
-	default:
-		if (tr->allocated_snapshot) {
-			if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
-				tracing_reset_online_cpus(&tr->snapshot_buffer);
-			else
-				tracing_reset_cpu(&tr->snapshot_buffer, iter->cpu_file);
-		}
-		break;
-	}
+static int tracing_clock_open(struct inode *inode, struct file *file)
+{
+	struct trace_array *tr = inode->i_private;
+	int ret;
 
-	if (ret >= 0) {
-		*ppos += cnt;
-		ret = cnt;
-	}
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
+
+	ret = single_open(file, tracing_clock_show, inode->i_private);
+	if (ret < 0)
+		trace_array_put(tr);
 
 	return ret;
 }
 
-static int tracing_snapshot_release(struct inode *inode, struct file *file)
+static int tracing_time_stamp_mode_show(struct seq_file *m, void *v)
 {
-	struct seq_file *m = file->private_data;
-	int ret;
-
-	ret = tracing_release(inode, file);
+	struct trace_array *tr = m->private;
 
-	if (file->f_mode & FMODE_READ)
-		return ret;
+	guard(mutex)(&trace_types_lock);
 
-	/* If write only, the seq_file is just a stub */
-	if (m)
-		kfree(m->private);
-	kfree(m);
+	if (ring_buffer_time_stamp_abs(tr->array_buffer.buffer))
+		seq_puts(m, "delta [absolute]\n");
+	else
+		seq_puts(m, "[delta] absolute\n");
 
 	return 0;
 }
 
-static int tracing_buffers_open(struct inode *inode, struct file *filp);
-static ssize_t tracing_buffers_read(struct file *filp, char __user *ubuf,
-				    size_t count, loff_t *ppos);
-static int tracing_buffers_release(struct inode *inode, struct file *file);
-static ssize_t tracing_buffers_splice_read(struct file *file, loff_t *ppos,
-		   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
-
-static int snapshot_raw_open(struct inode *inode, struct file *filp)
+static int tracing_time_stamp_mode_open(struct inode *inode, struct file *file)
 {
-	struct ftrace_buffer_info *info;
+	struct trace_array *tr = inode->i_private;
 	int ret;
 
-	/* The following checks for tracefs lockdown */
-	ret = tracing_buffers_open(inode, filp);
-	if (ret < 0)
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
 		return ret;
 
-	info = filp->private_data;
-
-	if (tracer_uses_snapshot(info->iter.trace)) {
-		tracing_buffers_release(inode, filp);
-		return -EBUSY;
-	}
-
-	info->iter.snapshot = true;
-	info->iter.array_buffer = &info->iter.tr->snapshot_buffer;
+	ret = single_open(file, tracing_time_stamp_mode_show, inode->i_private);
+	if (ret < 0)
+		trace_array_put(tr);
 
 	return ret;
 }
 
-#endif /* CONFIG_TRACER_SNAPSHOT */
+u64 tracing_event_time_stamp(struct trace_buffer *buffer, struct ring_buffer_event *rbe)
+{
+	if (rbe == this_cpu_read(trace_buffered_event))
+		return ring_buffer_time_stamp(buffer);
 
+	return ring_buffer_event_time_stamp(buffer, rbe);
+}
 
 static const struct file_operations tracing_thresh_fops = {
 	.open		= tracing_open_generic,
@@ -7320,16 +6479,6 @@ static const struct file_operations tracing_thresh_fops = {
 	.llseek		= generic_file_llseek,
 };
 
-#ifdef CONFIG_TRACER_MAX_TRACE
-static const struct file_operations tracing_max_lat_fops = {
-	.open		= tracing_open_generic_tr,
-	.read		= tracing_max_lat_read,
-	.write		= tracing_max_lat_write,
-	.llseek		= generic_file_llseek,
-	.release	= tracing_release_generic_tr,
-};
-#endif
-
 static const struct file_operations set_tracer_fops = {
 	.open		= tracing_open_generic_tr,
 	.read		= tracing_set_trace_read,
@@ -7416,24 +6565,6 @@ static const struct file_operations last_boot_fops = {
 	.release	= tracing_seq_release,
 };
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-static const struct file_operations snapshot_fops = {
-	.open		= tracing_snapshot_open,
-	.read		= seq_read,
-	.write		= tracing_snapshot_write,
-	.llseek		= tracing_lseek,
-	.release	= tracing_snapshot_release,
-};
-
-static const struct file_operations snapshot_raw_fops = {
-	.open		= snapshot_raw_open,
-	.read		= tracing_buffers_read,
-	.release	= tracing_buffers_release,
-	.splice_read	= tracing_buffers_splice_read,
-};
-
-#endif /* CONFIG_TRACER_SNAPSHOT */
-
 /*
  * trace_min_max_write - Write a u64 value to a trace_min_max_param struct
  * @filp: The active open file structure
@@ -7793,7 +6924,7 @@ static const struct file_operations tracing_err_log_fops = {
 	.release        = tracing_err_log_release,
 };
 
-static int tracing_buffers_open(struct inode *inode, struct file *filp)
+int tracing_buffers_open(struct inode *inode, struct file *filp)
 {
 	struct trace_array *tr = inode->i_private;
 	struct ftrace_buffer_info *info;
@@ -7841,9 +6972,8 @@ tracing_buffers_poll(struct file *filp, poll_table *poll_table)
 	return trace_poll(iter, filp, poll_table);
 }
 
-static ssize_t
-tracing_buffers_read(struct file *filp, char __user *ubuf,
-		     size_t count, loff_t *ppos)
+ssize_t tracing_buffers_read(struct file *filp, char __user *ubuf,
+			     size_t count, loff_t *ppos)
 {
 	struct ftrace_buffer_info *info = filp->private_data;
 	struct trace_iterator *iter = &info->iter;
@@ -7944,7 +7074,7 @@ static int tracing_buffers_flush(struct file *file, fl_owner_t id)
 	return 0;
 }
 
-static int tracing_buffers_release(struct inode *inode, struct file *file)
+int tracing_buffers_release(struct inode *inode, struct file *file)
 {
 	struct ftrace_buffer_info *info = file->private_data;
 	struct trace_iterator *iter = &info->iter;
@@ -8018,10 +7148,9 @@ static void buffer_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 	spd->partial[i].private = 0;
 }
 
-static ssize_t
-tracing_buffers_splice_read(struct file *file, loff_t *ppos,
-			    struct pipe_inode_info *pipe, size_t len,
-			    unsigned int flags)
+ssize_t tracing_buffers_splice_read(struct file *file, loff_t *ppos,
+				    struct pipe_inode_info *pipe, size_t len,
+				    unsigned int flags)
 {
 	struct ftrace_buffer_info *info = file->private_data;
 	struct trace_iterator *iter = &info->iter;
@@ -8175,44 +7304,6 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned
 	return 0;
 }
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-static int get_snapshot_map(struct trace_array *tr)
-{
-	int err = 0;
-
-	/*
-	 * Called with mmap_lock held. lockdep would be unhappy if we would now
-	 * take trace_types_lock. Instead use the specific
-	 * snapshot_trigger_lock.
-	 */
-	spin_lock(&tr->snapshot_trigger_lock);
-
-	if (tr->snapshot || tr->mapped == UINT_MAX)
-		err = -EBUSY;
-	else
-		tr->mapped++;
-
-	spin_unlock(&tr->snapshot_trigger_lock);
-
-	/* Wait for update_max_tr() to observe iter->tr->mapped */
-	if (tr->mapped == 1)
-		synchronize_rcu();
-
-	return err;
-
-}
-static void put_snapshot_map(struct trace_array *tr)
-{
-	spin_lock(&tr->snapshot_trigger_lock);
-	if (!WARN_ON(!tr->mapped))
-		tr->mapped--;
-	spin_unlock(&tr->snapshot_trigger_lock);
-}
-#else
-static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
-static inline void put_snapshot_map(struct trace_array *tr) { }
-#endif
-
 static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
 {
 	struct ftrace_buffer_info *info = vma->vm_file->private_data;
@@ -8380,170 +7471,6 @@ static const struct file_operations tracing_dyn_info_fops = {
 };
 #endif /* CONFIG_DYNAMIC_FTRACE */
 
-#if defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE)
-static void
-ftrace_snapshot(unsigned long ip, unsigned long parent_ip,
-		struct trace_array *tr, struct ftrace_probe_ops *ops,
-		void *data)
-{
-	tracing_snapshot_instance(tr);
-}
-
-static void
-ftrace_count_snapshot(unsigned long ip, unsigned long parent_ip,
-		      struct trace_array *tr, struct ftrace_probe_ops *ops,
-		      void *data)
-{
-	struct ftrace_func_mapper *mapper = data;
-	long *count = NULL;
-
-	if (mapper)
-		count = (long *)ftrace_func_mapper_find_ip(mapper, ip);
-
-	if (count) {
-
-		if (*count <= 0)
-			return;
-
-		(*count)--;
-	}
-
-	tracing_snapshot_instance(tr);
-}
-
-static int
-ftrace_snapshot_print(struct seq_file *m, unsigned long ip,
-		      struct ftrace_probe_ops *ops, void *data)
-{
-	struct ftrace_func_mapper *mapper = data;
-	long *count = NULL;
-
-	seq_printf(m, "%ps:", (void *)ip);
-
-	seq_puts(m, "snapshot");
-
-	if (mapper)
-		count = (long *)ftrace_func_mapper_find_ip(mapper, ip);
-
-	if (count)
-		seq_printf(m, ":count=%ld\n", *count);
-	else
-		seq_puts(m, ":unlimited\n");
-
-	return 0;
-}
-
-static int
-ftrace_snapshot_init(struct ftrace_probe_ops *ops, struct trace_array *tr,
-		     unsigned long ip, void *init_data, void **data)
-{
-	struct ftrace_func_mapper *mapper = *data;
-
-	if (!mapper) {
-		mapper = allocate_ftrace_func_mapper();
-		if (!mapper)
-			return -ENOMEM;
-		*data = mapper;
-	}
-
-	return ftrace_func_mapper_add_ip(mapper, ip, init_data);
-}
-
-static void
-ftrace_snapshot_free(struct ftrace_probe_ops *ops, struct trace_array *tr,
-		     unsigned long ip, void *data)
-{
-	struct ftrace_func_mapper *mapper = data;
-
-	if (!ip) {
-		if (!mapper)
-			return;
-		free_ftrace_func_mapper(mapper, NULL);
-		return;
-	}
-
-	ftrace_func_mapper_remove_ip(mapper, ip);
-}
-
-static struct ftrace_probe_ops snapshot_probe_ops = {
-	.func			= ftrace_snapshot,
-	.print			= ftrace_snapshot_print,
-};
-
-static struct ftrace_probe_ops snapshot_count_probe_ops = {
-	.func			= ftrace_count_snapshot,
-	.print			= ftrace_snapshot_print,
-	.init			= ftrace_snapshot_init,
-	.free			= ftrace_snapshot_free,
-};
-
-static int
-ftrace_trace_snapshot_callback(struct trace_array *tr, struct ftrace_hash *hash,
-			       char *glob, char *cmd, char *param, int enable)
-{
-	struct ftrace_probe_ops *ops;
-	void *count = (void *)-1;
-	char *number;
-	int ret;
-
-	if (!tr)
-		return -ENODEV;
-
-	/* hash funcs only work with set_ftrace_filter */
-	if (!enable)
-		return -EINVAL;
-
-	ops = param ? &snapshot_count_probe_ops :  &snapshot_probe_ops;
-
-	if (glob[0] == '!') {
-		ret = unregister_ftrace_function_probe_func(glob+1, tr, ops);
-		if (!ret)
-			tracing_disarm_snapshot(tr);
-
-		return ret;
-	}
-
-	if (!param)
-		goto out_reg;
-
-	number = strsep(&param, ":");
-
-	if (!strlen(number))
-		goto out_reg;
-
-	/*
-	 * We use the callback data field (which is a pointer)
-	 * as our counter.
-	 */
-	ret = kstrtoul(number, 0, (unsigned long *)&count);
-	if (ret)
-		return ret;
-
- out_reg:
-	ret = tracing_arm_snapshot(tr);
-	if (ret < 0)
-		return ret;
-
-	ret = register_ftrace_function_probe(glob, tr, ops, count);
-	if (ret < 0)
-		tracing_disarm_snapshot(tr);
-
-	return ret < 0 ? ret : 0;
-}
-
-static struct ftrace_func_command ftrace_snapshot_cmd = {
-	.name			= "snapshot",
-	.func			= ftrace_trace_snapshot_callback,
-};
-
-static __init int register_snapshot_cmd(void)
-{
-	return register_ftrace_command(&ftrace_snapshot_cmd);
-}
-#else
-static inline __init int register_snapshot_cmd(void) { return 0; }
-#endif /* defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE) */
-
 static struct dentry *tracing_get_dentry(struct trace_array *tr)
 {
 	/* Top directory uses NULL as the parent */
@@ -9336,7 +8263,7 @@ static void setup_trace_scratch(struct trace_array *tr,
 	memset(tscratch, 0, size);
 }
 
-static int
+int
 allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size)
 {
 	enum ring_buffer_flags rb_flags;
@@ -9376,8 +8303,8 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size
 	}
 
 	/* Allocate the first page for all buffers */
-	set_buffer_entries(&tr->array_buffer,
-			   ring_buffer_size(tr->array_buffer.buffer, 0));
+	trace_set_buffer_entries(&tr->array_buffer,
+				 ring_buffer_size(tr->array_buffer.buffer, 0));
 
 	return 0;
 }
@@ -9400,23 +8327,11 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
 	if (ret)
 		return ret;
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-	/* Fix mapped buffer trace arrays do not have snapshot buffers */
-	if (tr->range_addr_start)
-		return 0;
-
-	ret = allocate_trace_buffer(tr, &tr->snapshot_buffer,
-				    allocate_snapshot ? size : 1);
-	if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
+	ret = trace_allocate_snapshot(tr, size);
+	if (MEM_FAIL(ret, "Failed to allocate trace buffer\n"))
 		free_trace_buffer(&tr->array_buffer);
-		return -ENOMEM;
-	}
-	tr->allocated_snapshot = allocate_snapshot;
-
-	allocate_snapshot = false;
-#endif
 
-	return 0;
+	return ret;
 }
 
 static void free_trace_buffers(struct trace_array *tr)
@@ -10523,47 +9438,6 @@ ssize_t trace_parse_run_command(struct file *file, const char __user *buffer,
 	return done;
 }
 
-#ifdef CONFIG_TRACER_SNAPSHOT
-__init static bool tr_needs_alloc_snapshot(const char *name)
-{
-	char *test;
-	int len = strlen(name);
-	bool ret;
-
-	if (!boot_snapshot_index)
-		return false;
-
-	if (strncmp(name, boot_snapshot_info, len) == 0 &&
-	    boot_snapshot_info[len] == '\t')
-		return true;
-
-	test = kmalloc(strlen(name) + 3, GFP_KERNEL);
-	if (!test)
-		return false;
-
-	sprintf(test, "\t%s\t", name);
-	ret = strstr(boot_snapshot_info, test) == NULL;
-	kfree(test);
-	return ret;
-}
-
-__init static void do_allocate_snapshot(const char *name)
-{
-	if (!tr_needs_alloc_snapshot(name))
-		return;
-
-	/*
-	 * When allocate_snapshot is set, the next call to
-	 * allocate_trace_buffers() (called by trace_array_get_by_name())
-	 * will allocate the snapshot buffer. That will also clear
-	 * this flag.
-	 */
-	allocate_snapshot = true;
-}
-#else
-static inline void do_allocate_snapshot(const char *name) { }
-#endif
-
 __init static int backup_instance_area(const char *backup,
 				       unsigned long *addr, phys_addr_t *size)
 {
@@ -10713,8 +9587,7 @@ __init static void enable_instances(void)
 			}
 		} else {
 			/* Only non mapped buffers have snapshot buffers */
-			if (IS_ENABLED(CONFIG_TRACER_SNAPSHOT))
-				do_allocate_snapshot(name);
+			do_allocate_snapshot(name);
 		}
 
 		tr = trace_array_create_systems(name, NULL, addr, size);
@@ -10906,24 +9779,6 @@ struct trace_array *trace_get_global_array(void)
 }
 #endif
 
-void __init ftrace_boot_snapshot(void)
-{
-#ifdef CONFIG_TRACER_SNAPSHOT
-	struct trace_array *tr;
-
-	if (!snapshot_at_boot)
-		return;
-
-	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
-		if (!tr->allocated_snapshot)
-			continue;
-
-		tracing_snapshot_instance(tr);
-		trace_array_puts(tr, "** Boot snapshot taken **\n");
-	}
-#endif
-}
-
 void __init early_trace_init(void)
 {
 	if (tracepoint_printk) {
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b8f3804586a0..b0fb9bec1357 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -264,6 +264,7 @@ static inline bool still_need_pid_events(int type, struct trace_pid_list *pid_li
 
 typedef bool (*cond_update_fn_t)(struct trace_array *tr, void *cond_data);
 
+#ifdef CONFIG_TRACER_SNAPSHOT
 /**
  * struct cond_snapshot - conditional snapshot data and callback
  *
@@ -306,6 +307,7 @@ struct cond_snapshot {
 	void				*cond_data;
 	cond_update_fn_t		update;
 };
+#endif /* CONFIG_TRACER_SNAPSHOT */
 
 /*
  * struct trace_func_repeats - used to keep track of the consecutive
@@ -675,6 +677,7 @@ void tracing_reset_all_online_cpus(void);
 void tracing_reset_all_online_cpus_unlocked(void);
 int tracing_open_generic(struct inode *inode, struct file *filp);
 int tracing_open_generic_tr(struct inode *inode, struct file *filp);
+int tracing_release(struct inode *inode, struct file *file);
 int tracing_release_generic_tr(struct inode *inode, struct file *file);
 int tracing_open_file_tr(struct inode *inode, struct file *filp);
 int tracing_release_file_tr(struct inode *inode, struct file *filp);
@@ -684,12 +687,48 @@ void tracer_tracing_on(struct trace_array *tr);
 void tracer_tracing_off(struct trace_array *tr);
 void tracer_tracing_disable(struct trace_array *tr);
 void tracer_tracing_enable(struct trace_array *tr);
+int allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size);
 struct dentry *trace_create_file(const char *name,
 				 umode_t mode,
 				 struct dentry *parent,
 				 void *data,
 				 const struct file_operations *fops);
 
+struct trace_iterator *__tracing_open(struct inode *inode, struct file *file,
+				      bool snapshot);
+int tracing_buffers_open(struct inode *inode, struct file *filp);
+ssize_t tracing_buffers_read(struct file *filp, char __user *ubuf,
+			     size_t count, loff_t *ppos);
+int tracing_buffers_release(struct inode *inode, struct file *file);
+ssize_t tracing_buffers_splice_read(struct file *file, loff_t *ppos,
+		   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
+
+ssize_t tracing_nsecs_read(unsigned long *ptr, char __user *ubuf,
+			   size_t cnt, loff_t *ppos);
+ssize_t tracing_nsecs_write(unsigned long *ptr, const char __user *ubuf,
+			    size_t cnt, loff_t *ppos);
+
+void trace_set_buffer_entries(struct array_buffer *buf, unsigned long val);
+
+/*
+ * Should be used after trace_array_get(), trace_types_lock
+ * ensures that i_cdev was already initialized.
+ */
+static inline int tracing_get_cpu(struct inode *inode)
+{
+	if (inode->i_cdev) /* See trace_create_cpu_file() */
+		return (long)inode->i_cdev - 1;
+	return RING_BUFFER_ALL_CPUS;
+}
+void tracing_reset_cpu(struct array_buffer *buf, int cpu);
+
+struct ftrace_buffer_info {
+	struct trace_iterator	iter;
+	void			*spare;
+	unsigned int		spare_cpu;
+	unsigned int		spare_size;
+	unsigned int		read;
+};
 
 /**
  * tracer_tracing_is_on_cpu - show real state of ring buffer enabled on for a cpu
@@ -828,11 +867,15 @@ static inline bool tracer_uses_snapshot(struct tracer *tracer)
 {
 	return tracer->use_max_tr;
 }
+void trace_create_maxlat_file(struct trace_array *tr,
+			      struct dentry *d_tracer);
 #else
 static inline bool tracer_uses_snapshot(struct tracer *tracer)
 {
 	return false;
 }
+static inline void trace_create_maxlat_file(struct trace_array *tr,
+					    struct dentry *d_tracer) { }
 #endif
 
 void trace_last_func_repeats(struct trace_array *tr,
@@ -2135,12 +2178,6 @@ static inline bool event_command_needs_rec(struct event_command *cmd_ops)
 
 extern int trace_event_enable_disable(struct trace_event_file *file,
 				      int enable, int soft_disable);
-extern int tracing_alloc_snapshot(void);
-extern void tracing_snapshot_cond(struct trace_array *tr, void *cond_data);
-extern int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update);
-
-extern int tracing_snapshot_cond_disable(struct trace_array *tr);
-extern void *tracing_cond_snapshot_data(struct trace_array *tr);
 
 extern const char *__start___trace_bprintk_fmt[];
 extern const char *__stop___trace_bprintk_fmt[];
@@ -2228,19 +2265,71 @@ static inline void trace_event_update_all(struct trace_eval_map **map, int len)
 #endif
 
 #ifdef CONFIG_TRACER_SNAPSHOT
+extern const struct file_operations snapshot_fops;
+extern const struct file_operations snapshot_raw_fops;
+
+/* Used when creating instances */
+int trace_allocate_snapshot(struct trace_array *tr, int size);
+
+int tracing_alloc_snapshot(void);
+void tracing_snapshot_cond(struct trace_array *tr, void *cond_data);
+int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update);
+int tracing_snapshot_cond_disable(struct trace_array *tr);
+void *tracing_cond_snapshot_data(struct trace_array *tr);
 void tracing_snapshot_instance(struct trace_array *tr);
 int tracing_alloc_snapshot_instance(struct trace_array *tr);
+int tracing_arm_snapshot_locked(struct trace_array *tr);
 int tracing_arm_snapshot(struct trace_array *tr);
 void tracing_disarm_snapshot(struct trace_array *tr);
-#else
+void free_snapshot(struct trace_array *tr);
+void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter);
+int get_snapshot_map(struct trace_array *tr);
+void put_snapshot_map(struct trace_array *tr);
+int resize_buffer_duplicate_size(struct array_buffer *trace_buf,
+				 struct array_buffer *size_buf, int cpu_id);
+__init void do_allocate_snapshot(const char *name);
+# ifdef CONFIG_DYNAMIC_FTRACE
+__init int register_snapshot_cmd(void);
+# else
+static inline int register_snapshot_cmd(void) { return 0; }
+# endif
+#else /* !CONFIG_TRACER_SNAPSHOT */
+static inline int trace_allocate_snapshot(struct trace_array *tr, int size) { return 0; }
 static inline void tracing_snapshot_instance(struct trace_array *tr) { }
 static inline int tracing_alloc_snapshot_instance(struct trace_array *tr)
 {
 	return 0;
 }
+static inline int tracing_arm_snapshot_locked(struct trace_array *tr) { return -EBUSY; }
 static inline int tracing_arm_snapshot(struct trace_array *tr) { return 0; }
 static inline void tracing_disarm_snapshot(struct trace_array *tr) { }
-#endif
+static inline void free_snapshot(struct trace_array *tr) {}
+static inline void tracing_snapshot_cond(struct trace_array *tr, void *cond_data)
+{
+	WARN_ONCE(1, "Snapshot feature not enabled, but internal conditional snapshot used");
+}
+static inline void *tracing_cond_snapshot_data(struct trace_array *tr)
+{
+	return NULL;
+}
+static inline int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, cond_update_fn_t update)
+{
+	return -ENODEV;
+}
+static inline int tracing_snapshot_cond_disable(struct trace_array *tr)
+{
+	return false;
+}
+static inline void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
+{
+	/* Should never be called */
+	WARN_ONCE(1, "Snapshot print function called without snapshot configured");
+}
+static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
+static inline void put_snapshot_map(struct trace_array *tr) { }
+static inline void do_allocate_snapshot(const char *name) { }
+static inline int register_snapshot_cmd(void) { return 0; }
+#endif /* CONFIG_TRACER_SNAPSHOT */
 
 #ifdef CONFIG_PREEMPT_TRACER
 void tracer_preempt_on(unsigned long a0, unsigned long a1);
diff --git a/kernel/trace/trace_snapshot.c b/kernel/trace/trace_snapshot.c
new file mode 100644
index 000000000000..06561d6c004a
--- /dev/null
+++ b/kernel/trace/trace_snapshot.c
@@ -0,0 +1,1067 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/fsnotify.h>
+
+#include <asm/setup.h> /* COMMAND_LINE_SIZE */
+
+#include "trace.h"
+
+/* Used if snapshot allocated at boot */
+static bool allocate_snapshot;
+static bool snapshot_at_boot;
+
+static char boot_snapshot_info[COMMAND_LINE_SIZE] __initdata;
+static int boot_snapshot_index;
+
+static int __init boot_alloc_snapshot(char *str)
+{
+	char *slot = boot_snapshot_info + boot_snapshot_index;
+	int left = sizeof(boot_snapshot_info) - boot_snapshot_index;
+	int ret;
+
+	if (str[0] == '=') {
+		str++;
+		if (strlen(str) >= left)
+			return -1;
+
+		ret = snprintf(slot, left, "%s\t", str);
+		boot_snapshot_index += ret;
+	} else {
+		allocate_snapshot = true;
+		/* We also need the main ring buffer expanded */
+		trace_set_ring_buffer_expanded(NULL);
+	}
+	return 1;
+}
+__setup("alloc_snapshot", boot_alloc_snapshot);
+
+
+static int __init boot_snapshot(char *str)
+{
+	snapshot_at_boot = true;
+	boot_alloc_snapshot(str);
+	return 1;
+}
+__setup("ftrace_boot_snapshot", boot_snapshot);
+static void tracing_snapshot_instance_cond(struct trace_array *tr,
+					   void *cond_data)
+{
+	unsigned long flags;
+
+	if (in_nmi()) {
+		trace_array_puts(tr, "*** SNAPSHOT CALLED FROM NMI CONTEXT ***\n");
+		trace_array_puts(tr, "*** snapshot is being ignored        ***\n");
+		return;
+	}
+
+	if (!tr->allocated_snapshot) {
+		trace_array_puts(tr, "*** SNAPSHOT NOT ALLOCATED ***\n");
+		trace_array_puts(tr, "*** stopping trace here!   ***\n");
+		tracer_tracing_off(tr);
+		return;
+	}
+
+	if (tr->mapped) {
+		trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+		trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+		return;
+	}
+
+	/* Note, snapshot can not be used when the tracer uses it */
+	if (tracer_uses_snapshot(tr->current_trace)) {
+		trace_array_puts(tr, "*** LATENCY TRACER ACTIVE ***\n");
+		trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+		return;
+	}
+
+	local_irq_save(flags);
+	update_max_tr(tr, current, smp_processor_id(), cond_data);
+	local_irq_restore(flags);
+}
+
+void tracing_snapshot_instance(struct trace_array *tr)
+{
+	tracing_snapshot_instance_cond(tr, NULL);
+}
+
+/**
+ * tracing_snapshot_cond - conditionally take a snapshot of the current buffer.
+ * @tr:		The tracing instance to snapshot
+ * @cond_data:	The data to be tested conditionally, and possibly saved
+ *
+ * This is the same as tracing_snapshot() except that the snapshot is
+ * conditional - the snapshot will only happen if the
+ * cond_snapshot.update() implementation receiving the cond_data
+ * returns true, which means that the trace array's cond_snapshot
+ * update() operation used the cond_data to determine whether the
+ * snapshot should be taken, and if it was, presumably saved it along
+ * with the snapshot.
+ */
+void tracing_snapshot_cond(struct trace_array *tr, void *cond_data)
+{
+	tracing_snapshot_instance_cond(tr, cond_data);
+}
+EXPORT_SYMBOL_GPL(tracing_snapshot_cond);
+
+/**
+ * tracing_cond_snapshot_data - get the user data associated with a snapshot
+ * @tr:		The tracing instance
+ *
+ * When the user enables a conditional snapshot using
+ * tracing_snapshot_cond_enable(), the user-defined cond_data is saved
+ * with the snapshot.  This accessor is used to retrieve it.
+ *
+ * Should not be called from cond_snapshot.update(), since it takes
+ * the tr->max_lock lock, which the code calling
+ * cond_snapshot.update() has already done.
+ *
+ * Returns the cond_data associated with the trace array's snapshot.
+ */
+void *tracing_cond_snapshot_data(struct trace_array *tr)
+{
+	void *cond_data = NULL;
+
+	local_irq_disable();
+	arch_spin_lock(&tr->max_lock);
+
+	if (tr->cond_snapshot)
+		cond_data = tr->cond_snapshot->cond_data;
+
+	arch_spin_unlock(&tr->max_lock);
+	local_irq_enable();
+
+	return cond_data;
+}
+EXPORT_SYMBOL_GPL(tracing_cond_snapshot_data);
+
+/* resize @tr's buffer to the size of @size_tr's entries */
+int resize_buffer_duplicate_size(struct array_buffer *trace_buf,
+				 struct array_buffer *size_buf, int cpu_id)
+{
+	int cpu, ret = 0;
+
+	if (cpu_id == RING_BUFFER_ALL_CPUS) {
+		for_each_tracing_cpu(cpu) {
+			ret = ring_buffer_resize(trace_buf->buffer,
+				 per_cpu_ptr(size_buf->data, cpu)->entries, cpu);
+			if (ret < 0)
+				break;
+			per_cpu_ptr(trace_buf->data, cpu)->entries =
+				per_cpu_ptr(size_buf->data, cpu)->entries;
+		}
+	} else {
+		ret = ring_buffer_resize(trace_buf->buffer,
+				 per_cpu_ptr(size_buf->data, cpu_id)->entries, cpu_id);
+		if (ret == 0)
+			per_cpu_ptr(trace_buf->data, cpu_id)->entries =
+				per_cpu_ptr(size_buf->data, cpu_id)->entries;
+	}
+
+	return ret;
+}
+
+int tracing_alloc_snapshot_instance(struct trace_array *tr)
+{
+	int order;
+	int ret;
+
+	if (!tr->allocated_snapshot) {
+
+		/* Make the snapshot buffer have the same order as main buffer */
+		order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+		ret = ring_buffer_subbuf_order_set(tr->snapshot_buffer.buffer, order);
+		if (ret < 0)
+			return ret;
+
+		/* allocate spare buffer */
+		ret = resize_buffer_duplicate_size(&tr->snapshot_buffer,
+				   &tr->array_buffer, RING_BUFFER_ALL_CPUS);
+		if (ret < 0)
+			return ret;
+
+		tr->allocated_snapshot = true;
+	}
+
+	return 0;
+}
+
+void free_snapshot(struct trace_array *tr)
+{
+	/*
+	 * We don't free the ring buffer. instead, resize it because
+	 * The max_tr ring buffer has some state (e.g. ring->clock) and
+	 * we want preserve it.
+	 */
+	ring_buffer_subbuf_order_set(tr->snapshot_buffer.buffer, 0);
+	ring_buffer_resize(tr->snapshot_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
+	trace_set_buffer_entries(&tr->snapshot_buffer, 1);
+	tracing_reset_online_cpus(&tr->snapshot_buffer);
+	tr->allocated_snapshot = false;
+}
+
+int tracing_arm_snapshot_locked(struct trace_array *tr)
+{
+	int ret;
+
+	lockdep_assert_held(&trace_types_lock);
+
+	spin_lock(&tr->snapshot_trigger_lock);
+	if (tr->snapshot == UINT_MAX || tr->mapped) {
+		spin_unlock(&tr->snapshot_trigger_lock);
+		return -EBUSY;
+	}
+
+	tr->snapshot++;
+	spin_unlock(&tr->snapshot_trigger_lock);
+
+	ret = tracing_alloc_snapshot_instance(tr);
+	if (ret) {
+		spin_lock(&tr->snapshot_trigger_lock);
+		tr->snapshot--;
+		spin_unlock(&tr->snapshot_trigger_lock);
+	}
+
+	return ret;
+}
+
+int tracing_arm_snapshot(struct trace_array *tr)
+{
+	guard(mutex)(&trace_types_lock);
+	return tracing_arm_snapshot_locked(tr);
+}
+
+void tracing_disarm_snapshot(struct trace_array *tr)
+{
+	spin_lock(&tr->snapshot_trigger_lock);
+	if (!WARN_ON(!tr->snapshot))
+		tr->snapshot--;
+	spin_unlock(&tr->snapshot_trigger_lock);
+}
+
+/**
+ * tracing_snapshot_alloc - allocate and take a snapshot of the current buffer.
+ *
+ * This is similar to tracing_snapshot(), but it will allocate the
+ * snapshot buffer if it isn't already allocated. Use this only
+ * where it is safe to sleep, as the allocation may sleep.
+ *
+ * This causes a swap between the snapshot buffer and the current live
+ * tracing buffer. You can use this to take snapshots of the live
+ * trace when some condition is triggered, but continue to trace.
+ */
+void tracing_snapshot_alloc(void)
+{
+	int ret;
+
+	ret = tracing_alloc_snapshot();
+	if (ret < 0)
+		return;
+
+	tracing_snapshot();
+}
+EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
+
+/**
+ * tracing_snapshot_cond_enable - enable conditional snapshot for an instance
+ * @tr:		The tracing instance
+ * @cond_data:	User data to associate with the snapshot
+ * @update:	Implementation of the cond_snapshot update function
+ *
+ * Check whether the conditional snapshot for the given instance has
+ * already been enabled, or if the current tracer is already using a
+ * snapshot; if so, return -EBUSY, else create a cond_snapshot and
+ * save the cond_data and update function inside.
+ *
+ * Returns 0 if successful, error otherwise.
+ */
+int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data,
+				 cond_update_fn_t update)
+{
+	struct cond_snapshot *cond_snapshot __free(kfree) =
+		kzalloc_obj(*cond_snapshot);
+	int ret;
+
+	if (!cond_snapshot)
+		return -ENOMEM;
+
+	cond_snapshot->cond_data = cond_data;
+	cond_snapshot->update = update;
+
+	guard(mutex)(&trace_types_lock);
+
+	if (tracer_uses_snapshot(tr->current_trace))
+		return -EBUSY;
+
+	/*
+	 * The cond_snapshot can only change to NULL without the
+	 * trace_types_lock. We don't care if we race with it going
+	 * to NULL, but we want to make sure that it's not set to
+	 * something other than NULL when we get here, which we can
+	 * do safely with only holding the trace_types_lock and not
+	 * having to take the max_lock.
+	 */
+	if (tr->cond_snapshot)
+		return -EBUSY;
+
+	ret = tracing_arm_snapshot_locked(tr);
+	if (ret)
+		return ret;
+
+	local_irq_disable();
+	arch_spin_lock(&tr->max_lock);
+	tr->cond_snapshot = no_free_ptr(cond_snapshot);
+	arch_spin_unlock(&tr->max_lock);
+	local_irq_enable();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(tracing_snapshot_cond_enable);
+
+/**
+ * tracing_snapshot_cond_disable - disable conditional snapshot for an instance
+ * @tr:		The tracing instance
+ *
+ * Check whether the conditional snapshot for the given instance is
+ * enabled; if so, free the cond_snapshot associated with it,
+ * otherwise return -EINVAL.
+ *
+ * Returns 0 if successful, error otherwise.
+ */
+int tracing_snapshot_cond_disable(struct trace_array *tr)
+{
+	int ret = 0;
+
+	local_irq_disable();
+	arch_spin_lock(&tr->max_lock);
+
+	if (!tr->cond_snapshot)
+		ret = -EINVAL;
+	else {
+		kfree(tr->cond_snapshot);
+		tr->cond_snapshot = NULL;
+	}
+
+	arch_spin_unlock(&tr->max_lock);
+	local_irq_enable();
+
+	tracing_disarm_snapshot(tr);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+#ifdef LATENCY_FS_NOTIFY
+static struct workqueue_struct *fsnotify_wq;
+
+static void latency_fsnotify_workfn(struct work_struct *work)
+{
+	struct trace_array *tr = container_of(work, struct trace_array,
+					      fsnotify_work);
+	fsnotify_inode(tr->d_max_latency->d_inode, FS_MODIFY);
+}
+
+static void latency_fsnotify_workfn_irq(struct irq_work *iwork)
+{
+	struct trace_array *tr = container_of(iwork, struct trace_array,
+					      fsnotify_irqwork);
+	queue_work(fsnotify_wq, &tr->fsnotify_work);
+}
+
+__init static int latency_fsnotify_init(void)
+{
+	fsnotify_wq = alloc_workqueue("tr_max_lat_wq",
+				      WQ_UNBOUND | WQ_HIGHPRI, 0);
+	if (!fsnotify_wq) {
+		pr_err("Unable to allocate tr_max_lat_wq\n");
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+late_initcall_sync(latency_fsnotify_init);
+
+void latency_fsnotify(struct trace_array *tr)
+{
+	if (!fsnotify_wq)
+		return;
+	/*
+	 * We cannot call queue_work(&tr->fsnotify_work) from here because it's
+	 * possible that we are called from __schedule() or do_idle(), which
+	 * could cause a deadlock.
+	 */
+	irq_work_queue(&tr->fsnotify_irqwork);
+}
+#else
+static inline void latency_fsnotify(struct trace_array *tr) { }
+#endif /* LATENCY_FS_NOTIFY */
+static const struct file_operations tracing_max_lat_fops;
+
+void trace_create_maxlat_file(struct trace_array *tr,
+			      struct dentry *d_tracer)
+{
+#ifdef LATENCY_FS_NOTIFY
+	INIT_WORK(&tr->fsnotify_work, latency_fsnotify_workfn);
+	init_irq_work(&tr->fsnotify_irqwork, latency_fsnotify_workfn_irq);
+#endif
+	tr->d_max_latency = trace_create_file("tracing_max_latency",
+					      TRACE_MODE_WRITE,
+					      d_tracer, tr,
+					      &tracing_max_lat_fops);
+}
+
+/*
+ * Copy the new maximum trace into the separate maximum-trace
+ * structure. (this way the maximum trace is permanently saved,
+ * for later retrieval via /sys/kernel/tracing/tracing_max_latency)
+ */
+static void
+__update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
+{
+	struct array_buffer *trace_buf = &tr->array_buffer;
+	struct trace_array_cpu *data = per_cpu_ptr(trace_buf->data, cpu);
+	struct array_buffer *max_buf = &tr->snapshot_buffer;
+	struct trace_array_cpu *max_data = per_cpu_ptr(max_buf->data, cpu);
+
+	max_buf->cpu = cpu;
+	max_buf->time_start = data->preempt_timestamp;
+
+	max_data->saved_latency = tr->max_latency;
+	max_data->critical_start = data->critical_start;
+	max_data->critical_end = data->critical_end;
+
+	strscpy(max_data->comm, tsk->comm);
+	max_data->pid = tsk->pid;
+	/*
+	 * If tsk == current, then use current_uid(), as that does not use
+	 * RCU. The irq tracer can be called out of RCU scope.
+	 */
+	if (tsk == current)
+		max_data->uid = current_uid();
+	else
+		max_data->uid = task_uid(tsk);
+
+	max_data->nice = tsk->static_prio - 20 - MAX_RT_PRIO;
+	max_data->policy = tsk->policy;
+	max_data->rt_priority = tsk->rt_priority;
+
+	/* record this tasks comm */
+	tracing_record_cmdline(tsk);
+	latency_fsnotify(tr);
+}
+#else
+static inline void __update_max_tr(struct trace_array *tr,
+				   struct task_struct *tsk, int cpu) { }
+#endif /* CONFIG_TRACER_MAX_TRACE */
+
+/**
+ * update_max_tr - snapshot all trace buffers from global_trace to max_tr
+ * @tr: tracer
+ * @tsk: the task with the latency
+ * @cpu: The cpu that initiated the trace.
+ * @cond_data: User data associated with a conditional snapshot
+ *
+ * Flip the buffers between the @tr and the max_tr and record information
+ * about which task was the cause of this latency.
+ */
+void
+update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu,
+	      void *cond_data)
+{
+	if (tr->stop_count)
+		return;
+
+	WARN_ON_ONCE(!irqs_disabled());
+
+	if (!tr->allocated_snapshot) {
+		/* Only the nop tracer should hit this when disabling */
+		WARN_ON_ONCE(tr->current_trace != &nop_trace);
+		return;
+	}
+
+	arch_spin_lock(&tr->max_lock);
+
+	/* Inherit the recordable setting from array_buffer */
+	if (ring_buffer_record_is_set_on(tr->array_buffer.buffer))
+		ring_buffer_record_on(tr->snapshot_buffer.buffer);
+	else
+		ring_buffer_record_off(tr->snapshot_buffer.buffer);
+
+	if (tr->cond_snapshot && !tr->cond_snapshot->update(tr, cond_data)) {
+		arch_spin_unlock(&tr->max_lock);
+		return;
+	}
+
+	swap(tr->array_buffer.buffer, tr->snapshot_buffer.buffer);
+
+	__update_max_tr(tr, tsk, cpu);
+
+	arch_spin_unlock(&tr->max_lock);
+
+	/* Any waiters on the old snapshot buffer need to wake up */
+	ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS);
+}
+
+/**
+ * update_max_tr_single - only copy one trace over, and reset the rest
+ * @tr: tracer
+ * @tsk: task with the latency
+ * @cpu: the cpu of the buffer to copy.
+ *
+ * Flip the trace of a single CPU buffer between the @tr and the max_tr.
+ */
+void
+update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
+{
+	int ret;
+
+	if (tr->stop_count)
+		return;
+
+	WARN_ON_ONCE(!irqs_disabled());
+	if (!tr->allocated_snapshot) {
+		/* Only the nop tracer should hit this when disabling */
+		WARN_ON_ONCE(tr->current_trace != &nop_trace);
+		return;
+	}
+
+	arch_spin_lock(&tr->max_lock);
+
+	ret = ring_buffer_swap_cpu(tr->snapshot_buffer.buffer, tr->array_buffer.buffer, cpu);
+
+	if (ret == -EBUSY) {
+		/*
+		 * We failed to swap the buffer due to a commit taking
+		 * place on this CPU. We fail to record, but we reset
+		 * the max trace buffer (no one writes directly to it)
+		 * and flag that it failed.
+		 * Another reason is resize is in progress.
+		 */
+		trace_array_printk_buf(tr->snapshot_buffer.buffer, _THIS_IP_,
+			"Failed to swap buffers due to commit or resize in progress\n");
+	}
+
+	WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY);
+
+	__update_max_tr(tr, tsk, cpu);
+	arch_spin_unlock(&tr->max_lock);
+}
+
+static void show_snapshot_main_help(struct seq_file *m)
+{
+	seq_puts(m, "# echo 0 > snapshot : Clears and frees snapshot buffer\n"
+		    "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n"
+		    "#                      Takes a snapshot of the main buffer.\n"
+		    "# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)\n"
+		    "#                      (Doesn't have to be '2' works with any number that\n"
+		    "#                       is not a '0' or '1')\n");
+}
+
+static void show_snapshot_percpu_help(struct seq_file *m)
+{
+	seq_puts(m, "# echo 0 > snapshot : Invalid for per_cpu snapshot file.\n");
+#ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
+	seq_puts(m, "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n"
+		    "#                      Takes a snapshot of the main buffer for this cpu.\n");
+#else
+	seq_puts(m, "# echo 1 > snapshot : Not supported with this kernel.\n"
+		    "#                     Must use main snapshot file to allocate.\n");
+#endif
+	seq_puts(m, "# echo 2 > snapshot : Clears this cpu's snapshot buffer (but does not allocate)\n"
+		    "#                      (Doesn't have to be '2' works with any number that\n"
+		    "#                       is not a '0' or '1')\n");
+}
+
+void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
+{
+	if (iter->tr->allocated_snapshot)
+		seq_puts(m, "#\n# * Snapshot is allocated *\n#\n");
+	else
+		seq_puts(m, "#\n# * Snapshot is freed *\n#\n");
+
+	seq_puts(m, "# Snapshot commands:\n");
+	if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
+		show_snapshot_main_help(m);
+	else
+		show_snapshot_percpu_help(m);
+}
+
+static int tracing_snapshot_open(struct inode *inode, struct file *file)
+{
+	struct trace_array *tr = inode->i_private;
+	struct trace_iterator *iter;
+	struct seq_file *m;
+	int ret;
+
+	ret = tracing_check_open_get_tr(tr);
+	if (ret)
+		return ret;
+
+	if (file->f_mode & FMODE_READ) {
+		iter = __tracing_open(inode, file, true);
+		if (IS_ERR(iter))
+			ret = PTR_ERR(iter);
+	} else {
+		/* Writes still need the seq_file to hold the private data */
+		ret = -ENOMEM;
+		m = kzalloc_obj(*m);
+		if (!m)
+			goto out;
+		iter = kzalloc_obj(*iter);
+		if (!iter) {
+			kfree(m);
+			goto out;
+		}
+		ret = 0;
+
+		iter->tr = tr;
+		iter->array_buffer = &tr->snapshot_buffer;
+		iter->cpu_file = tracing_get_cpu(inode);
+		m->private = iter;
+		file->private_data = m;
+	}
+out:
+	if (ret < 0)
+		trace_array_put(tr);
+
+	return ret;
+}
+
+static void tracing_swap_cpu_buffer(void *tr)
+{
+	update_max_tr_single((struct trace_array *)tr, current, smp_processor_id());
+}
+
+static ssize_t
+tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
+		       loff_t *ppos)
+{
+	struct seq_file *m = filp->private_data;
+	struct trace_iterator *iter = m->private;
+	struct trace_array *tr = iter->tr;
+	unsigned long val;
+	int ret;
+
+	ret = tracing_update_buffers(tr);
+	if (ret < 0)
+		return ret;
+
+	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&trace_types_lock);
+
+	if (tracer_uses_snapshot(tr->current_trace))
+		return -EBUSY;
+
+	local_irq_disable();
+	arch_spin_lock(&tr->max_lock);
+	if (tr->cond_snapshot)
+		ret = -EBUSY;
+	arch_spin_unlock(&tr->max_lock);
+	local_irq_enable();
+	if (ret)
+		return ret;
+
+	switch (val) {
+	case 0:
+		if (iter->cpu_file != RING_BUFFER_ALL_CPUS)
+			return -EINVAL;
+		if (tr->allocated_snapshot)
+			free_snapshot(tr);
+		break;
+	case 1:
+/* Only allow per-cpu swap if the ring buffer supports it */
+#ifndef CONFIG_RING_BUFFER_ALLOW_SWAP
+		if (iter->cpu_file != RING_BUFFER_ALL_CPUS)
+			return -EINVAL;
+#endif
+		if (tr->allocated_snapshot)
+			ret = resize_buffer_duplicate_size(&tr->snapshot_buffer,
+					&tr->array_buffer, iter->cpu_file);
+
+		ret = tracing_arm_snapshot_locked(tr);
+		if (ret)
+			return ret;
+
+		/* Now, we're going to swap */
+		if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
+			local_irq_disable();
+			update_max_tr(tr, current, smp_processor_id(), NULL);
+			local_irq_enable();
+		} else {
+			smp_call_function_single(iter->cpu_file, tracing_swap_cpu_buffer,
+						 (void *)tr, 1);
+		}
+		tracing_disarm_snapshot(tr);
+		break;
+	default:
+		if (tr->allocated_snapshot) {
+			if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
+				tracing_reset_online_cpus(&tr->snapshot_buffer);
+			else
+				tracing_reset_cpu(&tr->snapshot_buffer, iter->cpu_file);
+		}
+		break;
+	}
+
+	if (ret >= 0) {
+		*ppos += cnt;
+		ret = cnt;
+	}
+
+	return ret;
+}
+
+static int tracing_snapshot_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	int ret;
+
+	ret = tracing_release(inode, file);
+
+	if (file->f_mode & FMODE_READ)
+		return ret;
+
+	/* If write only, the seq_file is just a stub */
+	if (m)
+		kfree(m->private);
+	kfree(m);
+
+	return 0;
+}
+
+static int snapshot_raw_open(struct inode *inode, struct file *filp)
+{
+	struct ftrace_buffer_info *info;
+	int ret;
+
+	/* The following checks for tracefs lockdown */
+	ret = tracing_buffers_open(inode, filp);
+	if (ret < 0)
+		return ret;
+
+	info = filp->private_data;
+
+	if (tracer_uses_snapshot(info->iter.trace)) {
+		tracing_buffers_release(inode, filp);
+		return -EBUSY;
+	}
+
+	info->iter.snapshot = true;
+	info->iter.array_buffer = &info->iter.tr->snapshot_buffer;
+
+	return ret;
+}
+
+const struct file_operations snapshot_fops = {
+	.open		= tracing_snapshot_open,
+	.read		= seq_read,
+	.write		= tracing_snapshot_write,
+	.llseek		= tracing_lseek,
+	.release	= tracing_snapshot_release,
+};
+
+const struct file_operations snapshot_raw_fops = {
+	.open		= snapshot_raw_open,
+	.read		= tracing_buffers_read,
+	.release	= tracing_buffers_release,
+	.splice_read	= tracing_buffers_splice_read,
+};
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+static ssize_t
+tracing_max_lat_read(struct file *filp, char __user *ubuf,
+		     size_t cnt, loff_t *ppos)
+{
+	struct trace_array *tr = filp->private_data;
+
+	return tracing_nsecs_read(&tr->max_latency, ubuf, cnt, ppos);
+}
+
+static ssize_t
+tracing_max_lat_write(struct file *filp, const char __user *ubuf,
+		      size_t cnt, loff_t *ppos)
+{
+	struct trace_array *tr = filp->private_data;
+
+	return tracing_nsecs_write(&tr->max_latency, ubuf, cnt, ppos);
+}
+
+static const struct file_operations tracing_max_lat_fops = {
+	.open		= tracing_open_generic_tr,
+	.read		= tracing_max_lat_read,
+	.write		= tracing_max_lat_write,
+	.llseek		= generic_file_llseek,
+	.release	= tracing_release_generic_tr,
+};
+#endif /* CONFIG_TRACER_MAX_TRACE */
+
+int get_snapshot_map(struct trace_array *tr)
+{
+	int err = 0;
+
+	/*
+	 * Called with mmap_lock held. lockdep would be unhappy if we would now
+	 * take trace_types_lock. Instead use the specific
+	 * snapshot_trigger_lock.
+	 */
+	spin_lock(&tr->snapshot_trigger_lock);
+
+	if (tr->snapshot || tr->mapped == UINT_MAX)
+		err = -EBUSY;
+	else
+		tr->mapped++;
+
+	spin_unlock(&tr->snapshot_trigger_lock);
+
+	/* Wait for update_max_tr() to observe iter->tr->mapped */
+	if (tr->mapped == 1)
+		synchronize_rcu();
+
+	return err;
+
+}
+void put_snapshot_map(struct trace_array *tr)
+{
+	spin_lock(&tr->snapshot_trigger_lock);
+	if (!WARN_ON(!tr->mapped))
+		tr->mapped--;
+	spin_unlock(&tr->snapshot_trigger_lock);
+}
+
+#ifdef CONFIG_DYNAMIC_FTRACE
+static void
+ftrace_snapshot(unsigned long ip, unsigned long parent_ip,
+		struct trace_array *tr, struct ftrace_probe_ops *ops,
+		void *data)
+{
+	tracing_snapshot_instance(tr);
+}
+
+static void
+ftrace_count_snapshot(unsigned long ip, unsigned long parent_ip,
+		      struct trace_array *tr, struct ftrace_probe_ops *ops,
+		      void *data)
+{
+	struct ftrace_func_mapper *mapper = data;
+	long *count = NULL;
+
+	if (mapper)
+		count = (long *)ftrace_func_mapper_find_ip(mapper, ip);
+
+	if (count) {
+
+		if (*count <= 0)
+			return;
+
+		(*count)--;
+	}
+
+	tracing_snapshot_instance(tr);
+}
+
+static int
+ftrace_snapshot_print(struct seq_file *m, unsigned long ip,
+		      struct ftrace_probe_ops *ops, void *data)
+{
+	struct ftrace_func_mapper *mapper = data;
+	long *count = NULL;
+
+	seq_printf(m, "%ps:", (void *)ip);
+
+	seq_puts(m, "snapshot");
+
+	if (mapper)
+		count = (long *)ftrace_func_mapper_find_ip(mapper, ip);
+
+	if (count)
+		seq_printf(m, ":count=%ld\n", *count);
+	else
+		seq_puts(m, ":unlimited\n");
+
+	return 0;
+}
+
+static int
+ftrace_snapshot_init(struct ftrace_probe_ops *ops, struct trace_array *tr,
+		     unsigned long ip, void *init_data, void **data)
+{
+	struct ftrace_func_mapper *mapper = *data;
+
+	if (!mapper) {
+		mapper = allocate_ftrace_func_mapper();
+		if (!mapper)
+			return -ENOMEM;
+		*data = mapper;
+	}
+
+	return ftrace_func_mapper_add_ip(mapper, ip, init_data);
+}
+
+static void
+ftrace_snapshot_free(struct ftrace_probe_ops *ops, struct trace_array *tr,
+		     unsigned long ip, void *data)
+{
+	struct ftrace_func_mapper *mapper = data;
+
+	if (!ip) {
+		if (!mapper)
+			return;
+		free_ftrace_func_mapper(mapper, NULL);
+		return;
+	}
+
+	ftrace_func_mapper_remove_ip(mapper, ip);
+}
+
+static struct ftrace_probe_ops snapshot_probe_ops = {
+	.func			= ftrace_snapshot,
+	.print			= ftrace_snapshot_print,
+};
+
+static struct ftrace_probe_ops snapshot_count_probe_ops = {
+	.func			= ftrace_count_snapshot,
+	.print			= ftrace_snapshot_print,
+	.init			= ftrace_snapshot_init,
+	.free			= ftrace_snapshot_free,
+};
+
+static int
+ftrace_trace_snapshot_callback(struct trace_array *tr, struct ftrace_hash *hash,
+			       char *glob, char *cmd, char *param, int enable)
+{
+	struct ftrace_probe_ops *ops;
+	void *count = (void *)-1;
+	char *number;
+	int ret;
+
+	if (!tr)
+		return -ENODEV;
+
+	/* hash funcs only work with set_ftrace_filter */
+	if (!enable)
+		return -EINVAL;
+
+	ops = param ? &snapshot_count_probe_ops :  &snapshot_probe_ops;
+
+	if (glob[0] == '!') {
+		ret = unregister_ftrace_function_probe_func(glob+1, tr, ops);
+		if (!ret)
+			tracing_disarm_snapshot(tr);
+
+		return ret;
+	}
+
+	if (!param)
+		goto out_reg;
+
+	number = strsep(&param, ":");
+
+	if (!strlen(number))
+		goto out_reg;
+
+	/*
+	 * We use the callback data field (which is a pointer)
+	 * as our counter.
+	 */
+	ret = kstrtoul(number, 0, (unsigned long *)&count);
+	if (ret)
+		return ret;
+
+ out_reg:
+	ret = tracing_arm_snapshot(tr);
+	if (ret < 0)
+		return ret;
+
+	ret = register_ftrace_function_probe(glob, tr, ops, count);
+	if (ret < 0)
+		tracing_disarm_snapshot(tr);
+
+	return ret < 0 ? ret : 0;
+}
+
+static struct ftrace_func_command ftrace_snapshot_cmd = {
+	.name			= "snapshot",
+	.func			= ftrace_trace_snapshot_callback,
+};
+
+__init int register_snapshot_cmd(void)
+{
+	return register_ftrace_command(&ftrace_snapshot_cmd);
+}
+#endif /* CONFIG_DYNAMIC_FTRACE */
+
+int trace_allocate_snapshot(struct trace_array *tr, int size)
+{
+	int ret;
+
+	/* Fix mapped buffer trace arrays do not have snapshot buffers */
+	if (tr->range_addr_start)
+		return 0;
+
+	/* allocate_snapshot can only be true during system boot */
+	ret = allocate_trace_buffer(tr, &tr->snapshot_buffer,
+				    allocate_snapshot ? size : 1);
+	if (ret < 0)
+		return -ENOMEM;
+
+	tr->allocated_snapshot = allocate_snapshot;
+
+	allocate_snapshot = false;
+	return 0;
+}
+
+__init static bool tr_needs_alloc_snapshot(const char *name)
+{
+	char *test;
+	int len = strlen(name);
+	bool ret;
+
+	if (!boot_snapshot_index)
+		return false;
+
+	if (strncmp(name, boot_snapshot_info, len) == 0 &&
+	    boot_snapshot_info[len] == '\t')
+		return true;
+
+	test = kmalloc(strlen(name) + 3, GFP_KERNEL);
+	if (!test)
+		return false;
+
+	sprintf(test, "\t%s\t", name);
+	ret = strstr(boot_snapshot_info, test) == NULL;
+	kfree(test);
+	return ret;
+}
+
+__init void do_allocate_snapshot(const char *name)
+{
+	if (!tr_needs_alloc_snapshot(name))
+		return;
+
+	/*
+	 * When allocate_snapshot is set, the next call to
+	 * allocate_trace_buffers() (called by trace_array_get_by_name())
+	 * will allocate the snapshot buffer. That will also clear
+	 * this flag.
+	 */
+	allocate_snapshot = true;
+}
+
+void __init ftrace_boot_snapshot(void)
+{
+	struct trace_array *tr;
+
+	if (!snapshot_at_boot)
+		return;
+
+	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
+		if (!tr->allocated_snapshot)
+			continue;
+
+		tracing_snapshot_instance(tr);
+		trace_array_puts(tr, "** Boot snapshot taken **\n");
+	}
+}
+
-- 
2.51.0


^ permalink raw reply related

* Re: [RFC PATCH 1/2] locking: add mutex_lock_nospin()
From: Yafang Shao @ 2026-03-06  2:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Waiman Long, David Laight, Peter Zijlstra, mingo, will, boqun,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel, bpf
In-Reply-To: <20260305082125.35d8539c@gandalf.local.home>

On Thu, Mar 5, 2026 at 9:20 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 5 Mar 2026 13:40:27 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > Exactly. ftrace is intended for debugging and should not significantly
> > impact real workloads. Therefore, it's reasonable to make it sleep if
> > it cannot acquire the lock immediately, rather than spinning and
> > consuming CPU cycles.
>
> Actually, ftrace is more than just debugging. It is the infrastructure for
> live kernel patching as well.

good to know.

>
> >
> > >
> > > BTW, you should expand the commit log of patch 1 to include the
> > > rationale of why we should add this feature to mutex as the information
> > > in the cover letter won't get included in the git log if this patch
> > > series is merged. You should also elaborate in comment on under what
> > > conditions should this this new mutex API be used.
> >
> > Sure.  I will update it.
> >
> > BTW, these issues are notably hard to find. I suspect there are other
> > locks out there with the same problem.
>
> As I mentioned, I'm not against the change. I just want to make sure the
> rationale is strong enough to make the change.
>
> One thing that should be modified with your patch is the name. "nospin"
> references the implementation of the mutex. Instead it should be called
> something like: "noncritical" or "slowpath" stating that the grabbing of
> this mutex is not of a critical section.
>
> Maybe an entirely new interface should be defined:
>
>
> struct slow_mutex;

Is it necessary to define a new structure for this slow mutex? We
could simply reuse the existing struct mutex instead. Alternatively,
should we add some new flags to this slow_mutex for debugging
purposes?

>
> slow_mutex_lock()
> slow_mutex_unlock()

These two APIs appear sufficient to handle this use case.

>
> etc,
>
> that makes it obvious that this mutex may be held for long periods of time.
> In fact, this would be useful for RT workloads, as these mutexes could be
> flagged to warn RT critical tasks if those tasks were to take one of them.
>
> There has been some talk to mark paths in the kernel that RT tasks would
> get a SIGKILL if they were to hit a path that is known to be non
> deterministic.

Thanks for your information.

-- 
Regards
Yafang

^ permalink raw reply

* Re: [RFC PATCH 1/2] locking: add mutex_lock_nospin()
From: Yafang Shao @ 2026-03-06  2:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Steven Rostedt, David Laight, Peter Zijlstra, mingo, will, boqun,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel, bpf
In-Reply-To: <9e4e356e-25b0-4ac5-8d25-ad093b241b94@redhat.com>

On Fri, Mar 6, 2026 at 2:45 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/5/26 1:34 PM, Waiman Long wrote:
> > On 3/5/26 12:40 AM, Yafang Shao wrote:
> >> On Thu, Mar 5, 2026 at 12:30 PM Waiman Long <longman@redhat.com> wrote:
> >>> On 3/4/26 10:08 PM, Yafang Shao wrote:
> >>>> On Thu, Mar 5, 2026 at 11:00 AM Steven Rostedt
> >>>> <rostedt@goodmis.org> wrote:
> >>>>> On Thu, 5 Mar 2026 10:33:00 +0800
> >>>>> Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>
> >>>>>> Other tools may also read available_filter_functions, requiring each
> >>>>>> one to be patched individually to avoid this flaw—a clearly
> >>>>>> impractical solution.
> >>>>> What exactly is the issue?
> >>>> It makes no sense to spin unnecessarily when it can be avoided. We
> >>>> continuously improve the kernel to do the right thing—and unnecessary
> >>>> spinning is certainly not the right thing.
> >>>>
> >>>>> If a task does a while 1 in user space, it
> >>>>> wouldn't be much different.
> >>>> The while loop in user space performs actual work, whereas useless
> >>>> spinning does nothing but burn CPU cycles. My point is simple: if this
> >>>> unnecessary spinning isn't already considered an issue, it should
> >>>> be—it's something that clearly needs improvement.
> >>> The whole point of optimistic spinning is to reduce the lock
> >>> acquisition
> >>> latency. If the waiter sleeps, the unlock operation will have to
> >>> wake up
> >>> the waiter which can have a variable latency depending on how busy the
> >>> system is at the time. Yes, it is burning CPU cycles while spinning,
> >>> Most workloads will gain performance with this optimistic spinning
> >>> feature. You do have a point that for system monitoring tools that
> >>> observe the system behavior, they shouldn't burn that much CPU times
> >>> that affect performance of real workload that the tools are monitoring.
> >> Exactly. ftrace is intended for debugging and should not significantly
> >> impact real workloads. Therefore, it's reasonable to make it sleep if
> >> it cannot acquire the lock immediately, rather than spinning and
> >> consuming CPU cycles.
> >
> > Your patch series use wordings that give a negative connotation about
> > optimistic spinning making it look like a bad thing.

Perhaps I didn't phrase that well. I do understand that optimistic
spinning is valuable in use cases where we wouldn't want to disable
CONFIG_MUTEX_SPIN_ON_OWNER.

> In fact, it is
> > just a request for a new mutex API for use cases where they can suffer
> > higher latency in order to minimize the system overhead they incur. So
> > don't bad-mouth optimistic spinning and emphasize the use cases you
> > want to support with the new API in your next version.
>
> BTW, for any new mutex API introduced, you should also provide an
> equivalent version in kernel/locking/rtmutex_api.c for PREEMPT_RT kernel.

Thanks for the suggestion.

-- 
Regards
Yafang

^ permalink raw reply

* Re: [RFC PATCH 1/2] locking: add mutex_lock_nospin()
From: Yafang Shao @ 2026-03-06  2:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: David Laight, Steven Rostedt, Peter Zijlstra, mingo, will, boqun,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel, bpf
In-Reply-To: <d161ea96-8023-4a5e-a4eb-3e17780c8308@redhat.com>

On Fri, Mar 6, 2026 at 3:00 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/5/26 4:32 AM, David Laight wrote:
> > On Wed, 4 Mar 2026 23:30:40 -0500
> > Waiman Long <longman@redhat.com> wrote:
> >
> >> On 3/4/26 10:08 PM, Yafang Shao wrote:
> >>> On Thu, Mar 5, 2026 at 11:00 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >>>> On Thu, 5 Mar 2026 10:33:00 +0800
> >>>> Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>
> >>>>> Other tools may also read available_filter_functions, requiring each
> >>>>> one to be patched individually to avoid this flaw—a clearly
> >>>>> impractical solution.
> >>>> What exactly is the issue?
> >>> It makes no sense to spin unnecessarily when it can be avoided. We
> >>> continuously improve the kernel to do the right thing—and unnecessary
> >>> spinning is certainly not the right thing.
> >>>
> >>>> If a task does a while 1 in user space, it
> >>>> wouldn't be much different.
> >>> The while loop in user space performs actual work, whereas useless
> >>> spinning does nothing but burn CPU cycles. My point is simple: if this
> >>> unnecessary spinning isn't already considered an issue, it should
> >>> be—it's something that clearly needs improvement.
> >> The whole point of optimistic spinning is to reduce the lock acquisition
> >> latency. If the waiter sleeps, the unlock operation will have to wake up
> >> the waiter which can have a variable latency depending on how busy the
> >> system is at the time. Yes, it is burning CPU cycles while spinning,
> >> Most workloads will gain performance with this optimistic spinning
> >> feature. You do have a point that for system monitoring tools that
> >> observe the system behavior, they shouldn't burn that much CPU times
> >> that affect performance of real workload that the tools are monitoring.
> >>
> >> BTW, you should expand the commit log of patch 1 to include the
> >> rationale of why we should add this feature to mutex as the information
> >> in the cover letter won't get included in the git log if this patch
> >> series is merged. You should also elaborate in comment on under what
> >> conditions should this this new mutex API be used.
> > Isn't changing mutex_lock() the wrong place anyway?
> > What you need is for the code holding the lock to indicate that
> > it isn't worth waiters spinning because the lock will be held
> > for a long time.
>
> I have actually thought about having a flag somewhere in the mutex
> itself to indicate that optimistic spinning isn't needed. However the
> owner field is running out of usable flag bits.

True. Introducing a new MUTEX_FLAGS would likely require a substantial
refactor of the mutex code, which may not be worth it.

> The other option is to
> add it to osq as it doesn't really need to use the full 32 bits for the
> tail. In this case, we can just initialize the mutex to say that we
> don't need optimistic spinning and no new mutex_lock() API will be needed.

I believe a new mutex_lock() variant would be clearer and easier to
understand. It also has the advantage of requiring minimal changes to
the existing mutex code.

-- 
Regards
Yafang

^ permalink raw reply

* Re: [PATCH v3 02/12] audit: widen ino fields to u64
From: Paul Moore @ 2026-03-06  3:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Dan Williams, Eric Biggers,
	Theodore Y. Ts'o, Muchun Song, Oscar Salvador,
	David Hildenbrand, David Howells, Paulo Alcantara, Andreas Dilger,
	Jan Kara, Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
	Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Steve French, Ronnie Sahlberg, Shyam Prasad N, Bharath SM,
	Alexander Aring, Ryusuke Konishi, Viacheslav Dubeyko,
	Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Sterba, Marc Dionne, Ian Kent,
	Luis de Bethencourt, Salah Triki, Tigran A. Aivazian,
	Ilya Dryomov, Alex Markuze, Jan Harkes, coda, Nicolas Pitre,
	Tyler Hicks, Amir Goldstein, Christoph Hellwig,
	John Paul Adrian Glaubitz, Yangtao Li, Mikulas Patocka,
	David Woodhouse, Richard Weinberger, Dave Kleikamp,
	Konstantin Komarov, Mark Fasheh, Joel Becker, Joseph Qi,
	Mike Marshall, Martin Brandenburg, Miklos Szeredi, Anders Larsen,
	Zhihao Cheng, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	John Johansen, James Morris, Serge E. Hallyn, Mimi Zohar,
	Roberto Sassu, Dmitry Kasatkin, Eric Snowberg, Fan Wu,
	Stephen Smalley, Ondrej Mosnacek, Casey Schaufler, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Sumit Semwal,
	Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman, Oleg Nesterov,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Darrick J. Wong,
	Martin Schiller, Eric Paris, Joerg Reuter, Marcel Holtmann,
	Johan Hedberg, Luiz Augusto von Dentz, Oliver Hartkopp,
	Marc Kleine-Budde, David Ahern, Neal Cardwell, Steffen Klassert,
	Herbert Xu, Remi Denis-Courmont, Marcelo Ricardo Leitner,
	Xin Long, Magnus Karlsson, Maciej Fijalkowski, Stanislav Fomichev,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, linux-fsdevel, linux-kernel, linux-trace-kernel,
	nvdimm, fsverity, linux-mm, netfs, linux-ext4, linux-f2fs-devel,
	linux-nfs, linux-cifs, samba-technical, linux-nilfs, v9fs,
	linux-afs, autofs, ceph-devel, codalist, ecryptfs, linux-mtd,
	jfs-discussion, ntfs3, ocfs2-devel, devel, linux-unionfs,
	apparmor, linux-security-module, linux-integrity, selinux,
	amd-gfx, dri-devel, linux-media, linaro-mm-sig, netdev,
	linux-perf-users, linux-fscrypt, linux-xfs, linux-hams, linux-x25,
	audit, linux-bluetooth, linux-can, linux-sctp, bpf
In-Reply-To: <20260304-iino-u64-v3-2-2257ad83d372@kernel.org>

On Wed, Mar 4, 2026 at 10:33 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> inode->i_ino is being widened from unsigned long to u64. The audit
> subsystem uses unsigned long ino in struct fields, function parameters,
> and local variables that store inode numbers from arbitrary filesystems.
> On 32-bit platforms this truncates inode numbers that exceed 32 bits,
> which will cause incorrect audit log entries and broken watch/mark
> comparisons.
>
> Widen all audit ino fields, parameters, and locals to u64, and update
> the inode format string from %lu to %llu to match.
>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/audit.h   |  2 +-
>  kernel/audit.h          | 13 ++++++-------
>  kernel/audit_fsnotify.c |  4 ++--
>  kernel/audit_watch.c    | 12 ++++++------
>  kernel/auditsc.c        |  4 ++--
>  5 files changed, 17 insertions(+), 18 deletions(-)

Acked-by: Paul Moore <paul@paul-moore.com>

-- 
paul-moore.com

^ permalink raw reply

* [PATCH V2] tracing: Revert "tracing: Remove pid in task_rename tracing output"
From: Xuewen Yan @ 2026-03-06  7:59 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, elver, kees
  Cc: lorenzo.stoakes, brauner, schuster.simon, david, linux-kernel,
	linux-trace-kernel, guohua.yan, ke.wang, xuewen.yan94, jing.xia

This reverts commit e3f6a42272e028c46695acc83fc7d7c42f2750ad.

The commit says that the tracepoint only deals with the current task,
however the following case is not current task:

comm_write() {
    p = get_proc_task(inode);
    if (!p)
        return -ESRCH;

    if (same_thread_group(current, p))
        set_task_comm(p, buffer);
}
where set_task_comm() calls __set_task_comm() which records
the update of p and not current.

So revert the patch to show pid.

Fixes: e3f6a42272e0 ("tracing: Remove pid in task_rename tracing output")
Reported-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
---
v2:
- update commit message (Steven)
---
 include/trace/events/task.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 4f0759634306..b9a129eb54d9 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -38,19 +38,22 @@ TRACE_EVENT(task_rename,
 	TP_ARGS(task, comm),
 
 	TP_STRUCT__entry(
+		__field(	pid_t,	pid)
 		__array(	char, oldcomm,  TASK_COMM_LEN)
 		__array(	char, newcomm,  TASK_COMM_LEN)
 		__field(	short,	oom_score_adj)
 	),
 
 	TP_fast_assign(
+		__entry->pid = task->pid;
 		memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
 		strscpy(entry->newcomm, comm, TASK_COMM_LEN);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
 
-	TP_printk("oldcomm=%s newcomm=%s oom_score_adj=%hd",
-		  __entry->oldcomm, __entry->newcomm, __entry->oom_score_adj)
+	TP_printk("pid=%d oldcomm=%s newcomm=%s oom_score_adj=%hd",
+		__entry->pid, __entry->oldcomm,
+		__entry->newcomm, __entry->oom_score_adj)
 );
 
 /**
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v4 4/5] mm: rename zone->lock to zone->_lock
From: Vlastimil Babka (SUSE) @ 2026-03-06  8:05 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
	Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt,
	linux-kernel, linux-mm, linux-trace-kernel, linux-pm
In-Reply-To: <aanSnywUXTVPaYUj@shell.ilvokhin.com>

On 3/5/26 19:59, Dmitry Ilvokhin wrote:
> On Thu, Mar 05, 2026 at 06:16:26PM +0000, Dmitry Ilvokhin wrote:
>> On Thu, Mar 05, 2026 at 10:27:07AM +0100, Vlastimil Babka (SUSE) wrote:
>> > On 3/4/26 16:13, SeongJae Park wrote:
>> > > On Wed, 4 Mar 2026 13:01:45 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>> > > 
>> > >> On Tue, Mar 03, 2026 at 05:50:34PM -0800, SeongJae Park wrote:
>> > >> > On Tue, 3 Mar 2026 14:25:55 +0000 Dmitry Ilvokhin <d@ilvokhin.com> wrote:
>> > >> > 
>> > >> > > On Mon, Mar 02, 2026 at 02:37:43PM -0800, Andrew Morton wrote:
>> > >> > > > On Mon, 2 Mar 2026 15:10:03 +0100 "Vlastimil Babka (SUSE)" <vbabka@kernel.org> wrote:
>> > >> > > > 
>> > >> > > > > On 2/27/26 17:00, Dmitry Ilvokhin wrote:
>> > >> > > > > > This intentionally breaks direct users of zone->lock at compile time so
>> > >> > > > > > all call sites are converted to the zone lock wrappers. Without the
>> > >> > > > > > rename, present and future out-of-tree code could continue using
>> > >> > > > > > spin_lock(&zone->lock) and bypass the wrappers and tracing
>> > >> > > > > > infrastructure.
>> > >> > > > > > 
>> > >> > > > > > No functional change intended.
>> > >> > > > > > 
>> > >> > > > > > Suggested-by: Andrew Morton <akpm@linux-foundation.org>
>> > >> > > > > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
>> > >> > > > > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> > >> > > > > > Acked-by: SeongJae Park <sj@kernel.org>
>> > >> > > > > 
>> > >> > > > > I see some more instances of 'zone->lock' in comments in
>> > >> > > > > include/linux/mmzone.h and under Documentation/ but otherwise LGTM.
>> > >> > > > > 
>> > >> > > > 
>> > >> > > > I fixed (most of) that in the previous version but my fix was lost.
>> > >> > > 
>> > >> > > Thanks for the fixups, Andrew.
>> > >> > > 
>> > >> > > I still see a few 'zone->lock' references in Documentation remain on
>> > >> > > mm-new. This patch cleans them up, as noted by Vlastimil.
>> > >> > > 
>> > >> > > I'm happy to adjust this patch if anything else needs attention.
>> > >> > > 
>> > >> > > From 9142d5a8b60038fa424a6033253960682e5a51f4 Mon Sep 17 00:00:00 2001
>> > >> > > From: Dmitry Ilvokhin <d@ilvokhin.com>
>> > >> > > Date: Tue, 3 Mar 2026 06:13:13 -0800
>> > >> > > Subject: [PATCH] mm: fix remaining zone->lock references
>> > >> > > 
>> > >> > > Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
>> > >> > > ---
>> > >> > >  Documentation/mm/physical_memory.rst | 4 ++--
>> > >> > >  Documentation/trace/events-kmem.rst  | 8 ++++----
>> > >> > >  2 files changed, 6 insertions(+), 6 deletions(-)
>> > >> > > 
>> > >> > > diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
>> > >> > > index b76183545e5b..e344f93515b6 100644
>> > >> > > --- a/Documentation/mm/physical_memory.rst
>> > >> > > +++ b/Documentation/mm/physical_memory.rst
>> > >> > > @@ -500,11 +500,11 @@ General
>> > >> > >  ``nr_isolate_pageblock``
>> > >> > >    Number of isolated pageblocks. It is used to solve incorrect freepage counting
>> > >> > >    problem due to racy retrieving migratetype of pageblock. Protected by
>> > >> > > -  ``zone->lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled.
>> > >> > > +  ``zone_lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled.
>> > >> > 
>> > >> > Dmitry's original patch [1] was doing 's/zone->lock/zone->_lock/', which aligns
>> > >> > to my expectation.  But this patch is doing 's/zone->lock/zone_lock/'.  Same
>> > >> > for the rest of this patch.
>> > >> > 
>> > >> > I was initially thinking this is just a mistake, but I also found Andrew is
>> > >> > doing same change [2], so I'm bit confused.  Is this an intentional change?
>> > >> > 
>> > >> > [1] https://lore.kernel.org/d61500c5784c64e971f4d328c57639303c475f81.1772206930.git.d@ilvokhin.com
>> > >> > [2] https://lore.kernel.org/20260302143743.220eed4feb36d7572fe726cc@linux-foundation.org
>> > >> > 
>> > >> 
>> > >> Good catch, thanks for pointing this out, SJ.
>> > >> 
>> > >> Originally the mechanical rename was indeed zone->lock -> zone->_lock.
>> > >> However, in Documentation I intentionally switched references to
>> > >> zone_lock instead of zone->_lock. The reasoning is that _lock is now an
>> > >> internal implementation detail, and direct access is discouraged. The
>> > >> intended interface is via the zone_lock_*() / zone_unlock_*() wrappers,
>> > >> so referencing zone_lock in documentation felt more appropriate than
>> > >> mentioning the private struct field (zone->_lock).
>> > > 
>> > > Thank you for this nice and kind clarification, Dmitry!  I agree mentioning
>> > > zone_[un]lock_*() helpers instead of the hidden member (zone->_lock) can be
>> > > better.
>> > > 
>> > > But, I'm concerned if people like me might not aware the intention under
>> > > 'zone_lock'.  If there is a well-known convention that allows people to know it
>> > > is for 'zone_[un]lock_*()' helpers, making it more clear would be nice, in my
>> > > humble opinion.  If there is such a convention but I'm just missing it, please
>> > > ignore.  If I'm not, for eaxmaple,
>> > > 
>> > > "protected by ``zone->lock``" could be re-wrote to
>> > > "protected by ``zone_[un]lock_*()`` locking helpers" or,
>> > > "protected by zone lock helper functions (``zone_[un]lock_*()``)" ?
>> > > 
>> > >> 
>> > >> That said, I agree this creates inconsistency with the mechanical
>> > >> rename, and I'm happy to adjust either way: either consistently refer
>> > >> to the wrapper API, or keep documentation aligned with zone->_lock.
>> > >> 
>> > >> I slightly prefer referring to the wrapper API, but don't have a strong
>> > >> preference as long as we're consistent.
>> > > 
>> > > I also think both approaches are good.  But for the wrapper approach, I think
>> > > giving more contexts rather than just ``zone_lock`` to readers would be nice.
>> > 
>> > Grep tells me that we also have comments mentioning simply "zone lock", btw.
>> > And it's also a term used often in informal conversations. Maybe we could
>> > just standardize on that in comments/documentations as it's easier to read.
>> > Discovering that the field is called _lock and that wrappers should be used,
>> > is hopefully not that difficult.
>> 
>> Thanks for the suggestion, Vlastimil. That sounds reasonable to me as
>> well. I'll update the comments and documentation to consistently use
>> "zone lock".
> 
> Following the suggestion from SJ and Vlastimil, I prepared fixup to
> standardize documentation and comments on the term "zone lock".
> 
> The patch is based on top of the current mm-new.
> 
> Andrew, please let me know if you would prefer a respin of the series
> instead.
> 
> From 267cda3e0e160f97b346009bc48819bfeed92e52 Mon Sep 17 00:00:00 2001
> From: Dmitry Ilvokhin <d@ilvokhin.com>
> Date: Thu, 5 Mar 2026 10:36:17 -0800
> Subject: [PATCH] mm: documentation: standardize on "zone lock" terminology
> 
> During review of the zone lock tracing series it was suggested to
> standardize documentation and comments on the term "zone lock"
> instead of using zone_lock or referring to the internal field
> zone->_lock.
> 
> Update references accordingly.
> 
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>

Thanks!


^ permalink raw reply

* Re: [PATCH v5 2/3] ring-buffer: Handle RB_MISSED_* flags on commit field correctly
From: Masami Hiramatsu @ 2026-03-06  8:46 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260305130348.3266e4c9@gandalf.local.home>

On Thu, 5 Mar 2026 13:03:48 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 26 Feb 2026 22:38:43 +0900
> "Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:
> 
> > From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Since the MSBs of rb_data_page::commit are used for storing
> > RB_MISSED_EVENTS and RB_MISSED_STORED, we need to mask out those bits
> > when it is used for finding the size of data pages.
> > 
> > Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events")
> > Fixes: 5b7be9c709e1 ("ring-buffer: Add test to validate the time stamp deltas")
> > Cc: stable@vger.kernel.org
> 
> This is unneeded for the current way things work.
> 
> The missed events flags are added when a page is read, so the commits in
> the write buffer should never have those flags set. If they did, the ring
> buffer code itself would break.

Hmm, but commit ca296d32ece3 ("tracing: ring_buffer: Rewind persistent 
ring buffer on reboot") may change it. Maybe we should treat it while
unwinding it?

> 
> But as patch 3 is adding a flag, you should likely merge this and patch 3
> together, as the only way that flag would get set is if the validator set
> it on a previous boot. And then this would be needed for subsequent boots
> that did not reset the buffer.

It is OK to combine these 2 patches. But my question is that when the flag
must be checked and when it must be ignored. Since the flags are encoded
to commit, if that is used for limiting or indexing inside the page,
we must mask the flag or check the max size to avoid accessing outside of
the subpage.

> 
> Hmm, I don't think we even need to do that! Because if it is set, it would
> simply warn again that a page is invalid, and I think we *want* that! As it
> would preserve that pages were invalid and not be cleared with a simple
> reboot.

OK, then I don't mark it, but just invalidate the subpage.

Thanks,

> 
> -- Steve
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 08/16] sched: Add task enqueue/dequeue trace points
From: Gabriele Monaco @ 2026-03-06  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Masami Hiramatsu, Ingo Molnar, K Prateek Nayak,
	linux-trace-kernel
In-Reply-To: <20260225095122.80683-9-gmonaco@redhat.com>

Hi Peter,

On Wed, 2026-02-25 at 10:51 +0100, Gabriele Monaco wrote:
> From: Nam Cao <namcao@linutronix.de>
> 
> Add trace points into enqueue_task() and dequeue_task().
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Co-developed-by: Gabriele Monaco <gmonaco@redhat.com>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>


Could I get your Ack to this?

Thanks,
Gabriele

> ---
> 
> Notes:
>     V5:
>     * Do not fire enqueue tracepoint for delayed enqueues
> 
>  include/trace/events/sched.h |  8 ++++++++
>  kernel/sched/core.c          | 12 +++++++++++-
>  kernel/sched/sched.h         |  2 ++
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 7b2645b50e78..5844147ec5fd 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -896,6 +896,14 @@ DECLARE_TRACE(sched_set_need_resched,
>  	TP_PROTO(struct task_struct *tsk, int cpu, int tif),
>  	TP_ARGS(tsk, cpu, tif));
>  
> +DECLARE_TRACE(sched_enqueue,
> +	TP_PROTO(struct task_struct *tsk, int cpu),
> +	TP_ARGS(tsk, cpu));
> +
> +DECLARE_TRACE(sched_dequeue,
> +	TP_PROTO(struct task_struct *tsk, int cpu),
> +	TP_ARGS(tsk, cpu));
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 759777694c78..4ca79ff58fca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -122,6 +122,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_entry_tp);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_enqueue_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_dequeue_tp);
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>  DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
> @@ -2094,6 +2096,9 @@ unsigned long get_wchan(struct task_struct *p)
>  
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	if (trace_sched_enqueue_tp_enabled() && !(flags & ENQUEUE_DELAYED))
> +		trace_sched_enqueue_tp(p, rq->cpu);
> +
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
>  
> @@ -2120,6 +2125,8 @@ void enqueue_task(struct rq *rq, struct task_struct *p,
> int flags)
>   */
>  inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	int ret;
> +
>  	if (sched_core_enabled(rq))
>  		sched_core_dequeue(rq, p, flags);
>  
> @@ -2136,7 +2143,10 @@ inline bool dequeue_task(struct rq *rq, struct
> task_struct *p, int flags)
>  	 * and mark the task ->sched_delayed.
>  	 */
>  	uclamp_rq_dec(rq, p);
> -	return p->sched_class->dequeue_task(rq, p, flags);
> +	ret = p->sched_class->dequeue_task(rq, p, flags);
> +	if (trace_sched_dequeue_tp_enabled() && !(flags & DEQUEUE_SLEEP))
> +		trace_sched_dequeue_tp(p, rq->cpu);
> +	return ret;
>  }
>  
>  void activate_task(struct rq *rq, struct task_struct *p, int flags)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b82fb70a9d54..a83177742603 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2944,6 +2944,8 @@ static inline void sub_nr_running(struct rq *rq,
> unsigned count)
>  
>  static inline void __block_task(struct rq *rq, struct task_struct *p)
>  {
> +	trace_sched_dequeue_tp(p, rq->cpu);
> +
>  	if (p->sched_contributes_to_load)
>  		rq->nr_uninterruptible++;
>  


^ permalink raw reply

* Re: [PATCH v3 00/12] vfs: change inode->i_ino from unsigned long to u64
From: Christian Brauner @ 2026-03-06  9:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, linux-fsdevel, linux-kernel,
	linux-trace-kernel, nvdimm, fsverity, linux-mm, netfs, linux-ext4,
	linux-f2fs-devel, linux-nfs, linux-cifs, samba-technical,
	linux-nilfs, v9fs, linux-afs, autofs, ceph-devel, codalist,
	ecryptfs, linux-mtd, jfs-discussion, ntfs3, ocfs2-devel, devel,
	linux-unionfs, apparmor, linux-security-module, linux-integrity,
	selinux, amd-gfx, dri-devel, linux-media, linaro-mm-sig, netdev,
	linux-perf-users, linux-fscrypt, linux-xfs, linux-hams, linux-x25,
	audit, linux-bluetooth, linux-can, linux-sctp, bpf,
	Alexander Viro, Jan Kara, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Dan Williams, Eric Biggers,
	Theodore Y. Ts'o, Muchun Song, Oscar Salvador,
	David Hildenbrand, David Howells, Paulo Alcantara, Andreas Dilger,
	Jan Kara, Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
	Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Steve French, Ronnie Sahlberg, Shyam Prasad N, Bharath SM,
	Alexander Aring, Ryusuke Konishi, Viacheslav Dubeyko,
	Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Sterba, Marc Dionne, Ian Kent,
	Luis de Bethencourt, Salah Triki, Tigran A. Aivazian,
	Ilya Dryomov, Alex Markuze, Jan Harkes, coda, Nicolas Pitre,
	Tyler Hicks, Amir Goldstein, Christoph Hellwig,
	John Paul Adrian Glaubitz, Yangtao Li, Mikulas Patocka,
	David Woodhouse, Richard Weinberger, Dave Kleikamp,
	Konstantin Komarov, Mark Fasheh, Joel Becker, Joseph Qi,
	Mike Marshall, Martin Brandenburg, Miklos Szeredi, Anders Larsen,
	Zhihao Cheng, Damien Le Moal, Naohiro Aota, Johannes Thumshirn,
	John Johansen, Paul Moore, James Morris, Serge E. Hallyn,
	Mimi Zohar, Roberto Sassu, Dmitry Kasatkin, Eric Snowberg, Fan Wu,
	Stephen Smalley, Ondrej Mosnacek, Casey Schaufler, Alex Deucher,
	Christian König, David Airlie, Simona Vetter, Sumit Semwal,
	Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
	David S. Miller, Jakub Kicinski, Simon Horman, Oleg Nesterov,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Darrick J. Wong,
	Martin Schiller, Eric Paris, Joerg Reuter, Marcel Holtmann,
	Johan Hedberg, Luiz Augusto von Dentz, Oliver Hartkopp,
	Marc Kleine-Budde, David Ahern, Neal Cardwell, Steffen Klassert,
	Herbert Xu, Remi Denis-Courmont, Marcelo Ricardo Leitner,
	Xin Long, Magnus Karlsson, Maciej Fijalkowski, Stanislav Fomichev,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend
In-Reply-To: <20260304-iino-u64-v3-0-2257ad83d372@kernel.org>

On Wed, 04 Mar 2026 10:32:30 -0500, Jeff Layton wrote:
> Christian said [1] to "just do it" when I proposed this, so here we are!
> 
> For historical reasons, the inode->i_ino field is an unsigned long,
> which means that it's 32 bits on 32 bit architectures. This has caused a
> number of filesystems to implement hacks to hash a 64-bit identifier
> into a 32-bit field, and deprives us of a universal identifier field for
> an inode.
> 
> [...]

This series makes me happy. We've been talking about this conversion for
a while and I'm thankful that you did this work. Without the automation
available this probably wouldn't have happened as quickly as it did now.
Let's see what bits and pieces it missed.

---

Applied to the vfs-7.1.kino branch of the vfs/vfs.git tree.
Patches in the vfs-7.1.kino branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.1.kino

[01/12] vfs: widen inode hash/lookup functions to u64
        https://git.kernel.org/vfs/vfs/c/2412a9fa518a
[02/12] audit: widen ino fields to u64
        https://git.kernel.org/vfs/vfs/c/a5e863be4d02
[03/12] net: change sock.sk_ino and sock_i_ino() to u64
        https://git.kernel.org/vfs/vfs/c/c21144a0a33f
[04/12] vfs: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/5e5c380870b2
[05/12] cachefiles: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/25291f67aad7
[06/12] ext2: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/797d04a355e3
[07/12] hugetlbfs: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/3c976fb36a9a
[08/12] zonefs: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/988f68c01b3a
[09/12] ext4: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/1c1427c79bc2
[10/12] f2fs: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/6e62bf74bd8a
[11/12] nilfs2: widen trace event i_ino fields to u64
        https://git.kernel.org/vfs/vfs/c/6ce73711525a
[12/12] treewide: change inode->i_ino from unsigned long to u64
        https://git.kernel.org/vfs/vfs/c/af82d143e869

^ permalink raw reply

* Re: [PATCH] kernel/trace/ftrace: introduce ftrace module notifier
From: Petr Mladek @ 2026-03-06  9:57 UTC (permalink / raw)
  To: Song Chen
  Cc: Steven Rostedt, Miroslav Benes, mcgrof, petr.pavlu, da.gomez,
	samitolvanen, atomlin, mhiramat, mark.rutland, mathieu.desnoyers,
	linux-modules, linux-kernel, linux-trace-kernel, live-patching
In-Reply-To: <321d4670-27cb-453f-a50d-426c83894074@189.cn>

On Fri 2026-02-27 09:34:59, Song Chen wrote:
> Hi,
> 
> 在 2026/2/27 01:30, Steven Rostedt 写道:
> > On Thu, 26 Feb 2026 11:51:53 +0100 (CET)
> > Miroslav Benes <mbenes@suse.cz> wrote:
> > 
> > > > Let me see if there is any way to use notifier and remain below calling
> > > > sequence:
> > > > 
> > > > ftrace_module_enable
> > > > klp_module_coming
> > > > blocking_notifier_call_chain_robust(MODULE_STATE_COMING)
> > > > 
> > > > blocking_notifier_call_chain(MODULE_STATE_GOING)
> > > > klp_module_going
> > > > ftrace_release_mod
> > > 
> > > Both klp and ftrace used module notifiers in the past. We abandoned that
> > > and opted for direct calls due to issues with ordering at the time. I do
> > > not have the list of problems at hand but I remember it was very fragile.
> > > 
> > > See commits 7dcd182bec27 ("ftrace/module: remove ftrace module
> > > notifier"), 7e545d6eca20 ("livepatch/module: remove livepatch module
> > > notifier") and their surroundings.
> > > 
> > > So unless there is a reason for the change (which should be then carefully
> > > reviewed and properly tested), I would prefer to keep it as is. What is
> > > the motivation? I am failing to find it in the commit log.
> 
> There is no special motivation, i just read btf initialization in module
> loading and found direct calls of ftrace and klp, i thought they were just
> forgotten to use notifier and i even didn't search git log to verify, sorry
> about that.
> 
> > 
> > Honestly, I do think just decoupling ftrace and live kernel patching from
> > modules is rationale enough, as it makes the code a bit cleaner. But to do
> > so, we really need to make sure there is absolutely no regressions.
> > 
> > Thus, to allow such a change, I would ask those that are proposing it, show
> > a full work flow of how ftrace, live kernel patching, and modules work with
> > each other and why those functions are currently injected in the module code.
> > 
> > As Miroslav stated, we tried to do it via notifiers in the past and it
> > failed. I don't want to find out why they failed by just adding them back
> > to notifiers again. Instead, the reasons must be fully understood and
> > updates made to make sure they will not fail in the future.
> 
> Yes, you are right, i read commit msg of 7dcd182bec27, this patch just
> reverses it simply and will introduce order issue back. I will try to find
> out the problem in the past at first.

AFAIK, the root of the problem is that livepatch uses the ftrace
framework. It means that:

   + ftrace must be initialized before livepatch gets enabled
   + livepatch must be disabled before ftrace support gets removed

My understanding is that this can't be achieved by notifiers easily
because they are always proceed in the same order.

An elegant solution would be to introduce  notifier_reverse_call_chain()
which would process the callbacks in the reverse order. But it might
be non-trivial:

  + We would need to make sure that it does not break some
    existing "hidden" dependencies.

  + notifier_call_chain() uses RCU to process the list of registered
    callbacks. I am not sure how complicated would be to make it safe
    in both directions.

Best Regards,
Petr

^ permalink raw reply

* Re: [RFC PATCH 1/2] locking: add mutex_lock_nospin()
From: David Laight @ 2026-03-06 10:00 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Steven Rostedt, Waiman Long, Peter Zijlstra, mingo, will, boqun,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel, bpf
In-Reply-To: <CALOAHbA9u2sHSYRF_kDKpf2d_UADQFu-1BMWFhBTo3mvgf6kFQ@mail.gmail.com>

On Fri, 6 Mar 2026 10:22:11 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> On Thu, Mar 5, 2026 at 9:20 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 5 Mar 2026 13:40:27 +0800
> > Yafang Shao <laoar.shao@gmail.com> wrote:
> >  
> > > Exactly. ftrace is intended for debugging and should not significantly
> > > impact real workloads. Therefore, it's reasonable to make it sleep if
> > > it cannot acquire the lock immediately, rather than spinning and
> > > consuming CPU cycles.  
> >
> > Actually, ftrace is more than just debugging. It is the infrastructure for
> > live kernel patching as well.  
> 
> good to know.
> 
> >  
> > >  
> > > >
> > > > BTW, you should expand the commit log of patch 1 to include the
> > > > rationale of why we should add this feature to mutex as the information
> > > > in the cover letter won't get included in the git log if this patch
> > > > series is merged. You should also elaborate in comment on under what
> > > > conditions should this this new mutex API be used.  
> > >
> > > Sure.  I will update it.
> > >
> > > BTW, these issues are notably hard to find. I suspect there are other
> > > locks out there with the same problem.  
> >
> > As I mentioned, I'm not against the change. I just want to make sure the
> > rationale is strong enough to make the change.
> >
> > One thing that should be modified with your patch is the name. "nospin"
> > references the implementation of the mutex. Instead it should be called
> > something like: "noncritical" or "slowpath" stating that the grabbing of
> > this mutex is not of a critical section.
> >
> > Maybe an entirely new interface should be defined:
> >
> >
> > struct slow_mutex;  
> 
> Is it necessary to define a new structure for this slow mutex? We
> could simply reuse the existing struct mutex instead. Alternatively,
> should we add some new flags to this slow_mutex for debugging
> purposes?
> 
> >
> > slow_mutex_lock()
> > slow_mutex_unlock()  
> 
> These two APIs appear sufficient to handle this use case.

Don't semaphores still exist?
IIRC they always sleep.

Although I wonder if the mutex need to be held for as long at it is.
ISTR one of the tracebacks was one the 'address to name' lookup,
that code will be slow.
Since the mutex can't be held across the multiple reads that are done
to read the full list of tracepoints it must surely be possible to
release it across the name lookup?

	David

> 
> >
> > etc,
> >
> > that makes it obvious that this mutex may be held for long periods of time.
> > In fact, this would be useful for RT workloads, as these mutexes could be
> > flagged to warn RT critical tasks if those tasks were to take one of them.
> >
> > There has been some talk to mark paths in the kernel that RT tasks would
> > get a SIGKILL if they were to hit a path that is known to be non
> > deterministic.  
> 
> Thanks for your information.
> 


^ permalink raw reply

* Re: [PATCH v4 4/5] mm: rename zone->lock to zone->_lock
From: Pedro Falcato @ 2026-03-06 10:30 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Rafael J. Wysocki, Pavel Machek, Len Brown, Brendan Jackman,
	Johannes Weiner, Zi Yan, Oscar Salvador, Qi Zheng, Shakeel Butt,
	linux-kernel, linux-mm, linux-trace-kernel, linux-pm,
	SeongJae Park
In-Reply-To: <d61500c5784c64e971f4d328c57639303c475f81.1772206930.git.d@ilvokhin.com>

On Fri, Feb 27, 2026 at 04:00:26PM +0000, Dmitry Ilvokhin wrote:
> This intentionally breaks direct users of zone->lock at compile time so
> all call sites are converted to the zone lock wrappers. Without the
> rename, present and future out-of-tree code could continue using
> spin_lock(&zone->lock) and bypass the wrappers and tracing
> infrastructure.
> 
> No functional change intended.
> 
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: SeongJae Park <sj@kernel.org>
> ---
>  include/linux/mmzone.h      |  7 +++++--
>  include/linux/mmzone_lock.h | 12 ++++++------
>  mm/compaction.c             |  4 ++--
>  mm/internal.h               |  2 +-
>  mm/page_alloc.c             | 16 ++++++++--------
>  mm/page_isolation.c         |  4 ++--
>  mm/page_owner.c             |  2 +-
>  7 files changed, 25 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4..32bca655fce5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1009,8 +1009,11 @@ struct zone {
>  	/* zone flags, see below */
>  	unsigned long		flags;
>  
> -	/* Primarily protects free_area */
> -	spinlock_t		lock;
> +	/*
> +	 * Primarily protects free_area. Should be accessed via zone_lock_*
> +	 * helpers.
> +	 */
> +	spinlock_t		_lock;

I really don't like this uglification.
Suggestion:
	spinlock_t __private	lock;

>  
>  	/* Pages to be freed when next trylock succeeds */
>  	struct llist_head	trylock_free_pages;
> diff --git a/include/linux/mmzone_lock.h b/include/linux/mmzone_lock.h
> index a1cfba8408d6..62e34d500078 100644
> --- a/include/linux/mmzone_lock.h
> +++ b/include/linux/mmzone_lock.h
> @@ -7,32 +7,32 @@
>  
>  static inline void zone_lock_init(struct zone *zone)
>  {
> -	spin_lock_init(&zone->lock);

and then ACCESS_PRIVATE() all over these helpers. This will not make a
difference to the compiler, but it will work with sparse.

It's not that I don't understand what you're doing, but we're going to need
to look to this code and refer to this code 20 years from now, I would rather
not refer to zone->_lock :)

-- 
Pedro

^ permalink raw reply

* Re: [PATCH v3 01/12] vfs: widen inode hash/lookup functions to u64
From: Jeff Layton @ 2026-03-06 12:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Dan Williams, Eric Biggers,
	Theodore Y. Ts'o, Muchun Song, Oscar Salvador,
	David Hildenbrand, David Howells, Paulo Alcantara, Andreas Dilger,
	Jan Kara, Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
	Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Steve French, Ronnie Sahlberg, Shyam Prasad N, Bharath SM,
	Alexander Aring, Ryusuke Konishi, Viacheslav Dubeyko,
	Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Sterba, Marc Dionne, Ian Kent,
	Luis de Bethencourt, Salah Triki, Tigran A. Aivazian,
	Ilya Dryomov, Alex Markuze, Jan Harkes, coda, Nicolas Pitre,
	Tyler Hicks, Amir Goldstein, John Paul Adrian Glaubitz,
	Yangtao Li, Mikulas Patocka, David Woodhouse, Richard Weinberger,
	Dave Kleikamp, Konstantin Komarov, Mark Fasheh, Joel Becker,
	Joseph Qi, Mike Marshall, Martin Brandenburg, Miklos Szeredi,
	Anders Larsen, Zhihao Cheng, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Mimi Zohar, Roberto Sassu, Dmitry Kasatkin,
	Eric Snowberg, Fan Wu, Stephen Smalley, Ondrej Mosnacek,
	Casey Schaufler, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Sumit Semwal, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Darrick J. Wong, Martin Schiller, Eric Paris,
	Joerg Reuter, Marcel Holtmann, Johan Hedberg,
	Luiz Augusto von Dentz, Oliver Hartkopp, Marc Kleine-Budde,
	David Ahern, Neal Cardwell, Steffen Klassert, Herbert Xu,
	Remi Denis-Courmont, Marcelo Ricardo Leitner, Xin Long,
	Magnus Karlsson, Maciej Fijalkowski, Stanislav Fomichev,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, linux-fsdevel, linux-kernel, linux-trace-kernel,
	nvdimm, fsverity, linux-mm, netfs, linux-ext4, linux-f2fs-devel,
	linux-nfs, linux-cifs, samba-technical, linux-nilfs, v9fs,
	linux-afs, autofs, ceph-devel, codalist, ecryptfs, linux-mtd,
	jfs-discussion, ntfs3, ocfs2-devel, devel, linux-unionfs,
	apparmor, linux-security-module, linux-integrity, selinux,
	amd-gfx, dri-devel, linux-media, linaro-mm-sig, netdev,
	linux-perf-users, linux-fscrypt, linux-xfs, linux-hams, linux-x25,
	audit, linux-bluetooth, linux-can, linux-sctp, bpf
In-Reply-To: <aamSFgXhrORAJLBC@infradead.org>

On Thu, 2026-03-05 at 06:24 -0800, Christoph Hellwig wrote:
> >  extern struct inode *ilookup5_nowait(struct super_block *sb,
> > -		unsigned long hashval, int (*test)(struct inode *, void *),
> > +		u64 hashval, int (*test)(struct inode *, void *),
> >  		void *data, bool *isnew);
> > -extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
> > +extern struct inode *ilookup5(struct super_block *sb, u64 hashval,
> >  		int (*test)(struct inode *, void *), void *data);
> 
> ...
> 
> Can you please drop all these pointless externs while you're at it?
> 

I was planning to do that, but then Christian merged it!

I'll do a patch on top of this that does this in the range of fs.h that
the patch touches. Christian can throw it on top of the series, and
that shouldn't be too bad for backports.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 01/12] vfs: widen inode hash/lookup functions to u64
From: Christian Brauner @ 2026-03-06 13:28 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Alexander Viro, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Dan Williams, Eric Biggers,
	Theodore Y. Ts'o, Muchun Song, Oscar Salvador,
	David Hildenbrand, David Howells, Paulo Alcantara, Andreas Dilger,
	Jan Kara, Jaegeuk Kim, Chao Yu, Trond Myklebust, Anna Schumaker,
	Chuck Lever, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Steve French, Ronnie Sahlberg, Shyam Prasad N, Bharath SM,
	Alexander Aring, Ryusuke Konishi, Viacheslav Dubeyko,
	Eric Van Hensbergen, Latchesar Ionkov, Dominique Martinet,
	Christian Schoenebeck, David Sterba, Marc Dionne, Ian Kent,
	Luis de Bethencourt, Salah Triki, Tigran A. Aivazian,
	Ilya Dryomov, Alex Markuze, Jan Harkes, coda, Nicolas Pitre,
	Tyler Hicks, Amir Goldstein, John Paul Adrian Glaubitz,
	Yangtao Li, Mikulas Patocka, David Woodhouse, Richard Weinberger,
	Dave Kleikamp, Konstantin Komarov, Mark Fasheh, Joel Becker,
	Joseph Qi, Mike Marshall, Martin Brandenburg, Miklos Szeredi,
	Anders Larsen, Zhihao Cheng, Damien Le Moal, Naohiro Aota,
	Johannes Thumshirn, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Mimi Zohar, Roberto Sassu, Dmitry Kasatkin,
	Eric Snowberg, Fan Wu, Stephen Smalley, Ondrej Mosnacek,
	Casey Schaufler, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Sumit Semwal, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Darrick J. Wong, Martin Schiller, Eric Paris,
	Joerg Reuter, Marcel Holtmann, Johan Hedberg,
	Luiz Augusto von Dentz, Oliver Hartkopp, Marc Kleine-Budde,
	David Ahern, Neal Cardwell, Steffen Klassert, Herbert Xu,
	Remi Denis-Courmont, Marcelo Ricardo Leitner, Xin Long,
	Magnus Karlsson, Maciej Fijalkowski, Stanislav Fomichev,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, linux-fsdevel, linux-kernel, linux-trace-kernel,
	nvdimm, fsverity, linux-mm, netfs, linux-ext4, linux-f2fs-devel,
	linux-nfs, linux-cifs, samba-technical, linux-nilfs, v9fs,
	linux-afs, autofs, ceph-devel, codalist, ecryptfs, linux-mtd,
	jfs-discussion, ntfs3, ocfs2-devel, devel, linux-unionfs,
	apparmor, linux-security-module, linux-integrity, selinux,
	amd-gfx, dri-devel, linux-media, linaro-mm-sig, netdev,
	linux-perf-users, linux-fscrypt, linux-xfs, linux-hams, linux-x25,
	audit, linux-bluetooth, linux-can, linux-sctp, bpf
In-Reply-To: <c1845a4b8d35d367953ac6cbfcf91ac36958ba51.camel@kernel.org>

On Fri, Mar 06, 2026 at 07:03:15AM -0500, Jeff Layton wrote:
> On Thu, 2026-03-05 at 06:24 -0800, Christoph Hellwig wrote:
> > >  extern struct inode *ilookup5_nowait(struct super_block *sb,
> > > -		unsigned long hashval, int (*test)(struct inode *, void *),
> > > +		u64 hashval, int (*test)(struct inode *, void *),
> > >  		void *data, bool *isnew);
> > > -extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
> > > +extern struct inode *ilookup5(struct super_block *sb, u64 hashval,
> > >  		int (*test)(struct inode *, void *), void *data);
> > 
> > ...
> > 
> > Can you please drop all these pointless externs while you're at it?
> > 
> 
> I was planning to do that, but then Christian merged it!
> 
> I'll do a patch on top of this that does this in the range of fs.h that
> the patch touches. Christian can throw it on top of the series, and
> that shouldn't be too bad for backports.

I can easily drop those so no need to resend for stuff like this as per
the usual protocol.

^ permalink raw reply

* [PATCH v13 00/32] Tracefs support for pKVM
From: Vincent Donnefort @ 2026-03-06 14:35 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz,
	oliver.upton, joey.gouly, suzuki.poulose, yuzenghui
  Cc: kvmarm, linux-arm-kernel, jstultz, qperret, will, aneesh.kumar,
	kernel-team, linux-kernel, Vincent Donnefort

The growing set of features supported by the hypervisor in protected
mode necessitates debugging and profiling tools. Tracefs is the
ideal candidate for this task:

  * It is simple to use and to script.

  * It is supported by various tools, from the trace-cmd CLI to the
    Android web-based perfetto.

  * The ring-buffer, where are stored trace events consists of linked
    pages, making it an ideal structure for sharing between kernel and
    hypervisor.

This series first introduces a new generic way of creating remote events and
remote buffers. Then it adds support to the pKVM hypervisor.

1. ring-buffer
--------------

To setup the per-cpu ring-buffers, a new interface is created:

  ring_buffer_remote:	Describes what the kernel needs to know about the
			remote writer, that is, the set of pages forming the
			ring-buffer and a callback for the reader/head
			swapping (enables consuming read)

  ring_buffer_remote():	Creates a read-only ring-buffer from a
			ring_buffer_remote.

To keep the internals of `struct ring_buffer` in sync with the remote,
the meta-page is used. It was originally introduced to enable user-space
mapping of the ring-buffer [1]. In this case, the kernel is not the
producer anymore but the reader. The function to read that meta-page is:

  ring_buffer_poll_remote():
			Update `struct ring_buffer` based on the remote
			meta-page. Wake-up readers if necessary.

The kernel has to poll the meta-page to be notified of newly written
events.

2. Tracefs
----------

This series introduce a new trace_remote that does the link between
tracefs and the remote ring-buffer.

The interface is found in the remotes/ directory at the root of the
tracefs mount point. Each remote is like an instance and you'll find
there a subset of the regular Tracefs user-space interface:

   remotes/test
	|-- buffer_size_kb
	|-- events
	|   |-- enable
	|   |-- header_event
	|   |-- header_page
	|   `-- test
	|       `-- selftest
	|           |-- enable
	|           |-- format
	|           `-- id
	|-- per_cpu
	|   `-- cpu0
	|       |-- trace
	|       `-- trace_pipe
	|-- trace
	|-- trace_pipe
	|-- tracing_on

Behind the scenes, kernel/trace/trace_remote.c creates this tracefs
hierarchy without relying on kernel/trace/trace.c. This is due to
fundamental differences:

  * Remote tracing doesn't support trace_array's system-specific
    features (snapshots, tracers, etc.).

  * Logged event formats differ (e.g., no PID for remote events).

  * Buffer operations require specific remote interactions.

3. Simple Ring-Buffer
---------------------

As the current ring-buffer.c implementation has too many dependencies to
be used directly by the pKVM hypervisor. A new simple implementation is
created and can be found in kernel/trace/simple-ring-buffer.c.

This implementation is write-only and is used by both the pKVM
hypervisor and a trace_remote test module.

4. Events
---------

A new REMOTE_EVENT() macro is added to simplify the creation of events
on the kernel side. As remote tracing buffer are read only, only the
event structure and a way of printing must be declared. The prototype of
the macro is very similar to the well-known TRACE_EVENT()

 REMOTE_EVENT(my_event, id,
     RE_STRUCT(
         re_field(u64, foobar)
     ),
     RE_PRINTK("foobar=%lld", __entry->foobar)
     )
  )

5. pKVM
-------

The pKVM support simply creates a "hypervisor" trace_remote on the
kernel side and inherits from simple-ring-buffer.c on the hypervisor
side.

A new event macro is created HYP_EVENT() that is under the hood re-using
REMOTE_EVENT() (defined in the previous paragaph) as well as generate
hypervisor specific struct and trace_<event>() functions.

5. Limitations:
---------------

Non-consuming reading of the buffer isn't supported (i.e. cat trace ->
-EPERM) due to current the lack of support in the ring-buffer meta-page.

[1] https://tracingsummit.org/ts/2022/hypervisortracing/
[2] https://lore.kernel.org/all/20240510140435.3550353-1-vdonnefort@google.com/

Changes since v12 (https://lore.kernel.org/all/20260219150307.14538-1-vdonnefort@google.com/)

  - Rebase on v7.0-rc2
  - use kzalloc_obj() in trace_remote.c and remote_test.c

changes since v11 (https://lore.kernel.org/all/20260131132848.254084-1-vdonnefort@google.com/)

  - Fix kerneldoc (Steven)
  - Remove useless ring_buffer_event_data type cast (Steven)
  - Fix __free_ring_buffer_iter() (Steven)
  - Move trace seq locking into start/stop (Steven)

changes since v10 (https://lore.kernel.org/all/20260126104419.1649811-1-vdonnefort@google.com/)

  - Move kerneldoc to .c files (Steven)
  - Return EBUSY on buffer_size_kb write if buffer is loaded (Steven)
  - Remove rb_iter/rb_iters union in trace_remote_iterator (Steven)
  - Rename a refactor trace file seq_operations (Steven)
  - Make trace_get_cpu() accessible to trace_remote.c (Steven)
  - Remove unnecessary cpus_read_unlock() (Steven)
  - !preempt on remote_test driver buffer writing (Steven)
  - Do not fail selftest if cpu/online is unavailable (Steven)
  - Add rational for trace_remote into documentation (Steven)

changes since v9 (https://lore.kernel.org/all/20251202093623.2337860-1-vdonnefort@google.com/)

  - Add vCPU PID to hyp_enter/hyp_exit (Marc)
  - Remove useless X1 setting for tracing HVCs (Marc)
  - Fix REMOTE_PRINTK_COUNT_ARGS()
  - Rebase on 6.19-rc7

Changes since v8 (https://lore.kernel.org/all/20251107093840.3779150-1-vdonnefort@google.com/)

  - Do not enable tracing if unstable cnvct (Marc)
  - Add support for nVHE (Marc)
  - Add PKVM_DISABLE_STAGE2_ON_PANIC (Marc)
  - NVHE_EL2_TRACING depends on NVHE_EL2_DEBUG (Marc)
  - Add a reason for hyp_enter/hyp_exit events (Marc)
  - Remove PKVM_SELFTESTS in favour of NVHE_EL2_DEBUG
  - Add wrapper for arm_smccc_1_2, now used in nvhe/ffa.c

Changes since v7 (https://lore.kernel.org/all/20251003133825.2068970-1-vdonnefort@google.com/)

  - Add missing EXPORT_SYMBOL_GPL for remote_test.ko
  - Rebase on 6.18-rc4

Changes since v6 (https://lore.kernel.org/all/20250821081412.1008261-1-vdonnefort@google.com/)

  - Add requires field to the selftest (Masami)
  - Use guard() for ring_buffer_poll_remote (Steven)
  - Rename ring_buffer_remote() to ring_buffer_alloc_remote() (Steven)
  - kerneldoc for trace_buffer_remote and simple_ring_buffer (Steven)
  - Validate trace_buffer_desc size in trace_remote_alloc_buffer
    (Steven)
  - Add non-consuming ring-buffer read (Steven)
  - Add spinning failsafe in simple_ring_buffer (Steven)
  - Range check for hyp_trace_desc::bpages_backing_* in hyp_trace_desc_validate()
  - unsigned int cpu in hyp_trace_desc_validate()
  - Fix event/format file
  - Add tests with an offline CPU
  - Add tests for non-consuming read
  - Add documentation
  - Rebase on 6.17

Changes since v5 (https://lore.kernel.org/all/20250516134031.661124-1-vdonnefort@google.com/)

  - Add tishift lib to the hyp (Aneesh)
  - Rebase on 6.17-rc2

Changes since v4 (https://lore.kernel.org/all/20250506164820.515876-1-vdonnefort@google.com/)

  - Extend meta-page with pages_touched and pages_lost
  - Create ring_buffer_types.h
  - Fix simple_ring_buffer build for 32-bits arch and x86
  - Try unload buffer on reset (+ test)
  - Minor renaming and comments

Changes since v3 (https://lore.kernel.org/all/20250224121353.98697-1-vdonnefort@google.com/)

  - Move tracefs support from kvm/hyp_trace.c into a generic trace_remote.c.
  - Move ring-buffer implementation from nvhe/trace.c into  a generic
    simple-ring-buffer.c
  - Rebase on 6.15-rc4.

Changes since v2 (https://lore.kernel.org/all/20250108114536.627715-1-vdonnefort@google.com/)

  - Fix ring-buffer remote reset
  - Fix fast-forward in rb_page_desc()
  - Refactor nvhe/trace.c
  - struct hyp_buffer_page more compact
  - Add a struct_len to trace_page_desc
  - Extend reset testing
  - Rebase on 6.14-rc3

Changes since v1 (https://lore.kernel.org/all/20240911093029.3279154-1-vdonnefort@google.com/)

  - Add 128-bits mult fallback in the unlikely event of an overflow. (John)
  - Fix ELF section sort.
  - __always_inline trace_* event macros.
  - Fix events/<event>/enable permissions.
  - Rename ring-buffer "writer" to "remote".
  - Rename CONFIG_PROTECTED_NVHE_TESTING to PKVM_SELFTEST to align with
    Quentin's upcoming selftest
  - Rebase on 6.13-rc3.

Changes since RFC (https://lore.kernel.org/all/20240805173234.3542917-1-vdonnefort@google.com/)

  - hypervisor trace clock:
     - mult/shift computed in hyp_trace.c. (John)
     - Update clock when it deviates from kernel boot clock. (John)
     - Add trace_clock file.
     - Separate patch for better readability.
  - Add a proper reset interface which does not need to teardown the
    tracing buffers. (Steven)
  - Return -EPERM on trace access. (Steven)
  - Add per-cpu trace file.
  - Automatically teardown and free the tracing buffer when it is empty,
    without readers and not currently tracing.
  - Show in buffer_size_kb if the buffer is loaded in the hypervisor or
    not.
  - Extend tests to cover reset and unload.
  - CC timekeeping folks on relevant patches (Marc)

Vincent Donnefort (32):
  ring-buffer: Add page statistics to the meta-page
  ring-buffer: Store bpage pointers into subbuf_ids
  ring-buffer: Introduce ring-buffer remotes
  ring-buffer: Add non-consuming read for ring-buffer remotes
  tracing: Introduce trace remotes
  tracing: Add reset to trace remotes
  tracing: Add non-consuming read to trace remotes
  tracing: Add init callback to trace remotes
  tracing: Add events to trace remotes
  tracing: Add events/ root files to trace remotes
  tracing: Add helpers to create trace remote events
  ring-buffer: Export buffer_data_page and macros
  tracing: Introduce simple_ring_buffer
  tracing: Add a trace remote module for testing
  tracing: selftests: Add trace remote tests
  Documentation: tracing: Add tracing remotes
  tracing: load/unload page callbacks for simple_ring_buffer
  tracing: Check for undefined symbols in simple_ring_buffer
  KVM: arm64: Add PKVM_DISABLE_STAGE2_ON_PANIC
  KVM: arm64: Add clock support to nVHE/pKVM hyp
  KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp
  KVM: arm64: Support unaligned fixmap in the pKVM hyp
  KVM: arm64: Add tracing capability for the nVHE/pKVM hyp
  KVM: arm64: Add trace remote for the nVHE/pKVM hyp
  KVM: arm64: Sync boot clock with the nVHE/pKVM hyp
  KVM: arm64: Add trace reset to the nVHE/pKVM hyp
  KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote
  KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp
  KVM: arm64: Add selftest event support to nVHE/pKVM hyp
  tracing: selftests: Add hypervisor trace remote tests
  fixup! tracing: Add a trace remote module for testing
  fixup! tracing: Add a trace remote module for testing

 Documentation/trace/index.rst                 |   11 +
 Documentation/trace/remotes.rst               |   66 +
 arch/arm64/include/asm/kvm_asm.h              |    8 +
 arch/arm64/include/asm/kvm_define_hypevents.h |   16 +
 arch/arm64/include/asm/kvm_host.h             |    3 +
 arch/arm64/include/asm/kvm_hyp.h              |    4 +-
 arch/arm64/include/asm/kvm_hypevents.h        |   60 +
 arch/arm64/include/asm/kvm_hyptrace.h         |   26 +
 arch/arm64/kernel/image-vars.h                |    4 +
 arch/arm64/kernel/vmlinux.lds.S               |   18 +
 arch/arm64/kvm/Kconfig                        |   64 +-
 arch/arm64/kvm/Makefile                       |    2 +
 arch/arm64/kvm/arm.c                          |   12 +-
 arch/arm64/kvm/handle_exit.c                  |    2 +-
 arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h   |   23 +
 arch/arm64/kvm/hyp/include/nvhe/clock.h       |   16 +
 .../kvm/hyp/include/nvhe/define_events.h      |   14 +
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |    2 -
 arch/arm64/kvm/hyp/include/nvhe/trace.h       |   70 +
 arch/arm64/kvm/hyp/nvhe/Makefile              |    6 +-
 arch/arm64/kvm/hyp/nvhe/clock.c               |   65 +
 arch/arm64/kvm/hyp/nvhe/events.c              |   25 +
 arch/arm64/kvm/hyp/nvhe/ffa.c                 |   28 +-
 arch/arm64/kvm/hyp/nvhe/host.S                |    2 +-
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   87 +-
 arch/arm64/kvm/hyp/nvhe/hyp.lds.S             |    6 +
 arch/arm64/kvm/hyp/nvhe/mm.c                  |    4 +-
 arch/arm64/kvm/hyp/nvhe/psci-relay.c          |    7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c               |    4 +-
 arch/arm64/kvm/hyp/nvhe/stacktrace.c          |    6 +-
 arch/arm64/kvm/hyp/nvhe/switch.c              |    5 +-
 arch/arm64/kvm/hyp/nvhe/trace.c               |  306 ++++
 arch/arm64/kvm/hyp_trace.c                    |  443 ++++++
 arch/arm64/kvm/hyp_trace.h                    |   11 +
 arch/arm64/kvm/stacktrace.c                   |    8 +-
 fs/tracefs/inode.c                            |    1 +
 include/linux/ring_buffer.h                   |   58 +
 include/linux/ring_buffer_types.h             |   41 +
 include/linux/simple_ring_buffer.h            |   65 +
 include/linux/trace_remote.h                  |   48 +
 include/linux/trace_remote_event.h            |   33 +
 include/trace/define_remote_events.h          |   73 +
 include/uapi/linux/trace_mmap.h               |    8 +-
 kernel/trace/Kconfig                          |   14 +
 kernel/trace/Makefile                         |   20 +
 kernel/trace/remote_test.c                    |  261 ++++
 kernel/trace/remote_test_events.h             |   10 +
 kernel/trace/ring_buffer.c                    |  356 ++++-
 kernel/trace/simple_ring_buffer.c             |  517 +++++++
 kernel/trace/trace.c                          |    4 +-
 kernel/trace/trace.h                          |    7 +
 kernel/trace/trace_remote.c                   | 1368 +++++++++++++++++
 .../ftrace/test.d/remotes/buffer_size.tc      |   25 +
 .../selftests/ftrace/test.d/remotes/functions |   88 ++
 .../test.d/remotes/hypervisor/buffer_size.tc  |   11 +
 .../ftrace/test.d/remotes/hypervisor/reset.tc |   11 +
 .../ftrace/test.d/remotes/hypervisor/trace.tc |   11 +
 .../test.d/remotes/hypervisor/trace_pipe.tc   |   11 +
 .../test.d/remotes/hypervisor/unloading.tc    |   11 +
 .../selftests/ftrace/test.d/remotes/reset.tc  |   90 ++
 .../selftests/ftrace/test.d/remotes/trace.tc  |  127 ++
 .../ftrace/test.d/remotes/trace_pipe.tc       |  127 ++
 .../ftrace/test.d/remotes/unloading.tc        |   41 +
 63 files changed, 4751 insertions(+), 120 deletions(-)
 create mode 100644 Documentation/trace/remotes.rst
 create mode 100644 arch/arm64/include/asm/kvm_define_hypevents.h
 create mode 100644 arch/arm64/include/asm/kvm_hypevents.h
 create mode 100644 arch/arm64/include/asm/kvm_hyptrace.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/arm-smccc.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/clock.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/define_events.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/trace.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/clock.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/events.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/trace.c
 create mode 100644 arch/arm64/kvm/hyp_trace.c
 create mode 100644 arch/arm64/kvm/hyp_trace.h
 create mode 100644 include/linux/ring_buffer_types.h
 create mode 100644 include/linux/simple_ring_buffer.h
 create mode 100644 include/linux/trace_remote.h
 create mode 100644 include/linux/trace_remote_event.h
 create mode 100644 include/trace/define_remote_events.h
 create mode 100644 kernel/trace/remote_test.c
 create mode 100644 kernel/trace/remote_test_events.h
 create mode 100644 kernel/trace/simple_ring_buffer.c
 create mode 100644 kernel/trace/trace_remote.c
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/buffer_size.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/functions
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/buffer_size.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/reset.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/trace_pipe.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/hypervisor/unloading.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/reset.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/trace.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/trace_pipe.tc
 create mode 100644 tools/testing/selftests/ftrace/test.d/remotes/unloading.tc


base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply

* [PATCH v13 01/32] ring-buffer: Add page statistics to the meta-page
From: Vincent Donnefort @ 2026-03-06 14:35 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz,
	oliver.upton, joey.gouly, suzuki.poulose, yuzenghui
  Cc: kvmarm, linux-arm-kernel, jstultz, qperret, will, aneesh.kumar,
	kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260306143536.339777-1-vdonnefort@google.com>

Add two fields pages_touched and pages_lost to the ring-buffer
meta-page. Those fields are useful to get the number of used pages in
the ring-buffer.

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index c102ef35d11e..e8185889a1c8 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -17,8 +17,8 @@
  * @entries:		Number of entries in the ring-buffer.
  * @overrun:		Number of entries lost in the ring-buffer.
  * @read:		Number of entries that have been read.
- * @Reserved1:		Internal use only.
- * @Reserved2:		Internal use only.
+ * @pages_lost:		Number of pages overwritten by the writer.
+ * @pages_touched:	Number of pages written by the writer.
  */
 struct trace_buffer_meta {
 	__u32		meta_page_size;
@@ -39,8 +39,8 @@ struct trace_buffer_meta {
 	__u64	overrun;
 	__u64	read;
 
-	__u64	Reserved1;
-	__u64	Reserved2;
+	__u64	pages_lost;
+	__u64	pages_touched;
 };
 
 #define TRACE_MMAP_IOCTL_GET_READER		_IO('R', 0x20)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f16f053ef77d..1b8be41faa78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6154,6 +6154,8 @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
 	meta->entries = local_read(&cpu_buffer->entries);
 	meta->overrun = local_read(&cpu_buffer->overrun);
 	meta->read = cpu_buffer->read;
+	meta->pages_lost = local_read(&cpu_buffer->pages_lost);
+	meta->pages_touched = local_read(&cpu_buffer->pages_touched);
 
 	/* Some archs do not have data cache coherency between kernel and user-space */
 	flush_kernel_vmap_range(cpu_buffer->meta_page, PAGE_SIZE);
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related

* [PATCH v13 02/32] ring-buffer: Store bpage pointers into subbuf_ids
From: Vincent Donnefort @ 2026-03-06 14:35 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz,
	oliver.upton, joey.gouly, suzuki.poulose, yuzenghui
  Cc: kvmarm, linux-arm-kernel, jstultz, qperret, will, aneesh.kumar,
	kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260306143536.339777-1-vdonnefort@google.com>

The subbuf_ids field allows to point to a specific page from the
ring-buffer based on its ID. As a preparation or the upcoming
ring-buffer remote support, point this array to the buffer_page instead
of the buffer_data_page.

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1b8be41faa78..45ec49d6c81d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -555,7 +555,7 @@ struct ring_buffer_per_cpu {
 	unsigned int			mapped;
 	unsigned int			user_mapped;	/* user space mapping */
 	struct mutex			mapping_lock;
-	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
+	struct buffer_page		**subbuf_ids;	/* ID to subbuf VA */
 	struct trace_buffer_meta	*meta_page;
 	struct ring_buffer_cpu_meta	*ring_meta;
 
@@ -7036,7 +7036,7 @@ static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
 }
 
 static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
-				   unsigned long *subbuf_ids)
+				   struct buffer_page **subbuf_ids)
 {
 	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
 	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
@@ -7045,7 +7045,7 @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
 	int id = 0;
 
 	id = rb_page_id(cpu_buffer, cpu_buffer->reader_page, id);
-	subbuf_ids[id++] = (unsigned long)cpu_buffer->reader_page->page;
+	subbuf_ids[id++] = cpu_buffer->reader_page;
 	cnt++;
 
 	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
@@ -7055,7 +7055,7 @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
 		if (WARN_ON(id >= nr_subbufs))
 			break;
 
-		subbuf_ids[id] = (unsigned long)subbuf->page;
+		subbuf_ids[id] = subbuf;
 
 		rb_inc_page(&subbuf);
 		id++;
@@ -7064,7 +7064,7 @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
 
 	WARN_ON(cnt != nr_subbufs);
 
-	/* install subbuf ID to kern VA translation */
+	/* install subbuf ID to bpage translation */
 	cpu_buffer->subbuf_ids = subbuf_ids;
 
 	meta->meta_struct_len = sizeof(*meta);
@@ -7220,13 +7220,15 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
 	}
 
 	while (p < nr_pages) {
+		struct buffer_page *subbuf;
 		struct page *page;
 		int off = 0;
 
 		if (WARN_ON_ONCE(s >= nr_subbufs))
 			return -EINVAL;
 
-		page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]);
+		subbuf = cpu_buffer->subbuf_ids[s];
+		page = virt_to_page((void *)subbuf->page);
 
 		for (; off < (1 << (subbuf_order)); off++, page++) {
 			if (p >= nr_pages)
@@ -7253,7 +7255,8 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 		    struct vm_area_struct *vma)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
-	unsigned long flags, *subbuf_ids;
+	struct buffer_page **subbuf_ids;
+	unsigned long flags;
 	int err;
 
 	if (!cpumask_test_cpu(cpu, buffer->cpumask))
@@ -7277,7 +7280,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 	if (err)
 		return err;
 
-	/* subbuf_ids include the reader while nr_pages does not */
+	/* subbuf_ids includes the reader while nr_pages does not */
 	subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), GFP_KERNEL);
 	if (!subbuf_ids) {
 		rb_free_meta_page(cpu_buffer);
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related

* [PATCH v13 03/32] ring-buffer: Introduce ring-buffer remotes
From: Vincent Donnefort @ 2026-03-06 14:35 UTC (permalink / raw)
  To: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel, maz,
	oliver.upton, joey.gouly, suzuki.poulose, yuzenghui
  Cc: kvmarm, linux-arm-kernel, jstultz, qperret, will, aneesh.kumar,
	kernel-team, linux-kernel, Vincent Donnefort
In-Reply-To: <20260306143536.339777-1-vdonnefort@google.com>

A ring-buffer remote is an entity outside of the kernel (most likely a
firmware or a hypervisor) capable of writing events in a ring-buffer
following the same format as the tracefs ring-buffer.

To setup the ring-buffer on the kernel side, a description of the pages
forming the ring-buffer (struct trace_buffer_desc) must be given.
Callbacks (swap_reader_page and reset) must also be provided.

It is expected from the remote to keep the meta-page updated.

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 876358cfe1b1..41193c5b0d28 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -250,4 +250,62 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 		    struct vm_area_struct *vma);
 int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
+
+struct ring_buffer_desc {
+	int		cpu;
+	unsigned int	nr_page_va; /* excludes the meta page */
+	unsigned long	meta_va;
+	unsigned long	page_va[] __counted_by(nr_page_va);
+};
+
+struct trace_buffer_desc {
+	int		nr_cpus;
+	size_t		struct_len;
+	char		__data[]; /* list of ring_buffer_desc */
+};
+
+static inline struct ring_buffer_desc *__next_ring_buffer_desc(struct ring_buffer_desc *desc)
+{
+	size_t len = struct_size(desc, page_va, desc->nr_page_va);
+
+	return (struct ring_buffer_desc *)((void *)desc + len);
+}
+
+static inline struct ring_buffer_desc *__first_ring_buffer_desc(struct trace_buffer_desc *desc)
+{
+	return (struct ring_buffer_desc *)(&desc->__data[0]);
+}
+
+static inline size_t trace_buffer_desc_size(size_t buffer_size, unsigned int nr_cpus)
+{
+	unsigned int nr_pages = max(DIV_ROUND_UP(buffer_size, PAGE_SIZE), 2UL) + 1;
+	struct ring_buffer_desc *rbdesc;
+
+	return size_add(offsetof(struct trace_buffer_desc, __data),
+			size_mul(nr_cpus, struct_size(rbdesc, page_va, nr_pages)));
+}
+
+#define for_each_ring_buffer_desc(__pdesc, __cpu, __trace_pdesc)		\
+	for (__pdesc = __first_ring_buffer_desc(__trace_pdesc), __cpu = 0;	\
+	     (__cpu) < (__trace_pdesc)->nr_cpus;				\
+	     (__cpu)++, __pdesc = __next_ring_buffer_desc(__pdesc))
+
+struct ring_buffer_remote {
+	struct trace_buffer_desc	*desc;
+	int				(*swap_reader_page)(unsigned int cpu, void *priv);
+	int				(*reset)(unsigned int cpu, void *priv);
+	void				*priv;
+};
+
+int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu);
+
+struct trace_buffer *
+__ring_buffer_alloc_remote(struct ring_buffer_remote *remote,
+			   struct lock_class_key *key);
+
+#define ring_buffer_alloc_remote(remote)			\
+({								\
+	static struct lock_class_key __key;			\
+	__ring_buffer_alloc_remote(remote, &__key);		\
+})
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 45ec49d6c81d..da0cd8e82105 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -559,6 +559,8 @@ struct ring_buffer_per_cpu {
 	struct trace_buffer_meta	*meta_page;
 	struct ring_buffer_cpu_meta	*ring_meta;
 
+	struct ring_buffer_remote	*remote;
+
 	/* ring buffer pages to update, > 0 to add, < 0 to remove */
 	long				nr_pages_to_update;
 	struct list_head		new_pages; /* new pages to add */
@@ -581,6 +583,8 @@ struct trace_buffer {
 
 	struct ring_buffer_per_cpu	**buffers;
 
+	struct ring_buffer_remote	*remote;
+
 	struct hlist_node		node;
 	u64				(*clock)(void);
 
@@ -2238,6 +2242,40 @@ static void rb_meta_buffer_update(struct ring_buffer_per_cpu *cpu_buffer,
 	}
 }
 
+static struct ring_buffer_desc *ring_buffer_desc(struct trace_buffer_desc *trace_desc, int cpu)
+{
+	struct ring_buffer_desc *desc, *end;
+	size_t len;
+	int i;
+
+	if (!trace_desc)
+		return NULL;
+
+	if (cpu >= trace_desc->nr_cpus)
+		return NULL;
+
+	end = (struct ring_buffer_desc *)((void *)trace_desc + trace_desc->struct_len);
+	desc = __first_ring_buffer_desc(trace_desc);
+	len = struct_size(desc, page_va, desc->nr_page_va);
+	desc = (struct ring_buffer_desc *)((void *)desc + (len * cpu));
+
+	if (desc < end && desc->cpu == cpu)
+		return desc;
+
+	/* Missing CPUs, need to linear search */
+	for_each_ring_buffer_desc(desc, i, trace_desc) {
+		if (desc->cpu == cpu)
+			return desc;
+	}
+
+	return NULL;
+}
+
+static void *ring_buffer_desc_page(struct ring_buffer_desc *desc, int page_id)
+{
+	return page_id > desc->nr_page_va ? NULL : (void *)desc->page_va[page_id];
+}
+
 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 		long nr_pages, struct list_head *pages)
 {
@@ -2245,6 +2283,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 	struct ring_buffer_cpu_meta *meta = NULL;
 	struct buffer_page *bpage, *tmp;
 	bool user_thread = current->mm != NULL;
+	struct ring_buffer_desc *desc = NULL;
 	long i;
 
 	/*
@@ -2273,6 +2312,12 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 	if (buffer->range_addr_start)
 		meta = rb_range_meta(buffer, nr_pages, cpu_buffer->cpu);
 
+	if (buffer->remote) {
+		desc = ring_buffer_desc(buffer->remote->desc, cpu_buffer->cpu);
+		if (!desc || WARN_ON(desc->nr_page_va != (nr_pages + 1)))
+			return -EINVAL;
+	}
+
 	for (i = 0; i < nr_pages; i++) {
 
 		bpage = alloc_cpu_page(cpu_buffer->cpu);
@@ -2297,6 +2342,16 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 				rb_meta_buffer_update(cpu_buffer, bpage);
 			bpage->range = 1;
 			bpage->id = i + 1;
+		} else if (desc) {
+			void *p = ring_buffer_desc_page(desc, i + 1);
+
+			if (WARN_ON(!p))
+				goto free_pages;
+
+			bpage->page = p;
+			bpage->range = 1; /* bpage->page can't be freed */
+			bpage->id = i + 1;
+			cpu_buffer->subbuf_ids[i + 1] = bpage;
 		} else {
 			int order = cpu_buffer->buffer->subbuf_order;
 			bpage->page = alloc_cpu_data(cpu_buffer->cpu, order);
@@ -2394,6 +2449,30 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
 		if (cpu_buffer->ring_meta->head_buffer)
 			rb_meta_buffer_update(cpu_buffer, bpage);
 		bpage->range = 1;
+	} else if (buffer->remote) {
+		struct ring_buffer_desc *desc = ring_buffer_desc(buffer->remote->desc, cpu);
+
+		if (!desc)
+			goto fail_free_reader;
+
+		cpu_buffer->remote = buffer->remote;
+		cpu_buffer->meta_page = (struct trace_buffer_meta *)(void *)desc->meta_va;
+		cpu_buffer->nr_pages = nr_pages;
+		cpu_buffer->subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1,
+						 sizeof(*cpu_buffer->subbuf_ids), GFP_KERNEL);
+		if (!cpu_buffer->subbuf_ids)
+			goto fail_free_reader;
+
+		/* Remote buffers are read-only and immutable */
+		atomic_inc(&cpu_buffer->record_disabled);
+		atomic_inc(&cpu_buffer->resize_disabled);
+
+		bpage->page = ring_buffer_desc_page(desc, cpu_buffer->meta_page->reader.id);
+		if (!bpage->page)
+			goto fail_free_reader;
+
+		bpage->range = 1;
+		cpu_buffer->subbuf_ids[0] = bpage;
 	} else {
 		int order = cpu_buffer->buffer->subbuf_order;
 		bpage->page = alloc_cpu_data(cpu, order);
@@ -2453,6 +2532,9 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 
 	irq_work_sync(&cpu_buffer->irq_work.work);
 
+	if (cpu_buffer->remote)
+		kfree(cpu_buffer->subbuf_ids);
+
 	free_buffer_page(cpu_buffer->reader_page);
 
 	if (head) {
@@ -2475,7 +2557,8 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 					 int order, unsigned long start,
 					 unsigned long end,
 					 unsigned long scratch_size,
-					 struct lock_class_key *key)
+					 struct lock_class_key *key,
+					 struct ring_buffer_remote *remote)
 {
 	struct trace_buffer *buffer __free(kfree) = NULL;
 	long nr_pages;
@@ -2515,6 +2598,8 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 	if (!buffer->buffers)
 		goto fail_free_cpumask;
 
+	cpu = raw_smp_processor_id();
+
 	/* If start/end are specified, then that overrides size */
 	if (start && end) {
 		unsigned long buffers_start;
@@ -2570,6 +2655,15 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 		buffer->range_addr_end = end;
 
 		rb_range_meta_init(buffer, nr_pages, scratch_size);
+	} else if (remote) {
+		struct ring_buffer_desc *desc = ring_buffer_desc(remote->desc, cpu);
+
+		buffer->remote = remote;
+		/* The writer is remote. This ring-buffer is read-only */
+		atomic_inc(&buffer->record_disabled);
+		nr_pages = desc->nr_page_va - 1;
+		if (nr_pages < 2)
+			goto fail_free_buffers;
 	} else {
 
 		/* need at least two pages */
@@ -2578,7 +2672,6 @@ static struct trace_buffer *alloc_buffer(unsigned long size, unsigned flags,
 			nr_pages = 2;
 	}
 
-	cpu = raw_smp_processor_id();
 	cpumask_set_cpu(cpu, buffer->cpumask);
 	buffer->buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, cpu);
 	if (!buffer->buffers[cpu])
@@ -2620,7 +2713,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
 					struct lock_class_key *key)
 {
 	/* Default buffer page size - one system page */
-	return alloc_buffer(size, flags, 0, 0, 0, 0, key);
+	return alloc_buffer(size, flags, 0, 0, 0, 0, key, NULL);
 
 }
 EXPORT_SYMBOL_GPL(__ring_buffer_alloc);
@@ -2647,7 +2740,18 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flag
 					       struct lock_class_key *key)
 {
 	return alloc_buffer(size, flags, order, start, start + range_size,
-			    scratch_size, key);
+			    scratch_size, key, NULL);
+}
+
+/**
+ * __ring_buffer_alloc_remote - allocate a new ring_buffer from a remote
+ * @remote: Contains a description of the ring-buffer pages and remote callbacks.
+ * @key: ring buffer reader_lock_key.
+ */
+struct trace_buffer *__ring_buffer_alloc_remote(struct ring_buffer_remote *remote,
+						struct lock_class_key *key)
+{
+	return alloc_buffer(0, 0, 0, 0, 0, 0, key, remote);
 }
 
 void *ring_buffer_meta_scratch(struct trace_buffer *buffer, unsigned int *size)
@@ -5274,6 +5378,16 @@ unsigned long ring_buffer_overruns(struct trace_buffer *buffer)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_overruns);
 
+static bool rb_read_remote_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	local_set(&cpu_buffer->entries, READ_ONCE(cpu_buffer->meta_page->entries));
+	local_set(&cpu_buffer->overrun, READ_ONCE(cpu_buffer->meta_page->overrun));
+	local_set(&cpu_buffer->pages_touched, READ_ONCE(cpu_buffer->meta_page->pages_touched));
+	local_set(&cpu_buffer->pages_lost, READ_ONCE(cpu_buffer->meta_page->pages_lost));
+
+	return rb_num_of_entries(cpu_buffer);
+}
+
 static void rb_iter_reset(struct ring_buffer_iter *iter)
 {
 	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
@@ -5428,7 +5542,43 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
 }
 
 static struct buffer_page *
-rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
+__rb_get_reader_page_from_remote(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct buffer_page *new_reader, *prev_reader;
+
+	if (!rb_read_remote_meta_page(cpu_buffer))
+		return NULL;
+
+	/* More to read on the reader page */
+	if (cpu_buffer->reader_page->read < rb_page_size(cpu_buffer->reader_page)) {
+		if (!cpu_buffer->reader_page->read)
+			cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp;
+		return cpu_buffer->reader_page;
+	}
+
+	prev_reader = cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id];
+
+	WARN_ON_ONCE(cpu_buffer->remote->swap_reader_page(cpu_buffer->cpu,
+							  cpu_buffer->remote->priv));
+	/* nr_pages doesn't include the reader page */
+	if (WARN_ON_ONCE(cpu_buffer->meta_page->reader.id > cpu_buffer->nr_pages))
+		return NULL;
+
+	new_reader = cpu_buffer->subbuf_ids[cpu_buffer->meta_page->reader.id];
+
+	WARN_ON_ONCE(prev_reader == new_reader);
+
+	cpu_buffer->reader_page->page = new_reader->page;
+	cpu_buffer->reader_page->id = new_reader->id;
+	cpu_buffer->reader_page->read = 0;
+	cpu_buffer->read_stamp = cpu_buffer->reader_page->page->time_stamp;
+	cpu_buffer->lost_events = cpu_buffer->meta_page->reader.lost_events;
+
+	return rb_page_size(cpu_buffer->reader_page) ? cpu_buffer->reader_page : NULL;
+}
+
+static struct buffer_page *
+__rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct buffer_page *reader = NULL;
 	unsigned long bsize = READ_ONCE(cpu_buffer->buffer->subbuf_size);
@@ -5598,6 +5748,13 @@ rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
 	return reader;
 }
 
+static struct buffer_page *
+rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->remote ? __rb_get_reader_page_from_remote(cpu_buffer) :
+				    __rb_get_reader_page(cpu_buffer);
+}
+
 static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct ring_buffer_event *event;
@@ -5998,7 +6155,7 @@ ring_buffer_read_start(struct trace_buffer *buffer, int cpu, gfp_t flags)
 	struct ring_buffer_per_cpu *cpu_buffer;
 	struct ring_buffer_iter *iter;
 
-	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+	if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->remote)
 		return NULL;
 
 	iter = kzalloc_obj(*iter, flags);
@@ -6166,6 +6323,23 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
 {
 	struct buffer_page *page;
 
+	if (cpu_buffer->remote) {
+		if (!cpu_buffer->remote->reset)
+			return;
+
+		cpu_buffer->remote->reset(cpu_buffer->cpu, cpu_buffer->remote->priv);
+		rb_read_remote_meta_page(cpu_buffer);
+
+		/* Read related values, not covered by the meta-page */
+		local_set(&cpu_buffer->pages_read, 0);
+		cpu_buffer->read = 0;
+		cpu_buffer->read_bytes = 0;
+		cpu_buffer->last_overrun = 0;
+		cpu_buffer->reader_page->read = 0;
+
+		return;
+	}
+
 	rb_head_page_deactivate(cpu_buffer);
 
 	cpu_buffer->head_page
@@ -6396,6 +6570,48 @@ bool ring_buffer_empty_cpu(struct trace_buffer *buffer, int cpu)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_empty_cpu);
 
+int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (cpu != RING_BUFFER_ALL_CPUS) {
+		if (!cpumask_test_cpu(cpu, buffer->cpumask))
+			return -EINVAL;
+
+		cpu_buffer = buffer->buffers[cpu];
+
+		guard(raw_spinlock)(&cpu_buffer->reader_lock);
+		if (rb_read_remote_meta_page(cpu_buffer))
+			rb_wakeups(buffer, cpu_buffer);
+
+		return 0;
+	}
+
+	cpus_read_lock();
+
+	/*
+	 * Make sure all the ring buffers are up to date before we start reading
+	 * them.
+	 */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+
+		guard(raw_spinlock)(&cpu_buffer->reader_lock);
+		rb_read_remote_meta_page(cpu_buffer);
+	}
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+
+		if (rb_num_of_entries(cpu_buffer))
+			rb_wakeups(buffer, cpu_buffer);
+	}
+
+	cpus_read_unlock();
+
+	return 0;
+}
+
 #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
 /**
  * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
@@ -6634,6 +6850,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	unsigned int commit;
 	unsigned int read;
 	u64 save_timestamp;
+	bool force_memcpy;
 
 	if (!cpumask_test_cpu(cpu, buffer->cpumask))
 		return -1;
@@ -6671,6 +6888,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	/* Check if any events were dropped */
 	missed_events = cpu_buffer->lost_events;
 
+	force_memcpy = cpu_buffer->mapped || cpu_buffer->remote;
+
 	/*
 	 * If this page has been partially read or
 	 * if len is not big enough to read the rest of the page or
@@ -6680,7 +6899,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	 */
 	if (read || (len < (commit - read)) ||
 	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
-	    cpu_buffer->mapped) {
+	    force_memcpy) {
 		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
 		unsigned int rpos = read;
 		unsigned int pos = 0;
@@ -7259,7 +7478,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 	unsigned long flags;
 	int err;
 
-	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+	if (!cpumask_test_cpu(cpu, buffer->cpumask) || buffer->remote)
 		return -EINVAL;
 
 	cpu_buffer = buffer->buffers[cpu];
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox