[for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4
@ 2023-12-07  2:37 Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 1/8] tracing: Always update snapshot buffer size Steven Rostedt
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton


tracing fixes for v6.7-rc4:

- Snapshot buffer issues

  1. When instances started allowing latency tracers, it uses
     a snapshot buffer (another buffer that is not written to
     but swapped with the main buffer that is). The snapshot buffer
     needs to be the same size as the main buffer. But when the
     snapshot buffers were added to instances, the code to make
     the snapshot equal to the main buffer still was only doing it
     for the main buffer and not the instances.

  2. Need to stop the current tracer when resizing the buffers.
     Otherwise there can be a race if the tracer decides to make
     a snapshot between resizing the main buffer and the snapshot
     buffer.

  3. When a tracer is "stopped" in disables both the main buffer
     and the snapshot buffer. This needs to be done for instances
     and not only the main buffer, now that instances also have
     a snapshot buffer.

- Buffered event for filtering issues

  When filtering is enabled, because events can be dropped often,
  it is quicker to copy the event into a temp buffer and write that
  into the main buffer if it is not filtered or just drop the event
  if it is, than to write the event into the ring buffer and then
  try to discard it. This temp buffer is allocated and needs special
  synchronization to do so. But there were some issues with that:

  1. When disabling the filter and freeing the buffer, a call to all
     CPUs is required to stop each per_cpu usage. But the code
     called smp_call_function_many() which does not include the
     current CPU. If the task is migrated to another CPU when it
     enables the CPUs via smp_call_function_many(), it will not enable
     the one it is currently on and this causes issues later on.
     Use on_each_cpu_mask() instead, which includes the current CPU.

   2. When the allocation of the buffered event fails, it can give
      a warning. But the buffered event is just an optimization
      (it's still OK to write to the ring buffer and free it).
      Do not WARN in this case.

   3. The freeing of the buffer event requires synchronization.
      First a counter is decremented to zero so that no new uses
      of it will happen. Then it sets the buffered event to NULL,
      and finally it frees the buffered event. There's a synchronize_rcu()
      between the counter decrement and the setting the variable to
      NULL, but only a smp_wmb() between that and the freeing of the
      buffer. It is theoretically possible that a user missed seeing
      the decrement, but will use the buffer after it is free. Another
      synchronize_rcu() is needed in place of that smp_wmb().

- ring buffer timestamps on 32 bit machines

  The ring buffer timestamp on 32 bit machines has to break the 64 bit
  number into multiple values as cmpxchg is required on it, and a
  64 bit cmpxchg on 32 bit architectures is very slow. The code use
  to just use two 32 bit values and make it a 60 bit timestamp where
  the other 4 bits were used as counters for synchronization. It later
  came known that the timestamp on 32 bit still need all 64 bits in
  some cases. So 3 words were created to handle the 64 bits. But issues
  arised with this:

   1. The synchronization logic still only compared the counter
      with the first two, but not with the third number, so the
      synchronization could fail unknowingly.

   2. A check on discard of an event could race if an event happened
      between the discard and updating one of the counters. The
      counter needs to be updated (forcing an absolute timestamp
      and not to use a delta) before the actual discard happens.


  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace/urgent

Head SHA1: f458a1453424e03462b5bb539673c9a3cddda480


Petr Pavlu (3):
      tracing: Fix incomplete locking when disabling buffered events
      tracing: Fix a warning when allocating buffered events fails
      tracing: Fix a possible race when disabling buffered events

Steven Rostedt (Google) (5):
      tracing: Always update snapshot buffer size
      tracing: Stop current tracer when resizing buffer
      tracing: Disable snapshot buffer when stopping instance tracers
      ring-buffer: Force absolute timestamp on discard of event
      ring-buffer: Test last update in 32bit version of __rb_time_read()

----
 kernel/trace/ring_buffer.c |  23 +++----
 kernel/trace/trace.c       | 158 +++++++++++++++++----------------------------
 2 files changed, 69 insertions(+), 112 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 1/8] tracing: Always update snapshot buffer size
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 2/8] tracing: Stop current tracer when resizing buffer Steven Rostedt
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). The update of the ring buffer
size would check if the instance was the top level and if so, it would
also update the snapshot buffer as it needs to be the same as the main
buffer.

Now that lower level instances also has a snapshot buffer, they too need
to update their snapshot buffer sizes when the main buffer is changed,
otherwise the following can be triggered:

 # cd /sys/kernel/tracing
 # echo 1500 > buffer_size_kb
 # mkdir instances/foo
 # echo irqsoff > instances/foo/current_tracer
 # echo 1000 > instances/foo/buffer_size_kb

Produces:

 WARNING: CPU: 2 PID: 856 at kernel/trace/trace.c:1938 update_max_tr_single.part.0+0x27d/0x320

Which is:

	ret = ring_buffer_swap_cpu(tr->max_buffer.buffer, tr->array_buffer.buffer, cpu);

	if (ret == -EBUSY) {
		[..]
	}

	WARN_ON_ONCE(ret && ret != -EAGAIN && ret != -EBUSY);  <== here

That's because ring_buffer_swap_cpu() has:

	int ret = -EINVAL;

	[..]

	/* At least make sure the two buffers are somewhat the same */
	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
		goto out;

	[..]
 out:
	return ret;
 }

Instead, update all instances' snapshot buffer sizes when their main
buffer size is updated.

Link: https://lkml.kernel.org/r/20231205220010.454662151@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 6d9b3fa5e7f6 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9aebf904ff97..231c173ec04f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6392,8 +6392,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
 		return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-	if (!(tr->flags & TRACE_ARRAY_FL_GLOBAL) ||
-	    !tr->current_trace->use_max_tr)
+	if (!tr->current_trace->use_max_tr)
 		goto out;
 
 	ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu);
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 2/8] tracing: Stop current tracer when resizing buffer
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 1/8] tracing: Always update snapshot buffer size Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 3/8] tracing: Disable snapshot buffer when stopping instance tracers Steven Rostedt
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

When the ring buffer is being resized, it can cause side effects to the
running tracer. For instance, there's a race with irqsoff tracer that
swaps individual per cpu buffers between the main buffer and the snapshot
buffer. The resize operation modifies the main buffer and then the
snapshot buffer. If a swap happens in between those two operations it will
break the tracer.

Simply stop the running tracer before resizing the buffers and enable it
again when finished.

Link: https://lkml.kernel.org/r/20231205220010.748996423@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 3928a8a2d9808 ("ftrace: make work with new ring buffer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 231c173ec04f..e978868b1a22 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6387,9 +6387,12 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
 	if (!tr->array_buffer.buffer)
 		return 0;
 
+	/* Do not allow tracing while resizng ring buffer */
+	tracing_stop_tr(tr);
+
 	ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
 	if (ret < 0)
-		return ret;
+		goto out_start;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
 	if (!tr->current_trace->use_max_tr)
@@ -6417,7 +6420,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
 			WARN_ON(1);
 			tracing_disabled = 1;
 		}
-		return ret;
+		goto out_start;
 	}
 
 	update_buffer_entries(&tr->max_buffer, cpu);
@@ -6426,7 +6429,8 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
 #endif /* CONFIG_TRACER_MAX_TRACE */
 
 	update_buffer_entries(&tr->array_buffer, cpu);
-
+ out_start:
+	tracing_start_tr(tr);
 	return ret;
 }
 
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 3/8] tracing: Disable snapshot buffer when stopping instance tracers
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 1/8] tracing: Always update snapshot buffer size Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 2/8] tracing: Stop current tracer when resizing buffer Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 4/8] tracing: Fix incomplete locking when disabling buffered events Steven Rostedt
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

It use to be that only the top level instance had a snapshot buffer (for
latency tracers like wakeup and irqsoff). When stopping a tracer in an
instance would not disable the snapshot buffer. This could have some
unintended consequences if the irqsoff tracer is enabled.

Consolidate the tracing_start/stop() with tracing_start/stop_tr() so that
all instances behave the same. The tracing_start/stop() functions will
just call their respective tracing_start/stop_tr() with the global_array
passed in.

Link: https://lkml.kernel.org/r/20231205220011.041220035@goodmis.org

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 6d9b3fa5e7f6 ("tracing: Move tracing_max_latency into trace_array")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 110 +++++++++++++------------------------------
 1 file changed, 34 insertions(+), 76 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e978868b1a22..2492c6c76850 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2360,13 +2360,7 @@ int is_tracing_stopped(void)
 	return global_trace.stop_count;
 }
 
-/**
- * tracing_start - quick start of the tracer
- *
- * If tracing is enabled but was stopped by tracing_stop,
- * this will start the tracer back up.
- */
-void tracing_start(void)
+static void tracing_start_tr(struct trace_array *tr)
 {
 	struct trace_buffer *buffer;
 	unsigned long flags;
@@ -2374,119 +2368,83 @@ void tracing_start(void)
 	if (tracing_disabled)
 		return;
 
-	raw_spin_lock_irqsave(&global_trace.start_lock, flags);
-	if (--global_trace.stop_count) {
-		if (global_trace.stop_count < 0) {
+	raw_spin_lock_irqsave(&tr->start_lock, flags);
+	if (--tr->stop_count) {
+		if (WARN_ON_ONCE(tr->stop_count < 0)) {
 			/* Someone screwed up their debugging */
-			WARN_ON_ONCE(1);
-			global_trace.stop_count = 0;
+			tr->stop_count = 0;
 		}
 		goto out;
 	}
 
 	/* Prevent the buffers from switching */
-	arch_spin_lock(&global_trace.max_lock);
+	arch_spin_lock(&tr->max_lock);
 
-	buffer = global_trace.array_buffer.buffer;
+	buffer = tr->array_buffer.buffer;
 	if (buffer)
 		ring_buffer_record_enable(buffer);
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-	buffer = global_trace.max_buffer.buffer;
+	buffer = tr->max_buffer.buffer;
 	if (buffer)
 		ring_buffer_record_enable(buffer);
 #endif
 
-	arch_spin_unlock(&global_trace.max_lock);
-
- out:
-	raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
-}
-
-static void tracing_start_tr(struct trace_array *tr)
-{
-	struct trace_buffer *buffer;
-	unsigned long flags;
-
-	if (tracing_disabled)
-		return;
-
-	/* If global, we need to also start the max tracer */
-	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
-		return tracing_start();
-
-	raw_spin_lock_irqsave(&tr->start_lock, flags);
-
-	if (--tr->stop_count) {
-		if (tr->stop_count < 0) {
-			/* Someone screwed up their debugging */
-			WARN_ON_ONCE(1);
-			tr->stop_count = 0;
-		}
-		goto out;
-	}
-
-	buffer = tr->array_buffer.buffer;
-	if (buffer)
-		ring_buffer_record_enable(buffer);
+	arch_spin_unlock(&tr->max_lock);
 
  out:
 	raw_spin_unlock_irqrestore(&tr->start_lock, flags);
 }
 
 /**
- * tracing_stop - quick stop of the tracer
+ * tracing_start - quick start of the tracer
  *
- * Light weight way to stop tracing. Use in conjunction with
- * tracing_start.
+ * If tracing is enabled but was stopped by tracing_stop,
+ * this will start the tracer back up.
  */
-void tracing_stop(void)
+void tracing_start(void)
+
+{
+	return tracing_start_tr(&global_trace);
+}
+
+static void tracing_stop_tr(struct trace_array *tr)
 {
 	struct trace_buffer *buffer;
 	unsigned long flags;
 
-	raw_spin_lock_irqsave(&global_trace.start_lock, flags);
-	if (global_trace.stop_count++)
+	raw_spin_lock_irqsave(&tr->start_lock, flags);
+	if (tr->stop_count++)
 		goto out;
 
 	/* Prevent the buffers from switching */
-	arch_spin_lock(&global_trace.max_lock);
+	arch_spin_lock(&tr->max_lock);
 
-	buffer = global_trace.array_buffer.buffer;
+	buffer = tr->array_buffer.buffer;
 	if (buffer)
 		ring_buffer_record_disable(buffer);
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-	buffer = global_trace.max_buffer.buffer;
+	buffer = tr->max_buffer.buffer;
 	if (buffer)
 		ring_buffer_record_disable(buffer);
 #endif
 
-	arch_spin_unlock(&global_trace.max_lock);
+	arch_spin_unlock(&tr->max_lock);
 
  out:
-	raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
+	raw_spin_unlock_irqrestore(&tr->start_lock, flags);
 }
 
-static void tracing_stop_tr(struct trace_array *tr)
+/**
+ * tracing_stop - quick stop of the tracer
+ *
+ * Light weight way to stop tracing. Use in conjunction with
+ * tracing_start.
+ */
+void tracing_stop(void)
 {
-	struct trace_buffer *buffer;
-	unsigned long flags;
-
-	/* If global, we need to also stop the max tracer */
-	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
-		return tracing_stop();
-
-	raw_spin_lock_irqsave(&tr->start_lock, flags);
-	if (tr->stop_count++)
-		goto out;
-
-	buffer = tr->array_buffer.buffer;
-	if (buffer)
-		ring_buffer_record_disable(buffer);
-
- out:
-	raw_spin_unlock_irqrestore(&tr->start_lock, flags);
+	return tracing_stop_tr(&global_trace);
 }
 
 static int trace_save_cmdline(struct task_struct *tsk)
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 4/8] tracing: Fix incomplete locking when disabling buffered events
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
                   ` (2 preceding siblings ...)
  2023-12-07  2:37 ` [for-linus][PATCH 3/8] tracing: Disable snapshot buffer when stopping instance tracers Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 5/8] tracing: Fix a warning when allocating buffered events fails Steven Rostedt
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable, Petr Pavlu

From: Petr Pavlu <petr.pavlu@suse.com>

The following warning appears when using buffered events:

[  203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
[...]
[  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G            E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
[  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
[  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
[  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
[  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
[  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000
[  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400
[  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000
[  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008
[  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000
[  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0
[  203.781862] Call Trace:
[  203.781870]  <TASK>
[  203.851949]  trace_event_buffer_commit+0x1ea/0x250
[  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
[  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
[  203.851990]  do_syscall_64+0x3a/0xe0
[  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  203.852090] RIP: 0033:0x7f4cd870fa77
[  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
[  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
[  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77
[  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0
[  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0
[  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40
[  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0
[  204.049256]  </TASK>

For instance, it can be triggered by running these two commands in
parallel:

 $ while true; do
    echo hist:key=id.syscall:val=hitcount > \
      /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
  done
 $ stress-ng --sysinfo $(nproc)

The warning indicates that the current ring_buffer_per_cpu is not in the
committing state. It happens because the active ring_buffer_event
doesn't actually come from the ring_buffer_per_cpu but is allocated from
trace_buffered_event.

The bug is in function trace_buffered_event_disable() where the
following normally happens:

* The code invokes disable_trace_buffered_event() via
  smp_call_function_many() and follows it by synchronize_rcu(). This
  increments the per-CPU variable trace_buffered_event_cnt on each
  target CPU and grants trace_buffered_event_disable() the exclusive
  access to the per-CPU variable trace_buffered_event.

* Maintenance is performed on trace_buffered_event, all per-CPU event
  buffers get freed.

* The code invokes enable_trace_buffered_event() via
  smp_call_function_many(). This decrements trace_buffered_event_cnt and
  releases the access to trace_buffered_event.

A problem is that smp_call_function_many() runs a given function on all
target CPUs except on the current one. The following can then occur:

* Task X executing trace_buffered_event_disable() runs on CPU 0.

* The control reaches synchronize_rcu() and the task gets rescheduled on
  another CPU 1.

* The RCU synchronization finishes. At this point,
  trace_buffered_event_disable() has the exclusive access to all
  trace_buffered_event variables except trace_buffered_event[CPU0]
  because trace_buffered_event_cnt[CPU0] is never incremented and if the
  buffer is currently unused, remains set to 0.

* A different task Y is scheduled on CPU 0 and hits a trace event. The
  code in trace_event_buffer_lock_reserve() sees that
  trace_buffered_event_cnt[CPU0] is set to 0 and decides the use the
  buffer provided by trace_buffered_event[CPU0].

* Task X continues its execution in trace_buffered_event_disable(). The
  code incorrectly frees the event buffer pointed by
  trace_buffered_event[CPU0] and resets the variable to NULL.

* Task Y writes event data to the now freed buffer and later detects the
  created inconsistency.

The issue is observable since commit dea499781a11 ("tracing: Fix warning
in trace_buffered_event_disable()") which moved the call of
trace_buffered_event_disable() in __ftrace_event_enable_disable()
earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
The underlying problem in trace_buffered_event_disable() is however
present since the original implementation in commit 0fc1b09ff1ff
("tracing: Use temp buffer when filtering events").

Fix the problem by replacing the two smp_call_function_many() calls with
on_each_cpu_mask() which invokes a given callback on all CPUs.

Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-2-petr.pavlu@suse.com

Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
Fixes: dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2492c6c76850..6aeffa4a6994 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2781,11 +2781,9 @@ void trace_buffered_event_disable(void)
 	if (--trace_buffered_event_ref)
 		return;

-	preempt_disable();
 	/* For each CPU, set the buffer as used. */
-	smp_call_function_many(tracing_buffer_mask,
-			       disable_trace_buffered_event, NULL, 1);
-	preempt_enable();
+	on_each_cpu_mask(tracing_buffer_mask, disable_trace_buffered_event,
+			 NULL, true);

 	/* Wait for all current users to finish */
 	synchronize_rcu();
@@ -2800,11 +2798,9 @@ void trace_buffered_event_disable(void)
 	 */
 	smp_wmb();

-	preempt_disable();
 	/* Do the work on each cpu */
-	smp_call_function_many(tracing_buffer_mask,
-			       enable_trace_buffered_event, NULL, 1);
-	preempt_enable();
+	on_each_cpu_mask(tracing_buffer_mask, enable_trace_buffered_event, NULL,
+			 true);
 }

 static struct trace_buffer *temp_buffer;
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 5/8] tracing: Fix a warning when allocating buffered events fails
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
                   ` (3 preceding siblings ...)
  2023-12-07  2:37 ` [for-linus][PATCH 4/8] tracing: Fix incomplete locking when disabling buffered events Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 6/8] tracing: Fix a possible race when disabling buffered events Steven Rostedt
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Petr Pavlu

From: Petr Pavlu <petr.pavlu@suse.com>

Function trace_buffered_event_disable() produces an unexpected warning
when the previous call to trace_buffered_event_enable() fails to
allocate pages for buffered events.

The situation can occur as follows:

* The counter trace_buffered_event_ref is at 0.

* The soft mode gets enabled for some event and
  trace_buffered_event_enable() is called. The function increments
  trace_buffered_event_ref to 1 and starts allocating event pages.

* The allocation fails for some page and trace_buffered_event_disable()
  is called for cleanup.

* Function trace_buffered_event_disable() decrements
  trace_buffered_event_ref back to 0, recognizes that it was the last
  use of buffered events and frees all allocated pages.

* The control goes back to trace_buffered_event_enable() which returns.
  The caller of trace_buffered_event_enable() has no information that
  the function actually failed.

* Some time later, the soft mode is disabled for the same event.
  Function trace_buffered_event_disable() is called. It warns on
  "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.

Buffered events are just an optimization and can handle failures. Make
trace_buffered_event_enable() exit on the first failure and left any
cleanup later to when trace_buffered_event_disable() is called.

Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-3-petr.pavlu@suse.com

Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6aeffa4a6994..ef72354f61ce 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2728,8 +2728,11 @@ void trace_buffered_event_enable(void)
 	for_each_tracing_cpu(cpu) {
 		page = alloc_pages_node(cpu_to_node(cpu),
 					GFP_KERNEL | __GFP_NORETRY, 0);
-		if (!page)
-			goto failed;
+		/* This is just an optimization and can handle failures */
+		if (!page) {
+			pr_err("Failed to allocate event buffer\n");
+			break;
+		}

 		event = page_address(page);
 		memset(event, 0, sizeof(*event));
@@ -2743,10 +2746,6 @@ void trace_buffered_event_enable(void)
 			WARN_ON_ONCE(1);
 		preempt_enable();
 	}
-
-	return;
- failed:
-	trace_buffered_event_disable();
 }

 static void enable_trace_buffered_event(void *data)
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 6/8] tracing: Fix a possible race when disabling buffered events
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
                   ` (4 preceding siblings ...)
  2023-12-07  2:37 ` [for-linus][PATCH 5/8] tracing: Fix a warning when allocating buffered events fails Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:37 ` [for-linus][PATCH 7/8] ring-buffer: Force absolute timestamp on discard of event Steven Rostedt
  2023-12-07  2:38 ` [for-linus][PATCH 8/8] ring-buffer: Test last update in 32bit version of __rb_time_read() Steven Rostedt
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable, Petr Pavlu

From: Petr Pavlu <petr.pavlu@suse.com>

Function trace_buffered_event_disable() is responsible for freeing pages
backing buffered events and this process can run concurrently with
trace_event_buffer_lock_reserve().

The following race is currently possible:

* Function trace_buffered_event_disable() is called on CPU 0. It
  increments trace_buffered_event_cnt on each CPU and waits via
  synchronize_rcu() for each user of trace_buffered_event to complete.

* After synchronize_rcu() is finished, function
  trace_buffered_event_disable() has the exclusive access to
  trace_buffered_event. All counters trace_buffered_event_cnt are at 1
  and all pointers trace_buffered_event are still valid.

* At this point, on a different CPU 1, the execution reaches
  trace_event_buffer_lock_reserve(). The function calls
  preempt_disable_notrace() and only now enters an RCU read-side
  critical section. The function proceeds and reads a still valid
  pointer from trace_buffered_event[CPU1] into the local variable
  "entry". However, it doesn't yet read trace_buffered_event_cnt[CPU1]
  which happens later.

* Function trace_buffered_event_disable() continues. It frees
  trace_buffered_event[CPU1] and decrements
  trace_buffered_event_cnt[CPU1] back to 0.

* Function trace_event_buffer_lock_reserve() continues. It reads and
  increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
  believe that it can use the "entry" that it already obtained but the
  pointer is now invalid and any access results in a use-after-free.

Fix the problem by making a second synchronize_rcu() call after all
trace_buffered_event values are set to NULL. This waits on all potential
users in trace_event_buffer_lock_reserve() that still read a previous
pointer from trace_buffered_event.

Link: https://lore.kernel.org/all/20231127151248.7232-2-petr.pavlu@suse.com/
Link: https://lkml.kernel.org/r/20231205161736.19663-4-petr.pavlu@suse.com

Cc: stable@vger.kernel.org
Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ef72354f61ce..fbcd3bafb93e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2791,13 +2791,17 @@ void trace_buffered_event_disable(void)
 		free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
 		per_cpu(trace_buffered_event, cpu) = NULL;
 	}
+
 	/*
-	 * Make sure trace_buffered_event is NULL before clearing
-	 * trace_buffered_event_cnt.
+	 * Wait for all CPUs that potentially started checking if they can use
+	 * their event buffer only after the previous synchronize_rcu() call and
+	 * they still read a valid pointer from trace_buffered_event. It must be
+	 * ensured they don't see cleared trace_buffered_event_cnt else they
+	 * could wrongly decide to use the pointed-to buffer which is now freed.
 	 */
-	smp_wmb();
+	synchronize_rcu();

-	/* Do the work on each cpu */
+	/* For each CPU, relinquish the buffer */
 	on_each_cpu_mask(tracing_buffer_mask, enable_trace_buffered_event, NULL,
 			 true);
 }
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 7/8] ring-buffer: Force absolute timestamp on discard of event
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
                   ` (5 preceding siblings ...)
  2023-12-07  2:37 ` [for-linus][PATCH 6/8] tracing: Fix a possible race when disabling buffered events Steven Rostedt
@ 2023-12-07  2:37 ` Steven Rostedt
  2023-12-07  2:38 ` [for-linus][PATCH 8/8] ring-buffer: Test last update in 32bit version of __rb_time_read() Steven Rostedt
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

There's a race where if an event is discarded from the ring buffer and an
interrupt were to happen at that time and insert an event, the time stamp
is still used from the discarded event as an offset. This can screw up the
timings.

If the event is going to be discarded, set the "before_stamp" to zero.
When a new event comes in, it compares the "before_stamp" with the
"write_stamp" and if they are not equal, it will insert an absolute
timestamp. This will prevent the timings from getting out of sync due to
the discarded event.

Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 6f6be606e763f ("ring-buffer: Force before_stamp and write_stamp to be different on discard")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 43cc47d7faaf..a6da2d765c78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3030,22 +3030,19 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 			local_read(&bpage->write) & ~RB_WRITE_MASK;
 		unsigned long event_length = rb_event_length(event);
 
+		/*
+		 * For the before_stamp to be different than the write_stamp
+		 * to make sure that the next event adds an absolute
+		 * value and does not rely on the saved write stamp, which
+		 * is now going to be bogus.
+		 */
+		rb_time_set(&cpu_buffer->before_stamp, 0);
+
 		/* Something came in, can't discard */
 		if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
 				       write_stamp, write_stamp - delta))
 			return false;
 
-		/*
-		 * It's possible that the event time delta is zero
-		 * (has the same time stamp as the previous event)
-		 * in which case write_stamp and before_stamp could
-		 * be the same. In such a case, force before_stamp
-		 * to be different than write_stamp. It doesn't
-		 * matter what it is, as long as its different.
-		 */
-		if (!delta)
-			rb_time_set(&cpu_buffer->before_stamp, 0);
-
 		/*
 		 * If an event were to come in now, it would see that the
 		 * write_stamp and the before_stamp are different, and assume
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [for-linus][PATCH 8/8] ring-buffer: Test last update in 32bit version of __rb_time_read()
  2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
                   ` (6 preceding siblings ...)
  2023-12-07  2:37 ` [for-linus][PATCH 7/8] ring-buffer: Force absolute timestamp on discard of event Steven Rostedt
@ 2023-12-07  2:38 ` Steven Rostedt
  7 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2023-12-07  2:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Since 64 bit cmpxchg() is very expensive on 32bit architectures, the
timestamp used by the ring buffer does some interesting tricks to be able
to still have an atomic 64 bit number. It originally just used 60 bits and
broke it up into two 32 bit words where the extra 2 bits were used for
synchronization. But this was not enough for all use cases, and all 64
bits were required.

The 32bit version of the ring buffer timestamp was then broken up into 3
32bit words using the same counter trick. But one update was not done. The
check to see if the read operation was done without interruption only
checked the first two words and not last one (like it had before this
update). Fix it by making sure all three updates happen without
interruption by comparing the initial counter with the last updated
counter.

Link: https://lore.kernel.org/linux-trace-kernel/20231206100050.3100b7bb@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a6da2d765c78..8d2a4f00eca9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)

 	*cnt = rb_time_cnt(top);

-	/* If top and bottom counts don't match, this interrupted a write */
-	if (*cnt != rb_time_cnt(bottom))
+	/* If top and msb counts don't match, this interrupted a write */
+	if (*cnt != rb_time_cnt(msb))
 		return false;

 	/* The shift to msb will lose its cnt bits */
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-12-07  2:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-07  2:37 [for-linus][PATCH 0/8] tracing: Updates for v6.7-rc4 Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 1/8] tracing: Always update snapshot buffer size Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 2/8] tracing: Stop current tracer when resizing buffer Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 3/8] tracing: Disable snapshot buffer when stopping instance tracers Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 4/8] tracing: Fix incomplete locking when disabling buffered events Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 5/8] tracing: Fix a warning when allocating buffered events fails Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 6/8] tracing: Fix a possible race when disabling buffered events Steven Rostedt
2023-12-07  2:37 ` [for-linus][PATCH 7/8] ring-buffer: Force absolute timestamp on discard of event Steven Rostedt
2023-12-07  2:38 ` [for-linus][PATCH 8/8] ring-buffer: Test last update in 32bit version of __rb_time_read() Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox