[for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5

All of lore.kernel.org
 help / color / mirror / Atom feed

* [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5
@ 2023-12-16  4:22 Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size Steven Rostedt
                   ` (14 more replies)
  0 siblings, 15 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton


Tracing fixes for v6.7-rc5:

- Fix eventfs to check creating new files for events with names greater than
  NAME_MAX. The eventfs lookup needs to check the return result of
  simple_lookup().

- Fix the ring buffer to check the proper max data size. Events must be able to
  fit on the ring buffer sub-buffer, if it cannot, then it fails to be written
  and the logic to add the event is avoided. The code to check if an event can
  fit failed to add the possible absolute timestamp which may make the event
  not be able to fit. This causes the ring buffer to go into an infinite loop
  trying to find a sub-buffer that would fit the event. Luckily, there's a check
  that will bail out if it looped over a 1000 times and it also warns.

  The real fix is not to add the absolute timestamp to an event that is
  starting at the beginning of a sub-buffer because it uses the sub-buffer
  timestamp. By avoiding the timestamp at the start of the sub-buffer allows
  events that pass the first check to always find a sub-buffer that it can fit
  on.

- Have large events that do not fit on a trace_seq to print "LINE TOO BIG" like
  it does for the trace_pipe instead of what it does now which is to silently
  drop the output.

- Fix a memory leak of forgetting to free the spare page that is saved by a
  trace instance.

- Update the size of the snapshot buffer when the main buffer is updated if the
  snapshot buffer is allocated.

- Fix ring buffer timestamp logic by removing all the places that tried to put
  the before_stamp back to the write stamp so that the next event doesn't add
  an absolute timestamp. But each of these updates added a race where by making
  the two timestamp equal, it was validating the write_stamp so that it can be
  incorrectly used for calculating the delta of an event.

- There's a temp buffer used for printing the event that was using the event
  data size for allocation when it needed to use the size of the entire event
  (meta-data and payload data)

- For hardening, use "%.*s" for printing the trace_marker output, to limit the
  amount that is printed by the size of the event. This was discovered by
  development that added a bug that truncated the '\0' and caused a crash.

- Fix a use-after-free bug in the use of the histogram files when an instance
  is being removed.

- Remove a useless update in the rb_try_to_discard of the write_stamp. The
  before_stamp was already changed to force the next event to add an absolute
  timestamp that the write_stamp is not used. But the write_stamp is modified
  again using an unneeded 64-bit cmpxchg.

- Fix several races in the 32-bit implementation of the rb_time_cmpxchg() that
  does a 64-bit cmpxchg.

- While looking at fixing the 64-bit cmpxchg, I noticed that because the ring
  buffer uses normal cmpxchg, and this can be done in NMI context, there's some
  architectures that do not have a working cmpxchg in NMI context. For these
  architectures, fail recording events that happen in NMI context.
  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace/urgent

Head SHA1: 712292308af2265cd9b126aedfa987f10f452a33


Beau Belgrave (1):
      eventfs: Fix events beyond NAME_MAX blocking tasks

Mathieu Desnoyers (1):
      ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()

Steven Rostedt (Google) (12):
      ring-buffer: Fix writing to the buffer with max_data_size
      tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing
      ring-buffer: Fix memory leak of free page
      tracing: Update snapshot buffer on resize if it is allocated
      ring-buffer: Do not update before stamp when switching sub-buffers
      ring-buffer: Have saved event hold the entire event
      tracing: Add size check when printing trace_marker output
      ring-buffer: Do not try to put back write_stamp
      ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
      ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
      ring-buffer: Have rb_time_cmpxchg() set the msb counter too
      ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI

Zheng Yejian (1):
      tracing: Fix uaf issue when open the hist or hist_debug file

----
 fs/tracefs/event_inode.c         |   4 ++
 kernel/trace/ring_buffer.c       | 115 ++++++++++++++-------------------------
 kernel/trace/trace.c             |  16 +++++-
 kernel/trace/trace.h             |   1 +
 kernel/trace/trace_events_hist.c |  12 ++--
 kernel/trace/trace_output.c      |   6 +-
 6 files changed, 72 insertions(+), 82 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing Steven Rostedt
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The maximum ring buffer data size is the maximum size of data that can be
recorded on the ring buffer. Events must be smaller than the sub buffer
data size minus any meta data. This size is checked before trying to
allocate from the ring buffer because the allocation assumes that the size
will fit on the sub buffer.

The maximum size was calculated as the size of a sub buffer page (which is
currently PAGE_SIZE minus the sub buffer header) minus the size of the
meta data of an individual event. But it missed the possible adding of a
time stamp for events that are added long enough apart that the event meta
data can't hold the time delta.

When an event is added that is greater than the current BUF_MAX_DATA_SIZE
minus the size of a time stamp, but still less than or equal to
BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
for a page that can hold the event. Luckily, there's a check for this loop
and after 1000 iterations and a warning is emitted and the ring buffer is
disabled. But this should never happen.

This can happen when a large event is added first, or after a long period
where an absolute timestamp is prefixed to the event, increasing its size
by 8 bytes. This passes the check and then goes into the algorithm that
causes the infinite loop.

For events that are the first event on the sub-buffer, it does not need to
add a timestamp, because the sub-buffer itself contains an absolute
timestamp, and adding one is redundant.

The fix is to check if the event is to be the first event on the
sub-buffer, and if it is, then do not add a timestamp.

This also fixes 32 bit adding a timestamp when a read of before_stamp or
write_stamp is interrupted. There's still no need to add that timestamp if
the event is going to be the first event on the sub buffer.

Also, if the buffer has "time_stamp_abs" set, then also check if the
length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.

Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is allocated")
Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some branches")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC)
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..b8986f82eccf 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3579,7 +3579,10 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 		 * absolute timestamp.
 		 * Don't bother if this is the start of a new page (w == 0).
 		 */
-		if (unlikely(!a_ok || !b_ok || (info->before != info->after && w))) {
+		if (!w) {
+			/* Use the sub-buffer timestamp */
+			info->delta = 0;
+		} else if (unlikely(!a_ok || !b_ok || info->before != info->after)) {
 			info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;
 			info->length += RB_LEN_TIME_EXTEND;
 		} else {
@@ -3737,6 +3740,8 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 	if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
 		add_ts_default = RB_ADD_STAMP_ABSOLUTE;
 		info.length += RB_LEN_TIME_EXTEND;
+		if (info.length > BUF_MAX_DATA_SIZE)
+			goto out_fail;
 	} else {
 		add_ts_default = RB_ADD_STAMP_NONE;
 	}
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks Steven Rostedt
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

If a large event was added to the ring buffer that is larger than what the
trace_seq can handle, it just drops the output:

 ~# cat /sys/kernel/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2/2   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
            <...>-859     [001] .....   141.118951: tracing_mark_write           <...>-859     [001] .....   141.148201: tracing_mark_write: 78901234

Instead, catch this case and add some context:

 ~# cat /sys/kernel/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2/2   #P:8
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
            <...>-852     [001] .....   121.550551: tracing_mark_write[LINE TOO BIG]
            <...>-852     [001] .....   121.550581: tracing_mark_write: 78901234

This now emulates the same output as trace_pipe.

Link: https://lore.kernel.org/linux-trace-kernel/20231209171058.78c1a026@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fbcd3bafb93e..aa8f99f3e5de 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4722,7 +4722,11 @@ static int s_show(struct seq_file *m, void *v)
 		iter->leftover = ret;
 
 	} else {
-		print_trace_line(iter);
+		ret = print_trace_line(iter);
+		if (ret == TRACE_TYPE_PARTIAL_LINE) {
+			iter->seq.full = 0;
+			trace_seq_puts(&iter->seq, "[LINE TOO BIG]\n");
+		}
 		ret = trace_print_seq(m, &iter->seq);
 		/*
 		 * If we overflow the seq_file buffer, then it will
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page Steven Rostedt
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Beau Belgrave

From: Beau Belgrave <beaub@linux.microsoft.com>

Eventfs uses simple_lookup(), however, it will fail if the name of the
entry is beyond NAME_MAX length. When this error is encountered, eventfs
still tries to create dentries instead of skipping the dentry creation.
When the dentry is attempted to be created in this state d_wait_lookup()
will loop forever, waiting for the lookup to be removed.

Fix eventfs to return the error in simple_lookup() back to the caller
instead of continuing to try to create the dentry.

Link: https://lore.kernel.org/linux-trace-kernel/20231210213534.497-1-beaub@linux.microsoft.com

Fixes: 63940449555e ("eventfs: Implement eventfs lookup, read, open functions")
Link: https://lore.kernel.org/linux-trace-kernel/20231208183601.GA46-beaub@linux.microsoft.com/
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 fs/tracefs/event_inode.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 0b90869fd805..43e237864a42 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -546,6 +546,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 		if (strcmp(ei_child->name, name) != 0)
 			continue;
 		ret = simple_lookup(dir, dentry, flags);
+		if (IS_ERR(ret))
+			goto out;
 		create_dir_dentry(ei, ei_child, ei_dentry, true);
 		created = true;
 		break;
@@ -568,6 +570,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 			if (r <= 0)
 				continue;
 			ret = simple_lookup(dir, dentry, flags);
+			if (IS_ERR(ret))
+				goto out;
 			create_file_dentry(ei, i, ei_dentry, name, mode, cdata,
 					   fops, true);
 			break;
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (2 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated Steven Rostedt
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Reading the ring buffer does a swap of a sub-buffer within the ring buffer
with a empty sub-buffer. This allows the reader to have full access to the
content of the sub-buffer that was swapped out without having to worry
about contention with the writer.

The readers call ring_buffer_alloc_read_page() to allocate a page that
will be used to swap with the ring buffer. When the code is finished with
the reader page, it calls ring_buffer_free_read_page(). Instead of freeing
the page, it stores it as a spare. Then next call to
ring_buffer_alloc_read_page() will return this spare instead of calling
into the memory management system to allocate a new page.

Unfortunately, on freeing of the ring buffer, this spare page is not
freed, and causes a memory leak.

Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b8986f82eccf..dcd47895b424 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1787,6 +1787,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
 		free_buffer_page(bpage);
 	}

+	free_page((unsigned long)cpu_buffer->free_page);
+
 	kfree(cpu_buffer);
 }

-- 
2.42.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (3 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers Steven Rostedt
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The snapshot buffer is to mimic the main buffer so that when a snapshot is
needed, the snapshot and main buffer are swapped. When the snapshot buffer
is allocated, it is set to the minimal size that the ring buffer may be at
and still functional. When it is allocated it becomes the same size as the
main ring buffer, and when the main ring buffer changes in size, it should
do.

Currently, the resize only updates the snapshot buffer if it's used by the
current tracer (ie. the preemptirqsoff tracer). But it needs to be updated
anytime it is allocated.

When changing the size of the main buffer, instead of looking to see if
the current tracer is utilizing the snapshot buffer, just check if it is
allocated to know if it should be updated or not.

Also fix typo in comment just above the code change.

Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: ad909e21bbe69 ("tracing: Add internal tracing_snapshot() functions")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa8f99f3e5de..6c79548f9574 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6348,7 +6348,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
 	if (!tr->array_buffer.buffer)
 		return 0;
 
-	/* Do not allow tracing while resizng ring buffer */
+	/* Do not allow tracing while resizing ring buffer */
 	tracing_stop_tr(tr);
 
 	ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
@@ -6356,7 +6356,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
 		goto out_start;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-	if (!tr->current_trace->use_max_tr)
+	if (!tr->allocated_snapshot)
 		goto out;
 
 	ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu);
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (4 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event Steven Rostedt
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The ring buffer timestamps are synchronized by two timestamp placeholders.
One is the "before_stamp" and the other is the "write_stamp" (sometimes
referred to as the "after stamp" but only in the comments. These two
stamps are key to knowing how to handle nested events coming in with a
lockless system.

When moving across sub-buffers, the before stamp is updated but the write
stamp is not. There's an effort to put back the before stamp to something
that seems logical in case there's nested events. But as the current event
is about to cross sub-buffers, and so will any new nested event that happens,
updating the before stamp is useless, and could even introduce new race
conditions.

The first event on a sub-buffer simply uses the sub-buffer's timestamp
and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
used in the algorithm in this case. There's no reason to try to fix the
before_stamp when this happens.

As a bonus, it removes a cmpxchg() when crossing sub-buffers!

Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dcd47895b424..c7abcc215fe2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3607,14 +3607,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,

 	/* See if we shot pass the end of this buffer page */
 	if (unlikely(write > BUF_PAGE_SIZE)) {
-		/* before and after may now different, fix it up*/
-		b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
-		a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
-		if (a_ok && b_ok && info->before != info->after)
-			(void)rb_time_cmpxchg(&cpu_buffer->before_stamp,
-					      info->before, info->after);
-		if (a_ok && b_ok)
-			check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
+		check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
 		return rb_move_tail(cpu_buffer, tail, info);
 	}

-- 
2.42.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (5 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output Steven Rostedt
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

For the ring buffer iterator (non-consuming read), the event needs to be
copied into the iterator buffer to make sure that a writer does not
overwrite it while the user is reading it. If a write happens during the
copy, the buffer is simply discarded.

But the temp buffer itself was not big enough. The allocation of the
buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
be passed into the ring buffer and saved. But the temp buffer needs to
hold the meta data as well. That would be BUF_PAGE_SIZE and not
BUF_MAX_DATA_SIZE.

Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 785888c544e04 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c7abcc215fe2..1d9caee7f542 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2409,7 +2409,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 	 */
 	barrier();
 
-	if ((iter->head + length) > commit || length > BUF_MAX_DATA_SIZE)
+	if ((iter->head + length) > commit || length > BUF_PAGE_SIZE)
 		/* Writer corrupted the read? */
 		goto reset;
 
@@ -5118,7 +5118,8 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags)
 	if (!iter)
 		return NULL;
 
-	iter->event = kmalloc(BUF_MAX_DATA_SIZE, flags);
+	/* Holds the entire event: data and meta data */
+	iter->event = kmalloc(BUF_PAGE_SIZE, flags);
 	if (!iter->event) {
 		kfree(iter);
 		return NULL;
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (6 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file Steven Rostedt
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:

  trace_seq_printf(s, ": %s", field->buf);

The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.

  int max = iter->ent_size - offsetof(struct print_entry, buf);

  trace_seq_printf(s, ": %*.s", max, field->buf);

Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace_output.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..3e7fa44dc2b2 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,
 {
 	struct print_entry *field;
 	struct trace_seq *s = &iter->seq;
+	int max = iter->ent_size - offsetof(struct print_entry, buf);
 
 	trace_assign_type(field, iter->ent);
 
 	seq_print_ip_sym(s, field->ip, flags);
-	trace_seq_printf(s, ": %s", field->buf);
+	trace_seq_printf(s, ": %.*s", max, field->buf);
 
 	return trace_handle_return(s);
 }
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags,
 					 struct trace_event *event)
 {
 	struct print_entry *field;
+	int max = iter->ent_size - offsetof(struct print_entry, buf);
 
 	trace_assign_type(field, iter->ent);
 
-	trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf);
+	trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf);
 
 	return trace_handle_return(&iter->seq);
 }
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (7 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp Steven Rostedt
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Zheng Yejian

From: Zheng Yejian <zhengyejian1@huawei.com>

KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.

  BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
  Read of size 8 at addr ffff242541e336b8 by task head/190

  CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x98/0xf8
   show_stack+0x1c/0x30
   dump_stack_lvl+0x44/0x58
   print_report+0xf0/0x5a0
   kasan_report+0x80/0xc0
   __asan_report_load8_noabort+0x1c/0x28
   hist_show+0x11e0/0x1278
   seq_read_iter+0x344/0xd78
   seq_read+0x128/0x1c0
   vfs_read+0x198/0x6c8
   ksys_read+0xf4/0x1e0
   __arm64_sys_read+0x70/0xa8
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Allocated by task 188:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_alloc_info+0x20/0x30
   __kasan_slab_alloc+0x6c/0x80
   kmem_cache_alloc+0x15c/0x4a8
   trace_create_new_event+0x84/0x348
   __trace_add_new_event+0x18/0x88
   event_trace_add_tracer+0xc4/0x1a0
   trace_array_create_dir+0x6c/0x100
   trace_array_create+0x2e8/0x568
   instance_mkdir+0x48/0x80
   tracefs_syscall_mkdir+0x90/0xe8
   vfs_mkdir+0x3c4/0x610
   do_mkdirat+0x144/0x200
   __arm64_sys_mkdirat+0x8c/0xc0
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

  Freed by task 191:
   kasan_save_stack+0x28/0x50
   kasan_set_track+0x28/0x38
   kasan_save_free_info+0x34/0x58
   __kasan_slab_free+0xe4/0x158
   kmem_cache_free+0x19c/0x508
   event_file_put+0xa0/0x120
   remove_event_file_dir+0x180/0x320
   event_trace_del_tracer+0xb0/0x180
   __remove_instance+0x224/0x508
   instance_rmdir+0x44/0x78
   tracefs_syscall_rmdir+0xbc/0x140
   vfs_rmdir+0x1cc/0x4c8
   do_rmdir+0x220/0x2b8
   __arm64_sys_unlinkat+0xc0/0x100
   invoke_syscall+0x70/0x260
   el0_svc_common.constprop.0+0xb0/0x280
   do_el0_svc+0x44/0x60
   el0_svc+0x34/0x68
   el0t_64_sync_handler+0xb8/0xc0
   el0t_64_sync+0x168/0x170

Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.com

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/trace.c             |  6 ++++++
 kernel/trace/trace.h             |  1 +
 kernel/trace/trace_events_hist.c | 12 ++++++++----
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6c79548f9574..199df497db07 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4968,6 +4968,12 @@ int tracing_release_file_tr(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+int tracing_single_release_file_tr(struct inode *inode, struct file *filp)
+{
+	tracing_release_file_tr(inode, filp);
+	return single_release(inode, filp);
+}
+
 static int tracing_mark_open(struct inode *inode, struct file *filp)
 {
 	stream_open(inode, filp);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b7f4ea25a194..0489e72c8169 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -617,6 +617,7 @@ int tracing_open_generic(struct inode *inode, struct file *filp);
 int tracing_open_generic_tr(struct inode *inode, struct file *filp);
 int tracing_open_file_tr(struct inode *inode, struct file *filp);
 int tracing_release_file_tr(struct inode *inode, struct file *filp);
+int tracing_single_release_file_tr(struct inode *inode, struct file *filp);
 bool tracing_is_disabled(void);
 bool tracer_tracing_is_on(struct trace_array *tr);
 void tracer_tracing_on(struct trace_array *tr);
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 1abc07fba1b9..5ecf3c8bde20 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5623,10 +5623,12 @@ static int event_hist_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
-	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	ret = tracing_open_file_tr(inode, file);
 	if (ret)
 		return ret;
 
+	/* Clear private_data to avoid warning in single_open() */
+	file->private_data = NULL;
 	return single_open(file, hist_show, file);
 }
 
@@ -5634,7 +5636,7 @@ const struct file_operations event_hist_fops = {
 	.open = event_hist_open,
 	.read = seq_read,
 	.llseek = seq_lseek,
-	.release = single_release,
+	.release = tracing_single_release_file_tr,
 };
 
 #ifdef CONFIG_HIST_TRIGGERS_DEBUG
@@ -5900,10 +5902,12 @@ static int event_hist_debug_open(struct inode *inode, struct file *file)
 {
 	int ret;
 
-	ret = security_locked_down(LOCKDOWN_TRACEFS);
+	ret = tracing_open_file_tr(inode, file);
 	if (ret)
 		return ret;
 
+	/* Clear private_data to avoid warning in single_open() */
+	file->private_data = NULL;
 	return single_open(file, hist_debug_show, file);
 }
 
@@ -5911,7 +5915,7 @@ const struct file_operations event_hist_debug_fops = {
 	.open = event_hist_debug_open,
 	.read = seq_read,
 	.llseek = seq_lseek,
-	.release = single_release,
+	.release = tracing_single_release_file_tr,
 };
 #endif
 
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (8 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() Steven Rostedt
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable, Joel Fernandes, Vincent Donnefort

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

If an update to an event is interrupted by another event between the time
the initial event allocated its buffer and where it wrote to the
write_stamp, the code try to reset the write stamp back to the what it had
just overwritten. It knows that it was overwritten via checking the
before_stamp, and if it didn't match what it wrote to the before_stamp
before it allocated its space, it knows it was overwritten.

To put back the write_stamp, it uses the before_stamp it read. The problem
here is that by writing the before_stamp to the write_stamp it makes the
two equal again, which means that the write_stamp can be considered valid
as the last timestamp written to the ring buffer. But this is not
necessarily true. The event that interrupted the event could have been
interrupted in a way that it was interrupted as well, and can end up
leaving with an invalid write_stamp. But if this happens and returns to
this context that uses the before_stamp to update the write_stamp again,
it can possibly incorrectly make it valid, causing later events to have in
correct time stamps.

As it is OK to leave this function with an invalid write_stamp (one that
doesn't match the before_stamp), there's no reason to try to make it valid
again in this case. If this race happens, then just leave with the invalid
write_stamp and the next event to come along will just add a absolute
timestamp and validate everything again.

Bonus points: This gets rid of another cmpxchg64!

Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..2668dde23343 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3612,14 +3612,14 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	}

 	if (likely(tail == w)) {
-		u64 save_before;
-		bool s_ok;
-
 		/* Nothing interrupted us between A and C */
  /*D*/		rb_time_set(&cpu_buffer->write_stamp, info->ts);
-		barrier();
- /*E*/		s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before);
-		RB_WARN_ON(cpu_buffer, !s_ok);
+		/*
+		 * If something came in between C and D, the write stamp
+		 * may now not be in sync. But that's fine as the before_stamp
+		 * will be different and then next event will just be forced
+		 * to use an absolute timestamp.
+		 */
 		if (likely(!(info->add_timestamp &
 			     (RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE))))
 			/* This did not interrupt any time update */
@@ -3627,24 +3627,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 		else
 			/* Just use full timestamp for interrupting event */
 			info->delta = info->ts;
-		barrier();
 		check_buffer(cpu_buffer, info, tail);
-		if (unlikely(info->ts != save_before)) {
-			/* SLOW PATH - Interrupted between C and E */
-
-			a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
-			RB_WARN_ON(cpu_buffer, !a_ok);
-
-			/* Write stamp must only go forward */
-			if (save_before > info->after) {
-				/*
-				 * We do not care about the result, only that
-				 * it gets updated atomically.
-				 */
-				(void)rb_time_cmpxchg(&cpu_buffer->write_stamp,
-						      info->after, save_before);
-			}
-		}
 	} else {
 		u64 ts;
 		/* SLOW PATH - Interrupted between A and C */
-- 
2.42.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (9 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs Steven Rostedt
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Joel Fernandes, Vincent Donnefort

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.

But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.

The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).

But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.

Remove this useless update, and in doing so, we remove another
cmpxchg64()!

Also update the comments to reflect this change as well as remove some
extra white space in another comment.

Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort   <vdonnefort@google.com>
Fixes: b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of event")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 47 +++++++++-----------------------------
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2668dde23343..ad4af0cba159 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2983,25 +2983,6 @@ static unsigned rb_calculate_event_length(unsigned length)
 	return length;
 }
 
-static u64 rb_time_delta(struct ring_buffer_event *event)
-{
-	switch (event->type_len) {
-	case RINGBUF_TYPE_PADDING:
-		return 0;
-
-	case RINGBUF_TYPE_TIME_EXTEND:
-		return rb_event_time_stamp(event);
-
-	case RINGBUF_TYPE_TIME_STAMP:
-		return 0;
-
-	case RINGBUF_TYPE_DATA:
-		return event->time_delta;
-	default:
-		return 0;
-	}
-}
-
 static inline bool
 rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 		  struct ring_buffer_event *event)
@@ -3009,8 +2990,6 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 	unsigned long new_index, old_index;
 	struct buffer_page *bpage;
 	unsigned long addr;
-	u64 write_stamp;
-	u64 delta;
 
 	new_index = rb_event_index(event);
 	old_index = new_index + rb_event_ts_length(event);
@@ -3019,14 +2998,10 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 
 	bpage = READ_ONCE(cpu_buffer->tail_page);
 
-	delta = rb_time_delta(event);
-
-	if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp))
-		return false;
-
-	/* Make sure the write stamp is read before testing the location */
-	barrier();
-
+	/*
+	 * Make sure the tail_page is still the same and
+	 * the next write location is the end of this event
+	 */
 	if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
 		unsigned long write_mask =
 			local_read(&bpage->write) & ~RB_WRITE_MASK;
@@ -3037,20 +3012,20 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 		 * to make sure that the next event adds an absolute
 		 * value and does not rely on the saved write stamp, which
 		 * is now going to be bogus.
+		 *
+		 * By setting the before_stamp to zero, the next event
+		 * is not going to use the write_stamp and will instead
+		 * create an absolute timestamp. This means there's no
+		 * reason to update the wirte_stamp!
 		 */
 		rb_time_set(&cpu_buffer->before_stamp, 0);
 
-		/* Something came in, can't discard */
-		if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
-				       write_stamp, write_stamp - delta))
-			return false;
-
 		/*
 		 * If an event were to come in now, it would see that the
 		 * write_stamp and the before_stamp are different, and assume
 		 * that this event just added itself before updating
 		 * the write stamp. The interrupting event will fix the
-		 * write stamp for us, and use the before stamp as its delta.
+		 * write stamp for us, and use an absolute timestamp.
 		 */
 
 		/*
@@ -3487,7 +3462,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
 		return;
 
 	/*
-	 * If this interrupted another event, 
+	 * If this interrupted another event,
 	 */
 	if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
 		goto out;
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (10 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() Steven Rostedt
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:

 static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 {
	unsigned long cnt, top, bottom, msb;
	unsigned long cnt2, top2, bottom2, msb2;
	u64 val;

	/* The cmpxchg always fails if it interrupted an update */
	 if (!__rb_time_read(t, &val, &cnt2))
		 return false;

	 if (val != expect)
		 return false;

<<<< interrupted here!

	 cnt = local_read(&t->cnt);

The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!

The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.

Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit")
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ad4af0cba159..b8ab0557bd1b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 	unsigned long cnt2, top2, bottom2, msb2;
 	u64 val;
 
+	/* Any interruptions in this function should cause a failure */
+	cnt = local_read(&t->cnt);
+
 	/* The cmpxchg always fails if it interrupted an update */
 	 if (!__rb_time_read(t, &val, &cnt2))
 		 return false;
@@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 	 if (val != expect)
 		 return false;
 
-	 cnt = local_read(&t->cnt);
 	 if ((cnt & 3) != cnt2)
 		 return false;
 
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (11 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI Steven Rostedt
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

The following race can cause rb_time_read() to observe a corrupted time
stamp:

rb_time_cmpxchg()
[...]
        if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
                return false;
        if (!rb_time_read_cmpxchg(&t->top, top, top2))
                return false;
<interrupted before updating bottom>
__rb_time_read()
[...]
        do {
                c = local_read(&t->cnt);
                top = local_read(&t->top);
                bottom = local_read(&t->bottom);
                msb = local_read(&t->msb);
        } while (c != local_read(&t->cnt));

        *cnt = rb_time_cnt(top);

        /* If top and msb counts don't match, this interrupted a write */
        if (*cnt != rb_time_cnt(msb))
                return false;
          ^ this check fails to catch that "bottom" is still not updated.

So the old "bottom" value is returned, which is wrong.

Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.

The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().

Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com

Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b8ab0557bd1b..f22a849da179 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
 
 	*cnt = rb_time_cnt(top);
 
-	/* If top and msb counts don't match, this interrupted a write */
-	if (*cnt != rb_time_cnt(msb))
+	/* If top, msb or bottom counts don't match, this interrupted a write */
+	if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom))
 		return false;
 
 	/* The shift to msb will lose its cnt bits */
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (12 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  2023-12-16  4:22 ` [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI Steven Rostedt
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	stable

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The rb_time_cmpxchg() on 32-bit architectures requires setting three
32-bit words to represent the 64-bit timestamp, with some salt for
synchronization. Those are: msb, top, and bottom

The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
and the msb that was written was stale.

Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f22a849da179..f4679013289b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -722,10 +722,12 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
 	 cnt2 = cnt + 1;
 
 	 rb_time_split(val, &top, &bottom, &msb);
+	 msb = rb_time_val_cnt(msb, cnt);
 	 top = rb_time_val_cnt(top, cnt);
 	 bottom = rb_time_val_cnt(bottom, cnt);
 
 	 rb_time_split(set, &top2, &bottom2, &msb2);
+	 msb2 = rb_time_val_cnt(msb2, cnt);
 	 top2 = rb_time_val_cnt(top2, cnt2);
 	 bottom2 = rb_time_val_cnt(bottom2, cnt2);
 
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
  2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
                   ` (13 preceding siblings ...)
  2023-12-16  4:22 ` [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too Steven Rostedt
@ 2023-12-16  4:22 ` Steven Rostedt
  14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16  4:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

As the ring buffer recording requires cmpxchg() to work, if the
architecture does not support cmpxchg in NMI, then do not do any recording
within an NMI.

Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4679013289b..5a114e752f11 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3674,6 +3674,12 @@ rb_reserve_next_event(struct trace_buffer *buffer,
 	int nr_loops = 0;
 	int add_ts_default;
 
+	/* ring buffer does cmpxchg, make sure it is safe in NMI context */
+	if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&
+	    (unlikely(in_nmi()))) {
+		return NULL;
+	}
+
 	rb_start_commit(cpu_buffer);
 	/* The commit page can not change after this */
 
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-12-16  4:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-16  4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too Steven Rostedt
2023-12-16  4:22 ` [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.