* [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing Steven Rostedt
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
The maximum ring buffer data size is the maximum size of data that can be
recorded on the ring buffer. Events must be smaller than the sub buffer
data size minus any meta data. This size is checked before trying to
allocate from the ring buffer because the allocation assumes that the size
will fit on the sub buffer.
The maximum size was calculated as the size of a sub buffer page (which is
currently PAGE_SIZE minus the sub buffer header) minus the size of the
meta data of an individual event. But it missed the possible adding of a
time stamp for events that are added long enough apart that the event meta
data can't hold the time delta.
When an event is added that is greater than the current BUF_MAX_DATA_SIZE
minus the size of a time stamp, but still less than or equal to
BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking
for a page that can hold the event. Luckily, there's a check for this loop
and after 1000 iterations and a warning is emitted and the ring buffer is
disabled. But this should never happen.
This can happen when a large event is added first, or after a long period
where an absolute timestamp is prefixed to the event, increasing its size
by 8 bytes. This passes the check and then goes into the algorithm that
causes the infinite loop.
For events that are the first event on the sub-buffer, it does not need to
add a timestamp, because the sub-buffer itself contains an absolute
timestamp, and adding one is redundant.
The fix is to check if the event is to be the first event on the
sub-buffer, and if it is, then do not add a timestamp.
This also fixes 32 bit adding a timestamp when a read of before_stamp or
write_stamp is interrupted. There's still no need to add that timestamp if
the event is going to be the first event on the sub buffer.
Also, if the buffer has "time_stamp_abs" set, then also check if the
length plus the timestamp is greater than the BUF_MAX_DATA_SIZE.
Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home
Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is allocated")
Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some branches")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC)
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8d2a4f00eca9..b8986f82eccf 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3579,7 +3579,10 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
* absolute timestamp.
* Don't bother if this is the start of a new page (w == 0).
*/
- if (unlikely(!a_ok || !b_ok || (info->before != info->after && w))) {
+ if (!w) {
+ /* Use the sub-buffer timestamp */
+ info->delta = 0;
+ } else if (unlikely(!a_ok || !b_ok || info->before != info->after)) {
info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;
info->length += RB_LEN_TIME_EXTEND;
} else {
@@ -3737,6 +3740,8 @@ rb_reserve_next_event(struct trace_buffer *buffer,
if (ring_buffer_time_stamp_abs(cpu_buffer->buffer)) {
add_ts_default = RB_ADD_STAMP_ABSOLUTE;
info.length += RB_LEN_TIME_EXTEND;
+ if (info.length > BUF_MAX_DATA_SIZE)
+ goto out_fail;
} else {
add_ts_default = RB_ADD_STAMP_NONE;
}
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks Steven Rostedt
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
If a large event was added to the ring buffer that is larger than what the
trace_seq can handle, it just drops the output:
~# cat /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<...>-859 [001] ..... 141.118951: tracing_mark_write <...>-859 [001] ..... 141.148201: tracing_mark_write: 78901234
Instead, catch this case and add some context:
~# cat /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<...>-852 [001] ..... 121.550551: tracing_mark_write[LINE TOO BIG]
<...>-852 [001] ..... 121.550581: tracing_mark_write: 78901234
This now emulates the same output as trace_pipe.
Link: https://lore.kernel.org/linux-trace-kernel/20231209171058.78c1a026@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index fbcd3bafb93e..aa8f99f3e5de 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4722,7 +4722,11 @@ static int s_show(struct seq_file *m, void *v)
iter->leftover = ret;
} else {
- print_trace_line(iter);
+ ret = print_trace_line(iter);
+ if (ret == TRACE_TYPE_PARTIAL_LINE) {
+ iter->seq.full = 0;
+ trace_seq_puts(&iter->seq, "[LINE TOO BIG]\n");
+ }
ret = trace_print_seq(m, &iter->seq);
/*
* If we overflow the seq_file buffer, then it will
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 01/15] ring-buffer: Fix writing to the buffer with max_data_size Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 02/15] tracing: Have large events show up as [LINE TOO BIG] instead of nothing Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page Steven Rostedt
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Beau Belgrave
From: Beau Belgrave <beaub@linux.microsoft.com>
Eventfs uses simple_lookup(), however, it will fail if the name of the
entry is beyond NAME_MAX length. When this error is encountered, eventfs
still tries to create dentries instead of skipping the dentry creation.
When the dentry is attempted to be created in this state d_wait_lookup()
will loop forever, waiting for the lookup to be removed.
Fix eventfs to return the error in simple_lookup() back to the caller
instead of continuing to try to create the dentry.
Link: https://lore.kernel.org/linux-trace-kernel/20231210213534.497-1-beaub@linux.microsoft.com
Fixes: 63940449555e ("eventfs: Implement eventfs lookup, read, open functions")
Link: https://lore.kernel.org/linux-trace-kernel/20231208183601.GA46-beaub@linux.microsoft.com/
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
fs/tracefs/event_inode.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 0b90869fd805..43e237864a42 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -546,6 +546,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
if (strcmp(ei_child->name, name) != 0)
continue;
ret = simple_lookup(dir, dentry, flags);
+ if (IS_ERR(ret))
+ goto out;
create_dir_dentry(ei, ei_child, ei_dentry, true);
created = true;
break;
@@ -568,6 +570,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
if (r <= 0)
continue;
ret = simple_lookup(dir, dentry, flags);
+ if (IS_ERR(ret))
+ goto out;
create_file_dentry(ei, i, ei_dentry, name, mode, cdata,
fops, true);
break;
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (2 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 03/15] eventfs: Fix events beyond NAME_MAX blocking tasks Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated Steven Rostedt
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Reading the ring buffer does a swap of a sub-buffer within the ring buffer
with a empty sub-buffer. This allows the reader to have full access to the
content of the sub-buffer that was swapped out without having to worry
about contention with the writer.
The readers call ring_buffer_alloc_read_page() to allocate a page that
will be used to swap with the ring buffer. When the code is finished with
the reader page, it calls ring_buffer_free_read_page(). Instead of freeing
the page, it stores it as a spare. Then next call to
ring_buffer_alloc_read_page() will return this spare instead of calling
into the memory management system to allocate a new page.
Unfortunately, on freeing of the ring buffer, this spare page is not
freed, and causes a memory leak.
Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b8986f82eccf..dcd47895b424 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1787,6 +1787,8 @@ static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
free_buffer_page(bpage);
}
+ free_page((unsigned long)cpu_buffer->free_page);
+
kfree(cpu_buffer);
}
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (3 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 04/15] ring-buffer: Fix memory leak of free page Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers Steven Rostedt
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
The snapshot buffer is to mimic the main buffer so that when a snapshot is
needed, the snapshot and main buffer are swapped. When the snapshot buffer
is allocated, it is set to the minimal size that the ring buffer may be at
and still functional. When it is allocated it becomes the same size as the
main ring buffer, and when the main ring buffer changes in size, it should
do.
Currently, the resize only updates the snapshot buffer if it's used by the
current tracer (ie. the preemptirqsoff tracer). But it needs to be updated
anytime it is allocated.
When changing the size of the main buffer, instead of looking to see if
the current tracer is utilizing the snapshot buffer, just check if it is
allocated to know if it should be updated or not.
Also fix typo in comment just above the code change.
Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: ad909e21bbe69 ("tracing: Add internal tracing_snapshot() functions")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index aa8f99f3e5de..6c79548f9574 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6348,7 +6348,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
if (!tr->array_buffer.buffer)
return 0;
- /* Do not allow tracing while resizng ring buffer */
+ /* Do not allow tracing while resizing ring buffer */
tracing_stop_tr(tr);
ret = ring_buffer_resize(tr->array_buffer.buffer, size, cpu);
@@ -6356,7 +6356,7 @@ static int __tracing_resize_ring_buffer(struct trace_array *tr,
goto out_start;
#ifdef CONFIG_TRACER_MAX_TRACE
- if (!tr->current_trace->use_max_tr)
+ if (!tr->allocated_snapshot)
goto out;
ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu);
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (4 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 05/15] tracing: Update snapshot buffer on resize if it is allocated Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event Steven Rostedt
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
The ring buffer timestamps are synchronized by two timestamp placeholders.
One is the "before_stamp" and the other is the "write_stamp" (sometimes
referred to as the "after stamp" but only in the comments. These two
stamps are key to knowing how to handle nested events coming in with a
lockless system.
When moving across sub-buffers, the before stamp is updated but the write
stamp is not. There's an effort to put back the before stamp to something
that seems logical in case there's nested events. But as the current event
is about to cross sub-buffers, and so will any new nested event that happens,
updating the before stamp is useless, and could even introduce new race
conditions.
The first event on a sub-buffer simply uses the sub-buffer's timestamp
and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not
used in the algorithm in this case. There's no reason to try to fix the
before_stamp when this happens.
As a bonus, it removes a cmpxchg() when crossing sub-buffers!
Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 9 +--------
1 file changed, 1 insertion(+), 8 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dcd47895b424..c7abcc215fe2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3607,14 +3607,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
/* See if we shot pass the end of this buffer page */
if (unlikely(write > BUF_PAGE_SIZE)) {
- /* before and after may now different, fix it up*/
- b_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
- a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
- if (a_ok && b_ok && info->before != info->after)
- (void)rb_time_cmpxchg(&cpu_buffer->before_stamp,
- info->before, info->after);
- if (a_ok && b_ok)
- check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
+ check_buffer(cpu_buffer, info, CHECK_FULL_PAGE);
return rb_move_tail(cpu_buffer, tail, info);
}
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (5 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 06/15] ring-buffer: Do not update before stamp when switching sub-buffers Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output Steven Rostedt
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
For the ring buffer iterator (non-consuming read), the event needs to be
copied into the iterator buffer to make sure that a writer does not
overwrite it while the user is reading it. If a write happens during the
copy, the buffer is simply discarded.
But the temp buffer itself was not big enough. The allocation of the
buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can
be passed into the ring buffer and saved. But the temp buffer needs to
hold the meta data as well. That would be BUF_PAGE_SIZE and not
BUF_MAX_DATA_SIZE.
Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 785888c544e04 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c7abcc215fe2..1d9caee7f542 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2409,7 +2409,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
*/
barrier();
- if ((iter->head + length) > commit || length > BUF_MAX_DATA_SIZE)
+ if ((iter->head + length) > commit || length > BUF_PAGE_SIZE)
/* Writer corrupted the read? */
goto reset;
@@ -5118,7 +5118,8 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags)
if (!iter)
return NULL;
- iter->event = kmalloc(BUF_MAX_DATA_SIZE, flags);
+ /* Holds the entire event: data and meta data */
+ iter->event = kmalloc(BUF_PAGE_SIZE, flags);
if (!iter->event) {
kfree(iter);
return NULL;
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (6 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 07/15] ring-buffer: Have saved event hold the entire event Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file Steven Rostedt
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
If for some reason the trace_marker write does not have a nul byte for the
string, it will overflow the print:
trace_seq_printf(s, ": %s", field->buf);
The field->buf could be missing the nul byte. To prevent overflow, add the
max size that the buf can be by using the event size and the field
location.
int max = iter->ent_size - offsetof(struct print_entry, buf);
trace_seq_printf(s, ": %*.s", max, field->buf);
Link: https://lore.kernel.org/linux-trace-kernel/20231212084444.4619b8ce@gandalf.local.home
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace_output.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..3e7fa44dc2b2 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1587,11 +1587,12 @@ static enum print_line_t trace_print_print(struct trace_iterator *iter,
{
struct print_entry *field;
struct trace_seq *s = &iter->seq;
+ int max = iter->ent_size - offsetof(struct print_entry, buf);
trace_assign_type(field, iter->ent);
seq_print_ip_sym(s, field->ip, flags);
- trace_seq_printf(s, ": %s", field->buf);
+ trace_seq_printf(s, ": %.*s", max, field->buf);
return trace_handle_return(s);
}
@@ -1600,10 +1601,11 @@ static enum print_line_t trace_print_raw(struct trace_iterator *iter, int flags,
struct trace_event *event)
{
struct print_entry *field;
+ int max = iter->ent_size - offsetof(struct print_entry, buf);
trace_assign_type(field, iter->ent);
- trace_seq_printf(&iter->seq, "# %lx %s", field->ip, field->buf);
+ trace_seq_printf(&iter->seq, "# %lx %.*s", field->ip, max, field->buf);
return trace_handle_return(&iter->seq);
}
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (7 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 08/15] tracing: Add size check when printing trace_marker output Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp Steven Rostedt
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Zheng Yejian
From: Zheng Yejian <zhengyejian1@huawei.com>
KASAN report following issue. The root cause is when opening 'hist'
file of an instance and accessing 'trace_event_file' in hist_show(),
but 'trace_event_file' has been freed due to the instance being removed.
'hist_debug' file has the same problem. To fix it, call
tracing_{open,release}_file_tr() in file_operations callback to have
the ref count and avoid 'trace_event_file' being freed.
BUG: KASAN: slab-use-after-free in hist_show+0x11e0/0x1278
Read of size 8 at addr ffff242541e336b8 by task head/190
CPU: 4 PID: 190 Comm: head Not tainted 6.7.0-rc5-g26aff849438c #133
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x98/0xf8
show_stack+0x1c/0x30
dump_stack_lvl+0x44/0x58
print_report+0xf0/0x5a0
kasan_report+0x80/0xc0
__asan_report_load8_noabort+0x1c/0x28
hist_show+0x11e0/0x1278
seq_read_iter+0x344/0xd78
seq_read+0x128/0x1c0
vfs_read+0x198/0x6c8
ksys_read+0xf4/0x1e0
__arm64_sys_read+0x70/0xa8
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Allocated by task 188:
kasan_save_stack+0x28/0x50
kasan_set_track+0x28/0x38
kasan_save_alloc_info+0x20/0x30
__kasan_slab_alloc+0x6c/0x80
kmem_cache_alloc+0x15c/0x4a8
trace_create_new_event+0x84/0x348
__trace_add_new_event+0x18/0x88
event_trace_add_tracer+0xc4/0x1a0
trace_array_create_dir+0x6c/0x100
trace_array_create+0x2e8/0x568
instance_mkdir+0x48/0x80
tracefs_syscall_mkdir+0x90/0xe8
vfs_mkdir+0x3c4/0x610
do_mkdirat+0x144/0x200
__arm64_sys_mkdirat+0x8c/0xc0
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Freed by task 191:
kasan_save_stack+0x28/0x50
kasan_set_track+0x28/0x38
kasan_save_free_info+0x34/0x58
__kasan_slab_free+0xe4/0x158
kmem_cache_free+0x19c/0x508
event_file_put+0xa0/0x120
remove_event_file_dir+0x180/0x320
event_trace_del_tracer+0xb0/0x180
__remove_instance+0x224/0x508
instance_rmdir+0x44/0x78
tracefs_syscall_rmdir+0xbc/0x140
vfs_rmdir+0x1cc/0x4c8
do_rmdir+0x220/0x2b8
__arm64_sys_unlinkat+0xc0/0x100
invoke_syscall+0x70/0x260
el0_svc_common.constprop.0+0xb0/0x280
do_el0_svc+0x44/0x60
el0_svc+0x34/0x68
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x168/0x170
Link: https://lore.kernel.org/linux-trace-kernel/20231214012153.676155-1-zhengyejian1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/trace.c | 6 ++++++
kernel/trace/trace.h | 1 +
kernel/trace/trace_events_hist.c | 12 ++++++++----
3 files changed, 15 insertions(+), 4 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6c79548f9574..199df497db07 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4968,6 +4968,12 @@ int tracing_release_file_tr(struct inode *inode, struct file *filp)
return 0;
}
+int tracing_single_release_file_tr(struct inode *inode, struct file *filp)
+{
+ tracing_release_file_tr(inode, filp);
+ return single_release(inode, filp);
+}
+
static int tracing_mark_open(struct inode *inode, struct file *filp)
{
stream_open(inode, filp);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index b7f4ea25a194..0489e72c8169 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -617,6 +617,7 @@ int tracing_open_generic(struct inode *inode, struct file *filp);
int tracing_open_generic_tr(struct inode *inode, struct file *filp);
int tracing_open_file_tr(struct inode *inode, struct file *filp);
int tracing_release_file_tr(struct inode *inode, struct file *filp);
+int tracing_single_release_file_tr(struct inode *inode, struct file *filp);
bool tracing_is_disabled(void);
bool tracer_tracing_is_on(struct trace_array *tr);
void tracer_tracing_on(struct trace_array *tr);
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 1abc07fba1b9..5ecf3c8bde20 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -5623,10 +5623,12 @@ static int event_hist_open(struct inode *inode, struct file *file)
{
int ret;
- ret = security_locked_down(LOCKDOWN_TRACEFS);
+ ret = tracing_open_file_tr(inode, file);
if (ret)
return ret;
+ /* Clear private_data to avoid warning in single_open() */
+ file->private_data = NULL;
return single_open(file, hist_show, file);
}
@@ -5634,7 +5636,7 @@ const struct file_operations event_hist_fops = {
.open = event_hist_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = single_release,
+ .release = tracing_single_release_file_tr,
};
#ifdef CONFIG_HIST_TRIGGERS_DEBUG
@@ -5900,10 +5902,12 @@ static int event_hist_debug_open(struct inode *inode, struct file *file)
{
int ret;
- ret = security_locked_down(LOCKDOWN_TRACEFS);
+ ret = tracing_open_file_tr(inode, file);
if (ret)
return ret;
+ /* Clear private_data to avoid warning in single_open() */
+ file->private_data = NULL;
return single_open(file, hist_debug_show, file);
}
@@ -5911,7 +5915,7 @@ const struct file_operations event_hist_debug_fops = {
.open = event_hist_debug_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = single_release,
+ .release = tracing_single_release_file_tr,
};
#endif
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (8 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 09/15] tracing: Fix uaf issue when open the hist or hist_debug file Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() Steven Rostedt
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable, Joel Fernandes, Vincent Donnefort
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
If an update to an event is interrupted by another event between the time
the initial event allocated its buffer and where it wrote to the
write_stamp, the code try to reset the write stamp back to the what it had
just overwritten. It knows that it was overwritten via checking the
before_stamp, and if it didn't match what it wrote to the before_stamp
before it allocated its space, it knows it was overwritten.
To put back the write_stamp, it uses the before_stamp it read. The problem
here is that by writing the before_stamp to the write_stamp it makes the
two equal again, which means that the write_stamp can be considered valid
as the last timestamp written to the ring buffer. But this is not
necessarily true. The event that interrupted the event could have been
interrupted in a way that it was interrupted as well, and can end up
leaving with an invalid write_stamp. But if this happens and returns to
this context that uses the before_stamp to update the write_stamp again,
it can possibly incorrectly make it valid, causing later events to have in
correct time stamps.
As it is OK to leave this function with an invalid write_stamp (one that
doesn't match the before_stamp), there's no reason to try to make it valid
again in this case. If this race happens, then just leave with the invalid
write_stamp and the next event to come along will just add a absolute
timestamp and validate everything again.
Bonus points: This gets rid of another cmpxchg64!
Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 29 ++++++-----------------------
1 file changed, 6 insertions(+), 23 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d9caee7f542..2668dde23343 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3612,14 +3612,14 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
}
if (likely(tail == w)) {
- u64 save_before;
- bool s_ok;
-
/* Nothing interrupted us between A and C */
/*D*/ rb_time_set(&cpu_buffer->write_stamp, info->ts);
- barrier();
- /*E*/ s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before);
- RB_WARN_ON(cpu_buffer, !s_ok);
+ /*
+ * If something came in between C and D, the write stamp
+ * may now not be in sync. But that's fine as the before_stamp
+ * will be different and then next event will just be forced
+ * to use an absolute timestamp.
+ */
if (likely(!(info->add_timestamp &
(RB_ADD_STAMP_FORCE | RB_ADD_STAMP_ABSOLUTE))))
/* This did not interrupt any time update */
@@ -3627,24 +3627,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
else
/* Just use full timestamp for interrupting event */
info->delta = info->ts;
- barrier();
check_buffer(cpu_buffer, info, tail);
- if (unlikely(info->ts != save_before)) {
- /* SLOW PATH - Interrupted between C and E */
-
- a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
- RB_WARN_ON(cpu_buffer, !a_ok);
-
- /* Write stamp must only go forward */
- if (save_before > info->after) {
- /*
- * We do not care about the result, only that
- * it gets updated atomically.
- */
- (void)rb_time_cmpxchg(&cpu_buffer->write_stamp,
- info->after, save_before);
- }
- }
} else {
u64 ts;
/* SLOW PATH - Interrupted between A and C */
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (9 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 10/15] ring-buffer: Do not try to put back write_stamp Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs Steven Rostedt
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Joel Fernandes, Vincent Donnefort
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.
But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.
The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).
But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.
Remove this useless update, and in doing so, we remove another
cmpxchg64()!
Also update the comments to reflect this change as well as remove some
extra white space in another comment.
Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Fixes: b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of event")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 47 +++++++++-----------------------------
1 file changed, 11 insertions(+), 36 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2668dde23343..ad4af0cba159 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2983,25 +2983,6 @@ static unsigned rb_calculate_event_length(unsigned length)
return length;
}
-static u64 rb_time_delta(struct ring_buffer_event *event)
-{
- switch (event->type_len) {
- case RINGBUF_TYPE_PADDING:
- return 0;
-
- case RINGBUF_TYPE_TIME_EXTEND:
- return rb_event_time_stamp(event);
-
- case RINGBUF_TYPE_TIME_STAMP:
- return 0;
-
- case RINGBUF_TYPE_DATA:
- return event->time_delta;
- default:
- return 0;
- }
-}
-
static inline bool
rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
struct ring_buffer_event *event)
@@ -3009,8 +2990,6 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
unsigned long new_index, old_index;
struct buffer_page *bpage;
unsigned long addr;
- u64 write_stamp;
- u64 delta;
new_index = rb_event_index(event);
old_index = new_index + rb_event_ts_length(event);
@@ -3019,14 +2998,10 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
bpage = READ_ONCE(cpu_buffer->tail_page);
- delta = rb_time_delta(event);
-
- if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp))
- return false;
-
- /* Make sure the write stamp is read before testing the location */
- barrier();
-
+ /*
+ * Make sure the tail_page is still the same and
+ * the next write location is the end of this event
+ */
if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
unsigned long write_mask =
local_read(&bpage->write) & ~RB_WRITE_MASK;
@@ -3037,20 +3012,20 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
* to make sure that the next event adds an absolute
* value and does not rely on the saved write stamp, which
* is now going to be bogus.
+ *
+ * By setting the before_stamp to zero, the next event
+ * is not going to use the write_stamp and will instead
+ * create an absolute timestamp. This means there's no
+ * reason to update the wirte_stamp!
*/
rb_time_set(&cpu_buffer->before_stamp, 0);
- /* Something came in, can't discard */
- if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
- write_stamp, write_stamp - delta))
- return false;
-
/*
* If an event were to come in now, it would see that the
* write_stamp and the before_stamp are different, and assume
* that this event just added itself before updating
* the write stamp. The interrupting event will fix the
- * write stamp for us, and use the before stamp as its delta.
+ * write stamp for us, and use an absolute timestamp.
*/
/*
@@ -3487,7 +3462,7 @@ static void check_buffer(struct ring_buffer_per_cpu *cpu_buffer,
return;
/*
- * If this interrupted another event,
+ * If this interrupted another event,
*/
if (atomic_inc_return(this_cpu_ptr(&checking)) != 1)
goto out;
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (10 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 11/15] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard() Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() Steven Rostedt
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit
architectures. That is:
static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
{
unsigned long cnt, top, bottom, msb;
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
return false;
if (val != expect)
return false;
<<<< interrupted here!
cnt = local_read(&t->cnt);
The problem is that the synchronization counter in the rb_time_t is read
*after* the value of the timestamp is read. That means if an interrupt
were to come in between the value being read and the counter being read,
it can change the value and the counter and the interrupted process would
be clueless about it!
The counter needs to be read first and then the value. That way it is easy
to tell if the value is stale or not. If the counter hasn't been updated,
then the value is still good.
Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit")
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ad4af0cba159..b8ab0557bd1b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -706,6 +706,9 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
unsigned long cnt2, top2, bottom2, msb2;
u64 val;
+ /* Any interruptions in this function should cause a failure */
+ cnt = local_read(&t->cnt);
+
/* The cmpxchg always fails if it interrupted an update */
if (!__rb_time_read(t, &val, &cnt2))
return false;
@@ -713,7 +716,6 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
if (val != expect)
return false;
- cnt = local_read(&t->cnt);
if ((cnt & 3) != cnt2)
return false;
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (11 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 12/15] ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI Steven Rostedt
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The following race can cause rb_time_read() to observe a corrupted time
stamp:
rb_time_cmpxchg()
[...]
if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
return false;
if (!rb_time_read_cmpxchg(&t->top, top, top2))
return false;
<interrupted before updating bottom>
__rb_time_read()
[...]
do {
c = local_read(&t->cnt);
top = local_read(&t->top);
bottom = local_read(&t->bottom);
msb = local_read(&t->msb);
} while (c != local_read(&t->cnt));
*cnt = rb_time_cnt(top);
/* If top and msb counts don't match, this interrupted a write */
if (*cnt != rb_time_cnt(msb))
return false;
^ this check fails to catch that "bottom" is still not updated.
So the old "bottom" value is returned, which is wrong.
Fix this by checking that all three of msb, top, and bottom 2-bit cnt
values match.
The reason to favor checking all three fields over requiring a specific
update order for both rb_time_set() and rb_time_cmpxchg() is because
checking all three fields is more robust to handle partial failures of
rb_time_cmpxchg() when interrupted by nested rb_time_set().
Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/
Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com
Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b8ab0557bd1b..f22a849da179 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,8 @@ static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
*cnt = rb_time_cnt(top);
- /* If top and msb counts don't match, this interrupted a write */
- if (*cnt != rb_time_cnt(msb))
+ /* If top, msb or bottom counts don't match, this interrupted a write */
+ if (*cnt != rb_time_cnt(msb) || *cnt != rb_time_cnt(bottom))
return false;
/* The shift to msb will lose its cnt bits */
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (12 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 13/15] ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
2023-12-16 4:22 ` [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI Steven Rostedt
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
stable
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
The rb_time_cmpxchg() on 32-bit architectures requires setting three
32-bit words to represent the 64-bit timestamp, with some salt for
synchronization. Those are: msb, top, and bottom
The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
and the msb that was written was stale.
Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f22a849da179..f4679013289b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -722,10 +722,12 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
cnt2 = cnt + 1;
rb_time_split(val, &top, &bottom, &msb);
+ msb = rb_time_val_cnt(msb, cnt);
top = rb_time_val_cnt(top, cnt);
bottom = rb_time_val_cnt(bottom, cnt);
rb_time_split(set, &top2, &bottom2, &msb2);
+ msb2 = rb_time_val_cnt(msb2, cnt);
top2 = rb_time_val_cnt(top2, cnt2);
bottom2 = rb_time_val_cnt(bottom2, cnt2);
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [for-linus][PATCH 15/15] ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMI
2023-12-16 4:22 [for-linus][PATCH 00/15] tracing: Fixes for 6.7-rc5 Steven Rostedt
` (13 preceding siblings ...)
2023-12-16 4:22 ` [for-linus][PATCH 14/15] ring-buffer: Have rb_time_cmpxchg() set the msb counter too Steven Rostedt
@ 2023-12-16 4:22 ` Steven Rostedt
14 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-12-16 4:22 UTC (permalink / raw)
To: linux-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
As the ring buffer recording requires cmpxchg() to work, if the
architecture does not support cmpxchg in NMI, then do not do any recording
within an NMI.
Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
kernel/trace/ring_buffer.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4679013289b..5a114e752f11 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3674,6 +3674,12 @@ rb_reserve_next_event(struct trace_buffer *buffer,
int nr_loops = 0;
int add_ts_default;
+ /* ring buffer does cmpxchg, make sure it is safe in NMI context */
+ if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&
+ (unlikely(in_nmi()))) {
+ return NULL;
+ }
+
rb_start_commit(cpu_buffer);
/* The commit page can not change after this */
--
2.42.0
^ permalink raw reply related [flat|nested] 16+ messages in thread