* [PATCH 0/2] ring-buffer: Fix a race between readers and resize checks @ 2024-05-17 13:40 Petr Pavlu 2024-05-17 13:40 ` [PATCH 1/2] ring-buffer: Correct stale comments related to non-consuming readers Petr Pavlu 2024-05-17 13:40 ` [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu 0 siblings, 2 replies; 8+ messages in thread From: Petr Pavlu @ 2024-05-17 13:40 UTC (permalink / raw) To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers Cc: linux-trace-kernel, linux-kernel, Petr Pavlu Fix a race between a reader swapping a new reader page into the ring-buffer from rb_get_reader_page() and the rb_check_pages() check invoked at the end of the resize operation in ring_buffer_resize(). Note that a similar problem with rb_check_pages() being executed with concurrent readers looks to exist for ring_buffer_subbuf_order_set() too. I plan to tackle it in a separate series. Petr Pavlu (2): ring-buffer: Correct stale comments related to non-consuming readers ring-buffer: Fix a race between readers and resize checks kernel/trace/ring_buffer.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) base-commit: ea5f6ad9ad9645733b72ab53a98e719b460d36a6 -- 2.35.3 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] ring-buffer: Correct stale comments related to non-consuming readers 2024-05-17 13:40 [PATCH 0/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu @ 2024-05-17 13:40 ` Petr Pavlu 2024-05-17 13:40 ` [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu 1 sibling, 0 replies; 8+ messages in thread From: Petr Pavlu @ 2024-05-17 13:40 UTC (permalink / raw) To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers Cc: linux-trace-kernel, linux-kernel, Petr Pavlu Adjust the following code documentation: * Kernel-doc comments for ring_buffer_read_prepare() and ring_buffer_read_finish() mention that recording to the ring buffer is disabled when the read is active. Remove mention of this restriction because it was already lifted in commit 1039221cc278 ("ring-buffer: Do not disable recording when there is an iterator"). * Function ring_buffer_read_finish() performs a self-check of the ring-buffer by locking cpu_buffer->reader_lock and then calling rb_check_pages(). The preceding comment explains that the lock is needed because rb_check_pages() clears the HEAD flag required by readers which might be running in parallel. Remove this explanation because commit 8843e06f67b1 ("ring-buffer: Handle race between rb_move_tail and rb_check_pages") simplified the function so it no longer resets the mentioned flag. Nonetheless, the lock is still needed because a reader swapping a page into the ring buffer can make the underlying doubly-linked list temporarily inconsistent. This is a non-functional change. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> --- kernel/trace/ring_buffer.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6511dc3a00da..0ae569eae55a 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -5036,13 +5036,9 @@ EXPORT_SYMBOL_GPL(ring_buffer_consume); * @flags: gfp flags to use for memory allocation * * This performs the initial preparations necessary to iterate - * through the buffer. Memory is allocated, buffer recording + * through the buffer. Memory is allocated, buffer resizing * is disabled, and the iterator pointer is returned to the caller. * - * Disabling buffer recording prevents the reading from being - * corrupted. This is not a consuming read, so a producer is not - * expected. - * * After a sequence of ring_buffer_read_prepare calls, the user is * expected to make at least one call to ring_buffer_read_prepare_sync. * Afterwards, ring_buffer_read_start is invoked to get things going @@ -5129,8 +5125,7 @@ EXPORT_SYMBOL_GPL(ring_buffer_read_start); * ring_buffer_read_finish - finish reading the iterator of the buffer * @iter: The iterator retrieved by ring_buffer_start * - * This re-enables the recording to the buffer, and frees the - * iterator. + * This re-enables resizing of the buffer, and frees the iterator. */ void ring_buffer_read_finish(struct ring_buffer_iter *iter) @@ -5138,12 +5133,7 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer; unsigned long flags; - /* - * Ring buffer is disabled from recording, here's a good place - * to check the integrity of the ring buffer. - * Must prevent readers from trying to read, as the check - * clears the HEAD page and readers require it. - */ + /* Use this opportunity to check the integrity of the ring buffer. */ raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); rb_check_pages(cpu_buffer); raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); -- 2.35.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks 2024-05-17 13:40 [PATCH 0/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu 2024-05-17 13:40 ` [PATCH 1/2] ring-buffer: Correct stale comments related to non-consuming readers Petr Pavlu @ 2024-05-17 13:40 ` Petr Pavlu 2024-05-20 13:50 ` Steven Rostedt 1 sibling, 1 reply; 8+ messages in thread From: Petr Pavlu @ 2024-05-17 13:40 UTC (permalink / raw) To: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers Cc: linux-trace-kernel, linux-kernel, Petr Pavlu The reader code in rb_get_reader_page() swaps a new reader page into the ring buffer by doing cmpxchg on old->list.prev->next to point it to the new page. Following that, if the operation is successful, old->list.next->prev gets updated too. This means the underlying doubly-linked list is temporarily inconsistent, page->prev->next or page->next->prev might not be equal back to page for some page in the ring buffer. The resize operation in ring_buffer_resize() can be invoked in parallel. It calls rb_check_pages() which can detect the described inconsistency and stop further tracing: [ 190.271762] ------------[ cut here ]------------ [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0 [ 190.271789] Modules linked in: [...] [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1 [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0 [ 190.272023] Code: [...] [ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206 [ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80 [ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700 [ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000 [ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720 [ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000 [ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000 [ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0 [ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 190.272077] Call Trace: [ 190.272098] <TASK> [ 190.272189] ring_buffer_resize+0x2ab/0x460 [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0 [ 190.272206] tracing_resize_ring_buffer+0x65/0x90 [ 190.272216] tracing_entries_write+0x74/0xc0 [ 190.272225] vfs_write+0xf5/0x420 [ 190.272248] ksys_write+0x67/0xe0 [ 190.272256] do_syscall_64+0x82/0x170 [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 190.272373] RIP: 0033:0x7f1bd657d263 [ 190.272381] Code: [...] [ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263 [ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001 [ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000 [ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500 [ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002 [ 190.272412] </TASK> [ 190.272414] ---[ end trace 0000000000000000 ]--- Note that ring_buffer_resize() calls rb_check_pages() only if the parent trace_buffer has recording disabled. Recent commit d78ab792705c ("tracing: Stop current tracer when resizing buffer") causes that it is now always the case which makes it more likely to experience this issue. The window to hit this race is nonetheless very small. To help reproducing it, one can add a delay loop in rb_get_reader_page(): ret = rb_head_page_replace(reader, cpu_buffer->reader_page); if (!ret) goto spin; for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */ __asm__ __volatile__ ("" : : : "memory"); rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list; .. and then run the following commands on the target system: echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable while true; do echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 done & while true; do for i in /sys/kernel/tracing/per_cpu/*; do timeout 0.1 cat $i/trace_pipe; sleep 0.2 done done To fix the problem, make sure ring_buffer_resize() doesn't invoke rb_check_pages() concurrently with a reader operating on the same ring_buffer_per_cpu by taking its cpu_buffer->reader_lock. Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> --- kernel/trace/ring_buffer.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 0ae569eae55a..967655591719 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1449,6 +1449,11 @@ static void rb_check_bpage(struct ring_buffer_per_cpu *cpu_buffer, * * As a safety measure we check to make sure the data pages have not * been corrupted. + * + * Callers of this function need to guarantee that the list of pages doesn't get + * modified during the check. In particular, if it's possible that the function + * is invoked with concurrent readers which can swap in a new reader page then + * the caller should take cpu_buffer->reader_lock. */ static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) { @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, */ synchronize_rcu(); for_each_buffer_cpu(buffer, cpu) { + unsigned long flags; + cpu_buffer = buffer->buffers[cpu]; + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); rb_check_pages(cpu_buffer); + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, + flags); } atomic_dec(&buffer->record_disabled); } -- 2.35.3 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks 2024-05-17 13:40 ` [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu @ 2024-05-20 13:50 ` Steven Rostedt 2024-05-27 9:36 ` Petr Pavlu 0 siblings, 1 reply; 8+ messages in thread From: Steven Rostedt @ 2024-05-20 13:50 UTC (permalink / raw) To: Petr Pavlu Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel, linux-kernel, linux-rt-users On Fri, 17 May 2024 15:40:08 +0200 Petr Pavlu <petr.pavlu@suse.com> wrote: > The reader code in rb_get_reader_page() swaps a new reader page into the > ring buffer by doing cmpxchg on old->list.prev->next to point it to the > new page. Following that, if the operation is successful, > old->list.next->prev gets updated too. This means the underlying > doubly-linked list is temporarily inconsistent, page->prev->next or > page->next->prev might not be equal back to page for some page in the > ring buffer. > > The resize operation in ring_buffer_resize() can be invoked in parallel. > It calls rb_check_pages() which can detect the described inconsistency > and stop further tracing: > > [ 190.271762] ------------[ cut here ]------------ > [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0 > [ 190.271789] Modules linked in: [...] > [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1 > [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f > [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 > [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0 > [ 190.272023] Code: [...] > [ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206 > [ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80 > [ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700 > [ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000 > [ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720 > [ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000 > [ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000 > [ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0 > [ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 190.272077] Call Trace: > [ 190.272098] <TASK> > [ 190.272189] ring_buffer_resize+0x2ab/0x460 > [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0 > [ 190.272206] tracing_resize_ring_buffer+0x65/0x90 > [ 190.272216] tracing_entries_write+0x74/0xc0 > [ 190.272225] vfs_write+0xf5/0x420 > [ 190.272248] ksys_write+0x67/0xe0 > [ 190.272256] do_syscall_64+0x82/0x170 > [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 190.272373] RIP: 0033:0x7f1bd657d263 > [ 190.272381] Code: [...] > [ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > [ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263 > [ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001 > [ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000 > [ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500 > [ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002 > [ 190.272412] </TASK> > [ 190.272414] ---[ end trace 0000000000000000 ]--- > > Note that ring_buffer_resize() calls rb_check_pages() only if the parent > trace_buffer has recording disabled. Recent commit d78ab792705c > ("tracing: Stop current tracer when resizing buffer") causes that it is > now always the case which makes it more likely to experience this issue. > > The window to hit this race is nonetheless very small. To help > reproducing it, one can add a delay loop in rb_get_reader_page(): > > ret = rb_head_page_replace(reader, cpu_buffer->reader_page); > if (!ret) > goto spin; > for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */ > __asm__ __volatile__ ("" : : : "memory"); > rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list; > > .. and then run the following commands on the target system: > > echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable > while true; do > echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 > echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 > done & > while true; do > for i in /sys/kernel/tracing/per_cpu/*; do > timeout 0.1 cat $i/trace_pipe; sleep 0.2 > done > done > > To fix the problem, make sure ring_buffer_resize() doesn't invoke > rb_check_pages() concurrently with a reader operating on the same > ring_buffer_per_cpu by taking its cpu_buffer->reader_lock. Definitely a bug. Thanks for catching it. But... > > Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read") > Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> > --- > kernel/trace/ring_buffer.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 0ae569eae55a..967655591719 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1449,6 +1449,11 @@ static void rb_check_bpage(struct ring_buffer_per_cpu *cpu_buffer, > * > * As a safety measure we check to make sure the data pages have not > * been corrupted. > + * > + * Callers of this function need to guarantee that the list of pages doesn't get > + * modified during the check. In particular, if it's possible that the function > + * is invoked with concurrent readers which can swap in a new reader page then > + * the caller should take cpu_buffer->reader_lock. > */ > static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) > { > @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, > */ > synchronize_rcu(); > for_each_buffer_cpu(buffer, cpu) { > + unsigned long flags; > + > cpu_buffer = buffer->buffers[cpu]; > + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > rb_check_pages(cpu_buffer); > + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, > + flags); Putting my RT hat on, I really don't like the above fix. The rb_check_pages() iterates all subbuffers which makes the time interrupts are disabled non-deterministic. Instead, I would rather have something where we disable readers while we do the check, and re-enable them. raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); cpu_buffer->read_disabled++; raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); // Also, don't put flags on a new line. We are allow to go 100 characters now. rb_check_pages(cpu_buffer); raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); cpu_buffer->read_disabled--; raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); Or something like that. Yes, that also requires creating a new "read_disabled" field in the ring_buffer_per_cpu code. That said, I'm going to accept these patches as is (moving flags onto the same line). But would like the above code for the next merge window as it would then not affect RT. I'll accept these patches because it does fix the bug now. -- Steve > } > atomic_dec(&buffer->record_disabled); > } ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks 2024-05-20 13:50 ` Steven Rostedt @ 2024-05-27 9:36 ` Petr Pavlu 2024-05-27 23:43 ` Steven Rostedt 0 siblings, 1 reply; 8+ messages in thread From: Petr Pavlu @ 2024-05-27 9:36 UTC (permalink / raw) To: Steven Rostedt Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel, linux-kernel, linux-rt-users On 5/20/24 15:50, Steven Rostedt wrote: > On Fri, 17 May 2024 15:40:08 +0200 > Petr Pavlu <petr.pavlu@suse.com> wrote: > >> The reader code in rb_get_reader_page() swaps a new reader page into the >> ring buffer by doing cmpxchg on old->list.prev->next to point it to the >> new page. Following that, if the operation is successful, >> old->list.next->prev gets updated too. This means the underlying >> doubly-linked list is temporarily inconsistent, page->prev->next or >> page->next->prev might not be equal back to page for some page in the >> ring buffer. >> >> The resize operation in ring_buffer_resize() can be invoked in parallel. >> It calls rb_check_pages() which can detect the described inconsistency >> and stop further tracing: >> >> [ 190.271762] ------------[ cut here ]------------ >> [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0 >> [ 190.271789] Modules linked in: [...] >> [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1 >> [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f >> [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 >> [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0 >> [ 190.272023] Code: [...] >> [ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206 >> [ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80 >> [ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700 >> [ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000 >> [ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720 >> [ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000 >> [ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000 >> [ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0 >> [ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> [ 190.272077] Call Trace: >> [ 190.272098] <TASK> >> [ 190.272189] ring_buffer_resize+0x2ab/0x460 >> [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0 >> [ 190.272206] tracing_resize_ring_buffer+0x65/0x90 >> [ 190.272216] tracing_entries_write+0x74/0xc0 >> [ 190.272225] vfs_write+0xf5/0x420 >> [ 190.272248] ksys_write+0x67/0xe0 >> [ 190.272256] do_syscall_64+0x82/0x170 >> [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e >> [ 190.272373] RIP: 0033:0x7f1bd657d263 >> [ 190.272381] Code: [...] >> [ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 >> [ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263 >> [ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001 >> [ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000 >> [ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500 >> [ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002 >> [ 190.272412] </TASK> >> [ 190.272414] ---[ end trace 0000000000000000 ]--- >> >> Note that ring_buffer_resize() calls rb_check_pages() only if the parent >> trace_buffer has recording disabled. Recent commit d78ab792705c >> ("tracing: Stop current tracer when resizing buffer") causes that it is >> now always the case which makes it more likely to experience this issue. >> >> The window to hit this race is nonetheless very small. To help >> reproducing it, one can add a delay loop in rb_get_reader_page(): >> >> ret = rb_head_page_replace(reader, cpu_buffer->reader_page); >> if (!ret) >> goto spin; >> for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */ >> __asm__ __volatile__ ("" : : : "memory"); >> rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list; >> >> .. and then run the following commands on the target system: >> >> echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable >> while true; do >> echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 >> echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 >> done & >> while true; do >> for i in /sys/kernel/tracing/per_cpu/*; do >> timeout 0.1 cat $i/trace_pipe; sleep 0.2 >> done >> done >> >> To fix the problem, make sure ring_buffer_resize() doesn't invoke >> rb_check_pages() concurrently with a reader operating on the same >> ring_buffer_per_cpu by taking its cpu_buffer->reader_lock. > > Definitely a bug. Thanks for catching it. But... > >> >> Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read") >> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> >> --- >> kernel/trace/ring_buffer.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c >> index 0ae569eae55a..967655591719 100644 >> --- a/kernel/trace/ring_buffer.c >> +++ b/kernel/trace/ring_buffer.c >> @@ -1449,6 +1449,11 @@ static void rb_check_bpage(struct ring_buffer_per_cpu *cpu_buffer, >> * >> * As a safety measure we check to make sure the data pages have not >> * been corrupted. >> + * >> + * Callers of this function need to guarantee that the list of pages doesn't get >> + * modified during the check. In particular, if it's possible that the function >> + * is invoked with concurrent readers which can swap in a new reader page then >> + * the caller should take cpu_buffer->reader_lock. >> */ >> static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) >> { >> @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, >> */ >> synchronize_rcu(); >> for_each_buffer_cpu(buffer, cpu) { >> + unsigned long flags; >> + >> cpu_buffer = buffer->buffers[cpu]; >> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); >> rb_check_pages(cpu_buffer); >> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, >> + flags); > > Putting my RT hat on, I really don't like the above fix. The > rb_check_pages() iterates all subbuffers which makes the time interrupts > are disabled non-deterministic. I see, this applies also to the same rb_check_pages() validation invoked from ring_buffer_read_finish(). > > Instead, I would rather have something where we disable readers while we do > the check, and re-enable them. > > raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > cpu_buffer->read_disabled++; > raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > > // Also, don't put flags on a new line. We are allow to go 100 characters now. Noted. > > > rb_check_pages(cpu_buffer); > raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > cpu_buffer->read_disabled--; > raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > > Or something like that. Yes, that also requires creating a new > "read_disabled" field in the ring_buffer_per_cpu code. I think this would work but I'm personally not immediately sold on this approach. If I understand the idea correctly, readers should then check whether cpu_buffer->read_disabled is set and bail out with some error if that is the case. The rb_check_pages() function is only a self-check code and as such, I feel it doesn't justify disrupting readers with a new error condition and adding more complex locking. I've been considering how to approach this RT issue differently. One obvious approach would be to drop the rb_check_pages() validation but that is probably not desirable. Another option could be to make the check less thorough and walk only a part of the list which is bounded by some constant, typically one would want to check the part where some change was just made. In case of a smaller list, it should be still possible to traverse it completely. This is an idea that I'm currently looking at. > > That said, I'm going to accept these patches as is (moving flags onto the > same line). But would like the above code for the next merge window as it > would then not affect RT. > > I'll accept these patches because it does fix the bug now. Thanks, Petr ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks 2024-05-27 9:36 ` Petr Pavlu @ 2024-05-27 23:43 ` Steven Rostedt 2024-06-07 8:29 ` Petr Pavlu 0 siblings, 1 reply; 8+ messages in thread From: Steven Rostedt @ 2024-05-27 23:43 UTC (permalink / raw) To: Petr Pavlu Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel, linux-kernel, linux-rt-users On Mon, 27 May 2024 11:36:55 +0200 Petr Pavlu <petr.pavlu@suse.com> wrote: > >> static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) > >> { > >> @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, > >> */ > >> synchronize_rcu(); > >> for_each_buffer_cpu(buffer, cpu) { > >> + unsigned long flags; > >> + > >> cpu_buffer = buffer->buffers[cpu]; > >> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > >> rb_check_pages(cpu_buffer); > >> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, > >> + flags); > > > > Putting my RT hat on, I really don't like the above fix. The > > rb_check_pages() iterates all subbuffers which makes the time interrupts > > are disabled non-deterministic. > > I see, this applies also to the same rb_check_pages() validation invoked > from ring_buffer_read_finish(). > > > > > Instead, I would rather have something where we disable readers while we do > > the check, and re-enable them. > > > > raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > > cpu_buffer->read_disabled++; > > raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > > > > // Also, don't put flags on a new line. We are allow to go 100 characters now. > > Noted. > > > > > > > rb_check_pages(cpu_buffer); > > raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > > cpu_buffer->read_disabled--; > > raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > > > > Or something like that. Yes, that also requires creating a new > > "read_disabled" field in the ring_buffer_per_cpu code. > > I think this would work but I'm personally not immediately sold on this > approach. If I understand the idea correctly, readers should then check > whether cpu_buffer->read_disabled is set and bail out with some error if > that is the case. The rb_check_pages() function is only a self-check > code and as such, I feel it doesn't justify disrupting readers with > a new error condition and adding more complex locking. Honestly, this code was never made for more than one reader per cpu_buffer. I'm perfectly fine if all check_pages() causes other readers to the same per_cpu buffer to get -EBUSY. Do you really see this being a problem? What use case is there for hitting the check_pages() and reading the same cpu buffer at the same time? -- Steve ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks 2024-05-27 23:43 ` Steven Rostedt @ 2024-06-07 8:29 ` Petr Pavlu 2024-06-10 15:56 ` Steven Rostedt 0 siblings, 1 reply; 8+ messages in thread From: Petr Pavlu @ 2024-06-07 8:29 UTC (permalink / raw) To: Steven Rostedt Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel, linux-kernel, linux-rt-users On 5/28/24 01:43, Steven Rostedt wrote: > On Mon, 27 May 2024 11:36:55 +0200 > Petr Pavlu <petr.pavlu@suse.com> wrote: > >>>> static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) >>>> { >>>> @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, >>>> */ >>>> synchronize_rcu(); >>>> for_each_buffer_cpu(buffer, cpu) { >>>> + unsigned long flags; >>>> + >>>> cpu_buffer = buffer->buffers[cpu]; >>>> + raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); >>>> rb_check_pages(cpu_buffer); >>>> + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, >>>> + flags); >>> >>> Putting my RT hat on, I really don't like the above fix. The >>> rb_check_pages() iterates all subbuffers which makes the time interrupts >>> are disabled non-deterministic. >> >> I see, this applies also to the same rb_check_pages() validation invoked >> from ring_buffer_read_finish(). >> >>> >>> Instead, I would rather have something where we disable readers while we do >>> the check, and re-enable them. >>> >>> raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); >>> cpu_buffer->read_disabled++; >>> raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); >>> >>> // Also, don't put flags on a new line. We are allow to go 100 characters now. >> >> Noted. >> >>> >>> >>> rb_check_pages(cpu_buffer); >>> raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); >>> cpu_buffer->read_disabled--; >>> raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); >>> >>> Or something like that. Yes, that also requires creating a new >>> "read_disabled" field in the ring_buffer_per_cpu code. >> >> I think this would work but I'm personally not immediately sold on this >> approach. If I understand the idea correctly, readers should then check >> whether cpu_buffer->read_disabled is set and bail out with some error if >> that is the case. The rb_check_pages() function is only a self-check >> code and as such, I feel it doesn't justify disrupting readers with >> a new error condition and adding more complex locking. > > Honestly, this code was never made for more than one reader per > cpu_buffer. I'm perfectly fine if all check_pages() causes other > readers to the same per_cpu buffer to get -EBUSY. > > Do you really see this being a problem? What use case is there for > hitting the check_pages() and reading the same cpu buffer at the same > time? My main issue is with added complexity to check the new read_disabled flag. The rb_check_pages() part is simple and you showed what to do. The readers part is what I struggle with. I think the read_disabled flag needs to be either checked once in rb_get_reader_page() where the actual problem with making the list temporarily inconsistent exists. Or alternatively, it can be checked by direct or indirect users of rb_get_reader_page() just after they take the reader_lock. Looking at the final rb_get_reader_page() function, it currently always returns a valid reader page unless the buffer doesn't contain any additional entry or a serious problem is detected by RB_WARN_ON() checks. This is simple to handle for callers, either they get a reader page and can continue, or they stop. Returning -EBUSY means that callers have a new case that they need to decide what to do about. They need to propagate the error up the chain or attempt to handle it. This involves ring-buffer functions rb_advance_reader(), rb_buffer_peek(), ring_buffer_peek(), ring_buffer_consume(), ring_buffer_read_page() ring_buffer_map_get_reader() and their callers from other source files. It is possible to handle this new case in these functions but I'm not sure if adding this logic is justified. I feel I must have misunderstood something how it should work? I was further thinking about alternatives that would possibly provide a less thorough check but have their complexity limited only to rb_check_pages(). The already mentioned idea is to have the function to look only at surrounding nodes where some change in the list occurred. Another option could be to try traversing the whole list in smaller parts and give up the reader_lock in between them. This would need some care to make sure that the operation completes, e.g. the code would need to bail out if it detects a change on cpu_buffer->pages_read. Thanks, Petr ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks 2024-06-07 8:29 ` Petr Pavlu @ 2024-06-10 15:56 ` Steven Rostedt 0 siblings, 0 replies; 8+ messages in thread From: Steven Rostedt @ 2024-06-10 15:56 UTC (permalink / raw) To: Petr Pavlu Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-trace-kernel, linux-kernel, linux-rt-users On Fri, 7 Jun 2024 10:29:03 +0200 Petr Pavlu <petr.pavlu@suse.com> wrote: > Another option could be to try traversing the whole list in smaller > parts and give up the reader_lock in between them. This would need some > care to make sure that the operation completes, e.g. the code would need > to bail out if it detects a change on cpu_buffer->pages_read. I think I like this approach the most. Perhaps even have a counter that gets incremented everything a new reader page is taken. And if it detects that, it restarts the check? To prevent a DOS, we restart 3 times at most, and then just say "the list is OK" and exit. So basically, we release the lock within the loop per each sub-buffer, and then check if the reader touch it when reacquiring the lock. -- Steve ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-06-10 15:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-17 13:40 [PATCH 0/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu 2024-05-17 13:40 ` [PATCH 1/2] ring-buffer: Correct stale comments related to non-consuming readers Petr Pavlu 2024-05-17 13:40 ` [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks Petr Pavlu 2024-05-20 13:50 ` Steven Rostedt 2024-05-27 9:36 ` Petr Pavlu 2024-05-27 23:43 ` Steven Rostedt 2024-06-07 8:29 ` Petr Pavlu 2024-06-10 15:56 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).