public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blktrace: fix __this_cpu_read/write in preemptible context
@ 2026-02-27  5:03 Chaitanya Kulkarni
  2026-02-27 16:48 ` Jens Axboe
  2026-02-27 19:19 ` Steven Rostedt
  0 siblings, 2 replies; 4+ messages in thread
From: Chaitanya Kulkarni @ 2026-02-27  5:03 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: shinichiro.kawasaki, linux-block, linux-trace-kernel,
	Chaitanya Kulkarni

tracing_record_cmdline() internally uses __this_cpu_read() and
__this_cpu_write() on the per-CPU variable trace_taskinfo_save, and
trace_save_cmdline() explicitly asserts preemption is disabled via
lockdep_assert_preemption_disabled().  These operations are safe only
when preemption is off, as they were designed to be called from the
scheduler's context probe_wakeup_sched_switch() / probe_wakeup().

__blk_add_trace() calls tracing_record_cmdline(current) from process
context where preemption is fully enabled, triggering the following
splat on using blktests/blktrace/002:

blktrace/002 (blktrace ftrace corruption with sysfs trace)   [failed]
    runtime  0.367s  ...  0.437s
    something found in dmesg:
    [   81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
    [   81.239580] null_blk: disk nullb1 created
    [   81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
    [   81.362842] caller is tracing_record_cmdline+0x10/0x40
    [   81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
    [   81.362877] Tainted: [N]=TEST
    [   81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    [   81.362881] Call Trace:
    [   81.362884]  <TASK>
    [   81.362886]  dump_stack_lvl+0x8d/0xb0
    ...
    (See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message)

[   81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
[   81.239580] null_blk: disk nullb1 created
[   81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
[   81.362842] caller is tracing_record_cmdline+0x10/0x40
[   81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
[   81.362877] Tainted: [N]=TEST
[   81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[   81.362881] Call Trace:
[   81.362884]  <TASK>
[   81.362886]  dump_stack_lvl+0x8d/0xb0
[   81.362895]  check_preemption_disabled+0xce/0xe0
[   81.362902]  tracing_record_cmdline+0x10/0x40
[   81.362923]  __blk_add_trace+0x307/0x5d0
[   81.362934]  ? lock_acquire+0xe0/0x300
[   81.362940]  ? iov_iter_extract_pages+0x101/0xa30
[   81.362959]  blk_add_trace_bio+0x106/0x1e0
[   81.362968]  submit_bio_noacct_nocheck+0x24b/0x3a0
[   81.362979]  ? lockdep_init_map_type+0x58/0x260
[   81.362988]  submit_bio_wait+0x56/0x90
[   81.363009]  __blkdev_direct_IO_simple+0x16c/0x250
[   81.363026]  ? __pfx_submit_bio_wait_endio+0x10/0x10
[   81.363038]  ? rcu_read_lock_any_held+0x73/0xa0
[   81.363051]  blkdev_read_iter+0xc1/0x140
[   81.363059]  vfs_read+0x20b/0x330
[   81.363083]  ksys_read+0x67/0xe0
[   81.363090]  do_syscall_64+0xbf/0xf00
[   81.363102]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   81.363106] RIP: 0033:0x7f281906029d
[   81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
[   81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d
[   81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000
[   81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000
[   81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000
[   81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a
[   81.363142]  </TASK>

The same BUG fires from blk_add_trace_plug(), blk_add_trace_unplug(),
and blk_add_trace_rq() paths as well.

Fix by wrapping the tracing_record_cmdline() call with
preempt_disable()/preempt_enable().  This is a best-effort "record
the comm string for this PID" operation and briefly disabling
preemption around it is both should be safe.

With this patch now blktests for blktrace pass :-

blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing)      [passed]
    runtime  3.652s  ...  3.649s
blktrace/002 (blktrace ftrace corruption with sysfs trace)   [passed]
    runtime  0.437s  ...  0.389s
blktests (master) #

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 kernel/trace/blktrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 3b7c102a6eb3..488552036583 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -383,7 +383,9 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	cpu = raw_smp_processor_id();
 
 	if (blk_tracer) {
+		preempt_disable_notrace();
 		tracing_record_cmdline(current);
+		preempt_enable_notrace();
 
 		buffer = blk_tr->array_buffer.buffer;
 		trace_ctx = tracing_gen_ctx_flags(0);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] blktrace: fix __this_cpu_read/write in preemptible context
  2026-02-27  5:03 [PATCH] blktrace: fix __this_cpu_read/write in preemptible context Chaitanya Kulkarni
@ 2026-02-27 16:48 ` Jens Axboe
  2026-02-27 19:19 ` Steven Rostedt
  1 sibling, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2026-02-27 16:48 UTC (permalink / raw)
  To: Chaitanya Kulkarni, rostedt, mhiramat, mathieu.desnoyers
  Cc: shinichiro.kawasaki, linux-block, linux-trace-kernel

On 2/26/26 10:03 PM, Chaitanya Kulkarni wrote:
> tracing_record_cmdline() internally uses __this_cpu_read() and
> __this_cpu_write() on the per-CPU variable trace_taskinfo_save, and
> trace_save_cmdline() explicitly asserts preemption is disabled via
> lockdep_assert_preemption_disabled().  These operations are safe only
> when preemption is off, as they were designed to be called from the
> scheduler's context probe_wakeup_sched_switch() / probe_wakeup().
> 
> __blk_add_trace() calls tracing_record_cmdline(current) from process
> context where preemption is fully enabled, triggering the following
> splat on using blktests/blktrace/002:
> 
> blktrace/002 (blktrace ftrace corruption with sysfs trace)   [failed]
>     runtime  0.367s  ...  0.437s
>     something found in dmesg:
>     [   81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
>     [   81.239580] null_blk: disk nullb1 created
>     [   81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
>     [   81.362842] caller is tracing_record_cmdline+0x10/0x40
>     [   81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
>     [   81.362877] Tainted: [N]=TEST
>     [   81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
>     [   81.362881] Call Trace:
>     [   81.362884]  <TASK>
>     [   81.362886]  dump_stack_lvl+0x8d/0xb0
>     ...
>     (See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message)
> 
> [   81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33
> [   81.239580] null_blk: disk nullb1 created
> [   81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516
> [   81.362842] caller is tracing_record_cmdline+0x10/0x40
> [   81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G                 N  7.0.0-rc1lblk+ #84 PREEMPT(full)
> [   81.362877] Tainted: [N]=TEST
> [   81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
> [   81.362881] Call Trace:
> [   81.362884]  <TASK>
> [   81.362886]  dump_stack_lvl+0x8d/0xb0
> [   81.362895]  check_preemption_disabled+0xce/0xe0
> [   81.362902]  tracing_record_cmdline+0x10/0x40
> [   81.362923]  __blk_add_trace+0x307/0x5d0
> [   81.362934]  ? lock_acquire+0xe0/0x300
> [   81.362940]  ? iov_iter_extract_pages+0x101/0xa30
> [   81.362959]  blk_add_trace_bio+0x106/0x1e0
> [   81.362968]  submit_bio_noacct_nocheck+0x24b/0x3a0
> [   81.362979]  ? lockdep_init_map_type+0x58/0x260
> [   81.362988]  submit_bio_wait+0x56/0x90
> [   81.363009]  __blkdev_direct_IO_simple+0x16c/0x250
> [   81.363026]  ? __pfx_submit_bio_wait_endio+0x10/0x10
> [   81.363038]  ? rcu_read_lock_any_held+0x73/0xa0
> [   81.363051]  blkdev_read_iter+0xc1/0x140
> [   81.363059]  vfs_read+0x20b/0x330
> [   81.363083]  ksys_read+0x67/0xe0
> [   81.363090]  do_syscall_64+0xbf/0xf00
> [   81.363102]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   81.363106] RIP: 0033:0x7f281906029d
> [   81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
> [   81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d
> [   81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000
> [   81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000
> [   81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000
> [   81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a
> [   81.363142]  </TASK>

Is this a new issue? Needs a Fixes tag.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] blktrace: fix __this_cpu_read/write in preemptible context
  2026-02-27  5:03 [PATCH] blktrace: fix __this_cpu_read/write in preemptible context Chaitanya Kulkarni
  2026-02-27 16:48 ` Jens Axboe
@ 2026-02-27 19:19 ` Steven Rostedt
  2026-02-27 23:08   ` Chaitanya Kulkarni
  1 sibling, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2026-02-27 19:19 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: axboe, mhiramat, mathieu.desnoyers, shinichiro.kawasaki,
	linux-block, linux-trace-kernel

On Thu, 26 Feb 2026 21:03:03 -0800
Chaitanya Kulkarni <kch@nvidia.com> wrote:

> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 3b7c102a6eb3..488552036583 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -383,7 +383,9 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  	cpu = raw_smp_processor_id();
>  
>  	if (blk_tracer) {
> +		preempt_disable_notrace();
>  		tracing_record_cmdline(current);
> +		preempt_enable_notrace();
>  
>  		buffer = blk_tr->array_buffer.buffer;
>  		trace_ctx = tracing_gen_ctx_flags(0);

Do you know when this started? rcu_read_lock() doesn't disable preemption
in PREEMPT environments, and hasn't for a very long time. I'm surprised it
took this long to detect this? Perhaps this was a bug from day one?

Anyway, the tracing_record_cmdline() is to update the COMM cache so that
the trace has way to show the task->comm based on the saved PID in the
trace. It sets a flag to record the COMM from the sched_switch event if a
trace event happened. It's not needed if no trace event occurred. That
means, instead of adding preempt_disable() here, just move it after the
ring buffer event is reserved, as that means preemption is disabled until
the event is committed.

i.e.

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index e6988929ead2..3735cbc1f99f 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -383,8 +383,6 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 	cpu = raw_smp_processor_id();
 
 	if (blk_tracer) {
-		tracing_record_cmdline(current);
-
 		buffer = blk_tr->array_buffer.buffer;
 		trace_ctx = tracing_gen_ctx_flags(0);
 		switch (bt->version) {
@@ -419,6 +417,8 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
 		if (!event)
 			return;
 
+		tracing_record_cmdline(current);
+
 		switch (bt->version) {
 		case 1:
 			record_blktrace_event(ring_buffer_event_data(event),

-- Steve

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] blktrace: fix __this_cpu_read/write in preemptible context
  2026-02-27 19:19 ` Steven Rostedt
@ 2026-02-27 23:08   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 4+ messages in thread
From: Chaitanya Kulkarni @ 2026-02-27 23:08 UTC (permalink / raw)
  To: Steven Rostedt, axboe@kernel.dk
  Cc: mhiramat@kernel.org, mathieu.desnoyers@efficios.com,
	shinichiro.kawasaki@wdc.com, linux-block@vger.kernel.org,
	Chaitanya Kulkarni, linux-trace-kernel@vger.kernel.org

On 2/27/26 11:19, Steven Rostedt wrote:
> On Thu, 26 Feb 2026 21:03:03 -0800
> Chaitanya Kulkarni <kch@nvidia.com> wrote:
>
>> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
>> index 3b7c102a6eb3..488552036583 100644
>> --- a/kernel/trace/blktrace.c
>> +++ b/kernel/trace/blktrace.c
>> @@ -383,7 +383,9 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>>   	cpu = raw_smp_processor_id();
>>   
>>   	if (blk_tracer) {
>> +		preempt_disable_notrace();
>>   		tracing_record_cmdline(current);
>> +		preempt_enable_notrace();
>>   
>>   		buffer = blk_tr->array_buffer.buffer;
>>   		trace_ctx = tracing_gen_ctx_flags(0);
> Do you know when this started? rcu_read_lock() doesn't disable preemption
> in PREEMPT environments, and hasn't for a very long time. I'm surprised it
> took this long to detect this? Perhaps this was a bug from day one?

This started with latest pull which I did on Wed.

Last time same test passed on 2/11/26 since I ran blktests and posted
following patch on 2/11/26 12:47 pacific :-

[PATCH V4] blktrace: log dropped REQ_OP_ZONE_XXX events ver1

Shinichiro CC'd here also reported same bug.
I think Fixes tag would be :-

Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred")

Since above commit added __this_cpu_read(trace_cmdline_save) and
__this_cpu_write(trace_cmdline_save) :-

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b90a827a4641..88111b08b2c1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -77,6 +77,13 @@ static int dummy_set_flag(u32 old_flags, u32 bit, int set)
         return 0;
  }
  
+/*
+ * To prevent the comm cache from being overwritten when no
+ * tracing is active, only save the comm when a trace event
+ * occurred.
+ */
+static DEFINE_PER_CPU(bool, trace_cmdline_save);
+
  /*
   * Kill all tracing for good (never come back).
   * It is initialized to 1 but will turn to zero if the initialization
@@ -1135,6 +1142,11 @@ void tracing_record_cmdline(struct task_struct *tsk)
             !tracing_is_on())
                 return;
  
+       if (!__this_cpu_read(trace_cmdline_save))
+               return;
+
+       __this_cpu_write(trace_cmdline_save, false);
+
         trace_save_cmdline(tsk);
  }
  
Full disclosure I've no idea why it has started showing up this month.
I've worked on blktrace extensively and tested my code a lot for form
2020-2021 after above commit but never has seen this even with my patches.

> Anyway, the tracing_record_cmdline() is to update the COMM cache so that
> the trace has way to show the task->comm based on the saved PID in the
> trace. It sets a flag to record the COMM from the sched_switch event if a
> trace event happened. It's not needed if no trace event occurred. That
> means, instead of adding preempt_disable() here, just move it after the
> ring buffer event is reserved, as that means preemption is disabled until
> the event is committed.
>
> i.e.
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index e6988929ead2..3735cbc1f99f 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -383,8 +383,6 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>   	cpu = raw_smp_processor_id();
>   
>   	if (blk_tracer) {
> -		tracing_record_cmdline(current);
> -
>   		buffer = blk_tr->array_buffer.buffer;
>   		trace_ctx = tracing_gen_ctx_flags(0);
>   		switch (bt->version) {
> @@ -419,6 +417,8 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>   		if (!event)
>   			return;
>   
> +		tracing_record_cmdline(current);
> +
>   		switch (bt->version) {
>   		case 1:
>   			record_blktrace_event(ring_buffer_event_data(event),
>
> -- Steve

Above does fix the problem and make testcase pass :-

blktests (master) # ./check blktrace
blktrace/001 (blktrace zone management command tracing)      [passed]
     runtime  3.650s  ...  3.647s
blktrace/002 (blktrace ftrace corruption with sysfs trace)   [passed]
     runtime  0.411s  ...  0.384s
blktests (master) #

I'll send a V2.

-ck



^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-27 23:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-27  5:03 [PATCH] blktrace: fix __this_cpu_read/write in preemptible context Chaitanya Kulkarni
2026-02-27 16:48 ` Jens Axboe
2026-02-27 19:19 ` Steven Rostedt
2026-02-27 23:08   ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox