KCSAN: data-race in desc_read / prb_reserve_in

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* KCSAN: data-race in desc_read / prb_reserve_in_last
@ 2026-03-11  3:31 Jianzhou Zhao
  2026-03-11 13:27 ` Petr Mladek
  0 siblings, 1 reply; 5+ messages in thread
From: Jianzhou Zhao @ 2026-03-11  3:31 UTC (permalink / raw)
  To: linux-kernel, senozhatsky, pmladek, rostedt, john.ogness



Subject: [BUG] printk: KCSAN: data-race in desc_read / prb_reserve_in_last

Dear Maintainers,

We are writing to report a KCSAN-detected data-race vulnerability in the Linux kernel. This bug was found by our custom fuzzing tool, RacePilot. The bug occurs in the printk ringbuffer system, dealing with concurrent modifications and unannotated reads of the data block logical limits. We observed this on the Linux kernel version 6.18.0-08691-g2061f18ad76e-dirty.

Call Trace & Context
==================================================================
BUG: KCSAN: data-race in desc_read / prb_reserve_in_last

write to 0xffffffff869276a8 of 8 bytes by task 14248 on cpu 0:
 data_realloc kernel/printk/printk_ringbuffer.c:1252 [inline]
 prb_reserve_in_last+0x831/0xb20 kernel/printk/printk_ringbuffer.c:1529
 vprintk_store+0x603/0x980 kernel/printk/printk.c:2283
 vprintk_emit+0xfd/0x540 kernel/printk/printk.c:2412
 vprintk_default+0x26/0x30 kernel/printk/printk.c:2451
 vprintk+0x1d/0x30 kernel/printk/printk_safe.c:82
 _printk+0x63/0x90 kernel/printk/printk.c:2461
 disk_unlock_native_capacity block/partitions/core.c:520 [inline]
 blk_add_partition block/partitions/core.c:543 [inline]
 blk_add_partitions block/partitions/core.c:633 [inline]
 bdev_disk_changed block/partitions/core.c:693 [inline]
 bdev_disk_changed+0xae3/0xeb0 block/partitions/core.c:642
 loop_reread_partitions+0x44/0xc0 drivers/block/loop.c:449
 loop_set_status+0x41c/0x580 drivers/block/loop.c:1278
 loop_set_status64 drivers/block/loop.c:1374 [inline]
 lo_ioctl+0xf0/0x1170 drivers/block/loop.c:1560
 blkdev_ioctl+0x377/0x420 block/ioctl.c:707
 vfs_ioctl fs/ioctl.c:52 [inline]
 __do_sys_ioctl fs/ioctl.c:605 [inline]
 __se_sys_ioctl fs/ioctl.c:591 [inline]
 __x64_sys_ioctl+0x121/0x170 fs/ioctl.c:591
 x64_sys_call+0xc3a/0x2030 arch/x86/include/generated/asm/syscalls_64.h:17
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffffffff869276a0 of 16 bytes by task 3004 on cpu 1:
 desc_read+0x115/0x250 kernel/printk/printk_ringbuffer.c:499
 desc_read_finalized_seq+0x40/0x140 kernel/printk/printk_ringbuffer.c:1972
 prb_read kernel/printk/printk_ringbuffer.c:2018 [inline]
 _prb_read_valid+0xc1/0x550 kernel/printk/printk_ringbuffer.c:2213
 prb_read_valid_info+0x74/0xa0 kernel/printk/printk_ringbuffer.c:2321
 devkmsg_poll+0xa1/0x120 kernel/printk/printk.c:906
 vfs_poll include/linux/poll.h:82 [inline]
 ep_item_poll.isra.0+0xb0/0x110 fs/eventpoll.c:1059
 ep_send_events+0x231/0x670 fs/eventpoll.c:1818
 ep_try_send_events fs/eventpoll.c:1905 [inline]
 ep_poll fs/eventpoll.c:1970 [inline]
 do_epoll_wait+0x2a8/0x9c0 fs/eventpoll.c:2461
 __do_sys_epoll_wait fs/eventpoll.c:2469 [inline]
 __se_sys_epoll_wait fs/eventpoll.c:2464 [inline]
 __x64_sys_epoll_wait+0xcb/0x190 fs/eventpoll.c:2464
 x64_sys_call+0x194e/0x2030 arch/x86/include/generated/asm/syscalls_64.h:233
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 3004 Comm: systemd-journal Not tainted 6.18.0-08691-g2061f18ad76e-dirty #50 PREEMPT(voluntary) 
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
==================================================================

Execution Flow & Code Context
The writer task executes `prb_reserve_in_last()`, which concurrently reallocates and extends space for the newest record data by calling `data_realloc()`. Inside `data_realloc()`, it updates `blk_lpos->next` (which points to `desc->text_blk_lpos.next`) with a plain write:
```c
// kernel/printk/printk_ringbuffer.c
static char *data_realloc(struct printk_ringbuffer *rb, unsigned int size,
			  struct prb_data_blk_lpos *blk_lpos, unsigned long id)
{
	...
	blk_lpos->next = next_lpos; // <-- Write
	return &blk->data[0];
}
```

Concurrently, another task acting as a reader accesses the ringbuffer via `desc_read()` to copy the descriptor structure to a local copy `desc_out`. It uses an unsynchronized `memcpy()` to read the `text_blk_lpos` structure which includes the 16 bytes representing both `begin` and `next` positions:
```c
// kernel/printk/printk_ringbuffer.c
static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
				 unsigned long id, struct prb_desc *desc_out,
				 u64 *seq_out, u32 *caller_id_out)
{
	...
	if (desc_out) {
		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */ // <-- Lockless Read
	}
	...
}
```

Root Cause Analysis
A data race occurs because the writer modifies `desc->text_blk_lpos.next` (via `blk_lpos->next = next_lpos` in `data_realloc()`) without concurrency synchronization, while a reader copies the entire `text_blk_lpos` structure (via `memcpy()`) concurrently. Since plain writes and `memcpy` lack atomic annotations, the compiler is free to tear the reads and stores or optimize them under the assumption of exclusivity. This can result in a torn or partially updated `text_blk_lpos` record being copied by the reader.
Unfortunately, we were unable to generate a reproducer for this bug.

Potential Impact
If `memcpy` in `desc_read()` tears the read, `desc_out->text_blk_lpos` may end up containing a logically inconsistent `begin` and `next` pair (e.g., an obsolete `begin` combined with an updated `next`, or simply a mangled 64-bit value). This could cause the printk readers to miscalculate the bounds of the data block, leading to reading corrupted text data, traversing out-of-bounds in the text data ring, or encountering infinite loops within the reader utilities, thereby causing a local DoS or syslog corruption.

Proposed Fix
To avoid tearing and resolve the sanitiser warnings while adhering to the memory model, we should replace the plain `memcpy` in `desc_read()` with individual `READ_ONCE()` calls for the `begin` and `next` fields. In addition, we should proactively wrap the concurrent assignments to `blk_lpos` in `data_alloc()` and `data_realloc()` with `WRITE_ONCE()`.

```diff
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -496,8 +496,8 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 	 * cannot be used because of the atomic_t @state_var field.
 	 */
 	if (desc_out) {
-		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
-		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
+		desc_out->text_blk_lpos.begin = READ_ONCE(desc->text_blk_lpos.begin);
+		desc_out->text_blk_lpos.next = READ_ONCE(desc->text_blk_lpos.next);
 	}
 	if (seq_out) {
 		*seq_out = info->seq; /* also part of desc_read:C */
@@ -1083,8 +1083,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
 		 * reader will recognize these special lpos values and handle
 		 * it appropriately.
 		 */
-		blk_lpos->begin = EMPTY_LINE_LPOS;
-		blk_lpos->next = EMPTY_LINE_LPOS;
+		WRITE_ONCE(blk_lpos->begin, EMPTY_LINE_LPOS);
+		WRITE_ONCE(blk_lpos->next, EMPTY_LINE_LPOS);
 		return NULL;
 	}
 
@@ -1107,8 +1107,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
 		if (WARN_ON_ONCE(next_lpos - begin_lpos >
 				 DATA_SIZE(data_ring)) ||
 		    !data_push_tail(rb, next_lpos - DATA_SIZE(data_ring))) {
-			blk_lpos->begin = FAILED_LPOS;
-			blk_lpos->next = FAILED_LPOS;
+			WRITE_ONCE(blk_lpos->begin, FAILED_LPOS);
+			WRITE_ONCE(blk_lpos->next, FAILED_LPOS);
 			return NULL;
 		}
 
@@ -1148,8 +1148,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, unsigned int size,
 		blk->id = id;
 	}
 
-	blk_lpos->begin = begin_lpos;
-	blk_lpos->next = next_lpos;
+	WRITE_ONCE(blk_lpos->begin, begin_lpos);
+	WRITE_ONCE(blk_lpos->next, next_lpos);
 
 	return &blk->data[0];
 }
@@ -1249,7 +1249,7 @@ static char *data_realloc(struct printk_ringbuffer *rb, unsigned int size,
 		}
 	}
 
-	blk_lpos->next = next_lpos;
+	WRITE_ONCE(blk_lpos->next, next_lpos);
 
 	return &blk->data[0];
 }
```

We would be highly honored if this could be of any help.

Best regards,
RacePilot Team

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: KCSAN: data-race in desc_read / prb_reserve_in_last
  2026-03-11  3:31 KCSAN: data-race in desc_read / prb_reserve_in_last Jianzhou Zhao
@ 2026-03-11 13:27 ` Petr Mladek
  2026-03-12 11:17   ` Petr Mladek
  0 siblings, 1 reply; 5+ messages in thread
From: Petr Mladek @ 2026-03-11 13:27 UTC (permalink / raw)
  To: Jianzhou Zhao; +Cc: linux-kernel, senozhatsky, rostedt, john.ogness

On Wed 2026-03-11 11:31:08, Jianzhou Zhao wrote:
> 
> 
> Subject: [BUG] printk: KCSAN: data-race in desc_read / prb_reserve_in_last
> 
> Dear Maintainers,
> 
> We are writing to report a KCSAN-detected data-race vulnerability in the Linux kernel. This bug was found by our custom fuzzing tool, RacePilot. The bug occurs in the printk ringbuffer system, dealing with concurrent modifications and unannotated reads of the data block logical limits. We observed this on the Linux kernel version 6.18.0-08691-g2061f18ad76e-dirty.
> 
> Call Trace & Context
> ==================================================================
> BUG: KCSAN: data-race in desc_read / prb_reserve_in_last
> 
> write to 0xffffffff869276a8 of 8 bytes by task 14248 on cpu 0:
>  data_realloc kernel/printk/printk_ringbuffer.c:1252 [inline]
>  prb_reserve_in_last+0x831/0xb20 kernel/printk/printk_ringbuffer.c:1529
>  vprintk_store+0x603/0x980 kernel/printk/printk.c:2283
>  vprintk_emit+0xfd/0x540 kernel/printk/printk.c:2412
>  vprintk_default+0x26/0x30 kernel/printk/printk.c:2451
>  vprintk+0x1d/0x30 kernel/printk/printk_safe.c:82
>  _printk+0x63/0x90 kernel/printk/printk.c:2461
>  disk_unlock_native_capacity block/partitions/core.c:520 [inline]
>  blk_add_partition block/partitions/core.c:543 [inline]
>  blk_add_partitions block/partitions/core.c:633 [inline]
>  bdev_disk_changed block/partitions/core.c:693 [inline]
>  bdev_disk_changed+0xae3/0xeb0 block/partitions/core.c:642
>  loop_reread_partitions+0x44/0xc0 drivers/block/loop.c:449
>  loop_set_status+0x41c/0x580 drivers/block/loop.c:1278
>  loop_set_status64 drivers/block/loop.c:1374 [inline]
>  lo_ioctl+0xf0/0x1170 drivers/block/loop.c:1560
>  blkdev_ioctl+0x377/0x420 block/ioctl.c:707
>  vfs_ioctl fs/ioctl.c:52 [inline]
>  __do_sys_ioctl fs/ioctl.c:605 [inline]
>  __se_sys_ioctl fs/ioctl.c:591 [inline]
>  __x64_sys_ioctl+0x121/0x170 fs/ioctl.c:591
>  x64_sys_call+0xc3a/0x2030 arch/x86/include/generated/asm/syscalls_64.h:17
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> read to 0xffffffff869276a0 of 16 bytes by task 3004 on cpu 1:
>  desc_read+0x115/0x250 kernel/printk/printk_ringbuffer.c:499
>  desc_read_finalized_seq+0x40/0x140 kernel/printk/printk_ringbuffer.c:1972
>  prb_read kernel/printk/printk_ringbuffer.c:2018 [inline]
>  _prb_read_valid+0xc1/0x550 kernel/printk/printk_ringbuffer.c:2213
>  prb_read_valid_info+0x74/0xa0 kernel/printk/printk_ringbuffer.c:2321
>  devkmsg_poll+0xa1/0x120 kernel/printk/printk.c:906
>  vfs_poll include/linux/poll.h:82 [inline]
>  ep_item_poll.isra.0+0xb0/0x110 fs/eventpoll.c:1059
>  ep_send_events+0x231/0x670 fs/eventpoll.c:1818
>  ep_try_send_events fs/eventpoll.c:1905 [inline]
>  ep_poll fs/eventpoll.c:1970 [inline]
>  do_epoll_wait+0x2a8/0x9c0 fs/eventpoll.c:2461
>  __do_sys_epoll_wait fs/eventpoll.c:2469 [inline]
>  __se_sys_epoll_wait fs/eventpoll.c:2464 [inline]
>  __x64_sys_epoll_wait+0xcb/0x190 fs/eventpoll.c:2464
>  x64_sys_call+0x194e/0x2030 arch/x86/include/generated/asm/syscalls_64.h:233
>  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
>  do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 1 UID: 0 PID: 3004 Comm: systemd-journal Not tainted 6.18.0-08691-g2061f18ad76e-dirty #50 PREEMPT(voluntary) 
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> ==================================================================
> 
> Execution Flow & Code Context
> The writer task executes `prb_reserve_in_last()`, which concurrently reallocates and extends space for the newest record data by calling `data_realloc()`. Inside `data_realloc()`, it updates `blk_lpos->next` (which points to `desc->text_blk_lpos.next`) with a plain write:
> ```c
> // kernel/printk/printk_ringbuffer.c
> static char *data_realloc(struct printk_ringbuffer *rb, unsigned int size,
> 			  struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> {
> 	...
> 	blk_lpos->next = next_lpos; // <-- Write
> 	return &blk->data[0];
> }
> ```
> 
> Concurrently, another task acting as a reader accesses the ringbuffer via `desc_read()` to copy the descriptor structure to a local copy `desc_out`. It uses an unsynchronized `memcpy()` to read the `text_blk_lpos` structure which includes the 16 bytes representing both `begin` and `next` positions:
> ```c
> // kernel/printk/printk_ringbuffer.c
> static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> 				 unsigned long id, struct prb_desc *desc_out,
> 				 u64 *seq_out, u32 *caller_id_out)
> {
> 	...
> 	if (desc_out) {
> 		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
> 		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */ // <-- Lockless Read
> 	}

The race is actually expected, see the comment above this code:

	/*
	 * Copy the descriptor data. The data is not valid until the
	 * state has been re-checked. A memcpy() for all of @desc
	 * cannot be used because of the atomic_t @state_var field.
	 */
	if (desc_out) {
		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
	}

The data are copied. But the reader is allowed to use
them only when the state of the block is marked as finalized.

And there is a long way between data_realloc() and
setting the finalized state:

   1. Writer calls prb_reserve_in_last() -> data_realloc().
      Note that the reserved size can only get increased here.

   2. If prb_reserve_in_last() succeeded that writer copies
      new data into the extended buffer.

   3. The record is finalized in two situations:

	a) When the appended text ended with newline '\n'
	   and the writer calls prb_final_commit().

	b) When the writer calls pr_commit() and next
	   record is already reserved.

	c) When another writer calls prb_reserve() and the last
	   (extended) record has already been committed.

So, the race should be safe.

But I agree that some race is there. The reader might copy
an inconsistent descriptor, get delayed by long NMI, and
then see already finalized state.

I mean the following:

CPU0				CPU1

prb_read()
  desc_read()
				prb_prb_reserve_in_last()
				  data_realloc()
    memcpy(text_blk_lpos)	    blk_lpos->next = next_lpos

Warning: memcpy copies inconsistent lpos !!!

      NMI enter

				# Writer appends data and finalize
				# the record
				prb_final_commit()

     NMI handler ends

  state_val = atomic_long_read(state_var);
  d_state = get_desc_state(id, state_val);

BANG: The reader gets inconsistent blk_lpos and sees
      finalized state.

On the positive side:
---------------------

There are still some sanity checks, for example:

static const char *get_data(struct prb_data_ring *data_ring,
			    struct prb_data_blk_lpos *blk_lpos,
			    unsigned int *data_size)
{
[...]
		*data_size = blk_lpos->next - blk_lpos->begin;
[...]
	if (WARN_ON_ONCE(!data_check_size(data_ring, *data_size) || !*data_size))
		return NULL;
	...

They might catch races when the data_size is negative number.

As a result, I am not sure if any race could even produce
bigger *data_size which might point behind the log buffer.


Catch the race:
---------------

We still should try to catch the race.

For example, desc_read() reads the state before and after
reading the data. It might invalidate them when the states
differ. Something like:

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 56c8e3d031f4..a5a456904b78 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -454,13 +454,13 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 	struct printk_info *info = to_info(desc_ring, id);
 	struct prb_desc *desc = to_desc(desc_ring, id);
 	atomic_long_t *state_var = &desc->state_var;
-	enum desc_state d_state;
+	enum desc_state d_state_before, d_state_after;
 	unsigned long state_val;
 
 	/* Check the descriptor state. */
 	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
-	d_state = get_desc_state(id, state_val);
-	if (d_state == desc_miss || d_state == desc_reserved) {
+	d_state_before = get_desc_state(id, state_val);
+	if (d_state_before == desc_miss || d_state_before == desc_reserved) {
 		/*
 		 * The descriptor is in an inconsistent state. Set at least
 		 * @state_var so that the caller can see the details of
@@ -546,11 +546,27 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 	 * which may have changed since the load above.
 	 */
 	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
-	d_state = get_desc_state(id, state_val);
+	d_state_after = get_desc_state(id, state_val);
+
+	/*
+	 * The copied output data might be inconsistent when the state
+	 * has changed in the meantime. Make the caller aware that
+	 * they are not final.
+	 *
+	 * FIXME: We need to double check all callers ignore the copied
+	 *        data when desc_reserved is returned!
+	 *
+	 *	  A cleaner solution might be introducing another state
+	 *	  for this.
+	 *
+	 *	  Also we might want to check/modify @state_val instead.
+	 */
+	if (d_state_after != d_state_before)
+		d_state_after = desc_reserved;
 out:
 	if (desc_out)
 		atomic_long_set(&desc_out->state_var, state_val);
-	return d_state;
+	return d_state_after;
 }
 
 /*

Also we might add more consistency checks when reading the data.
For example:

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 56c8e3d031f4..e252678337f6 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1319,6 +1335,12 @@ static const char *get_data(struct prb_data_ring *data_ring,
 	/* Subtract block ID space from size to reflect data size. */
 	*data_size -= sizeof(db->id);
 
+	/* Sanity check that we stay inside the printk data buffer */
+	if (WARN_ON_ONCE(&db->data[0] < data_ring->data ||
+			 &db->data[0] + *data_size >= data_ring->data + DATA_SIZE(data_ring)))
+		return NULL;
+
+
 	return &db->data[0];
 }


Hmm, I have tried to run printk printk_ringbuffer_kunit_test
and I have triggered the WARN_ON_ONCE() in get_data():

[  269.954151] KTAP version 1
[  269.954601] 1..1
[  269.955248]     KTAP version 1
[  269.955718]     # Subtest: printk-ringbuffer
[  269.956287]     # module: printk_ringbuffer_kunit_test
[  269.956294]     1..1
[  269.957591]     # test_readerwriter: running for 10000 ms
[  269.958731]     # test_readerwriter: start thread 001 (writer)
[  269.958827]     # test_readerwriter: start thread 002 (writer)
[  269.959031]     # test_readerwriter: start thread 003 (writer)
[  269.959287]     # test_readerwriter: start thread 004 (writer)
[  269.959550]     # test_readerwriter: start thread 005 (writer)
[  269.959649]     # test_readerwriter: start thread 006 (writer)
[  269.959800]     # test_readerwriter: start thread 007 (writer)
[  269.959888]     # test_readerwriter: start thread 008 (writer)
[  269.963125]     # test_readerwriter: start thread 009 (writer)
[  269.963205]     # test_readerwriter: start thread 010 (writer)
[  269.963275]     # test_readerwriter: starting test
[  269.963284]     # test_readerwriter: start thread 011 (writer)
[  269.963317]     # test_readerwriter: start reader
[  269.963919] ------------[ cut here ]------------
[  269.972098] &db->data[0] < data_ring->data || &db->data[0] + *data_size >= data_ring->data + DATA_SIZE(data_ring)
[  269.972101] WARNING: kernel/printk/printk_ringbuffer.c:1339 at get_data+0x13a/0x150, CPU#0: kunit_try_catch/4052
[  269.972112] Modules linked in: printk_ringbuffer_kunit_test(E)
[  269.972122] CPU: 0 UID: 0 PID: 4052 Comm: kunit_try_catch Kdump: loaded Tainted: G            E    N  7.0.0-rc3-default-dirty #562 PREEMPT(full)  60283ab8e9196bbd61ecb2de6cdd826bbd2749b4
[  269.972128] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
[  269.972130] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
[  269.972133] RIP: 0010:get_data+0x13a/0x150
[  269.972138] Code: 44 29 d2 48 f7 d0 4c 21 d0 48 03 47 08 41 89 10 85 d2 0f 85 66 ff ff ff 0f 0b 31 c0 e9 0a f2 cd ff 0f 0b 31 c0 e9 01 f2 cd ff <0f> 0b 31 c0 e9 f8 f1 cd ff 0f 0b eb e3 0f 0b eb df 0f 0b eb db 90
[  269.972141] RSP: 0018:ffffd084024d3b98 EFLAGS: 00010246
[  269.972145] RAX: ffffffffc028cfc0 RBX: 0000000000002b57 RCX: 000000000000000d
[  269.972148] RDX: ffffffffc028d040 RSI: ffffffffc028d040 RDI: 0000000000002000
[  269.972151] RBP: 3fffffffffffffff R08: ffffd084024d3bdc R09: 00000000000d7f78
[  269.972153] R10: 00000000000d7f78 R11: 0000000000002000 R12: 000000000000007c
[  269.972156] R13: 00000000000004f8 R14: ffffffffc0295860 R15: 0000000000002610
[  269.972160] FS:  0000000000000000(0000) GS:ffff894007af0000(0000) knlGS:0000000000000000
[  269.972163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  269.972166] CR2: 00007fc8e1e68758 CR3: 0000000142458000 CR4: 0000000000750ef0
[  269.972171] PKRU: 55555554
[  269.972173] Call Trace:
[  269.972177]  <TASK>
[  269.972179]  _prb_read_valid+0x179/0x370
[  269.972210]  prb_read_valid+0x1b/0x30
[  269.997890]  prbtest_reader.isra.0+0xec/0x304 [printk_ringbuffer_kunit_test 67c42ea3f25a5751580d5cf445aab4427a263db6]
[  269.997905]  ? __pfx_autoremove_wake_function+0x10/0x10
[  270.000468]  ? __pfx_prbtest_wakeup_callback+0x10/0x10 [printk_ringbuffer_kunit_test 67c42ea3f25a5751580d5cf445aab4427a263db6]
[  270.002296]  test_readerwriter.cold+0x62/0x6bc [printk_ringbuffer_kunit_test 67c42ea3f25a5751580d5cf445aab4427a263db6]
[  270.003999]  ? srso_alias_return_thunk+0x5/0xfbef5
[  270.004006]  ? lockdep_hardirqs_on+0x8c/0x130
[  270.004019]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[  270.004026]  kunit_try_run_case+0x72/0xf0
[  270.004033]  ? srso_alias_return_thunk+0x5/0xfbef5
[  270.004037]  ? srso_alias_return_thunk+0x5/0xfbef5
[  270.004040]  ? lockdep_hardirqs_on+0x8c/0x130
[  270.004043]  ? srso_alias_return_thunk+0x5/0xfbef5
[  270.004052]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[  270.004057]  kunit_generic_run_threadfn_adapter+0x21/0x40
[  270.004063]  kthread+0x103/0x140
[  270.004068]  ? __pfx_kthread+0x10/0x10
[  270.004079]  ret_from_fork+0x276/0x3a0
[  270.004088]  ? __pfx_kthread+0x10/0x10
[  270.004092]  ? __pfx_kthread+0x10/0x10
[  270.004100]  ret_from_fork_asm+0x1a/0x30
[  270.004135]  </TASK>
[  270.004137] irq event stamp: 879
[  270.004139] hardirqs last  enabled at (885): [<ffffffff8d41e9b5>] console_trylock_spinning+0x1c5/0x1f0
[  270.004144] hardirqs last disabled at (890): [<ffffffff8d41e968>] console_trylock_spinning+0x178/0x1f0
[  270.004148] softirqs last  enabled at (458): [<ffffffff8d36c7c0>] handle_softirqs+0x380/0x450
[  270.004154] softirqs last disabled at (449): [<ffffffff8d36c979>] __irq_exit_rcu+0xd9/0x150
[  270.004158] ---[ end trace 0000000000000000 ]---
[  280.158960]     # test_readerwriter: end reader: read=17402256 seq=17524469
[  280.160183]     # test_readerwriter: completed test
[  280.163200]     # test_readerwriter: end thread 011: wrote=1619327
[  280.164338]     # test_readerwriter: end thread 010: wrote=1590337
[  280.165441]     # test_readerwriter: end thread 009: wrote=1625807
[  280.166375]     # test_readerwriter: end thread 008: wrote=1592085
[  280.167406]     # test_readerwriter: end thread 007: wrote=1593299
[  280.168341]     # test_readerwriter: end thread 006: wrote=1591461
[  280.169378]     # test_readerwriter: end thread 005: wrote=1593876
[  280.170253]     # test_readerwriter: end thread 004: wrote=1608503
[  280.171123]     # test_readerwriter: end thread 003: wrote=1620013
[  280.171970]     # test_readerwriter: end thread 002: wrote=1616222
[  280.172798]     # test_readerwriter: end thread 001: wrote=1595192
[  280.173652]     # test_readerwriter.speed: slow
[  280.173660]     ok 1 test_readerwriter
[  280.174266] ok 1 printk-ringbuffer


I have to think more about it.

Anyway, I wonder whether to discuss this in public. I believe that if
there was a race that it would be very hard to misuse in practice.

Best Regards,
Petr

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: KCSAN: data-race in desc_read / prb_reserve_in_last
  2026-03-11 13:27 ` Petr Mladek
@ 2026-03-12 11:17   ` Petr Mladek
  2026-03-13 16:31     ` John Ogness
  0 siblings, 1 reply; 5+ messages in thread
From: Petr Mladek @ 2026-03-12 11:17 UTC (permalink / raw)
  To: Jianzhou Zhao; +Cc: linux-kernel, senozhatsky, rostedt, john.ogness

On Wed 2026-03-11 14:27:46, Petr Mladek wrote:
> On Wed 2026-03-11 11:31:08, Jianzhou Zhao wrote:
> > 
> > 
> > Subject: [BUG] printk: KCSAN: data-race in desc_read / prb_reserve_in_last
> > 
> > Dear Maintainers,
> > 
> > We are writing to report a KCSAN-detected data-race vulnerability in the Linux kernel. This bug was found by our custom fuzzing tool, RacePilot. The bug occurs in the printk ringbuffer system, dealing with concurrent modifications and unannotated reads of the data block logical limits. We observed this on the Linux kernel version 6.18.0-08691-g2061f18ad76e-dirty.
> > 
> > Call Trace & Context
> > ==================================================================
> > BUG: KCSAN: data-race in desc_read / prb_reserve_in_last
> > 
> > write to 0xffffffff869276a8 of 8 bytes by task 14248 on cpu 0:
> >  data_realloc kernel/printk/printk_ringbuffer.c:1252 [inline]
> >  prb_reserve_in_last+0x831/0xb20 kernel/printk/printk_ringbuffer.c:1529
> >  vprintk_store+0x603/0x980 kernel/printk/printk.c:2283
> >  vprintk_emit+0xfd/0x540 kernel/printk/printk.c:2412
> >  vprintk_default+0x26/0x30 kernel/printk/printk.c:2451
> >  vprintk+0x1d/0x30 kernel/printk/printk_safe.c:82
> >  _printk+0x63/0x90 kernel/printk/printk.c:2461
> >  disk_unlock_native_capacity block/partitions/core.c:520 [inline]
> >  blk_add_partition block/partitions/core.c:543 [inline]
> >  blk_add_partitions block/partitions/core.c:633 [inline]
> >  bdev_disk_changed block/partitions/core.c:693 [inline]
> >  bdev_disk_changed+0xae3/0xeb0 block/partitions/core.c:642
> >  loop_reread_partitions+0x44/0xc0 drivers/block/loop.c:449
> >  loop_set_status+0x41c/0x580 drivers/block/loop.c:1278
> >  loop_set_status64 drivers/block/loop.c:1374 [inline]
> >  lo_ioctl+0xf0/0x1170 drivers/block/loop.c:1560
> >  blkdev_ioctl+0x377/0x420 block/ioctl.c:707
> >  vfs_ioctl fs/ioctl.c:52 [inline]
> >  __do_sys_ioctl fs/ioctl.c:605 [inline]
> >  __se_sys_ioctl fs/ioctl.c:591 [inline]
> >  __x64_sys_ioctl+0x121/0x170 fs/ioctl.c:591
> >  x64_sys_call+0xc3a/0x2030 arch/x86/include/generated/asm/syscalls_64.h:17
> >  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> >  do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > read to 0xffffffff869276a0 of 16 bytes by task 3004 on cpu 1:
> >  desc_read+0x115/0x250 kernel/printk/printk_ringbuffer.c:499
> >  desc_read_finalized_seq+0x40/0x140 kernel/printk/printk_ringbuffer.c:1972
> >  prb_read kernel/printk/printk_ringbuffer.c:2018 [inline]
> >  _prb_read_valid+0xc1/0x550 kernel/printk/printk_ringbuffer.c:2213
> >  prb_read_valid_info+0x74/0xa0 kernel/printk/printk_ringbuffer.c:2321
> >  devkmsg_poll+0xa1/0x120 kernel/printk/printk.c:906
> >  vfs_poll include/linux/poll.h:82 [inline]
> >  ep_item_poll.isra.0+0xb0/0x110 fs/eventpoll.c:1059
> >  ep_send_events+0x231/0x670 fs/eventpoll.c:1818
> >  ep_try_send_events fs/eventpoll.c:1905 [inline]
> >  ep_poll fs/eventpoll.c:1970 [inline]
> >  do_epoll_wait+0x2a8/0x9c0 fs/eventpoll.c:2461
> >  __do_sys_epoll_wait fs/eventpoll.c:2469 [inline]
> >  __se_sys_epoll_wait fs/eventpoll.c:2464 [inline]
> >  __x64_sys_epoll_wait+0xcb/0x190 fs/eventpoll.c:2464
> >  x64_sys_call+0x194e/0x2030 arch/x86/include/generated/asm/syscalls_64.h:233
> >  do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> >  do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
> >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 1 UID: 0 PID: 3004 Comm: systemd-journal Not tainted 6.18.0-08691-g2061f18ad76e-dirty #50 PREEMPT(voluntary) 
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > ==================================================================
> > 
> > Execution Flow & Code Context
> > The writer task executes `prb_reserve_in_last()`, which concurrently reallocates and extends space for the newest record data by calling `data_realloc()`. Inside `data_realloc()`, it updates `blk_lpos->next` (which points to `desc->text_blk_lpos.next`) with a plain write:
> > ```c
> > // kernel/printk/printk_ringbuffer.c
> > static char *data_realloc(struct printk_ringbuffer *rb, unsigned int size,
> > 			  struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> > {
> > 	...
> > 	blk_lpos->next = next_lpos; // <-- Write
> > 	return &blk->data[0];
> > }
> > ```
> > 
> > Concurrently, another task acting as a reader accesses the ringbuffer via `desc_read()` to copy the descriptor structure to a local copy `desc_out`. It uses an unsynchronized `memcpy()` to read the `text_blk_lpos` structure which includes the 16 bytes representing both `begin` and `next` positions:
> > ```c
> > // kernel/printk/printk_ringbuffer.c
> > static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> > 				 unsigned long id, struct prb_desc *desc_out,
> > 				 u64 *seq_out, u32 *caller_id_out)
> > {
> > 	...
> > 	if (desc_out) {
> > 		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
> > 		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */ // <-- Lockless Read
> > 	}
> 
> The race is actually expected, see the comment above this code:
> 
> 	/*
> 	 * Copy the descriptor data. The data is not valid until the
> 	 * state has been re-checked. A memcpy() for all of @desc
> 	 * cannot be used because of the atomic_t @state_var field.
> 	 */
> 	if (desc_out) {
> 		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
> 		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
> 	}
> 
> The data are copied. But the reader is allowed to use
> them only when the state of the block is marked as finalized.
> 
> And there is a long way between data_realloc() and
> setting the finalized state:
> 
>    1. Writer calls prb_reserve_in_last() -> data_realloc().
>       Note that the reserved size can only get increased here.
> 
>    2. If prb_reserve_in_last() succeeded that writer copies
>       new data into the extended buffer.
> 
>    3. The record is finalized in two situations:
> 
> 	a) When the appended text ended with newline '\n'
> 	   and the writer calls prb_final_commit().
> 
> 	b) When the writer calls pr_commit() and next
> 	   record is already reserved.
> 
> 	c) When another writer calls prb_reserve() and the last
> 	   (extended) record has already been committed.
> 
> So, the race should be safe.
> 
> But I agree that some race is there. The reader might copy
> an inconsistent descriptor, get delayed by long NMI, and
> then see already finalized state.
> 
> I mean the following:
> 
> CPU0				CPU1
> 
> prb_read()
>   desc_read()
> 				prb_prb_reserve_in_last()
> 				  data_realloc()
>     memcpy(text_blk_lpos)	    blk_lpos->next = next_lpos
> 
> Warning: memcpy copies inconsistent lpos !!!
> 
>       NMI enter
> 
> 				# Writer appends data and finalize
> 				# the record
> 				prb_final_commit()
> 
>      NMI handler ends
> 
>   state_val = atomic_long_read(state_var);
>   d_state = get_desc_state(id, state_val);
> 
> BANG: The reader gets inconsistent blk_lpos and sees
>       finalized state.
> 
> On the positive side:
> ---------------------
> 
> There are still some sanity checks, for example:
> 
> static const char *get_data(struct prb_data_ring *data_ring,
> 			    struct prb_data_blk_lpos *blk_lpos,
> 			    unsigned int *data_size)
> {
> [...]
> 		*data_size = blk_lpos->next - blk_lpos->begin;
> [...]
> 	if (WARN_ON_ONCE(!data_check_size(data_ring, *data_size) || !*data_size))
> 		return NULL;
> 	...
> 
> They might catch races when the data_size is negative number.
> 
> As a result, I am not sure if any race could even produce
> bigger *data_size which might point behind the log buffer.
> 
> 
> Catch the race:
> ---------------
> 
> We still should try to catch the race.
> 
> For example, desc_read() reads the state before and after
> reading the data. It might invalidate them when the states
> differ. Something like:
> 
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index 56c8e3d031f4..a5a456904b78 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -454,13 +454,13 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
>  	struct printk_info *info = to_info(desc_ring, id);
>  	struct prb_desc *desc = to_desc(desc_ring, id);
>  	atomic_long_t *state_var = &desc->state_var;
> -	enum desc_state d_state;
> +	enum desc_state d_state_before, d_state_after;
>  	unsigned long state_val;
>  
>  	/* Check the descriptor state. */
>  	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
> -	d_state = get_desc_state(id, state_val);
> -	if (d_state == desc_miss || d_state == desc_reserved) {
> +	d_state_before = get_desc_state(id, state_val);
> +	if (d_state_before == desc_miss || d_state_before == desc_reserved) {
>  		/*
>  		 * The descriptor is in an inconsistent state. Set at least
>  		 * @state_var so that the caller can see the details of
> @@ -546,11 +546,27 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
>  	 * which may have changed since the load above.
>  	 */
>  	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
> -	d_state = get_desc_state(id, state_val);
> +	d_state_after = get_desc_state(id, state_val);
> +
> +	/*
> +	 * The copied output data might be inconsistent when the state
> +	 * has changed in the meantime. Make the caller aware that
> +	 * they are not final.
> +	 *
> +	 * FIXME: We need to double check all callers ignore the copied
> +	 *        data when desc_reserved is returned!
> +	 *
> +	 *	  A cleaner solution might be introducing another state
> +	 *	  for this.
> +	 *
> +	 *	  Also we might want to check/modify @state_val instead.
> +	 */
> +	if (d_state_after != d_state_before)
> +		d_state_after = desc_reserved;

As explained in the "FIXME: ..." comment above, the proposed approach
is complicated.

It might be much easier to use "goto try_again" and re-read the data
when "state_val" has changed when reading. It will make sure that
the read data are in sync with the returned state.

>  out:
>  	if (desc_out)
>  		atomic_long_set(&desc_out->state_var, state_val);
> -	return d_state;
> +	return d_state_after;
>  }

Something like [rfc v2]:

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 56c8e3d031f4..c080edf5636a 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -454,11 +454,12 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 	struct printk_info *info = to_info(desc_ring, id);
 	struct prb_desc *desc = to_desc(desc_ring, id);
 	atomic_long_t *state_var = &desc->state_var;
+	unsigned long state_val, state_val_check;
 	enum desc_state d_state;
-	unsigned long state_val;
 
 	/* Check the descriptor state. */
 	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
+try_again:
 	d_state = get_desc_state(id, state_val);
 	if (d_state == desc_miss || d_state == desc_reserved) {
 		/*
@@ -466,7 +467,9 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 		 * @state_var so that the caller can see the details of
 		 * the inconsistent state.
 		 */
-		goto out;
+		if (desc_out)
+			atomic_long_set(&desc_out->state_var, state_val);
+		return d_state;
 	}
 
 	/*
@@ -542,14 +545,19 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
 	smp_rmb(); /* LMM(desc_read:D) */
 
 	/*
-	 * The data has been copied. Return the current descriptor state,
-	 * which may have changed since the load above.
+	 * The data has been copied. Make sure that they match the given @id
+	 * and @d_state read at the beginning. Re-read them otherwise.
+	 *
+	 * Note that this simplifies the logic. Especially, the reader API
+	 * expects that all values are stable when the record is finalized.
+	 * But they are stable only when it stays finalized all the time.
 	 */
-	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
-	d_state = get_desc_state(id, state_val);
-out:
-	if (desc_out)
-		atomic_long_set(&desc_out->state_var, state_val);
+	state_val_check = atomic_long_read(state_var); /* LMM(desc_read:E) */
+	if (state_val_check != state_val) {
+		state_val = state_val_check;
+		goto try_again;
+	}
+
 	return d_state;
 }

>  /*
> 
> Also we might add more consistency checks when reading the data.
> For example:
> 
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index 56c8e3d031f4..e252678337f6 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1319,6 +1335,12 @@ static const char *get_data(struct prb_data_ring *data_ring,
>  	/* Subtract block ID space from size to reflect data size. */
>  	*data_size -= sizeof(db->id);
>  
> +	/* Sanity check that we stay inside the printk data buffer */
> +	if (WARN_ON_ONCE(&db->data[0] < data_ring->data ||
> +			 &db->data[0] + *data_size >= data_ring->data + DATA_SIZE(data_ring)))

I added some debug output:

		pr_err_once("Printk RB begin: 0x%llx vs. 0x%llx\n",
			    &db->data[0], data_ring->data);
		pr_err_once("Printk RB end:   0x%llx vs. 0x%llx\n",
			    &db->data[0] + *data_size,
			    data_ring->data + DATA_SIZE(data_ring));
		pr_err_once("Printk RB size:  0x%x vs. 0x%x\n",
			    *data_size, DATA_SIZE(data_ring));

and WARN_ON_ONCE() triggeres when the data fills the very end of the
buffer, for example:

[   69.349012] ------------[ cut here ]------------
[   69.360993] &db->data[0] < data_ring->data || &db->data[0] + *data_size >= data_ring->data + DATA_SIZE(data_ring)
[   69.360996] WARNING: kernel/printk/printk_ringbuffer.c:1331 at get_data+0x153/0x1a0, CPU#0: kunit_try_catch/1630
[...]
   69.424288] ---[ end trace 0000000000000000 ]---
[   69.424292] Printk RB begin: 0xffffffffc04a8fd8 vs. 0xffffffffc04a7040
[   69.424294] Printk RB end:   0xffffffffc04a9040 vs. 0xffffffffc04a9040
                                ^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^
[   69.424297] Printk RB size:  0x68 vs. 0x2000

And obviously, there should be s|>=|>|, aka

 +	if (WARN_ON_ONCE(&db->data[0] < data_ring->data ||
 +			 &db->data[0] + *data_size > data_ring->data + DATA_SIZE(data_ring)))

We are comparing size vs size...

I have never seen the warning anymore with the ">" check.

> +		return NULL;
> +
> +
>  	return &db->data[0];
>  }

Best Regards,
Petr

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: KCSAN: data-race in desc_read / prb_reserve_in_last
  2026-03-12 11:17   ` Petr Mladek
@ 2026-03-13 16:31     ` John Ogness
  2026-03-19 11:17       ` Petr Mladek
  0 siblings, 1 reply; 5+ messages in thread
From: John Ogness @ 2026-03-13 16:31 UTC (permalink / raw)
  To: Petr Mladek, Jianzhou Zhao; +Cc: linux-kernel, senozhatsky, rostedt

Hi Petr,

On 2026-03-12, Petr Mladek <pmladek@suse.com> wrote:
> [rfc v2]:
>
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index 56c8e3d031f4..c080edf5636a 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -454,11 +454,12 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
>  	struct printk_info *info = to_info(desc_ring, id);
>  	struct prb_desc *desc = to_desc(desc_ring, id);
>  	atomic_long_t *state_var = &desc->state_var;
> +	unsigned long state_val, state_val_check;
>  	enum desc_state d_state;
> -	unsigned long state_val;
>  
>  	/* Check the descriptor state. */
>  	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
> +try_again:
>  	d_state = get_desc_state(id, state_val);
>  	if (d_state == desc_miss || d_state == desc_reserved) {
>  		/*
> @@ -466,7 +467,9 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
>  		 * @state_var so that the caller can see the details of
>  		 * the inconsistent state.
>  		 */
> -		goto out;
> +		if (desc_out)
> +			atomic_long_set(&desc_out->state_var, state_val);
> +		return d_state;

I would keep the "goto out" so that the success scenario can also set
the state_var of @desc_out.

>  	}
>  
>  	/*
> @@ -542,14 +545,19 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
>  	smp_rmb(); /* LMM(desc_read:D) */
>  
>  	/*
> -	 * The data has been copied. Return the current descriptor state,
> -	 * which may have changed since the load above.
> +	 * The data has been copied. Make sure that they match the given @id
> +	 * and @d_state read at the beginning. Re-read them otherwise.
> +	 *
> +	 * Note that this simplifies the logic. Especially, the reader API
> +	 * expects that all values are stable when the record is finalized.
> +	 * But they are stable only when it stays finalized all the time.
>  	 */
> -	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
> -	d_state = get_desc_state(id, state_val);
> -out:
> -	if (desc_out)
> -		atomic_long_set(&desc_out->state_var, state_val);
> +	state_val_check = atomic_long_read(state_var); /* LMM(desc_read:E) */
> +	if (state_val_check != state_val) {
> +		state_val = state_val_check;
> +		goto try_again;
> +	}
> +

The out label and code could stay as it was. Or is there some reason you
do not want to set the state_var in the success case?

out:
	if (desc_out)
		atomic_long_set(&desc_out->state_var, state_val);

>  	return d_state;
>  }

We should probably annotate the memcpy() with a wrapping
data_race(). Sadly we were aware of and expected the data race, but did
not realize that commit 4cfc7258f876 ("printk: ringbuffer: add
finalization/extension support") made the data race a problem.

John

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: KCSAN: data-race in desc_read / prb_reserve_in_last
  2026-03-13 16:31     ` John Ogness
@ 2026-03-19 11:17       ` Petr Mladek
  0 siblings, 0 replies; 5+ messages in thread
From: Petr Mladek @ 2026-03-19 11:17 UTC (permalink / raw)
  To: John Ogness; +Cc: Jianzhou Zhao, linux-kernel, senozhatsky, rostedt

On Fri 2026-03-13 17:37:55, John Ogness wrote:
> Hi Petr,
> 
> On 2026-03-12, Petr Mladek <pmladek@suse.com> wrote:
> > [rfc v2]:
> >
> > diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> > index 56c8e3d031f4..c080edf5636a 100644
> > --- a/kernel/printk/printk_ringbuffer.c
> > +++ b/kernel/printk/printk_ringbuffer.c
> > @@ -454,11 +454,12 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> >  	struct printk_info *info = to_info(desc_ring, id);
> >  	struct prb_desc *desc = to_desc(desc_ring, id);
> >  	atomic_long_t *state_var = &desc->state_var;
> > +	unsigned long state_val, state_val_check;
> >  	enum desc_state d_state;
> > -	unsigned long state_val;
> >  
> >  	/* Check the descriptor state. */
> >  	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
> > +try_again:
> >  	d_state = get_desc_state(id, state_val);
> >  	if (d_state == desc_miss || d_state == desc_reserved) {
> >  		/*
> > @@ -466,7 +467,9 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> >  		 * @state_var so that the caller can see the details of
> >  		 * the inconsistent state.
> >  		 */
> > -		goto out;
> > +		if (desc_out)
> > +			atomic_long_set(&desc_out->state_var, state_val);
> > +		return d_state;
> 
> I would keep the "goto out" so that the success scenario can also set
> the state_var of @desc_out.

You are right.

> >  	}
> >  
> >  	/*
> > @@ -542,14 +545,19 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> >  	smp_rmb(); /* LMM(desc_read:D) */
> >  
> >  	/*
> > -	 * The data has been copied. Return the current descriptor state,
> > -	 * which may have changed since the load above.
> > +	 * The data has been copied. Make sure that they match the given @id
> > +	 * and @d_state read at the beginning. Re-read them otherwise.
> > +	 *
> > +	 * Note that this simplifies the logic. Especially, the reader API
> > +	 * expects that all values are stable when the record is finalized.
> > +	 * But they are stable only when it stays finalized all the time.
> >  	 */
> > -	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
> > -	d_state = get_desc_state(id, state_val);
> > -out:
> > -	if (desc_out)
> > -		atomic_long_set(&desc_out->state_var, state_val);
> > +	state_val_check = atomic_long_read(state_var); /* LMM(desc_read:E) */
> > +	if (state_val_check != state_val) {
> > +		state_val = state_val_check;
> > +		goto try_again;
> > +	}
> > +
> 
> The out label and code could stay as it was. Or is there some reason you
> do not want to set the state_var in the success case?

It was a mistake. I was confused by the following code:

	/*
	 * Copy the descriptor data. The data is not valid until the
	 * state has been re-checked. A memcpy() for all of @desc
	 * cannot be used because of the atomic_t @state_var field.
	 */
	if (desc_out) {
		memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
		       sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
	}

I somehow assumed that the memcpy copied everything. And we will not
longer need to update state_var when state_val did not change.

But the code copies only text_blk_lpos.

/o\

I am going to send a proper patch where this will be fixed.

> out:
> 	if (desc_out)
> 		atomic_long_set(&desc_out->state_var, state_val);
> 
> >  	return d_state;
> >  }
> 
> We should probably annotate the memcpy() with a wrapping
> data_race(). Sadly we were aware of and expected the data race, but did
> not realize that commit 4cfc7258f876 ("printk: ringbuffer: add
> finalization/extension support") made the data race a problem.

Yup. I'll add the data_race() as well.

Thanks a lot for review.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-19 11:17 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11  3:31 KCSAN: data-race in desc_read / prb_reserve_in_last Jianzhou Zhao
2026-03-11 13:27 ` Petr Mladek
2026-03-12 11:17   ` Petr Mladek
2026-03-13 16:31     ` John Ogness
2026-03-19 11:17       ` Petr Mladek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox