Linux block layer
 help / color / mirror / Atom feed
* [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
@ 2025-11-10 16:24 Bart Van Assche
  2025-11-11  1:18 ` Ming Lei
  2025-11-11  6:25 ` Nilay Shroff
  0 siblings, 2 replies; 5+ messages in thread
From: Bart Van Assche @ 2025-11-10 16:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Ming Lei,
	Nilay Shroff, Martin Wilck, Benjamin Marzinski, stable,
	Damien Le Moal, Chaitanya Kulkarni, Hannes Reinecke

Freezing the request queue from inside sysfs store callbacks may cause a
deadlock in combination with the dm-multipath driver and the
queue_if_no_path option. Additionally, freezing the request queue slows
down system boot on systems where sysfs attributes are set synchronously.

Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
calls from the store callbacks that do not strictly need these callbacks.
This patch may cause a small delay in applying the new settings.

This patch affects the following sysfs attributes:
* io_poll_delay
* io_timeout
* nomerges
* read_ahead_kb
* rq_affinity

Here is an example of a deadlock triggered by running test srp/002:

task:multipathd
Call Trace:
 <TASK>
 __schedule+0x8c1/0x1bf0
 schedule+0xdd/0x270
 schedule_preempt_disabled+0x1c/0x30
 __mutex_lock+0xb89/0x1650
 mutex_lock_nested+0x1f/0x30
 dm_table_set_restrictions+0x823/0xdf0
 __bind+0x166/0x590
 dm_swap_table+0x2a7/0x490
 do_resume+0x1b1/0x610
 dev_suspend+0x55/0x1a0
 ctl_ioctl+0x3a5/0x7e0
 dm_ctl_ioctl+0x12/0x20
 __x64_sys_ioctl+0x127/0x1a0
 x64_sys_call+0xe2b/0x17d0
 do_syscall_64+0x96/0x3a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>
task:(udev-worker)
Call Trace:
 <TASK>
 __schedule+0x8c1/0x1bf0
 schedule+0xdd/0x270
 blk_mq_freeze_queue_wait+0xf2/0x140
 blk_mq_freeze_queue_nomemsave+0x23/0x30
 queue_ra_store+0x14e/0x290
 queue_attr_store+0x23e/0x2c0
 sysfs_kf_write+0xde/0x140
 kernfs_fop_write_iter+0x3b2/0x630
 vfs_write+0x4fd/0x1390
 ksys_write+0xfd/0x230
 __x64_sys_write+0x76/0xc0
 x64_sys_call+0x276/0x17d0
 do_syscall_64+0x96/0x3a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
 </TASK>

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
Cc: stable@vger.kernel.org
Fixes: af2814149883 ("block: freeze the queue in queue_attr_store")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---

Changes compared to v4:
 - Use WRITE_ONCE() to update bdi->ra_pages.
 - Move a data_race() annotation from queue_io_timeout_store() into
   blk_queue_rq_timeout().

Changes compared to v3:
 - Added two data_race() annotations.

Changes compared to v2:
 - Dropped the controversial patch "block: Restrict the duration of sysfs
   attribute changes".

Changes compared to v1:
 - Added patch "block: Restrict the duration of sysfs attribute changes".
 - Remove queue freezing from more sysfs callbacks.
 
 block/blk-settings.c |  9 ++++++++-
 block/blk-sysfs.c    | 30 ++++++++++++------------------
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 78dfef117623..b5587cc535e5 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -23,7 +23,14 @@
 
 void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout)
 {
-	WRITE_ONCE(q->rq_timeout, timeout);
+	/*
+	 * Use WRITE_ONCE() to write q->rq_timeout once. Use data_race() to
+	 * suppress KCSAN race reports against the write below.
+	 */
+	data_race(({
+		WRITE_ONCE(q->rq_timeout, timeout);
+		0;
+	}));
 }
 EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 76c47fe9b8d6..99e78d907c1c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -143,21 +143,26 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count)
 {
 	unsigned long ra_kb;
 	ssize_t ret;
-	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 
 	ret = queue_var_store(&ra_kb, page, count);
 	if (ret < 0)
 		return ret;
 	/*
-	 * ->ra_pages is protected by ->limits_lock because it is usually
-	 * calculated from the queue limits by queue_limits_commit_update.
+	 * The ->ra_pages change below is protected by ->limits_lock because it
+	 * is usually calculated from the queue limits by
+	 * queue_limits_commit_update().
+	 *
+	 * bdi->ra_pages reads are not serialized against bdi->ra_pages writes.
+	 * Use WRITE_ONCE() to write bdi->ra_pages once. Use data_race() to
+	 * suppress KCSAN race reports against the write below.
 	 */
 	mutex_lock(&q->limits_lock);
-	memflags = blk_mq_freeze_queue(q);
-	disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10);
+	data_race(({
+		WRITE_ONCE(disk->bdi->ra_pages, ra_kb >> (PAGE_SHIFT - 10));
+		0;
+	}));
 	mutex_unlock(&q->limits_lock);
-	blk_mq_unfreeze_queue(q, memflags);
 
 	return ret;
 }
@@ -375,21 +380,18 @@ static ssize_t queue_nomerges_store(struct gendisk *disk, const char *page,
 				    size_t count)
 {
 	unsigned long nm;
-	unsigned int memflags;
 	struct request_queue *q = disk->queue;
 	ssize_t ret = queue_var_store(&nm, page, count);
 
 	if (ret < 0)
 		return ret;
 
-	memflags = blk_mq_freeze_queue(q);
 	blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
 	blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, q);
 	if (nm == 2)
 		blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q);
 	else if (nm)
 		blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
-	blk_mq_unfreeze_queue(q, memflags);
 
 	return ret;
 }
@@ -409,7 +411,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
 #ifdef CONFIG_SMP
 	struct request_queue *q = disk->queue;
 	unsigned long val;
-	unsigned int memflags;
 
 	ret = queue_var_store(&val, page, count);
 	if (ret < 0)
@@ -421,7 +422,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
 	 * are accessed individually using atomic test_bit operation. So we
 	 * don't grab any lock while updating these flags.
 	 */
-	memflags = blk_mq_freeze_queue(q);
 	if (val == 2) {
 		blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
 		blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
@@ -432,7 +432,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
 		blk_queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
 		blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
 	}
-	blk_mq_unfreeze_queue(q, memflags);
 #endif
 	return ret;
 }
@@ -446,11 +445,9 @@ static ssize_t queue_poll_delay_store(struct gendisk *disk, const char *page,
 static ssize_t queue_poll_store(struct gendisk *disk, const char *page,
 				size_t count)
 {
-	unsigned int memflags;
 	ssize_t ret = count;
 	struct request_queue *q = disk->queue;
 
-	memflags = blk_mq_freeze_queue(q);
 	if (!(q->limits.features & BLK_FEAT_POLL)) {
 		ret = -EINVAL;
 		goto out;
@@ -459,7 +456,6 @@ static ssize_t queue_poll_store(struct gendisk *disk, const char *page,
 	pr_info_ratelimited("writes to the poll attribute are ignored.\n");
 	pr_info_ratelimited("please use driver specific parameters instead.\n");
 out:
-	blk_mq_unfreeze_queue(q, memflags);
 	return ret;
 }
 
@@ -472,7 +468,7 @@ static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page)
 static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page,
 				  size_t count)
 {
-	unsigned int val, memflags;
+	unsigned int val;
 	int err;
 	struct request_queue *q = disk->queue;
 
@@ -480,9 +476,7 @@ static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page,
 	if (err || val == 0)
 		return -EINVAL;
 
-	memflags = blk_mq_freeze_queue(q);
 	blk_queue_rq_timeout(q, msecs_to_jiffies(val));
-	blk_mq_unfreeze_queue(q, memflags);
 
 	return count;
 }

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
  2025-11-10 16:24 [PATCH v5] block: Remove queue freezing from several sysfs store callbacks Bart Van Assche
@ 2025-11-11  1:18 ` Ming Lei
  2025-11-11  6:25 ` Nilay Shroff
  1 sibling, 0 replies; 5+ messages in thread
From: Ming Lei @ 2025-11-11  1:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Nilay Shroff,
	Martin Wilck, Benjamin Marzinski, stable, Damien Le Moal,
	Chaitanya Kulkarni, Hannes Reinecke

On Mon, Nov 10, 2025 at 08:24:18AM -0800, Bart Van Assche wrote:
> Freezing the request queue from inside sysfs store callbacks may cause a
> deadlock in combination with the dm-multipath driver and the
> queue_if_no_path option. Additionally, freezing the request queue slows
> down system boot on systems where sysfs attributes are set synchronously.
> 
> Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
> calls from the store callbacks that do not strictly need these callbacks.
> This patch may cause a small delay in applying the new settings.
> 
> This patch affects the following sysfs attributes:
> * io_poll_delay
> * io_timeout
> * nomerges
> * read_ahead_kb
> * rq_affinity

I'd suggest to add words why freeze isn't needed, such as:

```
Intermediate value isn't possible, and either the old or new value is just fine to take
in io fast path.
```

otherwise this patch is good for me.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
  2025-11-10 16:24 [PATCH v5] block: Remove queue freezing from several sysfs store callbacks Bart Van Assche
  2025-11-11  1:18 ` Ming Lei
@ 2025-11-11  6:25 ` Nilay Shroff
  2025-11-11 19:52   ` Bart Van Assche
  1 sibling, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-11-11  6:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Ming Lei, Martin Wilck,
	Benjamin Marzinski, stable, Damien Le Moal, Chaitanya Kulkarni,
	Hannes Reinecke



On 11/10/25 9:54 PM, Bart Van Assche wrote:
> Freezing the request queue from inside sysfs store callbacks may cause a
> deadlock in combination with the dm-multipath driver and the
> queue_if_no_path option. Additionally, freezing the request queue slows
> down system boot on systems where sysfs attributes are set synchronously.
> 
> Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
> calls from the store callbacks that do not strictly need these callbacks.
> This patch may cause a small delay in applying the new settings.
> 
> This patch affects the following sysfs attributes:
> * io_poll_delay
> * io_timeout
> * nomerges
> * read_ahead_kb
> * rq_affinity

I applied your patch on my linux tree and ran some tests. And as I earlier 
suspected, I found the following race from KCSAN:

1. Running q->io_timeout update concurrently while running I/O:

 BUG: KCSAN: data-race in blk_add_timer+0x64/0x1cc
 
 race at unknown origin, with read to 0xc000000142575fa8 of 4 bytes by task 5528 on cpu 49:
 blk_add_timer+0x64/0x1cc
 blk_mq_start_request+0xd0/0x534
 nvme_prep_rq.part.0+0x55c/0x1940 [nvme]
 nvme_queue_rqs+0x1e0/0x4c4 [nvme]
 blk_mq_dispatch_queue_requests+0x4f0/0x8c0
 blk_mq_flush_plug_list+0xb4/0x3cc
 __blk_flush_plug+0x294/0x358
 __submit_bio+0x308/0x3a4
 submit_bio_noacct_nocheck+0x5e4/0x92c
 submit_bio_noacct+0x3a4/0xec8
 submit_bio+0x70/0x2f0
 submit_bio_wait+0xc8/0x180
 __blkdev_direct_IO_simple+0x254/0x4a8
 blkdev_direct_IO+0x6d4/0x1000
 blkdev_write_iter+0x50c/0x658
 vfs_write+0x678/0x874
 sys_pwrite64+0x130/0x16c
 system_call_exception+0x1a0/0x500
 system_call_vectored_common+0x15c/0x2ec

value changed: 0x000005dc -> 0x00000bb8

Reported by Kernel Concurrency Sanitizer on:
CPU: 49 UID: 0 PID: 5528 Comm: fio Kdump: loaded Not tainted 6.18.0-rc4+ #65 VOLUNTARY 
Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries

2. Updating ->ra_pages while it's also being simultaneously accessed:

 BUG: KCSAN: data-race in queue_ra_show / read_ahead_kb_store
 
 write to 0xc00000000c107030 of 8 bytes by task 97376 on cpu 19:
  read_ahead_kb_store+0x84/0xcc
  dev_attr_store+0x70/0x9c
  sysfs_kf_write+0xbc/0x110
  kernfs_fop_write_iter+0x284/0x3b4
  vfs_write+0x678/0x874
  ksys_write+0xb0/0x1ac
  sys_write+0x68/0x84
  system_call_exception+0x1a0/0x500
  system_call_vectored_common+0x15c/0x2ec
 
 read to 0xc00000000c107030 of 8 bytes by task 167534 on cpu 33:
  queue_ra_show+0x70/0xd4
  queue_attr_show+0x90/0x170
  sysfs_kf_seq_show+0x144/0x28c
  kernfs_seq_show+0xbc/0xe0
  seq_read_iter+0x384/0xb4c
  kernfs_fop_read_iter+0x308/0x418
  vfs_read+0x420/0x5ac
  ksys_read+0xb0/0x1b0
  sys_read+0x68/0x84
  system_call_exception+0x1a0/0x500
  system_call_vectored_common+0x15c/0x2ec
 
 value changed: 0x0000000000000004 -> 0x0000000000000010
 
 Reported by Kernel Concurrency Sanitizer on:
 CPU: 33 UID: 0 PID: 167534 Comm: cat Kdump: loaded Not tainted 6.18.0-rc4+ #65 VOLUNTARY 
 Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries


So from the above trace it seems obvious that we need to mark both
writers and readers to avoid potential race.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
  2025-11-11  6:25 ` Nilay Shroff
@ 2025-11-11 19:52   ` Bart Van Assche
  2025-11-13 13:25     ` Nilay Shroff
  0 siblings, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2025-11-11 19:52 UTC (permalink / raw)
  To: Nilay Shroff, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Ming Lei, Martin Wilck,
	Benjamin Marzinski, stable, Damien Le Moal, Chaitanya Kulkarni,
	Hannes Reinecke

On 11/10/25 10:25 PM, Nilay Shroff wrote:
> I applied your patch on my linux tree and ran some tests. And as I earlier
> suspected, I found the following race from KCSAN:
> 
> [ ... ] 

Thank you for having run these tests. It's unfortunate that I couldn't
trigger these KCSAN complaints in my tests with KCSAN enabled in the
kernel configuration.
> So from the above trace it seems obvious that we need to mark both
> writers and readers to avoid potential race.

That would be an intrusive change. I don't think that the kernel
maintainers would agree with marking all rq_timeout and all ra_pages
reads with READ_ONCE(). I propose to annotate both the rq_timeout and
ra_pages data members with __data_racy to suppress these KCSAN reports.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
  2025-11-11 19:52   ` Bart Van Assche
@ 2025-11-13 13:25     ` Nilay Shroff
  0 siblings, 0 replies; 5+ messages in thread
From: Nilay Shroff @ 2025-11-13 13:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Ming Lei, Martin Wilck,
	Benjamin Marzinski, stable, Damien Le Moal, Chaitanya Kulkarni,
	Hannes Reinecke



On 11/12/25 1:22 AM, Bart Van Assche wrote:
> On 11/10/25 10:25 PM, Nilay Shroff wrote:
>> I applied your patch on my linux tree and ran some tests. And as I earlier
>> suspected, I found the following race from KCSAN:
>>
>> [ ... ] 
> 
> Thank you for having run these tests. It's unfortunate that I couldn't
> trigger these KCSAN complaints in my tests with KCSAN enabled in the
> kernel configuration.
>> So from the above trace it seems obvious that we need to mark both
>> writers and readers to avoid potential race.
> 
> That would be an intrusive change. I don't think that the kernel
> maintainers would agree with marking all rq_timeout and all ra_pages
> reads with READ_ONCE(). I propose to annotate both the rq_timeout and
> ra_pages data members with __data_racy to suppress these KCSAN reports.
> 
Yes, that should also work. I validated the use of __data_racy on my test kernel.

However, while compiling the kernel with __data_racy applied to q->rq_timeout,
I encountered a build failure. After some investigation, I found that the issue
occurred because my kernel configuration had CONFIG_DEBUG_INFO_BTF enabled. 
During the build, when the compiler attempted to generate BTF types from the
vmlinux.unstripped binary, it failed. 

Mu guess is that some compilation units have KCSAN disabled, in which case
the pre-processor expands __data_racy to nothing. In other units where KCSAN
is enabled, __data_racy expands to the volatile qualifier. As a result, BTF
resolver encountered two versions of struct request_queue: one where q->rq_timeout
was declared with the volatile keyword and another where it was declared without it.
This type mismatch caused resolver to fail during the BTF extraction phase.

Yes this is something not related to block layer and has to fixed by 
KCSAN/eBPF developers.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-13 13:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10 16:24 [PATCH v5] block: Remove queue freezing from several sysfs store callbacks Bart Van Assche
2025-11-11  1:18 ` Ming Lei
2025-11-11  6:25 ` Nilay Shroff
2025-11-11 19:52   ` Bart Van Assche
2025-11-13 13:25     ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox