* [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
@ 2025-11-10 16:24 Bart Van Assche
2025-11-11 1:18 ` Ming Lei
2025-11-11 6:25 ` Nilay Shroff
0 siblings, 2 replies; 5+ messages in thread
From: Bart Van Assche @ 2025-11-10 16:24 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Bart Van Assche, Ming Lei,
Nilay Shroff, Martin Wilck, Benjamin Marzinski, stable,
Damien Le Moal, Chaitanya Kulkarni, Hannes Reinecke
Freezing the request queue from inside sysfs store callbacks may cause a
deadlock in combination with the dm-multipath driver and the
queue_if_no_path option. Additionally, freezing the request queue slows
down system boot on systems where sysfs attributes are set synchronously.
Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
calls from the store callbacks that do not strictly need these callbacks.
This patch may cause a small delay in applying the new settings.
This patch affects the following sysfs attributes:
* io_poll_delay
* io_timeout
* nomerges
* read_ahead_kb
* rq_affinity
Here is an example of a deadlock triggered by running test srp/002:
task:multipathd
Call Trace:
<TASK>
__schedule+0x8c1/0x1bf0
schedule+0xdd/0x270
schedule_preempt_disabled+0x1c/0x30
__mutex_lock+0xb89/0x1650
mutex_lock_nested+0x1f/0x30
dm_table_set_restrictions+0x823/0xdf0
__bind+0x166/0x590
dm_swap_table+0x2a7/0x490
do_resume+0x1b1/0x610
dev_suspend+0x55/0x1a0
ctl_ioctl+0x3a5/0x7e0
dm_ctl_ioctl+0x12/0x20
__x64_sys_ioctl+0x127/0x1a0
x64_sys_call+0xe2b/0x17d0
do_syscall_64+0x96/0x3a0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
task:(udev-worker)
Call Trace:
<TASK>
__schedule+0x8c1/0x1bf0
schedule+0xdd/0x270
blk_mq_freeze_queue_wait+0xf2/0x140
blk_mq_freeze_queue_nomemsave+0x23/0x30
queue_ra_store+0x14e/0x290
queue_attr_store+0x23e/0x2c0
sysfs_kf_write+0xde/0x140
kernfs_fop_write_iter+0x3b2/0x630
vfs_write+0x4fd/0x1390
ksys_write+0xfd/0x230
__x64_sys_write+0x76/0xc0
x64_sys_call+0x276/0x17d0
do_syscall_64+0x96/0x3a0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>
Cc: stable@vger.kernel.org
Fixes: af2814149883 ("block: freeze the queue in queue_attr_store")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
Changes compared to v4:
- Use WRITE_ONCE() to update bdi->ra_pages.
- Move a data_race() annotation from queue_io_timeout_store() into
blk_queue_rq_timeout().
Changes compared to v3:
- Added two data_race() annotations.
Changes compared to v2:
- Dropped the controversial patch "block: Restrict the duration of sysfs
attribute changes".
Changes compared to v1:
- Added patch "block: Restrict the duration of sysfs attribute changes".
- Remove queue freezing from more sysfs callbacks.
block/blk-settings.c | 9 ++++++++-
block/blk-sysfs.c | 30 ++++++++++++------------------
2 files changed, 20 insertions(+), 19 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 78dfef117623..b5587cc535e5 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -23,7 +23,14 @@
void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout)
{
- WRITE_ONCE(q->rq_timeout, timeout);
+ /*
+ * Use WRITE_ONCE() to write q->rq_timeout once. Use data_race() to
+ * suppress KCSAN race reports against the write below.
+ */
+ data_race(({
+ WRITE_ONCE(q->rq_timeout, timeout);
+ 0;
+ }));
}
EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 76c47fe9b8d6..99e78d907c1c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -143,21 +143,26 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count)
{
unsigned long ra_kb;
ssize_t ret;
- unsigned int memflags;
struct request_queue *q = disk->queue;
ret = queue_var_store(&ra_kb, page, count);
if (ret < 0)
return ret;
/*
- * ->ra_pages is protected by ->limits_lock because it is usually
- * calculated from the queue limits by queue_limits_commit_update.
+ * The ->ra_pages change below is protected by ->limits_lock because it
+ * is usually calculated from the queue limits by
+ * queue_limits_commit_update().
+ *
+ * bdi->ra_pages reads are not serialized against bdi->ra_pages writes.
+ * Use WRITE_ONCE() to write bdi->ra_pages once. Use data_race() to
+ * suppress KCSAN race reports against the write below.
*/
mutex_lock(&q->limits_lock);
- memflags = blk_mq_freeze_queue(q);
- disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10);
+ data_race(({
+ WRITE_ONCE(disk->bdi->ra_pages, ra_kb >> (PAGE_SHIFT - 10));
+ 0;
+ }));
mutex_unlock(&q->limits_lock);
- blk_mq_unfreeze_queue(q, memflags);
return ret;
}
@@ -375,21 +380,18 @@ static ssize_t queue_nomerges_store(struct gendisk *disk, const char *page,
size_t count)
{
unsigned long nm;
- unsigned int memflags;
struct request_queue *q = disk->queue;
ssize_t ret = queue_var_store(&nm, page, count);
if (ret < 0)
return ret;
- memflags = blk_mq_freeze_queue(q);
blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q);
blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, q);
if (nm == 2)
blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q);
else if (nm)
blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
- blk_mq_unfreeze_queue(q, memflags);
return ret;
}
@@ -409,7 +411,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
#ifdef CONFIG_SMP
struct request_queue *q = disk->queue;
unsigned long val;
- unsigned int memflags;
ret = queue_var_store(&val, page, count);
if (ret < 0)
@@ -421,7 +422,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
* are accessed individually using atomic test_bit operation. So we
* don't grab any lock while updating these flags.
*/
- memflags = blk_mq_freeze_queue(q);
if (val == 2) {
blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
@@ -432,7 +432,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count)
blk_queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
}
- blk_mq_unfreeze_queue(q, memflags);
#endif
return ret;
}
@@ -446,11 +445,9 @@ static ssize_t queue_poll_delay_store(struct gendisk *disk, const char *page,
static ssize_t queue_poll_store(struct gendisk *disk, const char *page,
size_t count)
{
- unsigned int memflags;
ssize_t ret = count;
struct request_queue *q = disk->queue;
- memflags = blk_mq_freeze_queue(q);
if (!(q->limits.features & BLK_FEAT_POLL)) {
ret = -EINVAL;
goto out;
@@ -459,7 +456,6 @@ static ssize_t queue_poll_store(struct gendisk *disk, const char *page,
pr_info_ratelimited("writes to the poll attribute are ignored.\n");
pr_info_ratelimited("please use driver specific parameters instead.\n");
out:
- blk_mq_unfreeze_queue(q, memflags);
return ret;
}
@@ -472,7 +468,7 @@ static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page)
static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page,
size_t count)
{
- unsigned int val, memflags;
+ unsigned int val;
int err;
struct request_queue *q = disk->queue;
@@ -480,9 +476,7 @@ static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page,
if (err || val == 0)
return -EINVAL;
- memflags = blk_mq_freeze_queue(q);
blk_queue_rq_timeout(q, msecs_to_jiffies(val));
- blk_mq_unfreeze_queue(q, memflags);
return count;
}
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
2025-11-10 16:24 [PATCH v5] block: Remove queue freezing from several sysfs store callbacks Bart Van Assche
@ 2025-11-11 1:18 ` Ming Lei
2025-11-11 6:25 ` Nilay Shroff
1 sibling, 0 replies; 5+ messages in thread
From: Ming Lei @ 2025-11-11 1:18 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Nilay Shroff,
Martin Wilck, Benjamin Marzinski, stable, Damien Le Moal,
Chaitanya Kulkarni, Hannes Reinecke
On Mon, Nov 10, 2025 at 08:24:18AM -0800, Bart Van Assche wrote:
> Freezing the request queue from inside sysfs store callbacks may cause a
> deadlock in combination with the dm-multipath driver and the
> queue_if_no_path option. Additionally, freezing the request queue slows
> down system boot on systems where sysfs attributes are set synchronously.
>
> Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
> calls from the store callbacks that do not strictly need these callbacks.
> This patch may cause a small delay in applying the new settings.
>
> This patch affects the following sysfs attributes:
> * io_poll_delay
> * io_timeout
> * nomerges
> * read_ahead_kb
> * rq_affinity
I'd suggest to add words why freeze isn't needed, such as:
```
Intermediate value isn't possible, and either the old or new value is just fine to take
in io fast path.
```
otherwise this patch is good for me.
Thanks,
Ming
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
2025-11-10 16:24 [PATCH v5] block: Remove queue freezing from several sysfs store callbacks Bart Van Assche
2025-11-11 1:18 ` Ming Lei
@ 2025-11-11 6:25 ` Nilay Shroff
2025-11-11 19:52 ` Bart Van Assche
1 sibling, 1 reply; 5+ messages in thread
From: Nilay Shroff @ 2025-11-11 6:25 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Ming Lei, Martin Wilck,
Benjamin Marzinski, stable, Damien Le Moal, Chaitanya Kulkarni,
Hannes Reinecke
On 11/10/25 9:54 PM, Bart Van Assche wrote:
> Freezing the request queue from inside sysfs store callbacks may cause a
> deadlock in combination with the dm-multipath driver and the
> queue_if_no_path option. Additionally, freezing the request queue slows
> down system boot on systems where sysfs attributes are set synchronously.
>
> Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue()
> calls from the store callbacks that do not strictly need these callbacks.
> This patch may cause a small delay in applying the new settings.
>
> This patch affects the following sysfs attributes:
> * io_poll_delay
> * io_timeout
> * nomerges
> * read_ahead_kb
> * rq_affinity
I applied your patch on my linux tree and ran some tests. And as I earlier
suspected, I found the following race from KCSAN:
1. Running q->io_timeout update concurrently while running I/O:
BUG: KCSAN: data-race in blk_add_timer+0x64/0x1cc
race at unknown origin, with read to 0xc000000142575fa8 of 4 bytes by task 5528 on cpu 49:
blk_add_timer+0x64/0x1cc
blk_mq_start_request+0xd0/0x534
nvme_prep_rq.part.0+0x55c/0x1940 [nvme]
nvme_queue_rqs+0x1e0/0x4c4 [nvme]
blk_mq_dispatch_queue_requests+0x4f0/0x8c0
blk_mq_flush_plug_list+0xb4/0x3cc
__blk_flush_plug+0x294/0x358
__submit_bio+0x308/0x3a4
submit_bio_noacct_nocheck+0x5e4/0x92c
submit_bio_noacct+0x3a4/0xec8
submit_bio+0x70/0x2f0
submit_bio_wait+0xc8/0x180
__blkdev_direct_IO_simple+0x254/0x4a8
blkdev_direct_IO+0x6d4/0x1000
blkdev_write_iter+0x50c/0x658
vfs_write+0x678/0x874
sys_pwrite64+0x130/0x16c
system_call_exception+0x1a0/0x500
system_call_vectored_common+0x15c/0x2ec
value changed: 0x000005dc -> 0x00000bb8
Reported by Kernel Concurrency Sanitizer on:
CPU: 49 UID: 0 PID: 5528 Comm: fio Kdump: loaded Not tainted 6.18.0-rc4+ #65 VOLUNTARY
Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries
2. Updating ->ra_pages while it's also being simultaneously accessed:
BUG: KCSAN: data-race in queue_ra_show / read_ahead_kb_store
write to 0xc00000000c107030 of 8 bytes by task 97376 on cpu 19:
read_ahead_kb_store+0x84/0xcc
dev_attr_store+0x70/0x9c
sysfs_kf_write+0xbc/0x110
kernfs_fop_write_iter+0x284/0x3b4
vfs_write+0x678/0x874
ksys_write+0xb0/0x1ac
sys_write+0x68/0x84
system_call_exception+0x1a0/0x500
system_call_vectored_common+0x15c/0x2ec
read to 0xc00000000c107030 of 8 bytes by task 167534 on cpu 33:
queue_ra_show+0x70/0xd4
queue_attr_show+0x90/0x170
sysfs_kf_seq_show+0x144/0x28c
kernfs_seq_show+0xbc/0xe0
seq_read_iter+0x384/0xb4c
kernfs_fop_read_iter+0x308/0x418
vfs_read+0x420/0x5ac
ksys_read+0xb0/0x1b0
sys_read+0x68/0x84
system_call_exception+0x1a0/0x500
system_call_vectored_common+0x15c/0x2ec
value changed: 0x0000000000000004 -> 0x0000000000000010
Reported by Kernel Concurrency Sanitizer on:
CPU: 33 UID: 0 PID: 167534 Comm: cat Kdump: loaded Not tainted 6.18.0-rc4+ #65 VOLUNTARY
Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries
So from the above trace it seems obvious that we need to mark both
writers and readers to avoid potential race.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
2025-11-11 6:25 ` Nilay Shroff
@ 2025-11-11 19:52 ` Bart Van Assche
2025-11-13 13:25 ` Nilay Shroff
0 siblings, 1 reply; 5+ messages in thread
From: Bart Van Assche @ 2025-11-11 19:52 UTC (permalink / raw)
To: Nilay Shroff, Jens Axboe
Cc: linux-block, Christoph Hellwig, Ming Lei, Martin Wilck,
Benjamin Marzinski, stable, Damien Le Moal, Chaitanya Kulkarni,
Hannes Reinecke
On 11/10/25 10:25 PM, Nilay Shroff wrote:
> I applied your patch on my linux tree and ran some tests. And as I earlier
> suspected, I found the following race from KCSAN:
>
> [ ... ]
Thank you for having run these tests. It's unfortunate that I couldn't
trigger these KCSAN complaints in my tests with KCSAN enabled in the
kernel configuration.
> So from the above trace it seems obvious that we need to mark both
> writers and readers to avoid potential race.
That would be an intrusive change. I don't think that the kernel
maintainers would agree with marking all rq_timeout and all ra_pages
reads with READ_ONCE(). I propose to annotate both the rq_timeout and
ra_pages data members with __data_racy to suppress these KCSAN reports.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v5] block: Remove queue freezing from several sysfs store callbacks
2025-11-11 19:52 ` Bart Van Assche
@ 2025-11-13 13:25 ` Nilay Shroff
0 siblings, 0 replies; 5+ messages in thread
From: Nilay Shroff @ 2025-11-13 13:25 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Ming Lei, Martin Wilck,
Benjamin Marzinski, stable, Damien Le Moal, Chaitanya Kulkarni,
Hannes Reinecke
On 11/12/25 1:22 AM, Bart Van Assche wrote:
> On 11/10/25 10:25 PM, Nilay Shroff wrote:
>> I applied your patch on my linux tree and ran some tests. And as I earlier
>> suspected, I found the following race from KCSAN:
>>
>> [ ... ]
>
> Thank you for having run these tests. It's unfortunate that I couldn't
> trigger these KCSAN complaints in my tests with KCSAN enabled in the
> kernel configuration.
>> So from the above trace it seems obvious that we need to mark both
>> writers and readers to avoid potential race.
>
> That would be an intrusive change. I don't think that the kernel
> maintainers would agree with marking all rq_timeout and all ra_pages
> reads with READ_ONCE(). I propose to annotate both the rq_timeout and
> ra_pages data members with __data_racy to suppress these KCSAN reports.
>
Yes, that should also work. I validated the use of __data_racy on my test kernel.
However, while compiling the kernel with __data_racy applied to q->rq_timeout,
I encountered a build failure. After some investigation, I found that the issue
occurred because my kernel configuration had CONFIG_DEBUG_INFO_BTF enabled.
During the build, when the compiler attempted to generate BTF types from the
vmlinux.unstripped binary, it failed.
Mu guess is that some compilation units have KCSAN disabled, in which case
the pre-processor expands __data_racy to nothing. In other units where KCSAN
is enabled, __data_racy expands to the volatile qualifier. As a result, BTF
resolver encountered two versions of struct request_queue: one where q->rq_timeout
was declared with the volatile keyword and another where it was declared without it.
This type mismatch caused resolver to fail during the BTF extraction phase.
Yes this is something not related to block layer and has to fixed by
KCSAN/eBPF developers.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-11-13 13:25 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-10 16:24 [PATCH v5] block: Remove queue freezing from several sysfs store callbacks Bart Van Assche
2025-11-11 1:18 ` Ming Lei
2025-11-11 6:25 ` Nilay Shroff
2025-11-11 19:52 ` Bart Van Assche
2025-11-13 13:25 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox