[PATCH RFC 0/1] block: fix concurrent elevator change failure

Linux block layer
 help / color / mirror / Atom feed

* [PATCH RFC 0/1] block: fix concurrent elevator change failure
@ 2026-06-11  7:41 Shin'ichiro Kawasaki
  2026-06-11  7:42 ` [PATCH RFC 1/1] block: serialize whole elevator change steps for the same queue Shin'ichiro Kawasaki
  2026-06-11 11:22 ` [PATCH RFC 0/1] block: fix concurrent elevator change failure Ming Lei
  0 siblings, 2 replies; 9+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-06-11  7:41 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki

I observed that the blktests test case block/005 hangs on a specific
server hardware using a specific HDD as a block device. During the test
case run, the kernel reported a KASAN null-ptr-deref (and other memory
corruption symptoms) [2]. This failure looked sporadic and hardware-
dependent.

From the kernel message, I noticed that udev-worker wrote to the
queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
The test case block/005 also wrote to the same sysfs attribute, which
indicated that a concurrent elevator change caused the failure. I
created a new blktests test case that simply does the concurrent
elevator change with a null_blk device [1]. It recreates the failure in
a stable manner on various server hardware.

Using the new test case, I bisected and found that the failure first
appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
commit does not appear to explain the failure by itself: it changed the
queue freeze behavior and only unveiled a race, probably. Looking back
at the changes to elevator_change(), I think the actual cause is the
commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
elevator_lock") in the kernel tag v6.16-rc1. This commit moved
elevator_change_done() out of the guard of ->elevator_lock and the queue
freeze. As a result, when two threads write to the same queue/scheduler
attribute concurrently, elevator_change_done() runs in parallel causing
the memory corruption and the hang.

As the fix attempt, I created the patch in this series. It adds a new
mutex that serializes the whole elevator switch sequence, including the
elevator_change_done() call. I ran the reproducer with lockdep enabled
and confirmed that the patch avoids the failure and new WARN was not
observed.

However, the fix patch adds a new lock, and I'm not sure if it is the best
solution. Comments on the patch, or suggestions for a better solution,
would be appreciated.

[1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826

[2]

[30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
[30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
[30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
[30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
[30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
8 8b 6d 00 48
[30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
[30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
[30105.119796] [ T186111] Call Trace:
[30105.124154] [ T186111]  <TASK>
[30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
[30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610
[30105.140037] [ T186111]  ? __pfx_elevator_change_done+0x10/0x10
[30105.146409] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.152246] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.158189] [ T186111]  elevator_change+0x283/0x4f0
[30105.163342] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.168932] [ T186111]  elv_iosched_store+0x30c/0x3a0
[30105.174265] [ T186111]  ? __pfx_elv_iosched_store+0x10/0x10
[30105.180797] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.187066] [ T186111]  ? kernfs_fop_write_iter+0x25b/0x5e0
[30105.193594] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.199931] [ T186111]  ? lock_acquire+0x126/0x140
[30105.205683] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.211924] [ T186111]  queue_attr_store+0x23f/0x360
[30105.217796] [ T186111]  ? __pfx_queue_attr_store+0x10/0x10
[30105.224180] [ T186111]  ? __lock_acquire+0x55d/0xbd0
[30105.230049] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.236247] [ T186111]  ? sysfs_file_kobj+0x1d/0x1b0
[30105.242093] [ T186111]  ? find_held_lock+0x2b/0x80
[30105.247763] [ T186111]  ? __lock_release.isra.0+0x59/0x170
[30105.254122] [ T186111]  ? lock_release.part.0+0x1c/0x50
[30105.260226] [ T186111]  ? sysfs_file_kobj+0xb9/0x1b0
[30105.266048] [ T186111]  ? sysfs_kf_write+0x65/0x170
[30105.271778] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.277934] [ T186111]  kernfs_fop_write_iter+0x3da/0x5e0
[30105.284173] [ T186111]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[30105.290926] [ T186111]  vfs_write+0x524/0x1010
[30105.296215] [ T186111]  ? __pfx_vfs_write+0x10/0x10
[30105.301905] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.308092] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.314246] [ T186111]  ksys_write+0xff/0x200
[30105.319331] [ T186111]  ? __pfx_ksys_write+0x10/0x10
[30105.325007] [ T186111]  do_syscall_64+0xf4/0x1550
[30105.330407] [ T186111]  ? __pfx___x64_sys_openat+0x10/0x10
[30105.336566] [ T186111]  ? seccomp_run_filters+0xeb/0x560
[30105.342517] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.348096] [ T186111]  ? __seccomp_filter+0xa2/0x920
[30105.353749] [ T186111]  ? __pfx___seccomp_filter+0x10/0x10
[30105.359830] [ T186111]  ? trace_hardirqs_on_prepare+0x150/0x1a0
[30105.366344] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.371892] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.377422] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.382922] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.388401] [ T186111]  ? do_syscall_64+0x34/0x1550
[30105.393777] [ T186111]  ? do_syscall_64+0xab/0x1550
[30105.399129] [ T186111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[30105.405624] [ T186111] RIP: 0033:0x7fc1c7c4fbbe
[30105.410647] [ T186111] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[30105.431611] [ T186111] RSP: 002b:00007ffefd3bdd90 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[30105.440716] [ T186111] RAX: ffffffffffffffda RBX: 000055fa3f0f4b80 RCX: 00007fc1c7c4fbbe
[30105.449404] [ T186111] RDX: 000000000000000b RSI: 000055fa3ed9d550 RDI: 0000000000000015
[30105.458090] [ T186111] RBP: 00007ffefd3bdda0 R08: 0000000000000000 R09: 0000000000000000
[30105.466787] [ T186111] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000b
[30105.475479] [ T186111] R13: 000000000000000b R14: 000055fa3ed9d550 R15: 000055fa3ed9d550
[30105.484182] [ T186111]  </TASK>
[30105.487920] [ T186111] Modules linked in: iscsi_target_mod tcm_loop target_core_pscsi target_core_file target_core_iblock xfs nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc target_core_user target_core_mod rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nf_tables ip6table_filter ip6_tables iptable_filter ip_tables qrtr intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt intel_pmc_bxt kvm_intel kvm irqbypass rapl sunrpc intel_cstate intel_uncore pcspkr i2c_i801 i2c_smbus mei_me igb lpc_ich mei ioatdma dca wmi binfmt_misc joydev acpi_power_meter acpi_pad btrfs raid6_pq xor ses enclosure loop dm_multipath nfnetlink zram lz4hc_compress lz4_compress
[30105.488278] [ T186111]  zstd_compress ast drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper mpt3sas drm mpi3mr raid_class scsi_transport_sas scsi_dh_rdac scsi_dh_emc scsi_dh_alua i2c_dev fuse [last unloaded: zonefs]
[30105.609649] [ T186111] ---[ end trace 0000000000000000 ]---
[30105.648290] [ T186111] pstore: backend (erst) writing error (-28)
[30105.654739] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.662519] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 48 8b 6d 00 48
[30105.683653] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.691248] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.700121] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.708841] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.717829] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.726550] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.735306] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c54ce000(0000) knlGS:0000000000000000
[30105.745003] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.752368] [ T186111] CR2: 00007f251f9bc0e8 CR3: 0000000108350002 CR4: 00000000001726f0


Shin'ichiro Kawasaki (1):
  block: serialize whole elevator change steps for the same queue

 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC 1/1] block: serialize whole elevator change steps for the same queue
  2026-06-11  7:41 [PATCH RFC 0/1] block: fix concurrent elevator change failure Shin'ichiro Kawasaki
@ 2026-06-11  7:42 ` Shin'ichiro Kawasaki
  2026-06-11 11:22 ` [PATCH RFC 0/1] block: fix concurrent elevator change failure Ming Lei
  1 sibling, 0 replies; 9+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-06-11  7:42 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki

When elevator_change() is called concurrently for the same queue, the
elevator_change_done() function runs concurrently as well. This function
adds or deletes kobjects for the debugfs entry of the queue. Then the
concurrent calls cause memory corruption of the kobjects and result in a
process hang. The core part of the elevator switch is protected by queue
freeze and q->elevator_lock. However, since the commit 559dc11143eb
("block: move elv_register[unregister]_queue out of elevator_lock"), the
elevator_change_done() is not serialized. Hence the memory corruption
and the hang.

The failures are observed when udev-worker writes to a sysfs
queue/scheduler attribute file while the blktests test case block/005
writes to the same attribute file. The failure also can be recreated by
running two processes that write to the same queue/scheduler file
concurrently. The failure is observed since another commit 370ac285f23a
("block: avoid cpu_hotplug_lock depedency on freeze_lock"). This commit
changed the behavior of queue freeze and it unveiled the failure.

Fix the failure by adding a new per-queue lock 'elevator_queue_lock',
which serializes the whole elevator switch steps for the same queue
including the elevator_change_done() call.

Fixes: 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..c6418889897a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -430,6 +430,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 	refcount_set(&q->refs, 1);
 	mutex_init(&q->debugfs_mutex);
 	mutex_init(&q->elevator_lock);
+	mutex_init(&q->elevator_queue_lock);
 	mutex_init(&q->sysfs_lock);
 	mutex_init(&q->limits_lock);
 	mutex_init(&q->rq_qos_mutex);
diff --git a/block/elevator.c b/block/elevator.c
index 3bcd37c2aa34..65bdea27aa8a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -665,6 +665,13 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 			return ret;
 	}

+	/*
+	 * Acquire elevator_queue_lock to serialize the debugfs (un)register
+	 * steps for the same queue. The elevator switch core part is protected
+	 * by queue freezing and ->elevator_lock.
+	 */
+	mutex_lock(&q->elevator_queue_lock);
+
 	memflags = blk_mq_freeze_queue(q);
 	/*
 	 * May be called before adding disk, when there isn't any FS I/O,
@@ -690,6 +697,8 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	if (!ctx->new)
 		blk_mq_free_sched_res(&ctx->res, ctx->type, set);

+	mutex_unlock(&q->elevator_queue_lock);
+
 	return ret;
 }

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..cfeddd3ded95 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -606,6 +606,13 @@ struct request_queue {
 	 */
 	struct mutex		elevator_lock;

+	/*
+	 * Serializes the whole elevator change operation for the same queue,
+	 * including the debugfs (un)register steps. Must be acquired before
+	 * freezing the queue and acquiring elevator_lock.
+	 */
+	struct mutex		elevator_queue_lock;
+
 	struct mutex		sysfs_lock;
 	/*
 	 * Protects queue limits and also sysfs attribute read_ahead_kb.
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-11  7:41 [PATCH RFC 0/1] block: fix concurrent elevator change failure Shin'ichiro Kawasaki
  2026-06-11  7:42 ` [PATCH RFC 1/1] block: serialize whole elevator change steps for the same queue Shin'ichiro Kawasaki
@ 2026-06-11 11:22 ` Ming Lei
  2026-06-12  9:47   ` Shin'ichiro Kawasaki
  1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2026-06-11 11:22 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: linux-block, Jens Axboe, Nilay Shroff

Hi Shin'ichiro,

On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> I observed that the blktests test case block/005 hangs on a specific
> server hardware using a specific HDD as a block device. During the test
> case run, the kernel reported a KASAN null-ptr-deref (and other memory
> corruption symptoms) [2]. This failure looked sporadic and hardware-
> dependent.
> 
> From the kernel message, I noticed that udev-worker wrote to the
> queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> The test case block/005 also wrote to the same sysfs attribute, which

sysfs write is supposed to be serialized...

> indicated that a concurrent elevator change caused the failure. I
> created a new blktests test case that simply does the concurrent
> elevator change with a null_blk device [1]. It recreates the failure in
> a stable manner on various server hardware.
> 
> Using the new test case, I bisected and found that the failure first
> appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
> depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
> commit does not appear to explain the failure by itself: it changed the
> queue freeze behavior and only unveiled a race, probably. Looking back
> at the changes to elevator_change(), I think the actual cause is the
> commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
> elevator_lock") in the kernel tag v6.16-rc1. This commit moved
> elevator_change_done() out of the guard of ->elevator_lock and the queue
> freeze. As a result, when two threads write to the same queue/scheduler
> attribute concurrently, elevator_change_done() runs in parallel causing
> the memory corruption and the hang.
> 
> As the fix attempt, I created the patch in this series. It adds a new
> mutex that serializes the whole elevator switch sequence, including the
> elevator_change_done() call. I ran the reproducer with lockdep enabled
> and confirmed that the patch avoids the failure and new WARN was not
> observed.
> 
> However, the fix patch adds a new lock, and I'm not sure if it is the best
> solution. Comments on the patch, or suggestions for a better solution,
> would be appreciated.
> 
> [1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826
> 
> [2]
> 
> [30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
> [30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
> [30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> [30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
> [30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
> [30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
> [30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
> 8 8b 6d 00 48
> [30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
> [30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
> [30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
> [30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
> [30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
> [30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
> [30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
> [30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
> [30105.119796] [ T186111] Call Trace:
> [30105.124154] [ T186111]  <TASK>
> [30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
> [30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610

blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
fix could be check & avoid the null-ptr-deref.

Adding new lock should be the last straw usually, especially this one is
depended by queue freeze.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-11 11:22 ` [PATCH RFC 0/1] block: fix concurrent elevator change failure Ming Lei
@ 2026-06-12  9:47   ` Shin'ichiro Kawasaki
  2026-06-12 11:06     ` Ming Lei
  0 siblings, 1 reply; 9+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-06-12  9:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Jens Axboe, Nilay Shroff

On Jun 11, 2026 / 06:22, Ming Lei wrote:
> Hi Shin'ichiro,

Hi Ming, thanks for the comments.

> 
> On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> > I observed that the blktests test case block/005 hangs on a specific
> > server hardware using a specific HDD as a block device. During the test
> > case run, the kernel reported a KASAN null-ptr-deref (and other memory
> > corruption symptoms) [2]. This failure looked sporadic and hardware-
> > dependent.
> > 
> > From the kernel message, I noticed that udev-worker wrote to the
> > queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> > The test case block/005 also wrote to the same sysfs attribute, which
> 
> sysfs write is supposed to be serialized...

I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
I found elevator_change() call is guarded with the rw_semaphore
"set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
lock. This does not serialize the sysfs writes.

I tried the patch below to replace the reader lock with the writer lock. With
a quick trial, it looks working. The kernel message is no longer observed and
the new test case does not cause hangs. I will do further testing to confirm
that this change does not trigger other new lockdep WARNs. Assuming it does not
have such side effects, I hope this fix approach is acceptable. It doesn't add
the new lock, so I think it's the better.

diff --git a/block/elevator.c b/block/elevator.c
index 3bcd37c2aa34..b03185a217ff 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
 	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
 	 */
-	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
+	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
 	} else {
 		ret = -ENOENT;
 	}
-	up_read(&set->update_nr_hwq_lock);
+	up_write(&set->update_nr_hwq_lock);
 
 out:
 	if (ctx.type)

[...]

> blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
> fix could be check & avoid the null-ptr-deref.

Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
free is also observed [3]. Then I'm guessing adding null checks may not be
enough.

> Adding new lock should be the last straw usually, especially this one is
> depended by queue freeze.

Got it, thanks.


[3] KASAN slab-use-after-free

[  802.836569][ T3919] run blktests block/005 at 2026-05-11 10:42:39
[  804.256901][ T3866] debugfs: 'sched' already exists in 'sdd'
[  804.874743][ T3919] debugfs: 'sched' already exists in 'sdd'
[  804.882124][ T3919] ==================================================================
[  804.882154][ T3866] debugfs: 'sched' already exists in 'sdd'
[  804.890039][ T3919] BUG: KASAN: slab-use-after-free in elevator_change_done+0x304/0x610
[  804.890053][ T3919] Write of size 8 at addr ffff8881273e08e0 by task check/3919
[  804.890061][ T3919]
[  804.890069][ T3919] CPU: 4 UID: 0 PID: 3919 Comm: check Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
[  804.890080][ T3919] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
[  804.890086][ T3919] Call Trace:
[  804.890092][ T3919]  <TASK>
[  804.890098][ T3919]  dump_stack_lvl+0x6e/0xa0
[  804.890118][ T3919]  print_address_description.constprop.0+0x70/0x300
[  804.890135][ T3919]  ? elevator_change_done+0x304/0x610
[  804.890145][ T3919]  print_report+0xfc/0x1ff
[  804.890154][ T3919]  ? __virt_addr_valid+0x1d1/0x3f0
[  804.890163][ T3919]  ? elevator_change_done+0x304/0x610
[  804.890168][ T3919]  kasan_report+0xf6/0x1c0
[  804.890176][ T3919]  ? elevator_change_done+0x304/0x610
[  804.890185][ T3919]  kasan_check_range+0x125/0x200
[  804.890192][ T3919]  elevator_change_done+0x304/0x610
[  804.890198][ T3919]  ? sysfs_file_ops+0x70/0x140
[  804.890206][ T3919]  ? __pfx_elevator_change_done+0x10/0x10
[  804.890213][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890220][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890225][ T3919]  elevator_change+0x283/0x4f0
[  804.890233][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890239][ T3919]  elv_iosched_store+0x30c/0x3a0
[  804.890246][ T3919]  ? __pfx_elv_iosched_store+0x10/0x10
[  804.890255][ T3919]  ? lock_acquire.part.0+0xb8/0x230                                                                                                                                             10:42 [84/1747]
[  804.890262][ T3919]  ? kernfs_fop_write_iter+0x25b/0x5e0
[  804.890268][ T3919]  ? lock_acquire.part.0+0xb8/0x230
[  804.890274][ T3919]  ? lock_acquire+0x126/0x140
[  804.890281][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890286][ T3919]  queue_attr_store+0x23f/0x360
[  804.890295][ T3919]  ? __pfx_queue_attr_store+0x10/0x10
[  804.890300][ T3919]  ? __lock_acquire+0x55d/0xbd0
[  804.890308][ T3919]  ? lock_acquire.part.0+0xb8/0x230
[  804.890314][ T3919]  ? sysfs_file_kobj+0x1d/0x1b0
[  804.890319][ T3919]  ? find_held_lock+0x2b/0x80
[  804.890326][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890334][ T3919]  ? lock_release.part.0+0x1c/0x50
[  804.890340][ T3919]  ? sysfs_file_kobj+0xb9/0x1b0
[  804.890345][ T3919]  ? sysfs_kf_write+0x65/0x170
[  804.890352][ T3919]  ? __pfx_sysfs_kf_write+0x10/0x10
[  804.890357][ T3919]  kernfs_fop_write_iter+0x3da/0x5e0
[  804.890363][ T3919]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[  804.890368][ T3919]  vfs_write+0x524/0x1010
[  804.890378][ T3919]  ? __pfx_vfs_write+0x10/0x10
[  804.890393][ T3919]  ksys_write+0xff/0x200
[  804.890401][ T3919]  ? __pfx_ksys_write+0x10/0x10
[  804.890408][ T3919]  ? __pfx_pte_val+0x10/0x10
[  804.890414][ T3919]  ? folio_xchg_last_cpupid+0xc6/0x130
[  804.890421][ T3919]  do_syscall_64+0xf4/0x1550
[  804.890429][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890437][ T3919]  ? lock_release.part.0+0x1c/0x50
[  804.890444][ T3919]  ? rcu_read_unlock+0x1c/0x60
[  804.890449][ T3919]  ? wp_page_reuse+0x160/0x1e0
[  804.890455][ T3919]  ? do_wp_page+0x5db/0x10a0
[  804.890465][ T3919]  ? handle_pte_fault+0x54e/0x760
[  804.890472][ T3919]  ? __pfx_handle_pte_fault+0x10/0x10
[  804.890479][ T3919]  ? __pfx_pmd_val+0x10/0x10
[  804.890485][ T3919]  ? __handle_mm_fault+0xa02/0xef0
[  804.890493][ T3919]  ? __lock_acquire+0x55d/0xbd0
[  804.890499][ T3919]  ? __pfx_css_rstat_updated+0x10/0x10
[  804.890509][ T3919]  ? lock_acquire.part.0+0xb8/0x230
[  804.890515][ T3919]  ? count_memcg_events_mm.constprop.0+0x22/0x130
[  804.890522][ T3919]  ? find_held_lock+0x2b/0x80
[  804.890528][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890536][ T3919]  ? find_held_lock+0x2b/0x80
[  804.890542][ T3919]  ? __lock_release.isra.0+0x59/0x170
[  804.890550][ T3919]  ? do_user_addr_fault+0x811/0xed0
[  804.890559][ T3919]  ? do_syscall_64+0x34/0x1550
[  804.890564][ T3919]  ? lockdep_hardirqs_on_prepare.part.0+0x9b/0x140
[  804.890570][ T3919]  ? do_syscall_64+0x34/0x1550
[  804.890575][ T3919]  ? trace_hardirqs_on+0x19/0x1a0
[  804.890584][ T3919]  ? do_syscall_64+0xab/0x1550
[  804.890590][ T3919]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  804.890596][ T3919] RIP: 0033:0x7ff08cbe3bbe
[  804.890603][ T3919] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f
3 0f 1e fa
[  804.890609][ T3919] RSP: 002b:00007ffc95718820 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  804.890616][ T3919] RAX: ffffffffffffffda RBX: 00007ff08cd5f5c0 RCX: 00007ff08cbe3bbe
[  804.890621][ T3919] RDX: 0000000000000006 RSI: 0000563340f2c390 RDI: 0000000000000001
[  804.890624][ T3919] RBP: 00007ffc95718830 R08: 0000000000000000 R09: 0000000000000000
[  804.890627][ T3919] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000006
[  804.890630][ T3919] R13: 0000000000000006 R14: 0000563340f2c390 R15: 0000563340f96890
[  804.890641][ T3919]  </TASK>
[  804.890643][ T3919]
[  805.368835][ T3919] Allocated by task 3919:
[  805.373543][ T3919]  kasan_save_stack+0x30/0x50
[  805.378559][ T3919]  kasan_save_track+0x14/0x30
[  805.383559][ T3919]  __kasan_kmalloc+0x9a/0xb0
[  805.388465][ T3919]  elevator_alloc+0xc5/0x2b0
[  805.393366][ T3919]  blk_mq_init_sched+0xa6/0x5e0
[  805.398554][ T3919]  elevator_switch+0x18e/0x680
[  805.403702][ T3919]  elevator_change+0x2d8/0x4f0
[  805.408802][ T3919]  elv_iosched_store+0x30c/0x3a0
[  805.414116][ T3919]  queue_attr_store+0x23f/0x360
[  805.419289][ T3919]  kernfs_fop_write_iter+0x3da/0x5e0
[  805.424938][ T3919]  vfs_write+0x524/0x1010
[  805.429600][ T3919]  ksys_write+0xff/0x200
[  805.434159][ T3919]  do_syscall_64+0xf4/0x1550
[  805.439064][ T3919]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  805.445273][ T3919]
[  805.447927][ T3919] Freed by task 3866:
[  805.452231][ T3919]  kasan_save_stack+0x30/0x50
[  805.457287][ T3919]  kasan_save_track+0x14/0x30
[  805.462282][ T3919]  kasan_save_free_info+0x3b/0x70
[  805.467645][ T3919]  __kasan_slab_free+0x6b/0x90
[  805.472736][ T3919]  kfree+0x21c/0x620
[  805.476953][ T3919]  kobject_cleanup+0x105/0x3a0
[  805.482039][ T3919]  elevator_change_done+0x196/0x610
[  805.487633][ T3919]  elevator_change+0x283/0x4f0
[  805.492730][ T3919]  elv_iosched_store+0x30c/0x3a0
[  805.497989][ T3919]  queue_attr_store+0x23f/0x360
[  805.503144][ T3919]  kernfs_fop_write_iter+0x3da/0x5e0
[  805.508747][ T3919]  vfs_write+0x524/0x1010
[  805.513381][ T3919]  ksys_write+0xff/0x200
[  805.517944][ T3919]  do_syscall_64+0xf4/0x1550
[  805.522862][ T3919]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  805.529118][ T3919]
[  805.531858][ T3919] The buggy address belongs to the object at ffff8881273e0800
[  805.531858][ T3919]  which belongs to the cache kmalloc-rnd-13-1k of size 1024
[  805.547392][ T3919] The buggy address is located 224 bytes inside of
[  805.547392][ T3919]  freed 1024-byte region [ffff8881273e0800, ffff8881273e0c00)
[  805.562078][ T3919]
[  805.564734][ T3919] The buggy address belongs to the physical page:
[  805.571446][ T3919] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1273e0
[  805.580609][ T3919] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  805.589411][ T3919] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[  805.597524][ T3919] page_type: f5(slab)
[  805.601916][ T3919] raw: 0017ffffc0000040 ffff88810005c640 dead000000000100 dead000000000122
[  805.610881][ T3919] raw: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
[  805.619808][ T3919] head: 0017ffffc0000040 ffff88810005c640 dead000000000100 dead000000000122
[  805.628815][ T3919] head: 0000000000000000 0000000800100010 00000000f5000000 0000000000000000
[  805.637838][ T3919] head: 0017ffffc0000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
[  805.646901][ T3919] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
[  805.655983][ T3919] page dumped because: kasan: bad access detected
[  805.662913][ T3919]
[  805.665657][ T3919] Memory state around the buggy address:
[  805.671717][ T3919]  ffff8881273e0780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  805.680194][ T3919]  ffff8881273e0800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.688697][ T3919] >ffff8881273e0880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.697130][ T3919]                                                        ^
[  805.704717][ T3919]  ffff8881273e0900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.713179][ T3919]  ffff8881273e0980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  805.721720][ T3919] ==================================================================
[  805.730526][ T3919] Disabling lock debugging due to kernel taint
...

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-12  9:47   ` Shin'ichiro Kawasaki
@ 2026-06-12 11:06     ` Ming Lei
  2026-06-12 11:45       ` Nilay Shroff
  0 siblings, 1 reply; 9+ messages in thread
From: Ming Lei @ 2026-06-12 11:06 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: linux-block, Jens Axboe, Nilay Shroff

On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
> On Jun 11, 2026 / 06:22, Ming Lei wrote:
> > Hi Shin'ichiro,
> 
> Hi Ming, thanks for the comments.
> 
> > 
> > On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> > > I observed that the blktests test case block/005 hangs on a specific
> > > server hardware using a specific HDD as a block device. During the test
> > > case run, the kernel reported a KASAN null-ptr-deref (and other memory
> > > corruption symptoms) [2]. This failure looked sporadic and hardware-
> > > dependent.
> > > 
> > > From the kernel message, I noticed that udev-worker wrote to the
> > > queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> > > The test case block/005 also wrote to the same sysfs attribute, which
> > 
> > sysfs write is supposed to be serialized...
> 
> I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
> I found elevator_change() call is guarded with the rw_semaphore
> "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
> lock. This does not serialize the sysfs writes.

Please see kernfs_fop_write_iter(), in which mutex is held before calling
->write().

> 
> I tried the patch below to replace the reader lock with the writer lock. With
> a quick trial, it looks working. The kernel message is no longer observed and
> the new test case does not cause hangs. I will do further testing to confirm
> that this change does not trigger other new lockdep WARNs. Assuming it does not
> have such side effects, I hope this fix approach is acceptable. It doesn't add
> the new lock, so I think it's the better.
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index 3bcd37c2aa34..b03185a217ff 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
>  	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
>  	 */
> -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
> +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
>  		ret = -EBUSY;
>  		goto out;
>  	}
> @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>  	} else {
>  		ret = -ENOENT;
>  	}
> -	up_read(&set->update_nr_hwq_lock);
> +	up_write(&set->update_nr_hwq_lock);
>  
>  out:
>  	if (ctx.type)
> 
> [...]
> 
> > blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
> > fix could be check & avoid the null-ptr-deref.
> 
> Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
> free is also observed [3]. Then I'm guessing adding null checks may not be
> enough.
> 
> > Adding new lock should be the last straw usually, especially this one is
> > depended by queue freeze.
> 
> Got it, thanks.
> 
> 
> [3] KASAN slab-use-after-free

Then you need to figure out the exact slab type and check if the pointer is cleared
during free.

Anyway, there is guard already, not see reason to add new lock for covering
it.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-12 11:06     ` Ming Lei
@ 2026-06-12 11:45       ` Nilay Shroff
  2026-06-16  1:20         ` Shin'ichiro Kawasaki
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2026-06-12 11:45 UTC (permalink / raw)
  To: Ming Lei, Shin'ichiro Kawasaki; +Cc: linux-block, Jens Axboe

On 6/12/26 4:36 PM, Ming Lei wrote:
> On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
>> On Jun 11, 2026 / 06:22, Ming Lei wrote:
>>> Hi Shin'ichiro,
>>
>> Hi Ming, thanks for the comments.
>>
>>>
>>> On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
>>>> I observed that the blktests test case block/005 hangs on a specific
>>>> server hardware using a specific HDD as a block device. During the test
>>>> case run, the kernel reported a KASAN null-ptr-deref (and other memory
>>>> corruption symptoms) [2]. This failure looked sporadic and hardware-
>>>> dependent.
>>>>
>>>>  From the kernel message, I noticed that udev-worker wrote to the
>>>> queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
>>>> The test case block/005 also wrote to the same sysfs attribute, which
>>>
>>> sysfs write is supposed to be serialized...
>>
>> I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
>> I found elevator_change() call is guarded with the rw_semaphore
>> "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
>> lock. This does not serialize the sysfs writes.
> 
> Please see kernfs_fop_write_iter(), in which mutex is held before calling
> ->write().
> 
I think you're referring to @of->mutex here; however of->mutex is per struct
kernfs_open_file, which is associated with an open instance of the sysfs file.
The important point is that two separate opens can have different kernfs_open_file
instances and therefore different mutexes. Thus, concurrent write to same sysfs
attribute from two different processes may still be possible.


>>
>> I tried the patch below to replace the reader lock with the writer lock. With
>> a quick trial, it looks working. The kernel message is no longer observed and
>> the new test case does not cause hangs. I will do further testing to confirm
>> that this change does not trigger other new lockdep WARNs. Assuming it does not
>> have such side effects, I hope this fix approach is acceptable. It doesn't add
>> the new lock, so I think it's the better.
>>
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 3bcd37c2aa34..b03185a217ff 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>   	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
>>   	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
>>   	 */
>> -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
>> +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
>>   		ret = -EBUSY;
>>   		goto out;
>>   	}
>> @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>   	} else {
>>   		ret = -ENOENT;
>>   	}
>> -	up_read(&set->update_nr_hwq_lock);
>> +	up_write(&set->update_nr_hwq_lock);
>>   
>>   out:
>>   	if (ctx.type)
>>
>> [...]
>>
>>> blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
>>> fix could be check & avoid the null-ptr-deref.
>>
>> Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
>> free is also observed [3]. Then I'm guessing adding null checks may not be
>> enough.
>>
>>> Adding new lock should be the last straw usually, especially this one is
>>> depended by queue freeze.
>>
>> Got it, thanks.
>>
>>
>> [3] KASAN slab-use-after-free
> 
> Then you need to figure out the exact slab type and check if the pointer is cleared
> during free.
> 
> Anyway, there is guard already, not see reason to add new lock for covering
> it.
> 
Regarding the observed failure, my understanding is that blk_mq_debugfs_register_sched()
and blk_mq_debugfs_register_sched_hctx() access q->elevator without holding q->elevator_lock.
If multiple scheduler update paths run concurrently, one path can replace and free the
elevator while another path is still using it, which would explain the observed KASAN
use-after-free and NULL pointer dereference reports.

With the proposed change, upgrading update_nr_hwq_lock from a reader lock to a writer
lock in elv_iosched_store() would serialize concurrent scheduler updates and therefore
prevent multiple elevator switch operations from running at the same time.

The another way to fix this might be to acquire q->elevator_lock in blk_mq_sched_reg_debugfs()
and thus serialize access to q->elevator in blk_mq_debugfs_register_sched() and
blk_mq_debugfs_register_sched_hctx().

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-12 11:45       ` Nilay Shroff
@ 2026-06-16  1:20         ` Shin'ichiro Kawasaki
  2026-06-17 11:08           ` Nilay Shroff
  0 siblings, 1 reply; 9+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-06-16  1:20 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Ming Lei, linux-block, Jens Axboe

On Jun 12, 2026 / 17:15, Nilay Shroff wrote:
> On 6/12/26 4:36 PM, Ming Lei wrote:
> > On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
> > > On Jun 11, 2026 / 06:22, Ming Lei wrote:
> > > > Hi Shin'ichiro,
> > > 
> > > Hi Ming, thanks for the comments.
> > > 
> > > > 
> > > > On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> > > > > I observed that the blktests test case block/005 hangs on a specific
> > > > > server hardware using a specific HDD as a block device. During the test
> > > > > case run, the kernel reported a KASAN null-ptr-deref (and other memory
> > > > > corruption symptoms) [2]. This failure looked sporadic and hardware-
> > > > > dependent.
> > > > > 
> > > > >  From the kernel message, I noticed that udev-worker wrote to the
> > > > > queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> > > > > The test case block/005 also wrote to the same sysfs attribute, which
> > > > 
> > > > sysfs write is supposed to be serialized...
> > > 
> > > I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
> > > I found elevator_change() call is guarded with the rw_semaphore
> > > "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
> > > lock. This does not serialize the sysfs writes.
> > 
> > Please see kernfs_fop_write_iter(), in which mutex is held before calling
> > ->write().
> > 
> I think you're referring to @of->mutex here; however of->mutex is per struct
> kernfs_open_file, which is associated with an open instance of the sysfs file.
> The important point is that two separate opens can have different kernfs_open_file
> instances and therefore different mutexes. Thus, concurrent write to same sysfs
> attribute from two different processes may still be possible.

Thanks Nilay, I added debug prints to print @of->mutex address, and it observed
the address is different for each process and each file open. So, I don't think
sysfs write is serialized.

> 
> 
> > > 
> > > I tried the patch below to replace the reader lock with the writer lock. With
> > > a quick trial, it looks working. The kernel message is no longer observed and
> > > the new test case does not cause hangs. I will do further testing to confirm
> > > that this change does not trigger other new lockdep WARNs. Assuming it does not
> > > have such side effects, I hope this fix approach is acceptable. It doesn't add
> > > the new lock, so I think it's the better.
> > > 
> > > diff --git a/block/elevator.c b/block/elevator.c
> > > index 3bcd37c2aa34..b03185a217ff 100644
> > > --- a/block/elevator.c
> > > +++ b/block/elevator.c
> > > @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> > >   	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
> > >   	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
> > >   	 */
> > > -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
> > > +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
> > >   		ret = -EBUSY;
> > >   		goto out;
> > >   	}
> > > @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
> > >   	} else {
> > >   		ret = -ENOENT;
> > >   	}
> > > -	up_read(&set->update_nr_hwq_lock);
> > > +	up_write(&set->update_nr_hwq_lock);
> > >   out:
> > >   	if (ctx.type)
> > > 
> > > [...]
> > > 
> > > > blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
> > > > fix could be check & avoid the null-ptr-deref.
> > > 
> > > Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
> > > free is also observed [3]. Then I'm guessing adding null checks may not be
> > > enough.
> > > 
> > > > Adding new lock should be the last straw usually, especially this one is
> > > > depended by queue freeze.
> > > 
> > > Got it, thanks.
> > > 
> > > 
> > > [3] KASAN slab-use-after-free
> > 
> > Then you need to figure out the exact slab type and check if the pointer is cleared
> > during free.
> > 
> > Anyway, there is guard already, not see reason to add new lock for covering
> > it.
> > 
> Regarding the observed failure, my understanding is that blk_mq_debugfs_register_sched()
> and blk_mq_debugfs_register_sched_hctx() access q->elevator without holding q->elevator_lock.
> If multiple scheduler update paths run concurrently, one path can replace and free the
> elevator while another path is still using it, which would explain the observed KASAN
> use-after-free and NULL pointer dereference reports.

I have the same view. I think the use-after-free and the null-ptr-deref indicate
that elevator_queue object address in q->elevator is the problem. The references
of the object is also kept in the struct elv_change_ctx as ctx->old and
ctx->new. These multiple references are used concurrently, then I'm not sure if
adding pointer clears and null checks would fix the problem.

> 
> With the proposed change, upgrading update_nr_hwq_lock from a reader lock to a writer
> lock in elv_iosched_store() would serialize concurrent scheduler updates and therefore
> prevent multiple elevator switch operations from running at the same time.
> 
> The another way to fix this might be to acquire q->elevator_lock in blk_mq_sched_reg_debugfs()
> and thus serialize access to q->elevator in blk_mq_debugfs_register_sched() and
> blk_mq_debugfs_register_sched_hctx().

Thanks for the idea. I tried the patch below [X], but it triggered WARN in
debugfs_create_files() in block/blk-mq-debufs.c [Y]. Then I'm afraid, this
approach does not look working.

At this moment, the writer lock in elv_iosched_store() looks like the solution
to me, but further comments on other solution possibility will be welcomed.


[X]

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 0a00f5a76f5a..12c582b6c713 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -394,9 +394,11 @@ void blk_mq_sched_reg_debugfs(struct request_queue *q)
 	unsigned long i;
 
 	memflags = blk_debugfs_lock(q);
+	mutex_lock(&q->elevator_lock);
 	blk_mq_debugfs_register_sched(q);
 	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_debugfs_register_sched_hctx(q, hctx);
+	mutex_unlock(&q->elevator_lock);
 	blk_debugfs_unlock(q, memflags);
 }
 

[Y]

 612 static void debugfs_create_files(struct request_queue *q, struct dentry *parent,|
 613                                  void *data,                                    |
 614                                  const struct blk_mq_debugfs_attr *attr)        |
 615 {                                                                               |
 616         lockdep_assert_held(&q->debugfs_mutex);                                 |
 617         /*                                                                      |
 618          * debugfs_mutex should not be nested under other locks that can be     |
 619          * grabbed while queue is frozen.                                       |
 620          */                                                                     |
 621         lockdep_assert_not_held(&q->elevator_lock);                             | <----
 622         lockdep_assert_not_held(&q->rq_qos_mutex);                              |
 623                                                                                 |


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-16  1:20         ` Shin'ichiro Kawasaki
@ 2026-06-17 11:08           ` Nilay Shroff
  2026-06-18  8:04             ` Shin'ichiro Kawasaki
  0 siblings, 1 reply; 9+ messages in thread
From: Nilay Shroff @ 2026-06-17 11:08 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: Ming Lei, linux-block, Jens Axboe

On 6/16/26 6:50 AM, Shin'ichiro Kawasaki wrote:
> On Jun 12, 2026 / 17:15, Nilay Shroff wrote:
>> On 6/12/26 4:36 PM, Ming Lei wrote:
>>> On Fri, Jun 12, 2026 at 06:47:50PM +0900, Shin'ichiro Kawasaki wrote:
>>>> On Jun 11, 2026 / 06:22, Ming Lei wrote:
>>>>> Hi Shin'ichiro,
>>>>
>>>> Hi Ming, thanks for the comments.
>>>>
>>>>>
>>>>> On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
>>>>>> I observed that the blktests test case block/005 hangs on a specific
>>>>>> server hardware using a specific HDD as a block device. During the test
>>>>>> case run, the kernel reported a KASAN null-ptr-deref (and other memory
>>>>>> corruption symptoms) [2]. This failure looked sporadic and hardware-
>>>>>> dependent.
>>>>>>
>>>>>>   From the kernel message, I noticed that udev-worker wrote to the
>>>>>> queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
>>>>>> The test case block/005 also wrote to the same sysfs attribute, which
>>>>>
>>>>> sysfs write is supposed to be serialized...
>>>>
>>>> I checked the sysfs write handler elv_iosched_store() in block/elevator.c.
>>>> I found elevator_change() call is guarded with the rw_semaphore
>>>> "set->update_nr_hwq_lock", but the guard is not the writer lock but the reader
>>>> lock. This does not serialize the sysfs writes.
>>>
>>> Please see kernfs_fop_write_iter(), in which mutex is held before calling
>>> ->write().
>>>
>> I think you're referring to @of->mutex here; however of->mutex is per struct
>> kernfs_open_file, which is associated with an open instance of the sysfs file.
>> The important point is that two separate opens can have different kernfs_open_file
>> instances and therefore different mutexes. Thus, concurrent write to same sysfs
>> attribute from two different processes may still be possible.
> 
> Thanks Nilay, I added debug prints to print @of->mutex address, and it observed
> the address is different for each process and each file open. So, I don't think
> sysfs write is serialized.
> 
>>
>>
>>>>
>>>> I tried the patch below to replace the reader lock with the writer lock. With
>>>> a quick trial, it looks working. The kernel message is no longer observed and
>>>> the new test case does not cause hangs. I will do further testing to confirm
>>>> that this change does not trigger other new lockdep WARNs. Assuming it does not
>>>> have such side effects, I hope this fix approach is acceptable. It doesn't add
>>>> the new lock, so I think it's the better.
>>>>
>>>> diff --git a/block/elevator.c b/block/elevator.c
>>>> index 3bcd37c2aa34..b03185a217ff 100644
>>>> --- a/block/elevator.c
>>>> +++ b/block/elevator.c
>>>> @@ -813,7 +813,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>>    	 *   update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del)
>>>>    	 *   kn->active -> update_nr_hwq_lock (via this sysfs write path)
>>>>    	 */
>>>> -	if (!down_read_trylock(&set->update_nr_hwq_lock)) {
>>>> +	if (!down_write_trylock(&set->update_nr_hwq_lock)) {
>>>>    		ret = -EBUSY;
>>>>    		goto out;
>>>>    	}
>>>> @@ -824,7 +824,7 @@ ssize_t elv_iosched_store(struct gendisk *disk, const char *buf,
>>>>    	} else {
>>>>    		ret = -ENOENT;
>>>>    	}
>>>> -	up_read(&set->update_nr_hwq_lock);
>>>> +	up_write(&set->update_nr_hwq_lock);
>>>>    out:
>>>>    	if (ctx.type)
>>>>
>>>> [...]
>>>>
>>>>> blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
>>>>> fix could be check & avoid the null-ptr-deref.
>>>>
>>>> Actually, null-ptr-deref is one of the failure symptoms. KASAN slab-user-after
>>>> free is also observed [3]. Then I'm guessing adding null checks may not be
>>>> enough.
>>>>
>>>>> Adding new lock should be the last straw usually, especially this one is
>>>>> depended by queue freeze.
>>>>
>>>> Got it, thanks.
>>>>
>>>>
>>>> [3] KASAN slab-use-after-free
>>>
>>> Then you need to figure out the exact slab type and check if the pointer is cleared
>>> during free.
>>>
>>> Anyway, there is guard already, not see reason to add new lock for covering
>>> it.
>>>
>> Regarding the observed failure, my understanding is that blk_mq_debugfs_register_sched()
>> and blk_mq_debugfs_register_sched_hctx() access q->elevator without holding q->elevator_lock.
>> If multiple scheduler update paths run concurrently, one path can replace and free the
>> elevator while another path is still using it, which would explain the observed KASAN
>> use-after-free and NULL pointer dereference reports.
> 
> I have the same view. I think the use-after-free and the null-ptr-deref indicate
> that elevator_queue object address in q->elevator is the problem. The references
> of the object is also kept in the struct elv_change_ctx as ctx->old and
> ctx->new. These multiple references are used concurrently, then I'm not sure if
> adding pointer clears and null checks would fix the problem.
> 
>>
>> With the proposed change, upgrading update_nr_hwq_lock from a reader lock to a writer
>> lock in elv_iosched_store() would serialize concurrent scheduler updates and therefore
>> prevent multiple elevator switch operations from running at the same time.
>>
>> The another way to fix this might be to acquire q->elevator_lock in blk_mq_sched_reg_debugfs()
>> and thus serialize access to q->elevator in blk_mq_debugfs_register_sched() and
>> blk_mq_debugfs_register_sched_hctx().
> 
> Thanks for the idea. I tried the patch below [X], but it triggered WARN in
> debugfs_create_files() in block/blk-mq-debufs.c [Y]. Then I'm afraid, this
> approach does not look working.
> 
> At this moment, the writer lock in elv_iosched_store() looks like the solution
> to me, but further comments on other solution possibility will be welcomed.
> 
> 
> [X]
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 0a00f5a76f5a..12c582b6c713 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -394,9 +394,11 @@ void blk_mq_sched_reg_debugfs(struct request_queue *q)
>   	unsigned long i;
>   
>   	memflags = blk_debugfs_lock(q);
> +	mutex_lock(&q->elevator_lock);
>   	blk_mq_debugfs_register_sched(q);
>   	queue_for_each_hw_ctx(q, hctx, i)
>   		blk_mq_debugfs_register_sched_hctx(q, hctx);
> +	mutex_unlock(&q->elevator_lock);
>   	blk_debugfs_unlock(q, memflags);
>   }
>   
> 
> [Y]
> 
>   612 static void debugfs_create_files(struct request_queue *q, struct dentry *parent,|
>   613                                  void *data,                                    |
>   614                                  const struct blk_mq_debugfs_attr *attr)        |
>   615 {                                                                               |
>   616         lockdep_assert_held(&q->debugfs_mutex);                                 |
>   617         /*                                                                      |
>   618          * debugfs_mutex should not be nested under other locks that can be     |
>   619          * grabbed while queue is frozen.                                       |
>   620          */                                                                     |
>   621         lockdep_assert_not_held(&q->elevator_lock);                             | <----
>   622         lockdep_assert_not_held(&q->rq_qos_mutex);                              |
>   623                                                                                 |
> 

Yeah, I recall that assertion was added to avoid potential circular lockdep dependencies
when reclaim recurses back into the block layer. The concern is that ->elevator_lock and
  ->rq_qos_mutex can be acquired in code paths after the queue has been frozen. Consider
a scenario where one task freezes the queue and then attempts to acquire ->elevator_lock,
while another task already holds ->elevator_lock and subsequently triggers memory reclaim.
If reclaim recurses into the block layer, it may require forward progress on the same
frozen queue, which cannot happen until the freeze is lifted. This creates a circular
dependency involving queue freeze, reclaim, and ->elevator_lock (or ->rq_qos_mutex).

Given the above, I'm fine with the earlier approach of upgrading update_nr_hwq_lock from
a reader lock to a writer lock in elv_iosched_store(). That directly serializes concurrent
scheduler updates and avoids the race on q->elevator without introducing additional lock
ordering concerns.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
  2026-06-17 11:08           ` Nilay Shroff
@ 2026-06-18  8:04             ` Shin'ichiro Kawasaki
  0 siblings, 0 replies; 9+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-06-18  8:04 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: Ming Lei, linux-block, Jens Axboe

On Jun 17, 2026 / 16:38, Nilay Shroff wrote:
[...]
> Given the above, I'm fine with the earlier approach of upgrading update_nr_hwq_lock from
> a reader lock to a writer lock in elv_iosched_store(). That directly serializes concurrent
> scheduler updates and avoids the race on q->elevator without introducing additional lock
> ordering concerns.

Thanks for the comment. I will prepare the "writer lock in elv_iosched_store()"
approach as v2 patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-18  8:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-11  7:41 [PATCH RFC 0/1] block: fix concurrent elevator change failure Shin'ichiro Kawasaki
2026-06-11  7:42 ` [PATCH RFC 1/1] block: serialize whole elevator change steps for the same queue Shin'ichiro Kawasaki
2026-06-11 11:22 ` [PATCH RFC 0/1] block: fix concurrent elevator change failure Ming Lei
2026-06-12  9:47   ` Shin'ichiro Kawasaki
2026-06-12 11:06     ` Ming Lei
2026-06-12 11:45       ` Nilay Shroff
2026-06-16  1:20         ` Shin'ichiro Kawasaki
2026-06-17 11:08           ` Nilay Shroff
2026-06-18  8:04             ` Shin'ichiro Kawasaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox