Linux block layer
 help / color / mirror / Atom feed
* [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Shin'ichiro Kawasaki @ 2026-06-11  7:41 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki

I observed that the blktests test case block/005 hangs on a specific
server hardware using a specific HDD as a block device. During the test
case run, the kernel reported a KASAN null-ptr-deref (and other memory
corruption symptoms) [2]. This failure looked sporadic and hardware-
dependent.

From the kernel message, I noticed that udev-worker wrote to the
queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
The test case block/005 also wrote to the same sysfs attribute, which
indicated that a concurrent elevator change caused the failure. I
created a new blktests test case that simply does the concurrent
elevator change with a null_blk device [1]. It recreates the failure in
a stable manner on various server hardware.

Using the new test case, I bisected and found that the failure first
appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
commit does not appear to explain the failure by itself: it changed the
queue freeze behavior and only unveiled a race, probably. Looking back
at the changes to elevator_change(), I think the actual cause is the
commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
elevator_lock") in the kernel tag v6.16-rc1. This commit moved
elevator_change_done() out of the guard of ->elevator_lock and the queue
freeze. As a result, when two threads write to the same queue/scheduler
attribute concurrently, elevator_change_done() runs in parallel causing
the memory corruption and the hang.

As the fix attempt, I created the patch in this series. It adds a new
mutex that serializes the whole elevator switch sequence, including the
elevator_change_done() call. I ran the reproducer with lockdep enabled
and confirmed that the patch avoids the failure and new WARN was not
observed.

However, the fix patch adds a new lock, and I'm not sure if it is the best
solution. Comments on the patch, or suggestions for a better solution,
would be appreciated.

[1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826

[2]

[30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
[30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
[30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
[30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
[30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
8 8b 6d 00 48
[30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
[30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
[30105.119796] [ T186111] Call Trace:
[30105.124154] [ T186111]  <TASK>
[30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
[30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610
[30105.140037] [ T186111]  ? __pfx_elevator_change_done+0x10/0x10
[30105.146409] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.152246] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.158189] [ T186111]  elevator_change+0x283/0x4f0
[30105.163342] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.168932] [ T186111]  elv_iosched_store+0x30c/0x3a0
[30105.174265] [ T186111]  ? __pfx_elv_iosched_store+0x10/0x10
[30105.180797] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.187066] [ T186111]  ? kernfs_fop_write_iter+0x25b/0x5e0
[30105.193594] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.199931] [ T186111]  ? lock_acquire+0x126/0x140
[30105.205683] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.211924] [ T186111]  queue_attr_store+0x23f/0x360
[30105.217796] [ T186111]  ? __pfx_queue_attr_store+0x10/0x10
[30105.224180] [ T186111]  ? __lock_acquire+0x55d/0xbd0
[30105.230049] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.236247] [ T186111]  ? sysfs_file_kobj+0x1d/0x1b0
[30105.242093] [ T186111]  ? find_held_lock+0x2b/0x80
[30105.247763] [ T186111]  ? __lock_release.isra.0+0x59/0x170
[30105.254122] [ T186111]  ? lock_release.part.0+0x1c/0x50
[30105.260226] [ T186111]  ? sysfs_file_kobj+0xb9/0x1b0
[30105.266048] [ T186111]  ? sysfs_kf_write+0x65/0x170
[30105.271778] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.277934] [ T186111]  kernfs_fop_write_iter+0x3da/0x5e0
[30105.284173] [ T186111]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[30105.290926] [ T186111]  vfs_write+0x524/0x1010
[30105.296215] [ T186111]  ? __pfx_vfs_write+0x10/0x10
[30105.301905] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.308092] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.314246] [ T186111]  ksys_write+0xff/0x200
[30105.319331] [ T186111]  ? __pfx_ksys_write+0x10/0x10
[30105.325007] [ T186111]  do_syscall_64+0xf4/0x1550
[30105.330407] [ T186111]  ? __pfx___x64_sys_openat+0x10/0x10
[30105.336566] [ T186111]  ? seccomp_run_filters+0xeb/0x560
[30105.342517] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.348096] [ T186111]  ? __seccomp_filter+0xa2/0x920
[30105.353749] [ T186111]  ? __pfx___seccomp_filter+0x10/0x10
[30105.359830] [ T186111]  ? trace_hardirqs_on_prepare+0x150/0x1a0
[30105.366344] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.371892] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.377422] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.382922] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.388401] [ T186111]  ? do_syscall_64+0x34/0x1550
[30105.393777] [ T186111]  ? do_syscall_64+0xab/0x1550
[30105.399129] [ T186111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[30105.405624] [ T186111] RIP: 0033:0x7fc1c7c4fbbe
[30105.410647] [ T186111] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[30105.431611] [ T186111] RSP: 002b:00007ffefd3bdd90 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[30105.440716] [ T186111] RAX: ffffffffffffffda RBX: 000055fa3f0f4b80 RCX: 00007fc1c7c4fbbe
[30105.449404] [ T186111] RDX: 000000000000000b RSI: 000055fa3ed9d550 RDI: 0000000000000015
[30105.458090] [ T186111] RBP: 00007ffefd3bdda0 R08: 0000000000000000 R09: 0000000000000000
[30105.466787] [ T186111] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000b
[30105.475479] [ T186111] R13: 000000000000000b R14: 000055fa3ed9d550 R15: 000055fa3ed9d550
[30105.484182] [ T186111]  </TASK>
[30105.487920] [ T186111] Modules linked in: iscsi_target_mod tcm_loop target_core_pscsi target_core_file target_core_iblock xfs nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc target_core_user target_core_mod rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nf_tables ip6table_filter ip6_tables iptable_filter ip_tables qrtr intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt intel_pmc_bxt kvm_intel kvm irqbypass rapl sunrpc intel_cstate intel_uncore pcspkr i2c_i801 i2c_smbus mei_me igb lpc_ich mei ioatdma dca wmi binfmt_misc joydev acpi_power_meter acpi_pad btrfs raid6_pq xor ses enclosure loop dm_multipath nfnetlink zram lz4hc_compress lz4_compress
[30105.488278] [ T186111]  zstd_compress ast drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper mpt3sas drm mpi3mr raid_class scsi_transport_sas scsi_dh_rdac scsi_dh_emc scsi_dh_alua i2c_dev fuse [last unloaded: zonefs]
[30105.609649] [ T186111] ---[ end trace 0000000000000000 ]---
[30105.648290] [ T186111] pstore: backend (erst) writing error (-28)
[30105.654739] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.662519] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 48 8b 6d 00 48
[30105.683653] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.691248] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.700121] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.708841] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.717829] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.726550] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.735306] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c54ce000(0000) knlGS:0000000000000000
[30105.745003] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.752368] [ T186111] CR2: 00007f251f9bc0e8 CR3: 0000000108350002 CR4: 00000000001726f0


Shin'ichiro Kawasaki (1):
  block: serialize whole elevator change steps for the same queue

 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11  7:36 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipbojSeMH-usARY@ashevche-desk.local>

在 2026/6/11 14:54, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/10 22:43, Andy Shevchenko 写道:
>>> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
>>>> 在 2026/6/9 18:33, Christian König 写道:
>>>>> On 6/9/26 08:13, Kaitao Cheng wrote:
> 
>>>>>> This series prepares for, and then updates, the list_for_each_entry()
>>>>>> family so the common entry iterators cache their next or previous cursor
>>>>>> before the loop body runs.
>>>>>
>>>>> Why in the world would we want to do that?
>>>>>
>>>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
>>>>>
>>>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
>>>>>
>>>>>> The first 13 patches open-code loops that intentionally depend on the
>>>>>> old "derive the next entry from the current cursor at the end of the
>>>>>> iteration" behaviour.  These loops append work to the list being walked,
>>>>>> restart traversal after dropping a lock, skip an entry consumed by the
>>>>>> current iteration, or otherwise adjust the cursor in the loop body.
>>>>>
>>>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
>>>>
>>>> Hi Christian and Andy Shevchenko,
>>>>
>>>> Thanks for taking a look. I would like to clarify the point you raised.
>>>>
>>>> The reason I started looking at this is the original motivation behind
>>>> the _safe() variants.  They exist because some users need to remove, move
>>>> or otherwise consume the current entry while walking the list.  In that
>>>> case the next cursor has to be preserved before the loop body can modify
>>>> the current entry.
>>>>
>>>> The unfortunate part is that this could not be expressed with the
>>>> existing list_for_each_entry() interface without changing its calling
>>>> convention.  The _safe() variants had to grow an extra argument for the
>>>> temporary cursor, and that is why we ended up with a separate family of
>>>> macros.
>>>>
>>>> But conceptually, the distinction does not have to be exposed as two
>>>> different iterator families forever.  The difference is an implementation
>>>> detail: whether the iterator keeps the next/previous cursor before the
>>>> body runs.  This series makes the common list_for_each_entry() iterators
>>>> do that internally, so the safe and non-safe forms can effectively be
>>>> folded together, or at least the need for a separate public _safe()
>>>> interface becomes much weaker.
>>>>
>>>> There is also a usability issue with the current _safe() interface.  The
>>>> caller is forced to define a temporary cursor outside the macro and pass
>>>> it in, even though almost all users never use that cursor directly.  It is
>>>> just boilerplate required by the macro implementation.  I find that
>>>> redundant and awkward: the temporary cursor is an internal detail of the
>>>> iteration, but every caller has to spell it out.
>>>
>>> Ah, I think the distinct macro families is that what we want.
>>> But the hiding of the parameter can be done inside list_for_each_*_safe().
>>> You can do a treewide change with coccinelle.
>>>
>>> Sorry if I didn't get the whole idea from your previous contributions.
>>>
>>> Note, even cases that would need a temporary cursor may be switched to
>>> new list_for_each_*_safe(), see how PCI macros for iterating over resources
>>> are implemented (include/linux/pci.h).
>>
>> Thanks for your suggestions. I've written a demo based on your feedback.
>> Could you please review it and share your thoughts on this approach?
> 
> Have you checked how many users actually need the temporary storage?

In Muchun's reply, he mentioned the following:

There are 9,925 list_for_each_entry() call sites in total. Among them,
9,919 do not require any adaptation, and only 6 need to be refactored:

As for list_for_each_entry_safe(), there are 4,572 callers. 4,550 of them
can be directly replaced by the new list_for_each_entry(), while 22 cannot
be replaced

https://lore.kernel.org/all/2B3BFA1E-08B8-42AB-87D6-A28BF15E5C58@linux.dev/


I only used Coccinelle to scan for list_for_each_entry() call sites, and
found the 13 call sites shown in the current patch series, which cover
the 6 cases mentioned in Muchun's email. I have not yet run the Coccinelle
scan for list_for_each_entry_safe().

If we need to handle all 9,925 list_for_each_entry() call sites or all 4,572
list_for_each_entry_safe() call sites in one go, would such a change be too
large? I expect it would affect almost every kernel subsystem.

I wonder whether it would be better to first provide the necessary
compatibility APIs, and then let each subsystem owner update their code as
appropriate. That would make the impact more controlled, similar to how
the current folio replacement of page is being handled.

>>>> With the updated list_for_each_entry() implementation, that extra cursor
>>>> can be kept inside the iterator itself.  Callers that only want to walk
>>>> the list, including callers that delete or consume the current entry, no
>>>> longer need to carry an otherwise-unused temporary variable just to make
>>>> the macro work.
>>>>
>>>>>> The final patch changes include/linux/list.h to keep a private cursor in
>>>>>> the common entry iterators while preserving the public macro interface.
>>>>>> The safe variants remain available when callers need the temporary
>>>>>> cursor explicitly or have stronger mutation requirements.
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v3] block: rust: fix `Send` bound for `GenDisk`
From: Miguel Ojeda @ 2026-06-11  7:32 UTC (permalink / raw)
  To: Yuan Tan
  Cc: a.hindborg, boqun, linux-block, rust-for-linux, zhiyunq, ardalan,
	pgovind2, dzueck, Yuan Tan
In-Reply-To: <20260611003220.3512652-1-yuantan098@gmail.com>

On Thu, Jun 11, 2026 at 2:32 AM Yuan Tan <yuantan098@gmail.com> wrote:
>
> I am not sure whether it is appropriate for me to take Andreas' patch and
> only adjust the trailers. Please correct me, and my apologies if this is
> not the right way to handle it.

It is OK to re-send a patch from someone else, but you need to make
sure you give the proper attribution etc.

In particular, you should keep the authorship from Andreas if he was
the author of the actual fix (though it sounds like Andreas is OK with
you as author due to v1 (?)), and you should also keep the existing
Signed-off-by (again, assuming you were actually picking his patch,
which I don't know if it is the case here), adding your own afterwards
because you carried the patch. And if you make any changes to the
patch, you are supposed to mention that too, inside square brackets,
etc.

Furthermore, if you reported the issue but you are not the author,
then you should use Reported-by for yourself too (even if you have a
Signed-off-by because you are re-sending his patch). Either way, a
Link tag to the original report would be nice if one is available.

In addition, the Fixes tag means there should likely be a Cc: stable
tag too, given the hash covers other stable releases, unless it
shouldn't be backported (in which case, it should be justified).

The document that explains these tags etc. is at:

  https://docs.kernel.org/process/submitting-patches.html

I hope that helps!

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11  6:54 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <9b98e860-11df-44bf-9a95-3046d2c274a6@linux.dev>

On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
> 在 2026/6/10 22:43, Andy Shevchenko 写道:
> > On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
> >> 在 2026/6/9 18:33, Christian König 写道:
> >>> On 6/9/26 08:13, Kaitao Cheng wrote:

> >>>> This series prepares for, and then updates, the list_for_each_entry()
> >>>> family so the common entry iterators cache their next or previous cursor
> >>>> before the loop body runs.
> >>>
> >>> Why in the world would we want to do that?
> >>>
> >>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
> >>>
> >>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
> >>>
> >>>> The first 13 patches open-code loops that intentionally depend on the
> >>>> old "derive the next entry from the current cursor at the end of the
> >>>> iteration" behaviour.  These loops append work to the list being walked,
> >>>> restart traversal after dropping a lock, skip an entry consumed by the
> >>>> current iteration, or otherwise adjust the cursor in the loop body.
> >>>
> >>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
> >>
> >> Hi Christian and Andy Shevchenko,
> >>
> >> Thanks for taking a look. I would like to clarify the point you raised.
> >>
> >> The reason I started looking at this is the original motivation behind
> >> the _safe() variants.  They exist because some users need to remove, move
> >> or otherwise consume the current entry while walking the list.  In that
> >> case the next cursor has to be preserved before the loop body can modify
> >> the current entry.
> >>
> >> The unfortunate part is that this could not be expressed with the
> >> existing list_for_each_entry() interface without changing its calling
> >> convention.  The _safe() variants had to grow an extra argument for the
> >> temporary cursor, and that is why we ended up with a separate family of
> >> macros.
> >>
> >> But conceptually, the distinction does not have to be exposed as two
> >> different iterator families forever.  The difference is an implementation
> >> detail: whether the iterator keeps the next/previous cursor before the
> >> body runs.  This series makes the common list_for_each_entry() iterators
> >> do that internally, so the safe and non-safe forms can effectively be
> >> folded together, or at least the need for a separate public _safe()
> >> interface becomes much weaker.
> >>
> >> There is also a usability issue with the current _safe() interface.  The
> >> caller is forced to define a temporary cursor outside the macro and pass
> >> it in, even though almost all users never use that cursor directly.  It is
> >> just boilerplate required by the macro implementation.  I find that
> >> redundant and awkward: the temporary cursor is an internal detail of the
> >> iteration, but every caller has to spell it out.
> > 
> > Ah, I think the distinct macro families is that what we want.
> > But the hiding of the parameter can be done inside list_for_each_*_safe().
> > You can do a treewide change with coccinelle.
> > 
> > Sorry if I didn't get the whole idea from your previous contributions.
> > 
> > Note, even cases that would need a temporary cursor may be switched to
> > new list_for_each_*_safe(), see how PCI macros for iterating over resources
> > are implemented (include/linux/pci.h).
> 
> Thanks for your suggestions. I've written a demo based on your feedback.
> Could you please review it and share your thoughts on this approach?

Have you checked how many users actually need the temporary storage?

> >> With the updated list_for_each_entry() implementation, that extra cursor
> >> can be kept inside the iterator itself.  Callers that only want to walk
> >> the list, including callers that delete or consume the current entry, no
> >> longer need to carry an otherwise-unused temporary variable just to make
> >> the macro work.
> >>
> >>>> The final patch changes include/linux/list.h to keep a private cursor in
> >>>> the common entry iterators while preserving the public macro interface.
> >>>> The safe variants remain available when callers need the temporary
> >>>> cursor explicitly or have stronger mutation requirements.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v3 1/4] crypto: skcipher - add per-tfm data_unit_size for batched requests
From: Herbert Xu @ 2026-06-11  5:07 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260601085644.13026-2-lravich@amazon.com>

On Mon, Jun 01, 2026 at 08:56:41AM +0000, Leonid Ravich wrote:
>
> diff --git a/crypto/skcipher.c b/crypto/skcipher.c
> index 2b31d1d5d268..bc37bd554aec 100644
> --- a/crypto/skcipher.c
> +++ b/crypto/skcipher.c
> @@ -432,13 +432,119 @@ int crypto_skcipher_setkey(struct crypto_skcipher *tfm, const u8 *key,
>  }
>  EXPORT_SYMBOL_GPL(crypto_skcipher_setkey);
>  
> +int crypto_skcipher_set_data_unit_size(struct crypto_skcipher *tfm,
> +				       unsigned int data_unit_size)
> +{
> +	unsigned int blocksize;
> +
> +	if (!data_unit_size) {
> +		tfm->data_unit_size = 0;
> +		return 0;
> +	}
> +
> +	if (!crypto_skcipher_supports_multi_data_unit(tfm))
> +		return -EOPNOTSUPP;
> +
> +	blocksize = crypto_skcipher_blocksize(tfm);
> +	if (data_unit_size < blocksize || data_unit_size % blocksize)
> +		return -EINVAL;
> +
> +	tfm->data_unit_size = data_unit_size;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(crypto_skcipher_set_data_unit_size);

The unit size should be a per-request attribute, not per tfm.

> @@ -492,6 +517,66 @@ static inline unsigned int crypto_lskcipher_chunksize(
>  	return crypto_lskcipher_alg(tfm)->co.chunksize;
>  }
>  
> +/**
> + * crypto_skcipher_supports_multi_data_unit() - test multi-data-unit support
> + * @tfm: cipher handle
> + *
> + * Return: true if the algorithm advertises that it can process multiple
> + *	   data units in a single skcipher_request.
> + */
> +static inline bool
> +crypto_skcipher_supports_multi_data_unit(struct crypto_skcipher *tfm)
> +{
> +	return crypto_skcipher_alg_common(tfm)->base.cra_flags &
> +		CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT;
> +}

My preference is to always use multi-unit submission if the user
is capable of doing it.  The Crypto API should automatically divide
up the units if the underlying driver does not support it.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 20/27] nbd: Enable lock context analysis
From: Nilay Shroff @ 2026-06-11  5:02 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Marco Elver, Christoph Hellwig,
	Josef Bacik
In-Reply-To: <ba18cfd5-0a52-4e2e-85a1-a9f0a20ff957@acm.org>

On 6/10/26 10:46 PM, Bart Van Assche wrote:
> On 6/10/26 1:02 AM, Nilay Shroff wrote:
>> Above changes are good, however I see nbd also uses @nbd_index_mutex
>> which guards @nbd_index_idr. So should we also annotate @nbd_index_idr
>> using __guarded_by(&nbd_index_mutex)?
> 
> How about adding these changes as an additional patch?
> 
> Thanks,
> 
> Bart.
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 345e4b73009d..b9e0ad0b3ca0 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -49,8 +49,8 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/nbd.h>
> 
> -static DEFINE_IDR(nbd_index_idr);
>   static DEFINE_MUTEX(nbd_index_mutex);
> +static __guarded_by(&nbd_index_mutex) DEFINE_IDR(nbd_index_idr);
>   static struct workqueue_struct *nbd_del_wq;
>   static int nbd_total_devices = 0;
> 
> @@ -2739,7 +2739,9 @@ static void __exit nbd_cleanup(void)
>       /* Also wait for nbd_dev_remove_work() completes */
>       destroy_workqueue(nbd_del_wq);
> 
> -    idr_destroy(&nbd_index_idr);
> +    scoped_guard(mutex_init, &nbd_index_mutex)
> +        idr_destroy(&nbd_index_idr);
> +
>       unregister_blkdev(NBD_MAJOR, "nbd");
>   }
> 
> 
Looks good. But as I said earlier for similar changes in
loop driver, you may want to consider updating current
patch (instead of adding an additional patch) with the
above changes while you're enabling lock context for
nbd driver.

Thanks,
--Nilay


^ permalink raw reply

* Re: [PATCH 18/27] loop: Add lock context annotations
From: Nilay Shroff @ 2026-06-11  5:00 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Marco Elver, Nathan Chancellor
In-Reply-To: <63ffdebb-24f2-4842-8e65-53045d74dace@acm.org>

On 6/10/26 10:43 PM, Bart Van Assche wrote:
> On 6/10/26 2:21 AM, Nilay Shroff wrote:
>> One thing I noticed while looking through the loop driver is that it also defines
>> @loop_ctl_mutex, which protects @loop_index_idr. It might be worth annotating
>> @loop_index_idr with `__guarded_by(&loop_ctl_mutex) as well so that Clang can
>> validate accesses to the IDR against the corresponding locking requirements.
> 
> I'm considering to add the changes below as an additional patch:
> 
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index ff7eff102c5a..30a2b2696368 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -90,8 +90,8 @@ struct loop_cmd {
>   #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
>   #define LOOP_DEFAULT_HW_Q_DEPTH 128
> 
> -static DEFINE_IDR(loop_index_idr);
>   static DEFINE_MUTEX(loop_ctl_mutex);
> +static __guarded_by(&loop_ctl_mutex) DEFINE_IDR(loop_index_idr);
>   static DEFINE_MUTEX(loop_validate_mutex);
> 
>   /**
> @@ -2326,6 +2326,8 @@ static void __exit loop_exit(void)
>       struct loop_device *lo;
>       int id;
> 
> +    guard(mutex_init)(&loop_ctl_mutex);
> +
>       unregister_blkdev(LOOP_MAJOR, "loop");
>       misc_deregister(&loop_misc);
> 
> 
Okay looks good. Alternatively, I think you may also consider
updating existing patch with above change while you're adding
lock context annotation for loop driver. But anything is
fine for me either updating current patch or add it in a new
patch.

Thanks,
--Nilay

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11  4:42 UTC (permalink / raw)
  To: Andy Shevchenko, Christian König
  Cc: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Davidlohr Bueso, Paul E . McKenney, Josh Triplett, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Liam Girdwood, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Eddie James, Mark Brown, Maxime Coquelin, Alexandre Torgue,
	Laxman Dewangan, Neil Armstrong, Robert Foss, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Laurent Pinchart, Jonas Karlman, Jernej Skrabec, Matthew Auld,
	Matthew Brost, Waiman Long, drbd-dev, linux-block,
	linux1394-devel, dri-devel, intel-gfx, linux-spi, linux-stm32,
	linux-arm-kernel, linux-tegra, linux-sound, linux-kernel,
	Andrew Morton, Randy Dunlap, Christian Brauner, David Howells,
	Luca Ceresoli, Kaito Cheng, Muchun Song, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Takashi Sakamoto, Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <ail4AvzqAOXNaU6N@ashevche-desk.local>



在 2026/6/10 22:43, Andy Shevchenko 写道:
> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/9 18:33, Christian König 写道:
>>> On 6/9/26 08:13, Kaitao Cheng wrote:
>>>>
>>>> This series prepares for, and then updates, the list_for_each_entry()
>>>> family so the common entry iterators cache their next or previous cursor
>>>> before the loop body runs.
>>>
>>> Why in the world would we want to do that?
>>>
>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
>>>
>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
>>>
>>>> The first 13 patches open-code loops that intentionally depend on the
>>>> old "derive the next entry from the current cursor at the end of the
>>>> iteration" behaviour.  These loops append work to the list being walked,
>>>> restart traversal after dropping a lock, skip an entry consumed by the
>>>> current iteration, or otherwise adjust the cursor in the loop body.
>>>
>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
>>
>> Hi Christian and Andy Shevchenko,
>>
>> Thanks for taking a look. I would like to clarify the point you raised.
>>
>> The reason I started looking at this is the original motivation behind
>> the _safe() variants.  They exist because some users need to remove, move
>> or otherwise consume the current entry while walking the list.  In that
>> case the next cursor has to be preserved before the loop body can modify
>> the current entry.
>>
>> The unfortunate part is that this could not be expressed with the
>> existing list_for_each_entry() interface without changing its calling
>> convention.  The _safe() variants had to grow an extra argument for the
>> temporary cursor, and that is why we ended up with a separate family of
>> macros.
>>
>> But conceptually, the distinction does not have to be exposed as two
>> different iterator families forever.  The difference is an implementation
>> detail: whether the iterator keeps the next/previous cursor before the
>> body runs.  This series makes the common list_for_each_entry() iterators
>> do that internally, so the safe and non-safe forms can effectively be
>> folded together, or at least the need for a separate public _safe()
>> interface becomes much weaker.
>>
>> There is also a usability issue with the current _safe() interface.  The
>> caller is forced to define a temporary cursor outside the macro and pass
>> it in, even though almost all users never use that cursor directly.  It is
>> just boilerplate required by the macro implementation.  I find that
>> redundant and awkward: the temporary cursor is an internal detail of the
>> iteration, but every caller has to spell it out.
> 
> Ah, I think the distinct macro families is that what we want.
> But the hiding of the parameter can be done inside list_for_each_*_safe().
> You can do a treewide change with coccinelle.
> 
> Sorry if I didn't get the whole idea from your previous contributions.
> 
> Note, even cases that would need a temporary cursor may be switched to
> new list_for_each_*_safe(), see how PCI macros for iterating over resources
> are implemented (include/linux/pci.h).

Thanks for your suggestions. I've written a demo based on your feedback.
Could you please review it and share your thoughts on this approach?


diff --git a/include/linux/list.h b/include/linux/list.h
index 9df84a56a789..306554ab1841 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -7,6 +7,7 @@
 #include <linux/stddef.h>
 #include <linux/poison.h>
 #include <linux/const.h>
+#include <linux/args.h>

 #include <asm/barrier.h>

@@ -911,20 +912,34 @@ static inline size_t list_count_nodes(struct list_head *head)
        for (; !list_entry_is_head(pos, head, member);                  \
             pos = list_prev_entry(pos, member))

+#define __list_for_each_entry_safe_internal(pos, next, head, member)   \
+       for (typeof(pos) next = list_next_entry(pos =                   \
+               list_first_entry(head, typeof(*pos), member), member);  \
+            !list_entry_is_head(pos, head, member);                    \
+            pos = next, next = list_next_entry(next, member))
+
+#define __list_for_each_entry_safe2(pos, head, member)                 \
+       __list_for_each_entry_safe_internal(pos, __UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_safe3(pos, next, head, member)           \
+       for (pos = list_first_entry(head, typeof(*pos), member),        \
+               next = list_next_entry(pos, member);                    \
+            !list_entry_is_head(pos, head, member);                    \
+            pos = next, next = list_next_entry(next, member))
+
 /**
  * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
  * @pos:       the type * to use as a loop cursor.
- * @n:         another type * to use as temporary storage
- * @head:      the head for your list.
- * @member:    the name of the list_head within the struct.
+ * @...:       either (head, member) or (next, head, member)
+ *     @next:  another type * to use as optional temporary storage. The temporary
+ *             cursor is internal unless explicitly supplied by the caller.
+ *     @head:  the head for your list.
+ *     @member:the name of the list_head within the struct.
  *
  */
-#define list_for_each_entry_safe(pos, n, head, member)                 \
-       for (pos = list_first_entry(head, typeof(*pos), member),        \
-               n = list_next_entry(pos, member);                       \
-            !list_entry_is_head(pos, head, member);                    \
-            pos = n, n = list_next_entry(n, member))
+#define list_for_each_entry_safe(pos, ...)                             \
+       CONCATENATE(__list_for_each_entry_safe, COUNT_ARGS(__VA_ARGS__))\
+               (pos, __VA_ARGS__)

 /**
  * list_for_each_entry_safe_continue - continue list iteration safe against removal

>> With the updated list_for_each_entry() implementation, that extra cursor
>> can be kept inside the iterator itself.  Callers that only want to walk
>> the list, including callers that delete or consume the current entry, no
>> longer need to carry an otherwise-unused temporary variable just to make
>> the macro work.
>>
>>>> The final patch changes include/linux/list.h to keep a private cursor in
>>>> the common entry iterators while preserving the public macro interface.
>>>> The safe variants remain available when callers need the temporary
>>>> cursor explicitly or have stronger mutation requirements.
> 
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Ming Lei @ 2026-06-11  2:49 UTC (permalink / raw)
  To: Keith Busch; +Cc: Carlos Maiolino, brauner, linux-block
In-Reply-To: <ainBCDneRqNvmMT_@kbusch-mbp>

On Wed, Jun 10, 2026 at 01:54:48PM -0600, Keith Busch wrote:
> On Wed, Jun 10, 2026 at 08:19:53PM +0200, Carlos Maiolino wrote:
> > On Wed, Jun 10, 2026 at 11:14:30AM -0600, Keith Busch wrote:
> > > 
> > > It does require that someone calls the bio split-to-limits routine,
> > > which I had taken for granted as a given, but I realize that some
> > > drivers don't do that. What block device are you using for your test?
> > 
> > In the PPC machine, it's a virtual scsi vdasd device from one of the
> > virtual nodes
> > 
> > NAME HCTL       TYPE VENDOR   MODEL  REV SERIAL                             TRAN
> > sda  0:0:1:0    disk AIX      VDASD 0001 000a508a00007a0000000175dcba35ac.5
> > 
> > ibmvfc                262144  0
> > ibmvscsi              196608  2
> > 
> > For my x86 machine (remind I reduce the buffer size to 512 on x86), it's
> > a commodity sata samsung SSD:
> 
> Okay, these are under blk-mq so always call __bio_split_to_limits.
> However, I see there's an optimization to skip the checks we're
> depending on if bio_may_need_split doesn't think it needs to be split,
> which is a problem for your observation. I don't think the current
> expecations can allow us to take this optimization anymore when page
> offsets are used.
> 
> This should fix it:
> 
> ---
> diff --git a/block/blk.h b/block/blk.h
> index 1a2d9101bba04..3731f3c5ed140 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -404,7 +404,7 @@ static inline bool bio_may_need_split(struct bio *bio,
>  	bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>  	if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
>  		return true;
> -	return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
> +	return bv.bv_offset || bv->bv_len > lim->max_fast_segment_size;
>  }

This should work for the un-aligned DMA buffer, but might hurt perf for
any sub-page IO.

Given you have switched to validate dio buffer alignment to bio splitting, it
should be fine to check ->dma_alignment here by putting the three limits
fields into same cache line.


Thanks,
Ming

^ permalink raw reply

* [PATCH v3] block: rust: fix `Send` bound for `GenDisk`
From: Yuan Tan @ 2026-06-11  0:32 UTC (permalink / raw)
  To: a.hindborg
  Cc: boqun, linux-block, rust-for-linux, zhiyunq, ardalan, pgovind2,
	dzueck, yuantan098, Yuan Tan

From: Yuan Tan <ytan089@ucr.edu>

The `Send` implementation for `GenDisk<T>` was conditioned on `T: Send`.
This constrains the wrong type. `T` is the `Operations` implementation,
which is typically a zero-sized marker type that carries no data, so `T:
Send` says nothing about whether the data a `GenDisk` actually owns can be
moved to another thread.

A `GenDisk<T>` owns the queue data `T::QueueData` (stored as the
`gendisk`'s `queuedata` and dropped when the `GenDisk` is dropped) and an
`Arc<TagSet<T>>`. These are the values transferred when a `GenDisk` is sent
across a thread boundary, so the `Send` bound must constrain exactly them.
Bound `T::QueueData: Send` and `Arc<TagSet<T>>: Send` instead.

Fixes: 3253aba3408a ("rust: block: introduce `kernel::block::mq` module")
Reported-by: Priya Bala Govindasamy <pgovind2@uci.edu>
Reported-by: Dylan Zueck <dzueck@uci.edu>
Suggested-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Yuan Tan <ytan089@ucr.edu>
---

Changes in v3:
  - Add Priya and Dylan's names to the `Reported-by` tags
Link to v2:
  - https://lore.kernel.org/all/20260609-rnull-v6-19-rc5-send-v2-1-82c7404542e2@kernel.org/
Link to v1:
  - https://lore.kernel.org/all/cover.1780633578.git.ytan089@ucr.edu/

I am a bit unsure how to handle this v3.

The change in this v3 is adding the missing trailers.
Andreas' v2 already addresses the TagSet issue from my v1, and his commit
message is also more appropriate. Therefore this v3 has no changes other
than the trailers.

I am not sure whether it is appropriate for me to take Andreas' patch and
only adjust the trailers. Please correct me, and my apologies if this is
not the right way to handle it.

 rust/kernel/block/mq/gen_disk.rs | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
index 912cb805caf5..b36d24382cc3 100644
--- a/rust/kernel/block/mq/gen_disk.rs
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -199,8 +199,14 @@ pub struct GenDisk<T: Operations> {
 }
 
 // SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
-// `TagSet` It is safe to send this to other threads as long as T is Send.
-unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
+// `TagSet`. It is safe to send this to other threads as long as these two are `Send`.
+unsafe impl<T> Send for GenDisk<T>
+where
+    T: Operations,
+    T::QueueData: Send,
+    Arc<TagSet<T>>: Send,
+{
+}
 
 impl<T: Operations> Drop for GenDisk<T> {
     fn drop(&mut self) {
-- 
2.43.2


^ permalink raw reply related

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
From: Karim Manaouil @ 2026-06-10 22:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm
In-Reply-To: <f22caf98-1375-493a-a275-0500ffac3e81@suse.de>

On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I (together with the Czech Technical University) did some experiments trying
> to measure memory fragmentation with large block sizes.
> Testbed used was an nvme setup talking to a nvmet storage over
> the network.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>   precisely geared up to avoid it, so you would need to come up
>   with some idea how to defeat it. With the help from Willy I managed
>   to come up with something, but I really would like to discuss
>   what would be the best option here.

thpchallenge from mmtests has been a staple for the compaction/anti
fragmentation folks.

And check this https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Btw, do you mind sharing what workloads you discussed with Matthew?

> - What is acceptable memory fragmentation? Are we good enough if the
>   measured fragmentation does not grow during the test runs?
> - Do we have better visibility into memory fragmentation other than
>   just reading /proc/buddyinfo?
> 
> And, of course, I would like to present (and discuss) the results
> of the testruns done on 4k, 8k, and 16k blocksizes.
> 
> Not sure if this should be a storage or MM topic; I'll let the
> lsf-pc decide.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 

-- 
~karim

^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Keith Busch @ 2026-06-10 19:54 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: brauner, linux-block
In-Reply-To: <aimn3kipMHdRmTTe@nidhogg.toxiclabs.cc>

On Wed, Jun 10, 2026 at 08:19:53PM +0200, Carlos Maiolino wrote:
> On Wed, Jun 10, 2026 at 11:14:30AM -0600, Keith Busch wrote:
> > 
> > It does require that someone calls the bio split-to-limits routine,
> > which I had taken for granted as a given, but I realize that some
> > drivers don't do that. What block device are you using for your test?
> 
> In the PPC machine, it's a virtual scsi vdasd device from one of the
> virtual nodes
> 
> NAME HCTL       TYPE VENDOR   MODEL  REV SERIAL                             TRAN
> sda  0:0:1:0    disk AIX      VDASD 0001 000a508a00007a0000000175dcba35ac.5
> 
> ibmvfc                262144  0
> ibmvscsi              196608  2
> 
> For my x86 machine (remind I reduce the buffer size to 512 on x86), it's
> a commodity sata samsung SSD:

Okay, these are under blk-mq so always call __bio_split_to_limits.
However, I see there's an optimization to skip the checks we're
depending on if bio_may_need_split doesn't think it needs to be split,
which is a problem for your observation. I don't think the current
expecations can allow us to take this optimization anymore when page
offsets are used.

This should fix it:

---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..3731f3c5ed140 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,7 +404,7 @@ static inline bool bio_may_need_split(struct bio *bio,
 	bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
 	if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
 		return true;
-	return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
+	return bv.bv_offset || bv->bv_len > lim->max_fast_segment_size;
 }
 
 /**
--

^ permalink raw reply related

* [PATCH] rnbd-clt: Use common error handling code in rnbd_get_iu()
From: Markus Elfring @ 2026-06-10 19:03 UTC (permalink / raw)
  To: linux-block, Jack Wang, Jens Axboe, Md. Haris Iqbal; +Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Wed, 10 Jun 2026 20:58:47 +0200

Use an additional label so that a bit of exception handling can be better
reused at the end of an if branch.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/block/rnbd/rnbd-clt.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
index 4d6725a0035e..d8e3f145ee2f 100644
--- a/drivers/block/rnbd/rnbd-clt.c
+++ b/drivers/block/rnbd/rnbd-clt.c
@@ -329,10 +329,8 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
 		return NULL;
 
 	permit = rnbd_get_permit(sess, con_type, wait);
-	if (!permit) {
-		kfree(iu);
-		return NULL;
-	}
+	if (!permit)
+		goto free_iu;
 
 	iu->permit = permit;
 	/*
@@ -349,6 +347,7 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
 
 	if (sg_alloc_table(&iu->sgt, 1, GFP_KERNEL)) {
 		rnbd_put_permit(sess, permit);
+free_iu:
 		kfree(iu);
 		return NULL;
 	}
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Carlos Maiolino @ 2026-06-10 18:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: brauner, linux-block
In-Reply-To: <aimbdmqC10uXrZuo@kbusch-mbp>

On Wed, Jun 10, 2026 at 11:14:30AM -0600, Keith Busch wrote:
> On Wed, Jun 10, 2026 at 06:58:13PM +0200, Carlos Maiolino wrote:
> > On Wed, Jun 10, 2026 at 09:37:17AM -0600, Keith Busch wrote:
> > > On Wed, Jun 10, 2026 at 05:27:42PM +0200, Carlos Maiolino wrote:
> > > > The DIO alignment check has been lifted from iomap layer to rely on the
> > > > block layer to enforce proper alignment when issuing direct IO
> > > > operations. This though, depending on the IO size and buffer address
> > > > passed to the IO operation may lead to user-visible behavior change.
> > > > 
> > > > This has been caught initially by LTP test diotest4 running on
> > > > PPC architecture, where the test fails because a read() operation
> > > > with a supposedly misaligned buffer succeeds instead of an expected
> > > > -EINVAL.
> > > 
> > > It's not supposed to matter where in the stack we determined this be an
> > > invalid request: it should still fail if it's misaligned. Could you
> > > clarify how this is succeeding?
> > 
> > Fair enough, can you point me to where the alignment is supposed to be
> > checked? 
> 
> https://elixir.bootlin.com/linux/v7.1-rc7/source/block/blk-merge.c#L352
> 
> It does require that someone calls the bio split-to-limits routine,
> which I had taken for granted as a given, but I realize that some
> drivers don't do that. What block device are you using for your test?

In the PPC machine, it's a virtual scsi vdasd device from one of the
virtual nodes

NAME HCTL       TYPE VENDOR   MODEL  REV SERIAL                             TRAN
sda  0:0:1:0    disk AIX      VDASD 0001 000a508a00007a0000000175dcba35ac.5

ibmvfc                262144  0
ibmvscsi              196608  2

For my x86 machine (remind I reduce the buffer size to 512 on x86), it's
a commodity sata samsung SSD:

sdb  6:0:0:0    disk ATA      Samsung SSD 860 EVO 1TB RVT02B6Q S3Z9NB0KB82416J   sata

Both with an ext4 straight on top of a partition (no device-mapper or
any other volume layer in between), although I also tried with a
linear device-mapper (lvm) device and the results seemed the same.

Those are two machines I have had more reliable results. My laptop with
a samsung NVME makes read() return an -EIO but I can't tell if it's a
device failure or just a wrong error being returned by now

^ permalink raw reply

* Re: [PATCH 21/27] null_blk: Enable lock context analysis
From: Bart Van Assche @ 2026-06-10 18:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Marco Elver, Keith Busch, Damien Le Moal,
	Chaitanya Kulkarni, Johannes Thumshirn, Nilay Shroff,
	Genjian Zhang, Kees Cook
In-Reply-To: <aij2yP1Y1Dven4Bz@infradead.org>

On 6/9/26 10:31 PM, Christoph Hellwig wrote:
> On Tue, Jun 09, 2026 at 03:05:08PM -0700, Bart Van Assche wrote:
>> Add __must_hold() annotations where these are missing. Annotate two
>> functions that use conditional locking with __context_unsafe().
> 
> Please explain why that is needed and there is no better way to have
> proper annotations.  Both here and in a comment in the code.

Hi Christoph,

Is the change shown below considered acceptable? It converts the
null_lock_zone() and null_unlock_zone() calls into scoped_guard()
and thereby eliminates the risk that any null_lock_zone() call
would not be paired properly with a null_unlock_zone() call. The
DEFINE_CLASS() macro below does the following:
* Before any code block that is protected by scoped_guard(null_zone)
   or guard(null_zone), call null_lock_zone() and create a local variable
   with type struct nullb_dev_and_zone. That local variable is declared
   with the __cleanup__ attribute.
* The __cleanup__ attribute causes null_unlock_zone() to be called with
   the same arguments as null_lock_zone().

Thanks,

Bart.


diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 12a3534ecb85..3bf3b6057e8a 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -50,6 +50,23 @@ static inline void null_unlock_zone(struct 
nullb_device *dev,
  		mutex_unlock(&zone->mutex);
  }

+struct nullb_dev_and_zone {
+	struct nullb_device *dev;
+	struct nullb_zone *zone;
+};
+
+DEFINE_CLASS(null_zone, struct nullb_dev_and_zone,
+	     null_unlock_zone(_T.dev, _T.zone),
+	     ({
+		     null_lock_zone(dev, zone);
+		     (struct nullb_dev_and_zone){dev, zone};
+	     }),
+	     struct nullb_device *dev, struct nullb_zone *zone)
+
+DEFINE_CLASS_IS_UNCONDITIONAL(null_zone)
+
  int null_init_zoned_dev(struct nullb_device *dev,
  			struct queue_limits *lim)
  {
@@ -218,14 +235,14 @@ int null_report_zones(struct gendisk *disk, 
sector_t sector,
  		 * So use a local copy to avoid corruption of the device zone
  		 * array.
  		 */
-		null_lock_zone(dev, zone);
+		scoped_guard(null_zone, dev, zone) {
  		blkz.start = zone->start;
  		blkz.len = zone->len;
  		blkz.wp = zone->wp;
  		blkz.type = zone->type;
  		blkz.cond = zone->cond;
  		blkz.capacity = zone->capacity;
-		null_unlock_zone(dev, zone);
+		}

  		error = disk_report_zone(disk, &blkz, i, args);
  		if (error)
@@ -366,7 +383,7 @@ static blk_status_t null_zone_write(struct nullb_cmd 
*cmd, sector_t sector,
  		return null_process_cmd(cmd, REQ_OP_WRITE, sector, nr_sectors);
  	}

-	null_lock_zone(dev, zone);
+	scoped_guard(null_zone, dev, zone) {

  	/*
  	 * Regular writes must be at the write pointer position. Zone append
@@ -446,7 +463,8 @@ static blk_status_t null_zone_write(struct nullb_cmd 
*cmd, sector_t sector,
  	ret = badblocks_ret;

  unlock_zone:
-	null_unlock_zone(dev, zone);
+	;
+	}

  	return ret;
  }
@@ -657,14 +675,14 @@ static blk_status_t null_zone_mgmt(struct 
nullb_cmd *cmd, enum req_op op,
  	if (op == REQ_OP_ZONE_RESET_ALL) {
  		for (i = dev->zone_nr_conv; i < dev->nr_zones; i++) {
  			zone = &dev->zones[i];
-			null_lock_zone(dev, zone);
+			scoped_guard(null_zone, dev, zone) {
  			if (zone->cond != BLK_ZONE_COND_EMPTY &&
  			    zone->cond != BLK_ZONE_COND_READONLY &&
  			    zone->cond != BLK_ZONE_COND_OFFLINE) {
  				null_reset_zone(dev, zone);
  				trace_nullb_zone_op(cmd, i, zone->cond);
  			}
-			null_unlock_zone(dev, zone);
+			}
  		}
  		return BLK_STS_OK;
  	}
@@ -672,7 +690,7 @@ static blk_status_t null_zone_mgmt(struct nullb_cmd 
*cmd, enum req_op op,
  	zone_no = null_zone_no(dev, sector);
  	zone = &dev->zones[zone_no];

-	null_lock_zone(dev, zone);
+	scoped_guard(null_zone, dev, zone) {

  	if (zone->cond == BLK_ZONE_COND_READONLY ||
  	    zone->cond == BLK_ZONE_COND_OFFLINE) {
@@ -702,7 +720,8 @@ static blk_status_t null_zone_mgmt(struct nullb_cmd 
*cmd, enum req_op op,
  		trace_nullb_zone_op(cmd, zone_no, zone->cond);

  unlock:
-	null_unlock_zone(dev, zone);
+	;
+	}

  	return ret;
  }
@@ -712,7 +731,6 @@ blk_status_t null_process_zoned_cmd(struct nullb_cmd 
*cmd, enum req_op op,
  {
  	struct nullb_device *dev;
  	struct nullb_zone *zone;
-	blk_status_t sts;

  	switch (op) {
  	case REQ_OP_WRITE:
@@ -731,10 +749,8 @@ blk_status_t null_process_zoned_cmd(struct 
nullb_cmd *cmd, enum req_op op,
  		if (zone->cond == BLK_ZONE_COND_OFFLINE)
  			return BLK_STS_IOERR;

-		null_lock_zone(dev, zone);
-		sts = null_process_cmd(cmd, op, sector, nr_sectors);
-		null_unlock_zone(dev, zone);
-		return sts;
+		scoped_guard(null_zone, dev, zone)
+			return null_process_cmd(cmd, op, sector, nr_sectors);
  	}
  }

@@ -748,7 +764,7 @@ static void null_set_zone_cond(struct nullb_device *dev,
  			 cond != BLK_ZONE_COND_OFFLINE))
  		return;

-	null_lock_zone(dev, zone);
+	guard(null_zone)(dev, zone);

  	/*
  	 * If the read-only condition is requested again to zones already in
@@ -769,8 +785,6 @@ static void null_set_zone_cond(struct nullb_device *dev,
  		zone->cond = cond;
  		zone->wp = NULL_ZONE_INVALID_WP;
  	}
-
-	null_unlock_zone(dev, zone);
  }

  /*


^ permalink raw reply related

* Re: [PATCH 20/27] nbd: Enable lock context analysis
From: Bart Van Assche @ 2026-06-10 17:16 UTC (permalink / raw)
  To: Nilay Shroff, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Marco Elver, Christoph Hellwig,
	Josef Bacik
In-Reply-To: <4c8438e3-2415-43c9-ba6a-27321070c58e@linux.ibm.com>

On 6/10/26 1:02 AM, Nilay Shroff wrote:
> Above changes are good, however I see nbd also uses @nbd_index_mutex
> which guards @nbd_index_idr. So should we also annotate @nbd_index_idr
> using __guarded_by(&nbd_index_mutex)?

How about adding these changes as an additional patch?

Thanks,

Bart.

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 345e4b73009d..b9e0ad0b3ca0 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -49,8 +49,8 @@
  #define CREATE_TRACE_POINTS
  #include <trace/events/nbd.h>

-static DEFINE_IDR(nbd_index_idr);
  static DEFINE_MUTEX(nbd_index_mutex);
+static __guarded_by(&nbd_index_mutex) DEFINE_IDR(nbd_index_idr);
  static struct workqueue_struct *nbd_del_wq;
  static int nbd_total_devices = 0;

@@ -2739,7 +2739,9 @@ static void __exit nbd_cleanup(void)
  	/* Also wait for nbd_dev_remove_work() completes */
  	destroy_workqueue(nbd_del_wq);

-	idr_destroy(&nbd_index_idr);
+	scoped_guard(mutex_init, &nbd_index_mutex)
+		idr_destroy(&nbd_index_idr);
+
  	unregister_blkdev(NBD_MAJOR, "nbd");
  }



^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Keith Busch @ 2026-06-10 17:14 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: brauner, linux-block
In-Reply-To: <aimGzU_UY3jV-ece@nidhogg.toxiclabs.cc>

On Wed, Jun 10, 2026 at 06:58:13PM +0200, Carlos Maiolino wrote:
> On Wed, Jun 10, 2026 at 09:37:17AM -0600, Keith Busch wrote:
> > On Wed, Jun 10, 2026 at 05:27:42PM +0200, Carlos Maiolino wrote:
> > > The DIO alignment check has been lifted from iomap layer to rely on the
> > > block layer to enforce proper alignment when issuing direct IO
> > > operations. This though, depending on the IO size and buffer address
> > > passed to the IO operation may lead to user-visible behavior change.
> > > 
> > > This has been caught initially by LTP test diotest4 running on
> > > PPC architecture, where the test fails because a read() operation
> > > with a supposedly misaligned buffer succeeds instead of an expected
> > > -EINVAL.
> > 
> > It's not supposed to matter where in the stack we determined this be an
> > invalid request: it should still fail if it's misaligned. Could you
> > clarify how this is succeeding?
> 
> Fair enough, can you point me to where the alignment is supposed to be
> checked? 

https://elixir.bootlin.com/linux/v7.1-rc7/source/block/blk-merge.c#L352

It does require that someone calls the bio split-to-limits routine,
which I had taken for granted as a given, but I realize that some
drivers don't do that. What block device are you using for your test?

^ permalink raw reply

* Re: [PATCH 18/27] loop: Add lock context annotations
From: Bart Van Assche @ 2026-06-10 17:13 UTC (permalink / raw)
  To: Nilay Shroff, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Marco Elver, Nathan Chancellor
In-Reply-To: <c71180a3-c049-490f-a8ab-b3faa9714de1@linux.ibm.com>

On 6/10/26 2:21 AM, Nilay Shroff wrote:
> One thing I noticed while looking through the loop driver is that it 
> also defines
> @loop_ctl_mutex, which protects @loop_index_idr. It might be worth 
> annotating
> @loop_index_idr with `__guarded_by(&loop_ctl_mutex) as well so that 
> Clang can
> validate accesses to the IDR against the corresponding locking 
> requirements.

I'm considering to add the changes below as an additional patch:


diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index ff7eff102c5a..30a2b2696368 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -90,8 +90,8 @@ struct loop_cmd {
  #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
  #define LOOP_DEFAULT_HW_Q_DEPTH 128

-static DEFINE_IDR(loop_index_idr);
  static DEFINE_MUTEX(loop_ctl_mutex);
+static __guarded_by(&loop_ctl_mutex) DEFINE_IDR(loop_index_idr);
  static DEFINE_MUTEX(loop_validate_mutex);

  /**
@@ -2326,6 +2326,8 @@ static void __exit loop_exit(void)
  	struct loop_device *lo;
  	int id;

+	guard(mutex_init)(&loop_ctl_mutex);
+
  	unregister_blkdev(LOOP_MAJOR, "loop");
  	misc_deregister(&loop_misc);



^ permalink raw reply related

* Re: [PATCH 11/27] drbd: Split drbd_req_state()
From: Bart Van Assche @ 2026-06-10 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Marco Elver, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder
In-Reply-To: <61eb3024-f9bb-465a-938c-c12db1218923@acm.org>

On 6/10/26 10:05 AM, Bart Van Assche wrote:
> Do you perhaps want me to combine this patch and the next patch in this
> series (12/27)?

Answering my own question: I just noticed that this has been requested
in the review comment on the next patch. I will combine these two
patches.

Bart.

^ permalink raw reply

* Re: [PATCH 11/27] drbd: Split drbd_req_state()
From: Bart Van Assche @ 2026-06-10 17:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Marco Elver, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder
In-Reply-To: <aij1DtWM8N7O3q-E@infradead.org>

On 6/9/26 10:24 PM, Christoph Hellwig wrote:
>> +{
>> +	enum drbd_state_rv rv;
>> +
>> +	if (f & CS_SERIALIZE)
>> +		mutex_lock(device->state_mutex);
>> +	rv = __drbd_req_state(device, mask, val, f & ~CS_SERIALIZE);
>>   	if (f & CS_SERIALIZE)
>>   		mutex_unlock(device->state_mutex);
> 
> Wouldn't something like:
> 
> 	if (f & CS_SERIALIZE) {
> 		mutex_lock(device->state_mutex);
> 		rv = __drbd_req_state(device, mask, val, f & ~CS_SERIALIZE);
>   		mutex_unlock(device->state_mutex);
> 	} else {
> 		rv = __drbd_req_state(device, mask, val, f & ~CS_SERIALIZE);
> 	}
> 
> be either to follow?  Is there much of a point in the CS_SERIALIZE
> clearing here?

Do you perhaps want me to combine this patch and the next patch in this
series (12/27)?

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 00/27] Enable lock context analysis in drivers/block/
From: Bart Van Assche @ 2026-06-10 17:00 UTC (permalink / raw)
  To: Christoph Hellwig, Marco Elver; +Cc: Jens Axboe, linux-block
In-Reply-To: <ailK6eqeWrwRLPkz@infradead.org>


On 6/10/26 4:30 AM, Christoph Hellwig wrote:
> Bart: maytbe for next version just enable it on a per-driver basis.
> Once all are covered we can switch to directory-wide.

I can do that. Since the merge window likely will open this Monday, I
probably should wait with reposting any patches from this series until
after the merge window has closed.

Thanks for having taken the time to review this patch series.

Bart.

^ permalink raw reply

* [PATCH 1/1] block: partitions: bound sysv68 slice table count
From: Ren Wei @ 2026-06-10 16:58 UTC (permalink / raw)
  To: linux-block
  Cc: kees, axboe, objecting, akpm, phdm, yuantan098, zcliangcn, bird,
	zzhan461, n05ec
In-Reply-To: <cover.1781036698.git.zzhan461@ucr.edu>

From: Zhao Zhang <zzhan461@ucr.edu>

sysv68_partition() reads a single sector for the slice table, but it
trusts ios_slccnt from disk and walks that many entries after skipping
the synthetic whole-disk slice. A crafted image can set ios_slccnt
larger than the 64 struct slice records that fit in one sector and
trigger an out-of-bounds read while scanning partitions.

Limit the slice count to the number of records that fit in the sector
returned by read_part_sector(), then drop the whole-disk entry only
when the bounded count is non-zero.

Fixes: 19d0e8ce856a ("partition: add support for sysv68 partitions")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:GPT-5.4
Signed-off-by: Zhao Zhang <zzhan461@ucr.edu>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
 block/partitions/sysv68.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/partitions/sysv68.c b/block/partitions/sysv68.c
index 470e0f9de7be..5110ed83c541 100644
--- a/block/partitions/sysv68.c
+++ b/block/partitions/sysv68.c
@@ -48,7 +48,8 @@ struct slice {
 
 int sysv68_partition(struct parsed_partitions *state)
 {
-	int i, slices;
+	sector_t slice_sector;
+	unsigned int i, slices;
 	int slot = 1;
 	Sector sect;
 	unsigned char *data;
@@ -65,14 +66,16 @@ int sysv68_partition(struct parsed_partitions *state)
 		return 0;
 	}
 	slices = be16_to_cpu(b->dk_ios.ios_slccnt);
-	i = be32_to_cpu(b->dk_ios.ios_slcblk);
+	slice_sector = be32_to_cpu(b->dk_ios.ios_slcblk);
 	put_dev_sector(sect);
 
-	data = read_part_sector(state, i, &sect);
+	data = read_part_sector(state, slice_sector, &sect);
 	if (!data)
 		return -1;
 
-	slices -= 1; /* last slice is the whole disk */
+	slices = min_t(unsigned int, slices, SECTOR_SIZE / sizeof(*slice));
+	if (slices)
+		slices -= 1; /* last slice is the whole disk */
 	seq_buf_printf(&state->pp_buf, "sysV68: %s(s%u)", state->name, slices);
 	slice = (struct slice *)data;
 	for (i = 0; i < slices; i++, slice++) {
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Carlos Maiolino @ 2026-06-10 16:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: brauner, linux-block
In-Reply-To: <aimErft1NoW-4Map@kbusch-mbp>

On Wed, Jun 10, 2026 at 09:37:17AM -0600, Keith Busch wrote:
> On Wed, Jun 10, 2026 at 05:27:42PM +0200, Carlos Maiolino wrote:
> > The DIO alignment check has been lifted from iomap layer to rely on the
> > block layer to enforce proper alignment when issuing direct IO
> > operations. This though, depending on the IO size and buffer address
> > passed to the IO operation may lead to user-visible behavior change.
> > 
> > This has been caught initially by LTP test diotest4 running on
> > PPC architecture, where the test fails because a read() operation
> > with a supposedly misaligned buffer succeeds instead of an expected
> > -EINVAL.
> 
> It's not supposed to matter where in the stack we determined this be an
> invalid request: it should still fail if it's misaligned. Could you
> clarify how this is succeeding?

Fair enough, can you point me to where the alignment is supposed to be
checked? I've been seeing kind of different behaviors with different
machines so I kind of am not sure where the alignment is supposed to be
validated within the block layer (I should probably have tagged this
patch as RFC as my understanding of the block layer is superficial).

A few runs when the test failed (because the read() call succeded) I
could see this for example:

openat(dirfd=AT_FDCWD, pathname="testdata-4.2063", flags=O_RDWR|O_DIRECT) = 3
_llseek(fd=3, offset=4096, result=[4096], whence=SEEK_SET) = 0
read(arg1=0x3, arg2=0x25960001, arg3=0x1000) = 0x1000

FWIW, on my laptop, the test fails because read() returns an -EIO (while
the test expects -EINVAL), I'm pointing the fingers to a faulty hardware
now but I didn't discard the possibility of an EIO being returned EINVAL
should have been returned instead.
I'm trying to find another machine to test and see whatever differences
appear...

Cheers

^ permalink raw reply

* Re: [PATCH 09/27] drbd: Split drbd_nl_get_connections_dumpit()
From: Bart Van Assche @ 2026-06-10 16:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Marco Elver, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder
In-Reply-To: <aij0fuVz_Bxi6NXi@infradead.org>

On 6/9/26 10:22 PM, Christoph Hellwig wrote:
> On Tue, Jun 09, 2026 at 03:04:56PM -0700, Bart Van Assche wrote:
>> +static int drbd_nl_put_dump_connections_result(
>> +	struct sk_buff *skb, struct netlink_callback *cb,
>> +	struct drbd_resource *resource, struct drbd_connection *connection,
>> +	enum drbd_ret_code retcode)
> 
> Weird indenation here.  The usual style is either lining up after the
> opening brace or two tabs.

This formatting comes from git clang-format. Anyway, I will fix the
formatting.

Thanks,

Bart.



^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Keith Busch @ 2026-06-10 16:35 UTC (permalink / raw)
  To: Sumit Saxena
  Cc: Christoph Hellwig, Martin K . Petersen, Jens Axboe,
	James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <CAL2rwxr1uGshb1o=jvP2OnBffNz2cKXj8tHuAUCN5HFuy2vB_g@mail.gmail.com>

On Wed, Jun 10, 2026 at 09:16:11PM +0530, Sumit Saxena wrote:
> The motivation for this change stems from performance issue we
> encountered due to false sharing of the 'nr_active_requests_shared_tags'
> counter
> on certain CPU architectures. I initially submitted a patch to move that
> counter to
> its own cache line to avoid conflicts with 'nr_requests' and other hot
> fields
> (see:
> https://patchwork.kernel.org/project/linux-scsi/patch/20260402074637.92417-3-sumit.saxena@broadcom.com/
> ).
> 
> During the review, Bart shared his work, which eliminates the
> counter entirely by removing the fairness throttling. My testing confirmed
> that
> this approach resolved the performance issues and improved IOPS.
> This patch is part of a larger set, and I have reported the cumulative
> performance
> improvements in the cover letter.

So the problem is just the atomic operation accounting overhead? I
previously thought the device just really needed to consume all the tags
to hit performance.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox