Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] rnbd-clt: Use common error handling code in rnbd_get_iu()
From: Markus Elfring @ 2026-06-10 19:03 UTC (permalink / raw)
  To: linux-block, Jack Wang, Jens Axboe, Md. Haris Iqbal; +Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Wed, 10 Jun 2026 20:58:47 +0200

Use an additional label so that a bit of exception handling can be better
reused at the end of an if branch.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 drivers/block/rnbd/rnbd-clt.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
index 4d6725a0035e..d8e3f145ee2f 100644
--- a/drivers/block/rnbd/rnbd-clt.c
+++ b/drivers/block/rnbd/rnbd-clt.c
@@ -329,10 +329,8 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
 		return NULL;
 
 	permit = rnbd_get_permit(sess, con_type, wait);
-	if (!permit) {
-		kfree(iu);
-		return NULL;
-	}
+	if (!permit)
+		goto free_iu;
 
 	iu->permit = permit;
 	/*
@@ -349,6 +347,7 @@ static struct rnbd_iu *rnbd_get_iu(struct rnbd_clt_session *sess,
 
 	if (sg_alloc_table(&iu->sgt, 1, GFP_KERNEL)) {
 		rnbd_put_permit(sess, permit);
+free_iu:
 		kfree(iu);
 		return NULL;
 	}
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Keith Busch @ 2026-06-10 19:54 UTC (permalink / raw)
  To: Carlos Maiolino; +Cc: brauner, linux-block
In-Reply-To: <aimn3kipMHdRmTTe@nidhogg.toxiclabs.cc>

On Wed, Jun 10, 2026 at 08:19:53PM +0200, Carlos Maiolino wrote:
> On Wed, Jun 10, 2026 at 11:14:30AM -0600, Keith Busch wrote:
> > 
> > It does require that someone calls the bio split-to-limits routine,
> > which I had taken for granted as a given, but I realize that some
> > drivers don't do that. What block device are you using for your test?
> 
> In the PPC machine, it's a virtual scsi vdasd device from one of the
> virtual nodes
> 
> NAME HCTL       TYPE VENDOR   MODEL  REV SERIAL                             TRAN
> sda  0:0:1:0    disk AIX      VDASD 0001 000a508a00007a0000000175dcba35ac.5
> 
> ibmvfc                262144  0
> ibmvscsi              196608  2
> 
> For my x86 machine (remind I reduce the buffer size to 512 on x86), it's
> a commodity sata samsung SSD:

Okay, these are under blk-mq so always call __bio_split_to_limits.
However, I see there's an optimization to skip the checks we're
depending on if bio_may_need_split doesn't think it needs to be split,
which is a problem for your observation. I don't think the current
expecations can allow us to take this optimization anymore when page
offsets are used.

This should fix it:

---
diff --git a/block/blk.h b/block/blk.h
index 1a2d9101bba04..3731f3c5ed140 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -404,7 +404,7 @@ static inline bool bio_may_need_split(struct bio *bio,
 	bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
 	if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
 		return true;
-	return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
+	return bv.bv_offset || bv->bv_len > lim->max_fast_segment_size;
 }
 
 /**
--

^ permalink raw reply related

* Re: [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
From: Karim Manaouil @ 2026-06-10 22:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: lsf-pc, linux-nvme@lists.infradead.org,
	linux-block@vger.kernel.org, linux-mm
In-Reply-To: <f22caf98-1375-493a-a275-0500ffac3e81@suse.de>

On Thu, Feb 19, 2026 at 10:54:48AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> I (together with the Czech Technical University) did some experiments trying
> to measure memory fragmentation with large block sizes.
> Testbed used was an nvme setup talking to a nvmet storage over
> the network.
> 
> Doing so raised some challenges:
> 
> - How do you _generate_ memory fragmentation? The MM subsystem is
>   precisely geared up to avoid it, so you would need to come up
>   with some idea how to defeat it. With the help from Willy I managed
>   to come up with something, but I really would like to discuss
>   what would be the best option here.

thpchallenge from mmtests has been a staple for the compaction/anti
fragmentation folks.

And check this https://patchwork.freedesktop.org/patch/716404/?series=164353&rev=1

Btw, do you mind sharing what workloads you discussed with Matthew?

> - What is acceptable memory fragmentation? Are we good enough if the
>   measured fragmentation does not grow during the test runs?
> - Do we have better visibility into memory fragmentation other than
>   just reading /proc/buddyinfo?
> 
> And, of course, I would like to present (and discuss) the results
> of the testruns done on 4k, 8k, and 16k blocksizes.
> 
> Not sure if this should be a storage or MM topic; I'll let the
> lsf-pc decide.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 

-- 
~karim

^ permalink raw reply

* [PATCH v3] block: rust: fix `Send` bound for `GenDisk`
From: Yuan Tan @ 2026-06-11  0:32 UTC (permalink / raw)
  To: a.hindborg
  Cc: boqun, linux-block, rust-for-linux, zhiyunq, ardalan, pgovind2,
	dzueck, yuantan098, Yuan Tan

From: Yuan Tan <ytan089@ucr.edu>

The `Send` implementation for `GenDisk<T>` was conditioned on `T: Send`.
This constrains the wrong type. `T` is the `Operations` implementation,
which is typically a zero-sized marker type that carries no data, so `T:
Send` says nothing about whether the data a `GenDisk` actually owns can be
moved to another thread.

A `GenDisk<T>` owns the queue data `T::QueueData` (stored as the
`gendisk`'s `queuedata` and dropped when the `GenDisk` is dropped) and an
`Arc<TagSet<T>>`. These are the values transferred when a `GenDisk` is sent
across a thread boundary, so the `Send` bound must constrain exactly them.
Bound `T::QueueData: Send` and `Arc<TagSet<T>>: Send` instead.

Fixes: 3253aba3408a ("rust: block: introduce `kernel::block::mq` module")
Reported-by: Priya Bala Govindasamy <pgovind2@uci.edu>
Reported-by: Dylan Zueck <dzueck@uci.edu>
Suggested-by: Andreas Hindborg <a.hindborg@kernel.org>
Signed-off-by: Yuan Tan <ytan089@ucr.edu>
---

Changes in v3:
  - Add Priya and Dylan's names to the `Reported-by` tags
Link to v2:
  - https://lore.kernel.org/all/20260609-rnull-v6-19-rc5-send-v2-1-82c7404542e2@kernel.org/
Link to v1:
  - https://lore.kernel.org/all/cover.1780633578.git.ytan089@ucr.edu/

I am a bit unsure how to handle this v3.

The change in this v3 is adding the missing trailers.
Andreas' v2 already addresses the TagSet issue from my v1, and his commit
message is also more appropriate. Therefore this v3 has no changes other
than the trailers.

I am not sure whether it is appropriate for me to take Andreas' patch and
only adjust the trailers. Please correct me, and my apologies if this is
not the right way to handle it.

 rust/kernel/block/mq/gen_disk.rs | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/rust/kernel/block/mq/gen_disk.rs b/rust/kernel/block/mq/gen_disk.rs
index 912cb805caf5..b36d24382cc3 100644
--- a/rust/kernel/block/mq/gen_disk.rs
+++ b/rust/kernel/block/mq/gen_disk.rs
@@ -199,8 +199,14 @@ pub struct GenDisk<T: Operations> {
 }

 // SAFETY: `GenDisk` is an owned pointer to a `struct gendisk` and an `Arc` to a
-// `TagSet` It is safe to send this to other threads as long as T is Send.
-unsafe impl<T: Operations + Send> Send for GenDisk<T> {}
+// `TagSet`. It is safe to send this to other threads as long as these two are `Send`.
+unsafe impl<T> Send for GenDisk<T>
+where
+    T: Operations,
+    T::QueueData: Send,
+    Arc<TagSet<T>>: Send,
+{
+}

 impl<T: Operations> Drop for GenDisk<T> {
     fn drop(&mut self) {
-- 
2.43.2

^ permalink raw reply related

* Re: [PATCH] iomap: enforce DIO alignment check in iomap]
From: Ming Lei @ 2026-06-11  2:49 UTC (permalink / raw)
  To: Keith Busch; +Cc: Carlos Maiolino, brauner, linux-block
In-Reply-To: <ainBCDneRqNvmMT_@kbusch-mbp>

On Wed, Jun 10, 2026 at 01:54:48PM -0600, Keith Busch wrote:
> On Wed, Jun 10, 2026 at 08:19:53PM +0200, Carlos Maiolino wrote:
> > On Wed, Jun 10, 2026 at 11:14:30AM -0600, Keith Busch wrote:
> > > 
> > > It does require that someone calls the bio split-to-limits routine,
> > > which I had taken for granted as a given, but I realize that some
> > > drivers don't do that. What block device are you using for your test?
> > 
> > In the PPC machine, it's a virtual scsi vdasd device from one of the
> > virtual nodes
> > 
> > NAME HCTL       TYPE VENDOR   MODEL  REV SERIAL                             TRAN
> > sda  0:0:1:0    disk AIX      VDASD 0001 000a508a00007a0000000175dcba35ac.5
> > 
> > ibmvfc                262144  0
> > ibmvscsi              196608  2
> > 
> > For my x86 machine (remind I reduce the buffer size to 512 on x86), it's
> > a commodity sata samsung SSD:
> 
> Okay, these are under blk-mq so always call __bio_split_to_limits.
> However, I see there's an optimization to skip the checks we're
> depending on if bio_may_need_split doesn't think it needs to be split,
> which is a problem for your observation. I don't think the current
> expecations can allow us to take this optimization anymore when page
> offsets are used.
> 
> This should fix it:
> 
> ---
> diff --git a/block/blk.h b/block/blk.h
> index 1a2d9101bba04..3731f3c5ed140 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -404,7 +404,7 @@ static inline bool bio_may_need_split(struct bio *bio,
>  	bv = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>  	if (bio->bi_iter.bi_size > bv->bv_len - bio->bi_iter.bi_bvec_done)
>  		return true;
> -	return bv->bv_len + bv->bv_offset > lim->max_fast_segment_size;
> +	return bv.bv_offset || bv->bv_len > lim->max_fast_segment_size;
>  }

This should work for the un-aligned DMA buffer, but might hurt perf for
any sub-page IO.

Given you have switched to validate dio buffer alignment to bio splitting, it
should be fine to check ->dma_alignment here by putting the three limits
fields into same cache line.


Thanks,
Ming

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11  4:42 UTC (permalink / raw)
  To: Andy Shevchenko, Christian König
  Cc: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Davidlohr Bueso, Paul E . McKenney, Josh Triplett, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Liam Girdwood, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Eddie James, Mark Brown, Maxime Coquelin, Alexandre Torgue,
	Laxman Dewangan, Neil Armstrong, Robert Foss, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Laurent Pinchart, Jonas Karlman, Jernej Skrabec, Matthew Auld,
	Matthew Brost, Waiman Long, drbd-dev, linux-block,
	linux1394-devel, dri-devel, intel-gfx, linux-spi, linux-stm32,
	linux-arm-kernel, linux-tegra, linux-sound, linux-kernel,
	Andrew Morton, Randy Dunlap, Christian Brauner, David Howells,
	Luca Ceresoli, Kaito Cheng, Muchun Song, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Takashi Sakamoto, Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <ail4AvzqAOXNaU6N@ashevche-desk.local>



在 2026/6/10 22:43, Andy Shevchenko 写道:
> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/9 18:33, Christian König 写道:
>>> On 6/9/26 08:13, Kaitao Cheng wrote:
>>>>
>>>> This series prepares for, and then updates, the list_for_each_entry()
>>>> family so the common entry iterators cache their next or previous cursor
>>>> before the loop body runs.
>>>
>>> Why in the world would we want to do that?
>>>
>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
>>>
>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
>>>
>>>> The first 13 patches open-code loops that intentionally depend on the
>>>> old "derive the next entry from the current cursor at the end of the
>>>> iteration" behaviour.  These loops append work to the list being walked,
>>>> restart traversal after dropping a lock, skip an entry consumed by the
>>>> current iteration, or otherwise adjust the cursor in the loop body.
>>>
>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
>>
>> Hi Christian and Andy Shevchenko,
>>
>> Thanks for taking a look. I would like to clarify the point you raised.
>>
>> The reason I started looking at this is the original motivation behind
>> the _safe() variants.  They exist because some users need to remove, move
>> or otherwise consume the current entry while walking the list.  In that
>> case the next cursor has to be preserved before the loop body can modify
>> the current entry.
>>
>> The unfortunate part is that this could not be expressed with the
>> existing list_for_each_entry() interface without changing its calling
>> convention.  The _safe() variants had to grow an extra argument for the
>> temporary cursor, and that is why we ended up with a separate family of
>> macros.
>>
>> But conceptually, the distinction does not have to be exposed as two
>> different iterator families forever.  The difference is an implementation
>> detail: whether the iterator keeps the next/previous cursor before the
>> body runs.  This series makes the common list_for_each_entry() iterators
>> do that internally, so the safe and non-safe forms can effectively be
>> folded together, or at least the need for a separate public _safe()
>> interface becomes much weaker.
>>
>> There is also a usability issue with the current _safe() interface.  The
>> caller is forced to define a temporary cursor outside the macro and pass
>> it in, even though almost all users never use that cursor directly.  It is
>> just boilerplate required by the macro implementation.  I find that
>> redundant and awkward: the temporary cursor is an internal detail of the
>> iteration, but every caller has to spell it out.
> 
> Ah, I think the distinct macro families is that what we want.
> But the hiding of the parameter can be done inside list_for_each_*_safe().
> You can do a treewide change with coccinelle.
> 
> Sorry if I didn't get the whole idea from your previous contributions.
> 
> Note, even cases that would need a temporary cursor may be switched to
> new list_for_each_*_safe(), see how PCI macros for iterating over resources
> are implemented (include/linux/pci.h).

Thanks for your suggestions. I've written a demo based on your feedback.
Could you please review it and share your thoughts on this approach?


diff --git a/include/linux/list.h b/include/linux/list.h
index 9df84a56a789..306554ab1841 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -7,6 +7,7 @@
 #include <linux/stddef.h>
 #include <linux/poison.h>
 #include <linux/const.h>
+#include <linux/args.h>

 #include <asm/barrier.h>

@@ -911,20 +912,34 @@ static inline size_t list_count_nodes(struct list_head *head)
        for (; !list_entry_is_head(pos, head, member);                  \
             pos = list_prev_entry(pos, member))

+#define __list_for_each_entry_safe_internal(pos, next, head, member)   \
+       for (typeof(pos) next = list_next_entry(pos =                   \
+               list_first_entry(head, typeof(*pos), member), member);  \
+            !list_entry_is_head(pos, head, member);                    \
+            pos = next, next = list_next_entry(next, member))
+
+#define __list_for_each_entry_safe2(pos, head, member)                 \
+       __list_for_each_entry_safe_internal(pos, __UNIQUE_ID(next), head, member)
+
+#define __list_for_each_entry_safe3(pos, next, head, member)           \
+       for (pos = list_first_entry(head, typeof(*pos), member),        \
+               next = list_next_entry(pos, member);                    \
+            !list_entry_is_head(pos, head, member);                    \
+            pos = next, next = list_next_entry(next, member))
+
 /**
  * list_for_each_entry_safe - iterate over list of given type safe against removal of list entry
  * @pos:       the type * to use as a loop cursor.
- * @n:         another type * to use as temporary storage
- * @head:      the head for your list.
- * @member:    the name of the list_head within the struct.
+ * @...:       either (head, member) or (next, head, member)
+ *     @next:  another type * to use as optional temporary storage. The temporary
+ *             cursor is internal unless explicitly supplied by the caller.
+ *     @head:  the head for your list.
+ *     @member:the name of the list_head within the struct.
  *
  */
-#define list_for_each_entry_safe(pos, n, head, member)                 \
-       for (pos = list_first_entry(head, typeof(*pos), member),        \
-               n = list_next_entry(pos, member);                       \
-            !list_entry_is_head(pos, head, member);                    \
-            pos = n, n = list_next_entry(n, member))
+#define list_for_each_entry_safe(pos, ...)                             \
+       CONCATENATE(__list_for_each_entry_safe, COUNT_ARGS(__VA_ARGS__))\
+               (pos, __VA_ARGS__)

 /**
  * list_for_each_entry_safe_continue - continue list iteration safe against removal

>> With the updated list_for_each_entry() implementation, that extra cursor
>> can be kept inside the iterator itself.  Callers that only want to walk
>> the list, including callers that delete or consume the current entry, no
>> longer need to carry an otherwise-unused temporary variable just to make
>> the macro work.
>>
>>>> The final patch changes include/linux/list.h to keep a private cursor in
>>>> the common entry iterators while preserving the public macro interface.
>>>> The safe variants remain available when callers need the temporary
>>>> cursor explicitly or have stronger mutation requirements.
> 
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply related

* Re: [PATCH 18/27] loop: Add lock context annotations
From: Nilay Shroff @ 2026-06-11  5:00 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Marco Elver, Nathan Chancellor
In-Reply-To: <63ffdebb-24f2-4842-8e65-53045d74dace@acm.org>

On 6/10/26 10:43 PM, Bart Van Assche wrote:
> On 6/10/26 2:21 AM, Nilay Shroff wrote:
>> One thing I noticed while looking through the loop driver is that it also defines
>> @loop_ctl_mutex, which protects @loop_index_idr. It might be worth annotating
>> @loop_index_idr with `__guarded_by(&loop_ctl_mutex) as well so that Clang can
>> validate accesses to the IDR against the corresponding locking requirements.
> 
> I'm considering to add the changes below as an additional patch:
> 
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index ff7eff102c5a..30a2b2696368 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -90,8 +90,8 @@ struct loop_cmd {
>   #define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
>   #define LOOP_DEFAULT_HW_Q_DEPTH 128
> 
> -static DEFINE_IDR(loop_index_idr);
>   static DEFINE_MUTEX(loop_ctl_mutex);
> +static __guarded_by(&loop_ctl_mutex) DEFINE_IDR(loop_index_idr);
>   static DEFINE_MUTEX(loop_validate_mutex);
> 
>   /**
> @@ -2326,6 +2326,8 @@ static void __exit loop_exit(void)
>       struct loop_device *lo;
>       int id;
> 
> +    guard(mutex_init)(&loop_ctl_mutex);
> +
>       unregister_blkdev(LOOP_MAJOR, "loop");
>       misc_deregister(&loop_misc);
> 
> 
Okay looks good. Alternatively, I think you may also consider
updating existing patch with above change while you're adding
lock context annotation for loop driver. But anything is
fine for me either updating current patch or add it in a new
patch.

Thanks,
--Nilay

^ permalink raw reply

* Re: [PATCH 20/27] nbd: Enable lock context analysis
From: Nilay Shroff @ 2026-06-11  5:02 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Marco Elver, Christoph Hellwig,
	Josef Bacik
In-Reply-To: <ba18cfd5-0a52-4e2e-85a1-a9f0a20ff957@acm.org>

On 6/10/26 10:46 PM, Bart Van Assche wrote:
> On 6/10/26 1:02 AM, Nilay Shroff wrote:
>> Above changes are good, however I see nbd also uses @nbd_index_mutex
>> which guards @nbd_index_idr. So should we also annotate @nbd_index_idr
>> using __guarded_by(&nbd_index_mutex)?
> 
> How about adding these changes as an additional patch?
> 
> Thanks,
> 
> Bart.
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 345e4b73009d..b9e0ad0b3ca0 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -49,8 +49,8 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/nbd.h>
> 
> -static DEFINE_IDR(nbd_index_idr);
>   static DEFINE_MUTEX(nbd_index_mutex);
> +static __guarded_by(&nbd_index_mutex) DEFINE_IDR(nbd_index_idr);
>   static struct workqueue_struct *nbd_del_wq;
>   static int nbd_total_devices = 0;
> 
> @@ -2739,7 +2739,9 @@ static void __exit nbd_cleanup(void)
>       /* Also wait for nbd_dev_remove_work() completes */
>       destroy_workqueue(nbd_del_wq);
> 
> -    idr_destroy(&nbd_index_idr);
> +    scoped_guard(mutex_init, &nbd_index_mutex)
> +        idr_destroy(&nbd_index_idr);
> +
>       unregister_blkdev(NBD_MAJOR, "nbd");
>   }
> 
> 
Looks good. But as I said earlier for similar changes in
loop driver, you may want to consider updating current
patch (instead of adding an additional patch) with the
above changes while you're enabling lock context for
nbd driver.

Thanks,
--Nilay


^ permalink raw reply

* Re: [PATCH v3 1/4] crypto: skcipher - add per-tfm data_unit_size for batched requests
From: Herbert Xu @ 2026-06-11  5:07 UTC (permalink / raw)
  To: Leonid Ravich
  Cc: Alasdair Kergon, Ard Biesheuvel, Eric Biggers, Jens Axboe,
	Horia Geanta, Gilad Ben-Yossef, linux-crypto, dm-devel,
	linux-block
In-Reply-To: <20260601085644.13026-2-lravich@amazon.com>

On Mon, Jun 01, 2026 at 08:56:41AM +0000, Leonid Ravich wrote:
>
> diff --git a/crypto/skcipher.c b/crypto/skcipher.c
> index 2b31d1d5d268..bc37bd554aec 100644
> --- a/crypto/skcipher.c
> +++ b/crypto/skcipher.c
> @@ -432,13 +432,119 @@ int crypto_skcipher_setkey(struct crypto_skcipher *tfm, const u8 *key,
>  }
>  EXPORT_SYMBOL_GPL(crypto_skcipher_setkey);
>  
> +int crypto_skcipher_set_data_unit_size(struct crypto_skcipher *tfm,
> +				       unsigned int data_unit_size)
> +{
> +	unsigned int blocksize;
> +
> +	if (!data_unit_size) {
> +		tfm->data_unit_size = 0;
> +		return 0;
> +	}
> +
> +	if (!crypto_skcipher_supports_multi_data_unit(tfm))
> +		return -EOPNOTSUPP;
> +
> +	blocksize = crypto_skcipher_blocksize(tfm);
> +	if (data_unit_size < blocksize || data_unit_size % blocksize)
> +		return -EINVAL;
> +
> +	tfm->data_unit_size = data_unit_size;
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(crypto_skcipher_set_data_unit_size);

The unit size should be a per-request attribute, not per tfm.

> @@ -492,6 +517,66 @@ static inline unsigned int crypto_lskcipher_chunksize(
>  	return crypto_lskcipher_alg(tfm)->co.chunksize;
>  }
>  
> +/**
> + * crypto_skcipher_supports_multi_data_unit() - test multi-data-unit support
> + * @tfm: cipher handle
> + *
> + * Return: true if the algorithm advertises that it can process multiple
> + *	   data units in a single skcipher_request.
> + */
> +static inline bool
> +crypto_skcipher_supports_multi_data_unit(struct crypto_skcipher *tfm)
> +{
> +	return crypto_skcipher_alg_common(tfm)->base.cra_flags &
> +		CRYPTO_ALG_SKCIPHER_MULTI_DATA_UNIT;
> +}

My preference is to always use multi-unit submission if the user
is capable of doing it.  The Crypto API should automatically divide
up the units if the underlying driver does not support it.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11  6:54 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <9b98e860-11df-44bf-9a95-3046d2c274a6@linux.dev>

On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
> 在 2026/6/10 22:43, Andy Shevchenko 写道:
> > On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
> >> 在 2026/6/9 18:33, Christian König 写道:
> >>> On 6/9/26 08:13, Kaitao Cheng wrote:

> >>>> This series prepares for, and then updates, the list_for_each_entry()
> >>>> family so the common entry iterators cache their next or previous cursor
> >>>> before the loop body runs.
> >>>
> >>> Why in the world would we want to do that?
> >>>
> >>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
> >>>
> >>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
> >>>
> >>>> The first 13 patches open-code loops that intentionally depend on the
> >>>> old "derive the next entry from the current cursor at the end of the
> >>>> iteration" behaviour.  These loops append work to the list being walked,
> >>>> restart traversal after dropping a lock, skip an entry consumed by the
> >>>> current iteration, or otherwise adjust the cursor in the loop body.
> >>>
> >>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
> >>
> >> Hi Christian and Andy Shevchenko,
> >>
> >> Thanks for taking a look. I would like to clarify the point you raised.
> >>
> >> The reason I started looking at this is the original motivation behind
> >> the _safe() variants.  They exist because some users need to remove, move
> >> or otherwise consume the current entry while walking the list.  In that
> >> case the next cursor has to be preserved before the loop body can modify
> >> the current entry.
> >>
> >> The unfortunate part is that this could not be expressed with the
> >> existing list_for_each_entry() interface without changing its calling
> >> convention.  The _safe() variants had to grow an extra argument for the
> >> temporary cursor, and that is why we ended up with a separate family of
> >> macros.
> >>
> >> But conceptually, the distinction does not have to be exposed as two
> >> different iterator families forever.  The difference is an implementation
> >> detail: whether the iterator keeps the next/previous cursor before the
> >> body runs.  This series makes the common list_for_each_entry() iterators
> >> do that internally, so the safe and non-safe forms can effectively be
> >> folded together, or at least the need for a separate public _safe()
> >> interface becomes much weaker.
> >>
> >> There is also a usability issue with the current _safe() interface.  The
> >> caller is forced to define a temporary cursor outside the macro and pass
> >> it in, even though almost all users never use that cursor directly.  It is
> >> just boilerplate required by the macro implementation.  I find that
> >> redundant and awkward: the temporary cursor is an internal detail of the
> >> iteration, but every caller has to spell it out.
> > 
> > Ah, I think the distinct macro families is that what we want.
> > But the hiding of the parameter can be done inside list_for_each_*_safe().
> > You can do a treewide change with coccinelle.
> > 
> > Sorry if I didn't get the whole idea from your previous contributions.
> > 
> > Note, even cases that would need a temporary cursor may be switched to
> > new list_for_each_*_safe(), see how PCI macros for iterating over resources
> > are implemented (include/linux/pci.h).
> 
> Thanks for your suggestions. I've written a demo based on your feedback.
> Could you please review it and share your thoughts on this approach?

Have you checked how many users actually need the temporary storage?

> >> With the updated list_for_each_entry() implementation, that extra cursor
> >> can be kept inside the iterator itself.  Callers that only want to walk
> >> the list, including callers that delete or consume the current entry, no
> >> longer need to carry an otherwise-unused temporary variable just to make
> >> the macro work.
> >>
> >>>> The final patch changes include/linux/list.h to keep a private cursor in
> >>>> the common entry iterators while preserving the public macro interface.
> >>>> The safe variants remain available when callers need the temporary
> >>>> cursor explicitly or have stronger mutation requirements.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v3] block: rust: fix `Send` bound for `GenDisk`
From: Miguel Ojeda @ 2026-06-11  7:32 UTC (permalink / raw)
  To: Yuan Tan
  Cc: a.hindborg, boqun, linux-block, rust-for-linux, zhiyunq, ardalan,
	pgovind2, dzueck, Yuan Tan
In-Reply-To: <20260611003220.3512652-1-yuantan098@gmail.com>

On Thu, Jun 11, 2026 at 2:32 AM Yuan Tan <yuantan098@gmail.com> wrote:
>
> I am not sure whether it is appropriate for me to take Andreas' patch and
> only adjust the trailers. Please correct me, and my apologies if this is
> not the right way to handle it.

It is OK to re-send a patch from someone else, but you need to make
sure you give the proper attribution etc.

In particular, you should keep the authorship from Andreas if he was
the author of the actual fix (though it sounds like Andreas is OK with
you as author due to v1 (?)), and you should also keep the existing
Signed-off-by (again, assuming you were actually picking his patch,
which I don't know if it is the case here), adding your own afterwards
because you carried the patch. And if you make any changes to the
patch, you are supposed to mention that too, inside square brackets,
etc.

Furthermore, if you reported the issue but you are not the author,
then you should use Reported-by for yourself too (even if you have a
Signed-off-by because you are re-sending his patch). Either way, a
Link tag to the original report would be nice if one is available.

In addition, the Fixes tag means there should likely be a Cc: stable
tag too, given the hash covers other stable releases, unless it
shouldn't be backported (in which case, it should be justified).

The document that explains these tags etc. is at:

  https://docs.kernel.org/process/submitting-patches.html

I hope that helps!

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11  7:36 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipbojSeMH-usARY@ashevche-desk.local>

在 2026/6/11 14:54, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/10 22:43, Andy Shevchenko 写道:
>>> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
>>>> 在 2026/6/9 18:33, Christian König 写道:
>>>>> On 6/9/26 08:13, Kaitao Cheng wrote:
> 
>>>>>> This series prepares for, and then updates, the list_for_each_entry()
>>>>>> family so the common entry iterators cache their next or previous cursor
>>>>>> before the loop body runs.
>>>>>
>>>>> Why in the world would we want to do that?
>>>>>
>>>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
>>>>>
>>>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
>>>>>
>>>>>> The first 13 patches open-code loops that intentionally depend on the
>>>>>> old "derive the next entry from the current cursor at the end of the
>>>>>> iteration" behaviour.  These loops append work to the list being walked,
>>>>>> restart traversal after dropping a lock, skip an entry consumed by the
>>>>>> current iteration, or otherwise adjust the cursor in the loop body.
>>>>>
>>>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
>>>>
>>>> Hi Christian and Andy Shevchenko,
>>>>
>>>> Thanks for taking a look. I would like to clarify the point you raised.
>>>>
>>>> The reason I started looking at this is the original motivation behind
>>>> the _safe() variants.  They exist because some users need to remove, move
>>>> or otherwise consume the current entry while walking the list.  In that
>>>> case the next cursor has to be preserved before the loop body can modify
>>>> the current entry.
>>>>
>>>> The unfortunate part is that this could not be expressed with the
>>>> existing list_for_each_entry() interface without changing its calling
>>>> convention.  The _safe() variants had to grow an extra argument for the
>>>> temporary cursor, and that is why we ended up with a separate family of
>>>> macros.
>>>>
>>>> But conceptually, the distinction does not have to be exposed as two
>>>> different iterator families forever.  The difference is an implementation
>>>> detail: whether the iterator keeps the next/previous cursor before the
>>>> body runs.  This series makes the common list_for_each_entry() iterators
>>>> do that internally, so the safe and non-safe forms can effectively be
>>>> folded together, or at least the need for a separate public _safe()
>>>> interface becomes much weaker.
>>>>
>>>> There is also a usability issue with the current _safe() interface.  The
>>>> caller is forced to define a temporary cursor outside the macro and pass
>>>> it in, even though almost all users never use that cursor directly.  It is
>>>> just boilerplate required by the macro implementation.  I find that
>>>> redundant and awkward: the temporary cursor is an internal detail of the
>>>> iteration, but every caller has to spell it out.
>>>
>>> Ah, I think the distinct macro families is that what we want.
>>> But the hiding of the parameter can be done inside list_for_each_*_safe().
>>> You can do a treewide change with coccinelle.
>>>
>>> Sorry if I didn't get the whole idea from your previous contributions.
>>>
>>> Note, even cases that would need a temporary cursor may be switched to
>>> new list_for_each_*_safe(), see how PCI macros for iterating over resources
>>> are implemented (include/linux/pci.h).
>>
>> Thanks for your suggestions. I've written a demo based on your feedback.
>> Could you please review it and share your thoughts on this approach?
> 
> Have you checked how many users actually need the temporary storage?

In Muchun's reply, he mentioned the following:

There are 9,925 list_for_each_entry() call sites in total. Among them,
9,919 do not require any adaptation, and only 6 need to be refactored:

As for list_for_each_entry_safe(), there are 4,572 callers. 4,550 of them
can be directly replaced by the new list_for_each_entry(), while 22 cannot
be replaced

https://lore.kernel.org/all/2B3BFA1E-08B8-42AB-87D6-A28BF15E5C58@linux.dev/


I only used Coccinelle to scan for list_for_each_entry() call sites, and
found the 13 call sites shown in the current patch series, which cover
the 6 cases mentioned in Muchun's email. I have not yet run the Coccinelle
scan for list_for_each_entry_safe().

If we need to handle all 9,925 list_for_each_entry() call sites or all 4,572
list_for_each_entry_safe() call sites in one go, would such a change be too
large? I expect it would affect almost every kernel subsystem.

I wonder whether it would be better to first provide the necessary
compatibility APIs, and then let each subsystem owner update their code as
appropriate. That would make the impact more controlled, similar to how
the current folio replacement of page is being handled.

>>>> With the updated list_for_each_entry() implementation, that extra cursor
>>>> can be kept inside the iterator itself.  Callers that only want to walk
>>>> the list, including callers that delete or consume the current entry, no
>>>> longer need to carry an otherwise-unused temporary variable just to make
>>>> the macro work.
>>>>
>>>>>> The final patch changes include/linux/list.h to keep a private cursor in
>>>>>> the common entry iterators while preserving the public macro interface.
>>>>>> The safe variants remain available when callers need the temporary
>>>>>> cursor explicitly or have stronger mutation requirements.
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Shin'ichiro Kawasaki @ 2026-06-11  7:41 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki

I observed that the blktests test case block/005 hangs on a specific
server hardware using a specific HDD as a block device. During the test
case run, the kernel reported a KASAN null-ptr-deref (and other memory
corruption symptoms) [2]. This failure looked sporadic and hardware-
dependent.

From the kernel message, I noticed that udev-worker wrote to the
queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
The test case block/005 also wrote to the same sysfs attribute, which
indicated that a concurrent elevator change caused the failure. I
created a new blktests test case that simply does the concurrent
elevator change with a null_blk device [1]. It recreates the failure in
a stable manner on various server hardware.

Using the new test case, I bisected and found that the failure first
appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
commit does not appear to explain the failure by itself: it changed the
queue freeze behavior and only unveiled a race, probably. Looking back
at the changes to elevator_change(), I think the actual cause is the
commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
elevator_lock") in the kernel tag v6.16-rc1. This commit moved
elevator_change_done() out of the guard of ->elevator_lock and the queue
freeze. As a result, when two threads write to the same queue/scheduler
attribute concurrently, elevator_change_done() runs in parallel causing
the memory corruption and the hang.

As the fix attempt, I created the patch in this series. It adds a new
mutex that serializes the whole elevator switch sequence, including the
elevator_change_done() call. I ran the reproducer with lockdep enabled
and confirmed that the patch avoids the failure and new WARN was not
observed.

However, the fix patch adds a new lock, and I'm not sure if it is the best
solution. Comments on the patch, or suggestions for a better solution,
would be appreciated.

[1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826

[2]

[30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
[30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
[30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
[30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
[30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
8 8b 6d 00 48
[30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
[30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
[30105.119796] [ T186111] Call Trace:
[30105.124154] [ T186111]  <TASK>
[30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
[30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610
[30105.140037] [ T186111]  ? __pfx_elevator_change_done+0x10/0x10
[30105.146409] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.152246] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.158189] [ T186111]  elevator_change+0x283/0x4f0
[30105.163342] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.168932] [ T186111]  elv_iosched_store+0x30c/0x3a0
[30105.174265] [ T186111]  ? __pfx_elv_iosched_store+0x10/0x10
[30105.180797] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.187066] [ T186111]  ? kernfs_fop_write_iter+0x25b/0x5e0
[30105.193594] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.199931] [ T186111]  ? lock_acquire+0x126/0x140
[30105.205683] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.211924] [ T186111]  queue_attr_store+0x23f/0x360
[30105.217796] [ T186111]  ? __pfx_queue_attr_store+0x10/0x10
[30105.224180] [ T186111]  ? __lock_acquire+0x55d/0xbd0
[30105.230049] [ T186111]  ? lock_acquire.part.0+0xb8/0x230
[30105.236247] [ T186111]  ? sysfs_file_kobj+0x1d/0x1b0
[30105.242093] [ T186111]  ? find_held_lock+0x2b/0x80
[30105.247763] [ T186111]  ? __lock_release.isra.0+0x59/0x170
[30105.254122] [ T186111]  ? lock_release.part.0+0x1c/0x50
[30105.260226] [ T186111]  ? sysfs_file_kobj+0xb9/0x1b0
[30105.266048] [ T186111]  ? sysfs_kf_write+0x65/0x170
[30105.271778] [ T186111]  ? __pfx_sysfs_kf_write+0x10/0x10
[30105.277934] [ T186111]  kernfs_fop_write_iter+0x3da/0x5e0
[30105.284173] [ T186111]  ? __pfx_kernfs_fop_write_iter+0x10/0x10
[30105.290926] [ T186111]  vfs_write+0x524/0x1010
[30105.296215] [ T186111]  ? __pfx_vfs_write+0x10/0x10
[30105.301905] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.308092] [ T186111]  ? kasan_quarantine_put+0xf5/0x240
[30105.314246] [ T186111]  ksys_write+0xff/0x200
[30105.319331] [ T186111]  ? __pfx_ksys_write+0x10/0x10
[30105.325007] [ T186111]  do_syscall_64+0xf4/0x1550
[30105.330407] [ T186111]  ? __pfx___x64_sys_openat+0x10/0x10
[30105.336566] [ T186111]  ? seccomp_run_filters+0xeb/0x560
[30105.342517] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.348096] [ T186111]  ? __seccomp_filter+0xa2/0x920
[30105.353749] [ T186111]  ? __pfx___seccomp_filter+0x10/0x10
[30105.359830] [ T186111]  ? trace_hardirqs_on_prepare+0x150/0x1a0
[30105.366344] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.371892] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.377422] [ T186111]  ? do_syscall_64+0x1d7/0x1550
[30105.382922] [ T186111]  ? do_syscall_64+0x1b9/0x1550
[30105.388401] [ T186111]  ? do_syscall_64+0x34/0x1550
[30105.393777] [ T186111]  ? do_syscall_64+0xab/0x1550
[30105.399129] [ T186111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[30105.405624] [ T186111] RIP: 0033:0x7fc1c7c4fbbe
[30105.410647] [ T186111] Code: 4d 89 d8 e8 34 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[30105.431611] [ T186111] RSP: 002b:00007ffefd3bdd90 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[30105.440716] [ T186111] RAX: ffffffffffffffda RBX: 000055fa3f0f4b80 RCX: 00007fc1c7c4fbbe
[30105.449404] [ T186111] RDX: 000000000000000b RSI: 000055fa3ed9d550 RDI: 0000000000000015
[30105.458090] [ T186111] RBP: 00007ffefd3bdda0 R08: 0000000000000000 R09: 0000000000000000
[30105.466787] [ T186111] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000b
[30105.475479] [ T186111] R13: 000000000000000b R14: 000055fa3ed9d550 R15: 000055fa3ed9d550
[30105.484182] [ T186111]  </TASK>
[30105.487920] [ T186111] Modules linked in: iscsi_target_mod tcm_loop target_core_pscsi target_core_file target_core_iblock xfs nft_masq nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_conntrack_tftp bridge stp llc target_core_user target_core_mod rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nf_tables ip6table_filter ip6_tables iptable_filter ip_tables qrtr intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt intel_pmc_bxt kvm_intel kvm irqbypass rapl sunrpc intel_cstate intel_uncore pcspkr i2c_i801 i2c_smbus mei_me igb lpc_ich mei ioatdma dca wmi binfmt_misc joydev acpi_power_meter acpi_pad btrfs raid6_pq xor ses enclosure loop dm_multipath nfnetlink zram lz4hc_compress lz4_compress
[30105.488278] [ T186111]  zstd_compress ast drm_client_lib i2c_algo_bit drm_shmem_helper drm_kms_helper mpt3sas drm mpi3mr raid_class scsi_transport_sas scsi_dh_rdac scsi_dh_emc scsi_dh_alua i2c_dev fuse [last unloaded: zonefs]
[30105.609649] [ T186111] ---[ end trace 0000000000000000 ]---
[30105.648290] [ T186111] pstore: backend (erst) writing error (-28)
[30105.654739] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
[30105.662519] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 48 8b 6d 00 48
[30105.683653] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
[30105.691248] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
[30105.700121] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
[30105.708841] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
[30105.717829] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
[30105.726550] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
[30105.735306] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c54ce000(0000) knlGS:0000000000000000
[30105.745003] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[30105.752368] [ T186111] CR2: 00007f251f9bc0e8 CR3: 0000000108350002 CR4: 00000000001726f0


Shin'ichiro Kawasaki (1):
  block: serialize whole elevator change steps for the same queue

 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

-- 
2.54.0


^ permalink raw reply

* [PATCH RFC 1/1] block: serialize whole elevator change steps for the same queue
From: Shin'ichiro Kawasaki @ 2026-06-11  7:42 UTC (permalink / raw)
  To: linux-block, Jens Axboe; +Cc: Ming Lei, Nilay Shroff, Shin'ichiro Kawasaki
In-Reply-To: <20260611074200.474676-1-shinichiro.kawasaki@wdc.com>

When elevator_change() is called concurrently for the same queue, the
elevator_change_done() function runs concurrently as well. This function
adds or deletes kobjects for the debugfs entry of the queue. Then the
concurrent calls cause memory corruption of the kobjects and result in a
process hang. The core part of the elevator switch is protected by queue
freeze and q->elevator_lock. However, since the commit 559dc11143eb
("block: move elv_register[unregister]_queue out of elevator_lock"), the
elevator_change_done() is not serialized. Hence the memory corruption
and the hang.

The failures are observed when udev-worker writes to a sysfs
queue/scheduler attribute file while the blktests test case block/005
writes to the same attribute file. The failure also can be recreated by
running two processes that write to the same queue/scheduler file
concurrently. The failure is observed since another commit 370ac285f23a
("block: avoid cpu_hotplug_lock depedency on freeze_lock"). This commit
changed the behavior of queue freeze and it unveiled the failure.

Fix the failure by adding a new per-queue lock 'elevator_queue_lock',
which serializes the whole elevator switch steps for the same queue
including the elevator_change_done() call.

Fixes: 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 block/blk-core.c       | 1 +
 block/elevator.c       | 9 +++++++++
 include/linux/blkdev.h | 7 +++++++
 3 files changed, 17 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..c6418889897a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -430,6 +430,7 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 	refcount_set(&q->refs, 1);
 	mutex_init(&q->debugfs_mutex);
 	mutex_init(&q->elevator_lock);
+	mutex_init(&q->elevator_queue_lock);
 	mutex_init(&q->sysfs_lock);
 	mutex_init(&q->limits_lock);
 	mutex_init(&q->rq_qos_mutex);
diff --git a/block/elevator.c b/block/elevator.c
index 3bcd37c2aa34..65bdea27aa8a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -665,6 +665,13 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 			return ret;
 	}

+	/*
+	 * Acquire elevator_queue_lock to serialize the debugfs (un)register
+	 * steps for the same queue. The elevator switch core part is protected
+	 * by queue freezing and ->elevator_lock.
+	 */
+	mutex_lock(&q->elevator_queue_lock);
+
 	memflags = blk_mq_freeze_queue(q);
 	/*
 	 * May be called before adding disk, when there isn't any FS I/O,
@@ -690,6 +697,8 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
 	if (!ctx->new)
 		blk_mq_free_sched_res(&ctx->res, ctx->type, set);

+	mutex_unlock(&q->elevator_queue_lock);
+
 	return ret;
 }

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..cfeddd3ded95 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -606,6 +606,13 @@ struct request_queue {
 	 */
 	struct mutex		elevator_lock;

+	/*
+	 * Serializes the whole elevator change operation for the same queue,
+	 * including the debugfs (un)register steps. Must be acquired before
+	 * freezing the queue and acquiring elevator_lock.
+	 */
+	struct mutex		elevator_queue_lock;
+
 	struct mutex		sysfs_lock;
 	/*
 	 * Protects queue limits and also sysfs attribute read_ahead_kb.
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11  7:52 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <83ba73d8-27d3-4ee9-a143-7dfe4cb827be@linux.dev>

On Thu, Jun 11, 2026 at 03:36:01PM +0800, Kaitao Cheng wrote:
> 在 2026/6/11 14:54, Andy Shevchenko 写道:
> > On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
> >> 在 2026/6/10 22:43, Andy Shevchenko 写道:
> >>> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
> >>>> 在 2026/6/9 18:33, Christian König 写道:
> >>>>> On 6/9/26 08:13, Kaitao Cheng wrote:
> > 
> >>>>>> This series prepares for, and then updates, the list_for_each_entry()
> >>>>>> family so the common entry iterators cache their next or previous cursor
> >>>>>> before the loop body runs.
> >>>>>
> >>>>> Why in the world would we want to do that?
> >>>>>
> >>>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
> >>>>>
> >>>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
> >>>>>
> >>>>>> The first 13 patches open-code loops that intentionally depend on the
> >>>>>> old "derive the next entry from the current cursor at the end of the
> >>>>>> iteration" behaviour.  These loops append work to the list being walked,
> >>>>>> restart traversal after dropping a lock, skip an entry consumed by the
> >>>>>> current iteration, or otherwise adjust the cursor in the loop body.
> >>>>>
> >>>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
> >>>>
> >>>> Hi Christian and Andy Shevchenko,
> >>>>
> >>>> Thanks for taking a look. I would like to clarify the point you raised.
> >>>>
> >>>> The reason I started looking at this is the original motivation behind
> >>>> the _safe() variants.  They exist because some users need to remove, move
> >>>> or otherwise consume the current entry while walking the list.  In that
> >>>> case the next cursor has to be preserved before the loop body can modify
> >>>> the current entry.
> >>>>
> >>>> The unfortunate part is that this could not be expressed with the
> >>>> existing list_for_each_entry() interface without changing its calling
> >>>> convention.  The _safe() variants had to grow an extra argument for the
> >>>> temporary cursor, and that is why we ended up with a separate family of
> >>>> macros.
> >>>>
> >>>> But conceptually, the distinction does not have to be exposed as two
> >>>> different iterator families forever.  The difference is an implementation
> >>>> detail: whether the iterator keeps the next/previous cursor before the
> >>>> body runs.  This series makes the common list_for_each_entry() iterators
> >>>> do that internally, so the safe and non-safe forms can effectively be
> >>>> folded together, or at least the need for a separate public _safe()
> >>>> interface becomes much weaker.
> >>>>
> >>>> There is also a usability issue with the current _safe() interface.  The
> >>>> caller is forced to define a temporary cursor outside the macro and pass
> >>>> it in, even though almost all users never use that cursor directly.  It is
> >>>> just boilerplate required by the macro implementation.  I find that
> >>>> redundant and awkward: the temporary cursor is an internal detail of the
> >>>> iteration, but every caller has to spell it out.
> >>>
> >>> Ah, I think the distinct macro families is that what we want.
> >>> But the hiding of the parameter can be done inside list_for_each_*_safe().
> >>> You can do a treewide change with coccinelle.
> >>>
> >>> Sorry if I didn't get the whole idea from your previous contributions.
> >>>
> >>> Note, even cases that would need a temporary cursor may be switched to
> >>> new list_for_each_*_safe(), see how PCI macros for iterating over resources
> >>> are implemented (include/linux/pci.h).
> >>
> >> Thanks for your suggestions. I've written a demo based on your feedback.
> >> Could you please review it and share your thoughts on this approach?
> > 
> > Have you checked how many users actually need the temporary storage?
> 
> In Muchun's reply, he mentioned the following:
> 
> There are 9,925 list_for_each_entry() call sites in total. Among them,
> 9,919 do not require any adaptation, and only 6 need to be refactored:
> 
> As for list_for_each_entry_safe(), there are 4,572 callers. 4,550 of them
> can be directly replaced by the new list_for_each_entry(), while 22 cannot
> be replaced
> 
> https://lore.kernel.org/all/2B3BFA1E-08B8-42AB-87D6-A28BF15E5C58@linux.dev/
> 
> I only used Coccinelle to scan for list_for_each_entry() call sites, and
> found the 13 call sites shown in the current patch series, which cover
> the 6 cases mentioned in Muchun's email. I have not yet run the Coccinelle
> scan for list_for_each_entry_safe().
> 
> If we need to handle all 9,925 list_for_each_entry() call sites or all 4,572
> list_for_each_entry_safe() call sites in one go, would such a change be too
> large? I expect it would affect almost every kernel subsystem.

If it's done by Linus himself during the day when he prepares -rc1, it's fine.
You would need to provide a good justification for the change, though.

But in the above statistics the 4572 vs 4550, so the first step is to investigate
why temporary cursor is used in those 22 cases and what we can do to avoid that.

> I wonder whether it would be better to first provide the necessary
> compatibility APIs, and then let each subsystem owner update their code as
> appropriate. That would make the impact more controlled, similar to how
> the current folio replacement of page is being handled.
> 
> >>>> With the updated list_for_each_entry() implementation, that extra cursor
> >>>> can be kept inside the iterator itself.  Callers that only want to walk
> >>>> the list, including callers that delete or consume the current entry, no
> >>>> longer need to carry an otherwise-unused temporary variable just to make
> >>>> the macro work.
> >>>>
> >>>>>> The final patch changes include/linux/list.h to keep a private cursor in
> >>>>>> the common entry iterators while preserving the public macro interface.
> >>>>>> The safe variants remain available when callers need the temporary
> >>>>>> cursor explicitly or have stronger mutation requirements.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Christian König @ 2026-06-11  8:01 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <ail8iNvPrJnE7p58@ashevche-desk.local>

On 6/10/26 17:02, Andy Shevchenko wrote:
> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>> Should we revert to v1, or keep list_for_each_entry() and
>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>> changes?
>>>
>>> Link to v1:
>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>
>>> Or do you have any better suggestions?
>>
>> v1 looks perfectly reasonable to me.
> 
> But why not just hiding that once for all (in case they don't use the temporary
> iterator)? Easy to automate, robust — everyone is happy?

As far as I can see that is an extremely bad idea.

The distinction between the use cases of 'iterating the list' and 'iterating the list while you modify it' is completely intentional.

See the bool type can be implemented by int as well, but it is just a different use case.

Regards,
Christian.

> 
>> You should just include some patches in the same patch set to actually use
>> the new macros.
>>
>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>> the new macro I'm happy to review that.
> 


^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Andy Shevchenko @ 2026-06-11  8:29 UTC (permalink / raw)
  To: Christian König
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <92683537-8404-47fe-a4ba-160e54870f0b@amd.com>

On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
> On 6/10/26 17:02, Andy Shevchenko wrote:
> > On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
> >> On 6/10/26 10:18, Kaitao Cheng wrote:
> >>> 在 2026/6/10 16:07, Christian König 写道:

...

> >>> Should we revert to v1, or keep list_for_each_entry() and
> >>> list_for_each_entry_safe() as they are, close this thread, and make no
> >>> changes?
> >>>
> >>> Link to v1:
> >>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
> >>>
> >>> Or do you have any better suggestions?
> >>
> >> v1 looks perfectly reasonable to me.
> > 
> > But why not just hiding that once for all (in case they don't use the temporary
> > iterator)? Easy to automate, robust — everyone is happy?
> 
> As far as I can see that is an extremely bad idea.
> 
> The distinction between the use cases of 'iterating the list' and 'iterating
> the list while you modify it' is completely intentional.

What I meant is to keep the name, just drop the parameter (make it hidden and
being defined inside list_for_each_*_safe() cases).

> See the bool type can be implemented by int as well, but it is just a
> different use case.

> >> You should just include some patches in the same patch set to actually use
> >> the new macros.
> >>
> >> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
> >> the new macro I'm happy to review that.
> > 
> 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Christian König @ 2026-06-11  8:39 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kaitao Cheng, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipx1goKIsk40vrF@ashevche-desk.local>

On 6/11/26 10:29, Andy Shevchenko wrote:
> On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
>> On 6/10/26 17:02, Andy Shevchenko wrote:
>>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>>>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>>>> Should we revert to v1, or keep list_for_each_entry() and
>>>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>>>> changes?
>>>>>
>>>>> Link to v1:
>>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>>>
>>>>> Or do you have any better suggestions?
>>>>
>>>> v1 looks perfectly reasonable to me.
>>>
>>> But why not just hiding that once for all (in case they don't use the temporary
>>> iterator)? Easy to automate, robust — everyone is happy?
>>
>> As far as I can see that is an extremely bad idea.
>>
>> The distinction between the use cases of 'iterating the list' and 'iterating
>> the list while you modify it' is completely intentional.
> 
> What I meant is to keep the name, just drop the parameter (make it hidden and
> being defined inside list_for_each_*_safe() cases).

Ah, sorry I was still thinking the suggestion is to merge list_for_each_entry() and list_for_each_entry_safe().

If the modification is done all at once or in steps doesn't really matter for me as long as the patch can be re-created reproducible.

But I'm wondering if we couldn't improve the name at the same time. The _safe() postfix has caused tons of confusion where especially beginners thought that it is a thread-safe variant, which it clearly isn't.

The _mutable() postfix sounds like a much better description to what happens here.

Regards,
Christian.

> 
>> See the bool type can be implemented by int as well, but it is just a
>> different use case.
> 
>>>> You should just include some patches in the same patch set to actually use
>>>> the new macros.
>>>>
>>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>>>> the new macro I'm happy to review that.
>>>
>>
> 


^ permalink raw reply

* Re: [PATCH] iomap: enforce DIO alignment check in iomap
From: Carlos Maiolino @ 2026-06-11 10:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: brauner, linux-block, linux-fsdevel, linux-ext4, linux-xfs,
	Keith Busch, Hannes Reinecke, Martin K. Petersen, Jens Axboe
In-Reply-To: <20260611055744.GA18538@lst.de>

On Thu, Jun 11, 2026 at 07:57:44AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 10, 2026 at 04:52:11PM +0200, cem@kernel.org wrote:
> > From: Carlos Maiolino <cem@kernel.org>
> > 
> > The DIO alignment check has been lifted from iomap layer to rely on the
> > block layer to enforce proper alignment when issuing direct IO
> > operations. This though, depending on the IO size and buffer address
> > passed to the IO operation may lead to user-visible behavior change.
> > 
> > This has been caught initially by LTP test diotest4 running on
> > PPC architecture, where the test fails because a read() operation
> > with a supposedly misaligned buffer succeeds instead of an expected
> > -EINVAL.
> > This has no direct relationship with PPC, but seems to do with the
> > IO size crossing page borders or not.
> 
> I don't understand the problem here.  Why do we want to insist on a
> failure when we can support it?  I think the test is just broken.

The problem I see here from my POV is this changed the behavior expected
from the syscalls when the passed in buffer is misaligned as the read()
(in the test) succeeds when the passed in buffer does not match the
alignment requirements (see below).

I am pretty happy in declaring this a test bug, but I thought it would be
worth starting a discussion about the sudden/unexpected behavior change.
Not to mention now different filesystems will have different alignment
requirements which seems at least "weird" to me. I mean, now suddenly
iomap-based filesystems have a more relaxed alignment constraint than
for example btrfs.

> 
> > The problematic behavior is reproducible on x86 by reducing the IO size
> > to something < PAGE_SIZE, so the misaligned read()s will also be accepted
> > by the block layer.
> 
> What do you mean with misaligned here?  For a long time the kernel
> supports basically arbitrary low memory alignment for diret I/O,
> just bounded by the device capabilities (typical 4 byte alignment).

The test sends to read() a buffer misplaced by 1 byte (see below) which
doesn't match the system's alignment constraints at least from the user
passed buffer perspective.
I've been assuming it should match device's dma_alignment constraints.
The typical 4 byte alignment indeed is the requirement from my PPC
machine, but not for my x86:

> 
> The supported memory alignment is reported in the statx
> dio_mem_align.  What does that say compared to the alignment
> expectations in this test?

From my x86:
dio_mem_align: 512
dio_offset_align: 512

From PPC:
dio_mem_align: 4
dio_offset_align: 512

But this does not explain how the following call would succeed in either
case (below one taken from PPC):

openat(dirfd=AT_FDCWD, pathname="testdata-4.135256", flags=O_RDWR|O_DIRECT) = 3
_llseek(fd=3, offset=4096, result=[4096], whence=SEEK_SET) = 0
read(arg1=0x3, arg2=0x1003af80001, arg3=0x1000) = 0x1000

The passed in address 0x1003af80001 is one byte misaligned and shouldn't
(at least in theory) ever be accepted no? Or am I missing something
else?

^ permalink raw reply

* Re: [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-11 10:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Martin K . Petersen, Jens Axboe,
	James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche
In-Reply-To: <aimSb9I0Vl-68hy9@kbusch-mbp>


[-- Attachment #1.1: Type: text/plain, Size: 1232 bytes --]

On Wed, Jun 10, 2026 at 10:06 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Jun 10, 2026 at 09:16:11PM +0530, Sumit Saxena wrote:
> > The motivation for this change stems from performance issue we
> > encountered due to false sharing of the 'nr_active_requests_shared_tags'
> > counter
> > on certain CPU architectures. I initially submitted a patch to move that
> > counter to
> > its own cache line to avoid conflicts with 'nr_requests' and other hot
> > fields
> > (see:
> >
https://patchwork.kernel.org/project/linux-scsi/patch/20260402074637.92417-3-sumit.saxena@broadcom.com/
> > ).
> >
> > During the review, Bart shared his work, which eliminates the
> > counter entirely by removing the fairness throttling. My testing
confirmed
> > that
> > this approach resolved the performance issues and improved IOPS.
> > This patch is part of a larger set, and I have reported the cumulative
> > performance
> > improvements in the cover letter.
>
> So the problem is just the atomic operation accounting overhead? I
> previously thought the device just really needed to consume all the tags
> to hit performance.
That's correct, it's the atomic operation accounting overhead.

Thanks,
Sumit

[-- Attachment #1.2: Type: text/html, Size: 1660 bytes --]

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5469 bytes --]

^ permalink raw reply

* Re: [PATCH] rust: block: require `Sync` for `Operations::QueueData`
From: Andreas Hindborg @ 2026-06-10 18:02 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Boqun Feng, Miguel Ojeda, Gary Guo, Björn Roy Baron,
	Benno Lossin, Alice Ryhl, Trevor Gross, Danilo Krummrich,
	Jens Axboe, Daniel Almeida, linux-block, rust-for-linux,
	linux-kernel
In-Reply-To: <CANiq72mGmRTfauU-HUOyBVSG0GM7q39=6aP8p1djN2cHr5C_Kw@mail.gmail.com>

Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> writes:

> On Mon, Jun 8, 2026 at 10:25 AM Andreas Hindborg <a.hindborg@kernel.org> wrote:
>>
>> Fixes: 90d952fac8ac ("rust: block: add `GenDisk` private data support")
>
> Should this be Cc: stable given the hash is old? i.e. 6.18.y+

Yes it should have cc stable, sorry.


Best regards,
Andreas Hindborg




^ permalink raw reply

* Re: [PATCH RFC 0/1] block: fix concurrent elevator change failure
From: Ming Lei @ 2026-06-11 11:22 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: linux-block, Jens Axboe, Nilay Shroff
In-Reply-To: <20260611074200.474676-1-shinichiro.kawasaki@wdc.com>

Hi Shin'ichiro,

On Thu, Jun 11, 2026 at 04:41:59PM +0900, Shin'ichiro Kawasaki wrote:
> I observed that the blktests test case block/005 hangs on a specific
> server hardware using a specific HDD as a block device. During the test
> case run, the kernel reported a KASAN null-ptr-deref (and other memory
> corruption symptoms) [2]. This failure looked sporadic and hardware-
> dependent.
> 
> From the kernel message, I noticed that udev-worker wrote to the
> queue/scheduler sysfs attribute to change the IO scheduler, or elevator.
> The test case block/005 also wrote to the same sysfs attribute, which

sysfs write is supposed to be serialized...

> indicated that a concurrent elevator change caused the failure. I
> created a new blktests test case that simply does the concurrent
> elevator change with a null_blk device [1]. It recreates the failure in
> a stable manner on various server hardware.
> 
> Using the new test case, I bisected and found that the failure first
> appears at the commit 370ac285f23a ("block: avoid cpu_hotplug_lock
> depedency on freeze_lock") in the kernel tag v6.17-rc3. However, that
> commit does not appear to explain the failure by itself: it changed the
> queue freeze behavior and only unveiled a race, probably. Looking back
> at the changes to elevator_change(), I think the actual cause is the
> commit 559dc11143eb ("block: move elv_register[unregister]_queue out of
> elevator_lock") in the kernel tag v6.16-rc1. This commit moved
> elevator_change_done() out of the guard of ->elevator_lock and the queue
> freeze. As a result, when two threads write to the same queue/scheduler
> attribute concurrently, elevator_change_done() runs in parallel causing
> the memory corruption and the hang.
> 
> As the fix attempt, I created the patch in this series. It adds a new
> mutex that serializes the whole elevator switch sequence, including the
> elevator_change_done() call. I ran the reproducer with lockdep enabled
> and confirmed that the patch avoids the failure and new WARN was not
> observed.
> 
> However, the fix patch adds a new lock, and I'm not sure if it is the best
> solution. Comments on the patch, or suggestions for a better solution,
> would be appreciated.
> 
> [1] https://github.com/kawasaki/blktests/commit/4f8c63ed7d049f5e9c935c3fe00142b2a3629826
> 
> [2]
> 
> [30102.760660] [ T186170] run blktests block/005 at 2026-05-11 05:53:53
> [30104.969837] [ T186111] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
> [30104.983590] [ T186111] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> [30104.992929] [ T186111] CPU: 2 UID: 0 PID: 186111 Comm: (udev-worker) Not tainted 7.1.0-rc2-kts+ #1 PREEMPT(lazy)
> [30105.004019] [ T186111] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
> [30105.013216] [ T186111] RIP: 0010:blk_mq_debugfs_register_sched+0x46/0x210
> [30105.020667] [ T186111] Code: 48 89 fa 48 c1 ea 03 48 83 ec 10 80 3c 02 00 0f 85 83 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 6b 08 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 57 01 00 00 48 c7 c0 24 a3 b3 97 4
> 8 8b 6d 00 48
> [30105.041036] [ T186111] RSP: 0018:ffff88816b9c7708 EFLAGS: 00010246
> [30105.048111] [ T186111] RAX: dffffc0000000000 RBX: ffff888117f18000 RCX: 0000000000000000
> [30105.057097] [ T186111] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888117f18008
> [30105.066086] [ T186111] RBP: 0000000000000000 R08: ffffffff957c47ac R09: fffffbfff2f6633c
> [30105.075083] [ T186111] R10: ffff88816b9c7730 R11: 0000000000000001 R12: ffff88814c1f2000
> [30105.084088] [ T186111] R13: ffff88814c1f2018 R14: ffff8881b8a336ac R15: ffffffff95bfae30
> [30105.093111] [ T186111] FS:  00007fc1c7970c40(0000) GS:ffff8887c534e000(0000) knlGS:0000000000000000
> [30105.103093] [ T186111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [30105.110751] [ T186111] CR2: 000055fa37e182c0 CR3: 0000000108350003 CR4: 00000000001726f0
> [30105.119796] [ T186111] Call Trace:
> [30105.124154] [ T186111]  <TASK>
> [30105.128301] [ T186111]  blk_mq_sched_reg_debugfs+0x8d/0x1a0
> [30105.134193] [ T186111]  elevator_change_done+0x2f2/0x610

blk_mq_sched_reg_debugfs already includes debugfs lock, so I feel the proper
fix could be check & avoid the null-ptr-deref.

Adding new lock should be the last straw usually, especially this one is
depended by queue freeze.

Thanks,
Ming

^ permalink raw reply

* Re: [LSF/MM/BPF RFC PATCH 00/13]
From: Leon Romanovsky @ 2026-06-11 11:59 UTC (permalink / raw)
  To: Haris Iqbal
  Cc: linux-block, linux-rdma, linux-kernel, axboe, bvanassche, hch,
	jgg, jinpu.wang
In-Reply-To: <CAJpMwyg-6Qxskq2ktuhvf46yD5848J9BYLMPPfBLjg2Uzs=xnw@mail.gmail.com>

On Wed, May 27, 2026 at 02:44:08PM +0200, Haris Iqbal wrote:
> On Tue, May 12, 2026 at 12:34 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, May 05, 2026 at 09:46:12AM +0200, Md Haris Iqbal wrote:
> > > Following a conversation with Bart yesterday, I am sending the RMR+BRMR
> > > code through patch for easier review.
> > >
> > > The patches apply over the for-next branch of the block tree over commit
> > > 07dfa981ca3
> > >
> > > For context,
> > > RMR (Reliable Multicast over RTRS) is a kernel module that provides
> > > active-active block-level replication over RDMA. It guarantees delivery
> > > of IO to a group of storage nodes and handles resynchronization of data
> > > directly between storage nodes without involving the compute client.
> > >
> > > BRMR (Block device over RMR) sits on top of RMR and exposes a standard
> > > Linux block device (/dev/brmrX) backed by an RMR pool. Together, RMR and
> > > BRMR provide a single-hop replication and resynchronization solution for
> > > RDMA-connected storage clusters.
> > >
> > > My session is on Wednesday, at 12 in the storage room (Istanbul).
> >
> > To summarize the discussion:
> >
> > 1. Move as much logic as possible into the block layer; RDMA should serve
> >    strictly as a transport.
> > 2. Identify another in‑kernel user of this functionality, and add support for
> >    it if required. At least accommodate potential users elsewhere in the
> >    kernel.
> 
> Thanks for the summary Leon.
> 
> The main logic which handles multicast/replication legs, missed I/O
> tracking, re-synchronization, etc are the core parts of RMR.
> If we move those to a separate module, there won't be much left in
> RMR. RMR already uses RTRS from the RDMA subsystem as transport.
> 
> Having said that, I am not against moving RMR out of the RDMA layer.
> It can serve as a reliable replication service/library for any other
> user in the kernel to use.
> Which subsystem (block or something else) would be a better fit then,
> can be discussed.
> 
> PS: Would this be a good candidate for a session/discussion in the upcoming LPC?

Probably yes.

Thanks

> 
> >
> > Thanks

^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11 12:04 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Christian König, Thierry Reding, Jonathan Hunter,
	Sowjanya Komatineni, Davidlohr Bueso, Paul E . McKenney,
	Josh Triplett, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng, Liam Girdwood, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Tvrtko Ursulin, Huang Rui, Eddie James, Mark Brown,
	Maxime Coquelin, Alexandre Torgue, Laxman Dewangan,
	Neil Armstrong, Robert Foss, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Laurent Pinchart,
	Jonas Karlman, Jernej Skrabec, Matthew Auld, Matthew Brost,
	Waiman Long, drbd-dev, linux-block, linux1394-devel, dri-devel,
	intel-gfx, linux-spi, linux-stm32, linux-arm-kernel, linux-tegra,
	linux-sound, linux-kernel, Andrew Morton, Randy Dunlap,
	Christian Brauner, David Howells, Luca Ceresoli, Kaito Cheng,
	Muchun Song, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Takashi Sakamoto,
	Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aippVAj83dCzscTN@ashevche-desk.local>



在 2026/6/11 15:52, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 03:36:01PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/11 14:54, Andy Shevchenko 写道:
>>> On Thu, Jun 11, 2026 at 12:42:02PM +0800, Kaitao Cheng wrote:
>>>> 在 2026/6/10 22:43, Andy Shevchenko 写道:
>>>>> On Wed, Jun 10, 2026 at 02:14:06PM +0800, Kaitao Cheng wrote:
>>>>>> 在 2026/6/9 18:33, Christian König 写道:
>>>>>>> On 6/9/26 08:13, Kaitao Cheng wrote:
>>>
>>>>>>>> This series prepares for, and then updates, the list_for_each_entry()
>>>>>>>> family so the common entry iterators cache their next or previous cursor
>>>>>>>> before the loop body runs.
>>>>>>>
>>>>>>> Why in the world would we want to do that?
>>>>>>>
>>>>>>> The safe and non-safe variants have very distinct use cases and that is completely intentional.
>>>>>>>
>>>>>>> What we could improve maybe is the documentation, from my experience an astonishing large amount of people have misconceptions about the safe variants.
>>>>>>>
>>>>>>>> The first 13 patches open-code loops that intentionally depend on the
>>>>>>>> old "derive the next entry from the current cursor at the end of the
>>>>>>>> iteration" behaviour.  These loops append work to the list being walked,
>>>>>>>> restart traversal after dropping a lock, skip an entry consumed by the
>>>>>>>> current iteration, or otherwise adjust the cursor in the loop body.
>>>>>>>
>>>>>>> Well I have to clearly reject the changes for subsystems/components I'm maintaining, that just looks horrible to me and I clearly don't see a good reason for that.
>>>>>>
>>>>>> Hi Christian and Andy Shevchenko,
>>>>>>
>>>>>> Thanks for taking a look. I would like to clarify the point you raised.
>>>>>>
>>>>>> The reason I started looking at this is the original motivation behind
>>>>>> the _safe() variants.  They exist because some users need to remove, move
>>>>>> or otherwise consume the current entry while walking the list.  In that
>>>>>> case the next cursor has to be preserved before the loop body can modify
>>>>>> the current entry.
>>>>>>
>>>>>> The unfortunate part is that this could not be expressed with the
>>>>>> existing list_for_each_entry() interface without changing its calling
>>>>>> convention.  The _safe() variants had to grow an extra argument for the
>>>>>> temporary cursor, and that is why we ended up with a separate family of
>>>>>> macros.
>>>>>>
>>>>>> But conceptually, the distinction does not have to be exposed as two
>>>>>> different iterator families forever.  The difference is an implementation
>>>>>> detail: whether the iterator keeps the next/previous cursor before the
>>>>>> body runs.  This series makes the common list_for_each_entry() iterators
>>>>>> do that internally, so the safe and non-safe forms can effectively be
>>>>>> folded together, or at least the need for a separate public _safe()
>>>>>> interface becomes much weaker.
>>>>>>
>>>>>> There is also a usability issue with the current _safe() interface.  The
>>>>>> caller is forced to define a temporary cursor outside the macro and pass
>>>>>> it in, even though almost all users never use that cursor directly.  It is
>>>>>> just boilerplate required by the macro implementation.  I find that
>>>>>> redundant and awkward: the temporary cursor is an internal detail of the
>>>>>> iteration, but every caller has to spell it out.
>>>>>
>>>>> Ah, I think the distinct macro families is that what we want.
>>>>> But the hiding of the parameter can be done inside list_for_each_*_safe().
>>>>> You can do a treewide change with coccinelle.
>>>>>
>>>>> Sorry if I didn't get the whole idea from your previous contributions.
>>>>>
>>>>> Note, even cases that would need a temporary cursor may be switched to
>>>>> new list_for_each_*_safe(), see how PCI macros for iterating over resources
>>>>> are implemented (include/linux/pci.h).
>>>>
>>>> Thanks for your suggestions. I've written a demo based on your feedback.
>>>> Could you please review it and share your thoughts on this approach?
>>>
>>> Have you checked how many users actually need the temporary storage?
>>
>> In Muchun's reply, he mentioned the following:
>>
>> There are 9,925 list_for_each_entry() call sites in total. Among them,
>> 9,919 do not require any adaptation, and only 6 need to be refactored:
>>
>> As for list_for_each_entry_safe(), there are 4,572 callers. 4,550 of them
>> can be directly replaced by the new list_for_each_entry(), while 22 cannot
>> be replaced
>>
>> https://lore.kernel.org/all/2B3BFA1E-08B8-42AB-87D6-A28BF15E5C58@linux.dev/
>>
>> I only used Coccinelle to scan for list_for_each_entry() call sites, and
>> found the 13 call sites shown in the current patch series, which cover
>> the 6 cases mentioned in Muchun's email. I have not yet run the Coccinelle
>> scan for list_for_each_entry_safe().
>>
>> If we need to handle all 9,925 list_for_each_entry() call sites or all 4,572
>> list_for_each_entry_safe() call sites in one go, would such a change be too
>> large? I expect it would affect almost every kernel subsystem.
> 
> If it's done by Linus himself during the day when he prepares -rc1, it's fine.
> You would need to provide a good justification for the change, though.
> 
> But in the above statistics the 4572 vs 4550, so the first step is to investigate
> why temporary cursor is used in those 22 cases and what we can do to avoid that.

Here is one example: in shmem_unuse() in mm/shmem.c, list_for_each_entry_safe()
is used. In this case, the caller releases shmem_swaplist_lock inside the loop.
During that window, the list may be modified, and the previously saved next may
become stale. Therefore, next needs to be recomputed so that subsequent iteration
is based on the latest list state.

This leads to two possible approaches:

1. Change list_for_each_entry_safe(pos, n, head, member) directly to
list_for_each_entry_safe(pos, head, member). If we do this, the case
above would need to be converted to an open-coded form.

2. Support both forms, list_for_each_entry_safe(pos, n, head, member)
and list_for_each_entry_safe(pos, head, member), as described in the
link below.
https://lore.kernel.org/all/9b98e860-11df-44bf-9a95-3046d2c274a6@linux.dev/

Do you have any other thoughts on this?

>> I wonder whether it would be better to first provide the necessary
>> compatibility APIs, and then let each subsystem owner update their code as
>> appropriate. That would make the impact more controlled, similar to how
>> the current folio replacement of page is being handled.
-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v2 00/14] list: Prepare entry iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-11 12:27 UTC (permalink / raw)
  To: Andy Shevchenko, Christian König
  Cc: Thierry Reding, Jonathan Hunter, Sowjanya Komatineni,
	Davidlohr Bueso, Paul E . McKenney, Josh Triplett, Peter Zijlstra,
	Ingo Molnar, Will Deacon, Boqun Feng, Liam Girdwood, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Eddie James, Mark Brown, Maxime Coquelin, Alexandre Torgue,
	Laxman Dewangan, Neil Armstrong, Robert Foss, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Laurent Pinchart, Jonas Karlman, Jernej Skrabec, Matthew Auld,
	Matthew Brost, Waiman Long, drbd-dev, linux-block,
	linux1394-devel, dri-devel, intel-gfx, linux-spi, linux-stm32,
	linux-arm-kernel, linux-tegra, linux-sound, linux-kernel,
	Andrew Morton, Randy Dunlap, Christian Brauner, David Howells,
	Luca Ceresoli, Kaito Cheng, Muchun Song, Philipp Reisner,
	Lars Ellenberg, Christoph Böhmwalder, Jens Axboe,
	Takashi Sakamoto, Andrzej Hajda, Jaroslav Kysela, Takashi Iwai
In-Reply-To: <aipx1goKIsk40vrF@ashevche-desk.local>



在 2026/6/11 16:29, Andy Shevchenko 写道:
> On Thu, Jun 11, 2026 at 10:01:25AM +0200, Christian König wrote:
>> On 6/10/26 17:02, Andy Shevchenko wrote:
>>> On Wed, Jun 10, 2026 at 11:11:34AM +0200, Christian König wrote:
>>>> On 6/10/26 10:18, Kaitao Cheng wrote:
>>>>> 在 2026/6/10 16:07, Christian König 写道:
> 
> ...
> 
>>>>> Should we revert to v1, or keep list_for_each_entry() and
>>>>> list_for_each_entry_safe() as they are, close this thread, and make no
>>>>> changes?
>>>>>
>>>>> Link to v1:
>>>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>>>
>>>>> Or do you have any better suggestions?
>>>>
>>>> v1 looks perfectly reasonable to me.
>>>
>>> But why not just hiding that once for all (in case they don't use the temporary
>>> iterator)? Easy to automate, robust — everyone is happy?
>>
>> As far as I can see that is an extremely bad idea.
>>
>> The distinction between the use cases of 'iterating the list' and 'iterating
>> the list while you modify it' is completely intentional.

I agree with this point. It is very reasonable for list_for_each_entry()
to be used only for 'iterating the list'. In practice, however, we do not
have an effective way to enforce that rule for users, whereas the distinction
between bool and int can be enforced by the compiler. The 13 patches in the
current series are all real examples where users modify the list while using
list_for_each_entry(). Is a rule that cannot actually be enforced reasonable?
This is just my humble opinion, and I am raising it here only for discussion.

> What I meant is to keep the name, just drop the parameter (make it hidden and
> being defined inside list_for_each_*_safe() cases).

I agree with this approach, but the specific details still need to be settled,
including the issue described in the link below.

https://lore.kernel.org/all/0a333eb8-fc29-4b85-993e-6b726f4c7cf0@linux.dev/

Of course, there is also the suffix-renaming issue raised by Christian.

>> See the bool type can be implemented by int as well, but it is just a
>> different use case.
> 
>>>> You should just include some patches in the same patch set to actually use
>>>> the new macros.
>>>>
>>>> If you modify the files under drivers/dma-buf or drivers/gpu/drm/amd to use
>>>> the new macro I'm happy to review that.
>>>
>>
> 

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox