Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] block: clear zone write plugging flag before failing rejected BIOs
From: Jackie Liu @ 2026-06-09  0:36 UTC (permalink / raw)
  To: Damien Le Moal, axboe; +Cc: linux-block
In-Reply-To: <33035623-1f1b-4391-9212-e2af5fd9457f@kernel.org>

2026年6月8日 19:42, "Damien Le Moal" <dlemoal@kernel.org mailto:dlemoal@kernel.org?to=%22Damien%20Le%20Moal%22%20%3Cdlemoal%40kernel.org%3E > 写到:


> 
> On 2026/06/07 11:18, Jackie Liu wrote:
> 
> > 
> > From: Jackie Liu <liuyun01@kylinos.cn>
> >  
> >  Commit fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug
> >  error recovery") changed blk_zone_wplug_handle_write() to fail BIOs
> >  directly when blk_zone_wplug_prepare_bio() rejects them, for example
> >  because the write is not aligned to the cached write pointer or the plug
> >  needs a write pointer update. However, the BIO is already marked with
> >  BIO_ZONE_WRITE_PLUGGING at that point even though it is not issued.
> >  
> >  Completing such a BIO with bio_io_error() makes bio_endio() call
> >  blk_zone_write_plug_bio_endio(), which treats the completion as a failed
> >  device write and may poison the cached zone write pointer state by setting
> >  BLK_ZONE_WPLUG_NEED_WP_UPDATE.
> > 
> Yes, true. But you did not explain clearly why that is a problem. After all, if
> we hit this case, the user issued an unaligned BIO, and so forcing it to do a
> report zones to get everything in sync and the correct write pointer is not a
> bad thing.
> 
> If fe0418eb9bd6 change is actually causing you problems, please describe that
> problem clearly. But ideally, I do not want to special case some error
> completions over others and prefer to have a single error path that result in
> the same state for the zone write plugs, regardless of a write error root cause.

Thanks for the review. I agree that the changelog did not describe a
concrete user-visible problem clearly enough.

I was treating NEED_WP_UPDATE on a BIO rejected before submission as stale
state poisoning, because no device write was actually issued. But as you
pointed out, for an invalid/non-sequential write, forcing the user to
resynchronize the write pointer through report zones is consistent with the
current conservative recovery model.

I do not have a concrete regression from fe0418eb9bd6 beyond that extra
recovery requirement, so please drop this patch for now.

Thanks.

Jackie

> 
> > 
> > Clear BIO_ZONE_WRITE_PLUGGING and drop the zone write plug reference before
> >  failing the rejected BIO.
> >  
> >  Fixes: fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug error recovery")
> >  Cc: stable@vger.kernel.org # 6.13+
> >  Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
> >  ---
> >  block/blk-zoned.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >  
> >  diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> >  index 6a221c180889..855767d8bfc1 100644
> >  --- a/block/blk-zoned.c
> >  +++ b/block/blk-zoned.c
> >  @@ -1502,7 +1502,9 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
> >  goto queue_bio;
> >  
> >  if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
> >  + bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
> >  spin_unlock_irqrestore(&zwplug->lock, flags);
> >  + disk_put_zone_wplug(zwplug);
> >  bio_io_error(bio);
> >  return true;
> >  }
> > 
> -- 
> Damien Le Moal
> Western Digital Research
>

^ permalink raw reply

* [PATCH] brd: normalize non-positive max_part before rounding it up
From: Samuel Moelius @ 2026-06-08 23:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Samuel Moelius, open list:BLOCK LAYER, open list

`max_part` is an `int` module parameter, but brd only resets zero before
rounding non-divisor values with `1UL << fls(max_part)`.

A negative value such as -1 passes the initial zero check.  The modulo
test then reaches the roundup, where `fls(-1)` yields 32.  On 64-bit
builds that produces 4294967296, which is then assigned back to `int
max_part` as zero.  `brd_alloc()` passes that zero value to
`disk->minors`, and block core warns and rejects the disk.

Normalize non-positive values to the existing one-partition fallback
before the modulo/roundup, and apply the existing `DISK_MAX_PARTS` clamp
before the roundup so it only operates on representable, in-range
values.

Assisted-by: Codex:gpt-5.5-cyber-preview
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
---
 drivers/block/brd.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 00cc8122068f..ed9567f74579 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,9 +371,15 @@ static void brd_cleanup(void)

 static inline void brd_check_and_reset_par(void)
 {
-	if (unlikely(!max_part))
+	if (unlikely(max_part <= 0))
 		max_part = 1;

+	if (max_part > DISK_MAX_PARTS) {
+		pr_info("brd: max_part can't be larger than %d, reset max_part = %d.\n",
+			DISK_MAX_PARTS, DISK_MAX_PARTS);
+		max_part = DISK_MAX_PARTS;
+	}
+
 	/*
 	 * make sure 'max_part' can be divided exactly by (1U << MINORBITS),
 	 * otherwise, it is possiable to get same dev_t when adding partitions.
@@ -381,11 +387,6 @@ static inline void brd_check_and_reset_par(void)
 	if ((1U << MINORBITS) % max_part != 0)
 		max_part = 1UL << fls(max_part);

-	if (max_part > DISK_MAX_PARTS) {
-		pr_info("brd: max_part can't be larger than %d, reset max_part = %d.\n",
-			DISK_MAX_PARTS, DISK_MAX_PARTS);
-		max_part = DISK_MAX_PARTS;
-	}
 }

 static int __init brd_init(void)
-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH 4/4] block: add configurable error injection
From: Bart Van Assche @ 2026-06-08 22:08 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-5-hch@lst.de>

On 6/7/26 10:14 PM, Christoph Hellwig wrote:
> +Configurable error injection allows injecting specific block layer status codes
> +for ranges of a block device.  Errors can be injected unconditionally, or with a

ranges -> sector ranges?

> +static void error_inject_removall(struct gendisk *disk)
 > +{

Is a letter "e" perhaps missing from the above function name? (remov -> 
remove)

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 3/4] block: add a str_to_blk_op helper
From: Bart Van Assche @ 2026-06-08 21:57 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-4-hch@lst.de>

On 6/7/26 10:14 PM, Christoph Hellwig wrote:
> +enum req_op str_to_blk_op(const char *op)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(blk_op_name); i++)
> +		if (blk_op_name[i] && !strcmp(blk_op_name[i], op))
> +			return (enum req_op)i;
> +	return REQ_OP_LAST;
> +}
The above function is similar but not identical to
__sysfs_match_string(). Is __sysfs_match_string() good enough in this
context?

Thanks,

Bart.


^ permalink raw reply

* Re: [PATCH 2/4] block: add a "tag" for block status codes
From: Bart Van Assche @ 2026-06-08 21:55 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-3-hch@lst.de>

On 6/7/26 10:14 PM, Christoph Hellwig wrote:
> +const char *blk_status_to_tag(blk_status_t status)
> +{
> +	int idx = (__force int)status;
> +
> +	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors)))
> +		return "<null>";
> +	return blk_errors[idx].tag;
> +}

Since designated initializers are used to initialize blk_errors[], it's
probably a good idea to check the value of blk_errors[idx].tag, e.g. as
follows:

return blk_errors[idx].tag ?: "<null>";

Thanks,

Bart.


^ permalink raw reply

* Re: [PATCH 1/4] block: add a macro to initialize the status table
From: Bart Van Assche @ 2026-06-08 21:51 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-2-hch@lst.de>

On 6/7/26 10:14 PM, Christoph Hellwig wrote:
> Prepare for adding a new value to the error table by adding a macro
> to fill it.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply

* Re: [PATCH] block: fix arg type in `blk_mq_update_nr_hw_queues`
From: Bart Van Assche @ 2026-06-08 15:50 UTC (permalink / raw)
  To: Andreas Hindborg, Jens Axboe; +Cc: linux-block, linux-kernel
In-Reply-To: <20260608-update-hw-nodes-arg-v1-1-19aba440b32e@kernel.org>

On 6/8/26 1:39 AM, Andreas Hindborg wrote:
> The type of the argument `nr_hw_queues` in the function
> `blk_mq_update_nr_hw_queues` is a signed integer. This is wrong,
> considering the field `nr_hw_queues` of `struct blk_mq_tag_set` is
> unsigned. Thus, change the type of the parameter to unsigned.

Will there ever be storage devices that support more than 2**31 hardware
queues? If not, I think the word "wrong" in the commit message is too
strong.

If this patch does not change the behavior of the code for any practical
use case that would be good to mention.

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH]block: Observing higher CPU utilization during random IO testing
From: Jens Axboe @ 2026-06-08 15:18 UTC (permalink / raw)
  To: wenxiong, linux-block; +Cc: tom.leiming, yukuai, wenxiong
In-Reply-To: <76b46c10-201a-4bd1-85a8-49ae6bace572@kernel.dk>

On 6/8/26 9:15 AM, Jens Axboe wrote:
> On 6/4/26 10:27 AM, wenxiong@linux.ibm.com wrote:
>> From: Wen Xiong <wenxiong@linux.ibm.com>
>>
>> Hi All,
>>
>> Our performance team observed the higher CPU utilization in RHEL10 compared
>> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc6) as well
>> when running FIO random IO tests. Random IO tests are more CPU intensive
>> than sequential IO tests due to several factors: more context switching,
>> interrupt Handling, cache Inefficiency etc.
>>
>> Given commit 060406c61c7c ("block: add plug while submitting IO")
>> causes performance regression. This patch reverts it.
>>
>> Below is performance comparison with the latest upstream kernel.
>>
>> Iotype  qd   nj    rmix    mpstat busy        mpstat busy
>>                          with inner plug    without inner plug
>> Randrw  1    20    100       53%                 24%
>> Randrw  1    40    100       70%                 24%
>> Randrw  1    20    70        40%                 24%
>> Randrw  1    40    70        60%                 26%
>> Randrw  1    20    0         14%                 6%
>> Randrw  1    40    0         20%                 7%
> 
> I'm fine with the actual change, but the commit title and commit message
> are pretty abysmal. If you were browsing git log and came across
> something that said
> 
> "block: Observing higher CPU utilization during random IO testing"
> 
> then what would you think the change did? You would have no idea. That
> title is a bug report title, it's not a git commit title. A good git
> commit message should tell you WHY a change is being made, and the title
> should be a short summary of that change.
> 
> I'll rewrite these, but please keep this in mind for future submissions.
> Should not be necessary for me to completely rewrite your git commit
> message.
> 
> You also identify 060406c61c7c as the commit that this is fixing. This
> goes into a Fixes line before your signed-off-by. And you would
> presumably want this to go into stable as well, yet that's not tagged.
> Please read the documentation on how to submit patches.

Patch also doesn't apply without fuzz. Please submit a v2 that's against
for-7.2/block and heeds the above advice to turn this into something
that's closer to a good patch submission.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH]block: Observing higher CPU utilization during random IO testing
From: Jens Axboe @ 2026-06-08 15:15 UTC (permalink / raw)
  To: wenxiong, linux-block; +Cc: tom.leiming, yukuai, wenxiong
In-Reply-To: <20260604162709.3006327-1-wenxiong@linux.ibm.com>

On 6/4/26 10:27 AM, wenxiong@linux.ibm.com wrote:
> From: Wen Xiong <wenxiong@linux.ibm.com>
> 
> Hi All,
> 
> Our performance team observed the higher CPU utilization in RHEL10 compared
> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc6) as well
> when running FIO random IO tests. Random IO tests are more CPU intensive
> than sequential IO tests due to several factors: more context switching,
> interrupt Handling, cache Inefficiency etc.
> 
> Given commit 060406c61c7c ("block: add plug while submitting IO")
> causes performance regression. This patch reverts it.
> 
> Below is performance comparison with the latest upstream kernel.
> 
> Iotype  qd   nj    rmix    mpstat busy        mpstat busy
>                          with inner plug    without inner plug
> Randrw  1    20    100       53%                 24%
> Randrw  1    40    100       70%                 24%
> Randrw  1    20    70        40%                 24%
> Randrw  1    40    70        60%                 26%
> Randrw  1    20    0         14%                 6%
> Randrw  1    40    0         20%                 7%

I'm fine with the actual change, but the commit title and commit message
are pretty abysmal. If you were browsing git log and came across
something that said

"block: Observing higher CPU utilization during random IO testing"

then what would you think the change did? You would have no idea. That
title is a bug report title, it's not a git commit title. A good git
commit message should tell you WHY a change is being made, and the title
should be a short summary of that change.

I'll rewrite these, but please keep this in mind for future submissions.
Should not be necessary for me to completely rewrite your git commit
message.

You also identify 060406c61c7c as the commit that this is fixing. This
goes into a Fixes line before your signed-off-by. And you would
presumably want this to go into stable as well, yet that's not tagged.
Please read the documentation on how to submit patches.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH]block: Observing higher CPU utilization during random IO testing
From: Ming Lei @ 2026-06-08 15:04 UTC (permalink / raw)
  To: wenxiong; +Cc: linux-block, axboe, yukuai, wenxiong
In-Reply-To: <20260604162709.3006327-1-wenxiong@linux.ibm.com>

On Thu, Jun 04, 2026 at 12:27:09PM -0400, wenxiong@linux.ibm.com wrote:
> From: Wen Xiong <wenxiong@linux.ibm.com>
> 
> Hi All,
> 
> Our performance team observed the higher CPU utilization in RHEL10 compared
> to RHEL9.8, observed the similar issue in upstream kernel(v7.1-rc6) as well
> when running FIO random IO tests. Random IO tests are more CPU intensive
> than sequential IO tests due to several factors: more context switching,
> interrupt Handling, cache Inefficiency etc.
> 
> Given commit 060406c61c7c ("block: add plug while submitting IO")
> causes performance regression. This patch reverts it.

One thing is that plugging should be applied in outer io code path, also
it doesn't make sense to apply this timestamp optimization for single IO batch.

> 
> Below is performance comparison with the latest upstream kernel.
> 
> Iotype  qd   nj    rmix    mpstat busy        mpstat busy
>                          with inner plug    without inner plug
> Randrw  1    20    100       53%                 24%
> Randrw  1    40    100       70%                 24%
> Randrw  1    20    70        40%                 24%
> Randrw  1    40    70        60%                 26%
> Randrw  1    20    0         14%                 6%
> Randrw  1    40    0         20%                 7%
> 
> Signed-off-by: Wen Xiong <wenxiong@linux.ibm.com>
> Suggested-by: Ming Lei <tom.leiming@gmail.com>

With commit log update, this patch looks fine:

Reviewed-by: Ming Lei <tom.leiming@gmail.com>


Thanks,
Ming

^ permalink raw reply

* [PATCH v3 08/22] iov_iter: Add a segmented queue of bio_vec[]
From: David Howells @ 2026-06-08 14:54 UTC (permalink / raw)
  To: Christian Brauner, Matthew Wilcox, Christoph Hellwig
  Cc: David Howells, Paulo Alcantara, Jens Axboe, Leon Romanovsky,
	Steve French, ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, Trond Myklebust, netfs,
	linux-afs, linux-cifs, linux-nfs, ceph-devel, v9fs, linux-erofs,
	linux-fsdevel, linux-kernel, linux-block
In-Reply-To: <20260608145432.681865-1-dhowells@redhat.com>

Add the concept of a segmented queue of bio_vec[] arrays.  This allows an
indefinite quantity of elements to be handled and allows things like
network filesystems and crypto drivers to glue bits on the ends without
having to reallocate the array.

The bvecq struct that defines each segment also carries capacity/usage
information along with flags indicating whether the constituent memory
regions need freeing or unpinning and the file position of the first
element in a segment.  The bvecq structs are refcounted to allow a queue to
be extracted in batches and split between a number of subrequests.

The bvecq can have the bio_vec[] it manages allocated in with it, but this
is not required.  A flag is provided for if this is the case as comparing
->bv to ->__bv is not sufficient to detect this case.

Add an iterator type ITER_BVECQ for it.  This is intended to replace
ITER_FOLIOQ (and ITER_XARRAY).

Note that the prev pointer is only really needed for iov_iter_revert() and
could be dispensed with if struct iov_iter contained the head information
as well as the current point.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/bvecq.h      |  56 +++++++
 include/linux/iov_iter.h   |  69 +++++++-
 include/linux/uio.h        |  11 ++
 lib/iov_iter.c             | 322 ++++++++++++++++++++++++++++++++++++-
 lib/scatterlist.c          |  70 +++++++-
 lib/tests/kunit_iov_iter.c | 262 ++++++++++++++++++++++++++++++
 6 files changed, 784 insertions(+), 6 deletions(-)
 create mode 100644 include/linux/bvecq.h

diff --git a/include/linux/bvecq.h b/include/linux/bvecq.h
new file mode 100644
index 000000000000..15f16f905877
--- /dev/null
+++ b/include/linux/bvecq.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Implementation of a segmented queue of bio_vec[].
+ *
+ * Copyright (C) 2026 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_BVECQ_H
+#define _LINUX_BVECQ_H
+
+#include <linux/bvec.h>
+
+/*
+ * The type of memory retention used by the elements in bvecq->bv[] and how to
+ * clean it up.
+ */
+enum bvecq_mem {
+	BVECQ_MEM_EXTERNAL,	/* Externally retained memory - no freeing */
+	BVECQ_MEM_PAGECACHE,	/* Ref'd pagecache pages - must put */
+	BVECQ_MEM_GUP,		/* Pinned memory from get_user_pages() - unpin */
+	BVECQ_MEM_ALLOCED,	/* Memory alloc'd by bvecq - can be freed/pooled */
+} __mode(byte);
+
+/*
+ * Segmented bio_vec queue.
+ *
+ * These can be linked together to form messages of indefinite length and
+ * iterated over with an ITER_BVECQ iterator.  The list is non-circular; next
+ * and prev are NULL at the ends.
+ *
+ * The bv pointer points to the bio_vec array; this may be __bv if allocated
+ * together.  The caller is responsible for determining whether or not this is
+ * the case as the array pointed to by bv may be follow on directly from the
+ * bvecq by accident of allocation (ie. ->bv == ->__bv is *not* sufficient to
+ * determine this).
+ *
+ * The file position and discontiguity flag allow non-contiguous data sets to
+ * be chained together, but still teased apart without the need to convert the
+ * info in the bio_vec back into a folio pointer.
+ */
+struct bvecq {
+	struct bvecq	*next;		/* Next bvec in the list or NULL */
+	struct bvecq	*prev;		/* Prev bvec in the list or NULL */
+	unsigned long long fpos;	/* File position */
+	refcount_t	ref;
+	u32		priv;		/* Private data */
+	u16		nr_slots;	/* Number of elements in bv[] used */
+	u16		max_slots;	/* Number of elements allocated in bv[] */
+	enum bvecq_mem	mem_type:3;	/* What sort of memory and how to free it */
+	bool		inline_bv:1;	/* T if __bv[] is being used */
+	bool		discontig:1;	/* T if not contiguous with previous bvecq */
+	struct bio_vec	*bv;		/* Pointer to array of page fragments */
+	struct bio_vec	__bv[];		/* Default array (if ->inline_bv) */
+};
+
+#endif /* _LINUX_BVECQ_H */
diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
index f9a17fbbd398..c19a4c561ab4 100644
--- a/include/linux/iov_iter.h
+++ b/include/linux/iov_iter.h
@@ -10,6 +10,7 @@
 
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
 
 typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
@@ -141,6 +142,66 @@ size_t iterate_bvec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
 	return progress;
 }
 
+/*
+ * Handle ITER_BVECQ.
+ */
+static __always_inline
+size_t iterate_bvecq(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		     iov_step_f step)
+{
+	const struct bvecq *bq = iter->bvecq;
+	unsigned int slot = iter->bvecq_slot;
+	size_t progress = 0, skip = iter->iov_offset;
+
+	do {
+		const struct bio_vec *bvec;
+		struct page *page;
+		size_t poff, plen;
+		void *base;
+
+		if (slot >= bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq = bq->next;
+			slot = 0;
+		}
+
+		bvec = &bq->bv[slot];
+		page = bvec->bv_page + (bvec->bv_offset + skip) / PAGE_SIZE;
+		poff = (bvec->bv_offset + skip) % PAGE_SIZE;
+		plen = umin(bvec->bv_len - skip, len);
+
+		while (plen > 0) {
+			size_t part, remain, consumed;
+
+			part = umin(plen, PAGE_SIZE - poff);
+			base = kmap_local_page(page) + poff;
+			remain = step(base, progress, part, priv, priv2);
+			kunmap_local(base);
+
+			consumed = part - remain;
+			progress += consumed;
+			skip += consumed;
+			len -= consumed;
+			if (!len || remain)
+				goto stop;
+			page++;
+			poff = 0;
+			plen -= consumed;
+		}
+
+		skip = 0;
+		slot++;
+	} while (len);
+
+stop:
+	iter->bvecq_slot = slot;
+	iter->bvecq = bq;
+	iter->iov_offset = skip;
+	iter->count -= progress;
+	return progress;
+}
+
 /*
  * Handle ITER_FOLIOQ.
  */
@@ -306,6 +367,8 @@ size_t iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
@@ -342,8 +405,8 @@ size_t iterate_and_advance(struct iov_iter *iter, size_t len, void *priv,
  * buffer is presented in segments, which for kernel iteration are broken up by
  * physical pages and mapped, with the mapped address being presented.
  *
- * [!] Note This will only handle BVEC, KVEC, FOLIOQ, XARRAY and DISCARD-type
- * iterators; it will not handle UBUF or IOVEC-type iterators.
+ * [!] Note This will only handle BVEC, KVEC, BVECQ, FOLIOQ, XARRAY and
+ * DISCARD-type iterators; it will not handle UBUF or IOVEC-type iterators.
  *
  * A step functions, @step, must be provided, one for handling mapped kernel
  * addresses and the other is given user addresses which have the potential to
@@ -370,6 +433,8 @@ size_t iterate_and_advance_kernel(struct iov_iter *iter, size_t len, void *priv,
 		return iterate_bvec(iter, len, priv, priv2, step);
 	if (iov_iter_is_kvec(iter))
 		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_bvecq(iter))
+		return iterate_bvecq(iter, len, priv, priv2, step);
 	if (iov_iter_is_folioq(iter))
 		return iterate_folioq(iter, len, priv, priv2, step);
 	if (iov_iter_is_xarray(iter))
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e3..f7cfa6ea8213 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -26,6 +26,7 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_BVEC,
 	ITER_KVEC,
+	ITER_BVECQ,
 	ITER_FOLIOQ,
 	ITER_XARRAY,
 	ITER_DISCARD,
@@ -68,6 +69,7 @@ struct iov_iter {
 				const struct iovec *__iov;
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
+				const struct bvecq *bvecq;
 				const struct folio_queue *folioq;
 				struct xarray *xarray;
 				void __user *ubuf;
@@ -77,6 +79,7 @@ struct iov_iter {
 	};
 	union {
 		unsigned long nr_segs;
+		u16 bvecq_slot;
 		u8 folioq_slot;
 		loff_t xarray_start;
 	};
@@ -145,6 +148,11 @@ static inline bool iov_iter_is_discard(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_DISCARD;
 }
 
+static inline bool iov_iter_is_bvecq(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_BVECQ;
+}
+
 static inline bool iov_iter_is_folioq(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_FOLIOQ;
@@ -295,6 +303,9 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq,
+			 unsigned int first_slot, unsigned int offset, size_t count);
 void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
 			  const struct folio_queue *folioq,
 			  unsigned int first_slot, unsigned int offset, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index cac7d7364bc2..63fc75c2bc48 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -538,6 +538,39 @@ static void iov_iter_iovec_advance(struct iov_iter *i, size_t size)
 	i->__iov = iov;
 }
 
+static void iov_iter_bvecq_advance(struct iov_iter *i, size_t by)
+{
+	const struct bvecq *bq = i->bvecq;
+	unsigned int slot = i->bvecq_slot;
+
+	if (!i->count)
+		return;
+	i->count -= by;
+
+	by += i->iov_offset; /* From beginning of current segment. */
+	do {
+		size_t len;
+
+		while (slot >= bq->nr_slots) {
+			if (!bq->next)
+				break;
+			bq = bq->next;
+			slot = 0;
+		}
+
+		len = bq->bv[slot].bv_len;
+
+		if (likely(by < len))
+			break;
+		by -= len;
+		slot++;
+	} while (by);
+
+	i->iov_offset = by;
+	i->bvecq_slot = slot;
+	i->bvecq = bq;
+}
+
 static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
 {
 	const struct folio_queue *folioq = i->folioq;
@@ -583,6 +616,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
+	} else if (iov_iter_is_bvecq(i)) {
+		iov_iter_bvecq_advance(i, size);
 	} else if (iov_iter_is_folioq(i)) {
 		iov_iter_folioq_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
@@ -591,6 +626,32 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
+static void iov_iter_bvecq_revert(struct iov_iter *i, size_t unroll)
+{
+	const struct bvecq *bq = i->bvecq;
+	unsigned int slot = i->bvecq_slot;
+
+	for (;;) {
+		size_t len;
+
+		if (slot == 0) {
+			bq = bq->prev;
+			slot = bq->nr_slots;
+		}
+		slot--;
+
+		len = bq->bv[slot].bv_len;
+		if (unroll <= len) {
+			i->iov_offset = len - unroll;
+			break;
+		}
+		unroll -= len;
+	}
+
+	i->bvecq_slot = slot;
+	i->bvecq = bq;
+}
+
 static void iov_iter_folioq_revert(struct iov_iter *i, size_t unroll)
 {
 	const struct folio_queue *folioq = i->folioq;
@@ -648,6 +709,9 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 			}
 			unroll -= n;
 		}
+	} else if (iov_iter_is_bvecq(i)) {
+		i->iov_offset = 0;
+		iov_iter_bvecq_revert(i, unroll);
 	} else if (iov_iter_is_folioq(i)) {
 		i->iov_offset = 0;
 		iov_iter_folioq_revert(i, unroll);
@@ -678,9 +742,24 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 		if (iov_iter_is_bvec(i))
 			return min(i->count, i->bvec->bv_len - i->iov_offset);
 	}
+	if (!i->count)
+		return 0;
+	if (unlikely(iov_iter_is_bvecq(i))) {
+		const struct bvecq *bq = i->bvecq;
+		unsigned int slot = i->bvecq_slot;
+		size_t offset = i->iov_offset;
+
+		while (slot >= bq->nr_slots) {
+			bq = bq->next;
+			if (!bq)
+				return 0;
+			slot = 0;
+			offset = 0;
+		}
+		return umin(i->count, bq->bv[slot].bv_len - offset);
+	}
 	if (unlikely(iov_iter_is_folioq(i)))
-		return !i->count ? 0 :
-			umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
+		return umin(folioq_folio_size(i->folioq, i->folioq_slot), i->count);
 	return i->count;
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
@@ -717,6 +796,35 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+/**
+ * iov_iter_bvec_queue - Initialise an I/O iterator to use a segmented bvec queue
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @bvecq: The starting point in the bvec queue.
+ * @first_slot: The first slot in the bvec queue to use
+ * @offset: The offset into the bvec in the first slot to start at
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to either draw data out of the buffers attached to an
+ * inode or to inject data into those buffers.  The pages *must* be prevented
+ * from evaporation, either by the caller.
+ */
+void iov_iter_bvec_queue(struct iov_iter *i, unsigned int direction,
+			 const struct bvecq *bvecq, unsigned int first_slot,
+			 unsigned int offset, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i = (struct iov_iter) {
+		.iter_type	= ITER_BVECQ,
+		.data_source	= direction,
+		.bvecq		= bvecq,
+		.bvecq_slot	= first_slot,
+		.count		= count,
+		.iov_offset	= offset,
+	};
+}
+EXPORT_SYMBOL(iov_iter_bvec_queue);
+
 /**
  * iov_iter_folio_queue - Initialise an I/O iterator to use the folios in a folio queue
  * @i: The iterator to initialise.
@@ -839,6 +947,37 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
 	return res;
 }
 
+static unsigned long iov_iter_alignment_bvecq(const struct iov_iter *iter)
+{
+	const struct bvecq *bq;
+	unsigned long res = 0;
+	unsigned int slot = iter->bvecq_slot;
+	size_t skip = iter->iov_offset;
+	size_t size = iter->count;
+
+	if (!size)
+		return res;
+
+	for (bq = iter->bvecq; bq; bq = bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec = &bq->bv[slot];
+			size_t part = umin(bvec->bv_len - skip, size);
+
+			res |= bvec->bv_offset + skip;
+			res |= part;
+
+			size -= part;
+			if (size == 0)
+				return res;
+			skip = 0;
+		}
+
+		slot = 0;
+	}
+
+	return res;
+}
+
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
 	if (likely(iter_is_ubuf(i))) {
@@ -854,6 +993,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_alignment_bvecq(i);
 
 	/* With both xarray and folioq types, we're dealing with whole folios. */
 	if (iov_iter_is_folioq(i))
@@ -1066,6 +1207,36 @@ static int bvec_npages(const struct iov_iter *i, int maxpages)
 	return npages;
 }
 
+static size_t iov_npages_bvecq(const struct iov_iter *iter, size_t maxpages)
+{
+	const struct bvecq *bq;
+	unsigned int slot = iter->bvecq_slot;
+	size_t npages = 0;
+	size_t skip = iter->iov_offset;
+	size_t size = iter->count;
+
+	for (bq = iter->bvecq; bq; bq = bq->next) {
+		for (; slot < bq->nr_slots; slot++) {
+			const struct bio_vec *bvec = &bq->bv[slot];
+			size_t offs = (bvec->bv_offset + skip) % PAGE_SIZE;
+			size_t part = umin(bvec->bv_len - skip, size);
+
+			npages += DIV_ROUND_UP(offs + part, PAGE_SIZE);
+			if (npages >= maxpages)
+				goto out;
+
+			size -= part;
+			if (!size)
+				goto out;
+			skip = 0;
+		}
+
+		slot = 0;
+	}
+out:
+	return umin(npages, maxpages);
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1080,6 +1251,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
+	if (iov_iter_is_bvecq(i))
+		return iov_npages_bvecq(i, maxpages);
 	if (iov_iter_is_folioq(i)) {
 		unsigned offset = i->iov_offset % PAGE_SIZE;
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1366,6 +1539,147 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 	i->nr_segs = state->nr_segs;
 }
 
+/*
+ * Count the number of virtually contiguous pages coming up next in an
+ * ITER_BVECQ iterator, up to the specified maxima.
+ */
+static unsigned int iter_count_bvecq_pages(const struct iov_iter *iter,
+					   size_t maxsize,
+					   unsigned int maxpages)
+{
+	const struct bvecq *bvecq = iter->bvecq;
+	unsigned int slot = iter->bvecq_slot;
+	ssize_t remain = umin(maxsize, iter->count);
+	size_t count = 0, offset = iter->iov_offset;
+
+	for (;; slot++) {
+		const struct bio_vec *bv;
+		size_t boff, blen;
+
+		while (slot >= bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(remain > 0);
+				break;
+			}
+			bvecq = bvecq->next;
+			slot = 0;
+		}
+
+		bv = &bvecq->bv[slot];
+		boff = bv->bv_offset;
+		blen = bv->bv_len;
+
+		if (unlikely(!bv->bv_page)) {
+			if (blen && count > 0)
+				break;
+			continue;
+		}
+		if (!PAGE_ALIGNED(boff) && count > 0)
+			break;
+
+		boff += offset;
+		blen -= offset;
+		offset = 0;
+		if (!blen)
+			continue;
+
+		blen = umin(blen, remain);
+		remain -= blen;
+		blen += offset_in_page(boff);
+		count += DIV_ROUND_UP(blen, PAGE_SIZE);
+
+		if (!PAGE_ALIGNED(blen))
+			break;
+		if (remain <= 0)
+			break;
+		if (count >= maxpages)
+			break;
+	}
+
+	return umin(count, maxpages);
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_BVECQ iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvecq_pages(struct iov_iter *iter,
+					    struct page ***pages, size_t maxsize,
+					    unsigned int maxpages,
+					    iov_iter_extraction_t extraction_flags,
+					    size_t *offset0)
+{
+	const struct bvecq *bvecq;
+	struct page **p;
+	unsigned int slot, nr = 0;
+	size_t extracted = 0, offset;
+
+	/* Count the next run of virtually contiguous pages. */
+	maxpages = iter_count_bvecq_pages(iter, maxsize, maxpages);
+
+	if (!*pages) {
+		*pages = kvmalloc_array(maxpages, sizeof(struct page *), GFP_KERNEL);
+		if (!*pages)
+			return -ENOMEM;
+	}
+
+	p = *pages;
+
+	/* Now transcribe the page pointers. */
+	extracted = 0;
+	bvecq = iter->bvecq;
+	offset = iter->iov_offset;
+	slot = iter->bvecq_slot;
+
+	do {
+		const struct bio_vec *bv;
+		size_t boff, blen;
+
+		while (slot >= bvecq->nr_slots) {
+			if (!bvecq->next) {
+				WARN_ON_ONCE(extracted < iter->count);
+				break;
+			}
+			bvecq = bvecq->next;
+			slot = 0;
+		}
+
+		bv = &bvecq->bv[slot];
+		boff = bv->bv_offset;
+		blen = bv->bv_len;
+
+		if (!bv->bv_page)
+			blen = 0;
+
+		if (offset < blen) {
+			size_t part = umin(maxsize - extracted, blen - offset);
+			size_t poff = (boff + offset) % PAGE_SIZE;
+			size_t pix = (boff + offset) / PAGE_SIZE;
+
+			if (poff + part > PAGE_SIZE)
+				part = PAGE_SIZE - poff;
+
+			if (!extracted)
+				*offset0 = poff;
+
+			p[nr++] = bv->bv_page + pix;
+			offset += part;
+			extracted += part;
+		}
+
+		if (offset >= blen) {
+			offset = 0;
+			slot++;
+		}
+	} while (nr < maxpages && extracted < maxsize);
+
+	iter->bvecq = bvecq;
+	iter->bvecq_slot = slot;
+	iter->iov_offset = offset;
+	iter->count -= extracted;
+	return extracted;
+}
+
 /*
  * Extract a list of contiguous pages from an ITER_FOLIOQ iterator.  This does
  * not get references on the pages, nor does it get a pin on them.
@@ -1708,6 +2022,10 @@ ssize_t iov_iter_extract_pages(struct iov_iter *i,
 		return iov_iter_extract_bvec_pages(i, pages, maxsize,
 						   maxpages, extraction_flags,
 						   offset0);
+	if (iov_iter_is_bvecq(i))
+		return iov_iter_extract_bvecq_pages(i, pages, maxsize,
+						    maxpages, extraction_flags,
+						    offset0);
 	if (iov_iter_is_folioq(i))
 		return iov_iter_extract_folioq_pages(i, pages, maxsize,
 						     maxpages, extraction_flags,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index b7fe91ef35b8..b92144659543 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -10,6 +10,7 @@
 #include <linux/highmem.h>
 #include <linux/kmemleak.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/uio.h>
 #include <linux/folio_queue.h>
 
@@ -1267,6 +1268,68 @@ static ssize_t extract_kvec_to_sg(struct iov_iter *iter,
 	return ret;
 }
 
+/*
+ * Extract up to sg_max folios from an BVECQ-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t extract_bvecq_to_sg(struct iov_iter *iter,
+				   ssize_t maxsize,
+				   struct sg_table *sgtable,
+				   unsigned int sg_max,
+				   iov_iter_extraction_t extraction_flags)
+{
+	const struct bvecq *bvecq = iter->bvecq;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned int seg = iter->bvecq_slot;
+	ssize_t ret = 0;
+	size_t offset = iter->iov_offset;
+
+	if (seg >= bvecq->nr_slots) {
+		bvecq = bvecq->next;
+		if (WARN_ON_ONCE(!bvecq))
+			return 0;
+		seg = 0;
+	}
+
+	do {
+		const struct bio_vec *bv = &bvecq->bv[seg];
+		size_t blen = bv->bv_len;
+
+		if (!bv->bv_page)
+			blen = 0;
+
+		if (offset < blen) {
+			size_t part = umin(maxsize - ret, blen - offset);
+
+			sg_set_page(sg, bv->bv_page, part, bv->bv_offset + offset);
+			sgtable->nents++;
+			sg++;
+			sg_max--;
+			offset += part;
+			ret += part;
+		}
+
+		if (offset >= blen) {
+			offset = 0;
+			seg++;
+			if (seg >= bvecq->nr_slots) {
+				if (!bvecq->next) {
+					WARN_ON_ONCE(ret < iter->count);
+					break;
+				}
+				bvecq = bvecq->next;
+				seg = 0;
+			}
+		}
+	} while (sg_max > 0 && ret < maxsize);
+
+	iter->bvecq = bvecq;
+	iter->bvecq_slot = seg;
+	iter->iov_offset = offset;
+	iter->count -= ret;
+	return ret;
+}
+
 /*
  * Extract up to sg_max folios from an FOLIOQ-type iterator and add them to
  * the scatterlist.  The pages are not pinned.
@@ -1390,8 +1453,8 @@ static ssize_t extract_xarray_to_sg(struct iov_iter *iter,
  * addition of @sg_max elements.
  *
  * The pages referred to by UBUF- and IOVEC-type iterators are extracted and
- * pinned; BVEC-, KVEC-, FOLIOQ- and XARRAY-type are extracted but aren't
- * pinned; DISCARD-type is not supported.
+ * pinned; BVEC-, BVECQ-, KVEC-, FOLIOQ- and XARRAY-type are extracted but
+ * aren't pinned; DISCARD-type is not supported.
  *
  * No end mark is placed on the scatterlist; that's left to the caller.
  *
@@ -1423,6 +1486,9 @@ ssize_t extract_iter_to_sg(struct iov_iter *iter, size_t maxsize,
 	case ITER_KVEC:
 		return extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
 					  extraction_flags);
+	case ITER_BVECQ:
+		return extract_bvecq_to_sg(iter, maxsize, sgtable, sg_max,
+					   extraction_flags);
 	case ITER_FOLIOQ:
 		return extract_folioq_to_sg(iter, maxsize, sgtable, sg_max,
 					    extraction_flags);
diff --git a/lib/tests/kunit_iov_iter.c b/lib/tests/kunit_iov_iter.c
index f02f7b7aa796..d3e8e22ca9ca 100644
--- a/lib/tests/kunit_iov_iter.c
+++ b/lib/tests/kunit_iov_iter.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/uio.h>
 #include <linux/bvec.h>
+#include <linux/bvecq.h>
 #include <linux/folio_queue.h>
 #include <linux/scatterlist.h>
 #include <linux/minmax.h>
@@ -545,6 +546,185 @@ static void __init iov_kunit_copy_from_folioq(struct kunit *test)
 	KUNIT_SUCCEED(test);
 }
 
+static void iov_kunit_destroy_bvecq(void *data)
+{
+	struct bvecq *bq, *next;
+
+	for (bq = data; bq; bq = next) {
+		next = bq->next;
+		for (int i = 0; i < bq->nr_slots; i++)
+			if (bq->bv[i].bv_page)
+				put_page(bq->bv[i].bv_page);
+		kfree(bq);
+	}
+}
+
+static struct bvecq *iov_kunit_alloc_bvecq(struct kunit *test, unsigned int max_slots)
+{
+	struct bvecq *bq;
+
+	bq = kzalloc(struct_size(bq, __bv, max_slots), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bq);
+	bq->max_slots = max_slots;
+	bq->bv = bq->__bv;
+	bq->inline_bv = true;
+	return bq;
+}
+
+static struct bvecq *iov_kunit_create_bvecq(struct kunit *test, unsigned int max_slots)
+{
+	struct bvecq *bq;
+
+	bq = iov_kunit_alloc_bvecq(test, max_slots);
+	kunit_add_action_or_reset(test, iov_kunit_destroy_bvecq, bq);
+	return bq;
+}
+
+static void __init iov_kunit_load_bvecq(struct kunit *test,
+					struct iov_iter *iter, int dir,
+					struct bvecq *bq_head,
+					struct page **pages, size_t npages)
+{
+	struct bvecq *bq = bq_head;
+	size_t size = 0;
+
+	for (int i = 0; i < npages; i++) {
+		if (bq->nr_slots >= bq->max_slots) {
+			bq->next = iov_kunit_alloc_bvecq(test, 13);
+			bq->next->prev = bq;
+			bq = bq->next;
+		}
+		bvec_set_page(&bq->bv[bq->nr_slots], pages[i], PAGE_SIZE, 0);
+		bq->nr_slots++;
+		size += PAGE_SIZE;
+	}
+	iov_iter_bvec_queue(iter, dir, bq_head, 0, 0, size);
+}
+
+/*
+ * Test copying to a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_to_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, patt;
+
+	bufsize = 0x100000;
+	npages = bufsize / PAGE_SIZE;
+
+	bq = iov_kunit_create_bvecq(test, 13);
+
+	scratch = iov_kunit_create_buffer(test, &spages, npages);
+	for (i = 0; i < bufsize; i++)
+		scratch[i] = pattern(i);
+
+	buffer = iov_kunit_create_buffer(test, &bpages, npages);
+	memset(buffer, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i = 0;
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		size = pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, READ, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied = copy_to_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i += size;
+		if (test->status == KUNIT_FAILURE)
+			goto stop;
+	}
+
+	/* Build the expected image in the scratch buffer. */
+	patt = 0;
+	memset(scratch, 0, bufsize);
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++)
+		for (i = pr->from; i < pr->to; i++)
+			scratch[i] = pattern(patt++);
+
+	/* Compare the images */
+	for (i = 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, buffer[i], scratch[i], "at i=%x", i);
+		if (buffer[i] != scratch[i])
+			return;
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
+/*
+ * Test copying from a ITER_BVECQ-type iterator.
+ */
+static void __init iov_kunit_copy_from_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **spages, **bpages;
+	u8 *scratch, *buffer;
+	size_t bufsize, npages, size, copied;
+	int i, j;
+
+	bufsize = 0x100000;
+	npages = bufsize / PAGE_SIZE;
+
+	bq = iov_kunit_create_bvecq(test, 13);
+
+	buffer = iov_kunit_create_buffer(test, &bpages, npages);
+	for (i = 0; i < bufsize; i++)
+		buffer[i] = pattern(i);
+
+	scratch = iov_kunit_create_buffer(test, &spages, npages);
+	memset(scratch, 0, bufsize);
+
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	i = 0;
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		size = pr->to - pr->from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, pr->from);
+		copied = copy_from_iter(scratch + i, size, &iter);
+
+		KUNIT_EXPECT_EQ(test, copied, size);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+		i += size;
+	}
+
+	/* Build the expected image in the main buffer. */
+	i = 0;
+	memset(buffer, 0, bufsize);
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		for (j = pr->from; j < pr->to; j++) {
+			buffer[i++] = pattern(j);
+			if (i >= bufsize)
+				goto stop;
+		}
+	}
+stop:
+
+	/* Compare the images */
+	for (i = 0; i < bufsize; i++) {
+		KUNIT_EXPECT_EQ_MSG(test, scratch[i], buffer[i], "at i=%x", i);
+		if (scratch[i] != buffer[i])
+			return;
+	}
+
+	KUNIT_SUCCEED(test);
+}
+
 static void iov_kunit_destroy_xarray(void *data)
 {
 	struct xarray *xarray = data;
@@ -860,6 +1040,85 @@ static void __init iov_kunit_extract_pages_bvec(struct kunit *test)
 	KUNIT_SUCCEED(test);
 }
 
+/*
+ * Test the extraction of ITER_BVECQ-type iterators.
+ */
+static void __init iov_kunit_extract_pages_bvecq(struct kunit *test)
+{
+	const struct kvec_test_range *pr;
+	struct iov_iter iter;
+	struct bvecq *bq;
+	struct page **bpages, *pagelist[8], **pages = pagelist;
+	ssize_t len;
+	size_t bufsize, size = 0, npages;
+	int i, from;
+
+	bufsize = 0x100000;
+	npages = bufsize / PAGE_SIZE;
+
+	bq = iov_kunit_create_bvecq(test, 13);
+
+	iov_kunit_create_buffer(test, &bpages, npages);
+	iov_kunit_load_bvecq(test, &iter, READ, bq, bpages, npages);
+
+	for (pr = kvec_test_ranges; pr->from >= 0; pr++) {
+		from = pr->from;
+		size = pr->to - from;
+		KUNIT_ASSERT_LE(test, pr->to, bufsize);
+
+		iov_iter_bvec_queue(&iter, WRITE, bq, 0, 0, pr->to);
+		iov_iter_advance(&iter, from);
+
+		do {
+			size_t offset0 = LONG_MAX;
+
+			for (i = 0; i < ARRAY_SIZE(pagelist); i++)
+				pagelist[i] = (void *)(unsigned long)0xaa55aa55aa55aa55ULL;
+
+			len = iov_iter_extract_pages(&iter, &pages, 100 * 1024,
+						     ARRAY_SIZE(pagelist), 0, &offset0);
+			KUNIT_EXPECT_GE(test, len, 0);
+			if (len < 0)
+				break;
+			KUNIT_EXPECT_LE(test, len, size);
+			KUNIT_EXPECT_EQ(test, iter.count, size - len);
+			if (len == 0)
+				break;
+			size -= len;
+			KUNIT_EXPECT_GE(test, (ssize_t)offset0, 0);
+			KUNIT_EXPECT_LT(test, offset0, PAGE_SIZE);
+
+			for (i = 0; i < ARRAY_SIZE(pagelist); i++) {
+				struct page *p;
+				ssize_t part = min_t(ssize_t, len, PAGE_SIZE - offset0);
+				int ix;
+
+				KUNIT_ASSERT_GE(test, part, 0);
+				ix = from / PAGE_SIZE;
+				KUNIT_ASSERT_LT(test, ix, npages);
+				p = bpages[ix];
+				KUNIT_EXPECT_PTR_EQ(test, pagelist[i], p);
+				KUNIT_EXPECT_EQ(test, offset0, from % PAGE_SIZE);
+				from += part;
+				len -= part;
+				KUNIT_ASSERT_GE(test, len, 0);
+				if (len == 0)
+					break;
+				offset0 = 0;
+			}
+
+			if (test->status == KUNIT_FAILURE)
+				goto stop;
+		} while (iov_iter_count(&iter) > 0);
+
+		KUNIT_EXPECT_EQ(test, size, 0);
+		KUNIT_EXPECT_EQ(test, iter.count, 0);
+	}
+
+stop:
+	KUNIT_SUCCEED(test);
+}
+
 /*
  * Test the extraction of ITER_FOLIOQ-type iterators.
  */
@@ -1219,12 +1478,15 @@ static struct kunit_case __refdata iov_kunit_cases[] = {
 	KUNIT_CASE(iov_kunit_copy_from_kvec),
 	KUNIT_CASE(iov_kunit_copy_to_bvec),
 	KUNIT_CASE(iov_kunit_copy_from_bvec),
+	KUNIT_CASE(iov_kunit_copy_to_bvecq),
+	KUNIT_CASE(iov_kunit_copy_from_bvecq),
 	KUNIT_CASE(iov_kunit_copy_to_folioq),
 	KUNIT_CASE(iov_kunit_copy_from_folioq),
 	KUNIT_CASE(iov_kunit_copy_to_xarray),
 	KUNIT_CASE(iov_kunit_copy_from_xarray),
 	KUNIT_CASE(iov_kunit_extract_pages_kvec),
 	KUNIT_CASE(iov_kunit_extract_pages_bvec),
+	KUNIT_CASE(iov_kunit_extract_pages_bvecq),
 	KUNIT_CASE(iov_kunit_extract_pages_folioq),
 	KUNIT_CASE(iov_kunit_extract_pages_xarray),
 	KUNIT_CASE(iov_kunit_iter_to_sg_kvec),


^ permalink raw reply related

* [PATCH v3 07/22] iov_iter: Make iov_iter_get_pages*() wrap iov_iter_extract_pages()
From: David Howells @ 2026-06-08 14:54 UTC (permalink / raw)
  To: Christian Brauner, Matthew Wilcox, Christoph Hellwig
  Cc: David Howells, Paulo Alcantara, Jens Axboe, Leon Romanovsky,
	Steve French, ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, Trond Myklebust, netfs,
	linux-afs, linux-cifs, linux-nfs, ceph-devel, v9fs, linux-erofs,
	linux-fsdevel, linux-kernel, linux-block
In-Reply-To: <20260608145432.681865-1-dhowells@redhat.com>

Make iov_iter_get_pages*() wrap iov_iter_extract_pages() for kernel
iterator types (e.g. ITER_BVEC, ITER_FOLIOQ, ITER_XARRAY).  The pages
obtained have their refcounts incremented afterwards if they're not slab
pages.  ITER_KVEC is left returning -EFAULT.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 lib/iov_iter.c | 164 ++++++-------------------------------------------
 1 file changed, 19 insertions(+), 145 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 243662af1af7..cac7d7364bc2 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -910,118 +910,34 @@ static int want_pages_array(struct page ***res, size_t size,
 	return count;
 }
 
-static ssize_t iter_folioq_get_pages(struct iov_iter *iter,
+/*
+ * Wrap iov_iter_extract_pages() and then pin the non-slab pages we got back.
+ * This only works for non-user iterator types as get_pages uses get_user_pages
+ * not pin_user_pages.
+ */
+static ssize_t iter_get_kernel_pages(struct iov_iter *iter,
 				     struct page ***ppages, size_t maxsize,
 				     unsigned maxpages, size_t *_start_offset)
 {
-	const struct folio_queue *folioq = iter->folioq;
 	struct page **pages;
-	unsigned int slot = iter->folioq_slot;
-	size_t extracted = 0, count = iter->count, iov_offset = iter->iov_offset;
+	ssize_t ret, done;
 
-	if (slot >= folioq_nr_slots(folioq)) {
-		folioq = folioq->next;
-		slot = 0;
-		if (WARN_ON(iov_offset != 0))
-			return -EIO;
-	}
+	ret = iov_iter_extract_pages(iter, ppages, maxsize, maxpages,
+				     0, _start_offset);
+	if (ret <= 0)
+		return ret;
 
-	maxpages = want_pages_array(ppages, maxsize, iov_offset & ~PAGE_MASK, maxpages);
-	if (!maxpages)
-		return -ENOMEM;
-	*_start_offset = iov_offset & ~PAGE_MASK;
 	pages = *ppages;
+	for (done = ret + *_start_offset; done > 0; done -= PAGE_SIZE) {
+		struct folio *folio = page_folio(*pages);
 
-	for (;;) {
-		struct folio *folio = folioq_folio(folioq, slot);
-		size_t offset = iov_offset, fsize = folioq_folio_size(folioq, slot);
-		size_t part = PAGE_SIZE - offset % PAGE_SIZE;
-
-		if (offset < fsize) {
-			part = umin(part, umin(maxsize - extracted, fsize - offset));
-			count -= part;
-			iov_offset += part;
-			extracted += part;
-
-			*pages = folio_page(folio, offset / PAGE_SIZE);
-			get_page(*pages);
-			pages++;
-			maxpages--;
-		}
-
-		if (maxpages == 0 || extracted >= maxsize)
-			break;
-
-		if (iov_offset >= fsize) {
-			iov_offset = 0;
-			slot++;
-			if (slot == folioq_nr_slots(folioq) && folioq->next) {
-				folioq = folioq->next;
-				slot = 0;
-			}
-		}
-	}
-
-	iter->count = count;
-	iter->iov_offset = iov_offset;
-	iter->folioq = folioq;
-	iter->folioq_slot = slot;
-	return extracted;
-}
-
-static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
-					  pgoff_t index, unsigned int nr_pages)
-{
-	XA_STATE(xas, xa, index);
-	struct folio *folio;
-	unsigned int ret = 0;
-
-	rcu_read_lock();
-	for (folio = xas_load(&xas); folio; folio = xas_next(&xas)) {
-		if (xas_retry(&xas, folio))
-			continue;
-
-		/* Has the folio moved or been split? */
-		if (unlikely(folio != xas_reload(&xas))) {
-			xas_reset(&xas);
-			continue;
-		}
-
-		pages[ret] = folio_file_page(folio, xas.xa_index);
-		folio_get(folio);
-		if (++ret == nr_pages)
-			break;
+		if (!folio_test_slab(folio))
+			folio_get(folio);
+		pages++;
 	}
-	rcu_read_unlock();
 	return ret;
 }
 
-static ssize_t iter_xarray_get_pages(struct iov_iter *i,
-				     struct page ***pages, size_t maxsize,
-				     unsigned maxpages, size_t *_start_offset)
-{
-	unsigned nr, offset, count;
-	pgoff_t index;
-	loff_t pos;
-
-	pos = i->xarray_start + i->iov_offset;
-	index = pos >> PAGE_SHIFT;
-	offset = pos & ~PAGE_MASK;
-	*_start_offset = offset;
-
-	count = want_pages_array(pages, maxsize, offset, maxpages);
-	if (!count)
-		return -ENOMEM;
-	nr = iter_xarray_populate_pages(*pages, i->xarray, index, count);
-	if (nr == 0)
-		return 0;
-
-	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
-	i->iov_offset += maxsize;
-	i->count -= maxsize;
-	return maxsize;
-}
-
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size)
 {
@@ -1044,22 +960,6 @@ static unsigned long first_iovec_segment(const struct iov_iter *i, size_t *size)
 	BUG(); // if it had been empty, we wouldn't get called
 }
 
-/* must be done on non-empty ITER_BVEC one */
-static struct page *first_bvec_segment(const struct iov_iter *i,
-				       size_t *size, size_t *start)
-{
-	struct page *page;
-	size_t skip = i->iov_offset, len;
-
-	len = i->bvec->bv_len - skip;
-	if (*size > len)
-		*size = len;
-	skip += i->bvec->bv_offset;
-	page = i->bvec->bv_page + skip / PAGE_SIZE;
-	*start = skip % PAGE_SIZE;
-	return page;
-}
-
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start)
@@ -1095,36 +995,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		iov_iter_advance(i, maxsize);
 		return maxsize;
 	}
-	if (iov_iter_is_bvec(i)) {
-		struct page **p;
-		struct page *page;
 
-		page = first_bvec_segment(i, &maxsize, start);
-		n = want_pages_array(pages, maxsize, *start, maxpages);
-		if (!n)
-			return -ENOMEM;
-		p = *pages;
-		for (int k = 0; k < n; k++) {
-			struct folio *folio = page_folio(page + k);
-			p[k] = page + k;
-			if (!folio_test_slab(folio))
-				folio_get(folio);
-		}
-		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
-		i->count -= maxsize;
-		i->iov_offset += maxsize;
-		if (i->iov_offset == i->bvec->bv_len) {
-			i->iov_offset = 0;
-			i->bvec++;
-			i->nr_segs--;
-		}
-		return maxsize;
-	}
-	if (iov_iter_is_folioq(i))
-		return iter_folioq_get_pages(i, pages, maxsize, maxpages, start);
-	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
-	return -EFAULT;
+	if (iov_iter_is_kvec(i))
+		return -EFAULT;
+	return iter_get_kernel_pages(i, pages, maxsize, maxpages, start);
 }
 
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,


^ permalink raw reply related

* [PATCH v3 06/22] Add a function to kmap one page of a multipage bio_vec
From: David Howells @ 2026-06-08 14:54 UTC (permalink / raw)
  To: Christian Brauner, Matthew Wilcox, Christoph Hellwig
  Cc: David Howells, Paulo Alcantara, Jens Axboe, Leon Romanovsky,
	Steve French, ChenXiaoSong, Marc Dionne, Eric Van Hensbergen,
	Dominique Martinet, Ilya Dryomov, Trond Myklebust, netfs,
	linux-afs, linux-cifs, linux-nfs, ceph-devel, v9fs, linux-erofs,
	linux-fsdevel, linux-kernel, linux-block
In-Reply-To: <20260608145432.681865-1-dhowells@redhat.com>

Add a function to kmap one page of a multipage bio_vec by offset (which is
added to the offset in the bio_vec internally).  The caller is responsible
for calculating how much of the page is then available.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 include/linux/bvec.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d36dd476feda..f834a862224e 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -299,4 +299,22 @@ static inline phys_addr_t bvec_phys(const struct bio_vec *bvec)
 	return page_to_phys(bvec->bv_page) + bvec->bv_offset;
 }
 
+/**
+ * bvec_kmap_partial - Map part of a bvec into the kernel virtual address space
+ * @bvec: bvec to map
+ * @offset: Offset into bvec
+ *
+ * Map the page containing the byte at @offset into the kernel virtual address
+ * space.  The caller is responsible for making sure this doesn't overrun.
+ *
+ * Call kunmap_local on the returned address to unmap.
+ */
+static inline void *bvec_kmap_partial(struct bio_vec *bvec, size_t offset)
+{
+	offset += bvec->bv_offset;
+
+	return kmap_local_page(bvec->bv_page + (offset >> PAGE_SHIFT)) +
+		(offset & ~PAGE_MASK);
+}
+
 #endif /* __LINUX_BVEC_H */


^ permalink raw reply related

* Re: [PATCH 4/4] block: add configurable error injection
From: Jens Axboe @ 2026-06-08 14:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jonathan Corbet, Damien Le Moal, Hannes Reinecke, Keith Busch,
	linux-block, linux-doc, Hannes Reinecke
In-Reply-To: <20260608051416.1205282-5-hch@lst.de>

On 6/7/26 11:14 PM, Christoph Hellwig wrote:
> diff --git a/block/blk.h b/block/blk.h
> index e8b7d5517086..10df23b2cb90 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -660,6 +660,18 @@ static inline bool should_fail_request(struct block_device *part,
>  }
>  #endif /* CONFIG_FAIL_MAKE_REQUEST */
>  
> +void blk_error_injection_init(struct gendisk *disk);
> +void blk_error_injection_exit(struct gendisk *disk);
> +bool __blk_error_inject(struct bio *bio);
> +static inline bool blk_error_inject(struct bio *bio)
> +{
> +	if (!IS_ENABLED(CONFIG_BLK_ERROR_INJECTION))
> +		return false;
> +	if (!test_bit(GD_ERROR_INJECT, &bio->bi_bdev->bd_disk->state))
> +		return false;
> +	return __blk_error_inject(bio);
> +}

I really hate this part, that's a pretty deep set of pointer chasings to
figure out if injection is enabled or not, when in practice error
injection is only ever enabled for specific test cases and distros
invariably will set CONFIG_BLK_ERROR_INJECTION because they turn on
every damn thing under the sun.

IOW, that won't fly for the hot path. Maybe a static key would be useful
here?

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH] block: optimize I/O merge hot path with unlikely() hints
From: Jens Axboe @ 2026-06-08 14:34 UTC (permalink / raw)
  To: Steven Feng; +Cc: linux-block, linux-kernel
In-Reply-To: <tencent_79B652BD0CC23E093F27914380F161E7E505@qq.com>


On Sat, 06 Jun 2026 10:42:18 +0800, Steven Feng wrote:
> Remove redundant '== false' comparisons and add unlikely() branch
> prediction hints in block I/O merge path functions.
> 
> These functions (ll_new_hw_segment, ll_merge_requests_fn, and
> blk_rq_merge_ok) are executed on every I/O request merge attempt,
> making them critical hot paths. Data integrity check failures are
> rare events, so marking these conditions as unlikely() helps the
> CPU optimize the common case by improving branch prediction.
> 
> [...]

Applied, thanks!

[1/1] block: optimize I/O merge hot path with unlikely() hints
      commit: 7ed4aab1381f3439f45032eb860f89d9da5e45c2

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH next] drivers/block/rbd: Use strscpy() to copy strings into arrays
From: Jens Axboe @ 2026-06-08 14:34 UTC (permalink / raw)
  To: Kees Cook, linux-hardening, Arnd Bergmann, ceph-devel,
	linux-block, linux-kernel, david.laight.linux
  Cc: Ilya Dryomov
In-Reply-To: <20260606202744.5113-5-david.laight.linux@gmail.com>


On Sat, 06 Jun 2026 21:27:44 +0100, david.laight.linux@gmail.com wrote:
> Replacing strcpy() with strscpy() ensures than overflow of the target
> buffer cannot happen.

Applied, thanks!

[1/1] drivers/block/rbd: Use strscpy() to copy strings into arrays
      commit: 5ef1b0194b382fafe5023b5b014e4db3b948ee15

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH] partitions: aix: bound the pp_count scan to the ppe array
From: Jens Axboe @ 2026-06-08 14:34 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Philippe De Muyter, Kees Cook, Michael Bommarito, linux-block,
	linux-kernel
In-Reply-To: <20260607064137.302574-1-hexlabsecurity@proton.me>


On Sun, 07 Jun 2026 06:41:43 +0000, Bryam Vargas wrote:
> aix_partition() reads the physical volume descriptor into a fixed-size
> struct pvd and then scans its physical-partition-extent array:
> 
> 	int numpps = be16_to_cpu(pvd->pp_count);
> 	...
> 	for (i = 0; i < numpps; i += 1) {
> 		struct ppe *p = pvd->ppe + i;
> 		...
> 		lp_ix = be16_to_cpu(p->lp_ix);
> 
> [...]

Applied, thanks!

[1/1] partitions: aix: bound the pp_count scan to the ppe array
      commit: 2dc0bfd2fe355fb930de63c2f2eb8ced8570c579

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH v3 4/7] block: implement NVMEM provider
From: Loic Poulain @ 2026-06-08 13:00 UTC (permalink / raw)
  To: Bartosz Golaszewski, daniel
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, Ulf Hansson,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson,
	Konrad Dybcio, Jens Axboe, Johannes Berg, Jeff Johnson,
	Marcel Holtmann, Luiz Augusto von Dentz, Balakrishna Godavarthi,
	Rocky Liao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Srinivas Kandagatla, Andrew Lunn,
	Heiner Kallweit, Russell King, Saravana Kannan
In-Reply-To: <CAMRc=McmoWvezeH05_5AR7ZbAyg1L567HsKWbuD7711LwnjV0Q@mail.gmail.com>

On Mon, Jun 8, 2026 at 1:17 PM Bartosz Golaszewski <brgl@kernel.org> wrote:
>
> On Mon, 8 Jun 2026 12:50:41 +0200, Loic Poulain
> <loic.poulain@oss.qualcomm.com> said:
> > From: Daniel Golle <daniel@makrotopia.org>
> >
> > On embedded devices using an eMMC it is common that one or more partitions
> > on the eMMC are used to store MAC addresses and Wi-Fi calibration EEPROM
> > data. Allow referencing the partition in device tree for the kernel and
> > Wi-Fi drivers accessing it via the NVMEM layer.
> >
> > To safely defer the freeing of the provider private data until all
> > consumers have released their cells, a nvmem_dev() accessor is added to
> > the NVMEM core to expose the struct device embedded in struct nvmem_device.
> > This allows registering a devm action on the nvmem device itself, ensuring
> > the private data outlives any active cell references.
> >
> > Signed-off-by: Daniel Golle <daniel@makrotopia.org>
> > Co-developed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > ---
> >  block/Kconfig                  |   9 +++
> >  block/Makefile                 |   1 +
> >  block/blk-nvmem.c              | 171 +++++++++++++++++++++++++++++++++++++++++
> >  drivers/nvmem/core.c           |  13 ++++
> >  include/linux/nvmem-consumer.h |   6 ++
> >  5 files changed, 200 insertions(+)
> >
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 15027963472d7b40e27b9097a5993c457b5b3054..0b33747e16dc33473683706f75c92bdf8b648f7c 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -209,6 +209,15 @@ config BLK_INLINE_ENCRYPTION_FALLBACK
> >         by falling back to the kernel crypto API when inline
> >         encryption hardware is not present.
> >
> > +config BLK_NVMEM
> > +     bool "Block device NVMEM provider"
> > +     depends on OF
> > +     depends on NVMEM
> > +     help
> > +       Allow block devices (or partitions) to act as NVMEM providers,
> > +       typically used with eMMC to store MAC addresses or Wi-Fi
> > +       calibration data on embedded devices.
> > +
> >  source "block/partitions/Kconfig"
> >
> >  config BLK_PM
> > diff --git a/block/Makefile b/block/Makefile
> > index 7dce2e44276c4274c11a0a61121c83d9c43d6e0c..d7ac389e71902bc091a8800ea266190a43b3e63d 100644
> > --- a/block/Makefile
> > +++ b/block/Makefile
> > @@ -36,3 +36,4 @@ obj-$(CONFIG_BLK_INLINE_ENCRYPTION) += blk-crypto.o blk-crypto-profile.o \
> >                                          blk-crypto-sysfs.o
> >  obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) += blk-crypto-fallback.o
> >  obj-$(CONFIG_BLOCK_HOLDER_DEPRECATED)        += holder.o
> > +obj-$(CONFIG_BLK_NVMEM)                += blk-nvmem.o
> > diff --git a/block/blk-nvmem.c b/block/blk-nvmem.c
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..99c7728fb7bccdc2216780a73a89a9210f925049
> > --- /dev/null
> > +++ b/block/blk-nvmem.c
> > @@ -0,0 +1,171 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * block device NVMEM provider
> > + *
> > + * Copyright (c) 2024 Daniel Golle <daniel@makrotopia.org>
> > + * Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
> > + *
> > + * Useful on devices using a partition on an eMMC for MAC addresses or
> > + * Wi-Fi calibration EEPROM data.
> > + */
> > +
> > +#include <linux/cleanup.h>
> > +#include <linux/mutex.h>
> > +#include <linux/nvmem-provider.h>
> > +#include <linux/nvmem-consumer.h>
> > +#include <linux/of.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/property.h>
> > +
> > +#include "blk.h"
> > +
> > +
>
> Stray newline?
>
> > +/* List of all NVMEM devices */
> > +static LIST_HEAD(nvmem_devices);
> > +static DEFINE_MUTEX(devices_mutex);
> > +
> > +struct blk_nvmem {
> > +     struct nvmem_device     *nvmem;
> > +     dev_t                   devt;
> > +     struct list_head        list;
> > +};
> > +
> > +static int blk_nvmem_reg_read(void *priv, unsigned int from,
> > +                           void *val, size_t bytes)
> > +{
> > +     blk_mode_t mode = BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES;
> > +     struct blk_nvmem *bnv = priv;
> > +     size_t bytes_left = bytes;
> > +     struct file *bdev_file;
> > +     loff_t pos = from;
> > +     int ret = 0;
> > +
> > +     bdev_file = bdev_file_open_by_dev(bnv->devt, mode, priv, NULL);
> > +     if (!bdev_file)
> > +             return -ENODEV;
> > +
> > +     if (IS_ERR(bdev_file))
> > +             return PTR_ERR(bdev_file);
> > +
> > +     while (bytes_left) {
> > +             pgoff_t f_index = pos >> PAGE_SHIFT;
> > +             struct folio *folio;
> > +             size_t folio_off;
> > +             size_t to_read;
> > +
> > +             folio = read_mapping_folio(bdev_file->f_mapping, f_index, NULL);
> > +             if (IS_ERR(folio)) {
> > +                     ret = PTR_ERR(folio);
> > +                     goto err_release_bdev;
> > +             }
> > +
> > +             folio_off = offset_in_folio(folio, pos);
> > +             to_read = min(bytes_left, folio_size(folio) - folio_off);
> > +             memcpy_from_folio(val, folio, folio_off, to_read);
> > +             pos += to_read;
> > +             bytes_left -= to_read;
> > +             val += to_read;
> > +             folio_put(folio);
> > +     }
> > +
> > +err_release_bdev:
> > +     fput(bdev_file);
>
> There's a __free() action defined in linux/file.h so you can use:
>
>         struct file *bdev_file __free(fput) = ...
>
> and avoid this label.

Ok, thanks, will use.

>
> > +
> > +     return ret;
> > +}
> > +
> > +static int blk_nvmem_register(struct device *dev)
> > +{
> > +     struct device_node *child, *np = dev_of_node(dev);
> > +     struct block_device *bdev = dev_to_bdev(dev);
> > +     struct nvmem_config config = {};
> > +     struct blk_nvmem *bnv;
> > +
> > +     /* skip devices which do not have a device tree node */
> > +     if (!np)
> > +             return 0;
> > +
> > +     /* skip devices without an nvmem layout defined */
> > +     child = of_get_child_by_name(np, "nvmem-layout");
> > +     if (!child)
> > +             return 0;
> > +     of_node_put(child);
>
> Same as above, can be:
>
>         struct device_node *child __free(device_node) == ...

Ack.

>
> > +
> > +     /*
> > +      * skip block device too large to be represented as NVMEM devices,
> > +      * the NVMEM reg_read callback uses an unsigned int offset
> > +      */
> > +     if (bdev_nr_bytes(bdev) > UINT_MAX)
> > +             return -EFBIG;
>
> This will mean a failed probe. Wouldn't it be better to use -ENODEV?

That would indeed be an appropriate error.

>
> > +
> > +     bnv = kzalloc_obj(*bnv);
> > +     if (!bnv)
> > +             return -ENOMEM;
> > +
> > +     config.id = NVMEM_DEVID_NONE;
> > +     config.dev = &bdev->bd_device;
> > +     config.name = dev_name(&bdev->bd_device);
> > +     config.owner = THIS_MODULE;
> > +     config.priv = bnv;
> > +     config.reg_read = blk_nvmem_reg_read;
> > +     config.size = bdev_nr_bytes(bdev);
> > +     config.word_size = 1;
> > +     config.stride = 1;
> > +     config.read_only = true;
> > +     config.root_only = true;
> > +     config.ignore_wp = true;
> > +     config.of_node = to_of_node(dev->fwnode);
> > +
> > +     bnv->devt = bdev->bd_device.devt;
> > +     bnv->nvmem = nvmem_register(&config);
> > +     if (IS_ERR(bnv->nvmem)) {
> > +             dev_err_probe(&bdev->bd_device, PTR_ERR(bnv->nvmem),
> > +                           "Failed to register NVMEM device\n");
> > +             kfree(bnv);
> > +             return PTR_ERR(bnv->nvmem);
> > +     }
> > +
> > +     scoped_guard(mutex, &devices_mutex)
> > +             list_add_tail(&bnv->list, &nvmem_devices);
>
> I'm not sure I understand the need to store these? Whatever you need to do in
> remove() can be scheduled in a devres action here.

I think the devm_ approach would work fine in practice. The only
difference is that NVMEM unregistration would be delayed from
device_del() to device_release(), but during that window any read
attempt would simply return -ENODEV, so there is no real race or
safety concern AFAIU. I guess the explicit list was initially kept to
mirror the add_dev/remove_dev symmetry of the class interface. But,
except if there is no strong technical argument against devm_, I will
switch to that simpler approach in the next version.

Daniel, feel free to nack or ask for authorship removal if needed.
This patch submitted in your name has accumulated enough changes since
the original submission that the current shape may no longer reflect
your intent.

>
> > +
> > +     return 0;
> > +}
> > +
> > +static void blk_nvmem_unregister(struct device *dev)
> > +{
> > +     struct blk_nvmem *bnv_c, *bnv_t, *bnv = NULL;
> > +
> > +     scoped_guard(mutex, &devices_mutex) {
> > +             list_for_each_entry_safe(bnv_c, bnv_t, &nvmem_devices, list) {
> > +                     if (bnv_c->devt == dev_to_bdev(dev)->bd_device.devt) {
> > +                             bnv = bnv_c;
> > +                             list_del(&bnv->list);
> > +                             break;
> > +                     }
> > +             }
> > +
> > +             if (!bnv)
> > +                     return;
> > +     }
> > +
> > +     nvmem_unregister(bnv->nvmem);
> > +     kfree(bnv);
> > +}
> > +
> > +static struct class_interface blk_nvmem_bus_interface __refdata = {
> > +     .class = &block_class,
> > +     .add_dev = &blk_nvmem_register,
> > +     .remove_dev = &blk_nvmem_unregister,
> > +};
> > +
> > +static int __init blk_nvmem_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = class_interface_register(&blk_nvmem_bus_interface);
> > +     if (ret)
> > +             return ret;
> > +
> > +     return 0;
> > +}
> > +device_initcall(blk_nvmem_init);
> > diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c
> > index 311cb2e5a5c02d2c6979d7c9bbb7f94abdfbdad1..ee3481229c20b3063c86d0dd66aabbf6b5e29169 100644
> > --- a/drivers/nvmem/core.c
> > +++ b/drivers/nvmem/core.c
> > @@ -2154,6 +2154,19 @@ const char *nvmem_dev_name(struct nvmem_device *nvmem)
> >  }
> >  EXPORT_SYMBOL_GPL(nvmem_dev_name);
> >
> > +/**
> > + * nvmem_dev() - Get the struct device of a given nvmem device.
> > + *
> > + * @nvmem: nvmem device.
> > + *
> > + * Return: pointer to the struct device of the nvmem device.
> > + */
> > +struct device *nvmem_dev(struct nvmem_device *nvmem)
> > +{
> > +     return &nvmem->dev;
> > +}
> > +EXPORT_SYMBOL_GPL(nvmem_dev);
>
> This should still be a separate patch.

Well yes, actually I should even remove this as this is no more needed.

Regards,
Loic

^ permalink raw reply

* Re: [PATCH] partitions: aix: bound the pp_count scan to the ppe array
From: Philippe De Muyter @ 2026-06-08 11:46 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Jens Axboe, Kees Cook, Michael Bommarito, linux-block,
	linux-kernel
In-Reply-To: <20260607064137.302574-1-hexlabsecurity@proton.me>

Hello Bryam,

On Sun, Jun 07, 2026 at 06:41:43AM +0000, Bryam Vargas wrote:
> aix_partition() reads the physical volume descriptor into a fixed-size
> struct pvd and then scans its physical-partition-extent array:
> 
> 	int numpps = be16_to_cpu(pvd->pp_count);
> 	...
> 	for (i = 0; i < numpps; i += 1) {
> 		struct ppe *p = pvd->ppe + i;
> 		...
> 		lp_ix = be16_to_cpu(p->lp_ix);
> 
> pvd points at a single kmalloc()'d struct pvd whose ppe[] member holds a
> fixed ARRAY_SIZE(pvd->ppe) (1016) entries, but the loop runs up to the
> on-disk pp_count.  pp_count is an unvalidated __be16 read straight from
> the descriptor, so a crafted AIX image with pp_count larger than 1016
> drives the loop to read pvd->ppe[i] past the end of the allocation (up to
> 65535 entries, ~2 MB out of bounds).
> 
> The partition scan runs without mounting anything, when a block device
> with a crafted AIX/IBM partition table appears (an attacker-supplied
> image attached with losetup -P, or a device auto-scanned by udev), via
> msdos_partition() -> aix_partition().
> 
> Clamp the scan to the number of entries the ppe[] array can hold.
> 
> Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files")
> Cc: stable@vger.kernel.org
> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
> ---
> Reproduced on v7.1-rc6 with KASAN (CONFIG_PARTITION_ADVANCED +
> CONFIG_AIX_PARTITION).  A crafted disk image whose AIX/IBM partition table
> sets pp_count to 0xffff, attached with `losetup -fP image.img` (in-kernel
> partition scan, no mount), is reported by KASAN:
> 
>   BUG: KASAN: slab-out-of-bounds in aix_partition+0xb6e/0xee0
>   Read of size 2 at addr ... by task losetup
>    aix_partition
>    msdos_partition
>    bdev_disk_changed
>    loop_reread_partitions
>    loop_configure
>    lo_ioctl
>    __x64_sys_ioctl
> 
> i.e. a read past the end of the kmalloc(sizeof(struct pvd)) object.  A control
> image with pp_count == 1016 (== ARRAY_SIZE(pvd->ppe)) is clean.  With this
> patch the crafted image is parsed with no out-of-bounds access.
> 
> This is the read-loop sibling of the lvd scan bounded by Michael Bommarito's
> "partitions: aix: bound the lvd scan to one sector"; that change does not
> touch the pp_count/ppe[] loop, so the two are complementary (separate hunks).
> 
>  block/partitions/aix.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/block/partitions/aix.c b/block/partitions/aix.c
> index 29b8f4cebb63..f3c4174e003e 100644
> --- a/block/partitions/aix.c
> +++ b/block/partitions/aix.c
> @@ -226,6 +226,15 @@ int aix_partition(struct parsed_partitions *state)
>  		int next_lp_ix = 1;
>  		int lp_ix;
>  
> +		/*
> +		 * pvd was read into a fixed-size struct pvd whose ppe[] array
> +		 * holds ARRAY_SIZE(pvd->ppe) entries.  pp_count is an
> +		 * unvalidated on-disk __be16, so clamp the scan to the array
> +		 * size to avoid walking past the allocation.
> +		 */
> +		if (numpps > ARRAY_SIZE(pvd->ppe))
> +			numpps = ARRAY_SIZE(pvd->ppe);
> +
>  		for (i = 0; i < numpps; i += 1) {
>  			struct ppe *p = pvd->ppe + i;
>  			unsigned int lv_ix;
> -- 
> 2.43.0

Thank you for your patch.

Acked-by: Philippe De Muyter <phdm@macqel.be>

Best regards

Philippe

^ permalink raw reply

* Re: [PATCH v3 7/7] Bluetooth: qca: Set NVMEM BD address quirks when address is invalid
From: Loic Poulain @ 2026-06-08 11:44 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan, linux-mmc, devicetree,
	linux-kernel, linux-arm-msm, linux-block, linux-wireless, ath10k,
	linux-bluetooth, netdev, daniel, Bartosz Golaszewski
In-Reply-To: <f528672e-ab4f-4844-bc7c-1f8f1c4dbd3d@oss.qualcomm.com>

On Mon, Jun 8, 2026 at 1:29 PM Konrad Dybcio
<konrad.dybcio@oss.qualcomm.com> wrote:
>
> On 6/8/26 12:50 PM, Loic Poulain wrote:
> > When the controller BD address is invalid (zero or default),
> > set the NVMEM quirks to allow retrieving the address from a
> > 'local-bd-address' NVMEM cell. The BD address is often stored
> > alongside the WiFi MAC address in big-endian format, so also
> > set the big-endian quirk.
> >
> > Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> > Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> > ---
> >  arch/arm64/boot/dts/qcom/qrb2210-arduino-imola.dts | 39 ++++++++++++++++++++++
> >  drivers/bluetooth/btqca.c                          |  5 ++-
>
> Squash mistake?

Indeed, thanks Konrad.

^ permalink raw reply

* [PATCH] blk-flush: fix possibe deadlock when process nvme_timeout()
From: Ye Bin @ 2026-06-08 11:39 UTC (permalink / raw)
  To: axboe, linux-block, yebin, yebin10; +Cc: kbusch, hch, sagi, linux-nvme

From: Ye Bin <yebin10@huawei.com>

 There's when process nvme_timeout():
 [  206.734601][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, aborting req_op:FLUSH(2) size:0
 [  206.736112][    C0] nvme nvme0: Abort status: 0x0
 [  208.094637][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, reset controller

 [root@localhost ~]# cat /proc/8184/stack
 [<0>] msleep+0x37/0x50
 [<0>] blk_mq_tagset_wait_completed_request+0x6f/0xe0
 [<0>] nvme_cancel_tagset+0x79/0xa0
 [<0>] nvme_dev_disable+0x55c/0x7e0
 [<0>] nvme_timeout+0x25b/0x1530
 [<0>] blk_mq_handle_expired+0x210/0x2c0
 [<0>] bt_iter+0x2bb/0x360
 [<0>] blk_mq_queue_tag_busy_iter+0x9f8/0x1f30
 [<0>] blk_mq_timeout_work+0x5dc/0x7d0
 [<0>] process_one_work+0xa08/0x1d00
 [<0>] worker_thread+0x698/0xeb0
 [<0>] kthread+0x408/0x540
 [<0>] ret_from_fork+0xa4d/0xdd0
 [<0>] ret_from_fork_asm+0x1a/0x30

 Above issue may happen as follows:
 nvme_timeout  // tag 512 request's flush request the first timeout
   iod->aborted = 1;
   abort_req = nvme_alloc_request(dev->ctrl.admin_q, &cmd,
          BLK_MQ_REQ_NOWAIT, NVME_QID_ANY);  // Abort tag 512 flush request
   blk_execute_rq_nowait(abort_req->q, NULL, abort_req, 0, abort_endio);
      // Abort request completion, will no wait
         ....
  ****'abort_req' not complete***
         ....
 nvme_timeout  // tag 512 request's flush request the second timeout
  if (!nvmeq->qid || (iod->flags & IOD_ABORTED))
    nvme_req(req)->flags |= NVME_REQ_CANCELLED;
    goto disable;
      ...
    **** tag 512 request's flush request end ****
         nvme_try_complete_req
          blk_mq_complete_request_remote(req);
           WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
            ...
             nvme_end_req(req);
              blk_mq_end_request(req, status);
               __blk_mq_end_request(rq, error);
                if (rq->end_io)
                 rq->end_io(rq, error);
                  flush_end_io(rq, error);
                  // The timeout process holds the reference count.
                  // so request keep MQ_RQ_COMPLETE state
                   if (!refcount_dec_and_test(&flush_rq->ref))
                    fq->rq_status = error;
                    return;
    **** tag 512 flush request is MQ_RQ_COMPLETE state ****
 disable:
   nvme_dev_disable(dev, false);
     nvme_cancel_tagset(&dev->ctrl);
       blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request,
                               &dev->ctrl);
         nvme_cancel_request
           if (blk_mq_request_completed(req))
             return true;
      blk_mq_tagset_wait_completed_request(&dev->tagset);
        while (true)
          blk_mq_tagset_busy_iter(tagset,
                           blk_mq_tagset_count_completed_rqs, &count);
             blk_mq_tagset_count_completed_rqs();
             // request is MQ_RQ_COMPLETE state
                if (blk_mq_request_completed(rq))   // return true
                  (*count)++;
          if (!count) // So the value of 'count' is never 0, loop endless
              break;
          msleep(5);
The preceding problem occurs because the timeout processing flow holds
the reference count of the request, and the flush request is always in
the MQ_RQ_COMPLETE state due to the special nature of the flush request.
As a result, a dead loop occurs in the nvme_dev_disable() process.
To solve the preceding problem, if only the timeout processing flow holds
the reference count when the flush request times out, the request status
must be changed to MQ_RQ_IDLE in advance. In this way, it is safe to call
blk_mq_tagset_wait_completed_request () during the timeout processing.

Fixes: e1569a16180a ("nvme: do not restart the request timeout if we're resetting the controller")
Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 block/blk-flush.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 403a46c86411..d12839b1fcb5 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -213,6 +213,18 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
 
 	if (!req_ref_put_and_test(flush_rq)) {
 		fq->rq_status = error;
+
+		/*
+		 * The timeout processing flow holds the reference count
+		 * of flush_rq. If the last reference count is held by the
+		 * timeout processing flow, the status of flush_rq must be
+		 * changed to MQ_RQ_IDLE in advance. Otherwise, a deadlock
+		 * occurs when blk_mq_tagset_wait_completed_request() is
+		 * called in the timeout processing flow.
+		 */
+		if (req_ref_read(flush_rq) == 1 &&
+		    flush_rq->rq_flags & RQF_TIMED_OUT)
+			WRITE_ONCE(flush_rq->state, MQ_RQ_IDLE);
 		spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
 		return RQ_END_IO_NONE;
 	}
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH] block: clear zone write plugging flag before failing rejected BIOs
From: Damien Le Moal @ 2026-06-08 11:42 UTC (permalink / raw)
  To: Jackie Liu, axboe; +Cc: linux-block
In-Reply-To: <20260607031814.19188-1-liu.yun@linux.dev>

On 2026/06/07 11:18, Jackie Liu wrote:
> From: Jackie Liu <liuyun01@kylinos.cn>
> 
> Commit fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug
> error recovery") changed blk_zone_wplug_handle_write() to fail BIOs
> directly when blk_zone_wplug_prepare_bio() rejects them, for example
> because the write is not aligned to the cached write pointer or the plug
> needs a write pointer update. However, the BIO is already marked with
> BIO_ZONE_WRITE_PLUGGING at that point even though it is not issued.
> 
> Completing such a BIO with bio_io_error() makes bio_endio() call
> blk_zone_write_plug_bio_endio(), which treats the completion as a failed
> device write and may poison the cached zone write pointer state by setting
> BLK_ZONE_WPLUG_NEED_WP_UPDATE.

Yes, true. But you did not explain clearly why that is a problem. After all, if
we hit this case, the user issued an unaligned BIO, and so forcing it to do a
report zones to get everything in sync and the correct write pointer is not a
bad thing.

If fe0418eb9bd6 change is actually causing you problems, please describe that
problem clearly. But ideally, I do not want to special case some error
completions over others and prefer to have a single error path that result in
the same state for the zone write plugs, regardless of a write error root cause.

> 
> Clear BIO_ZONE_WRITE_PLUGGING and drop the zone write plug reference before
> failing the rejected BIO.
> 
> Fixes: fe0418eb9bd6 ("block: Prevent potential deadlocks in zone write plug error recovery")
> Cc: stable@vger.kernel.org # 6.13+
> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
> ---
>  block/blk-zoned.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 6a221c180889..855767d8bfc1 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1502,7 +1502,9 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>  		goto queue_bio;
>  
>  	if (!blk_zone_wplug_prepare_bio(zwplug, bio)) {
> +		bio_clear_flag(bio, BIO_ZONE_WRITE_PLUGGING);
>  		spin_unlock_irqrestore(&zwplug->lock, flags);
> +		disk_put_zone_wplug(zwplug);
>  		bio_io_error(bio);
>  		return true;
>  	}


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply

* Re: [PATCH v3 7/7] Bluetooth: qca: Set NVMEM BD address quirks when address is invalid
From: Konrad Dybcio @ 2026-06-08 11:29 UTC (permalink / raw)
  To: Loic Poulain, Ulf Hansson, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley, Bjorn Andersson, Konrad Dybcio, Jens Axboe,
	Johannes Berg, Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Bartosz Golaszewski
In-Reply-To: <20260608-block-as-nvmem-v3-7-82681f50aa35@oss.qualcomm.com>

On 6/8/26 12:50 PM, Loic Poulain wrote:
> When the controller BD address is invalid (zero or default),
> set the NVMEM quirks to allow retrieving the address from a
> 'local-bd-address' NVMEM cell. The BD address is often stored
> alongside the WiFi MAC address in big-endian format, so also
> set the big-endian quirk.
> 
> Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---
>  arch/arm64/boot/dts/qcom/qrb2210-arduino-imola.dts | 39 ++++++++++++++++++++++
>  drivers/bluetooth/btqca.c                          |  5 ++-

Squash mistake?

Konrad

^ permalink raw reply

* Re: [PATCH v3 6/7] Bluetooth: hci_sync: Add NVMEM-backed BD address retrieval
From: Bartosz Golaszewski @ 2026-06-08 11:19 UTC (permalink / raw)
  To: Loic Poulain
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan
In-Reply-To: <20260608-block-as-nvmem-v3-6-82681f50aa35@oss.qualcomm.com>

On Mon, 8 Jun 2026 12:50:43 +0200, Loic Poulain
<loic.poulain@oss.qualcomm.com> said:
> Some devices store the Bluetooth BD address in non-volatile
> memory, which can be accessed through the NVMEM framework.
> Similar to Ethernet or WiFi MAC addresses, add support for
> reading the BD address from a 'local-bd-address' NVMEM cell.
>
> As with the device-tree provided BD address, add a quirk to
> indicate whether a device or platform should attempt to read
> the address from NVMEM when no valid in-chip address is present.
> Also add a quirk to indicate if the address is stored in
> big-endian byte order.
>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---

Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

^ permalink raw reply

* Re: [PATCH v3 2/7] dt-bindings: net: wireless: qcom,ath10k: Document NVMEM cells
From: Bartosz Golaszewski @ 2026-06-08 11:18 UTC (permalink / raw)
  To: Loic Poulain
  Cc: linux-mmc, devicetree, linux-kernel, linux-arm-msm, linux-block,
	linux-wireless, ath10k, linux-bluetooth, netdev, daniel,
	Ulf Hansson, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Jens Axboe, Johannes Berg,
	Jeff Johnson, Bartosz Golaszewski, Marcel Holtmann,
	Luiz Augusto von Dentz, Balakrishna Godavarthi, Rocky Liao,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Srinivas Kandagatla, Andrew Lunn, Heiner Kallweit,
	Russell King, Saravana Kannan
In-Reply-To: <20260608-block-as-nvmem-v3-2-82681f50aa35@oss.qualcomm.com>

On Mon, 8 Jun 2026 12:50:39 +0200, Loic Poulain
<loic.poulain@oss.qualcomm.com> said:
> Document the NVMEM cells supported by the ath10k driver, the
> mac-address, pre-calibration data, and calibration data.
>
> Signed-off-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
> ---

Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

^ permalink raw reply

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox