[PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance
@ 2025-10-14 15:04 Keith Busch
  2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Keith Busch @ 2025-10-14 15:04 UTC (permalink / raw)
  To: hch, linux-nvme, linux-block, axboe; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

Previous version here:

  https://lore.kernel.org/linux-block/20251007175245.3898972-1-kbusch@meta.com/

The purpose is to allow optimization decisions to happen per IO, and
flexibility to utilize unaligned buffers for hardware that supports it.

The virtual boundary that NVMe uses provides specific guarantees about
the data alignment, but that might not be large enough for some CPU
architectures to take advantage of even if an applications uses aligned
data buffers that could use it.

At the same time, the virtual boundary prevents the driver from directly
using memory in ways the hardware may be capable of accessing. This
creates unnecessary needs on applications to double buffer their data
into a more restrictive virtually contiguous format.

This patch series provides an efficient way to track segment boundary
gaps per-IO so that the optimizations can be decided per-IO. This
provides flexibility to use all hardware to their abilities beyond what
the virtual boundary mask can provide.

Note, abuse of this capability may result in worse performance compared
to the bounce buffer solutions. Sending a bunch of tiny vectors for one
IO incurs significant protocol overhead, so while this patch set allows
you to do that, I recommend that you don't. We can't enforce a minimum
size though because vectors may straddle pages with only a few words in
the first and/or last pages, which we do need to support.

Changes from v4:

 * Keep the same lowest-set-bit representation in the request as the
   bio; provide a helper to turn it into a mask

 * Open-code the bvec gaps calculation since the helper is being removed

 * Additional code comments

 * Keeping the virt boundary unchanged for the loop target for now. Only
   pci, tcp, and fc are not reporting such a boundary.

Keith Busch (2):
  block: accumulate memory segment gaps per bio
  nvme: remove virtual boundary for sgl capable devices

 block/bio.c                 |  1 +
 block/blk-map.c             |  3 +++
 block/blk-merge.c           | 39 ++++++++++++++++++++++++++++++++++---
 block/blk-mq-dma.c          |  3 +--
 block/blk-mq.c              |  6 ++++++
 drivers/nvme/host/apple.c   |  1 +
 drivers/nvme/host/core.c    | 10 +++++-----
 drivers/nvme/host/fabrics.h |  6 ++++++
 drivers/nvme/host/fc.c      |  1 +
 drivers/nvme/host/nvme.h    |  7 +++++++
 drivers/nvme/host/pci.c     | 28 +++++++++++++++++++++++---
 drivers/nvme/host/rdma.c    |  1 +
 drivers/nvme/host/tcp.c     |  1 +
 drivers/nvme/target/loop.c  |  1 +
 include/linux/bio.h         |  2 ++
 include/linux/blk-mq.h      | 16 +++++++++++++++
 include/linux/blk_types.h   | 12 ++++++++++++
 17 files changed, 125 insertions(+), 13 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
@ 2025-10-14 15:04 ` Keith Busch
  2025-10-14 21:18   ` Martin K. Petersen
                     ` (2 more replies)
  2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Keith Busch @ 2025-10-14 15:04 UTC (permalink / raw)
  To: hch, linux-nvme, linux-block, axboe; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

The blk-mq dma iterator has an optimization for requests that align to
the device's iommu merge boundary. This boundary may be larger than the
device's virtual boundary, but the code had been depending on that queue
limit to know ahead of time if the request is guaranteed to align to
that optimization.

Rather than rely on that queue limit, which many devices may not report,
save the lowest set bit of any boundary gap between each segment in the
bio while checking the segments. The request stores the value for
merging and quickly checking per io if the request can use iova
optimizations.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c               |  1 +
 block/blk-map.c           |  3 +++
 block/blk-merge.c         | 39 ++++++++++++++++++++++++++++++++++++---
 block/blk-mq-dma.c        |  3 +--
 block/blk-mq.c            |  6 ++++++
 include/linux/bio.h       |  2 ++
 include/linux/blk-mq.h    | 16 ++++++++++++++++
 include/linux/blk_types.h | 12 ++++++++++++
 8 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b3a79285c278d..7b13bdf72de09 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -253,6 +253,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_write_hint = 0;
 	bio->bi_write_stream = 0;
 	bio->bi_status = 0;
+	bio->bi_bvec_gap_bit = 0;
 	bio->bi_iter.bi_sector = 0;
 	bio->bi_iter.bi_size = 0;
 	bio->bi_iter.bi_idx = 0;
diff --git a/block/blk-map.c b/block/blk-map.c
index 60faf036fb6e4..17a1dc2886786 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -459,6 +459,8 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
 	if (rq->bio) {
 		if (!ll_back_merge_fn(rq, bio, nr_segs))
 			return -EINVAL;
+		rq->phys_gap_bit = bio_seg_gap(rq->q, rq->biotail, bio,
+					       rq->phys_gap_bit);
 		rq->biotail->bi_next = bio;
 		rq->biotail = bio;
 		rq->__data_len += bio->bi_iter.bi_size;
@@ -469,6 +471,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
 	rq->nr_phys_segments = nr_segs;
 	rq->bio = rq->biotail = bio;
 	rq->__data_len = bio->bi_iter.bi_size;
+	rq->phys_gap_bit = bio->bi_bvec_gap_bit;
 	return 0;
 }
 EXPORT_SYMBOL(blk_rq_append_bio);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 002c594798242..2a1da435462de 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -302,6 +302,12 @@ static unsigned int bio_split_alignment(struct bio *bio,
 	return lim->logical_block_size;
 }
 
+static inline unsigned int bvec_seg_gap(struct bio_vec *bvprv,
+					struct bio_vec *bv)
+{
+	return bv->bv_offset | (bvprv->bv_offset + bvprv->bv_len);
+}
+
 /**
  * bio_split_io_at - check if and where to split a bio
  * @bio:  [in] bio to be split
@@ -319,8 +325,8 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, unsigned max_bytes, unsigned len_align_mask)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
+	unsigned nsegs = 0, bytes = 0, gaps = 0;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
 		if (bv.bv_offset & lim->dma_alignment ||
@@ -331,8 +337,11 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
-			goto split;
+		if (bvprvp) {
+			if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
+				goto split;
+			gaps |= bvec_seg_gap(bvprvp, &bv);
+		}
 
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
@@ -350,6 +359,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	}
 
 	*segs = nsegs;
+	bio->bi_bvec_gap_bit = ffs(gaps);
 	return 0;
 split:
 	if (bio->bi_opf & REQ_ATOMIC)
@@ -385,6 +395,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	 * big IO can be trival, disable iopoll when split needed.
 	 */
 	bio_clear_polled(bio);
+	bio->bi_bvec_gap_bit = ffs(gaps);
 	return bytes >> SECTOR_SHIFT;
 }
 EXPORT_SYMBOL_GPL(bio_split_io_at);
@@ -721,6 +732,21 @@ static bool blk_atomic_write_mergeable_rqs(struct request *rq,
 	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
 }
 
+u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
+	       u8 gaps_bit)
+{
+	struct bio_vec pb, nb;
+
+	gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit);
+	gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit);
+
+	bio_get_last_bvec(prev, &pb);
+	bio_get_first_bvec(next, &nb);
+	if (!biovec_phys_mergeable(q, &pb, &nb))
+		gaps_bit = min_not_zero(gaps_bit, ffs(bvec_seg_gap(&pb, &nb)));
+	return gaps_bit;
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -785,6 +811,9 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (next->start_time_ns < req->start_time_ns)
 		req->start_time_ns = next->start_time_ns;
 
+	req->phys_gap_bit = bio_seg_gap(req->q, req->biotail, next->bio,
+					min_not_zero(next->phys_gap_bit,
+						     req->phys_gap_bit));
 	req->biotail->bi_next = next->bio;
 	req->biotail = next->biotail;
 
@@ -908,6 +937,8 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
 	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
 		blk_zone_write_plug_bio_merged(bio);
 
+	req->phys_gap_bit = bio_seg_gap(req->q, req->biotail, bio,
+					req->phys_gap_bit);
 	req->biotail->bi_next = bio;
 	req->biotail = bio;
 	req->__data_len += bio->bi_iter.bi_size;
@@ -942,6 +973,8 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
 
 	blk_update_mixed_merge(req, bio, true);
 
+	req->phys_gap_bit = bio_seg_gap(req->q, bio, req->bio,
+					req->phys_gap_bit);
 	bio->bi_next = req->bio;
 	req->bio = bio;
 
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 449950029872a..94d3461b5bc8e 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -79,8 +79,7 @@ static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter,
 static inline bool blk_can_dma_map_iova(struct request *req,
 		struct device *dma_dev)
 {
-	return !((queue_virt_boundary(req->q) + 1) &
-		dma_get_merge_boundary(dma_dev));
+	return !(req_phys_gap_mask(req) & dma_get_merge_boundary(dma_dev));
 }
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09f5794141615..c12e5ee315fce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -376,6 +376,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	INIT_LIST_HEAD(&rq->queuelist);
 	rq->q = q;
 	rq->__sector = (sector_t) -1;
+	rq->phys_gap_bit = 0;
 	INIT_HLIST_NODE(&rq->hash);
 	RB_CLEAR_NODE(&rq->rb_node);
 	rq->tag = BLK_MQ_NO_TAG;
@@ -668,6 +669,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 			goto out_queue_exit;
 	}
 	rq->__data_len = 0;
+	rq->phys_gap_bit = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -748,6 +750,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
+	rq->phys_gap_bit = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -2674,6 +2677,8 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
 	rq->bio = rq->biotail = bio;
 	rq->__sector = bio->bi_iter.bi_sector;
 	rq->__data_len = bio->bi_iter.bi_size;
+	rq->phys_gap_bit = bio->bi_bvec_gap_bit;
+
 	rq->nr_phys_segments = nr_segs;
 	if (bio_integrity(bio))
 		rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q,
@@ -3380,6 +3385,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 	}
 	rq->nr_phys_segments = rq_src->nr_phys_segments;
 	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+	rq->phys_gap_bit = rq_src->phys_gap_bit;
 
 	if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
 		goto free_and_out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 16c1c85613b76..ad2d57908c1c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -324,6 +324,8 @@ extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
 int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, unsigned max_bytes, unsigned len_align);
+u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
+		u8 gaps_bit);
 
 /**
  * bio_next_split - get next @sectors from a bio, splitting if necessary
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46d..ddda00f07fae4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,14 @@ struct request {
 	unsigned short nr_phys_segments;
 	unsigned short nr_integrity_segments;
 
+	/*
+	 * The lowest set bit for address gaps between physical segments. This
+	 * provides information necessary for dma optimization opprotunities,
+	 * like for testing if the segments can be coalesced against the
+	 * device's iommu granule.
+	 */
+	unsigned char phys_gap_bit;
+
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
 	struct bio_crypt_ctx *crypt_ctx;
 	struct blk_crypto_keyslot *crypt_keyslot;
@@ -208,6 +216,14 @@ struct request {
 	void *end_io_data;
 };
 
+/*
+ * Returns a mask with all bits starting at req->phys_gap_bit set to 1.
+ */
+static inline unsigned long req_phys_gap_mask(const struct request *req)
+{
+	return ~(((1 << req->phys_gap_bit) >> 1) - 1);
+}
+
 static inline enum req_op req_op(const struct request *req)
 {
 	return req->cmd_flags & REQ_OP_MASK;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8e8d1cc8b06c4..53501ebb0623e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -218,6 +218,18 @@ struct bio {
 	enum rw_hint		bi_write_hint;
 	u8			bi_write_stream;
 	blk_status_t		bi_status;
+
+	/*
+	 * The bvec gap bit indicates the lowest set bit in any address offset
+	 * between all bi_io_vecs. This field is initialized only after the bio
+	 * is split to the hardware limits (see bio_split_io_at()). The value
+	 * may be used to consider DMA optimization when performing that
+	 * mapping. The value is compared to a power of two mask where the
+	 * result depends on any bit set within the mask, so saving the lowest
+	 * bit is sufficient to know if any segment gap collides with the mask.
+	 */
+	u8			bi_bvec_gap_bit;
+
 	atomic_t		__bi_remaining;
 
 	struct bvec_iter	bi_iter;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
@ 2025-10-14 21:18   ` Martin K. Petersen
  2025-10-15  4:08   ` Christoph Hellwig
  2025-11-11  4:26   ` Matthew Wilcox
  2 siblings, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-10-14 21:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch


Keith,

> The blk-mq dma iterator has an optimization for requests that align to
> the device's iommu merge boundary. This boundary may be larger than
> the device's virtual boundary, but the code had been depending on that
> queue limit to know ahead of time if the request is guaranteed to
> align to that optimization.
>
> Rather than rely on that queue limit, which many devices may not
> report, save the lowest set bit of any boundary gap between each
> segment in the bio while checking the segments. The request stores the
> value for merging and quickly checking per io if the request can use
> iova optimizations.

Looks OK to me.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
  2025-10-14 21:18   ` Martin K. Petersen
@ 2025-10-15  4:08   ` Christoph Hellwig
  2025-11-11  4:26   ` Matthew Wilcox
  2 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2025-10-15  4:08 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
  2025-10-14 21:18   ` Martin K. Petersen
  2025-10-15  4:08   ` Christoph Hellwig
@ 2025-11-11  4:26   ` Matthew Wilcox
  2025-11-11  4:50     ` Keith Busch
  2 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2025-11-11  4:26 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch

On Tue, Oct 14, 2025 at 08:04:55AM -0700, Keith Busch wrote:
> The blk-mq dma iterator has an optimization for requests that align to
> the device's iommu merge boundary. This boundary may be larger than the
> device's virtual boundary, but the code had been depending on that queue
> limit to know ahead of time if the request is guaranteed to align to
> that optimization.
> 
> Rather than rely on that queue limit, which many devices may not report,
> save the lowest set bit of any boundary gap between each segment in the
> bio while checking the segments. The request stores the value for
> merging and quickly checking per io if the request can use iova
> optimizations.

Hi Keith,

I just hit this bug:

generic/455       run fstests generic/455 at 2025-11-11 04:11:25
XFS (vdb): Mounting V5 Filesystem 54edd3b5-5306-493b-9ecd-f06cd9a8d669
XFS (vdb): Ending clean mount
XFS (dm-4): Mounting V5 Filesystem 3eb16918-3537-4d69-999a-ba226510f6c2
XFS (dm-4): Ending clean mount
XFS (dm-4): Unmounting Filesystem 3eb16918-3537-4d69-999a-ba226510f6c2
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 1614197 Comm: kworker/u64:2 Not tainted 6.18.0-rc4-next-20251
110-ktest-00017-g2307dc640a8d #131 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: dm-thin do_worker
RIP: 0010:bio_get_last_bvec+0x20/0xe0
Code: 90 90 90 90 90 90 90 90 90 90 55 49 89 f2 48 89 f9 48 89 e5 53 8b 77 2c 8b 47 30 44 8b 4f 28 49 89 f0 49 c1 e0 04 4c 03 47 50 <41> 8b 78 08 41 8b 58 0c 4d 8b 18 29 c7 44 39 cf 4d 89 1a 41 0f 47
RSP: 0018:ffff88814d1ef8d8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8881044b5500 RCX: ffff88811fd23d78
RDX: ffff88811fd235f8 RSI: 0000000000000000 RDI: ffff88811fd23d78
RBP: ffff88814d1ef8e0 R08: 0000000000000000 R09: 0000000000020000
R10: ffff88814d1ef8f0 R11: 0000000000000200 R12: 0000000000000000
R13: ffff88811fd235f8 R14: 0000000000000000 R15: ffff8881044b5500
FS:  0000000000000000(0000) GS:ffff8881f6ac9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000000263a000 CR4: 0000000000750eb0
PKRU: 55555554
Call Trace:
 <TASK>
 bio_seg_gap+0x4c/0x150
 bio_attempt_front_merge+0x19a/0x3a0
 blk_attempt_bio_merge.part.0+0xb4/0x110
 blk_attempt_plug_merge+0xd6/0xe0
 blk_mq_submit_bio+0x76c/0x9f0
 ? lock_release+0xbb/0x260
 __submit_bio+0xa5/0x380
 submit_bio_noacct_nocheck+0x126/0x380
 ? submit_bio_noacct_nocheck+0x126/0x380
 submit_bio_noacct+0x17f/0x3c0
 ? __cond_resched+0x1e/0x60
 submit_bio+0xd6/0x100
 end_discard+0x3a/0x90
 process_prepared_discard_passdown_pt1+0xff/0x180
 process_discard_cell_passdown+0x19e/0x2a0
 process_discard_bio+0x105/0x1a0
 do_worker+0x824/0xa40
 ? process_one_work+0x1ad/0x530
 process_one_work+0x1ed/0x530
 ? move_linked_works+0x77/0xb0
 worker_thread+0x1cf/0x3d0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x100/0x220
 ? _raw_spin_unlock_irq+0x2b/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x249/0x280
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Modules linked in: [last unloaded: crc_t10dif]
CR2: 0000000000000008
---[ end trace 0000000000000000 ]---
RIP: 0010:bio_get_last_bvec+0x20/0xe0
Code: 90 90 90 90 90 90 90 90 90 90 55 49 89 f2 48 89 f9 48 89 e5 53 8b 77 2c 8b 47 30 44 8b 4f 28 49 89 f0 49 c1 e0 04 4c 03 47 50 <41> 8b 78 08 41 8b 58 0c 4d 8b 18 29 c7 44 39 cf 4d 89 1a 41 0f 47
RSP: 0018:ffff88814d1ef8d8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8881044b5500 RCX: ffff88811fd23d78
RDX: ffff88811fd235f8 RSI: 0000000000000000 RDI: ffff88811fd23d78
RBP: ffff88814d1ef8e0 R08: 0000000000000000 R09: 0000000000020000
R10: ffff88814d1ef8f0 R11: 0000000000000200 R12: 0000000000000000
R13: ffff88811fd235f8 R14: 0000000000000000 R15: ffff8881044b5500
FS:  0000000000000000(0000) GS:ffff8881f6ac9000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000000263a000 CR4: 0000000000750eb0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---

I'm not saying it's definitely your patch; after all, there's 17 of
my slab patches on top of next-20251110, but when I looked on lore for
'bio_get_last_bvec' this was the only patch since 2021 that mentioned it,
so I thought I'd drop you a note in case you see the bug immediately.
I'm heading to bed, and will be out tomorrow, so my opportunities to be
helpful will be limited.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11  4:26   ` Matthew Wilcox
@ 2025-11-11  4:50     ` Keith Busch
  2025-11-11  7:14       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2025-11-11  4:50 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Keith Busch, hch, linux-nvme, linux-block, axboe

On Tue, Nov 11, 2025 at 04:26:21AM +0000, Matthew Wilcox wrote:
> On Tue, Oct 14, 2025 at 08:04:55AM -0700, Keith Busch wrote:
> > device's virtual boundary, but the code had been depending on that queue
> > limit to know ahead of time if the request is guaranteed to align to
> > that optimization.
> > 
> > Rather than rely on that queue limit, which many devices may not report,
> > save the lowest set bit of any boundary gap between each segment in the
> > bio while checking the segments. The request stores the value for
> > merging and quickly checking per io if the request can use iova
> > optimizations.
> 
> Hi Keith,
> 
> I just hit this bug:

...
 
> RIP: 0010:bio_get_last_bvec+0x20/0xe0
> Code: 90 90 90 90 90 90 90 90 90 90 55 49 89 f2 48 89 f9 48 89 e5 53 8b 77 2c 8b 47 30 44 8b 4f 28 49 89 f0 49 c1 e0 04 4c 03 47 50 <41> 8b 78 08 41 8b 58 0c 4d 8b 18 29 c7 44 39 cf 4d 89 1a 41 0f 47
> RSP: 0018:ffff88814d1ef8d8 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff8881044b5500 RCX: ffff88811fd23d78
> RDX: ffff88811fd235f8 RSI: 0000000000000000 RDI: ffff88811fd23d78
> RBP: ffff88814d1ef8e0 R08: 0000000000000000 R09: 0000000000020000
> R10: ffff88814d1ef8f0 R11: 0000000000000200 R12: 0000000000000000
> R13: ffff88811fd235f8 R14: 0000000000000000 R15: ffff8881044b5500
> FS:  0000000000000000(0000) GS:ffff8881f6ac9000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000000263a000 CR4: 0000000000750eb0
> PKRU: 55555554
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: disabled
> ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> I'm not saying it's definitely your patch; after all, there's 17 of
> my slab patches on top of next-20251110, but when I looked on lore for
> 'bio_get_last_bvec' this was the only patch since 2021 that mentioned it,
> so I thought I'd drop you a note in case you see the bug immediately.
> I'm heading to bed, and will be out tomorrow, so my opportunities to be
> helpful will be limited.

Thanks for the heads up. This is in the path I'd been modifying lately,
so sounds plausible that I introduced the bug. The information here
should be enough for me to make progress: it looks like req->bio is NULL
in your trace, which I did not expect would happen. But it's late here
too, so look with fresh eyes in the morning.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11  4:50     ` Keith Busch
@ 2025-11-11  7:14       ` Christoph Hellwig
  2025-11-11  9:36         ` Yu Kuai
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2025-11-11  7:14 UTC (permalink / raw)
  To: Keith Busch
  Cc: Matthew Wilcox, Keith Busch, hch, linux-nvme, linux-block, axboe

On Mon, Nov 10, 2025 at 11:50:51PM -0500, Keith Busch wrote:
> Thanks for the heads up. This is in the path I'd been modifying lately,
> so sounds plausible that I introduced the bug. The information here
> should be enough for me to make progress: it looks like req->bio is NULL
> in your trace, which I did not expect would happen. But it's late here
> too, so look with fresh eyes in the morning.

req->bio should only be NULL for flush requests or passthrough requests
that do not transfer data.  None of them should end up in this path.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11  7:14       ` Christoph Hellwig
@ 2025-11-11  9:36         ` Yu Kuai
  2025-11-11  9:39           ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Kuai @ 2025-11-11  9:36 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe,
	yukuai

Hi,

在 2025/11/11 15:14, Christoph Hellwig 写道:
> On Mon, Nov 10, 2025 at 11:50:51PM -0500, Keith Busch wrote:
>> Thanks for the heads up. This is in the path I'd been modifying lately,
>> so sounds plausible that I introduced the bug. The information here
>> should be enough for me to make progress: it looks like req->bio is NULL
>> in your trace, which I did not expect would happen. But it's late here
>> too, so look with fresh eyes in the morning.
> req->bio should only be NULL for flush requests or passthrough requests
> that do not transfer data.  None of them should end up in this path.

This can be reproduced 100% with branch for-6.19/block now, just:

blkdiscard /dev/md0

Where discard IO will be split to different underlying disks and then
merge. And for discard bio, bio->bi_io_vec is NULL. So when discard
bio ends up to the merge path, bio->bi_io_vec will be dereferenced
unconditionally.

How about following simple fix:

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3ca6fbf8b787..31f460422fe3 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -740,6 +740,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
         gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit);
         gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit);

+       if (op_is_discard(prev->bi_opf) || op_is_discard(next->bi_opf))
+               return gaps_bit;
+
         bio_get_last_bvec(prev, &pb);
         bio_get_first_bvec(next, &nb);
         if (!biovec_phys_mergeable(q, &pb, &nb))

Thanks,
Kuai

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11  9:36         ` Yu Kuai
@ 2025-11-11  9:39           ` Christoph Hellwig
  2025-11-11 10:14             ` Yu Kuai
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2025-11-11  9:39 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Keith Busch, Matthew Wilcox, Keith Busch,
	linux-nvme, linux-block, axboe

On Tue, Nov 11, 2025 at 05:36:39PM +0800, Yu Kuai wrote:
> This can be reproduced 100% with branch for-6.19/block now, just:
> 
> blkdiscard /dev/md0
> 
> Where discard IO will be split to different underlying disks and then
> merge. And for discard bio, bio->bi_io_vec is NULL. So when discard
> bio ends up to the merge path, bio->bi_io_vec will be dereferenced
> unconditionally.

Ah, so it's not a NULL req->bio but bio->bi_io_vec.

> 
> How about following simple fix:
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3ca6fbf8b787..31f460422fe3 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -740,6 +740,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
>          gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit);
>          gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit);
> 
> +       if (op_is_discard(prev->bi_opf) || op_is_discard(next->bi_opf))
> +               return gaps_bit;
> +

I think the problem is how we even end up here?  The only merging
for discard should be the special multi-segment merge.  So I think
something higher up is messed up


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11  9:39           ` Christoph Hellwig
@ 2025-11-11 10:14             ` Yu Kuai
  2025-11-11 13:25               ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Yu Kuai @ 2025-11-11 10:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Matthew Wilcox, Keith Busch, linux-nvme, linux-block,
	axboe, yukuai

Hi,

在 2025/11/11 17:39, Christoph Hellwig 写道:
> On Tue, Nov 11, 2025 at 05:36:39PM +0800, Yu Kuai wrote:
>> This can be reproduced 100% with branch for-6.19/block now, just:
>>
>> blkdiscard /dev/md0
>>
>> Where discard IO will be split to different underlying disks and then
>> merge. And for discard bio, bio->bi_io_vec is NULL. So when discard
>> bio ends up to the merge path, bio->bi_io_vec will be dereferenced
>> unconditionally.
> Ah, so it's not a NULL req->bio but bio->bi_io_vec.
>
>> How about following simple fix:
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 3ca6fbf8b787..31f460422fe3 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -740,6 +740,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
>>           gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit);
>>           gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit);
>>
>> +       if (op_is_discard(prev->bi_opf) || op_is_discard(next->bi_opf))
>> +               return gaps_bit;
>> +
> I think the problem is how we even end up here?  The only merging
> for discard should be the special multi-segment merge.  So I think
> something higher up is messed up
>
At least from blk_try_merge(), blk_discard_mergable() do return false,
however, following checking passed and we end up to the back merge patch.

blk_try_merge
  if (blk_discard_mergable())
    // false due to max_discard_segments is 1
  else if (...)
   return ELEVATOR_BACK_MERGE

Perhaps are you suggesting to change this to:

  enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
  {
-       if (blk_discard_mergable(rq))
-               return ELEVATOR_DISCARD_MERGE;
-       else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
+       if (req_op(rq) == REQ_OP_DISCARD) {
+               if (blk_discard_mergable((rq)))
+                       return ELEVATOR_DISCARD_MERGE;
+               return ELEVATOR_NO_MERGE;
+       }
+
+       if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
                 return ELEVATOR_BACK_MERGE;
-       else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
+       if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
                 return ELEVATOR_FRONT_MERGE;
         return ELEVATOR_NO_MERGE;
  }

And the same for other callers for blk_discard_mergable().

Thanks,
Kuai

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11 10:14             ` Yu Kuai
@ 2025-11-11 13:25               ` Keith Busch
  2025-11-11 13:40                 ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2025-11-11 13:25 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Matthew Wilcox, Keith Busch, linux-nvme,
	linux-block, axboe

On Tue, Nov 11, 2025 at 06:14:24PM +0800, Yu Kuai wrote:
> >
> At least from blk_try_merge(), blk_discard_mergable() do return false,
> however, following checking passed and we end up to the back merge patch.
> 
> blk_try_merge
>   if (blk_discard_mergable())
>     // false due to max_discard_segments is 1
>   else if (...)
>    return ELEVATOR_BACK_MERGE
> 
> Perhaps are you suggesting to change this to:
> 
>   enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
>   {
> -       if (blk_discard_mergable(rq))
> -               return ELEVATOR_DISCARD_MERGE;
> -       else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
> +       if (req_op(rq) == REQ_OP_DISCARD) {
> +               if (blk_discard_mergable((rq)))
> +                       return ELEVATOR_DISCARD_MERGE;
> +               return ELEVATOR_NO_MERGE;
> +       }
> +
> +       if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
>                  return ELEVATOR_BACK_MERGE;
> -       else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
> +       if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
>                  return ELEVATOR_FRONT_MERGE;
>          return ELEVATOR_NO_MERGE;
>   }
> 
> And the same for other callers for blk_discard_mergable().

Ah, so we're merging a discard for a device that doesn't support
vectored discard. I think we still want to be able to front/back merge
such requests, though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11 13:25               ` Keith Busch
@ 2025-11-11 13:40                 ` Christoph Hellwig
  2025-11-11 13:54                   ` Keith Busch
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2025-11-11 13:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Yu Kuai, Christoph Hellwig, Matthew Wilcox, Keith Busch,
	linux-nvme, linux-block, axboe

On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote:
> Ah, so we're merging a discard for a device that doesn't support
> vectored discard. I think we still want to be able to front/back merge
> such requests, though.

Yes, but purely based on bi_sector/bi_size, not based on the payload.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11 13:40                 ` Christoph Hellwig
@ 2025-11-11 13:54                   ` Keith Busch
  2025-11-11 13:58                     ` Christoph Hellwig
  2025-11-11 14:03                     ` Yu Kuai
  0 siblings, 2 replies; 20+ messages in thread
From: Keith Busch @ 2025-11-11 13:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yu Kuai, Matthew Wilcox, Keith Busch, linux-nvme, linux-block,
	axboe

On Tue, Nov 11, 2025 at 02:40:01PM +0100, Christoph Hellwig wrote:
> On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote:
> > Ah, so we're merging a discard for a device that doesn't support
> > vectored discard. I think we still want to be able to front/back merge
> > such requests, though.
> 
> Yes, but purely based on bi_sector/bi_size, not based on the payload.

This should do it:

---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3ca6fbf8b7870..d3115d7469df0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -737,6 +737,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
 {
 	struct bio_vec pb, nb;
 
+	if (!bio_has_data(prev))
+		return 0;
+
 	gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit);
 	gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit);
 
--

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11 13:54                   ` Keith Busch
@ 2025-11-11 13:58                     ` Christoph Hellwig
  2025-11-11 14:03                     ` Yu Kuai
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2025-11-11 13:58 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Yu Kuai, Matthew Wilcox, Keith Busch,
	linux-nvme, linux-block, axboe

On Tue, Nov 11, 2025 at 08:54:53AM -0500, Keith Busch wrote:
> On Tue, Nov 11, 2025 at 02:40:01PM +0100, Christoph Hellwig wrote:
> > On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote:
> > > Ah, so we're merging a discard for a device that doesn't support
> > > vectored discard. I think we still want to be able to front/back merge
> > > such requests, though.
> > 
> > Yes, but purely based on bi_sector/bi_size, not based on the payload.
> 
> This should do it:

Yes, that looks sensible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio
  2025-11-11 13:54                   ` Keith Busch
  2025-11-11 13:58                     ` Christoph Hellwig
@ 2025-11-11 14:03                     ` Yu Kuai
  1 sibling, 0 replies; 20+ messages in thread
From: Yu Kuai @ 2025-11-11 14:03 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe,
	yukuai

Hi,

在 2025/11/11 21:54, Keith Busch 写道:
> On Tue, Nov 11, 2025 at 02:40:01PM +0100, Christoph Hellwig wrote:
>> On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote:
>>> Ah, so we're merging a discard for a device that doesn't support
>>> vectored discard. I think we still want to be able to front/back merge
>>> such requests, though.
>> Yes, but purely based on bi_sector/bi_size, not based on the payload.
> This should do it:
>
> ---
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3ca6fbf8b7870..d3115d7469df0 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -737,6 +737,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next,
>   {
>   	struct bio_vec pb, nb;
>   
> +	if (!bio_has_data(prev))
> +		return 0;
> +
>   	gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit);
>   	gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit);
>   
> --

This looks good.

Thanks,
Kuai

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
  2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
@ 2025-10-14 15:04 ` Keith Busch
  2025-10-14 21:19   ` Martin K. Petersen
  2025-11-07  1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
  2025-11-07  1:12 ` Jens Axboe
  3 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2025-10-14 15:04 UTC (permalink / raw)
  To: hch, linux-nvme, linux-block, axboe; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

The nvme virtual boundary is only required for the PRP format. Devices
that can use SGL for DMA don't need it for IO queues. Drop reporting it
for such devices; rdma fabrics controllers will continue to use the
limit as they currently don't report any boundary requirements, but tcp
and fc never needed it in the first place so they get to report no
virtual boundary.

Applications may continue to align to the same virtual boundaries for
optimization purposes if they want, and the driver will continue to
decide whether to use the PRP format the same as before if the IO allows
it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/apple.c   |  1 +
 drivers/nvme/host/core.c    | 10 +++++-----
 drivers/nvme/host/fabrics.h |  6 ++++++
 drivers/nvme/host/fc.c      |  1 +
 drivers/nvme/host/nvme.h    |  7 +++++++
 drivers/nvme/host/pci.c     | 28 +++++++++++++++++++++++++---
 drivers/nvme/host/rdma.c    |  1 +
 drivers/nvme/host/tcp.c     |  1 +
 drivers/nvme/target/loop.c  |  1 +
 9 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index f35d3f71d14f3..15b3d07f8ccdd 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -1283,6 +1283,7 @@ static const struct nvme_ctrl_ops nvme_ctrl_ops = {
 	.reg_read64 = apple_nvme_reg_read64,
 	.free_ctrl = apple_nvme_free_ctrl,
 	.get_address = apple_nvme_get_address,
+	.get_virt_boundary = nvme_get_virt_boundary,
 };
 
 static void apple_nvme_async_probe(void *data, async_cookie_t cookie)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de736..63e15cce3699c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2069,13 +2069,13 @@ static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl)
 }
 
 static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl,
-		struct queue_limits *lim)
+		struct queue_limits *lim, bool is_admin)
 {
 	lim->max_hw_sectors = ctrl->max_hw_sectors;
 	lim->max_segments = min_t(u32, USHRT_MAX,
 		min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments));
 	lim->max_integrity_segments = ctrl->max_integrity_segments;
-	lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
+	lim->virt_boundary_mask = ctrl->ops->get_virt_boundary(ctrl, is_admin);
 	lim->max_segment_size = UINT_MAX;
 	lim->dma_alignment = 3;
 }
@@ -2177,7 +2177,7 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	int ret;
 
 	lim = queue_limits_start_update(ns->disk->queue);
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 
 	memflags = blk_mq_freeze_queue(ns->disk->queue);
 	ret = queue_limits_commit_update(ns->disk->queue, &lim);
@@ -2381,7 +2381,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	ns->head->lba_shift = id->lbaf[lbaf].ds;
 	ns->head->nuse = le64_to_cpu(id->nuse);
 	capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze));
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 	nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info);
 	nvme_set_chunk_sectors(ns, id, &lim);
 	if (!nvme_update_disk_info(ns, id, &lim))
@@ -3589,7 +3589,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
 	lim = queue_limits_start_update(ctrl->admin_q);
-	nvme_set_ctrl_limits(ctrl, &lim);
+	nvme_set_ctrl_limits(ctrl, &lim, true);
 	ret = queue_limits_commit_update(ctrl->admin_q, &lim);
 	if (ret)
 		goto out_free;
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 1b58ee7d0dcee..caf5503d08332 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -217,6 +217,12 @@ static inline unsigned int nvmf_nr_io_queues(struct nvmf_ctrl_options *opts)
 		min(opts->nr_poll_queues, num_online_cpus());
 }
 
+static inline unsigned long nvmf_get_virt_boundary(struct nvme_ctrl *ctrl,
+						   bool is_admin)
+{
+	return 0;
+}
+
 int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 03987f497a5b5..70c066c2e2d42 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3360,6 +3360,7 @@ static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
 	.submit_async_event	= nvme_fc_submit_async_event,
 	.delete_ctrl		= nvme_fc_delete_ctrl,
 	.get_address		= nvmf_get_address,
+	.get_virt_boundary	= nvmf_get_virt_boundary,
 };
 
 static void
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 102fae6a231c5..7f7cb823d60d8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -558,6 +558,12 @@ static inline bool nvme_ns_has_pi(struct nvme_ns_head *head)
 	return head->pi_type && head->ms == head->pi_size;
 }
 
+static inline unsigned long nvme_get_virt_boundary(struct nvme_ctrl *ctrl,
+						   bool is_admin)
+{
+	return NVME_CTRL_PAGE_SIZE - 1;
+}
+
 struct nvme_ctrl_ops {
 	const char *name;
 	struct module *module;
@@ -578,6 +584,7 @@ struct nvme_ctrl_ops {
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
 	void (*print_device_info)(struct nvme_ctrl *ctrl);
 	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+	unsigned long (*get_virt_boundary)(struct nvme_ctrl *ctrl, bool is_admin);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c916176bd9f05..3c1727df1e36f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -613,9 +613,22 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
-		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
-			return SGL_FORCED;
-		if (req->nr_integrity_segments > 1)
+		/*
+		 * When the controller is capable of using SGL, there are
+		 * several conditions that we force to use it:
+		 *
+		 * 1. A request containing page gaps within the controller's
+		 *    mask can not use the PRP format.
+		 *
+		 * 2. User commands use SGL because that lets the device
+		 *    validate the requested transfer lengths.
+		 *
+		 * 3. Multiple integrity segments must use SGL as that's the
+		 *    only way to describe such a command in NVMe.
+		 */
+		if (req_phys_gap_mask(req) & (NVME_CTRL_PAGE_SIZE - 1) ||
+		    nvme_req(req)->flags & NVME_REQ_USERCMD ||
+		    req->nr_integrity_segments > 1)
 			return SGL_FORCED;
 		return SGL_SUPPORTED;
 	}
@@ -3243,6 +3256,14 @@ static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
 	return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static unsigned long nvme_pci_get_virt_boundary(struct nvme_ctrl *ctrl,
+						bool is_admin)
+{
+	if (!nvme_ctrl_sgl_supported(ctrl) || is_admin)
+		return NVME_CTRL_PAGE_SIZE - 1;
+	return 0;
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
@@ -3257,6 +3278,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.get_address		= nvme_pci_get_address,
 	.print_device_info	= nvme_pci_print_device_info,
 	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
+	.get_virt_boundary	= nvme_pci_get_virt_boundary,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 190a4cfa8a5ee..35c0822edb2d7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2202,6 +2202,7 @@ static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = {
 	.delete_ctrl		= nvme_rdma_delete_ctrl,
 	.get_address		= nvmf_get_address,
 	.stop_ctrl		= nvme_rdma_stop_ctrl,
+	.get_virt_boundary	= nvme_get_virt_boundary,
 };
 
 /*
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1413788ca7d52..82875351442a0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2862,6 +2862,7 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
 	.delete_ctrl		= nvme_tcp_delete_ctrl,
 	.get_address		= nvme_tcp_get_address,
 	.stop_ctrl		= nvme_tcp_stop_ctrl,
+	.get_virt_boundary	= nvmf_get_virt_boundary,
 };
 
 static bool
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index f85a8441bcc6e..9fe88a489eb71 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -511,6 +511,7 @@ static const struct nvme_ctrl_ops nvme_loop_ctrl_ops = {
 	.submit_async_event	= nvme_loop_submit_async_event,
 	.delete_ctrl		= nvme_loop_delete_ctrl_host,
 	.get_address		= nvmf_get_address,
+	.get_virt_boundary	= nvme_get_virt_boundary,
 };
 
 static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
@ 2025-10-14 21:19   ` Martin K. Petersen
  0 siblings, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2025-10-14 21:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch


Keith,

> The nvme virtual boundary is only required for the PRP format. Devices
> that can use SGL for DMA don't need it for IO queues. Drop reporting
> it for such devices; rdma fabrics controllers will continue to use the
> limit as they currently don't report any boundary requirements, but
> tcp and fc never needed it in the first place so they get to report no
> virtual boundary.
>
> Applications may continue to align to the same virtual boundaries for
> optimization purposes if they want, and the driver will continue to
> decide whether to use the PRP format the same as before if the IO
> allows it.

Looks fine.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance
  2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
  2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
  2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
@ 2025-11-07  1:03 ` Keith Busch
  2025-11-07  1:12   ` Jens Axboe
  2025-11-07  1:12 ` Jens Axboe
  3 siblings, 1 reply; 20+ messages in thread
From: Keith Busch @ 2025-11-07  1:03 UTC (permalink / raw)
  To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe

On Tue, Oct 14, 2025 at 08:04:54AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The purpose is to allow optimization decisions to happen per IO, and
> flexibility to utilize unaligned buffers for hardware that supports it.

Hi Jens,

Did you have a chance to consider this one? Thanks,

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance
  2025-11-07  1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
@ 2025-11-07  1:12   ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2025-11-07  1:12 UTC (permalink / raw)
  To: Keith Busch, Keith Busch; +Cc: hch, linux-nvme, linux-block

On 11/6/25 6:03 PM, Keith Busch wrote:
> On Tue, Oct 14, 2025 at 08:04:54AM -0700, Keith Busch wrote:
>> From: Keith Busch <kbusch@kernel.org>
>>
>> The purpose is to allow optimization decisions to happen per IO, and
>> flexibility to utilize unaligned buffers for hardware that supports it.
> 
> Hi Jens,
> 
> Did you have a chance to consider this one? Thanks,

Looks like I missed this as it was during my last race^Wpto, now
queued up!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance
  2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
                   ` (2 preceding siblings ...)
  2025-11-07  1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
@ 2025-11-07  1:12 ` Jens Axboe
  3 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2025-11-07  1:12 UTC (permalink / raw)
  To: hch, linux-nvme, linux-block, Keith Busch; +Cc: Keith Busch


On Tue, 14 Oct 2025 08:04:54 -0700, Keith Busch wrote:
> Previous version here:
> 
>   https://lore.kernel.org/linux-block/20251007175245.3898972-1-kbusch@meta.com/
> 
> The purpose is to allow optimization decisions to happen per IO, and
> flexibility to utilize unaligned buffers for hardware that supports it.
> 
> [...]

Applied, thanks!

[1/2] block: accumulate memory segment gaps per bio
      commit: 2f6b2565d43cdb5087cac23d530cca84aa3d897e
[2/2] nvme: remove virtual boundary for sgl capable devices
      commit: bc840b21a25a50f00e2b240329c09281506df387

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-11-11 14:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
2025-10-14 21:18   ` Martin K. Petersen
2025-10-15  4:08   ` Christoph Hellwig
2025-11-11  4:26   ` Matthew Wilcox
2025-11-11  4:50     ` Keith Busch
2025-11-11  7:14       ` Christoph Hellwig
2025-11-11  9:36         ` Yu Kuai
2025-11-11  9:39           ` Christoph Hellwig
2025-11-11 10:14             ` Yu Kuai
2025-11-11 13:25               ` Keith Busch
2025-11-11 13:40                 ` Christoph Hellwig
2025-11-11 13:54                   ` Keith Busch
2025-11-11 13:58                     ` Christoph Hellwig
2025-11-11 14:03                     ` Yu Kuai
2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
2025-10-14 21:19   ` Martin K. Petersen
2025-11-07  1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch
2025-11-07  1:12   ` Jens Axboe
2025-11-07  1:12 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox