[PATCHv4 0/2] block+nvme: removing virtual boundary mask reliance

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv4 0/2] block+nvme: removing virtual boundary mask reliance
@ 2025-10-07 17:52 Keith Busch
  2025-10-07 17:52 ` [PATCHv4 1/2] block: accumulate memory segment gaps per bio Keith Busch
  2025-10-07 17:52 ` [PATCHv4 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
  0 siblings, 2 replies; 6+ messages in thread
From: Keith Busch @ 2025-10-07 17:52 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Previous version here:

  https://lore.kernel.org/linux-nvme/20250821204420.2267923-1-kbusch@meta.com/

The purpose is to allow optimization decisions to happen per IO, and
flexibility to utilize unaligned buffers for hardware that supports it.

The virtual boundary that NVMe uses provides specific guarantees about
the data alignment, but that might not be large enough for some CPU
architectures to take advantage of even if an applications uses aligned
data buffers that could use it.

At the same time, the virtual boundary prevents the driver from directly
using memory in ways the hardware may be capable of accessing. This
creates unnecessary needs on applications to double buffer their data
into a more restrictive virtually contiguous format.

This patch series provides an efficient way to track segment  boundary
gaps per-IO so that the optimizations can be decided per-IO. This
provides flexibility to use all hardware to their abilities beyond what
the virtual boundary mask can provide.

Note, abuse of this capability may result in worse performance compared
to the bounce buffer solutions. Sending a bunch of tiny vectors for one
IO incurs significant protocol overhead, so while this patch set allows
you to do that, I recommend that you don't. We can't enforce a minimum
size though because vectors may straddle pages with only a few words in
the first and/or last pages, which we do need to support.

Changes from v3:

 - More comments explaining what the new fields are for

 - A bit of refactoring to reuse the bvec gap code

 - Also count gaps for passthrough commands, as it's possible to send
   vectored IO through that interface too.

 - The nvme side has all the transport ops specify a callback to get the
   desired virtual boundary. PCI supports no boundary for SGL capable
   devices, while TCP and FC never needed it. RDMA and Apple continue to
   use current virtual boundary mask as it's not clear if its safe to
   remove it for those.

Keith Busch (2):
  block: accumulate memory segment gaps per bio
  nvme: remove virtual boundary for sgl capable devices

 block/bio.c                 |  1 +
 block/blk-map.c             |  3 +++
 block/blk-merge.c           | 39 ++++++++++++++++++++++++++++++++++---
 block/blk-mq-dma.c          |  3 +--
 block/blk-mq.c              | 10 ++++++++++
 block/blk.h                 |  9 +++++++--
 drivers/nvme/host/apple.c   |  1 +
 drivers/nvme/host/core.c    | 10 +++++-----
 drivers/nvme/host/fabrics.h |  6 ++++++
 drivers/nvme/host/fc.c      |  1 +
 drivers/nvme/host/nvme.h    |  7 +++++++
 drivers/nvme/host/pci.c     | 28 +++++++++++++++++++++++---
 drivers/nvme/host/rdma.c    |  1 +
 drivers/nvme/host/tcp.c     |  1 +
 drivers/nvme/target/loop.c  |  1 +
 include/linux/bio.h         |  2 ++
 include/linux/blk-mq.h      |  8 ++++++++
 include/linux/blk_types.h   | 12 ++++++++++++
 18 files changed, 128 insertions(+), 15 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCHv4 1/2] block: accumulate memory segment gaps per bio
  2025-10-07 17:52 [PATCHv4 0/2] block+nvme: removing virtual boundary mask reliance Keith Busch
@ 2025-10-07 17:52 ` Keith Busch
  2025-10-10  5:34   ` Christoph Hellwig
  2025-10-07 17:52 ` [PATCHv4 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
  1 sibling, 1 reply; 6+ messages in thread
From: Keith Busch @ 2025-10-07 17:52 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The blk-mq dma iterator has an optimization for requests that align to
the device's iommu merge boundary. This boundary may be larger than the
device's virtual boundary, but the code had been depending on that queue
limit to know ahead of time if the request is guaranteed to align to
that optimization.

Rather than rely on that queue limit, which many devices may not report,
save the lowest set bit of any boundary gap between each segment in the
bio while checking the segments. The request turns this into a mask for
merging and quickly checking per io if the request can use the iova
optimization.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c               |  1 +
 block/blk-map.c           |  3 +++
 block/blk-merge.c         | 39 ++++++++++++++++++++++++++++++++++++---
 block/blk-mq-dma.c        |  3 +--
 block/blk-mq.c            | 10 ++++++++++
 block/blk.h               |  9 +++++++--
 include/linux/bio.h       |  2 ++
 include/linux/blk-mq.h    |  8 ++++++++
 include/linux/blk_types.h | 12 ++++++++++++
 9 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b3a79285c278d..7b13bdf72de09 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -253,6 +253,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_write_hint = 0;
 	bio->bi_write_stream = 0;
 	bio->bi_status = 0;
+	bio->bi_bvec_gap_bit = 0;
 	bio->bi_iter.bi_sector = 0;
 	bio->bi_iter.bi_size = 0;
 	bio->bi_iter.bi_idx = 0;
diff --git a/block/blk-map.c b/block/blk-map.c
index 60faf036fb6e4..770baf6001718 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -459,6 +459,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
 	if (rq->bio) {
 		if (!ll_back_merge_fn(rq, bio, nr_segs))
 			return -EINVAL;
+		rq->phys_gap |= bio_seg_gap(rq->q, rq->biotail, bio);
 		rq->biotail->bi_next = bio;
 		rq->biotail = bio;
 		rq->__data_len += bio->bi_iter.bi_size;
@@ -469,6 +470,8 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio)
 	rq->nr_phys_segments = nr_segs;
 	rq->bio = rq->biotail = bio;
 	rq->__data_len = bio->bi_iter.bi_size;
+	if (bio->bi_bvec_gap_bit)
+		rq->phys_gap = 1 << (bio->bi_bvec_gap_bit - 1);
 	return 0;
 }
 EXPORT_SYMBOL(blk_rq_append_bio);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 37864c5d287ef..aadf8cab92ec5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -302,6 +302,12 @@ static unsigned int bio_split_alignment(struct bio *bio,
 	return lim->logical_block_size;
 }
 
+static inline unsigned int bvec_seg_gap(struct bio_vec *bvprv,
+					struct bio_vec *bv)
+{
+	return __bvec_gap(bvprv, bv->bv_offset, U32_MAX);
+}
+
 /**
  * bio_split_io_at - check if and where to split a bio
  * @bio:  [in] bio to be split
@@ -318,9 +324,9 @@ static unsigned int bio_split_alignment(struct bio *bio,
 int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, unsigned max_bytes, unsigned len_align_mask)
 {
+	unsigned nsegs = 0, bytes = 0, gaps = 0;
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
 		if (bv.bv_offset & lim->dma_alignment ||
@@ -331,8 +337,11 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
-			goto split;
+		if (bvprvp) {
+			if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
+				goto split;
+			gaps |= bvec_seg_gap(bvprvp, &bv);
+		}
 
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
@@ -350,6 +359,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	}
 
 	*segs = nsegs;
+	bio->bi_bvec_gap_bit = ffs(gaps);
 	return 0;
 split:
 	if (bio->bi_opf & REQ_ATOMIC)
@@ -385,6 +395,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	 * big IO can be trival, disable iopoll when split needed.
 	 */
 	bio_clear_polled(bio);
+	bio->bi_bvec_gap_bit = ffs(gaps);
 	return bytes >> SECTOR_SHIFT;
 }
 EXPORT_SYMBOL_GPL(bio_split_io_at);
@@ -721,6 +732,24 @@ static bool blk_atomic_write_mergeable_rqs(struct request *rq,
 	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
 }
 
+inline unsigned int bio_seg_gap(struct request_queue *q, struct bio *prev,
+				struct bio *next)
+{
+	unsigned int gaps = 0;
+	struct bio_vec pb, nb;
+
+	if (prev->bi_bvec_gap_bit)
+		gaps |= 1 << (prev->bi_bvec_gap_bit - 1);
+	if (next->bi_bvec_gap_bit)
+		gaps |= 1 << (next->bi_bvec_gap_bit - 1);
+
+	bio_get_last_bvec(prev, &pb);
+	bio_get_first_bvec(next, &nb);
+	if (!biovec_phys_mergeable(q, &pb, &nb))
+		gaps |= bvec_seg_gap(&pb, &nb);
+	return gaps;
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -785,6 +814,8 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (next->start_time_ns < req->start_time_ns)
 		req->start_time_ns = next->start_time_ns;
 
+	req->phys_gap |= bio_seg_gap(req->q, req->biotail, next->bio) |
+			   next->phys_gap;
 	req->biotail->bi_next = next->bio;
 	req->biotail = next->biotail;
 
@@ -908,6 +939,7 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
 	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
 		blk_zone_write_plug_bio_merged(bio);
 
+	req->phys_gap |= bio_seg_gap(req->q, req->biotail, bio);
 	req->biotail->bi_next = bio;
 	req->biotail = bio;
 	req->__data_len += bio->bi_iter.bi_size;
@@ -942,6 +974,7 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
 
 	blk_update_mixed_merge(req, bio, true);
 
+	req->phys_gap |= bio_seg_gap(req->q, bio, req->bio);
 	bio->bi_next = req->bio;
 	req->bio = bio;
 
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 449950029872a..0f2de31987def 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -79,8 +79,7 @@ static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter,
 static inline bool blk_can_dma_map_iova(struct request *req,
 		struct device *dma_dev)
 {
-	return !((queue_virt_boundary(req->q) + 1) &
-		dma_get_merge_boundary(dma_dev));
+	return !(req->phys_gap & dma_get_merge_boundary(dma_dev));
 }
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 09f5794141615..a5095bfc15964 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -376,6 +376,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	INIT_LIST_HEAD(&rq->queuelist);
 	rq->q = q;
 	rq->__sector = (sector_t) -1;
+	rq->phys_gap = 0;
 	INIT_HLIST_NODE(&rq->hash);
 	RB_CLEAR_NODE(&rq->rb_node);
 	rq->tag = BLK_MQ_NO_TAG;
@@ -668,6 +669,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 			goto out_queue_exit;
 	}
 	rq->__data_len = 0;
+	rq->phys_gap = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -748,6 +750,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
+	rq->phys_gap = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -2674,6 +2677,12 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
 	rq->bio = rq->biotail = bio;
 	rq->__sector = bio->bi_iter.bi_sector;
 	rq->__data_len = bio->bi_iter.bi_size;
+
+	if (bio->bi_bvec_gap_bit)
+		rq->phys_gap = 1 << (bio->bi_bvec_gap_bit - 1);
+	else
+		rq->phys_gap  = 0;
+
 	rq->nr_phys_segments = nr_segs;
 	if (bio_integrity(bio))
 		rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q,
@@ -3380,6 +3389,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 	}
 	rq->nr_phys_segments = rq_src->nr_phys_segments;
 	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+	rq->phys_gap = rq_src->phys_gap;
 
 	if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
 		goto free_and_out;
diff --git a/block/blk.h b/block/blk.h
index 170794632135d..0617b167fcb8d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -140,11 +140,16 @@ static inline bool biovec_phys_mergeable(struct request_queue *q,
 	return true;
 }
 
+static inline u32 __bvec_gap(struct bio_vec *bprv, unsigned int offset,
+				unsigned long mask)
+{
+	return (offset | (bprv->bv_offset + bprv->bv_len)) & mask;
+}
+
 static inline bool __bvec_gap_to_prev(const struct queue_limits *lim,
 		struct bio_vec *bprv, unsigned int offset)
 {
-	return (offset & lim->virt_boundary_mask) ||
-		((bprv->bv_offset + bprv->bv_len) & lim->virt_boundary_mask);
+	return __bvec_gap(bprv, offset, lim->virt_boundary_mask);
 }
 
 /*
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 16c1c85613b76..3ae305d7a6c59 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -324,6 +324,8 @@ extern struct bio *bio_split(struct bio *bio, int sectors,
 			     gfp_t gfp, struct bio_set *bs);
 int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, unsigned max_bytes, unsigned len_align);
+unsigned int bio_seg_gap(struct request_queue *q, struct bio *prev,
+			struct bio *next);
 
 /**
  * bio_next_split - get next @sectors from a bio, splitting if necessary
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46d..31546f3e943b2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,14 @@ struct request {
 	unsigned short nr_phys_segments;
 	unsigned short nr_integrity_segments;
 
+	/*
+	 * A mask that contains bits set for virtual address gaps between
+	 * physical segments. This provides information necessary for dma
+	 * optimization opprotunities, like for testing if the segments can be
+	 * coalesced against the device's iommu granule.
+	 */
+	unsigned int phys_gap;
+
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
 	struct bio_crypt_ctx *crypt_ctx;
 	struct blk_crypto_keyslot *crypt_keyslot;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8e8d1cc8b06c4..4162afd313463 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -218,6 +218,18 @@ struct bio {
 	enum rw_hint		bi_write_hint;
 	u8			bi_write_stream;
 	blk_status_t		bi_status;
+
+	/*
+	 * The bvec gap bit indicates the lowest set bit in any address offset
+	 * between all bi_io_vecs. This field is initialized only after
+	 * splitting to the hardware limits. It may be used to consider DMA
+	 * optimization when performing that mapping. The value is compared to
+	 * a power of two mask where the result depends on any bit set within
+	 * the mask, so saving the lowest bit is sufficient to know if any
+	 * segment gap collides with the mask.
+	 */
+	u8			bi_bvec_gap_bit;
+
 	atomic_t		__bi_remaining;
 
 	struct bvec_iter	bi_iter;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCHv4 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-10-07 17:52 [PATCHv4 0/2] block+nvme: removing virtual boundary mask reliance Keith Busch
  2025-10-07 17:52 ` [PATCHv4 1/2] block: accumulate memory segment gaps per bio Keith Busch
@ 2025-10-07 17:52 ` Keith Busch
  2025-10-10  5:34   ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Keith Busch @ 2025-10-07 17:52 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The nvme virtual boundary is only required for the PRP format. Devices
that can use SGL for DMA don't need it for IO queues. Drop reporting it
for such devices; rdma fabrics controllers will continue to use the
limit as they currently don't report any boundary requirements, but tcp
and fc never needed it in the first place so they get to report no
virtual boundary.

Applications may continue to align to the same virtual boundaries for
optimization purposes if they want, and the driver will continue to
decide whether to use the PRP format the same as before if the IO allows
it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/apple.c   |  1 +
 drivers/nvme/host/core.c    | 10 +++++-----
 drivers/nvme/host/fabrics.h |  6 ++++++
 drivers/nvme/host/fc.c      |  1 +
 drivers/nvme/host/nvme.h    |  7 +++++++
 drivers/nvme/host/pci.c     | 28 +++++++++++++++++++++++++---
 drivers/nvme/host/rdma.c    |  1 +
 drivers/nvme/host/tcp.c     |  1 +
 drivers/nvme/target/loop.c  |  1 +
 9 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
index f35d3f71d14f3..15b3d07f8ccdd 100644
--- a/drivers/nvme/host/apple.c
+++ b/drivers/nvme/host/apple.c
@@ -1283,6 +1283,7 @@ static const struct nvme_ctrl_ops nvme_ctrl_ops = {
 	.reg_read64 = apple_nvme_reg_read64,
 	.free_ctrl = apple_nvme_free_ctrl,
 	.get_address = apple_nvme_get_address,
+	.get_virt_boundary = nvme_get_virt_boundary,
 };
 
 static void apple_nvme_async_probe(void *data, async_cookie_t cookie)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de736..63e15cce3699c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2069,13 +2069,13 @@ static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl)
 }
 
 static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl,
-		struct queue_limits *lim)
+		struct queue_limits *lim, bool is_admin)
 {
 	lim->max_hw_sectors = ctrl->max_hw_sectors;
 	lim->max_segments = min_t(u32, USHRT_MAX,
 		min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments));
 	lim->max_integrity_segments = ctrl->max_integrity_segments;
-	lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
+	lim->virt_boundary_mask = ctrl->ops->get_virt_boundary(ctrl, is_admin);
 	lim->max_segment_size = UINT_MAX;
 	lim->dma_alignment = 3;
 }
@@ -2177,7 +2177,7 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	int ret;
 
 	lim = queue_limits_start_update(ns->disk->queue);
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 
 	memflags = blk_mq_freeze_queue(ns->disk->queue);
 	ret = queue_limits_commit_update(ns->disk->queue, &lim);
@@ -2381,7 +2381,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	ns->head->lba_shift = id->lbaf[lbaf].ds;
 	ns->head->nuse = le64_to_cpu(id->nuse);
 	capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze));
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 	nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info);
 	nvme_set_chunk_sectors(ns, id, &lim);
 	if (!nvme_update_disk_info(ns, id, &lim))
@@ -3589,7 +3589,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
 	lim = queue_limits_start_update(ctrl->admin_q);
-	nvme_set_ctrl_limits(ctrl, &lim);
+	nvme_set_ctrl_limits(ctrl, &lim, true);
 	ret = queue_limits_commit_update(ctrl->admin_q, &lim);
 	if (ret)
 		goto out_free;
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 1b58ee7d0dcee..caf5503d08332 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -217,6 +217,12 @@ static inline unsigned int nvmf_nr_io_queues(struct nvmf_ctrl_options *opts)
 		min(opts->nr_poll_queues, num_online_cpus());
 }
 
+static inline unsigned long nvmf_get_virt_boundary(struct nvme_ctrl *ctrl,
+						   bool is_admin)
+{
+	return 0;
+}
+
 int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val);
 int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 03987f497a5b5..70c066c2e2d42 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3360,6 +3360,7 @@ static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
 	.submit_async_event	= nvme_fc_submit_async_event,
 	.delete_ctrl		= nvme_fc_delete_ctrl,
 	.get_address		= nvmf_get_address,
+	.get_virt_boundary	= nvmf_get_virt_boundary,
 };
 
 static void
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 102fae6a231c5..7f7cb823d60d8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -558,6 +558,12 @@ static inline bool nvme_ns_has_pi(struct nvme_ns_head *head)
 	return head->pi_type && head->ms == head->pi_size;
 }
 
+static inline unsigned long nvme_get_virt_boundary(struct nvme_ctrl *ctrl,
+						   bool is_admin)
+{
+	return NVME_CTRL_PAGE_SIZE - 1;
+}
+
 struct nvme_ctrl_ops {
 	const char *name;
 	struct module *module;
@@ -578,6 +584,7 @@ struct nvme_ctrl_ops {
 	int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
 	void (*print_device_info)(struct nvme_ctrl *ctrl);
 	bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+	unsigned long (*get_virt_boundary)(struct nvme_ctrl *ctrl, bool is_admin);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c916176bd9f05..86d0f05523aec 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -613,9 +613,22 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
-		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
-			return SGL_FORCED;
-		if (req->nr_integrity_segments > 1)
+		/*
+		 * When the controller is capable of using SGL, there are
+		 * several conditions that we force to use it:
+		 *
+		 * 1. A request containing page gaps within the controller's
+		 *    mask can not use the PRP format.
+		 *
+		 * 2. User commands use SGL because that lets the device
+		 *    validate the requested transfer lengths.
+		 *
+		 * 3. Multiple integrity segments must use SGL as that's the
+		 *    only way to describe such a command in NVMe.
+		 */
+		if (req->phys_gap & (NVME_CTRL_PAGE_SIZE - 1) ||
+		    nvme_req(req)->flags & NVME_REQ_USERCMD ||
+		    req->nr_integrity_segments > 1)
 			return SGL_FORCED;
 		return SGL_SUPPORTED;
 	}
@@ -3243,6 +3256,14 @@ static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
 	return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static unsigned long nvme_pci_get_virt_boundary(struct nvme_ctrl *ctrl,
+						bool is_admin)
+{
+	if (!nvme_ctrl_sgl_supported(ctrl) || is_admin)
+		return NVME_CTRL_PAGE_SIZE - 1;
+	return 0;
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.name			= "pcie",
 	.module			= THIS_MODULE,
@@ -3257,6 +3278,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.get_address		= nvme_pci_get_address,
 	.print_device_info	= nvme_pci_print_device_info,
 	.supports_pci_p2pdma	= nvme_pci_supports_pci_p2pdma,
+	.get_virt_boundary	= nvme_pci_get_virt_boundary,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 190a4cfa8a5ee..35c0822edb2d7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2202,6 +2202,7 @@ static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = {
 	.delete_ctrl		= nvme_rdma_delete_ctrl,
 	.get_address		= nvmf_get_address,
 	.stop_ctrl		= nvme_rdma_stop_ctrl,
+	.get_virt_boundary	= nvme_get_virt_boundary,
 };
 
 /*
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1413788ca7d52..82875351442a0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2862,6 +2862,7 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
 	.delete_ctrl		= nvme_tcp_delete_ctrl,
 	.get_address		= nvme_tcp_get_address,
 	.stop_ctrl		= nvme_tcp_stop_ctrl,
+	.get_virt_boundary	= nvmf_get_virt_boundary,
 };
 
 static bool
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index f85a8441bcc6e..9fe88a489eb71 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -511,6 +511,7 @@ static const struct nvme_ctrl_ops nvme_loop_ctrl_ops = {
 	.submit_async_event	= nvme_loop_submit_async_event,
 	.delete_ctrl		= nvme_loop_delete_ctrl_host,
 	.get_address		= nvmf_get_address,
+	.get_virt_boundary	= nvmf_get_virt_boundary,
 };
 
 static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCHv4 1/2] block: accumulate memory segment gaps per bio
  2025-10-07 17:52 ` [PATCHv4 1/2] block: accumulate memory segment gaps per bio Keith Busch
@ 2025-10-10  5:34   ` Christoph Hellwig
  2025-10-13 21:33     ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2025-10-10  5:34 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch

On Tue, Oct 07, 2025 at 10:52:44AM -0700, Keith Busch wrote:
> +static inline unsigned int bvec_seg_gap(struct bio_vec *bvprv,
> +					struct bio_vec *bv)
> +{
> +	return __bvec_gap(bvprv, bv->bv_offset, U32_MAX);
> +}

I find this helper (and the existing __bvec_gap* ones, but I'll send
patches to clean that up in a bit..) very confusing.  Just open coding
it in the callers like:

	 gaps |= (bvprvp->bv_offset + bvprvp->bv_len);
	 gaps |= bv.bv_offset;

makes the intent clear, and also removes the pointless masking by 
U32_MAX.

> +	/*
> +	 * A mask that contains bits set for virtual address gaps between
> +	 * physical segments. This provides information necessary for dma
> +	 * optimization opprotunities, like for testing if the segments can be
> +	 * coalesced against the device's iommu granule.
> +	 */
> +	unsigned int phys_gap;

Any reason this is not a mask like in the bio?  Having the representation
and naming match between the bio and request should make the code a bit
easier to understand.

> +
> +	/*
> +	 * The bvec gap bit indicates the lowest set bit in any address offset
> +	 * between all bi_io_vecs. This field is initialized only after
> +	 * splitting to the hardware limits. It may be used to consider DMA
> +	 * optimization when performing that mapping. The value is compared to
> +	 * a power of two mask where the result depends on any bit set within
> +	 * the mask, so saving the lowest bit is sufficient to know if any
> +	 * segment gap collides with the mask.
> +	 */

This should grow a sentence explaining that the field is only set by
bio_split_io_at, and not valid before as that's very different from the
other bio fields.

> +	u8			bi_bvec_gap_bit;

Aren't we normally calling something like this _mask, i.e., something
like:

		bi_bvec_page_gap_mask;



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv4 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-10-07 17:52 ` [PATCHv4 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
@ 2025-10-10  5:34   ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2025-10-10  5:34 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv4 1/2] block: accumulate memory segment gaps per bio
  2025-10-10  5:34   ` Christoph Hellwig
@ 2025-10-13 21:33     ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2025-10-13 21:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme, axboe

On Fri, Oct 10, 2025 at 07:34:22AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 07, 2025 at 10:52:44AM -0700, Keith Busch wrote:
> > +static inline unsigned int bvec_seg_gap(struct bio_vec *bvprv,
> > +					struct bio_vec *bv)
> > +{
> > +	return __bvec_gap(bvprv, bv->bv_offset, U32_MAX);
> > +}
> 
> I find this helper (and the existing __bvec_gap* ones, but I'll send
> patches to clean that up in a bit..) very confusing.  Just open coding
> it in the callers like:
> 
> 	 gaps |= (bvprvp->bv_offset + bvprvp->bv_len);
> 	 gaps |= bv.bv_offset;
> 
> makes the intent clear, and also removes the pointless masking by 
> U32_MAX.

Sounds good, I'll rebase on your cleanup patch.

> > +	/*
> > +	 * A mask that contains bits set for virtual address gaps between
> > +	 * physical segments. This provides information necessary for dma
> > +	 * optimization opprotunities, like for testing if the segments can be
> > +	 * coalesced against the device's iommu granule.
> > +	 */
> > +	unsigned int phys_gap;
> 
> Any reason this is not a mask like in the bio?  Having the representation
> and naming match between the bio and request should make the code a bit
> easier to understand.

I thought it easier for the users to deal with the mask rather than a
set bit value. Not a big deal, I'll just introduce a helper to return a
mask from the value.
 
> > +
> > +	/*
> > +	 * The bvec gap bit indicates the lowest set bit in any address offset
> > +	 * between all bi_io_vecs. This field is initialized only after
> > +	 * splitting to the hardware limits. It may be used to consider DMA
> > +	 * optimization when performing that mapping. The value is compared to
> > +	 * a power of two mask where the result depends on any bit set within
> > +	 * the mask, so saving the lowest bit is sufficient to know if any
> > +	 * segment gap collides with the mask.
> > +	 */
> 
> This should grow a sentence explaining that the field is only set by
> bio_split_io_at, and not valid before as that's very different from the
> other bio fields.

I didn't mention the function by name, but the comment does say it's not
initialized until you split to limits. I'll add a pointer to
bio_split_io_at().

> > +	u8			bi_bvec_gap_bit;
> 
> Aren't we normally calling something like this _mask, i.e., something
> like:
> 
> 		bi_bvec_page_gap_mask;

A "mask" suffix in the name suggests you can AND it directly with
another value to get a useful result, but that's not how this is
encoded. You have to shift it to generate the intended mask.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-13 21:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-07 17:52 [PATCHv4 0/2] block+nvme: removing virtual boundary mask reliance Keith Busch
2025-10-07 17:52 ` [PATCHv4 1/2] block: accumulate memory segment gaps per bio Keith Busch
2025-10-10  5:34   ` Christoph Hellwig
2025-10-13 21:33     ` Keith Busch
2025-10-07 17:52 ` [PATCHv4 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
2025-10-10  5:34   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox