[PATCHv2 0/2] block: replace reliance on virt boundary

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv2 0/2] block: replace reliance on virt boundary
@ 2025-08-06 14:51 Keith Busch
  2025-08-06 14:51 ` [PATCHv2 1/2] block: accumulate segment page gaps per bio Keith Busch
  2025-08-06 14:51 ` [PATCHv2 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
  0 siblings, 2 replies; 5+ messages in thread
From: Keith Busch @ 2025-08-06 14:51 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Changes from v1:

 - Replaced macro with inline function (Caleb)

 - Replaced unsafe tail bio vec access with safe access

 - For nvme, always set the max_segment_size limit rather than depend on
   it being set through the virt_boundary_mask usage.

Keith Busch (2):
  block: accumulate segment page gaps per bio
  nvme: remove virtual boundary for sgl capable devices

 block/bio.c               |  1 +
 block/blk-merge.c         | 32 +++++++++++++++++++++++++++++---
 block/blk-mq-dma.c        |  3 +--
 block/blk-mq.c            |  5 +++++
 drivers/nvme/host/core.c  | 15 ++++++++++-----
 drivers/nvme/host/pci.c   |  6 +++---
 include/linux/blk-mq.h    |  6 ++++++
 include/linux/blk_types.h |  2 ++
 8 files changed, 57 insertions(+), 13 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCHv2 1/2] block: accumulate segment page gaps per bio
  2025-08-06 14:51 [PATCHv2 0/2] block: replace reliance on virt boundary Keith Busch
@ 2025-08-06 14:51 ` Keith Busch
  2025-08-10 14:55   ` Christoph Hellwig
  2025-08-06 14:51 ` [PATCHv2 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
  1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2025-08-06 14:51 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The blk-mq dma iteration has an optimization for requests that align to
the device's iommu merge boundary. This boundary may be larger than the
device's virtual boundary, but the code had been depending on that queue
limit to know ahead of time if the request aligns to the optimization.

Rather than rely on that queue limit, which many devices may not even
have, store the virtual boundary gaps of each segment into the bio as a
mask while checking the segments and merging. We can then quickly check
per io if the request can use the optimization or not.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c               |  1 +
 block/blk-merge.c         | 32 +++++++++++++++++++++++++++++---
 block/blk-mq-dma.c        |  3 +--
 block/blk-mq.c            |  5 +++++
 include/linux/blk-mq.h    |  6 ++++++
 include/linux/blk_types.h |  2 ++
 6 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b7636a2662f7d..8ed71c284b408 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -275,6 +275,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_integrity = NULL;
 #endif
 	bio->bi_vcnt = 0;
+	bio->page_gaps = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
 	atomic_set(&bio->__bi_cnt, 1);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 81bdad915699a..a09749f9ee3cb 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -278,6 +278,11 @@ static unsigned int bio_split_alignment(struct bio *bio,
 	return lim->logical_block_size;
 }
 
+static inline unsigned int bvec_seg_gap(struct bio_vec bv, struct bio_vec bp)
+{
+	return bv.bv_offset | ((bp.bv_offset + bp.bv_len) & (PAGE_SIZE - 1));
+}
+
 /**
  * bio_split_rw_at - check if and where to split a read/write bio
  * @bio:  [in] bio to be split
@@ -293,9 +298,9 @@ static unsigned int bio_split_alignment(struct bio *bio,
 int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, unsigned max_bytes)
 {
+	unsigned nsegs = 0, bytes = 0, page_gaps = 0;
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
 		if (bv.bv_offset & lim->dma_alignment)
@@ -305,8 +310,11 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
-			goto split;
+		if (bvprvp) {
+			if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
+				goto split;
+			page_gaps |= bvec_seg_gap(bv, bvprv);
+		}
 
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
@@ -324,6 +332,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	}
 
 	*segs = nsegs;
+	bio->page_gaps = page_gaps;
 	return 0;
 split:
 	if (bio->bi_opf & REQ_ATOMIC)
@@ -353,6 +362,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	 * big IO can be trival, disable iopoll when split needed.
 	 */
 	bio_clear_polled(bio);
+	bio->page_gaps = page_gaps;
 	return bytes >> SECTOR_SHIFT;
 }
 EXPORT_SYMBOL_GPL(bio_split_rw_at);
@@ -689,6 +699,19 @@ static bool blk_atomic_write_mergeable_rqs(struct request *rq,
 	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
 }
 
+static inline unsigned int bio_seg_gap(struct request_queue *q,
+				struct bio *next, struct bio *prev)
+{
+	struct bio_vec pb, nb;
+
+	bio_get_last_bvec(prev, &pb);
+	bio_get_first_bvec(next, &nb);
+	if (biovec_phys_mergeable(q, &pb, &nb))
+		return 0;
+
+	return next->page_gaps | bvec_seg_gap(nb, pb);
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -753,6 +776,7 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (next->start_time_ns < req->start_time_ns)
 		req->start_time_ns = next->start_time_ns;
 
+	req->__page_gaps |= bio_seg_gap(req->q, next->bio, req->biotail);
 	req->biotail->bi_next = next->bio;
 	req->biotail = next->biotail;
 
@@ -876,6 +900,7 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
 	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
 		blk_zone_write_plug_bio_merged(bio);
 
+	req->__page_gaps |= bio_seg_gap(req->q, bio, req->biotail);
 	req->biotail->bi_next = bio;
 	req->biotail = bio;
 	req->__data_len += bio->bi_iter.bi_size;
@@ -910,6 +935,7 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
 
 	blk_update_mixed_merge(req, bio, true);
 
+	req->__page_gaps |= bio_seg_gap(req->q, req->bio, bio);
 	bio->bi_next = req->bio;
 	req->bio = bio;
 
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index faa36ff6465ee..a03067c4a268f 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -73,8 +73,7 @@ static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter)
 static inline bool blk_can_dma_map_iova(struct request *req,
 		struct device *dma_dev)
 {
-	return !((queue_virt_boundary(req->q) + 1) &
-		dma_get_merge_boundary(dma_dev));
+	return !(blk_rq_page_gaps(req) & dma_get_merge_boundary(dma_dev));
 }
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02ecebd..09134a66c5666 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -376,6 +376,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	INIT_LIST_HEAD(&rq->queuelist);
 	rq->q = q;
 	rq->__sector = (sector_t) -1;
+	rq->__page_gaps = 0;
 	INIT_HLIST_NODE(&rq->hash);
 	RB_CLEAR_NODE(&rq->rb_node);
 	rq->tag = BLK_MQ_NO_TAG;
@@ -659,6 +660,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 			goto out_queue_exit;
 	}
 	rq->__data_len = 0;
+	rq->__page_gaps = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -739,6 +741,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
+	rq->__page_gaps = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -2665,6 +2668,7 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
 	rq->bio = rq->biotail = bio;
 	rq->__sector = bio->bi_iter.bi_sector;
 	rq->__data_len = bio->bi_iter.bi_size;
+	rq->__page_gaps = bio->page_gaps;
 	rq->nr_phys_segments = nr_segs;
 	if (bio_integrity(bio))
 		rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q,
@@ -3363,6 +3367,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 
 	/* Copy attributes of the original request to the clone request. */
 	rq->__sector = blk_rq_pos(rq_src);
+	rq->__page_gaps = blk_rq_page_gaps(rq_src);
 	rq->__data_len = blk_rq_bytes(rq_src);
 	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
 		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0b..093f6c5cd7b74 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -115,6 +115,7 @@ struct request {
 
 	/* the following two fields are internal, NEVER access directly */
 	unsigned int __data_len;	/* total data len */
+	unsigned int __page_gaps;	/* a mask of all the segment gaps */
 	sector_t __sector;		/* sector cursor */
 
 	struct bio *bio;
@@ -1080,6 +1081,11 @@ static inline sector_t blk_rq_pos(const struct request *rq)
 	return rq->__sector;
 }
 
+static inline unsigned int blk_rq_page_gaps(const struct request *rq)
+{
+	return rq->__page_gaps;
+}
+
 static inline unsigned int blk_rq_bytes(const struct request *rq)
 {
 	return rq->__data_len;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0a29b20939d17..d0ed28d40fe02 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -264,6 +264,8 @@ struct bio {
 
 	unsigned short		bi_max_vecs;	/* max bvl_vecs we can hold */
 
+	unsigned int		page_gaps;	/* a mask of all the vector gaps */
+
 	atomic_t		__bi_cnt;	/* pin count */
 
 	struct bio_vec		*bi_io_vec;	/* the actual vec list */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCHv2 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-08-06 14:51 [PATCHv2 0/2] block: replace reliance on virt boundary Keith Busch
  2025-08-06 14:51 ` [PATCHv2 1/2] block: accumulate segment page gaps per bio Keith Busch
@ 2025-08-06 14:51 ` Keith Busch
  2025-08-10 14:59   ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Keith Busch @ 2025-08-06 14:51 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The nvme virtual boundary is only for the PRP format. Devices that can
use the SGL format don't need it for IO queues. Drop reporting it for
such PCIe devices; fabrics target will continue to use the limit.

Applications can still continue to align to it for optimization
purposes, and the driver will still decide whether to use the PRP format
if the IO allows it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c | 15 ++++++++++-----
 drivers/nvme/host/pci.c  |  6 +++---
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 812c1565114fd..39c74e3a18a04 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2065,13 +2065,18 @@ static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl)
 }
 
 static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl,
-		struct queue_limits *lim)
+		struct queue_limits *lim, bool admin)
 {
 	lim->max_hw_sectors = ctrl->max_hw_sectors;
 	lim->max_segments = min_t(u32, USHRT_MAX,
 		min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments));
 	lim->max_integrity_segments = ctrl->max_integrity_segments;
-	lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
+	lim->max_segment_size = UINT_MAX;
+	if (!nvme_ctrl_sgl_supported(ctrl) || admin ||
+	    ctrl->ops->flags & NVME_F_FABRICS)
+		lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
+	else
+		lim->virt_boundary_mask = 0;
 	lim->max_segment_size = UINT_MAX;
 	lim->dma_alignment = 3;
 }
@@ -2173,7 +2178,7 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	int ret;
 
 	lim = queue_limits_start_update(ns->disk->queue);
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 
 	memflags = blk_mq_freeze_queue(ns->disk->queue);
 	ret = queue_limits_commit_update(ns->disk->queue, &lim);
@@ -2377,7 +2382,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	ns->head->lba_shift = id->lbaf[lbaf].ds;
 	ns->head->nuse = le64_to_cpu(id->nuse);
 	capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze));
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 	nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info);
 	nvme_set_chunk_sectors(ns, id, &lim);
 	if (!nvme_update_disk_info(ns, id, &lim))
@@ -3580,7 +3585,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
 	lim = queue_limits_start_update(ctrl->admin_q);
-	nvme_set_ctrl_limits(ctrl, &lim);
+	nvme_set_ctrl_limits(ctrl, &lim, true);
 	ret = queue_limits_commit_update(ctrl->admin_q, &lim);
 	if (ret)
 		goto out_free;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4a23729f399ac..3da9e4a584682 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -619,9 +619,9 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
-		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
-			return SGL_FORCED;
-		if (req->nr_integrity_segments > 1)
+		if (blk_rq_page_gaps(req) & (NVME_CTRL_PAGE_SIZE - 1) ||
+		    nvme_req(req)->flags & NVME_REQ_USERCMD ||
+		    req->nr_integrity_segments > 1)
 			return SGL_FORCED;
 		return SGL_SUPPORTED;
 	}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCHv2 1/2] block: accumulate segment page gaps per bio
  2025-08-06 14:51 ` [PATCHv2 1/2] block: accumulate segment page gaps per bio Keith Busch
@ 2025-08-10 14:55   ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2025-08-10 14:55 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch

On Wed, Aug 06, 2025 at 07:51:35AM -0700, Keith Busch wrote:
> +static inline unsigned int bvec_seg_gap(struct bio_vec bv, struct bio_vec bp)
> +{
> +	return bv.bv_offset | ((bp.bv_offset + bp.bv_len) & (PAGE_SIZE - 1));
> +}

Can you just pass a pointer to the bio_vec even if the compiler
is probably optimizing away the function either way?

Also bp is a bit of an odd name for a bio_vec.  Note that bvprv flows
well, but it is what the caller uses, so mahybe stick to it?

>  	/* the following two fields are internal, NEVER access directly */
>  	unsigned int __data_len;	/* total data len */
> +	unsigned int __page_gaps;	/* a mask of all the segment gaps */
>  	sector_t __sector;		/* sector cursor */
>  
>  	struct bio *bio;
> @@ -1080,6 +1081,11 @@ static inline sector_t blk_rq_pos(const struct request *rq)
>  	return rq->__sector;
>  }
>  
> +static inline unsigned int blk_rq_page_gaps(const struct request *rq)
> +{
> +	return rq->__page_gaps;
> +}

I don't think we really need the __ and the access helper here.  This was
mostly done for fields where historically drivers accessed the field
directly, but it subtly changed semantics making these direct accesses
unsafe.  


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCHv2 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-08-06 14:51 ` [PATCHv2 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
@ 2025-08-10 14:59   ` Christoph Hellwig
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2025-08-10 14:59 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch

> +	lim->max_segment_size = UINT_MAX;
> +	if (!nvme_ctrl_sgl_supported(ctrl) || admin ||
> +	    ctrl->ops->flags & NVME_F_FABRICS)
> +		lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
> +	else
> +		lim->virt_boundary_mask = 0;

We start out with a virt_boundary_mask by default, so the else branch
here can go away.  This would also benefit from a comment explaining that
we only need the virt_boundary_mask for PRPs, and that as soon as SGLs
are supported for a given queue we don't set it because we don't want to
pay for the overhead it generates.

>  	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
> -		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
> -			return SGL_FORCED;
> -		if (req->nr_integrity_segments > 1)
> +		if (blk_rq_page_gaps(req) & (NVME_CTRL_PAGE_SIZE - 1) ||
> +		    nvme_req(req)->flags & NVME_REQ_USERCMD ||
> +		    req->nr_integrity_segments > 1)

And this would also really benefit from a comment explaining the high
level rationale behind the checks.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-10 14:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 14:51 [PATCHv2 0/2] block: replace reliance on virt boundary Keith Busch
2025-08-06 14:51 ` [PATCHv2 1/2] block: accumulate segment page gaps per bio Keith Busch
2025-08-10 14:55   ` Christoph Hellwig
2025-08-06 14:51 ` [PATCHv2 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
2025-08-10 14:59   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).