[PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance
@ 2025-08-21 20:44 Keith Busch
  2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch
  2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
  0 siblings, 2 replies; 13+ messages in thread
From: Keith Busch @ 2025-08-21 20:44 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Previous version is here:

  https://lore.kernel.org/linux-nvme/20250805195608.2379107-1-kbusch@meta.com/

This patch set depends on this unmerged series for flexible direct-io
here:

  https://lore.kernel.org/linux-block/20250819164922.640964-1-kbusch@meta.com/

The purpose of this is to allow optimization decisions to happen per IO.
The virtual boundary that NVMe reports provides specific guarantees
about the data alignment, but that might not be large enough for some
CPU architectures to take advantage of even iif an applications uses
aligned data buffers that could use it.

At the same time, the virtual boundary prevents the driver from directly
using memory in ways the hardware may be capable of accessing. This
creates unnecessary needs on applications to double buffer their data
into the more restrictive virtually contiguous format.

This patch series provides an efficient way to track page boundary gaps
per-IO so that the optimizations can be decided per-IO. This provides
flexibility to use all hardware to their abilities beyond what the
virtual boundary mask can provide.

Note, abuse of this capability may result in worse performance compared
to the bounce buffer solutions. Sending a bunch of tiny vectors for one
IO incurs significant protocol overhead, so while this patch set allows
you to do that, I recommend that you don't. We can't enforce a minimum
size though because vectors may straddle pages with only a few words in
the first and/or last pages, which we do need to support.

Changes from v2:

  - We only need to know about the lowest set bit in any bio vector page
    gap. Use that to avoid increasing the bio size

  - Fixed back merges; the previous was potentially missing one of the
    bio's gaps

  - Use pointers instead of relying on the inline to generate good code.

  - Trivial name changes

  - Comments explaing the new bio field, and the nvme usage for deciding
    on SGL vs PRP DMA.

Keith Busch (2):
  block: accumulate segment page gaps per bio
  nvme: remove virtual boundary for sgl capable devices

 block/bio.c               |  1 +
 block/blk-merge.c         | 39 ++++++++++++++++++++++++++++++++++++---
 block/blk-mq-dma.c        |  3 +--
 block/blk-mq.c            | 10 ++++++++++
 drivers/nvme/host/core.c  | 21 ++++++++++++++++-----
 drivers/nvme/host/pci.c   | 16 +++++++++++++---
 include/linux/blk-mq.h    |  2 ++
 include/linux/blk_types.h |  8 ++++++++
 8 files changed, 87 insertions(+), 13 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-21 20:44 [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance Keith Busch
@ 2025-08-21 20:44 ` Keith Busch
  2025-08-25 13:46   ` Christoph Hellwig
  2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-08-21 20:44 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The blk-mq dma iteration has an optimization for requests that align to
the device's iommu merge boundary. This boundary may be larger than the
device's virtual boundary (if any), but the code had been depending on
that queue limit to know ahead of time if the request is guaranteed to
align to that optimization.

Rather than rely on that queue limit, which many devices may not report,
store the lowest set bit of any page boundary gap between each segment
into the bio while checking the segments. The request turns this into a
mask for merging and quickly checking per io if the request can use the
iova optimization.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c               |  1 +
 block/blk-merge.c         | 39 ++++++++++++++++++++++++++++++++++++---
 block/blk-mq-dma.c        |  3 +--
 block/blk-mq.c            | 10 ++++++++++
 include/linux/blk-mq.h    |  2 ++
 include/linux/blk_types.h |  8 ++++++++
 6 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f185119b8b40b..8bbb9236c2567 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -253,6 +253,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	bio->bi_write_hint = 0;
 	bio->bi_write_stream = 0;
 	bio->bi_status = 0;
+	bio->bi_pg_bit = 0;
 	bio->bi_iter.bi_sector = 0;
 	bio->bi_iter.bi_size = 0;
 	bio->bi_iter.bi_idx = 0;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 35e99dd1c5fd8..7e81a729d5067 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -278,6 +278,12 @@ static unsigned int bio_split_alignment(struct bio *bio,
 	return lim->logical_block_size;
 }
 
+static inline unsigned int bvec_seg_gap(struct bio_vec *bv, struct bio_vec *bvprv)
+{
+	return ((bvprv->bv_offset + bvprv->bv_len) & (PAGE_SIZE - 1)) |
+			bv->bv_offset;
+}
+
 /**
  * bio_split_io_at - check if and where to split a bio
  * @bio:  [in] bio to be split
@@ -294,9 +300,9 @@ static unsigned int bio_split_alignment(struct bio *bio,
 int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, unsigned max_bytes, unsigned len_align_mask)
 {
+	unsigned nsegs = 0, bytes = 0, page_gaps = 0;
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
 		if (bv.bv_offset & lim->dma_alignment ||
@@ -307,8 +313,11 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
-			goto split;
+		if (bvprvp) {
+			if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
+				goto split;
+			page_gaps |= bvec_seg_gap(&bv, &bvprv);
+		}
 
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
@@ -326,6 +335,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	}
 
 	*segs = nsegs;
+	bio->bi_pg_bit = ffs(page_gaps);
 	return 0;
 split:
 	if (bio->bi_opf & REQ_ATOMIC)
@@ -361,6 +371,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
 	 * big IO can be trival, disable iopoll when split needed.
 	 */
 	bio_clear_polled(bio);
+	bio->bi_pg_bit = ffs(page_gaps);
 	return bytes >> SECTOR_SHIFT;
 }
 EXPORT_SYMBOL_GPL(bio_split_io_at);
@@ -697,6 +708,24 @@ static bool blk_atomic_write_mergeable_rqs(struct request *rq,
 	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
 }
 
+static inline unsigned int bio_seg_gap(struct request_queue *q,
+				struct bio *next, struct bio *prev)
+{
+	unsigned int page_gaps = 0;
+	struct bio_vec pb, nb;
+
+	if (prev->bi_pg_bit)
+		page_gaps |= 1 << (prev->bi_pg_bit - 1);
+	if (next->bi_pg_bit)
+		page_gaps |= 1 << (next->bi_pg_bit - 1);
+
+	bio_get_last_bvec(prev, &pb);
+	bio_get_first_bvec(next, &nb);
+	if (!biovec_phys_mergeable(q, &pb, &nb))
+		page_gaps |= bvec_seg_gap(&nb, &pb);
+	return page_gaps;
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -761,6 +790,8 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (next->start_time_ns < req->start_time_ns)
 		req->start_time_ns = next->start_time_ns;
 
+	req->page_gap |= bio_seg_gap(req->q, next->bio, req->biotail) |
+			   next->page_gap;
 	req->biotail->bi_next = next->bio;
 	req->biotail = next->biotail;
 
@@ -884,6 +915,7 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
 	if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
 		blk_zone_write_plug_bio_merged(bio);
 
+	req->page_gap |= bio_seg_gap(req->q, bio, req->biotail);
 	req->biotail->bi_next = bio;
 	req->biotail = bio;
 	req->__data_len += bio->bi_iter.bi_size;
@@ -918,6 +950,7 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
 
 	blk_update_mixed_merge(req, bio, true);
 
+	req->page_gap |= bio_seg_gap(req->q, req->bio, bio);
 	bio->bi_next = req->bio;
 	req->bio = bio;
 
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 660b5e200ccf6..ee542ef28de0c 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -79,8 +79,7 @@ static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter,
 static inline bool blk_can_dma_map_iova(struct request *req,
 		struct device *dma_dev)
 {
-	return !((queue_virt_boundary(req->q) + 1) &
-		dma_get_merge_boundary(dma_dev));
+	return !(req->page_gap & dma_get_merge_boundary(dma_dev));
 }
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b67d6c02ecebd..8d1cde13742d4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -376,6 +376,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	INIT_LIST_HEAD(&rq->queuelist);
 	rq->q = q;
 	rq->__sector = (sector_t) -1;
+	rq->page_gap = 0;
 	INIT_HLIST_NODE(&rq->hash);
 	RB_CLEAR_NODE(&rq->rb_node);
 	rq->tag = BLK_MQ_NO_TAG;
@@ -659,6 +660,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
 			goto out_queue_exit;
 	}
 	rq->__data_len = 0;
+	rq->page_gap = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -739,6 +741,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
+	rq->page_gap = 0;
 	rq->__sector = (sector_t) -1;
 	rq->bio = rq->biotail = NULL;
 	return rq;
@@ -2665,6 +2668,12 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio,
 	rq->bio = rq->biotail = bio;
 	rq->__sector = bio->bi_iter.bi_sector;
 	rq->__data_len = bio->bi_iter.bi_size;
+
+	if (bio->bi_pg_bit)
+		rq->page_gap = 1 << (bio->bi_pg_bit - 1);
+	else
+		rq->page_gap  = 0;
+
 	rq->nr_phys_segments = nr_segs;
 	if (bio_integrity(bio))
 		rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q,
@@ -3370,6 +3379,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 	}
 	rq->nr_phys_segments = rq_src->nr_phys_segments;
 	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+	rq->page_gap = rq_src->page_gap;
 
 	if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0)
 		goto free_and_out;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0b..5ef0cef8823be 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -152,6 +152,8 @@ struct request {
 	unsigned short nr_phys_segments;
 	unsigned short nr_integrity_segments;
 
+	unsigned int page_gap;
+
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
 	struct bio_crypt_ctx *crypt_ctx;
 	struct blk_crypto_keyslot *crypt_keyslot;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 930daff207df2..0bfea2d4cd03a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -222,6 +222,14 @@ struct bio {
 	enum rw_hint		bi_write_hint;
 	u8			bi_write_stream;
 	blk_status_t		bi_status;
+
+	/*
+	 * The page gap bit indicates the lowest set bit in any page address
+	 * offset between all bi_io_vecs. This field is initialized only after
+	 * splitting to the hardware limits.
+	 */
+	u8			bi_pg_bit;
+
 	atomic_t		__bi_remaining;
 
 	struct bvec_iter	bi_iter;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch
@ 2025-08-25 13:46   ` Christoph Hellwig
  2025-08-25 14:10     ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-08-25 13:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch

On Thu, Aug 21, 2025 at 01:44:19PM -0700, Keith Busch wrote:
> +static inline unsigned int bvec_seg_gap(struct bio_vec *bv, struct bio_vec *bvprv)

Nit: overly long line.

> +{
> +	return ((bvprv->bv_offset + bvprv->bv_len) & (PAGE_SIZE - 1)) |
> +			bv->bv_offset;

But what's actually more important is a good name, and a good comment.
Without much of an explanation this just looks like black magic :)

Also use the chance to document why all this is PAGE_SIZE based and
not based on either the iommu granule size or the virt boundary.

> +		if (bvprvp) {
> +			if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
> +				goto split;
> +			page_gaps |= bvec_seg_gap(&bv, &bvprv);
> +		}
>  
>  		if (nsegs < lim->max_segments &&
>  		    bytes + bv.bv_len <= max_bytes &&
> @@ -326,6 +335,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
>  	}
>  
>  	*segs = nsegs;
> +	bio->bi_pg_bit = ffs(page_gaps);

Caling this "bit" feels odd.  I guess the idea is that you only care
about power of two alignments?  I think this would be much easier
with the whole theory of operation spelled out somewhere in detail,
including why the compression to the set bits works, why the PAGE
granularity matters, why we only need to set this bit when splitting
but not on bios that never gets split or at least looped over for
splitting decisions.

>  	enum rw_hint		bi_write_hint;
>  	u8			bi_write_stream;
>  	blk_status_t		bi_status;
> +
> +	/*
> +	 * The page gap bit indicates the lowest set bit in any page address
> +	 * offset between all bi_io_vecs. This field is initialized only after
> +	 * splitting to the hardware limits.
> +	 */
> +	u8			bi_pg_bit;

Maybe move this one up so that all the field only set on the submission
side stay together?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-25 13:46   ` Christoph Hellwig
@ 2025-08-25 14:10     ` Keith Busch
  2025-08-26 13:03       ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-08-25 14:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme, hch, axboe

On Mon, Aug 25, 2025 at 06:46:50AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 01:44:19PM -0700, Keith Busch wrote:
> 
> Also use the chance to document why all this is PAGE_SIZE based and
> not based on either the iommu granule size or the virt boundary.

This is a good opportunity to double check my assumptions:

PAGE_SIZEs, iommu granules, and virt boundaries are all power-of-two
values, and PAGE_SIZE is always the largest (or tied for largest) of
these.

If that's accurate, storing the lowest page offset is sufficient to
cover all the boundary masks.

If that's not accurate, then this kind of falls apart. I didn't find
anything enforcing this assumption, but I can't imagine it would make
sense for a device to require the virtual boundary be larger than the
page size. It'd be difficult to do IO to that. I also know the iommu
granule may be differant than PAGE_SIZE too, but it's always the smaller
of the two if they are not the same.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-25 14:10     ` Keith Busch
@ 2025-08-26 13:03       ` Christoph Hellwig
  2025-08-26 13:47         ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-08-26 13:03 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
	iommu

On Mon, Aug 25, 2025 at 08:10:59AM -0600, Keith Busch wrote:
> On Mon, Aug 25, 2025 at 06:46:50AM -0700, Christoph Hellwig wrote:
> > On Thu, Aug 21, 2025 at 01:44:19PM -0700, Keith Busch wrote:
> > 
> > Also use the chance to document why all this is PAGE_SIZE based and
> > not based on either the iommu granule size or the virt boundary.
> 
> This is a good opportunity to double check my assumptions:

Always a good idea!

> 
> PAGE_SIZEs, iommu granules, and virt boundaries are all power-of-two
> values, and PAGE_SIZE is always the largest (or tied for largest) of
> these.

I just had an offlist conversation with someone trying to make a nvme
device with a virt boundary larger than PAGE_SIZE work.  No idea
where that device came from.

I hink IOMMU granule > PAGE_SIZE would be painful, but adding the
IOMMU list to double check.

It would also be really good to document all these assumptions with
both comments and assert so that we immediately see when they are
violated.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-26 13:03       ` Christoph Hellwig
@ 2025-08-26 13:47         ` Keith Busch
  2025-08-26 13:57           ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-08-26 13:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
	iommu

On Tue, Aug 26, 2025 at 03:03:44PM +0200, Christoph Hellwig wrote:
> On Mon, Aug 25, 2025 at 08:10:59AM -0600, Keith Busch wrote:
> > 
> > PAGE_SIZEs, iommu granules, and virt boundaries are all power-of-two
> > values, and PAGE_SIZE is always the largest (or tied for largest) of
> > these.
> 
> I just had an offlist conversation with someone trying to make a nvme
> device with a virt boundary larger than PAGE_SIZE work.  No idea
> where that device came from.

Currently, the virtual boundary is always compared to bv_offset, which
is a page offset. If the virtual boundary is larger than a page, then we
need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every
place we need to check against the virt boundary.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-26 13:47         ` Keith Busch
@ 2025-08-26 13:57           ` Christoph Hellwig
  2025-08-26 22:33             ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-08-26 13:57 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Christoph Hellwig, Keith Busch, linux-block,
	linux-nvme, axboe, iommu

On Tue, Aug 26, 2025 at 07:47:46AM -0600, Keith Busch wrote:
> Currently, the virtual boundary is always compared to bv_offset, which
> is a page offset. If the virtual boundary is larger than a page, then we
> need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every
> place we need to check against the virt boundary.

bv_offset is only guaranteed to be a page offset if your use
bio_for_each_segment(_all) or the low-level helpers implementing
it and not bio_for_each_bvec(_all) where it can be much larger
than PAGE_SIZE.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-26 13:57           ` Christoph Hellwig
@ 2025-08-26 22:33             ` Keith Busch
  2025-08-27  7:37               ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-08-26 22:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
	iommu

On Tue, Aug 26, 2025 at 03:57:34PM +0200, Christoph Hellwig wrote:
> On Tue, Aug 26, 2025 at 07:47:46AM -0600, Keith Busch wrote:
> > Currently, the virtual boundary is always compared to bv_offset, which
> > is a page offset. If the virtual boundary is larger than a page, then we
> > need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every
> > place we need to check against the virt boundary.
> 
> bv_offset is only guaranteed to be a page offset if your use
> bio_for_each_segment(_all) or the low-level helpers implementing
> it and not bio_for_each_bvec(_all) where it can be much larger
> than PAGE_SIZE.

Yes, good point. So we'd have a folio offset when it's not a single
page, but I don't think we want to special case large folios for every
virt boundary check. It's looking like replace bvec's "page + offset"
with phys addrs, yeah?!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-26 22:33             ` Keith Busch
@ 2025-08-27  7:37               ` Christoph Hellwig
  2025-08-30  1:47                 ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-08-27  7:37 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Christoph Hellwig, Keith Busch, linux-block,
	linux-nvme, axboe, iommu

On Tue, Aug 26, 2025 at 04:33:15PM -0600, Keith Busch wrote:
> On Tue, Aug 26, 2025 at 03:57:34PM +0200, Christoph Hellwig wrote:
> > On Tue, Aug 26, 2025 at 07:47:46AM -0600, Keith Busch wrote:
> > > Currently, the virtual boundary is always compared to bv_offset, which
> > > is a page offset. If the virtual boundary is larger than a page, then we
> > > need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every
> > > place we need to check against the virt boundary.
> > 
> > bv_offset is only guaranteed to be a page offset if your use
> > bio_for_each_segment(_all) or the low-level helpers implementing
> > it and not bio_for_each_bvec(_all) where it can be much larger
> > than PAGE_SIZE.
> 
> Yes, good point. So we'd have a folio offset when it's not a single
> page, but I don't think we want to special case large folios for every
> virt boundary check. It's looking like replace bvec's "page + offset"
> with phys addrs, yeah?!

Basically everything should be using physical address.  The page + offset
is just a weird and inefficient way to represent that and we really
need to get rid of it.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-27  7:37               ` Christoph Hellwig
@ 2025-08-30  1:47                 ` Keith Busch
  2025-09-02  5:36                   ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-08-30  1:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe,
	iommu

On Wed, Aug 27, 2025 at 09:37:09AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 26, 2025 at 04:33:15PM -0600, Keith Busch wrote:
> > virt boundary check. It's looking like replace bvec's "page + offset"
> > with phys addrs, yeah?!
> 
> Basically everything should be using physical address.  The page + offset
> is just a weird and inefficient way to represent that and we really
> need to get rid of it.

I was plowing ahead with converting to phys addrs only to discover
skb_frag_t overlays a bvec with tightly coupled expectations on its
layout. I'm not comfortable right now messing with that type. I think it
may need to be decoupled to proceed on this path. :(


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio
  2025-08-30  1:47                 ` Keith Busch
@ 2025-09-02  5:36                   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-09-02  5:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Christoph Hellwig, Keith Busch, linux-block,
	linux-nvme, axboe, iommu, willy, netdev, linux-mm

On Fri, Aug 29, 2025 at 07:47:40PM -0600, Keith Busch wrote:
> On Wed, Aug 27, 2025 at 09:37:09AM +0200, Christoph Hellwig wrote:
> > On Tue, Aug 26, 2025 at 04:33:15PM -0600, Keith Busch wrote:
> > > virt boundary check. It's looking like replace bvec's "page + offset"
> > > with phys addrs, yeah?!
> > 
> > Basically everything should be using physical address.  The page + offset
> > is just a weird and inefficient way to represent that and we really
> > need to get rid of it.
> 
> I was plowing ahead with converting to phys addrs only to discover
> skb_frag_t overlays a bvec with tightly coupled expectations on its
> layout. I'm not comfortable right now messing with that type. I think it
> may need to be decoupled to proceed on this path. :(

willy got really angry at this for all the right reasons, and we
still need it fixed.  Can we just revert the unreviewed crap the
network folks did here to move the kernel forward?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-08-21 20:44 [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance Keith Busch
  2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch
@ 2025-08-21 20:44 ` Keith Busch
  2025-08-25 13:49   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2025-08-21 20:44 UTC (permalink / raw)
  To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The nvme virtual boundary is only required for the PRP format. Devices
that can use SGL for DMA don't need it for IO queues. Drop reporting it
for such PCIe devices; fabrics controllers will continue to use the
limit as they currently don't report any boundary requirements.

Applications may continue to align to the same virtual boundaries for
optimization purposes if they want, and the driver will continue to
decide whether to use the PRP format the same as before if the IO allows
it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c | 21 ++++++++++++++++-----
 drivers/nvme/host/pci.c  | 16 +++++++++++++---
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 812c1565114fd..cce1fbf2becd8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2065,13 +2065,24 @@ static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl)
 }
 
 static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl,
-		struct queue_limits *lim)
+		struct queue_limits *lim, bool admin)
 {
 	lim->max_hw_sectors = ctrl->max_hw_sectors;
 	lim->max_segments = min_t(u32, USHRT_MAX,
 		min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments));
 	lim->max_integrity_segments = ctrl->max_integrity_segments;
-	lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
+
+	/*
+	 * The virtual boundary mask is not necessary for PCI controllers that
+	 * support SGL for DMA. It's only necessary when using PRP. Admin
+	 * queues only support PRP, and fabrics drivers currently don't report
+	 * what boundaries they require, so set the virtual boundary for
+	 * either.
+	 */
+	if (!nvme_ctrl_sgl_supported(ctrl) || admin ||
+	    ctrl->ops->flags & NVME_F_FABRICS)
+		lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;
+
 	lim->max_segment_size = UINT_MAX;
 	lim->dma_alignment = 3;
 }
@@ -2173,7 +2184,7 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns,
 	int ret;
 
 	lim = queue_limits_start_update(ns->disk->queue);
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 
 	memflags = blk_mq_freeze_queue(ns->disk->queue);
 	ret = queue_limits_commit_update(ns->disk->queue, &lim);
@@ -2377,7 +2388,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 	ns->head->lba_shift = id->lbaf[lbaf].ds;
 	ns->head->nuse = le64_to_cpu(id->nuse);
 	capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze));
-	nvme_set_ctrl_limits(ns->ctrl, &lim);
+	nvme_set_ctrl_limits(ns->ctrl, &lim, false);
 	nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info);
 	nvme_set_chunk_sectors(ns, id, &lim);
 	if (!nvme_update_disk_info(ns, id, &lim))
@@ -3580,7 +3591,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
 		min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
 
 	lim = queue_limits_start_update(ctrl->admin_q);
-	nvme_set_ctrl_limits(ctrl, &lim);
+	nvme_set_ctrl_limits(ctrl, &lim, true);
 	ret = queue_limits_commit_update(ctrl->admin_q, &lim);
 	if (ret)
 		goto out_free;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d8a9dee55de33..6e848aa4531b6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -619,9 +619,19 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
-		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
-			return SGL_FORCED;
-		if (req->nr_integrity_segments > 1)
+		/*
+		 * A request with page gaps within the controller's mask can
+		 * not use the PRP format.
+		 *
+		 * We force user commands to use SGL because that lets the
+		 * device validate the requested transfer lengths.
+		 *
+		 * Multiple integrity segments must use SGL as that's the only
+		 * way to describe such a command.
+		 */
+		if (req->page_gap & (NVME_CTRL_PAGE_SIZE - 1) ||
+		    nvme_req(req)->flags & NVME_REQ_USERCMD ||
+		    req->nr_integrity_segments > 1)
 			return SGL_FORCED;
 		return SGL_SUPPORTED;
 	}
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices
  2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
@ 2025-08-25 13:49   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-08-25 13:49 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch

On Thu, Aug 21, 2025 at 01:44:20PM -0700, Keith Busch wrote:
> +	/*
> +	 * The virtual boundary mask is not necessary for PCI controllers that
> +	 * support SGL for DMA. It's only necessary when using PRP. Admin
> +	 * queues only support PRP, and fabrics drivers currently don't report
> +	 * what boundaries they require, so set the virtual boundary for
> +	 * either.
> +	 */
> +	if (!nvme_ctrl_sgl_supported(ctrl) || admin ||
> +	    ctrl->ops->flags & NVME_F_FABRICS)
> +		lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1;

Fabrics itself never needs the virt boundary.  And for TCP which is a
software only transport I think we can just do away with it.  For
FC I suspect we can do away with it as well, as all the FC HBA support
proper SGLs.  For RDMA the standard MR methods do require the virtual
boundary, but somewhat recent Mellanox / Nvidia hardware does not.

No need for you to update all these, but I think having the transport
advertise the capability is probably better than a bunch of random
conditions in the core code.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-09-02  5:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 20:44 [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance Keith Busch
2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch
2025-08-25 13:46   ` Christoph Hellwig
2025-08-25 14:10     ` Keith Busch
2025-08-26 13:03       ` Christoph Hellwig
2025-08-26 13:47         ` Keith Busch
2025-08-26 13:57           ` Christoph Hellwig
2025-08-26 22:33             ` Keith Busch
2025-08-27  7:37               ` Christoph Hellwig
2025-08-30  1:47                 ` Keith Busch
2025-09-02  5:36                   ` Christoph Hellwig
2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch
2025-08-25 13:49   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).