* [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance @ 2025-08-21 20:44 Keith Busch 2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch 2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch 0 siblings, 2 replies; 13+ messages in thread From: Keith Busch @ 2025-08-21 20:44 UTC (permalink / raw) To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch From: Keith Busch <kbusch@kernel.org> Previous version is here: https://lore.kernel.org/linux-nvme/20250805195608.2379107-1-kbusch@meta.com/ This patch set depends on this unmerged series for flexible direct-io here: https://lore.kernel.org/linux-block/20250819164922.640964-1-kbusch@meta.com/ The purpose of this is to allow optimization decisions to happen per IO. The virtual boundary that NVMe reports provides specific guarantees about the data alignment, but that might not be large enough for some CPU architectures to take advantage of even iif an applications uses aligned data buffers that could use it. At the same time, the virtual boundary prevents the driver from directly using memory in ways the hardware may be capable of accessing. This creates unnecessary needs on applications to double buffer their data into the more restrictive virtually contiguous format. This patch series provides an efficient way to track page boundary gaps per-IO so that the optimizations can be decided per-IO. This provides flexibility to use all hardware to their abilities beyond what the virtual boundary mask can provide. Note, abuse of this capability may result in worse performance compared to the bounce buffer solutions. Sending a bunch of tiny vectors for one IO incurs significant protocol overhead, so while this patch set allows you to do that, I recommend that you don't. We can't enforce a minimum size though because vectors may straddle pages with only a few words in the first and/or last pages, which we do need to support. Changes from v2: - We only need to know about the lowest set bit in any bio vector page gap. Use that to avoid increasing the bio size - Fixed back merges; the previous was potentially missing one of the bio's gaps - Use pointers instead of relying on the inline to generate good code. - Trivial name changes - Comments explaing the new bio field, and the nvme usage for deciding on SGL vs PRP DMA. Keith Busch (2): block: accumulate segment page gaps per bio nvme: remove virtual boundary for sgl capable devices block/bio.c | 1 + block/blk-merge.c | 39 ++++++++++++++++++++++++++++++++++++--- block/blk-mq-dma.c | 3 +-- block/blk-mq.c | 10 ++++++++++ drivers/nvme/host/core.c | 21 ++++++++++++++++----- drivers/nvme/host/pci.c | 16 +++++++++++++--- include/linux/blk-mq.h | 2 ++ include/linux/blk_types.h | 8 ++++++++ 8 files changed, 87 insertions(+), 13 deletions(-) -- 2.47.3 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-21 20:44 [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance Keith Busch @ 2025-08-21 20:44 ` Keith Busch 2025-08-25 13:46 ` Christoph Hellwig 2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch 1 sibling, 1 reply; 13+ messages in thread From: Keith Busch @ 2025-08-21 20:44 UTC (permalink / raw) To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch From: Keith Busch <kbusch@kernel.org> The blk-mq dma iteration has an optimization for requests that align to the device's iommu merge boundary. This boundary may be larger than the device's virtual boundary (if any), but the code had been depending on that queue limit to know ahead of time if the request is guaranteed to align to that optimization. Rather than rely on that queue limit, which many devices may not report, store the lowest set bit of any page boundary gap between each segment into the bio while checking the segments. The request turns this into a mask for merging and quickly checking per io if the request can use the iova optimization. Signed-off-by: Keith Busch <kbusch@kernel.org> --- block/bio.c | 1 + block/blk-merge.c | 39 ++++++++++++++++++++++++++++++++++++--- block/blk-mq-dma.c | 3 +-- block/blk-mq.c | 10 ++++++++++ include/linux/blk-mq.h | 2 ++ include/linux/blk_types.h | 8 ++++++++ 6 files changed, 58 insertions(+), 5 deletions(-) diff --git a/block/bio.c b/block/bio.c index f185119b8b40b..8bbb9236c2567 100644 --- a/block/bio.c +++ b/block/bio.c @@ -253,6 +253,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table, bio->bi_write_hint = 0; bio->bi_write_stream = 0; bio->bi_status = 0; + bio->bi_pg_bit = 0; bio->bi_iter.bi_sector = 0; bio->bi_iter.bi_size = 0; bio->bi_iter.bi_idx = 0; diff --git a/block/blk-merge.c b/block/blk-merge.c index 35e99dd1c5fd8..7e81a729d5067 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -278,6 +278,12 @@ static unsigned int bio_split_alignment(struct bio *bio, return lim->logical_block_size; } +static inline unsigned int bvec_seg_gap(struct bio_vec *bv, struct bio_vec *bvprv) +{ + return ((bvprv->bv_offset + bvprv->bv_len) & (PAGE_SIZE - 1)) | + bv->bv_offset; +} + /** * bio_split_io_at - check if and where to split a bio * @bio: [in] bio to be split @@ -294,9 +300,9 @@ static unsigned int bio_split_alignment(struct bio *bio, int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, unsigned *segs, unsigned max_bytes, unsigned len_align_mask) { + unsigned nsegs = 0, bytes = 0, page_gaps = 0; struct bio_vec bv, bvprv, *bvprvp = NULL; struct bvec_iter iter; - unsigned nsegs = 0, bytes = 0; bio_for_each_bvec(bv, bio, iter) { if (bv.bv_offset & lim->dma_alignment || @@ -307,8 +313,11 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, * If the queue doesn't support SG gaps and adding this * offset would create a gap, disallow it. */ - if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset)) - goto split; + if (bvprvp) { + if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset)) + goto split; + page_gaps |= bvec_seg_gap(&bv, &bvprv); + } if (nsegs < lim->max_segments && bytes + bv.bv_len <= max_bytes && @@ -326,6 +335,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, } *segs = nsegs; + bio->bi_pg_bit = ffs(page_gaps); return 0; split: if (bio->bi_opf & REQ_ATOMIC) @@ -361,6 +371,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, * big IO can be trival, disable iopoll when split needed. */ bio_clear_polled(bio); + bio->bi_pg_bit = ffs(page_gaps); return bytes >> SECTOR_SHIFT; } EXPORT_SYMBOL_GPL(bio_split_io_at); @@ -697,6 +708,24 @@ static bool blk_atomic_write_mergeable_rqs(struct request *rq, return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC); } +static inline unsigned int bio_seg_gap(struct request_queue *q, + struct bio *next, struct bio *prev) +{ + unsigned int page_gaps = 0; + struct bio_vec pb, nb; + + if (prev->bi_pg_bit) + page_gaps |= 1 << (prev->bi_pg_bit - 1); + if (next->bi_pg_bit) + page_gaps |= 1 << (next->bi_pg_bit - 1); + + bio_get_last_bvec(prev, &pb); + bio_get_first_bvec(next, &nb); + if (!biovec_phys_mergeable(q, &pb, &nb)) + page_gaps |= bvec_seg_gap(&nb, &pb); + return page_gaps; +} + /* * For non-mq, this has to be called with the request spinlock acquired. * For mq with scheduling, the appropriate queue wide lock should be held. @@ -761,6 +790,8 @@ static struct request *attempt_merge(struct request_queue *q, if (next->start_time_ns < req->start_time_ns) req->start_time_ns = next->start_time_ns; + req->page_gap |= bio_seg_gap(req->q, next->bio, req->biotail) | + next->page_gap; req->biotail->bi_next = next->bio; req->biotail = next->biotail; @@ -884,6 +915,7 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req, if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING) blk_zone_write_plug_bio_merged(bio); + req->page_gap |= bio_seg_gap(req->q, bio, req->biotail); req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; @@ -918,6 +950,7 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req, blk_update_mixed_merge(req, bio, true); + req->page_gap |= bio_seg_gap(req->q, req->bio, bio); bio->bi_next = req->bio; req->bio = bio; diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index 660b5e200ccf6..ee542ef28de0c 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -79,8 +79,7 @@ static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter, static inline bool blk_can_dma_map_iova(struct request *req, struct device *dma_dev) { - return !((queue_virt_boundary(req->q) + 1) & - dma_get_merge_boundary(dma_dev)); + return !(req->page_gap & dma_get_merge_boundary(dma_dev)); } static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) diff --git a/block/blk-mq.c b/block/blk-mq.c index b67d6c02ecebd..8d1cde13742d4 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -376,6 +376,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) INIT_LIST_HEAD(&rq->queuelist); rq->q = q; rq->__sector = (sector_t) -1; + rq->page_gap = 0; INIT_HLIST_NODE(&rq->hash); RB_CLEAR_NODE(&rq->rb_node); rq->tag = BLK_MQ_NO_TAG; @@ -659,6 +660,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, goto out_queue_exit; } rq->__data_len = 0; + rq->page_gap = 0; rq->__sector = (sector_t) -1; rq->bio = rq->biotail = NULL; return rq; @@ -739,6 +741,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag); blk_mq_rq_time_init(rq, alloc_time_ns); rq->__data_len = 0; + rq->page_gap = 0; rq->__sector = (sector_t) -1; rq->bio = rq->biotail = NULL; return rq; @@ -2665,6 +2668,12 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio, rq->bio = rq->biotail = bio; rq->__sector = bio->bi_iter.bi_sector; rq->__data_len = bio->bi_iter.bi_size; + + if (bio->bi_pg_bit) + rq->page_gap = 1 << (bio->bi_pg_bit - 1); + else + rq->page_gap = 0; + rq->nr_phys_segments = nr_segs; if (bio_integrity(bio)) rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q, @@ -3370,6 +3379,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src, } rq->nr_phys_segments = rq_src->nr_phys_segments; rq->nr_integrity_segments = rq_src->nr_integrity_segments; + rq->page_gap = rq_src->page_gap; if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0) goto free_and_out; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 2a5a828f19a0b..5ef0cef8823be 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -152,6 +152,8 @@ struct request { unsigned short nr_phys_segments; unsigned short nr_integrity_segments; + unsigned int page_gap; + #ifdef CONFIG_BLK_INLINE_ENCRYPTION struct bio_crypt_ctx *crypt_ctx; struct blk_crypto_keyslot *crypt_keyslot; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 930daff207df2..0bfea2d4cd03a 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -222,6 +222,14 @@ struct bio { enum rw_hint bi_write_hint; u8 bi_write_stream; blk_status_t bi_status; + + /* + * The page gap bit indicates the lowest set bit in any page address + * offset between all bi_io_vecs. This field is initialized only after + * splitting to the hardware limits. + */ + u8 bi_pg_bit; + atomic_t __bi_remaining; struct bvec_iter bi_iter; -- 2.47.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch @ 2025-08-25 13:46 ` Christoph Hellwig 2025-08-25 14:10 ` Keith Busch 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2025-08-25 13:46 UTC (permalink / raw) To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch On Thu, Aug 21, 2025 at 01:44:19PM -0700, Keith Busch wrote: > +static inline unsigned int bvec_seg_gap(struct bio_vec *bv, struct bio_vec *bvprv) Nit: overly long line. > +{ > + return ((bvprv->bv_offset + bvprv->bv_len) & (PAGE_SIZE - 1)) | > + bv->bv_offset; But what's actually more important is a good name, and a good comment. Without much of an explanation this just looks like black magic :) Also use the chance to document why all this is PAGE_SIZE based and not based on either the iommu granule size or the virt boundary. > + if (bvprvp) { > + if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset)) > + goto split; > + page_gaps |= bvec_seg_gap(&bv, &bvprv); > + } > > if (nsegs < lim->max_segments && > bytes + bv.bv_len <= max_bytes && > @@ -326,6 +335,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, > } > > *segs = nsegs; > + bio->bi_pg_bit = ffs(page_gaps); Caling this "bit" feels odd. I guess the idea is that you only care about power of two alignments? I think this would be much easier with the whole theory of operation spelled out somewhere in detail, including why the compression to the set bits works, why the PAGE granularity matters, why we only need to set this bit when splitting but not on bios that never gets split or at least looped over for splitting decisions. > enum rw_hint bi_write_hint; > u8 bi_write_stream; > blk_status_t bi_status; > + > + /* > + * The page gap bit indicates the lowest set bit in any page address > + * offset between all bi_io_vecs. This field is initialized only after > + * splitting to the hardware limits. > + */ > + u8 bi_pg_bit; Maybe move this one up so that all the field only set on the submission side stay together? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-25 13:46 ` Christoph Hellwig @ 2025-08-25 14:10 ` Keith Busch 2025-08-26 13:03 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Keith Busch @ 2025-08-25 14:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, linux-block, linux-nvme, hch, axboe On Mon, Aug 25, 2025 at 06:46:50AM -0700, Christoph Hellwig wrote: > On Thu, Aug 21, 2025 at 01:44:19PM -0700, Keith Busch wrote: > > Also use the chance to document why all this is PAGE_SIZE based and > not based on either the iommu granule size or the virt boundary. This is a good opportunity to double check my assumptions: PAGE_SIZEs, iommu granules, and virt boundaries are all power-of-two values, and PAGE_SIZE is always the largest (or tied for largest) of these. If that's accurate, storing the lowest page offset is sufficient to cover all the boundary masks. If that's not accurate, then this kind of falls apart. I didn't find anything enforcing this assumption, but I can't imagine it would make sense for a device to require the virtual boundary be larger than the page size. It'd be difficult to do IO to that. I also know the iommu granule may be differant than PAGE_SIZE too, but it's always the smaller of the two if they are not the same. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-25 14:10 ` Keith Busch @ 2025-08-26 13:03 ` Christoph Hellwig 2025-08-26 13:47 ` Keith Busch 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2025-08-26 13:03 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu On Mon, Aug 25, 2025 at 08:10:59AM -0600, Keith Busch wrote: > On Mon, Aug 25, 2025 at 06:46:50AM -0700, Christoph Hellwig wrote: > > On Thu, Aug 21, 2025 at 01:44:19PM -0700, Keith Busch wrote: > > > > Also use the chance to document why all this is PAGE_SIZE based and > > not based on either the iommu granule size or the virt boundary. > > This is a good opportunity to double check my assumptions: Always a good idea! > > PAGE_SIZEs, iommu granules, and virt boundaries are all power-of-two > values, and PAGE_SIZE is always the largest (or tied for largest) of > these. I just had an offlist conversation with someone trying to make a nvme device with a virt boundary larger than PAGE_SIZE work. No idea where that device came from. I hink IOMMU granule > PAGE_SIZE would be painful, but adding the IOMMU list to double check. It would also be really good to document all these assumptions with both comments and assert so that we immediately see when they are violated. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-26 13:03 ` Christoph Hellwig @ 2025-08-26 13:47 ` Keith Busch 2025-08-26 13:57 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Keith Busch @ 2025-08-26 13:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu On Tue, Aug 26, 2025 at 03:03:44PM +0200, Christoph Hellwig wrote: > On Mon, Aug 25, 2025 at 08:10:59AM -0600, Keith Busch wrote: > > > > PAGE_SIZEs, iommu granules, and virt boundaries are all power-of-two > > values, and PAGE_SIZE is always the largest (or tied for largest) of > > these. > > I just had an offlist conversation with someone trying to make a nvme > device with a virt boundary larger than PAGE_SIZE work. No idea > where that device came from. Currently, the virtual boundary is always compared to bv_offset, which is a page offset. If the virtual boundary is larger than a page, then we need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every place we need to check against the virt boundary. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-26 13:47 ` Keith Busch @ 2025-08-26 13:57 ` Christoph Hellwig 2025-08-26 22:33 ` Keith Busch 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2025-08-26 13:57 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu On Tue, Aug 26, 2025 at 07:47:46AM -0600, Keith Busch wrote: > Currently, the virtual boundary is always compared to bv_offset, which > is a page offset. If the virtual boundary is larger than a page, then we > need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every > place we need to check against the virt boundary. bv_offset is only guaranteed to be a page offset if your use bio_for_each_segment(_all) or the low-level helpers implementing it and not bio_for_each_bvec(_all) where it can be much larger than PAGE_SIZE. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-26 13:57 ` Christoph Hellwig @ 2025-08-26 22:33 ` Keith Busch 2025-08-27 7:37 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Keith Busch @ 2025-08-26 22:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu On Tue, Aug 26, 2025 at 03:57:34PM +0200, Christoph Hellwig wrote: > On Tue, Aug 26, 2025 at 07:47:46AM -0600, Keith Busch wrote: > > Currently, the virtual boundary is always compared to bv_offset, which > > is a page offset. If the virtual boundary is larger than a page, then we > > need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every > > place we need to check against the virt boundary. > > bv_offset is only guaranteed to be a page offset if your use > bio_for_each_segment(_all) or the low-level helpers implementing > it and not bio_for_each_bvec(_all) where it can be much larger > than PAGE_SIZE. Yes, good point. So we'd have a folio offset when it's not a single page, but I don't think we want to special case large folios for every virt boundary check. It's looking like replace bvec's "page + offset" with phys addrs, yeah?! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-26 22:33 ` Keith Busch @ 2025-08-27 7:37 ` Christoph Hellwig 2025-08-30 1:47 ` Keith Busch 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2025-08-27 7:37 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu On Tue, Aug 26, 2025 at 04:33:15PM -0600, Keith Busch wrote: > On Tue, Aug 26, 2025 at 03:57:34PM +0200, Christoph Hellwig wrote: > > On Tue, Aug 26, 2025 at 07:47:46AM -0600, Keith Busch wrote: > > > Currently, the virtual boundary is always compared to bv_offset, which > > > is a page offset. If the virtual boundary is larger than a page, then we > > > need something like "page_to_phys(bv.bv_page) + bv.bv_offset" every > > > place we need to check against the virt boundary. > > > > bv_offset is only guaranteed to be a page offset if your use > > bio_for_each_segment(_all) or the low-level helpers implementing > > it and not bio_for_each_bvec(_all) where it can be much larger > > than PAGE_SIZE. > > Yes, good point. So we'd have a folio offset when it's not a single > page, but I don't think we want to special case large folios for every > virt boundary check. It's looking like replace bvec's "page + offset" > with phys addrs, yeah?! Basically everything should be using physical address. The page + offset is just a weird and inefficient way to represent that and we really need to get rid of it. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-27 7:37 ` Christoph Hellwig @ 2025-08-30 1:47 ` Keith Busch 2025-09-02 5:36 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: Keith Busch @ 2025-08-30 1:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu On Wed, Aug 27, 2025 at 09:37:09AM +0200, Christoph Hellwig wrote: > On Tue, Aug 26, 2025 at 04:33:15PM -0600, Keith Busch wrote: > > virt boundary check. It's looking like replace bvec's "page + offset" > > with phys addrs, yeah?! > > Basically everything should be using physical address. The page + offset > is just a weird and inefficient way to represent that and we really > need to get rid of it. I was plowing ahead with converting to phys addrs only to discover skb_frag_t overlays a bvec with tightly coupled expectations on its layout. I'm not comfortable right now messing with that type. I think it may need to be decoupled to proceed on this path. :( ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCHv3 1/2] block: accumulate segment page gaps per bio 2025-08-30 1:47 ` Keith Busch @ 2025-09-02 5:36 ` Christoph Hellwig 0 siblings, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2025-09-02 5:36 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Christoph Hellwig, Keith Busch, linux-block, linux-nvme, axboe, iommu, willy, netdev, linux-mm On Fri, Aug 29, 2025 at 07:47:40PM -0600, Keith Busch wrote: > On Wed, Aug 27, 2025 at 09:37:09AM +0200, Christoph Hellwig wrote: > > On Tue, Aug 26, 2025 at 04:33:15PM -0600, Keith Busch wrote: > > > virt boundary check. It's looking like replace bvec's "page + offset" > > > with phys addrs, yeah?! > > > > Basically everything should be using physical address. The page + offset > > is just a weird and inefficient way to represent that and we really > > need to get rid of it. > > I was plowing ahead with converting to phys addrs only to discover > skb_frag_t overlays a bvec with tightly coupled expectations on its > layout. I'm not comfortable right now messing with that type. I think it > may need to be decoupled to proceed on this path. :( willy got really angry at this for all the right reasons, and we still need it fixed. Can we just revert the unreviewed crap the network folks did here to move the kernel forward? ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices 2025-08-21 20:44 [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance Keith Busch 2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch @ 2025-08-21 20:44 ` Keith Busch 2025-08-25 13:49 ` Christoph Hellwig 1 sibling, 1 reply; 13+ messages in thread From: Keith Busch @ 2025-08-21 20:44 UTC (permalink / raw) To: linux-block, linux-nvme; +Cc: hch, axboe, Keith Busch From: Keith Busch <kbusch@kernel.org> The nvme virtual boundary is only required for the PRP format. Devices that can use SGL for DMA don't need it for IO queues. Drop reporting it for such PCIe devices; fabrics controllers will continue to use the limit as they currently don't report any boundary requirements. Applications may continue to align to the same virtual boundaries for optimization purposes if they want, and the driver will continue to decide whether to use the PRP format the same as before if the IO allows it. Signed-off-by: Keith Busch <kbusch@kernel.org> --- drivers/nvme/host/core.c | 21 ++++++++++++++++----- drivers/nvme/host/pci.c | 16 +++++++++++++--- 2 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 812c1565114fd..cce1fbf2becd8 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2065,13 +2065,24 @@ static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl) } static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl, - struct queue_limits *lim) + struct queue_limits *lim, bool admin) { lim->max_hw_sectors = ctrl->max_hw_sectors; lim->max_segments = min_t(u32, USHRT_MAX, min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments)); lim->max_integrity_segments = ctrl->max_integrity_segments; - lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; + + /* + * The virtual boundary mask is not necessary for PCI controllers that + * support SGL for DMA. It's only necessary when using PRP. Admin + * queues only support PRP, and fabrics drivers currently don't report + * what boundaries they require, so set the virtual boundary for + * either. + */ + if (!nvme_ctrl_sgl_supported(ctrl) || admin || + ctrl->ops->flags & NVME_F_FABRICS) + lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; + lim->max_segment_size = UINT_MAX; lim->dma_alignment = 3; } @@ -2173,7 +2184,7 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns, int ret; lim = queue_limits_start_update(ns->disk->queue); - nvme_set_ctrl_limits(ns->ctrl, &lim); + nvme_set_ctrl_limits(ns->ctrl, &lim, false); memflags = blk_mq_freeze_queue(ns->disk->queue); ret = queue_limits_commit_update(ns->disk->queue, &lim); @@ -2377,7 +2388,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, ns->head->lba_shift = id->lbaf[lbaf].ds; ns->head->nuse = le64_to_cpu(id->nuse); capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze)); - nvme_set_ctrl_limits(ns->ctrl, &lim); + nvme_set_ctrl_limits(ns->ctrl, &lim, false); nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info); nvme_set_chunk_sectors(ns, id, &lim); if (!nvme_update_disk_info(ns, id, &lim)) @@ -3580,7 +3591,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl) min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); lim = queue_limits_start_update(ctrl->admin_q); - nvme_set_ctrl_limits(ctrl, &lim); + nvme_set_ctrl_limits(ctrl, &lim, true); ret = queue_limits_commit_update(ctrl->admin_q, &lim); if (ret) goto out_free; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index d8a9dee55de33..6e848aa4531b6 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -619,9 +619,19 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev, struct nvme_queue *nvmeq = req->mq_hctx->driver_data; if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) { - if (nvme_req(req)->flags & NVME_REQ_USERCMD) - return SGL_FORCED; - if (req->nr_integrity_segments > 1) + /* + * A request with page gaps within the controller's mask can + * not use the PRP format. + * + * We force user commands to use SGL because that lets the + * device validate the requested transfer lengths. + * + * Multiple integrity segments must use SGL as that's the only + * way to describe such a command. + */ + if (req->page_gap & (NVME_CTRL_PAGE_SIZE - 1) || + nvme_req(req)->flags & NVME_REQ_USERCMD || + req->nr_integrity_segments > 1) return SGL_FORCED; return SGL_SUPPORTED; } -- 2.47.3 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices 2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch @ 2025-08-25 13:49 ` Christoph Hellwig 0 siblings, 0 replies; 13+ messages in thread From: Christoph Hellwig @ 2025-08-25 13:49 UTC (permalink / raw) To: Keith Busch; +Cc: linux-block, linux-nvme, hch, axboe, Keith Busch On Thu, Aug 21, 2025 at 01:44:20PM -0700, Keith Busch wrote: > + /* > + * The virtual boundary mask is not necessary for PCI controllers that > + * support SGL for DMA. It's only necessary when using PRP. Admin > + * queues only support PRP, and fabrics drivers currently don't report > + * what boundaries they require, so set the virtual boundary for > + * either. > + */ > + if (!nvme_ctrl_sgl_supported(ctrl) || admin || > + ctrl->ops->flags & NVME_F_FABRICS) > + lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; Fabrics itself never needs the virt boundary. And for TCP which is a software only transport I think we can just do away with it. For FC I suspect we can do away with it as well, as all the FC HBA support proper SGLs. For RDMA the standard MR methods do require the virtual boundary, but somewhat recent Mellanox / Nvidia hardware does not. No need for you to update all these, but I think having the transport advertise the capability is probably better than a bunch of random conditions in the core code. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-09-02 5:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-21 20:44 [PATCHv3 0/2] block+nvme: reducing virtual boundary mask reliance Keith Busch 2025-08-21 20:44 ` [PATCHv3 1/2] block: accumulate segment page gaps per bio Keith Busch 2025-08-25 13:46 ` Christoph Hellwig 2025-08-25 14:10 ` Keith Busch 2025-08-26 13:03 ` Christoph Hellwig 2025-08-26 13:47 ` Keith Busch 2025-08-26 13:57 ` Christoph Hellwig 2025-08-26 22:33 ` Keith Busch 2025-08-27 7:37 ` Christoph Hellwig 2025-08-30 1:47 ` Keith Busch 2025-09-02 5:36 ` Christoph Hellwig 2025-08-21 20:44 ` [PATCHv3 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch 2025-08-25 13:49 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).