* [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance
@ 2025-10-14 15:04 Keith Busch
2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch
` (3 more replies)
0 siblings, 4 replies; 20+ messages in thread
From: Keith Busch @ 2025-10-14 15:04 UTC (permalink / raw)
To: hch, linux-nvme, linux-block, axboe; +Cc: Keith Busch
From: Keith Busch <kbusch@kernel.org>
Previous version here:
https://lore.kernel.org/linux-block/20251007175245.3898972-1-kbusch@meta.com/
The purpose is to allow optimization decisions to happen per IO, and
flexibility to utilize unaligned buffers for hardware that supports it.
The virtual boundary that NVMe uses provides specific guarantees about
the data alignment, but that might not be large enough for some CPU
architectures to take advantage of even if an applications uses aligned
data buffers that could use it.
At the same time, the virtual boundary prevents the driver from directly
using memory in ways the hardware may be capable of accessing. This
creates unnecessary needs on applications to double buffer their data
into a more restrictive virtually contiguous format.
This patch series provides an efficient way to track segment boundary
gaps per-IO so that the optimizations can be decided per-IO. This
provides flexibility to use all hardware to their abilities beyond what
the virtual boundary mask can provide.
Note, abuse of this capability may result in worse performance compared
to the bounce buffer solutions. Sending a bunch of tiny vectors for one
IO incurs significant protocol overhead, so while this patch set allows
you to do that, I recommend that you don't. We can't enforce a minimum
size though because vectors may straddle pages with only a few words in
the first and/or last pages, which we do need to support.
Changes from v4:
* Keep the same lowest-set-bit representation in the request as the
bio; provide a helper to turn it into a mask
* Open-code the bvec gaps calculation since the helper is being removed
* Additional code comments
* Keeping the virt boundary unchanged for the loop target for now. Only
pci, tcp, and fc are not reporting such a boundary.
Keith Busch (2):
block: accumulate memory segment gaps per bio
nvme: remove virtual boundary for sgl capable devices
block/bio.c | 1 +
block/blk-map.c | 3 +++
block/blk-merge.c | 39 ++++++++++++++++++++++++++++++++++---
block/blk-mq-dma.c | 3 +--
block/blk-mq.c | 6 ++++++
drivers/nvme/host/apple.c | 1 +
drivers/nvme/host/core.c | 10 +++++-----
drivers/nvme/host/fabrics.h | 6 ++++++
drivers/nvme/host/fc.c | 1 +
drivers/nvme/host/nvme.h | 7 +++++++
drivers/nvme/host/pci.c | 28 +++++++++++++++++++++++---
drivers/nvme/host/rdma.c | 1 +
drivers/nvme/host/tcp.c | 1 +
drivers/nvme/target/loop.c | 1 +
include/linux/bio.h | 2 ++
include/linux/blk-mq.h | 16 +++++++++++++++
include/linux/blk_types.h | 12 ++++++++++++
17 files changed, 125 insertions(+), 13 deletions(-)
--
2.47.3
^ permalink raw reply [flat|nested] 20+ messages in thread* [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch @ 2025-10-14 15:04 ` Keith Busch 2025-10-14 21:18 ` Martin K. Petersen ` (2 more replies) 2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch ` (2 subsequent siblings) 3 siblings, 3 replies; 20+ messages in thread From: Keith Busch @ 2025-10-14 15:04 UTC (permalink / raw) To: hch, linux-nvme, linux-block, axboe; +Cc: Keith Busch From: Keith Busch <kbusch@kernel.org> The blk-mq dma iterator has an optimization for requests that align to the device's iommu merge boundary. This boundary may be larger than the device's virtual boundary, but the code had been depending on that queue limit to know ahead of time if the request is guaranteed to align to that optimization. Rather than rely on that queue limit, which many devices may not report, save the lowest set bit of any boundary gap between each segment in the bio while checking the segments. The request stores the value for merging and quickly checking per io if the request can use iova optimizations. Signed-off-by: Keith Busch <kbusch@kernel.org> --- block/bio.c | 1 + block/blk-map.c | 3 +++ block/blk-merge.c | 39 ++++++++++++++++++++++++++++++++++++--- block/blk-mq-dma.c | 3 +-- block/blk-mq.c | 6 ++++++ include/linux/bio.h | 2 ++ include/linux/blk-mq.h | 16 ++++++++++++++++ include/linux/blk_types.h | 12 ++++++++++++ 8 files changed, 77 insertions(+), 5 deletions(-) diff --git a/block/bio.c b/block/bio.c index b3a79285c278d..7b13bdf72de09 100644 --- a/block/bio.c +++ b/block/bio.c @@ -253,6 +253,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table, bio->bi_write_hint = 0; bio->bi_write_stream = 0; bio->bi_status = 0; + bio->bi_bvec_gap_bit = 0; bio->bi_iter.bi_sector = 0; bio->bi_iter.bi_size = 0; bio->bi_iter.bi_idx = 0; diff --git a/block/blk-map.c b/block/blk-map.c index 60faf036fb6e4..17a1dc2886786 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -459,6 +459,8 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio) if (rq->bio) { if (!ll_back_merge_fn(rq, bio, nr_segs)) return -EINVAL; + rq->phys_gap_bit = bio_seg_gap(rq->q, rq->biotail, bio, + rq->phys_gap_bit); rq->biotail->bi_next = bio; rq->biotail = bio; rq->__data_len += bio->bi_iter.bi_size; @@ -469,6 +471,7 @@ int blk_rq_append_bio(struct request *rq, struct bio *bio) rq->nr_phys_segments = nr_segs; rq->bio = rq->biotail = bio; rq->__data_len = bio->bi_iter.bi_size; + rq->phys_gap_bit = bio->bi_bvec_gap_bit; return 0; } EXPORT_SYMBOL(blk_rq_append_bio); diff --git a/block/blk-merge.c b/block/blk-merge.c index 002c594798242..2a1da435462de 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -302,6 +302,12 @@ static unsigned int bio_split_alignment(struct bio *bio, return lim->logical_block_size; } +static inline unsigned int bvec_seg_gap(struct bio_vec *bvprv, + struct bio_vec *bv) +{ + return bv->bv_offset | (bvprv->bv_offset + bvprv->bv_len); +} + /** * bio_split_io_at - check if and where to split a bio * @bio: [in] bio to be split @@ -319,8 +325,8 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, unsigned *segs, unsigned max_bytes, unsigned len_align_mask) { struct bio_vec bv, bvprv, *bvprvp = NULL; + unsigned nsegs = 0, bytes = 0, gaps = 0; struct bvec_iter iter; - unsigned nsegs = 0, bytes = 0; bio_for_each_bvec(bv, bio, iter) { if (bv.bv_offset & lim->dma_alignment || @@ -331,8 +337,11 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, * If the queue doesn't support SG gaps and adding this * offset would create a gap, disallow it. */ - if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset)) - goto split; + if (bvprvp) { + if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset)) + goto split; + gaps |= bvec_seg_gap(bvprvp, &bv); + } if (nsegs < lim->max_segments && bytes + bv.bv_len <= max_bytes && @@ -350,6 +359,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, } *segs = nsegs; + bio->bi_bvec_gap_bit = ffs(gaps); return 0; split: if (bio->bi_opf & REQ_ATOMIC) @@ -385,6 +395,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, * big IO can be trival, disable iopoll when split needed. */ bio_clear_polled(bio); + bio->bi_bvec_gap_bit = ffs(gaps); return bytes >> SECTOR_SHIFT; } EXPORT_SYMBOL_GPL(bio_split_io_at); @@ -721,6 +732,21 @@ static bool blk_atomic_write_mergeable_rqs(struct request *rq, return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC); } +u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, + u8 gaps_bit) +{ + struct bio_vec pb, nb; + + gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit); + gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit); + + bio_get_last_bvec(prev, &pb); + bio_get_first_bvec(next, &nb); + if (!biovec_phys_mergeable(q, &pb, &nb)) + gaps_bit = min_not_zero(gaps_bit, ffs(bvec_seg_gap(&pb, &nb))); + return gaps_bit; +} + /* * For non-mq, this has to be called with the request spinlock acquired. * For mq with scheduling, the appropriate queue wide lock should be held. @@ -785,6 +811,9 @@ static struct request *attempt_merge(struct request_queue *q, if (next->start_time_ns < req->start_time_ns) req->start_time_ns = next->start_time_ns; + req->phys_gap_bit = bio_seg_gap(req->q, req->biotail, next->bio, + min_not_zero(next->phys_gap_bit, + req->phys_gap_bit)); req->biotail->bi_next = next->bio; req->biotail = next->biotail; @@ -908,6 +937,8 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req, if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING) blk_zone_write_plug_bio_merged(bio); + req->phys_gap_bit = bio_seg_gap(req->q, req->biotail, bio, + req->phys_gap_bit); req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bio->bi_iter.bi_size; @@ -942,6 +973,8 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req, blk_update_mixed_merge(req, bio, true); + req->phys_gap_bit = bio_seg_gap(req->q, bio, req->bio, + req->phys_gap_bit); bio->bi_next = req->bio; req->bio = bio; diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index 449950029872a..94d3461b5bc8e 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -79,8 +79,7 @@ static bool blk_map_iter_next(struct request *req, struct blk_map_iter *iter, static inline bool blk_can_dma_map_iova(struct request *req, struct device *dma_dev) { - return !((queue_virt_boundary(req->q) + 1) & - dma_get_merge_boundary(dma_dev)); + return !(req_phys_gap_mask(req) & dma_get_merge_boundary(dma_dev)); } static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) diff --git a/block/blk-mq.c b/block/blk-mq.c index 09f5794141615..c12e5ee315fce 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -376,6 +376,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) INIT_LIST_HEAD(&rq->queuelist); rq->q = q; rq->__sector = (sector_t) -1; + rq->phys_gap_bit = 0; INIT_HLIST_NODE(&rq->hash); RB_CLEAR_NODE(&rq->rb_node); rq->tag = BLK_MQ_NO_TAG; @@ -668,6 +669,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, goto out_queue_exit; } rq->__data_len = 0; + rq->phys_gap_bit = 0; rq->__sector = (sector_t) -1; rq->bio = rq->biotail = NULL; return rq; @@ -748,6 +750,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag); blk_mq_rq_time_init(rq, alloc_time_ns); rq->__data_len = 0; + rq->phys_gap_bit = 0; rq->__sector = (sector_t) -1; rq->bio = rq->biotail = NULL; return rq; @@ -2674,6 +2677,8 @@ static void blk_mq_bio_to_request(struct request *rq, struct bio *bio, rq->bio = rq->biotail = bio; rq->__sector = bio->bi_iter.bi_sector; rq->__data_len = bio->bi_iter.bi_size; + rq->phys_gap_bit = bio->bi_bvec_gap_bit; + rq->nr_phys_segments = nr_segs; if (bio_integrity(bio)) rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q, @@ -3380,6 +3385,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src, } rq->nr_phys_segments = rq_src->nr_phys_segments; rq->nr_integrity_segments = rq_src->nr_integrity_segments; + rq->phys_gap_bit = rq_src->phys_gap_bit; if (rq->bio && blk_crypto_rq_bio_prep(rq, rq->bio, gfp_mask) < 0) goto free_and_out; diff --git a/include/linux/bio.h b/include/linux/bio.h index 16c1c85613b76..ad2d57908c1c0 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -324,6 +324,8 @@ extern struct bio *bio_split(struct bio *bio, int sectors, gfp_t gfp, struct bio_set *bs); int bio_split_io_at(struct bio *bio, const struct queue_limits *lim, unsigned *segs, unsigned max_bytes, unsigned len_align); +u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, + u8 gaps_bit); /** * bio_next_split - get next @sectors from a bio, splitting if necessary diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index b25d12545f46d..ddda00f07fae4 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -152,6 +152,14 @@ struct request { unsigned short nr_phys_segments; unsigned short nr_integrity_segments; + /* + * The lowest set bit for address gaps between physical segments. This + * provides information necessary for dma optimization opprotunities, + * like for testing if the segments can be coalesced against the + * device's iommu granule. + */ + unsigned char phys_gap_bit; + #ifdef CONFIG_BLK_INLINE_ENCRYPTION struct bio_crypt_ctx *crypt_ctx; struct blk_crypto_keyslot *crypt_keyslot; @@ -208,6 +216,14 @@ struct request { void *end_io_data; }; +/* + * Returns a mask with all bits starting at req->phys_gap_bit set to 1. + */ +static inline unsigned long req_phys_gap_mask(const struct request *req) +{ + return ~(((1 << req->phys_gap_bit) >> 1) - 1); +} + static inline enum req_op req_op(const struct request *req) { return req->cmd_flags & REQ_OP_MASK; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 8e8d1cc8b06c4..53501ebb0623e 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -218,6 +218,18 @@ struct bio { enum rw_hint bi_write_hint; u8 bi_write_stream; blk_status_t bi_status; + + /* + * The bvec gap bit indicates the lowest set bit in any address offset + * between all bi_io_vecs. This field is initialized only after the bio + * is split to the hardware limits (see bio_split_io_at()). The value + * may be used to consider DMA optimization when performing that + * mapping. The value is compared to a power of two mask where the + * result depends on any bit set within the mask, so saving the lowest + * bit is sufficient to know if any segment gap collides with the mask. + */ + u8 bi_bvec_gap_bit; + atomic_t __bi_remaining; struct bvec_iter bi_iter; -- 2.47.3 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch @ 2025-10-14 21:18 ` Martin K. Petersen 2025-10-15 4:08 ` Christoph Hellwig 2025-11-11 4:26 ` Matthew Wilcox 2 siblings, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-10-14 21:18 UTC (permalink / raw) To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch Keith, > The blk-mq dma iterator has an optimization for requests that align to > the device's iommu merge boundary. This boundary may be larger than > the device's virtual boundary, but the code had been depending on that > queue limit to know ahead of time if the request is guaranteed to > align to that optimization. > > Rather than rely on that queue limit, which many devices may not > report, save the lowest set bit of any boundary gap between each > segment in the bio while checking the segments. The request stores the > value for merging and quickly checking per io if the request can use > iova optimizations. Looks OK to me. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch 2025-10-14 21:18 ` Martin K. Petersen @ 2025-10-15 4:08 ` Christoph Hellwig 2025-11-11 4:26 ` Matthew Wilcox 2 siblings, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2025-10-15 4:08 UTC (permalink / raw) To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch 2025-10-14 21:18 ` Martin K. Petersen 2025-10-15 4:08 ` Christoph Hellwig @ 2025-11-11 4:26 ` Matthew Wilcox 2025-11-11 4:50 ` Keith Busch 2 siblings, 1 reply; 20+ messages in thread From: Matthew Wilcox @ 2025-11-11 4:26 UTC (permalink / raw) To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch On Tue, Oct 14, 2025 at 08:04:55AM -0700, Keith Busch wrote: > The blk-mq dma iterator has an optimization for requests that align to > the device's iommu merge boundary. This boundary may be larger than the > device's virtual boundary, but the code had been depending on that queue > limit to know ahead of time if the request is guaranteed to align to > that optimization. > > Rather than rely on that queue limit, which many devices may not report, > save the lowest set bit of any boundary gap between each segment in the > bio while checking the segments. The request stores the value for > merging and quickly checking per io if the request can use iova > optimizations. Hi Keith, I just hit this bug: generic/455 run fstests generic/455 at 2025-11-11 04:11:25 XFS (vdb): Mounting V5 Filesystem 54edd3b5-5306-493b-9ecd-f06cd9a8d669 XFS (vdb): Ending clean mount XFS (dm-4): Mounting V5 Filesystem 3eb16918-3537-4d69-999a-ba226510f6c2 XFS (dm-4): Ending clean mount XFS (dm-4): Unmounting Filesystem 3eb16918-3537-4d69-999a-ba226510f6c2 BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 1614197 Comm: kworker/u64:2 Not tainted 6.18.0-rc4-next-20251 110-ktest-00017-g2307dc640a8d #131 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: dm-thin do_worker RIP: 0010:bio_get_last_bvec+0x20/0xe0 Code: 90 90 90 90 90 90 90 90 90 90 55 49 89 f2 48 89 f9 48 89 e5 53 8b 77 2c 8b 47 30 44 8b 4f 28 49 89 f0 49 c1 e0 04 4c 03 47 50 <41> 8b 78 08 41 8b 58 0c 4d 8b 18 29 c7 44 39 cf 4d 89 1a 41 0f 47 RSP: 0018:ffff88814d1ef8d8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8881044b5500 RCX: ffff88811fd23d78 RDX: ffff88811fd235f8 RSI: 0000000000000000 RDI: ffff88811fd23d78 RBP: ffff88814d1ef8e0 R08: 0000000000000000 R09: 0000000000020000 R10: ffff88814d1ef8f0 R11: 0000000000000200 R12: 0000000000000000 R13: ffff88811fd235f8 R14: 0000000000000000 R15: ffff8881044b5500 FS: 0000000000000000(0000) GS:ffff8881f6ac9000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000000263a000 CR4: 0000000000750eb0 PKRU: 55555554 Call Trace: <TASK> bio_seg_gap+0x4c/0x150 bio_attempt_front_merge+0x19a/0x3a0 blk_attempt_bio_merge.part.0+0xb4/0x110 blk_attempt_plug_merge+0xd6/0xe0 blk_mq_submit_bio+0x76c/0x9f0 ? lock_release+0xbb/0x260 __submit_bio+0xa5/0x380 submit_bio_noacct_nocheck+0x126/0x380 ? submit_bio_noacct_nocheck+0x126/0x380 submit_bio_noacct+0x17f/0x3c0 ? __cond_resched+0x1e/0x60 submit_bio+0xd6/0x100 end_discard+0x3a/0x90 process_prepared_discard_passdown_pt1+0xff/0x180 process_discard_cell_passdown+0x19e/0x2a0 process_discard_bio+0x105/0x1a0 do_worker+0x824/0xa40 ? process_one_work+0x1ad/0x530 process_one_work+0x1ed/0x530 ? move_linked_works+0x77/0xb0 worker_thread+0x1cf/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x100/0x220 ? _raw_spin_unlock_irq+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x249/0x280 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: [last unloaded: crc_t10dif] CR2: 0000000000000008 ---[ end trace 0000000000000000 ]--- RIP: 0010:bio_get_last_bvec+0x20/0xe0 Code: 90 90 90 90 90 90 90 90 90 90 55 49 89 f2 48 89 f9 48 89 e5 53 8b 77 2c 8b 47 30 44 8b 4f 28 49 89 f0 49 c1 e0 04 4c 03 47 50 <41> 8b 78 08 41 8b 58 0c 4d 8b 18 29 c7 44 39 cf 4d 89 1a 41 0f 47 RSP: 0018:ffff88814d1ef8d8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8881044b5500 RCX: ffff88811fd23d78 RDX: ffff88811fd235f8 RSI: 0000000000000000 RDI: ffff88811fd23d78 RBP: ffff88814d1ef8e0 R08: 0000000000000000 R09: 0000000000020000 R10: ffff88814d1ef8f0 R11: 0000000000000200 R12: 0000000000000000 R13: ffff88811fd235f8 R14: 0000000000000000 R15: ffff8881044b5500 FS: 0000000000000000(0000) GS:ffff8881f6ac9000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000000263a000 CR4: 0000000000750eb0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception ]--- I'm not saying it's definitely your patch; after all, there's 17 of my slab patches on top of next-20251110, but when I looked on lore for 'bio_get_last_bvec' this was the only patch since 2021 that mentioned it, so I thought I'd drop you a note in case you see the bug immediately. I'm heading to bed, and will be out tomorrow, so my opportunities to be helpful will be limited. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 4:26 ` Matthew Wilcox @ 2025-11-11 4:50 ` Keith Busch 2025-11-11 7:14 ` Christoph Hellwig 0 siblings, 1 reply; 20+ messages in thread From: Keith Busch @ 2025-11-11 4:50 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Keith Busch, hch, linux-nvme, linux-block, axboe On Tue, Nov 11, 2025 at 04:26:21AM +0000, Matthew Wilcox wrote: > On Tue, Oct 14, 2025 at 08:04:55AM -0700, Keith Busch wrote: > > device's virtual boundary, but the code had been depending on that queue > > limit to know ahead of time if the request is guaranteed to align to > > that optimization. > > > > Rather than rely on that queue limit, which many devices may not report, > > save the lowest set bit of any boundary gap between each segment in the > > bio while checking the segments. The request stores the value for > > merging and quickly checking per io if the request can use iova > > optimizations. > > Hi Keith, > > I just hit this bug: ... > RIP: 0010:bio_get_last_bvec+0x20/0xe0 > Code: 90 90 90 90 90 90 90 90 90 90 55 49 89 f2 48 89 f9 48 89 e5 53 8b 77 2c 8b 47 30 44 8b 4f 28 49 89 f0 49 c1 e0 04 4c 03 47 50 <41> 8b 78 08 41 8b 58 0c 4d 8b 18 29 c7 44 39 cf 4d 89 1a 41 0f 47 > RSP: 0018:ffff88814d1ef8d8 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: ffff8881044b5500 RCX: ffff88811fd23d78 > RDX: ffff88811fd235f8 RSI: 0000000000000000 RDI: ffff88811fd23d78 > RBP: ffff88814d1ef8e0 R08: 0000000000000000 R09: 0000000000020000 > R10: ffff88814d1ef8f0 R11: 0000000000000200 R12: 0000000000000000 > R13: ffff88811fd235f8 R14: 0000000000000000 R15: ffff8881044b5500 > FS: 0000000000000000(0000) GS:ffff8881f6ac9000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000008 CR3: 000000000263a000 CR4: 0000000000750eb0 > PKRU: 55555554 > Kernel panic - not syncing: Fatal exception > Kernel Offset: disabled > ---[ end Kernel panic - not syncing: Fatal exception ]--- > > I'm not saying it's definitely your patch; after all, there's 17 of > my slab patches on top of next-20251110, but when I looked on lore for > 'bio_get_last_bvec' this was the only patch since 2021 that mentioned it, > so I thought I'd drop you a note in case you see the bug immediately. > I'm heading to bed, and will be out tomorrow, so my opportunities to be > helpful will be limited. Thanks for the heads up. This is in the path I'd been modifying lately, so sounds plausible that I introduced the bug. The information here should be enough for me to make progress: it looks like req->bio is NULL in your trace, which I did not expect would happen. But it's late here too, so look with fresh eyes in the morning. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 4:50 ` Keith Busch @ 2025-11-11 7:14 ` Christoph Hellwig 2025-11-11 9:36 ` Yu Kuai 0 siblings, 1 reply; 20+ messages in thread From: Christoph Hellwig @ 2025-11-11 7:14 UTC (permalink / raw) To: Keith Busch Cc: Matthew Wilcox, Keith Busch, hch, linux-nvme, linux-block, axboe On Mon, Nov 10, 2025 at 11:50:51PM -0500, Keith Busch wrote: > Thanks for the heads up. This is in the path I'd been modifying lately, > so sounds plausible that I introduced the bug. The information here > should be enough for me to make progress: it looks like req->bio is NULL > in your trace, which I did not expect would happen. But it's late here > too, so look with fresh eyes in the morning. req->bio should only be NULL for flush requests or passthrough requests that do not transfer data. None of them should end up in this path. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 7:14 ` Christoph Hellwig @ 2025-11-11 9:36 ` Yu Kuai 2025-11-11 9:39 ` Christoph Hellwig 0 siblings, 1 reply; 20+ messages in thread From: Yu Kuai @ 2025-11-11 9:36 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch Cc: Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe, yukuai Hi, 在 2025/11/11 15:14, Christoph Hellwig 写道: > On Mon, Nov 10, 2025 at 11:50:51PM -0500, Keith Busch wrote: >> Thanks for the heads up. This is in the path I'd been modifying lately, >> so sounds plausible that I introduced the bug. The information here >> should be enough for me to make progress: it looks like req->bio is NULL >> in your trace, which I did not expect would happen. But it's late here >> too, so look with fresh eyes in the morning. > req->bio should only be NULL for flush requests or passthrough requests > that do not transfer data. None of them should end up in this path. This can be reproduced 100% with branch for-6.19/block now, just: blkdiscard /dev/md0 Where discard IO will be split to different underlying disks and then merge. And for discard bio, bio->bi_io_vec is NULL. So when discard bio ends up to the merge path, bio->bi_io_vec will be dereferenced unconditionally. How about following simple fix: diff --git a/block/blk-merge.c b/block/blk-merge.c index 3ca6fbf8b787..31f460422fe3 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -740,6 +740,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit); gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit); + if (op_is_discard(prev->bi_opf) || op_is_discard(next->bi_opf)) + return gaps_bit; + bio_get_last_bvec(prev, &pb); bio_get_first_bvec(next, &nb); if (!biovec_phys_mergeable(q, &pb, &nb)) Thanks, Kuai ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 9:36 ` Yu Kuai @ 2025-11-11 9:39 ` Christoph Hellwig 2025-11-11 10:14 ` Yu Kuai 0 siblings, 1 reply; 20+ messages in thread From: Christoph Hellwig @ 2025-11-11 9:39 UTC (permalink / raw) To: Yu Kuai Cc: Christoph Hellwig, Keith Busch, Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe On Tue, Nov 11, 2025 at 05:36:39PM +0800, Yu Kuai wrote: > This can be reproduced 100% with branch for-6.19/block now, just: > > blkdiscard /dev/md0 > > Where discard IO will be split to different underlying disks and then > merge. And for discard bio, bio->bi_io_vec is NULL. So when discard > bio ends up to the merge path, bio->bi_io_vec will be dereferenced > unconditionally. Ah, so it's not a NULL req->bio but bio->bi_io_vec. > > How about following simple fix: > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 3ca6fbf8b787..31f460422fe3 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -740,6 +740,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, > gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit); > gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit); > > + if (op_is_discard(prev->bi_opf) || op_is_discard(next->bi_opf)) > + return gaps_bit; > + I think the problem is how we even end up here? The only merging for discard should be the special multi-segment merge. So I think something higher up is messed up ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 9:39 ` Christoph Hellwig @ 2025-11-11 10:14 ` Yu Kuai 2025-11-11 13:25 ` Keith Busch 0 siblings, 1 reply; 20+ messages in thread From: Yu Kuai @ 2025-11-11 10:14 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe, yukuai Hi, 在 2025/11/11 17:39, Christoph Hellwig 写道: > On Tue, Nov 11, 2025 at 05:36:39PM +0800, Yu Kuai wrote: >> This can be reproduced 100% with branch for-6.19/block now, just: >> >> blkdiscard /dev/md0 >> >> Where discard IO will be split to different underlying disks and then >> merge. And for discard bio, bio->bi_io_vec is NULL. So when discard >> bio ends up to the merge path, bio->bi_io_vec will be dereferenced >> unconditionally. > Ah, so it's not a NULL req->bio but bio->bi_io_vec. > >> How about following simple fix: >> >> diff --git a/block/blk-merge.c b/block/blk-merge.c >> index 3ca6fbf8b787..31f460422fe3 100644 >> --- a/block/blk-merge.c >> +++ b/block/blk-merge.c >> @@ -740,6 +740,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, >> gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit); >> gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit); >> >> + if (op_is_discard(prev->bi_opf) || op_is_discard(next->bi_opf)) >> + return gaps_bit; >> + > I think the problem is how we even end up here? The only merging > for discard should be the special multi-segment merge. So I think > something higher up is messed up > At least from blk_try_merge(), blk_discard_mergable() do return false, however, following checking passed and we end up to the back merge patch. blk_try_merge if (blk_discard_mergable()) // false due to max_discard_segments is 1 else if (...) return ELEVATOR_BACK_MERGE Perhaps are you suggesting to change this to: enum elv_merge blk_try_merge(struct request *rq, struct bio *bio) { - if (blk_discard_mergable(rq)) - return ELEVATOR_DISCARD_MERGE; - else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector) + if (req_op(rq) == REQ_OP_DISCARD) { + if (blk_discard_mergable((rq))) + return ELEVATOR_DISCARD_MERGE; + return ELEVATOR_NO_MERGE; + } + + if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector) return ELEVATOR_BACK_MERGE; - else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector) + if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector) return ELEVATOR_FRONT_MERGE; return ELEVATOR_NO_MERGE; } And the same for other callers for blk_discard_mergable(). Thanks, Kuai ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 10:14 ` Yu Kuai @ 2025-11-11 13:25 ` Keith Busch 2025-11-11 13:40 ` Christoph Hellwig 0 siblings, 1 reply; 20+ messages in thread From: Keith Busch @ 2025-11-11 13:25 UTC (permalink / raw) To: Yu Kuai Cc: Christoph Hellwig, Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe On Tue, Nov 11, 2025 at 06:14:24PM +0800, Yu Kuai wrote: > > > At least from blk_try_merge(), blk_discard_mergable() do return false, > however, following checking passed and we end up to the back merge patch. > > blk_try_merge > if (blk_discard_mergable()) > // false due to max_discard_segments is 1 > else if (...) > return ELEVATOR_BACK_MERGE > > Perhaps are you suggesting to change this to: > > enum elv_merge blk_try_merge(struct request *rq, struct bio *bio) > { > - if (blk_discard_mergable(rq)) > - return ELEVATOR_DISCARD_MERGE; > - else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector) > + if (req_op(rq) == REQ_OP_DISCARD) { > + if (blk_discard_mergable((rq))) > + return ELEVATOR_DISCARD_MERGE; > + return ELEVATOR_NO_MERGE; > + } > + > + if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector) > return ELEVATOR_BACK_MERGE; > - else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector) > + if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector) > return ELEVATOR_FRONT_MERGE; > return ELEVATOR_NO_MERGE; > } > > And the same for other callers for blk_discard_mergable(). Ah, so we're merging a discard for a device that doesn't support vectored discard. I think we still want to be able to front/back merge such requests, though. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 13:25 ` Keith Busch @ 2025-11-11 13:40 ` Christoph Hellwig 2025-11-11 13:54 ` Keith Busch 0 siblings, 1 reply; 20+ messages in thread From: Christoph Hellwig @ 2025-11-11 13:40 UTC (permalink / raw) To: Keith Busch Cc: Yu Kuai, Christoph Hellwig, Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote: > Ah, so we're merging a discard for a device that doesn't support > vectored discard. I think we still want to be able to front/back merge > such requests, though. Yes, but purely based on bi_sector/bi_size, not based on the payload. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 13:40 ` Christoph Hellwig @ 2025-11-11 13:54 ` Keith Busch 2025-11-11 13:58 ` Christoph Hellwig 2025-11-11 14:03 ` Yu Kuai 0 siblings, 2 replies; 20+ messages in thread From: Keith Busch @ 2025-11-11 13:54 UTC (permalink / raw) To: Christoph Hellwig Cc: Yu Kuai, Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe On Tue, Nov 11, 2025 at 02:40:01PM +0100, Christoph Hellwig wrote: > On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote: > > Ah, so we're merging a discard for a device that doesn't support > > vectored discard. I think we still want to be able to front/back merge > > such requests, though. > > Yes, but purely based on bi_sector/bi_size, not based on the payload. This should do it: --- diff --git a/block/blk-merge.c b/block/blk-merge.c index 3ca6fbf8b7870..d3115d7469df0 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -737,6 +737,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, { struct bio_vec pb, nb; + if (!bio_has_data(prev)) + return 0; + gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit); gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit); -- ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 13:54 ` Keith Busch @ 2025-11-11 13:58 ` Christoph Hellwig 2025-11-11 14:03 ` Yu Kuai 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2025-11-11 13:58 UTC (permalink / raw) To: Keith Busch Cc: Christoph Hellwig, Yu Kuai, Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe On Tue, Nov 11, 2025 at 08:54:53AM -0500, Keith Busch wrote: > On Tue, Nov 11, 2025 at 02:40:01PM +0100, Christoph Hellwig wrote: > > On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote: > > > Ah, so we're merging a discard for a device that doesn't support > > > vectored discard. I think we still want to be able to front/back merge > > > such requests, though. > > > > Yes, but purely based on bi_sector/bi_size, not based on the payload. > > This should do it: Yes, that looks sensible. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 1/2] block: accumulate memory segment gaps per bio 2025-11-11 13:54 ` Keith Busch 2025-11-11 13:58 ` Christoph Hellwig @ 2025-11-11 14:03 ` Yu Kuai 1 sibling, 0 replies; 20+ messages in thread From: Yu Kuai @ 2025-11-11 14:03 UTC (permalink / raw) To: Keith Busch, Christoph Hellwig Cc: Matthew Wilcox, Keith Busch, linux-nvme, linux-block, axboe, yukuai Hi, 在 2025/11/11 21:54, Keith Busch 写道: > On Tue, Nov 11, 2025 at 02:40:01PM +0100, Christoph Hellwig wrote: >> On Tue, Nov 11, 2025 at 08:25:38AM -0500, Keith Busch wrote: >>> Ah, so we're merging a discard for a device that doesn't support >>> vectored discard. I think we still want to be able to front/back merge >>> such requests, though. >> Yes, but purely based on bi_sector/bi_size, not based on the payload. > This should do it: > > --- > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 3ca6fbf8b7870..d3115d7469df0 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -737,6 +737,9 @@ u8 bio_seg_gap(struct request_queue *q, struct bio *prev, struct bio *next, > { > struct bio_vec pb, nb; > > + if (!bio_has_data(prev)) > + return 0; > + > gaps_bit = min_not_zero(gaps_bit, prev->bi_bvec_gap_bit); > gaps_bit = min_not_zero(gaps_bit, next->bi_bvec_gap_bit); > > -- This looks good. Thanks, Kuai ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices 2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch 2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch @ 2025-10-14 15:04 ` Keith Busch 2025-10-14 21:19 ` Martin K. Petersen 2025-11-07 1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch 2025-11-07 1:12 ` Jens Axboe 3 siblings, 1 reply; 20+ messages in thread From: Keith Busch @ 2025-10-14 15:04 UTC (permalink / raw) To: hch, linux-nvme, linux-block, axboe; +Cc: Keith Busch From: Keith Busch <kbusch@kernel.org> The nvme virtual boundary is only required for the PRP format. Devices that can use SGL for DMA don't need it for IO queues. Drop reporting it for such devices; rdma fabrics controllers will continue to use the limit as they currently don't report any boundary requirements, but tcp and fc never needed it in the first place so they get to report no virtual boundary. Applications may continue to align to the same virtual boundaries for optimization purposes if they want, and the driver will continue to decide whether to use the PRP format the same as before if the IO allows it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> --- drivers/nvme/host/apple.c | 1 + drivers/nvme/host/core.c | 10 +++++----- drivers/nvme/host/fabrics.h | 6 ++++++ drivers/nvme/host/fc.c | 1 + drivers/nvme/host/nvme.h | 7 +++++++ drivers/nvme/host/pci.c | 28 +++++++++++++++++++++++++--- drivers/nvme/host/rdma.c | 1 + drivers/nvme/host/tcp.c | 1 + drivers/nvme/target/loop.c | 1 + 9 files changed, 48 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c index f35d3f71d14f3..15b3d07f8ccdd 100644 --- a/drivers/nvme/host/apple.c +++ b/drivers/nvme/host/apple.c @@ -1283,6 +1283,7 @@ static const struct nvme_ctrl_ops nvme_ctrl_ops = { .reg_read64 = apple_nvme_reg_read64, .free_ctrl = apple_nvme_free_ctrl, .get_address = apple_nvme_get_address, + .get_virt_boundary = nvme_get_virt_boundary, }; static void apple_nvme_async_probe(void *data, async_cookie_t cookie) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index fa4181d7de736..63e15cce3699c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2069,13 +2069,13 @@ static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl) } static void nvme_set_ctrl_limits(struct nvme_ctrl *ctrl, - struct queue_limits *lim) + struct queue_limits *lim, bool is_admin) { lim->max_hw_sectors = ctrl->max_hw_sectors; lim->max_segments = min_t(u32, USHRT_MAX, min_not_zero(nvme_max_drv_segments(ctrl), ctrl->max_segments)); lim->max_integrity_segments = ctrl->max_integrity_segments; - lim->virt_boundary_mask = NVME_CTRL_PAGE_SIZE - 1; + lim->virt_boundary_mask = ctrl->ops->get_virt_boundary(ctrl, is_admin); lim->max_segment_size = UINT_MAX; lim->dma_alignment = 3; } @@ -2177,7 +2177,7 @@ static int nvme_update_ns_info_generic(struct nvme_ns *ns, int ret; lim = queue_limits_start_update(ns->disk->queue); - nvme_set_ctrl_limits(ns->ctrl, &lim); + nvme_set_ctrl_limits(ns->ctrl, &lim, false); memflags = blk_mq_freeze_queue(ns->disk->queue); ret = queue_limits_commit_update(ns->disk->queue, &lim); @@ -2381,7 +2381,7 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns, ns->head->lba_shift = id->lbaf[lbaf].ds; ns->head->nuse = le64_to_cpu(id->nuse); capacity = nvme_lba_to_sect(ns->head, le64_to_cpu(id->nsze)); - nvme_set_ctrl_limits(ns->ctrl, &lim); + nvme_set_ctrl_limits(ns->ctrl, &lim, false); nvme_configure_metadata(ns->ctrl, ns->head, id, nvm, info); nvme_set_chunk_sectors(ns, id, &lim); if (!nvme_update_disk_info(ns, id, &lim)) @@ -3589,7 +3589,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl) min_not_zero(ctrl->max_hw_sectors, max_hw_sectors); lim = queue_limits_start_update(ctrl->admin_q); - nvme_set_ctrl_limits(ctrl, &lim); + nvme_set_ctrl_limits(ctrl, &lim, true); ret = queue_limits_commit_update(ctrl->admin_q, &lim); if (ret) goto out_free; diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 1b58ee7d0dcee..caf5503d08332 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -217,6 +217,12 @@ static inline unsigned int nvmf_nr_io_queues(struct nvmf_ctrl_options *opts) min(opts->nr_poll_queues, num_online_cpus()); } +static inline unsigned long nvmf_get_virt_boundary(struct nvme_ctrl *ctrl, + bool is_admin) +{ + return 0; +} + int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val); int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val); int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val); diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 03987f497a5b5..70c066c2e2d42 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -3360,6 +3360,7 @@ static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = { .submit_async_event = nvme_fc_submit_async_event, .delete_ctrl = nvme_fc_delete_ctrl, .get_address = nvmf_get_address, + .get_virt_boundary = nvmf_get_virt_boundary, }; static void diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 102fae6a231c5..7f7cb823d60d8 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -558,6 +558,12 @@ static inline bool nvme_ns_has_pi(struct nvme_ns_head *head) return head->pi_type && head->ms == head->pi_size; } +static inline unsigned long nvme_get_virt_boundary(struct nvme_ctrl *ctrl, + bool is_admin) +{ + return NVME_CTRL_PAGE_SIZE - 1; +} + struct nvme_ctrl_ops { const char *name; struct module *module; @@ -578,6 +584,7 @@ struct nvme_ctrl_ops { int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size); void (*print_device_info)(struct nvme_ctrl *ctrl); bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl); + unsigned long (*get_virt_boundary)(struct nvme_ctrl *ctrl, bool is_admin); }; /* diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index c916176bd9f05..3c1727df1e36f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -613,9 +613,22 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev, struct nvme_queue *nvmeq = req->mq_hctx->driver_data; if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) { - if (nvme_req(req)->flags & NVME_REQ_USERCMD) - return SGL_FORCED; - if (req->nr_integrity_segments > 1) + /* + * When the controller is capable of using SGL, there are + * several conditions that we force to use it: + * + * 1. A request containing page gaps within the controller's + * mask can not use the PRP format. + * + * 2. User commands use SGL because that lets the device + * validate the requested transfer lengths. + * + * 3. Multiple integrity segments must use SGL as that's the + * only way to describe such a command in NVMe. + */ + if (req_phys_gap_mask(req) & (NVME_CTRL_PAGE_SIZE - 1) || + nvme_req(req)->flags & NVME_REQ_USERCMD || + req->nr_integrity_segments > 1) return SGL_FORCED; return SGL_SUPPORTED; } @@ -3243,6 +3256,14 @@ static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl) return dma_pci_p2pdma_supported(dev->dev); } +static unsigned long nvme_pci_get_virt_boundary(struct nvme_ctrl *ctrl, + bool is_admin) +{ + if (!nvme_ctrl_sgl_supported(ctrl) || is_admin) + return NVME_CTRL_PAGE_SIZE - 1; + return 0; +} + static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { .name = "pcie", .module = THIS_MODULE, @@ -3257,6 +3278,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { .get_address = nvme_pci_get_address, .print_device_info = nvme_pci_print_device_info, .supports_pci_p2pdma = nvme_pci_supports_pci_p2pdma, + .get_virt_boundary = nvme_pci_get_virt_boundary, }; static int nvme_dev_map(struct nvme_dev *dev) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 190a4cfa8a5ee..35c0822edb2d7 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -2202,6 +2202,7 @@ static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = { .delete_ctrl = nvme_rdma_delete_ctrl, .get_address = nvmf_get_address, .stop_ctrl = nvme_rdma_stop_ctrl, + .get_virt_boundary = nvme_get_virt_boundary, }; /* diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 1413788ca7d52..82875351442a0 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2862,6 +2862,7 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = { .delete_ctrl = nvme_tcp_delete_ctrl, .get_address = nvme_tcp_get_address, .stop_ctrl = nvme_tcp_stop_ctrl, + .get_virt_boundary = nvmf_get_virt_boundary, }; static bool diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index f85a8441bcc6e..9fe88a489eb71 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -511,6 +511,7 @@ static const struct nvme_ctrl_ops nvme_loop_ctrl_ops = { .submit_async_event = nvme_loop_submit_async_event, .delete_ctrl = nvme_loop_delete_ctrl_host, .get_address = nvmf_get_address, + .get_virt_boundary = nvme_get_virt_boundary, }; static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl) -- 2.47.3 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices 2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch @ 2025-10-14 21:19 ` Martin K. Petersen 0 siblings, 0 replies; 20+ messages in thread From: Martin K. Petersen @ 2025-10-14 21:19 UTC (permalink / raw) To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe, Keith Busch Keith, > The nvme virtual boundary is only required for the PRP format. Devices > that can use SGL for DMA don't need it for IO queues. Drop reporting > it for such devices; rdma fabrics controllers will continue to use the > limit as they currently don't report any boundary requirements, but > tcp and fc never needed it in the first place so they get to report no > virtual boundary. > > Applications may continue to align to the same virtual boundaries for > optimization purposes if they want, and the driver will continue to > decide whether to use the PRP format the same as before if the IO > allows it. Looks fine. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> -- Martin K. Petersen ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance 2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch 2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch 2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch @ 2025-11-07 1:03 ` Keith Busch 2025-11-07 1:12 ` Jens Axboe 2025-11-07 1:12 ` Jens Axboe 3 siblings, 1 reply; 20+ messages in thread From: Keith Busch @ 2025-11-07 1:03 UTC (permalink / raw) To: Keith Busch; +Cc: hch, linux-nvme, linux-block, axboe On Tue, Oct 14, 2025 at 08:04:54AM -0700, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > The purpose is to allow optimization decisions to happen per IO, and > flexibility to utilize unaligned buffers for hardware that supports it. Hi Jens, Did you have a chance to consider this one? Thanks, ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance 2025-11-07 1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch @ 2025-11-07 1:12 ` Jens Axboe 0 siblings, 0 replies; 20+ messages in thread From: Jens Axboe @ 2025-11-07 1:12 UTC (permalink / raw) To: Keith Busch, Keith Busch; +Cc: hch, linux-nvme, linux-block On 11/6/25 6:03 PM, Keith Busch wrote: > On Tue, Oct 14, 2025 at 08:04:54AM -0700, Keith Busch wrote: >> From: Keith Busch <kbusch@kernel.org> >> >> The purpose is to allow optimization decisions to happen per IO, and >> flexibility to utilize unaligned buffers for hardware that supports it. > > Hi Jens, > > Did you have a chance to consider this one? Thanks, Looks like I missed this as it was during my last race^Wpto, now queued up! -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance 2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch ` (2 preceding siblings ...) 2025-11-07 1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch @ 2025-11-07 1:12 ` Jens Axboe 3 siblings, 0 replies; 20+ messages in thread From: Jens Axboe @ 2025-11-07 1:12 UTC (permalink / raw) To: hch, linux-nvme, linux-block, Keith Busch; +Cc: Keith Busch On Tue, 14 Oct 2025 08:04:54 -0700, Keith Busch wrote: > Previous version here: > > https://lore.kernel.org/linux-block/20251007175245.3898972-1-kbusch@meta.com/ > > The purpose is to allow optimization decisions to happen per IO, and > flexibility to utilize unaligned buffers for hardware that supports it. > > [...] Applied, thanks! [1/2] block: accumulate memory segment gaps per bio commit: 2f6b2565d43cdb5087cac23d530cca84aa3d897e [2/2] nvme: remove virtual boundary for sgl capable devices commit: bc840b21a25a50f00e2b240329c09281506df387 Best regards, -- Jens Axboe ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-11-11 14:03 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-14 15:04 [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch 2025-10-14 15:04 ` [PATCHv5 1/2] block: accumulate memory segment gaps per bio Keith Busch 2025-10-14 21:18 ` Martin K. Petersen 2025-10-15 4:08 ` Christoph Hellwig 2025-11-11 4:26 ` Matthew Wilcox 2025-11-11 4:50 ` Keith Busch 2025-11-11 7:14 ` Christoph Hellwig 2025-11-11 9:36 ` Yu Kuai 2025-11-11 9:39 ` Christoph Hellwig 2025-11-11 10:14 ` Yu Kuai 2025-11-11 13:25 ` Keith Busch 2025-11-11 13:40 ` Christoph Hellwig 2025-11-11 13:54 ` Keith Busch 2025-11-11 13:58 ` Christoph Hellwig 2025-11-11 14:03 ` Yu Kuai 2025-10-14 15:04 ` [PATCHv5 2/2] nvme: remove virtual boundary for sgl capable devices Keith Busch 2025-10-14 21:19 ` Martin K. Petersen 2025-11-07 1:03 ` [PATCHv5 0/2] block, nvme: removing virtual boundary mask reliance Keith Busch 2025-11-07 1:12 ` Jens Axboe 2025-11-07 1:12 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox