* [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 12:44 ` Leon Romanovsky
` (3 more replies)
2025-06-10 5:06 ` [PATCH 2/9] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
` (7 subsequent siblings)
8 siblings, 4 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
To get out of the DMA mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.
This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the DMA mapping path, and only for already
marked bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/bio-integrity.c | 3 +++
block/bio.c | 20 +++++++++++++-------
include/linux/blk_types.h | 2 ++
3 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 10912988c8f5..6b077ca937f6 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -128,6 +128,9 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
if (bip->bip_vcnt > 0) {
struct bio_vec *bv = &bip->bip_vec[bip->bip_vcnt - 1];
+ if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+ return 0;
+
if (bvec_try_merge_hw_page(q, bv, page, len, offset)) {
bip->bip_iter.bi_size += len;
return len;
diff --git a/block/bio.c b/block/bio.c
index 3c0a558c90f5..92c512e876c8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -930,8 +930,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
- if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
- return false;
if ((vec_end_addr & PAGE_MASK) != ((page_addr + off) & PAGE_MASK)) {
if (IS_ENABLED(CONFIG_KMSAN))
@@ -982,6 +980,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
WARN_ON_ONCE(bio_full(bio, len));
+ if (is_pci_p2pdma_page(page))
+ bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
+
bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, off);
bio->bi_iter.bi_size += len;
bio->bi_vcnt++;
@@ -1022,11 +1023,16 @@ int bio_add_page(struct bio *bio, struct page *page,
if (bio->bi_iter.bi_size > UINT_MAX - len)
return 0;
- if (bio->bi_vcnt > 0 &&
- bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
- page, len, offset)) {
- bio->bi_iter.bi_size += len;
- return len;
+ if (bio->bi_vcnt > 0) {
+ struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+ if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+ return 0;
+
+ if (bvec_try_merge_page(bv, page, len, offset)) {
+ bio->bi_iter.bi_size += len;
+ return len;
+ }
}
if (bio->bi_vcnt >= bio->bi_max_vecs)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c..2a02972dc17c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -386,6 +386,7 @@ enum req_flag_bits {
__REQ_DRV, /* for driver use */
__REQ_FS_PRIVATE, /* for file system (submitter) use */
__REQ_ATOMIC, /* for atomic write operations */
+ __REQ_P2PDMA, /* contains P2P DMA pages */
/*
* Command specific flags, keep last:
*/
@@ -418,6 +419,7 @@ enum req_flag_bits {
#define REQ_DRV (__force blk_opf_t)(1ULL << __REQ_DRV)
#define REQ_FS_PRIVATE (__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
#define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC)
+#define REQ_P2PDMA (__force blk_opf_t)(1ULL << __REQ_P2PDMA)
#define REQ_NOUNMAP (__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-10 5:06 ` [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
@ 2025-06-10 12:44 ` Leon Romanovsky
2025-06-10 15:37 ` Keith Busch
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 12:44 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:39AM +0200, Christoph Hellwig wrote:
> To get out of the DMA mapping helpers having to check every segment for
> it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> transfers, and that a P2P bio only contains ranges from a single device.
>
> This means we do the page zone access in the bio add path where it should
> be still page hot, and will only have do the fairly expensive P2P topology
> lookup once per bio down in the DMA mapping path, and only for already
> marked bios.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> block/bio-integrity.c | 3 +++
> block/bio.c | 20 +++++++++++++-------
> include/linux/blk_types.h | 2 ++
> 3 files changed, 18 insertions(+), 7 deletions(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-10 5:06 ` [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
2025-06-10 12:44 ` Leon Romanovsky
@ 2025-06-10 15:37 ` Keith Busch
2025-06-11 3:43 ` Christoph Hellwig
2025-06-12 6:24 ` Kanchan Joshi
2025-06-12 15:22 ` Logan Gunthorpe
3 siblings, 1 reply; 65+ messages in thread
From: Keith Busch @ 2025-06-10 15:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:39AM +0200, Christoph Hellwig wrote:
> To get out of the DMA mapping helpers having to check every segment for
> it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> transfers, and that a P2P bio only contains ranges from a single device.
I may be out of the loop here. Is this an optimization to make something
easier for the DMA layer? I don't think there's any fundamental reason
why devices like nvme couldn't handle a command that uses memory mixed
among multiple devices and/or host memory, at least.
> This means we do the page zone access in the bio add path where it should
> be still page hot, and will only have do the fairly expensive P2P topology
> lookup once per bio down in the DMA mapping path, and only for already
> marked bios.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-10 15:37 ` Keith Busch
@ 2025-06-11 3:43 ` Christoph Hellwig
2025-06-11 16:26 ` Keith Busch
0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-11 3:43 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On Tue, Jun 10, 2025 at 09:37:30AM -0600, Keith Busch wrote:
> On Tue, Jun 10, 2025 at 07:06:39AM +0200, Christoph Hellwig wrote:
> > To get out of the DMA mapping helpers having to check every segment for
> > it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> > transfers, and that a P2P bio only contains ranges from a single device.
>
> I may be out of the loop here. Is this an optimization to make something
> easier for the DMA layer?
Yes. P2P that is based on a bus address (i.e. using a switch) uses
a completely different way to DMA MAP than the normal IOMMU or
direct mapping. So the optimization of collapsing all host physical
addresses into an iova can't work once it is present.
> I don't think there's any fundamental reason
> why devices like nvme couldn't handle a command that uses memory mixed
> among multiple devices and/or host memory, at least.
Sure, devices don't even see if an IOVA is P2P or not, this is all
host side.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-11 3:43 ` Christoph Hellwig
@ 2025-06-11 16:26 ` Keith Busch
2025-06-11 16:39 ` Logan Gunthorpe
0 siblings, 1 reply; 65+ messages in thread
From: Keith Busch @ 2025-06-11 16:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Wed, Jun 11, 2025 at 05:43:16AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 09:37:30AM -0600, Keith Busch wrote:
> > I may be out of the loop here. Is this an optimization to make something
> > easier for the DMA layer?
>
> Yes. P2P that is based on a bus address (i.e. using a switch) uses
> a completely different way to DMA MAP than the normal IOMMU or
> direct mapping. So the optimization of collapsing all host physical
> addresses into an iova can't work once it is present.
>
> > I don't think there's any fundamental reason
> > why devices like nvme couldn't handle a command that uses memory mixed
> > among multiple devices and/or host memory, at least.
>
> Sure, devices don't even see if an IOVA is P2P or not, this is all
> host side.
Sorry for my ignorant questions here, but I'm not sure how this setup
(P2P transactions with switches and IOMMU enabled) actually works and
would like to understand better.
If I recall correctly, the PCIe ACS features will default redirect
everything up to the root-complex when you have the IOMMU on. A device
can set its memory request TLP's Address Type field to have the switch
direct the transaction directly to a peer device instead, but how does
the nvme device know how to set the it memory request's AT field?
There's nothing that says a command's addresses are untranslated IOVAs
vs translated peer addresses, right? Lacking some mechanism to specify
what kind of address the nvme controller is dealing with, wouldn't you
be forced to map peer addresses with the IOMMU, having P2P transactions
make a round trip through it only using mapped IOVAs?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-11 16:26 ` Keith Busch
@ 2025-06-11 16:39 ` Logan Gunthorpe
2025-06-11 16:41 ` Keith Busch
0 siblings, 1 reply; 65+ messages in thread
From: Logan Gunthorpe @ 2025-06-11 16:39 UTC (permalink / raw)
To: Keith Busch, Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, linux-block, linux-nvme
On 2025-06-11 10:26, Keith Busch wrote:
> If I recall correctly, the PCIe ACS features will default redirect
> everything up to the root-complex when you have the IOMMU on. A device
> can set its memory request TLP's Address Type field to have the switch
> direct the transaction directly to a peer device instead, but how does
> the nvme device know how to set the it memory request's AT field?
> There's nothing that says a command's addresses are untranslated IOVAs
> vs translated peer addresses, right? Lacking some mechanism to specify
> what kind of address the nvme controller is dealing with, wouldn't you
> be forced to map peer addresses with the IOMMU, having P2P transactions
> make a round trip through it only using mapped IOVAs?
That is all correct. In order to use P2P on a switch, with the IOMMU
enabled, it is currently required to disable ACS for the devices in
question. This is done with the command line parameter disable_acs_redir
or config_acs.
Logan
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-11 16:39 ` Logan Gunthorpe
@ 2025-06-11 16:41 ` Keith Busch
2025-06-11 19:41 ` Logan Gunthorpe
0 siblings, 1 reply; 65+ messages in thread
From: Keith Busch @ 2025-06-11 16:41 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, linux-block,
linux-nvme
On Wed, Jun 11, 2025 at 10:39:17AM -0600, Logan Gunthorpe wrote:
>
>
> On 2025-06-11 10:26, Keith Busch wrote:
> > If I recall correctly, the PCIe ACS features will default redirect
> > everything up to the root-complex when you have the IOMMU on. A device
> > can set its memory request TLP's Address Type field to have the switch
> > direct the transaction directly to a peer device instead, but how does
> > the nvme device know how to set the it memory request's AT field?
> > There's nothing that says a command's addresses are untranslated IOVAs
> > vs translated peer addresses, right? Lacking some mechanism to specify
> > what kind of address the nvme controller is dealing with, wouldn't you
> > be forced to map peer addresses with the IOMMU, having P2P transactions
> > make a round trip through it only using mapped IOVAs?
>
> That is all correct. In order to use P2P on a switch, with the IOMMU
> enabled, it is currently required to disable ACS for the devices in
> question. This is done with the command line parameter disable_acs_redir
> or config_acs.
Is there some other mechansim that ensures a host memory mapped IOVA
doesn't collide with a PCI bus address then?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-11 16:41 ` Keith Busch
@ 2025-06-11 19:41 ` Logan Gunthorpe
2025-06-11 20:00 ` Keith Busch
2025-06-12 4:57 ` Christoph Hellwig
0 siblings, 2 replies; 65+ messages in thread
From: Logan Gunthorpe @ 2025-06-11 19:41 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, linux-block,
linux-nvme
On 2025-06-11 10:41, Keith Busch wrote:
> On Wed, Jun 11, 2025 at 10:39:17AM -0600, Logan Gunthorpe wrote:
>> On 2025-06-11 10:26, Keith Busch wrote:
>> That is all correct. In order to use P2P on a switch, with the IOMMU
>> enabled, it is currently required to disable ACS for the devices in
>> question. This is done with the command line parameter disable_acs_redir
>> or config_acs.
>
> Is there some other mechansim that ensures a host memory mapped IOVA
> doesn't collide with a PCI bus address then?
Yes, in the absence of a switch with ACS protection this can be a problem.
I haven't looked at this in a long time, but the iommu drivers reserve
regions where the PCI addresses are valid so no iova will be allocated
with a similar bus address. After a quick search, I believe today, this
is handled by iova_reserve_pci_windows().
Logan
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-11 19:41 ` Logan Gunthorpe
@ 2025-06-11 20:00 ` Keith Busch
2025-06-12 4:57 ` Christoph Hellwig
1 sibling, 0 replies; 65+ messages in thread
From: Keith Busch @ 2025-06-11 20:00 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, linux-block,
linux-nvme
On Wed, Jun 11, 2025 at 01:41:54PM -0600, Logan Gunthorpe wrote:
> On 2025-06-11 10:41, Keith Busch wrote:
> > Is there some other mechansim that ensures a host memory mapped IOVA
> > doesn't collide with a PCI bus address then?
>
> Yes, in the absence of a switch with ACS protection this can be a problem.
>
> I haven't looked at this in a long time, but the iommu drivers reserve
> regions where the PCI addresses are valid so no iova will be allocated
> with a similar bus address. After a quick search, I believe today, this
> is handled by iova_reserve_pci_windows().
Excellent, I think that was the piece I was missing. Thanks!
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-11 19:41 ` Logan Gunthorpe
2025-06-11 20:00 ` Keith Busch
@ 2025-06-12 4:57 ` Christoph Hellwig
1 sibling, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 4:57 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Keith Busch, Christoph Hellwig, Jens Axboe, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
linux-block, linux-nvme
On Wed, Jun 11, 2025 at 01:41:54PM -0600, Logan Gunthorpe wrote:
> > Is there some other mechansim that ensures a host memory mapped IOVA
> > doesn't collide with a PCI bus address then?
>
> Yes, in the absence of a switch with ACS protection this can be a problem.
>
> I haven't looked at this in a long time, but the iommu drivers reserve
> regions where the PCI addresses are valid so no iova will be allocated
> with a similar bus address. After a quick search, I believe today, this
> is handled by iova_reserve_pci_windows().
Exactly.
Fun side story: the CMB decoding for commands in the NVMe spec relies on
this to not corrupt data as it tries to match an IOVA against a PCI bus
addresses. A certain very big hypervisor vendor did not reserve the
space like this and it caused data corruption due to this broken nvme
feature.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-10 5:06 ` [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
2025-06-10 12:44 ` Leon Romanovsky
2025-06-10 15:37 ` Keith Busch
@ 2025-06-12 6:24 ` Kanchan Joshi
2025-06-13 6:19 ` Christoph Hellwig
2025-06-12 15:22 ` Logan Gunthorpe
3 siblings, 1 reply; 65+ messages in thread
From: Kanchan Joshi @ 2025-06-12 6:24 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Leon Romanovsky,
Nitesh Shetty, Logan Gunthorpe, linux-block, linux-nvme
On 6/10/2025 10:36 AM, Christoph Hellwig wrote:
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -930,8 +930,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
> return false;
> if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
> return false;
> - if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
> - return false;
I did not understand the value of moving this out to its two callers
(bio_add_page and bio_integrity_add_page).
Since this check existed, I am a bit confused. The thing that patch
title says - is not a new addition and used to happen earlier too?
Or is this about setting REQ_NOMERGE in bio?
Flagging P2P bio with the new flag REQ_P2PDMA seems preparatory work to
optimize stuff in subsequent patches.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-12 6:24 ` Kanchan Joshi
@ 2025-06-13 6:19 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-13 6:19 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Thu, Jun 12, 2025 at 11:54:21AM +0530, Kanchan Joshi wrote:
> On 6/10/2025 10:36 AM, Christoph Hellwig wrote:
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -930,8 +930,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
> > return false;
> > if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
> > return false;
> > - if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
> > - return false;
>
> I did not understand the value of moving this out to its two callers
> (bio_add_page and bio_integrity_add_page).
>
> Since this check existed, I am a bit confused. The thing that patch
> title says - is not a new addition and used to happen earlier too?
> Or is this about setting REQ_NOMERGE in bio?
It is about not merging mismatch pgmaps into a bio, for which it needs
to be in the caller. The current code only prevents merges into the
same bio_vec.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio
2025-06-10 5:06 ` [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
` (2 preceding siblings ...)
2025-06-12 6:24 ` Kanchan Joshi
@ 2025-06-12 15:22 ` Logan Gunthorpe
3 siblings, 0 replies; 65+ messages in thread
From: Logan Gunthorpe @ 2025-06-12 15:22 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, linux-block, linux-nvme
On 2025-06-09 23:06, Christoph Hellwig wrote:
> To get out of the DMA mapping helpers having to check every segment for
> it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> transfers, and that a P2P bio only contains ranges from a single device.
>
> This means we do the page zone access in the bio add path where it should
> be still page hot, and will only have do the fairly expensive P2P topology
> lookup once per bio down in the DMA mapping path, and only for already
> marked bios.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
Looks good to me:
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
2025-06-10 5:06 ` [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 12:51 ` Leon Romanovsky
` (2 more replies)
2025-06-10 5:06 ` [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls Christoph Hellwig
` (6 subsequent siblings)
8 siblings, 3 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
the wasteful scatterlist structure. Instead it uses the mapping iterator
to either add segments to the IOVA for IOMMU operations, or just maps
them one by one for the direct mapping. For the IOMMU case instead of
a scatterlist with an entry for each segment, only a single [dma_addr,len]
pair needs to be stored for processing a request, and for the direct
mapping the per-segment allocation shrinks from
[page,offset,len,dma_addr,dma_len] to just [dma_addr,len].
One big difference to the scatterlist API, which could be considered
downside, is that the IOVA collapsing only works when the driver sets
a virt_boundary that matches the IOMMU granule. For NVMe this is done
already so it works perfectly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
block/blk-mq-dma.c | 162 +++++++++++++++++++++++++++++++++++++
include/linux/blk-mq-dma.h | 63 +++++++++++++++
2 files changed, 225 insertions(+)
create mode 100644 include/linux/blk-mq-dma.h
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 82bae475dfa4..37f8fba077e6 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -2,6 +2,7 @@
/*
* Copyright (C) 2025 Christoph Hellwig
*/
+#include <linux/blk-mq-dma.h>
#include "blk.h"
struct phys_vec {
@@ -61,6 +62,167 @@ static bool blk_map_iter_next(struct request *req, struct req_iterator *iter,
return true;
}
+/*
+ * The IOVA-based DMA API wants to be able to coalesce at the minimal IOMMU page
+ * size granularity (which is guaranteed to be <= PAGE_SIZE and usually 4k), so
+ * we need to ensure our segments are aligned to this as well.
+ *
+ * Note that there is no point in using the slightly more complicated IOVA based
+ * path for single segment mappings.
+ */
+static inline bool blk_can_dma_map_iova(struct request *req,
+ struct device *dma_dev)
+{
+ return !((queue_virt_boundary(req->q) + 1) &
+ dma_get_merge_boundary(dma_dev));
+}
+
+static bool blk_dma_map_bus(struct request *req, struct device *dma_dev,
+ struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+ iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+ iter->len = vec->len;
+ return true;
+}
+
+static bool blk_dma_map_direct(struct request *req, struct device *dma_dev,
+ struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+ iter->addr = dma_map_page(dma_dev, phys_to_page(vec->paddr),
+ offset_in_page(vec->paddr), vec->len, rq_dma_dir(req));
+ if (dma_mapping_error(dma_dev, iter->addr)) {
+ iter->status = BLK_STS_RESOURCE;
+ return false;
+ }
+ iter->len = vec->len;
+ return true;
+}
+
+static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
+ struct dma_iova_state *state, struct blk_dma_iter *iter,
+ struct phys_vec *vec)
+{
+ enum dma_data_direction dir = rq_dma_dir(req);
+ unsigned int mapped = 0;
+ int error = 0;
+
+ iter->addr = state->addr;
+ iter->len = dma_iova_size(state);
+
+ do {
+ error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
+ vec->len, dir, 0);
+ if (error)
+ break;
+ mapped += vec->len;
+ } while (blk_map_iter_next(req, &iter->iter, vec));
+
+ error = dma_iova_sync(dma_dev, state, 0, mapped);
+ if (error) {
+ iter->status = errno_to_blk_status(error);
+ return false;
+ }
+
+ return true;
+}
+
+/**
+ * blk_rq_dma_map_iter_start - map the first DMA segment for a request
+ * @req: request to map
+ * @dma_dev: device to map to
+ * @state: DMA IOVA state
+ * @iter: block layer DMA iterator
+ *
+ * Start DMA mapping @req to @dma_dev. @state and @iter are provided by the
+ * caller and don't need to be initialized. @state needs to be stored for use
+ * at unmap time, @iter is only needed at map time.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true ft it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len. If no segment was mapped the status code is
+ * returned in @iter.status.
+ *
+ * The caller can call blk_rq_dma_map_coalesce() to check if further segments
+ * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next()
+ * to try to map the following segments.
+ */
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+ struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+ unsigned int total_len = blk_rq_payload_bytes(req);
+ struct phys_vec vec;
+
+ iter->iter.bio = req->bio;
+ iter->iter.iter = req->bio->bi_iter;
+ memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
+ iter->status = BLK_STS_OK;
+
+ /*
+ * Grab the first segment ASAP because we'll need it to check for P2P
+ * transfers.
+ */
+ if (!blk_map_iter_next(req, &iter->iter, &vec))
+ return false;
+
+ if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) {
+ switch (pci_p2pdma_state(&iter->p2pdma, dma_dev,
+ phys_to_page(vec.paddr))) {
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ return blk_dma_map_bus(req, dma_dev, iter, &vec);
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ /*
+ * P2P transfers through the host bridge are treated the
+ * same as non-P2P transfers below and during unmap.
+ */
+ req->cmd_flags &= ~REQ_P2PDMA;
+ break;
+ default:
+ iter->status = BLK_STS_INVAL;
+ return false;
+ }
+ }
+
+ if (blk_can_dma_map_iova(req, dma_dev) &&
+ dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
+ return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
+ return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
+
+/**
+ * blk_rq_dma_map_iter_next - map the next DMA segment for a request
+ * @req: request to map
+ * @dma_dev: device to map to
+ * @state: DMA IOVA state
+ * @iter: block layer DMA iterator
+ *
+ * Iterate to the next mapping after a previous call to
+ * blk_rq_dma_map_iter_start(). See there for a detailed description of the
+ * arguments.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true ft it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len. If no segment was mapped the status code is
+ * returned in @iter.status.
+ */
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+ struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+ struct phys_vec vec;
+
+ if (!blk_map_iter_next(req, &iter->iter, &vec))
+ return false;
+
+ if (iter->p2pdma.map == PCI_P2PDMA_MAP_BUS_ADDR)
+ return blk_dma_map_bus(req, dma_dev, iter, &vec);
+ return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_next);
+
static inline struct scatterlist *
blk_next_sg(struct scatterlist **sg, struct scatterlist *sglist)
{
diff --git a/include/linux/blk-mq-dma.h b/include/linux/blk-mq-dma.h
new file mode 100644
index 000000000000..c26a01aeae00
--- /dev/null
+++ b/include/linux/blk-mq-dma.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BLK_MQ_DMA_H
+#define BLK_MQ_DMA_H
+
+#include <linux/blk-mq.h>
+#include <linux/pci-p2pdma.h>
+
+struct blk_dma_iter {
+ /* Output address range for this iteration */
+ dma_addr_t addr;
+ u32 len;
+
+ /* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
+ blk_status_t status;
+
+ /* Internal to blk_rq_dma_map_iter_* */
+ struct req_iterator iter;
+ struct pci_p2pdma_map_state p2pdma;
+};
+
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+ struct dma_iova_state *state, struct blk_dma_iter *iter);
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+ struct dma_iova_state *state, struct blk_dma_iter *iter);
+
+/**
+ * blk_rq_dma_map_coalesce - were all segments coalesced?
+ * @state: DMA state to check
+ *
+ * Returns true if blk_rq_dma_map_iter_start coalesced all segments into a
+ * single DMA range.
+ */
+static inline bool blk_rq_dma_map_coalesce(struct dma_iova_state *state)
+{
+ return dma_use_iova(state);
+}
+
+/**
+ * blk_rq_dma_unmap - try to DMA unmap a request
+ * @req: request to unmap
+ * @dma_dev: device to unmap from
+ * @state: DMA IOVA state
+ * @mapped_len: number of bytes to unmap
+ *
+ * Returns %false if the callers need to manually unmap every DMA segment
+ * mapped using @iter or %true if no work is left to be done.
+ */
+static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev,
+ struct dma_iova_state *state, size_t mapped_len)
+{
+ if (req->cmd_flags & REQ_P2PDMA)
+ return true;
+
+ if (dma_use_iova(state)) {
+ dma_iova_destroy(dma_dev, state, mapped_len, rq_dma_dir(req),
+ 0);
+ return true;
+ }
+
+ return !dma_need_unmap(dma_dev);
+}
+
+#endif /* BLK_MQ_DMA_H */
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-10 5:06 ` [PATCH 2/9] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
@ 2025-06-10 12:51 ` Leon Romanovsky
2025-06-11 13:43 ` Daniel Gomez
2025-06-12 6:35 ` Kanchan Joshi
2 siblings, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 12:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:40AM +0200, Christoph Hellwig wrote:
> Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
> the wasteful scatterlist structure. Instead it uses the mapping iterator
> to either add segments to the IOVA for IOMMU operations, or just maps
> them one by one for the direct mapping. For the IOMMU case instead of
> a scatterlist with an entry for each segment, only a single [dma_addr,len]
> pair needs to be stored for processing a request, and for the direct
> mapping the per-segment allocation shrinks from
> [page,offset,len,dma_addr,dma_len] to just [dma_addr,len].
>
> One big difference to the scatterlist API, which could be considered
> downside, is that the IOVA collapsing only works when the driver sets
> a virt_boundary that matches the IOMMU granule. For NVMe this is done
> already so it works perfectly.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> block/blk-mq-dma.c | 162 +++++++++++++++++++++++++++++++++++++
> include/linux/blk-mq-dma.h | 63 +++++++++++++++
> 2 files changed, 225 insertions(+)
> create mode 100644 include/linux/blk-mq-dma.h
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-10 5:06 ` [PATCH 2/9] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
2025-06-10 12:51 ` Leon Romanovsky
@ 2025-06-11 13:43 ` Daniel Gomez
2025-06-16 5:02 ` Christoph Hellwig
2025-06-12 6:35 ` Kanchan Joshi
2 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 13:43 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
> the wasteful scatterlist structure. Instead it uses the mapping iterator
> to either add segments to the IOVA for IOMMU operations, or just maps
> them one by one for the direct mapping. For the IOMMU case instead of
> a scatterlist with an entry for each segment, only a single [dma_addr,len]
> pair needs to be stored for processing a request, and for the direct
> mapping the per-segment allocation shrinks from
> [page,offset,len,dma_addr,dma_len] to just [dma_addr,len].
>
> One big difference to the scatterlist API, which could be considered
> downside, is that the IOVA collapsing only works when the driver sets
> a virt_boundary that matches the IOMMU granule. For NVMe this is done
> already so it works perfectly.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> block/blk-mq-dma.c | 162 +++++++++++++++++++++++++++++++++++++
> include/linux/blk-mq-dma.h | 63 +++++++++++++++
> 2 files changed, 225 insertions(+)
> create mode 100644 include/linux/blk-mq-dma.h
>
> diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
> index 82bae475dfa4..37f8fba077e6 100644
> --- a/block/blk-mq-dma.c
> +++ b/block/blk-mq-dma.c
> +static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
> + struct dma_iova_state *state, struct blk_dma_iter *iter,
> + struct phys_vec *vec)
> +{
> + enum dma_data_direction dir = rq_dma_dir(req);
> + unsigned int mapped = 0;
> + int error = 0;
error does not need to be initialized.
> +/**
> + * blk_rq_dma_map_iter_start - map the first DMA segment for a request
> + * @req: request to map
> + * @dma_dev: device to map to
> + * @state: DMA IOVA state
> + * @iter: block layer DMA iterator
> + *
> + * Start DMA mapping @req to @dma_dev. @state and @iter are provided by the
> + * caller and don't need to be initialized. @state needs to be stored for use
> + * at unmap time, @iter is only needed at map time.
> + *
> + * Returns %false if there is no segment to map, including due to an error, or
> + * %true ft it did map a segment.
> + *
> + * If a segment was mapped, the DMA address for it is returned in @iter.addr and
> + * the length in @iter.len. If no segment was mapped the status code is
> + * returned in @iter.status.
> + *
> + * The caller can call blk_rq_dma_map_coalesce() to check if further segments
> + * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next()
> + * to try to map the following segments.
> + */
> +bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
> + struct dma_iova_state *state, struct blk_dma_iter *iter)
> +{
> + unsigned int total_len = blk_rq_payload_bytes(req);
> + struct phys_vec vec;
> +
> + iter->iter.bio = req->bio;
> + iter->iter.iter = req->bio->bi_iter;
> + memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
> + iter->status = BLK_STS_OK;
> +
> + /*
> + * Grab the first segment ASAP because we'll need it to check for P2P
> + * transfers.
> + */
> + if (!blk_map_iter_next(req, &iter->iter, &vec))
> + return false;
> +
> + if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) {
> + switch (pci_p2pdma_state(&iter->p2pdma, dma_dev,
> + phys_to_page(vec.paddr))) {
> + case PCI_P2PDMA_MAP_BUS_ADDR:
> + return blk_dma_map_bus(req, dma_dev, iter, &vec);
> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> + /*
> + * P2P transfers through the host bridge are treated the
> + * same as non-P2P transfers below and during unmap.
> + */
> + req->cmd_flags &= ~REQ_P2PDMA;
> + break;
> + default:
> + iter->status = BLK_STS_INVAL;
> + return false;
> + }
> + }
> +
> + if (blk_can_dma_map_iova(req, dma_dev) &&
> + dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
> + return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
> + return blk_dma_map_direct(req, dma_dev, iter, &vec);
> +}
> +EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
> +
...
> diff --git a/include/linux/blk-mq-dma.h b/include/linux/blk-mq-dma.h
> new file mode 100644
> index 000000000000..c26a01aeae00
> --- /dev/null
> +++ b/include/linux/blk-mq-dma.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef BLK_MQ_DMA_H
> +#define BLK_MQ_DMA_H
> +
> +#include <linux/blk-mq.h>
> +#include <linux/pci-p2pdma.h>
> +
> +struct blk_dma_iter {
> + /* Output address range for this iteration */
> + dma_addr_t addr;
> + u32 len;
> +
> + /* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
> + blk_status_t status;
This comment does not match with blk_rq_dma_map_iter_start(). It returns false
and status is BLK_STS_INVAL.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-11 13:43 ` Daniel Gomez
@ 2025-06-16 5:02 ` Christoph Hellwig
2025-06-16 6:43 ` Daniel Gomez
0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-16 5:02 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Wed, Jun 11, 2025 at 03:43:07PM +0200, Daniel Gomez wrote:
> > +struct blk_dma_iter {
> > + /* Output address range for this iteration */
> > + dma_addr_t addr;
> > + u32 len;
> > +
> > + /* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
> > + blk_status_t status;
>
> This comment does not match with blk_rq_dma_map_iter_start(). It returns false
> and status is BLK_STS_INVAL.
I went over you comment a few times and still don't understand it.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-16 5:02 ` Christoph Hellwig
@ 2025-06-16 6:43 ` Daniel Gomez
2025-06-16 11:31 ` Christoph Hellwig
0 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-16 6:43 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 16/06/2025 07.02, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 03:43:07PM +0200, Daniel Gomez wrote:
>>> +struct blk_dma_iter {
>>> + /* Output address range for this iteration */
>>> + dma_addr_t addr;
>>> + u32 len;
>>> +
>>> + /* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
>>> + blk_status_t status;
>>
>> This comment does not match with blk_rq_dma_map_iter_start(). It returns false
>> and status is BLK_STS_INVAL.
>
> I went over you comment a few times and still don't understand it.
The way I read the comment is that status is only valid when
blk_rq_dma_map_iter_* returns false.
But blk_rq_dma_map_iter_start() can return false and an invalid status (in the
default switch case).
Assuming the comment is the right thing, I'd expect a valid status for that
case:
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -179,7 +179,6 @@ bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
req->cmd_flags &= ~REQ_P2PDMA;
break;
default:
- iter->status = BLK_STS_INVAL;
return false;
}
}
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-16 6:43 ` Daniel Gomez
@ 2025-06-16 11:31 ` Christoph Hellwig
2025-06-16 12:37 ` Daniel Gomez
0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-16 11:31 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Mon, Jun 16, 2025 at 08:43:22AM +0200, Daniel Gomez wrote:
> >> This comment does not match with blk_rq_dma_map_iter_start(). It returns false
> >> and status is BLK_STS_INVAL.
> >
> > I went over you comment a few times and still don't understand it.
>
> The way I read the comment is that status is only valid when
> blk_rq_dma_map_iter_* returns false.
Yes.
> But blk_rq_dma_map_iter_start() can return false and an invalid status (in the
> default switch case).
The status field is valid. Your patch below leaves it uninitialized
instead, which leaves me even more confused than the previous comment.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-16 11:31 ` Christoph Hellwig
@ 2025-06-16 12:37 ` Daniel Gomez
2025-06-16 12:42 ` Christoph Hellwig
0 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-16 12:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 16/06/2025 13.31, Christoph Hellwig wrote:
> On Mon, Jun 16, 2025 at 08:43:22AM +0200, Daniel Gomez wrote:
>>>> This comment does not match with blk_rq_dma_map_iter_start(). It returns false
>>>> and status is BLK_STS_INVAL.
>>>
>>> I went over you comment a few times and still don't understand it.
>>
>> The way I read the comment is that status is only valid when
>> blk_rq_dma_map_iter_* returns false.
>
> Yes.
>
>> But blk_rq_dma_map_iter_start() can return false and an invalid status (in the
>> default switch case).
>
> The status field is valid. Your patch below leaves it uninitialized
> instead, which leaves me even more confused than the previous comment.
>
status is initialized at line 160 after memset():
iter->status = BLK_STS_OK;
So, what the patch does is (for blk_rq_dma_map_iter_start()):
1. Initialize status to valid (line 160)
iter->status = BLK_STS_OK;
2. In the if/switch case default case (line 181-183):
It sets the status to invalid and *iter_start() returns false:
default:
iter->status = BLK_STS_INVAL;
return false;
Removing the invalid assignment "makes it" valid (because of the initialization)
and also matches with the blk_dma_iter data struct comment (lines 13-14 in
blk-mq-dma.h):
/* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
blk_status_t status;
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-16 12:37 ` Daniel Gomez
@ 2025-06-16 12:42 ` Christoph Hellwig
2025-06-16 12:52 ` Daniel Gomez
0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-16 12:42 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Mon, Jun 16, 2025 at 02:37:43PM +0200, Daniel Gomez wrote:
>
> 2. In the if/switch case default case (line 181-183):
> It sets the status to invalid and *iter_start() returns false:
>
> default:
> iter->status = BLK_STS_INVAL;
> return false;
>
>
> Removing the invalid assignment "makes it" valid (because of the
> initialization)
Huhhh? How do you make a field access valid by dropping an assignment?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-16 12:42 ` Christoph Hellwig
@ 2025-06-16 12:52 ` Daniel Gomez
2025-06-16 13:01 ` Christoph Hellwig
0 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-16 12:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 16/06/2025 14.42, Christoph Hellwig wrote:
> On Mon, Jun 16, 2025 at 02:37:43PM +0200, Daniel Gomez wrote:
>>
>> 2. In the if/switch case default case (line 181-183):
>> It sets the status to invalid and *iter_start() returns false:
>>
>> default:
>> iter->status = BLK_STS_INVAL;
>> return false;
>>
>>
>> Removing the invalid assignment "makes it" valid (because of the
>> initialization)
>
> Huhhh? How do you make a field access valid by dropping an assignment?
>
It should be valid already. But if the initialized value at line 160 can be
changed through the previous code then, this should suffice:
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 37f8fba077e6..7d5ebe1a3f49 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -179,7 +179,7 @@ bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
req->cmd_flags &= ~REQ_P2PDMA;
break;
default:
- iter->status = BLK_STS_INVAL;
+ iter->status = BLK_STS_OK;
return false;
}
}
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-16 12:52 ` Daniel Gomez
@ 2025-06-16 13:01 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-16 13:01 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Mon, Jun 16, 2025 at 02:52:15PM +0200, Daniel Gomez wrote:
> It should be valid already. But if the initialized value at line 160 can be
> changed through the previous code then, this should suffice:
Why? If pci_p2pdma_state returns something other than
PCI_P2PDMA_MAP_BUS_ADDR or PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, BLK_STS_INVAL
should be propagated up the callchain.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-10 5:06 ` [PATCH 2/9] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
2025-06-10 12:51 ` Leon Romanovsky
2025-06-11 13:43 ` Daniel Gomez
@ 2025-06-12 6:35 ` Kanchan Joshi
2025-06-13 6:17 ` Christoph Hellwig
2 siblings, 1 reply; 65+ messages in thread
From: Kanchan Joshi @ 2025-06-12 6:35 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Leon Romanovsky,
Nitesh Shetty, Logan Gunthorpe, linux-block, linux-nvme
On 6/10/2025 10:36 AM, Christoph Hellwig wrote:
> +static bool blk_dma_map_bus(struct request *req, struct device *dma_dev,
Both are not used in the function body.
> + */
> +bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
> + struct dma_iova_state *state, struct blk_dma_iter *iter)
> +{
> + unsigned int total_len = blk_rq_payload_bytes(req);
> + struct phys_vec vec;
> +
> + iter->iter.bio = req->bio;
> + iter->iter.iter = req->bio->bi_iter;
> + memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
Should this (or maybe p2pdma field itself) be compiled out using
CONFIG_PCI_P2PDMA.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 2/9] block: add scatterlist-less DMA mapping helpers
2025-06-12 6:35 ` Kanchan Joshi
@ 2025-06-13 6:17 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-13 6:17 UTC (permalink / raw)
To: Kanchan Joshi
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Thu, Jun 12, 2025 at 12:05:18PM +0530, Kanchan Joshi wrote:
> On 6/10/2025 10:36 AM, Christoph Hellwig wrote:
> > +static bool blk_dma_map_bus(struct request *req, struct device *dma_dev,
>
> Both are not used in the function body.
I'll fix it.
> > + */
> > +bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
> > + struct dma_iova_state *state, struct blk_dma_iter *iter)
> > +{
> > + unsigned int total_len = blk_rq_payload_bytes(req);
> > + struct phys_vec vec;
> > +
> > + iter->iter.bio = req->bio;
> > + iter->iter.iter = req->bio->bi_iter;
> > + memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
> Should this (or maybe p2pdma field itself) be compiled out using
> CONFIG_PCI_P2PDMA.
We could, but is it really worth the effort?
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
2025-06-10 5:06 ` [PATCH 1/9] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
2025-06-10 5:06 ` [PATCH 2/9] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 12:52 ` Leon Romanovsky
2025-06-11 21:38 ` Keith Busch
2025-06-10 5:06 ` [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
` (5 subsequent siblings)
8 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
Move the nvme_ctrl_meta_sgl_supported into the callers. 2 out of
three already have it, and the third will be simplified in follow on
patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ff12e415cb5..449a9950b46c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -578,11 +578,8 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
spin_unlock(&nvmeq->sq_lock);
}
-static inline bool nvme_pci_metadata_use_sgls(struct nvme_dev *dev,
- struct request *req)
+static inline bool nvme_pci_metadata_use_sgls(struct request *req)
{
- if (!nvme_ctrl_meta_sgl_supported(&dev->ctrl))
- return false;
return req->nr_integrity_segments > 1 ||
nvme_req(req)->flags & NVME_REQ_USERCMD;
}
@@ -599,7 +596,7 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
return false;
if (!nvmeq->qid)
return false;
- if (nvme_pci_metadata_use_sgls(dev, req))
+ if (nvme_pci_metadata_use_sgls(req))
return true;
if (!sgl_threshold || avg_seg_size < sgl_threshold)
return nvme_req(req)->flags & NVME_REQ_USERCMD;
@@ -858,7 +855,8 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
struct bio_vec bv = req_bvec(req);
if (!is_pci_p2pdma_page(bv.bv_page)) {
- if (!nvme_pci_metadata_use_sgls(dev, req) &&
+ if ((!nvme_ctrl_meta_sgl_supported(&dev->ctrl) ||
+ !nvme_pci_metadata_use_sgls(req)) &&
(bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
return nvme_setup_prp_simple(dev, req,
@@ -981,7 +979,7 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
if ((iod->cmd.common.flags & NVME_CMD_SGL_METABUF) &&
- nvme_pci_metadata_use_sgls(dev, req))
+ nvme_pci_metadata_use_sgls(req))
return nvme_pci_setup_meta_sgls(dev, req);
return nvme_pci_setup_meta_mptr(dev, req);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls
2025-06-10 5:06 ` [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls Christoph Hellwig
@ 2025-06-10 12:52 ` Leon Romanovsky
2025-06-11 21:38 ` Keith Busch
1 sibling, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 12:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:41AM +0200, Christoph Hellwig wrote:
> Move the nvme_ctrl_meta_sgl_supported into the callers. 2 out of
> three already have it, and the third will be simplified in follow on
> patches.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls
2025-06-10 5:06 ` [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls Christoph Hellwig
2025-06-10 12:52 ` Leon Romanovsky
@ 2025-06-11 21:38 ` Keith Busch
2025-06-12 4:59 ` Christoph Hellwig
1 sibling, 1 reply; 65+ messages in thread
From: Keith Busch @ 2025-06-11 21:38 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:41AM +0200, Christoph Hellwig wrote:
> +static inline bool nvme_pci_metadata_use_sgls(struct request *req)
> {
> - if (!nvme_ctrl_meta_sgl_supported(&dev->ctrl))
> - return false;
> return req->nr_integrity_segments > 1 ||
> nvme_req(req)->flags & NVME_REQ_USERCMD;
> }
...
> @@ -981,7 +979,7 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>
> if ((iod->cmd.common.flags & NVME_CMD_SGL_METABUF) &&
> - nvme_pci_metadata_use_sgls(dev, req))
> + nvme_pci_metadata_use_sgls(req))
> return nvme_pci_setup_meta_sgls(dev, req);
> return nvme_pci_setup_meta_mptr(dev, req);
> }
Am I missing something here? This looks like it forces user commands to
use metadata SGLs even if the controller doesn't support it.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls
2025-06-11 21:38 ` Keith Busch
@ 2025-06-12 4:59 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 4:59 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On Wed, Jun 11, 2025 at 03:38:22PM -0600, Keith Busch wrote:
> On Tue, Jun 10, 2025 at 07:06:41AM +0200, Christoph Hellwig wrote:
> > +static inline bool nvme_pci_metadata_use_sgls(struct request *req)
> > {
> > - if (!nvme_ctrl_meta_sgl_supported(&dev->ctrl))
> > - return false;
> > return req->nr_integrity_segments > 1 ||
> > nvme_req(req)->flags & NVME_REQ_USERCMD;
> > }
>
> ...
>
>
> > @@ -981,7 +979,7 @@ static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
> > struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> >
> > if ((iod->cmd.common.flags & NVME_CMD_SGL_METABUF) &&
> > - nvme_pci_metadata_use_sgls(dev, req))
> > + nvme_pci_metadata_use_sgls(req))
> > return nvme_pci_setup_meta_sgls(dev, req);
> > return nvme_pci_setup_meta_mptr(dev, req);
> > }
>
> Am I missing something here? This looks like it forces user commands to
> use metadata SGLs even if the controller doesn't support it.
Yeah, looks like I messed up something here.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
` (2 preceding siblings ...)
2025-06-10 5:06 ` [PATCH 3/9] nvme-pci: simplify nvme_pci_metadata_use_sgls Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 13:10 ` Leon Romanovsky
` (2 more replies)
2025-06-10 5:06 ` [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
` (4 subsequent siblings)
8 siblings, 3 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
Move the average segment size into a separate helper, and return a
tristate to distinguish the case where can use SGL vs where we have to
use SGLs. This will allow the simplify the code and make more efficient
decisions in follow on changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 41 +++++++++++++++++++++++++++--------------
1 file changed, 27 insertions(+), 14 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 449a9950b46c..0b85c11d3c96 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -578,29 +578,39 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
spin_unlock(&nvmeq->sq_lock);
}
+enum nvme_use_sgl {
+ SGL_UNSUPPORTED,
+ SGL_SUPPORTED,
+ SGL_FORCED,
+};
+
static inline bool nvme_pci_metadata_use_sgls(struct request *req)
{
return req->nr_integrity_segments > 1 ||
nvme_req(req)->flags & NVME_REQ_USERCMD;
}
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
- int nseg)
+static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
+ struct request *req)
{
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
- unsigned int avg_seg_size;
- avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
+ if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
+ if (nvme_req(req)->flags & NVME_REQ_USERCMD)
+ return SGL_FORCED;
+ if (nvme_pci_metadata_use_sgls(req))
+ return SGL_FORCED;
+ return SGL_SUPPORTED;
+ }
- if (!nvme_ctrl_sgl_supported(&dev->ctrl))
- return false;
- if (!nvmeq->qid)
- return false;
- if (nvme_pci_metadata_use_sgls(req))
- return true;
- if (!sgl_threshold || avg_seg_size < sgl_threshold)
- return nvme_req(req)->flags & NVME_REQ_USERCMD;
- return true;
+ return SGL_UNSUPPORTED;
+}
+
+static unsigned int nvme_pci_avg_seg_size(struct request *req)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+ return DIV_ROUND_UP(blk_rq_payload_bytes(req), iod->sgt.nents);
}
static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
@@ -848,6 +858,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
{
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
blk_status_t ret = BLK_STS_RESOURCE;
int rc;
@@ -886,7 +897,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
goto out_free_sg;
}
- if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+ if (use_sgl == SGL_FORCED ||
+ (use_sgl == SGL_SUPPORTED &&
+ (!sgl_threshold || nvme_pci_avg_seg_size(req) < sgl_threshold)))
ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
else
ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls
2025-06-10 5:06 ` [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
@ 2025-06-10 13:10 ` Leon Romanovsky
2025-06-11 13:43 ` Daniel Gomez
2025-06-11 20:50 ` Keith Busch
2 siblings, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:10 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:42AM +0200, Christoph Hellwig wrote:
> Move the average segment size into a separate helper, and return a
> tristate to distinguish the case where can use SGL vs where we have to
> use SGLs. This will allow the simplify the code and make more efficient
> decisions in follow on changes.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 41 +++++++++++++++++++++++++++--------------
> 1 file changed, 27 insertions(+), 14 deletions(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls
2025-06-10 5:06 ` [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
2025-06-10 13:10 ` Leon Romanovsky
@ 2025-06-11 13:43 ` Daniel Gomez
2025-06-12 5:00 ` Christoph Hellwig
2025-06-11 20:50 ` Keith Busch
2 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 13:43 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> Move the average segment size into a separate helper, and return a
> tristate to distinguish the case where can use SGL vs where we have to
> use SGLs. This will allow the simplify the code and make more efficient
> decisions in follow on changes.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 41 +++++++++++++++++++++++++++--------------
> 1 file changed, 27 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 449a9950b46c..0b85c11d3c96 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
...
> @@ -886,7 +897,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> goto out_free_sg;
> }
>
> - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> + if (use_sgl == SGL_FORCED ||
> + (use_sgl == SGL_SUPPORTED &&
FORCED implies SUPPORTED, what about simplifying to:
if (use_sgl >= SGL_SUPPORTED)
or just do:
if (use_sgl)
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls
2025-06-11 13:43 ` Daniel Gomez
@ 2025-06-12 5:00 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 5:00 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Wed, Jun 11, 2025 at 03:43:36PM +0200, Daniel Gomez wrote:
> > - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> > + if (use_sgl == SGL_FORCED ||
> > + (use_sgl == SGL_SUPPORTED &&
>
> FORCED implies SUPPORTED, what about simplifying to:
>
> if (use_sgl >= SGL_SUPPORTED)
>
> or just do:
>
> if (use_sgl)
For forced we unconditionally enter the sgl branch, for supported
we also need an additional condition.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls
2025-06-10 5:06 ` [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
2025-06-10 13:10 ` Leon Romanovsky
2025-06-11 13:43 ` Daniel Gomez
@ 2025-06-11 20:50 ` Keith Busch
2025-06-12 5:00 ` Christoph Hellwig
2 siblings, 1 reply; 65+ messages in thread
From: Keith Busch @ 2025-06-11 20:50 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:42AM +0200, Christoph Hellwig wrote:
> static inline bool nvme_pci_metadata_use_sgls(struct request *req)
> {
> return req->nr_integrity_segments > 1 ||
> nvme_req(req)->flags & NVME_REQ_USERCMD;
> }
>
> -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
> - int nseg)
> +static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
> + struct request *req)
> {
> struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> - unsigned int avg_seg_size;
>
> - avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
> + if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
> + if (nvme_req(req)->flags & NVME_REQ_USERCMD)
> + return SGL_FORCED;
> + if (nvme_pci_metadata_use_sgls(req))
> + return SGL_FORCED;
nvme_pci_metadata_use_sgls() already handles checking for
NVME_REQ_USERCMD flagged commands, so I don't think you need both of
these conditions to return FORCED.
> @@ -886,7 +897,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> goto out_free_sg;
> }
>
> - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> + if (use_sgl == SGL_FORCED ||
> + (use_sgl == SGL_SUPPORTED &&
> + (!sgl_threshold || nvme_pci_avg_seg_size(req) < sgl_threshold)))
This looks backwards for deciding to use sgls in the non-forced case.
Shouldn't it be:
(sgl_threshold && nvme_pci_avg_seg_size(req) >= sgl_threshold)))
?
> ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
> else
> ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls
2025-06-11 20:50 ` Keith Busch
@ 2025-06-12 5:00 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 5:00 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On Wed, Jun 11, 2025 at 02:50:53PM -0600, Keith Busch wrote:
> > + if (use_sgl == SGL_FORCED ||
> > + (use_sgl == SGL_SUPPORTED &&
> > + (!sgl_threshold || nvme_pci_avg_seg_size(req) < sgl_threshold)))
>
> This looks backwards for deciding to use sgls in the non-forced case.
> Shouldn't it be:
>
> (sgl_threshold && nvme_pci_avg_seg_size(req) >= sgl_threshold)))
>
> ?
Yes.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
` (3 preceding siblings ...)
2025-06-10 5:06 ` [PATCH 4/9] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 13:13 ` Leon Romanovsky
` (2 more replies)
2025-06-10 5:06 ` [PATCH 6/9] nvme-pci: remove superfluous arguments Christoph Hellwig
` (3 subsequent siblings)
8 siblings, 3 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic.
Merge them into a single helper that makes use of the previously added
use_sgl tristate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 77 +++++++++++++++++------------------------
1 file changed, 32 insertions(+), 45 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0b85c11d3c96..50bb1ebe6810 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -814,42 +814,41 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
return BLK_STS_OK;
}
-static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
- struct request *req, struct nvme_rw_command *cmnd,
- struct bio_vec *bv)
+static blk_status_t nvme_pci_setup_data_simple(struct request *req,
+ enum nvme_use_sgl use_sgl)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
- unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
-
- iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
- if (dma_mapping_error(dev->dev, iod->first_dma))
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct bio_vec bv = req_bvec(req);
+ unsigned int prp1_offset = bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
+ bool prp_possible = prp1_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2;
+ dma_addr_t dma_addr;
+
+ if (!use_sgl && !prp_possible)
+ return BLK_STS_AGAIN;
+ if (is_pci_p2pdma_page(bv.bv_page))
+ return BLK_STS_AGAIN;
+
+ dma_addr = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
+ if (dma_mapping_error(nvmeq->dev->dev, dma_addr))
return BLK_STS_RESOURCE;
- iod->dma_len = bv->bv_len;
-
- cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
- if (bv->bv_len > first_prp_len)
- cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma + first_prp_len);
- else
- cmnd->dptr.prp2 = 0;
- return BLK_STS_OK;
-}
+ iod->dma_len = bv.bv_len;
-static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
- struct request *req, struct nvme_rw_command *cmnd,
- struct bio_vec *bv)
-{
- struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ if (use_sgl == SGL_FORCED || !prp_possible) {
+ iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
+ iod->cmd.common.dptr.sgl.addr = cpu_to_le64(dma_addr);
+ iod->cmd.common.dptr.sgl.length = cpu_to_le32(bv.bv_len);
+ iod->cmd.common.dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
+ } else {
+ unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - prp1_offset;
- iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
- if (dma_mapping_error(dev->dev, iod->first_dma))
- return BLK_STS_RESOURCE;
- iod->dma_len = bv->bv_len;
+ iod->cmd.common.dptr.prp1 = cpu_to_le64(dma_addr);
+ iod->cmd.common.dptr.prp2 = 0;
+ if (bv.bv_len > first_prp_len)
+ iod->cmd.common.dptr.prp2 =
+ cpu_to_le64(dma_addr + first_prp_len);
+ }
- cmnd->flags = NVME_CMD_SGL_METABUF;
- cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma);
- cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len);
- cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
return BLK_STS_OK;
}
@@ -863,21 +862,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
int rc;
if (blk_rq_nr_phys_segments(req) == 1) {
- struct bio_vec bv = req_bvec(req);
-
- if (!is_pci_p2pdma_page(bv.bv_page)) {
- if ((!nvme_ctrl_meta_sgl_supported(&dev->ctrl) ||
- !nvme_pci_metadata_use_sgls(req)) &&
- (bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
- bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
- return nvme_setup_prp_simple(dev, req,
- &cmnd->rw, &bv);
-
- if (nvmeq->qid && sgl_threshold &&
- nvme_ctrl_sgl_supported(&dev->ctrl))
- return nvme_setup_sgl_simple(dev, req,
- &cmnd->rw, &bv);
- }
+ ret = nvme_pci_setup_data_simple(req, use_sgl);
+ if (ret != BLK_STS_AGAIN)
+ return ret;
}
iod->dma_len = 0;
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper
2025-06-10 5:06 ` [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
@ 2025-06-10 13:13 ` Leon Romanovsky
2025-06-11 13:44 ` Daniel Gomez
2025-06-11 21:03 ` Keith Busch
2 siblings, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:13 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:43AM +0200, Christoph Hellwig wrote:
> nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic.
> Merge them into a single helper that makes use of the previously added
> use_sgl tristate.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 77 +++++++++++++++++------------------------
> 1 file changed, 32 insertions(+), 45 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 0b85c11d3c96..50bb1ebe6810 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -814,42 +814,41 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
> return BLK_STS_OK;
> }
<...>
> -static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
> - struct request *req, struct nvme_rw_command *cmnd,
> - struct bio_vec *bv)
> +static blk_status_t nvme_pci_setup_data_simple(struct request *req,
> + enum nvme_use_sgl use_sgl)
<...>
> + if (!use_sgl && !prp_possible)
use_sgl is tristate, the better check will be use_sgl == SGL_UNSUPPORTED.
Thanks
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper
2025-06-10 5:06 ` [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
2025-06-10 13:13 ` Leon Romanovsky
@ 2025-06-11 13:44 ` Daniel Gomez
2025-06-12 5:01 ` Christoph Hellwig
2025-06-11 21:03 ` Keith Busch
2 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 13:44 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic.
> Merge them into a single helper that makes use of the previously added
> use_sgl tristate.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 77 +++++++++++++++++------------------------
> 1 file changed, 32 insertions(+), 45 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 0b85c11d3c96..50bb1ebe6810 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -814,42 +814,41 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
> return BLK_STS_OK;
> }
>
> -static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
> - struct request *req, struct nvme_rw_command *cmnd,
> - struct bio_vec *bv)
> +static blk_status_t nvme_pci_setup_data_simple(struct request *req,
> + enum nvme_use_sgl use_sgl)
> {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> - unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
> - unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
> -
> - iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
> - if (dma_mapping_error(dev->dev, iod->first_dma))
> + struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> + struct bio_vec bv = req_bvec(req);
> + unsigned int prp1_offset = bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
> + bool prp_possible = prp1_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2;
> + dma_addr_t dma_addr;
> +
> + if (!use_sgl && !prp_possible)
> + return BLK_STS_AGAIN;
> + if (is_pci_p2pdma_page(bv.bv_page))
> + return BLK_STS_AGAIN;
> +
> + dma_addr = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
> + if (dma_mapping_error(nvmeq->dev->dev, dma_addr))
> return BLK_STS_RESOURCE;
> - iod->dma_len = bv->bv_len;
> -
> - cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
> - if (bv->bv_len > first_prp_len)
> - cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma + first_prp_len);
> - else
> - cmnd->dptr.prp2 = 0;
> - return BLK_STS_OK;
> -}
> + iod->dma_len = bv.bv_len;
>
> -static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
> - struct request *req, struct nvme_rw_command *cmnd,
> - struct bio_vec *bv)
> -{
> - struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> + if (use_sgl == SGL_FORCED || !prp_possible) {
I couldn't find any place other than this where the new FORCED tristate actually
matters. So instead of passing the use_sgl tristate around, why not just check
here whether SGL is forced?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper
2025-06-11 13:44 ` Daniel Gomez
@ 2025-06-12 5:01 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 5:01 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Wed, Jun 11, 2025 at 03:44:45PM +0200, Daniel Gomez wrote:
> > -static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
> > - struct request *req, struct nvme_rw_command *cmnd,
> > - struct bio_vec *bv)
> > -{
> > - struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > + if (use_sgl == SGL_FORCED || !prp_possible) {
>
> I couldn't find any place other than this where the new FORCED tristate actually
> matters. So instead of passing the use_sgl tristate around, why not just check
> here whether SGL is forced?
See the check in nvme_map_data.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper
2025-06-10 5:06 ` [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
2025-06-10 13:13 ` Leon Romanovsky
2025-06-11 13:44 ` Daniel Gomez
@ 2025-06-11 21:03 ` Keith Busch
2025-06-12 5:01 ` Christoph Hellwig
2 siblings, 1 reply; 65+ messages in thread
From: Keith Busch @ 2025-06-11 21:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:43AM +0200, Christoph Hellwig wrote:
> - iod->dma_len = bv->bv_len;
> + iod->cmd.common.dptr.prp1 = cpu_to_le64(dma_addr);
> + iod->cmd.common.dptr.prp2 = 0;
> + if (bv.bv_len > first_prp_len)
> + iod->cmd.common.dptr.prp2 =
> + cpu_to_le64(dma_addr + first_prp_len);
Nit, set prp2 to 0 in an 'else' case instead of unconditionally before
overwriting it?
Otherwise looks good.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper
2025-06-11 21:03 ` Keith Busch
@ 2025-06-12 5:01 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 5:01 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On Wed, Jun 11, 2025 at 03:03:35PM -0600, Keith Busch wrote:
> On Tue, Jun 10, 2025 at 07:06:43AM +0200, Christoph Hellwig wrote:
> > - iod->dma_len = bv->bv_len;
> > + iod->cmd.common.dptr.prp1 = cpu_to_le64(dma_addr);
> > + iod->cmd.common.dptr.prp2 = 0;
> > + if (bv.bv_len > first_prp_len)
> > + iod->cmd.common.dptr.prp2 =
> > + cpu_to_le64(dma_addr + first_prp_len);
>
> Nit, set prp2 to 0 in an 'else' case instead of unconditionally before
> overwriting it?
unconditionally setting it actually generates slightly better code,
assuming the compiler won't just optimize the other form to it anyway.
But if you think it is more readable I can change it around.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 6/9] nvme-pci: remove superfluous arguments
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
` (4 preceding siblings ...)
2025-06-10 5:06 ` [PATCH 5/9] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 13:15 ` Leon Romanovsky
2025-06-11 21:05 ` Keith Busch
2025-06-10 5:06 ` [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map Christoph Hellwig
` (2 subsequent siblings)
8 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
The call chain in the prep_rq and completion paths passes around a lot
of nvme_dev, nvme_queue and nvme_command arguments that can be trivially
derived from the passed in struct request. Remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 97 +++++++++++++++++++----------------------
1 file changed, 46 insertions(+), 51 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 50bb1ebe6810..04461efb6d27 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -621,8 +621,9 @@ static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
return nvmeq->descriptor_pools.large;
}
-static void nvme_free_descriptors(struct nvme_queue *nvmeq, struct request *req)
+static void nvme_free_descriptors(struct request *req)
{
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
dma_addr_t dma_addr = iod->first_dma;
@@ -644,22 +645,22 @@ static void nvme_free_descriptors(struct nvme_queue *nvmeq, struct request *req)
}
}
-static void nvme_unmap_data(struct nvme_dev *dev, struct nvme_queue *nvmeq,
- struct request *req)
+static void nvme_unmap_data(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
if (iod->dma_len) {
- dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
+ dma_unmap_page(nvmeq->dev->dev, iod->first_dma, iod->dma_len,
rq_dma_dir(req));
return;
}
WARN_ON_ONCE(!iod->sgt.nents);
- dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
- nvme_free_descriptors(nvmeq, req);
- mempool_free(iod->sgt.sgl, dev->iod_mempool);
+ dma_unmap_sgtable(nvmeq->dev->dev, &iod->sgt, rq_dma_dir(req), 0);
+ nvme_free_descriptors(req);
+ mempool_free(iod->sgt.sgl, nvmeq->dev->iod_mempool);
}
static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -676,10 +677,10 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
}
}
-static blk_status_t nvme_pci_setup_prps(struct nvme_queue *nvmeq,
- struct request *req, struct nvme_rw_command *cmnd)
+static blk_status_t nvme_pci_setup_prps(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
int length = blk_rq_payload_bytes(req);
struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
@@ -748,11 +749,11 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_queue *nvmeq,
dma_len = sg_dma_len(sg);
}
done:
- cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
- cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
+ iod->cmd.common.dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
+ iod->cmd.common.dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
free_prps:
- nvme_free_descriptors(nvmeq, req);
+ nvme_free_descriptors(req);
return BLK_STS_RESOURCE;
bad_sgl:
WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
@@ -777,10 +778,10 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
}
-static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
- struct request *req, struct nvme_rw_command *cmd)
+static blk_status_t nvme_pci_setup_sgls(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct nvme_sgl_desc *sg_list;
struct scatterlist *sg = iod->sgt.sgl;
unsigned int entries = iod->sgt.nents;
@@ -788,10 +789,10 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
int i = 0;
/* setting the transfer type as SGL */
- cmd->flags = NVME_CMD_SGL_METABUF;
+ iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
if (entries == 1) {
- nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
+ nvme_pci_sgl_set_data(&iod->cmd.common.dptr.sgl, sg);
return BLK_STS_OK;
}
@@ -805,7 +806,7 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
iod->descriptors[iod->nr_descriptors++] = sg_list;
iod->first_dma = sgl_dma;
- nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
+ nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, entries);
do {
nvme_pci_sgl_set_data(&sg_list[i++], sg);
sg = sg_next(sg);
@@ -852,11 +853,11 @@ static blk_status_t nvme_pci_setup_data_simple(struct request *req,
return BLK_STS_OK;
}
-static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
- struct nvme_command *cmnd)
+static blk_status_t nvme_map_data(struct request *req)
{
- struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct nvme_dev *dev = nvmeq->dev;
enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
blk_status_t ret = BLK_STS_RESOURCE;
int rc;
@@ -887,9 +888,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
if (use_sgl == SGL_FORCED ||
(use_sgl == SGL_SUPPORTED &&
(!sgl_threshold || nvme_pci_avg_seg_size(req) < sgl_threshold)))
- ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
+ ret = nvme_pci_setup_sgls(req);
else
- ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
+ ret = nvme_pci_setup_prps(req);
if (ret != BLK_STS_OK)
goto out_unmap_sg;
return BLK_STS_OK;
@@ -901,12 +902,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
return ret;
}
-static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
- struct request *req)
+static blk_status_t nvme_pci_setup_meta_sgls(struct request *req)
{
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct nvme_dev *dev = nvmeq->dev;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- struct nvme_rw_command *cmnd = &iod->cmd.rw;
struct nvme_sgl_desc *sg_list;
struct scatterlist *sgl, *sg;
unsigned int entries;
@@ -937,8 +937,8 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
iod->meta_descriptor = sg_list;
iod->meta_dma = sgl_dma;
- cmnd->flags = NVME_CMD_SGL_METASEG;
- cmnd->metadata = cpu_to_le64(sgl_dma);
+ iod->cmd.common.flags = NVME_CMD_SGL_METASEG;
+ iod->cmd.common.metadata = cpu_to_le64(sgl_dma);
sgl = iod->meta_sgt.sgl;
if (entries == 1) {
@@ -960,31 +960,30 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
return BLK_STS_RESOURCE;
}
-static blk_status_t nvme_pci_setup_meta_mptr(struct nvme_dev *dev,
- struct request *req)
+static blk_status_t nvme_pci_setup_meta_mptr(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct bio_vec bv = rq_integrity_vec(req);
- struct nvme_command *cmnd = &iod->cmd;
- iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
- if (dma_mapping_error(dev->dev, iod->meta_dma))
+ iod->meta_dma = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
+ if (dma_mapping_error(nvmeq->dev->dev, iod->meta_dma))
return BLK_STS_IOERR;
- cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
+ iod->cmd.common.metadata = cpu_to_le64(iod->meta_dma);
return BLK_STS_OK;
}
-static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
+static blk_status_t nvme_map_metadata(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
if ((iod->cmd.common.flags & NVME_CMD_SGL_METABUF) &&
nvme_pci_metadata_use_sgls(req))
- return nvme_pci_setup_meta_sgls(dev, req);
- return nvme_pci_setup_meta_mptr(dev, req);
+ return nvme_pci_setup_meta_sgls(req);
+ return nvme_pci_setup_meta_mptr(req);
}
-static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
+static blk_status_t nvme_prep_rq(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
blk_status_t ret;
@@ -999,13 +998,13 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
return ret;
if (blk_rq_nr_phys_segments(req)) {
- ret = nvme_map_data(dev, req, &iod->cmd);
+ ret = nvme_map_data(req);
if (ret)
goto out_free_cmd;
}
if (blk_integrity_rq(req)) {
- ret = nvme_map_metadata(dev, req);
+ ret = nvme_map_metadata(req);
if (ret)
goto out_unmap_data;
}
@@ -1014,7 +1013,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
return BLK_STS_OK;
out_unmap_data:
if (blk_rq_nr_phys_segments(req))
- nvme_unmap_data(dev, req->mq_hctx->driver_data, req);
+ nvme_unmap_data(req);
out_free_cmd:
nvme_cleanup_cmd(req);
return ret;
@@ -1039,7 +1038,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
if (unlikely(!nvme_check_ready(&dev->ctrl, req, true)))
return nvme_fail_nonready_command(&dev->ctrl, req);
- ret = nvme_prep_rq(dev, req);
+ ret = nvme_prep_rq(req);
if (unlikely(ret))
return ret;
spin_lock(&nvmeq->sq_lock);
@@ -1077,7 +1076,7 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
return false;
- return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+ return nvme_prep_rq(req) == BLK_STS_OK;
}
static void nvme_queue_rqs(struct rq_list *rqlist)
@@ -1103,11 +1102,11 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
*rqlist = requeue_list;
}
-static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
- struct nvme_queue *nvmeq,
- struct request *req)
+static __always_inline void nvme_unmap_metadata(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct nvme_dev *dev = nvmeq->dev;
if (!iod->meta_sgt.nents) {
dma_unmap_page(dev->dev, iod->meta_dma,
@@ -1124,14 +1123,10 @@ static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
static __always_inline void nvme_pci_unmap_rq(struct request *req)
{
- struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
- struct nvme_dev *dev = nvmeq->dev;
-
if (blk_integrity_rq(req))
- nvme_unmap_metadata(dev, nvmeq, req);
-
+ nvme_unmap_metadata(req);
if (blk_rq_nr_phys_segments(req))
- nvme_unmap_data(dev, nvmeq, req);
+ nvme_unmap_data(req);
}
static void nvme_pci_complete_rq(struct request *req)
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 6/9] nvme-pci: remove superfluous arguments
2025-06-10 5:06 ` [PATCH 6/9] nvme-pci: remove superfluous arguments Christoph Hellwig
@ 2025-06-10 13:15 ` Leon Romanovsky
2025-06-11 21:05 ` Keith Busch
1 sibling, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:44AM +0200, Christoph Hellwig wrote:
> The call chain in the prep_rq and completion paths passes around a lot
> of nvme_dev, nvme_queue and nvme_command arguments that can be trivially
> derived from the passed in struct request. Remove them.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 97 +++++++++++++++++++----------------------
> 1 file changed, 46 insertions(+), 51 deletions(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 6/9] nvme-pci: remove superfluous arguments
2025-06-10 5:06 ` [PATCH 6/9] nvme-pci: remove superfluous arguments Christoph Hellwig
2025-06-10 13:15 ` Leon Romanovsky
@ 2025-06-11 21:05 ` Keith Busch
1 sibling, 0 replies; 65+ messages in thread
From: Keith Busch @ 2025-06-11 21:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:44AM +0200, Christoph Hellwig wrote:
> The call chain in the prep_rq and completion paths passes around a lot
> of nvme_dev, nvme_queue and nvme_command arguments that can be trivially
> derived from the passed in struct request. Remove them.
Looks good
Reviewed-by: Keith Busch <kbusch@kernel.org>
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
` (5 preceding siblings ...)
2025-06-10 5:06 ` [PATCH 6/9] nvme-pci: remove superfluous arguments Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 13:19 ` Leon Romanovsky
` (3 more replies)
2025-06-10 5:06 ` [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
2025-06-10 5:06 ` [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
8 siblings, 4 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
Use the blk_rq_dma_map API to DMA map requests instead of scatterlists.
This removes the need to allocate a scatterlist covering every segment,
and thus the overall transfer length limit based on the scatterlist
allocation.
Instead the DMA mapping is done by iterating the bio_vec chain in the
request directly. The unmap is handled differently depending on how
we mapped:
- when using an IOMMU only a single IOVA is used, and it is stored in
iova_state
- for direct mappings that don't use swiotlb and are cache coherent no
unmap is needed at all
- for direct mappings that are not cache coherent or use swiotlb, the
physical addresses are rebuild from the PRPs or SGL segments
The latter unfortunately adds a fair amount of code to the driver, but
it is code not used in the fast path.
The conversion only covers the data mapping path, and still uses a
scatterlist for the multi-segment metadata case. I plan to convert that
as soon as we have good test coverage for the multi-segment metadata
path.
Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API
conversion for nvme-pci, Kanchan Joshi for bringing back the single
segment optimization, Leon Romanovsky for shepherding this through a
gazillion rebases and Nitesh Shetty for various improvements.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 388 +++++++++++++++++++++++++---------------
1 file changed, 242 insertions(+), 146 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 04461efb6d27..2d3573293d0c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -7,7 +7,7 @@
#include <linux/acpi.h>
#include <linux/async.h>
#include <linux/blkdev.h>
-#include <linux/blk-mq.h>
+#include <linux/blk-mq-dma.h>
#include <linux/blk-integrity.h>
#include <linux/dmi.h>
#include <linux/init.h>
@@ -27,7 +27,6 @@
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/io-64-nonatomic-hi-lo.h>
#include <linux/sed-opal.h>
-#include <linux/pci-p2pdma.h>
#include "trace.h"
#include "nvme.h"
@@ -46,13 +45,11 @@
#define NVME_MAX_NR_DESCRIPTORS 5
/*
- * For data SGLs we support a single descriptors worth of SGL entries, but for
- * now we also limit it to avoid an allocation larger than PAGE_SIZE for the
- * scatterlist.
+ * For data SGLs we support a single descriptors worth of SGL entries.
+ * For PRPs, segments don't matter at all.
*/
#define NVME_MAX_SEGS \
- min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
- (PAGE_SIZE / sizeof(struct scatterlist)))
+ (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
/*
* For metadata SGLs, only the small descriptor is supported, and the first
@@ -162,7 +159,6 @@ struct nvme_dev {
bool hmb;
struct sg_table *hmb_sgt;
- mempool_t *iod_mempool;
mempool_t *iod_meta_mempool;
/* shadow doorbell buffer support: */
@@ -246,7 +242,10 @@ enum nvme_iod_flags {
IOD_ABORTED = 1U << 0,
/* uses the small descriptor pool */
- IOD_SMALL_DESCRIPTOR = 1U << 1,
+ IOD_SMALL_DESCRIPTOR = 1U << 1,
+
+ /* single segment dma mapping */
+ IOD_SINGLE_SEGMENT = 1U << 2,
};
/*
@@ -257,13 +256,14 @@ struct nvme_iod {
struct nvme_command cmd;
u8 flags;
u8 nr_descriptors;
- unsigned int dma_len; /* length of single DMA segment mapping */
- dma_addr_t first_dma;
+
+ unsigned int total_len;
+ struct dma_iova_state dma_state;
+ void *descriptors[NVME_MAX_NR_DESCRIPTORS];
+
dma_addr_t meta_dma;
- struct sg_table sgt;
struct sg_table meta_sgt;
struct nvme_sgl_desc *meta_descriptor;
- void *descriptors[NVME_MAX_NR_DESCRIPTORS];
};
static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -609,8 +609,13 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
static unsigned int nvme_pci_avg_seg_size(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ unsigned int nseg;
- return DIV_ROUND_UP(blk_rq_payload_bytes(req), iod->sgt.nents);
+ if (blk_rq_dma_map_coalesce(&iod->dma_state))
+ nseg = 1;
+ else
+ nseg = blk_rq_nr_phys_segments(req);
+ return DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
}
static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
@@ -621,12 +626,25 @@ static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
return nvmeq->descriptor_pools.large;
}
+static inline bool nvme_pci_cmd_use_sgl(struct nvme_command *cmd)
+{
+ return cmd->common.flags &
+ (NVME_CMD_SGL_METABUF | NVME_CMD_SGL_METASEG);
+}
+
+static inline dma_addr_t nvme_pci_first_desc_dma_addr(struct nvme_command *cmd)
+{
+ if (nvme_pci_cmd_use_sgl(cmd))
+ return le64_to_cpu(cmd->common.dptr.sgl.addr);
+ return le64_to_cpu(cmd->common.dptr.prp2);
+}
+
static void nvme_free_descriptors(struct request *req)
{
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- dma_addr_t dma_addr = iod->first_dma;
+ dma_addr_t dma_addr = nvme_pci_first_desc_dma_addr(&iod->cmd);
int i;
if (iod->nr_descriptors == 1) {
@@ -645,68 +663,138 @@ static void nvme_free_descriptors(struct request *req)
}
}
-static void nvme_unmap_data(struct request *req)
+static void nvme_free_prps(struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct device *dma_dev = nvmeq->dev->dev;
+ enum dma_data_direction dir = rq_dma_dir(req);
+ int length = iod->total_len;
+ dma_addr_t dma_addr;
+ int i, desc;
+ __le64 *prp_list;
+ u32 dma_len;
+
+ dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
+ dma_len = min_t(u32, length,
+ NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1)));
+ length -= dma_len;
+ if (!length) {
+ dma_unmap_page(dma_dev, dma_addr, dma_len, dir);
+ return;
+ }
- if (iod->dma_len) {
- dma_unmap_page(nvmeq->dev->dev, iod->first_dma, iod->dma_len,
- rq_dma_dir(req));
+ if (length <= NVME_CTRL_PAGE_SIZE) {
+ dma_unmap_page(dma_dev, dma_addr, dma_len, dir);
+ dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2);
+ dma_unmap_page(dma_dev, dma_addr, length, dir);
return;
}
- WARN_ON_ONCE(!iod->sgt.nents);
+ i = 0;
+ desc = 0;
+ prp_list = iod->descriptors[desc];
+ do {
+ dma_unmap_page(dma_dev, dma_addr, dma_len, dir);
+ if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+ prp_list = iod->descriptors[++desc];
+ i = 0;
+ }
- dma_unmap_sgtable(nvmeq->dev->dev, &iod->sgt, rq_dma_dir(req), 0);
- nvme_free_descriptors(req);
- mempool_free(iod->sgt.sgl, nvmeq->dev->iod_mempool);
+ dma_addr = le64_to_cpu(prp_list[i++]);
+ dma_len = min(length, NVME_CTRL_PAGE_SIZE);
+ length -= dma_len;
+ } while (length);
}
-static void nvme_print_sgl(struct scatterlist *sgl, int nents)
+static void nvme_free_sgls(struct request *req)
{
- int i;
- struct scatterlist *sg;
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct device *dma_dev = nvmeq->dev->dev;
+ dma_addr_t sqe_dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+ unsigned int sqe_dma_len = le32_to_cpu(iod->cmd.common.dptr.sgl.length);
+ struct nvme_sgl_desc *sg_list = iod->descriptors[0];
+ enum dma_data_direction dir = rq_dma_dir(req);
+
+ if (iod->nr_descriptors) {
+ unsigned int nr_entries = sqe_dma_len / sizeof(*sg_list), i;
+
+ for (i = 0; i < nr_entries; i++)
+ dma_unmap_page(dma_dev, le64_to_cpu(sg_list[i].addr),
+ le32_to_cpu(sg_list[i].length), dir);
+ } else {
+ dma_unmap_page(dma_dev, sqe_dma_addr, sqe_dma_len, dir);
+ }
+}
+
+static void nvme_unmap_data(struct request *req)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ struct device *dma_dev = nvmeq->dev->dev;
+
+ if (iod->flags & IOD_SINGLE_SEGMENT) {
+ static_assert(offsetof(union nvme_data_ptr, prp1) ==
+ offsetof(union nvme_data_ptr, sgl.addr));
+ dma_unmap_page(dma_dev, le64_to_cpu(iod->cmd.common.dptr.prp1),
+ iod->total_len, rq_dma_dir(req));
+ return;
+ }
- for_each_sg(sgl, sg, nents, i) {
- dma_addr_t phys = sg_phys(sg);
- pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
- "dma_address:%pad dma_length:%d\n",
- i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
- sg_dma_len(sg));
+ if (!blk_rq_dma_unmap(req, dma_dev, &iod->dma_state, iod->total_len)) {
+ if (nvme_pci_cmd_use_sgl(&iod->cmd))
+ nvme_free_sgls(req);
+ else
+ nvme_free_prps(req);
}
+
+ if (iod->nr_descriptors)
+ nvme_free_descriptors(req);
}
-static blk_status_t nvme_pci_setup_prps(struct request *req)
+static blk_status_t nvme_pci_setup_data_prp(struct request *req,
+ struct blk_dma_iter *iter)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
- int length = blk_rq_payload_bytes(req);
- struct scatterlist *sg = iod->sgt.sgl;
- int dma_len = sg_dma_len(sg);
- u64 dma_addr = sg_dma_address(sg);
- int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+ unsigned int length = blk_rq_payload_bytes(req);
+ dma_addr_t prp1_dma, prp2_dma = 0;
+ unsigned int prp_len, i;
__le64 *prp_list;
- dma_addr_t prp_dma;
- int i;
- length -= (NVME_CTRL_PAGE_SIZE - offset);
- if (length <= 0) {
- iod->first_dma = 0;
+ /*
+ * PRP1 always points to the start of the DMA transfers.
+ *
+ * This is the only PRP (except for the list entries) that could be
+ * non-aligned.
+ */
+ prp1_dma = iter->addr;
+ prp_len = min(length, NVME_CTRL_PAGE_SIZE -
+ (iter->addr & (NVME_CTRL_PAGE_SIZE - 1)));
+ iod->total_len += prp_len;
+ iter->addr += prp_len;
+ iter->len -= prp_len;
+ length -= prp_len;
+ if (!length)
goto done;
- }
- dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
- if (dma_len) {
- dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
- } else {
- sg = sg_next(sg);
- dma_addr = sg_dma_address(sg);
- dma_len = sg_dma_len(sg);
+ if (!iter->len) {
+ if (!blk_rq_dma_map_iter_next(req, nvmeq->dev->dev,
+ &iod->dma_state, iter)) {
+ if (WARN_ON_ONCE(!iter->status))
+ goto bad_sgl;
+ goto done;
+ }
}
+ /*
+ * PRP2 is usually a list, but can point to data if all data to be
+ * transferred fits into PRP1 + PRP2:
+ */
if (length <= NVME_CTRL_PAGE_SIZE) {
- iod->first_dma = dma_addr;
+ prp2_dma = iter->addr;
+ iod->total_len += length;
goto done;
}
@@ -715,58 +803,83 @@ static blk_status_t nvme_pci_setup_prps(struct request *req)
iod->flags |= IOD_SMALL_DESCRIPTOR;
prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
- &prp_dma);
- if (!prp_list)
- return BLK_STS_RESOURCE;
+ &prp2_dma);
+ if (!prp_list) {
+ iter->status = BLK_STS_RESOURCE;
+ goto done;
+ }
iod->descriptors[iod->nr_descriptors++] = prp_list;
- iod->first_dma = prp_dma;
+
i = 0;
for (;;) {
+ prp_list[i++] = cpu_to_le64(iter->addr);
+ prp_len = min(length, NVME_CTRL_PAGE_SIZE);
+ if (WARN_ON_ONCE(iter->len < prp_len))
+ goto bad_sgl;
+
+ iod->total_len += prp_len;
+ iter->addr += prp_len;
+ iter->len -= prp_len;
+ length -= prp_len;
+ if (!length)
+ break;
+
+ if (iter->len == 0) {
+ if (!blk_rq_dma_map_iter_next(req, nvmeq->dev->dev,
+ &iod->dma_state, iter)) {
+ if (WARN_ON_ONCE(!iter->status))
+ goto bad_sgl;
+ goto done;
+ }
+ }
+
+ /*
+ * If we've filled the entire descriptor, allocate a new that is
+ * pointed to be the last entry in the previous PRP list. To
+ * accommodate for that move the last actual entry to the new
+ * descriptor.
+ */
if (i == NVME_CTRL_PAGE_SIZE >> 3) {
__le64 *old_prp_list = prp_list;
+ dma_addr_t prp_list_dma;
prp_list = dma_pool_alloc(nvmeq->descriptor_pools.large,
- GFP_ATOMIC, &prp_dma);
- if (!prp_list)
- goto free_prps;
+ GFP_ATOMIC, &prp_list_dma);
+ if (!prp_list) {
+ iter->status = BLK_STS_RESOURCE;
+ goto done;
+ }
iod->descriptors[iod->nr_descriptors++] = prp_list;
+
prp_list[0] = old_prp_list[i - 1];
- old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+ old_prp_list[i - 1] = cpu_to_le64(prp_list_dma);
i = 1;
}
- prp_list[i++] = cpu_to_le64(dma_addr);
- dma_len -= NVME_CTRL_PAGE_SIZE;
- dma_addr += NVME_CTRL_PAGE_SIZE;
- length -= NVME_CTRL_PAGE_SIZE;
- if (length <= 0)
- break;
- if (dma_len > 0)
- continue;
- if (unlikely(dma_len < 0))
- goto bad_sgl;
- sg = sg_next(sg);
- dma_addr = sg_dma_address(sg);
- dma_len = sg_dma_len(sg);
}
+
done:
- iod->cmd.common.dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
- iod->cmd.common.dptr.prp2 = cpu_to_le64(iod->first_dma);
- return BLK_STS_OK;
-free_prps:
- nvme_free_descriptors(req);
- return BLK_STS_RESOURCE;
+ /*
+ * nvme_unmap_data uses the DPT field in the SQE to tear down the
+ * mapping, so initialize it even for failures.
+ */
+ iod->cmd.common.dptr.prp1 = cpu_to_le64(prp1_dma);
+ iod->cmd.common.dptr.prp2 = cpu_to_le64(prp2_dma);
+ if (unlikely(iter->status))
+ nvme_unmap_data(req);
+ return iter->status;
+
bad_sgl:
- WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
- "Invalid SGL for payload:%d nents:%d\n",
- blk_rq_payload_bytes(req), iod->sgt.nents);
+ dev_err_once(nvmeq->dev->dev,
+ "Incorrectly formed request for payload:%d nents:%d\n",
+ blk_rq_payload_bytes(req), blk_rq_nr_phys_segments(req));
return BLK_STS_IOERR;
}
static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
- struct scatterlist *sg)
+ struct blk_dma_iter *iter)
{
- sge->addr = cpu_to_le64(sg_dma_address(sg));
- sge->length = cpu_to_le32(sg_dma_len(sg));
+ sge->addr = cpu_to_le64(iter->addr);
+ sge->length = cpu_to_le32(iter->len);
sge->type = NVME_SGL_FMT_DATA_DESC << 4;
}
@@ -778,21 +891,22 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
}
-static blk_status_t nvme_pci_setup_sgls(struct request *req)
+static blk_status_t nvme_pci_setup_data_sgl(struct request *req,
+ struct blk_dma_iter *iter)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+ unsigned int entries = blk_rq_nr_phys_segments(req);
struct nvme_sgl_desc *sg_list;
- struct scatterlist *sg = iod->sgt.sgl;
- unsigned int entries = iod->sgt.nents;
dma_addr_t sgl_dma;
- int i = 0;
+ unsigned int mapped = 0;
- /* setting the transfer type as SGL */
+ /* set the transfer type as SGL */
iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
- if (entries == 1) {
- nvme_pci_sgl_set_data(&iod->cmd.common.dptr.sgl, sg);
+ if (entries == 1 || blk_rq_dma_map_coalesce(&iod->dma_state)) {
+ nvme_pci_sgl_set_data(&iod->cmd.common.dptr.sgl, iter);
+ iod->total_len += iter->len;
return BLK_STS_OK;
}
@@ -804,15 +918,21 @@ static blk_status_t nvme_pci_setup_sgls(struct request *req)
if (!sg_list)
return BLK_STS_RESOURCE;
iod->descriptors[iod->nr_descriptors++] = sg_list;
- iod->first_dma = sgl_dma;
- nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, entries);
do {
- nvme_pci_sgl_set_data(&sg_list[i++], sg);
- sg = sg_next(sg);
- } while (--entries > 0);
+ if (WARN_ON_ONCE(mapped == entries)) {
+ iter->status = BLK_STS_IOERR;
+ break;
+ }
+ nvme_pci_sgl_set_data(&sg_list[mapped++], iter);
+ iod->total_len += iter->len;
+ } while (blk_rq_dma_map_iter_next(req, nvmeq->dev->dev, &iod->dma_state,
+ iter));
- return BLK_STS_OK;
+ nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, mapped);
+ if (unlikely(iter->status))
+ nvme_free_sgls(req);
+ return iter->status;
}
static blk_status_t nvme_pci_setup_data_simple(struct request *req,
@@ -833,7 +953,8 @@ static blk_status_t nvme_pci_setup_data_simple(struct request *req,
dma_addr = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
if (dma_mapping_error(nvmeq->dev->dev, dma_addr))
return BLK_STS_RESOURCE;
- iod->dma_len = bv.bv_len;
+ iod->total_len = bv.bv_len;
+ iod->flags |= IOD_SINGLE_SEGMENT;
if (use_sgl == SGL_FORCED || !prp_possible) {
iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
@@ -859,47 +980,36 @@ static blk_status_t nvme_map_data(struct request *req)
struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
struct nvme_dev *dev = nvmeq->dev;
enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
- blk_status_t ret = BLK_STS_RESOURCE;
- int rc;
+ struct blk_dma_iter iter;
+ blk_status_t ret;
if (blk_rq_nr_phys_segments(req) == 1) {
+ /*
+ * Try to skip the DMA iterator for single segment requests, as
+ * that significantly improves performances for workloads with
+ * small I/O sizes.
+ */
ret = nvme_pci_setup_data_simple(req, use_sgl);
if (ret != BLK_STS_AGAIN)
return ret;
}
- iod->dma_len = 0;
- iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
- if (!iod->sgt.sgl)
- return BLK_STS_RESOURCE;
- sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
- iod->sgt.orig_nents = blk_rq_map_sg(req, iod->sgt.sgl);
- if (!iod->sgt.orig_nents)
- goto out_free_sg;
-
- rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
- DMA_ATTR_NO_WARN);
- if (rc) {
- if (rc == -EREMOTEIO)
- ret = BLK_STS_TARGET;
- goto out_free_sg;
- }
+ if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter))
+ return iter.status;
if (use_sgl == SGL_FORCED ||
(use_sgl == SGL_SUPPORTED &&
(!sgl_threshold || nvme_pci_avg_seg_size(req) < sgl_threshold)))
- ret = nvme_pci_setup_sgls(req);
- else
- ret = nvme_pci_setup_prps(req);
- if (ret != BLK_STS_OK)
- goto out_unmap_sg;
- return BLK_STS_OK;
+ return nvme_pci_setup_data_sgl(req, &iter);
+ return nvme_pci_setup_data_prp(req, &iter);
+}
-out_unmap_sg:
- dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-out_free_sg:
- mempool_free(iod->sgt.sgl, dev->iod_mempool);
- return ret;
+static void nvme_pci_sgl_set_data_sg(struct nvme_sgl_desc *sge,
+ struct scatterlist *sg)
+{
+ sge->addr = cpu_to_le64(sg_dma_address(sg));
+ sge->length = cpu_to_le32(sg_dma_len(sg));
+ sge->type = NVME_SGL_FMT_DATA_DESC << 4;
}
static blk_status_t nvme_pci_setup_meta_sgls(struct request *req)
@@ -942,14 +1052,14 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct request *req)
sgl = iod->meta_sgt.sgl;
if (entries == 1) {
- nvme_pci_sgl_set_data(sg_list, sgl);
+ nvme_pci_sgl_set_data_sg(sg_list, sgl);
return BLK_STS_OK;
}
sgl_dma += sizeof(*sg_list);
nvme_pci_sgl_set_seg(sg_list, sgl_dma, entries);
for_each_sg(sgl, sg, entries, i)
- nvme_pci_sgl_set_data(&sg_list[i + 1], sg);
+ nvme_pci_sgl_set_data_sg(&sg_list[i + 1], sg);
return BLK_STS_OK;
@@ -990,7 +1100,7 @@ static blk_status_t nvme_prep_rq(struct request *req)
iod->flags = 0;
iod->nr_descriptors = 0;
- iod->sgt.nents = 0;
+ iod->total_len = 0;
iod->meta_sgt.nents = 0;
ret = nvme_setup_cmd(req->q->queuedata, req);
@@ -2908,26 +3018,14 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
{
size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1);
- size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
-
- dev->iod_mempool = mempool_create_node(1,
- mempool_kmalloc, mempool_kfree,
- (void *)alloc_size, GFP_KERNEL,
- dev_to_node(dev->dev));
- if (!dev->iod_mempool)
- return -ENOMEM;
dev->iod_meta_mempool = mempool_create_node(1,
mempool_kmalloc, mempool_kfree,
(void *)meta_size, GFP_KERNEL,
dev_to_node(dev->dev));
if (!dev->iod_meta_mempool)
- goto free;
-
+ return -ENOMEM;
return 0;
-free:
- mempool_destroy(dev->iod_mempool);
- return -ENOMEM;
}
static void nvme_free_tagset(struct nvme_dev *dev)
@@ -3371,7 +3469,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
nvme_dbbuf_dma_free(dev);
nvme_free_queues(dev, 0);
out_release_iod_mempool:
- mempool_destroy(dev->iod_mempool);
mempool_destroy(dev->iod_meta_mempool);
out_dev_unmap:
nvme_dev_unmap(dev);
@@ -3435,7 +3532,6 @@ static void nvme_remove(struct pci_dev *pdev)
nvme_dev_remove_admin(dev);
nvme_dbbuf_dma_free(dev);
nvme_free_queues(dev, 0);
- mempool_destroy(dev->iod_mempool);
mempool_destroy(dev->iod_meta_mempool);
nvme_release_descriptor_pools(dev);
nvme_dev_unmap(dev);
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-10 5:06 ` [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map Christoph Hellwig
@ 2025-06-10 13:19 ` Leon Romanovsky
2025-06-11 12:15 ` Daniel Gomez
` (2 subsequent siblings)
3 siblings, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:45AM +0200, Christoph Hellwig wrote:
> Use the blk_rq_dma_map API to DMA map requests instead of scatterlists.
> This removes the need to allocate a scatterlist covering every segment,
> and thus the overall transfer length limit based on the scatterlist
> allocation.
>
> Instead the DMA mapping is done by iterating the bio_vec chain in the
> request directly. The unmap is handled differently depending on how
> we mapped:
>
> - when using an IOMMU only a single IOVA is used, and it is stored in
> iova_state
> - for direct mappings that don't use swiotlb and are cache coherent no
> unmap is needed at al
s/unmap is needed/unmap is not needed
> - for direct mappings that are not cache coherent or use swiotlb, the
> physical addresses are rebuild from the PRPs or SGL segments
>
> The latter unfortunately adds a fair amount of code to the driver, but
> it is code not used in the fast path.
>
> The conversion only covers the data mapping path, and still uses a
> scatterlist for the multi-segment metadata case. I plan to convert that
> as soon as we have good test coverage for the multi-segment metadata
> path.
>
> Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API
> conversion for nvme-pci, Kanchan Joshi for bringing back the single
> segment optimization, Leon Romanovsky for shepherding this through a
> gazillion rebases and Nitesh Shetty for various improvements.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 388 +++++++++++++++++++++++++---------------
> 1 file changed, 242 insertions(+), 146 deletions(-)
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-10 5:06 ` [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map Christoph Hellwig
2025-06-10 13:19 ` Leon Romanovsky
@ 2025-06-11 12:15 ` Daniel Gomez
2025-06-12 5:02 ` Christoph Hellwig
2025-06-11 14:13 ` Daniel Gomez
2025-06-16 7:49 ` Daniel Gomez
3 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 12:15 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> Use the blk_rq_dma_map API to DMA map requests instead of scatterlists.
> This removes the need to allocate a scatterlist covering every segment,
> and thus the overall transfer length limit based on the scatterlist
> allocation.
>
> Instead the DMA mapping is done by iterating the bio_vec chain in the
> request directly. The unmap is handled differently depending on how
> we mapped:
>
> - when using an IOMMU only a single IOVA is used, and it is stored in
> iova_state
> - for direct mappings that don't use swiotlb and are cache coherent no
> unmap is needed at all
> - for direct mappings that are not cache coherent or use swiotlb, the
> physical addresses are rebuild from the PRPs or SGL segments
>
> The latter unfortunately adds a fair amount of code to the driver, but
> it is code not used in the fast path.
>
> The conversion only covers the data mapping path, and still uses a
> scatterlist for the multi-segment metadata case. I plan to convert that
> as soon as we have good test coverage for the multi-segment metadata
> path.
>
> Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API
> conversion for nvme-pci, Kanchan Joshi for bringing back the single
> segment optimization, Leon Romanovsky for shepherding this through a
> gazillion rebases and Nitesh Shetty for various improvements.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 388 +++++++++++++++++++++++++---------------
> 1 file changed, 242 insertions(+), 146 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 04461efb6d27..2d3573293d0c 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -7,7 +7,7 @@
> #include <linux/acpi.h>
> #include <linux/async.h>
> #include <linux/blkdev.h>
> -#include <linux/blk-mq.h>
> +#include <linux/blk-mq-dma.h>
> #include <linux/blk-integrity.h>
> #include <linux/dmi.h>
> #include <linux/init.h>
> @@ -27,7 +27,6 @@
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/io-64-nonatomic-hi-lo.h>
> #include <linux/sed-opal.h>
> -#include <linux/pci-p2pdma.h>
>
> #include "trace.h"
> #include "nvme.h"
> @@ -46,13 +45,11 @@
> #define NVME_MAX_NR_DESCRIPTORS 5
>
> /*
> - * For data SGLs we support a single descriptors worth of SGL entries, but for
> - * now we also limit it to avoid an allocation larger than PAGE_SIZE for the
> - * scatterlist.
> + * For data SGLs we support a single descriptors worth of SGL entries.
> + * For PRPs, segments don't matter at all.
> */
> #define NVME_MAX_SEGS \
> - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
> - (PAGE_SIZE / sizeof(struct scatterlist)))
> + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
The 8 MiB max transfer size is only reachable if host segments are at least 32k.
But I think this limitation is only on the SGL side, right? Adding support to
multiple SGL segments should allow us to increase this limit 256 -> 2048.
Is this correct?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-11 12:15 ` Daniel Gomez
@ 2025-06-12 5:02 ` Christoph Hellwig
2025-06-16 7:41 ` Daniel Gomez
2025-06-17 17:43 ` Daniel Gomez
0 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 5:02 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Wed, Jun 11, 2025 at 02:15:10PM +0200, Daniel Gomez wrote:
> > #define NVME_MAX_SEGS \
> > - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
> > - (PAGE_SIZE / sizeof(struct scatterlist)))
> > + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>
> The 8 MiB max transfer size is only reachable if host segments are at least 32k.
> But I think this limitation is only on the SGL side, right?
Yes, PRPs don't really have the concept of segments to start with.
> Adding support to
> multiple SGL segments should allow us to increase this limit 256 -> 2048.
>
> Is this correct?
Yes. Note that plenty of hardware doesn't really like chained SGLs too
much and you might get performance degradation.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-12 5:02 ` Christoph Hellwig
@ 2025-06-16 7:41 ` Daniel Gomez
2025-06-16 11:33 ` Christoph Hellwig
2025-06-17 17:43 ` Daniel Gomez
1 sibling, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-16 7:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 12/06/2025 07.02, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 02:15:10PM +0200, Daniel Gomez wrote:
>>> #define NVME_MAX_SEGS \
>>> - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
>>> - (PAGE_SIZE / sizeof(struct scatterlist)))
>>> + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>
>> The 8 MiB max transfer size is only reachable if host segments are at least 32k.
>> But I think this limitation is only on the SGL side, right?
>
> Yes, PRPs don't really have the concept of segments to start with.
>
>> Adding support to
>> multiple SGL segments should allow us to increase this limit 256 -> 2048.
>>
>> Is this correct?
>
> Yes. Note that plenty of hardware doesn't really like chained SGLs too
> much and you might get performance degradation.
>
I see the driver assumes better performance on SGLs over PRPs when I/Os are
greater than 32k (this is the default sgl threshold). But what if chaining SGL
is needed, i.e. my host segments are between 4k and 16k, would PRPs perform
better than chaining SGLs?
Also, if host segments are between 4k and 16k, PRPs would be able to support it
but this limit prevents that use case. I guess the question is if you see any
blocker to enable this path?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-16 7:41 ` Daniel Gomez
@ 2025-06-16 11:33 ` Christoph Hellwig
2025-06-17 17:33 ` Daniel Gomez
0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-16 11:33 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Mon, Jun 16, 2025 at 09:41:15AM +0200, Daniel Gomez wrote:
> >> Adding support to
> >> multiple SGL segments should allow us to increase this limit 256 -> 2048.
> >>
> >> Is this correct?
> >
> > Yes. Note that plenty of hardware doesn't really like chained SGLs too
> > much and you might get performance degradation.
> >
>
> I see the driver assumes better performance on SGLs over PRPs when I/Os are
> greater than 32k (this is the default sgl threshold).
Yes. Although that might be worth revisiting - the threshold was we
benchmarked on the first hardware we got hold off that supports SGLs.
> But what if chaining SGL
> is needed, i.e. my host segments are between 4k and 16k, would PRPs perform
> better than chaining SGLs?
Right now we don't support chaining of SGL descriptors, so that's a moot
point.
> Also, if host segments are between 4k and 16k, PRPs would be able to support it
> but this limit prevents that use case. I guess the question is if you see any
> blocker to enable this path?
Well, if you think it's worth it give it a spin on a wide variety of
hardware.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-16 11:33 ` Christoph Hellwig
@ 2025-06-17 17:33 ` Daniel Gomez
2025-06-17 23:25 ` Keith Busch
0 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-17 17:33 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 16/06/2025 13.33, Christoph Hellwig wrote:
> On Mon, Jun 16, 2025 at 09:41:15AM +0200, Daniel Gomez wrote:
>> Also, if host segments are between 4k and 16k, PRPs would be able to support it
>> but this limit prevents that use case. I guess the question is if you see any
>> blocker to enable this path?
>
> Well, if you think it's worth it give it a spin on a wide variety of
> hardware.
I'm not sure if I understand this. Can you clarify why hardware evaluation would
be required? What exactly?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-17 17:33 ` Daniel Gomez
@ 2025-06-17 23:25 ` Keith Busch
0 siblings, 0 replies; 65+ messages in thread
From: Keith Busch @ 2025-06-17 23:25 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On Tue, Jun 17, 2025 at 07:33:46PM +0200, Daniel Gomez wrote:
> On 16/06/2025 13.33, Christoph Hellwig wrote:
> > On Mon, Jun 16, 2025 at 09:41:15AM +0200, Daniel Gomez wrote:
> >> Also, if host segments are between 4k and 16k, PRPs would be able to support it
> >> but this limit prevents that use case. I guess the question is if you see any
> >> blocker to enable this path?
> >
> > Well, if you think it's worth it give it a spin on a wide variety of
> > hardware.
>
> I'm not sure if I understand this. Can you clarify why hardware evaluation would
> be required? What exactly?
This is about chaining SGL's so I think the request is benchmarking if
that's faster than splitting commands. Splitting hand been quicker for
much hardware because they could process SQE's in parallel easier than
walking a single command's SG List.
On a slightly related topic, NVMe SGL's don't need the
"virt_boundary_mask". So for devices are optimized for SGL, then that
queue limit could go away, and I've recently heard use cases for the
passthrough interface where that would be useful on avoiding kernel copy
bounce buffers (sorry for the digression).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-12 5:02 ` Christoph Hellwig
2025-06-16 7:41 ` Daniel Gomez
@ 2025-06-17 17:43 ` Daniel Gomez
2025-06-17 17:45 ` Daniel Gomez
1 sibling, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-17 17:43 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 12/06/2025 07.02, Christoph Hellwig wrote:
> On Wed, Jun 11, 2025 at 02:15:10PM +0200, Daniel Gomez wrote:
>>> #define NVME_MAX_SEGS \
>>> - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
>>> - (PAGE_SIZE / sizeof(struct scatterlist)))
>>> + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>
>> The 8 MiB max transfer size is only reachable if host segments are at least 32k.
>> But I think this limitation is only on the SGL side, right?
>
> Yes, PRPs don't really have the concept of segments to start with.
SGLs don't have the same MPS limitation we have in PRPs. So I think the correct
calculation for NVME_MAX_SEGS is PAGE_SIZE / sizeof(struct nvme_sgl_desc).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-17 17:43 ` Daniel Gomez
@ 2025-06-17 17:45 ` Daniel Gomez
0 siblings, 0 replies; 65+ messages in thread
From: Daniel Gomez @ 2025-06-17 17:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
linux-block, linux-nvme
On 17/06/2025 19.43, Daniel Gomez wrote:
> On 12/06/2025 07.02, Christoph Hellwig wrote:
>> On Wed, Jun 11, 2025 at 02:15:10PM +0200, Daniel Gomez wrote:
>>>> #define NVME_MAX_SEGS \
>>>> - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
>>>> - (PAGE_SIZE / sizeof(struct scatterlist)))
>>>> + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>>>
>>> The 8 MiB max transfer size is only reachable if host segments are at least 32k.
>>> But I think this limitation is only on the SGL side, right?
>>
>> Yes, PRPs don't really have the concept of segments to start with.
>
> SGLs don't have the same MPS limitation we have in PRPs. So I think the correct
> calculation for NVME_MAX_SEGS is PAGE_SIZE / sizeof(struct nvme_sgl_desc).
>
Please, ignore. I forgot I already made this point.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-10 5:06 ` [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map Christoph Hellwig
2025-06-10 13:19 ` Leon Romanovsky
2025-06-11 12:15 ` Daniel Gomez
@ 2025-06-11 14:13 ` Daniel Gomez
2025-06-12 5:03 ` Christoph Hellwig
2025-06-16 7:49 ` Daniel Gomez
3 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 14:13 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> Use the blk_rq_dma_map API to DMA map requests instead of scatterlists.
> This removes the need to allocate a scatterlist covering every segment,
> and thus the overall transfer length limit based on the scatterlist
> allocation.
>
> Instead the DMA mapping is done by iterating the bio_vec chain in the
> request directly. The unmap is handled differently depending on how
> we mapped:
>
> - when using an IOMMU only a single IOVA is used, and it is stored in
> iova_state
> - for direct mappings that don't use swiotlb and are cache coherent no
> unmap is needed at all
> - for direct mappings that are not cache coherent or use swiotlb, the
> physical addresses are rebuild from the PRPs or SGL segments
>
> The latter unfortunately adds a fair amount of code to the driver, but
> it is code not used in the fast path.
>
> The conversion only covers the data mapping path, and still uses a
> scatterlist for the multi-segment metadata case. I plan to convert that
> as soon as we have good test coverage for the multi-segment metadata
> path.
>
> Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API
> conversion for nvme-pci, Kanchan Joshi for bringing back the single
> segment optimization, Leon Romanovsky for shepherding this through a
> gazillion rebases and Nitesh Shetty for various improvements.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 388 +++++++++++++++++++++++++---------------
> 1 file changed, 242 insertions(+), 146 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 04461efb6d27..2d3573293d0c 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
...
> @@ -2908,26 +3018,14 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
> static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
Since this pool is now used exclusively for metadata, it makes sense to update
the function name accordingly:
static int nvme_pci_alloc_iod_meta_mempool(struct nvme_dev *dev)
> {
> size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1);
> - size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
> -
> - dev->iod_mempool = mempool_create_node(1,
> - mempool_kmalloc, mempool_kfree,
> - (void *)alloc_size, GFP_KERNEL,
> - dev_to_node(dev->dev));
> - if (!dev->iod_mempool)
> - return -ENOMEM;
>
> dev->iod_meta_mempool = mempool_create_node(1,
> mempool_kmalloc, mempool_kfree,
> (void *)meta_size, GFP_KERNEL,
> dev_to_node(dev->dev));
> if (!dev->iod_meta_mempool)
> - goto free;
> -
> + return -ENOMEM;
> return 0;
> -free:
> - mempool_destroy(dev->iod_mempool);
> - return -ENOMEM;
> }
>
> static void nvme_free_tagset(struct nvme_dev *dev)
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-11 14:13 ` Daniel Gomez
@ 2025-06-12 5:03 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-12 5:03 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Wed, Jun 11, 2025 at 04:13:22PM +0200, Daniel Gomez wrote:
> > @@ -2908,26 +3018,14 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
> > static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
>
> Since this pool is now used exclusively for metadata, it makes sense to update
> the function name accordingly:
Nah. It should go away entirely pretty soon, so let's keep the churn
down.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-10 5:06 ` [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map Christoph Hellwig
` (2 preceding siblings ...)
2025-06-11 14:13 ` Daniel Gomez
@ 2025-06-16 7:49 ` Daniel Gomez
2025-06-16 11:35 ` Christoph Hellwig
3 siblings, 1 reply; 65+ messages in thread
From: Daniel Gomez @ 2025-06-16 7:49 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> @@ -46,13 +45,11 @@
> #define NVME_MAX_NR_DESCRIPTORS 5
>
> /*
> - * For data SGLs we support a single descriptors worth of SGL entries, but for
> - * now we also limit it to avoid an allocation larger than PAGE_SIZE for the
> - * scatterlist.
> + * For data SGLs we support a single descriptors worth of SGL entries.
> + * For PRPs, segments don't matter at all.
> */
> #define NVME_MAX_SEGS \
> - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
> - (PAGE_SIZE / sizeof(struct scatterlist)))
> + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
IIRC, I've seen in the commit history going from PAGE_SIZE to CC.MPS for
different cases in the driver. PRPs requires contiguous regions to be CC.MPS,
i.e use NVME_CTRL_PAGE_SIZE for PRP lists and entries. But I think that is not a
limit for SGLs. Can we use PAGE_SIZE here?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map
2025-06-16 7:49 ` Daniel Gomez
@ 2025-06-16 11:35 ` Christoph Hellwig
0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-16 11:35 UTC (permalink / raw)
To: Daniel Gomez
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Chaitanya Kulkarni, Kanchan Joshi, Leon Romanovsky, Nitesh Shetty,
Logan Gunthorpe, linux-block, linux-nvme
On Mon, Jun 16, 2025 at 09:49:47AM +0200, Daniel Gomez wrote:
> > #define NVME_MAX_SEGS \
> > - min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
> > - (PAGE_SIZE / sizeof(struct scatterlist)))
> > + (NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
>
> IIRC, I've seen in the commit history going from PAGE_SIZE to CC.MPS for
> different cases in the driver. PRPs requires contiguous regions to be CC.MPS,
> i.e use NVME_CTRL_PAGE_SIZE for PRP lists and entries. But I think that is not a
> limit for SGLs. Can we use PAGE_SIZE here?
Sure. Separate unrelated patch, though.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
` (6 preceding siblings ...)
2025-06-10 5:06 ` [PATCH 7/9] nvme-pci: convert the data mapping blk_rq_dma_map Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 13:20 ` Leon Romanovsky
2025-06-11 14:00 ` Daniel Gomez
2025-06-10 5:06 ` [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
8 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
Having a define in kiB units is a bit weird. Also update the
comment now that there is not scatterlist limit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2d3573293d0c..735f448d8db2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -38,10 +38,9 @@
#define NVME_SMALL_POOL_SIZE 256
/*
- * These can be higher, but we need to ensure that any command doesn't
- * require an sg allocation that needs more than a page of data.
+ * Arbitrary upper bound.
*/
-#define NVME_MAX_KB_SZ 8192
+#define NVME_MAX_BYTES SZ_8M
#define NVME_MAX_NR_DESCRIPTORS 5
/*
@@ -413,7 +412,7 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, __le32 *dbbuf_db,
*/
static __always_inline int nvme_pci_npages_prp(void)
{
- unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
+ unsigned max_bytes = NVME_MAX_BYTES + NVME_CTRL_PAGE_SIZE;
unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
}
@@ -3363,7 +3362,8 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
* over a single page.
*/
dev->ctrl.max_hw_sectors = min_t(u32,
- NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
+ NVME_MAX_BYTES >> SECTOR_SHIFT,
+ dma_opt_mapping_size(&pdev->dev) >> 9);
dev->ctrl.max_segments = NVME_MAX_SEGS;
dev->ctrl.max_integrity_segments = 1;
return dev;
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE
2025-06-10 5:06 ` [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
@ 2025-06-10 13:20 ` Leon Romanovsky
2025-06-11 14:00 ` Daniel Gomez
1 sibling, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:20 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:46AM +0200, Christoph Hellwig wrote:
> Having a define in kiB units is a bit weird. Also update the
> comment now that there is not scatterlist limit.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE
2025-06-10 5:06 ` [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
2025-06-10 13:20 ` Leon Romanovsky
@ 2025-06-11 14:00 ` Daniel Gomez
1 sibling, 0 replies; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 14:00 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> Having a define in kiB units is a bit weird. Also update the
> comment now that there is not scatterlist limit.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 2d3573293d0c..735f448d8db2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
...
> @@ -3363,7 +3362,8 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
> * over a single page.
> */
> dev->ctrl.max_hw_sectors = min_t(u32,
> - NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
> + NVME_MAX_BYTES >> SECTOR_SHIFT,
> + dma_opt_mapping_size(&pdev->dev) >> 9);
We use SECTOR_SHIFT for MAX_BYTES but not for the bytes returned from
dma_opt_mapping_size(). Yes, I know this is not part of the change, but still is
the same line. I think it make sense to convert it too.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS
2025-06-10 5:06 new DMA API conversion for nvme-pci Christoph Hellwig
` (7 preceding siblings ...)
2025-06-10 5:06 ` [PATCH 8/9] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
@ 2025-06-10 5:06 ` Christoph Hellwig
2025-06-10 13:21 ` Leon Romanovsky
2025-06-11 13:51 ` Daniel Gomez
8 siblings, 2 replies; 65+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:06 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
The current use of an always_inline helper is a bit convoluted.
Instead use macros that represent the arithmerics used for building
up the PRP chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
drivers/nvme/host/pci.c | 28 +++++++++++++++-------------
1 file changed, 15 insertions(+), 13 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 735f448d8db2..0f07599e0980 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -57,6 +57,21 @@
#define NVME_MAX_META_SEGS \
((NVME_SMALL_POOL_SIZE / sizeof(struct nvme_sgl_desc)) - 1)
+/*
+ * The last entry is used to link to the next descriptor.
+ */
+#define PRPS_PER_PAGE \
+ (((NVME_CTRL_PAGE_SIZE / sizeof(__le64))) - 1)
+
+/*
+ * I/O could be non-aligned both at the beginning and end.
+ */
+#define MAX_PRP_RANGE \
+ (NVME_MAX_BYTES + 2 * (NVME_CTRL_PAGE_SIZE - 1))
+
+static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
+ (1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
+
static int use_threaded_interrupts;
module_param(use_threaded_interrupts, int, 0444);
@@ -405,18 +420,6 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, __le32 *dbbuf_db,
return true;
}
-/*
- * Will slightly overestimate the number of pages needed. This is OK
- * as it only leads to a small amount of wasted memory for the lifetime of
- * the I/O.
- */
-static __always_inline int nvme_pci_npages_prp(void)
-{
- unsigned max_bytes = NVME_MAX_BYTES + NVME_CTRL_PAGE_SIZE;
- unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
- return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
-}
-
static struct nvme_descriptor_pools *
nvme_setup_descriptor_pools(struct nvme_dev *dev, unsigned numa_node)
{
@@ -3934,7 +3937,6 @@ static int __init nvme_init(void)
BUILD_BUG_ON(sizeof(struct nvme_create_sq) != 64);
BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64);
BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
- BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_DESCRIPTORS);
return pci_register_driver(&nvme_driver);
}
--
2.47.2
^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS
2025-06-10 5:06 ` [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
@ 2025-06-10 13:21 ` Leon Romanovsky
2025-06-11 13:51 ` Daniel Gomez
1 sibling, 0 replies; 65+ messages in thread
From: Leon Romanovsky @ 2025-06-10 13:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni,
Kanchan Joshi, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On Tue, Jun 10, 2025 at 07:06:47AM +0200, Christoph Hellwig wrote:
> The current use of an always_inline helper is a bit convoluted.
> Instead use macros that represent the arithmerics used for building
> up the PRP chain.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> drivers/nvme/host/pci.c | 28 +++++++++++++++-------------
> 1 file changed, 15 insertions(+), 13 deletions(-)
>
Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS
2025-06-10 5:06 ` [PATCH 9/9] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
2025-06-10 13:21 ` Leon Romanovsky
@ 2025-06-11 13:51 ` Daniel Gomez
1 sibling, 0 replies; 65+ messages in thread
From: Daniel Gomez @ 2025-06-11 13:51 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe
Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
linux-nvme
On 10/06/2025 07.06, Christoph Hellwig wrote:
> The current use of an always_inline helper is a bit convoluted.
> Instead use macros that represent the arithmerics used for building
Nit, typo here:
s/arithmerics/arithmetics/g
> up the PRP chain.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
These are my checks:
NVME_MAX_BYTES=8388608
NVME_CTRL_PAGE_SIZE=4096
MAX_PRP_RANGE=8396798
NVME_MAX_NR_DESCRIPTORS=5
PRPS_PER_PAGE=511.0
MAX_PRP_RANGE/NVME_CTRL_PAGE_SIZE=2049.99951171875
1+NVME_MAX_NR_DESCRIPTORS*PRPS_PER_PAGE=2556.0
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
^ permalink raw reply [flat|nested] 65+ messages in thread