new DMA API conversion for nvme-pci v2

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* new DMA API conversion for nvme-pci v2
@ 2025-06-23 14:12 Christoph Hellwig
  2025-06-23 14:12 ` [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

Hi all,

this series converts the nvme-pci driver to the new IOVA-based DMA API
for the data path.

Chances since v1:
 - minor cleanups to the block dma mapping helpers
 - fix the metadata SGL supported check for bisectability
 - fix SGL threshold check
 - fix/simplify metadata SGL force checks

Diffstat:
 block/bio-integrity.c      |    3 
 block/bio.c                |   20 -
 block/blk-mq-dma.c         |  161 +++++++++++
 drivers/nvme/host/pci.c    |  615 +++++++++++++++++++++++++--------------------
 include/linux/blk-mq-dma.h |   63 ++++
 include/linux/blk_types.h  |    2 
 6 files changed, 597 insertions(+), 267 deletions(-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:43   ` Keith Busch
  2025-06-23 14:12 ` [PATCH 2/8] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

To get out of the DMA mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.

This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the DMA mapping path, and only for already
marked bios.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/bio-integrity.c     |  3 +++
 block/bio.c               | 20 +++++++++++++-------
 include/linux/blk_types.h |  2 ++
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 10912988c8f5..6b077ca937f6 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -128,6 +128,9 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
 	if (bip->bip_vcnt > 0) {
 		struct bio_vec *bv = &bip->bip_vec[bip->bip_vcnt - 1];
 
+		if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+			return 0;
+
 		if (bvec_try_merge_hw_page(q, bv, page, len, offset)) {
 			bip->bip_iter.bi_size += len;
 			return len;
diff --git a/block/bio.c b/block/bio.c
index 3c0a558c90f5..92c512e876c8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -930,8 +930,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
 		return false;
 	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
 		return false;
-	if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
-		return false;
 
 	if ((vec_end_addr & PAGE_MASK) != ((page_addr + off) & PAGE_MASK)) {
 		if (IS_ENABLED(CONFIG_KMSAN))
@@ -982,6 +980,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 	WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
 	WARN_ON_ONCE(bio_full(bio, len));
 
+	if (is_pci_p2pdma_page(page))
+		bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
+
 	bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, off);
 	bio->bi_iter.bi_size += len;
 	bio->bi_vcnt++;
@@ -1022,11 +1023,16 @@ int bio_add_page(struct bio *bio, struct page *page,
 	if (bio->bi_iter.bi_size > UINT_MAX - len)
 		return 0;
 
-	if (bio->bi_vcnt > 0 &&
-	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset)) {
-		bio->bi_iter.bi_size += len;
-		return len;
+	if (bio->bi_vcnt > 0) {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+			return 0;
+
+		if (bvec_try_merge_page(bv, page, len, offset)) {
+			bio->bi_iter.bi_size += len;
+			return len;
+		}
 	}
 
 	if (bio->bi_vcnt >= bio->bi_max_vecs)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c..2a02972dc17c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -386,6 +386,7 @@ enum req_flag_bits {
 	__REQ_DRV,		/* for driver use */
 	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
 	__REQ_ATOMIC,		/* for atomic write operations */
+	__REQ_P2PDMA,		/* contains P2P DMA pages */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -418,6 +419,7 @@ enum req_flag_bits {
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_FS_PRIVATE	(__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
 #define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
+#define REQ_P2PDMA	(__force blk_opf_t)(1ULL << __REQ_P2PDMA)
 
 #define REQ_NOUNMAP	(__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/8] block: add scatterlist-less DMA mapping helpers
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
  2025-06-23 14:12 ` [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:43   ` Keith Busch
  2025-06-23 14:12 ` [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
the wasteful scatterlist structure.  Instead it uses the mapping iterator
to either add segments to the IOVA for IOMMU operations, or just maps
them one by one for the direct mapping.  For the IOMMU case instead of
a scatterlist with an entry for each segment, only a single [dma_addr,len]
pair needs to be stored for processing a request, and for the direct
mapping the per-segment allocation shrinks from
[page,offset,len,dma_addr,dma_len] to just [dma_addr,len].

One big difference to the scatterlist API, which could be considered
downside, is that the IOVA collapsing only works when the driver sets
a virt_boundary that matches the IOMMU granule.  For NVMe this is done
already so it works perfectly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-mq-dma.c         | 161 +++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-dma.h |  63 +++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 include/linux/blk-mq-dma.h

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 82bae475dfa4..ad283017caef 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (C) 2025 Christoph Hellwig
  */
+#include <linux/blk-mq-dma.h>
 #include "blk.h"
 
 struct phys_vec {
@@ -61,6 +62,166 @@ static bool blk_map_iter_next(struct request *req, struct req_iterator *iter,
 	return true;
 }
 
+/*
+ * The IOVA-based DMA API wants to be able to coalesce at the minimal IOMMU page
+ * size granularity (which is guaranteed to be <= PAGE_SIZE and usually 4k), so
+ * we need to ensure our segments are aligned to this as well.
+ *
+ * Note that there is no point in using the slightly more complicated IOVA based
+ * path for single segment mappings.
+ */
+static inline bool blk_can_dma_map_iova(struct request *req,
+		struct device *dma_dev)
+{
+	return !((queue_virt_boundary(req->q) + 1) &
+		dma_get_merge_boundary(dma_dev));
+}
+
+static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+	iter->len = vec->len;
+	return true;
+}
+
+static bool blk_dma_map_direct(struct request *req, struct device *dma_dev,
+		struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+	iter->addr = dma_map_page(dma_dev, phys_to_page(vec->paddr),
+			offset_in_page(vec->paddr), vec->len, rq_dma_dir(req));
+	if (dma_mapping_error(dma_dev, iter->addr)) {
+		iter->status = BLK_STS_RESOURCE;
+		return false;
+	}
+	iter->len = vec->len;
+	return true;
+}
+
+static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter,
+		struct phys_vec *vec)
+{
+	enum dma_data_direction dir = rq_dma_dir(req);
+	unsigned int mapped = 0;
+	int error;
+
+	iter->addr = state->addr;
+	iter->len = dma_iova_size(state);
+
+	do {
+		error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
+				vec->len, dir, 0);
+		if (error)
+			break;
+		mapped += vec->len;
+	} while (blk_map_iter_next(req, &iter->iter, vec));
+
+	error = dma_iova_sync(dma_dev, state, 0, mapped);
+	if (error) {
+		iter->status = errno_to_blk_status(error);
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * blk_rq_dma_map_iter_start - map the first DMA segment for a request
+ * @req:	request to map
+ * @dma_dev:	device to map to
+ * @state:	DMA IOVA state
+ * @iter:	block layer DMA iterator
+ *
+ * Start DMA mapping @req to @dma_dev.  @state and @iter are provided by the
+ * caller and don't need to be initialized.  @state needs to be stored for use
+ * at unmap time, @iter is only needed at map time.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true ft it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len.  If no segment was mapped the status code is
+ * returned in @iter.status.
+ *
+ * The caller can call blk_rq_dma_map_coalesce() to check if further segments
+ * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next()
+ * to try to map the following segments.
+ */
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+	unsigned int total_len = blk_rq_payload_bytes(req);
+	struct phys_vec vec;
+
+	iter->iter.bio = req->bio;
+	iter->iter.iter = req->bio->bi_iter;
+	memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
+	iter->status = BLK_STS_OK;
+
+	/*
+	 * Grab the first segment ASAP because we'll need it to check for P2P
+	 * transfers.
+	 */
+	if (!blk_map_iter_next(req, &iter->iter, &vec))
+		return false;
+
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) {
+		switch (pci_p2pdma_state(&iter->p2pdma, dma_dev,
+					 phys_to_page(vec.paddr))) {
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			return blk_dma_map_bus(iter, &vec);
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * P2P transfers through the host bridge are treated the
+			 * same as non-P2P transfers below and during unmap.
+			 */
+			req->cmd_flags &= ~REQ_P2PDMA;
+			break;
+		default:
+			iter->status = BLK_STS_INVAL;
+			return false;
+		}
+	}
+
+	if (blk_can_dma_map_iova(req, dma_dev) &&
+	    dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
+		return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
+	return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
+
+/**
+ * blk_rq_dma_map_iter_next - map the next DMA segment for a request
+ * @req:	request to map
+ * @dma_dev:	device to map to
+ * @state:	DMA IOVA state
+ * @iter:	block layer DMA iterator
+ *
+ * Iterate to the next mapping after a previous call to
+ * blk_rq_dma_map_iter_start().  See there for a detailed description of the
+ * arguments.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true ft it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len.  If no segment was mapped the status code is
+ * returned in @iter.status.
+ */
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+	struct phys_vec vec;
+
+	if (!blk_map_iter_next(req, &iter->iter, &vec))
+		return false;
+
+	if (iter->p2pdma.map == PCI_P2PDMA_MAP_BUS_ADDR)
+		return blk_dma_map_bus(iter, &vec);
+	return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_next);
+
 static inline struct scatterlist *
 blk_next_sg(struct scatterlist **sg, struct scatterlist *sglist)
 {
diff --git a/include/linux/blk-mq-dma.h b/include/linux/blk-mq-dma.h
new file mode 100644
index 000000000000..c26a01aeae00
--- /dev/null
+++ b/include/linux/blk-mq-dma.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BLK_MQ_DMA_H
+#define BLK_MQ_DMA_H
+
+#include <linux/blk-mq.h>
+#include <linux/pci-p2pdma.h>
+
+struct blk_dma_iter {
+	/* Output address range for this iteration */
+	dma_addr_t			addr;
+	u32				len;
+
+	/* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
+	blk_status_t			status;
+
+	/* Internal to blk_rq_dma_map_iter_* */
+	struct req_iterator		iter;
+	struct pci_p2pdma_map_state	p2pdma;
+};
+
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter);
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter);
+
+/**
+ * blk_rq_dma_map_coalesce - were all segments coalesced?
+ * @state: DMA state to check
+ *
+ * Returns true if blk_rq_dma_map_iter_start coalesced all segments into a
+ * single DMA range.
+ */
+static inline bool blk_rq_dma_map_coalesce(struct dma_iova_state *state)
+{
+	return dma_use_iova(state);
+}
+
+/**
+ * blk_rq_dma_unmap - try to DMA unmap a request
+ * @req:	request to unmap
+ * @dma_dev:	device to unmap from
+ * @state:	DMA IOVA state
+ * @mapped_len: number of bytes to unmap
+ *
+ * Returns %false if the callers need to manually unmap every DMA segment
+ * mapped using @iter or %true if no work is left to be done.
+ */
+static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, size_t mapped_len)
+{
+	if (req->cmd_flags & REQ_P2PDMA)
+		return true;
+
+	if (dma_use_iova(state)) {
+		dma_iova_destroy(dma_dev, state, mapped_len, rq_dma_dir(req),
+				 0);
+		return true;
+	}
+
+	return !dma_need_unmap(dma_dev);
+}
+
+#endif /* BLK_MQ_DMA_H */
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
  2025-06-23 14:12 ` [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
  2025-06-23 14:12 ` [PATCH 2/8] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:27   ` Keith Busch
  2025-06-23 14:12 ` [PATCH 4/8] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

Move the average segment size into a separate helper, and return a
tristate to distinguish the case where can use SGL vs where we have to
use SGLs.  This will allow the simplify the code and make more efficient
decisions in follow on changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ff12e415cb5..bc84438f0523 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -578,6 +578,12 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
+enum nvme_use_sgl {
+	SGL_UNSUPPORTED,
+	SGL_SUPPORTED,
+	SGL_FORCED,
+};
+
 static inline bool nvme_pci_metadata_use_sgls(struct nvme_dev *dev,
 					      struct request *req)
 {
@@ -587,23 +593,27 @@ static inline bool nvme_pci_metadata_use_sgls(struct nvme_dev *dev,
 		nvme_req(req)->flags & NVME_REQ_USERCMD;
 }
 
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
-				     int nseg)
+static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
+		struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	unsigned int avg_seg_size;
 
-	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
+	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
+		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
+			return SGL_FORCED;
+		if (req->nr_integrity_segments > 1)
+			return SGL_FORCED;
+		return SGL_SUPPORTED;
+	}
 
-	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
-		return false;
-	if (!nvmeq->qid)
-		return false;
-	if (nvme_pci_metadata_use_sgls(dev, req))
-		return true;
-	if (!sgl_threshold || avg_seg_size < sgl_threshold)
-		return nvme_req(req)->flags & NVME_REQ_USERCMD;
-	return true;
+	return SGL_UNSUPPORTED;
+}
+
+static unsigned int nvme_pci_avg_seg_size(struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+	return DIV_ROUND_UP(blk_rq_payload_bytes(req), iod->sgt.nents);
 }
 
 static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
@@ -851,6 +861,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
@@ -888,7 +899,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out_free_sg;
 	}
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+	if (use_sgl == SGL_FORCED ||
+	    (use_sgl == SGL_SUPPORTED &&
+	     (!sgl_threshold || nvme_pci_avg_seg_size(req) >= sgl_threshold)))
 		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/8] nvme-pci: merge the simple PRP and SGL setup into a common helper
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2025-06-23 14:12 ` [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:43   ` Keith Busch
  2025-06-23 14:12 ` [PATCH 5/8] nvme-pci: remove superfluous arguments Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic.
Merge them into a single helper that makes use of the previously added
use_sgl tristate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 76 +++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 44 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index bc84438f0523..f7c43eeefb26 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -817,42 +817,41 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
 	return BLK_STS_OK;
 }
 
-static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
+static blk_status_t nvme_pci_setup_data_simple(struct request *req,
+		enum nvme_use_sgl use_sgl)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
-	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
-
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct bio_vec bv = req_bvec(req);
+	unsigned int prp1_offset = bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
+	bool prp_possible = prp1_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2;
+	dma_addr_t dma_addr;
+
+	if (!use_sgl && !prp_possible)
+		return BLK_STS_AGAIN;
+	if (is_pci_p2pdma_page(bv.bv_page))
+		return BLK_STS_AGAIN;
+
+	dma_addr = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
+	if (dma_mapping_error(nvmeq->dev->dev, dma_addr))
 		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
-
-	cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
-	if (bv->bv_len > first_prp_len)
-		cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma + first_prp_len);
-	else
-		cmnd->dptr.prp2 = 0;
-	return BLK_STS_OK;
-}
+	iod->dma_len = bv.bv_len;
 
-static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	if (use_sgl == SGL_FORCED || !prp_possible) {
+		iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
+		iod->cmd.common.dptr.sgl.addr = cpu_to_le64(dma_addr);
+		iod->cmd.common.dptr.sgl.length = cpu_to_le32(bv.bv_len);
+		iod->cmd.common.dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
+	} else {
+		unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - prp1_offset;
 
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
-		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
+		iod->cmd.common.dptr.prp1 = cpu_to_le64(dma_addr);
+		iod->cmd.common.dptr.prp2 = 0;
+		if (bv.bv_len > first_prp_len)
+			iod->cmd.common.dptr.prp2 =
+				cpu_to_le64(dma_addr + first_prp_len);
+	}
 
-	cmnd->flags = NVME_CMD_SGL_METABUF;
-	cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma);
-	cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len);
-	cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
 	return BLK_STS_OK;
 }
 
@@ -866,20 +865,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct bio_vec bv = req_bvec(req);
-
-		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if (!nvme_pci_metadata_use_sgls(dev, req) &&
-			    (bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
-			     bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
-
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
-				return nvme_setup_sgl_simple(dev, req,
-							     &cmnd->rw, &bv);
-		}
+		ret = nvme_pci_setup_data_simple(req, use_sgl);
+		if (ret != BLK_STS_AGAIN)
+			return ret;
 	}
 
 	iod->dma_len = 0;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/8] nvme-pci: remove superfluous arguments
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2025-06-23 14:12 ` [PATCH 4/8] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 14:12 ` [PATCH 6/8] nvme-pci: convert the data mapping to blk_rq_dma_map Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

The call chain in the prep_rq and completion paths passes around a lot
of nvme_dev, nvme_queue and nvme_command arguments that can be trivially
derived from the passed in struct request.  Remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 105 +++++++++++++++++++---------------------
 1 file changed, 51 insertions(+), 54 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f7c43eeefb26..79114fa034d0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -584,9 +584,11 @@ enum nvme_use_sgl {
 	SGL_FORCED,
 };
 
-static inline bool nvme_pci_metadata_use_sgls(struct nvme_dev *dev,
-					      struct request *req)
+static inline bool nvme_pci_metadata_use_sgls(struct request *req)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct nvme_dev *dev = nvmeq->dev;
+
 	if (!nvme_ctrl_meta_sgl_supported(&dev->ctrl))
 		return false;
 	return req->nr_integrity_segments > 1 ||
@@ -624,8 +626,9 @@ static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
 	return nvmeq->descriptor_pools.large;
 }
 
-static void nvme_free_descriptors(struct nvme_queue *nvmeq, struct request *req)
+static void nvme_free_descriptors(struct request *req)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	dma_addr_t dma_addr = iod->first_dma;
@@ -647,22 +650,22 @@ static void nvme_free_descriptors(struct nvme_queue *nvmeq, struct request *req)
 	}
 }
 
-static void nvme_unmap_data(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-			    struct request *req)
+static void nvme_unmap_data(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 
 	if (iod->dma_len) {
-		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
+		dma_unmap_page(nvmeq->dev->dev, iod->first_dma, iod->dma_len,
 			       rq_dma_dir(req));
 		return;
 	}
 
 	WARN_ON_ONCE(!iod->sgt.nents);
 
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-	nvme_free_descriptors(nvmeq, req);
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
+	dma_unmap_sgtable(nvmeq->dev->dev, &iod->sgt, rq_dma_dir(req), 0);
+	nvme_free_descriptors(req);
+	mempool_free(iod->sgt.sgl, nvmeq->dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -679,10 +682,10 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
 	}
 }
 
-static blk_status_t nvme_pci_setup_prps(struct nvme_queue *nvmeq,
-		struct request *req, struct nvme_rw_command *cmnd)
+static blk_status_t nvme_pci_setup_prps(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	int length = blk_rq_payload_bytes(req);
 	struct scatterlist *sg = iod->sgt.sgl;
 	int dma_len = sg_dma_len(sg);
@@ -751,11 +754,11 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_queue *nvmeq,
 		dma_len = sg_dma_len(sg);
 	}
 done:
-	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
-	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
+	iod->cmd.common.dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
+	iod->cmd.common.dptr.prp2 = cpu_to_le64(iod->first_dma);
 	return BLK_STS_OK;
 free_prps:
-	nvme_free_descriptors(nvmeq, req);
+	nvme_free_descriptors(req);
 	return BLK_STS_RESOURCE;
 bad_sgl:
 	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
@@ -780,10 +783,10 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 	sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
 }
 
-static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
-		struct request *req, struct nvme_rw_command *cmd)
+static blk_status_t nvme_pci_setup_sgls(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_sgl_desc *sg_list;
 	struct scatterlist *sg = iod->sgt.sgl;
 	unsigned int entries = iod->sgt.nents;
@@ -791,10 +794,10 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
 	int i = 0;
 
 	/* setting the transfer type as SGL */
-	cmd->flags = NVME_CMD_SGL_METABUF;
+	iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
 
 	if (entries == 1) {
-		nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
+		nvme_pci_sgl_set_data(&iod->cmd.common.dptr.sgl, sg);
 		return BLK_STS_OK;
 	}
 
@@ -808,7 +811,7 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_queue *nvmeq,
 	iod->descriptors[iod->nr_descriptors++] = sg_list;
 	iod->first_dma = sgl_dma;
 
-	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
+	nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, entries);
 	do {
 		nvme_pci_sgl_set_data(&sg_list[i++], sg);
 		sg = sg_next(sg);
@@ -855,11 +858,11 @@ static blk_status_t nvme_pci_setup_data_simple(struct request *req,
 	return BLK_STS_OK;
 }
 
-static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
-		struct nvme_command *cmnd)
+static blk_status_t nvme_map_data(struct request *req)
 {
-	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct nvme_dev *dev = nvmeq->dev;
 	enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
@@ -890,9 +893,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	if (use_sgl == SGL_FORCED ||
 	    (use_sgl == SGL_SUPPORTED &&
 	     (!sgl_threshold || nvme_pci_avg_seg_size(req) >= sgl_threshold)))
-		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
+		ret = nvme_pci_setup_sgls(req);
 	else
-		ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
+		ret = nvme_pci_setup_prps(req);
 	if (ret != BLK_STS_OK)
 		goto out_unmap_sg;
 	return BLK_STS_OK;
@@ -904,12 +907,11 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	return ret;
 }
 
-static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
-					     struct request *req)
+static blk_status_t nvme_pci_setup_meta_sgls(struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct nvme_dev *dev = nvmeq->dev;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct nvme_rw_command *cmnd = &iod->cmd.rw;
 	struct nvme_sgl_desc *sg_list;
 	struct scatterlist *sgl, *sg;
 	unsigned int entries;
@@ -940,8 +942,8 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
 	iod->meta_descriptor = sg_list;
 	iod->meta_dma = sgl_dma;
 
-	cmnd->flags = NVME_CMD_SGL_METASEG;
-	cmnd->metadata = cpu_to_le64(sgl_dma);
+	iod->cmd.common.flags = NVME_CMD_SGL_METASEG;
+	iod->cmd.common.metadata = cpu_to_le64(sgl_dma);
 
 	sgl = iod->meta_sgt.sgl;
 	if (entries == 1) {
@@ -963,31 +965,30 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
 	return BLK_STS_RESOURCE;
 }
 
-static blk_status_t nvme_pci_setup_meta_mptr(struct nvme_dev *dev,
-					     struct request *req)
+static blk_status_t nvme_pci_setup_meta_mptr(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct bio_vec bv = rq_integrity_vec(req);
-	struct nvme_command *cmnd = &iod->cmd;
 
-	iod->meta_dma = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->meta_dma))
+	iod->meta_dma = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
+	if (dma_mapping_error(nvmeq->dev->dev, iod->meta_dma))
 		return BLK_STS_IOERR;
-	cmnd->rw.metadata = cpu_to_le64(iod->meta_dma);
+	iod->cmd.common.metadata = cpu_to_le64(iod->meta_dma);
 	return BLK_STS_OK;
 }
 
-static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req)
+static blk_status_t nvme_map_metadata(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
 	if ((iod->cmd.common.flags & NVME_CMD_SGL_METABUF) &&
-	    nvme_pci_metadata_use_sgls(dev, req))
-		return nvme_pci_setup_meta_sgls(dev, req);
-	return nvme_pci_setup_meta_mptr(dev, req);
+	    nvme_pci_metadata_use_sgls(req))
+		return nvme_pci_setup_meta_sgls(req);
+	return nvme_pci_setup_meta_mptr(req);
 }
 
-static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
+static blk_status_t nvme_prep_rq(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret;
@@ -1002,13 +1003,13 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 		return ret;
 
 	if (blk_rq_nr_phys_segments(req)) {
-		ret = nvme_map_data(dev, req, &iod->cmd);
+		ret = nvme_map_data(req);
 		if (ret)
 			goto out_free_cmd;
 	}
 
 	if (blk_integrity_rq(req)) {
-		ret = nvme_map_metadata(dev, req);
+		ret = nvme_map_metadata(req);
 		if (ret)
 			goto out_unmap_data;
 	}
@@ -1017,7 +1018,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	return BLK_STS_OK;
 out_unmap_data:
 	if (blk_rq_nr_phys_segments(req))
-		nvme_unmap_data(dev, req->mq_hctx->driver_data, req);
+		nvme_unmap_data(req);
 out_free_cmd:
 	nvme_cleanup_cmd(req);
 	return ret;
@@ -1042,7 +1043,7 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (unlikely(!nvme_check_ready(&dev->ctrl, req, true)))
 		return nvme_fail_nonready_command(&dev->ctrl, req);
 
-	ret = nvme_prep_rq(dev, req);
+	ret = nvme_prep_rq(req);
 	if (unlikely(ret))
 		return ret;
 	spin_lock(&nvmeq->sq_lock);
@@ -1080,7 +1081,7 @@ static bool nvme_prep_rq_batch(struct nvme_queue *nvmeq, struct request *req)
 	if (unlikely(!nvme_check_ready(&nvmeq->dev->ctrl, req, true)))
 		return false;
 
-	return nvme_prep_rq(nvmeq->dev, req) == BLK_STS_OK;
+	return nvme_prep_rq(req) == BLK_STS_OK;
 }
 
 static void nvme_queue_rqs(struct rq_list *rqlist)
@@ -1106,11 +1107,11 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
 	*rqlist = requeue_list;
 }
 
-static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
-						struct nvme_queue *nvmeq,
-						struct request *req)
+static __always_inline void nvme_unmap_metadata(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct nvme_dev *dev = nvmeq->dev;
 
 	if (!iod->meta_sgt.nents) {
 		dma_unmap_page(dev->dev, iod->meta_dma,
@@ -1127,14 +1128,10 @@ static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
 
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
-	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	struct nvme_dev *dev = nvmeq->dev;
-
 	if (blk_integrity_rq(req))
-		nvme_unmap_metadata(dev, nvmeq, req);
-
+		nvme_unmap_metadata(req);
 	if (blk_rq_nr_phys_segments(req))
-		nvme_unmap_data(dev, nvmeq, req);
+		nvme_unmap_data(req);
 }
 
 static void nvme_pci_complete_rq(struct request *req)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/8] nvme-pci: convert the data mapping to blk_rq_dma_map
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2025-06-23 14:12 ` [PATCH 5/8] nvme-pci: remove superfluous arguments Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:44   ` Keith Busch
  2025-06-23 14:12 ` [PATCH 7/8] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

Use the blk_rq_dma_map API to DMA map requests instead of scatterlists.
This removes the need to allocate a scatterlist covering every segment,
and thus the overall transfer length limit based on the scatterlist
allocation.

Instead the DMA mapping is done by iterating the bio_vec chain in the
request directly.  The unmap is handled differently depending on how
we mapped:

 - when using an IOMMU only a single IOVA is used, and it is stored in
   iova_state
 - for direct mappings that don't use swiotlb and are cache coherent,
   unmap is not needed at all
 - for direct mappings that are not cache coherent or use swiotlb, the
   physical addresses are rebuild from the PRPs or SGL segments

The latter unfortunately adds a fair amount of code to the driver, but
it is code not used in the fast path.

The conversion only covers the data mapping path, and still uses a
scatterlist for the multi-segment metadata case.  I plan to convert that
as soon as we have good test coverage for the multi-segment metadata
path.

Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API
conversion for nvme-pci, Kanchan Joshi for bringing back the single
segment optimization, Leon Romanovsky for shepherding this through a
gazillion rebases and Nitesh Shetty for various improvements.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 387 +++++++++++++++++++++++++---------------
 1 file changed, 241 insertions(+), 146 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 79114fa034d0..040ed906c580 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -7,7 +7,7 @@
 #include <linux/acpi.h>
 #include <linux/async.h>
 #include <linux/blkdev.h>
-#include <linux/blk-mq.h>
+#include <linux/blk-mq-dma.h>
 #include <linux/blk-integrity.h>
 #include <linux/dmi.h>
 #include <linux/init.h>
@@ -27,7 +27,6 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
-#include <linux/pci-p2pdma.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -46,13 +45,11 @@
 #define NVME_MAX_NR_DESCRIPTORS	5
 
 /*
- * For data SGLs we support a single descriptors worth of SGL entries, but for
- * now we also limit it to avoid an allocation larger than PAGE_SIZE for the
- * scatterlist.
+ * For data SGLs we support a single descriptors worth of SGL entries.
+ * For PRPs, segments don't matter at all.
  */
 #define NVME_MAX_SEGS \
-	min(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc), \
-	    (PAGE_SIZE / sizeof(struct scatterlist)))
+	(NVME_CTRL_PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
 /*
  * For metadata SGLs, only the small descriptor is supported, and the first
@@ -162,7 +159,6 @@ struct nvme_dev {
 	bool hmb;
 	struct sg_table *hmb_sgt;
 
-	mempool_t *iod_mempool;
 	mempool_t *iod_meta_mempool;
 
 	/* shadow doorbell buffer support: */
@@ -246,7 +242,10 @@ enum nvme_iod_flags {
 	IOD_ABORTED		= 1U << 0,
 
 	/* uses the small descriptor pool */
-	IOD_SMALL_DESCRIPTOR		= 1U << 1,
+	IOD_SMALL_DESCRIPTOR	= 1U << 1,
+
+	/* single segment dma mapping */
+	IOD_SINGLE_SEGMENT	= 1U << 2,
 };
 
 /*
@@ -257,13 +256,14 @@ struct nvme_iod {
 	struct nvme_command cmd;
 	u8 flags;
 	u8 nr_descriptors;
-	unsigned int dma_len;	/* length of single DMA segment mapping */
-	dma_addr_t first_dma;
+
+	unsigned int total_len;
+	struct dma_iova_state dma_state;
+	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
+
 	dma_addr_t meta_dma;
-	struct sg_table sgt;
 	struct sg_table meta_sgt;
 	struct nvme_sgl_desc *meta_descriptor;
-	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -614,8 +614,13 @@ static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
 static unsigned int nvme_pci_avg_seg_size(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	unsigned int nseg;
 
-	return DIV_ROUND_UP(blk_rq_payload_bytes(req), iod->sgt.nents);
+	if (blk_rq_dma_map_coalesce(&iod->dma_state))
+		nseg = 1;
+	else
+		nseg = blk_rq_nr_phys_segments(req);
+	return DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
 }
 
 static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
@@ -626,12 +631,25 @@ static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
 	return nvmeq->descriptor_pools.large;
 }
 
+static inline bool nvme_pci_cmd_use_sgl(struct nvme_command *cmd)
+{
+	return cmd->common.flags &
+		(NVME_CMD_SGL_METABUF | NVME_CMD_SGL_METASEG);
+}
+
+static inline dma_addr_t nvme_pci_first_desc_dma_addr(struct nvme_command *cmd)
+{
+	if (nvme_pci_cmd_use_sgl(cmd))
+		return le64_to_cpu(cmd->common.dptr.sgl.addr);
+	return le64_to_cpu(cmd->common.dptr.prp2);
+}
+
 static void nvme_free_descriptors(struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	dma_addr_t dma_addr = iod->first_dma;
+	dma_addr_t dma_addr = nvme_pci_first_desc_dma_addr(&iod->cmd);
 	int i;
 
 	if (iod->nr_descriptors == 1) {
@@ -650,68 +668,138 @@ static void nvme_free_descriptors(struct request *req)
 	}
 }
 
-static void nvme_unmap_data(struct request *req)
+static void nvme_free_prps(struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct device *dma_dev = nvmeq->dev->dev;
+	enum dma_data_direction dir = rq_dma_dir(req);
+	int length = iod->total_len;
+	dma_addr_t dma_addr;
+	int i, desc;
+	__le64 *prp_list;
+	u32 dma_len;
+
+	dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
+	dma_len = min_t(u32, length,
+		NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1)));
+	length -= dma_len;
+	if (!length) {
+		dma_unmap_page(dma_dev, dma_addr, dma_len, dir);
+		return;
+	}
 
-	if (iod->dma_len) {
-		dma_unmap_page(nvmeq->dev->dev, iod->first_dma, iod->dma_len,
-			       rq_dma_dir(req));
+	if (length <= NVME_CTRL_PAGE_SIZE) {
+		dma_unmap_page(dma_dev, dma_addr, dma_len, dir);
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2);
+		dma_unmap_page(dma_dev, dma_addr, length, dir);
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->sgt.nents);
+	i = 0;
+	desc = 0;
+	prp_list = iod->descriptors[desc];
+	do {
+		dma_unmap_page(dma_dev, dma_addr, dma_len, dir);
+		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+			prp_list = iod->descriptors[++desc];
+			i = 0;
+		}
 
-	dma_unmap_sgtable(nvmeq->dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-	nvme_free_descriptors(req);
-	mempool_free(iod->sgt.sgl, nvmeq->dev->iod_mempool);
+		dma_addr = le64_to_cpu(prp_list[i++]);
+		dma_len = min(length, NVME_CTRL_PAGE_SIZE);
+		length -= dma_len;
+	} while (length);
 }
 
-static void nvme_print_sgl(struct scatterlist *sgl, int nents)
+static void nvme_free_sgls(struct request *req)
 {
-	int i;
-	struct scatterlist *sg;
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct device *dma_dev = nvmeq->dev->dev;
+	dma_addr_t sqe_dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+	unsigned int sqe_dma_len = le32_to_cpu(iod->cmd.common.dptr.sgl.length);
+	struct nvme_sgl_desc *sg_list = iod->descriptors[0];
+	enum dma_data_direction dir = rq_dma_dir(req);
+
+	if (iod->nr_descriptors) {
+		unsigned int nr_entries = sqe_dma_len / sizeof(*sg_list), i;
+
+		for (i = 0; i < nr_entries; i++)
+			dma_unmap_page(dma_dev, le64_to_cpu(sg_list[i].addr),
+				le32_to_cpu(sg_list[i].length), dir);
+	} else {
+		dma_unmap_page(dma_dev, sqe_dma_addr, sqe_dma_len, dir);
+	}
+}
+
+static void nvme_unmap_data(struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	struct device *dma_dev = nvmeq->dev->dev;
+
+	if (iod->flags & IOD_SINGLE_SEGMENT) {
+		static_assert(offsetof(union nvme_data_ptr, prp1) ==
+				offsetof(union nvme_data_ptr, sgl.addr));
+		dma_unmap_page(dma_dev, le64_to_cpu(iod->cmd.common.dptr.prp1),
+				iod->total_len, rq_dma_dir(req));
+		return;
+	}
 
-	for_each_sg(sgl, sg, nents, i) {
-		dma_addr_t phys = sg_phys(sg);
-		pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
-			"dma_address:%pad dma_length:%d\n",
-			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
-			sg_dma_len(sg));
+	if (!blk_rq_dma_unmap(req, dma_dev, &iod->dma_state, iod->total_len)) {
+		if (nvme_pci_cmd_use_sgl(&iod->cmd))
+			nvme_free_sgls(req);
+		else
+			nvme_free_prps(req);
 	}
+
+	if (iod->nr_descriptors)
+		nvme_free_descriptors(req);
 }
 
-static blk_status_t nvme_pci_setup_prps(struct request *req)
+static blk_status_t nvme_pci_setup_data_prp(struct request *req,
+		struct blk_dma_iter *iter)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	int length = blk_rq_payload_bytes(req);
-	struct scatterlist *sg = iod->sgt.sgl;
-	int dma_len = sg_dma_len(sg);
-	u64 dma_addr = sg_dma_address(sg);
-	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+	unsigned int length = blk_rq_payload_bytes(req);
+	dma_addr_t prp1_dma, prp2_dma = 0;
+	unsigned int prp_len, i;
 	__le64 *prp_list;
-	dma_addr_t prp_dma;
-	int i;
 
-	length -= (NVME_CTRL_PAGE_SIZE - offset);
-	if (length <= 0) {
-		iod->first_dma = 0;
+	/*
+	 * PRP1 always points to the start of the DMA transfers.
+	 *
+	 * This is the only PRP (except for the list entries) that could be
+	 * non-aligned.
+	 */
+	prp1_dma = iter->addr;
+	prp_len = min(length, NVME_CTRL_PAGE_SIZE -
+			(iter->addr & (NVME_CTRL_PAGE_SIZE - 1)));
+	iod->total_len += prp_len;
+	iter->addr += prp_len;
+	iter->len -= prp_len;
+	length -= prp_len;
+	if (!length)
 		goto done;
-	}
 
-	dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
-	if (dma_len) {
-		dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
-	} else {
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+	if (!iter->len) {
+		if (!blk_rq_dma_map_iter_next(req, nvmeq->dev->dev,
+				&iod->dma_state, iter)) {
+			if (WARN_ON_ONCE(!iter->status))
+				goto bad_sgl;
+			goto done;
+		}
 	}
 
+	/*
+	 * PRP2 is usually a list, but can point to data if all data to be
+	 * transferred fits into PRP1 + PRP2:
+	 */
 	if (length <= NVME_CTRL_PAGE_SIZE) {
-		iod->first_dma = dma_addr;
+		prp2_dma = iter->addr;
+		iod->total_len += length;
 		goto done;
 	}
 
@@ -720,58 +808,83 @@ static blk_status_t nvme_pci_setup_prps(struct request *req)
 		iod->flags |= IOD_SMALL_DESCRIPTOR;
 
 	prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
-			&prp_dma);
-	if (!prp_list)
-		return BLK_STS_RESOURCE;
+			&prp2_dma);
+	if (!prp_list) {
+		iter->status = BLK_STS_RESOURCE;
+		goto done;
+	}
 	iod->descriptors[iod->nr_descriptors++] = prp_list;
-	iod->first_dma = prp_dma;
+
 	i = 0;
 	for (;;) {
+		prp_list[i++] = cpu_to_le64(iter->addr);
+		prp_len = min(length, NVME_CTRL_PAGE_SIZE);
+		if (WARN_ON_ONCE(iter->len < prp_len))
+			goto bad_sgl;
+
+		iod->total_len += prp_len;
+		iter->addr += prp_len;
+		iter->len -= prp_len;
+		length -= prp_len;
+		if (!length)
+			break;
+
+		if (iter->len == 0) {
+			if (!blk_rq_dma_map_iter_next(req, nvmeq->dev->dev,
+					&iod->dma_state, iter)) {
+				if (WARN_ON_ONCE(!iter->status))
+					goto bad_sgl;
+				goto done;
+			}
+		}
+
+		/*
+		 * If we've filled the entire descriptor, allocate a new that is
+		 * pointed to be the last entry in the previous PRP list.  To
+		 * accommodate for that move the last actual entry to the new
+		 * descriptor.
+		 */
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
+			dma_addr_t prp_list_dma;
 
 			prp_list = dma_pool_alloc(nvmeq->descriptor_pools.large,
-					GFP_ATOMIC, &prp_dma);
-			if (!prp_list)
-				goto free_prps;
+					GFP_ATOMIC, &prp_list_dma);
+			if (!prp_list) {
+				iter->status = BLK_STS_RESOURCE;
+				goto done;
+			}
 			iod->descriptors[iod->nr_descriptors++] = prp_list;
+
 			prp_list[0] = old_prp_list[i - 1];
-			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+			old_prp_list[i - 1] = cpu_to_le64(prp_list_dma);
 			i = 1;
 		}
-		prp_list[i++] = cpu_to_le64(dma_addr);
-		dma_len -= NVME_CTRL_PAGE_SIZE;
-		dma_addr += NVME_CTRL_PAGE_SIZE;
-		length -= NVME_CTRL_PAGE_SIZE;
-		if (length <= 0)
-			break;
-		if (dma_len > 0)
-			continue;
-		if (unlikely(dma_len < 0))
-			goto bad_sgl;
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
 	}
+
 done:
-	iod->cmd.common.dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
-	iod->cmd.common.dptr.prp2 = cpu_to_le64(iod->first_dma);
-	return BLK_STS_OK;
-free_prps:
-	nvme_free_descriptors(req);
-	return BLK_STS_RESOURCE;
+	/*
+	 * nvme_unmap_data uses the DPT field in the SQE to tear down the
+	 * mapping, so initialize it even for failures.
+	 */
+	iod->cmd.common.dptr.prp1 = cpu_to_le64(prp1_dma);
+	iod->cmd.common.dptr.prp2 = cpu_to_le64(prp2_dma);
+	if (unlikely(iter->status))
+		nvme_unmap_data(req);
+	return iter->status;
+
 bad_sgl:
-	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
-			"Invalid SGL for payload:%d nents:%d\n",
-			blk_rq_payload_bytes(req), iod->sgt.nents);
+	dev_err_once(nvmeq->dev->dev,
+		"Incorrectly formed request for payload:%d nents:%d\n",
+		blk_rq_payload_bytes(req), blk_rq_nr_phys_segments(req));
 	return BLK_STS_IOERR;
 }
 
 static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
-		struct scatterlist *sg)
+		struct blk_dma_iter *iter)
 {
-	sge->addr = cpu_to_le64(sg_dma_address(sg));
-	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->addr = cpu_to_le64(iter->addr);
+	sge->length = cpu_to_le32(iter->len);
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
@@ -783,21 +896,22 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 	sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
 }
 
-static blk_status_t nvme_pci_setup_sgls(struct request *req)
+static blk_status_t nvme_pci_setup_data_sgl(struct request *req,
+		struct blk_dma_iter *iter)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
+	unsigned int entries = blk_rq_nr_phys_segments(req);
 	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sg = iod->sgt.sgl;
-	unsigned int entries = iod->sgt.nents;
 	dma_addr_t sgl_dma;
-	int i = 0;
+	unsigned int mapped = 0;
 
-	/* setting the transfer type as SGL */
+	/* set the transfer type as SGL */
 	iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
 
-	if (entries == 1) {
-		nvme_pci_sgl_set_data(&iod->cmd.common.dptr.sgl, sg);
+	if (entries == 1 || blk_rq_dma_map_coalesce(&iod->dma_state)) {
+		nvme_pci_sgl_set_data(&iod->cmd.common.dptr.sgl, iter);
+		iod->total_len += iter->len;
 		return BLK_STS_OK;
 	}
 
@@ -809,15 +923,21 @@ static blk_status_t nvme_pci_setup_sgls(struct request *req)
 	if (!sg_list)
 		return BLK_STS_RESOURCE;
 	iod->descriptors[iod->nr_descriptors++] = sg_list;
-	iod->first_dma = sgl_dma;
 
-	nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, entries);
 	do {
-		nvme_pci_sgl_set_data(&sg_list[i++], sg);
-		sg = sg_next(sg);
-	} while (--entries > 0);
+		if (WARN_ON_ONCE(mapped == entries)) {
+			iter->status = BLK_STS_IOERR;
+			break;
+		}
+		nvme_pci_sgl_set_data(&sg_list[mapped++], iter);
+		iod->total_len += iter->len;
+	} while (blk_rq_dma_map_iter_next(req, nvmeq->dev->dev, &iod->dma_state,
+				iter));
 
-	return BLK_STS_OK;
+	nvme_pci_sgl_set_seg(&iod->cmd.common.dptr.sgl, sgl_dma, mapped);
+	if (unlikely(iter->status))
+		nvme_free_sgls(req);
+	return iter->status;
 }
 
 static blk_status_t nvme_pci_setup_data_simple(struct request *req,
@@ -838,7 +958,8 @@ static blk_status_t nvme_pci_setup_data_simple(struct request *req,
 	dma_addr = dma_map_bvec(nvmeq->dev->dev, &bv, rq_dma_dir(req), 0);
 	if (dma_mapping_error(nvmeq->dev->dev, dma_addr))
 		return BLK_STS_RESOURCE;
-	iod->dma_len = bv.bv_len;
+	iod->total_len = bv.bv_len;
+	iod->flags |= IOD_SINGLE_SEGMENT;
 
 	if (use_sgl == SGL_FORCED || !prp_possible) {
 		iod->cmd.common.flags = NVME_CMD_SGL_METABUF;
@@ -864,47 +985,35 @@ static blk_status_t nvme_map_data(struct request *req)
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 	enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
-	blk_status_t ret = BLK_STS_RESOURCE;
-	int rc;
+	struct blk_dma_iter iter;
+	blk_status_t ret;
 
+	/*
+	 * Try to skip the DMA iterator for single segment requests, as that
+	 * significantly improves performances for small I/O sizes.
+	 */
 	if (blk_rq_nr_phys_segments(req) == 1) {
 		ret = nvme_pci_setup_data_simple(req, use_sgl);
 		if (ret != BLK_STS_AGAIN)
 			return ret;
 	}
 
-	iod->dma_len = 0;
-	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
-	if (!iod->sgt.sgl)
-		return BLK_STS_RESOURCE;
-	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
-	iod->sgt.orig_nents = blk_rq_map_sg(req, iod->sgt.sgl);
-	if (!iod->sgt.orig_nents)
-		goto out_free_sg;
-
-	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
-			     DMA_ATTR_NO_WARN);
-	if (rc) {
-		if (rc == -EREMOTEIO)
-			ret = BLK_STS_TARGET;
-		goto out_free_sg;
-	}
+	if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter))
+		return iter.status;
 
 	if (use_sgl == SGL_FORCED ||
 	    (use_sgl == SGL_SUPPORTED &&
 	     (!sgl_threshold || nvme_pci_avg_seg_size(req) >= sgl_threshold)))
-		ret = nvme_pci_setup_sgls(req);
-	else
-		ret = nvme_pci_setup_prps(req);
-	if (ret != BLK_STS_OK)
-		goto out_unmap_sg;
-	return BLK_STS_OK;
+		return nvme_pci_setup_data_sgl(req, &iter);
+	return nvme_pci_setup_data_prp(req, &iter);
+}
 
-out_unmap_sg:
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-out_free_sg:
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
-	return ret;
+static void nvme_pci_sgl_set_data_sg(struct nvme_sgl_desc *sge,
+		struct scatterlist *sg)
+{
+	sge->addr = cpu_to_le64(sg_dma_address(sg));
+	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
 static blk_status_t nvme_pci_setup_meta_sgls(struct request *req)
@@ -947,14 +1056,14 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct request *req)
 
 	sgl = iod->meta_sgt.sgl;
 	if (entries == 1) {
-		nvme_pci_sgl_set_data(sg_list, sgl);
+		nvme_pci_sgl_set_data_sg(sg_list, sgl);
 		return BLK_STS_OK;
 	}
 
 	sgl_dma += sizeof(*sg_list);
 	nvme_pci_sgl_set_seg(sg_list, sgl_dma, entries);
 	for_each_sg(sgl, sg, entries, i)
-		nvme_pci_sgl_set_data(&sg_list[i + 1], sg);
+		nvme_pci_sgl_set_data_sg(&sg_list[i + 1], sg);
 
 	return BLK_STS_OK;
 
@@ -995,7 +1104,7 @@ static blk_status_t nvme_prep_rq(struct request *req)
 
 	iod->flags = 0;
 	iod->nr_descriptors = 0;
-	iod->sgt.nents = 0;
+	iod->total_len = 0;
 	iod->meta_sgt.nents = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
@@ -2913,26 +3022,14 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
 static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
 {
 	size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1);
-	size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
-
-	dev->iod_mempool = mempool_create_node(1,
-			mempool_kmalloc, mempool_kfree,
-			(void *)alloc_size, GFP_KERNEL,
-			dev_to_node(dev->dev));
-	if (!dev->iod_mempool)
-		return -ENOMEM;
 
 	dev->iod_meta_mempool = mempool_create_node(1,
 			mempool_kmalloc, mempool_kfree,
 			(void *)meta_size, GFP_KERNEL,
 			dev_to_node(dev->dev));
 	if (!dev->iod_meta_mempool)
-		goto free;
-
+		return -ENOMEM;
 	return 0;
-free:
-	mempool_destroy(dev->iod_mempool);
-	return -ENOMEM;
 }
 
 static void nvme_free_tagset(struct nvme_dev *dev)
@@ -3376,7 +3473,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
 out_release_iod_mempool:
-	mempool_destroy(dev->iod_mempool);
 	mempool_destroy(dev->iod_meta_mempool);
 out_dev_unmap:
 	nvme_dev_unmap(dev);
@@ -3440,7 +3536,6 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
-	mempool_destroy(dev->iod_mempool);
 	mempool_destroy(dev->iod_meta_mempool);
 	nvme_release_descriptor_pools(dev);
 	nvme_dev_unmap(dev);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/8] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2025-06-23 14:12 ` [PATCH 6/8] nvme-pci: convert the data mapping to blk_rq_dma_map Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:44   ` Keith Busch
  2025-06-23 14:12 ` [PATCH 8/8] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
  2025-06-23 21:45 ` new DMA API conversion for nvme-pci v2 Chaitanya Kulkarni
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Daniel Gomez, Leon Romanovsky

Having a define in kiB units is a bit weird.  Also update the
comment now that there is not scatterlist limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 040ed906c580..4bcd1fdf9a71 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -38,10 +38,9 @@
 #define NVME_SMALL_POOL_SIZE	256
 
 /*
- * These can be higher, but we need to ensure that any command doesn't
- * require an sg allocation that needs more than a page of data.
+ * Arbitrary upper bound.
  */
-#define NVME_MAX_KB_SZ	8192
+#define NVME_MAX_BYTES		SZ_8M
 #define NVME_MAX_NR_DESCRIPTORS	5
 
 /*
@@ -413,7 +412,7 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, __le32 *dbbuf_db,
  */
 static __always_inline int nvme_pci_npages_prp(void)
 {
-	unsigned max_bytes = (NVME_MAX_KB_SZ * 1024) + NVME_CTRL_PAGE_SIZE;
+	unsigned max_bytes = NVME_MAX_BYTES + NVME_CTRL_PAGE_SIZE;
 	unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
 	return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
 }
@@ -3367,7 +3366,8 @@ static struct nvme_dev *nvme_pci_alloc_dev(struct pci_dev *pdev,
 	 * over a single page.
 	 */
 	dev->ctrl.max_hw_sectors = min_t(u32,
-		NVME_MAX_KB_SZ << 1, dma_opt_mapping_size(&pdev->dev) >> 9);
+			NVME_MAX_BYTES >> SECTOR_SHIFT,
+			dma_opt_mapping_size(&pdev->dev) >> 9);
 	dev->ctrl.max_segments = NVME_MAX_SEGS;
 	dev->ctrl.max_integrity_segments = 1;
 	return dev;
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 8/8] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2025-06-23 14:12 ` [PATCH 7/8] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
@ 2025-06-23 14:12 ` Christoph Hellwig
  2025-06-23 15:44   ` Keith Busch
  2025-06-23 21:45 ` new DMA API conversion for nvme-pci v2 Chaitanya Kulkarni
  8 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-23 14:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Daniel Gomez, Leon Romanovsky

The current use of an always_inline helper is a bit convoluted.
Instead use macros that represent the arithmetics used for building
up the PRP chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4bcd1fdf9a71..d9d115b14a88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -57,6 +57,21 @@
 #define NVME_MAX_META_SEGS \
 	((NVME_SMALL_POOL_SIZE / sizeof(struct nvme_sgl_desc)) - 1)
 
+/*
+ * The last entry is used to link to the next descriptor.
+ */
+#define PRPS_PER_PAGE \
+	(((NVME_CTRL_PAGE_SIZE / sizeof(__le64))) - 1)
+
+/*
+ * I/O could be non-aligned both at the beginning and end.
+ */
+#define MAX_PRP_RANGE \
+	(NVME_MAX_BYTES + 2 * (NVME_CTRL_PAGE_SIZE - 1))
+
+static_assert(MAX_PRP_RANGE / NVME_CTRL_PAGE_SIZE <=
+	(1 /* prp1 */ + NVME_MAX_NR_DESCRIPTORS * PRPS_PER_PAGE));
+
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0444);
 
@@ -405,18 +420,6 @@ static bool nvme_dbbuf_update_and_check_event(u16 value, __le32 *dbbuf_db,
 	return true;
 }
 
-/*
- * Will slightly overestimate the number of pages needed.  This is OK
- * as it only leads to a small amount of wasted memory for the lifetime of
- * the I/O.
- */
-static __always_inline int nvme_pci_npages_prp(void)
-{
-	unsigned max_bytes = NVME_MAX_BYTES + NVME_CTRL_PAGE_SIZE;
-	unsigned nprps = DIV_ROUND_UP(max_bytes, NVME_CTRL_PAGE_SIZE);
-	return DIV_ROUND_UP(8 * nprps, NVME_CTRL_PAGE_SIZE - 8);
-}
-
 static struct nvme_descriptor_pools *
 nvme_setup_descriptor_pools(struct nvme_dev *dev, unsigned numa_node)
 {
@@ -3938,7 +3941,6 @@ static int __init nvme_init(void)
 	BUILD_BUG_ON(sizeof(struct nvme_create_sq) != 64);
 	BUILD_BUG_ON(sizeof(struct nvme_delete_queue) != 64);
 	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
-	BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_DESCRIPTORS);
 
 	return pci_register_driver(&nvme_driver);
 }
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-23 14:12 ` [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
@ 2025-06-23 15:27   ` Keith Busch
  2025-06-24 12:46     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

On Mon, Jun 23, 2025 at 04:12:25PM +0200, Christoph Hellwig wrote:
> @@ -888,7 +899,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		goto out_free_sg;
>  	}
>  
> -	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> +	if (use_sgl == SGL_FORCED ||
> +	    (use_sgl == SGL_SUPPORTED &&
> +	     (!sgl_threshold || nvme_pci_avg_seg_size(req) >= sgl_threshold)))
>  		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);

We historically interpreted sgl_threshold set to 0 to mean disable SGL
usage, maybe because the controller is broken or something. It might be
okay to have 0 mean to not consider segment sizes, but I just wanted to
point out this is a different interpretation of the user parameter.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio
  2025-06-23 14:12 ` [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
@ 2025-06-23 15:43   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

On Mon, Jun 23, 2025 at 04:12:23PM +0200, Christoph Hellwig wrote:
> To get out of the DMA mapping helpers having to check every segment for
> it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> transfers, and that a P2P bio only contains ranges from a single device.
> 
> This means we do the page zone access in the bio add path where it should
> be still page hot, and will only have do the fairly expensive P2P topology
> lookup once per bio down in the DMA mapping path, and only for already
> marked bios.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/8] block: add scatterlist-less DMA mapping helpers
  2025-06-23 14:12 ` [PATCH 2/8] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
@ 2025-06-23 15:43   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

On Mon, Jun 23, 2025 at 04:12:24PM +0200, Christoph Hellwig wrote:
> Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
> the wasteful scatterlist structure.  Instead it uses the mapping iterator
> to either add segments to the IOVA for IOMMU operations, or just maps
> them one by one for the direct mapping.  For the IOMMU case instead of
> a scatterlist with an entry for each segment, only a single [dma_addr,len]
> pair needs to be stored for processing a request, and for the direct
> mapping the per-segment allocation shrinks from
> [page,offset,len,dma_addr,dma_len] to just [dma_addr,len].
> 
> One big difference to the scatterlist API, which could be considered
> downside, is that the IOVA collapsing only works when the driver sets
> a virt_boundary that matches the IOMMU granule.  For NVMe this is done
> already so it works perfectly.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/8] nvme-pci: merge the simple PRP and SGL setup into a common helper
  2025-06-23 14:12 ` [PATCH 4/8] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
@ 2025-06-23 15:43   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

On Mon, Jun 23, 2025 at 04:12:26PM +0200, Christoph Hellwig wrote:
> nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic.
> Merge them into a single helper that makes use of the previously added
> use_sgl tristate.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/8] nvme-pci: convert the data mapping to blk_rq_dma_map
  2025-06-23 14:12 ` [PATCH 6/8] nvme-pci: convert the data mapping to blk_rq_dma_map Christoph Hellwig
@ 2025-06-23 15:44   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Leon Romanovsky

On Mon, Jun 23, 2025 at 04:12:28PM +0200, Christoph Hellwig wrote:
> Use the blk_rq_dma_map API to DMA map requests instead of scatterlists.
> This removes the need to allocate a scatterlist covering every segment,
> and thus the overall transfer length limit based on the scatterlist
> allocation.
> 
> Instead the DMA mapping is done by iterating the bio_vec chain in the
> request directly.  The unmap is handled differently depending on how
> we mapped:
> 
>  - when using an IOMMU only a single IOVA is used, and it is stored in
>    iova_state
>  - for direct mappings that don't use swiotlb and are cache coherent,
>    unmap is not needed at all
>  - for direct mappings that are not cache coherent or use swiotlb, the
>    physical addresses are rebuild from the PRPs or SGL segments
> 
> The latter unfortunately adds a fair amount of code to the driver, but
> it is code not used in the fast path.
> 
> The conversion only covers the data mapping path, and still uses a
> scatterlist for the multi-segment metadata case.  I plan to convert that
> as soon as we have good test coverage for the multi-segment metadata
> path.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 7/8] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE
  2025-06-23 14:12 ` [PATCH 7/8] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
@ 2025-06-23 15:44   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Daniel Gomez, Leon Romanovsky

On Mon, Jun 23, 2025 at 04:12:29PM +0200, Christoph Hellwig wrote:
> Having a define in kiB units is a bit weird.  Also update the
> comment now that there is not scatterlist limit.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 8/8] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS
  2025-06-23 14:12 ` [PATCH 8/8] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
@ 2025-06-23 15:44   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-23 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme, Daniel Gomez, Leon Romanovsky

On Mon, Jun 23, 2025 at 04:12:30PM +0200, Christoph Hellwig wrote:
> The current use of an always_inline helper is a bit convoluted.
> Instead use macros that represent the arithmetics used for building
> up the PRP chain.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new DMA API conversion for nvme-pci v2
  2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2025-06-23 14:12 ` [PATCH 8/8] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
@ 2025-06-23 21:45 ` Chaitanya Kulkarni
  2025-06-25  4:32   ` Chaitanya Kulkarni
  8 siblings, 1 reply; 23+ messages in thread
From: Chaitanya Kulkarni @ 2025-06-23 21:45 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org

On 6/23/25 07:12, Christoph Hellwig wrote:
> Hi all,
>
> this series converts the nvme-pci driver to the new IOVA-based DMA API
> for the data path.
>
> Chances since v1:
>   - minor cleanups to the block dma mapping helpers
>   - fix the metadata SGL supported check for bisectability
>   - fix SGL threshold check
>   - fix/simplify metadata SGL force checks
>
> Diffstat:
>   block/bio-integrity.c      |    3
>   block/bio.c                |   20 -
>   block/blk-mq-dma.c         |  161 +++++++++++
>   drivers/nvme/host/pci.c    |  615 +++++++++++++++++++++++++--------------------
>   include/linux/blk-mq-dma.h |   63 ++++
>   include/linux/blk_types.h  |    2
>   6 files changed, 597 insertions(+), 267 deletions(-)

I'll have this tested and reviewed by tomorrow end of my day on my H/W 
setup.
Gentle request, please wait till tomorrow end of the day before applying.

-ck



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-23 15:27   ` Keith Busch
@ 2025-06-24 12:46     ` Christoph Hellwig
  2025-06-25  4:28       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-24 12:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni,
	Kanchan Joshi, Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
	linux-block, linux-nvme

On Mon, Jun 23, 2025 at 09:27:30AM -0600, Keith Busch wrote:
> > -	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> > +	if (use_sgl == SGL_FORCED ||
> > +	    (use_sgl == SGL_SUPPORTED &&
> > +	     (!sgl_threshold || nvme_pci_avg_seg_size(req) >= sgl_threshold)))
> >  		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
> 
> We historically interpreted sgl_threshold set to 0 to mean disable SGL
> usage, maybe because the controller is broken or something. It might be
> okay to have 0 mean to not consider segment sizes, but I just wanted to
> point out this is a different interpretation of the user parameter.

This is still a leftover from the threshold mess in the first version.
I'll fix it for the next one.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-24 12:46     ` Christoph Hellwig
@ 2025-06-25  4:28       ` Chaitanya Kulkarni
  2025-06-25  6:07         ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Chaitanya Kulkarni @ 2025-06-25  4:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
	linux-block@vger.kernel.org, Keith Busch,
	linux-nvme@lists.infradead.org

Christoph,

On 6/24/25 05:46, Christoph Hellwig wrote:
> On Mon, Jun 23, 2025 at 09:27:30AM -0600, Keith Busch wrote:
>>> -	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
>>> +	if (use_sgl == SGL_FORCED ||
>>> +	    (use_sgl == SGL_SUPPORTED &&
>>> +	     (!sgl_threshold || nvme_pci_avg_seg_size(req) >= sgl_threshold)))
>>>   		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
>> We historically interpreted sgl_threshold set to 0 to mean disable SGL
>> usage, maybe because the controller is broken or something. It might be
>> okay to have 0 mean to not consider segment sizes, but I just wanted to
>> point out this is a different interpretation of the user parameter.
> This is still a leftover from the threshold mess in the first version.
> I'll fix it for the next one.

Are you planning to send a V3 with a fix sg_threshold parameter ?

or

you are planning to send out a separate patch to fix this change
in the behavior ?

-ck



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new DMA API conversion for nvme-pci v2
  2025-06-23 21:45 ` new DMA API conversion for nvme-pci v2 Chaitanya Kulkarni
@ 2025-06-25  4:32   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 23+ messages in thread
From: Chaitanya Kulkarni @ 2025-06-25  4:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, Kanchan Joshi, Leon Romanovsky,
	Nitesh Shetty, Logan Gunthorpe, linux-block@vger.kernel.org,
	linux-nvme@lists.infradead.org, Jens Axboe

On 6/23/25 14:45, Chaitanya Kulkarni wrote:
> On 6/23/25 07:12, Christoph Hellwig wrote:
>> Hi all,
>>
>> this series converts the nvme-pci driver to the new IOVA-based DMA API
>> for the data path.
>>
>> Chances since v1:
>>    - minor cleanups to the block dma mapping helpers
>>    - fix the metadata SGL supported check for bisectability
>>    - fix SGL threshold check
>>    - fix/simplify metadata SGL force checks
>>
>> Diffstat:
>>    block/bio-integrity.c      |    3
>>    block/bio.c                |   20 -
>>    block/blk-mq-dma.c         |  161 +++++++++++
>>    drivers/nvme/host/pci.c    |  615 +++++++++++++++++++++++++--------------------
>>    include/linux/blk-mq-dma.h |   63 ++++
>>    include/linux/blk_types.h  |    2
>>    6 files changed, 597 insertions(+), 267 deletions(-)
> I'll have this tested and reviewed by tomorrow end of my day on my H/W
> setup.
> Gentle request, please wait till tomorrow end of the day before applying.
>
> -ck
>
>

Patch-series does look good to me (except the SGL parameter fix).
I ran out of time before I could finish the performance tests today.

I'll definitely have it finished before tomorrow end of the day to
provide the necessary tags.

-ck



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-25  4:28       ` Chaitanya Kulkarni
@ 2025-06-25  6:07         ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-25  6:07 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe,
	linux-block@vger.kernel.org, Keith Busch,
	linux-nvme@lists.infradead.org

On Wed, Jun 25, 2025 at 04:28:09AM +0000, Chaitanya Kulkarni wrote:
> Are you planning to send a V3 with a fix sg_threshold parameter ?

I'll send out a new version in a bit.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-25 11:34 new DMA API conversion for nvme-pci v3 Christoph Hellwig
@ 2025-06-25 11:35 ` Christoph Hellwig
  2025-06-25 16:03   ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2025-06-25 11:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

Move the average segment size into a separate helper, and return a
tristate to distinguish the case where can use SGL vs where we have to
use SGLs.  This will allow the simplify the code and make more efficient
decisions in follow on changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ff12e415cb5..16ff87fe3dd9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -578,6 +578,12 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
+enum nvme_use_sgl {
+	SGL_UNSUPPORTED,
+	SGL_SUPPORTED,
+	SGL_FORCED,
+};
+
 static inline bool nvme_pci_metadata_use_sgls(struct nvme_dev *dev,
 					      struct request *req)
 {
@@ -587,23 +593,27 @@ static inline bool nvme_pci_metadata_use_sgls(struct nvme_dev *dev,
 		nvme_req(req)->flags & NVME_REQ_USERCMD;
 }
 
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
-				     int nseg)
+static inline enum nvme_use_sgl nvme_pci_use_sgls(struct nvme_dev *dev,
+		struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	unsigned int avg_seg_size;
 
-	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
+	if (nvmeq->qid && nvme_ctrl_sgl_supported(&dev->ctrl)) {
+		if (nvme_req(req)->flags & NVME_REQ_USERCMD)
+			return SGL_FORCED;
+		if (req->nr_integrity_segments > 1)
+			return SGL_FORCED;
+		return SGL_SUPPORTED;
+	}
 
-	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
-		return false;
-	if (!nvmeq->qid)
-		return false;
-	if (nvme_pci_metadata_use_sgls(dev, req))
-		return true;
-	if (!sgl_threshold || avg_seg_size < sgl_threshold)
-		return nvme_req(req)->flags & NVME_REQ_USERCMD;
-	return true;
+	return SGL_UNSUPPORTED;
+}
+
+static unsigned int nvme_pci_avg_seg_size(struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+	return DIV_ROUND_UP(blk_rq_payload_bytes(req), iod->sgt.nents);
 }
 
 static inline struct dma_pool *nvme_dma_pool(struct nvme_queue *nvmeq,
@@ -851,6 +861,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	enum nvme_use_sgl use_sgl = nvme_pci_use_sgls(dev, req);
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
@@ -888,7 +899,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out_free_sg;
 	}
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+	if (use_sgl == SGL_FORCED ||
+	    (use_sgl == SGL_SUPPORTED &&
+	     (sgl_threshold && nvme_pci_avg_seg_size(req) >= sgl_threshold)))
 		ret = nvme_pci_setup_sgls(nvmeq, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(nvmeq, req, &cmnd->rw);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls
  2025-06-25 11:35 ` [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
@ 2025-06-25 16:03   ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-06-25 16:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Chaitanya Kulkarni, Kanchan Joshi,
	Leon Romanovsky, Nitesh Shetty, Logan Gunthorpe, linux-block,
	linux-nvme

On Wed, Jun 25, 2025 at 01:35:00PM +0200, Christoph Hellwig wrote:
> Move the average segment size into a separate helper, and return a
> tristate to distinguish the case where can use SGL vs where we have to
> use SGLs.  This will allow the simplify the code and make more efficient
> decisions in follow on changes.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-06-25 18:29 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23 14:12 new DMA API conversion for nvme-pci v2 Christoph Hellwig
2025-06-23 14:12 ` [PATCH 1/8] block: don't merge different kinds of P2P transfers in a single bio Christoph Hellwig
2025-06-23 15:43   ` Keith Busch
2025-06-23 14:12 ` [PATCH 2/8] block: add scatterlist-less DMA mapping helpers Christoph Hellwig
2025-06-23 15:43   ` Keith Busch
2025-06-23 14:12 ` [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
2025-06-23 15:27   ` Keith Busch
2025-06-24 12:46     ` Christoph Hellwig
2025-06-25  4:28       ` Chaitanya Kulkarni
2025-06-25  6:07         ` Christoph Hellwig
2025-06-23 14:12 ` [PATCH 4/8] nvme-pci: merge the simple PRP and SGL setup into a common helper Christoph Hellwig
2025-06-23 15:43   ` Keith Busch
2025-06-23 14:12 ` [PATCH 5/8] nvme-pci: remove superfluous arguments Christoph Hellwig
2025-06-23 14:12 ` [PATCH 6/8] nvme-pci: convert the data mapping to blk_rq_dma_map Christoph Hellwig
2025-06-23 15:44   ` Keith Busch
2025-06-23 14:12 ` [PATCH 7/8] nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTE Christoph Hellwig
2025-06-23 15:44   ` Keith Busch
2025-06-23 14:12 ` [PATCH 8/8] nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORS Christoph Hellwig
2025-06-23 15:44   ` Keith Busch
2025-06-23 21:45 ` new DMA API conversion for nvme-pci v2 Chaitanya Kulkarni
2025-06-25  4:32   ` Chaitanya Kulkarni
  -- strict thread matches above, loose matches on Subject: below --
2025-06-25 11:34 new DMA API conversion for nvme-pci v3 Christoph Hellwig
2025-06-25 11:35 ` [PATCH 3/8] nvme-pci: refactor nvme_pci_use_sgls Christoph Hellwig
2025-06-25 16:03   ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).