[RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API
@ 2024-10-27 14:21 Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 1/7] block: share more code for bio addition helpers Leon Romanovsky
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jonathan Corbet, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

This is complimentary patchset to the series which adds new DMA mapping API [1].

In this series, Christoph converts existing nvme-pci driver to use new API, which
is wrapped by relevant blk-mq helpers, so future blk drivers can reuse them in
block layer specific structures.

This is posted as RFC as it is under heavy testing now, and presented
here to get feedback from the community and show another advanced use
case of the new API.

Thanks

[1] https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org

Christoph Hellwig (7):
  block: share more code for bio addition helpers
  block: don't merge different kinds of P2P transfers in a single bio
  blk-mq: add a dma mapping iterator
  blk-mq: add scatterlist-less DMA mapping helpers
  nvme-pci: remove struct nvme_descriptor
  nvme-pci: use a better encoding for small prp pool allocations
  nvme-pci: convert to blk_rq_dma_map

 block/bio.c                | 148 ++++++------
 block/blk-map.c            |  32 ++-
 block/blk-merge.c          | 313 +++++++++++++++++-------
 drivers/nvme/host/pci.c    | 470 +++++++++++++++++++------------------
 include/linux/blk-mq-dma.h |  64 +++++
 include/linux/blk_types.h  |   2 +
 6 files changed, 636 insertions(+), 393 deletions(-)
 create mode 100644 include/linux/blk-mq-dma.h

-- 
2.46.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/7] block: share more code for bio addition helpers
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
@ 2024-10-27 14:21 ` Leon Romanovsky
  2024-10-31 20:55   ` Bart Van Assche
  2024-10-27 14:21 ` [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

__bio_iov_iter_get_pages currently open codes adding pages to the bio,
which duplicates a lot of code from bio_add_page and
bio_add_zone_append_page.  Add bio_add_page_int and
bio_add_zone_append_page_int helpers that pass down the same_page
output argument so that __bio_iov_iter_get_pages can reuse the main
add bio to page helpers.

Note that I'd normally call these helpers __bio_add_page and
__bio_add_zone_append_page, but the former is already taken for an
exported API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/bio.c | 114 +++++++++++++++++++++++-----------------------------
 1 file changed, 51 insertions(+), 63 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ac4d77c88932..2d3bc8bfb071 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1064,6 +1064,19 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
+static int bio_add_zone_append_page_int(struct bio *bio, struct page *page,
+		unsigned int len, unsigned int offset, bool *same_page)
+{
+	struct block_device *bdev = bio->bi_bdev;
+
+	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
+		return 0;
+	if (WARN_ON_ONCE(!bdev_is_zoned(bdev)))
+		return 0;
+	return bio_add_hw_page(bdev_get_queue(bdev), bio, page, len, offset,
+			bdev_max_zone_append_sectors(bdev), same_page);
+}
+
 /**
  * bio_add_zone_append_page - attempt to add page to zone-append bio
  * @bio: destination bio
@@ -1083,17 +1096,9 @@ EXPORT_SYMBOL(bio_add_pc_page);
 int bio_add_zone_append_page(struct bio *bio, struct page *page,
 			     unsigned int len, unsigned int offset)
 {
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	bool same_page = false;
 
-	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
-		return 0;
-
-	if (WARN_ON_ONCE(!bdev_is_zoned(bio->bi_bdev)))
-		return 0;
-
-	return bio_add_hw_page(q, bio, page, len, offset,
-			       queue_max_zone_append_sectors(q), &same_page);
+	return bio_add_zone_append_page_int(bio, page, len, offset, &same_page);
 }
 EXPORT_SYMBOL_GPL(bio_add_zone_append_page);
 
@@ -1119,20 +1124,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
-/**
- *	bio_add_page	-	attempt to add page(s) to bio
- *	@bio: destination bio
- *	@page: start page to add
- *	@len: vec entry length, may cross pages
- *	@offset: vec entry offset relative to @page, may cross pages
- *
- *	Attempt to add page(s) to the bio_vec maplist. This will only fail
- *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
- */
-int bio_add_page(struct bio *bio, struct page *page,
-		 unsigned int len, unsigned int offset)
+static int bio_add_page_int(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset, bool *same_page)
 {
-	bool same_page = false;
 
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return 0;
@@ -1141,7 +1135,7 @@ int bio_add_page(struct bio *bio, struct page *page,
 
 	if (bio->bi_vcnt > 0 &&
 	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset, &same_page)) {
+				page, len, offset, same_page)) {
 		bio->bi_iter.bi_size += len;
 		return len;
 	}
@@ -1151,6 +1145,24 @@ int bio_add_page(struct bio *bio, struct page *page,
 	__bio_add_page(bio, page, len, offset);
 	return len;
 }
+
+/**
+ * bio_add_page	- attempt to add page(s) to bio
+ * @bio: destination bio
+ * @page: start page to add
+ * @len: vec entry length, may cross pages
+ * @offset: vec entry offset relative to @page, may cross pages
+ *
+ * Attempt to add page(s) to the bio_vec maplist.  Will only fail if the
+ * bio is full, or it is incorrectly used on a cloned bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset)
+{
+	bool same_page = false;
+
+	return bio_add_page_int(bio, page, len, offset, &same_page);
+}
 EXPORT_SYMBOL(bio_add_page);
 
 void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len,
@@ -1224,41 +1236,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static int bio_iov_add_folio(struct bio *bio, struct folio *folio, size_t len,
-			     size_t offset)
-{
-	bool same_page = false;
-
-	if (WARN_ON_ONCE(bio->bi_iter.bi_size > UINT_MAX - len))
-		return -EIO;
-
-	if (bio->bi_vcnt > 0 &&
-	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				folio_page(folio, 0), len, offset,
-				&same_page)) {
-		bio->bi_iter.bi_size += len;
-		if (same_page && bio_flagged(bio, BIO_PAGE_PINNED))
-			unpin_user_folio(folio, 1);
-		return 0;
-	}
-	bio_add_folio_nofail(bio, folio, len, offset);
-	return 0;
-}
-
-static int bio_iov_add_zone_append_folio(struct bio *bio, struct folio *folio,
-					 size_t len, size_t offset)
-{
-	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
-	bool same_page = false;
-
-	if (bio_add_hw_folio(q, bio, folio, len, offset,
-			queue_max_zone_append_sectors(q), &same_page) != len)
-		return -EINVAL;
-	if (same_page && bio_flagged(bio, BIO_PAGE_PINNED))
-		unpin_user_folio(folio, 1);
-	return 0;
-}
-
 static unsigned int get_contig_folio_len(unsigned int *num_pages,
 					 struct page **pages, unsigned int i,
 					 struct folio *folio, size_t left,
@@ -1353,6 +1330,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
 		struct page *page = pages[i];
 		struct folio *folio = page_folio(page);
+		struct page *first_page = folio_page(folio, 0);
+		bool same_page = false;
 
 		folio_offset = ((size_t)folio_page_idx(folio, page) <<
 			       PAGE_SHIFT) + offset;
@@ -1366,12 +1345,21 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 						   folio, left, offset);
 
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			ret = bio_iov_add_zone_append_folio(bio, folio, len,
-					folio_offset);
-			if (ret)
+			if (bio_add_zone_append_page_int(bio, first_page, len,
+					folio_offset, &same_page) != len) {
+				ret = -EINVAL;
+				break;
+			}
+		} else {
+			if (bio_add_page_int(bio, folio_page(folio, 0), len,
+					folio_offset, &same_page) != len) {
+				ret = -EINVAL;
 				break;
-		} else
-			bio_iov_add_folio(bio, folio, len, folio_offset);
+			}
+		}
+
+		if (same_page && bio_flagged(bio, BIO_PAGE_PINNED))
+			unpin_user_folio(folio, 1);
 
 		offset = 0;
 	}
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 1/7] block: share more code for bio addition helpers Leon Romanovsky
@ 2024-10-27 14:21 ` Leon Romanovsky
  2024-10-28 18:27   ` Logan Gunthorpe
                     ` (2 more replies)
  2024-10-27 14:21 ` [RFC PATCH 3/7] blk-mq: add a dma mapping iterator Leon Romanovsky
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

To get out of the dma mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.

This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the dma mapping path, and only for already
marked bios.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/bio.c               | 36 +++++++++++++++++++++++++++++-------
 block/blk-map.c           | 32 ++++++++++++++++++++++++--------
 include/linux/blk_types.h |  2 ++
 3 files changed, 55 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2d3bc8bfb071..943a6d78cb3e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -928,8 +928,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
 		return false;
 	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
 		return false;
-	if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
-		return false;
 
 	*same_page = ((vec_end_addr & PAGE_MASK) == ((page_addr + off) &
 		     PAGE_MASK));
@@ -993,6 +991,14 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 
+		/*
+		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
+		 * bio must be P2P pages from the same device.
+		 */
+		if ((bio->bi_opf & REQ_P2PDMA) &&
+		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
+			return 0;
+
 		if (bvec_try_merge_hw_page(q, bv, page, len, offset,
 				same_page)) {
 			bio->bi_iter.bi_size += len;
@@ -1009,6 +1015,9 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		 */
 		if (bvec_gap_to_prev(&q->limits, bv, offset))
 			return 0;
+	} else {
+		if (is_pci_p2pdma_page(page))
+			bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
 	}
 
 	bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, offset);
@@ -1133,11 +1142,24 @@ static int bio_add_page_int(struct bio *bio, struct page *page,
 	if (bio->bi_iter.bi_size > UINT_MAX - len)
 		return 0;
 
-	if (bio->bi_vcnt > 0 &&
-	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset, same_page)) {
-		bio->bi_iter.bi_size += len;
-		return len;
+	if (bio->bi_vcnt > 0) {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		/*
+		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
+		 * bio must be P2P pages from the same device.
+		 */
+		if ((bio->bi_opf & REQ_P2PDMA) &&
+		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
+			return 0;
+
+		if (bvec_try_merge_page(bv, page, len, offset, same_page)) {
+			bio->bi_iter.bi_size += len;
+			return len;
+		}
+	} else {
+		if (is_pci_p2pdma_page(page))
+			bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
 	}
 
 	if (bio->bi_vcnt >= bio->bi_max_vecs)
diff --git a/block/blk-map.c b/block/blk-map.c
index 0e1167b23934..03192b1ca6ea 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -568,6 +568,7 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
 	const struct queue_limits *lim = &q->limits;
 	unsigned int nsegs = 0, bytes = 0;
 	struct bio *bio;
+	int error;
 	size_t i;
 
 	if (!nr_iter || (nr_iter >> SECTOR_SHIFT) > queue_max_hw_sectors(q))
@@ -588,15 +589,30 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
 	for (i = 0; i < nr_segs; i++) {
 		struct bio_vec *bv = &bvecs[i];
 
-		/*
-		 * If the queue doesn't support SG gaps and adding this
-		 * offset would create a gap, fallback to copy.
-		 */
-		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset)) {
-			blk_mq_map_bio_put(bio);
-			return -EREMOTEIO;
+		error = -EREMOTEIO;
+		if (bvprvp) {
+			/*
+			 * If the queue doesn't support SG gaps and adding this
+			 * offset would create a gap, fallback to copy.
+			 */
+			if (bvec_gap_to_prev(lim, bvprvp, bv->bv_offset))
+				goto put_bio;
+
+			/*
+			 * When doing ZONE_DEVICE-based P2P transfers, all pages
+			 * in a bio must be P2P pages, and from the same device.
+			 */
+			if ((bio->bi_opf & REQ_P2PDMA) &&
+			    zone_device_pages_have_same_pgmap(bvprvp->bv_page,
+					bv->bv_page))
+				goto put_bio;
+		} else {
+			if (is_pci_p2pdma_page(bv->bv_page))
+				bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
 		}
+
 		/* check full condition */
+		error = -EINVAL;
 		if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len)
 			goto put_bio;
 		if (bytes + bv->bv_len > nr_iter)
@@ -611,7 +627,7 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
 	return 0;
 put_bio:
 	blk_mq_map_bio_put(bio);
-	return -EINVAL;
+	return error;
 }
 
 /**
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7..94cf146e8ce6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -378,6 +378,7 @@ enum req_flag_bits {
 	__REQ_DRV,		/* for driver use */
 	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
 	__REQ_ATOMIC,		/* for atomic write operations */
+	__REQ_P2PDMA,		/* contains P2P DMA pages */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -410,6 +411,7 @@ enum req_flag_bits {
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_FS_PRIVATE	(__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
 #define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
+#define REQ_P2PDMA	(__force blk_opf_t)(1ULL << __REQ_P2PDMA)
 
 #define REQ_NOUNMAP	(__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/7] blk-mq: add a dma mapping iterator
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 1/7] block: share more code for bio addition helpers Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
@ 2024-10-27 14:21 ` Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 4/7] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

blk_rq_map_sg is maze of nested loops.  Untangle it by creating an
iterator that returns [paddr,len] tuples for DMA mapping, and then
implement the DMA logic on top of this.  This not only removes code
at the source level, but also generates nicer binary code:

$ size block/blk-merge.o.*
   text	   data	    bss	    dec	    hex	filename
  10001	    432	      0	  10433	   28c1	block/blk-merge.o.new
  10317	    468	      0	  10785	   2a21	block/blk-merge.o.old

Last but not least it will be used as a building block for a new
DMA mapping helper that doesn't rely on struct scatterlist.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-merge.c | 182 ++++++++++++++++++++--------------------------
 1 file changed, 77 insertions(+), 105 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index ad763ec313b6..b63fd754a5de 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -451,137 +451,109 @@ unsigned int blk_recalc_rq_segments(struct request *rq)
 	return nr_phys_segs;
 }
 
-static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
-		struct scatterlist *sglist)
+struct phys_vec {
+	phys_addr_t	paddr;
+	u32		len;
+};
+
+static bool blk_map_iter_next(struct request *req,
+		struct req_iterator *iter, struct phys_vec *vec)
 {
-	if (!*sg)
-		return sglist;
+	unsigned int max_size;
+	struct bio_vec bv;
 
 	/*
-	 * If the driver previously mapped a shorter list, we could see a
-	 * termination bit prematurely unless it fully inits the sg table
-	 * on each mapping. We KNOW that there must be more entries here
-	 * or the driver would be buggy, so force clear the termination bit
-	 * to avoid doing a full sg_init_table() in drivers for each command.
+	 * For special payload requests there only is a single segment.  Return
+	 * it now and make sure blk_phys_iter_next stop iterating.
 	 */
-	sg_unmark_end(*sg);
-	return sg_next(*sg);
-}
+	if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		if (!iter->bio)
+			return false;
+		vec->paddr = bvec_phys(&req->special_vec);
+		vec->len = req->special_vec.bv_len;
+		iter->bio = NULL;
+		return true;
+	}
 
-static unsigned blk_bvec_map_sg(struct request_queue *q,
-		struct bio_vec *bvec, struct scatterlist *sglist,
-		struct scatterlist **sg)
-{
-	unsigned nbytes = bvec->bv_len;
-	unsigned nsegs = 0, total = 0;
+	if (!iter->iter.bi_size)
+		return false;
 
-	while (nbytes > 0) {
-		unsigned offset = bvec->bv_offset + total;
-		unsigned len = get_max_segment_size(&q->limits,
-				bvec_phys(bvec) + total, nbytes);
-		struct page *page = bvec->bv_page;
+	bv = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter);
+	vec->paddr = bvec_phys(&bv);
+	max_size = get_max_segment_size(&req->q->limits, vec->paddr, UINT_MAX);
+	bv.bv_len = min(bv.bv_len, max_size);
+	bio_advance_iter_single(iter->bio, &iter->iter, bv.bv_len);
 
-		/*
-		 * Unfortunately a fair number of drivers barf on scatterlists
-		 * that have an offset larger than PAGE_SIZE, despite other
-		 * subsystems dealing with that invariant just fine.  For now
-		 * stick to the legacy format where we never present those from
-		 * the block layer, but the code below should be removed once
-		 * these offenders (mostly MMC/SD drivers) are fixed.
-		 */
-		page += (offset >> PAGE_SHIFT);
-		offset &= ~PAGE_MASK;
+	/*
+	 * If we are entirely done with this bi_io_vec entry, check if the next
+	 * one could be merged into it.  This typically happens when moving to
+	 * the next bio, but some callers also don't pack bvecs tight.
+	 */
+	while (!iter->iter.bi_size || !iter->iter.bi_bvec_done) {
+		struct bio_vec next;
+
+		if (!iter->iter.bi_size) {
+			if (!iter->bio->bi_next)
+				break;
+			iter->bio = iter->bio->bi_next;
+			iter->iter = iter->bio->bi_iter;
+		}
 
-		*sg = blk_next_sg(sg, sglist);
-		sg_set_page(*sg, page, len, offset);
+		next = mp_bvec_iter_bvec(iter->bio->bi_io_vec, iter->iter);
+		if (bv.bv_len + next.bv_len > max_size ||
+		    !biovec_phys_mergeable(req->q, &bv, &next))
+			break;
 
-		total += len;
-		nbytes -= len;
-		nsegs++;
+		bv.bv_len += next.bv_len;
+		bio_advance_iter_single(iter->bio, &iter->iter, next.bv_len);
 	}
 
-	return nsegs;
+	vec->len = bv.bv_len;
+	return true;
 }
 
-static inline int __blk_bvec_map_sg(struct bio_vec bv,
-		struct scatterlist *sglist, struct scatterlist **sg)
-{
-	*sg = blk_next_sg(sg, sglist);
-	sg_set_page(*sg, bv.bv_page, bv.bv_len, bv.bv_offset);
-	return 1;
-}
+#define blk_phys_to_page(_paddr) \
+	(pfn_to_page(__phys_to_pfn(_paddr)))
 
-/* only try to merge bvecs into one sg if they are from two bios */
-static inline bool
-__blk_segment_map_sg_merge(struct request_queue *q, struct bio_vec *bvec,
-			   struct bio_vec *bvprv, struct scatterlist **sg)
+static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
+		struct scatterlist *sglist)
 {
-
-	int nbytes = bvec->bv_len;
-
 	if (!*sg)
-		return false;
-
-	if ((*sg)->length + nbytes > queue_max_segment_size(q))
-		return false;
-
-	if (!biovec_phys_mergeable(q, bvprv, bvec))
-		return false;
-
-	(*sg)->length += nbytes;
-
-	return true;
-}
-
-static int __blk_bios_map_sg(struct request_queue *q, struct bio *bio,
-			     struct scatterlist *sglist,
-			     struct scatterlist **sg)
-{
-	struct bio_vec bvec, bvprv = { NULL };
-	struct bvec_iter iter;
-	int nsegs = 0;
-	bool new_bio = false;
-
-	for_each_bio(bio) {
-		bio_for_each_bvec(bvec, bio, iter) {
-			/*
-			 * Only try to merge bvecs from two bios given we
-			 * have done bio internal merge when adding pages
-			 * to bio
-			 */
-			if (new_bio &&
-			    __blk_segment_map_sg_merge(q, &bvec, &bvprv, sg))
-				goto next_bvec;
-
-			if (bvec.bv_offset + bvec.bv_len <= PAGE_SIZE)
-				nsegs += __blk_bvec_map_sg(bvec, sglist, sg);
-			else
-				nsegs += blk_bvec_map_sg(q, &bvec, sglist, sg);
- next_bvec:
-			new_bio = false;
-		}
-		if (likely(bio->bi_iter.bi_size)) {
-			bvprv = bvec;
-			new_bio = true;
-		}
-	}
+		return sglist;
 
-	return nsegs;
+	/*
+	 * If the driver previously mapped a shorter list, we could see a
+	 * termination bit prematurely unless it fully inits the sg table
+	 * on each mapping. We KNOW that there must be more entries here
+	 * or the driver would be buggy, so force clear the termination bit
+	 * to avoid doing a full sg_init_table() in drivers for each command.
+	 */
+	sg_unmark_end(*sg);
+	return sg_next(*sg);
 }
 
 /*
- * map a request to scatterlist, return number of sg entries setup. Caller
- * must make sure sg can hold rq->nr_phys_segments entries
+ * Map a request to scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold rq->nr_phys_segments entries.
  */
 int __blk_rq_map_sg(struct request_queue *q, struct request *rq,
 		struct scatterlist *sglist, struct scatterlist **last_sg)
 {
+	struct req_iterator iter = {
+		.bio	= rq->bio,
+		.iter	= rq->bio->bi_iter,
+	};
+	struct phys_vec vec;
 	int nsegs = 0;
 
-	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
-		nsegs = __blk_bvec_map_sg(rq->special_vec, sglist, last_sg);
-	else if (rq->bio)
-		nsegs = __blk_bios_map_sg(q, rq->bio, sglist, last_sg);
+	while (blk_map_iter_next(rq, &iter, &vec)) {
+		struct page *page = blk_phys_to_page(vec.paddr);
+		unsigned int offset = offset_in_page(vec.paddr);
+
+		*last_sg = blk_next_sg(last_sg, sglist);
+		sg_set_page(*last_sg, page, vec.len, offset);
+		nsegs++;
+	}
 
 	if (*last_sg)
 		sg_mark_end(*last_sg);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/7] blk-mq: add scatterlist-less DMA mapping helpers
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
                   ` (2 preceding siblings ...)
  2024-10-27 14:21 ` [RFC PATCH 3/7] blk-mq: add a dma mapping iterator Leon Romanovsky
@ 2024-10-27 14:21 ` Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 5/7] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
the wasteful scatterlist structure.  Instead it uses the mapping iterator
to either add segments to the IOVA for IOMMU operations, or just maps
them one by one for the direct mapping.  For the IOMMU case instead of
a scatterlist with an entry for each segment, only a single [dma_addr,len]
pair needs to be stored for processing a request, and for the direct
mapping the per-segment allocation shrinks from
[page,offset,len,dma_addr,dma_len] to just [dma_addr,len].

The major downside of this API is that the IOVA collapsing only works
when the driver sets a virt_boundary that matches the IOMMU granule.

Note that struct blk_dma_vec, struct blk_dma_mapping and blk_rq_dma_unmap
aren't really block specific, but for they are kept with the only mapping
routine to keep things simple.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-merge.c          | 163 +++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-dma.h |  64 +++++++++++++++
 2 files changed, 227 insertions(+)
 create mode 100644 include/linux/blk-mq-dma.h

diff --git a/block/blk-merge.c b/block/blk-merge.c
index b63fd754a5de..77e5a3d208fc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -7,6 +7,8 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/blk-integrity.h>
+#include <linux/blk-mq-dma.h>
+#include <linux/dma-mapping.h>
 #include <linux/scatterlist.h>
 #include <linux/part_stat.h>
 #include <linux/blk-cgroup.h>
@@ -515,6 +517,167 @@ static bool blk_map_iter_next(struct request *req,
 #define blk_phys_to_page(_paddr) \
 	(pfn_to_page(__phys_to_pfn(_paddr)))
 
+/*
+ * The IOVA-based DMA API wants to be able to coalesce at the minimal IOMMU page
+ * size granularity (which is guaranteed to be <= PAGE_SIZE and usually 4k), so
+ * we need to ensure our segments are aligned to this as well.
+ *
+ * Note that there is no point in using the slightly more complicated IOVA based
+ * path for single segment mappings.
+ */
+static inline bool blk_can_dma_map_iova(struct request *req,
+		struct device *dma_dev)
+{
+	return !((queue_virt_boundary(req->q) + 1) &
+		dma_get_merge_boundary(dma_dev));
+}
+
+static bool blk_dma_map_bus(struct request *req, struct device *dma_dev,
+		struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+	iter->len = vec->len;
+	return true;
+}
+
+static bool blk_dma_map_direct(struct request *req, struct device *dma_dev,
+		struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+	iter->addr = dma_map_page(dma_dev, blk_phys_to_page(vec->paddr),
+			offset_in_page(vec->paddr), vec->len, rq_dma_dir(req));
+	if (dma_mapping_error(dma_dev, iter->addr)) {
+		iter->status = BLK_STS_RESOURCE;
+		return false;
+	}
+	iter->len = vec->len;
+	return true;
+}
+
+static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter,
+		struct phys_vec *vec)
+{
+	enum dma_data_direction dir = rq_dma_dir(req);
+	unsigned int mapped = 0;
+	int error = 0;
+
+	iter->addr = state->addr;
+	iter->len = dma_iova_size(state);
+
+	do {
+		error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
+				vec->len, dir, 0);
+		if (error)
+			break;
+		mapped += vec->len;
+	} while (blk_map_iter_next(req, &iter->iter, vec));
+
+	error = dma_iova_sync(dma_dev, state, 0, mapped, error);
+	if (error) {
+		iter->status = errno_to_blk_status(error);
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * blk_rq_dma_map_iter_start - map the first DMA segment for a request
+ * @req:	request to map
+ * @dma_dev:	device to map to
+ * @state:	DMA IOVA state
+ * @iter:	block layer DMA iterator
+ *
+ * Start DMA mapping @req to @dma_dev.  @state and @iter are provided by the
+ * caller and don't need to be initialized.  @state needs to be stored for use
+ * at unmap time, @iter is only needed at map time.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true it it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len.  If no segment was mapped the status code is
+ * returned in @iter.status.
+ *
+ * The caller can call blk_rq_dma_map_coalesce() to check if further segments
+ * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next()
+ * to try to map the following segments.
+ */
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+	unsigned int total_len = blk_rq_payload_bytes(req);
+	struct phys_vec vec;
+
+	iter->iter.bio = req->bio;
+	iter->iter.iter = req->bio->bi_iter;
+	memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
+	iter->status = BLK_STS_OK;
+
+	/*
+	 * Grab the first segment ASAP because we'll need it to check for P2P
+	 * transfers.
+	 */
+	if (!blk_map_iter_next(req, &iter->iter, &vec))
+		return false;
+
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) {
+		switch (pci_p2pdma_state(&iter->p2pdma, dma_dev,
+				blk_phys_to_page(vec.paddr))) {
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			return blk_dma_map_bus(req, dma_dev, iter, &vec);
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * P2P transfers through the host bridge are treated the
+			 * same as non-P2P transfers below and during unmap.
+			 */
+			req->cmd_flags &= ~REQ_P2PDMA;
+			break;
+		default:
+			iter->status = BLK_STS_INVAL;
+			return false;
+		}
+	}
+
+	if (blk_can_dma_map_iova(req, dma_dev) &&
+	    dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
+		return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
+	return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
+
+/**
+ * blk_rq_dma_map_iter_next - map the next DMA segment for a request
+ * @req:	request to map
+ * @dma_dev:	device to map to
+ * @state:	DMA IOVA state
+ * @iter:	block layer DMA iterator
+ *
+ * Iterate to the next mapping after a previous call to
+ * blk_rq_dma_map_iter_start().  See there for a detailed description of the
+ * arguments.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true it it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len.  If no segment was mapped the status code is
+ * returned in @iter.status.
+ */
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+	struct phys_vec vec;
+
+	if (!blk_map_iter_next(req, &iter->iter, &vec))
+		return false;
+
+	if (iter->p2pdma.map == PCI_P2PDMA_MAP_BUS_ADDR)
+		return blk_dma_map_bus(req, dma_dev, iter, &vec);
+	return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_next);
+
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
diff --git a/include/linux/blk-mq-dma.h b/include/linux/blk-mq-dma.h
new file mode 100644
index 000000000000..f1dae14e8203
--- /dev/null
+++ b/include/linux/blk-mq-dma.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BLK_MQ_DMA_H
+#define BLK_MQ_DMA_H
+
+#include <linux/blk-mq.h>
+#include <linux/pci-p2pdma.h>
+
+struct blk_dma_iter {
+	/* Output address range for this iteration */
+	dma_addr_t			addr;
+	u32				len;
+
+	/* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
+	blk_status_t			status;
+
+	/* Internal to blk_rq_dma_map_iter_* */
+	struct req_iterator		iter;
+	struct pci_p2pdma_map_state	p2pdma;
+};
+
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter);
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter);
+
+/**
+ * blk_rq_dma_map_coalesce - were all segments coalesced?
+ * @state: DMA state to check
+ *
+ * Returns true if blk_rq_dma_map_iter_start coalesced all segments into a
+ * single DMA range.
+ */
+static inline bool blk_rq_dma_map_coalesce(struct dma_iova_state *state)
+{
+	return dma_use_iova(state);
+}
+
+/**
+ * blk_rq_dma_map_coalesce - try to DMA unmap a request
+ * @req:	request to unmap
+ * @dma_dev:	device to unmap from
+ * @state:	DMA IOVA state
+ *
+ * Returns %false if the callers needs to manually unmap every DMA segment
+ * mapped using @iter or %true if no work is left to be done.
+ */
+static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state)
+{
+	if (req->cmd_flags & REQ_P2PDMA)
+		return true;
+
+	if (dma_use_iova(state)) {
+		dma_iova_destroy(dma_dev, state, rq_dma_dir(req), 0);
+		return true;
+	}
+
+	if (!dma_need_unmap(dma_dev))
+		return true;
+
+	return false;
+}
+
+#endif /* BLK_MQ_DMA_H */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 5/7] nvme-pci: remove struct nvme_descriptor
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
                   ` (3 preceding siblings ...)
  2024-10-27 14:21 ` [RFC PATCH 4/7] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
@ 2024-10-27 14:21 ` Leon Romanovsky
  2024-10-27 14:21 ` [RFC PATCH 6/7] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
  2024-10-27 14:22 ` [RFC PATCH 7/7] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
  6 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

There is no real point in having a union of two pointer types here, just
use a void pointer as we mix and match types between the arms of the
union between the allocation and freeing side already.

Also rename the nr_allocations field to nr_descriptors to better describe
what it does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 51 ++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 29 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4b9fda0b1d9a..ba077a42cbba 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -43,7 +43,8 @@
  */
 #define NVME_MAX_KB_SZ	8192
 #define NVME_MAX_SEGS	128
-#define NVME_MAX_NR_ALLOCATIONS	5
+
+#define NVME_MAX_NR_DESCRIPTORS		5
 
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0444);
@@ -216,28 +217,20 @@ struct nvme_queue {
 	struct completion delete_done;
 };
 
-union nvme_descriptor {
-	struct nvme_sgl_desc	*sg_list;
-	__le64			*prp_list;
-};
-
 /*
  * The nvme_iod describes the data in an I/O.
- *
- * The sg pointer contains the list of PRP/SGL chunk allocations in addition
- * to the actual struct scatterlist.
  */
 struct nvme_iod {
 	struct nvme_request req;
 	struct nvme_command cmd;
 	bool aborted;
-	s8 nr_allocations;	/* PRP list pool allocations. 0 means small
-				   pool in use */
+	/* # of PRP/SGL descriptors: (0 for small pool) */
+	s8 nr_descriptors;
 	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
 	struct sg_table sgt;
-	union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
+	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -528,8 +521,8 @@ static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 	dma_addr_t dma_addr = iod->first_dma;
 	int i;
 
-	for (i = 0; i < iod->nr_allocations; i++) {
-		__le64 *prp_list = iod->list[i].prp_list;
+	for (i = 0; i < iod->nr_descriptors; i++) {
+		__le64 *prp_list = iod->descriptors[i];
 		dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
 
 		dma_pool_free(dev->prp_page_pool, prp_list, dma_addr);
@@ -551,11 +544,11 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 
 	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
-	if (iod->nr_allocations == 0)
-		dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list,
+	if (iod->nr_descriptors == 0)
+		dma_pool_free(dev->prp_small_pool, iod->descriptors[0],
 			      iod->first_dma);
-	else if (iod->nr_allocations == 1)
-		dma_pool_free(dev->prp_page_pool, iod->list[0].sg_list,
+	else if (iod->nr_descriptors == 1)
+		dma_pool_free(dev->prp_page_pool, iod->descriptors[0],
 			      iod->first_dma);
 	else
 		nvme_free_prps(dev, req);
@@ -613,18 +606,18 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
 	if (nprps <= (256 / 8)) {
 		pool = dev->prp_small_pool;
-		iod->nr_allocations = 0;
+		iod->nr_descriptors = 0;
 	} else {
 		pool = dev->prp_page_pool;
-		iod->nr_allocations = 1;
+		iod->nr_descriptors = 1;
 	}
 
 	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 	if (!prp_list) {
-		iod->nr_allocations = -1;
+		iod->nr_descriptors = -1;
 		return BLK_STS_RESOURCE;
 	}
-	iod->list[0].prp_list = prp_list;
+	iod->descriptors[0] = prp_list;
 	iod->first_dma = prp_dma;
 	i = 0;
 	for (;;) {
@@ -633,7 +626,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 			if (!prp_list)
 				goto free_prps;
-			iod->list[iod->nr_allocations++].prp_list = prp_list;
+			iod->descriptors[iod->nr_descriptors++] = prp_list;
 			prp_list[0] = old_prp_list[i - 1];
 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
 			i = 1;
@@ -703,19 +696,19 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 
 	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
 		pool = dev->prp_small_pool;
-		iod->nr_allocations = 0;
+		iod->nr_descriptors = 0;
 	} else {
 		pool = dev->prp_page_pool;
-		iod->nr_allocations = 1;
+		iod->nr_descriptors = 1;
 	}
 
 	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
 	if (!sg_list) {
-		iod->nr_allocations = -1;
+		iod->nr_descriptors = -1;
 		return BLK_STS_RESOURCE;
 	}
 
-	iod->list[0].sg_list = sg_list;
+	iod->descriptors[0] = sg_list;
 	iod->first_dma = sgl_dma;
 
 	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
@@ -841,7 +834,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	blk_status_t ret;
 
 	iod->aborted = false;
-	iod->nr_allocations = -1;
+	iod->nr_descriptors = -1;
 	iod->sgt.nents = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
@@ -3626,7 +3619,7 @@ static int __init nvme_init(void)
 	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
 	BUILD_BUG_ON(NVME_MAX_SEGS > SGES_PER_PAGE);
 	BUILD_BUG_ON(sizeof(struct scatterlist) * NVME_MAX_SEGS > PAGE_SIZE);
-	BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_ALLOCATIONS);
+	BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_DESCRIPTORS);
 
 	return pci_register_driver(&nvme_driver);
 }
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 6/7] nvme-pci: use a better encoding for small prp pool allocations
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
                   ` (4 preceding siblings ...)
  2024-10-27 14:21 ` [RFC PATCH 5/7] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
@ 2024-10-27 14:21 ` Leon Romanovsky
  2024-10-27 14:22 ` [RFC PATCH 7/7] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
  6 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:21 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

There is plenty of unused space in the iod next to nr_descriptors.
Add a separate bool (which could be condensed to a single bit once we
start running out of space) to encode that the transfer is using
the full page sized pool, and use a normal 0..n count for the number of
descriptors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 85 +++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 45 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba077a42cbba..79cd65a5f311 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -45,6 +45,7 @@
 #define NVME_MAX_SEGS	128
 
 #define NVME_MAX_NR_DESCRIPTORS		5
+#define NVME_SMALL_DESCRIPTOR_SIZE	256
 
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0444);
@@ -224,8 +225,8 @@ struct nvme_iod {
 	struct nvme_request req;
 	struct nvme_command cmd;
 	bool aborted;
-	/* # of PRP/SGL descriptors: (0 for small pool) */
-	s8 nr_descriptors;
+	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
+	bool large_descriptors;	/* uses the full page sized descriptor pool */
 	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
@@ -514,13 +515,27 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 	return true;
 }
 
-static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
+static inline struct dma_pool *nvme_dma_pool(struct nvme_dev *dev,
+		struct nvme_iod *iod)
+{
+	if (iod->large_descriptors)
+		return dev->prp_page_pool;
+	return dev->prp_small_pool;
+}
+
+static void nvme_free_descriptors(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	dma_addr_t dma_addr = iod->first_dma;
 	int i;
 
+	if (iod->nr_descriptors == 1) {
+		dma_pool_free(nvme_dma_pool(dev, iod), iod->descriptors[0],
+				dma_addr);
+		return;
+	}
+
 	for (i = 0; i < iod->nr_descriptors; i++) {
 		__le64 *prp_list = iod->descriptors[i];
 		dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
@@ -543,15 +558,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 	WARN_ON_ONCE(!iod->sgt.nents);
 
 	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-
-	if (iod->nr_descriptors == 0)
-		dma_pool_free(dev->prp_small_pool, iod->descriptors[0],
-			      iod->first_dma);
-	else if (iod->nr_descriptors == 1)
-		dma_pool_free(dev->prp_page_pool, iod->descriptors[0],
-			      iod->first_dma);
-	else
-		nvme_free_prps(dev, req);
+	nvme_free_descriptors(dev, req);
 	mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
@@ -573,7 +580,6 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmnd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
 	struct scatterlist *sg = iod->sgt.sgl;
 	int dma_len = sg_dma_len(sg);
@@ -581,7 +587,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
 	__le64 *prp_list;
 	dma_addr_t prp_dma;
-	int nprps, i;
+	int i;
 
 	length -= (NVME_CTRL_PAGE_SIZE - offset);
 	if (length <= 0) {
@@ -603,27 +609,23 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		goto done;
 	}
 
-	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
-	if (nprps <= (256 / 8)) {
-		pool = dev->prp_small_pool;
-		iod->nr_descriptors = 0;
-	} else {
-		pool = dev->prp_page_pool;
-		iod->nr_descriptors = 1;
-	}
+	if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) >
+	    NVME_SMALL_DESCRIPTOR_SIZE / sizeof(__le64))
+		iod->large_descriptors = true;
 
-	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
-	if (!prp_list) {
-		iod->nr_descriptors = -1;
+	prp_list = dma_pool_alloc(nvme_dma_pool(dev, iod), GFP_ATOMIC,
+			&prp_dma);
+	if (!prp_list)
 		return BLK_STS_RESOURCE;
-	}
-	iod->descriptors[0] = prp_list;
+	iod->descriptors[iod->nr_descriptors++] = prp_list;
 	iod->first_dma = prp_dma;
 	i = 0;
 	for (;;) {
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
-			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
+
+			prp_list = dma_pool_alloc(dev->prp_page_pool,
+					GFP_ATOMIC, &prp_dma);
 			if (!prp_list)
 				goto free_prps;
 			iod->descriptors[iod->nr_descriptors++] = prp_list;
@@ -650,7 +652,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
 	return BLK_STS_OK;
 free_prps:
-	nvme_free_prps(dev, req);
+	nvme_free_descriptors(dev, req);
 	return BLK_STS_RESOURCE;
 bad_sgl:
 	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
@@ -679,7 +681,6 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
 	struct scatterlist *sg = iod->sgt.sgl;
 	unsigned int entries = iod->sgt.nents;
@@ -694,21 +695,13 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		return BLK_STS_OK;
 	}
 
-	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
-		pool = dev->prp_small_pool;
-		iod->nr_descriptors = 0;
-	} else {
-		pool = dev->prp_page_pool;
-		iod->nr_descriptors = 1;
-	}
+	if (entries > NVME_SMALL_DESCRIPTOR_SIZE / sizeof(*sg_list))
+		iod->large_descriptors = true;
 
-	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
-	if (!sg_list) {
-		iod->nr_descriptors = -1;
+	sg_list = dma_pool_alloc(nvme_dma_pool(dev, iod), GFP_ATOMIC, &sgl_dma);
+	if (!sg_list)
 		return BLK_STS_RESOURCE;
-	}
-
-	iod->descriptors[0] = sg_list;
+	iod->descriptors[iod->nr_descriptors++] = sg_list;
 	iod->first_dma = sgl_dma;
 
 	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
@@ -834,7 +827,8 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	blk_status_t ret;
 
 	iod->aborted = false;
-	iod->nr_descriptors = -1;
+	iod->nr_descriptors = 0;
+	iod->large_descriptors = false;
 	iod->sgt.nents = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
@@ -2694,7 +2688,8 @@ static int nvme_setup_prp_pools(struct nvme_dev *dev)
 
 	/* Optimisation for I/Os between 4k and 128k */
 	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
-						256, 256, 0);
+						NVME_SMALL_DESCRIPTOR_SIZE,
+						NVME_SMALL_DESCRIPTOR_SIZE, 0);
 	if (!dev->prp_small_pool) {
 		dma_pool_destroy(dev->prp_page_pool);
 		return -ENOMEM;
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 7/7] nvme-pci: convert to blk_rq_dma_map
  2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
                   ` (5 preceding siblings ...)
  2024-10-27 14:21 ` [RFC PATCH 6/7] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
@ 2024-10-27 14:22 ` Leon Romanovsky
  6 siblings, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-10-27 14:22 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

Use the blk_rq_dma_map API to DMA map requests instead of
scatterlists.  This also removes the fast path single segment
code as the blk_rq_dma_map naturally inlines single IOVA
segment mappings into the preallocated structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 382 +++++++++++++++++++++-------------------
 1 file changed, 205 insertions(+), 177 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 79cd65a5f311..f41db1efecb1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -7,7 +7,7 @@
 #include <linux/acpi.h>
 #include <linux/async.h>
 #include <linux/blkdev.h>
-#include <linux/blk-mq.h>
+#include <linux/blk-mq-dma.h>
 #include <linux/blk-mq-pci.h>
 #include <linux/blk-integrity.h>
 #include <linux/dmi.h>
@@ -27,7 +27,6 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
-#include <linux/pci-p2pdma.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -227,10 +226,9 @@ struct nvme_iod {
 	bool aborted;
 	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
 	bool large_descriptors;	/* uses the full page sized descriptor pool */
-	unsigned int dma_len;	/* length of single DMA segment mapping */
-	dma_addr_t first_dma;
+	unsigned int total_len;	/* length of the entire transfer */
 	dma_addr_t meta_dma;
-	struct sg_table sgt;
+	struct dma_iova_state dma_state;
 	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
 };
 
@@ -527,9 +525,14 @@ static void nvme_free_descriptors(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	dma_addr_t dma_addr = iod->first_dma;
+	dma_addr_t dma_addr;
 	int i;
 
+	if (iod->cmd.common.flags & NVME_CMD_SGL_METABUF)
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+	else
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2);
+
 	if (iod->nr_descriptors == 1) {
 		dma_pool_free(nvme_dma_pool(dev, iod), iod->descriptors[0],
 				dma_addr);
@@ -545,67 +548,143 @@ static void nvme_free_descriptors(struct nvme_dev *dev, struct request *req)
 	}
 }
 
-static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
+static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-	if (iod->dma_len) {
-		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
-			       rq_dma_dir(req));
+	enum dma_data_direction dir = rq_dma_dir(req);
+	int length = iod->total_len;
+	dma_addr_t dma_addr;
+	int prp_len, nprps, i, desc;
+	__le64 *prp_list;
+	dma_addr_t dma_start;
+	u32 dma_len;
+
+	dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
+	prp_len = NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1));
+	prp_len = min(length, prp_len);
+	length -= prp_len;
+	if (!length) {
+		dma_unmap_page(dev->dev, dma_addr, prp_len, dir);
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->sgt.nents);
+	dma_start = dma_addr;
+	dma_len = prp_len;
 
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-	nvme_free_descriptors(dev, req);
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
+	dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2);
+	if (length <= NVME_CTRL_PAGE_SIZE) {
+		if (dma_addr != dma_start + dma_len) {
+			dma_unmap_page(dev->dev, dma_start, dma_len, dir);
+			dma_start = dma_addr;
+			dma_len = 0;
+		}
+		dma_len += length;
+		goto done;
+	}
+
+	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
+	i = 0;
+	desc = 0;
+	prp_list = iod->descriptors[desc];
+	do {
+		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+			prp_list = iod->descriptors[++desc];
+			i = 0;
+		}
+
+		dma_addr = le64_to_cpu(prp_list[i++]);
+		if (dma_addr != dma_start + dma_len) {
+			dma_unmap_page(dev->dev, dma_start, dma_len, dir);
+			dma_start = dma_addr;
+			dma_len = 0;
+		}
+		prp_len = min(length, NVME_CTRL_PAGE_SIZE);
+		dma_len += prp_len;
+		length -= prp_len;
+	} while (length);
+done:
+	dma_unmap_page(dev->dev, dma_start, dma_len, dir);
 }
 
-static void nvme_print_sgl(struct scatterlist *sgl, int nents)
+static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
 {
-	int i;
-	struct scatterlist *sg;
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	dma_addr_t sqe_dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+	unsigned int sqe_dma_len = le32_to_cpu(iod->cmd.common.dptr.sgl.length);
+	struct nvme_sgl_desc *sg_list = iod->descriptors[0];
+	enum dma_data_direction dir = rq_dma_dir(req);
+
+	if (iod->nr_descriptors) {
+		unsigned int nr_entries = sqe_dma_len / sizeof(*sg_list), i;
+
+		for (i = 0; i < nr_entries; i++)
+			dma_unmap_page(dev->dev, le64_to_cpu(sg_list[i].addr),
+				le32_to_cpu(sg_list[i].length), dir);
+	} else {
+		dma_unmap_page(dev->dev, sqe_dma_addr, sqe_dma_len, dir);
+	}
+}
+
+static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
-	for_each_sg(sgl, sg, nents, i) {
-		dma_addr_t phys = sg_phys(sg);
-		pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
-			"dma_address:%pad dma_length:%d\n",
-			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
-			sg_dma_len(sg));
+	if (!blk_rq_dma_unmap(req, dev->dev, &iod->dma_state)) {
+		if (iod->cmd.common.flags & NVME_CMD_SGL_METABUF)
+			nvme_free_sgls(dev, req);
+		else
+			nvme_free_prps(dev, req);
 	}
+
+	if (iod->nr_descriptors)
+		nvme_free_descriptors(dev, req);
 }
 
 static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmnd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	int length = blk_rq_payload_bytes(req);
-	struct scatterlist *sg = iod->sgt.sgl;
-	int dma_len = sg_dma_len(sg);
-	u64 dma_addr = sg_dma_address(sg);
-	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+	unsigned int length = blk_rq_payload_bytes(req);
+	struct blk_dma_iter iter;
+	dma_addr_t prp1_dma, prp2_dma = 0;
+	unsigned int prp_len, i;
 	__le64 *prp_list;
-	dma_addr_t prp_dma;
-	int i;
 
-	length -= (NVME_CTRL_PAGE_SIZE - offset);
-	if (length <= 0) {
-		iod->first_dma = 0;
+	if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter))
+		return iter.status;
+
+	/*
+	 * PRP1 always points to the start of the DMA transfers.
+	 *
+	 * This is the only PRP (except for the list entries) that could be
+	 * non-aligned.
+	 */
+	prp1_dma = iter.addr;
+	prp_len = min(length, NVME_CTRL_PAGE_SIZE -
+			(iter.addr & (NVME_CTRL_PAGE_SIZE - 1)));
+	iod->total_len += prp_len;
+	iter.addr += prp_len;
+	iter.len -= prp_len;
+	length -= prp_len;
+	if (!length)
 		goto done;
-	}
 
-	dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
-	if (dma_len) {
-		dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
-	} else {
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+	if (!iter.len) {
+		if (!blk_rq_dma_map_iter_next(req, dev->dev, &iod->dma_state,
+				&iter)) {
+			if (WARN_ON_ONCE(!iter.status))
+				goto bad_sgl;
+			goto done;
+		}
 	}
 
+	/*
+	 * PRP2 is usually a list, but can point to data if all data to be
+	 * transferred fits into PRP1 + PRP2:
+	 */
 	if (length <= NVME_CTRL_PAGE_SIZE) {
-		iod->first_dma = dma_addr;
+		prp2_dma = iter.addr;
+		iod->total_len += length;
 		goto done;
 	}
 
@@ -614,58 +693,83 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		iod->large_descriptors = true;
 
 	prp_list = dma_pool_alloc(nvme_dma_pool(dev, iod), GFP_ATOMIC,
-			&prp_dma);
-	if (!prp_list)
-		return BLK_STS_RESOURCE;
+			&prp2_dma);
+	if (!prp_list) {
+		iter.status = BLK_STS_RESOURCE;
+		goto done;
+	}
 	iod->descriptors[iod->nr_descriptors++] = prp_list;
-	iod->first_dma = prp_dma;
+
 	i = 0;
 	for (;;) {
+		prp_list[i++] = cpu_to_le64(iter.addr);
+		prp_len = min(length, NVME_CTRL_PAGE_SIZE);
+		if (WARN_ON_ONCE(iter.len < prp_len))
+			goto bad_sgl;
+
+		iod->total_len += prp_len;
+		iter.addr += prp_len;
+		iter.len -= prp_len;
+		length -= prp_len;
+		if (!length)
+			break;
+
+		if (iter.len == 0) {
+			if (!blk_rq_dma_map_iter_next(req, dev->dev,
+					&iod->dma_state, &iter)) {
+				if (WARN_ON_ONCE(!iter.status))
+					goto bad_sgl;
+				goto done;
+			}
+		}
+
+		/*
+		 * If we've filled the entire descriptor, allocate a new that is
+		 * pointed to be the last entry in the previous PRP list.  To
+		 * accommodate for that move the last actual entry to the new
+		 * descriptor.
+		 */
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
+			dma_addr_t prp_list_dma;
 
 			prp_list = dma_pool_alloc(dev->prp_page_pool,
-					GFP_ATOMIC, &prp_dma);
-			if (!prp_list)
-				goto free_prps;
+					GFP_ATOMIC, &prp_list_dma);
+			if (!prp_list) {
+				iter.status = BLK_STS_RESOURCE;
+				goto done;
+			}
 			iod->descriptors[iod->nr_descriptors++] = prp_list;
+
 			prp_list[0] = old_prp_list[i - 1];
-			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+			old_prp_list[i - 1] = cpu_to_le64(prp_list_dma);
 			i = 1;
 		}
-		prp_list[i++] = cpu_to_le64(dma_addr);
-		dma_len -= NVME_CTRL_PAGE_SIZE;
-		dma_addr += NVME_CTRL_PAGE_SIZE;
-		length -= NVME_CTRL_PAGE_SIZE;
-		if (length <= 0)
-			break;
-		if (dma_len > 0)
-			continue;
-		if (unlikely(dma_len < 0))
-			goto bad_sgl;
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
 	}
+
 done:
-	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
-	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
-	return BLK_STS_OK;
-free_prps:
-	nvme_free_descriptors(dev, req);
-	return BLK_STS_RESOURCE;
+	/*
+	 * nvme_unmap_data uses the DPT field in the SQE to tear down the
+	 * mapping, so initialize it even for failures.
+	 */
+	cmnd->dptr.prp1 = cpu_to_le64(prp1_dma);
+	cmnd->dptr.prp2 = cpu_to_le64(prp2_dma);
+	if (unlikely(iter.status))
+		nvme_unmap_data(dev, req);
+	return iter.status;
+
 bad_sgl:
-	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
-			"Invalid SGL for payload:%d nents:%d\n",
-			blk_rq_payload_bytes(req), iod->sgt.nents);
+	dev_err_once(dev->dev,
+		"Incorrectly formed request for payload:%d nents:%d\n",
+		blk_rq_payload_bytes(req), blk_rq_nr_phys_segments(req));
 	return BLK_STS_IOERR;
 }
 
 static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
-		struct scatterlist *sg)
+		struct blk_dma_iter *iter)
 {
-	sge->addr = cpu_to_le64(sg_dma_address(sg));
-	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->addr = cpu_to_le64(iter->addr);
+	sge->length = cpu_to_le32(iter->len);
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
@@ -681,17 +785,21 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	unsigned int entries = blk_rq_nr_phys_segments(req);
 	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sg = iod->sgt.sgl;
-	unsigned int entries = iod->sgt.nents;
+	struct blk_dma_iter iter;
 	dma_addr_t sgl_dma;
-	int i = 0;
+	unsigned int mapped = 0;
 
 	/* setting the transfer type as SGL */
 	cmd->flags = NVME_CMD_SGL_METABUF;
 
-	if (entries == 1) {
-		nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
+	if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter))
+		return iter.status;
+
+	if (entries == 1 || blk_rq_dma_map_coalesce(&iod->dma_state)) {
+		nvme_pci_sgl_set_data(&cmd->dptr.sgl, &iter);
+		iod->total_len += iter.len;
 		return BLK_STS_OK;
 	}
 
@@ -702,110 +810,30 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 	if (!sg_list)
 		return BLK_STS_RESOURCE;
 	iod->descriptors[iod->nr_descriptors++] = sg_list;
-	iod->first_dma = sgl_dma;
 
-	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
 	do {
-		nvme_pci_sgl_set_data(&sg_list[i++], sg);
-		sg = sg_next(sg);
-	} while (--entries > 0);
-
-	return BLK_STS_OK;
-}
-
-static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
-	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
-
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
-		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
-
-	cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
-	if (bv->bv_len > first_prp_len)
-		cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma + first_prp_len);
-	else
-		cmnd->dptr.prp2 = 0;
-	return BLK_STS_OK;
-}
-
-static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+		if (WARN_ON_ONCE(mapped == entries)) {
+			iter.status = BLK_STS_IOERR;
+			break;
+		}
+		nvme_pci_sgl_set_data(&sg_list[mapped++], &iter);
+		iod->total_len += iter.len;
+	} while (blk_rq_dma_map_iter_next(req, dev->dev, &iod->dma_state,
+			&iter));
 
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
-		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
+	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, mapped);
 
-	cmnd->flags = NVME_CMD_SGL_METABUF;
-	cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma);
-	cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len);
-	cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
-	return BLK_STS_OK;
+	if (unlikely(iter.status))
+		nvme_free_sgls(dev, req);
+	return iter.status;
 }
 
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	blk_status_t ret = BLK_STS_RESOURCE;
-	int rc;
-
-	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-		struct bio_vec bv = req_bvec(req);
-
-		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
-			     bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
-
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
-				return nvme_setup_sgl_simple(dev, req,
-							     &cmnd->rw, &bv);
-		}
-	}
-
-	iod->dma_len = 0;
-	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
-	if (!iod->sgt.sgl)
-		return BLK_STS_RESOURCE;
-	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
-	iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
-	if (!iod->sgt.orig_nents)
-		goto out_free_sg;
-
-	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
-			     DMA_ATTR_NO_WARN);
-	if (rc) {
-		if (rc == -EREMOTEIO)
-			ret = BLK_STS_TARGET;
-		goto out_free_sg;
-	}
-
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
-		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
-	else
-		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
-	if (ret != BLK_STS_OK)
-		goto out_unmap_sg;
-	return BLK_STS_OK;
-
-out_unmap_sg:
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-out_free_sg:
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
-	return ret;
+	if (nvme_pci_use_sgls(dev, req, blk_rq_nr_phys_segments(req)))
+		return nvme_pci_setup_sgls(dev, req, &cmnd->rw);
+	return nvme_pci_setup_prps(dev, req, &cmnd->rw);
 }
 
 static blk_status_t nvme_map_metadata(struct nvme_dev *dev, struct request *req,
@@ -829,7 +857,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	iod->aborted = false;
 	iod->nr_descriptors = 0;
 	iod->large_descriptors = false;
-	iod->sgt.nents = 0;
+	iod->total_len = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
 	if (ret)
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-10-27 14:21 ` [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
@ 2024-10-28 18:27   ` Logan Gunthorpe
  2024-10-31 20:58   ` Bart Van Assche
  2024-11-02  7:39   ` Zhu Yanjun
  2 siblings, 0 replies; 17+ messages in thread
From: Logan Gunthorpe @ 2024-10-28 18:27 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm



On 2024-10-27 08:21, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> To get out of the dma mapping helpers having to check every segment for
> it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> transfers, and that a P2P bio only contains ranges from a single device.
> 
> This means we do the page zone access in the bio add path where it should
> be still page hot, and will only have do the fairly expensive P2P topology
> lookup once per bio down in the dma mapping path, and only for already
> marked bios.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Looks good to me.

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/7] block: share more code for bio addition helpers
  2024-10-27 14:21 ` [RFC PATCH 1/7] block: share more code for bio addition helpers Leon Romanovsky
@ 2024-10-31 20:55   ` Bart Van Assche
  2024-11-04  8:55     ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2024-10-31 20:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On 10/27/24 7:21 AM, Leon Romanovsky wrote:
> +static int bio_add_zone_append_page_int(struct bio *bio, struct page *page,
> +		unsigned int len, unsigned int offset, bool *same_page)
> +{
> +	struct block_device *bdev = bio->bi_bdev;
> +
> +	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
> +		return 0;
> +	if (WARN_ON_ONCE(!bdev_is_zoned(bdev)))
> +		return 0;
> +	return bio_add_hw_page(bdev_get_queue(bdev), bio, page, len, offset,
> +			bdev_max_zone_append_sectors(bdev), same_page);
> +}

Does "_int" stand for "_internal"? If so, please consider changing it
into "_impl". I think that will prevent that anyone confuses this suffix
with the "int" data type.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-10-27 14:21 ` [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
  2024-10-28 18:27   ` Logan Gunthorpe
@ 2024-10-31 20:58   ` Bart Van Assche
  2024-11-01  6:11     ` Leon Romanovsky
  2024-11-04  8:55     ` Christoph Hellwig
  2024-11-02  7:39   ` Zhu Yanjun
  2 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2024-10-31 20:58 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On 10/27/24 7:21 AM, Leon Romanovsky wrote:
> +		/*
> +		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
> +		 * bio must be P2P pages from the same device.
> +		 */
> +		if ((bio->bi_opf & REQ_P2PDMA) &&
> +		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
> +			return 0;

It's probably too late to change the "zone_device_" prefix into
something that cannot be confused with a reference to zoned block
devices?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-10-31 20:58   ` Bart Van Assche
@ 2024-11-01  6:11     ` Leon Romanovsky
  2024-11-04  8:55     ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-11-01  6:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 01:58:37PM -0700, Bart Van Assche wrote:
> On 10/27/24 7:21 AM, Leon Romanovsky wrote:
> > +		/*
> > +		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
> > +		 * bio must be P2P pages from the same device.
> > +		 */
> > +		if ((bio->bi_opf & REQ_P2PDMA) &&
> > +		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
> > +			return 0;
> 
> It's probably too late to change the "zone_device_" prefix into
> something that cannot be confused with a reference to zoned block
> devices?

It is never too late to send a patch which renames the names, but it needs
to be worth it. However it is hard to see global benefit from zone_device_* rename.
ZONE_DEVICE is a well known term in the kernel and it is used all other places
in the kernel.

Thanks

> 
> Thanks,
> 
> Bart.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-10-27 14:21 ` [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
  2024-10-28 18:27   ` Logan Gunthorpe
  2024-10-31 20:58   ` Bart Van Assche
@ 2024-11-02  7:39   ` Zhu Yanjun
  2024-11-03 15:19     ` Leon Romanovsky
  2024-11-04  8:56     ` Christoph Hellwig
  2 siblings, 2 replies; 17+ messages in thread
From: Zhu Yanjun @ 2024-11-02  7:39 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

在 2024/10/27 15:21, Leon Romanovsky 写道:
> From: Christoph Hellwig <hch@lst.de>
> 
> To get out of the dma mapping helpers having to check every segment for
> it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> transfers, and that a P2P bio only contains ranges from a single device.
> 
> This means we do the page zone access in the bio add path where it should
> be still page hot, and will only have do the fairly expensive P2P topology
> lookup once per bio down in the dma mapping path, and only for already
> marked bios.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   block/bio.c               | 36 +++++++++++++++++++++++++++++-------
>   block/blk-map.c           | 32 ++++++++++++++++++++++++--------
>   include/linux/blk_types.h |  2 ++
>   3 files changed, 55 insertions(+), 15 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 2d3bc8bfb071..943a6d78cb3e 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -928,8 +928,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
>   		return false;
>   	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
>   		return false;
> -	if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
> -		return false;
>   
>   	*same_page = ((vec_end_addr & PAGE_MASK) == ((page_addr + off) &
>   		     PAGE_MASK));
> @@ -993,6 +991,14 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>   	if (bio->bi_vcnt > 0) {
>   		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>   
> +		/*
> +		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
> +		 * bio must be P2P pages from the same device.
> +		 */
> +		if ((bio->bi_opf & REQ_P2PDMA) &&
> +		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
> +			return 0;
> +
>   		if (bvec_try_merge_hw_page(q, bv, page, len, offset,
>   				same_page)) {
>   			bio->bi_iter.bi_size += len;
> @@ -1009,6 +1015,9 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>   		 */
>   		if (bvec_gap_to_prev(&q->limits, bv, offset))
>   			return 0;
> +	} else {
> +		if (is_pci_p2pdma_page(page))
> +			bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
>   	}
>   
>   	bvec_set_page(&bio->bi_io_vec[bio->bi_vcnt], page, len, offset);
> @@ -1133,11 +1142,24 @@ static int bio_add_page_int(struct bio *bio, struct page *page,
>   	if (bio->bi_iter.bi_size > UINT_MAX - len)
>   		return 0;
>   
> -	if (bio->bi_vcnt > 0 &&
> -	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
> -				page, len, offset, same_page)) {
> -		bio->bi_iter.bi_size += len;
> -		return len;
> +	if (bio->bi_vcnt > 0) {
> +		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +
> +		/*
> +		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
> +		 * bio must be P2P pages from the same device.
> +		 */
> +		if ((bio->bi_opf & REQ_P2PDMA) &&
> +		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
> +			return 0;
> +
> +		if (bvec_try_merge_page(bv, page, len, offset, same_page)) {
> +			bio->bi_iter.bi_size += len;
> +			return len;
> +		}
> +	} else {
> +		if (is_pci_p2pdma_page(page))
> +			bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
>   	}
>   
>   	if (bio->bi_vcnt >= bio->bi_max_vecs)
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 0e1167b23934..03192b1ca6ea 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -568,6 +568,7 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
>   	const struct queue_limits *lim = &q->limits;
>   	unsigned int nsegs = 0, bytes = 0;
>   	struct bio *bio;
> +	int error;
>   	size_t i;
>   
>   	if (!nr_iter || (nr_iter >> SECTOR_SHIFT) > queue_max_hw_sectors(q))
> @@ -588,15 +589,30 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
>   	for (i = 0; i < nr_segs; i++) {
>   		struct bio_vec *bv = &bvecs[i];
>   
> -		/*
> -		 * If the queue doesn't support SG gaps and adding this
> -		 * offset would create a gap, fallback to copy.
> -		 */
> -		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv->bv_offset)) {
> -			blk_mq_map_bio_put(bio);
> -			return -EREMOTEIO;
> +		error = -EREMOTEIO;
> +		if (bvprvp) {
> +			/*
> +			 * If the queue doesn't support SG gaps and adding this
> +			 * offset would create a gap, fallback to copy.
> +			 */
> +			if (bvec_gap_to_prev(lim, bvprvp, bv->bv_offset))
> +				goto put_bio;
> +
> +			/*
> +			 * When doing ZONE_DEVICE-based P2P transfers, all pages
> +			 * in a bio must be P2P pages, and from the same device.
> +			 */
> +			if ((bio->bi_opf & REQ_P2PDMA) &&
> +			    zone_device_pages_have_same_pgmap(bvprvp->bv_page,
> +					bv->bv_page))
> +				goto put_bio;
> +		} else {
> +			if (is_pci_p2pdma_page(bv->bv_page))
> +				bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
>   		}
> +
>   		/* check full condition */
> +		error = -EINVAL;
>   		if (nsegs >= nr_segs || bytes > UINT_MAX - bv->bv_len)
>   			goto put_bio;
>   		if (bytes + bv->bv_len > nr_iter)
> @@ -611,7 +627,7 @@ static int blk_rq_map_user_bvec(struct request *rq, const struct iov_iter *iter)
>   	return 0;
>   put_bio:
>   	blk_mq_map_bio_put(bio);
> -	return -EINVAL;
> +	return error;
>   }
>   
>   /**
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index dce7615c35e7..94cf146e8ce6 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -378,6 +378,7 @@ enum req_flag_bits {
>   	__REQ_DRV,		/* for driver use */
>   	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
>   	__REQ_ATOMIC,		/* for atomic write operations */
> +	__REQ_P2PDMA,		/* contains P2P DMA pages */
>   	/*
>   	 * Command specific flags, keep last:
>   	 */
> @@ -410,6 +411,7 @@ enum req_flag_bits {
>   #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
>   #define REQ_FS_PRIVATE	(__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
>   #define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
> +#define REQ_P2PDMA	(__force blk_opf_t)(1ULL << __REQ_P2PDMA)

#define REQ_P2PDMA	(__force blk_opf_t)BIT_ULL(__REQ_P2PDMA)

Use BIT_ULL instead of direct left shit.

Zhu Yanjun

>   
>   #define REQ_NOUNMAP	(__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-11-02  7:39   ` Zhu Yanjun
@ 2024-11-03 15:19     ` Leon Romanovsky
  2024-11-04  8:56     ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Leon Romanovsky @ 2024-11-03 15:19 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Sat, Nov 02, 2024 at 08:39:35AM +0100, Zhu Yanjun wrote:
> 在 2024/10/27 15:21, Leon Romanovsky 写道:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > To get out of the dma mapping helpers having to check every segment for
> > it's P2P status, ensure that bios either contain P2P transfers or non-P2P
> > transfers, and that a P2P bio only contains ranges from a single device.
> > 
> > This means we do the page zone access in the bio add path where it should
> > be still page hot, and will only have do the fairly expensive P2P topology
> > lookup once per bio down in the dma mapping path, and only for already
> > marked bios.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >   block/bio.c               | 36 +++++++++++++++++++++++++++++-------
> >   block/blk-map.c           | 32 ++++++++++++++++++++++++--------
> >   include/linux/blk_types.h |  2 ++
> >   3 files changed, 55 insertions(+), 15 deletions(-)

<...>

> > @@ -410,6 +411,7 @@ enum req_flag_bits {
> >   #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
> >   #define REQ_FS_PRIVATE	(__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
> >   #define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
> > +#define REQ_P2PDMA	(__force blk_opf_t)(1ULL << __REQ_P2PDMA)
> 
> #define REQ_P2PDMA	(__force blk_opf_t)BIT_ULL(__REQ_P2PDMA)
> 
> Use BIT_ULL instead of direct left shit.

We keep coding style consistent and all defines above aren't implemented
with BIT_ULL().

Thanks

> 
> Zhu Yanjun
> 
> >   #define REQ_NOUNMAP	(__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/7] block: share more code for bio addition helpers
  2024-10-31 20:55   ` Bart Van Assche
@ 2024-11-04  8:55     ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-11-04  8:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 01:55:11PM -0700, Bart Van Assche wrote:
> Does "_int" stand for "_internal"? If so, please consider changing it
> into "_impl". I think that will prevent that anyone confuses this suffix
> with the "int" data type.

_impl is even more unusual than _int, so no.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-10-31 20:58   ` Bart Van Assche
  2024-11-01  6:11     ` Leon Romanovsky
@ 2024-11-04  8:55     ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-11-04  8:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 01:58:37PM -0700, Bart Van Assche wrote:
> On 10/27/24 7:21 AM, Leon Romanovsky wrote:
>> +		/*
>> +		 * When doing ZONE_DEVICE-based P2P transfers, all pages in a
>> +		 * bio must be P2P pages from the same device.
>> +		 */
>> +		if ((bio->bi_opf & REQ_P2PDMA) &&
>> +		    !zone_device_pages_have_same_pgmap(bv->bv_page, page))
>> +			return 0;
>
> It's probably too late to change the "zone_device_" prefix into
> something that cannot be confused with a reference to zoned block
> devices?

It's too late and also wrong and also entirely out of scope for this
series.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio
  2024-11-02  7:39   ` Zhu Yanjun
  2024-11-03 15:19     ` Leon Romanovsky
@ 2024-11-04  8:56     ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-11-04  8:56 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Sat, Nov 02, 2024 at 08:39:35AM +0100, Zhu Yanjun wrote

<full quote delete>

If you have something useful to say, please quote the mail you are
replying to to the context required to make sense.

Ignored for now.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-11-04  8:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-27 14:21 [RFC PATCH 0/7] Block and NMMe PCI use of new DMA mapping API Leon Romanovsky
2024-10-27 14:21 ` [RFC PATCH 1/7] block: share more code for bio addition helpers Leon Romanovsky
2024-10-31 20:55   ` Bart Van Assche
2024-11-04  8:55     ` Christoph Hellwig
2024-10-27 14:21 ` [RFC PATCH 2/7] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
2024-10-28 18:27   ` Logan Gunthorpe
2024-10-31 20:58   ` Bart Van Assche
2024-11-01  6:11     ` Leon Romanovsky
2024-11-04  8:55     ` Christoph Hellwig
2024-11-02  7:39   ` Zhu Yanjun
2024-11-03 15:19     ` Leon Romanovsky
2024-11-04  8:56     ` Christoph Hellwig
2024-10-27 14:21 ` [RFC PATCH 3/7] blk-mq: add a dma mapping iterator Leon Romanovsky
2024-10-27 14:21 ` [RFC PATCH 4/7] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
2024-10-27 14:21 ` [RFC PATCH 5/7] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
2024-10-27 14:21 ` [RFC PATCH 6/7] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
2024-10-27 14:22 ` [RFC PATCH 7/7] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).