linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/24] Provide a new two step DMA mapping API
@ 2025-04-23  8:12 Leon Romanovsky
  2025-04-23  8:12 ` [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
                   ` (23 more replies)
  0 siblings, 24 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

Following recent on site LSF/MM 2025 [1] discussion, the overall
response was extremely positive with many people expressed their
desire to see this series merged, so they can base their work on it.

It includes, but not limited:
 * Luis's "nvme-pci: breaking the 512 KiB max IO boundary":
   https://lore.kernel.org/all/20250320111328.2841690-1-mcgrof@kernel.org/
 * Chuck's NFS conversion to use one structure (bio_vec) for all types
   of RPC transports:
   https://lore.kernel.org/all/913df4b4-fc4a-409d-9007-088a3e2c8291@oracle.com
 * Matthew's vision for the world without struct page:
   https://lore.kernel.org/all/20250320111328.2841690-1-mcgrof@kernel.org/
 * Confidential computing roadmap from Dan:
   https://lore.kernel.org/all/6801a8e3968da_71fe29411@dwillia2-xfh.jf.intel.com.notmuch

This series is combination of effort of many people who contributed ideas,
code and testing and I'm gratefully thankful for them.

[1] https://lore.kernel.org/linux-rdma/20250122071600.GC10702@unreal/
-----------------------------------------------------------------------
Changelog:
v9:
 * Added tested-by from Jens.
 * Replaced is_pci_p2pdma_page(bv.bv_page) check with if
   "(IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA))"
   which is more aligned with the goal (do not access struct page) and
   more efficient. This is the one line only that was changed in Jens's
   performance testing flow, so I kept his tags as is.
 * Restored single-segment optimization for SGL path.
 * Added forgotten unmap of metdata SGL multi-segment flow.
 * Split and squashed optimization patch from Kanchan.
 * Converted "bool aborted" flag to use newly introduced flag variable.
v8:
 * Rebased to v6.15-rc1
 * Added NVMe patches which are now patches and not RFC. They were in
   RFC stage because block iterator caused to performance regression
   for very extreme case scenario (~100M IOPS), but after Kanchan fixed
   it, the code started to be ready for merging.
 * @Niklas, i didn't change naming in this series as it follows iommu
   naming format.
v7:
 * Rebased to v6.14-rc1
v6: https://lore.kernel.org/all/cover.1737106761.git.leon@kernel.org
 * Changed internal __size variable to u64 to properly set private flag
   in most significant bit.
 * Added comment about why we check DMA_IOVA_USE_SWIOTLB
 * Break unlink loop if phys is NULL, condition which we shouldn't get.
v5: https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org
 * Trimmed long lines in all patches.
 * Squashed "dma-mapping: Add check if IOVA can be used" into
   "dma: Provide an interface to allow allocate IOVA" patch.
 * Added tags from Christoph and Will.
 * Fixed spelling/grammar errors.
 * Change title from "dma: Provide an  ..." to be "dma-mapping: Provide
   an ...".
 * Slightly changed hmm patch to set sticky flags in one place.
v4: https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org
 * Added extra patch to add kernel-doc for iommu_unmap and iommu_unmap_fast
 * Rebased to v6.13-rc1
 * Added Will's tags
v3: https://lore.kernel.org/all/cover.1731244445.git.leon@kernel.org
 * Added DMA_ATTR_SKIP_CPU_SYNC to p2p pages in HMM.
 * Fixed error unwind if dma_iova_sync fails in HMM.
 * Clear all PFN flags which were set in map to make code.
   more clean, the callers anyway cleaned them.
 * Generalize sticky PFN flags logic in HMM.
 * Removed not-needed #ifdef-#endif section.
v2: https://lore.kernel.org/all/cover.1730892663.git.leon@kernel.org
 * Fixed docs file as Randy suggested
 * Fixed releases of memory in HMM path. It was allocated with kv..
   variants but released with kfree instead of kvfree.
 * Slightly changed commit message in VFIO patch.
v1: https://lore.kernel.org/all/cover.1730298502.git.leon@kernel.org
 * Squashed two VFIO patches into one
 * Added Acked-by/Reviewed-by tags
 * Fix docs spelling errors
 * Simplified dma_iova_sync() API
 * Added extra check in dma_iova_destroy() if mapped size to make code more clear
 * Fixed checkpatch warnings in p2p patch
 * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
   be more general
 * Reduced the number of changes in VFIO patch
v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org

----------------------------------------------------------------------------
 LWN coverage:
Dancing the DMA two-step - https://lwn.net/Articles/997563/
----------------------------------------------------------------------------

Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the
impossibility of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - Allocate IOVA space:
    To do any pre-allocation required. This is done based on the caller
    supplying some details about how much IOMMU address space it would need
    in worst case.
 - Map and unmap relevant structures to pre-allocated IOVA space:
    Perform the actual mapping into the pre-allocated IOVA. This is very
    similar to dma_map_page().

In this series, examples of three different users are converted to the new API
to show the benefits and its versatility. Each user has a unique
flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
 * Memory reduction as in most real use cases there is no need to store mapped
   DMA addresses and unmap them.
 * Reducing the function call overhead by removing the need to call function
   pointers and use direct calls instead.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes.

The whole series can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git dma-split-Apr-23

Thanks

Christoph Hellwig (12):
  PCI/P2PDMA: Refactor the p2pdma mapping helpers
  dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  iommu: generalize the batched sync after map interface
  iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  dma-mapping: add a dma_need_unmap helper
  docs: core-api: document the IOVA-based API
  block: share more code for bio addition helper
  block: don't merge different kinds of P2P transfers in a single bio
  blk-mq: add scatterlist-less DMA mapping helpers
  nvme-pci: remove struct nvme_descriptor
  nvme-pci: use a better encoding for small prp pool allocations
  nvme-pci: convert to blk_rq_dma_map

Leon Romanovsky (12):
  iommu: add kernel-doc for iommu_unmap_fast
  dma-mapping: Provide an interface to allow allocate IOVA
  dma-mapping: Implement link/unlink ranges API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  mm/hmm: provide generic DMA managing logic
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
    linkage
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Enable the DMA link API
  nvme-pci: store aborted state in flags variable

 Documentation/core-api/dma-api.rst   |  71 +++
 block/bio.c                          |  83 ++--
 block/blk-merge.c                    | 180 ++++++-
 drivers/infiniband/core/umem_odp.c   | 250 ++++------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
 drivers/infiniband/hw/mlx5/odp.c     |  65 ++-
 drivers/infiniband/hw/mlx5/umr.c     |  12 +-
 drivers/infiniband/sw/rxe/rxe_odp.c  |  18 +-
 drivers/iommu/dma-iommu.c            | 468 +++++++++++++++---
 drivers/iommu/iommu.c                |  84 ++--
 drivers/nvme/host/pci.c              | 702 +++++++++++++++------------
 drivers/pci/p2pdma.c                 |  38 +-
 drivers/vfio/pci/mlx5/cmd.c          | 375 +++++++-------
 drivers/vfio/pci/mlx5/cmd.h          |  35 +-
 drivers/vfio/pci/mlx5/main.c         |  87 ++--
 include/linux/blk-mq-dma.h           |  63 +++
 include/linux/blk_types.h            |   2 +
 include/linux/dma-map-ops.h          |  54 ---
 include/linux/dma-mapping.h          |  85 ++++
 include/linux/hmm-dma.h              |  33 ++
 include/linux/hmm.h                  |  21 +
 include/linux/iommu.h                |   4 +
 include/linux/pci-p2pdma.h           |  84 ++++
 include/rdma/ib_umem_odp.h           |  25 +-
 kernel/dma/direct.c                  |  44 +-
 kernel/dma/mapping.c                 |  18 +
 mm/hmm.c                             | 264 +++++++++-
 27 files changed, 2103 insertions(+), 1074 deletions(-)
 create mode 100644 include/linux/blk-mq-dma.h
 create mode 100644 include/linux/hmm-dma.h

-- 
2.49.0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-26  0:21   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.

Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 47 +++++++++++++++++-----------------
 drivers/pci/p2pdma.c        | 38 ++++-----------------------
 include/linux/dma-map-ops.h | 51 +++++++++++++++++++++++++++++--------
 kernel/dma/direct.c         | 43 +++++++++++++++----------------
 4 files changed, 91 insertions(+), 88 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index cb7e29dcac15..a8f9fd93e150 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1359,7 +1359,6 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	struct scatterlist *s, *prev = NULL;
 	int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
 	struct pci_p2pdma_map_state p2pdma_state = {};
-	enum pci_p2pdma_map_type map;
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
@@ -1389,28 +1388,30 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 		size_t s_length = s->length;
 		size_t pad_len = (mask - iova_len + 1) & mask;
 
-		if (is_pci_p2pdma_page(sg_page(s))) {
-			map = pci_p2pdma_map_segment(&p2pdma_state, dev, s);
-			switch (map) {
-			case PCI_P2PDMA_MAP_BUS_ADDR:
-				/*
-				 * iommu_map_sg() will skip this segment as
-				 * it is marked as a bus address,
-				 * __finalise_sg() will copy the dma address
-				 * into the output segment.
-				 */
-				continue;
-			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-				/*
-				 * Mapping through host bridge should be
-				 * mapped with regular IOVAs, thus we
-				 * do nothing here and continue below.
-				 */
-				break;
-			default:
-				ret = -EREMOTEIO;
-				goto out_restore_sg;
-			}
+		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(s))) {
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * Mapping through host bridge should be mapped with
+			 * regular IOVAs, thus we do nothing here and continue
+			 * below.
+			 */
+			break;
+		case PCI_P2PDMA_MAP_NONE:
+			break;
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			/*
+			 * iommu_map_sg() will skip this segment as it is marked
+			 * as a bus address, __finalise_sg() will copy the dma
+			 * address into the output segment.
+			 */
+			s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
+						sg_phys(s));
+			sg_dma_len(s) = sg->length;
+			sg_dma_mark_bus_address(s);
+			continue;
+		default:
+			ret = -EREMOTEIO;
+			goto out_restore_sg;
 		}
 
 		sg_dma_address(s) = s_iova_off;
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 19214ec81fbb..8d955c25aed3 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -1004,40 +1004,12 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	return type;
 }
 
-/**
- * pci_p2pdma_map_segment - map an sg segment determining the mapping type
- * @state: State structure that should be declared outside of the for_each_sg()
- *	loop and initialized to zero.
- * @dev: DMA device that's doing the mapping operation
- * @sg: scatterlist segment to map
- *
- * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
- * the sg segment is the same for the page_link and the dma_address.
- *
- * Attempt to map a single segment in an SGL with the PCI bus address.
- * The segment must point to a PCI P2PDMA page and thus must be
- * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
- *
- * Returns the type of mapping used and maps the page if the type is
- * PCI_P2PDMA_MAP_BUS_ADDR.
- */
-enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
-		       struct scatterlist *sg)
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+		struct device *dev, struct page *page)
 {
-	if (state->pgmap != page_pgmap(sg_page(sg))) {
-		state->pgmap = page_pgmap(sg_page(sg));
-		state->map = pci_p2pdma_map_type(state->pgmap, dev);
-		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
-	}
-
-	if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
-		sg->dma_address = sg_phys(sg) + state->bus_off;
-		sg_dma_len(sg) = sg->length;
-		sg_dma_mark_bus_address(sg);
-	}
-
-	return state->map;
+	state->pgmap = page_pgmap(page);
+	state->map = pci_p2pdma_map_type(state->pgmap, dev);
+	state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
 }
 
 /**
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index e172522cd936..c3086edeccc6 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -443,6 +443,11 @@ enum pci_p2pdma_map_type {
 	 */
 	PCI_P2PDMA_MAP_UNKNOWN = 0,
 
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
 	/*
 	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
 	 * traverse the host bridge and the host bridge is not in the
@@ -471,21 +476,47 @@ enum pci_p2pdma_map_type {
 
 struct pci_p2pdma_map_state {
 	struct dev_pagemap *pgmap;
-	int map;
+	enum pci_p2pdma_map_type map;
 	u64 bus_off;
 };
 
-#ifdef CONFIG_PCI_P2PDMA
-enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
-		       struct scatterlist *sg);
-#else /* CONFIG_PCI_P2PDMA */
+/* helper for pci_p2pdma_state(), do not use directly */
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+		struct device *dev, struct page *page);
+
+/**
+ * pci_p2pdma_state - check the P2P transfer state of a page
+ * @state:	P2P state structure
+ * @dev:	device to transfer to/from
+ * @page:	page to map
+ *
+ * Check if @page is a PCI P2PDMA page, and if yes of what kind.  Returns the
+ * map type, and updates @state with all information needed for a P2P transfer.
+ */
 static inline enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
-		       struct scatterlist *sg)
+pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
+		struct page *page)
+{
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+		if (state->pgmap != page_pgmap(page))
+			__pci_p2pdma_update_state(state, dev, page);
+		return state->map;
+	}
+	return PCI_P2PDMA_MAP_NONE;
+}
+
+/**
+ * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
+ * @state:	P2P state structure
+ * @paddr:	physical address to map
+ *
+ * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ */
+static inline dma_addr_t
+pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
 {
-	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
+	return paddr + state->bus_off;
 }
-#endif /* CONFIG_PCI_P2PDMA */
 
 #endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index b8fe0b3d0ffb..cec43cd5ed62 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -462,34 +462,33 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		enum dma_data_direction dir, unsigned long attrs)
 {
 	struct pci_p2pdma_map_state p2pdma_state = {};
-	enum pci_p2pdma_map_type map;
 	struct scatterlist *sg;
 	int i, ret;
 
 	for_each_sg(sgl, sg, nents, i) {
-		if (is_pci_p2pdma_page(sg_page(sg))) {
-			map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
-			switch (map) {
-			case PCI_P2PDMA_MAP_BUS_ADDR:
-				continue;
-			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-				/*
-				 * Any P2P mapping that traverses the PCI
-				 * host bridge must be mapped with CPU physical
-				 * address and not PCI bus addresses. This is
-				 * done with dma_direct_map_page() below.
-				 */
-				break;
-			default:
-				ret = -EREMOTEIO;
+		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * Any P2P mapping that traverses the PCI host bridge
+			 * must be mapped with CPU physical address and not PCI
+			 * bus addresses.
+			 */
+			break;
+		case PCI_P2PDMA_MAP_NONE:
+			sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
+					sg->offset, sg->length, dir, attrs);
+			if (sg->dma_address == DMA_MAPPING_ERROR) {
+				ret = -EIO;
 				goto out_unmap;
 			}
-		}
-
-		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
-				sg->offset, sg->length, dir, attrs);
-		if (sg->dma_address == DMA_MAPPING_ERROR) {
-			ret = -EIO;
+			break;
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
+					sg_phys(sg));
+			sg_dma_mark_bus_address(sg);
+			continue;
+		default:
+			ret = -EREMOTEIO;
 			goto out_unmap;
 		}
 		sg_dma_len(sg) = sg->length;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
  2025-04-23  8:12 ` [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-26  0:34   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 03/24] iommu: generalize the batched sync after map interface Leon Romanovsky
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API.  Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.

Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   |  1 +
 include/linux/dma-map-ops.h | 85 -------------------------------------
 include/linux/pci-p2pdma.h  | 84 ++++++++++++++++++++++++++++++++++++
 kernel/dma/direct.c         |  1 +
 4 files changed, 86 insertions(+), 85 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index a8f9fd93e150..145606498b4c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -27,6 +27,7 @@
 #include <linux/msi.h>
 #include <linux/of_iommu.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/scatterlist.h>
 #include <linux/spinlock.h>
 #include <linux/swiotlb.h>
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index c3086edeccc6..f48e5fb88bd5 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -434,89 +434,4 @@ static inline void debug_dma_dump_mappings(struct device *dev)
 #endif /* CONFIG_DMA_API_DEBUG */
 
 extern const struct dma_map_ops dma_dummy_ops;
-
-enum pci_p2pdma_map_type {
-	/*
-	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
-	 * type hasn't been calculated yet. Functions that return this enum
-	 * never return this value.
-	 */
-	PCI_P2PDMA_MAP_UNKNOWN = 0,
-
-	/*
-	 * Not a PCI P2PDMA transfer.
-	 */
-	PCI_P2PDMA_MAP_NONE,
-
-	/*
-	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
-	 * traverse the host bridge and the host bridge is not in the
-	 * allowlist. DMA Mapping routines should return an error when
-	 * this is returned.
-	 */
-	PCI_P2PDMA_MAP_NOT_SUPPORTED,
-
-	/*
-	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
-	 * each other directly through a PCI switch and the transaction will
-	 * not traverse the host bridge. Such a mapping should program
-	 * the DMA engine with PCI bus addresses.
-	 */
-	PCI_P2PDMA_MAP_BUS_ADDR,
-
-	/*
-	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
-	 * to each other, but the transaction traverses a host bridge on the
-	 * allowlist. In this case, a normal mapping either with CPU physical
-	 * addresses (in the case of dma-direct) or IOVA addresses (in the
-	 * case of IOMMUs) should be used to program the DMA engine.
-	 */
-	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
-struct pci_p2pdma_map_state {
-	struct dev_pagemap *pgmap;
-	enum pci_p2pdma_map_type map;
-	u64 bus_off;
-};
-
-/* helper for pci_p2pdma_state(), do not use directly */
-void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
-		struct device *dev, struct page *page);
-
-/**
- * pci_p2pdma_state - check the P2P transfer state of a page
- * @state:	P2P state structure
- * @dev:	device to transfer to/from
- * @page:	page to map
- *
- * Check if @page is a PCI P2PDMA page, and if yes of what kind.  Returns the
- * map type, and updates @state with all information needed for a P2P transfer.
- */
-static inline enum pci_p2pdma_map_type
-pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
-		struct page *page)
-{
-	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-		if (state->pgmap != page_pgmap(page))
-			__pci_p2pdma_update_state(state, dev, page);
-		return state->map;
-	}
-	return PCI_P2PDMA_MAP_NONE;
-}
-
-/**
- * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
- * @state:	P2P state structure
- * @paddr:	physical address to map
- *
- * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
- */
-static inline dma_addr_t
-pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
-{
-	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->bus_off;
-}
-
 #endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 2c07aa6b7665..e85b3ae08a11 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -104,4 +104,88 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
 	return pci_p2pmem_find_many(&client, 1);
 }
 
+enum pci_p2pdma_map_type {
+	/*
+	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+	 * type hasn't been calculated yet. Functions that return this enum
+	 * never return this value.
+	 */
+	PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
+	/*
+	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+	 * traverse the host bridge and the host bridge is not in the
+	 * allowlist. DMA Mapping routines should return an error when
+	 * this is returned.
+	 */
+	PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+	/*
+	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+	 * each other directly through a PCI switch and the transaction will
+	 * not traverse the host bridge. Such a mapping should program
+	 * the DMA engine with PCI bus addresses.
+	 */
+	PCI_P2PDMA_MAP_BUS_ADDR,
+
+	/*
+	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+	 * to each other, but the transaction traverses a host bridge on the
+	 * allowlist. In this case, a normal mapping either with CPU physical
+	 * addresses (in the case of dma-direct) or IOVA addresses (in the
+	 * case of IOMMUs) should be used to program the DMA engine.
+	 */
+	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+struct pci_p2pdma_map_state {
+	struct dev_pagemap *pgmap;
+	enum pci_p2pdma_map_type map;
+	u64 bus_off;
+};
+
+/* helper for pci_p2pdma_state(), do not use directly */
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+		struct device *dev, struct page *page);
+
+/**
+ * pci_p2pdma_state - check the P2P transfer state of a page
+ * @state:	P2P state structure
+ * @dev:	device to transfer to/from
+ * @page:	page to map
+ *
+ * Check if @page is a PCI P2PDMA page, and if yes of what kind.  Returns the
+ * map type, and updates @state with all information needed for a P2P transfer.
+ */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
+		struct page *page)
+{
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+		if (state->pgmap != page_pgmap(page))
+			__pci_p2pdma_update_state(state, dev, page);
+		return state->map;
+	}
+	return PCI_P2PDMA_MAP_NONE;
+}
+
+/**
+ * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
+ * @state:	P2P state structure
+ * @paddr:	physical address to map
+ *
+ * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ */
+static inline dma_addr_t
+pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
+{
+	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
+	return paddr + state->bus_off;
+}
+
 #endif /* _LINUX_PCI_P2P_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index cec43cd5ed62..24c359d9c879 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include "direct.h"
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 03/24] iommu: generalize the batched sync after map interface
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
  2025-04-23  8:12 ` [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
  2025-04-23  8:12 ` [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-23 17:15   ` Jason Gunthorpe
  2025-04-26  0:52   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
                   ` (20 subsequent siblings)
  23 siblings, 2 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

For the upcoming IOVA-based DMA API we want to use the interface batch the
sync after mapping multiple entries from dma-iommu without having a
scatterlist.

For that move more sanity checks from the callers into __iommu_map and
make that function available outside of iommu.c as iommu_map_nosync.

Add a wrapper for the map_sync as iommu_sync_map so that callers don't
need to poke into the methods directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/iommu.c | 65 +++++++++++++++++++------------------------
 include/linux/iommu.h |  4 +++
 2 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c8033ca66377..3dc47f62d9ff 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2440,8 +2440,8 @@ static size_t iommu_pgsize(struct iommu_domain *domain, unsigned long iova,
 	return pgsize;
 }
 
-static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
-		       phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
+		phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
 	unsigned long orig_iova = iova;
@@ -2450,12 +2450,19 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	phys_addr_t orig_paddr = paddr;
 	int ret = 0;
 
+	might_sleep_if(gfpflags_allow_blocking(gfp));
+
 	if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
 		return -EINVAL;
 
 	if (WARN_ON(!ops->map_pages || domain->pgsize_bitmap == 0UL))
 		return -ENODEV;
 
+	/* Discourage passing strange GFP flags */
+	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+				__GFP_HIGHMEM)))
+		return -EINVAL;
+
 	/* find out the minimum page size supported */
 	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
 
@@ -2503,31 +2510,27 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	return ret;
 }
 
-int iommu_map(struct iommu_domain *domain, unsigned long iova,
-	      phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_sync_map(struct iommu_domain *domain, unsigned long iova, size_t size)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
-	int ret;
-
-	might_sleep_if(gfpflags_allow_blocking(gfp));
 
-	/* Discourage passing strange GFP flags */
-	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
-				__GFP_HIGHMEM)))
-		return -EINVAL;
+	if (!ops->iotlb_sync_map)
+		return 0;
+	return ops->iotlb_sync_map(domain, iova, size);
+}
 
-	ret = __iommu_map(domain, iova, paddr, size, prot, gfp);
-	if (ret == 0 && ops->iotlb_sync_map) {
-		ret = ops->iotlb_sync_map(domain, iova, size);
-		if (ret)
-			goto out_err;
-	}
+int iommu_map(struct iommu_domain *domain, unsigned long iova,
+	      phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+	int ret;
 
-	return ret;
+	ret = iommu_map_nosync(domain, iova, paddr, size, prot, gfp);
+	if (ret)
+		return ret;
 
-out_err:
-	/* undo mappings already done */
-	iommu_unmap(domain, iova, size);
+	ret = iommu_sync_map(domain, iova, size);
+	if (ret)
+		iommu_unmap(domain, iova, size);
 
 	return ret;
 }
@@ -2627,26 +2630,17 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 		     struct scatterlist *sg, unsigned int nents, int prot,
 		     gfp_t gfp)
 {
-	const struct iommu_domain_ops *ops = domain->ops;
 	size_t len = 0, mapped = 0;
 	phys_addr_t start;
 	unsigned int i = 0;
 	int ret;
 
-	might_sleep_if(gfpflags_allow_blocking(gfp));
-
-	/* Discourage passing strange GFP flags */
-	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
-				__GFP_HIGHMEM)))
-		return -EINVAL;
-
 	while (i <= nents) {
 		phys_addr_t s_phys = sg_phys(sg);
 
 		if (len && s_phys != start + len) {
-			ret = __iommu_map(domain, iova + mapped, start,
+			ret = iommu_map_nosync(domain, iova + mapped, start,
 					len, prot, gfp);
-
 			if (ret)
 				goto out_err;
 
@@ -2669,11 +2663,10 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			sg = sg_next(sg);
 	}
 
-	if (ops->iotlb_sync_map) {
-		ret = ops->iotlb_sync_map(domain, iova, mapped);
-		if (ret)
-			goto out_err;
-	}
+	ret = iommu_sync_map(domain, iova, mapped);
+	if (ret)
+		goto out_err;
+
 	return mapped;
 
 out_err:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ccce8a751e2a..ce472af8e9c3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -872,6 +872,10 @@ extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
+		phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+int iommu_sync_map(struct iommu_domain *domain, unsigned long iova,
+		size_t size);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (2 preceding siblings ...)
  2025-04-23  8:12 ` [PATCH v9 03/24] iommu: generalize the batched sync after map interface Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-23 17:15   ` Jason Gunthorpe
  2025-04-26  0:55   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 05/24] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
                   ` (19 subsequent siblings)
  23 siblings, 2 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni,
	Jason Gunthorpe

From: Leon Romanovsky <leonro@nvidia.com>

Add kernel-doc section for iommu_unmap_fast to document existing
limitation of underlying functions which can't split individual ranges.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/iommu.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3dc47f62d9ff..66b0bf6418ef 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2618,6 +2618,25 @@ size_t iommu_unmap(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
+/**
+ * iommu_unmap_fast() - Remove mappings from a range of IOVA without IOTLB sync
+ * @domain: Domain to manipulate
+ * @iova: IO virtual address to start
+ * @size: Length of the range starting from @iova
+ * @iotlb_gather: range information for a pending IOTLB flush
+ *
+ * iommu_unmap_fast() will remove a translation created by iommu_map().
+ * It can't subdivide a mapping created by iommu_map(), so it should be
+ * called with IOVA ranges that match what was passed to iommu_map(). The
+ * range can aggregate contiguous iommu_map() calls so long as no individual
+ * range is split.
+ *
+ * Basically iommu_unmap_fast() is the same as iommu_unmap() but for callers
+ * which manage the IOTLB flushing externally to perform a batched sync.
+ *
+ * Returns: Number of bytes of IOVA unmapped. iova + res will be the point
+ * unmapping stopped.
+ */
 size_t iommu_unmap_fast(struct iommu_domain *domain,
 			unsigned long iova, size_t size,
 			struct iommu_iotlb_gather *iotlb_gather)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 05/24] dma-mapping: Provide an interface to allow allocate IOVA
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (3 preceding siblings ...)
  2025-04-23  8:12 ` [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-26  1:10   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

The existing .map_page() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().

These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.

Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.

In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped, so add a check if IOVA can be used for the specific
transaction.

The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 86 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h | 48 +++++++++++++++++++++
 2 files changed, 134 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 145606498b4c..6ca9305a26cc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1723,6 +1723,92 @@ size_t iommu_dma_max_mapping_size(struct device *dev)
 	return SIZE_MAX;
 }
 
+/**
+ * dma_iova_try_alloc - Try to allocate an IOVA space
+ * @dev: Device to allocate the IOVA space for
+ * @state: IOVA state
+ * @phys: physical address
+ * @size: IOVA size
+ *
+ * Check if @dev supports the IOVA-based DMA API, and if yes allocate IOVA space
+ * for the given base address and size.
+ *
+ * Note: @phys is only used to calculate the IOVA alignment. Callers that always
+ * do PAGE_SIZE aligned transfers can safely pass 0 here.
+ *
+ * Returns %true if the IOVA-based DMA API can be used and IOVA space has been
+ * allocated, or %false if the regular DMA API should be used.
+ */
+bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t size)
+{
+	struct iommu_dma_cookie *cookie;
+	struct iommu_domain *domain;
+	struct iova_domain *iovad;
+	size_t iova_off;
+	dma_addr_t addr;
+
+	memset(state, 0, sizeof(*state));
+	if (!use_dma_iommu(dev))
+		return false;
+
+	domain = iommu_get_dma_domain(dev);
+	cookie = domain->iova_cookie;
+	iovad = &cookie->iovad;
+	iova_off = iova_offset(iovad, phys);
+
+	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
+	    iommu_deferred_attach(dev, iommu_get_domain_for_dev(dev)))
+		return false;
+
+	if (WARN_ON_ONCE(!size))
+		return false;
+
+	/*
+	 * DMA_IOVA_USE_SWIOTLB is flag which is set by dma-iommu
+	 * internals, make sure that caller didn't set it and/or
+	 * didn't use this interface to map SIZE_MAX.
+	 */
+	if (WARN_ON_ONCE((u64)size & DMA_IOVA_USE_SWIOTLB))
+		return false;
+
+	addr = iommu_dma_alloc_iova(domain,
+			iova_align(iovad, size + iova_off),
+			dma_get_mask(dev), dev);
+	if (!addr)
+		return false;
+
+	state->addr = addr + iova_off;
+	state->__size = size;
+	return true;
+}
+EXPORT_SYMBOL_GPL(dma_iova_try_alloc);
+
+/**
+ * dma_iova_free - Free an IOVA space
+ * @dev: Device to free the IOVA space for
+ * @state: IOVA state
+ *
+ * Undoes a successful dma_try_iova_alloc().
+ *
+ * Note that all dma_iova_link() calls need to be undone first.  For callers
+ * that never call dma_iova_unlink(), dma_iova_destroy() can be used instead
+ * which unlinks all ranges and frees the IOVA space in a single efficient
+ * operation.
+ */
+void dma_iova_free(struct device *dev, struct dma_iova_state *state)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, state->addr);
+	size_t size = dma_iova_size(state);
+
+	iommu_dma_free_iova(domain, state->addr - iova_start_pad,
+			iova_align(iovad, size + iova_start_pad), NULL);
+}
+EXPORT_SYMBOL_GPL(dma_iova_free);
+
 void iommu_setup_dma_ops(struct device *dev)
 {
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index b79925b1c433..de7f73810d54 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -72,6 +72,22 @@
 
 #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
 
+struct dma_iova_state {
+	dma_addr_t addr;
+	u64 __size;
+};
+
+/*
+ * Use the high bit to mark if we used swiotlb for one or more ranges.
+ */
+#define DMA_IOVA_USE_SWIOTLB		(1ULL << 63)
+
+static inline size_t dma_iova_size(struct dma_iova_state *state)
+{
+	/* Casting is needed for 32-bits systems */
+	return (size_t)(state->__size & ~DMA_IOVA_USE_SWIOTLB);
+}
+
 #ifdef CONFIG_DMA_API_DEBUG
 void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 void debug_dma_map_single(struct device *dev, const void *addr,
@@ -277,6 +293,38 @@ static inline int dma_mmap_noncontiguous(struct device *dev,
 }
 #endif /* CONFIG_HAS_DMA */
 
+#ifdef CONFIG_IOMMU_DMA
+/**
+ * dma_use_iova - check if the IOVA API is used for this state
+ * @state: IOVA state
+ *
+ * Return %true if the DMA transfers uses the dma_iova_*() calls or %false if
+ * they can't be used.
+ */
+static inline bool dma_use_iova(struct dma_iova_state *state)
+{
+	return state->__size != 0;
+}
+
+bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t size);
+void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+#else /* CONFIG_IOMMU_DMA */
+static inline bool dma_use_iova(struct dma_iova_state *state)
+{
+	return false;
+}
+static inline bool dma_iova_try_alloc(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t size)
+{
+	return false;
+}
+static inline void dma_iova_free(struct device *dev,
+		struct dma_iova_state *state)
+{
+}
+#endif /* CONFIG_IOMMU_DMA */
+
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
 void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 		enum dma_data_direction dir);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (4 preceding siblings ...)
  2025-04-23  8:12 ` [PATCH v9 05/24] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-26  1:14   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
                   ` (17 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 73 ++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 6ca9305a26cc..d2c298083e0a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1138,6 +1138,43 @@ void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
 }
 
+static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iova_domain *iovad = &domain->iova_cookie->iovad;
+
+	if (!is_swiotlb_active(dev)) {
+		dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n");
+		return (phys_addr_t)DMA_MAPPING_ERROR;
+	}
+
+	trace_swiotlb_bounced(dev, phys, size);
+
+	phys = swiotlb_tbl_map_single(dev, phys, size, iova_mask(iovad), dir,
+			attrs);
+
+	/*
+	 * Untrusted devices should not see padding areas with random leftover
+	 * kernel data, so zero the pre- and post-padding.
+	 * swiotlb_tbl_map_single() has initialized the bounce buffer proper to
+	 * the contents of the original memory buffer.
+	 */
+	if (phys != (phys_addr_t)DMA_MAPPING_ERROR && dev_is_untrusted(dev)) {
+		size_t start, virt = (size_t)phys_to_virt(phys);
+
+		/* Pre-padding */
+		start = iova_align_down(iovad, virt);
+		memset((void *)start, 0, virt - start);
+
+		/* Post-padding */
+		start = virt + size;
+		memset((void *)start, 0, iova_align(iovad, start) - start);
+	}
+
+	return phys;
+}
+
 dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	      unsigned long offset, size_t size, enum dma_data_direction dir,
 	      unsigned long attrs)
@@ -1151,42 +1188,14 @@ dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	dma_addr_t iova, dma_mask = dma_get_mask(dev);
 
 	/*
-	 * If both the physical buffer start address and size are
-	 * page aligned, we don't need to use a bounce page.
+	 * If both the physical buffer start address and size are page aligned,
+	 * we don't need to use a bounce page.
 	 */
 	if (dev_use_swiotlb(dev, size, dir) &&
 	    iova_offset(iovad, phys | size)) {
-		if (!is_swiotlb_active(dev)) {
-			dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n");
-			return DMA_MAPPING_ERROR;
-		}
-
-		trace_swiotlb_bounced(dev, phys, size);
-
-		phys = swiotlb_tbl_map_single(dev, phys, size,
-					      iova_mask(iovad), dir, attrs);
-
-		if (phys == DMA_MAPPING_ERROR)
+		phys = iommu_dma_map_swiotlb(dev, phys, size, dir, attrs);
+		if (phys == (phys_addr_t)DMA_MAPPING_ERROR)
 			return DMA_MAPPING_ERROR;
-
-		/*
-		 * Untrusted devices should not see padding areas with random
-		 * leftover kernel data, so zero the pre- and post-padding.
-		 * swiotlb_tbl_map_single() has initialized the bounce buffer
-		 * proper to the contents of the original memory buffer.
-		 */
-		if (dev_is_untrusted(dev)) {
-			size_t start, virt = (size_t)phys_to_virt(phys);
-
-			/* Pre-padding */
-			start = iova_align_down(iovad, virt);
-			memset((void *)start, 0, virt - start);
-
-			/* Post-padding */
-			start = virt + size;
-			memset((void *)start, 0,
-			       iova_align(iovad, start) - start);
-		}
 	}
 
 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (5 preceding siblings ...)
  2025-04-23  8:12 ` [PATCH v9 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-26 22:46   ` Luis Chamberlain
  2025-04-23  8:12 ` [PATCH v9 08/24] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.

In proposed API, the callers will perform the following steps.
In map path:
	if (dma_can_use_iova(...))
	    dma_iova_alloc()
	    for (page in range)
	       dma_iova_link_next(...)
	    dma_iova_sync(...)
	else
	     /* Fallback to legacy map pages */
             for (all pages)
	       dma_map_page(...)

In unmap path:
	if (dma_can_use_iova(...))
	     dma_iova_destroy()
	else
	     for (all pages)
		dma_unmap_page(...)

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 261 ++++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h |  32 +++++
 2 files changed, 293 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d2c298083e0a..2e014db5a244 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1818,6 +1818,267 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
 }
 EXPORT_SYMBOL_GPL(dma_iova_free);
 
+static int __dma_iova_link(struct device *dev, dma_addr_t addr,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	bool coherent = dev_is_dma_coherent(dev);
+
+	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		arch_sync_dma_for_device(phys, size, dir);
+
+	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
+			dma_info_to_prot(dir, coherent, attrs), GFP_ATOMIC);
+}
+
+static int iommu_dma_iova_bounce_and_link(struct device *dev, dma_addr_t addr,
+		phys_addr_t phys, size_t bounce_len,
+		enum dma_data_direction dir, unsigned long attrs,
+		size_t iova_start_pad)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iova_domain *iovad = &domain->iova_cookie->iovad;
+	phys_addr_t bounce_phys;
+	int error;
+
+	bounce_phys = iommu_dma_map_swiotlb(dev, phys, bounce_len, dir, attrs);
+	if (bounce_phys == DMA_MAPPING_ERROR)
+		return -ENOMEM;
+
+	error = __dma_iova_link(dev, addr - iova_start_pad,
+			bounce_phys - iova_start_pad,
+			iova_align(iovad, bounce_len), dir, attrs);
+	if (error)
+		swiotlb_tbl_unmap_single(dev, bounce_phys, bounce_len, dir,
+				attrs);
+	return error;
+}
+
+static int iommu_dma_iova_link_swiotlb(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, phys);
+	size_t iova_end_pad = iova_offset(iovad, phys + size);
+	dma_addr_t addr = state->addr + offset;
+	size_t mapped = 0;
+	int error;
+
+	if (iova_start_pad) {
+		size_t bounce_len = min(size, iovad->granule - iova_start_pad);
+
+		error = iommu_dma_iova_bounce_and_link(dev, addr, phys,
+				bounce_len, dir, attrs, iova_start_pad);
+		if (error)
+			return error;
+		state->__size |= DMA_IOVA_USE_SWIOTLB;
+
+		mapped += bounce_len;
+		size -= bounce_len;
+		if (!size)
+			return 0;
+	}
+
+	size -= iova_end_pad;
+	error = __dma_iova_link(dev, addr + mapped, phys + mapped, size, dir,
+			attrs);
+	if (error)
+		goto out_unmap;
+	mapped += size;
+
+	if (iova_end_pad) {
+		error = iommu_dma_iova_bounce_and_link(dev, addr + mapped,
+				phys + mapped, iova_end_pad, dir, attrs, 0);
+		if (error)
+			goto out_unmap;
+		state->__size |= DMA_IOVA_USE_SWIOTLB;
+	}
+
+	return 0;
+
+out_unmap:
+	dma_iova_unlink(dev, state, 0, mapped, dir, attrs);
+	return error;
+}
+
+/**
+ * dma_iova_link - Link a range of IOVA space
+ * @dev: DMA device
+ * @state: IOVA state
+ * @phys: physical address to link
+ * @offset: offset into the IOVA state to map into
+ * @size: size of the buffer
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Link a range of IOVA space for the given IOVA state without IOTLB sync.
+ * This function is used to link multiple physical addresses in contiguous
+ * IOVA space without performing costly IOTLB sync.
+ *
+ * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
+ * the end of linkage.
+ */
+int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, phys);
+
+	if (WARN_ON_ONCE(iova_start_pad && offset > 0))
+		return -EIO;
+
+	if (dev_use_swiotlb(dev, size, dir) && iova_offset(iovad, phys | size))
+		return iommu_dma_iova_link_swiotlb(dev, state, phys, offset,
+				size, dir, attrs);
+
+	return __dma_iova_link(dev, state->addr + offset - iova_start_pad,
+			phys - iova_start_pad,
+			iova_align(iovad, size + iova_start_pad), dir, attrs);
+}
+EXPORT_SYMBOL_GPL(dma_iova_link);
+
+/**
+ * dma_iova_sync - Sync IOTLB
+ * @dev: DMA device
+ * @state: IOVA state
+ * @offset: offset into the IOVA state to sync
+ * @size: size of the buffer
+ *
+ * Sync IOTLB for the given IOVA state. This function should be called on
+ * the IOVA-contiguous range created by one ore more dma_iova_link() calls
+ * to sync the IOTLB.
+ */
+int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	dma_addr_t addr = state->addr + offset;
+	size_t iova_start_pad = iova_offset(iovad, addr);
+
+	return iommu_sync_map(domain, addr - iova_start_pad,
+		      iova_align(iovad, size + iova_start_pad));
+}
+EXPORT_SYMBOL_GPL(dma_iova_sync);
+
+static void iommu_dma_iova_unlink_range_slow(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, addr);
+	dma_addr_t end = addr + size;
+
+	do {
+		phys_addr_t phys;
+		size_t len;
+
+		phys = iommu_iova_to_phys(domain, addr);
+		if (WARN_ON(!phys))
+			/* Something very horrible happen here */
+			return;
+
+		len = min_t(size_t,
+			end - addr, iovad->granule - iova_start_pad);
+
+		if (!dev_is_dma_coherent(dev) &&
+		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+			arch_sync_dma_for_cpu(phys, len, dir);
+
+		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
+
+		addr += len;
+		iova_start_pad = 0;
+	} while (addr < end);
+}
+
+static void __iommu_dma_iova_unlink(struct device *dev,
+		struct dma_iova_state *state, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs,
+		bool free_iova)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	dma_addr_t addr = state->addr + offset;
+	size_t iova_start_pad = iova_offset(iovad, addr);
+	struct iommu_iotlb_gather iotlb_gather;
+	size_t unmapped;
+
+	if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
+	    (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
+		iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
+
+	iommu_iotlb_gather_init(&iotlb_gather);
+	iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);
+
+	size = iova_align(iovad, size + iova_start_pad);
+	addr -= iova_start_pad;
+	unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
+	WARN_ON(unmapped != size);
+
+	if (!iotlb_gather.queued)
+		iommu_iotlb_sync(domain, &iotlb_gather);
+	if (free_iova)
+		iommu_dma_free_iova(domain, addr, size, &iotlb_gather);
+}
+
+/**
+ * dma_iova_unlink - Unlink a range of IOVA space
+ * @dev: DMA device
+ * @state: IOVA state
+ * @offset: offset into the IOVA state to unlink
+ * @size: size of the buffer
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Unlink a range of IOVA space for the given IOVA state.
+ */
+void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	 __iommu_dma_iova_unlink(dev, state, offset, size, dir, attrs, false);
+}
+EXPORT_SYMBOL_GPL(dma_iova_unlink);
+
+/**
+ * dma_iova_destroy - Finish a DMA mapping transaction
+ * @dev: DMA device
+ * @state: IOVA state
+ * @mapped_len: number of bytes to unmap
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Unlink the IOVA range up to @mapped_len and free the entire IOVA space. The
+ * range of IOVA from dma_addr to @mapped_len must all be linked, and be the
+ * only linked IOVA in state.
+ */
+void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+		size_t mapped_len, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	if (mapped_len)
+		__iommu_dma_iova_unlink(dev, state, 0, mapped_len, dir, attrs,
+				true);
+	else
+		/*
+		 * We can be here if first call to dma_iova_link() failed and
+		 * there is nothing to unlink, so let's be more clear.
+		 */
+		dma_iova_free(dev, state);
+}
+EXPORT_SYMBOL_GPL(dma_iova_destroy);
+
 void iommu_setup_dma_ops(struct device *dev)
 {
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index de7f73810d54..a71e110f1e9d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -309,6 +309,17 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
 bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
 		phys_addr_t phys, size_t size);
 void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+		size_t mapped_len, enum dma_data_direction dir,
+		unsigned long attrs);
+int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size);
+int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs);
+void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size, enum dma_data_direction dir,
+		unsigned long attrs);
 #else /* CONFIG_IOMMU_DMA */
 static inline bool dma_use_iova(struct dma_iova_state *state)
 {
@@ -323,6 +334,27 @@ static inline void dma_iova_free(struct device *dev,
 		struct dma_iova_state *state)
 {
 }
+static inline void dma_iova_destroy(struct device *dev,
+		struct dma_iova_state *state, size_t mapped_len,
+		enum dma_data_direction dir, unsigned long attrs)
+{
+}
+static inline int dma_iova_sync(struct device *dev,
+		struct dma_iova_state *state, size_t offset, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+static inline int dma_iova_link(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	return -EOPNOTSUPP;
+}
+static inline void dma_iova_unlink(struct device *dev,
+		struct dma_iova_state *state, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs)
+{
+}
 #endif /* CONFIG_IOMMU_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 08/24] dma-mapping: add a dma_need_unmap helper
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (6 preceding siblings ...)
  2025-04-23  8:12 ` [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
@ 2025-04-23  8:12 ` Leon Romanovsky
  2025-04-26 22:49   ` Luis Chamberlain
  2025-04-23  8:13 ` [PATCH v9 09/24] docs: core-api: document the IOVA-based API Leon Romanovsky
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

Add helper that allows a driver to skip calling dma_unmap_*
if the DMA layer can guarantee that they are no-nops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-mapping.h |  5 +++++
 kernel/dma/mapping.c        | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index a71e110f1e9d..d2f358c5a25d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -406,6 +406,7 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
 }
+bool dma_need_unmap(struct device *dev);
 #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
 static inline bool dma_dev_need_sync(const struct device *dev)
 {
@@ -431,6 +432,10 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	return false;
 }
+static inline bool dma_need_unmap(struct device *dev)
+{
+	return false;
+}
 #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
 
 struct page *dma_alloc_pages(struct device *dev, size_t size,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index cda127027e48..3c3204ad2839 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -443,6 +443,24 @@ bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 }
 EXPORT_SYMBOL_GPL(__dma_need_sync);
 
+/**
+ * dma_need_unmap - does this device need dma_unmap_* operations
+ * @dev: device to check
+ *
+ * If this function returns %false, drivers can skip calling dma_unmap_* after
+ * finishing an I/O.  This function must be called after all mappings that might
+ * need to be unmapped have been performed.
+ */
+bool dma_need_unmap(struct device *dev)
+{
+	if (!dma_map_direct(dev, get_dma_ops(dev)))
+		return true;
+	if (!dev->dma_skip_sync)
+		return true;
+	return IS_ENABLED(CONFIG_DMA_API_DEBUG);
+}
+EXPORT_SYMBOL_GPL(dma_need_unmap);
+
 static void dma_setup_need_sync(struct device *dev)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 09/24] docs: core-api: document the IOVA-based API
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (7 preceding siblings ...)
  2025-04-23  8:12 ` [PATCH v9 08/24] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  8:13 ` [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

Add an explanation of the newly added IOVA-based mapping API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/core-api/dma-api.rst | 71 ++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index 8e3cce3d0a23..2ad08517e626 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -530,6 +530,77 @@ routines, e.g.:::
 		....
 	}
 
+Part Ie - IOVA-based DMA mappings
+---------------------------------
+
+These APIs allow a very efficient mapping when using an IOMMU.  They are an
+optional path that requires extra code and are only recommended for drivers
+where DMA mapping performance, or the space usage for storing the DMA addresses
+matter.  All the considerations from the previous section apply here as well.
+
+::
+
+    bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t size);
+
+Is used to try to allocate IOVA space for mapping operation.  If it returns
+false this API can't be used for the given device and the normal streaming
+DMA mapping API should be used.  The ``struct dma_iova_state`` is allocated
+by the driver and must be kept around until unmap time.
+
+::
+
+    static inline bool dma_use_iova(struct dma_iova_state *state)
+
+Can be used by the driver to check if the IOVA-based API is used after a
+call to dma_iova_try_alloc.  This can be useful in the unmap path.
+
+::
+
+    int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs);
+
+Is used to link ranges to the IOVA previously allocated.  The start of all
+but the first call to dma_iova_link for a given state must be aligned
+to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and
+the size of all but the last range must be aligned to the DMA merge boundary
+as well.
+
+::
+
+    int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size);
+
+Must be called to sync the IOMMU page tables for IOVA-range mapped by one or
+more calls to ``dma_iova_link()``.
+
+For drivers that use a one-shot mapping, all ranges can be unmapped and the
+IOVA freed by calling:
+
+::
+
+   void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+		size_t mapped_len, enum dma_data_direction dir,
+                unsigned long attrs);
+
+Alternatively drivers can dynamically manage the IOVA space by unmapping
+and mapping individual regions.  In that case
+
+::
+
+    void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size, enum dma_data_direction dir,
+		unsigned long attrs);
+
+is used to unmap a range previously mapped, and
+
+::
+
+   void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+
+is used to free the IOVA space.  All regions must have been unmapped using
+``dma_iova_unlink()`` before calling ``dma_iova_free()``.
 
 Part II - Non-coherent DMA allocations
 --------------------------------------
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (8 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 09/24] docs: core-api: document the IOVA-based API Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 17:17   ` Jason Gunthorpe
  2025-04-23 17:54   ` Mika Penttilä
  2025-04-23  8:13 ` [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic Leon Romanovsky
                   ` (13 subsequent siblings)
  23 siblings, 2 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new sticky flag (HMM_PFN_DMA_MAPPED), which isn't overwritten
by HMM range fault. Such flag allows users to tag specific PFNs with
information if this specific PFN was already DMA mapped.

Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/hmm.h | 17 +++++++++++++++
 mm/hmm.c            | 51 ++++++++++++++++++++++++++++-----------------
 2 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..a1ddbedc19c0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,8 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
+ *                      to mark that page is already DMA mapped
  *
  * On input:
  * 0                 - Return the current state of the page, do not fault it.
@@ -36,6 +38,13 @@ enum hmm_pfn_flags {
 	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
 	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+
+	/*
+	 * Sticky flags, carried from input to output,
+	 * don't forget to update HMM_PFN_INOUT_FLAGS
+	 */
+	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
+
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
 
 	/* Input flags */
@@ -57,6 +66,14 @@ static inline struct page *hmm_pfn_to_page(unsigned long hmm_pfn)
 	return pfn_to_page(hmm_pfn & ~HMM_PFN_FLAGS);
 }
 
+/*
+ * hmm_pfn_to_phys() - return physical address pointed to by a device entry
+ */
+static inline phys_addr_t hmm_pfn_to_phys(unsigned long hmm_pfn)
+{
+	return __pfn_to_phys(hmm_pfn & ~HMM_PFN_FLAGS);
+}
+
 /*
  * hmm_pfn_to_map_order() - return the CPU mapping size order
  *
diff --git a/mm/hmm.c b/mm/hmm.c
index 082f7b7c0b9e..51fe8b011cc7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -39,13 +39,20 @@ enum {
 	HMM_NEED_ALL_BITS = HMM_NEED_FAULT | HMM_NEED_WRITE_FAULT,
 };
 
+enum {
+	/* These flags are carried from input-to-output */
+	HMM_PFN_INOUT_FLAGS = HMM_PFN_DMA_MAPPED,
+};
+
 static int hmm_pfns_fill(unsigned long addr, unsigned long end,
 			 struct hmm_range *range, unsigned long cpu_flags)
 {
 	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
 
-	for (; addr < end; addr += PAGE_SIZE, i++)
-		range->hmm_pfns[i] = cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++) {
+		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+		range->hmm_pfns[i] |= cpu_flags;
+	}
 	return 0;
 }
 
@@ -202,8 +209,10 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		return hmm_vma_fault(addr, end, required_fault, walk);
 
 	pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		hmm_pfns[i] = pfn | cpu_flags;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+		hmm_pfns[i] |= pfn | cpu_flags;
+	}
 	return 0;
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -230,14 +239,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	unsigned long cpu_flags;
 	pte_t pte = ptep_get(ptep);
 	uint64_t pfn_req_flags = *hmm_pfn;
+	uint64_t new_pfn_flags = 0;
 
 	if (pte_none_mostly(pte)) {
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
 			goto fault;
-		*hmm_pfn = 0;
-		return 0;
+		goto out;
 	}
 
 	if (!pte_present(pte)) {
@@ -253,16 +262,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
-			return 0;
+			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
+			goto out;
 		}
 
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
-		if (!required_fault) {
-			*hmm_pfn = 0;
-			return 0;
-		}
+		if (!required_fault)
+			goto out;
 
 		if (!non_swap_entry(entry))
 			goto fault;
@@ -304,11 +311,13 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			pte_unmap(ptep);
 			return -EFAULT;
 		}
-		*hmm_pfn = HMM_PFN_ERROR;
-		return 0;
+		new_pfn_flags = HMM_PFN_ERROR;
+		goto out;
 	}
 
-	*hmm_pfn = pte_pfn(pte) | cpu_flags;
+	new_pfn_flags = pte_pfn(pte) | cpu_flags;
+out:
+	*hmm_pfn = (*hmm_pfn & HMM_PFN_INOUT_FLAGS) | new_pfn_flags;
 	return 0;
 
 fault:
@@ -448,8 +457,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		}
 
 		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-		for (i = 0; i < npages; ++i, ++pfn)
-			hmm_pfns[i] = pfn | cpu_flags;
+		for (i = 0; i < npages; ++i, ++pfn) {
+			hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+			hmm_pfns[i] |= pfn | cpu_flags;
+		}
 		goto out_unlock;
 	}
 
@@ -507,8 +518,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	}
 
 	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-	for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		range->hmm_pfns[i] = pfn | cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+		range->hmm_pfns[i] |= pfn | cpu_flags;
+	}
 
 	spin_unlock(ptl);
 	return 0;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (9 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 17:28   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 12/24] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

HMM callers use PFN list to populate range while calling
to hmm_range_fault(), the conversion from PFN to DMA address
is done by the callers with help of another DMA list. However,
it is wasteful on any modern platform and by doing the right
logic, that DMA list can be avoided.

Provide generic logic to manage these lists and gave an interface
to map/unmap PFNs to DMA addresses, without requiring from the callers
to be an experts in DMA core API.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/hmm-dma.h |  33 ++++++
 include/linux/hmm.h     |   4 +
 mm/hmm.c                | 215 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 251 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hmm-dma.h

diff --git a/include/linux/hmm-dma.h b/include/linux/hmm-dma.h
new file mode 100644
index 000000000000..f58b9fc71999
--- /dev/null
+++ b/include/linux/hmm-dma.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright (c) 2024 NVIDIA Corporation & Affiliates */
+#ifndef LINUX_HMM_DMA_H
+#define LINUX_HMM_DMA_H
+
+#include <linux/dma-mapping.h>
+
+struct dma_iova_state;
+struct pci_p2pdma_map_state;
+
+/*
+ * struct hmm_dma_map - array of PFNs and DMA addresses
+ *
+ * @state: DMA IOVA state
+ * @pfns: array of PFNs
+ * @dma_list: array of DMA addresses
+ * @dma_entry_size: size of each DMA entry in the array
+ */
+struct hmm_dma_map {
+	struct dma_iova_state state;
+	unsigned long *pfn_list;
+	dma_addr_t *dma_list;
+	size_t dma_entry_size;
+};
+
+int hmm_dma_map_alloc(struct device *dev, struct hmm_dma_map *map,
+		      size_t nr_entries, size_t dma_entry_size);
+void hmm_dma_map_free(struct device *dev, struct hmm_dma_map *map);
+dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
+			   size_t idx,
+			   struct pci_p2pdma_map_state *p2pdma_state);
+bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx);
+#endif /* LINUX_HMM_DMA_H */
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index a1ddbedc19c0..1bc33e4c20ea 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,8 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_P2PDMA - P2P page
+ * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
  *                      to mark that page is already DMA mapped
  *
@@ -43,6 +45,8 @@ enum hmm_pfn_flags {
 	 * Sticky flags, carried from input to output,
 	 * don't forget to update HMM_PFN_INOUT_FLAGS
 	 */
+	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
+	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
 	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
 
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
diff --git a/mm/hmm.c b/mm/hmm.c
index 51fe8b011cc7..c0bee5aa00fc 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -10,6 +10,7 @@
  */
 #include <linux/pagewalk.h>
 #include <linux/hmm.h>
+#include <linux/hmm-dma.h>
 #include <linux/init.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
@@ -23,6 +24,7 @@
 #include <linux/sched/mm.h>
 #include <linux/jump_label.h>
 #include <linux/dma-mapping.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
@@ -41,7 +43,8 @@ enum {
 
 enum {
 	/* These flags are carried from input-to-output */
-	HMM_PFN_INOUT_FLAGS = HMM_PFN_DMA_MAPPED,
+	HMM_PFN_INOUT_FLAGS = HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA |
+			      HMM_PFN_P2PDMA_BUS,
 };
 
 static int hmm_pfns_fill(unsigned long addr, unsigned long end,
@@ -620,3 +623,213 @@ int hmm_range_fault(struct hmm_range *range)
 	return ret;
 }
 EXPORT_SYMBOL(hmm_range_fault);
+
+/**
+ * hmm_dma_map_alloc - Allocate HMM map structure
+ * @dev: device to allocate structure for
+ * @map: HMM map to allocate
+ * @nr_entries: number of entries in the map
+ * @dma_entry_size: size of the DMA entry in the map
+ *
+ * Allocate the HMM map structure and all the lists it contains.
+ * Return 0 on success, -ENOMEM on failure.
+ */
+int hmm_dma_map_alloc(struct device *dev, struct hmm_dma_map *map,
+		      size_t nr_entries, size_t dma_entry_size)
+{
+	bool dma_need_sync = false;
+	bool use_iova;
+
+	if (!(nr_entries * PAGE_SIZE / dma_entry_size))
+		return -EINVAL;
+
+	/*
+	 * The HMM API violates our normal DMA buffer ownership rules and can't
+	 * transfer buffer ownership.  The dma_addressing_limited() check is a
+	 * best approximation to ensure no swiotlb buffering happens.
+	 */
+#ifdef CONFIG_DMA_NEED_SYNC
+	dma_need_sync = !dev->dma_skip_sync;
+#endif /* CONFIG_DMA_NEED_SYNC */
+	if (dma_need_sync || dma_addressing_limited(dev))
+		return -EOPNOTSUPP;
+
+	map->dma_entry_size = dma_entry_size;
+	map->pfn_list = kvcalloc(nr_entries, sizeof(*map->pfn_list),
+				 GFP_KERNEL | __GFP_NOWARN);
+	if (!map->pfn_list)
+		return -ENOMEM;
+
+	use_iova = dma_iova_try_alloc(dev, &map->state, 0,
+			nr_entries * PAGE_SIZE);
+	if (!use_iova && dma_need_unmap(dev)) {
+		map->dma_list = kvcalloc(nr_entries, sizeof(*map->dma_list),
+					 GFP_KERNEL | __GFP_NOWARN);
+		if (!map->dma_list)
+			goto err_dma;
+	}
+	return 0;
+
+err_dma:
+	kvfree(map->pfn_list);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(hmm_dma_map_alloc);
+
+/**
+ * hmm_dma_map_free - iFree HMM map structure
+ * @dev: device to free structure from
+ * @map: HMM map containing the various lists and state
+ *
+ * Free the HMM map structure and all the lists it contains.
+ */
+void hmm_dma_map_free(struct device *dev, struct hmm_dma_map *map)
+{
+	if (dma_use_iova(&map->state))
+		dma_iova_free(dev, &map->state);
+	kvfree(map->pfn_list);
+	kvfree(map->dma_list);
+}
+EXPORT_SYMBOL_GPL(hmm_dma_map_free);
+
+/**
+ * hmm_dma_map_pfn - Map a physical HMM page to DMA address
+ * @dev: Device to map the page for
+ * @map: HMM map
+ * @idx: Index into the PFN and dma address arrays
+ * @p2pdma_state: PCI P2P state.
+ *
+ * dma_alloc_iova() allocates IOVA based on the size specified by their use in
+ * iova->size. Call this function after IOVA allocation to link whole @page
+ * to get the DMA address. Note that very first call to this function
+ * will have @offset set to 0 in the IOVA space allocated from
+ * dma_alloc_iova(). For subsequent calls to this function on same @iova,
+ * @offset needs to be advanced by the caller with the size of previous
+ * page that was linked + DMA address returned for the previous page that was
+ * linked by this function.
+ */
+dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
+			   size_t idx,
+			   struct pci_p2pdma_map_state *p2pdma_state)
+{
+	struct dma_iova_state *state = &map->state;
+	dma_addr_t *dma_addrs = map->dma_list;
+	unsigned long *pfns = map->pfn_list;
+	struct page *page = hmm_pfn_to_page(pfns[idx]);
+	phys_addr_t paddr = hmm_pfn_to_phys(pfns[idx]);
+	size_t offset = idx * map->dma_entry_size;
+	unsigned long attrs = 0;
+	dma_addr_t dma_addr;
+	int ret;
+
+	if ((pfns[idx] & HMM_PFN_DMA_MAPPED) &&
+	    !(pfns[idx] & HMM_PFN_P2PDMA_BUS)) {
+		/*
+		 * We are in this flow when there is a need to resync flags,
+		 * for example when page was already linked in prefetch call
+		 * with READ flag and now we need to add WRITE flag
+		 *
+		 * This page was already programmed to HW and we don't want/need
+		 * to unlink and link it again just to resync flags.
+		 */
+		if (dma_use_iova(state))
+			return state->addr + offset;
+
+		/*
+		 * Without dma_need_unmap, the dma_addrs array is NULL, thus we
+		 * need to regenerate the address below even if there already
+		 * was a mapping. But !dma_need_unmap implies that the
+		 * mapping stateless, so this is fine.
+		 */
+		if (dma_need_unmap(dev))
+			return dma_addrs[idx];
+
+		/* Continue to remapping */
+	}
+
+	switch (pci_p2pdma_state(p2pdma_state, dev, page)) {
+	case PCI_P2PDMA_MAP_NONE:
+		break;
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		attrs |= DMA_ATTR_SKIP_CPU_SYNC;
+		pfns[idx] |= HMM_PFN_P2PDMA;
+		break;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED;
+		return pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
+	default:
+		return DMA_MAPPING_ERROR;
+	}
+
+	if (dma_use_iova(state)) {
+		ret = dma_iova_link(dev, state, paddr, offset,
+				    map->dma_entry_size, DMA_BIDIRECTIONAL,
+				    attrs);
+		if (ret)
+			goto error;
+
+		ret = dma_iova_sync(dev, state, offset, map->dma_entry_size);
+		if (ret) {
+			dma_iova_unlink(dev, state, offset, map->dma_entry_size,
+					DMA_BIDIRECTIONAL, attrs);
+			goto error;
+		}
+
+		dma_addr = state->addr + offset;
+	} else {
+		if (WARN_ON_ONCE(dma_need_unmap(dev) && !dma_addrs))
+			goto error;
+
+		dma_addr = dma_map_page(dev, page, 0, map->dma_entry_size,
+					DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(dev, dma_addr))
+			goto error;
+
+		if (dma_need_unmap(dev))
+			dma_addrs[idx] = dma_addr;
+	}
+	pfns[idx] |= HMM_PFN_DMA_MAPPED;
+	return dma_addr;
+error:
+	pfns[idx] &= ~HMM_PFN_P2PDMA;
+	return DMA_MAPPING_ERROR;
+
+}
+EXPORT_SYMBOL_GPL(hmm_dma_map_pfn);
+
+/**
+ * hmm_dma_unmap_pfn - Unmap a physical HMM page from DMA address
+ * @dev: Device to unmap the page from
+ * @map: HMM map
+ * @idx: Index of the PFN to unmap
+ *
+ * Returns true if the PFN was mapped and has been unmapped, false otherwise.
+ */
+bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
+{
+	struct dma_iova_state *state = &map->state;
+	dma_addr_t *dma_addrs = map->dma_list;
+	unsigned long *pfns = map->pfn_list;
+	unsigned long attrs = 0;
+
+#define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
+	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
+		return false;
+#undef HMM_PFN_VALID_DMA
+
+	if (pfns[idx] & HMM_PFN_P2PDMA_BUS)
+		; /* no need to unmap bus address P2P mappings */
+	else if (dma_use_iova(state)) {
+		if (pfns[idx] & HMM_PFN_P2PDMA)
+			attrs |= DMA_ATTR_SKIP_CPU_SYNC;
+		dma_iova_unlink(dev, state, idx * map->dma_entry_size,
+				map->dma_entry_size, DMA_BIDIRECTIONAL, attrs);
+	} else if (dma_need_unmap(dev))
+		dma_unmap_page(dev, dma_addrs[idx], map->dma_entry_size,
+			       DMA_BIDIRECTIONAL);
+
+	pfns[idx] &=
+		~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA | HMM_PFN_P2PDMA_BUS);
+	return true;
+}
+EXPORT_SYMBOL_GPL(hmm_dma_unmap_pfn);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 12/24] RDMA/umem: Store ODP access mask information in PFN
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (10 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 17:34   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

As a preparation to remove dma_list, store access mask in PFN pointer
and not in dma_addr_t.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c   | 103 +++++++++++----------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   1 +
 drivers/infiniband/hw/mlx5/odp.c     |  37 +++++-----
 drivers/infiniband/sw/rxe/rxe_odp.c  |  14 ++--
 include/rdma/ib_umem_odp.h           |  14 +---
 5 files changed, 70 insertions(+), 99 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d31c23..e1a5a567efb3 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -296,22 +296,11 @@ EXPORT_SYMBOL(ib_umem_odp_release);
 static int ib_umem_odp_map_dma_single_page(
 		struct ib_umem_odp *umem_odp,
 		unsigned int dma_index,
-		struct page *page,
-		u64 access_mask)
+		struct page *page)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
 	dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
 
-	if (*dma_addr) {
-		/*
-		 * If the page is already dma mapped it means it went through
-		 * a non-invalidating trasition, like read-only to writable.
-		 * Resync the flags.
-		 */
-		*dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask;
-		return 0;
-	}
-
 	*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
 				    DMA_BIDIRECTIONAL);
 	if (ib_dma_mapping_error(dev, *dma_addr)) {
@@ -319,7 +308,6 @@ static int ib_umem_odp_map_dma_single_page(
 		return -EFAULT;
 	}
 	umem_odp->npages++;
-	*dma_addr |= access_mask;
 	return 0;
 }
 
@@ -355,9 +343,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	struct hmm_range range = {};
 	unsigned long timeout;
 
-	if (access_mask == 0)
-		return -EINVAL;
-
 	if (user_virt < ib_umem_start(umem_odp) ||
 	    user_virt + bcnt > ib_umem_end(umem_odp))
 		return -EFAULT;
@@ -383,7 +368,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	if (fault) {
 		range.default_flags = HMM_PFN_REQ_FAULT;
 
-		if (access_mask & ODP_WRITE_ALLOWED_BIT)
+		if (access_mask & HMM_PFN_WRITE)
 			range.default_flags |= HMM_PFN_REQ_WRITE;
 	}
 
@@ -415,22 +400,17 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	for (pfn_index = 0; pfn_index < num_pfns;
 		pfn_index += 1 << (page_shift - PAGE_SHIFT), dma_index++) {
 
-		if (fault) {
-			/*
-			 * Since we asked for hmm_range_fault() to populate
-			 * pages it shouldn't return an error entry on success.
-			 */
-			WARN_ON(range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
-			WARN_ON(!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
-		} else {
-			if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)) {
-				WARN_ON(umem_odp->dma_list[dma_index]);
-				continue;
-			}
-			access_mask = ODP_READ_ALLOWED_BIT;
-			if (range.hmm_pfns[pfn_index] & HMM_PFN_WRITE)
-				access_mask |= ODP_WRITE_ALLOWED_BIT;
-		}
+		/*
+		 * Since we asked for hmm_range_fault() to populate
+		 * pages it shouldn't return an error entry on success.
+		 */
+		WARN_ON(fault && range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
+		WARN_ON(fault && !(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
+		if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID))
+			continue;
+
+		if (range.hmm_pfns[pfn_index] & HMM_PFN_DMA_MAPPED)
+			continue;
 
 		hmm_order = hmm_pfn_to_map_order(range.hmm_pfns[pfn_index]);
 		/* If a hugepage was detected and ODP wasn't set for, the umem
@@ -445,13 +425,14 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 		}
 
 		ret = ib_umem_odp_map_dma_single_page(
-				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]),
-				access_mask);
+			umem_odp, dma_index,
+			hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
 		if (ret < 0) {
 			ibdev_dbg(umem_odp->umem.ibdev,
 				  "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
 			break;
 		}
+		range.hmm_pfns[pfn_index] |= HMM_PFN_DMA_MAPPED;
 	}
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
@@ -471,7 +452,6 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 				 u64 bound)
 {
-	dma_addr_t dma_addr;
 	dma_addr_t dma;
 	int idx;
 	u64 addr;
@@ -482,34 +462,37 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	virt = max_t(u64, virt, ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
+		unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >>
+					PAGE_SHIFT;
+		struct page *page =
+			hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
+
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 		dma = umem_odp->dma_list[idx];
 
-		/* The access flags guaranteed a valid DMA address in case was NULL */
-		if (dma) {
-			unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
-			struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
-
-			dma_addr = dma & ODP_DMA_ADDR_MASK;
-			ib_dma_unmap_page(dev, dma_addr,
-					  BIT(umem_odp->page_shift),
-					  DMA_BIDIRECTIONAL);
-			if (dma & ODP_WRITE_ALLOWED_BIT) {
-				struct page *head_page = compound_head(page);
-				/*
-				 * set_page_dirty prefers being called with
-				 * the page lock. However, MMU notifiers are
-				 * called sometimes with and sometimes without
-				 * the lock. We rely on the umem_mutex instead
-				 * to prevent other mmu notifiers from
-				 * continuing and allowing the page mapping to
-				 * be removed.
-				 */
-				set_page_dirty(head_page);
-			}
-			umem_odp->dma_list[idx] = 0;
-			umem_odp->npages--;
+		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
+			goto clear;
+		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_DMA_MAPPED))
+			goto clear;
+
+		ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
+				  DMA_BIDIRECTIONAL);
+		if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+			struct page *head_page = compound_head(page);
+			/*
+			 * set_page_dirty prefers being called with
+			 * the page lock. However, MMU notifiers are
+			 * called sometimes with and sometimes without
+			 * the lock. We rely on the umem_mutex instead
+			 * to prevent other mmu notifiers from
+			 * continuing and allowing the page mapping to
+			 * be removed.
+			 */
+			set_page_dirty(head_page);
 		}
+		umem_odp->npages--;
+clear:
+		umem_odp->pfn_list[pfn_idx] &= ~HMM_PFN_FLAGS;
 	}
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ace2df3e1d9f..7424b23bb0d9 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -351,6 +351,7 @@ struct mlx5_ib_flow_db {
 #define MLX5_IB_UPD_XLT_PD	      BIT(4)
 #define MLX5_IB_UPD_XLT_ACCESS	      BIT(5)
 #define MLX5_IB_UPD_XLT_INDIRECT      BIT(6)
+#define MLX5_IB_UPD_XLT_DOWNGRADE     BIT(7)
 
 /* Private QP creation flags to be passed in ib_qp_init_attr.create_flags.
  *
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 86d8fa63bf69..6074c541885c 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -34,6 +34,7 @@
 #include <linux/kernel.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-resv.h>
+#include <linux/hmm.h>
 
 #include "mlx5_ib.h"
 #include "cmd.h"
@@ -158,22 +159,12 @@ static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentries,
 	}
 }
 
-static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
-{
-	u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
-
-	if (umem_dma & ODP_READ_ALLOWED_BIT)
-		mtt_entry |= MLX5_IB_MTT_READ;
-	if (umem_dma & ODP_WRITE_ALLOWED_BIT)
-		mtt_entry |= MLX5_IB_MTT_WRITE;
-
-	return mtt_entry;
-}
-
 static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 			 struct mlx5_ib_mr *mr, int flags)
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
+	bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
+	unsigned long pfn;
 	dma_addr_t pa;
 	size_t i;
 
@@ -181,8 +172,17 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 		return;
 
 	for (i = 0; i < nentries; i++) {
+		pfn = odp->pfn_list[idx + i];
+		if (!(pfn & HMM_PFN_VALID))
+			/* ODP initialization */
+			continue;
+
 		pa = odp->dma_list[idx + i];
-		pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+		pa |= MLX5_IB_MTT_READ;
+		if ((pfn & HMM_PFN_WRITE) && !downgrade)
+			pa |= MLX5_IB_MTT_WRITE;
+
+		pas[i] = cpu_to_be64(pa);
 	}
 }
 
@@ -303,8 +303,7 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->dma_list[idx] &
-		    (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+		if (umem_odp->pfn_list[idx] & HMM_PFN_VALID) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
@@ -687,7 +686,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 {
 	int page_shift, ret, np;
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
-	u64 access_mask;
+	u64 access_mask = 0;
 	u64 start_idx;
 	bool fault = !(flags & MLX5_PF_FLAGS_SNAPSHOT);
 	u32 xlt_flags = MLX5_IB_UPD_XLT_ATOMIC;
@@ -695,12 +694,14 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	if (flags & MLX5_PF_FLAGS_ENABLE)
 		xlt_flags |= MLX5_IB_UPD_XLT_ENABLE;
 
+	if (flags & MLX5_PF_FLAGS_DOWNGRADE)
+		xlt_flags |= MLX5_IB_UPD_XLT_DOWNGRADE;
+
 	page_shift = odp->page_shift;
 	start_idx = (user_va - ib_umem_start(odp)) >> page_shift;
-	access_mask = ODP_READ_ALLOWED_BIT;
 
 	if (odp->umem.writable && !downgrade)
-		access_mask |= ODP_WRITE_ALLOWED_BIT;
+		access_mask |= HMM_PFN_WRITE;
 
 	np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
 	if (np < 0)
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 9f6e2bb2a269..caedd9ef2fe3 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -26,7 +26,7 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
 	start = max_t(u64, ib_umem_start(umem_odp), range->start);
 	end = min_t(u64, ib_umem_end(umem_odp), range->end);
 
-	/* update umem_odp->dma_list */
+	/* update umem_odp->map.pfn_list */
 	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
 
 	mutex_unlock(&umem_odp->umem_mutex);
@@ -44,12 +44,11 @@ static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcn
 {
 	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
-	u64 access_mask;
+	u64 access_mask = 0;
 	int np;
 
-	access_mask = ODP_READ_ALLOWED_BIT;
 	if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
-		access_mask |= ODP_WRITE_ALLOWED_BIT;
+		access_mask |= HMM_PFN_WRITE;
 
 	/*
 	 * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
@@ -137,7 +136,7 @@ static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp,
 	while (addr < iova + length) {
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 
-		if (!(umem_odp->dma_list[idx] & perm)) {
+		if (!(umem_odp->map.pfn_list[idx] & perm)) {
 			need_fault = true;
 			break;
 		}
@@ -151,15 +150,14 @@ static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u
 {
 	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
 	bool need_fault;
-	u64 perm;
+	u64 perm = 0;
 	int err;
 
 	if (unlikely(length < 1))
 		return -EINVAL;
 
-	perm = ODP_READ_ALLOWED_BIT;
 	if (!(flags & RXE_PAGEFAULT_RDONLY))
-		perm |= ODP_WRITE_ALLOWED_BIT;
+		perm |= HMM_PFN_WRITE;
 
 	mutex_lock(&umem_odp->umem_mutex);
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 0844c1d05ac6..a345c26a745d 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -8,6 +8,7 @@
 
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
+#include <linux/hmm.h>
 
 struct ib_umem_odp {
 	struct ib_umem umem;
@@ -67,19 +68,6 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
 	       umem_odp->page_shift;
 }
 
-/*
- * The lower 2 bits of the DMA address signal the R/W permissions for
- * the entry. To upgrade the permissions, provide the appropriate
- * bitmask to the map_dma_pages function.
- *
- * Be aware that upgrading a mapped address might result in change of
- * the DMA address for the page.
- */
-#define ODP_READ_ALLOWED_BIT  (1<<0ULL)
-#define ODP_WRITE_ALLOWED_BIT (1<<1ULL)
-
-#define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
-
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 
 struct ib_umem_odp *
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (11 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 12/24] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 17:36   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path for UMEM ODP flow.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c   | 104 ++++++---------------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  11 +--
 drivers/infiniband/hw/mlx5/odp.c     |  40 +++++++----
 drivers/infiniband/hw/mlx5/umr.c     |  12 +++-
 drivers/infiniband/sw/rxe/rxe_odp.c  |   4 +-
 include/rdma/ib_umem_odp.h           |  13 +---
 6 files changed, 71 insertions(+), 113 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e1a5a567efb3..30cd8f353476 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,6 +41,7 @@
 #include <linux/hugetlb.h>
 #include <linux/interval_tree.h>
 #include <linux/hmm.h>
+#include <linux/hmm-dma.h>
 #include <linux/pagemap.h>
 
 #include <rdma/ib_umem_odp.h>
@@ -50,6 +51,7 @@
 static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 				   const struct mmu_interval_notifier_ops *ops)
 {
+	struct ib_device *dev = umem_odp->umem.ibdev;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
@@ -59,7 +61,6 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 		size_t page_size = 1UL << umem_odp->page_shift;
 		unsigned long start;
 		unsigned long end;
-		size_t ndmas, npfns;
 
 		start = ALIGN_DOWN(umem_odp->umem.address, page_size);
 		if (check_add_overflow(umem_odp->umem.address,
@@ -70,36 +71,23 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 		if (unlikely(end < page_size))
 			return -EOVERFLOW;
 
-		ndmas = (end - start) >> umem_odp->page_shift;
-		if (!ndmas)
-			return -EINVAL;
-
-		npfns = (end - start) >> PAGE_SHIFT;
-		umem_odp->pfn_list = kvcalloc(
-			npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
-		if (!umem_odp->pfn_list)
-			return -ENOMEM;
-
-		umem_odp->dma_list = kvcalloc(
-			ndmas, sizeof(*umem_odp->dma_list), GFP_KERNEL);
-		if (!umem_odp->dma_list) {
-			ret = -ENOMEM;
-			goto out_pfn_list;
-		}
+		ret = hmm_dma_map_alloc(dev->dma_device, &umem_odp->map,
+					(end - start) >> PAGE_SHIFT,
+					1 << umem_odp->page_shift);
+		if (ret)
+			return ret;
 
 		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
 						   umem_odp->umem.owning_mm,
 						   start, end - start, ops);
 		if (ret)
-			goto out_dma_list;
+			goto out_free_map;
 	}
 
 	return 0;
 
-out_dma_list:
-	kvfree(umem_odp->dma_list);
-out_pfn_list:
-	kvfree(umem_odp->pfn_list);
+out_free_map:
+	hmm_dma_map_free(dev->dma_device, &umem_odp->map);
 	return ret;
 }
 
@@ -262,6 +250,8 @@ EXPORT_SYMBOL(ib_umem_odp_get);
 
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 {
+	struct ib_device *dev = umem_odp->umem.ibdev;
+
 	/*
 	 * Ensure that no more pages are mapped in the umem.
 	 *
@@ -274,48 +264,17 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 					    ib_umem_end(umem_odp));
 		mutex_unlock(&umem_odp->umem_mutex);
 		mmu_interval_notifier_remove(&umem_odp->notifier);
-		kvfree(umem_odp->dma_list);
-		kvfree(umem_odp->pfn_list);
+		hmm_dma_map_free(dev->dma_device, &umem_odp->map);
 	}
 	put_pid(umem_odp->tgid);
 	kfree(umem_odp);
 }
 EXPORT_SYMBOL(ib_umem_odp_release);
 
-/*
- * Map for DMA and insert a single page into the on-demand paging page tables.
- *
- * @umem: the umem to insert the page to.
- * @dma_index: index in the umem to add the dma to.
- * @page: the page struct to map and add.
- * @access_mask: access permissions needed for this page.
- *
- * The function returns -EFAULT if the DMA mapping operation fails.
- *
- */
-static int ib_umem_odp_map_dma_single_page(
-		struct ib_umem_odp *umem_odp,
-		unsigned int dma_index,
-		struct page *page)
-{
-	struct ib_device *dev = umem_odp->umem.ibdev;
-	dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
-
-	*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
-				    DMA_BIDIRECTIONAL);
-	if (ib_dma_mapping_error(dev, *dma_addr)) {
-		*dma_addr = 0;
-		return -EFAULT;
-	}
-	umem_odp->npages++;
-	return 0;
-}
-
 /**
  * ib_umem_odp_map_dma_and_lock - DMA map userspace memory in an ODP MR and lock it.
  *
  * Maps the range passed in the argument to DMA addresses.
- * The DMA addresses of the mapped pages is updated in umem_odp->dma_list.
  * Upon success the ODP MR will be locked to let caller complete its device
  * page table update.
  *
@@ -372,7 +331,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 			range.default_flags |= HMM_PFN_REQ_WRITE;
 	}
 
-	range.hmm_pfns = &(umem_odp->pfn_list[pfn_start_idx]);
+	range.hmm_pfns = &(umem_odp->map.pfn_list[pfn_start_idx]);
 	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
 
 retry:
@@ -423,16 +382,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 				  __func__, hmm_order, page_shift);
 			break;
 		}
-
-		ret = ib_umem_odp_map_dma_single_page(
-			umem_odp, dma_index,
-			hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
-		if (ret < 0) {
-			ibdev_dbg(umem_odp->umem.ibdev,
-				  "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-			break;
-		}
-		range.hmm_pfns[pfn_index] |= HMM_PFN_DMA_MAPPED;
 	}
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
@@ -452,32 +401,23 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 				 u64 bound)
 {
-	dma_addr_t dma;
-	int idx;
-	u64 addr;
 	struct ib_device *dev = umem_odp->umem.ibdev;
+	u64 addr;
 
 	lockdep_assert_held(&umem_odp->umem_mutex);
 
 	virt = max_t(u64, virt, ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
-		unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >>
-					PAGE_SHIFT;
-		struct page *page =
-			hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
-
-		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
-		dma = umem_odp->dma_list[idx];
+		u64 offset = addr - ib_umem_start(umem_odp);
+		size_t idx = offset >> umem_odp->page_shift;
+		unsigned long pfn = umem_odp->map.pfn_list[idx];
 
-		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
-			goto clear;
-		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_DMA_MAPPED))
+		if (!hmm_dma_unmap_pfn(dev->dma_device, &umem_odp->map, idx))
 			goto clear;
 
-		ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
-				  DMA_BIDIRECTIONAL);
-		if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+		if (pfn & HMM_PFN_WRITE) {
+			struct page *page = hmm_pfn_to_page(pfn);
 			struct page *head_page = compound_head(page);
 			/*
 			 * set_page_dirty prefers being called with
@@ -492,7 +432,7 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 		}
 		umem_odp->npages--;
 clear:
-		umem_odp->pfn_list[pfn_idx] &= ~HMM_PFN_FLAGS;
+		umem_odp->map.pfn_list[idx] &= ~HMM_PFN_FLAGS;
 	}
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7424b23bb0d9..27fbf5f3a2d6 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1474,8 +1474,8 @@ void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev);
 int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 int mlx5_odp_init_mkey_cache(struct mlx5_ib_dev *dev);
-void mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
-			   struct mlx5_ib_mr *mr, int flags);
+int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
+			  struct mlx5_ib_mr *mr, int flags);
 
 int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
 			       enum ib_uverbs_advise_mr_advice advice,
@@ -1496,8 +1496,11 @@ static inline int mlx5_odp_init_mkey_cache(struct mlx5_ib_dev *dev)
 {
 	return 0;
 }
-static inline void mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
-					 struct mlx5_ib_mr *mr, int flags) {}
+static inline int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
+					struct mlx5_ib_mr *mr, int flags)
+{
+	return -EOPNOTSUPP;
+}
 
 static inline int
 mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 6074c541885c..eaa2f9f5f3a9 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -35,6 +35,8 @@
 #include <linux/dma-buf.h>
 #include <linux/dma-resv.h>
 #include <linux/hmm.h>
+#include <linux/hmm-dma.h>
+#include <linux/pci-p2pdma.h>
 
 #include "mlx5_ib.h"
 #include "cmd.h"
@@ -159,40 +161,50 @@ static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentries,
 	}
 }
 
-static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
-			 struct mlx5_ib_mr *mr, int flags)
+static int populate_mtt(__be64 *pas, size_t start, size_t nentries,
+			struct mlx5_ib_mr *mr, int flags)
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
 	bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
-	unsigned long pfn;
-	dma_addr_t pa;
+	struct pci_p2pdma_map_state p2pdma_state = {};
+	struct ib_device *dev = odp->umem.ibdev;
 	size_t i;
 
 	if (flags & MLX5_IB_UPD_XLT_ZAP)
-		return;
+		return 0;
 
 	for (i = 0; i < nentries; i++) {
-		pfn = odp->pfn_list[idx + i];
+		unsigned long pfn = odp->map.pfn_list[start + i];
+		dma_addr_t dma_addr;
+
+		pfn = odp->map.pfn_list[start + i];
 		if (!(pfn & HMM_PFN_VALID))
 			/* ODP initialization */
 			continue;
 
-		pa = odp->dma_list[idx + i];
-		pa |= MLX5_IB_MTT_READ;
+		dma_addr = hmm_dma_map_pfn(dev->dma_device, &odp->map,
+					   start + i, &p2pdma_state);
+		if (ib_dma_mapping_error(dev, dma_addr))
+			return -EFAULT;
+
+		dma_addr |= MLX5_IB_MTT_READ;
 		if ((pfn & HMM_PFN_WRITE) && !downgrade)
-			pa |= MLX5_IB_MTT_WRITE;
+			dma_addr |= MLX5_IB_MTT_WRITE;
 
-		pas[i] = cpu_to_be64(pa);
+		pas[i] = cpu_to_be64(dma_addr);
+		odp->npages++;
 	}
+	return 0;
 }
 
-void mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
-			   struct mlx5_ib_mr *mr, int flags)
+int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
+			  struct mlx5_ib_mr *mr, int flags)
 {
 	if (flags & MLX5_IB_UPD_XLT_INDIRECT) {
 		populate_klm(xlt, idx, nentries, mr, flags);
+		return 0;
 	} else {
-		populate_mtt(xlt, idx, nentries, mr, flags);
+		return populate_mtt(xlt, idx, nentries, mr, flags);
 	}
 }
 
@@ -303,7 +315,7 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->pfn_list[idx] & HMM_PFN_VALID) {
+		if (umem_odp->map.pfn_list[idx] & HMM_PFN_VALID) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
diff --git a/drivers/infiniband/hw/mlx5/umr.c b/drivers/infiniband/hw/mlx5/umr.c
index 793f3c5c4d01..5be4426a2884 100644
--- a/drivers/infiniband/hw/mlx5/umr.c
+++ b/drivers/infiniband/hw/mlx5/umr.c
@@ -840,7 +840,17 @@ int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 		size_to_map = npages * desc_size;
 		dma_sync_single_for_cpu(ddev, sg.addr, sg.length,
 					DMA_TO_DEVICE);
-		mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
+		/*
+		 * npages is the maximum number of pages to map, but we
+		 * can't guarantee that all pages are actually mapped.
+		 *
+		 * For example, if page is p2p of type which is not supported
+		 * for mapping, the number of pages mapped will be less than
+		 * requested.
+		 */
+		err = mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
+		if (err)
+			return err;
 		dma_sync_single_for_device(ddev, sg.addr, sg.length,
 					   DMA_TO_DEVICE);
 		sg.length = ALIGN(size_to_map, MLX5_UMR_FLEX_ALIGNMENT);
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index caedd9ef2fe3..51b28f5d5027 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -194,7 +194,7 @@ static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 	while (length > 0) {
 		u8 *src, *dest;
 
-		page = hmm_pfn_to_page(umem_odp->pfn_list[idx]);
+		page = hmm_pfn_to_page(umem_odp->map.pfn_list[idx]);
 		user_va = kmap_local_page(page);
 		if (!user_va)
 			return -EFAULT;
@@ -277,7 +277,7 @@ static int rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 
 	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 	page_offset = iova & (BIT(umem_odp->page_shift) - 1);
-	page = hmm_pfn_to_page(umem_odp->pfn_list[idx]);
+	page = hmm_pfn_to_page(umem_odp->map.pfn_list[idx]);
 	if (!page)
 		return RESPST_ERR_RKEY_VIOLATION;
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index a345c26a745d..2a24bf791c10 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -8,24 +8,17 @@
 
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
-#include <linux/hmm.h>
+#include <linux/hmm-dma.h>
 
 struct ib_umem_odp {
 	struct ib_umem umem;
 	struct mmu_interval_notifier notifier;
 	struct pid *tgid;
 
-	/* An array of the pfns included in the on-demand paging umem. */
-	unsigned long *pfn_list;
+	struct hmm_dma_map map;
 
 	/*
-	 * An array with DMA addresses mapped for pfns in pfn_list.
-	 * The lower two bits designate access permissions.
-	 * See ODP_READ_ALLOWED_BIT and ODP_WRITE_ALLOWED_BIT.
-	 */
-	dma_addr_t		*dma_list;
-	/*
-	 * The umem_mutex protects the page_list and dma_list fields of an ODP
+	 * The umem_mutex protects the page_list field of an ODP
 	 * umem, allowing only a single thread to map/unmap pages. The mutex
 	 * also protects access to the mmu notifier counters.
 	 */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (12 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 17:38   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Create separate functions for the implicit ODP initialization
which is different from the explicit ODP initialization.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c | 91 +++++++++++++++---------------
 1 file changed, 46 insertions(+), 45 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 30cd8f353476..51d518989914 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -48,41 +48,44 @@
 
 #include "uverbs.h"
 
-static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
-				   const struct mmu_interval_notifier_ops *ops)
+static void ib_init_umem_implicit_odp(struct ib_umem_odp *umem_odp)
+{
+	umem_odp->is_implicit_odp = 1;
+	umem_odp->umem.is_odp = 1;
+	mutex_init(&umem_odp->umem_mutex);
+}
+
+static int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
+			    const struct mmu_interval_notifier_ops *ops)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
+	size_t page_size = 1UL << umem_odp->page_shift;
+	unsigned long start;
+	unsigned long end;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
 	mutex_init(&umem_odp->umem_mutex);
 
-	if (!umem_odp->is_implicit_odp) {
-		size_t page_size = 1UL << umem_odp->page_shift;
-		unsigned long start;
-		unsigned long end;
-
-		start = ALIGN_DOWN(umem_odp->umem.address, page_size);
-		if (check_add_overflow(umem_odp->umem.address,
-				       (unsigned long)umem_odp->umem.length,
-				       &end))
-			return -EOVERFLOW;
-		end = ALIGN(end, page_size);
-		if (unlikely(end < page_size))
-			return -EOVERFLOW;
-
-		ret = hmm_dma_map_alloc(dev->dma_device, &umem_odp->map,
-					(end - start) >> PAGE_SHIFT,
-					1 << umem_odp->page_shift);
-		if (ret)
-			return ret;
-
-		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
-						   umem_odp->umem.owning_mm,
-						   start, end - start, ops);
-		if (ret)
-			goto out_free_map;
-	}
+	start = ALIGN_DOWN(umem_odp->umem.address, page_size);
+	if (check_add_overflow(umem_odp->umem.address,
+			       (unsigned long)umem_odp->umem.length, &end))
+		return -EOVERFLOW;
+	end = ALIGN(end, page_size);
+	if (unlikely(end < page_size))
+		return -EOVERFLOW;
+
+	ret = hmm_dma_map_alloc(dev->dma_device, &umem_odp->map,
+				(end - start) >> PAGE_SHIFT,
+				1 << umem_odp->page_shift);
+	if (ret)
+		return ret;
+
+	ret = mmu_interval_notifier_insert(&umem_odp->notifier,
+					   umem_odp->umem.owning_mm, start,
+					   end - start, ops);
+	if (ret)
+		goto out_free_map;
 
 	return 0;
 
@@ -106,7 +109,6 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
 {
 	struct ib_umem *umem;
 	struct ib_umem_odp *umem_odp;
-	int ret;
 
 	if (access & IB_ACCESS_HUGETLB)
 		return ERR_PTR(-EINVAL);
@@ -118,16 +120,10 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
 	umem->ibdev = device;
 	umem->writable = ib_access_writable(access);
 	umem->owning_mm = current->mm;
-	umem_odp->is_implicit_odp = 1;
 	umem_odp->page_shift = PAGE_SHIFT;
 
 	umem_odp->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
-	ret = ib_init_umem_odp(umem_odp, NULL);
-	if (ret) {
-		put_pid(umem_odp->tgid);
-		kfree(umem_odp);
-		return ERR_PTR(ret);
-	}
+	ib_init_umem_implicit_odp(umem_odp);
 	return umem_odp;
 }
 EXPORT_SYMBOL(ib_umem_odp_alloc_implicit);
@@ -248,7 +244,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
 }
 EXPORT_SYMBOL(ib_umem_odp_get);
 
-void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
+static void ib_umem_odp_free(struct ib_umem_odp *umem_odp)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
 
@@ -258,14 +254,19 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 	 * It is the driver's responsibility to ensure, before calling us,
 	 * that the hardware will not attempt to access the MR any more.
 	 */
-	if (!umem_odp->is_implicit_odp) {
-		mutex_lock(&umem_odp->umem_mutex);
-		ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
-					    ib_umem_end(umem_odp));
-		mutex_unlock(&umem_odp->umem_mutex);
-		mmu_interval_notifier_remove(&umem_odp->notifier);
-		hmm_dma_map_free(dev->dma_device, &umem_odp->map);
-	}
+	mutex_lock(&umem_odp->umem_mutex);
+	ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
+				    ib_umem_end(umem_odp));
+	mutex_unlock(&umem_odp->umem_mutex);
+	mmu_interval_notifier_remove(&umem_odp->notifier);
+	hmm_dma_map_free(dev->dma_device, &umem_odp->map);
+}
+
+void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
+{
+	if (!umem_odp->is_implicit_odp)
+		ib_umem_odp_free(umem_odp);
+
 	put_pid(umem_odp->tgid);
 	kfree(umem_odp);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (13 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 17:39   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

allocated_length is a multiple of page size and number of pages,
so let's change the functions to accept number of pages. It opens
us a venue to combine receive and send paths together with code
readability improvement.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 32 ++++++++++-----------
 drivers/vfio/pci/mlx5/cmd.h  | 10 +++----
 drivers/vfio/pci/mlx5/main.c | 56 +++++++++++++++++++++++-------------
 3 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 11eda6b207f1..377dee7765fb 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -318,8 +318,7 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 			struct mlx5_vhca_recv_buf *recv_buf,
 			u32 *mkey)
 {
-	size_t npages = buf ? DIV_ROUND_UP(buf->allocated_length, PAGE_SIZE) :
-				recv_buf->npages;
+	size_t npages = buf ? buf->npages : recv_buf->npages;
 	int err = 0, inlen;
 	__be64 *mtt;
 	void *mkc;
@@ -375,7 +374,7 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (buf->dmaed || !buf->allocated_length)
+	if (buf->dmaed || !buf->npages)
 		return -EINVAL;
 
 	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
@@ -445,7 +444,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 
 		if (ret)
 			goto err_append;
-		buf->allocated_length += filled * PAGE_SIZE;
+		buf->npages += filled;
 		/* clean input for another bulk allocation */
 		memset(page_list, 0, filled * sizeof(*page_list));
 		to_fill = min_t(unsigned int, to_alloc,
@@ -464,8 +463,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 }
 
 struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
-			 size_t length,
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 			 enum dma_data_direction dma_dir)
 {
 	struct mlx5_vhca_data_buffer *buf;
@@ -477,9 +475,8 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
 
 	buf->dma_dir = dma_dir;
 	buf->migf = migf;
-	if (length) {
-		ret = mlx5vf_add_migration_pages(buf,
-				DIV_ROUND_UP_ULL(length, PAGE_SIZE));
+	if (npages) {
+		ret = mlx5vf_add_migration_pages(buf, npages);
 		if (ret)
 			goto end;
 
@@ -505,8 +502,8 @@ void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf)
 }
 
 struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
-		       size_t length, enum dma_data_direction dma_dir)
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+		       enum dma_data_direction dma_dir)
 {
 	struct mlx5_vhca_data_buffer *buf, *temp_buf;
 	struct list_head free_list;
@@ -521,7 +518,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
 	list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
 		if (buf->dma_dir == dma_dir) {
 			list_del_init(&buf->buf_elm);
-			if (buf->allocated_length >= length) {
+			if (buf->npages >= npages) {
 				spin_unlock_irq(&migf->list_lock);
 				goto found;
 			}
@@ -535,7 +532,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
 		}
 	}
 	spin_unlock_irq(&migf->list_lock);
-	buf = mlx5vf_alloc_data_buffer(migf, length, dma_dir);
+	buf = mlx5vf_alloc_data_buffer(migf, npages, dma_dir);
 
 found:
 	while ((temp_buf = list_first_entry_or_null(&free_list,
@@ -716,7 +713,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
 	MLX5_SET(save_vhca_state_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(save_vhca_state_in, in, mkey, buf->mkey);
-	MLX5_SET(save_vhca_state_in, in, size, buf->allocated_length);
+	MLX5_SET(save_vhca_state_in, in, size, buf->npages * PAGE_SIZE);
 	MLX5_SET(save_vhca_state_in, in, incremental, inc);
 	MLX5_SET(save_vhca_state_in, in, set_track, track);
 
@@ -738,8 +735,11 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (!header_buf) {
-		header_buf = mlx5vf_get_data_buffer(migf,
-			sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+		header_buf = mlx5vf_get_data_buffer(
+			migf,
+			DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+				     PAGE_SIZE),
+			DMA_NONE);
 		if (IS_ERR(header_buf)) {
 			err = PTR_ERR(header_buf);
 			goto err_free;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index df421dc6de04..7d4a833b6900 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -56,7 +56,7 @@ struct mlx5_vhca_data_buffer {
 	struct sg_append_table table;
 	loff_t start_pos;
 	u64 length;
-	u64 allocated_length;
+	u32 npages;
 	u32 mkey;
 	enum dma_data_direction dma_dir;
 	u8 dmaed:1;
@@ -217,12 +217,12 @@ int mlx5vf_cmd_alloc_pd(struct mlx5_vf_migration_file *migf);
 void mlx5vf_cmd_dealloc_pd(struct mlx5_vf_migration_file *migf);
 void mlx5fv_cmd_clean_migf_resources(struct mlx5_vf_migration_file *migf);
 struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
-			 size_t length, enum dma_data_direction dma_dir);
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+			 enum dma_data_direction dma_dir);
 void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf);
 struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
-		       size_t length, enum dma_data_direction dma_dir);
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+		       enum dma_data_direction dma_dir);
 void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf);
 struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
 				       unsigned long offset);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 709543e7eb04..bc0f468f741b 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -308,6 +308,7 @@ static struct mlx5_vhca_data_buffer *
 mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
 				  u8 index, size_t required_length)
 {
+	u32 npages = DIV_ROUND_UP(required_length, PAGE_SIZE);
 	struct mlx5_vhca_data_buffer *buf = migf->buf[index];
 	u8 chunk_num;
 
@@ -315,12 +316,11 @@ mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
 	chunk_num = buf->stop_copy_chunk_num;
 	buf->migf->buf[index] = NULL;
 	/* Checking whether the pre-allocated buffer can fit */
-	if (buf->allocated_length >= required_length)
+	if (buf->npages >= npages)
 		return buf;
 
 	mlx5vf_put_data_buffer(buf);
-	buf = mlx5vf_get_data_buffer(buf->migf, required_length,
-				     DMA_FROM_DEVICE);
+	buf = mlx5vf_get_data_buffer(buf->migf, npages, DMA_FROM_DEVICE);
 	if (IS_ERR(buf))
 		return buf;
 
@@ -373,7 +373,8 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
 	u8 *to_buff;
 	int ret;
 
-	header_buf = mlx5vf_get_data_buffer(migf, size, DMA_NONE);
+	header_buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(size, PAGE_SIZE),
+					    DMA_NONE);
 	if (IS_ERR(header_buf))
 		return PTR_ERR(header_buf);
 
@@ -388,7 +389,7 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
 	to_buff = kmap_local_page(page);
 	memcpy(to_buff, &header, sizeof(header));
 	header_buf->length = sizeof(header);
-	data.stop_copy_size = cpu_to_le64(migf->buf[0]->allocated_length);
+	data.stop_copy_size = cpu_to_le64(migf->buf[0]->npages * PAGE_SIZE);
 	memcpy(to_buff + sizeof(header), &data, sizeof(data));
 	header_buf->length += sizeof(data);
 	kunmap_local(to_buff);
@@ -437,15 +438,20 @@ static int mlx5vf_prep_stop_copy(struct mlx5vf_pci_core_device *mvdev,
 
 	num_chunks = mvdev->chunk_mode ? MAX_NUM_CHUNKS : 1;
 	for (i = 0; i < num_chunks; i++) {
-		buf = mlx5vf_get_data_buffer(migf, inc_state_size, DMA_FROM_DEVICE);
+		buf = mlx5vf_get_data_buffer(
+			migf, DIV_ROUND_UP(inc_state_size, PAGE_SIZE),
+			DMA_FROM_DEVICE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto err;
 		}
 
 		migf->buf[i] = buf;
-		buf = mlx5vf_get_data_buffer(migf,
-				sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+		buf = mlx5vf_get_data_buffer(
+			migf,
+			DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+				     PAGE_SIZE),
+			DMA_NONE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto err;
@@ -553,7 +559,8 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	 * We finished transferring the current state and the device has a
 	 * dirty state, save a new state to be ready for.
 	 */
-	buf = mlx5vf_get_data_buffer(migf, inc_length, DMA_FROM_DEVICE);
+	buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(inc_length, PAGE_SIZE),
+				     DMA_FROM_DEVICE);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		mlx5vf_mark_err(migf);
@@ -675,8 +682,8 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev, bool track)
 
 	if (track) {
 		/* leave the allocated buffer ready for the stop-copy phase */
-		buf = mlx5vf_alloc_data_buffer(migf,
-			migf->buf[0]->allocated_length, DMA_FROM_DEVICE);
+		buf = mlx5vf_alloc_data_buffer(migf, migf->buf[0]->npages,
+					       DMA_FROM_DEVICE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto out_pd;
@@ -917,11 +924,14 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 				goto out_unlock;
 			break;
 		case MLX5_VF_LOAD_STATE_PREP_HEADER_DATA:
-			if (vhca_buf_header->allocated_length < migf->record_size) {
+		{
+			u32 npages = DIV_ROUND_UP(migf->record_size, PAGE_SIZE);
+
+			if (vhca_buf_header->npages < npages) {
 				mlx5vf_free_data_buffer(vhca_buf_header);
 
-				migf->buf_header[0] = mlx5vf_alloc_data_buffer(migf,
-						migf->record_size, DMA_NONE);
+				migf->buf_header[0] = mlx5vf_alloc_data_buffer(
+					migf, npages, DMA_NONE);
 				if (IS_ERR(migf->buf_header[0])) {
 					ret = PTR_ERR(migf->buf_header[0]);
 					migf->buf_header[0] = NULL;
@@ -934,6 +944,7 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 			vhca_buf_header->start_pos = migf->max_pos;
 			migf->load_state = MLX5_VF_LOAD_STATE_READ_HEADER_DATA;
 			break;
+		}
 		case MLX5_VF_LOAD_STATE_READ_HEADER_DATA:
 			ret = mlx5vf_resume_read_header_data(migf, vhca_buf_header,
 							&buf, &len, pos, &done);
@@ -944,12 +955,13 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 		{
 			u64 size = max(migf->record_size,
 				       migf->stop_copy_prep_size);
+			u32 npages = DIV_ROUND_UP(size, PAGE_SIZE);
 
-			if (vhca_buf->allocated_length < size) {
+			if (vhca_buf->npages < npages) {
 				mlx5vf_free_data_buffer(vhca_buf);
 
-				migf->buf[0] = mlx5vf_alloc_data_buffer(migf,
-							size, DMA_TO_DEVICE);
+				migf->buf[0] = mlx5vf_alloc_data_buffer(
+					migf, npages, DMA_TO_DEVICE);
 				if (IS_ERR(migf->buf[0])) {
 					ret = PTR_ERR(migf->buf[0]);
 					migf->buf[0] = NULL;
@@ -1037,8 +1049,11 @@ mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)
 	}
 
 	migf->buf[0] = buf;
-	buf = mlx5vf_alloc_data_buffer(migf,
-		sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+	buf = mlx5vf_alloc_data_buffer(
+		migf,
+		DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+			     PAGE_SIZE),
+		DMA_NONE);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		goto out_buf;
@@ -1148,7 +1163,8 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 					MLX5VF_QUERY_INC | MLX5VF_QUERY_CLEANUP);
 		if (ret)
 			return ERR_PTR(ret);
-		buf = mlx5vf_get_data_buffer(migf, size, DMA_FROM_DEVICE);
+		buf = mlx5vf_get_data_buffer(migf,
+				DIV_ROUND_UP(size, PAGE_SIZE), DMA_FROM_DEVICE);
 		if (IS_ERR(buf))
 			return ERR_CAST(buf);
 		/* pre_copy cleanup */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (14 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 18:02   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API Leon Romanovsky
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Change the creation of mkey to be performed in multiple steps:
data allocation, DMA setup and actual call to HW to create that mkey.

In this new flow, the whole input to MKEY command is saved to eliminate
the need to keep array of pointers for DMA addresses for receive list
and in the future patches for send list too.

In addition to memory size reduce and elimination of unnecessary data
movements to set MKEY input, the code is prepared for future reuse.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 157 ++++++++++++++++++++----------------
 drivers/vfio/pci/mlx5/cmd.h |   4 +-
 2 files changed, 91 insertions(+), 70 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 377dee7765fb..84dc3bc128c6 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -313,39 +313,21 @@ static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 	return ret;
 }
 
-static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
-			struct mlx5_vhca_data_buffer *buf,
-			struct mlx5_vhca_recv_buf *recv_buf,
-			u32 *mkey)
+static u32 *alloc_mkey_in(u32 npages, u32 pdn)
 {
-	size_t npages = buf ? buf->npages : recv_buf->npages;
-	int err = 0, inlen;
-	__be64 *mtt;
+	int inlen;
 	void *mkc;
 	u32 *in;
 
 	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
-		sizeof(*mtt) * round_up(npages, 2);
+		sizeof(__be64) * round_up(npages, 2);
 
-	in = kvzalloc(inlen, GFP_KERNEL);
+	in = kvzalloc(inlen, GFP_KERNEL_ACCOUNT);
 	if (!in)
-		return -ENOMEM;
+		return NULL;
 
 	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
 		 DIV_ROUND_UP(npages, 2));
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
-
-	if (buf) {
-		struct sg_dma_page_iter dma_iter;
-
-		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
-			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
-	} else {
-		int i;
-
-		for (i = 0; i < npages; i++)
-			*mtt++ = cpu_to_be64(recv_buf->dma_addrs[i]);
-	}
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
@@ -359,9 +341,30 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
 	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
 	MLX5_SET64(mkc, mkc, len, npages * PAGE_SIZE);
-	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
-	kvfree(in);
-	return err;
+
+	return in;
+}
+
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
+		       struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+		       u32 *mkey)
+{
+	__be64 *mtt;
+	int inlen;
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+	if (buf) {
+		struct sg_dma_page_iter dma_iter;
+
+		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
+			*mtt++ = cpu_to_be64(
+				sg_page_iter_dma_address(&dma_iter));
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(__be64) * round_up(npages, 2);
+
+	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
 }
 
 static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -374,20 +377,28 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (buf->dmaed || !buf->npages)
+	if (buf->mkey_in || !buf->npages)
 		return -EINVAL;
 
 	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	if (ret)
 		return ret;
 
-	ret = _create_mkey(mdev, buf->migf->pdn, buf, NULL, &buf->mkey);
-	if (ret)
+	buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
+	if (!buf->mkey_in) {
+		ret = -ENOMEM;
 		goto err;
+	}
 
-	buf->dmaed = true;
+	ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+	if (ret)
+		goto err_create_mkey;
 
 	return 0;
+
+err_create_mkey:
+	kvfree(buf->mkey_in);
+	buf->mkey_in = NULL;
 err:
 	dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	return ret;
@@ -401,8 +412,9 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	lockdep_assert_held(&migf->mvdev->state_mutex);
 	WARN_ON(migf->mvdev->mdev_detach);
 
-	if (buf->dmaed) {
+	if (buf->mkey_in) {
 		mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+		kvfree(buf->mkey_in);
 		dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
 				  buf->dma_dir, 0);
 	}
@@ -783,7 +795,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (!buf->dmaed) {
+	if (!buf->mkey_in) {
 		err = mlx5vf_dma_data_buffer(buf);
 		if (err)
 			return err;
@@ -1384,56 +1396,54 @@ static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
 	kvfree(recv_buf->page_list);
 	return -ENOMEM;
 }
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+				 u32 *mkey_in)
+{
+	dma_addr_t addr;
+	__be64 *mtt;
+	int i;
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+	for (i = npages - 1; i >= 0; i--) {
+		addr = be64_to_cpu(mtt[i]);
+		dma_unmap_single(mdev->device, addr, PAGE_SIZE,
+				DMA_FROM_DEVICE);
+	}
+}
 
-static int register_dma_recv_pages(struct mlx5_core_dev *mdev,
-				   struct mlx5_vhca_recv_buf *recv_buf)
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+			      struct page **page_list, u32 *mkey_in)
 {
-	int i, j;
+	dma_addr_t addr;
+	__be64 *mtt;
+	int i;
 
-	recv_buf->dma_addrs = kvcalloc(recv_buf->npages,
-				       sizeof(*recv_buf->dma_addrs),
-				       GFP_KERNEL_ACCOUNT);
-	if (!recv_buf->dma_addrs)
-		return -ENOMEM;
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
 
-	for (i = 0; i < recv_buf->npages; i++) {
-		recv_buf->dma_addrs[i] = dma_map_page(mdev->device,
-						      recv_buf->page_list[i],
-						      0, PAGE_SIZE,
-						      DMA_FROM_DEVICE);
-		if (dma_mapping_error(mdev->device, recv_buf->dma_addrs[i]))
+	for (i = 0; i < npages; i++) {
+		addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
+				    DMA_FROM_DEVICE);
+		if (dma_mapping_error(mdev->device, addr))
 			goto error;
+
+		*mtt++ = cpu_to_be64(addr);
 	}
+
 	return 0;
 
 error:
-	for (j = 0; j < i; j++)
-		dma_unmap_single(mdev->device, recv_buf->dma_addrs[j],
-				 PAGE_SIZE, DMA_FROM_DEVICE);
-
-	kvfree(recv_buf->dma_addrs);
+	unregister_dma_pages(mdev, i, mkey_in);
 	return -ENOMEM;
 }
 
-static void unregister_dma_recv_pages(struct mlx5_core_dev *mdev,
-				      struct mlx5_vhca_recv_buf *recv_buf)
-{
-	int i;
-
-	for (i = 0; i < recv_buf->npages; i++)
-		dma_unmap_single(mdev->device, recv_buf->dma_addrs[i],
-				 PAGE_SIZE, DMA_FROM_DEVICE);
-
-	kvfree(recv_buf->dma_addrs);
-}
-
 static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 					  struct mlx5_vhca_qp *qp)
 {
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 
 	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
-	unregister_dma_recv_pages(mdev, recv_buf);
+	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+	kvfree(recv_buf->mkey_in);
 	free_recv_pages(&qp->recv_buf);
 }
 
@@ -1449,18 +1459,29 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	if (err < 0)
 		return err;
 
-	err = register_dma_recv_pages(mdev, recv_buf);
-	if (err)
+	recv_buf->mkey_in = alloc_mkey_in(npages, pdn);
+	if (!recv_buf->mkey_in) {
+		err = -ENOMEM;
 		goto end;
+	}
+
+	err = register_dma_pages(mdev, npages, recv_buf->page_list,
+				 recv_buf->mkey_in);
+	if (err)
+		goto err_register_dma;
 
-	err = _create_mkey(mdev, pdn, NULL, recv_buf, &recv_buf->mkey);
+	err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
+			  &recv_buf->mkey);
 	if (err)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
-	unregister_dma_recv_pages(mdev, recv_buf);
+	unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+err_register_dma:
+	kvfree(recv_buf->mkey_in);
+	recv_buf->mkey_in = NULL;
 end:
 	free_recv_pages(recv_buf);
 	return err;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 7d4a833b6900..25dd6ff54591 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -58,8 +58,8 @@ struct mlx5_vhca_data_buffer {
 	u64 length;
 	u32 npages;
 	u32 mkey;
+	u32 *mkey_in;
 	enum dma_data_direction dma_dir;
-	u8 dmaed:1;
 	u8 stop_copy_chunk_num;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
@@ -133,8 +133,8 @@ struct mlx5_vhca_cq {
 struct mlx5_vhca_recv_buf {
 	u32 npages;
 	struct page **page_list;
-	dma_addr_t *dma_addrs;
 	u32 next_rq_offset;
+	u32 *mkey_in;
 	u32 mkey;
 };
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (15 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23 18:09   ` Jason Gunthorpe
  2025-04-23  8:13 ` [PATCH v9 18/24] block: share more code for bio addition helper Leon Romanovsky
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Remove intermediate scatter-gather table completely and
enable new DMA link API.

Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 298 ++++++++++++++++-------------------
 drivers/vfio/pci/mlx5/cmd.h  |  21 ++-
 drivers/vfio/pci/mlx5/main.c |  31 ----
 3 files changed, 147 insertions(+), 203 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 84dc3bc128c6..b162e44112fb 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -345,26 +345,82 @@ static u32 *alloc_mkey_in(u32 npages, u32 pdn)
 	return in;
 }
 
-static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
-		       struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages, u32 *mkey_in,
 		       u32 *mkey)
 {
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(__be64) * round_up(npages, 2);
+
+	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+}
+
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+				 u32 *mkey_in, struct dma_iova_state *state,
+				 enum dma_data_direction dir)
+{
+	dma_addr_t addr;
 	__be64 *mtt;
-	int inlen;
+	int i;
 
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-	if (buf) {
-		struct sg_dma_page_iter dma_iter;
+	WARN_ON_ONCE(dir == DMA_NONE);
 
-		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
-			*mtt++ = cpu_to_be64(
-				sg_page_iter_dma_address(&dma_iter));
+	if (dma_use_iova(state)) {
+		dma_iova_destroy(mdev->device, state, npages * PAGE_SIZE, dir,
+				 0);
+	} else {
+		mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in,
+					     klm_pas_mtt);
+		for (i = npages - 1; i >= 0; i--) {
+			addr = be64_to_cpu(mtt[i]);
+			dma_unmap_page(mdev->device, addr, PAGE_SIZE, dir);
+		}
 	}
+}
 
-	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
-		sizeof(__be64) * round_up(npages, 2);
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+			      struct page **page_list, u32 *mkey_in,
+			      struct dma_iova_state *state,
+			      enum dma_data_direction dir)
+{
+	dma_addr_t addr;
+	size_t mapped = 0;
+	__be64 *mtt;
+	int i, err;
 
-	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+	WARN_ON_ONCE(dir == DMA_NONE);
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+
+	if (dma_iova_try_alloc(mdev->device, state, 0, npages * PAGE_SIZE)) {
+		addr = state->addr;
+		for (i = 0; i < npages; i++) {
+			err = dma_iova_link(mdev->device, state,
+					    page_to_phys(page_list[i]), mapped,
+					    PAGE_SIZE, dir, 0);
+			if (err)
+				goto error;
+			*mtt++ = cpu_to_be64(addr);
+			addr += PAGE_SIZE;
+			mapped += PAGE_SIZE;
+		}
+		err = dma_iova_sync(mdev->device, state, 0, mapped);
+		if (err)
+			goto error;
+	} else {
+		for (i = 0; i < npages; i++) {
+			addr = dma_map_page(mdev->device, page_list[i], 0,
+					    PAGE_SIZE, dir);
+			err = dma_mapping_error(mdev->device, addr);
+			if (err)
+				goto error;
+			*mtt++ = cpu_to_be64(addr);
+		}
+	}
+	return 0;
+
+error:
+	unregister_dma_pages(mdev, i, mkey_in, state, dir);
+	return err;
 }
 
 static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -380,98 +436,90 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (buf->mkey_in || !buf->npages)
 		return -EINVAL;
 
-	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
-	if (ret)
-		return ret;
-
 	buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
-	if (!buf->mkey_in) {
-		ret = -ENOMEM;
-		goto err;
-	}
+	if (!buf->mkey_in)
+		return -ENOMEM;
 
-	ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+	ret = register_dma_pages(mdev, buf->npages, buf->page_list,
+				 buf->mkey_in, &buf->state, buf->dma_dir);
+	if (ret)
+		goto err_register_dma;
+
+	ret = create_mkey(mdev, buf->npages, buf->mkey_in, &buf->mkey);
 	if (ret)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
+	unregister_dma_pages(mdev, buf->npages, buf->mkey_in, &buf->state,
+			     buf->dma_dir);
+err_register_dma:
 	kvfree(buf->mkey_in);
 	buf->mkey_in = NULL;
-err:
-	dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	return ret;
 }
 
+static void free_page_list(u32 npages, struct page **page_list)
+{
+	int i;
+
+	/* Undo alloc_pages_bulk() */
+	for (i = npages - 1; i >= 0; i--)
+		__free_page(page_list[i]);
+
+	kvfree(page_list);
+}
+
 void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 {
-	struct mlx5_vf_migration_file *migf = buf->migf;
-	struct sg_page_iter sg_iter;
+	struct mlx5vf_pci_core_device *mvdev = buf->migf->mvdev;
+	struct mlx5_core_dev *mdev = mvdev->mdev;
 
-	lockdep_assert_held(&migf->mvdev->state_mutex);
-	WARN_ON(migf->mvdev->mdev_detach);
+	lockdep_assert_held(&mvdev->state_mutex);
+	WARN_ON(mvdev->mdev_detach);
 
 	if (buf->mkey_in) {
-		mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+		mlx5_core_destroy_mkey(mdev, buf->mkey);
+		unregister_dma_pages(mdev, buf->npages, buf->mkey_in,
+				     &buf->state, buf->dma_dir);
 		kvfree(buf->mkey_in);
-		dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
-				  buf->dma_dir, 0);
 	}
 
-	/* Undo alloc_pages_bulk() */
-	for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
-		__free_page(sg_page_iter_page(&sg_iter));
-	sg_free_append_table(&buf->table);
+	free_page_list(buf->npages, buf->page_list);
 	kfree(buf);
 }
 
-static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
-				      unsigned int npages)
+static int mlx5vf_add_pages(struct page ***page_list, unsigned int npages)
 {
-	unsigned int to_alloc = npages;
-	struct page **page_list;
-	unsigned long filled;
-	unsigned int to_fill;
-	int ret;
+	unsigned int filled, done = 0;
 	int i;
 
-	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
-	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT);
-	if (!page_list)
+	*page_list =
+		kvcalloc(npages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+	if (!*page_list)
 		return -ENOMEM;
 
-	do {
-		filled = alloc_pages_bulk(GFP_KERNEL_ACCOUNT, to_fill,
-					  page_list);
-		if (!filled) {
-			ret = -ENOMEM;
+	for (;;) {
+		filled = alloc_pages_bulk(GFP_KERNEL_ACCOUNT, npages - done,
+					  *page_list + done);
+		if (!filled)
 			goto err;
-		}
-		to_alloc -= filled;
-		ret = sg_alloc_append_table_from_pages(
-			&buf->table, page_list, filled, 0,
-			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
-			GFP_KERNEL_ACCOUNT);
 
-		if (ret)
-			goto err_append;
-		buf->npages += filled;
-		/* clean input for another bulk allocation */
-		memset(page_list, 0, filled * sizeof(*page_list));
-		to_fill = min_t(unsigned int, to_alloc,
-				PAGE_SIZE / sizeof(*page_list));
-	} while (to_alloc > 0);
+		done += filled;
+		if (done == npages)
+			break;
+	}
 
-	kvfree(page_list);
 	return 0;
 
-err_append:
-	for (i = filled - 1; i >= 0; i--)
-		__free_page(page_list[i]);
 err:
-	kvfree(page_list);
-	return ret;
+	for (i = 0; i < done; i++)
+		__free_page(*page_list[i]);
+
+	kvfree(*page_list);
+	*page_list = NULL;
+	return -ENOMEM;
 }
 
 struct mlx5_vhca_data_buffer *
@@ -488,10 +536,12 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 	buf->dma_dir = dma_dir;
 	buf->migf = migf;
 	if (npages) {
-		ret = mlx5vf_add_migration_pages(buf, npages);
+		ret = mlx5vf_add_pages(&buf->page_list, npages);
 		if (ret)
 			goto end;
 
+		buf->npages = npages;
+
 		if (dma_dir != DMA_NONE) {
 			ret = mlx5vf_dma_data_buffer(buf);
 			if (ret)
@@ -1350,101 +1400,16 @@ static void mlx5vf_destroy_qp(struct mlx5_core_dev *mdev,
 	kfree(qp);
 }
 
-static void free_recv_pages(struct mlx5_vhca_recv_buf *recv_buf)
-{
-	int i;
-
-	/* Undo alloc_pages_bulk() */
-	for (i = 0; i < recv_buf->npages; i++)
-		__free_page(recv_buf->page_list[i]);
-
-	kvfree(recv_buf->page_list);
-}
-
-static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
-			    unsigned int npages)
-{
-	unsigned int filled = 0, done = 0;
-	int i;
-
-	recv_buf->page_list = kvcalloc(npages, sizeof(*recv_buf->page_list),
-				       GFP_KERNEL_ACCOUNT);
-	if (!recv_buf->page_list)
-		return -ENOMEM;
-
-	for (;;) {
-		filled = alloc_pages_bulk(GFP_KERNEL_ACCOUNT,
-					  npages - done,
-					  recv_buf->page_list + done);
-		if (!filled)
-			goto err;
-
-		done += filled;
-		if (done == npages)
-			break;
-	}
-
-	recv_buf->npages = npages;
-	return 0;
-
-err:
-	for (i = 0; i < npages; i++) {
-		if (recv_buf->page_list[i])
-			__free_page(recv_buf->page_list[i]);
-	}
-
-	kvfree(recv_buf->page_list);
-	return -ENOMEM;
-}
-static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
-				 u32 *mkey_in)
-{
-	dma_addr_t addr;
-	__be64 *mtt;
-	int i;
-
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-	for (i = npages - 1; i >= 0; i--) {
-		addr = be64_to_cpu(mtt[i]);
-		dma_unmap_single(mdev->device, addr, PAGE_SIZE,
-				DMA_FROM_DEVICE);
-	}
-}
-
-static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
-			      struct page **page_list, u32 *mkey_in)
-{
-	dma_addr_t addr;
-	__be64 *mtt;
-	int i;
-
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-
-	for (i = 0; i < npages; i++) {
-		addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
-				    DMA_FROM_DEVICE);
-		if (dma_mapping_error(mdev->device, addr))
-			goto error;
-
-		*mtt++ = cpu_to_be64(addr);
-	}
-
-	return 0;
-
-error:
-	unregister_dma_pages(mdev, i, mkey_in);
-	return -ENOMEM;
-}
-
 static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 					  struct mlx5_vhca_qp *qp)
 {
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 
 	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
-	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in,
+			     &recv_buf->state, DMA_FROM_DEVICE);
 	kvfree(recv_buf->mkey_in);
-	free_recv_pages(&qp->recv_buf);
+	free_page_list(recv_buf->npages, recv_buf->page_list);
 }
 
 static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
@@ -1455,10 +1420,12 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 	int err;
 
-	err = alloc_recv_pages(recv_buf, npages);
-	if (err < 0)
+	err = mlx5vf_add_pages(&recv_buf->page_list, npages);
+	if (err)
 		return err;
 
+	recv_buf->npages = npages;
+
 	recv_buf->mkey_in = alloc_mkey_in(npages, pdn);
 	if (!recv_buf->mkey_in) {
 		err = -ENOMEM;
@@ -1466,24 +1433,25 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	}
 
 	err = register_dma_pages(mdev, npages, recv_buf->page_list,
-				 recv_buf->mkey_in);
+				 recv_buf->mkey_in, &recv_buf->state,
+				 DMA_FROM_DEVICE);
 	if (err)
 		goto err_register_dma;
 
-	err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
-			  &recv_buf->mkey);
+	err = create_mkey(mdev, npages, recv_buf->mkey_in, &recv_buf->mkey);
 	if (err)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
-	unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+	unregister_dma_pages(mdev, npages, recv_buf->mkey_in, &recv_buf->state,
+			     DMA_FROM_DEVICE);
 err_register_dma:
 	kvfree(recv_buf->mkey_in);
 	recv_buf->mkey_in = NULL;
 end:
-	free_recv_pages(recv_buf);
+	free_page_list(npages, recv_buf->page_list);
 	return err;
 }
 
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 25dd6ff54591..d7821b5ca772 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -53,7 +53,8 @@ struct mlx5_vf_migration_header {
 };
 
 struct mlx5_vhca_data_buffer {
-	struct sg_append_table table;
+	struct page **page_list;
+	struct dma_iova_state state;
 	loff_t start_pos;
 	u64 length;
 	u32 npages;
@@ -63,10 +64,6 @@ struct mlx5_vhca_data_buffer {
 	u8 stop_copy_chunk_num;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
-	/* Optimize mlx5vf_get_migration_page() for sequential access */
-	struct scatterlist *last_offset_sg;
-	unsigned int sg_last_entry;
-	unsigned long last_offset;
 };
 
 struct mlx5vf_async_data {
@@ -133,6 +130,7 @@ struct mlx5_vhca_cq {
 struct mlx5_vhca_recv_buf {
 	u32 npages;
 	struct page **page_list;
+	struct dma_iova_state state;
 	u32 next_rq_offset;
 	u32 *mkey_in;
 	u32 mkey;
@@ -224,8 +222,17 @@ struct mlx5_vhca_data_buffer *
 mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 		       enum dma_data_direction dma_dir);
 void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf);
-struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
-				       unsigned long offset);
+static inline struct page *
+mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
+			  unsigned long offset)
+{
+	int page_entry = offset / PAGE_SIZE;
+
+	if (page_entry >= buf->npages)
+		return NULL;
+
+	return buf->page_list[page_entry];
+}
 void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev,
 			enum mlx5_vf_migf_state *last_save_state);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index bc0f468f741b..93f894fe60d2 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -34,37 +34,6 @@ static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev)
 			    core_device);
 }
 
-struct page *
-mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
-			  unsigned long offset)
-{
-	unsigned long cur_offset = 0;
-	struct scatterlist *sg;
-	unsigned int i;
-
-	/* All accesses are sequential */
-	if (offset < buf->last_offset || !buf->last_offset_sg) {
-		buf->last_offset = 0;
-		buf->last_offset_sg = buf->table.sgt.sgl;
-		buf->sg_last_entry = 0;
-	}
-
-	cur_offset = buf->last_offset;
-
-	for_each_sg(buf->last_offset_sg, sg,
-			buf->table.sgt.orig_nents - buf->sg_last_entry, i) {
-		if (offset < sg->length + cur_offset) {
-			buf->last_offset_sg = sg;
-			buf->sg_last_entry += i;
-			buf->last_offset = cur_offset;
-			return nth_page(sg_page(sg),
-					(offset - cur_offset) / PAGE_SIZE);
-		}
-		cur_offset += sg->length;
-	}
-	return NULL;
-}
-
 static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf)
 {
 	mutex_lock(&migf->lock);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 18/24] block: share more code for bio addition helper
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (16 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  8:13 ` [PATCH v9 19/24] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

__bio_iov_iter_get_pages currently open codes adding pages to the bio,
which duplicates a lot of code from bio_add_page. Add bio_add_page_int
helper that pass down the same_page output argument so that
__bio_iov_iter_get_pages can reuse the main add bio to page helpers.

Note that I'd normally call these helper __bio_add_page, but the former
is already taken for an exported API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/bio.c | 68 +++++++++++++++++++++++++----------------------------
 1 file changed, 32 insertions(+), 36 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4e6c85a33d74..3047fa3f4b32 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -989,20 +989,9 @@ void __bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL_GPL(__bio_add_page);
 
-/**
- *	bio_add_page	-	attempt to add page(s) to bio
- *	@bio: destination bio
- *	@page: start page to add
- *	@len: vec entry length, may cross pages
- *	@offset: vec entry offset relative to @page, may cross pages
- *
- *	Attempt to add page(s) to the bio_vec maplist. This will only fail
- *	if either bio->bi_vcnt == bio->bi_max_vecs or it's a cloned bio.
- */
-int bio_add_page(struct bio *bio, struct page *page,
-		 unsigned int len, unsigned int offset)
+static int bio_add_page_int(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset, bool *same_page)
 {
-	bool same_page = false;
 
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return 0;
@@ -1011,7 +1000,7 @@ int bio_add_page(struct bio *bio, struct page *page,
 
 	if (bio->bi_vcnt > 0 &&
 	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset, &same_page)) {
+				page, len, offset, same_page)) {
 		bio->bi_iter.bi_size += len;
 		return len;
 	}
@@ -1021,6 +1010,24 @@ int bio_add_page(struct bio *bio, struct page *page,
 	__bio_add_page(bio, page, len, offset);
 	return len;
 }
+
+/**
+ * bio_add_page	- attempt to add page(s) to bio
+ * @bio: destination bio
+ * @page: start page to add
+ * @len: vec entry length, may cross pages
+ * @offset: vec entry offset relative to @page, may cross pages
+ *
+ * Attempt to add page(s) to the bio_vec maplist.  Will only fail if the
+ * bio is full, or it is incorrectly used on a cloned bio.
+ */
+int bio_add_page(struct bio *bio, struct page *page,
+		 unsigned int len, unsigned int offset)
+{
+	bool same_page = false;
+
+	return bio_add_page_int(bio, page, len, offset, &same_page);
+}
 EXPORT_SYMBOL(bio_add_page);
 
 void bio_add_folio_nofail(struct bio *bio, struct folio *folio, size_t len,
@@ -1088,27 +1095,6 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static int bio_iov_add_folio(struct bio *bio, struct folio *folio, size_t len,
-			     size_t offset)
-{
-	bool same_page = false;
-
-	if (WARN_ON_ONCE(bio->bi_iter.bi_size > UINT_MAX - len))
-		return -EIO;
-
-	if (bio->bi_vcnt > 0 &&
-	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				folio_page(folio, 0), len, offset,
-				&same_page)) {
-		bio->bi_iter.bi_size += len;
-		if (same_page && bio_flagged(bio, BIO_PAGE_PINNED))
-			unpin_user_folio(folio, 1);
-		return 0;
-	}
-	bio_add_folio_nofail(bio, folio, len, offset);
-	return 0;
-}
-
 static unsigned int get_contig_folio_len(unsigned int *num_pages,
 					 struct page **pages, unsigned int i,
 					 struct folio *folio, size_t left,
@@ -1203,6 +1189,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	for (left = size, i = 0; left > 0; left -= len, i += num_pages) {
 		struct page *page = pages[i];
 		struct folio *folio = page_folio(page);
+		struct page *first_page = folio_page(folio, 0);
+		bool same_page = false;
 
 		folio_offset = ((size_t)folio_page_idx(folio, page) <<
 			       PAGE_SHIFT) + offset;
@@ -1215,7 +1203,15 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 			len = get_contig_folio_len(&num_pages, pages, i,
 						   folio, left, offset);
 
-		bio_iov_add_folio(bio, folio, len, folio_offset);
+		if (bio_add_page_int(bio, first_page, len, folio_offset,
+				     &same_page) != len) {
+			ret = -EINVAL;
+			break;
+		}
+
+		if (same_page && bio_flagged(bio, BIO_PAGE_PINNED))
+			unpin_user_folio(folio, 1);
+
 		offset = 0;
 	}
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 19/24] block: don't merge different kinds of P2P transfers in a single bio
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (17 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 18/24] block: share more code for bio addition helper Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  8:13 ` [PATCH v9 20/24] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

To get out of the dma mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.

This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the dma mapping path, and only for already
marked bios.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/bio.c               | 17 ++++++++++-------
 block/blk-merge.c         | 17 +++++++++++------
 include/linux/blk_types.h |  2 ++
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3047fa3f4b32..279eac2396bf 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -928,8 +928,6 @@ static bool bvec_try_merge_page(struct bio_vec *bv, struct page *page,
 		return false;
 	if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
 		return false;
-	if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
-		return false;
 
 	*same_page = ((vec_end_addr & PAGE_MASK) == ((page_addr + off) &
 		     PAGE_MASK));
@@ -998,11 +996,16 @@ static int bio_add_page_int(struct bio *bio, struct page *page,
 	if (bio->bi_iter.bi_size > UINT_MAX - len)
 		return 0;
 
-	if (bio->bi_vcnt > 0 &&
-	    bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
-				page, len, offset, same_page)) {
-		bio->bi_iter.bi_size += len;
-		return len;
+	if (bio->bi_vcnt > 0) {
+		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+		if (bvec_try_merge_page(bv, page, len, offset, same_page)) {
+			bio->bi_iter.bi_size += len;
+			return len;
+		}
+	} else {
+		if (is_pci_p2pdma_page(page))
+			bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
 	}
 
 	if (bio->bi_vcnt >= bio->bi_max_vecs)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fdd4efb54c6c..d9691e900cc6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -320,12 +320,17 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 	unsigned nsegs = 0, bytes = 0;
 
 	bio_for_each_bvec(bv, bio, iter) {
-		/*
-		 * If the queue doesn't support SG gaps and adding this
-		 * offset would create a gap, disallow it.
-		 */
-		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
-			goto split;
+		if (bvprvp) {
+			/*
+			 * If the queue doesn't support SG gaps and adding this
+			 * offset would create a gap, disallow it.
+			 */
+			if (bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
+				goto split;
+		} else {
+			if (is_pci_p2pdma_page(bv.bv_page))
+				bio->bi_opf |= REQ_P2PDMA | REQ_NOMERGE;
+		}
 
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dce7615c35e7..94cf146e8ce6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -378,6 +378,7 @@ enum req_flag_bits {
 	__REQ_DRV,		/* for driver use */
 	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
 	__REQ_ATOMIC,		/* for atomic write operations */
+	__REQ_P2PDMA,		/* contains P2P DMA pages */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -410,6 +411,7 @@ enum req_flag_bits {
 #define REQ_DRV		(__force blk_opf_t)(1ULL << __REQ_DRV)
 #define REQ_FS_PRIVATE	(__force blk_opf_t)(1ULL << __REQ_FS_PRIVATE)
 #define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
+#define REQ_P2PDMA	(__force blk_opf_t)(1ULL << __REQ_P2PDMA)
 
 #define REQ_NOUNMAP	(__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 20/24] blk-mq: add scatterlist-less DMA mapping helpers
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (18 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 19/24] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  8:13 ` [PATCH v9 21/24] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
the wasteful scatterlist structure.  Instead it uses the mapping iterator
to either add segments to the IOVA for IOMMU operations, or just maps
them one by one for the direct mapping.  For the IOMMU case instead of
a scatterlist with an entry for each segment, only a single [dma_addr,len]
pair needs to be stored for processing a request, and for the direct
mapping the per-segment allocation shrinks from
[page,offset,len,dma_addr,dma_len] to just [dma_addr,len].

The major downside of this API is that the IOVA collapsing only works
when the driver sets a virt_boundary that matches the IOMMU granule.

Note that struct blk_dma_vec, struct blk_dma_mapping and blk_rq_dma_unmap
aren't really block specific, but for they are kept with the only mapping
routine to keep things simple.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-merge.c          | 163 +++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-dma.h |  63 ++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 include/linux/blk-mq-dma.h

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d9691e900cc6..5e2e6db60fda 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -7,6 +7,8 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/blk-integrity.h>
+#include <linux/blk-mq-dma.h>
+#include <linux/dma-mapping.h>
 #include <linux/scatterlist.h>
 #include <linux/part_stat.h>
 #include <linux/blk-cgroup.h>
@@ -535,6 +537,167 @@ static bool blk_map_iter_next(struct request *req,
 	return true;
 }
 
+/*
+ * The IOVA-based DMA API wants to be able to coalesce at the minimal IOMMU page
+ * size granularity (which is guaranteed to be <= PAGE_SIZE and usually 4k), so
+ * we need to ensure our segments are aligned to this as well.
+ *
+ * Note that there is no point in using the slightly more complicated IOVA based
+ * path for single segment mappings.
+ */
+static inline bool blk_can_dma_map_iova(struct request *req,
+		struct device *dma_dev)
+{
+	return !((queue_virt_boundary(req->q) + 1) &
+		dma_get_merge_boundary(dma_dev));
+}
+
+static bool blk_dma_map_bus(struct request *req, struct device *dma_dev,
+		struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+	iter->len = vec->len;
+	return true;
+}
+
+static bool blk_dma_map_direct(struct request *req, struct device *dma_dev,
+		struct blk_dma_iter *iter, struct phys_vec *vec)
+{
+	iter->addr = dma_map_page(dma_dev, phys_to_page(vec->paddr),
+			offset_in_page(vec->paddr), vec->len, rq_dma_dir(req));
+	if (dma_mapping_error(dma_dev, iter->addr)) {
+		iter->status = BLK_STS_RESOURCE;
+		return false;
+	}
+	iter->len = vec->len;
+	return true;
+}
+
+static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter,
+		struct phys_vec *vec)
+{
+	enum dma_data_direction dir = rq_dma_dir(req);
+	unsigned int mapped = 0;
+	int error = 0;
+
+	iter->addr = state->addr;
+	iter->len = dma_iova_size(state);
+
+	do {
+		error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
+				vec->len, dir, 0);
+		if (error)
+			break;
+		mapped += vec->len;
+	} while (blk_map_iter_next(req, &iter->iter, vec));
+
+	error = dma_iova_sync(dma_dev, state, 0, mapped);
+	if (error) {
+		iter->status = errno_to_blk_status(error);
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * blk_rq_dma_map_iter_start - map the first DMA segment for a request
+ * @req:	request to map
+ * @dma_dev:	device to map to
+ * @state:	DMA IOVA state
+ * @iter:	block layer DMA iterator
+ *
+ * Start DMA mapping @req to @dma_dev.  @state and @iter are provided by the
+ * caller and don't need to be initialized.  @state needs to be stored for use
+ * at unmap time, @iter is only needed at map time.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true ft it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len.  If no segment was mapped the status code is
+ * returned in @iter.status.
+ *
+ * The caller can call blk_rq_dma_map_coalesce() to check if further segments
+ * need to be mapped after this, or go straight to blk_rq_dma_map_iter_next()
+ * to try to map the following segments.
+ */
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+	unsigned int total_len = blk_rq_payload_bytes(req);
+	struct phys_vec vec;
+
+	iter->iter.bio = req->bio;
+	iter->iter.iter = req->bio->bi_iter;
+	memset(&iter->p2pdma, 0, sizeof(iter->p2pdma));
+	iter->status = BLK_STS_OK;
+
+	/*
+	 * Grab the first segment ASAP because we'll need it to check for P2P
+	 * transfers.
+	 */
+	if (!blk_map_iter_next(req, &iter->iter, &vec))
+		return false;
+
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA)) {
+		switch (pci_p2pdma_state(&iter->p2pdma, dma_dev,
+					 phys_to_page(vec.paddr))) {
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			return blk_dma_map_bus(req, dma_dev, iter, &vec);
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * P2P transfers through the host bridge are treated the
+			 * same as non-P2P transfers below and during unmap.
+			 */
+			req->cmd_flags &= ~REQ_P2PDMA;
+			break;
+		default:
+			iter->status = BLK_STS_INVAL;
+			return false;
+		}
+	}
+
+	if (blk_can_dma_map_iova(req, dma_dev) &&
+	    dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
+		return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
+	return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
+
+/**
+ * blk_rq_dma_map_iter_next - map the next DMA segment for a request
+ * @req:	request to map
+ * @dma_dev:	device to map to
+ * @state:	DMA IOVA state
+ * @iter:	block layer DMA iterator
+ *
+ * Iterate to the next mapping after a previous call to
+ * blk_rq_dma_map_iter_start().  See there for a detailed description of the
+ * arguments.
+ *
+ * Returns %false if there is no segment to map, including due to an error, or
+ * %true ft it did map a segment.
+ *
+ * If a segment was mapped, the DMA address for it is returned in @iter.addr and
+ * the length in @iter.len.  If no segment was mapped the status code is
+ * returned in @iter.status.
+ */
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter)
+{
+	struct phys_vec vec;
+
+	if (!blk_map_iter_next(req, &iter->iter, &vec))
+		return false;
+
+	if (iter->p2pdma.map == PCI_P2PDMA_MAP_BUS_ADDR)
+		return blk_dma_map_bus(req, dma_dev, iter, &vec);
+	return blk_dma_map_direct(req, dma_dev, iter, &vec);
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_next);
+
 static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
diff --git a/include/linux/blk-mq-dma.h b/include/linux/blk-mq-dma.h
new file mode 100644
index 000000000000..6d85b4bedcba
--- /dev/null
+++ b/include/linux/blk-mq-dma.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BLK_MQ_DMA_H
+#define BLK_MQ_DMA_H
+
+#include <linux/blk-mq.h>
+#include <linux/pci-p2pdma.h>
+
+struct blk_dma_iter {
+	/* Output address range for this iteration */
+	dma_addr_t			addr;
+	u32				len;
+
+	/* Status code. Only valid when blk_rq_dma_map_iter_* returned false */
+	blk_status_t			status;
+
+	/* Internal to blk_rq_dma_map_iter_* */
+	struct req_iterator		iter;
+	struct pci_p2pdma_map_state	p2pdma;
+};
+
+bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter);
+bool blk_rq_dma_map_iter_next(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, struct blk_dma_iter *iter);
+
+/**
+ * blk_rq_dma_map_coalesce - were all segments coalesced?
+ * @state: DMA state to check
+ *
+ * Returns true if blk_rq_dma_map_iter_start coalesced all segments into a
+ * single DMA range.
+ */
+static inline bool blk_rq_dma_map_coalesce(struct dma_iova_state *state)
+{
+	return dma_use_iova(state);
+}
+
+/**
+ * blk_rq_dma_unmap - try to DMA unmap a request
+ * @req:	request to unmap
+ * @dma_dev:	device to unmap from
+ * @state:	DMA IOVA state
+ * @mapped_len: number of bytes to unmap
+ *
+ * Returns %false if the callers need to manually unmap every DMA segment
+ * mapped using @iter or %true if no work is left to be done.
+ */
+static inline bool blk_rq_dma_unmap(struct request *req, struct device *dma_dev,
+		struct dma_iova_state *state, size_t mapped_len)
+{
+	if (req->cmd_flags & REQ_P2PDMA)
+		return true;
+
+	if (dma_use_iova(state)) {
+		dma_iova_destroy(dma_dev, state, mapped_len, rq_dma_dir(req),
+				 0);
+		return true;
+	}
+
+	return !dma_need_unmap(dma_dev);
+}
+
+#endif /* BLK_MQ_DMA_H */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 21/24] nvme-pci: remove struct nvme_descriptor
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (19 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 20/24] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  8:13 ` [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

There is no real point in having a union of two pointer types here, just
use a void pointer as we mix and match types between the arms of the
union between the allocation and freeing side already.

Also rename the nr_allocations field to nr_descriptors to better describe
what it does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 57 +++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b178d52eac1b..638e759b29ad 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -43,7 +43,7 @@
 #define NVME_MAX_KB_SZ	8192
 #define NVME_MAX_SEGS	128
 #define NVME_MAX_META_SEGS 15
-#define NVME_MAX_NR_ALLOCATIONS	5
+#define NVME_MAX_NR_DESCRIPTORS	5
 
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0444);
@@ -219,30 +219,22 @@ struct nvme_queue {
 	struct completion delete_done;
 };
 
-union nvme_descriptor {
-	struct nvme_sgl_desc	*sg_list;
-	__le64			*prp_list;
-};
-
 /*
  * The nvme_iod describes the data in an I/O.
- *
- * The sg pointer contains the list of PRP/SGL chunk allocations in addition
- * to the actual struct scatterlist.
  */
 struct nvme_iod {
 	struct nvme_request req;
 	struct nvme_command cmd;
 	bool aborted;
-	s8 nr_allocations;	/* PRP list pool allocations. 0 means small
-				   pool in use */
+	/* # of PRP/SGL descriptors: (0 for small pool) */
+	s8 nr_descriptors;
 	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
 	struct sg_table sgt;
 	struct sg_table meta_sgt;
-	union nvme_descriptor meta_list;
-	union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
+	void *meta_list;
+	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -544,8 +536,8 @@ static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 	dma_addr_t dma_addr = iod->first_dma;
 	int i;
 
-	for (i = 0; i < iod->nr_allocations; i++) {
-		__le64 *prp_list = iod->list[i].prp_list;
+	for (i = 0; i < iod->nr_descriptors; i++) {
+		__le64 *prp_list = iod->descriptors[i];
 		dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
 
 		dma_pool_free(dev->prp_page_pool, prp_list, dma_addr);
@@ -567,11 +559,11 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 
 	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
 
-	if (iod->nr_allocations == 0)
-		dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list,
+	if (iod->nr_descriptors == 0)
+		dma_pool_free(dev->prp_small_pool, iod->descriptors[0],
 			      iod->first_dma);
-	else if (iod->nr_allocations == 1)
-		dma_pool_free(dev->prp_page_pool, iod->list[0].sg_list,
+	else if (iod->nr_descriptors == 1)
+		dma_pool_free(dev->prp_page_pool, iod->descriptors[0],
 			      iod->first_dma);
 	else
 		nvme_free_prps(dev, req);
@@ -629,18 +621,18 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
 	if (nprps <= (256 / 8)) {
 		pool = dev->prp_small_pool;
-		iod->nr_allocations = 0;
+		iod->nr_descriptors = 0;
 	} else {
 		pool = dev->prp_page_pool;
-		iod->nr_allocations = 1;
+		iod->nr_descriptors = 1;
 	}
 
 	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 	if (!prp_list) {
-		iod->nr_allocations = -1;
+		iod->nr_descriptors = -1;
 		return BLK_STS_RESOURCE;
 	}
-	iod->list[0].prp_list = prp_list;
+	iod->descriptors[0] = prp_list;
 	iod->first_dma = prp_dma;
 	i = 0;
 	for (;;) {
@@ -649,7 +641,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
 			if (!prp_list)
 				goto free_prps;
-			iod->list[iod->nr_allocations++].prp_list = prp_list;
+			iod->descriptors[iod->nr_descriptors++] = prp_list;
 			prp_list[0] = old_prp_list[i - 1];
 			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
 			i = 1;
@@ -719,19 +711,19 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 
 	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
 		pool = dev->prp_small_pool;
-		iod->nr_allocations = 0;
+		iod->nr_descriptors = 0;
 	} else {
 		pool = dev->prp_page_pool;
-		iod->nr_allocations = 1;
+		iod->nr_descriptors = 1;
 	}
 
 	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
 	if (!sg_list) {
-		iod->nr_allocations = -1;
+		iod->nr_descriptors = -1;
 		return BLK_STS_RESOURCE;
 	}
 
-	iod->list[0].sg_list = sg_list;
+	iod->descriptors[0] = sg_list;
 	iod->first_dma = sgl_dma;
 
 	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
@@ -870,7 +862,7 @@ static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
 		goto out_unmap_sg;
 
 	entries = iod->meta_sgt.nents;
-	iod->meta_list.sg_list = sg_list;
+	iod->meta_list = sg_list;
 	iod->meta_dma = sgl_dma;
 
 	cmnd->flags = NVME_CMD_SGL_METASEG;
@@ -923,7 +915,7 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	blk_status_t ret;
 
 	iod->aborted = false;
-	iod->nr_allocations = -1;
+	iod->nr_descriptors = -1;
 	iod->sgt.nents = 0;
 	iod->meta_sgt.nents = 0;
 
@@ -1048,8 +1040,7 @@ static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
 		return;
 	}
 
-	dma_pool_free(dev->prp_small_pool, iod->meta_list.sg_list,
-		      iod->meta_dma);
+	dma_pool_free(dev->prp_small_pool, iod->meta_list, iod->meta_dma);
 	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
 	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
 }
@@ -3801,7 +3792,7 @@ static int __init nvme_init(void)
 	BUILD_BUG_ON(IRQ_AFFINITY_MAX_SETS < 2);
 	BUILD_BUG_ON(NVME_MAX_SEGS > SGES_PER_PAGE);
 	BUILD_BUG_ON(sizeof(struct scatterlist) * NVME_MAX_SEGS > PAGE_SIZE);
-	BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_ALLOCATIONS);
+	BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_DESCRIPTORS);
 
 	return pci_register_driver(&nvme_driver);
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (20 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 21/24] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  9:05   ` Christoph Hellwig
  2025-04-23  8:13 ` [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
  2025-04-23  8:13 ` [PATCH v9 24/24] nvme-pci: store aborted state in flags variable Leon Romanovsky
  23 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

There is plenty of unused space in the iod next to nr_descriptors.
Add a separate flag to encode that the transfer is using the full
page sized pool, and use a normal 0..n count for the number of
descriptors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
[ Leon: changed original bool variable to be flag as was proposed by Kanchan ]
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 93 ++++++++++++++++++++---------------------
 1 file changed, 46 insertions(+), 47 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 638e759b29ad..7e93536d01cb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -44,6 +44,7 @@
 #define NVME_MAX_SEGS	128
 #define NVME_MAX_META_SEGS 15
 #define NVME_MAX_NR_DESCRIPTORS	5
+#define NVME_SMALL_DESCRIPTOR_SIZE 256
 
 static int use_threaded_interrupts;
 module_param(use_threaded_interrupts, int, 0444);
@@ -219,6 +220,10 @@ struct nvme_queue {
 	struct completion delete_done;
 };
 
+enum {
+	IOD_LARGE_DESCRIPTORS = 1, /* uses the full page sized descriptor pool */
+};
+
 /*
  * The nvme_iod describes the data in an I/O.
  */
@@ -226,8 +231,8 @@ struct nvme_iod {
 	struct nvme_request req;
 	struct nvme_command cmd;
 	bool aborted;
-	/* # of PRP/SGL descriptors: (0 for small pool) */
-	s8 nr_descriptors;
+	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
+	unsigned int flags;
 	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
@@ -529,13 +534,27 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 	return true;
 }
 
-static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
+static inline struct dma_pool *nvme_dma_pool(struct nvme_dev *dev,
+		struct nvme_iod *iod)
+{
+	if (iod->flags & IOD_LARGE_DESCRIPTORS)
+		return dev->prp_page_pool;
+	return dev->prp_small_pool;
+}
+
+static void nvme_free_descriptors(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	dma_addr_t dma_addr = iod->first_dma;
 	int i;
 
+	if (iod->nr_descriptors == 1) {
+		dma_pool_free(nvme_dma_pool(dev, iod), iod->descriptors[0],
+				dma_addr);
+		return;
+	}
+
 	for (i = 0; i < iod->nr_descriptors; i++) {
 		__le64 *prp_list = iod->descriptors[i];
 		dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
@@ -558,15 +577,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 	WARN_ON_ONCE(!iod->sgt.nents);
 
 	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-
-	if (iod->nr_descriptors == 0)
-		dma_pool_free(dev->prp_small_pool, iod->descriptors[0],
-			      iod->first_dma);
-	else if (iod->nr_descriptors == 1)
-		dma_pool_free(dev->prp_page_pool, iod->descriptors[0],
-			      iod->first_dma);
-	else
-		nvme_free_prps(dev, req);
+	nvme_free_descriptors(dev, req);
 	mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
@@ -588,7 +599,6 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmnd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
 	struct scatterlist *sg = iod->sgt.sgl;
 	int dma_len = sg_dma_len(sg);
@@ -596,7 +606,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
 	__le64 *prp_list;
 	dma_addr_t prp_dma;
-	int nprps, i;
+	int i;
 
 	length -= (NVME_CTRL_PAGE_SIZE - offset);
 	if (length <= 0) {
@@ -618,27 +628,23 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		goto done;
 	}
 
-	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
-	if (nprps <= (256 / 8)) {
-		pool = dev->prp_small_pool;
-		iod->nr_descriptors = 0;
-	} else {
-		pool = dev->prp_page_pool;
-		iod->nr_descriptors = 1;
-	}
+	if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) >
+	    NVME_SMALL_DESCRIPTOR_SIZE / sizeof(__le64))
+		iod->flags |= IOD_LARGE_DESCRIPTORS;
 
-	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
-	if (!prp_list) {
-		iod->nr_descriptors = -1;
+	prp_list = dma_pool_alloc(nvme_dma_pool(dev, iod), GFP_ATOMIC,
+			&prp_dma);
+	if (!prp_list)
 		return BLK_STS_RESOURCE;
-	}
-	iod->descriptors[0] = prp_list;
+	iod->descriptors[iod->nr_descriptors++] = prp_list;
 	iod->first_dma = prp_dma;
 	i = 0;
 	for (;;) {
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
-			prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
+
+			prp_list = dma_pool_alloc(dev->prp_page_pool,
+					GFP_ATOMIC, &prp_dma);
 			if (!prp_list)
 				goto free_prps;
 			iod->descriptors[iod->nr_descriptors++] = prp_list;
@@ -665,7 +671,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
 	return BLK_STS_OK;
 free_prps:
-	nvme_free_prps(dev, req);
+	nvme_free_descriptors(dev, req);
 	return BLK_STS_RESOURCE;
 bad_sgl:
 	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
@@ -694,7 +700,6 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmd)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
 	struct scatterlist *sg = iod->sgt.sgl;
 	unsigned int entries = iod->sgt.nents;
@@ -709,21 +714,13 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		return BLK_STS_OK;
 	}
 
-	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
-		pool = dev->prp_small_pool;
-		iod->nr_descriptors = 0;
-	} else {
-		pool = dev->prp_page_pool;
-		iod->nr_descriptors = 1;
-	}
+	if (entries > NVME_SMALL_DESCRIPTOR_SIZE / sizeof(*sg_list))
+		iod->flags |= IOD_LARGE_DESCRIPTORS;
 
-	sg_list = dma_pool_alloc(pool, GFP_ATOMIC, &sgl_dma);
-	if (!sg_list) {
-		iod->nr_descriptors = -1;
+	sg_list = dma_pool_alloc(nvme_dma_pool(dev, iod), GFP_ATOMIC, &sgl_dma);
+	if (!sg_list)
 		return BLK_STS_RESOURCE;
-	}
-
-	iod->descriptors[0] = sg_list;
+	iod->descriptors[iod->nr_descriptors++] = sg_list;
 	iod->first_dma = sgl_dma;
 
 	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
@@ -915,7 +912,8 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	blk_status_t ret;
 
 	iod->aborted = false;
-	iod->nr_descriptors = -1;
+	iod->nr_descriptors = 0;
+	iod->flags = 0;
 	iod->sgt.nents = 0;
 	iod->meta_sgt.nents = 0;
 
@@ -2833,7 +2831,7 @@ static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
 
 static int nvme_setup_prp_pools(struct nvme_dev *dev)
 {
-	size_t small_align = 256;
+	size_t small_align = NVME_SMALL_DESCRIPTOR_SIZE;
 
 	dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
 						NVME_CTRL_PAGE_SIZE,
@@ -2841,12 +2839,13 @@ static int nvme_setup_prp_pools(struct nvme_dev *dev)
 	if (!dev->prp_page_pool)
 		return -ENOMEM;
 
+	BUILD_BUG_ON(NVME_SMALL_DESCRIPTOR_SIZE != 256);
 	if (dev->ctrl.quirks & NVME_QUIRK_DMAPOOL_ALIGN_512)
-		small_align = 512;
+		small_align *= 2;
 
 	/* Optimisation for I/Os between 4k and 128k */
 	dev->prp_small_pool = dma_pool_create("prp list 256", dev->dev,
-						256, small_align, 0);
+						NVME_SMALL_DESCRIPTOR_SIZE, small_align, 0);
 	if (!dev->prp_small_pool) {
 		dma_pool_destroy(dev->prp_page_pool);
 		return -ENOMEM;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (21 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  2025-04-23  9:24   ` Christoph Hellwig
  2025-04-23 14:58   ` Keith Busch
  2025-04-23  8:13 ` [PATCH v9 24/24] nvme-pci: store aborted state in flags variable Leon Romanovsky
  23 siblings, 2 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty, Leon Romanovsky

From: Christoph Hellwig <hch@lst.de>

Use the blk_rq_dma_map API to DMA map requests instead of
scatterlists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
[ Leon: squashed optimization patch from Kanchan ]
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
[ Leon: rewrote original patch due to rebases and addition of metadata support ]
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 597 ++++++++++++++++++++++------------------
 1 file changed, 332 insertions(+), 265 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7e93536d01cb..eb60a486331c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -7,7 +7,7 @@
 #include <linux/acpi.h>
 #include <linux/async.h>
 #include <linux/blkdev.h>
-#include <linux/blk-mq.h>
+#include <linux/blk-mq-dma.h>
 #include <linux/blk-integrity.h>
 #include <linux/dmi.h>
 #include <linux/init.h>
@@ -26,7 +26,6 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/io-64-nonatomic-hi-lo.h>
 #include <linux/sed-opal.h>
-#include <linux/pci-p2pdma.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -144,9 +143,6 @@ struct nvme_dev {
 	bool hmb;
 	struct sg_table *hmb_sgt;
 
-	mempool_t *iod_mempool;
-	mempool_t *iod_meta_mempool;
-
 	/* shadow doorbell buffer support: */
 	__le32 *dbbuf_dbs;
 	dma_addr_t dbbuf_dbs_dma_addr;
@@ -222,6 +218,7 @@ struct nvme_queue {
 
 enum {
 	IOD_LARGE_DESCRIPTORS = 1, /* uses the full page sized descriptor pool */
+	IOD_SINGLE_SEGMENT = 2, /* single segment dma mapping */
 };
 
 /*
@@ -233,11 +230,11 @@ struct nvme_iod {
 	bool aborted;
 	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
 	unsigned int flags;
-	unsigned int dma_len;	/* length of single DMA segment mapping */
-	dma_addr_t first_dma;
+	unsigned int total_len; /* length of the entire transfer */
+	unsigned int total_meta_len; /* length of the entire metadata transfer */
 	dma_addr_t meta_dma;
-	struct sg_table sgt;
-	struct sg_table meta_sgt;
+	struct dma_iova_state dma_state;
+	struct dma_iova_state dma_meta_state;
 	void *meta_list;
 	void *descriptors[NVME_MAX_NR_DESCRIPTORS];
 };
@@ -546,9 +543,14 @@ static void nvme_free_descriptors(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	dma_addr_t dma_addr = iod->first_dma;
+	dma_addr_t dma_addr;
 	int i;
 
+	if (iod->cmd.common.flags & NVME_CMD_SGL_METABUF)
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+	else
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2);
+
 	if (iod->nr_descriptors == 1) {
 		dma_pool_free(nvme_dma_pool(dev, iod), iod->descriptors[0],
 				dma_addr);
@@ -564,67 +566,178 @@ static void nvme_free_descriptors(struct nvme_dev *dev, struct request *req)
 	}
 }
 
+static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	enum dma_data_direction dir = rq_dma_dir(req);
+	int length = iod->total_len;
+	dma_addr_t dma_addr;
+	int i, desc;
+	__le64 *prp_list;
+	u32 dma_len;
+
+	dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
+	dma_len = min_t(u32, length,
+		NVME_CTRL_PAGE_SIZE - (dma_addr & (NVME_CTRL_PAGE_SIZE - 1)));
+	length -= dma_len;
+	if (!length) {
+		dma_unmap_page(dev->dev, dma_addr, dma_len, dir);
+		return;
+	}
+
+	if (length <= NVME_CTRL_PAGE_SIZE) {
+		dma_unmap_page(dev->dev, dma_addr, dma_len, dir);
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp2);
+		dma_unmap_page(dev->dev, dma_addr, length, dir);
+		return;
+	}
+
+	i = 0;
+	desc = 0;
+	prp_list = iod->descriptors[desc];
+	do {
+		dma_unmap_page(dev->dev, dma_addr, dma_len, dir);
+		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+			prp_list = iod->descriptors[++desc];
+			i = 0;
+		}
+
+		dma_addr = le64_to_cpu(prp_list[i++]);
+		dma_len = min(length, NVME_CTRL_PAGE_SIZE);
+		length -= dma_len;
+	} while (length);
+}
+
+
+static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	dma_addr_t sqe_dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+	unsigned int sqe_dma_len = le32_to_cpu(iod->cmd.common.dptr.sgl.length);
+	struct nvme_sgl_desc *sg_list = iod->descriptors[0];
+	enum dma_data_direction dir = rq_dma_dir(req);
+
+	if (iod->nr_descriptors) {
+		unsigned int nr_entries = sqe_dma_len / sizeof(*sg_list), i;
+
+		for (i = 0; i < nr_entries; i++)
+			dma_unmap_page(dev->dev, le64_to_cpu(sg_list[i].addr),
+				le32_to_cpu(sg_list[i].length), dir);
+	} else {
+		dma_unmap_page(dev->dev, sqe_dma_addr, sqe_dma_len, dir);
+	}
+}
+
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	dma_addr_t dma_addr;
 
-	if (iod->dma_len) {
-		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
-			       rq_dma_dir(req));
+	if (iod->flags & IOD_SINGLE_SEGMENT) {
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
+		dma_unmap_page(dev->dev, dma_addr, iod->total_len,
+				rq_dma_dir(req));
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->sgt.nents);
+	if (!blk_rq_dma_unmap(req, dev->dev, &iod->dma_state, iod->total_len)) {
+		if (iod->cmd.common.flags & NVME_CMD_SGL_METABUF)
+			nvme_free_sgls(dev, req);
+		else
+			nvme_free_prps(dev, req);
+	}
 
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-	nvme_free_descriptors(dev, req);
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
+	if (iod->nr_descriptors)
+		nvme_free_descriptors(dev, req);
 }
 
-static void nvme_print_sgl(struct scatterlist *sgl, int nents)
+static bool nvme_try_setup_prp_simple(struct nvme_dev *dev, struct request *req,
+				      struct nvme_rw_command *cmnd,
+				      struct blk_dma_iter *iter)
 {
-	int i;
-	struct scatterlist *sg;
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct bio_vec bv = req_bvec(req);
+	unsigned int first_prp_len;
+
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA))
+		return false;
+
+	if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + bv.bv_len >
+	    NVME_CTRL_PAGE_SIZE * 2)
+		return false;
 
-	for_each_sg(sgl, sg, nents, i) {
-		dma_addr_t phys = sg_phys(sg);
-		pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
-			"dma_address:%pad dma_length:%d\n",
-			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
-			sg_dma_len(sg));
+	iter->addr = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
+	if (dma_mapping_error(dev->dev, iter->addr)) {
+		iter->status = BLK_STS_RESOURCE;
+		return true;
 	}
+	iod->total_len = bv.bv_len;
+	cmnd->dptr.prp1 = cpu_to_le64(iter->addr);
+
+	first_prp_len = NVME_CTRL_PAGE_SIZE -
+			(bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1));
+	if (bv.bv_len > first_prp_len)
+		cmnd->dptr.prp2 = cpu_to_le64(iter->addr + first_prp_len);
+	else
+		cmnd->dptr.prp2 = 0;
+
+	iter->status = BLK_STS_OK;
+	iod->flags |= IOD_SINGLE_SEGMENT;
+	return true;
 }
 
 static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd)
+					struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	int length = blk_rq_payload_bytes(req);
-	struct scatterlist *sg = iod->sgt.sgl;
-	int dma_len = sg_dma_len(sg);
-	u64 dma_addr = sg_dma_address(sg);
-	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+	struct nvme_rw_command *cmnd = &iod->cmd.rw;
+	unsigned int length = blk_rq_payload_bytes(req);
+	struct blk_dma_iter iter;
+	dma_addr_t prp1_dma, prp2_dma = 0;
+	unsigned int prp_len, i;
 	__le64 *prp_list;
-	dma_addr_t prp_dma;
-	int i;
+	unsigned int nr_segments = blk_rq_nr_phys_segments(req);
 
-	length -= (NVME_CTRL_PAGE_SIZE - offset);
-	if (length <= 0) {
-		iod->first_dma = 0;
-		goto done;
+	if (nr_segments == 1) {
+		if (nvme_try_setup_prp_simple(dev, req, cmnd, &iter))
+			return iter.status;
 	}
 
-	dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
-	if (dma_len) {
-		dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
-	} else {
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+	if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter))
+		return iter.status;
+
+	/*
+	 * PRP1 always points to the start of the DMA transfers.
+	 *
+	 * This is the only PRP (except for the list entries) that could be
+	 * non-aligned.
+	 */
+	prp1_dma = iter.addr;
+	prp_len = min(length, NVME_CTRL_PAGE_SIZE -
+			(iter.addr & (NVME_CTRL_PAGE_SIZE - 1)));
+	iod->total_len += prp_len;
+	iter.addr += prp_len;
+	iter.len -= prp_len;
+	length -= prp_len;
+	if (!length)
+		goto done;
+
+	if (!iter.len) {
+		if (!blk_rq_dma_map_iter_next(req, dev->dev, &iod->dma_state,
+				&iter)) {
+			if (WARN_ON_ONCE(!iter.status))
+				goto bad_sgl;
+			goto done;
+		}
 	}
 
+	/*
+	 * PRP2 is usually a list, but can point to data if all data to be
+	 * transferred fits into PRP1 + PRP2:
+	 */
 	if (length <= NVME_CTRL_PAGE_SIZE) {
-		iod->first_dma = dma_addr;
+		prp2_dma = iter.addr;
+		iod->total_len += length;
 		goto done;
 	}
 
@@ -633,58 +746,83 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 		iod->flags |= IOD_LARGE_DESCRIPTORS;
 
 	prp_list = dma_pool_alloc(nvme_dma_pool(dev, iod), GFP_ATOMIC,
-			&prp_dma);
-	if (!prp_list)
-		return BLK_STS_RESOURCE;
+			&prp2_dma);
+	if (!prp_list) {
+		iter.status = BLK_STS_RESOURCE;
+		goto done;
+	}
 	iod->descriptors[iod->nr_descriptors++] = prp_list;
-	iod->first_dma = prp_dma;
+
 	i = 0;
 	for (;;) {
+		prp_list[i++] = cpu_to_le64(iter.addr);
+		prp_len = min(length, NVME_CTRL_PAGE_SIZE);
+		if (WARN_ON_ONCE(iter.len < prp_len))
+			goto bad_sgl;
+
+		iod->total_len += prp_len;
+		iter.addr += prp_len;
+		iter.len -= prp_len;
+		length -= prp_len;
+		if (!length)
+			break;
+
+		if (iter.len == 0) {
+			if (!blk_rq_dma_map_iter_next(req, dev->dev,
+					&iod->dma_state, &iter)) {
+				if (WARN_ON_ONCE(!iter.status))
+					goto bad_sgl;
+				goto done;
+			}
+		}
+
+		/*
+		 * If we've filled the entire descriptor, allocate a new that is
+		 * pointed to be the last entry in the previous PRP list.  To
+		 * accommodate for that move the last actual entry to the new
+		 * descriptor.
+		 */
 		if (i == NVME_CTRL_PAGE_SIZE >> 3) {
 			__le64 *old_prp_list = prp_list;
+			dma_addr_t prp_list_dma;
 
 			prp_list = dma_pool_alloc(dev->prp_page_pool,
-					GFP_ATOMIC, &prp_dma);
-			if (!prp_list)
-				goto free_prps;
+					GFP_ATOMIC, &prp_list_dma);
+			if (!prp_list) {
+				iter.status = BLK_STS_RESOURCE;
+				goto done;
+			}
 			iod->descriptors[iod->nr_descriptors++] = prp_list;
+
 			prp_list[0] = old_prp_list[i - 1];
-			old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+			old_prp_list[i - 1] = cpu_to_le64(prp_list_dma);
 			i = 1;
 		}
-		prp_list[i++] = cpu_to_le64(dma_addr);
-		dma_len -= NVME_CTRL_PAGE_SIZE;
-		dma_addr += NVME_CTRL_PAGE_SIZE;
-		length -= NVME_CTRL_PAGE_SIZE;
-		if (length <= 0)
-			break;
-		if (dma_len > 0)
-			continue;
-		if (unlikely(dma_len < 0))
-			goto bad_sgl;
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
 	}
+
 done:
-	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
-	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
-	return BLK_STS_OK;
-free_prps:
-	nvme_free_descriptors(dev, req);
-	return BLK_STS_RESOURCE;
+	/*
+	 * nvme_unmap_data uses the DPT field in the SQE to tear down the
+	 * mapping, so initialize it even for failures.
+	 */
+	cmnd->dptr.prp1 = cpu_to_le64(prp1_dma);
+	cmnd->dptr.prp2 = cpu_to_le64(prp2_dma);
+	if (unlikely(iter.status))
+		nvme_unmap_data(dev, req);
+	return iter.status;
+
 bad_sgl:
-	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
-			"Invalid SGL for payload:%d nents:%d\n",
-			blk_rq_payload_bytes(req), iod->sgt.nents);
+	dev_err_once(dev->dev,
+		"Incorrectly formed request for payload:%d nents:%d\n",
+		blk_rq_payload_bytes(req), blk_rq_nr_phys_segments(req));
 	return BLK_STS_IOERR;
 }
 
 static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
-		struct scatterlist *sg)
+		struct blk_dma_iter *iter)
 {
-	sge->addr = cpu_to_le64(sg_dma_address(sg));
-	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->addr = cpu_to_le64(iter->addr);
+	sge->length = cpu_to_le32(iter->len);
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
@@ -696,21 +834,60 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 	sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
 }
 
+static bool nvme_try_setup_sgl_simple(struct nvme_dev *dev, struct request *req,
+				      struct nvme_rw_command *cmnd,
+				      struct blk_dma_iter *iter)
+{
+	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct bio_vec bv = req_bvec(req);
+
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA))
+		return false;
+
+	if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + bv.bv_len >
+			NVME_CTRL_PAGE_SIZE * 2)
+		return false;
+
+	iter->addr = dma_map_bvec(dev->dev, &bv, rq_dma_dir(req), 0);
+	if (dma_mapping_error(dev->dev, iter->addr)) {
+		iter->status = BLK_STS_RESOURCE;
+		return true;
+	}
+	iod->total_len = bv.bv_len;
+	cmnd->dptr.sgl.addr = cpu_to_le64(iter->addr);
+	cmnd->dptr.sgl.length = cpu_to_le32(iod->total_len);
+	cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
+	iter->status = BLK_STS_OK;
+	iod->flags |= IOD_SINGLE_SEGMENT;
+	return true;
+}
+
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmd)
+					struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_rw_command *cmd = &iod->cmd.rw;
+	unsigned int entries = blk_rq_nr_phys_segments(req);
 	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sg = iod->sgt.sgl;
-	unsigned int entries = iod->sgt.nents;
+	struct blk_dma_iter iter;
 	dma_addr_t sgl_dma;
-	int i = 0;
+	unsigned int mapped = 0;
+	unsigned int nr_segments = blk_rq_nr_phys_segments(req);
 
 	/* setting the transfer type as SGL */
 	cmd->flags = NVME_CMD_SGL_METABUF;
 
-	if (entries == 1) {
-		nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
+	if (nr_segments == 1) {
+		if (nvme_try_setup_sgl_simple(dev, req, cmd, &iter))
+			return iter.status;
+	}
+
+	if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_state, &iter))
+		return iter.status;
+
+	if (entries == 1 || blk_rq_dma_map_coalesce(&iod->dma_state)) {
+		nvme_pci_sgl_set_data(&cmd->dptr.sgl, &iter);
+		iod->total_len += iter.len;
 		return BLK_STS_OK;
 	}
 
@@ -721,168 +898,109 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 	if (!sg_list)
 		return BLK_STS_RESOURCE;
 	iod->descriptors[iod->nr_descriptors++] = sg_list;
-	iod->first_dma = sgl_dma;
 
-	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
 	do {
-		nvme_pci_sgl_set_data(&sg_list[i++], sg);
-		sg = sg_next(sg);
-	} while (--entries > 0);
+		if (WARN_ON_ONCE(mapped == entries)) {
+			iter.status = BLK_STS_IOERR;
+			break;
+		}
+		nvme_pci_sgl_set_data(&sg_list[mapped++], &iter);
+		iod->total_len += iter.len;
+	} while (blk_rq_dma_map_iter_next(req, dev->dev, &iod->dma_state,
+				&iter));
 
-	return BLK_STS_OK;
+	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, mapped);
+	if (unlikely(iter.status))
+		nvme_free_sgls(dev, req);
+	return iter.status;
 }
 
-static blk_status_t nvme_setup_prp_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
+static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req)
 {
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	unsigned int offset = bv->bv_offset & (NVME_CTRL_PAGE_SIZE - 1);
-	unsigned int first_prp_len = NVME_CTRL_PAGE_SIZE - offset;
-
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
-		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
-
-	cmnd->dptr.prp1 = cpu_to_le64(iod->first_dma);
-	if (bv->bv_len > first_prp_len)
-		cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma + first_prp_len);
-	else
-		cmnd->dptr.prp2 = 0;
-	return BLK_STS_OK;
+	if (nvme_pci_use_sgls(dev, req, blk_rq_nr_phys_segments(req)))
+		return nvme_pci_setup_sgls(dev, req);
+	return nvme_pci_setup_prps(dev, req);
 }
 
-static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
-		struct request *req, struct nvme_rw_command *cmnd,
-		struct bio_vec *bv)
+static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
+						struct request *req)
 {
+	unsigned int entries = req->nr_integrity_segments;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	struct nvme_sgl_desc *sg_list = iod->meta_list;
+	enum dma_data_direction dir = rq_dma_dir(req);
+	dma_addr_t dma_addr;
 
-	iod->first_dma = dma_map_bvec(dev->dev, bv, rq_dma_dir(req), 0);
-	if (dma_mapping_error(dev->dev, iod->first_dma))
-		return BLK_STS_RESOURCE;
-	iod->dma_len = bv->bv_len;
-
-	cmnd->flags = NVME_CMD_SGL_METABUF;
-	cmnd->dptr.sgl.addr = cpu_to_le64(iod->first_dma);
-	cmnd->dptr.sgl.length = cpu_to_le32(iod->dma_len);
-	cmnd->dptr.sgl.type = NVME_SGL_FMT_DATA_DESC << 4;
-	return BLK_STS_OK;
-}
-
-static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
-		struct nvme_command *cmnd)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	blk_status_t ret = BLK_STS_RESOURCE;
-	int rc;
-
-	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-		struct bio_vec bv = req_bvec(req);
-
-		if (!is_pci_p2pdma_page(bv.bv_page)) {
-			if (!nvme_pci_metadata_use_sgls(dev, req) &&
-			    (bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) +
-			     bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
-				return nvme_setup_prp_simple(dev, req,
-							     &cmnd->rw, &bv);
-
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
-				return nvme_setup_sgl_simple(dev, req,
-							     &cmnd->rw, &bv);
-		}
+	if (iod->flags & IOD_SINGLE_SEGMENT) {
+		dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+		dma_unmap_page(dev->dev, dma_addr, iod->total_len, rq_dma_dir(req));
+		return;
 	}
 
-	iod->dma_len = 0;
-	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
-	if (!iod->sgt.sgl)
-		return BLK_STS_RESOURCE;
-	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
-	iod->sgt.orig_nents = blk_rq_map_sg(req, iod->sgt.sgl);
-	if (!iod->sgt.orig_nents)
-		goto out_free_sg;
+	if (!blk_rq_dma_unmap(req, dev->dev, &iod->dma_meta_state,
+			      iod->total_meta_len)) {
+		if (iod->cmd.common.flags & NVME_CMD_SGL_METASEG) {
+			unsigned int i;
 
-	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
-			     DMA_ATTR_NO_WARN);
-	if (rc) {
-		if (rc == -EREMOTEIO)
-			ret = BLK_STS_TARGET;
-		goto out_free_sg;
+			for (i = 0; i < entries; i++)
+				dma_unmap_page(dev->dev,
+				       le64_to_cpu(sg_list[i].addr),
+				       le32_to_cpu(sg_list[i].length), dir);
+		} else {
+			dma_unmap_page(dev->dev, iod->meta_dma,
+				       rq_integrity_vec(req).bv_len, dir);
+			return;
+		}
 	}
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
-		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
-	else
-		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
-	if (ret != BLK_STS_OK)
-		goto out_unmap_sg;
-	return BLK_STS_OK;
-
-out_unmap_sg:
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-out_free_sg:
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
-	return ret;
+	dma_pool_free(dev->prp_small_pool, iod->meta_list, iod->meta_dma);
 }
 
 static blk_status_t nvme_pci_setup_meta_sgls(struct nvme_dev *dev,
 					     struct request *req)
 {
+	unsigned int entries = req->nr_integrity_segments;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct nvme_rw_command *cmnd = &iod->cmd.rw;
+	struct nvme_rw_command *cmd = &iod->cmd.rw;
 	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sgl, *sg;
-	unsigned int entries;
+	struct blk_dma_iter iter;
+	unsigned int mapped = 0;
 	dma_addr_t sgl_dma;
-	int rc, i;
-
-	iod->meta_sgt.sgl = mempool_alloc(dev->iod_meta_mempool, GFP_ATOMIC);
-	if (!iod->meta_sgt.sgl)
-		return BLK_STS_RESOURCE;
 
-	sg_init_table(iod->meta_sgt.sgl, req->nr_integrity_segments);
-	iod->meta_sgt.orig_nents = blk_rq_map_integrity_sg(req,
-							   iod->meta_sgt.sgl);
-	if (!iod->meta_sgt.orig_nents)
-		goto out_free_sg;
+	cmd->flags = NVME_CMD_SGL_METASEG;
 
-	rc = dma_map_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req),
-			     DMA_ATTR_NO_WARN);
-	if (rc)
-		goto out_free_sg;
+	if (!blk_rq_dma_map_iter_start(req, dev->dev, &iod->dma_meta_state,
+				       &iter))
+		return iter.status;
 
 	sg_list = dma_pool_alloc(dev->prp_small_pool, GFP_ATOMIC, &sgl_dma);
 	if (!sg_list)
-		goto out_unmap_sg;
+		return BLK_STS_RESOURCE;
 
-	entries = iod->meta_sgt.nents;
 	iod->meta_list = sg_list;
 	iod->meta_dma = sgl_dma;
+	cmd->metadata = cpu_to_le64(sgl_dma);
 
-	cmnd->flags = NVME_CMD_SGL_METASEG;
-	cmnd->metadata = cpu_to_le64(sgl_dma);
-
-	sgl = iod->meta_sgt.sgl;
-	if (entries == 1) {
-		nvme_pci_sgl_set_data(sg_list, sgl);
+	if (entries == 1 || blk_rq_dma_map_coalesce(&iod->dma_meta_state)) {
+		nvme_pci_sgl_set_data(sg_list, &iter);
+		iod->total_meta_len += iter.len;
 		return BLK_STS_OK;
 	}
 
-	sgl_dma += sizeof(*sg_list);
-	nvme_pci_sgl_set_seg(sg_list, sgl_dma, entries);
-	for_each_sg(sgl, sg, entries, i)
-		nvme_pci_sgl_set_data(&sg_list[i + 1], sg);
-
-	return BLK_STS_OK;
+	do {
+		if (WARN_ON_ONCE(mapped == entries)) {
+			iter.status = BLK_STS_IOERR;
+			break;
+		}
+		nvme_pci_sgl_set_data(&sg_list[mapped++], &iter);
+		iod->total_len += iter.len;
+	} while (blk_rq_dma_map_iter_next(req, dev->dev, &iod->dma_meta_state,
+				 &iter));
 
-out_unmap_sg:
-	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
-out_free_sg:
-	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
-	return BLK_STS_RESOURCE;
+	nvme_pci_sgl_set_seg(sg_list, sgl_dma, mapped);
+	if (unlikely(iter.status))
+		nvme_unmap_metadata(dev, req);
+	return iter.status;
 }
 
 static blk_status_t nvme_pci_setup_meta_mptr(struct nvme_dev *dev,
@@ -914,15 +1032,15 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	iod->aborted = false;
 	iod->nr_descriptors = 0;
 	iod->flags = 0;
-	iod->sgt.nents = 0;
-	iod->meta_sgt.nents = 0;
+	iod->total_len = 0;
+	iod->total_meta_len = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
 	if (ret)
 		return ret;
 
 	if (blk_rq_nr_phys_segments(req)) {
-		ret = nvme_map_data(dev, req, &iod->cmd);
+		ret = nvme_map_data(dev, req);
 		if (ret)
 			goto out_free_cmd;
 	}
@@ -1026,23 +1144,6 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
 	*rqlist = requeue_list;
 }
 
-static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
-						struct request *req)
-{
-	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-	if (!iod->meta_sgt.nents) {
-		dma_unmap_page(dev->dev, iod->meta_dma,
-			       rq_integrity_vec(req).bv_len,
-			       rq_dma_dir(req));
-		return;
-	}
-
-	dma_pool_free(dev->prp_small_pool, iod->meta_list, iod->meta_dma);
-	dma_unmap_sgtable(dev->dev, &iod->meta_sgt, rq_dma_dir(req), 0);
-	mempool_free(iod->meta_sgt.sgl, dev->iod_meta_mempool);
-}
-
 static __always_inline void nvme_pci_unmap_rq(struct request *req)
 {
 	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
@@ -2859,31 +2960,6 @@ static void nvme_release_prp_pools(struct nvme_dev *dev)
 	dma_pool_destroy(dev->prp_small_pool);
 }
 
-static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
-{
-	size_t meta_size = sizeof(struct scatterlist) * (NVME_MAX_META_SEGS + 1);
-	size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
-
-	dev->iod_mempool = mempool_create_node(1,
-			mempool_kmalloc, mempool_kfree,
-			(void *)alloc_size, GFP_KERNEL,
-			dev_to_node(dev->dev));
-	if (!dev->iod_mempool)
-		return -ENOMEM;
-
-	dev->iod_meta_mempool = mempool_create_node(1,
-			mempool_kmalloc, mempool_kfree,
-			(void *)meta_size, GFP_KERNEL,
-			dev_to_node(dev->dev));
-	if (!dev->iod_meta_mempool)
-		goto free;
-
-	return 0;
-free:
-	mempool_destroy(dev->iod_mempool);
-	return -ENOMEM;
-}
-
 static void nvme_free_tagset(struct nvme_dev *dev)
 {
 	if (dev->tagset.tags)
@@ -3252,15 +3328,11 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto out_dev_unmap;
 
-	result = nvme_pci_alloc_iod_mempool(dev);
-	if (result)
-		goto out_release_prp_pools;
-
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
 	result = nvme_pci_enable(dev);
 	if (result)
-		goto out_release_iod_mempool;
+		goto out_release_prp_pools;
 
 	result = nvme_alloc_admin_tag_set(&dev->ctrl, &dev->admin_tagset,
 				&nvme_mq_admin_ops, sizeof(struct nvme_iod));
@@ -3327,9 +3399,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
-out_release_iod_mempool:
-	mempool_destroy(dev->iod_mempool);
-	mempool_destroy(dev->iod_meta_mempool);
 out_release_prp_pools:
 	nvme_release_prp_pools(dev);
 out_dev_unmap:
@@ -3394,8 +3463,6 @@ static void nvme_remove(struct pci_dev *pdev)
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
 	nvme_free_queues(dev, 0);
-	mempool_destroy(dev->iod_mempool);
-	mempool_destroy(dev->iod_meta_mempool);
 	nvme_release_prp_pools(dev);
 	nvme_dev_unmap(dev);
 	nvme_uninit_ctrl(&dev->ctrl);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 24/24] nvme-pci: store aborted state in flags variable
  2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (22 preceding siblings ...)
  2025-04-23  8:13 ` [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
@ 2025-04-23  8:13 ` Leon Romanovsky
  23 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23  8:13 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

From: Leon Romanovsky <leonro@nvidia.com>

Instead of keeping dedicated "bool aborted" variable, let's reuse
newly introduced flags variable and save space.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index eb60a486331c..f69f1eb4308e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -219,6 +219,7 @@ struct nvme_queue {
 enum {
 	IOD_LARGE_DESCRIPTORS = 1, /* uses the full page sized descriptor pool */
 	IOD_SINGLE_SEGMENT = 2, /* single segment dma mapping */
+	IOD_ABORTED = 3, /* abort timed out commands */
 };
 
 /*
@@ -227,7 +228,6 @@ enum {
 struct nvme_iod {
 	struct nvme_request req;
 	struct nvme_command cmd;
-	bool aborted;
 	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
 	unsigned int flags;
 	unsigned int total_len; /* length of the entire transfer */
@@ -1029,7 +1029,6 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret;
 
-	iod->aborted = false;
 	iod->nr_descriptors = 0;
 	iod->flags = 0;
 	iod->total_len = 0;
@@ -1578,7 +1577,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	 * returned to the driver, or if this is the admin queue.
 	 */
 	opcode = nvme_req(req)->cmd->common.opcode;
-	if (!nvmeq->qid || iod->aborted) {
+	if (!nvmeq->qid || (iod->flags & IOD_ABORTED)) {
 		dev_warn(dev->ctrl.device,
 			 "I/O tag %d (%04x) opcode %#x (%s) QID %d timeout, reset controller\n",
 			 req->tag, nvme_cid(req), opcode,
@@ -1591,7 +1590,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 		atomic_inc(&dev->ctrl.abort_limit);
 		return BLK_EH_RESET_TIMER;
 	}
-	iod->aborted = true;
+	iod->flags |= IOD_ABORTED;
 
 	cmd.abort.opcode = nvme_admin_abort_cmd;
 	cmd.abort.cid = nvme_cid(req);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations
  2025-04-23  8:13 ` [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
@ 2025-04-23  9:05   ` Christoph Hellwig
  2025-04-23 13:39     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2025-04-23  9:05 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:13:13AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> There is plenty of unused space in the iod next to nr_descriptors.
> Add a separate flag to encode that the transfer is using the full
> page sized pool, and use a normal 0..n count for the number of
> descriptors.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> [ Leon: changed original bool variable to be flag as was proposed by Kanchan ]
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/nvme/host/pci.c | 93 ++++++++++++++++++++---------------------
>  1 file changed, 46 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 638e759b29ad..7e93536d01cb 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -44,6 +44,7 @@
>  #define NVME_MAX_SEGS	128
>  #define NVME_MAX_META_SEGS 15
>  #define NVME_MAX_NR_DESCRIPTORS	5
> +#define NVME_SMALL_DESCRIPTOR_SIZE 256
>  
>  static int use_threaded_interrupts;
>  module_param(use_threaded_interrupts, int, 0444);
> @@ -219,6 +220,10 @@ struct nvme_queue {
>  	struct completion delete_done;
>  };
>  
> +enum {
> +	IOD_LARGE_DESCRIPTORS = 1, /* uses the full page sized descriptor pool */

This is used as a ORable flag, I'd make that explicit:

	/* uses the full page sized descriptor pool */
	IOD_LARGE_DESCRIPTORS		= 1U << 0,

and similar for the next flag added in the next patch.

>  	struct nvme_request req;
>  	struct nvme_command cmd;
>  	bool aborted;
> -	/* # of PRP/SGL descriptors: (0 for small pool) */
> -	s8 nr_descriptors;
> +	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
> +	unsigned int flags;

And this should be limited to a u16 to not bloat the structure.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23  8:13 ` [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
@ 2025-04-23  9:24   ` Christoph Hellwig
  2025-04-23 10:03     ` Leon Romanovsky
                       ` (2 more replies)
  2025-04-23 14:58   ` Keith Busch
  1 sibling, 3 replies; 73+ messages in thread
From: Christoph Hellwig @ 2025-04-23  9:24 UTC (permalink / raw)
  To: Leon Romanovsky, Keith Busch
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty, Leon Romanovsky

I don't think the meta SGL handling is quite right yet, and the
single segment data handling also regressed.  Totally untested
patch below, I'll try to allocate some testing time later today.

Right now I don't have a test setup for metasgl, though.  Keith,
do you have a good qemu config for that?  Or anyone else?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f69f1eb4308e..80c21082b0c6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -634,7 +634,11 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 	dma_addr_t dma_addr;
 
 	if (iod->flags & IOD_SINGLE_SEGMENT) {
-		dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
+		if (iod->cmd.common.flags &
+		    (NVME_CMD_SGL_METABUF | NVME_CMD_SGL_METASEG))
+			dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
+		else
+			dma_addr = le64_to_cpu(iod->cmd.common.dptr.prp1);
 		dma_unmap_page(dev->dev, dma_addr, iod->total_len,
 				rq_dma_dir(req));
 		return;
@@ -922,35 +926,37 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req)
 	return nvme_pci_setup_prps(dev, req);
 }
 
-static __always_inline void nvme_unmap_metadata(struct nvme_dev *dev,
-						struct request *req)
+static void nvme_unmap_metadata(struct nvme_dev *dev, struct request *req)
 {
 	unsigned int entries = req->nr_integrity_segments;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct nvme_sgl_desc *sg_list = iod->meta_list;
 	enum dma_data_direction dir = rq_dma_dir(req);
-	dma_addr_t dma_addr;
 
-	if (iod->flags & IOD_SINGLE_SEGMENT) {
-		dma_addr = le64_to_cpu(iod->cmd.common.dptr.sgl.addr);
-		dma_unmap_page(dev->dev, dma_addr, iod->total_len, rq_dma_dir(req));
+	/*
+	 * If the NVME_CMD_SGL_METASEG flag is not set and we're using the
+	 * non-SGL linear meta buffer we know that we have a single input
+	 * segment as well.
+	 *
+	 * Note that it would be nice to always use the linear buffer when
+	 * using IOVA mappings and kernel buffers to avoid the SGL
+	 * indirection, but that's left for a future optimization.
+	 */
+	if (!(iod->cmd.common.flags & NVME_CMD_SGL_METASEG)) {
+		dma_unmap_page(dev->dev,
+			le64_to_cpu(iod->cmd.common.dptr.prp1),
+			iod->total_len, rq_dma_dir(req));
 		return;
 	}
 
 	if (!blk_rq_dma_unmap(req, dev->dev, &iod->dma_meta_state,
 			      iod->total_meta_len)) {
-		if (iod->cmd.common.flags & NVME_CMD_SGL_METASEG) {
-			unsigned int i;
+		unsigned int i;
 
-			for (i = 0; i < entries; i++)
-				dma_unmap_page(dev->dev,
-				       le64_to_cpu(sg_list[i].addr),
-				       le32_to_cpu(sg_list[i].length), dir);
-		} else {
-			dma_unmap_page(dev->dev, iod->meta_dma,
-				       rq_integrity_vec(req).bv_len, dir);
-			return;
-		}
+		for (i = 0; i < entries; i++)
+			dma_unmap_page(dev->dev,
+			       le64_to_cpu(sg_list[i].addr),
+			       le32_to_cpu(sg_list[i].length), dir);
 	}
 
 	dma_pool_free(dev->prp_small_pool, iod->meta_list, iod->meta_dma);

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23  9:24   ` Christoph Hellwig
@ 2025-04-23 10:03     ` Leon Romanovsky
  2025-04-23 15:47       ` Christoph Hellwig
  2025-04-23 15:05     ` Keith Busch
  2025-04-27  7:10     ` Leon Romanovsky
  2 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23 10:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Marek Szyprowski, Jens Axboe, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty

On Wed, Apr 23, 2025 at 11:24:37AM +0200, Christoph Hellwig wrote:
> I don't think the meta SGL handling is quite right yet, and the
> single segment data handling also regressed.  Totally untested
> patch below, I'll try to allocate some testing time later today.

Christoph,

Can we please progress with the DMA patches and leave NVMe for later?
NVMe is one the users for new DMA API, let's merge API first.

Thanks

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations
  2025-04-23  9:05   ` Christoph Hellwig
@ 2025-04-23 13:39     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23 13:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Marek Szyprowski, Jens Axboe, Keith Busch, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:05:52AM +0200, Christoph Hellwig wrote:
> On Wed, Apr 23, 2025 at 11:13:13AM +0300, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > There is plenty of unused space in the iod next to nr_descriptors.
> > Add a separate flag to encode that the transfer is using the full
> > page sized pool, and use a normal 0..n count for the number of
> > descriptors.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Tested-by: Jens Axboe <axboe@kernel.dk>
> > [ Leon: changed original bool variable to be flag as was proposed by Kanchan ]
> > Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/nvme/host/pci.c | 93 ++++++++++++++++++++---------------------
> >  1 file changed, 46 insertions(+), 47 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 638e759b29ad..7e93536d01cb 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -44,6 +44,7 @@
> >  #define NVME_MAX_SEGS	128
> >  #define NVME_MAX_META_SEGS 15
> >  #define NVME_MAX_NR_DESCRIPTORS	5
> > +#define NVME_SMALL_DESCRIPTOR_SIZE 256
> >  
> >  static int use_threaded_interrupts;
> >  module_param(use_threaded_interrupts, int, 0444);
> > @@ -219,6 +220,10 @@ struct nvme_queue {
> >  	struct completion delete_done;
> >  };
> >  
> > +enum {
> > +	IOD_LARGE_DESCRIPTORS = 1, /* uses the full page sized descriptor pool */
> 
> This is used as a ORable flag, I'd make that explicit:
> 
> 	/* uses the full page sized descriptor pool */
> 	IOD_LARGE_DESCRIPTORS		= 1U << 0,
> 
> and similar for the next flag added in the next patch.
> 
> >  	struct nvme_request req;
> >  	struct nvme_command cmd;
> >  	bool aborted;
> > -	/* # of PRP/SGL descriptors: (0 for small pool) */
> > -	s8 nr_descriptors;
> > +	u8 nr_descriptors;	/* # of PRP/SGL descriptors */
> > +	unsigned int flags;
> 
> And this should be limited to a u16 to not bloat the structure.

I'll limit it to u8.

Thanks

> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23  8:13 ` [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
  2025-04-23  9:24   ` Christoph Hellwig
@ 2025-04-23 14:58   ` Keith Busch
  2025-04-23 17:11     ` Leon Romanovsky
  1 sibling, 1 reply; 73+ messages in thread
From: Keith Busch @ 2025-04-23 14:58 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:13:14AM +0300, Leon Romanovsky wrote:
> +static bool nvme_try_setup_sgl_simple(struct nvme_dev *dev, struct request *req,
> +				      struct nvme_rw_command *cmnd,
> +				      struct blk_dma_iter *iter)
> +{
> +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	struct bio_vec bv = req_bvec(req);
> +
> +	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA))
> +		return false;
> +
> +	if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + bv.bv_len >
> +			NVME_CTRL_PAGE_SIZE * 2)
> +		return false;

We don't need this check for SGLs. If we have a single segment, we can
put it in a single SG element no matter how large it is.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23  9:24   ` Christoph Hellwig
  2025-04-23 10:03     ` Leon Romanovsky
@ 2025-04-23 15:05     ` Keith Busch
  2025-04-27  7:10     ` Leon Romanovsky
  2 siblings, 0 replies; 73+ messages in thread
From: Keith Busch @ 2025-04-23 15:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Marek Szyprowski, Jens Axboe, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:24:37AM +0200, Christoph Hellwig wrote:
> Right now I don't have a test setup for metasgl, though.  Keith,
> do you have a good qemu config for that?  Or anyone else?

QEMU does support it, and reports support for it by default (you may
need to upgrade qemu if yours is more than a year old). You just need to
format your namespace with metadata then you can send commands with
either SGL or MPTR.

QEMU supports 0, 8, 16, and 64 metadata bytes on either 512b or 4k block
sizes.

If you want 8 bytes for metadata on start up, attach parameter "ms=8" to
the '-device nvme-ns' qemu setup.

Alternatively, you can use the 'nvme format' command after booting to
change it whenever you want.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23 10:03     ` Leon Romanovsky
@ 2025-04-23 15:47       ` Christoph Hellwig
  2025-04-23 17:00         ` Jason Gunthorpe
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2025-04-23 15:47 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Keith Busch, Marek Szyprowski, Jens Axboe,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty

On Wed, Apr 23, 2025 at 01:03:14PM +0300, Leon Romanovsky wrote:
> On Wed, Apr 23, 2025 at 11:24:37AM +0200, Christoph Hellwig wrote:
> > I don't think the meta SGL handling is quite right yet, and the
> > single segment data handling also regressed.  Totally untested
> > patch below, I'll try to allocate some testing time later today.
> 
> Christoph,
> 
> Can we please progress with the DMA patches and leave NVMe for later?
> NVMe is one the users for new DMA API, let's merge API first.

We'll need to merge the block/nvme patches through the block tree
anyway to avoid merges from hell, so yes.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23 15:47       ` Christoph Hellwig
@ 2025-04-23 17:00         ` Jason Gunthorpe
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Keith Busch, Marek Szyprowski, Jens Axboe,
	Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty

On Wed, Apr 23, 2025 at 05:47:12PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 23, 2025 at 01:03:14PM +0300, Leon Romanovsky wrote:
> > On Wed, Apr 23, 2025 at 11:24:37AM +0200, Christoph Hellwig wrote:
> > > I don't think the meta SGL handling is quite right yet, and the
> > > single segment data handling also regressed.  Totally untested
> > > patch below, I'll try to allocate some testing time later today.
> > 
> > Christoph,
> > 
> > Can we please progress with the DMA patches and leave NVMe for later?
> > NVMe is one the users for new DMA API, let's merge API first.
> 
> We'll need to merge the block/nvme patches through the block tree
> anyway to avoid merges from hell, so yes.

RDMA has been having conflicts on the ODP patches too, so yeah we need
a shared branch and this thing into each trees. I'd rely on Marek to
make the shared branch and I'll take the RDMA parts on top.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23 14:58   ` Keith Busch
@ 2025-04-23 17:11     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-23 17:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty

On Wed, Apr 23, 2025 at 08:58:51AM -0600, Keith Busch wrote:
> On Wed, Apr 23, 2025 at 11:13:14AM +0300, Leon Romanovsky wrote:
> > +static bool nvme_try_setup_sgl_simple(struct nvme_dev *dev, struct request *req,
> > +				      struct nvme_rw_command *cmnd,
> > +				      struct blk_dma_iter *iter)
> > +{
> > +	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > +	struct bio_vec bv = req_bvec(req);
> > +
> > +	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA))
> > +		return false;
> > +
> > +	if ((bv.bv_offset & (NVME_CTRL_PAGE_SIZE - 1)) + bv.bv_len >
> > +			NVME_CTRL_PAGE_SIZE * 2)
> > +		return false;
> 
> We don't need this check for SGLs. If we have a single segment, we can
> put it in a single SG element no matter how large it is.

Absolutely, removed it and updated my dma-split-wip branch.

Thanks

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 03/24] iommu: generalize the batched sync after map interface
  2025-04-23  8:12 ` [PATCH v9 03/24] iommu: generalize the batched sync after map interface Leon Romanovsky
@ 2025-04-23 17:15   ` Jason Gunthorpe
  2025-04-24  6:55     ` Leon Romanovsky
  2025-04-26  0:52   ` Luis Chamberlain
  1 sibling, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:15 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:12:54AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> For the upcoming IOVA-based DMA API we want to use the interface batch the
> sync after mapping multiple entries from dma-iommu without having a
> scatterlist.

Grammer:

 For the upcoming IOVA-based DMA API we want to batch the
 ops->iotlb_sync_map() call after mapping multiple IOVAs from
 dma-iommu without having a scatterlist. Improve the API.

 Add a wrapper for the map_sync as iommu_sync_map() so that callers don't
 need to poke into the methods directly.

 Formalize __iommu_map() into iommu_map_nosync() which requires the
 caller to call iommu_sync_map() after all maps are completed.

 Refactor the existing sanity checks from all the different layers
 into iommu_map_nosync().

>  drivers/iommu/iommu.c | 65 +++++++++++++++++++------------------------
>  include/linux/iommu.h |  4 +++
>  2 files changed, 33 insertions(+), 36 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

> +	/* Discourage passing strange GFP flags */
> +	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
> +				__GFP_HIGHMEM)))
> +		return -EINVAL;

There is some kind of overlap with the new iommu_alloc_pages_node()
here that does a similar check, nothing that can be addressed in this
series but maybe a TBD for later..

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast
  2025-04-23  8:12 ` [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
@ 2025-04-23 17:15   ` Jason Gunthorpe
  2025-04-26  0:55   ` Luis Chamberlain
  1 sibling, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:15 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:12:55AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add kernel-doc section for iommu_unmap_fast to document existing
> limitation of underlying functions which can't split individual ranges.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: Will Deacon <will@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/iommu/iommu.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23  8:13 ` [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
@ 2025-04-23 17:17   ` Jason Gunthorpe
  2025-04-23 17:54   ` Mika Penttilä
  1 sibling, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:17 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:01AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Introduce new sticky flag (HMM_PFN_DMA_MAPPED), which isn't overwritten
> by HMM range fault. Such flag allows users to tag specific PFNs with
> information if this specific PFN was already DMA mapped.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  include/linux/hmm.h | 17 +++++++++++++++
>  mm/hmm.c            | 51 ++++++++++++++++++++++++++++-----------------
>  2 files changed, 49 insertions(+), 19 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

This would be part of the RDMA bits

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic
  2025-04-23  8:13 ` [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic Leon Romanovsky
@ 2025-04-23 17:28   ` Jason Gunthorpe
  2025-04-24  7:15     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:28 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:02AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> HMM callers use PFN list to populate range while calling
> to hmm_range_fault(), the conversion from PFN to DMA address
> is done by the callers with help of another DMA list. However,
> it is wasteful on any modern platform and by doing the right
> logic, that DMA list can be avoided.
> 
> Provide generic logic to manage these lists and gave an interface
> to map/unmap PFNs to DMA addresses, without requiring from the callers
> to be an experts in DMA core API.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>

I don't think Jens tested the RDMA and hmm parts :)

> +	/*
> +	 * The HMM API violates our normal DMA buffer ownership rules and can't
> +	 * transfer buffer ownership.  The dma_addressing_limited() check is a
> +	 * best approximation to ensure no swiotlb buffering happens.
> +	 */

This is a bit unclear, HMM inherently can't do cache flushing or
swiotlb bounce buffering because its entire purpose is to DMA directly
and coherently to a mm_struct's page tables. There are no sensible
points we could put the required flushing that wouldn't break the
entire model.

FWIW I view that fact that we now fail back to userspace in these
cases instead of quietly malfunction to be a big improvement.

> +bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
> +{
> +	struct dma_iova_state *state = &map->state;
> +	dma_addr_t *dma_addrs = map->dma_list;
> +	unsigned long *pfns = map->pfn_list;
> +	unsigned long attrs = 0;
> +
> +#define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
> +	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
> +		return false;
> +#undef HMM_PFN_VALID_DMA

If a v10 comes I'd put this in a const function level variable:

          const unsigned int HMM_PFN_VALID_DMA = HMM_PFN_VALID | HMM_PFN_DMA_MAPPED;

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 12/24] RDMA/umem: Store ODP access mask information in PFN
  2025-04-23  8:13 ` [PATCH v9 12/24] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
@ 2025-04-23 17:34   ` Jason Gunthorpe
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:03AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> As a preparation to remove dma_list, store access mask in PFN pointer
> and not in dma_addr_t.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/infiniband/core/umem_odp.c   | 103 +++++++++++----------------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |   1 +
>  drivers/infiniband/hw/mlx5/odp.c     |  37 +++++-----
>  drivers/infiniband/sw/rxe/rxe_odp.c  |  14 ++--
>  include/rdma/ib_umem_odp.h           |  14 +---
>  5 files changed, 70 insertions(+), 99 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
  2025-04-23  8:13 ` [PATCH v9 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
@ 2025-04-23 17:36   ` Jason Gunthorpe
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:36 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:04AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Reuse newly added DMA API to cache IOVA and only link/unlink pages
> in fast path for UMEM ODP flow.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/infiniband/core/umem_odp.c   | 104 ++++++---------------------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  11 +--
>  drivers/infiniband/hw/mlx5/odp.c     |  40 +++++++----
>  drivers/infiniband/hw/mlx5/umr.c     |  12 +++-
>  drivers/infiniband/sw/rxe/rxe_odp.c  |   4 +-
>  include/rdma/ib_umem_odp.h           |  13 +---
>  6 files changed, 71 insertions(+), 113 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP
  2025-04-23  8:13 ` [PATCH v9 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
@ 2025-04-23 17:38   ` Jason Gunthorpe
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:38 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:05AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Create separate functions for the implicit ODP initialization
> which is different from the explicit ODP initialization.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/infiniband/core/umem_odp.c | 91 +++++++++++++++---------------
>  1 file changed, 46 insertions(+), 45 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length
  2025-04-23  8:13 ` [PATCH v9 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
@ 2025-04-23 17:39   ` Jason Gunthorpe
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 17:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:06AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> allocated_length is a multiple of page size and number of pages,
> so let's change the functions to accept number of pages. It opens
> us a venue to combine receive and send paths together with code
> readability improvement.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/mlx5/cmd.c  | 32 ++++++++++-----------
>  drivers/vfio/pci/mlx5/cmd.h  | 10 +++----
>  drivers/vfio/pci/mlx5/main.c | 56 +++++++++++++++++++++++-------------
>  3 files changed, 57 insertions(+), 41 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23  8:13 ` [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
  2025-04-23 17:17   ` Jason Gunthorpe
@ 2025-04-23 17:54   ` Mika Penttilä
  2025-04-23 18:17     ` Jason Gunthorpe
  1 sibling, 1 reply; 73+ messages in thread
From: Mika Penttilä @ 2025-04-23 17:54 UTC (permalink / raw)
  To: Leon Romanovsky, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
	Keith Busch
  Cc: Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

Hi,

On 4/23/25 11:13, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
>
> Introduce new sticky flag (HMM_PFN_DMA_MAPPED), which isn't overwritten
> by HMM range fault. Such flag allows users to tag specific PFNs with
> information if this specific PFN was already DMA mapped.
>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  include/linux/hmm.h | 17 +++++++++++++++
>  mm/hmm.c            | 51 ++++++++++++++++++++++++++++-----------------
>  2 files changed, 49 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 126a36571667..a1ddbedc19c0 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -23,6 +23,8 @@ struct mmu_interval_notifier;
>   * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
>   * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
>   *                 fail. ie poisoned memory, special pages, no vma, etc
> + * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
> + *                      to mark that page is already DMA mapped
>   *
>   * On input:
>   * 0                 - Return the current state of the page, do not fault it.
> @@ -36,6 +38,13 @@ enum hmm_pfn_flags {
>  	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
>  	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
>  	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
> +
> +	/*
> +	 * Sticky flags, carried from input to output,
> +	 * don't forget to update HMM_PFN_INOUT_FLAGS
> +	 */
> +	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
> +

How is this playing together with the mapped order usage?


> HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
>  
>  	/* Input flags */
> @@ -57,6 +66,14 @@ static inline struct page *hmm_pfn_to_page(unsigned long hmm_pfn)
>  	return pfn_to_page(hmm_pfn & ~HMM_PFN_FLAGS);
>  }
>  
> +/*
> + * hmm_pfn_to_phys() - return physical address pointed to by a device entry
> + */
> +static inline phys_addr_t hmm_pfn_to_phys(unsigned long hmm_pfn)
> +{
> +	return __pfn_to_phys(hmm_pfn & ~HMM_PFN_FLAGS);
> +}
> +
>  /*
>   * hmm_pfn_to_map_order() - return the CPU mapping size order
>   *
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 082f7b7c0b9e..51fe8b011cc7 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -39,13 +39,20 @@ enum {
>  	HMM_NEED_ALL_BITS = HMM_NEED_FAULT | HMM_NEED_WRITE_FAULT,
>  };
>  
> +enum {
> +	/* These flags are carried from input-to-output */
> +	HMM_PFN_INOUT_FLAGS = HMM_PFN_DMA_MAPPED,
> +};
> +
>  static int hmm_pfns_fill(unsigned long addr, unsigned long end,
>  			 struct hmm_range *range, unsigned long cpu_flags)
>  {
>  	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
>  
> -	for (; addr < end; addr += PAGE_SIZE, i++)
> -		range->hmm_pfns[i] = cpu_flags;
> +	for (; addr < end; addr += PAGE_SIZE, i++) {
> +		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +		range->hmm_pfns[i] |= cpu_flags;
> +	}
>  	return 0;
>  }
>  
> @@ -202,8 +209,10 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
>  		return hmm_vma_fault(addr, end, required_fault, walk);
>  
>  	pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> -		hmm_pfns[i] = pfn | cpu_flags;
> +	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +		hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +		hmm_pfns[i] |= pfn | cpu_flags;
> +	}
>  	return 0;
>  }
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> @@ -230,14 +239,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	unsigned long cpu_flags;
>  	pte_t pte = ptep_get(ptep);
>  	uint64_t pfn_req_flags = *hmm_pfn;
> +	uint64_t new_pfn_flags = 0;
>  
>  	if (pte_none_mostly(pte)) {
>  		required_fault =
>  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
>  		if (required_fault)
>  			goto fault;
> -		*hmm_pfn = 0;
> -		return 0;
> +		goto out;
>  	}
>  
>  	if (!pte_present(pte)) {
> @@ -253,16 +262,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			cpu_flags = HMM_PFN_VALID;
>  			if (is_writable_device_private_entry(entry))
>  				cpu_flags |= HMM_PFN_WRITE;
> -			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
> -			return 0;
> +			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
> +			goto out;
>  		}
>  
>  		required_fault =
>  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
> -		if (!required_fault) {
> -			*hmm_pfn = 0;
> -			return 0;
> -		}
> +		if (!required_fault)
> +			goto out;
>  
>  		if (!non_swap_entry(entry))
>  			goto fault;
> @@ -304,11 +311,13 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			pte_unmap(ptep);
>  			return -EFAULT;
>  		}
> -		*hmm_pfn = HMM_PFN_ERROR;
> -		return 0;
> +		new_pfn_flags = HMM_PFN_ERROR;
> +		goto out;
>  	}
>  
> -	*hmm_pfn = pte_pfn(pte) | cpu_flags;
> +	new_pfn_flags = pte_pfn(pte) | cpu_flags;
> +out:
> +	*hmm_pfn = (*hmm_pfn & HMM_PFN_INOUT_FLAGS) | new_pfn_flags;
>  	return 0;
>  
>  fault:
> @@ -448,8 +457,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
>  		}
>  
>  		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -		for (i = 0; i < npages; ++i, ++pfn)
> -			hmm_pfns[i] = pfn | cpu_flags;
> +		for (i = 0; i < npages; ++i, ++pfn) {
> +			hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +			hmm_pfns[i] |= pfn | cpu_flags;
> +		}
>  		goto out_unlock;
>  	}
>  
> @@ -507,8 +518,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
>  	}
>  
>  	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
> -	for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
> -		range->hmm_pfns[i] = pfn | cpu_flags;
> +	for (; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +		range->hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +		range->hmm_pfns[i] |= pfn | cpu_flags;
> +	}
>  
>  	spin_unlock(ptl);
>  	return 0;


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  2025-04-23  8:13 ` [PATCH v9 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
@ 2025-04-23 18:02   ` Jason Gunthorpe
  0 siblings, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 18:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:07AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Change the creation of mkey to be performed in multiple steps:
> data allocation, DMA setup and actual call to HW to create that mkey.
> 
> In this new flow, the whole input to MKEY command is saved to eliminate
> the need to keep array of pointers for DMA addresses for receive list
> and in the future patches for send list too.
> 
> In addition to memory size reduce and elimination of unnecessary data
> movements to set MKEY input, the code is prepared for future reuse.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/mlx5/cmd.c | 157 ++++++++++++++++++++----------------
>  drivers/vfio/pci/mlx5/cmd.h |   4 +-
>  2 files changed, 91 insertions(+), 70 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API
  2025-04-23  8:13 ` [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API Leon Romanovsky
@ 2025-04-23 18:09   ` Jason Gunthorpe
  2025-04-24  7:55     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 18:09 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:13:08AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Remove intermediate scatter-gather table completely and
> enable new DMA link API.
> 
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/mlx5/cmd.c  | 298 ++++++++++++++++-------------------
>  drivers/vfio/pci/mlx5/cmd.h  |  21 ++-
>  drivers/vfio/pci/mlx5/main.c |  31 ----
>  3 files changed, 147 insertions(+), 203 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

> +static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
> +			      struct page **page_list, u32 *mkey_in,
> +			      struct dma_iova_state *state,
> +			      enum dma_data_direction dir)
> +{
> +	dma_addr_t addr;
> +	size_t mapped = 0;
> +	__be64 *mtt;
> +	int i, err;
>  
> -	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
> +	WARN_ON_ONCE(dir == DMA_NONE);
> +
> +	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
> +
> +	if (dma_iova_try_alloc(mdev->device, state, 0, npages * PAGE_SIZE)) {
> +		addr = state->addr;
> +		for (i = 0; i < npages; i++) {
> +			err = dma_iova_link(mdev->device, state,
> +					    page_to_phys(page_list[i]), mapped,
> +					    PAGE_SIZE, dir, 0);
> +			if (err)
> +				goto error;
> +			*mtt++ = cpu_to_be64(addr);
> +			addr += PAGE_SIZE;
> +			mapped += PAGE_SIZE;
> +		}

This is an area I'd like to see improvement on as a follow up.

Given we know we are allocating contiguous IOVA we should be able to
request a certain alignment so we can know that it can be put into the
mkey as single mtt. That would eliminate the double translation cost in
the HW.

The RDMA mkey builder is able to do this from the scatterlist but the
logic to do that was too complex to copy into vfio. This is close to
being simple enough, just the alignment is the only problem.

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23 17:54   ` Mika Penttilä
@ 2025-04-23 18:17     ` Jason Gunthorpe
  2025-04-23 18:37       ` Mika Penttilä
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 18:17 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Leon Romanovsky, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
	Keith Busch, Leon Romanovsky, Jake Edge, Jonathan Corbet,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 08:54:05PM +0300, Mika Penttilä wrote:
> > @@ -36,6 +38,13 @@ enum hmm_pfn_flags {
> >  	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
> >  	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
> >  	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
> > +
> > +	/*
> > +	 * Sticky flags, carried from input to output,
> > +	 * don't forget to update HMM_PFN_INOUT_FLAGS
> > +	 */
> > +	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
> > +
> 
> How is this playing together with the mapped order usage?

Order shift starts at bit 8, DMA_MAPPED is at bit 7

The pfn array is linear and simply indexed. The order is intended for
page table like HW to be able to build larger entries from the hmm
data without having to scan for contiguity.

Even if order is present the entry is still replicated across all the
pfns that are inside the order.

At least this series should replicate the dma_mapped flag as well as
it doesn't pay attention to order.

I suspect a page table implementation may need to make some small
changes. Indeed with guarenteed contiguous IOVA there may be a
significant optimization available to have the HW page table cover all
the contiguous present pages in the iommu, which would be a higher
order than the pages themselves. However this would require being able
to punch non-present holes into contiguous mappings...

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23 18:17     ` Jason Gunthorpe
@ 2025-04-23 18:37       ` Mika Penttilä
  2025-04-23 23:33         ` Jason Gunthorpe
  0 siblings, 1 reply; 73+ messages in thread
From: Mika Penttilä @ 2025-04-23 18:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
	Keith Busch, Leon Romanovsky, Jake Edge, Jonathan Corbet,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni


On 4/23/25 21:17, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 08:54:05PM +0300, Mika Penttilä wrote:
>>> @@ -36,6 +38,13 @@ enum hmm_pfn_flags {
>>>  	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
>>>  	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
>>>  	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
>>> +
>>> +	/*
>>> +	 * Sticky flags, carried from input to output,
>>> +	 * don't forget to update HMM_PFN_INOUT_FLAGS
>>> +	 */
>>> +	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
>>> +
>> How is this playing together with the mapped order usage?
> Order shift starts at bit 8, DMA_MAPPED is at bit 7

hmm bits are the high bits, and order is 5 bits starting from
(BITS_PER_LONG - 8)


> The pfn array is linear and simply indexed. The order is intended for
> page table like HW to be able to build larger entries from the hmm
> data without having to scan for contiguity.
>
> Even if order is present the entry is still replicated across all the
> pfns that are inside the order.
>
> At least this series should replicate the dma_mapped flag as well as
> it doesn't pay attention to order.
>
> I suspect a page table implementation may need to make some small
> changes. Indeed with guarenteed contiguous IOVA there may be a
> significant optimization available to have the HW page table cover all
> the contiguous present pages in the iommu, which would be a higher
> order than the pages themselves. However this would require being able
> to punch non-present holes into contiguous mappings...
>
> Jason
>
--Mika



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23 18:37       ` Mika Penttilä
@ 2025-04-23 23:33         ` Jason Gunthorpe
  2025-04-24  8:07           ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 23:33 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Leon Romanovsky, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
	Keith Busch, Leon Romanovsky, Jake Edge, Jonathan Corbet,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 09:37:24PM +0300, Mika Penttilä wrote:
> 
> On 4/23/25 21:17, Jason Gunthorpe wrote:
> > On Wed, Apr 23, 2025 at 08:54:05PM +0300, Mika Penttilä wrote:
> >>> @@ -36,6 +38,13 @@ enum hmm_pfn_flags {
> >>>  	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
> >>>  	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
> >>>  	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
> >>> +
> >>> +	/*
> >>> +	 * Sticky flags, carried from input to output,
> >>> +	 * don't forget to update HMM_PFN_INOUT_FLAGS
> >>> +	 */
> >>> +	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
> >>> +
> >> How is this playing together with the mapped order usage?
> > Order shift starts at bit 8, DMA_MAPPED is at bit 7
> 
> hmm bits are the high bits, and order is 5 bits starting from
> (BITS_PER_LONG - 8)

I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.

Probably testing didn't hit this because the usual 2M order of 9 only
sets bits -4 and -8 .. The way the order works it doesn't clear the
0 bits, which I wonder if is a little bug..

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 03/24] iommu: generalize the batched sync after map interface
  2025-04-23 17:15   ` Jason Gunthorpe
@ 2025-04-24  6:55     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 02:15:37PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 11:12:54AM +0300, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > For the upcoming IOVA-based DMA API we want to use the interface batch the
> > sync after mapping multiple entries from dma-iommu without having a
> > scatterlist.
> 
> Grammer:
> 
>  For the upcoming IOVA-based DMA API we want to batch the
>  ops->iotlb_sync_map() call after mapping multiple IOVAs from
>  dma-iommu without having a scatterlist. Improve the API.
> 
>  Add a wrapper for the map_sync as iommu_sync_map() so that callers don't
>  need to poke into the methods directly.
> 
>  Formalize __iommu_map() into iommu_map_nosync() which requires the
>  caller to call iommu_sync_map() after all maps are completed.
> 
>  Refactor the existing sanity checks from all the different layers
>  into iommu_map_nosync().
> 
> >  drivers/iommu/iommu.c | 65 +++++++++++++++++++------------------------
> >  include/linux/iommu.h |  4 +++
> >  2 files changed, 33 insertions(+), 36 deletions(-)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> > +	/* Discourage passing strange GFP flags */
> > +	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
> > +				__GFP_HIGHMEM)))
> > +		return -EINVAL;
> 
> There is some kind of overlap with the new iommu_alloc_pages_node()
> here that does a similar check, nothing that can be addressed in this
> series but maybe a TBD for later..

This series is based on pure -rc1 to allow creation of shared branch,
while you removed iommu_alloc_pages_node() in IOMMU tree. So we must
merge it first and tidy the code after that.

Thanks

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic
  2025-04-23 17:28   ` Jason Gunthorpe
@ 2025-04-24  7:15     ` Leon Romanovsky
  2025-04-24  7:22       ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24  7:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 02:28:56PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 11:13:02AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > HMM callers use PFN list to populate range while calling
> > to hmm_range_fault(), the conversion from PFN to DMA address
> > is done by the callers with help of another DMA list. However,
> > it is wasteful on any modern platform and by doing the right
> > logic, that DMA list can be avoided.
> > 
> > Provide generic logic to manage these lists and gave an interface
> > to map/unmap PFNs to DMA addresses, without requiring from the callers
> > to be an experts in DMA core API.
> > 
> > Tested-by: Jens Axboe <axboe@kernel.dk>
> 
> I don't think Jens tested the RDMA and hmm parts :)

I know, but he posted his Tested-by tag on cover letter and b4 picked it
for whole series. I decided to keep it as is.

> 
> > +	/*
> > +	 * The HMM API violates our normal DMA buffer ownership rules and can't
> > +	 * transfer buffer ownership.  The dma_addressing_limited() check is a
> > +	 * best approximation to ensure no swiotlb buffering happens.
> > +	 */
> 
> This is a bit unclear, HMM inherently can't do cache flushing or
> swiotlb bounce buffering because its entire purpose is to DMA directly
> and coherently to a mm_struct's page tables. There are no sensible
> points we could put the required flushing that wouldn't break the
> entire model.
> 
> FWIW I view that fact that we now fail back to userspace in these
> cases instead of quietly malfunction to be a big improvement.
> 
> > +bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
> > +{
> > +	struct dma_iova_state *state = &map->state;
> > +	dma_addr_t *dma_addrs = map->dma_list;
> > +	unsigned long *pfns = map->pfn_list;
> > +	unsigned long attrs = 0;
> > +
> > +#define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
> > +	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
> > +		return false;
> > +#undef HMM_PFN_VALID_DMA
> 
> If a v10 comes I'd put this in a const function level variable:
> 
>           const unsigned int HMM_PFN_VALID_DMA = HMM_PFN_VALID | HMM_PFN_DMA_MAPPED;
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

I have no idea if v10 is needed. DMA API is stable for a long time and
only DMA part should go in shared branch. Everything else will need to
go through relevant subsystems anyway.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic
  2025-04-24  7:15     ` Leon Romanovsky
@ 2025-04-24  7:22       ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Thu, Apr 24, 2025 at 10:15:45AM +0300, Leon Romanovsky wrote:
> On Wed, Apr 23, 2025 at 02:28:56PM -0300, Jason Gunthorpe wrote:
> > On Wed, Apr 23, 2025 at 11:13:02AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>

<...>

> > > +bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
> > > +{
> > > +	struct dma_iova_state *state = &map->state;
> > > +	dma_addr_t *dma_addrs = map->dma_list;
> > > +	unsigned long *pfns = map->pfn_list;
> > > +	unsigned long attrs = 0;
> > > +
> > > +#define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
> > > +	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
> > > +		return false;
> > > +#undef HMM_PFN_VALID_DMA
> > 
> > If a v10 comes I'd put this in a const function level variable:
> > 
> >           const unsigned int HMM_PFN_VALID_DMA = HMM_PFN_VALID | HMM_PFN_DMA_MAPPED;

diff --git a/mm/hmm.c b/mm/hmm.c
index c0bee5aa00fc..a8bf097677f3 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -807,15 +807,14 @@ EXPORT_SYMBOL_GPL(hmm_dma_map_pfn);
  */
 bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
 {
+       const unsigned long valid_dma = HMM_PFN_VALID | HMM_PFN_DMA_MAPPED;
        struct dma_iova_state *state = &map->state;
        dma_addr_t *dma_addrs = map->dma_list;
        unsigned long *pfns = map->pfn_list;
        unsigned long attrs = 0;
 
-#define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
-       if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
+       if ((pfns[idx] & valid_dma) != valid_dma)
                return false;
-#undef HMM_PFN_VALID_DMA
 
        if (pfns[idx] & HMM_PFN_P2PDMA_BUS)
                ; /* no need to unmap bus address P2P mappings */

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API
  2025-04-23 18:09   ` Jason Gunthorpe
@ 2025-04-24  7:55     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24  7:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 03:09:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 11:13:08AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Remove intermediate scatter-gather table completely and
> > enable new DMA link API.
> > 
> > Tested-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/vfio/pci/mlx5/cmd.c  | 298 ++++++++++++++++-------------------
> >  drivers/vfio/pci/mlx5/cmd.h  |  21 ++-
> >  drivers/vfio/pci/mlx5/main.c |  31 ----
> >  3 files changed, 147 insertions(+), 203 deletions(-)
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> > +static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
> > +			      struct page **page_list, u32 *mkey_in,
> > +			      struct dma_iova_state *state,
> > +			      enum dma_data_direction dir)
> > +{
> > +	dma_addr_t addr;
> > +	size_t mapped = 0;
> > +	__be64 *mtt;
> > +	int i, err;
> >  
> > -	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
> > +	WARN_ON_ONCE(dir == DMA_NONE);
> > +
> > +	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
> > +
> > +	if (dma_iova_try_alloc(mdev->device, state, 0, npages * PAGE_SIZE)) {
> > +		addr = state->addr;
> > +		for (i = 0; i < npages; i++) {
> > +			err = dma_iova_link(mdev->device, state,
> > +					    page_to_phys(page_list[i]), mapped,
> > +					    PAGE_SIZE, dir, 0);
> > +			if (err)
> > +				goto error;
> > +			*mtt++ = cpu_to_be64(addr);
> > +			addr += PAGE_SIZE;
> > +			mapped += PAGE_SIZE;
> > +		}
> 
> This is an area I'd like to see improvement on as a follow up.
> 
> Given we know we are allocating contiguous IOVA we should be able to
> request a certain alignment so we can know that it can be put into the
> mkey as single mtt. That would eliminate the double translation cost in
> the HW.
> 
> The RDMA mkey builder is able to do this from the scatterlist but the
> logic to do that was too complex to copy into vfio. This is close to
> being simple enough, just the alignment is the only problem.

I saw this improvement as well, but there is a need to generalize this 
"if (dma_iova_try_alloc) ... else ..." code first, as it will be used
by all vfio HW drivers.

So the plan is:
1. Merge the code as is.
2. Convert second vfio HW to the new API.
3. Propose something like dma_map_pages(..., struct page **page_list, ...)
   to map array of pages.
4. Optimize mlx5 vfio MTT creation.

Thanks

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-23 23:33         ` Jason Gunthorpe
@ 2025-04-24  8:07           ` Leon Romanovsky
  2025-04-24  8:11             ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24  8:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mika Penttilä, Marek Szyprowski, Jens Axboe,
	Christoph Hellwig, Keith Busch, Jake Edge, Jonathan Corbet,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 08:33:35PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 09:37:24PM +0300, Mika Penttilä wrote:
> > 
> > On 4/23/25 21:17, Jason Gunthorpe wrote:
> > > On Wed, Apr 23, 2025 at 08:54:05PM +0300, Mika Penttilä wrote:
> > >>> @@ -36,6 +38,13 @@ enum hmm_pfn_flags {
> > >>>  	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
> > >>>  	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
> > >>>  	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
> > >>> +
> > >>> +	/*
> > >>> +	 * Sticky flags, carried from input to output,
> > >>> +	 * don't forget to update HMM_PFN_INOUT_FLAGS
> > >>> +	 */
> > >>> +	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
> > >>> +
> > >> How is this playing together with the mapped order usage?
> > > Order shift starts at bit 8, DMA_MAPPED is at bit 7
> > 
> > hmm bits are the high bits, and order is 5 bits starting from
> > (BITS_PER_LONG - 8)
> 
> I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
> DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.

Thanks for the fix.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-24  8:07           ` Leon Romanovsky
@ 2025-04-24  8:11             ` Christoph Hellwig
  2025-04-24  8:46               ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2025-04-24  8:11 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Mika Penttilä, Marek Szyprowski, Jens Axboe,
	Christoph Hellwig, Keith Busch, Jake Edge, Jonathan Corbet,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Luis Chamberlain,
	Matthew Wilcox, Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

On Thu, Apr 24, 2025 at 11:07:44AM +0300, Leon Romanovsky wrote:
> > I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
> > DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.
> 
> Thanks for the fix.

Maybe we can use the chance to make the scheme less fragile?  i.e.
put flags in the high bits and derive the first valid bit from the
pfn order?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-24  8:11             ` Christoph Hellwig
@ 2025-04-24  8:46               ` Leon Romanovsky
  2025-04-24 12:07                 ` Jason Gunthorpe
  0 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24  8:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Mika Penttilä, Marek Szyprowski, Jens Axboe,
	Keith Busch, Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Thu, Apr 24, 2025 at 10:11:01AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 24, 2025 at 11:07:44AM +0300, Leon Romanovsky wrote:
> > > I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
> > > DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.
> > 
> > Thanks for the fix.
> 
> Maybe we can use the chance to make the scheme less fragile?  i.e.
> put flags in the high bits and derive the first valid bit from the
> pfn order?
> 

It can be done too. This is what I got:

   38 enum hmm_pfn_flags {
   39         /* Output fields and flags */
   40         HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
   41         HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
   42         HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
   43         /*
   44          * Sticky flags, carried from input to output,
   45          * don't forget to update HMM_PFN_INOUT_FLAGS
   46          */
   47         HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 4),
   48         HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
   49         HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
   50
   51         HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
   52
   53         /* Input flags */
   54         HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
   55         HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
   56
   57         HMM_PFN_FLAGS = ~((1UL << HMM_PFN_ORDER_SHIFT) - 1),
   58 };

So now, we just need to move HMM_PFN_ORDER_SHIFT if we add new flags
and HMM_PFN_FLAGS will be updated automatically.

Thanks

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-24  8:46               ` Leon Romanovsky
@ 2025-04-24 12:07                 ` Jason Gunthorpe
  2025-04-24 12:50                   ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-24 12:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Mika Penttilä, Marek Szyprowski,
	Jens Axboe, Keith Busch, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Thu, Apr 24, 2025 at 11:46:26AM +0300, Leon Romanovsky wrote:
> On Thu, Apr 24, 2025 at 10:11:01AM +0200, Christoph Hellwig wrote:
> > On Thu, Apr 24, 2025 at 11:07:44AM +0300, Leon Romanovsky wrote:
> > > > I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
> > > > DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.
> > > 
> > > Thanks for the fix.
> > 
> > Maybe we can use the chance to make the scheme less fragile?  i.e.
> > put flags in the high bits and derive the first valid bit from the
> > pfn order?
>
> It can be done too. This is what I got:

Use genmask:

enum hmm_pfn_flags {
	HMM_FLAGS_START = BITS_PER_LONG - PAGE_SHIFT,
	HMM_PFN_FLAGS = GENMASK(BITS_PER_LONG - 1, HMM_FLAGS_START),

	/* Output fields and flags */
	HMM_PFN_VALID = 1UL << HMM_FLAGS_START + 0,
	HMM_PFN_WRITE = 1UL << HMM_FLAGS_START + 1,
	HMM_PFN_ERROR = 1UL << HMM_FLAGS_START + 2,
	HMM_PFN_ORDER_MASK = GENMASK(HMM_FLAGS_START + 7, HMM_FLAGS_START + 3),

	/* Input flags */
	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
};

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-24 12:07                 ` Jason Gunthorpe
@ 2025-04-24 12:50                   ` Leon Romanovsky
  2025-04-24 16:01                     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24 12:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Mika Penttilä, Marek Szyprowski,
	Jens Axboe, Keith Busch, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Thu, Apr 24, 2025 at 09:07:03AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 24, 2025 at 11:46:26AM +0300, Leon Romanovsky wrote:
> > On Thu, Apr 24, 2025 at 10:11:01AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 24, 2025 at 11:07:44AM +0300, Leon Romanovsky wrote:
> > > > > I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
> > > > > DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.
> > > > 
> > > > Thanks for the fix.
> > > 
> > > Maybe we can use the chance to make the scheme less fragile?  i.e.
> > > put flags in the high bits and derive the first valid bit from the
> > > pfn order?
> >
> > It can be done too. This is what I got:
> 
> Use genmask:

I can do it too, will change.

> 
> enum hmm_pfn_flags {
> 	HMM_FLAGS_START = BITS_PER_LONG - PAGE_SHIFT,
> 	HMM_PFN_FLAGS = GENMASK(BITS_PER_LONG - 1, HMM_FLAGS_START),
> 
> 	/* Output fields and flags */
> 	HMM_PFN_VALID = 1UL << HMM_FLAGS_START + 0,
> 	HMM_PFN_WRITE = 1UL << HMM_FLAGS_START + 1,
> 	HMM_PFN_ERROR = 1UL << HMM_FLAGS_START + 2,
> 	HMM_PFN_ORDER_MASK = GENMASK(HMM_FLAGS_START + 7, HMM_FLAGS_START + 3),
> 
> 	/* Input flags */
> 	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
> 	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
> };
> 
> Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2025-04-24 12:50                   ` Leon Romanovsky
@ 2025-04-24 16:01                     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-24 16:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Mika Penttilä, Marek Szyprowski,
	Jens Axboe, Keith Busch, Jake Edge, Jonathan Corbet, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

On Thu, Apr 24, 2025 at 03:50:52PM +0300, Leon Romanovsky wrote:
> On Thu, Apr 24, 2025 at 09:07:03AM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 24, 2025 at 11:46:26AM +0300, Leon Romanovsky wrote:
> > > On Thu, Apr 24, 2025 at 10:11:01AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Apr 24, 2025 at 11:07:44AM +0300, Leon Romanovsky wrote:
> > > > > > I see, so yes order occupies 5 bits [-4,-5,-6,-7,-8] and the
> > > > > > DMA_MAPPED overlaps, it should be 9 not 7 because of the backwardness.
> > > > > 
> > > > > Thanks for the fix.
> > > > 
> > > > Maybe we can use the chance to make the scheme less fragile?  i.e.
> > > > put flags in the high bits and derive the first valid bit from the
> > > > pfn order?
> > >
> > > It can be done too. This is what I got:
> > 
> > Use genmask:
> 
> I can do it too, will change.

If you don't mind, I'll stick with my previous proposal.

GENMASK() alone is not enough and the best solution will include use
of FIELD_GET FIELD_PREP mocros. IMHO, that will make code unreadable.
The simple, clean and reliable bitfield OR operations much better fit
here.

Thanks

> 
> > 
> > enum hmm_pfn_flags {
> > 	HMM_FLAGS_START = BITS_PER_LONG - PAGE_SHIFT,
> > 	HMM_PFN_FLAGS = GENMASK(BITS_PER_LONG - 1, HMM_FLAGS_START),
> > 
> > 	/* Output fields and flags */
> > 	HMM_PFN_VALID = 1UL << HMM_FLAGS_START + 0,
> > 	HMM_PFN_WRITE = 1UL << HMM_FLAGS_START + 1,
> > 	HMM_PFN_ERROR = 1UL << HMM_FLAGS_START + 2,
> > 	HMM_PFN_ORDER_MASK = GENMASK(HMM_FLAGS_START + 7, HMM_FLAGS_START + 3),
> > 
> > 	/* Input flags */
> > 	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
> > 	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
> > };
> > 
> > Jason
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers
  2025-04-23  8:12 ` [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
@ 2025-04-26  0:21   ` Luis Chamberlain
  2025-04-27  7:25     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26  0:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:12:52AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> The current scheme with a single helper to determine the P2P status
> and map a scatterlist segment force users to always use the map_sg
> helper to DMA map, which we're trying to get away from because they
> are very cache inefficient.
> 
> Refactor the code so that there is a single helper that checks the P2P
> state for a page, including the result that it is not a P2P page to
> simplify the callers, and a second one to perform the address translation
> for a bus mapped P2P transfer that does not depend on the scatterlist
> structure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Might make it easier for patch review to split off adding
__pci_p2pdma_update_state() in a seprate patch first. Other than that,
looks good.

Reviewed-by: Luis Chamberlain <mcgrof@kenrel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  2025-04-23  8:12 ` [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
@ 2025-04-26  0:34   ` Luis Chamberlain
  2025-04-27  7:53     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26  0:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:12:53AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> +enum pci_p2pdma_map_type {
> +	/*
> +	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
> +	 * type hasn't been calculated yet. Functions that return this enum
> +	 * never return this value.
> +	 */

This last sentence is confusing. How about:

* PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
* the mapping type has been calculated. Exported routines for the API
* will never return this value.

> +	PCI_P2PDMA_MAP_UNKNOWN = 0,
> +
> +	/*
> +	 * Not a PCI P2PDMA transfer.
> +	 */
> +	PCI_P2PDMA_MAP_NONE,
> +
> +	/*
> +	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
> +	 * traverse the host bridge and the host bridge is not in the
> +	 * allowlist. DMA Mapping routines should return an error when
> +	 * this is returned.
> +	 */
> +	PCI_P2PDMA_MAP_NOT_SUPPORTED,
> +
> +	/*
> +	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to

You mean   PCI_P2PDMA_MAP_BUS_ADDR

> + * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer

Hrm, maybe with a bit more clarity:

Translate a physical address to a bus address for a  PCI_P2PDMA_MAP_BUS_ADDR
transfer.


Other than that.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 03/24] iommu: generalize the batched sync after map interface
  2025-04-23  8:12 ` [PATCH v9 03/24] iommu: generalize the batched sync after map interface Leon Romanovsky
  2025-04-23 17:15   ` Jason Gunthorpe
@ 2025-04-26  0:52   ` Luis Chamberlain
  2025-04-27  7:54     ` Leon Romanovsky
  1 sibling, 1 reply; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26  0:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:12:54AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> For the upcoming IOVA-based DMA API we want to use the interface batch the
> sync after mapping multiple entries from dma-iommu without having a
> scatterlist.

This reads odd, how about:

For the upcoming IOVA-based DMA API, we want to batch the sync operation
after mapping multiple entries from dma-iommu without requiring a
scatterlist.

Other than that:

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast
  2025-04-23  8:12 ` [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
  2025-04-23 17:15   ` Jason Gunthorpe
@ 2025-04-26  0:55   ` Luis Chamberlain
  1 sibling, 0 replies; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26  0:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Matthew Wilcox,
	Dan Williams, Kanchan Joshi, Chaitanya Kulkarni, Jason Gunthorpe

On Wed, Apr 23, 2025 at 11:12:55AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add kernel-doc section for iommu_unmap_fast to document existing
> limitation of underlying functions which can't split individual ranges.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: Will Deacon <will@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 05/24] dma-mapping: Provide an interface to allow allocate IOVA
  2025-04-23  8:12 ` [PATCH v9 05/24] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
@ 2025-04-26  1:10   ` Luis Chamberlain
  0 siblings, 0 replies; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26  1:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Matthew Wilcox,
	Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

The subject reads odd, how about:

dma-mapping: Provide an interface to allocate IOVA

> +/**
> + * dma_iova_free - Free an IOVA space
> + * @dev: Device to free the IOVA space for
> + * @state: IOVA state
> + *
> + * Undoes a successful dma_try_iova_alloc().
> + *
> + * Note that all dma_iova_link() calls need to be undone first.  For callers
> + * that never call dma_iova_unlink(), dma_iova_destroy() can be used instead
> + * which unlinks all ranges and frees the IOVA space in a single efficient
> + * operation.
> + */

Probably does't matter but dma_iova_destroy() doesn't exist yet here.

Other than that:

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  2025-04-23  8:12 ` [PATCH v9 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
@ 2025-04-26  1:14   ` Luis Chamberlain
  0 siblings, 0 replies; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26  1:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:12:57AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Split the iommu logic from iommu_dma_map_page into a separate helper.
> This not only keeps the code neatly separated, but will also allow for
> reuse in another caller.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API
  2025-04-23  8:12 ` [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
@ 2025-04-26 22:46   ` Luis Chamberlain
  2025-04-27  8:13     ` Leon Romanovsky
  0 siblings, 1 reply; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26 22:46 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Leon Romanovsky, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Matthew Wilcox,
	Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

On Wed, Apr 23, 2025 at 11:12:58AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Introduce new DMA APIs to perform DMA linkage of buffers
> in layers higher than DMA.
> 
> In proposed API, the callers will perform the following steps.
> In map path:
> 	if (dma_can_use_iova(...))
> 	    dma_iova_alloc()
> 	    for (page in range)
> 	       dma_iova_link_next(...)
> 	    dma_iova_sync(...)
> 	else
> 	     /* Fallback to legacy map pages */
>              for (all pages)
> 	       dma_map_page(...)
> 
> In unmap path:
> 	if (dma_can_use_iova(...))
> 	     dma_iova_destroy()
> 	else
> 	     for (all pages)
> 		dma_unmap_page(...)
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/iommu/dma-iommu.c   | 261 ++++++++++++++++++++++++++++++++++++
>  include/linux/dma-mapping.h |  32 +++++
>  2 files changed, 293 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index d2c298083e0a..2e014db5a244 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1818,6 +1818,267 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
>  }
>  EXPORT_SYMBOL_GPL(dma_iova_free);
>  
> +static int __dma_iova_link(struct device *dev, dma_addr_t addr,
> +		phys_addr_t phys, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	bool coherent = dev_is_dma_coherent(dev);
> +
> +	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +		arch_sync_dma_for_device(phys, size, dir);

So arch_sync_dma_for_device() is a no-op on some architectures, notably x86.
So since you're doing this work and given the above pattern is common on
the non iova case, we could save ourselves 2 branches checks on x86 on
__dma_iova_link() and also generalize savings for the non-iova case as
well. For the non-iova case we have two use cases, one with the attrs on
initial mapping, and one without on subsequent sync ops. For the iova
case the attr is always consistently used.

So we could just have something like this:

#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE
static inline void arch_sync_dma_device(struct device *dev,
                                        phys_addr_t paddr, size_t size,
                                        enum dma_data_direction dir)
{
    if (!dev_is_dma_coherent(dev))
        arch_sync_dma_for_device(paddr, size, dir);
}

static inline void arch_sync_dma_device_attrs(struct device *dev,
                                              phys_addr_t paddr, size_t size,
                                              enum dma_data_direction dir,
                                              unsigned long attrs)
{
    if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
        arch_sync_dma_for_device(paddr, size, dir);
}
#else
static inline void arch_sync_dma_device(struct device *dev,
                                        phys_addr_t paddr, size_t size,
                                        enum dma_data_direction dir)
{
}

static inline void arch_sync_dma_device_attrs(struct device *dev,
                                              phys_addr_t paddr, size_t size,
                                              enum dma_data_direction dir,
                                              unsigned long attrs)
{
}
#endif

> +/**
> + * dma_iova_link - Link a range of IOVA space
> + * @dev: DMA device
> + * @state: IOVA state
> + * @phys: physical address to link
> + * @offset: offset into the IOVA state to map into
> + * @size: size of the buffer
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Link a range of IOVA space for the given IOVA state without IOTLB sync.
> + * This function is used to link multiple physical addresses in contiguous
> + * IOVA space without performing costly IOTLB sync.
> + *
> + * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
> + * the end of linkage.
> + */
> +int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, phys);
> +
> +	if (WARN_ON_ONCE(iova_start_pad && offset > 0))
> +		return -EIO;
> +
> +	if (dev_use_swiotlb(dev, size, dir) && iova_offset(iovad, phys | size))

There is already a similar check for the non-iova case for this on
iommu_dma_map_page() and a nice comment about what why this checked,
this seems to be just screaming for a helper:

/*                                                                       
 * Checks if a physical buffer has unaligned boundaries with respect to
 * the IOMMU granule. Returns non-zero if either the start or end
 * address is not aligned to the granule boundary.
*/
static inline size_t iova_unaligned(struct iova_domain *iovad,
                                    phys_addr_t phys,
				    size_t size)
{                                                                                
	return iova_offset(iovad, phys | size);
}  

Other than that, looks good.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 08/24] dma-mapping: add a dma_need_unmap helper
  2025-04-23  8:12 ` [PATCH v9 08/24] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
@ 2025-04-26 22:49   ` Luis Chamberlain
  0 siblings, 0 replies; 73+ messages in thread
From: Luis Chamberlain @ 2025-04-26 22:49 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni, Leon Romanovsky

On Wed, Apr 23, 2025 at 11:12:59AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Add helper that allows a driver to skip calling dma_unmap_*
> if the DMA layer can guarantee that they are no-nops.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map
  2025-04-23  9:24   ` Christoph Hellwig
  2025-04-23 10:03     ` Leon Romanovsky
  2025-04-23 15:05     ` Keith Busch
@ 2025-04-27  7:10     ` Leon Romanovsky
  2 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-27  7:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Marek Szyprowski, Jens Axboe, Jake Edge,
	Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni, Nitesh Shetty

On Wed, Apr 23, 2025 at 11:24:37AM +0200, Christoph Hellwig wrote:
> I don't think the meta SGL handling is quite right yet, and the
> single segment data handling also regressed.  

If my testing is correct, my dma-split-wip branch passes all tests,
including single segment.

Thanks

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers
  2025-04-26  0:21   ` Luis Chamberlain
@ 2025-04-27  7:25     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-27  7:25 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni

On Fri, Apr 25, 2025 at 05:21:59PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 23, 2025 at 11:12:52AM +0300, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > The current scheme with a single helper to determine the P2P status
> > and map a scatterlist segment force users to always use the map_sg
> > helper to DMA map, which we're trying to get away from because they
> > are very cache inefficient.
> > 
> > Refactor the code so that there is a single helper that checks the P2P
> > state for a page, including the result that it is not a P2P page to
> > simplify the callers, and a second one to perform the address translation
> > for a bus mapped P2P transfer that does not depend on the scatterlist
> > structure.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > Tested-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> Might make it easier for patch review to split off adding
> __pci_p2pdma_update_state() in a seprate patch first.

Original code __pci_p2pdma_update_state() had this code and was
dependent on SG, which we are removing in this patch.

       if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
               sg->dma_address = sg_phys(sg) + state->bus_off;
               sg_dma_len(sg) = sg->length;
               sg_dma_mark_bus_address(sg);
       }

So to split, we would need to introduce new version of __pci_p2pdma_update_state(),
rename existing one to something like __pci_p2pdma_update_state2() and
remove it in next patch. Such pattern of adding and immediately deleting
code is not welcomed.

> Other than that, looks good.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kenrel.org>

Thanks

> 
>   Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  2025-04-26  0:34   ` Luis Chamberlain
@ 2025-04-27  7:53     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-27  7:53 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni

On Fri, Apr 25, 2025 at 05:34:14PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 23, 2025 at 11:12:53AM +0300, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > +enum pci_p2pdma_map_type {
> > +	/*
> > +	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
> > +	 * type hasn't been calculated yet. Functions that return this enum
> > +	 * never return this value.
> > +	 */
> 
> This last sentence is confusing. How about:
> 
> * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
> * the mapping type has been calculated. Exported routines for the API
> * will never return this value.

This patch moved code as is, but sure, let's update the comments.

> 
> > +	PCI_P2PDMA_MAP_UNKNOWN = 0,
> > +
> > +	/*
> > +	 * Not a PCI P2PDMA transfer.
> > +	 */
> > +	PCI_P2PDMA_MAP_NONE,
> > +
> > +	/*
> > +	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
> > +	 * traverse the host bridge and the host bridge is not in the
> > +	 * allowlist. DMA Mapping routines should return an error when
> > +	 * this is returned.
> > +	 */
> > +	PCI_P2PDMA_MAP_NOT_SUPPORTED,
> > +
> > +	/*
> > +	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
> 
> You mean   PCI_P2PDMA_MAP_BUS_ADDR

done

> 
> > + * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
> 
> Hrm, maybe with a bit more clarity:
> 
> Translate a physical address to a bus address for a  PCI_P2PDMA_MAP_BUS_ADDR
> transfer.
> 
> 
> Other than that.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

Thanks

> 
>   Luis
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 03/24] iommu: generalize the batched sync after map interface
  2025-04-26  0:52   ` Luis Chamberlain
@ 2025-04-27  7:54     ` Leon Romanovsky
  0 siblings, 0 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-27  7:54 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni

On Fri, Apr 25, 2025 at 05:52:02PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 23, 2025 at 11:12:54AM +0300, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > For the upcoming IOVA-based DMA API we want to use the interface batch the
> > sync after mapping multiple entries from dma-iommu without having a
> > scatterlist.
> 
> This reads odd, how about:
> 
> For the upcoming IOVA-based DMA API, we want to batch the sync operation
> after mapping multiple entries from dma-iommu without requiring a
> scatterlist.
> 
> Other than that:
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

I used Jason's proposal
https://lore.kernel.org/all/20250423171537.GJ1213339@ziepe.ca

Thanks

> 
>   Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API
  2025-04-26 22:46   ` Luis Chamberlain
@ 2025-04-27  8:13     ` Leon Romanovsky
  2025-04-28 13:16       ` Jason Gunthorpe
  2025-04-28 13:20       ` Christoph Hellwig
  0 siblings, 2 replies; 73+ messages in thread
From: Leon Romanovsky @ 2025-04-27  8:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
	Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni

On Sat, Apr 26, 2025 at 03:46:30PM -0700, Luis Chamberlain wrote:
> On Wed, Apr 23, 2025 at 11:12:58AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Introduce new DMA APIs to perform DMA linkage of buffers
> > in layers higher than DMA.
> > 
> > In proposed API, the callers will perform the following steps.
> > In map path:
> > 	if (dma_can_use_iova(...))
> > 	    dma_iova_alloc()
> > 	    for (page in range)
> > 	       dma_iova_link_next(...)
> > 	    dma_iova_sync(...)
> > 	else
> > 	     /* Fallback to legacy map pages */
> >              for (all pages)
> > 	       dma_map_page(...)
> > 
> > In unmap path:
> > 	if (dma_can_use_iova(...))
> > 	     dma_iova_destroy()
> > 	else
> > 	     for (all pages)
> > 		dma_unmap_page(...)
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Tested-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/iommu/dma-iommu.c   | 261 ++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-mapping.h |  32 +++++
> >  2 files changed, 293 insertions(+)
> > 
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index d2c298083e0a..2e014db5a244 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -1818,6 +1818,267 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
> >  }
> >  EXPORT_SYMBOL_GPL(dma_iova_free);
> >  
> > +static int __dma_iova_link(struct device *dev, dma_addr_t addr,
> > +		phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > +		unsigned long attrs)
> > +{
> > +	bool coherent = dev_is_dma_coherent(dev);
> > +
> > +	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> > +		arch_sync_dma_for_device(phys, size, dir);
> 
> So arch_sync_dma_for_device() is a no-op on some architectures, notably x86.
> So since you're doing this work and given the above pattern is common on
> the non iova case, we could save ourselves 2 branches checks on x86 on
> __dma_iova_link() and also generalize savings for the non-iova case as
> well. For the non-iova case we have two use cases, one with the attrs on
> initial mapping, and one without on subsequent sync ops. For the iova
> case the attr is always consistently used.

I want to believe that compiler will discards these "if (!coherent &&
!(attrs & DMA_ATTR_SKIP_CPU_SYNC)))" branch if case is empty.

> 
> So we could just have something like this:
> 
> #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE
> static inline void arch_sync_dma_device(struct device *dev,
>                                         phys_addr_t paddr, size_t size,
>                                         enum dma_data_direction dir)
> {
>     if (!dev_is_dma_coherent(dev))
>         arch_sync_dma_for_device(paddr, size, dir);
> }
> 
> static inline void arch_sync_dma_device_attrs(struct device *dev,
>                                               phys_addr_t paddr, size_t size,
>                                               enum dma_data_direction dir,
>                                               unsigned long attrs)
> {
>     if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
>         arch_sync_dma_for_device(paddr, size, dir);
> }
> #else
> static inline void arch_sync_dma_device(struct device *dev,
>                                         phys_addr_t paddr, size_t size,
>                                         enum dma_data_direction dir)
> {
> }
> 
> static inline void arch_sync_dma_device_attrs(struct device *dev,
>                                               phys_addr_t paddr, size_t size,
>                                               enum dma_data_direction dir,
>                                               unsigned long attrs)
> {
> }
> #endif

The problem is that dev_is_dma_coherent() and DMA_ATTR_SKIP_CPU_SYNC
checks are scattered over all dma-iommu.c file with different
combinations. While we can do new static functions for small number of
use cases, it will be half-solution.

> 
> > +/**
> > + * dma_iova_link - Link a range of IOVA space
> > + * @dev: DMA device
> > + * @state: IOVA state
> > + * @phys: physical address to link
> > + * @offset: offset into the IOVA state to map into
> > + * @size: size of the buffer
> > + * @dir: DMA direction
> > + * @attrs: attributes of mapping properties
> > + *
> > + * Link a range of IOVA space for the given IOVA state without IOTLB sync.
> > + * This function is used to link multiple physical addresses in contiguous
> > + * IOVA space without performing costly IOTLB sync.
> > + *
> > + * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
> > + * the end of linkage.
> > + */
> > +int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> > +		phys_addr_t phys, size_t offset, size_t size,
> > +		enum dma_data_direction dir, unsigned long attrs)
> > +{
> > +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > +	struct iova_domain *iovad = &cookie->iovad;
> > +	size_t iova_start_pad = iova_offset(iovad, phys);
> > +
> > +	if (WARN_ON_ONCE(iova_start_pad && offset > 0))
> > +		return -EIO;
> > +
> > +	if (dev_use_swiotlb(dev, size, dir) && iova_offset(iovad, phys | size))
> 
> There is already a similar check for the non-iova case for this on
> iommu_dma_map_page() and a nice comment about what why this checked,
> this seems to be just screaming for a helper:
> 
> /*                                                                       
>  * Checks if a physical buffer has unaligned boundaries with respect to
>  * the IOMMU granule. Returns non-zero if either the start or end
>  * address is not aligned to the granule boundary.
> */
> static inline size_t iova_unaligned(struct iova_domain *iovad,
>                                     phys_addr_t phys,
> 				    size_t size)
> {                                                                                
> 	return iova_offset(iovad, phys | size);
> }  

I added this function, thanks.
 
> Other than that, looks good.
> 
> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
> 
>   Luis

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API
  2025-04-27  8:13     ` Leon Romanovsky
@ 2025-04-28 13:16       ` Jason Gunthorpe
  2025-04-28 13:20       ` Christoph Hellwig
  1 sibling, 0 replies; 73+ messages in thread
From: Jason Gunthorpe @ 2025-04-28 13:16 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Luis Chamberlain, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
	Keith Busch, Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Jérôme Glisse, Andrew Morton,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Matthew Wilcox, Dan Williams, Kanchan Joshi,
	Chaitanya Kulkarni

On Sun, Apr 27, 2025 at 11:13:12AM +0300, Leon Romanovsky wrote:
> > So arch_sync_dma_for_device() is a no-op on some architectures, notably x86.
> > So since you're doing this work and given the above pattern is common on
> > the non iova case, we could save ourselves 2 branches checks on x86 on
> > __dma_iova_link() and also generalize savings for the non-iova case as
> > well. For the non-iova case we have two use cases, one with the attrs on
> > initial mapping, and one without on subsequent sync ops. For the iova
> > case the attr is always consistently used.
> 
> I want to believe that compiler will discards these "if (!coherent &&
> !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))" branch if case is empty.

Yeah, I'm pretty sure it will

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API
  2025-04-27  8:13     ` Leon Romanovsky
  2025-04-28 13:16       ` Jason Gunthorpe
@ 2025-04-28 13:20       ` Christoph Hellwig
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Hellwig @ 2025-04-28 13:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Luis Chamberlain, Marek Szyprowski, Jens Axboe, Christoph Hellwig,
	Keith Busch, Jake Edge, Jonathan Corbet, Jason Gunthorpe,
	Zhu Yanjun, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, Niklas Schnelle, Chuck Lever, Matthew Wilcox,
	Dan Williams, Kanchan Joshi, Chaitanya Kulkarni

On Sun, Apr 27, 2025 at 11:13:12AM +0300, Leon Romanovsky wrote:
> > So arch_sync_dma_for_device() is a no-op on some architectures, notably x86.
> > So since you're doing this work and given the above pattern is common on
> > the non iova case, we could save ourselves 2 branches checks on x86 on
> > __dma_iova_link() and also generalize savings for the non-iova case as
> > well. For the non-iova case we have two use cases, one with the attrs on
> > initial mapping, and one without on subsequent sync ops. For the iova
> > case the attr is always consistently used.
> 
> I want to believe that compiler will discards these "if (!coherent &&
> !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))" branch if case is empty.

Yes, it is the poster child for dead code elimination using the
IS_ENABLED() helper.

> checks are scattered over all dma-iommu.c file with different
> combinations. While we can do new static functions for small number of
> use cases, it will be half-solution.

Don't bother.


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2025-04-28 13:20 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-23  8:12 [PATCH v9 00/24] Provide a new two step DMA mapping API Leon Romanovsky
2025-04-23  8:12 ` [PATCH v9 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
2025-04-26  0:21   ` Luis Chamberlain
2025-04-27  7:25     ` Leon Romanovsky
2025-04-23  8:12 ` [PATCH v9 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
2025-04-26  0:34   ` Luis Chamberlain
2025-04-27  7:53     ` Leon Romanovsky
2025-04-23  8:12 ` [PATCH v9 03/24] iommu: generalize the batched sync after map interface Leon Romanovsky
2025-04-23 17:15   ` Jason Gunthorpe
2025-04-24  6:55     ` Leon Romanovsky
2025-04-26  0:52   ` Luis Chamberlain
2025-04-27  7:54     ` Leon Romanovsky
2025-04-23  8:12 ` [PATCH v9 04/24] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
2025-04-23 17:15   ` Jason Gunthorpe
2025-04-26  0:55   ` Luis Chamberlain
2025-04-23  8:12 ` [PATCH v9 05/24] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
2025-04-26  1:10   ` Luis Chamberlain
2025-04-23  8:12 ` [PATCH v9 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
2025-04-26  1:14   ` Luis Chamberlain
2025-04-23  8:12 ` [PATCH v9 07/24] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
2025-04-26 22:46   ` Luis Chamberlain
2025-04-27  8:13     ` Leon Romanovsky
2025-04-28 13:16       ` Jason Gunthorpe
2025-04-28 13:20       ` Christoph Hellwig
2025-04-23  8:12 ` [PATCH v9 08/24] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
2025-04-26 22:49   ` Luis Chamberlain
2025-04-23  8:13 ` [PATCH v9 09/24] docs: core-api: document the IOVA-based API Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
2025-04-23 17:17   ` Jason Gunthorpe
2025-04-23 17:54   ` Mika Penttilä
2025-04-23 18:17     ` Jason Gunthorpe
2025-04-23 18:37       ` Mika Penttilä
2025-04-23 23:33         ` Jason Gunthorpe
2025-04-24  8:07           ` Leon Romanovsky
2025-04-24  8:11             ` Christoph Hellwig
2025-04-24  8:46               ` Leon Romanovsky
2025-04-24 12:07                 ` Jason Gunthorpe
2025-04-24 12:50                   ` Leon Romanovsky
2025-04-24 16:01                     ` Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 11/24] mm/hmm: provide generic DMA managing logic Leon Romanovsky
2025-04-23 17:28   ` Jason Gunthorpe
2025-04-24  7:15     ` Leon Romanovsky
2025-04-24  7:22       ` Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 12/24] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2025-04-23 17:34   ` Jason Gunthorpe
2025-04-23  8:13 ` [PATCH v9 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
2025-04-23 17:36   ` Jason Gunthorpe
2025-04-23  8:13 ` [PATCH v9 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
2025-04-23 17:38   ` Jason Gunthorpe
2025-04-23  8:13 ` [PATCH v9 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2025-04-23 17:39   ` Jason Gunthorpe
2025-04-23  8:13 ` [PATCH v9 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2025-04-23 18:02   ` Jason Gunthorpe
2025-04-23  8:13 ` [PATCH v9 17/24] vfio/mlx5: Enable the DMA link API Leon Romanovsky
2025-04-23 18:09   ` Jason Gunthorpe
2025-04-24  7:55     ` Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 18/24] block: share more code for bio addition helper Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 19/24] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 20/24] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 21/24] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 22/24] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
2025-04-23  9:05   ` Christoph Hellwig
2025-04-23 13:39     ` Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 23/24] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
2025-04-23  9:24   ` Christoph Hellwig
2025-04-23 10:03     ` Leon Romanovsky
2025-04-23 15:47       ` Christoph Hellwig
2025-04-23 17:00         ` Jason Gunthorpe
2025-04-23 15:05     ` Keith Busch
2025-04-27  7:10     ` Leon Romanovsky
2025-04-23 14:58   ` Keith Busch
2025-04-23 17:11     ` Leon Romanovsky
2025-04-23  8:13 ` [PATCH v9 24/24] nvme-pci: store aborted state in flags variable Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).