[PATCH v1 00/17] Provide a new two step DMA mapping API

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 00/17] Provide a new two step DMA mapping API
@ 2024-10-30 15:12 Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
                   ` (19 more replies)
  0 siblings, 20 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

Changelog:
v1: 
 * Squashed two VFIO patches into one
 * Added Acked-by/Reviewed-by tags
 * Fix docs spelling errors
 * Simplified dma_iova_sync() API
 * Added extra check in dma_iova_destroy() if mapped size to make code more clear
 * Fixed checkpatch warnings in p2p patch
 * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
   be more general
 * Reduced the number of changes in VFIO patch
v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org

----------------------------------------------------------------------------
The code can be downloaded from:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30

----------------------------------------------------------------------------
Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - Allocate IOVA space:
    To do any pre-allocation required. This is done based on the caller
    supplying some details about how much IOMMU address space it would need
    in worst case.
 - Map and unmap relevant structures to pre-allocated IOVA space:
    Perform the actual mapping into the pre-allocated IOVA. This is very
    similar to dma_map_page().

In this and the next series [1], examples of three different users are converted
to the new API to show the benefits and its versatility. Each user has a unique
flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
 * Memory reduction as in most real use cases there is no need to store mapped
   DMA addresses and unmap them.
 * Reducing the function call overhead by removing the need to call function
   pointers and use direct calls instead.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.

Thanks

[1] This still points to v0, as the change is just around handling dma_iova_sync():
https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org

Christoph Hellwig (6):
  PCI/P2PDMA: Refactor the p2pdma mapping helpers
  dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  iommu: generalize the batched sync after map interface
  iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  dma-mapping: add a dma_need_unmap helper
  docs: core-api: document the IOVA-based API

Leon Romanovsky (11):
  dma-mapping: Add check if IOVA can be used
  dma: Provide an interface to allow allocate IOVA
  dma-mapping: Implement link/unlink ranges API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  mm/hmm: provide generic DMA managing logic
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
    linkage
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Convert vfio to use DMA link API

 Documentation/core-api/dma-api.rst   |  70 ++++
 drivers/infiniband/core/umem_odp.c   | 250 +++++----------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
 drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
 drivers/infiniband/hw/mlx5/umr.c     |  12 +-
 drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
 drivers/iommu/iommu.c                |  65 ++--
 drivers/pci/p2pdma.c                 |  38 +--
 drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
 drivers/vfio/pci/mlx5/cmd.h          |  35 +-
 drivers/vfio/pci/mlx5/main.c         |  87 +++--
 include/linux/dma-map-ops.h          |  54 ----
 include/linux/dma-mapping.h          |  85 +++++
 include/linux/hmm-dma.h              |  32 ++
 include/linux/hmm.h                  |  16 +
 include/linux/iommu.h                |   4 +
 include/linux/pci-p2pdma.h           |  84 +++++
 include/rdma/ib_umem_odp.h           |  25 +-
 kernel/dma/direct.c                  |  44 +--
 kernel/dma/mapping.c                 |  20 ++
 mm/hmm.c                             | 231 +++++++++++++-
 21 files changed, 1377 insertions(+), 684 deletions(-)
 create mode 100644 include/linux/hmm-dma.h

-- 
2.46.2


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v1 01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 02/17] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.

Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 47 +++++++++++++++++-----------------
 drivers/pci/p2pdma.c        | 38 ++++-----------------------
 include/linux/dma-map-ops.h | 51 +++++++++++++++++++++++++++++--------
 kernel/dma/direct.c         | 43 +++++++++++++++----------------
 4 files changed, 91 insertions(+), 88 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 2a9fa0c8cc00..5746ffaf0061 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1382,7 +1382,6 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	struct scatterlist *s, *prev = NULL;
 	int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
 	struct pci_p2pdma_map_state p2pdma_state = {};
-	enum pci_p2pdma_map_type map;
 	dma_addr_t iova;
 	size_t iova_len = 0;
 	unsigned long mask = dma_get_seg_boundary(dev);
@@ -1412,28 +1411,30 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 		size_t s_length = s->length;
 		size_t pad_len = (mask - iova_len + 1) & mask;
 
-		if (is_pci_p2pdma_page(sg_page(s))) {
-			map = pci_p2pdma_map_segment(&p2pdma_state, dev, s);
-			switch (map) {
-			case PCI_P2PDMA_MAP_BUS_ADDR:
-				/*
-				 * iommu_map_sg() will skip this segment as
-				 * it is marked as a bus address,
-				 * __finalise_sg() will copy the dma address
-				 * into the output segment.
-				 */
-				continue;
-			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-				/*
-				 * Mapping through host bridge should be
-				 * mapped with regular IOVAs, thus we
-				 * do nothing here and continue below.
-				 */
-				break;
-			default:
-				ret = -EREMOTEIO;
-				goto out_restore_sg;
-			}
+		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(s))) {
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * Mapping through host bridge should be mapped with
+			 * regular IOVAs, thus we do nothing here and continue
+			 * below.
+			 */
+			break;
+		case PCI_P2PDMA_MAP_NONE:
+			break;
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			/*
+			 * iommu_map_sg() will skip this segment as it is marked
+			 * as a bus address, __finalise_sg() will copy the dma
+			 * address into the output segment.
+			 */
+			s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
+						sg_phys(s));
+			sg_dma_len(s) = sg->length;
+			sg_dma_mark_bus_address(s);
+			continue;
+		default:
+			ret = -EREMOTEIO;
+			goto out_restore_sg;
 		}
 
 		sg_dma_address(s) = s_iova_off;
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13cb500..f38d16d71dd5 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -995,40 +995,12 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	return type;
 }
 
-/**
- * pci_p2pdma_map_segment - map an sg segment determining the mapping type
- * @state: State structure that should be declared outside of the for_each_sg()
- *	loop and initialized to zero.
- * @dev: DMA device that's doing the mapping operation
- * @sg: scatterlist segment to map
- *
- * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
- * the sg segment is the same for the page_link and the dma_address.
- *
- * Attempt to map a single segment in an SGL with the PCI bus address.
- * The segment must point to a PCI P2PDMA page and thus must be
- * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
- *
- * Returns the type of mapping used and maps the page if the type is
- * PCI_P2PDMA_MAP_BUS_ADDR.
- */
-enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
-		       struct scatterlist *sg)
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+		struct device *dev, struct page *page)
 {
-	if (state->pgmap != sg_page(sg)->pgmap) {
-		state->pgmap = sg_page(sg)->pgmap;
-		state->map = pci_p2pdma_map_type(state->pgmap, dev);
-		state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
-	}
-
-	if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
-		sg->dma_address = sg_phys(sg) + state->bus_off;
-		sg_dma_len(sg) = sg->length;
-		sg_dma_mark_bus_address(sg);
-	}
-
-	return state->map;
+	state->pgmap = page->pgmap;
+	state->map = pci_p2pdma_map_type(state->pgmap, dev);
+	state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
 }
 
 /**
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index b7773201414c..3480a28d1b9f 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -443,6 +443,11 @@ enum pci_p2pdma_map_type {
 	 */
 	PCI_P2PDMA_MAP_UNKNOWN = 0,
 
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
 	/*
 	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
 	 * traverse the host bridge and the host bridge is not in the
@@ -471,21 +476,47 @@ enum pci_p2pdma_map_type {
 
 struct pci_p2pdma_map_state {
 	struct dev_pagemap *pgmap;
-	int map;
+	enum pci_p2pdma_map_type map;
 	u64 bus_off;
 };
 
-#ifdef CONFIG_PCI_P2PDMA
-enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
-		       struct scatterlist *sg);
-#else /* CONFIG_PCI_P2PDMA */
+/* helper for pci_p2pdma_state(), do not use directly */
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+		struct device *dev, struct page *page);
+
+/**
+ * pci_p2pdma_state - check the P2P transfer state of a page
+ * @state:	P2P state structure
+ * @dev:	device to transfer to/from
+ * @page:	page to map
+ *
+ * Check if @page is a PCI P2PDMA page, and if yes of what kind.  Returns the
+ * map type, and updates @state with all information needed for a P2P transfer.
+ */
 static inline enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
-		       struct scatterlist *sg)
+pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
+		struct page *page)
+{
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+		if (state->pgmap != page->pgmap)
+			__pci_p2pdma_update_state(state, dev, page);
+		return state->map;
+	}
+	return PCI_P2PDMA_MAP_NONE;
+}
+
+/**
+ * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
+ * @state:	P2P state structure
+ * @paddr:	physical address to map
+ *
+ * Map a physically contigous PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ */
+static inline dma_addr_t
+pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
 {
-	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
+	return paddr + state->bus_off;
 }
-#endif /* CONFIG_PCI_P2PDMA */
 
 #endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5b4e6d3bf7bc..e289ad27d1b5 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -462,34 +462,33 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		enum dma_data_direction dir, unsigned long attrs)
 {
 	struct pci_p2pdma_map_state p2pdma_state = {};
-	enum pci_p2pdma_map_type map;
 	struct scatterlist *sg;
 	int i, ret;
 
 	for_each_sg(sgl, sg, nents, i) {
-		if (is_pci_p2pdma_page(sg_page(sg))) {
-			map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
-			switch (map) {
-			case PCI_P2PDMA_MAP_BUS_ADDR:
-				continue;
-			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-				/*
-				 * Any P2P mapping that traverses the PCI
-				 * host bridge must be mapped with CPU physical
-				 * address and not PCI bus addresses. This is
-				 * done with dma_direct_map_page() below.
-				 */
-				break;
-			default:
-				ret = -EREMOTEIO;
+		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
+		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+			/*
+			 * Any P2P mapping that traverses the PCI host bridge
+			 * must be mapped with CPU physical address and not PCI
+			 * bus addresses.
+			 */
+			break;
+		case PCI_P2PDMA_MAP_NONE:
+			sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
+					sg->offset, sg->length, dir, attrs);
+			if (sg->dma_address == DMA_MAPPING_ERROR) {
+				ret = -EIO;
 				goto out_unmap;
 			}
-		}
-
-		sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
-				sg->offset, sg->length, dir, attrs);
-		if (sg->dma_address == DMA_MAPPING_ERROR) {
-			ret = -EIO;
+			break;
+		case PCI_P2PDMA_MAP_BUS_ADDR:
+			sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
+					sg_phys(sg));
+			sg_dma_mark_bus_address(sg);
+			continue;
+		default:
+			ret = -EREMOTEIO;
 			goto out_unmap;
 		}
 		sg_dma_len(sg) = sg->length;
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 02/17] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 03/17] iommu: generalize the batched sync after map interface Leon Romanovsky
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API.  Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.

Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   |  1 +
 include/linux/dma-map-ops.h | 85 -------------------------------------
 include/linux/pci-p2pdma.h  | 84 ++++++++++++++++++++++++++++++++++++
 kernel/dma/direct.c         |  1 +
 4 files changed, 86 insertions(+), 85 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 5746ffaf0061..853247c42f7d 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -26,6 +26,7 @@
 #include <linux/mutex.h>
 #include <linux/of_iommu.h>
 #include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/scatterlist.h>
 #include <linux/spinlock.h>
 #include <linux/swiotlb.h>
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 3480a28d1b9f..dced37816ede 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -434,89 +434,4 @@ static inline void debug_dma_dump_mappings(struct device *dev)
 #endif /* CONFIG_DMA_API_DEBUG */
 
 extern const struct dma_map_ops dma_dummy_ops;
-
-enum pci_p2pdma_map_type {
-	/*
-	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
-	 * type hasn't been calculated yet. Functions that return this enum
-	 * never return this value.
-	 */
-	PCI_P2PDMA_MAP_UNKNOWN = 0,
-
-	/*
-	 * Not a PCI P2PDMA transfer.
-	 */
-	PCI_P2PDMA_MAP_NONE,
-
-	/*
-	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
-	 * traverse the host bridge and the host bridge is not in the
-	 * allowlist. DMA Mapping routines should return an error when
-	 * this is returned.
-	 */
-	PCI_P2PDMA_MAP_NOT_SUPPORTED,
-
-	/*
-	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
-	 * each other directly through a PCI switch and the transaction will
-	 * not traverse the host bridge. Such a mapping should program
-	 * the DMA engine with PCI bus addresses.
-	 */
-	PCI_P2PDMA_MAP_BUS_ADDR,
-
-	/*
-	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
-	 * to each other, but the transaction traverses a host bridge on the
-	 * allowlist. In this case, a normal mapping either with CPU physical
-	 * addresses (in the case of dma-direct) or IOVA addresses (in the
-	 * case of IOMMUs) should be used to program the DMA engine.
-	 */
-	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
-struct pci_p2pdma_map_state {
-	struct dev_pagemap *pgmap;
-	enum pci_p2pdma_map_type map;
-	u64 bus_off;
-};
-
-/* helper for pci_p2pdma_state(), do not use directly */
-void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
-		struct device *dev, struct page *page);
-
-/**
- * pci_p2pdma_state - check the P2P transfer state of a page
- * @state:	P2P state structure
- * @dev:	device to transfer to/from
- * @page:	page to map
- *
- * Check if @page is a PCI P2PDMA page, and if yes of what kind.  Returns the
- * map type, and updates @state with all information needed for a P2P transfer.
- */
-static inline enum pci_p2pdma_map_type
-pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
-		struct page *page)
-{
-	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-		if (state->pgmap != page->pgmap)
-			__pci_p2pdma_update_state(state, dev, page);
-		return state->map;
-	}
-	return PCI_P2PDMA_MAP_NONE;
-}
-
-/**
- * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
- * @state:	P2P state structure
- * @paddr:	physical address to map
- *
- * Map a physically contigous PCI_P2PDMA_MAP_BUS_ADDR transfer.
- */
-static inline dma_addr_t
-pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
-{
-	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->bus_off;
-}
-
 #endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 2c07aa6b7665..e839f52b512b 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -104,4 +104,88 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
 	return pci_p2pmem_find_many(&client, 1);
 }
 
+enum pci_p2pdma_map_type {
+	/*
+	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+	 * type hasn't been calculated yet. Functions that return this enum
+	 * never return this value.
+	 */
+	PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
+	/*
+	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+	 * traverse the host bridge and the host bridge is not in the
+	 * allowlist. DMA Mapping routines should return an error when
+	 * this is returned.
+	 */
+	PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+	/*
+	 * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+	 * each other directly through a PCI switch and the transaction will
+	 * not traverse the host bridge. Such a mapping should program
+	 * the DMA engine with PCI bus addresses.
+	 */
+	PCI_P2PDMA_MAP_BUS_ADDR,
+
+	/*
+	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+	 * to each other, but the transaction traverses a host bridge on the
+	 * allowlist. In this case, a normal mapping either with CPU physical
+	 * addresses (in the case of dma-direct) or IOVA addresses (in the
+	 * case of IOMMUs) should be used to program the DMA engine.
+	 */
+	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+struct pci_p2pdma_map_state {
+	struct dev_pagemap *pgmap;
+	enum pci_p2pdma_map_type map;
+	u64 bus_off;
+};
+
+/* helper for pci_p2pdma_state(), do not use directly */
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+		struct device *dev, struct page *page);
+
+/**
+ * pci_p2pdma_state - check the P2P transfer state of a page
+ * @state:	P2P state structure
+ * @dev:	device to transfer to/from
+ * @page:	page to map
+ *
+ * Check if @page is a PCI P2PDMA page, and if yes of what kind.  Returns the
+ * map type, and updates @state with all information needed for a P2P transfer.
+ */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
+		struct page *page)
+{
+	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+		if (state->pgmap != page->pgmap)
+			__pci_p2pdma_update_state(state, dev, page);
+		return state->map;
+	}
+	return PCI_P2PDMA_MAP_NONE;
+}
+
+/**
+ * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
+ * @state:	P2P state structure
+ * @paddr:	physical address to map
+ *
+ * Map a physically contigous PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ */
+static inline dma_addr_t
+pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
+{
+	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
+	return paddr + state->bus_off;
+}
+
 #endif /* _LINUX_PCI_P2P_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index e289ad27d1b5..c9b3893257d4 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
 #include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
 #include "direct.h"
 
 /*
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 03/17] iommu: generalize the batched sync after map interface
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 02/17] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used Leon Romanovsky
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

For the upcoming IOVA-based DMA API we want to use the interface batch the
sync after mapping multiple entries from dma-iommu without having a
scatterlist.

For that move more sanity checks from the callers into __iommu_map and
make that function available outside of iommu.c as iommu_map_nosync.

Add a wrapper for the map_sync as iommu_sync_map so that callers don't
need to poke into the methods directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/iommu.c | 65 +++++++++++++++++++------------------------
 include/linux/iommu.h |  4 +++
 2 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 83c8e617a2c5..6b0943397e1e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2439,8 +2439,8 @@ static size_t iommu_pgsize(struct iommu_domain *domain, unsigned long iova,
 	return pgsize;
 }
 
-static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
-		       phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
+		phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
 	unsigned long orig_iova = iova;
@@ -2449,12 +2449,19 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	phys_addr_t orig_paddr = paddr;
 	int ret = 0;
 
+	might_sleep_if(gfpflags_allow_blocking(gfp));
+
 	if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
 		return -EINVAL;
 
 	if (WARN_ON(!ops->map_pages || domain->pgsize_bitmap == 0UL))
 		return -ENODEV;
 
+	/* Discourage passing strange GFP flags */
+	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+				__GFP_HIGHMEM)))
+		return -EINVAL;
+
 	/* find out the minimum page size supported */
 	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
 
@@ -2502,31 +2509,27 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
 	return ret;
 }
 
-int iommu_map(struct iommu_domain *domain, unsigned long iova,
-	      phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_sync_map(struct iommu_domain *domain, unsigned long iova, size_t size)
 {
 	const struct iommu_domain_ops *ops = domain->ops;
-	int ret;
-
-	might_sleep_if(gfpflags_allow_blocking(gfp));
 
-	/* Discourage passing strange GFP flags */
-	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
-				__GFP_HIGHMEM)))
-		return -EINVAL;
+	if (!ops->iotlb_sync_map)
+		return 0;
+	return ops->iotlb_sync_map(domain, iova, size);
+}
 
-	ret = __iommu_map(domain, iova, paddr, size, prot, gfp);
-	if (ret == 0 && ops->iotlb_sync_map) {
-		ret = ops->iotlb_sync_map(domain, iova, size);
-		if (ret)
-			goto out_err;
-	}
+int iommu_map(struct iommu_domain *domain, unsigned long iova,
+	      phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+	int ret;
 
-	return ret;
+	ret = iommu_map_nosync(domain, iova, paddr, size, prot, gfp);
+	if (ret)
+		return ret;
 
-out_err:
-	/* undo mappings already done */
-	iommu_unmap(domain, iova, size);
+	ret = iommu_sync_map(domain, iova, size);
+	if (ret)
+		iommu_unmap(domain, iova, size);
 
 	return ret;
 }
@@ -2612,26 +2615,17 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 		     struct scatterlist *sg, unsigned int nents, int prot,
 		     gfp_t gfp)
 {
-	const struct iommu_domain_ops *ops = domain->ops;
 	size_t len = 0, mapped = 0;
 	phys_addr_t start;
 	unsigned int i = 0;
 	int ret;
 
-	might_sleep_if(gfpflags_allow_blocking(gfp));
-
-	/* Discourage passing strange GFP flags */
-	if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
-				__GFP_HIGHMEM)))
-		return -EINVAL;
-
 	while (i <= nents) {
 		phys_addr_t s_phys = sg_phys(sg);
 
 		if (len && s_phys != start + len) {
-			ret = __iommu_map(domain, iova + mapped, start,
+			ret = iommu_map_nosync(domain, iova + mapped, start,
 					len, prot, gfp);
-
 			if (ret)
 				goto out_err;
 
@@ -2654,11 +2648,10 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
 			sg = sg_next(sg);
 	}
 
-	if (ops->iotlb_sync_map) {
-		ret = ops->iotlb_sync_map(domain, iova, mapped);
-		if (ret)
-			goto out_err;
-	}
+	ret = iommu_sync_map(domain, iova, mapped);
+	if (ret)
+		goto out_err;
+
 	return mapped;
 
 out_err:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index bd722f473635..8927e5f996c2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -799,6 +799,10 @@ extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 		     phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
+		phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+int iommu_sync_map(struct iommu_domain *domain, unsigned long iova,
+		size_t size);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
 			  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (2 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 03/17] iommu: generalize the batched sync after map interface Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-11-10 15:09   ` Zhu Yanjun
  2024-10-30 15:12 ` [PATCH v1 05/17] dma: Provide an interface to allow allocate IOVA Leon Romanovsky
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

This patch adds a check if IOVA can be used for the specific
transaction.

In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-mapping.h | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 1524da363734..6075e0708deb 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -76,6 +76,20 @@
 
 #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
 
+struct dma_iova_state {
+	size_t __size;
+};
+
+/*
+ * Use the high bit to mark if we used swiotlb for one or more ranges.
+ */
+#define DMA_IOVA_USE_SWIOTLB		(1ULL << 63)
+
+static inline size_t dma_iova_size(struct dma_iova_state *state)
+{
+	return state->__size & ~DMA_IOVA_USE_SWIOTLB;
+}
+
 #ifdef CONFIG_DMA_API_DEBUG
 void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 void debug_dma_map_single(struct device *dev, const void *addr,
@@ -281,6 +295,25 @@ static inline int dma_mmap_noncontiguous(struct device *dev,
 }
 #endif /* CONFIG_HAS_DMA */
 
+#ifdef CONFIG_IOMMU_DMA
+/**
+ * dma_use_iova - check if the IOVA API is used for this state
+ * @state: IOVA state
+ *
+ * Return %true if the DMA transfers uses the dma_iova_*() calls or %false if
+ * they can't be used.
+ */
+static inline bool dma_use_iova(struct dma_iova_state *state)
+{
+	return state->__size != 0;
+}
+#else /* CONFIG_IOMMU_DMA */
+static inline bool dma_use_iova(struct dma_iova_state *state)
+{
+	return false;
+}
+#endif /* CONFIG_IOMMU_DMA */
+
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
 void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 		enum dma_data_direction dir);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 05/17] dma: Provide an interface to allow allocate IOVA
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (3 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 06/17] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

The existing .map_page() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().

These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.

Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.

The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 79 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h | 15 +++++++
 2 files changed, 94 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 853247c42f7d..127150f63c95 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1746,6 +1746,85 @@ size_t iommu_dma_max_mapping_size(struct device *dev)
 	return SIZE_MAX;
 }
 
+static bool iommu_dma_iova_alloc(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_off = iova_offset(iovad, phys);
+	dma_addr_t addr;
+
+	if (WARN_ON_ONCE(!size))
+		return false;
+	if (WARN_ON_ONCE(size & DMA_IOVA_USE_SWIOTLB))
+		return false;
+
+	addr = iommu_dma_alloc_iova(domain,
+			iova_align(iovad, size + iova_off),
+			dma_get_mask(dev), dev);
+	if (!addr)
+		return false;
+
+	state->addr = addr + iova_off;
+	state->__size = size;
+	return true;
+}
+
+/**
+ * dma_iova_try_alloc - Try to allocate an IOVA space
+ * @dev: Device to allocate the IOVA space for
+ * @state: IOVA state
+ * @phys: physical address
+ * @size: IOVA size
+ *
+ * Check if @dev supports the IOVA-based DMA API, and if yes allocate IOVA space
+ * for the given base address and size.
+ *
+ * Note: @phys is only used to calculate the IOVA alignment. Callers that always
+ * do PAGE_SIZE aligned transfers can safely pass 0 here.
+ *
+ * Returns %true if the IOVA-based DMA API can be used and IOVA space has been
+ * allocated, or %false if the regular DMA API should be used.
+ */
+bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t size)
+{
+	memset(state, 0, sizeof(*state));
+	if (!use_dma_iommu(dev))
+		return false;
+	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
+	    iommu_deferred_attach(dev, iommu_get_domain_for_dev(dev)))
+		return false;
+	return iommu_dma_iova_alloc(dev, state, phys, size);
+}
+EXPORT_SYMBOL_GPL(dma_iova_try_alloc);
+
+/**
+ * dma_iova_free - Free an IOVA space
+ * @dev: Device to free the IOVA space for
+ * @state: IOVA state
+ *
+ * Undoes a successful dma_try_iova_alloc().
+ *
+ * Note that all dma_iova_link() calls need to be undone first.  For callers
+ * that never call dma_iova_unlink(), dma_iova_destroy() can be used instead
+ * which unlinks all ranges and frees the IOVA space in a single efficient
+ * operation.
+ */
+void dma_iova_free(struct device *dev, struct dma_iova_state *state)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, state->addr);
+	size_t size = dma_iova_size(state);
+
+	iommu_dma_free_iova(cookie, state->addr - iova_start_pad,
+			iova_align(iovad, size + iova_start_pad), NULL);
+}
+EXPORT_SYMBOL_GPL(dma_iova_free);
+
 void iommu_setup_dma_ops(struct device *dev)
 {
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 6075e0708deb..817f11bce7bc 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -11,6 +11,7 @@
 #include <linux/scatterlist.h>
 #include <linux/bug.h>
 #include <linux/mem_encrypt.h>
+#include <linux/iommu.h>
 
 /**
  * List of possible attributes associated with a DMA mapping. The semantics
@@ -77,6 +78,7 @@
 #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
 
 struct dma_iova_state {
+	dma_addr_t addr;
 	size_t __size;
 };
 
@@ -307,11 +309,24 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
 {
 	return state->__size != 0;
 }
+
+bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t size);
+void dma_iova_free(struct device *dev, struct dma_iova_state *state);
 #else /* CONFIG_IOMMU_DMA */
 static inline bool dma_use_iova(struct dma_iova_state *state)
 {
 	return false;
 }
+static inline bool dma_iova_try_alloc(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t size)
+{
+	return false;
+}
+static inline void dma_iova_free(struct device *dev,
+		struct dma_iova_state *state)
+{
+}
 #endif /* CONFIG_IOMMU_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 06/17] iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (4 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 05/17] dma: Provide an interface to allow allocate IOVA Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 73 ++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 127150f63c95..e1eaad500d27 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1161,6 +1161,43 @@ void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
 }
 
+static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iova_domain *iovad = &domain->iova_cookie->iovad;
+
+	if (!is_swiotlb_active(dev)) {
+		dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n");
+		return DMA_MAPPING_ERROR;
+	}
+
+	trace_swiotlb_bounced(dev, phys, size);
+
+	phys = swiotlb_tbl_map_single(dev, phys, size, iova_mask(iovad), dir,
+			attrs);
+
+	/*
+	 * Untrusted devices should not see padding areas with random leftover
+	 * kernel data, so zero the pre- and post-padding.
+	 * swiotlb_tbl_map_single() has initialized the bounce buffer proper to
+	 * the contents of the original memory buffer.
+	 */
+	if (phys != DMA_MAPPING_ERROR && dev_is_untrusted(dev)) {
+		size_t start, virt = (size_t)phys_to_virt(phys);
+
+		/* Pre-padding */
+		start = iova_align_down(iovad, virt);
+		memset((void *)start, 0, virt - start);
+
+		/* Post-padding */
+		start = virt + size;
+		memset((void *)start, 0, iova_align(iovad, start) - start);
+	}
+
+	return phys;
+}
+
 dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	      unsigned long offset, size_t size, enum dma_data_direction dir,
 	      unsigned long attrs)
@@ -1174,42 +1211,14 @@ dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
 	dma_addr_t iova, dma_mask = dma_get_mask(dev);
 
 	/*
-	 * If both the physical buffer start address and size are
-	 * page aligned, we don't need to use a bounce page.
+	 * If both the physical buffer start address and size are page aligned,
+	 * we don't need to use a bounce page.
 	 */
 	if (dev_use_swiotlb(dev, size, dir) &&
 	    iova_offset(iovad, phys | size)) {
-		if (!is_swiotlb_active(dev)) {
-			dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n");
-			return DMA_MAPPING_ERROR;
-		}
-
-		trace_swiotlb_bounced(dev, phys, size);
-
-		phys = swiotlb_tbl_map_single(dev, phys, size,
-					      iova_mask(iovad), dir, attrs);
-
+		phys = iommu_dma_map_swiotlb(dev, phys, size, dir, attrs);
 		if (phys == DMA_MAPPING_ERROR)
-			return DMA_MAPPING_ERROR;
-
-		/*
-		 * Untrusted devices should not see padding areas with random
-		 * leftover kernel data, so zero the pre- and post-padding.
-		 * swiotlb_tbl_map_single() has initialized the bounce buffer
-		 * proper to the contents of the original memory buffer.
-		 */
-		if (dev_is_untrusted(dev)) {
-			size_t start, virt = (size_t)phys_to_virt(phys);
-
-			/* Pre-padding */
-			start = iova_align_down(iovad, virt);
-			memset((void *)start, 0, virt - start);
-
-			/* Post-padding */
-			start = virt + size;
-			memset((void *)start, 0,
-			       iova_align(iovad, start) - start);
-		}
+			return phys;
 	}
 
 	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (5 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 06/17] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-31 21:18   ` Robin Murphy
  2024-10-30 15:12 ` [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.

In proposed API, the callers will perform the following steps.
In map path:
	if (dma_can_use_iova(...))
	    dma_iova_alloc()
	    for (page in range)
	       dma_iova_link_next(...)
	    dma_iova_sync(...)
	else
	     /* Fallback to legacy map pages */
             for (all pages)
	       dma_map_page(...)

In unmap path:
	if (dma_can_use_iova(...))
	     dma_iova_destroy()
	else
	     for (all pages)
		dma_unmap_page(...)

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 259 ++++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h |  32 +++++
 2 files changed, 291 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index e1eaad500d27..4a504a879cc0 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1834,6 +1834,265 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
 }
 EXPORT_SYMBOL_GPL(dma_iova_free);
 
+static int __dma_iova_link(struct device *dev, dma_addr_t addr,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	bool coherent = dev_is_dma_coherent(dev);
+
+	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		arch_sync_dma_for_device(phys, size, dir);
+
+	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
+			dma_info_to_prot(dir, coherent, attrs), GFP_ATOMIC);
+}
+
+static int iommu_dma_iova_bounce_and_link(struct device *dev, dma_addr_t addr,
+		phys_addr_t phys, size_t bounce_len,
+		enum dma_data_direction dir, unsigned long attrs,
+		size_t iova_start_pad)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iova_domain *iovad = &domain->iova_cookie->iovad;
+	phys_addr_t bounce_phys;
+	int error;
+
+	bounce_phys = iommu_dma_map_swiotlb(dev, phys, bounce_len, dir, attrs);
+	if (bounce_phys == DMA_MAPPING_ERROR)
+		return -ENOMEM;
+
+	error = __dma_iova_link(dev, addr - iova_start_pad,
+			bounce_phys - iova_start_pad,
+			iova_align(iovad, bounce_len), dir, attrs);
+	if (error)
+		swiotlb_tbl_unmap_single(dev, bounce_phys, bounce_len, dir,
+				attrs);
+	return error;
+}
+
+static int iommu_dma_iova_link_swiotlb(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, phys);
+	size_t iova_end_pad = iova_offset(iovad, phys + size);
+	dma_addr_t addr = state->addr + offset;
+	size_t mapped = 0;
+	int error;
+
+	if (iova_start_pad) {
+		size_t bounce_len = min(size, iovad->granule - iova_start_pad);
+
+		error = iommu_dma_iova_bounce_and_link(dev, addr, phys,
+				bounce_len, dir, attrs, iova_start_pad);
+		if (error)
+			return error;
+		state->__size |= DMA_IOVA_USE_SWIOTLB;
+
+		mapped += bounce_len;
+		size -= bounce_len;
+		if (!size)
+			return 0;
+	}
+
+	size -= iova_end_pad;
+	error = __dma_iova_link(dev, addr + mapped, phys + mapped, size, dir,
+			attrs);
+	if (error)
+		goto out_unmap;
+	mapped += size;
+
+	if (iova_end_pad) {
+		error = iommu_dma_iova_bounce_and_link(dev, addr + mapped,
+				phys + mapped, iova_end_pad, dir, attrs, 0);
+		if (error)
+			goto out_unmap;
+		state->__size |= DMA_IOVA_USE_SWIOTLB;
+	}
+
+	return 0;
+
+out_unmap:
+	dma_iova_unlink(dev, state, 0, mapped, dir, attrs);
+	return error;
+}
+
+/**
+ * dma_iova_link - Link a range of IOVA space
+ * @dev: DMA device
+ * @state: IOVA state
+ * @phys: physical address to link
+ * @offset: offset into the IOVA state to map into
+ * @size: size of the buffer
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Link a range of IOVA space for the given IOVA state without IOTLB sync.
+ * This function is used to link multiple physical addresses in contigueous
+ * IOVA space without performing costly IOTLB sync.
+ *
+ * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
+ * the end of linkage.
+ */
+int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, phys);
+
+	if (WARN_ON_ONCE(iova_start_pad && offset > 0))
+		return -EIO;
+
+	if (dev_use_swiotlb(dev, size, dir) && iova_offset(iovad, phys | size))
+		return iommu_dma_iova_link_swiotlb(dev, state, phys, offset,
+				size, dir, attrs);
+
+	return __dma_iova_link(dev, state->addr + offset - iova_start_pad,
+			phys - iova_start_pad,
+			iova_align(iovad, size + iova_start_pad), dir, attrs);
+}
+EXPORT_SYMBOL_GPL(dma_iova_link);
+
+/**
+ * dma_iova_sync - Sync IOTLB
+ * @dev: DMA device
+ * @state: IOVA state
+ * @offset: offset into the IOVA state to sync
+ * @size: size of the buffer
+ *
+ * Sync IOTLB for the given IOVA state. This function should be called on
+ * the IOVA-contigous range created by one ore more dma_iova_link() calls
+ * to sync the IOTLB.
+ */
+int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	dma_addr_t addr = state->addr + offset;
+	size_t iova_start_pad = iova_offset(iovad, addr);
+
+	return iommu_sync_map(domain, addr - iova_start_pad,
+		      iova_align(iovad, size + iova_start_pad));
+}
+EXPORT_SYMBOL_GPL(dma_iova_sync);
+
+static void iommu_dma_iova_unlink_range_slow(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	size_t iova_start_pad = iova_offset(iovad, addr);
+	dma_addr_t end = addr + size;
+
+	do {
+		phys_addr_t phys;
+		size_t len;
+
+		phys = iommu_iova_to_phys(domain, addr);
+		if (WARN_ON(!phys))
+			continue;
+		len = min_t(size_t,
+			end - addr, iovad->granule - iova_start_pad);
+
+		if (!dev_is_dma_coherent(dev) &&
+		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+			arch_sync_dma_for_cpu(phys, len, dir);
+
+		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
+
+		addr += len;
+		iova_start_pad = 0;
+	} while (addr < end);
+}
+
+static void __iommu_dma_iova_unlink(struct device *dev,
+		struct dma_iova_state *state, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs,
+		bool free_iova)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	dma_addr_t addr = state->addr + offset;
+	size_t iova_start_pad = iova_offset(iovad, addr);
+	struct iommu_iotlb_gather iotlb_gather;
+	size_t unmapped;
+
+	if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
+	    (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
+		iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
+
+	iommu_iotlb_gather_init(&iotlb_gather);
+	iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);
+
+	size = iova_align(iovad, size + iova_start_pad);
+	addr -= iova_start_pad;
+	unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
+	WARN_ON(unmapped != size);
+
+	if (!iotlb_gather.queued)
+		iommu_iotlb_sync(domain, &iotlb_gather);
+	if (free_iova)
+		iommu_dma_free_iova(cookie, addr, size, &iotlb_gather);
+}
+
+/**
+ * dma_iova_unlink - Unlink a range of IOVA space
+ * @dev: DMA device
+ * @state: IOVA state
+ * @offset: offset into the IOVA state to unlink
+ * @size: size of the buffer
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Unlink a range of IOVA space for the given IOVA state.
+ */
+void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	 __iommu_dma_iova_unlink(dev, state, offset, size, dir, attrs, false);
+}
+EXPORT_SYMBOL_GPL(dma_iova_unlink);
+
+/**
+ * dma_iova_destroy - Finish a DMA mapping transaction
+ * @dev: DMA device
+ * @state: IOVA state
+ * @mapped_len: number of bytes to unmap
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Unlink the IOVA range up to @mapped_len and free the entire IOVA space. The
+ * range of IOVA from dma_addr to @mapped_len must all be linked, and be the
+ * only linked IOVA in state.
+ */
+void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+		size_t mapped_len, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	if (mapped_len)
+		__iommu_dma_iova_unlink(dev, state, 0, mapped_len, dir, attrs,
+				true);
+	else
+		/*
+		 * We can be here if first call to dma_iova_link() failed and
+		 * there is nothing to unlink, so let's be more clear.
+		 */
+		dma_iova_free(dev, state);
+}
+EXPORT_SYMBOL_GPL(dma_iova_destroy);
+
 void iommu_setup_dma_ops(struct device *dev)
 {
 	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 817f11bce7bc..8074a3b5c807 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -313,6 +313,17 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
 bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
 		phys_addr_t phys, size_t size);
 void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+		size_t mapped_len, enum dma_data_direction dir,
+		unsigned long attrs);
+int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size);
+int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs);
+void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size, enum dma_data_direction dir,
+		unsigned long attrs);
 #else /* CONFIG_IOMMU_DMA */
 static inline bool dma_use_iova(struct dma_iova_state *state)
 {
@@ -327,6 +338,27 @@ static inline void dma_iova_free(struct device *dev,
 		struct dma_iova_state *state)
 {
 }
+static inline void dma_iova_destroy(struct device *dev,
+		struct dma_iova_state *state, size_t mapped_len,
+		enum dma_data_direction dir, unsigned long attrs)
+{
+}
+static inline int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+static inline int dma_iova_link(struct device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	return -EOPNOTSUPP;
+}
+static inline void dma_iova_unlink(struct device *dev,
+		struct dma_iova_state *state, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs)
+{
+}
 #endif /* CONFIG_IOMMU_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (6 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-31 21:18   ` Robin Murphy
  2024-10-30 15:12 ` [PATCH v1 09/17] docs: core-api: document the IOVA-based API Leon Romanovsky
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

Add helper that allows a driver to skip calling dma_unmap_*
if the DMA layer can guarantee that they are no-nops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-mapping.h |  5 +++++
 kernel/dma/mapping.c        | 20 ++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 8074a3b5c807..6906edde505d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -410,6 +410,7 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
 }
+bool dma_need_unmap(struct device *dev);
 #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
 static inline bool dma_dev_need_sync(const struct device *dev)
 {
@@ -435,6 +436,10 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
 	return false;
 }
+static inline bool dma_need_unmap(struct device *dev)
+{
+	return false;
+}
 #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
 
 struct page *dma_alloc_pages(struct device *dev, size_t size,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 864a1121bf08..daa97a650778 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -442,6 +442,26 @@ bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 }
 EXPORT_SYMBOL_GPL(__dma_need_sync);
 
+/**
+ * dma_need_unmap - does this device need dma_unmap_* operations
+ * @dev: device to check
+ *
+ * If this function returns %false, drivers can skip calling dma_unmap_* after
+ * finishing an I/O.  This function must be called after all mappings that might
+ * need to be unmapped have been performed.
+ */
+bool dma_need_unmap(struct device *dev)
+{
+	if (!dma_map_direct(dev, get_dma_ops(dev)))
+		return true;
+#ifdef CONFIG_DMA_NEED_SYNC
+	if (!dev->dma_skip_sync)
+		return true;
+#endif
+	return IS_ENABLED(CONFIG_DMA_API_DEBUG);
+}
+EXPORT_SYMBOL_GPL(dma_need_unmap);
+
 static void dma_setup_need_sync(struct device *dev)
 {
 	const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (7 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-31  1:41   ` Randy Dunlap
  2024-11-08 19:34   ` Jonathan Corbet
  2024-10-30 15:12 ` [PATCH v1 10/17] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
                   ` (10 subsequent siblings)
  19 siblings, 2 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

From: Christoph Hellwig <hch@lst.de>

Add an explanation of the newly added IOVA-based mapping API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 Documentation/core-api/dma-api.rst | 70 ++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index 8e3cce3d0a23..6095696a65a7 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -530,6 +530,76 @@ routines, e.g.:::
 		....
 	}
 
+Part Ie - IOVA-based DMA mappings
+---------------------------------
+
+These APIs allow a very efficient mapping when using an IOMMU.  They are an
+optional path that requires extra code and are only recommended for drivers
+where DMA mapping performance, or the space usage for storing the DMA addresses
+matter.  All the consideration from the previous section apply here as well.
+
+::
+
+    bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t size);
+
+Is used to try to allocate IOVA space for mapping operation.  If it returns
+false this API can't be used for the given device and the normal streaming
+DMA mapping API should be used.  The ``struct dma_iova_state`` is allocated
+by the driver and must be kept around until unmap time.
+
+::
+
+    static inline bool dma_use_iova(struct dma_iova_state *state)
+
+Can be used by the driver to check if the IOVA-based API is used after a
+call to dma_iova_try_alloc.  This can be useful in the unmap path.
+
+::
+
+    int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+		phys_addr_t phys, size_t offset, size_t size,
+		enum dma_data_direction dir, unsigned long attrs);
+
+Is used to link ranges to the IOVA previously allocated.  The start of all
+but the first call to dma_iova_link for a given state must be aligned
+to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and
+the size of all but the last range must be aligned to the DMA merge boundary
+as well.
+
+::
+
+    int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size);
+
+Must be called to sync the IOMMU page tables for IOVA-range mapped by one or
+more calls to ``dma_iova_link()``.
+
+For drivers that use a one-shot mapping, all ranges can be unmapped and the
+IOVA freed by calling:
+
+::
+
+   void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+		enum dma_data_direction dir, unsigned long attrs);
+
+Alternatively drivers can dynamically manage the IOVA space by unmapping
+and mapping individual regions.  In that case
+
+::
+
+    void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+		size_t offset, size_t size, enum dma_data_direction dir,
+		unsigned long attrs);
+
+is used to unmap a range previous mapped, and
+
+::
+
+   void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+
+is used to free the IOVA space.  All regions must have been unmapped using
+``dma_iova_unlink()`` before calling ``dma_iova_free()``.
 
 Part II - Non-coherent DMA allocations
 --------------------------------------
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 10/17] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (8 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 09/17] docs: core-api: document the IOVA-based API Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 11/17] mm/hmm: provide generic DMA managing logic Leon Romanovsky
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new sticky flag (HMM_PFN_DMA_MAPPED), which isn't overwritten
by HMM range fault. Such flag allows users to tag specific PFNs with information
if this specific PFN was already DMA mapped.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/hmm.h | 14 ++++++++++++++
 mm/hmm.c            | 34 +++++++++++++++++++++-------------
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..5dd655f6766b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,8 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
+ *                      to mark that page is already DMA mapped
  *
  * On input:
  * 0                 - Return the current state of the page, do not fault it.
@@ -36,6 +38,10 @@ enum hmm_pfn_flags {
 	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
 	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+
+	/* Sticky flag, carried from Input to Output */
+	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
+
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
 
 	/* Input flags */
@@ -57,6 +63,14 @@ static inline struct page *hmm_pfn_to_page(unsigned long hmm_pfn)
 	return pfn_to_page(hmm_pfn & ~HMM_PFN_FLAGS);
 }
 
+/*
+ * hmm_pfn_to_phys() - return physical address pointed to by a device entry
+ */
+static inline phys_addr_t hmm_pfn_to_phys(unsigned long hmm_pfn)
+{
+	return __pfn_to_phys(hmm_pfn & ~HMM_PFN_FLAGS);
+}
+
 /*
  * hmm_pfn_to_map_order() - return the CPU mapping size order
  *
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..2a0c34d7cb2b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -44,8 +44,10 @@ static int hmm_pfns_fill(unsigned long addr, unsigned long end,
 {
 	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
 
-	for (; addr < end; addr += PAGE_SIZE, i++)
-		range->hmm_pfns[i] = cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++) {
+		range->hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+		range->hmm_pfns[i] |= cpu_flags;
+	}
 	return 0;
 }
 
@@ -202,8 +204,10 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		return hmm_vma_fault(addr, end, required_fault, walk);
 
 	pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		hmm_pfns[i] = pfn | cpu_flags;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+		hmm_pfns[i] |= pfn | cpu_flags;
+	}
 	return 0;
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -236,7 +240,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
 			goto fault;
-		*hmm_pfn = 0;
+		*hmm_pfn = *hmm_pfn & HMM_PFN_DMA_MAPPED;
 		return 0;
 	}
 
@@ -253,14 +257,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
+			*hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | swp_offset_pfn(entry) | cpu_flags;
 			return 0;
 		}
 
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (!required_fault) {
-			*hmm_pfn = 0;
+			*hmm_pfn = *hmm_pfn & HMM_PFN_DMA_MAPPED;
 			return 0;
 		}
 
@@ -304,11 +308,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			pte_unmap(ptep);
 			return -EFAULT;
 		}
-		*hmm_pfn = HMM_PFN_ERROR;
+		*hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | HMM_PFN_ERROR;
 		return 0;
 	}
 
-	*hmm_pfn = pte_pfn(pte) | cpu_flags;
+	*hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | pte_pfn(pte) | cpu_flags;
 	return 0;
 
 fault:
@@ -448,8 +452,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		}
 
 		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-		for (i = 0; i < npages; ++i, ++pfn)
-			hmm_pfns[i] = pfn | cpu_flags;
+		for (i = 0; i < npages; ++i, ++pfn) {
+			hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+			hmm_pfns[i] |= pfn | cpu_flags;
+		}
 		goto out_unlock;
 	}
 
@@ -507,8 +513,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	}
 
 	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-	for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		range->hmm_pfns[i] = pfn | cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		range->hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+		range->hmm_pfns[i] |= pfn | cpu_flags;
+	}
 
 	spin_unlock(ptl);
 	return 0;
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 11/17] mm/hmm: provide generic DMA managing logic
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (9 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 10/17] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 12/17] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

HMM callers use PFN list to populate range while calling
to hmm_range_fault(), the conversion from PFN to DMA address
is done by the callers with help of another DMA list. However,
it is wasteful on any modern platform and by doing the right
logic, that DMA list can be avoided.

Provide generic logic to manage these lists and gave an interface
to map/unmap PFNs to DMA addresses, without requiring from the callers
to be an experts in DMA core API.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/hmm-dma.h |  32 +++++++
 include/linux/hmm.h     |   2 +
 mm/hmm.c                | 197 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+)
 create mode 100644 include/linux/hmm-dma.h

diff --git a/include/linux/hmm-dma.h b/include/linux/hmm-dma.h
new file mode 100644
index 000000000000..f6ce2a00d74d
--- /dev/null
+++ b/include/linux/hmm-dma.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* Copyright (c) 2024 NVIDIA Corporation & Affiliates */
+#ifndef LINUX_HMM_DMA_H
+#define LINUX_HMM_DMA_H
+
+#include <linux/dma-mapping.h>
+
+struct dma_iova_state;
+struct pci_p2pdma_map_state;
+
+/*
+ * struct hmm_dma_map - array of PFNs and DMA addresses
+ *
+ * @state: DMA IOVA state
+ * @pfns: array of PFNs
+ * @dma_list: array of DMA addresses
+ * @dma_entry_size: size of each DMA entry in the array
+ */
+struct hmm_dma_map {
+	struct dma_iova_state state;
+	unsigned long *pfn_list;
+	dma_addr_t *dma_list;
+	size_t dma_entry_size;
+};
+
+int hmm_dma_map_alloc(struct device *dev, struct hmm_dma_map *map,
+		      size_t nr_entries, size_t dma_entry_size);
+void hmm_dma_map_free(struct device *dev, struct hmm_dma_map *map);
+dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
+			   size_t idx, struct pci_p2pdma_map_state *p2pdma_state);
+bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx);
+#endif /* LINUX_HMM_DMA_H */
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5dd655f6766b..62980ca8f3c5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
  *                      to mark that page is already DMA mapped
  *
@@ -40,6 +41,7 @@ enum hmm_pfn_flags {
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
 
 	/* Sticky flag, carried from Input to Output */
+	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
 	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
 
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
diff --git a/mm/hmm.c b/mm/hmm.c
index 2a0c34d7cb2b..a852d8337c73 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -10,6 +10,7 @@
  */
 #include <linux/pagewalk.h>
 #include <linux/hmm.h>
+#include <linux/hmm-dma.h>
 #include <linux/init.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
@@ -23,6 +24,7 @@
 #include <linux/sched/mm.h>
 #include <linux/jump_label.h>
 #include <linux/dma-mapping.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
@@ -615,3 +617,198 @@ int hmm_range_fault(struct hmm_range *range)
 	return ret;
 }
 EXPORT_SYMBOL(hmm_range_fault);
+
+/**
+ * hmm_dma_map_alloc - Allocate HMM map structure
+ * @dev: device to allocate structure for
+ * @map: HMM map to allocate
+ * @nr_entries: number of entries in the map
+ * @dma_entry_size: size of the DMA entry in the map
+ *
+ * Allocate the HMM map structure and all the lists it contains.
+ * Return 0 on success, -ENOMEM on failure.
+ */
+int hmm_dma_map_alloc(struct device *dev, struct hmm_dma_map *map,
+		      size_t nr_entries, size_t dma_entry_size)
+{
+	bool dma_need_sync = false;
+	bool use_iova;
+
+	if (!(nr_entries * PAGE_SIZE / dma_entry_size))
+		return -EINVAL;
+
+	/*
+	 * The HMM API violates our normal DMA buffer ownership rules and can't
+	 * transfer buffer ownership.  The dma_addressing_limited() check is a
+	 * best approximation to ensure no swiotlb buffering happens.
+	 */
+	if (IS_ENABLED(CONFIG_DMA_NEED_SYNC))
+		dma_need_sync = !dev->dma_skip_sync;
+	if (dma_need_sync || dma_addressing_limited(dev))
+		return -EOPNOTSUPP;
+
+	map->dma_entry_size = dma_entry_size;
+	map->pfn_list =
+		kvcalloc(nr_entries, sizeof(*map->pfn_list), GFP_KERNEL);
+	if (!map->pfn_list)
+		return -ENOMEM;
+
+	use_iova = dma_iova_try_alloc(dev, &map->state, 0,
+			nr_entries * PAGE_SIZE);
+	if (!use_iova && dma_need_unmap(dev)) {
+		map->dma_list = kvcalloc(nr_entries, sizeof(*map->dma_list),
+					 GFP_KERNEL);
+		if (!map->dma_list)
+			goto err_dma;
+	}
+	return 0;
+
+err_dma:
+	kfree(map->pfn_list);
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(hmm_dma_map_alloc);
+
+/**
+ * hmm_dma_map_free - iFree HMM map structure
+ * @dev: device to free structure from
+ * @map: HMM map containing the various lists and state
+ *
+ * Free the HMM map structure and all the lists it contains.
+ */
+void hmm_dma_map_free(struct device *dev, struct hmm_dma_map *map)
+{
+	if (dma_use_iova(&map->state))
+		dma_iova_free(dev, &map->state);
+	kfree(map->pfn_list);
+	kfree(map->dma_list);
+}
+EXPORT_SYMBOL_GPL(hmm_dma_map_free);
+
+/**
+ * hmm_dma_map_pfn - Map a physical HMM page to DMA address
+ * @dev: Device to map the page for
+ * @map: HMM map
+ * @idx: Index into the PFN and dma address arrays
+ * @pci_p2pdma_map_state: PCI P2P state.
+ *
+ * dma_alloc_iova() allocates IOVA based on the size specified by their use in
+ * iova->size. Call this function after IOVA allocation to link whole @page
+ * to get the DMA address. Note that very first call to this function
+ * will have @offset set to 0 in the IOVA space allocated from
+ * dma_alloc_iova(). For subsequent calls to this function on same @iova,
+ * @offset needs to be advanced by the caller with the size of previous
+ * page that was linked + DMA address returned for the previous page that was
+ * linked by this function.
+ */
+dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
+			   size_t idx, struct pci_p2pdma_map_state *p2pdma_state)
+{
+	struct dma_iova_state *state = &map->state;
+	dma_addr_t *dma_addrs = map->dma_list;
+	unsigned long *pfns = map->pfn_list;
+	struct page *page = hmm_pfn_to_page(pfns[idx]);
+	phys_addr_t paddr = hmm_pfn_to_phys(pfns[idx]);
+	size_t offset = idx * map->dma_entry_size;
+	dma_addr_t dma_addr;
+	int ret;
+
+	if ((pfns[idx] & HMM_PFN_DMA_MAPPED) &&
+	    !(pfns[idx] & HMM_PFN_P2PDMA_BUS)) {
+		/*
+		 * We are in this flow when there is a need to resync flags,
+		 * for example when page was already linked in prefetch call
+		 * with READ flag and now we need to add WRITE flag
+		 *
+		 * This page was already programmed to HW and we don't want/need
+		 * to unlink and link it again just to resync flags.
+		 */
+		if (dma_use_iova(state))
+			return state->addr + offset;
+
+		/*
+		 * Without dma_need_unmap, the dma_addrs array is NULL, thus we
+		 * need to regenerate the address below even if there already
+		 * was a mapping. But !dma_need_unmap implies that the
+		 * mapping stateless, so this is fine.
+		 */
+		if (dma_need_unmap(dev))
+			return dma_addrs[idx];
+
+		/* Continue to remapping */
+	}
+
+	switch (pci_p2pdma_state(p2pdma_state, dev, page)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+	case PCI_P2PDMA_MAP_NONE:
+		break;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		dma_addr = pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
+		pfns[idx] |= HMM_PFN_P2PDMA_BUS;
+		goto done;
+	default:
+		return DMA_MAPPING_ERROR;
+	}
+
+	if (dma_use_iova(state)) {
+		ret = dma_iova_link(dev, state, paddr, offset,
+				    map->dma_entry_size, DMA_BIDIRECTIONAL, 0);
+		if (ret)
+			return DMA_MAPPING_ERROR;
+
+		ret = dma_iova_sync(dev, state, offset, map->dma_entry_size);
+		if (ret)
+			return DMA_MAPPING_ERROR;
+
+		dma_addr = state->addr + offset;
+	} else {
+		if (WARN_ON_ONCE(dma_need_unmap(dev) && !dma_addrs))
+			return DMA_MAPPING_ERROR;
+
+		dma_addr = dma_map_page(dev, page, 0, map->dma_entry_size,
+					DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(dev, dma_addr))
+			return DMA_MAPPING_ERROR;
+
+		if (dma_need_unmap(dev))
+			dma_addrs[idx] = dma_addr;
+	}
+
+done:
+	pfns[idx] |= HMM_PFN_DMA_MAPPED;
+	return dma_addr;
+}
+EXPORT_SYMBOL_GPL(hmm_dma_map_pfn);
+
+/**
+ * hmm_dma_unmap_pfn - Unmap a physical HMM page from DMA address
+ * @dev: Device to unmap the page from
+ * @map: HMM map
+ * @idx: Index of the PFN to unmap
+ *
+ * Returns true if the PFN was mapped and has been unmapped, false otherwise.
+ */
+bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
+{
+	struct dma_iova_state *state = &map->state;
+	dma_addr_t *dma_addrs = map->dma_list;
+	unsigned long *pfns = map->pfn_list;
+
+#define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
+	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
+		return false;
+#undef HMM_PFN_VALID_DMA
+
+	if (pfns[idx] & HMM_PFN_P2PDMA_BUS)
+		; /* no need to unmap bus address P2P mappings */
+	else if (dma_use_iova(state))
+		dma_iova_unlink(dev, state, idx * map->dma_entry_size,
+				map->dma_entry_size, DMA_BIDIRECTIONAL, 0);
+	else if (dma_need_unmap(dev))
+		dma_unmap_page(dev, dma_addrs[idx], map->dma_entry_size,
+			       DMA_BIDIRECTIONAL);
+
+	pfns[idx] &= ~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA_BUS);
+	return true;
+}
+EXPORT_SYMBOL_GPL(hmm_dma_unmap_pfn);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 12/17] RDMA/umem: Store ODP access mask information in PFN
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (10 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 11/17] mm/hmm: provide generic DMA managing logic Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:12 ` [PATCH v1 13/17] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

As a preparation to remove dma_list, store access mask in PFN pointer
and not in dma_addr_t.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c   | 100 +++++++++++----------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   1 +
 drivers/infiniband/hw/mlx5/odp.c     |  37 +++++-----
 include/rdma/ib_umem_odp.h           |  14 +---
 4 files changed, 61 insertions(+), 91 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d31c23..9dba369365af 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -296,22 +296,11 @@ EXPORT_SYMBOL(ib_umem_odp_release);
 static int ib_umem_odp_map_dma_single_page(
 		struct ib_umem_odp *umem_odp,
 		unsigned int dma_index,
-		struct page *page,
-		u64 access_mask)
+		struct page *page)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
 	dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
 
-	if (*dma_addr) {
-		/*
-		 * If the page is already dma mapped it means it went through
-		 * a non-invalidating trasition, like read-only to writable.
-		 * Resync the flags.
-		 */
-		*dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask;
-		return 0;
-	}
-
 	*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
 				    DMA_BIDIRECTIONAL);
 	if (ib_dma_mapping_error(dev, *dma_addr)) {
@@ -319,7 +308,6 @@ static int ib_umem_odp_map_dma_single_page(
 		return -EFAULT;
 	}
 	umem_odp->npages++;
-	*dma_addr |= access_mask;
 	return 0;
 }
 
@@ -355,9 +343,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	struct hmm_range range = {};
 	unsigned long timeout;
 
-	if (access_mask == 0)
-		return -EINVAL;
-
 	if (user_virt < ib_umem_start(umem_odp) ||
 	    user_virt + bcnt > ib_umem_end(umem_odp))
 		return -EFAULT;
@@ -383,7 +368,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	if (fault) {
 		range.default_flags = HMM_PFN_REQ_FAULT;
 
-		if (access_mask & ODP_WRITE_ALLOWED_BIT)
+		if (access_mask & HMM_PFN_WRITE)
 			range.default_flags |= HMM_PFN_REQ_WRITE;
 	}
 
@@ -415,22 +400,17 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	for (pfn_index = 0; pfn_index < num_pfns;
 		pfn_index += 1 << (page_shift - PAGE_SHIFT), dma_index++) {
 
-		if (fault) {
-			/*
-			 * Since we asked for hmm_range_fault() to populate
-			 * pages it shouldn't return an error entry on success.
-			 */
-			WARN_ON(range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
-			WARN_ON(!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
-		} else {
-			if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)) {
-				WARN_ON(umem_odp->dma_list[dma_index]);
-				continue;
-			}
-			access_mask = ODP_READ_ALLOWED_BIT;
-			if (range.hmm_pfns[pfn_index] & HMM_PFN_WRITE)
-				access_mask |= ODP_WRITE_ALLOWED_BIT;
-		}
+		/*
+		 * Since we asked for hmm_range_fault() to populate
+		 * pages it shouldn't return an error entry on success.
+		 */
+		WARN_ON(fault && range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
+		WARN_ON(fault && !(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
+		if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID))
+			continue;
+
+		if (range.hmm_pfns[pfn_index] & HMM_PFN_DMA_MAPPED)
+			continue;
 
 		hmm_order = hmm_pfn_to_map_order(range.hmm_pfns[pfn_index]);
 		/* If a hugepage was detected and ODP wasn't set for, the umem
@@ -445,13 +425,13 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 		}
 
 		ret = ib_umem_odp_map_dma_single_page(
-				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]),
-				access_mask);
+				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
 		if (ret < 0) {
 			ibdev_dbg(umem_odp->umem.ibdev,
 				  "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
 			break;
 		}
+		range.hmm_pfns[pfn_index] |= HMM_PFN_DMA_MAPPED;
 	}
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
@@ -471,7 +451,6 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 				 u64 bound)
 {
-	dma_addr_t dma_addr;
 	dma_addr_t dma;
 	int idx;
 	u64 addr;
@@ -482,34 +461,35 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	virt = max_t(u64, virt, ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
+		unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
+		struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
+
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 		dma = umem_odp->dma_list[idx];
 
-		/* The access flags guaranteed a valid DMA address in case was NULL */
-		if (dma) {
-			unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
-			struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
-
-			dma_addr = dma & ODP_DMA_ADDR_MASK;
-			ib_dma_unmap_page(dev, dma_addr,
-					  BIT(umem_odp->page_shift),
-					  DMA_BIDIRECTIONAL);
-			if (dma & ODP_WRITE_ALLOWED_BIT) {
-				struct page *head_page = compound_head(page);
-				/*
-				 * set_page_dirty prefers being called with
-				 * the page lock. However, MMU notifiers are
-				 * called sometimes with and sometimes without
-				 * the lock. We rely on the umem_mutex instead
-				 * to prevent other mmu notifiers from
-				 * continuing and allowing the page mapping to
-				 * be removed.
-				 */
-				set_page_dirty(head_page);
-			}
-			umem_odp->dma_list[idx] = 0;
-			umem_odp->npages--;
+		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
+			goto clear;
+		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_DMA_MAPPED))
+			goto clear;
+
+		ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
+				  DMA_BIDIRECTIONAL);
+		if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+			struct page *head_page = compound_head(page);
+			/*
+			 * set_page_dirty prefers being called with
+			 * the page lock. However, MMU notifiers are
+			 * called sometimes with and sometimes without
+			 * the lock. We rely on the umem_mutex instead
+			 * to prevent other mmu notifiers from
+			 * continuing and allowing the page mapping to
+			 * be removed.
+			 */
+			set_page_dirty(head_page);
 		}
+		umem_odp->npages--;
+clear:
+		umem_odp->pfn_list[pfn_idx] &= ~HMM_PFN_FLAGS;
 	}
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 23fd72f7f63d..3e4aaa6319db 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -336,6 +336,7 @@ struct mlx5_ib_flow_db {
 #define MLX5_IB_UPD_XLT_PD	      BIT(4)
 #define MLX5_IB_UPD_XLT_ACCESS	      BIT(5)
 #define MLX5_IB_UPD_XLT_INDIRECT      BIT(6)
+#define MLX5_IB_UPD_XLT_DOWNGRADE     BIT(7)
 
 /* Private QP creation flags to be passed in ib_qp_init_attr.create_flags.
  *
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 4b37446758fd..78887500ce15 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -34,6 +34,7 @@
 #include <linux/kernel.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-resv.h>
+#include <linux/hmm.h>
 
 #include "mlx5_ib.h"
 #include "cmd.h"
@@ -158,22 +159,12 @@ static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentries,
 	}
 }
 
-static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
-{
-	u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
-
-	if (umem_dma & ODP_READ_ALLOWED_BIT)
-		mtt_entry |= MLX5_IB_MTT_READ;
-	if (umem_dma & ODP_WRITE_ALLOWED_BIT)
-		mtt_entry |= MLX5_IB_MTT_WRITE;
-
-	return mtt_entry;
-}
-
 static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 			 struct mlx5_ib_mr *mr, int flags)
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
+	bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
+	unsigned long pfn;
 	dma_addr_t pa;
 	size_t i;
 
@@ -181,8 +172,17 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 		return;
 
 	for (i = 0; i < nentries; i++) {
+		pfn = odp->pfn_list[idx + i];
+		if (!(pfn & HMM_PFN_VALID))
+			/* ODP initialization */
+			continue;
+
 		pa = odp->dma_list[idx + i];
-		pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+		pa |= MLX5_IB_MTT_READ;
+		if ((pfn & HMM_PFN_WRITE) && !downgrade)
+			pa |= MLX5_IB_MTT_WRITE;
+
+		pas[i] = cpu_to_be64(pa);
 	}
 }
 
@@ -286,8 +286,7 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->dma_list[idx] &
-		    (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+		if (umem_odp->pfn_list[idx] & HMM_PFN_VALID) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
@@ -668,7 +667,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 {
 	int page_shift, ret, np;
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
-	u64 access_mask;
+	u64 access_mask = 0;
 	u64 start_idx;
 	bool fault = !(flags & MLX5_PF_FLAGS_SNAPSHOT);
 	u32 xlt_flags = MLX5_IB_UPD_XLT_ATOMIC;
@@ -676,12 +675,14 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	if (flags & MLX5_PF_FLAGS_ENABLE)
 		xlt_flags |= MLX5_IB_UPD_XLT_ENABLE;
 
+	if (flags & MLX5_PF_FLAGS_DOWNGRADE)
+		xlt_flags |= MLX5_IB_UPD_XLT_DOWNGRADE;
+
 	page_shift = odp->page_shift;
 	start_idx = (user_va - ib_umem_start(odp)) >> page_shift;
-	access_mask = ODP_READ_ALLOWED_BIT;
 
 	if (odp->umem.writable && !downgrade)
-		access_mask |= ODP_WRITE_ALLOWED_BIT;
+		access_mask |= HMM_PFN_WRITE;
 
 	np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
 	if (np < 0)
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 0844c1d05ac6..a345c26a745d 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -8,6 +8,7 @@
 
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
+#include <linux/hmm.h>
 
 struct ib_umem_odp {
 	struct ib_umem umem;
@@ -67,19 +68,6 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
 	       umem_odp->page_shift;
 }
 
-/*
- * The lower 2 bits of the DMA address signal the R/W permissions for
- * the entry. To upgrade the permissions, provide the appropriate
- * bitmask to the map_dma_pages function.
- *
- * Be aware that upgrading a mapped address might result in change of
- * the DMA address for the page.
- */
-#define ODP_READ_ALLOWED_BIT  (1<<0ULL)
-#define ODP_WRITE_ALLOWED_BIT (1<<1ULL)
-
-#define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
-
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 
 struct ib_umem_odp *
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 13/17] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (11 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 12/17] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
@ 2024-10-30 15:12 ` Leon Romanovsky
  2024-10-30 15:13 ` [PATCH v1 14/17] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:12 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path for UMEM ODP flow.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c   | 101 ++++++---------------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  11 +--
 drivers/infiniband/hw/mlx5/odp.c     |  40 +++++++----
 drivers/infiniband/hw/mlx5/umr.c     |  12 +++-
 include/rdma/ib_umem_odp.h           |  13 +---
 5 files changed, 69 insertions(+), 108 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 9dba369365af..30cd8f353476 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,6 +41,7 @@
 #include <linux/hugetlb.h>
 #include <linux/interval_tree.h>
 #include <linux/hmm.h>
+#include <linux/hmm-dma.h>
 #include <linux/pagemap.h>
 
 #include <rdma/ib_umem_odp.h>
@@ -50,6 +51,7 @@
 static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 				   const struct mmu_interval_notifier_ops *ops)
 {
+	struct ib_device *dev = umem_odp->umem.ibdev;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
@@ -59,7 +61,6 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 		size_t page_size = 1UL << umem_odp->page_shift;
 		unsigned long start;
 		unsigned long end;
-		size_t ndmas, npfns;
 
 		start = ALIGN_DOWN(umem_odp->umem.address, page_size);
 		if (check_add_overflow(umem_odp->umem.address,
@@ -70,36 +71,23 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 		if (unlikely(end < page_size))
 			return -EOVERFLOW;
 
-		ndmas = (end - start) >> umem_odp->page_shift;
-		if (!ndmas)
-			return -EINVAL;
-
-		npfns = (end - start) >> PAGE_SHIFT;
-		umem_odp->pfn_list = kvcalloc(
-			npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
-		if (!umem_odp->pfn_list)
-			return -ENOMEM;
-
-		umem_odp->dma_list = kvcalloc(
-			ndmas, sizeof(*umem_odp->dma_list), GFP_KERNEL);
-		if (!umem_odp->dma_list) {
-			ret = -ENOMEM;
-			goto out_pfn_list;
-		}
+		ret = hmm_dma_map_alloc(dev->dma_device, &umem_odp->map,
+					(end - start) >> PAGE_SHIFT,
+					1 << umem_odp->page_shift);
+		if (ret)
+			return ret;
 
 		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
 						   umem_odp->umem.owning_mm,
 						   start, end - start, ops);
 		if (ret)
-			goto out_dma_list;
+			goto out_free_map;
 	}
 
 	return 0;
 
-out_dma_list:
-	kvfree(umem_odp->dma_list);
-out_pfn_list:
-	kvfree(umem_odp->pfn_list);
+out_free_map:
+	hmm_dma_map_free(dev->dma_device, &umem_odp->map);
 	return ret;
 }
 
@@ -262,6 +250,8 @@ EXPORT_SYMBOL(ib_umem_odp_get);
 
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 {
+	struct ib_device *dev = umem_odp->umem.ibdev;
+
 	/*
 	 * Ensure that no more pages are mapped in the umem.
 	 *
@@ -274,48 +264,17 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 					    ib_umem_end(umem_odp));
 		mutex_unlock(&umem_odp->umem_mutex);
 		mmu_interval_notifier_remove(&umem_odp->notifier);
-		kvfree(umem_odp->dma_list);
-		kvfree(umem_odp->pfn_list);
+		hmm_dma_map_free(dev->dma_device, &umem_odp->map);
 	}
 	put_pid(umem_odp->tgid);
 	kfree(umem_odp);
 }
 EXPORT_SYMBOL(ib_umem_odp_release);
 
-/*
- * Map for DMA and insert a single page into the on-demand paging page tables.
- *
- * @umem: the umem to insert the page to.
- * @dma_index: index in the umem to add the dma to.
- * @page: the page struct to map and add.
- * @access_mask: access permissions needed for this page.
- *
- * The function returns -EFAULT if the DMA mapping operation fails.
- *
- */
-static int ib_umem_odp_map_dma_single_page(
-		struct ib_umem_odp *umem_odp,
-		unsigned int dma_index,
-		struct page *page)
-{
-	struct ib_device *dev = umem_odp->umem.ibdev;
-	dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
-
-	*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
-				    DMA_BIDIRECTIONAL);
-	if (ib_dma_mapping_error(dev, *dma_addr)) {
-		*dma_addr = 0;
-		return -EFAULT;
-	}
-	umem_odp->npages++;
-	return 0;
-}
-
 /**
  * ib_umem_odp_map_dma_and_lock - DMA map userspace memory in an ODP MR and lock it.
  *
  * Maps the range passed in the argument to DMA addresses.
- * The DMA addresses of the mapped pages is updated in umem_odp->dma_list.
  * Upon success the ODP MR will be locked to let caller complete its device
  * page table update.
  *
@@ -372,7 +331,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 			range.default_flags |= HMM_PFN_REQ_WRITE;
 	}
 
-	range.hmm_pfns = &(umem_odp->pfn_list[pfn_start_idx]);
+	range.hmm_pfns = &(umem_odp->map.pfn_list[pfn_start_idx]);
 	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
 
 retry:
@@ -423,15 +382,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 				  __func__, hmm_order, page_shift);
 			break;
 		}
-
-		ret = ib_umem_odp_map_dma_single_page(
-				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
-		if (ret < 0) {
-			ibdev_dbg(umem_odp->umem.ibdev,
-				  "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-			break;
-		}
-		range.hmm_pfns[pfn_index] |= HMM_PFN_DMA_MAPPED;
 	}
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
@@ -451,30 +401,23 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 				 u64 bound)
 {
-	dma_addr_t dma;
-	int idx;
-	u64 addr;
 	struct ib_device *dev = umem_odp->umem.ibdev;
+	u64 addr;
 
 	lockdep_assert_held(&umem_odp->umem_mutex);
 
 	virt = max_t(u64, virt, ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
-		unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
-		struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
-
-		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
-		dma = umem_odp->dma_list[idx];
+		u64 offset = addr - ib_umem_start(umem_odp);
+		size_t idx = offset >> umem_odp->page_shift;
+		unsigned long pfn = umem_odp->map.pfn_list[idx];
 
-		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
-			goto clear;
-		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_DMA_MAPPED))
+		if (!hmm_dma_unmap_pfn(dev->dma_device, &umem_odp->map, idx))
 			goto clear;
 
-		ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
-				  DMA_BIDIRECTIONAL);
-		if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+		if (pfn & HMM_PFN_WRITE) {
+			struct page *page = hmm_pfn_to_page(pfn);
 			struct page *head_page = compound_head(page);
 			/*
 			 * set_page_dirty prefers being called with
@@ -489,7 +432,7 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 		}
 		umem_odp->npages--;
 clear:
-		umem_odp->pfn_list[pfn_idx] &= ~HMM_PFN_FLAGS;
+		umem_odp->map.pfn_list[idx] &= ~HMM_PFN_FLAGS;
 	}
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 3e4aaa6319db..1bae5595c729 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1444,8 +1444,8 @@ void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev);
 int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 int mlx5_odp_init_mkey_cache(struct mlx5_ib_dev *dev);
-void mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
-			   struct mlx5_ib_mr *mr, int flags);
+int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
+			  struct mlx5_ib_mr *mr, int flags);
 
 int mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
 			       enum ib_uverbs_advise_mr_advice advice,
@@ -1466,8 +1466,11 @@ static inline int mlx5_odp_init_mkey_cache(struct mlx5_ib_dev *dev)
 {
 	return 0;
 }
-static inline void mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
-					 struct mlx5_ib_mr *mr, int flags) {}
+static inline int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
+					struct mlx5_ib_mr *mr, int flags)
+{
+	return -EOPNOTSUPP;
+}
 
 static inline int
 mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 78887500ce15..fbb2a5670c32 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -35,6 +35,8 @@
 #include <linux/dma-buf.h>
 #include <linux/dma-resv.h>
 #include <linux/hmm.h>
+#include <linux/hmm-dma.h>
+#include <linux/pci-p2pdma.h>
 
 #include "mlx5_ib.h"
 #include "cmd.h"
@@ -159,40 +161,50 @@ static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentries,
 	}
 }
 
-static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
-			 struct mlx5_ib_mr *mr, int flags)
+static int populate_mtt(__be64 *pas, size_t start, size_t nentries,
+			struct mlx5_ib_mr *mr, int flags)
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
 	bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
-	unsigned long pfn;
-	dma_addr_t pa;
+	struct pci_p2pdma_map_state p2pdma_state = {};
+	struct ib_device *dev = odp->umem.ibdev;
 	size_t i;
 
 	if (flags & MLX5_IB_UPD_XLT_ZAP)
-		return;
+		return 0;
 
 	for (i = 0; i < nentries; i++) {
-		pfn = odp->pfn_list[idx + i];
+		unsigned long pfn = odp->map.pfn_list[start + i];
+		dma_addr_t dma_addr;
+
+		pfn = odp->map.pfn_list[start + i];
 		if (!(pfn & HMM_PFN_VALID))
 			/* ODP initialization */
 			continue;
 
-		pa = odp->dma_list[idx + i];
-		pa |= MLX5_IB_MTT_READ;
+		dma_addr = hmm_dma_map_pfn(dev->dma_device, &odp->map,
+					   start + i, &p2pdma_state);
+		if (ib_dma_mapping_error(dev, dma_addr))
+			return -EFAULT;
+
+		dma_addr |= MLX5_IB_MTT_READ;
 		if ((pfn & HMM_PFN_WRITE) && !downgrade)
-			pa |= MLX5_IB_MTT_WRITE;
+			dma_addr |= MLX5_IB_MTT_WRITE;
 
-		pas[i] = cpu_to_be64(pa);
+		pas[i] = cpu_to_be64(dma_addr);
+		odp->npages++;
 	}
+	return 0;
 }
 
-void mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
-			   struct mlx5_ib_mr *mr, int flags)
+int mlx5_odp_populate_xlt(void *xlt, size_t idx, size_t nentries,
+			  struct mlx5_ib_mr *mr, int flags)
 {
 	if (flags & MLX5_IB_UPD_XLT_INDIRECT) {
 		populate_klm(xlt, idx, nentries, mr, flags);
+		return 0;
 	} else {
-		populate_mtt(xlt, idx, nentries, mr, flags);
+		return populate_mtt(xlt, idx, nentries, mr, flags);
 	}
 }
 
@@ -286,7 +298,7 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->pfn_list[idx] & HMM_PFN_VALID) {
+		if (umem_odp->map.pfn_list[idx] & HMM_PFN_VALID) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
diff --git a/drivers/infiniband/hw/mlx5/umr.c b/drivers/infiniband/hw/mlx5/umr.c
index 887fd6fa3ba9..d7fa94ab23cf 100644
--- a/drivers/infiniband/hw/mlx5/umr.c
+++ b/drivers/infiniband/hw/mlx5/umr.c
@@ -811,7 +811,17 @@ int mlx5r_umr_update_xlt(struct mlx5_ib_mr *mr, u64 idx, int npages,
 		size_to_map = npages * desc_size;
 		dma_sync_single_for_cpu(ddev, sg.addr, sg.length,
 					DMA_TO_DEVICE);
-		mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
+		/*
+		 * npages is the maximum number of pages to map, but we
+		 * can't guarantee that all pages are actually mapped.
+		 *
+		 * For example, if page is p2p of type which is not supported
+		 * for mapping, the number of pages mapped will be less than
+		 * requested.
+		 */
+		err = mlx5_odp_populate_xlt(xlt, idx, npages, mr, flags);
+		if (err)
+			return err;
 		dma_sync_single_for_device(ddev, sg.addr, sg.length,
 					   DMA_TO_DEVICE);
 		sg.length = ALIGN(size_to_map, MLX5_UMR_FLEX_ALIGNMENT);
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index a345c26a745d..2a24bf791c10 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -8,24 +8,17 @@
 
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
-#include <linux/hmm.h>
+#include <linux/hmm-dma.h>
 
 struct ib_umem_odp {
 	struct ib_umem umem;
 	struct mmu_interval_notifier notifier;
 	struct pid *tgid;
 
-	/* An array of the pfns included in the on-demand paging umem. */
-	unsigned long *pfn_list;
+	struct hmm_dma_map map;
 
 	/*
-	 * An array with DMA addresses mapped for pfns in pfn_list.
-	 * The lower two bits designate access permissions.
-	 * See ODP_READ_ALLOWED_BIT and ODP_WRITE_ALLOWED_BIT.
-	 */
-	dma_addr_t		*dma_list;
-	/*
-	 * The umem_mutex protects the page_list and dma_list fields of an ODP
+	 * The umem_mutex protects the page_list field of an ODP
 	 * umem, allowing only a single thread to map/unmap pages. The mutex
 	 * also protects access to the mmu notifier counters.
 	 */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 14/17] RDMA/umem: Separate implicit ODP initialization from explicit ODP
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (12 preceding siblings ...)
  2024-10-30 15:12 ` [PATCH v1 13/17] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
@ 2024-10-30 15:13 ` Leon Romanovsky
  2024-10-30 15:13 ` [PATCH v1 15/17] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:13 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Create separate functions for the implicit ODP initialization
which is different from the explicit ODP initialization.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c | 91 +++++++++++++++---------------
 1 file changed, 46 insertions(+), 45 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 30cd8f353476..51d518989914 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -48,41 +48,44 @@
 
 #include "uverbs.h"
 
-static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
-				   const struct mmu_interval_notifier_ops *ops)
+static void ib_init_umem_implicit_odp(struct ib_umem_odp *umem_odp)
+{
+	umem_odp->is_implicit_odp = 1;
+	umem_odp->umem.is_odp = 1;
+	mutex_init(&umem_odp->umem_mutex);
+}
+
+static int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
+			    const struct mmu_interval_notifier_ops *ops)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
+	size_t page_size = 1UL << umem_odp->page_shift;
+	unsigned long start;
+	unsigned long end;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
 	mutex_init(&umem_odp->umem_mutex);
 
-	if (!umem_odp->is_implicit_odp) {
-		size_t page_size = 1UL << umem_odp->page_shift;
-		unsigned long start;
-		unsigned long end;
-
-		start = ALIGN_DOWN(umem_odp->umem.address, page_size);
-		if (check_add_overflow(umem_odp->umem.address,
-				       (unsigned long)umem_odp->umem.length,
-				       &end))
-			return -EOVERFLOW;
-		end = ALIGN(end, page_size);
-		if (unlikely(end < page_size))
-			return -EOVERFLOW;
-
-		ret = hmm_dma_map_alloc(dev->dma_device, &umem_odp->map,
-					(end - start) >> PAGE_SHIFT,
-					1 << umem_odp->page_shift);
-		if (ret)
-			return ret;
-
-		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
-						   umem_odp->umem.owning_mm,
-						   start, end - start, ops);
-		if (ret)
-			goto out_free_map;
-	}
+	start = ALIGN_DOWN(umem_odp->umem.address, page_size);
+	if (check_add_overflow(umem_odp->umem.address,
+			       (unsigned long)umem_odp->umem.length, &end))
+		return -EOVERFLOW;
+	end = ALIGN(end, page_size);
+	if (unlikely(end < page_size))
+		return -EOVERFLOW;
+
+	ret = hmm_dma_map_alloc(dev->dma_device, &umem_odp->map,
+				(end - start) >> PAGE_SHIFT,
+				1 << umem_odp->page_shift);
+	if (ret)
+		return ret;
+
+	ret = mmu_interval_notifier_insert(&umem_odp->notifier,
+					   umem_odp->umem.owning_mm, start,
+					   end - start, ops);
+	if (ret)
+		goto out_free_map;
 
 	return 0;
 
@@ -106,7 +109,6 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
 {
 	struct ib_umem *umem;
 	struct ib_umem_odp *umem_odp;
-	int ret;
 
 	if (access & IB_ACCESS_HUGETLB)
 		return ERR_PTR(-EINVAL);
@@ -118,16 +120,10 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_device *device,
 	umem->ibdev = device;
 	umem->writable = ib_access_writable(access);
 	umem->owning_mm = current->mm;
-	umem_odp->is_implicit_odp = 1;
 	umem_odp->page_shift = PAGE_SHIFT;
 
 	umem_odp->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
-	ret = ib_init_umem_odp(umem_odp, NULL);
-	if (ret) {
-		put_pid(umem_odp->tgid);
-		kfree(umem_odp);
-		return ERR_PTR(ret);
-	}
+	ib_init_umem_implicit_odp(umem_odp);
 	return umem_odp;
 }
 EXPORT_SYMBOL(ib_umem_odp_alloc_implicit);
@@ -248,7 +244,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_device *device,
 }
 EXPORT_SYMBOL(ib_umem_odp_get);
 
-void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
+static void ib_umem_odp_free(struct ib_umem_odp *umem_odp)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
 
@@ -258,14 +254,19 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 	 * It is the driver's responsibility to ensure, before calling us,
 	 * that the hardware will not attempt to access the MR any more.
 	 */
-	if (!umem_odp->is_implicit_odp) {
-		mutex_lock(&umem_odp->umem_mutex);
-		ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
-					    ib_umem_end(umem_odp));
-		mutex_unlock(&umem_odp->umem_mutex);
-		mmu_interval_notifier_remove(&umem_odp->notifier);
-		hmm_dma_map_free(dev->dma_device, &umem_odp->map);
-	}
+	mutex_lock(&umem_odp->umem_mutex);
+	ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
+				    ib_umem_end(umem_odp));
+	mutex_unlock(&umem_odp->umem_mutex);
+	mmu_interval_notifier_remove(&umem_odp->notifier);
+	hmm_dma_map_free(dev->dma_device, &umem_odp->map);
+}
+
+void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
+{
+	if (!umem_odp->is_implicit_odp)
+		ib_umem_odp_free(umem_odp);
+
 	put_pid(umem_odp->tgid);
 	kfree(umem_odp);
 }
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 15/17] vfio/mlx5: Explicitly use number of pages instead of allocated length
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (13 preceding siblings ...)
  2024-10-30 15:13 ` [PATCH v1 14/17] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
@ 2024-10-30 15:13 ` Leon Romanovsky
  2024-10-30 15:13 ` [PATCH v1 16/17] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:13 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

allocated_length is a multiple of page size and number of pages,
so let's change the functions to accept number of pages. It opens
us a venue to combine receive and send paths together with code
readability improvement.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 32 ++++++++++-----------
 drivers/vfio/pci/mlx5/cmd.h  | 10 +++----
 drivers/vfio/pci/mlx5/main.c | 56 +++++++++++++++++++++++-------------
 3 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 41a4b0cf4297..fdc3e515741f 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -318,8 +318,7 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 			struct mlx5_vhca_recv_buf *recv_buf,
 			u32 *mkey)
 {
-	size_t npages = buf ? DIV_ROUND_UP(buf->allocated_length, PAGE_SIZE) :
-				recv_buf->npages;
+	size_t npages = buf ? buf->npages : recv_buf->npages;
 	int err = 0, inlen;
 	__be64 *mtt;
 	void *mkc;
@@ -375,7 +374,7 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (buf->dmaed || !buf->allocated_length)
+	if (buf->dmaed || !buf->npages)
 		return -EINVAL;
 
 	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
@@ -444,7 +443,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 
 		if (ret)
 			goto err;
-		buf->allocated_length += filled * PAGE_SIZE;
+		buf->npages += filled;
 		/* clean input for another bulk allocation */
 		memset(page_list, 0, filled * sizeof(*page_list));
 		to_fill = min_t(unsigned int, to_alloc,
@@ -460,8 +459,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 }
 
 struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
-			 size_t length,
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 			 enum dma_data_direction dma_dir)
 {
 	struct mlx5_vhca_data_buffer *buf;
@@ -473,9 +471,8 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
 
 	buf->dma_dir = dma_dir;
 	buf->migf = migf;
-	if (length) {
-		ret = mlx5vf_add_migration_pages(buf,
-				DIV_ROUND_UP_ULL(length, PAGE_SIZE));
+	if (npages) {
+		ret = mlx5vf_add_migration_pages(buf, npages);
 		if (ret)
 			goto end;
 
@@ -501,8 +498,8 @@ void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf)
 }
 
 struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
-		       size_t length, enum dma_data_direction dma_dir)
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+		       enum dma_data_direction dma_dir)
 {
 	struct mlx5_vhca_data_buffer *buf, *temp_buf;
 	struct list_head free_list;
@@ -517,7 +514,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
 	list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
 		if (buf->dma_dir == dma_dir) {
 			list_del_init(&buf->buf_elm);
-			if (buf->allocated_length >= length) {
+			if (buf->npages >= npages) {
 				spin_unlock_irq(&migf->list_lock);
 				goto found;
 			}
@@ -531,7 +528,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
 		}
 	}
 	spin_unlock_irq(&migf->list_lock);
-	buf = mlx5vf_alloc_data_buffer(migf, length, dma_dir);
+	buf = mlx5vf_alloc_data_buffer(migf, npages, dma_dir);
 
 found:
 	while ((temp_buf = list_first_entry_or_null(&free_list,
@@ -712,7 +709,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
 	MLX5_SET(save_vhca_state_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(save_vhca_state_in, in, mkey, buf->mkey);
-	MLX5_SET(save_vhca_state_in, in, size, buf->allocated_length);
+	MLX5_SET(save_vhca_state_in, in, size, buf->npages * PAGE_SIZE);
 	MLX5_SET(save_vhca_state_in, in, incremental, inc);
 	MLX5_SET(save_vhca_state_in, in, set_track, track);
 
@@ -734,8 +731,11 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (!header_buf) {
-		header_buf = mlx5vf_get_data_buffer(migf,
-			sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+		header_buf = mlx5vf_get_data_buffer(
+			migf,
+			DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+				     PAGE_SIZE),
+			DMA_NONE);
 		if (IS_ERR(header_buf)) {
 			err = PTR_ERR(header_buf);
 			goto err_free;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index df421dc6de04..7d4a833b6900 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -56,7 +56,7 @@ struct mlx5_vhca_data_buffer {
 	struct sg_append_table table;
 	loff_t start_pos;
 	u64 length;
-	u64 allocated_length;
+	u32 npages;
 	u32 mkey;
 	enum dma_data_direction dma_dir;
 	u8 dmaed:1;
@@ -217,12 +217,12 @@ int mlx5vf_cmd_alloc_pd(struct mlx5_vf_migration_file *migf);
 void mlx5vf_cmd_dealloc_pd(struct mlx5_vf_migration_file *migf);
 void mlx5fv_cmd_clean_migf_resources(struct mlx5_vf_migration_file *migf);
 struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
-			 size_t length, enum dma_data_direction dma_dir);
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+			 enum dma_data_direction dma_dir);
 void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf);
 struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
-		       size_t length, enum dma_data_direction dma_dir);
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+		       enum dma_data_direction dma_dir);
 void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf);
 struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
 				       unsigned long offset);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 242c23eef452..a1dbee3be1e0 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -308,6 +308,7 @@ static struct mlx5_vhca_data_buffer *
 mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
 				  u8 index, size_t required_length)
 {
+	u32 npages = DIV_ROUND_UP(required_length, PAGE_SIZE);
 	struct mlx5_vhca_data_buffer *buf = migf->buf[index];
 	u8 chunk_num;
 
@@ -315,12 +316,11 @@ mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
 	chunk_num = buf->stop_copy_chunk_num;
 	buf->migf->buf[index] = NULL;
 	/* Checking whether the pre-allocated buffer can fit */
-	if (buf->allocated_length >= required_length)
+	if (buf->npages >= npages)
 		return buf;
 
 	mlx5vf_put_data_buffer(buf);
-	buf = mlx5vf_get_data_buffer(buf->migf, required_length,
-				     DMA_FROM_DEVICE);
+	buf = mlx5vf_get_data_buffer(buf->migf, npages, DMA_FROM_DEVICE);
 	if (IS_ERR(buf))
 		return buf;
 
@@ -373,7 +373,8 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
 	u8 *to_buff;
 	int ret;
 
-	header_buf = mlx5vf_get_data_buffer(migf, size, DMA_NONE);
+	header_buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(size, PAGE_SIZE),
+					    DMA_NONE);
 	if (IS_ERR(header_buf))
 		return PTR_ERR(header_buf);
 
@@ -388,7 +389,7 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
 	to_buff = kmap_local_page(page);
 	memcpy(to_buff, &header, sizeof(header));
 	header_buf->length = sizeof(header);
-	data.stop_copy_size = cpu_to_le64(migf->buf[0]->allocated_length);
+	data.stop_copy_size = cpu_to_le64(migf->buf[0]->npages * PAGE_SIZE);
 	memcpy(to_buff + sizeof(header), &data, sizeof(data));
 	header_buf->length += sizeof(data);
 	kunmap_local(to_buff);
@@ -437,15 +438,20 @@ static int mlx5vf_prep_stop_copy(struct mlx5vf_pci_core_device *mvdev,
 
 	num_chunks = mvdev->chunk_mode ? MAX_NUM_CHUNKS : 1;
 	for (i = 0; i < num_chunks; i++) {
-		buf = mlx5vf_get_data_buffer(migf, inc_state_size, DMA_FROM_DEVICE);
+		buf = mlx5vf_get_data_buffer(
+			migf, DIV_ROUND_UP(inc_state_size, PAGE_SIZE),
+			DMA_FROM_DEVICE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto err;
 		}
 
 		migf->buf[i] = buf;
-		buf = mlx5vf_get_data_buffer(migf,
-				sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+		buf = mlx5vf_get_data_buffer(
+			migf,
+			DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+				     PAGE_SIZE),
+			DMA_NONE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto err;
@@ -553,7 +559,8 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	 * We finished transferring the current state and the device has a
 	 * dirty state, save a new state to be ready for.
 	 */
-	buf = mlx5vf_get_data_buffer(migf, inc_length, DMA_FROM_DEVICE);
+	buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(inc_length, PAGE_SIZE),
+				     DMA_FROM_DEVICE);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		mlx5vf_mark_err(migf);
@@ -673,8 +680,8 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev, bool track)
 
 	if (track) {
 		/* leave the allocated buffer ready for the stop-copy phase */
-		buf = mlx5vf_alloc_data_buffer(migf,
-			migf->buf[0]->allocated_length, DMA_FROM_DEVICE);
+		buf = mlx5vf_alloc_data_buffer(migf, migf->buf[0]->npages,
+					       DMA_FROM_DEVICE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto out_pd;
@@ -917,11 +924,14 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 				goto out_unlock;
 			break;
 		case MLX5_VF_LOAD_STATE_PREP_HEADER_DATA:
-			if (vhca_buf_header->allocated_length < migf->record_size) {
+		{
+			u32 npages = DIV_ROUND_UP(migf->record_size, PAGE_SIZE);
+
+			if (vhca_buf_header->npages < npages) {
 				mlx5vf_free_data_buffer(vhca_buf_header);
 
-				migf->buf_header[0] = mlx5vf_alloc_data_buffer(migf,
-						migf->record_size, DMA_NONE);
+				migf->buf_header[0] = mlx5vf_alloc_data_buffer(
+					migf, npages, DMA_NONE);
 				if (IS_ERR(migf->buf_header[0])) {
 					ret = PTR_ERR(migf->buf_header[0]);
 					migf->buf_header[0] = NULL;
@@ -934,6 +944,7 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 			vhca_buf_header->start_pos = migf->max_pos;
 			migf->load_state = MLX5_VF_LOAD_STATE_READ_HEADER_DATA;
 			break;
+		}
 		case MLX5_VF_LOAD_STATE_READ_HEADER_DATA:
 			ret = mlx5vf_resume_read_header_data(migf, vhca_buf_header,
 							&buf, &len, pos, &done);
@@ -944,12 +955,13 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 		{
 			u64 size = max(migf->record_size,
 				       migf->stop_copy_prep_size);
+			u32 npages = DIV_ROUND_UP(size, PAGE_SIZE);
 
-			if (vhca_buf->allocated_length < size) {
+			if (vhca_buf->npages < npages) {
 				mlx5vf_free_data_buffer(vhca_buf);
 
-				migf->buf[0] = mlx5vf_alloc_data_buffer(migf,
-							size, DMA_TO_DEVICE);
+				migf->buf[0] = mlx5vf_alloc_data_buffer(
+					migf, npages, DMA_TO_DEVICE);
 				if (IS_ERR(migf->buf[0])) {
 					ret = PTR_ERR(migf->buf[0]);
 					migf->buf[0] = NULL;
@@ -1031,8 +1043,11 @@ mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)
 	}
 
 	migf->buf[0] = buf;
-	buf = mlx5vf_alloc_data_buffer(migf,
-		sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+	buf = mlx5vf_alloc_data_buffer(
+		migf,
+		DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+			     PAGE_SIZE),
+		DMA_NONE);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		goto out_buf;
@@ -1149,7 +1164,8 @@ mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev,
 					MLX5VF_QUERY_INC | MLX5VF_QUERY_CLEANUP);
 		if (ret)
 			return ERR_PTR(ret);
-		buf = mlx5vf_get_data_buffer(migf, size, DMA_FROM_DEVICE);
+		buf = mlx5vf_get_data_buffer(migf,
+				DIV_ROUND_UP(size, PAGE_SIZE), DMA_FROM_DEVICE);
 		if (IS_ERR(buf))
 			return ERR_CAST(buf);
 		/* pre_copy cleanup */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 16/17] vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (14 preceding siblings ...)
  2024-10-30 15:13 ` [PATCH v1 15/17] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
@ 2024-10-30 15:13 ` Leon Romanovsky
  2024-10-30 15:13 ` [PATCH v1 17/17] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:13 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Change the creation of mkey to be performed in multiple steps:
data allocation, DMA setup and actual call to HW to create that mkey.

In this new flow, the whole input to MKEY command is saved to eliminate
the need to keep array of pointers for DMA addresses for receive list
and in the future patches for send list too.

In addition to memory size reduce and elimination of unnecessary data
movements to set MKEY input, the code is prepared for future reuse.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 156 ++++++++++++++++++++----------------
 drivers/vfio/pci/mlx5/cmd.h |   4 +-
 2 files changed, 90 insertions(+), 70 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index fdc3e515741f..1832a6c1f35d 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -313,39 +313,21 @@ static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 	return ret;
 }
 
-static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
-			struct mlx5_vhca_data_buffer *buf,
-			struct mlx5_vhca_recv_buf *recv_buf,
-			u32 *mkey)
+static u32 *alloc_mkey_in(u32 npages, u32 pdn)
 {
-	size_t npages = buf ? buf->npages : recv_buf->npages;
-	int err = 0, inlen;
-	__be64 *mtt;
+	int inlen;
 	void *mkc;
 	u32 *in;
 
 	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
-		sizeof(*mtt) * round_up(npages, 2);
+		sizeof(__be64) * round_up(npages, 2);
 
-	in = kvzalloc(inlen, GFP_KERNEL);
+	in = kvzalloc(inlen, GFP_KERNEL_ACCOUNT);
 	if (!in)
-		return -ENOMEM;
+		return NULL;
 
 	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
 		 DIV_ROUND_UP(npages, 2));
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
-
-	if (buf) {
-		struct sg_dma_page_iter dma_iter;
-
-		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
-			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
-	} else {
-		int i;
-
-		for (i = 0; i < npages; i++)
-			*mtt++ = cpu_to_be64(recv_buf->dma_addrs[i]);
-	}
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
@@ -359,9 +341,29 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
 	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
 	MLX5_SET64(mkc, mkc, len, npages * PAGE_SIZE);
-	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
-	kvfree(in);
-	return err;
+
+	return in;
+}
+
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
+		       struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+		       u32 *mkey)
+{
+	__be64 *mtt;
+	int inlen;
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+	if (buf) {
+		struct sg_dma_page_iter dma_iter;
+
+		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
+			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(__be64) * round_up(npages, 2);
+
+	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
 }
 
 static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -374,20 +376,28 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (buf->dmaed || !buf->npages)
+	if (buf->mkey_in || !buf->npages)
 		return -EINVAL;
 
 	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	if (ret)
 		return ret;
 
-	ret = _create_mkey(mdev, buf->migf->pdn, buf, NULL, &buf->mkey);
-	if (ret)
+	buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
+	if (!buf->mkey_in) {
+		ret = -ENOMEM;
 		goto err;
+	}
 
-	buf->dmaed = true;
+	ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+	if (ret)
+		goto err_create_mkey;
 
 	return 0;
+
+err_create_mkey:
+	kvfree(buf->mkey_in);
+	buf->mkey_in = NULL;
 err:
 	dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	return ret;
@@ -401,8 +411,9 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	lockdep_assert_held(&migf->mvdev->state_mutex);
 	WARN_ON(migf->mvdev->mdev_detach);
 
-	if (buf->dmaed) {
+	if (buf->mkey_in) {
 		mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+		kvfree(buf->mkey_in);
 		dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
 				  buf->dma_dir, 0);
 	}
@@ -779,7 +790,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (!buf->dmaed) {
+	if (!buf->mkey_in) {
 		err = mlx5vf_dma_data_buffer(buf);
 		if (err)
 			return err;
@@ -1380,56 +1391,54 @@ static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
 	kvfree(recv_buf->page_list);
 	return -ENOMEM;
 }
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+				 u32 *mkey_in)
+{
+	dma_addr_t addr;
+	__be64 *mtt;
+	int i;
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+	for (i = npages - 1; i >= 0; i--) {
+		addr = be64_to_cpu(mtt[i]);
+		dma_unmap_single(mdev->device, addr, PAGE_SIZE,
+				DMA_FROM_DEVICE);
+	}
+}
 
-static int register_dma_recv_pages(struct mlx5_core_dev *mdev,
-				   struct mlx5_vhca_recv_buf *recv_buf)
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+			      struct page **page_list, u32 *mkey_in)
 {
-	int i, j;
+	dma_addr_t addr;
+	__be64 *mtt;
+	int i;
 
-	recv_buf->dma_addrs = kvcalloc(recv_buf->npages,
-				       sizeof(*recv_buf->dma_addrs),
-				       GFP_KERNEL_ACCOUNT);
-	if (!recv_buf->dma_addrs)
-		return -ENOMEM;
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
 
-	for (i = 0; i < recv_buf->npages; i++) {
-		recv_buf->dma_addrs[i] = dma_map_page(mdev->device,
-						      recv_buf->page_list[i],
-						      0, PAGE_SIZE,
-						      DMA_FROM_DEVICE);
-		if (dma_mapping_error(mdev->device, recv_buf->dma_addrs[i]))
+	for (i = 0; i < npages; i++) {
+		addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
+				    DMA_FROM_DEVICE);
+		if (dma_mapping_error(mdev->device, addr))
 			goto error;
+
+		*mtt++ = cpu_to_be64(addr);
 	}
+
 	return 0;
 
 error:
-	for (j = 0; j < i; j++)
-		dma_unmap_single(mdev->device, recv_buf->dma_addrs[j],
-				 PAGE_SIZE, DMA_FROM_DEVICE);
-
-	kvfree(recv_buf->dma_addrs);
+	unregister_dma_pages(mdev, i, mkey_in);
 	return -ENOMEM;
 }
 
-static void unregister_dma_recv_pages(struct mlx5_core_dev *mdev,
-				      struct mlx5_vhca_recv_buf *recv_buf)
-{
-	int i;
-
-	for (i = 0; i < recv_buf->npages; i++)
-		dma_unmap_single(mdev->device, recv_buf->dma_addrs[i],
-				 PAGE_SIZE, DMA_FROM_DEVICE);
-
-	kvfree(recv_buf->dma_addrs);
-}
-
 static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 					  struct mlx5_vhca_qp *qp)
 {
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 
 	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
-	unregister_dma_recv_pages(mdev, recv_buf);
+	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+	kvfree(recv_buf->mkey_in);
 	free_recv_pages(&qp->recv_buf);
 }
 
@@ -1445,18 +1454,29 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	if (err < 0)
 		return err;
 
-	err = register_dma_recv_pages(mdev, recv_buf);
-	if (err)
+	recv_buf->mkey_in = alloc_mkey_in(npages, pdn);
+	if (!recv_buf->mkey_in) {
+		err = -ENOMEM;
 		goto end;
+	}
+
+	err = register_dma_pages(mdev, npages, recv_buf->page_list,
+				 recv_buf->mkey_in);
+	if (err)
+		goto err_register_dma;
 
-	err = _create_mkey(mdev, pdn, NULL, recv_buf, &recv_buf->mkey);
+	err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
+			  &recv_buf->mkey);
 	if (err)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
-	unregister_dma_recv_pages(mdev, recv_buf);
+	unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+err_register_dma:
+	kvfree(recv_buf->mkey_in);
+	recv_buf->mkey_in = NULL;
 end:
 	free_recv_pages(recv_buf);
 	return err;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 7d4a833b6900..25dd6ff54591 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -58,8 +58,8 @@ struct mlx5_vhca_data_buffer {
 	u64 length;
 	u32 npages;
 	u32 mkey;
+	u32 *mkey_in;
 	enum dma_data_direction dma_dir;
-	u8 dmaed:1;
 	u8 stop_copy_chunk_num;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
@@ -133,8 +133,8 @@ struct mlx5_vhca_cq {
 struct mlx5_vhca_recv_buf {
 	u32 npages;
 	struct page **page_list;
-	dma_addr_t *dma_addrs;
 	u32 next_rq_offset;
+	u32 *mkey_in;
 	u32 mkey;
 };
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v1 17/17] vfio/mlx5: Convert vfio to use DMA link API
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (15 preceding siblings ...)
  2024-10-30 15:13 ` [PATCH v1 16/17] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
@ 2024-10-30 15:13 ` Leon Romanovsky
  2024-10-31  1:44 ` [PATCH v1 00/17] Provide a new two step DMA mapping API Jens Axboe
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-30 15:13 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Remove intermediate scatter-gather table as it is not needed
if DMA link API is used. This conversion reduces drastically
the memory used to manage that table.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 295 ++++++++++++++++-------------------
 drivers/vfio/pci/mlx5/cmd.h  |  21 ++-
 drivers/vfio/pci/mlx5/main.c |  31 ----
 3 files changed, 148 insertions(+), 199 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 1832a6c1f35d..cde1481ed23c 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -345,25 +345,81 @@ static u32 *alloc_mkey_in(u32 npages, u32 pdn)
 	return in;
 }
 
-static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
-		       struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages, u32 *mkey_in,
 		       u32 *mkey)
 {
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(__be64) * round_up(npages, 2);
+
+	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+}
+
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+				 u32 *mkey_in, struct dma_iova_state *state,
+				 enum dma_data_direction dir)
+{
+	dma_addr_t addr;
 	__be64 *mtt;
-	int inlen;
+	int i;
 
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-	if (buf) {
-		struct sg_dma_page_iter dma_iter;
+	WARN_ON_ONCE(dir == DMA_NONE);
 
-		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
-			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	if (dma_use_iova(state)) {
+		dma_iova_destroy(mdev->device, state, npages * PAGE_SIZE, dir, 0);
+	} else {
+		mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in,
+					     klm_pas_mtt);
+		for (i = npages - 1; i >= 0; i--) {
+			addr = be64_to_cpu(mtt[i]);
+			dma_unmap_page(mdev->device, addr, PAGE_SIZE, dir);
+		}
 	}
+}
 
-	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
-		sizeof(__be64) * round_up(npages, 2);
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+			      struct page **page_list, u32 *mkey_in,
+			      struct dma_iova_state *state,
+			      enum dma_data_direction dir)
+{
+	dma_addr_t addr;
+	size_t mapped = 0;
+	__be64 *mtt;
+	int i, err;
 
-	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+	WARN_ON_ONCE(dir == DMA_NONE);
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+
+	if (dma_iova_try_alloc(mdev->device, state, 0, npages * PAGE_SIZE)) {
+		addr = state->addr;
+		for (i = 0; i < npages; i++) {
+			err = dma_iova_link(mdev->device, state,
+					    page_to_phys(page_list[i]), mapped,
+					    PAGE_SIZE, dir, 0);
+			if (err)
+				goto error;
+			*mtt++ = cpu_to_be64(addr);
+			addr += PAGE_SIZE;
+			mapped += PAGE_SIZE;
+		}
+		err = dma_iova_sync(mdev->device, state, 0, mapped);
+		if (err)
+			goto error;
+	} else {
+		for (i = 0; i < npages; i++) {
+			addr = dma_map_page(mdev->device, page_list[i], 0,
+					    PAGE_SIZE, dir);
+			err = dma_mapping_error(mdev->device, addr);
+			if (err)
+				goto error;
+			*mtt++ = cpu_to_be64(addr);
+		}
+	}
+	return 0;
+
+error:
+	unregister_dma_pages(mdev, i, mkey_in, state, dir);
+	return err;
 }
 
 static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -379,96 +435,93 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (buf->mkey_in || !buf->npages)
 		return -EINVAL;
 
-	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
-	if (ret)
-		return ret;
-
 	buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
-	if (!buf->mkey_in) {
-		ret = -ENOMEM;
-		goto err;
-	}
+	if (!buf->mkey_in)
+		return -ENOMEM;
 
-	ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+	ret = register_dma_pages(mdev, buf->npages, buf->page_list,
+				 buf->mkey_in, &buf->state, buf->dma_dir);
+	if (ret)
+		goto err_register_dma;
+
+	ret = create_mkey(mdev, buf->npages, buf->mkey_in, &buf->mkey);
 	if (ret)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
+	unregister_dma_pages(mdev, buf->npages, buf->mkey_in, &buf->state,
+			     buf->dma_dir);
+err_register_dma:
 	kvfree(buf->mkey_in);
 	buf->mkey_in = NULL;
-err:
-	dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	return ret;
 }
 
+static void free_page_list(u32 npages, struct page **page_list)
+{
+	int i;
+
+	/* Undo alloc_pages_bulk_array() */
+	for (i = npages - 1; i >= 0; i--)
+		__free_page(page_list[i]);
+
+	kvfree(page_list);
+}
+
 void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 {
-	struct mlx5_vf_migration_file *migf = buf->migf;
-	struct sg_page_iter sg_iter;
+	struct mlx5vf_pci_core_device *mvdev = buf->migf->mvdev;
+	struct mlx5_core_dev *mdev = mvdev->mdev;
 
-	lockdep_assert_held(&migf->mvdev->state_mutex);
-	WARN_ON(migf->mvdev->mdev_detach);
+	lockdep_assert_held(&mvdev->state_mutex);
+	WARN_ON(mvdev->mdev_detach);
 
 	if (buf->mkey_in) {
-		mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+		mlx5_core_destroy_mkey(mdev, buf->mkey);
+		unregister_dma_pages(mdev, buf->npages, buf->mkey_in,
+				     &buf->state, buf->dma_dir);
 		kvfree(buf->mkey_in);
-		dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
-				  buf->dma_dir, 0);
 	}
 
-	/* Undo alloc_pages_bulk_array() */
-	for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
-		__free_page(sg_page_iter_page(&sg_iter));
-	sg_free_append_table(&buf->table);
+	free_page_list(buf->npages, buf->page_list);
 	kfree(buf);
 }
 
-static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
-				      unsigned int npages)
+static int mlx5vf_add_pages(struct page ***page_list, unsigned int npages)
 {
-	unsigned int to_alloc = npages;
-	struct page **page_list;
-	unsigned long filled;
-	unsigned int to_fill;
-	int ret;
+	unsigned int filled = 0, done = 0;
+	int i;
 
-	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
-	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT);
-	if (!page_list)
+	*page_list = kvcalloc(npages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+	if (!*page_list)
 		return -ENOMEM;
 
-	do {
-		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
-						page_list);
-		if (!filled) {
-			ret = -ENOMEM;
+	for (;;) {
+		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT,
+						npages - done,
+						*page_list + done);
+		if (!filled)
 			goto err;
-		}
-		to_alloc -= filled;
-		ret = sg_alloc_append_table_from_pages(
-			&buf->table, page_list, filled, 0,
-			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
-			GFP_KERNEL_ACCOUNT);
 
-		if (ret)
-			goto err;
-		buf->npages += filled;
-		/* clean input for another bulk allocation */
-		memset(page_list, 0, filled * sizeof(*page_list));
-		to_fill = min_t(unsigned int, to_alloc,
-				PAGE_SIZE / sizeof(*page_list));
-	} while (to_alloc > 0);
+		done += filled;
+		if (done == npages)
+			break;
+	}
 
-	kvfree(page_list);
 	return 0;
 
 err:
-	kvfree(page_list);
-	return ret;
+	for (i = 0; i < done; i++)
+		__free_page(*page_list[i]);
+
+	kvfree(*page_list);
+	*page_list = NULL;
+	return -ENOMEM;
 }
 
+
 struct mlx5_vhca_data_buffer *
 mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 			 enum dma_data_direction dma_dir)
@@ -483,10 +536,12 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 	buf->dma_dir = dma_dir;
 	buf->migf = migf;
 	if (npages) {
-		ret = mlx5vf_add_migration_pages(buf, npages);
+		ret = mlx5vf_add_pages(&buf->page_list, npages);
 		if (ret)
 			goto end;
 
+		buf->npages = npages;
+
 		if (dma_dir != DMA_NONE) {
 			ret = mlx5vf_dma_data_buffer(buf);
 			if (ret)
@@ -1345,101 +1400,16 @@ static void mlx5vf_destroy_qp(struct mlx5_core_dev *mdev,
 	kfree(qp);
 }
 
-static void free_recv_pages(struct mlx5_vhca_recv_buf *recv_buf)
-{
-	int i;
-
-	/* Undo alloc_pages_bulk_array() */
-	for (i = 0; i < recv_buf->npages; i++)
-		__free_page(recv_buf->page_list[i]);
-
-	kvfree(recv_buf->page_list);
-}
-
-static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
-			    unsigned int npages)
-{
-	unsigned int filled = 0, done = 0;
-	int i;
-
-	recv_buf->page_list = kvcalloc(npages, sizeof(*recv_buf->page_list),
-				       GFP_KERNEL_ACCOUNT);
-	if (!recv_buf->page_list)
-		return -ENOMEM;
-
-	for (;;) {
-		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT,
-						npages - done,
-						recv_buf->page_list + done);
-		if (!filled)
-			goto err;
-
-		done += filled;
-		if (done == npages)
-			break;
-	}
-
-	recv_buf->npages = npages;
-	return 0;
-
-err:
-	for (i = 0; i < npages; i++) {
-		if (recv_buf->page_list[i])
-			__free_page(recv_buf->page_list[i]);
-	}
-
-	kvfree(recv_buf->page_list);
-	return -ENOMEM;
-}
-static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
-				 u32 *mkey_in)
-{
-	dma_addr_t addr;
-	__be64 *mtt;
-	int i;
-
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-	for (i = npages - 1; i >= 0; i--) {
-		addr = be64_to_cpu(mtt[i]);
-		dma_unmap_single(mdev->device, addr, PAGE_SIZE,
-				DMA_FROM_DEVICE);
-	}
-}
-
-static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
-			      struct page **page_list, u32 *mkey_in)
-{
-	dma_addr_t addr;
-	__be64 *mtt;
-	int i;
-
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-
-	for (i = 0; i < npages; i++) {
-		addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
-				    DMA_FROM_DEVICE);
-		if (dma_mapping_error(mdev->device, addr))
-			goto error;
-
-		*mtt++ = cpu_to_be64(addr);
-	}
-
-	return 0;
-
-error:
-	unregister_dma_pages(mdev, i, mkey_in);
-	return -ENOMEM;
-}
-
 static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 					  struct mlx5_vhca_qp *qp)
 {
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 
 	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
-	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in,
+			     &recv_buf->state, DMA_FROM_DEVICE);
 	kvfree(recv_buf->mkey_in);
-	free_recv_pages(&qp->recv_buf);
+	free_page_list(recv_buf->npages, recv_buf->page_list);
 }
 
 static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
@@ -1450,10 +1420,12 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 	int err;
 
-	err = alloc_recv_pages(recv_buf, npages);
-	if (err < 0)
+	err = mlx5vf_add_pages(&recv_buf->page_list, npages);
+	if (err)
 		return err;
 
+	recv_buf->npages = npages;
+
 	recv_buf->mkey_in = alloc_mkey_in(npages, pdn);
 	if (!recv_buf->mkey_in) {
 		err = -ENOMEM;
@@ -1461,24 +1433,25 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	}
 
 	err = register_dma_pages(mdev, npages, recv_buf->page_list,
-				 recv_buf->mkey_in);
+				 recv_buf->mkey_in, &recv_buf->state,
+				 DMA_FROM_DEVICE);
 	if (err)
 		goto err_register_dma;
 
-	err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
-			  &recv_buf->mkey);
+	err = create_mkey(mdev, npages, recv_buf->mkey_in, &recv_buf->mkey);
 	if (err)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
-	unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+	unregister_dma_pages(mdev, npages, recv_buf->mkey_in, &recv_buf->state,
+			     DMA_FROM_DEVICE);
 err_register_dma:
 	kvfree(recv_buf->mkey_in);
 	recv_buf->mkey_in = NULL;
 end:
-	free_recv_pages(recv_buf);
+	free_page_list(npages, recv_buf->page_list);
 	return err;
 }
 
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 25dd6ff54591..d7821b5ca772 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -53,7 +53,8 @@ struct mlx5_vf_migration_header {
 };
 
 struct mlx5_vhca_data_buffer {
-	struct sg_append_table table;
+	struct page **page_list;
+	struct dma_iova_state state;
 	loff_t start_pos;
 	u64 length;
 	u32 npages;
@@ -63,10 +64,6 @@ struct mlx5_vhca_data_buffer {
 	u8 stop_copy_chunk_num;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
-	/* Optimize mlx5vf_get_migration_page() for sequential access */
-	struct scatterlist *last_offset_sg;
-	unsigned int sg_last_entry;
-	unsigned long last_offset;
 };
 
 struct mlx5vf_async_data {
@@ -133,6 +130,7 @@ struct mlx5_vhca_cq {
 struct mlx5_vhca_recv_buf {
 	u32 npages;
 	struct page **page_list;
+	struct dma_iova_state state;
 	u32 next_rq_offset;
 	u32 *mkey_in;
 	u32 mkey;
@@ -224,8 +222,17 @@ struct mlx5_vhca_data_buffer *
 mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 		       enum dma_data_direction dma_dir);
 void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf);
-struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
-				       unsigned long offset);
+static inline struct page *
+mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
+			  unsigned long offset)
+{
+	int page_entry = offset / PAGE_SIZE;
+
+	if (page_entry >= buf->npages)
+		return NULL;
+
+	return buf->page_list[page_entry];
+}
 void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev);
 void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev,
 			enum mlx5_vf_migf_state *last_save_state);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index a1dbee3be1e0..d6cf97101c41 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -34,37 +34,6 @@ static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev)
 			    core_device);
 }
 
-struct page *
-mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
-			  unsigned long offset)
-{
-	unsigned long cur_offset = 0;
-	struct scatterlist *sg;
-	unsigned int i;
-
-	/* All accesses are sequential */
-	if (offset < buf->last_offset || !buf->last_offset_sg) {
-		buf->last_offset = 0;
-		buf->last_offset_sg = buf->table.sgt.sgl;
-		buf->sg_last_entry = 0;
-	}
-
-	cur_offset = buf->last_offset;
-
-	for_each_sg(buf->last_offset_sg, sg,
-			buf->table.sgt.orig_nents - buf->sg_last_entry, i) {
-		if (offset < sg->length + cur_offset) {
-			buf->last_offset_sg = sg;
-			buf->sg_last_entry += i;
-			buf->last_offset = cur_offset;
-			return nth_page(sg_page(sg),
-					(offset - cur_offset) / PAGE_SIZE);
-		}
-		cur_offset += sg->length;
-	}
-	return NULL;
-}
-
 static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf)
 {
 	mutex_lock(&migf->lock);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-10-30 15:12 ` [PATCH v1 09/17] docs: core-api: document the IOVA-based API Leon Romanovsky
@ 2024-10-31  1:41   ` Randy Dunlap
  2024-10-31  7:59     ` Leon Romanovsky
  2024-11-08 19:34   ` Jonathan Corbet
  1 sibling, 1 reply; 63+ messages in thread
From: Randy Dunlap @ 2024-10-31  1:41 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

(nits)

On 10/30/24 8:12 AM, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Add an explanation of the newly added IOVA-based mapping API.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/core-api/dma-api.rst | 70 ++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
> 
> diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
> index 8e3cce3d0a23..6095696a65a7 100644
> --- a/Documentation/core-api/dma-api.rst
> +++ b/Documentation/core-api/dma-api.rst
> @@ -530,6 +530,76 @@ routines, e.g.:::
>  		....
>  	}
>  
> +Part Ie - IOVA-based DMA mappings
> +---------------------------------
> +
> +These APIs allow a very efficient mapping when using an IOMMU.  They are an
> +optional path that requires extra code and are only recommended for drivers
> +where DMA mapping performance, or the space usage for storing the DMA addresses
> +matter.  All the consideration from the previous section apply here as well.

                    considerations

> +
> +::
> +
> +    bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t size);
> +
> +Is used to try to allocate IOVA space for mapping operation.  If it returns
> +false this API can't be used for the given device and the normal streaming
> +DMA mapping API should be used.  The ``struct dma_iova_state`` is allocated
> +by the driver and must be kept around until unmap time.
> +
> +::
> +
> +    static inline bool dma_use_iova(struct dma_iova_state *state)
> +
> +Can be used by the driver to check if the IOVA-based API is used after a
> +call to dma_iova_try_alloc.  This can be useful in the unmap path.
> +
> +::
> +
> +    int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs);
> +
> +Is used to link ranges to the IOVA previously allocated.  The start of all
> +but the first call to dma_iova_link for a given state must be aligned
> +to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and
> +the size of all but the last range must be aligned to the DMA merge boundary
> +as well.
> +
> +::
> +
> +    int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size);
> +
> +Must be called to sync the IOMMU page tables for IOVA-range mapped by one or
> +more calls to ``dma_iova_link()``.
> +
> +For drivers that use a one-shot mapping, all ranges can be unmapped and the
> +IOVA freed by calling:
> +
> +::
> +
> +   void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
> +		enum dma_data_direction dir, unsigned long attrs);
> +
> +Alternatively drivers can dynamically manage the IOVA space by unmapping
> +and mapping individual regions.  In that case
> +
> +::
> +
> +    void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs);
> +
> +is used to unmap a range previous mapped, and

                            previously

> +
> +::
> +
> +   void dma_iova_free(struct device *dev, struct dma_iova_state *state);
> +
> +is used to free the IOVA space.  All regions must have been unmapped using
> +``dma_iova_unlink()`` before calling ``dma_iova_free()``.
>  
>  Part II - Non-coherent DMA allocations
>  --------------------------------------

-- 
~Randy


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (16 preceding siblings ...)
  2024-10-30 15:13 ` [PATCH v1 17/17] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
@ 2024-10-31  1:44 ` Jens Axboe
  2024-10-31  8:34   ` Christoph Hellwig
  2024-10-31 21:17 ` Robin Murphy
  2024-11-05 18:51 ` Jason Gunthorpe
  19 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2024-10-31  1:44 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On 10/30/24 9:12 AM, Leon Romanovsky wrote:
> Changelog:
> v1: 
>  * Squashed two VFIO patches into one
>  * Added Acked-by/Reviewed-by tags
>  * Fix docs spelling errors
>  * Simplified dma_iova_sync() API
>  * Added extra check in dma_iova_destroy() if mapped size to make code more clear
>  * Fixed checkpatch warnings in p2p patch
>  * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
>    be more general
>  * Reduced the number of changes in VFIO patch
> v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org
> 
> ----------------------------------------------------------------------------
> The code can be downloaded from:
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30

On Christoph's request, I tested this series last week and saw some
pretty significant performance regressions on my box. I don't know what
the status is in terms of that, just want to make sure something like
this doesn't get merged until that is both fully understood and sorted
out.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-10-31  1:41   ` Randy Dunlap
@ 2024-10-31  7:59     ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-31  7:59 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Wed, Oct 30, 2024 at 06:41:21PM -0700, Randy Dunlap wrote:
> (nits)
> 
> On 10/30/24 8:12 AM, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Add an explanation of the newly added IOVA-based mapping API.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/core-api/dma-api.rst | 70 ++++++++++++++++++++++++++++++
> >  1 file changed, 70 insertions(+)

<...>

> > +These APIs allow a very efficient mapping when using an IOMMU.  They are an
> > +optional path that requires extra code and are only recommended for drivers
> > +where DMA mapping performance, or the space usage for storing the DMA addresses
> > +matter.  All the consideration from the previous section apply here as well.
> 
>                     considerations

<...>

> > +is used to unmap a range previous mapped, and
> 
>                             previously

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31  1:44 ` [PATCH v1 00/17] Provide a new two step DMA mapping API Jens Axboe
@ 2024-10-31  8:34   ` Christoph Hellwig
  2024-10-31  9:05     ` Leon Romanovsky
  2024-10-31 17:42     ` Jens Axboe
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2024-10-31  8:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Leon Romanovsky, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Wed, Oct 30, 2024 at 07:44:13PM -0600, Jens Axboe wrote:
> On Christoph's request, I tested this series last week and saw some
> pretty significant performance regressions on my box. I don't know what
> the status is in terms of that, just want to make sure something like
> this doesn't get merged until that is both fully understood and sorted
> out.

Working on it, but I have way too many things going on at once.  Note
that the weird thing about your setup was that we apparently dropped into
the slow path, which still puzzles me.  But I should probably also look
into making that path a little less slow.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31  8:34   ` Christoph Hellwig
@ 2024-10-31  9:05     ` Leon Romanovsky
  2024-10-31  9:21       ` Christoph Hellwig
  2024-10-31 17:42     ` Jens Axboe
  1 sibling, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-31  9:05 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Jason Gunthorpe, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 09:34:50AM +0100, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 07:44:13PM -0600, Jens Axboe wrote:
> > On Christoph's request, I tested this series last week and saw some
> > pretty significant performance regressions on my box. I don't know what
> > the status is in terms of that, just want to make sure something like
> > this doesn't get merged until that is both fully understood and sorted
> > out.

This series is a subset of the series you tested and doesn't include the
block layer changes which most likely were the cause of the performance
regression.

This is why I separated the block layer changes from the rest of the series
and marked them as RFC.

The current patch set is viable for HMM and VFIO. Can you please retest
only this series and leave the block layer changes for later till Christoph
finds the answer for the performance regression?

Thanks

> 
> Working on it, but I have way too many things going on at once.  Note
> that the weird thing about your setup was that we apparently dropped into
> the slow path, which still puzzles me.  But I should probably also look
> into making that path a little less slow.
> 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31  9:05     ` Leon Romanovsky
@ 2024-10-31  9:21       ` Christoph Hellwig
  2024-10-31  9:37         ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-10-31  9:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
> This series is a subset of the series you tested and doesn't include the
> block layer changes which most likely were the cause of the performance
> regression.
> 
> This is why I separated the block layer changes from the rest of the series
> and marked them as RFC.
> 
> The current patch set is viable for HMM and VFIO. Can you please retest
> only this series and leave the block layer changes for later till Christoph
> finds the answer for the performance regression?

As the subset doesn't touch block code or code called by block I don't
think we need Jens to benchmark it, unless he really wants to.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31  9:21       ` Christoph Hellwig
@ 2024-10-31  9:37         ` Leon Romanovsky
  2024-10-31 17:43           ` Jens Axboe
  0 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-31  9:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Thu, Oct 31, 2024 at 10:21:13AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
> > This series is a subset of the series you tested and doesn't include the
> > block layer changes which most likely were the cause of the performance
> > regression.
> > 
> > This is why I separated the block layer changes from the rest of the series
> > and marked them as RFC.
> > 
> > The current patch set is viable for HMM and VFIO. Can you please retest
> > only this series and leave the block layer changes for later till Christoph
> > finds the answer for the performance regression?
> 
> As the subset doesn't touch block code or code called by block I don't
> think we need Jens to benchmark it, unless he really wants to.

He wrote this sentence in his email, while responding on subset which doesn't change
anything in block layer: "just want to make sure something like this doesn't get merged
until that is both fully understood and sorted out."

This series works like a charm for RDMA (HMM) and VFIO.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31  8:34   ` Christoph Hellwig
  2024-10-31  9:05     ` Leon Romanovsky
@ 2024-10-31 17:42     ` Jens Axboe
  1 sibling, 0 replies; 63+ messages in thread
From: Jens Axboe @ 2024-10-31 17:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On 10/31/24 2:34 AM, Christoph Hellwig wrote:
> On Wed, Oct 30, 2024 at 07:44:13PM -0600, Jens Axboe wrote:
>> On Christoph's request, I tested this series last week and saw some
>> pretty significant performance regressions on my box. I don't know what
>> the status is in terms of that, just want to make sure something like
>> this doesn't get merged until that is both fully understood and sorted
>> out.
> 
> Working on it, but I have way too many things going on at once.  Note
> that the weird thing about your setup was that we apparently dropped into
> the slow path, which still puzzles me.  But I should probably also look
> into making that path a little less slow.

That's fine, just wanted to ensure that no push was being done on this
before that was resolved.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31  9:37         ` Leon Romanovsky
@ 2024-10-31 17:43           ` Jens Axboe
  2024-10-31 20:43             ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Jens Axboe @ 2024-10-31 17:43 UTC (permalink / raw)
  To: Leon Romanovsky, Christoph Hellwig
  Cc: Jason Gunthorpe, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

On 10/31/24 3:37 AM, Leon Romanovsky wrote:
> On Thu, Oct 31, 2024 at 10:21:13AM +0100, Christoph Hellwig wrote:
>> On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
>>> This series is a subset of the series you tested and doesn't include the
>>> block layer changes which most likely were the cause of the performance
>>> regression.
>>>
>>> This is why I separated the block layer changes from the rest of the series
>>> and marked them as RFC.
>>>
>>> The current patch set is viable for HMM and VFIO. Can you please retest
>>> only this series and leave the block layer changes for later till Christoph
>>> finds the answer for the performance regression?
>>
>> As the subset doesn't touch block code or code called by block I don't
>> think we need Jens to benchmark it, unless he really wants to.
> 
> He wrote this sentence in his email, while responding on subset which
> doesn't change anything in block layer: "just want to make sure
> something like this doesn't get merged until that is both fully
> understood and sorted out."
> 
> This series works like a charm for RDMA (HMM) and VFIO.

I don't care about rdma/vfio, nor do I test it, so you guys can do
whatever you want there, as long as it doesn't regress the iommu side.
The block series is separate, so we'll deal with that when we get there.

I don't know why you CC'ed linux-block on the series.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31 17:43           ` Jens Axboe
@ 2024-10-31 20:43             ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-10-31 20:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Thu, Oct 31, 2024 at 11:43:50AM -0600, Jens Axboe wrote:
> On 10/31/24 3:37 AM, Leon Romanovsky wrote:
> > On Thu, Oct 31, 2024 at 10:21:13AM +0100, Christoph Hellwig wrote:
> >> On Thu, Oct 31, 2024 at 11:05:30AM +0200, Leon Romanovsky wrote:
> >>> This series is a subset of the series you tested and doesn't include the
> >>> block layer changes which most likely were the cause of the performance
> >>> regression.
> >>>
> >>> This is why I separated the block layer changes from the rest of the series
> >>> and marked them as RFC.
> >>>
> >>> The current patch set is viable for HMM and VFIO. Can you please retest
> >>> only this series and leave the block layer changes for later till Christoph
> >>> finds the answer for the performance regression?
> >>
> >> As the subset doesn't touch block code or code called by block I don't
> >> think we need Jens to benchmark it, unless he really wants to.
> > 
> > He wrote this sentence in his email, while responding on subset which
> > doesn't change anything in block layer: "just want to make sure
> > something like this doesn't get merged until that is both fully
> > understood and sorted out."
> > 
> > This series works like a charm for RDMA (HMM) and VFIO.
> 
> I don't care about rdma/vfio, nor do I test it, so you guys can do
> whatever you want there, as long as it doesn't regress the iommu side.
> The block series is separate, so we'll deal with that when we get there.
> 
> I don't know why you CC'ed linux-block on the series.

Because of the second part, which is marked as RFC and based on this
one. I think that it is better to present whole picture to everyone
interested in the discussion.

Thanks

> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (17 preceding siblings ...)
  2024-10-31  1:44 ` [PATCH v1 00/17] Provide a new two step DMA mapping API Jens Axboe
@ 2024-10-31 21:17 ` Robin Murphy
  2024-11-04  9:58   ` Christoph Hellwig
  2024-11-05 18:51 ` Jason Gunthorpe
  19 siblings, 1 reply; 63+ messages in thread
From: Robin Murphy @ 2024-10-31 21:17 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On 30/10/2024 3:12 pm, Leon Romanovsky wrote:
> Changelog:
> v1:
>   * Squashed two VFIO patches into one
>   * Added Acked-by/Reviewed-by tags
>   * Fix docs spelling errors
>   * Simplified dma_iova_sync() API
>   * Added extra check in dma_iova_destroy() if mapped size to make code more clear
>   * Fixed checkpatch warnings in p2p patch
>   * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
>     be more general
>   * Reduced the number of changes in VFIO patch
> v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org
> 
> ----------------------------------------------------------------------------
> The code can be downloaded from:
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git tag:dma-split-oct-30
> 
> ----------------------------------------------------------------------------
> Currently the only efficient way to map a complex memory description through
> the DMA API is by using the scatterlist APIs.

It's really not efficient... In most cases they're just wrappers for a 
bunch of dma_map_page() etc. calls for the convenience of callers who 
are using a scatterlist for their own reasons anyway. Even with 
iommu-dma, I expect that approach would likely perform better for most 
users as well, given that typical individual segment sizes are much more 
likely to be in scope of the IOVA caches.

The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
being able to linearise a scatterlist in DMA address space. TBH I doubt 
there are many actual scatter-gather-capable devices with significant 
enough limitations to meaningfully benefit from DMA segment combining 
these days - I've often thought that by now it might be a good idea to 
turn that behaviour off by default and add an attribute for callers to 
explicitly request it.

> The SG APIs are unique in that
> they efficiently combine the two fundamental operations of sizing and allocating
> a large IOVA window from the IOMMU and processing all the per-address
> swiotlb/flushing/p2p/map details.

Except that's obviously not unique when the page APIs also combine the 
exact same operations? :/

> This uniqueness has been a long standing pain point as the scatterlist API
> is mandatory, but expensive to use.

Huh? When and where has anything ever called it mandatory? Nobody's 
getting sent to DMA jail for open-coding:

	for_each_sg(...)
		my_dma_addr = dma_map_page(..., sg_page());

if they do know the map_sg operation is unnecessarily expensive for 
their needs.

> It prevents any kind of optimization or
> feature improvement (such as avoiding struct page for P2P) due to the impossibility
> of improving the scatterlist.
> 
> Several approaches have been explored to expand the DMA API with additional
> scatterlist-like structures (BIO, rlist), instead split up the DMA API
> to allow callers to bring their own data structure.

And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
those two notions in any way related? We literally already have one 
generic DMA operation which doesn't operate on struct page, yet needed 
nothing "split up" to be possible. Fair enough if callers want some 
alternative interfaces for mapping memory as well, but to be a common 
DMA API it has to be usable everywhere and cover all the DMA operations 
that the current page-based APIs provide, otherwise those callers 
obviously can't stop using struct pages. What precludes a 
straightforward dma_map_phys() etc. to parallel the existing API? What's 
the justification for an IOMMU-specific design when surely if anyone can 
benefit from more memory-efficient structures across drivers and 
subsystems it's the little embedded platforms, not the big servers 
already happy to spend tens to hundreds of megabytes on IOMMU pagetables?

> The API is split up into parts:
>   - Allocate IOVA space:
>      To do any pre-allocation required. This is done based on the caller
>      supplying some details about how much IOMMU address space it would need
>      in worst case.
>   - Map and unmap relevant structures to pre-allocated IOVA space:
>      Perform the actual mapping into the pre-allocated IOVA. This is very
>      similar to dma_map_page().
 >
> In this and the next series [1], examples of three different users are converted
> to the new API to show the benefits and its versatility. Each user has a unique
> flow:
>   1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
>      dynamically map/unmap large numbers of single pages. This becomes
>      significantly faster in the IOMMU case as the map/unmap is now just
>      a page table walk, the IOVA allocation is pre-computed once. Significant
>      amounts of memory are saved as there is no longer a need to store the
>      dma_addr_t of each page.

I particularly enjoy the comment in patch #11 calling out how this 
"unique flow" is fundamentally incompatible with the API it's supposed 
to show off and has to rely on a sketchy hack to abuse its 
"versatility". Great stuff.

>   2. VFIO PCI live migration code is building a very large "page list"
>      for the device. Instead of allocating a scatter list entry per allocated
>      page it can just allocate an array of 'struct page *', saving a large
>      amount of memory.

VFIO already assumes a coherent device with (realistically) an IOMMU 
which it explicitly manages - why is it even pretending to need a 
generic DMA API?

>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>      list without having to allocate then populate an intermediate SG table.

As above, given that a bio_vec still deals in struct pages, that could 
seemingly already be done by just mapping the pages, so how is it 
proving any benefit of a fragile new interface?

Heck, not that I really want to encourage it, but we also already have 
network drivers who don't have the space to stash both a DMA address 
*and* a page address in their descriptors, and economise on shadow 
storage by instead grovelling into the default IOMMU domain with 
iova_to_phys(). I mean, I'd _kinda_ like to send them to DMA jail, but 
it's not an absolutely unreasonable trick to play... also DMA jail 
doesn't exist.

> To make the use of the new API easier, HMM and block subsystems are extended
> to hide the optimization details from the caller. Among these optimizations:
>   * Memory reduction as in most real use cases there is no need to store mapped
>     DMA addresses and unmap them.
>   * Reducing the function call overhead by removing the need to call function
>     pointers and use direct calls instead.
> 
> This step is first along a path to provide alternatives to scatterlist and
> solve some of the abuses and design mistakes, for instance in DMABUF's P2P
> support.

My big concern here is that a thin and vaguely-defined wrapper around 
the IOMMU API is itself a step which smells strongly of "abuse and 
design mistake", given that the basic notion of allocating DMA addresses 
in advance clearly cannot generalise. Thus it really demands some 
considered justification beyond "We must do something; This is 
something; Therefore we must do this." to be convincing.

Thanks,
Robin.

> 
> Thanks
> 
> [1] This still points to v0, as the change is just around handling dma_iova_sync():
> https://lore.kernel.org/all/cover.1730037261.git.leon@kernel.org
> 
> Christoph Hellwig (6):
>    PCI/P2PDMA: Refactor the p2pdma mapping helpers
>    dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
>    iommu: generalize the batched sync after map interface
>    iommu/dma: Factor out a iommu_dma_map_swiotlb helper
>    dma-mapping: add a dma_need_unmap helper
>    docs: core-api: document the IOVA-based API
> 
> Leon Romanovsky (11):
>    dma-mapping: Add check if IOVA can be used
>    dma: Provide an interface to allow allocate IOVA
>    dma-mapping: Implement link/unlink ranges API
>    mm/hmm: let users to tag specific PFN with DMA mapped bit
>    mm/hmm: provide generic DMA managing logic
>    RDMA/umem: Store ODP access mask information in PFN
>    RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
>      linkage
>    RDMA/umem: Separate implicit ODP initialization from explicit ODP
>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>    vfio/mlx5: Convert vfio to use DMA link API
> 
>   Documentation/core-api/dma-api.rst   |  70 ++++
>   drivers/infiniband/core/umem_odp.c   | 250 +++++----------
>   drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
>   drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
>   drivers/infiniband/hw/mlx5/umr.c     |  12 +-
>   drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
>   drivers/iommu/iommu.c                |  65 ++--
>   drivers/pci/p2pdma.c                 |  38 +--
>   drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
>   drivers/vfio/pci/mlx5/cmd.h          |  35 +-
>   drivers/vfio/pci/mlx5/main.c         |  87 +++--
>   include/linux/dma-map-ops.h          |  54 ----
>   include/linux/dma-mapping.h          |  85 +++++
>   include/linux/hmm-dma.h              |  32 ++
>   include/linux/hmm.h                  |  16 +
>   include/linux/iommu.h                |   4 +
>   include/linux/pci-p2pdma.h           |  84 +++++
>   include/rdma/ib_umem_odp.h           |  25 +-
>   kernel/dma/direct.c                  |  44 +--
>   kernel/dma/mapping.c                 |  20 ++
>   mm/hmm.c                             | 231 +++++++++++++-
>   21 files changed, 1377 insertions(+), 684 deletions(-)
>   create mode 100644 include/linux/hmm-dma.h
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API
  2024-10-30 15:12 ` [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
@ 2024-10-31 21:18   ` Robin Murphy
  2024-11-04  9:10     ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Robin Murphy @ 2024-10-31 21:18 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

On 30/10/2024 3:12 pm, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Introduce new DMA APIs to perform DMA linkage of buffers
> in layers higher than DMA.
> 
> In proposed API, the callers will perform the following steps.
> In map path:
> 	if (dma_can_use_iova(...))
> 	    dma_iova_alloc()
> 	    for (page in range)
> 	       dma_iova_link_next(...)
> 	    dma_iova_sync(...)
> 	else
> 	     /* Fallback to legacy map pages */
>               for (all pages)
> 	       dma_map_page(...)
> 
> In unmap path:
> 	if (dma_can_use_iova(...))
> 	     dma_iova_destroy()
> 	else
> 	     for (all pages)
> 		dma_unmap_page(...)
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   drivers/iommu/dma-iommu.c   | 259 ++++++++++++++++++++++++++++++++++++
>   include/linux/dma-mapping.h |  32 +++++
>   2 files changed, 291 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index e1eaad500d27..4a504a879cc0 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1834,6 +1834,265 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
>   }
>   EXPORT_SYMBOL_GPL(dma_iova_free);
>   
> +static int __dma_iova_link(struct device *dev, dma_addr_t addr,
> +		phys_addr_t phys, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	bool coherent = dev_is_dma_coherent(dev);
> +
> +	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))

If you really imagine this can support non-coherent operation and 
DMA_ATTR_SKIP_CPU_SYNC, where are the corresponding explicit sync 
operations? dma_sync_single_*() sure as heck aren't going to work...

In fact, same goes for SWIOTLB bouncing even in the coherent case.

> +		arch_sync_dma_for_device(phys, size, dir);

Plus if the aim is to pass P2P and whatever arbitrary physical addresses 
through here as well, how can we be sure this isn't going to explode?

> +
> +	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
> +			dma_info_to_prot(dir, coherent, attrs), GFP_ATOMIC);
> +}
> +
> +static int iommu_dma_iova_bounce_and_link(struct device *dev, dma_addr_t addr,
> +		phys_addr_t phys, size_t bounce_len,
> +		enum dma_data_direction dir, unsigned long attrs,
> +		size_t iova_start_pad)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iova_domain *iovad = &domain->iova_cookie->iovad;
> +	phys_addr_t bounce_phys;
> +	int error;
> +
> +	bounce_phys = iommu_dma_map_swiotlb(dev, phys, bounce_len, dir, attrs);
> +	if (bounce_phys == DMA_MAPPING_ERROR)
> +		return -ENOMEM;
> +
> +	error = __dma_iova_link(dev, addr - iova_start_pad,
> +			bounce_phys - iova_start_pad,
> +			iova_align(iovad, bounce_len), dir, attrs);
> +	if (error)
> +		swiotlb_tbl_unmap_single(dev, bounce_phys, bounce_len, dir,
> +				attrs);
> +	return error;
> +}
> +
> +static int iommu_dma_iova_link_swiotlb(struct device *dev,
> +		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
> +		size_t size, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, phys);
> +	size_t iova_end_pad = iova_offset(iovad, phys + size);

I thought the code below was wrong until I double-checked and realised 
that this is not what its name implies it to be...

> +	dma_addr_t addr = state->addr + offset;
> +	size_t mapped = 0;
> +	int error;
> +
> +	if (iova_start_pad) {
> +		size_t bounce_len = min(size, iovad->granule - iova_start_pad);
> +
> +		error = iommu_dma_iova_bounce_and_link(dev, addr, phys,
> +				bounce_len, dir, attrs, iova_start_pad);
> +		if (error)
> +			return error;
> +		state->__size |= DMA_IOVA_USE_SWIOTLB;
> +
> +		mapped += bounce_len;
> +		size -= bounce_len;
> +		if (!size)
> +			return 0;
> +	}
> +
> +	size -= iova_end_pad;
> +	error = __dma_iova_link(dev, addr + mapped, phys + mapped, size, dir,
> +			attrs);
> +	if (error)
> +		goto out_unmap;
> +	mapped += size;
> +
> +	if (iova_end_pad) {
> +		error = iommu_dma_iova_bounce_and_link(dev, addr + mapped,
> +				phys + mapped, iova_end_pad, dir, attrs, 0);
> +		if (error)
> +			goto out_unmap;
> +		state->__size |= DMA_IOVA_USE_SWIOTLB;
> +	}
> +
> +	return 0;
> +
> +out_unmap:
> +	dma_iova_unlink(dev, state, 0, mapped, dir, attrs);
> +	return error;
> +}
> +
> +/**
> + * dma_iova_link - Link a range of IOVA space
> + * @dev: DMA device
> + * @state: IOVA state
> + * @phys: physical address to link
> + * @offset: offset into the IOVA state to map into
> + * @size: size of the buffer
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Link a range of IOVA space for the given IOVA state without IOTLB sync.
> + * This function is used to link multiple physical addresses in contigueous
> + * IOVA space without performing costly IOTLB sync.
> + *
> + * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
> + * the end of linkage.
> + */
> +int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, phys);
> +
> +	if (WARN_ON_ONCE(iova_start_pad && offset > 0))
> +		return -EIO;
> +
> +	if (dev_use_swiotlb(dev, size, dir) && iova_offset(iovad, phys | size))
> +		return iommu_dma_iova_link_swiotlb(dev, state, phys, offset,
> +				size, dir, attrs);
> +
> +	return __dma_iova_link(dev, state->addr + offset - iova_start_pad,
> +			phys - iova_start_pad,
> +			iova_align(iovad, size + iova_start_pad), dir, attrs);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_link);
> +
> +/**
> + * dma_iova_sync - Sync IOTLB
> + * @dev: DMA device
> + * @state: IOVA state
> + * @offset: offset into the IOVA state to sync
> + * @size: size of the buffer
> + *
> + * Sync IOTLB for the given IOVA state. This function should be called on
> + * the IOVA-contigous range created by one ore more dma_iova_link() calls
> + * to sync the IOTLB.
> + */
> +int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	dma_addr_t addr = state->addr + offset;
> +	size_t iova_start_pad = iova_offset(iovad, addr);
> +
> +	return iommu_sync_map(domain, addr - iova_start_pad,
> +		      iova_align(iovad, size + iova_start_pad));
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_sync);
> +
> +static void iommu_dma_iova_unlink_range_slow(struct device *dev,
> +		dma_addr_t addr, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, addr);
> +	dma_addr_t end = addr + size;
> +
> +	do {
> +		phys_addr_t phys;
> +		size_t len;
> +
> +		phys = iommu_iova_to_phys(domain, addr);
> +		if (WARN_ON(!phys))
> +			continue;
> +		len = min_t(size_t,
> +			end - addr, iovad->granule - iova_start_pad);
> +
> +		if (!dev_is_dma_coherent(dev) &&
> +		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +			arch_sync_dma_for_cpu(phys, len, dir);
> +
> +		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);

How do you know that "phys" and "len" match what was originally 
allocated and bounced in, and this isn't going to try to bounce out too 
much, free the wrong slot, or anything else nasty? If it's not supposed 
to be intentional that a sub-granule buffer can be linked to any offset 
in the middle of the IOVA range as long as its original physical address 
is aligned to the IOVA granule size(?), why try to bounce anywhere other 
than the ends of the range at all?

> +
> +		addr += len;
> +		iova_start_pad = 0;
> +	} while (addr < end);
> +}
> +
> +static void __iommu_dma_iova_unlink(struct device *dev,
> +		struct dma_iova_state *state, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs,
> +		bool free_iova)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	dma_addr_t addr = state->addr + offset;
> +	size_t iova_start_pad = iova_offset(iovad, addr);
> +	struct iommu_iotlb_gather iotlb_gather;
> +	size_t unmapped;
> +
> +	if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
> +	    (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
> +		iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
> +
> +	iommu_iotlb_gather_init(&iotlb_gather);
> +	iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);

Is is really worth the bother?

> +	size = iova_align(iovad, size + iova_start_pad);
> +	addr -= iova_start_pad;
> +	unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
> +	WARN_ON(unmapped != size);
> +
> +	if (!iotlb_gather.queued)
> +		iommu_iotlb_sync(domain, &iotlb_gather);
> +	if (free_iova)
> +		iommu_dma_free_iova(cookie, addr, size, &iotlb_gather);

There's no guarantee that "size" is the correct value here, so this has 
every chance of corrupting the IOVA domain.

> +}
> +
> +/**
> + * dma_iova_unlink - Unlink a range of IOVA space
> + * @dev: DMA device
> + * @state: IOVA state
> + * @offset: offset into the IOVA state to unlink
> + * @size: size of the buffer
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Unlink a range of IOVA space for the given IOVA state.

If I initially link a large range in one go, then unlink a small part of 
it, what behaviour can I expect?

Thanks,
Robin.

> + */
> +void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	 __iommu_dma_iova_unlink(dev, state, offset, size, dir, attrs, false);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_unlink);
> +
> +/**
> + * dma_iova_destroy - Finish a DMA mapping transaction
> + * @dev: DMA device
> + * @state: IOVA state
> + * @mapped_len: number of bytes to unmap
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Unlink the IOVA range up to @mapped_len and free the entire IOVA space. The
> + * range of IOVA from dma_addr to @mapped_len must all be linked, and be the
> + * only linked IOVA in state.
> + */
> +void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
> +		size_t mapped_len, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	if (mapped_len)
> +		__iommu_dma_iova_unlink(dev, state, 0, mapped_len, dir, attrs,
> +				true);
> +	else
> +		/*
> +		 * We can be here if first call to dma_iova_link() failed and
> +		 * there is nothing to unlink, so let's be more clear.
> +		 */
> +		dma_iova_free(dev, state);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_destroy);
> +
>   void iommu_setup_dma_ops(struct device *dev)
>   {
>   	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 817f11bce7bc..8074a3b5c807 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -313,6 +313,17 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
>   bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
>   		phys_addr_t phys, size_t size);
>   void dma_iova_free(struct device *dev, struct dma_iova_state *state);
> +void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
> +		size_t mapped_len, enum dma_data_direction dir,
> +		unsigned long attrs);
> +int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size);
> +int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs);
> +void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs);
>   #else /* CONFIG_IOMMU_DMA */
>   static inline bool dma_use_iova(struct dma_iova_state *state)
>   {
> @@ -327,6 +338,27 @@ static inline void dma_iova_free(struct device *dev,
>   		struct dma_iova_state *state)
>   {
>   }
> +static inline void dma_iova_destroy(struct device *dev,
> +		struct dma_iova_state *state, size_t mapped_len,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +}
> +static inline int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int dma_iova_link(struct device *dev,
> +		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
> +		size_t size, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void dma_iova_unlink(struct device *dev,
> +		struct dma_iova_state *state, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +}
>   #endif /* CONFIG_IOMMU_DMA */
>   
>   #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper
  2024-10-30 15:12 ` [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
@ 2024-10-31 21:18   ` Robin Murphy
  2024-11-01 11:06     ` Leon Romanovsky
  2024-11-04  9:15     ` Christoph Hellwig
  0 siblings, 2 replies; 63+ messages in thread
From: Robin Murphy @ 2024-10-31 21:18 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On 30/10/2024 3:12 pm, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Add helper that allows a driver to skip calling dma_unmap_*
> if the DMA layer can guarantee that they are no-nops.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   include/linux/dma-mapping.h |  5 +++++
>   kernel/dma/mapping.c        | 20 ++++++++++++++++++++
>   2 files changed, 25 insertions(+)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 8074a3b5c807..6906edde505d 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -410,6 +410,7 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>   {
>   	return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
>   }
> +bool dma_need_unmap(struct device *dev);
>   #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
>   static inline bool dma_dev_need_sync(const struct device *dev)
>   {
> @@ -435,6 +436,10 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>   {
>   	return false;
>   }
> +static inline bool dma_need_unmap(struct device *dev)
> +{
> +	return false;
> +}
>   #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
>   
>   struct page *dma_alloc_pages(struct device *dev, size_t size,
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index 864a1121bf08..daa97a650778 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -442,6 +442,26 @@ bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>   }
>   EXPORT_SYMBOL_GPL(__dma_need_sync);
>   
> +/**
> + * dma_need_unmap - does this device need dma_unmap_* operations
> + * @dev: device to check
> + *
> + * If this function returns %false, drivers can skip calling dma_unmap_* after
> + * finishing an I/O.  This function must be called after all mappings that might
> + * need to be unmapped have been performed.

In terms of the unmap call itself, why don't we just use dma_skip_sync 
to short-cut dma_direct_unmap_*() and make sure it's as cheap as possible?

In terms of not having to unmap implying not having to store addresses 
at all, it doesn't seem super-useful when you still have to store them 
for long enough to find out that you don't :/

Thanks,
Robin.

> + */
> +bool dma_need_unmap(struct device *dev)
> +{
> +	if (!dma_map_direct(dev, get_dma_ops(dev)))
> +		return true;
> +#ifdef CONFIG_DMA_NEED_SYNC
> +	if (!dev->dma_skip_sync)
> +		return true;
> +#endif
> +	return IS_ENABLED(CONFIG_DMA_API_DEBUG);
> +}
> +EXPORT_SYMBOL_GPL(dma_need_unmap);
> +
>   static void dma_setup_need_sync(struct device *dev)
>   {
>   	const struct dma_map_ops *ops = get_dma_ops(dev);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper
  2024-10-31 21:18   ` Robin Murphy
@ 2024-11-01 11:06     ` Leon Romanovsky
  2024-11-04  9:15     ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-01 11:06 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jens Axboe, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Christoph Hellwig, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Thu, Oct 31, 2024 at 09:18:11PM +0000, Robin Murphy wrote:
> On 30/10/2024 3:12 pm, Leon Romanovsky wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Add helper that allows a driver to skip calling dma_unmap_*
> > if the DMA layer can guarantee that they are no-nops.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >   include/linux/dma-mapping.h |  5 +++++
> >   kernel/dma/mapping.c        | 20 ++++++++++++++++++++
> >   2 files changed, 25 insertions(+)
> > 
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 8074a3b5c807..6906edde505d 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -410,6 +410,7 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> >   {
> >   	return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
> >   }
> > +bool dma_need_unmap(struct device *dev);
> >   #else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
> >   static inline bool dma_dev_need_sync(const struct device *dev)
> >   {
> > @@ -435,6 +436,10 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> >   {
> >   	return false;
> >   }
> > +static inline bool dma_need_unmap(struct device *dev)
> > +{
> > +	return false;
> > +}
> >   #endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
> >   struct page *dma_alloc_pages(struct device *dev, size_t size,
> > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> > index 864a1121bf08..daa97a650778 100644
> > --- a/kernel/dma/mapping.c
> > +++ b/kernel/dma/mapping.c
> > @@ -442,6 +442,26 @@ bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
> >   }
> >   EXPORT_SYMBOL_GPL(__dma_need_sync);
> > +/**
> > + * dma_need_unmap - does this device need dma_unmap_* operations
> > + * @dev: device to check
> > + *
> > + * If this function returns %false, drivers can skip calling dma_unmap_* after
> > + * finishing an I/O.  This function must be called after all mappings that might
> > + * need to be unmapped have been performed.
> 
> In terms of the unmap call itself, why don't we just use dma_skip_sync to
> short-cut dma_direct_unmap_*() and make sure it's as cheap as possible?

From what I see dma_skip_sync is not available when kernel is built
without CONFIG_DMA_NEED_SYNC.

> 
> In terms of not having to unmap implying not having to store addresses at
> all, it doesn't seem super-useful when you still have to store them for long
> enough to find out that you don't :/

Why? The decision if DMA addresses are needed is taken when allocating
relevant arrays, before we have any DMA address to store. If we know
that we don't need to unmap, we can skip allocation of the array for
free. So what and when "you still have to store them"?

Thanks

> 
> Thanks,
> Robin.
> 
> > + */
> > +bool dma_need_unmap(struct device *dev)
> > +{
> > +	if (!dma_map_direct(dev, get_dma_ops(dev)))
> > +		return true;
> > +#ifdef CONFIG_DMA_NEED_SYNC
> > +	if (!dev->dma_skip_sync)
> > +		return true;
> > +#endif
> > +	return IS_ENABLED(CONFIG_DMA_API_DEBUG);
> > +}
> > +EXPORT_SYMBOL_GPL(dma_need_unmap);
> > +
> >   static void dma_setup_need_sync(struct device *dev)
> >   {
> >   	const struct dma_map_ops *ops = get_dma_ops(dev);
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API
  2024-10-31 21:18   ` Robin Murphy
@ 2024-11-04  9:10     ` Christoph Hellwig
  2024-11-04 12:19       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-04  9:10 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 09:18:07PM +0000, Robin Murphy wrote:
>>   +static int __dma_iova_link(struct device *dev, dma_addr_t addr,
>> +		phys_addr_t phys, size_t size, enum dma_data_direction dir,
>> +		unsigned long attrs)
>> +{
>> +	bool coherent = dev_is_dma_coherent(dev);
>> +
>> +	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
>
> If you really imagine this can support non-coherent operation and 
> DMA_ATTR_SKIP_CPU_SYNC, where are the corresponding explicit sync 
> operations? dma_sync_single_*() sure as heck aren't going to work...
>
> In fact, same goes for SWIOTLB bouncing even in the coherent case.

No with explicit sync operations.  But plain map/unmap works, I've
actually verified that with nvme.  And that's a pretty large use
case.

>> +		arch_sync_dma_for_device(phys, size, dir);
>
> Plus if the aim is to pass P2P and whatever arbitrary physical addresses 
> through here as well, how can we be sure this isn't going to explode?

That's a good point.  Only mapped through host bridge P2P can even
end up here, so the address is a perfectly valid physical address
in the host.  But I'm not sure if all arch_sync_dma_for_device
implementations handle IOMMU memory fine.

>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>> +	struct iova_domain *iovad = &cookie->iovad;
>> +	size_t iova_start_pad = iova_offset(iovad, phys);
>> +	size_t iova_end_pad = iova_offset(iovad, phys + size);
>
> I thought the code below was wrong until I double-checked and realised that 
> this is not what its name implies it to be...

Which variable does this refer to, and what would be a better name?

>> +		phys = iommu_iova_to_phys(domain, addr);
>> +		if (WARN_ON(!phys))
>> +			continue;
>> +		len = min_t(size_t,
>> +			end - addr, iovad->granule - iova_start_pad);
>> +
>> +		if (!dev_is_dma_coherent(dev) &&
>> +		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
>> +			arch_sync_dma_for_cpu(phys, len, dir);
>> +
>> +		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
>
> How do you know that "phys" and "len" match what was originally allocated 
> and bounced in, and this isn't going to try to bounce out too much, free 
> the wrong slot, or anything else nasty? If it's not supposed to be 
> intentional that a sub-granule buffer can be linked to any offset in the 
> middle of the IOVA range as long as its original physical address is 
> aligned to the IOVA granule size(?), why try to bounce anywhere other than 
> the ends of the range at all?

Mostly because the code is simpler and unless misused it just works.
But it might be worth adding explicit checks for the start and end.

>> +static void __iommu_dma_iova_unlink(struct device *dev,
>> +		struct dma_iova_state *state, size_t offset, size_t size,
>> +		enum dma_data_direction dir, unsigned long attrs,
>> +		bool free_iova)
>> +{
>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>> +	struct iova_domain *iovad = &cookie->iovad;
>> +	dma_addr_t addr = state->addr + offset;
>> +	size_t iova_start_pad = iova_offset(iovad, addr);
>> +	struct iommu_iotlb_gather iotlb_gather;
>> +	size_t unmapped;
>> +
>> +	if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
>> +	    (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
>> +		iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
>> +
>> +	iommu_iotlb_gather_init(&iotlb_gather);
>> +	iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);
>
> Is is really worth the bother?

Worth what?

>> +	size = iova_align(iovad, size + iova_start_pad);
>> +	addr -= iova_start_pad;
>> +	unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
>> +	WARN_ON(unmapped != size);
>> +
>> +	if (!iotlb_gather.queued)
>> +		iommu_iotlb_sync(domain, &iotlb_gather);
>> +	if (free_iova)
>> +		iommu_dma_free_iova(cookie, addr, size, &iotlb_gather);
>
> There's no guarantee that "size" is the correct value here, so this has 
> every chance of corrupting the IOVA domain.

Yes, but the same is true for every users of the iommu_* API as well.

>> +/**
>> + * dma_iova_unlink - Unlink a range of IOVA space
>> + * @dev: DMA device
>> + * @state: IOVA state
>> + * @offset: offset into the IOVA state to unlink
>> + * @size: size of the buffer
>> + * @dir: DMA direction
>> + * @attrs: attributes of mapping properties
>> + *
>> + * Unlink a range of IOVA space for the given IOVA state.
>
> If I initially link a large range in one go, then unlink a small part of 
> it, what behaviour can I expect?

As in map say 128k and then unmap 4k?  It will just work, even if that
is not the intended use case, which is either map everything up front
and unmap everything together, or the HMM version of random constant
mapping and unmapping at page size granularity.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper
  2024-10-31 21:18   ` Robin Murphy
  2024-11-01 11:06     ` Leon Romanovsky
@ 2024-11-04  9:15     ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-04  9:15 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 09:18:11PM +0000, Robin Murphy wrote:
>>   +/**
>> + * dma_need_unmap - does this device need dma_unmap_* operations
>> + * @dev: device to check
>> + *
>> + * If this function returns %false, drivers can skip calling dma_unmap_* after
>> + * finishing an I/O.  This function must be called after all mappings that might
>> + * need to be unmapped have been performed.
>
> In terms of the unmap call itself, why don't we just use dma_skip_sync to 
> short-cut dma_direct_unmap_*() and make sure it's as cheap as possible?
>
> In terms of not having to unmap implying not having to store addresses at 
> all, it doesn't seem super-useful when you still have to store them for 
> long enough to find out that you don't :/

I don't fully understand the comment, mostly because the way I read the
two sentences appear to contradict each other.

Bypassing dma_direct_unmap_ is not the important part, because it already
is pretty cheap.  Storing the addresses is not.

That being said now that we never check need_unmap in the iova path
it might make sense to not have a separate helper, but it needs to
be exposed and documented.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-31 21:17 ` Robin Murphy
@ 2024-11-04  9:58   ` Christoph Hellwig
  2024-11-04 11:39     ` Leon Romanovsky
  2024-11-05 19:53     ` Jason Gunthorpe
  0 siblings, 2 replies; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-04  9:58 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> being able to linearise a scatterlist in DMA address space. TBH I doubt 
> there are many actual scatter-gather-capable devices with significant 
> enough limitations to meaningfully benefit from DMA segment combining these 
> days - I've often thought that by now it might be a good idea to turn that 
> behaviour off by default and add an attribute for callers to explicitly 
> request it.

Even when devices are not limited they often perform significantly better
when IOVA space is not completely fragmented.  While the dma_map_sg code
is a bit gross due to the fact that it has to deal with unaligned segments,
the coalescing itself often is a big win.

Note that dma_map_sg also has two other very useful features:  batching
of the iotlb flushing, and support for P2P, which to be efficient also
requires batching the lookups.

>> This uniqueness has been a long standing pain point as the scatterlist API
>> is mandatory, but expensive to use.
>
> Huh? When and where has anything ever called it mandatory? Nobody's getting 
> sent to DMA jail for open-coding:

You don't get sent to jail.  But you do not get batched iotlb sync, you
don't get properly working P2P, and you don't get IOVA coalescing.

>> Several approaches have been explored to expand the DMA API with additional
>> scatterlist-like structures (BIO, rlist), instead split up the DMA API
>> to allow callers to bring their own data structure.
>
> And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> those two notions in any way related? We literally already have one generic 
> DMA operation which doesn't operate on struct page, yet needed nothing 
> "split up" to be possible.

Yeah, I don't really get the struct page argument.  In fact if we look
at the nitty-gritty details of dma_map_page it doesn't really need a
page at all.  I've been looking at cleaning some of this up and providing
a dma_map_phys/paddr which would be quite handy in a few places.  Note
because we don't have a struct page for the memory, but because converting
to/from it all the time is not very efficient.

>>   2. VFIO PCI live migration code is building a very large "page list"
>>      for the device. Instead of allocating a scatter list entry per allocated
>>      page it can just allocate an array of 'struct page *', saving a large
>>      amount of memory.
>
> VFIO already assumes a coherent device with (realistically) an IOMMU which 
> it explicitly manages - why is it even pretending to need a generic DMA 
> API?

AFAIK that does isn't really vfio as we know it but the control device
for live migration.  But Leon or Jason might fill in more.

The point is that quite a few devices have these page list based APIs
(RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).

>
>>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>>      list without having to allocate then populate an intermediate SG table.
>
> As above, given that a bio_vec still deals in struct pages, that could 
> seemingly already be done by just mapping the pages, so how is it proving 
> any benefit of a fragile new interface?

Because we only need to preallocate the tiny constant sized dma_iova_state
as part of the request instead of an additional scatterlist that requires
sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
per segment, including a memory allocation per I/O for that.

> My big concern here is that a thin and vaguely-defined wrapper around the 
> IOMMU API is itself a step which smells strongly of "abuse and design 
> mistake", given that the basic notion of allocating DMA addresses in 
> advance clearly cannot generalise. Thus it really demands some considered 
> justification beyond "We must do something; This is something; Therefore we 
> must do this." to be convincing.

At least for the block code we have a nice little core wrapper that is
very easy to use, and provides a great reduction of memory use and
allocations.  The HMM use case I'll let others talk about.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-04  9:58   ` Christoph Hellwig
@ 2024-11-04 11:39     ` Leon Romanovsky
  2024-11-05 19:53     ` Jason Gunthorpe
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-04 11:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Mon, Nov 04, 2024 at 10:58:31AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:

<...>

> >>   2. VFIO PCI live migration code is building a very large "page list"
> >>      for the device. Instead of allocating a scatter list entry per allocated
> >>      page it can just allocate an array of 'struct page *', saving a large
> >>      amount of memory.
> >
> > VFIO already assumes a coherent device with (realistically) an IOMMU which 
> > it explicitly manages - why is it even pretending to need a generic DMA 
> > API?
> 
> AFAIK that does isn't really vfio as we know it but the control device
> for live migration.  But Leon or Jason might fill in more.

Yes, you are right, as it is written above "VFIO PCI live migration ...".
That piece of code directly connected to the underlying real HW device
and uses DMA API to provide live migration functionality to/from that
device.

> 
> The point is that quite a few devices have these page list based APIs
> (RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).
> 
> >
> >>   3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
> >>      list without having to allocate then populate an intermediate SG table.
> >
> > As above, given that a bio_vec still deals in struct pages, that could 
> > seemingly already be done by just mapping the pages, so how is it proving 
> > any benefit of a fragile new interface?
> 
> Because we only need to preallocate the tiny constant sized dma_iova_state
> as part of the request instead of an additional scatterlist that requires
> sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
> per segment, including a memory allocation per I/O for that.
> 
> > My big concern here is that a thin and vaguely-defined wrapper around the 
> > IOMMU API is itself a step which smells strongly of "abuse and design 
> > mistake", given that the basic notion of allocating DMA addresses in 
> > advance clearly cannot generalise. Thus it really demands some considered 
> > justification beyond "We must do something; This is something; Therefore we 
> > must do this." to be convincing.
> 
> At least for the block code we have a nice little core wrapper that is
> very easy to use, and provides a great reduction of memory use and
> allocations.  The HMM use case I'll let others talk about.

I'm not sure about which wrappers Robin talks, but if we are talking
about HMM wrappers, they gave us perfect combination of usability,
performance and maintenance. All HMM users use same pattern, same
structures and don't need to worry about internal DMA/IOMMU details.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API
  2024-11-04  9:10     ` Christoph Hellwig
@ 2024-11-04 12:19       ` Jason Gunthorpe
  2024-11-04 12:53         ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-04 12:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Leon Romanovsky, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Mon, Nov 04, 2024 at 10:10:48AM +0100, Christoph Hellwig wrote:
> >> +		arch_sync_dma_for_device(phys, size, dir);
> >
> > Plus if the aim is to pass P2P and whatever arbitrary physical addresses 
> > through here as well, how can we be sure this isn't going to explode?
> 
> That's a good point.  Only mapped through host bridge P2P can even
> end up here, so the address is a perfectly valid physical address
> in the host.  But I'm not sure if all arch_sync_dma_for_device
> implementations handle IOMMU memory fine.

I was told on x86 if you do a cache flush operation on MMIO there is a
chance it will MCE. Recently had some similar discussions about ARM
where it was asserted some platforms may have similar.

It would be safest to only call arch flushing calls on memory that is
mapped cachable. We can assume that a P2P target is never CPU
mapped cachable, regardless of how the DMA is routed.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API
  2024-11-04 12:19       ` Jason Gunthorpe
@ 2024-11-04 12:53         ` Christoph Hellwig
  2024-11-07 14:50           ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-04 12:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Robin Murphy, Leon Romanovsky, Jens Axboe,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Leon Romanovsky,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Mon, Nov 04, 2024 at 08:19:24AM -0400, Jason Gunthorpe wrote:
> > That's a good point.  Only mapped through host bridge P2P can even
> > end up here, so the address is a perfectly valid physical address
> > in the host.  But I'm not sure if all arch_sync_dma_for_device
> > implementations handle IOMMU memory fine.
> 
> I was told on x86 if you do a cache flush operation on MMIO there is a
> chance it will MCE. Recently had some similar discussions about ARM
> where it was asserted some platforms may have similar.

On x86 we never flush caches for DMA operations anyway, so x86 isn't
really the concern here, but architectures that do cache incoherent DMA
to PCIe devices.  Which isn't a whole lot as most SOCs try to avoid that
for PCIe even if they lack DMA coherent for lesser peripherals, but I bet
there are some on arm/arm64 and maybe riscv or mips.

> It would be safest to only call arch flushing calls on memory that is
> mapped cachable. We can assume that a P2P target is never CPU
> mapped cachable, regardless of how the DMA is routed.

Yes.  I.e. force DMA_ATTR_SKIP_CPU_SYNC for P2P.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
                   ` (18 preceding siblings ...)
  2024-10-31 21:17 ` Robin Murphy
@ 2024-11-05 18:51 ` Jason Gunthorpe
  19 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-05 18:51 UTC (permalink / raw)
  To: Leon Romanovsky, Christoph Hellwig
  Cc: Jens Axboe, Robin Murphy, Joerg Roedel, Will Deacon,
	Sagi Grimberg, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

On Wed, Oct 30, 2024 at 05:12:46PM +0200, Leon Romanovsky wrote:

>  Documentation/core-api/dma-api.rst   |  70 ++++
>  drivers/infiniband/core/umem_odp.c   | 250 +++++----------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
>  drivers/infiniband/hw/mlx5/odp.c     |  65 ++--
>  drivers/infiniband/hw/mlx5/umr.c     |  12 +-
>  drivers/iommu/dma-iommu.c            | 459 +++++++++++++++++++++++----
>  drivers/iommu/iommu.c                |  65 ++--
>  drivers/pci/p2pdma.c                 |  38 +--
>  drivers/vfio/pci/mlx5/cmd.c          | 373 +++++++++++-----------
>  drivers/vfio/pci/mlx5/cmd.h          |  35 +-
>  drivers/vfio/pci/mlx5/main.c         |  87 +++--
>  include/linux/dma-map-ops.h          |  54 ----
>  include/linux/dma-mapping.h          |  85 +++++
>  include/linux/hmm-dma.h              |  32 ++
>  include/linux/hmm.h                  |  16 +
>  include/linux/iommu.h                |   4 +
>  include/linux/pci-p2pdma.h           |  84 +++++
>  include/rdma/ib_umem_odp.h           |  25 +-
>  kernel/dma/direct.c                  |  44 +--
>  kernel/dma/mapping.c                 |  20 ++
>  mm/hmm.c                             | 231 +++++++++++++-

This is touching alot of subsystems, at least two are mine :)

Who is in the hot seat to merge this? Are we expecting it this merge
window?

I've read through past versions and am happy with the general
concept. Would like to read it again in details.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-04  9:58   ` Christoph Hellwig
  2024-11-04 11:39     ` Leon Romanovsky
@ 2024-11-05 19:53     ` Jason Gunthorpe
  2024-11-07  8:32       ` Christoph Hellwig
  1 sibling, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-05 19:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Mon, Nov 04, 2024 at 10:58:31AM +0100, Christoph Hellwig wrote:
> On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> > The hilarious amount of work that iommu_dma_map_sg() does is pretty much 
> > entirely for the benefit of v4l2 and dma-buf importers who *depend* on 
> > being able to linearise a scatterlist in DMA address space. TBH I doubt 
> > there are many actual scatter-gather-capable devices with significant 
> > enough limitations to meaningfully benefit from DMA segment combining these 
> > days - I've often thought that by now it might be a good idea to turn that 
> > behaviour off by default and add an attribute for callers to explicitly 
> > request it.
> 
> Even when devices are not limited they often perform significantly better
> when IOVA space is not completely fragmented.  While the dma_map_sg code
> is a bit gross due to the fact that it has to deal with unaligned segments,
> the coalescing itself often is a big win.

RDMA is like this too, Almost all the MR HW gets big wins if the
entire scatter list is IOVA contiguous. One of the future steps I'd
like to see on top of this is to fine tune the IOVA allocation backing
MRs to exactly match the HW needs. Having proper alignment and
contiguity can be huge reduction in device overhead, like a 100MB MR
may need to store 200K of mapping information on-device, but with a
properly aligned IOVA this can be reduced to only 16 bytes.

Avoiding a double translation tax when the iommu HW is enabled is
potentially significant. We have some RDMA workloads with VMs where
the NIC is holding ~1GB of memory just for translations, but the iommu
is active as the S2. ie we are paying a double tax on translation.

It could be a very interesting trade off to reduce the NIC side to
nothing and rely on the CPU IOMMU with nested translation instead.

> Note that dma_map_sg also has two other very useful features:  batching
> of the iotlb flushing, and support for P2P, which to be efficient also
> requires batching the lookups.

This is the main point, and I think, is the uniqueness Leon is talking
about. We don't get those properties through any other API and this
one series preserves them.

In fact I would say that is the entire point of this series: preserve
everything special about dma_map_sg() compared to dma_map_page() but
don't require a scatterlist.

> >> Several approaches have been explored to expand the DMA API with additional
> >> scatterlist-like structures (BIO, rlist), instead split up the DMA API
> >> to allow callers to bring their own data structure.
> >
> > And this line of reasoning is still "2 + 2 = Thursday" - what is to say 
> > those two notions in any way related? We literally already have one generic 
> > DMA operation which doesn't operate on struct page, yet needed nothing 
> > "split up" to be possible.
> 
> Yeah, I don't really get the struct page argument.  In fact if we look
> at the nitty-gritty details of dma_map_page it doesn't really need a
> page at all. 

Today, if you want to map a P2P address you must have a struct page,
because page->pgmap is the only source of information on the P2P
topology.

So the logic is, to get P2P without struct page we need a way to have
all the features of dma_map_sg() but without a mandatory scatterlist
because we cannot remove struct page from scatterlist.

This series gets to the first step - no scatterlist. There will need
to be another series to provide an alternative to page->pgmap to get
the P2P information. Then we really won't have struct page dependence
in the DMA API.

I actually once looked at how to enhance dma_map_resource() to support
P2P and it was not very nice, the unmap side became quite complex. I
think this is a more elgant solution than what I was sketching.

> >>      for the device. Instead of allocating a scatter list entry per allocated
> >>      page it can just allocate an array of 'struct page *', saving a large
> >>      amount of memory.
> >
> > VFIO already assumes a coherent device with (realistically) an IOMMU which 
> > it explicitly manages - why is it even pretending to need a generic DMA 
> > API?
> 
> AFAIK that does isn't really vfio as we know it but the control device
> for live migration.  But Leon or Jason might fill in more.

Yes, this is the control side of the VFIO live migration driver that
needs rather a lot of memory to store the migration blob. There is
definitely an iommu, and the VF function is definitely translating,
but it doesn't mean the PF function is using dma-iommu.c, it is often
in iommu passthrough/identity and using DMA direct.

It was done as an alternative example on how to use the API. Again
there are more improvements possible there, the driver does not take
advantage of contiguity or alignment when programming the HW.

> Because we only need to preallocate the tiny constant sized dma_iova_state
> as part of the request instead of an additional scatterlist that requires
> sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
> per segment, including a memory allocation per I/O for that.

Right, eliminating scatterlist entirely on fast paths is a big
point. I recall Chuck was keen on the same thing for NFSoRDMA as well.

> At least for the block code we have a nice little core wrapper that is
> very easy to use, and provides a great reduction of memory use and
> allocations.  The HMM use case I'll let others talk about.

I saw the Intel XE team make a complicated integration with the DMA
API that wasn't so good. They were looking at an earlier version of
this and I think the feedback was positive. It should make a big
difference, but we will need to see what they come up and possibly
tweak things.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-05 19:53     ` Jason Gunthorpe
@ 2024-11-07  8:32       ` Christoph Hellwig
  2024-11-07 13:28         ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-07  8:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Robin Murphy, Leon Romanovsky, Jens Axboe,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Tue, Nov 05, 2024 at 03:53:57PM -0400, Jason Gunthorpe wrote:
> > Yeah, I don't really get the struct page argument.  In fact if we look
> > at the nitty-gritty details of dma_map_page it doesn't really need a
> > page at all. 
> 
> Today, if you want to map a P2P address you must have a struct page,
> because page->pgmap is the only source of information on the P2P
> topology.
> 
> So the logic is, to get P2P without struct page we need a way to have
> all the features of dma_map_sg() but without a mandatory scatterlist
> because we cannot remove struct page from scatterlist.

Well, that is true but also not the point.  The hard part is to
find the P2P routing information without the page.  After that
any physical address based interface will work, including a trivial
dma_map_phys.

> > At least for the block code we have a nice little core wrapper that is
> > very easy to use, and provides a great reduction of memory use and
> > allocations.  The HMM use case I'll let others talk about.
> 
> I saw the Intel XE team make a complicated integration with the DMA
> API that wasn't so good. They were looking at an earlier version of
> this and I think the feedback was positive. It should make a big
> difference, but we will need to see what they come up and possibly
> tweak things.

Not even sure what XE is, but do you have a pointer to it?  It would
really be great if people having DMA problems talked to the dma-mapping
and iommu maintaines / list..


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-07  8:32       ` Christoph Hellwig
@ 2024-11-07 13:28         ` Jason Gunthorpe
  2024-11-07 13:50           ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-07 13:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Thu, Nov 07, 2024 at 09:32:56AM +0100, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 03:53:57PM -0400, Jason Gunthorpe wrote:
> > > Yeah, I don't really get the struct page argument.  In fact if we look
> > > at the nitty-gritty details of dma_map_page it doesn't really need a
> > > page at all. 
> > 
> > Today, if you want to map a P2P address you must have a struct page,
> > because page->pgmap is the only source of information on the P2P
> > topology.
> > 
> > So the logic is, to get P2P without struct page we need a way to have
> > all the features of dma_map_sg() but without a mandatory scatterlist
> > because we cannot remove struct page from scatterlist.
> 
> Well, that is true but also not the point.  The hard part is to
> find the P2P routing information without the page.  After that
> any physical address based interface will work, including a trivial
> dma_map_phys.

Once we are freed from scatterlist we can explore a design that would
pass the P2P routing information directly. For instance imagine
something like:

   dma_map_p2p(dev, phys, p2p_provider);

Then dma_map_page(dev, page) could be something like

   if (is_pci_p2pdma_page(page))
      dev_map_p2p(dev, page_to_phys(page), page->pgmap->p2p_provider)

From there we could then go into DRM/VFIO/etc and give them
p2p_providers without pgmaps. p2p_provider is some light refactoring
of what is already in drivers/pci/p2pdma.c

For the dmabuf use cases it is not actually hard to find the P2P
routing information - the driver constructing the dmabuf has it. The
challenge is carrying that information from the originating driver,
through the dmabuf apis to the final place that does the dma mapping.

So I'm thinking of a datastructure for things like dmabuf/rdma MR
that is sort of like this:

   struct phys_list {
         enum type; // CPU, p2p, encrypted, whatever
         struct p2p_provider *p2p_provider;
         struct phys_list *next;
         struct phys_range frags[];
   }

Where each phys_list would be a single uniform dma operation and
easily carries the extra meta data. No struct page, no serious issue
transfering the P2P routing information.

> > I saw the Intel XE team make a complicated integration with the DMA
> > API that wasn't so good. They were looking at an earlier version of
> > this and I think the feedback was positive. It should make a big
> > difference, but we will need to see what they come up and possibly
> > tweak things.
> 
> Not even sure what XE is, but do you have a pointer to it?  It would
> really be great if people having DMA problems talked to the dma-mapping
> and iommu maintaines / list..

GPU driver

https://lore.kernel.org/dri-devel/20240117221223.18540-7-oak.zeng@intel.com/

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-07 13:28         ` Jason Gunthorpe
@ 2024-11-07 13:50           ` Christoph Hellwig
  2024-11-08 15:02             ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-07 13:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Robin Murphy, Leon Romanovsky, Jens Axboe,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, matthew.brost,
	Thomas.Hellstrom, brian.welty, himal.prasad.ghimiray,
	krishnaiah.bommu, niranjana.vishwanathapura

On Thu, Nov 07, 2024 at 09:28:08AM -0400, Jason Gunthorpe wrote:
> Once we are freed from scatterlist we can explore a design that would
> pass the P2P routing information directly. For instance imagine
> something like:
> 
>    dma_map_p2p(dev, phys, p2p_provider);
> 
> Then dma_map_page(dev, page) could be something like
> 
>    if (is_pci_p2pdma_page(page))
>       dev_map_p2p(dev, page_to_phys(page), page->pgmap->p2p_provider)

One thing that this series does is to move the P2P mapping decisions out
of the low-level dma mapping helpers and into the caller (again) for
the non-sg callers and moves the special switch based bus mapping into
a routine that can be called directly.

Take a look at blk_rq_dma_map_iter_start, which now literally uses
dma_map_page for the no-iommu, no-switch P2P case.  It also is a good
use case for the proposed dma_map_phys.

> GPU driver
> 
> https://lore.kernel.org/dri-devel/20240117221223.18540-7-oak.zeng@intel.com/

Eww, that's horrible.  Converting this to Leon's new hmm helpers
would be really nice (and how that they are useful for more than
mlx5).


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API
  2024-11-04 12:53         ` Christoph Hellwig
@ 2024-11-07 14:50           ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-07 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Robin Murphy, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Mon, Nov 04, 2024 at 01:53:02PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 04, 2024 at 08:19:24AM -0400, Jason Gunthorpe wrote:
> > > That's a good point.  Only mapped through host bridge P2P can even
> > > end up here, so the address is a perfectly valid physical address
> > > in the host.  But I'm not sure if all arch_sync_dma_for_device
> > > implementations handle IOMMU memory fine.
> > 
> > I was told on x86 if you do a cache flush operation on MMIO there is a
> > chance it will MCE. Recently had some similar discussions about ARM
> > where it was asserted some platforms may have similar.
> 
> On x86 we never flush caches for DMA operations anyway, so x86 isn't
> really the concern here, but architectures that do cache incoherent DMA
> to PCIe devices.  Which isn't a whole lot as most SOCs try to avoid that
> for PCIe even if they lack DMA coherent for lesser peripherals, but I bet
> there are some on arm/arm64 and maybe riscv or mips.
> 
> > It would be safest to only call arch flushing calls on memory that is
> > mapped cachable. We can assume that a P2P target is never CPU
> > mapped cachable, regardless of how the DMA is routed.
> 
> Yes.  I.e. force DMA_ATTR_SKIP_CPU_SYNC for P2P.

What do you think?

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 38bcb3ecceeb..065bdace3344 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -559,14 +559,19 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
 {
 	enum dma_data_direction dir = rq_dma_dir(req);
 	unsigned int mapped = 0;
+	unsigned long attrs = 0;
 	int error = 0;
 
 	iter->addr = state->addr;
 	iter->len = dma_iova_size(state);
+	if (req->cmd_flags & REQ_P2PDMA) {
+		attrs |= DMA_ATTR_SKIP_CPU_SYNC;
+		req->cmd_flags &= ~REQ_P2PDMA;
+	}
 
 	do {
 		error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
-				vec->len, dir, 0);
+				vec->len, dir, attrs);
 		if (error)
 			goto error_unmap;
 		mapped += vec->len;
@@ -578,7 +583,7 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
 
 	return true;
 error_unmap:
-	dma_iova_destroy(dma_dev, state, mapped, rq_dma_dir(req), 0);
+	dma_iova_destroy(dma_dev, state, mapped, rq_dma_dir(req), attrs);
 	iter->status = errno_to_blk_status(error);
 	return false;
 }
@@ -633,7 +638,6 @@ bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
 			 * P2P transfers through the host bridge are treated the
 			 * same as non-P2P transfers below and during unmap.
 			 */
-			req->cmd_flags &= ~REQ_P2PDMA;
 			break;
 		default:
 			iter->status = BLK_STS_INVAL;
@@ -644,6 +648,8 @@ bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
 	if (blk_can_dma_map_iova(req, dma_dev) &&
 	    dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
 		return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
+
+	req->cmd_flags &= ~REQ_P2PDMA;
 	return blk_dma_map_direct(req, dma_dev, iter, &vec);
 }
 EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 62980ca8f3c5..5fe30fbc42b0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_P2PDMA - P@P page, not bus mapped
  * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
  *                      to mark that page is already DMA mapped
@@ -41,6 +42,7 @@ enum hmm_pfn_flags {
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
 
 	/* Sticky flag, carried from Input to Output */
+	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
 	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
 	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4ef2b3815212..b2ec199c2ea8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -710,6 +710,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 	struct page *page = hmm_pfn_to_page(pfns[idx]);
 	phys_addr_t paddr = hmm_pfn_to_phys(pfns[idx]);
 	size_t offset = idx * map->dma_entry_size;
+	unsigned long attrs = 0;
 	dma_addr_t dma_addr;
 	int ret;
 
@@ -740,6 +741,9 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 
 	switch (pci_p2pdma_state(p2pdma_state, dev, page)) {
 	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		attrs |= DMA_ATTR_SKIP_CPU_SYNC;
+		pfns[idx] |= HMM_PFN_P2PDMA;
+		fallthrough;
 	case PCI_P2PDMA_MAP_NONE:
 		break;
 	case PCI_P2PDMA_MAP_BUS_ADDR:
@@ -752,7 +756,8 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 
 	if (dma_use_iova(state)) {
 		ret = dma_iova_link(dev, state, paddr, offset,
-				    map->dma_entry_size, DMA_BIDIRECTIONAL, 0);
+				    map->dma_entry_size, DMA_BIDIRECTIONAL,
+				    attrs);
 		if (ret)
 			return DMA_MAPPING_ERROR;
 
@@ -793,6 +798,7 @@ bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
 	struct dma_iova_state *state = &map->state;
 	dma_addr_t *dma_addrs = map->dma_list;
 	unsigned long *pfns = map->pfn_list;
+	unsigned long attrs = 0;
 
 #define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
 	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
@@ -801,14 +807,16 @@ bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
 
 	if (pfns[idx] & HMM_PFN_P2PDMA_BUS)
 		; /* no need to unmap bus address P2P mappings */
-	else if (dma_use_iova(state))
+	else if (dma_use_iova(state)) {
+		if (pfns[idx] & HMM_PFN_P2PDMA)
+			attrs |= DMA_ATTR_SKIP_CPU_SYNC;
 		dma_iova_unlink(dev, state, idx * map->dma_entry_size,
-				map->dma_entry_size, DMA_BIDIRECTIONAL, 0);
-	else if (dma_need_unmap(dev))
+				map->dma_entry_size, DMA_BIDIRECTIONAL, attrs);
+	} else if (dma_need_unmap(dev))
 		dma_unmap_page(dev, dma_addrs[idx], map->dma_entry_size,
 			       DMA_BIDIRECTIONAL);
 
-	pfns[idx] &= ~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA_BUS);
+	pfns[idx] &= ~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA | HMM_PFN_P2PDMA_BUS);
 	return true;
 }
 EXPORT_SYMBOL_GPL(hmm_dma_unmap_pfn);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-07 13:50           ` Christoph Hellwig
@ 2024-11-08 15:02             ` Jason Gunthorpe
  2024-11-08 15:05               ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-08 15:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

On Thu, Nov 07, 2024 at 02:50:25PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 07, 2024 at 09:28:08AM -0400, Jason Gunthorpe wrote:
> > Once we are freed from scatterlist we can explore a design that would
> > pass the P2P routing information directly. For instance imagine
> > something like:
> > 
> >    dma_map_p2p(dev, phys, p2p_provider);
> > 
> > Then dma_map_page(dev, page) could be something like
> > 
> >    if (is_pci_p2pdma_page(page))
> >       dev_map_p2p(dev, page_to_phys(page), page->pgmap->p2p_provider)
> 
> One thing that this series does is to move the P2P mapping decisions out
> of the low-level dma mapping helpers and into the caller (again) for
> the non-sg callers and moves the special switch based bus mapping into
> a routine that can be called directly.
> 
> Take a look at blk_rq_dma_map_iter_start, which now literally uses
> dma_map_page for the no-iommu, no-switch P2P case.  It also is a good
> use case for the proposed dma_map_phys.

It is fully OK? Can't dma_map_page() trigger swiotlb? It must not do
that for P2P. How does it know the difference if it just gets a phys?

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-08 15:02             ` Jason Gunthorpe
@ 2024-11-08 15:05               ` Christoph Hellwig
  2024-11-08 15:25                 ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-08 15:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Robin Murphy, Leon Romanovsky, Jens Axboe,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, matthew.brost,
	Thomas.Hellstrom, brian.welty, himal.prasad.ghimiray,
	krishnaiah.bommu, niranjana.vishwanathapura

On Fri, Nov 08, 2024 at 11:02:26AM -0400, Jason Gunthorpe wrote:
> It is fully OK? Can't dma_map_page() trigger swiotlb? It must not do
> that for P2P. How does it know the difference if it just gets a phys?

dma_direct_map_page checks for p2p pages in the swiotlb bounce
path already in the current kernel, and dma_map_sg relies on exactly
that check to prevent bouncing for p2p.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-08 15:05               ` Christoph Hellwig
@ 2024-11-08 15:25                 ` Jason Gunthorpe
  2024-11-08 15:29                   ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-08 15:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

On Fri, Nov 08, 2024 at 04:05:00PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:02:26AM -0400, Jason Gunthorpe wrote:
> > It is fully OK? Can't dma_map_page() trigger swiotlb? It must not do
> > that for P2P. How does it know the difference if it just gets a phys?
> 
> dma_direct_map_page checks for p2p pages in the swiotlb bounce
> path already in the current kernel, and dma_map_sg relies on exactly
> that check to prevent bouncing for p2p.

I'm asking how it will work if you change the struct page argument to
physical, because today dma_direct_map_page() has:

		if (is_pci_p2pdma_page(page))
			return DMA_MAPPING_ERROR;

Which is exactly the sorts of things I'm looking at when when I say to
get rid of struct page.

What I'm thinking about is replacing code like the above with something like:

		if (p2p_provider)
			return DMA_MAPPING_ERROR;

And the caller is the one that would have done is_pci_p2pdma_page()
and either passes p2p_provider=NULL or page->pgmap->p2p_provider.

Anyhow, I hope Leon will attempt this once this is settled and it will
make more sense in patches. I'm just brainstorming how I've been
thinking of it.

Another option would be some 'is_pci_p2pdma_page_phys(phys)', but I
think that is going to be worse performance than managing a
p2p_provider pointer in the mapping call chain explicitly.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-08 15:25                 ` Jason Gunthorpe
@ 2024-11-08 15:29                   ` Christoph Hellwig
  2024-11-08 15:38                     ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-08 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Robin Murphy, Leon Romanovsky, Jens Axboe,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, matthew.brost,
	Thomas.Hellstrom, brian.welty, himal.prasad.ghimiray,
	krishnaiah.bommu, niranjana.vishwanathapura

On Fri, Nov 08, 2024 at 11:25:37AM -0400, Jason Gunthorpe wrote:
> I'm asking how it will work if you change the struct page argument to
> physical, because today dma_direct_map_page() has:
> 
> 		if (is_pci_p2pdma_page(page))
> 			return DMA_MAPPING_ERROR;
> 
> Which is exactly the sorts of things I'm looking at when when I say to
> get rid of struct page.

It will have to look up the page from the physical address obviously.
But at least only in the error path.

> What I'm thinking about is replacing code like the above with something like:
> 
> 		if (p2p_provider)
> 			return DMA_MAPPING_ERROR;
> 
> And the caller is the one that would have done is_pci_p2pdma_page()
> and either passes p2p_provider=NULL or page->pgmap->p2p_provider.

And where do you get that one from?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-08 15:29                   ` Christoph Hellwig
@ 2024-11-08 15:38                     ` Jason Gunthorpe
  2024-11-12  6:01                       ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-08 15:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

On Fri, Nov 08, 2024 at 04:29:56PM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:25:37AM -0400, Jason Gunthorpe wrote:
> > I'm asking how it will work if you change the struct page argument to
> > physical, because today dma_direct_map_page() has:
> > 
> > 		if (is_pci_p2pdma_page(page))
> > 			return DMA_MAPPING_ERROR;
> > 
> > Which is exactly the sorts of things I'm looking at when when I say to
> > get rid of struct page.
> 
> It will have to look up the page from the physical address obviously.
> But at least only in the error path.

I'm thinking we can largely avoid searching on physical, or at least
we can optimize this so there is only one search on physical at the
start of the DMA mapping. (since we are now saying all pages are the
same type)

> > What I'm thinking about is replacing code like the above with something like:
> > 
> > 		if (p2p_provider)
> > 			return DMA_MAPPING_ERROR;
> > 
> > And the caller is the one that would have done is_pci_p2pdma_page()
> > and either passes p2p_provider=NULL or page->pgmap->p2p_provider.
> 
> And where do you get that one from?

Which one?

The caller must know the p2p properties of what it is doing because it
is driving all the P2P logic around what APIs to call.

Either because it is already working with struct page and gets it out
of the pgmap.

Or it is working with non-struct page memory and has a (MMIO address,
p2p_provider) tuple that it got from the original driver that gave it
the MMIO address.

Or it really does have a naked phys_addr_t and it did the search on
physical, but only once.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-10-30 15:12 ` [PATCH v1 09/17] docs: core-api: document the IOVA-based API Leon Romanovsky
  2024-10-31  1:41   ` Randy Dunlap
@ 2024-11-08 19:34   ` Jonathan Corbet
  2024-11-08 20:03     ` Leon Romanovsky
  1 sibling, 1 reply; 63+ messages in thread
From: Jonathan Corbet @ 2024-11-08 19:34 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

Leon Romanovsky <leon@kernel.org> writes:

> From: Christoph Hellwig <hch@lst.de>
>
> Add an explanation of the newly added IOVA-based mapping API.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  Documentation/core-api/dma-api.rst | 70 ++++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
> index 8e3cce3d0a23..6095696a65a7 100644
> --- a/Documentation/core-api/dma-api.rst
> +++ b/Documentation/core-api/dma-api.rst
> @@ -530,6 +530,76 @@ routines, e.g.:::
>  		....
>  	}
>  
> +Part Ie - IOVA-based DMA mappings
> +---------------------------------
> +
> +These APIs allow a very efficient mapping when using an IOMMU.  They are an
> +optional path that requires extra code and are only recommended for drivers
> +where DMA mapping performance, or the space usage for storing the DMA addresses
> +matter.  All the consideration from the previous section apply here as well.
> +
> +::
> +
> +    bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t size);
> +
> +Is used to try to allocate IOVA space for mapping operation.  If it returns
> +false this API can't be used for the given device and the normal streaming
> +DMA mapping API should be used.  The ``struct dma_iova_state`` is allocated
> +by the driver and must be kept around until unmap time.

So, I see that you have nice kernel-doc comments for these; why not just
pull them in here with a kernel-doc directive rather than duplicating
the information?

Thanks,

jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-08 19:34   ` Jonathan Corbet
@ 2024-11-08 20:03     ` Leon Romanovsky
  2024-11-08 20:13       ` Jonathan Corbet
  0 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-08 20:03 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Fri, Nov 08, 2024 at 12:34:21PM -0700, Jonathan Corbet wrote:
> Leon Romanovsky <leon@kernel.org> writes:
> 
> > From: Christoph Hellwig <hch@lst.de>
> >
> > Add an explanation of the newly added IOVA-based mapping API.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  Documentation/core-api/dma-api.rst | 70 ++++++++++++++++++++++++++++++
> >  1 file changed, 70 insertions(+)
> >
> > diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
> > index 8e3cce3d0a23..6095696a65a7 100644
> > --- a/Documentation/core-api/dma-api.rst
> > +++ b/Documentation/core-api/dma-api.rst
> > @@ -530,6 +530,76 @@ routines, e.g.:::
> >  		....
> >  	}
> >  
> > +Part Ie - IOVA-based DMA mappings
> > +---------------------------------
> > +
> > +These APIs allow a very efficient mapping when using an IOMMU.  They are an
> > +optional path that requires extra code and are only recommended for drivers
> > +where DMA mapping performance, or the space usage for storing the DMA addresses
> > +matter.  All the consideration from the previous section apply here as well.
> > +
> > +::
> > +
> > +    bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
> > +		phys_addr_t phys, size_t size);
> > +
> > +Is used to try to allocate IOVA space for mapping operation.  If it returns
> > +false this API can't be used for the given device and the normal streaming
> > +DMA mapping API should be used.  The ``struct dma_iova_state`` is allocated
> > +by the driver and must be kept around until unmap time.
> 
> So, I see that you have nice kernel-doc comments for these; why not just
> pull them in here with a kernel-doc directive rather than duplicating
> the information?

Can I you please point me to commit/lore link/documentation with example
of such directive and I will do it?

Thanks

> 
> Thanks,
> 
> jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-08 20:03     ` Leon Romanovsky
@ 2024-11-08 20:13       ` Jonathan Corbet
  2024-11-08 20:27         ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Jonathan Corbet @ 2024-11-08 20:13 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

Leon Romanovsky <leon@kernel.org> writes:

>> So, I see that you have nice kernel-doc comments for these; why not just
>> pull them in here with a kernel-doc directive rather than duplicating
>> the information?
>
> Can I you please point me to commit/lore link/documentation with example
> of such directive and I will do it?

Documentation/doc-guide/kernel-doc.rst has all the information you need.
It could be as simple as replacing your inline descriptions with:

  .. kernel-doc:: drivers/iommu/dma-iommu.c
     :export:

That will pull in documentation for other, unrelated functions, though;
assuming you don't want those, something like:

  .. kernel-doc:: drivers/iommu/dma-iommu.c
     :identifiers: dma_iova_try_alloc dma_iova_free ...

Then do a docs build and see the nice results you get :)

Thanks,

jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-08 20:13       ` Jonathan Corbet
@ 2024-11-08 20:27         ` Leon Romanovsky
  2024-11-10 10:41           ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-08 20:27 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Fri, Nov 08, 2024 at 01:13:27PM -0700, Jonathan Corbet wrote:
> Leon Romanovsky <leon@kernel.org> writes:
> 
> >> So, I see that you have nice kernel-doc comments for these; why not just
> >> pull them in here with a kernel-doc directive rather than duplicating
> >> the information?
> >
> > Can I you please point me to commit/lore link/documentation with example
> > of such directive and I will do it?
> 
> Documentation/doc-guide/kernel-doc.rst has all the information you need.
> It could be as simple as replacing your inline descriptions with:
> 
>   .. kernel-doc:: drivers/iommu/dma-iommu.c
>      :export:
> 
> That will pull in documentation for other, unrelated functions, though;
> assuming you don't want those, something like:
> 
>   .. kernel-doc:: drivers/iommu/dma-iommu.c
>      :identifiers: dma_iova_try_alloc dma_iova_free ...
> 
> Then do a docs build and see the nice results you get :)

Thanks for the explanation, will change it.

> 
> Thanks,
> 
> jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-08 20:27         ` Leon Romanovsky
@ 2024-11-10 10:41           ` Leon Romanovsky
  2024-11-11  6:38             ` Christoph Hellwig
  0 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-10 10:41 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Fri, Nov 08, 2024 at 10:27:36PM +0200, Leon Romanovsky wrote:
> On Fri, Nov 08, 2024 at 01:13:27PM -0700, Jonathan Corbet wrote:
> > Leon Romanovsky <leon@kernel.org> writes:
> > 
> > >> So, I see that you have nice kernel-doc comments for these; why not just
> > >> pull them in here with a kernel-doc directive rather than duplicating
> > >> the information?
> > >
> > > Can I you please point me to commit/lore link/documentation with example
> > > of such directive and I will do it?
> > 
> > Documentation/doc-guide/kernel-doc.rst has all the information you need.
> > It could be as simple as replacing your inline descriptions with:
> > 
> >   .. kernel-doc:: drivers/iommu/dma-iommu.c
> >      :export:
> > 
> > That will pull in documentation for other, unrelated functions, though;
> > assuming you don't want those, something like:
> > 
> >   .. kernel-doc:: drivers/iommu/dma-iommu.c
> >      :identifiers: dma_iova_try_alloc dma_iova_free ...
> > 
> > Then do a docs build and see the nice results you get :)
> 
> Thanks for the explanation, will change it.

Jonathan,

I tried this today and the output (HTML) in the new section looks
so different from the rest of dma-api.rst that I lean to leave
the current doc implementation as is.

Thanks

> 
> > 
> > Thanks,
> > 
> > jon
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used
  2024-10-30 15:12 ` [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used Leon Romanovsky
@ 2024-11-10 15:09   ` Zhu Yanjun
  2024-11-10 15:19     ` Leon Romanovsky
  2024-11-11  6:39     ` Christoph Hellwig
  0 siblings, 2 replies; 63+ messages in thread
From: Zhu Yanjun @ 2024-11-10 15:09 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg
  Cc: Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

在 2024/10/30 16:12, Leon Romanovsky 写道:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> This patch adds a check if IOVA can be used for the specific
> transaction.
> 
> In the new API a DMA mapping transaction is identified by a
> struct dma_iova_state, which holds some recomputed information
> for the transaction which does not change for each page being
> mapped.
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   include/linux/dma-mapping.h | 33 +++++++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 1524da363734..6075e0708deb 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -76,6 +76,20 @@
>   
>   #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
>   
> +struct dma_iova_state {
> +	size_t __size;
> +};
> +
> +/*
> + * Use the high bit to mark if we used swiotlb for one or more ranges.
> + */
> +#define DMA_IOVA_USE_SWIOTLB		(1ULL << 63)

A trivial problem.
In the above macro, using BIT_ULL(63) is better?

Zhu Yanjun

> +
> +static inline size_t dma_iova_size(struct dma_iova_state *state)
> +{
> +	return state->__size & ~DMA_IOVA_USE_SWIOTLB;
> +}
> +
>   #ifdef CONFIG_DMA_API_DEBUG
>   void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
>   void debug_dma_map_single(struct device *dev, const void *addr,
> @@ -281,6 +295,25 @@ static inline int dma_mmap_noncontiguous(struct device *dev,
>   }
>   #endif /* CONFIG_HAS_DMA */
>   
> +#ifdef CONFIG_IOMMU_DMA
> +/**
> + * dma_use_iova - check if the IOVA API is used for this state
> + * @state: IOVA state
> + *
> + * Return %true if the DMA transfers uses the dma_iova_*() calls or %false if
> + * they can't be used.
> + */
> +static inline bool dma_use_iova(struct dma_iova_state *state)
> +{
> +	return state->__size != 0;
> +}
> +#else /* CONFIG_IOMMU_DMA */
> +static inline bool dma_use_iova(struct dma_iova_state *state)
> +{
> +	return false;
> +}
> +#endif /* CONFIG_IOMMU_DMA */
> +
>   #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
>   void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
>   		enum dma_data_direction dir);


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used
  2024-11-10 15:09   ` Zhu Yanjun
@ 2024-11-10 15:19     ` Leon Romanovsky
  2024-11-11  6:39     ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2024-11-10 15:19 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm

On Sun, Nov 10, 2024 at 04:09:11PM +0100, Zhu Yanjun wrote:
> 在 2024/10/30 16:12, Leon Romanovsky 写道:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > This patch adds a check if IOVA can be used for the specific
> > transaction.
> > 
> > In the new API a DMA mapping transaction is identified by a
> > struct dma_iova_state, which holds some recomputed information
> > for the transaction which does not change for each page being
> > mapped.
> > 
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >   include/linux/dma-mapping.h | 33 +++++++++++++++++++++++++++++++++
> >   1 file changed, 33 insertions(+)
> > 
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index 1524da363734..6075e0708deb 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -76,6 +76,20 @@
> >   #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
> > +struct dma_iova_state {
> > +	size_t __size;
> > +};
> > +
> > +/*
> > + * Use the high bit to mark if we used swiotlb for one or more ranges.
> > + */
> > +#define DMA_IOVA_USE_SWIOTLB		(1ULL << 63)
> 
> A trivial problem.
> In the above macro, using BIT_ULL(63) is better?

You already asked same question and the answer is also the same.
https://lore.kernel.org/all/20241103151946.GA99170@unreal/

> 
> Zhu Yanjun

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-10 10:41           ` Leon Romanovsky
@ 2024-11-11  6:38             ` Christoph Hellwig
  2024-11-11  6:43               ` anish kumar
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-11  6:38 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jonathan Corbet, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Sun, Nov 10, 2024 at 12:41:30PM +0200, Leon Romanovsky wrote:
> I tried this today and the output (HTML) in the new section looks
> so different from the rest of dma-api.rst that I lean to leave
> the current doc implementation as is.

Yeah.  The whole DMA API documentation shows it's age and could use
a major revamp, but for now I'd prefer to stick to the way it is done.

If we have any volunteers for bringing it up to standards I'd be glad
to help with input and review.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used
  2024-11-10 15:09   ` Zhu Yanjun
  2024-11-10 15:19     ` Leon Romanovsky
@ 2024-11-11  6:39     ` Christoph Hellwig
  2024-11-11  7:19       ` Greg Sword
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-11  6:39 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

On Sun, Nov 10, 2024 at 04:09:11PM +0100, Zhu Yanjun wrote:
>> +
>> +/*
>> + * Use the high bit to mark if we used swiotlb for one or more ranges.
>> + */
>> +#define DMA_IOVA_USE_SWIOTLB		(1ULL << 63)
>
> A trivial problem.
> In the above macro, using BIT_ULL(63) is better?

No, and can people please stop suggesting it?  That macro is so fucking
pointless that it's revolting,


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-11  6:38             ` Christoph Hellwig
@ 2024-11-11  6:43               ` anish kumar
  2024-11-11 14:59                 ` Jonathan Corbet
  0 siblings, 1 reply; 63+ messages in thread
From: anish kumar @ 2024-11-11  6:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jonathan Corbet, Jens Axboe, Jason Gunthorpe,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

On Sun, Nov 10, 2024 at 10:39 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sun, Nov 10, 2024 at 12:41:30PM +0200, Leon Romanovsky wrote:
> > I tried this today and the output (HTML) in the new section looks
> > so different from the rest of dma-api.rst that I lean to leave
> > the current doc implementation as is.
>
> Yeah.  The whole DMA API documentation shows it's age and could use
> a major revamp, but for now I'd prefer to stick to the way it is done.
>
> If we have any volunteers for bringing it up to standards I'd be glad
> to help with input and review.

Jonathan, if you agree, I can take this up?
>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used
  2024-11-11  6:39     ` Christoph Hellwig
@ 2024-11-11  7:19       ` Greg Sword
  0 siblings, 0 replies; 63+ messages in thread
From: Greg Sword @ 2024-11-11  7:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhu Yanjun, Leon Romanovsky, Jens Axboe, Jason Gunthorpe,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Leon Romanovsky, Keith Busch, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	Jonathan Corbet, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm

On Mon, Nov 11, 2024 at 2:39 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sun, Nov 10, 2024 at 04:09:11PM +0100, Zhu Yanjun wrote:
> >> +
> >> +/*
> >> + * Use the high bit to mark if we used swiotlb for one or more ranges.
> >> + */
> >> +#define DMA_IOVA_USE_SWIOTLB                (1ULL << 63)
> >
> > A trivial problem.
> > In the above macro, using BIT_ULL(63) is better?
>
> No, and can people please stop suggesting it?  That macro is so fucking
> pointless that it's revolting,

Why do you hate this macro so much, have you considered the feelings
of the macro author?

>
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 09/17] docs: core-api: document the IOVA-based API
  2024-11-11  6:43               ` anish kumar
@ 2024-11-11 14:59                 ` Jonathan Corbet
  0 siblings, 0 replies; 63+ messages in thread
From: Jonathan Corbet @ 2024-11-11 14:59 UTC (permalink / raw)
  To: anish kumar, Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm

anish kumar <yesanishhere@gmail.com> writes:

> On Sun, Nov 10, 2024 at 10:39 PM Christoph Hellwig <hch@lst.de> wrote:
>>
>> On Sun, Nov 10, 2024 at 12:41:30PM +0200, Leon Romanovsky wrote:
>> > I tried this today and the output (HTML) in the new section looks
>> > so different from the rest of dma-api.rst that I lean to leave
>> > the current doc implementation as is.
>>
>> Yeah.  The whole DMA API documentation shows it's age and could use
>> a major revamp, but for now I'd prefer to stick to the way it is done.
>>
>> If we have any volunteers for bringing it up to standards I'd be glad
>> to help with input and review.
>
> Jonathan, if you agree, I can take this up?

I am happy to see help with the documentation, but agreement from the
authors and maintainers of the DMA-mapping documentation is rather more
important than agreement from me.

jon

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-08 15:38                     ` Jason Gunthorpe
@ 2024-11-12  6:01                       ` Christoph Hellwig
  2024-11-13 18:41                         ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2024-11-12  6:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Robin Murphy, Leon Romanovsky, Jens Axboe,
	Joerg Roedel, Will Deacon, Sagi Grimberg, Keith Busch,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, Jonathan Corbet,
	linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
	linux-nvme, linux-pci, kvm, linux-mm, matthew.brost,
	Thomas.Hellstrom, brian.welty, himal.prasad.ghimiray,
	krishnaiah.bommu, niranjana.vishwanathapura

On Fri, Nov 08, 2024 at 11:38:46AM -0400, Jason Gunthorpe wrote:
> > > What I'm thinking about is replacing code like the above with something like:
> > > 
> > > 		if (p2p_provider)
> > > 			return DMA_MAPPING_ERROR;
> > > 
> > > And the caller is the one that would have done is_pci_p2pdma_page()
> > > and either passes p2p_provider=NULL or page->pgmap->p2p_provider.
> > 
> > And where do you get that one from?
> 
> Which one?

The p2p_provider thing (whatever that will actually be).


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v1 00/17] Provide a new two step DMA mapping API
  2024-11-12  6:01                       ` Christoph Hellwig
@ 2024-11-13 18:41                         ` Jason Gunthorpe
  0 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2024-11-13 18:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Leon Romanovsky, Jens Axboe, Joerg Roedel,
	Will Deacon, Sagi Grimberg, Keith Busch, Bjorn Helgaas,
	Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Marek Szyprowski, Jérôme Glisse,
	Andrew Morton, Jonathan Corbet, linux-doc, linux-kernel,
	linux-block, linux-rdma, iommu, linux-nvme, linux-pci, kvm,
	linux-mm, matthew.brost, Thomas.Hellstrom, brian.welty,
	himal.prasad.ghimiray, krishnaiah.bommu,
	niranjana.vishwanathapura

On Tue, Nov 12, 2024 at 07:01:08AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 08, 2024 at 11:38:46AM -0400, Jason Gunthorpe wrote:
> > > > What I'm thinking about is replacing code like the above with something like:
> > > > 
> > > > 		if (p2p_provider)
> > > > 			return DMA_MAPPING_ERROR;
> > > > 
> > > > And the caller is the one that would have done is_pci_p2pdma_page()
> > > > and either passes p2p_provider=NULL or page->pgmap->p2p_provider.
> > > 
> > > And where do you get that one from?
> > 
> > Which one?
> 
> The p2p_provider thing (whatever that will actually be).

p2p_provider would be splitting out the information in
pci_p2pdma_pagemap to it's own type:

struct pci_p2pdma_pagemap {
	struct pci_dev *provider;
	u64 bus_offset;

That is the essential information to compute PCI_P2PDMA_MAP_*.

For example when blk_rq_dma_map_iter_start() calls pci_p2pdma_state(),
it has this information from page->pgmap. It would still have the
information via the pgmap when we split it out of the
pci_p2pdma_pagemap.

Since everything doing a dma map has to do the pci_p2pdma_state() to
compute PCI_P2PDMA_MAP_* every dma mapping operation has already got
the provider. Since everything is uniform within a mapping operation
the provider is constant for the whole map.

For future non-struct page cases the provider comes along with the
address list from whatever created the address list in the first
place.

Looking at dmabuf for example, I expect dmabuf to provide a new data
structure which is a list of lists:

 [[provider GPU: [mmio_addr1,mmio_addr2,mmio_addr3],
  [provider NULL: [cpu_addr1, cpu_addr2, ...],
   ..
 ]

And each uniform group would be dma map'd on its own using the
embedded provider instead of page->pgmap.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2024-11-13 18:41 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-30 15:12 [PATCH v1 00/17] Provide a new two step DMA mapping API Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 02/17] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 03/17] iommu: generalize the batched sync after map interface Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 04/17] dma-mapping: Add check if IOVA can be used Leon Romanovsky
2024-11-10 15:09   ` Zhu Yanjun
2024-11-10 15:19     ` Leon Romanovsky
2024-11-11  6:39     ` Christoph Hellwig
2024-11-11  7:19       ` Greg Sword
2024-10-30 15:12 ` [PATCH v1 05/17] dma: Provide an interface to allow allocate IOVA Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 06/17] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
2024-10-31 21:18   ` Robin Murphy
2024-11-04  9:10     ` Christoph Hellwig
2024-11-04 12:19       ` Jason Gunthorpe
2024-11-04 12:53         ` Christoph Hellwig
2024-11-07 14:50           ` Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 08/17] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
2024-10-31 21:18   ` Robin Murphy
2024-11-01 11:06     ` Leon Romanovsky
2024-11-04  9:15     ` Christoph Hellwig
2024-10-30 15:12 ` [PATCH v1 09/17] docs: core-api: document the IOVA-based API Leon Romanovsky
2024-10-31  1:41   ` Randy Dunlap
2024-10-31  7:59     ` Leon Romanovsky
2024-11-08 19:34   ` Jonathan Corbet
2024-11-08 20:03     ` Leon Romanovsky
2024-11-08 20:13       ` Jonathan Corbet
2024-11-08 20:27         ` Leon Romanovsky
2024-11-10 10:41           ` Leon Romanovsky
2024-11-11  6:38             ` Christoph Hellwig
2024-11-11  6:43               ` anish kumar
2024-11-11 14:59                 ` Jonathan Corbet
2024-10-30 15:12 ` [PATCH v1 10/17] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 11/17] mm/hmm: provide generic DMA managing logic Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 12/17] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-10-30 15:12 ` [PATCH v1 13/17] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 14/17] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 15/17] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 16/17] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-10-30 15:13 ` [PATCH v1 17/17] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-10-31  1:44 ` [PATCH v1 00/17] Provide a new two step DMA mapping API Jens Axboe
2024-10-31  8:34   ` Christoph Hellwig
2024-10-31  9:05     ` Leon Romanovsky
2024-10-31  9:21       ` Christoph Hellwig
2024-10-31  9:37         ` Leon Romanovsky
2024-10-31 17:43           ` Jens Axboe
2024-10-31 20:43             ` Leon Romanovsky
2024-10-31 17:42     ` Jens Axboe
2024-10-31 21:17 ` Robin Murphy
2024-11-04  9:58   ` Christoph Hellwig
2024-11-04 11:39     ` Leon Romanovsky
2024-11-05 19:53     ` Jason Gunthorpe
2024-11-07  8:32       ` Christoph Hellwig
2024-11-07 13:28         ` Jason Gunthorpe
2024-11-07 13:50           ` Christoph Hellwig
2024-11-08 15:02             ` Jason Gunthorpe
2024-11-08 15:05               ` Christoph Hellwig
2024-11-08 15:25                 ` Jason Gunthorpe
2024-11-08 15:29                   ` Christoph Hellwig
2024-11-08 15:38                     ` Jason Gunthorpe
2024-11-12  6:01                       ` Christoph Hellwig
2024-11-13 18:41                         ` Jason Gunthorpe
2024-11-05 18:51 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).