[RFC PATCH v1 00/18] Provide a new two step DMA API mapping API

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
@ 2024-07-02  9:09 Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 01/18] dma-mapping: query DMA memory type Leon Romanovsky
                   ` (19 more replies)
  0 siblings, 20 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

Changelog:
v1:
 * Rewrote cover letter
 * Changed to API as proposed
   https://lore.kernel.org/linux-rdma/20240322184330.GL66976@ziepe.ca/
 * Removed IB DMA wrappers and use DMA API directly
v0: https://lore.kernel.org/all/cover.1709635535.git.leon@kernel.org
-------------------------------------------------------------------------

Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO[1], rlist[2]), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - dma_alloc_iova() / dma_free_iova()
   To do any pre-allocation required. This is done based on the caller
   supplying some details about how much IOMMU address space it would need
   in worst case.
 - dma_link_range() / dma_unlink_range()
   Perform the actual mapping into the pre-allocated IOVA. This is very
   similar to dma_map_page().

A driver will extent its mapping size using its own data structure, such as
BIO, to request the required IOVA. Then it will iterate directly over it's
data structure to DMA map each range. The result can then be stored directly
into the HW specific DMA list. No intermediate scatterlist is required.

In this series, examples of three users are converted to the new API to show
the benefits. Each user has a unique flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.

The ODP and VFIO versions are complete and fully tested, they can be the users
of the new API to merge it. The NVMe requires more work.

[1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net/
[2] https://lore.kernel.org/all/ZD2lMvprVxu23BXZ@ziepe.ca/

Chaitanya Kulkarni (2):
  block: export helper to get segment max size
  nvme-pci: use new dma API

Leon Romanovsky (16):
  dma-mapping: query DMA memory type
  dma-mapping: provide an interface to allocate IOVA
  dma-mapping: check if IOVA can be used
  dma-mapping: implement link range API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  dma-mapping: provide callbacks to link/unlink HMM PFNs to specific
    IOVA
  iommu/dma: Provide an interface to allow preallocate IOVA
  iommu/dma: Implement link/unlink ranges callbacks
  RDMA/umem: Preallocate and cache IOVA for UMEM ODP
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Separate DMA mapping to caching IOVA and page linkage
  RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Explicitly store page list
  vfio/mlx5: Convert vfio to use DMA link API

 block/blk-merge.c                    |   3 +-
 drivers/infiniband/core/umem_odp.c   | 218 ++++++-----------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   1 +
 drivers/infiniband/hw/mlx5/odp.c     |  44 ++--
 drivers/iommu/dma-iommu.c            | 142 +++++++++--
 drivers/nvme/host/pci.c              | 283 ++++++++++++++++------
 drivers/pci/p2pdma.c                 |   4 +-
 drivers/vfio/pci/mlx5/cmd.c          | 344 +++++++++++++++------------
 drivers/vfio/pci/mlx5/cmd.h          |  25 +-
 drivers/vfio/pci/mlx5/main.c         |  86 +++----
 include/linux/blk-mq.h               |   3 +
 include/linux/dma-map-ops.h          |  30 +++
 include/linux/dma-mapping.h          |  87 +++++++
 include/linux/hmm.h                  |   4 +
 include/rdma/ib_umem_odp.h           |  23 +-
 kernel/dma/mapping.c                 | 290 ++++++++++++++++++++++
 mm/hmm.c                             |  34 ++-
 17 files changed, 1122 insertions(+), 499 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 01/18] dma-mapping: query DMA memory type
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 02/18] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Provide an option to query and set DMA memory type so callers who supply
range of pages can perform it only once as the whole range is supposed
to have same memory type.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-mapping.h | 20 ++++++++++++++++++++
 kernel/dma/mapping.c        | 30 ++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f693aafe221f..49b99c6e7ec5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -76,6 +76,20 @@
 
 #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
 
+enum dma_memory_types {
+	/* Normal memory without any extra properties like P2P, e.t.c */
+	DMA_MEMORY_TYPE_NORMAL,
+	/* Memory which is p2p capable */
+	DMA_MEMORY_TYPE_P2P,
+	/* Encrypted memory (TDX) */
+	DMA_MEMORY_TYPE_ENCRYPTED,
+};
+
+struct dma_memory_type {
+	enum dma_memory_types type;
+	struct dev_pagemap *p2p_pgmap;
+};
+
 #ifdef CONFIG_DMA_API_DEBUG
 void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 void debug_dma_map_single(struct device *dev, const void *addr,
@@ -149,6 +163,8 @@ void *dma_vmap_noncontiguous(struct device *dev, size_t size,
 void dma_vunmap_noncontiguous(struct device *dev, void *vaddr);
 int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
 		size_t size, struct sg_table *sgt);
+
+void dma_get_memory_type(struct page *page, struct dma_memory_type *type);
 #else /* CONFIG_HAS_DMA */
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
 		struct page *page, size_t offset, size_t size,
@@ -279,6 +295,10 @@ static inline int dma_mmap_noncontiguous(struct device *dev,
 {
 	return -EINVAL;
 }
+static inline void dma_get_memory_type(struct page *page,
+				       struct dma_memory_type *type)
+{
+}
 #endif /* CONFIG_HAS_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 81de84318ccc..877e43b39c06 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -6,6 +6,7 @@
  * Copyright (c) 2006  Tejun Heo <teheo@suse.de>
  */
 #include <linux/memblock.h> /* for max_pfn */
+#include <linux/memremap.h>
 #include <linux/acpi.h>
 #include <linux/dma-map-ops.h>
 #include <linux/export.h>
@@ -14,6 +15,7 @@
 #include <linux/of_device.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/cc_platform.h>
 #include "debug.h"
 #include "direct.h"
 
@@ -894,3 +896,31 @@ unsigned long dma_get_merge_boundary(struct device *dev)
 	return ops->get_merge_boundary(dev);
 }
 EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
+
+/**
+ * dma_get_memory_type - get the DMA memory type of the page supplied
+ * @page: page to check
+ * @type: memory type of that page
+ *
+ * Return the DMA memory type for the struct page. Pages with the same
+ * memory type can be combined into the same IOVA mapping. Users of the
+ * dma_iova family of functions must seperate the memory they want to map
+ * into same-memory type ranges.
+ */
+void dma_get_memory_type(struct page *page, struct dma_memory_type *type)
+{
+	/* TODO: Rewrite this check to rely on specific struct page flags */
+	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		type->type = DMA_MEMORY_TYPE_ENCRYPTED;
+		return;
+	}
+
+	if (is_pci_p2pdma_page(page)) {
+		type->type = DMA_MEMORY_TYPE_P2P;
+		type->p2p_pgmap = page->pgmap;
+		return;
+	}
+
+	type->type = DMA_MEMORY_TYPE_NORMAL;
+}
+EXPORT_SYMBOL_GPL(dma_get_memory_type);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 02/18] dma-mapping: provide an interface to allocate IOVA
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 01/18] dma-mapping: query DMA memory type Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 03/18] dma-mapping: check if IOVA can be used Leon Romanovsky
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Existing .map_page() callback provides two things at the same time:
allocates IOVA and links DMA pages. That combination works great for
most of the callers who use it in control paths, but less effective
in fast paths.

These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.

Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-map-ops.h |  3 +++
 include/linux/dma-mapping.h | 20 +++++++++++++++++
 kernel/dma/mapping.c        | 44 +++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 02a1c825896b..23e5e2f63a1c 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -86,6 +86,9 @@ struct dma_map_ops {
 	size_t (*max_mapping_size)(struct device *dev);
 	size_t (*opt_mapping_size)(void);
 	unsigned long (*get_merge_boundary)(struct device *dev);
+
+	dma_addr_t (*alloc_iova)(struct device *dev, size_t size);
+	void (*free_iova)(struct device *dev, dma_addr_t dma_addr, size_t size);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 49b99c6e7ec5..673ddcf140ff 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -90,6 +90,16 @@ struct dma_memory_type {
 	struct dev_pagemap *p2p_pgmap;
 };
 
+struct dma_iova_attrs {
+	/* OUT field */
+	dma_addr_t addr;
+	/* IN fields */
+	struct device *dev;
+	size_t size;
+	enum dma_data_direction dir;
+	unsigned long attrs;
+};
+
 #ifdef CONFIG_DMA_API_DEBUG
 void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 void debug_dma_map_single(struct device *dev, const void *addr,
@@ -115,6 +125,9 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 	return 0;
 }
 
+int dma_alloc_iova(struct dma_iova_attrs *iova);
+void dma_free_iova(struct dma_iova_attrs *iova);
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 		size_t offset, size_t size, enum dma_data_direction dir,
 		unsigned long attrs);
@@ -166,6 +179,13 @@ int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
 
 void dma_get_memory_type(struct page *page, struct dma_memory_type *type);
 #else /* CONFIG_HAS_DMA */
+static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
+{
+	return -EOPNOTSUPP;
+}
+static inline void dma_free_iova(struct dma_iova_attrs *iova)
+{
+}
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
 		struct page *page, size_t offset, size_t size,
 		enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 877e43b39c06..0c8f51010d08 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -924,3 +924,47 @@ void dma_get_memory_type(struct page *page, struct dma_memory_type *type)
 	type->type = DMA_MEMORY_TYPE_NORMAL;
 }
 EXPORT_SYMBOL_GPL(dma_get_memory_type);
+
+/**
+ * dma_alloc_iova - Allocate an IOVA space
+ * @iova: IOVA attributes
+ *
+ * Allocate an IOVA space for the given IOVA attributes. The IOVA space
+ * is allocated to the worst case when whole range is going to be used.
+ */
+int dma_alloc_iova(struct dma_iova_attrs *iova)
+{
+	struct device *dev = iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) || !ops->alloc_iova) {
+		/* dma_map_direct(..) check is for HMM range fault callers */
+		iova->addr = 0;
+		return 0;
+	}
+
+	iova->addr = ops->alloc_iova(dev, iova->size);
+	if (dma_mapping_error(dev, iova->addr))
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dma_alloc_iova);
+
+/**
+ * dma_free_iova - Free an IOVA space
+ * @iova: IOVA attributes
+ *
+ * Free an IOVA space for the given IOVA attributes.
+ */
+void dma_free_iova(struct dma_iova_attrs *iova)
+{
+	struct device *dev = iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (dma_map_direct(dev, ops) || !ops->free_iova || !iova->addr)
+		return;
+
+	ops->free_iova(dev, iova->addr, iova->size);
+}
+EXPORT_SYMBOL_GPL(dma_free_iova);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 03/18] dma-mapping: check if IOVA can be used
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 01/18] dma-mapping: query DMA memory type Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 02/18] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 04/18] dma-mapping: implement link range API Leon Romanovsky
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanvosky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanvosky <leonro@nvidia.com>

Provide a way to the callers to see if IOVA can be used for specific
DMA memory type.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 13 -------------
 drivers/pci/p2pdma.c        |  4 ++--
 include/linux/dma-map-ops.h | 21 +++++++++++++++++++++
 include/linux/dma-mapping.h | 10 ++++++++++
 kernel/dma/mapping.c        | 32 ++++++++++++++++++++++++++++++++
 5 files changed, 65 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 43520e7275cc..89e34503e0bb 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -597,19 +597,6 @@ static int iova_reserve_iommu_regions(struct device *dev,
 	return ret;
 }
 
-static bool dev_is_untrusted(struct device *dev)
-{
-	return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
-}
-
-static bool dev_use_swiotlb(struct device *dev, size_t size,
-			    enum dma_data_direction dir)
-{
-	return IS_ENABLED(CONFIG_SWIOTLB) &&
-		(dev_is_untrusted(dev) ||
-		 dma_kmalloc_needs_bounce(dev, size, dir));
-}
-
 static bool dev_use_sg_swiotlb(struct device *dev, struct scatterlist *sg,
 			       int nents, enum dma_data_direction dir)
 {
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13cb500..6ceea32bb041 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -964,8 +964,8 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-						    struct device *dev)
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+					     struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
 	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 23e5e2f63a1c..b52e9c8db241 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -9,6 +9,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/pgtable.h>
 #include <linux/slab.h>
+#include <linux/pci.h>
 
 struct cma;
 struct iommu_ops;
@@ -348,6 +349,19 @@ static inline bool dma_kmalloc_needs_bounce(struct device *dev, size_t size,
 	return !dma_kmalloc_safe(dev, dir) && !dma_kmalloc_size_aligned(size);
 }
 
+static inline bool dev_is_untrusted(struct device *dev)
+{
+	return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
+}
+
+static inline bool dev_use_swiotlb(struct device *dev, size_t size,
+				   enum dma_data_direction dir)
+{
+	return IS_ENABLED(CONFIG_SWIOTLB) &&
+	       (dev_is_untrusted(dev) ||
+		dma_kmalloc_needs_bounce(dev, size, dir));
+}
+
 void *arch_dma_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
 		gfp_t gfp, unsigned long attrs);
 void arch_dma_free(struct device *dev, size_t size, void *cpu_addr,
@@ -514,6 +528,8 @@ struct pci_p2pdma_map_state {
 enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
 		       struct scatterlist *sg);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+					     struct device *dev);
 #else /* CONFIG_PCI_P2PDMA */
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
@@ -521,6 +537,11 @@ pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
 {
 	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 #endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 673ddcf140ff..9d1e020869a6 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -100,6 +100,11 @@ struct dma_iova_attrs {
 	unsigned long attrs;
 };
 
+struct dma_iova_state {
+	struct dma_iova_attrs *iova;
+	struct dma_memory_type *type;
+};
+
 #ifdef CONFIG_DMA_API_DEBUG
 void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
 void debug_dma_map_single(struct device *dev, const void *addr,
@@ -178,6 +183,7 @@ int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
 		size_t size, struct sg_table *sgt);
 
 void dma_get_memory_type(struct page *page, struct dma_memory_type *type);
+bool dma_can_use_iova(struct dma_iova_state *state, size_t size);
 #else /* CONFIG_HAS_DMA */
 static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
 {
@@ -319,6 +325,10 @@ static inline void dma_get_memory_type(struct page *page,
 				       struct dma_memory_type *type)
 {
 }
+static inline bool dma_can_use_iova(struct dma_iova_state *state, size_t size)
+{
+	return false;
+}
 #endif /* CONFIG_HAS_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 0c8f51010d08..9044ee525fdb 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -968,3 +968,35 @@ void dma_free_iova(struct dma_iova_attrs *iova)
 	ops->free_iova(dev, iova->addr, iova->size);
 }
 EXPORT_SYMBOL_GPL(dma_free_iova);
+
+/**
+ * dma_can_use_iova - check if the device type is valid
+ *                    and won't take SWIOTLB path
+ * @state: IOVA state
+ * @size: size of the buffer
+ *
+ * Return %true if the device should use swiotlb for the given buffer, else
+ * %false.
+ */
+bool dma_can_use_iova(struct dma_iova_state *state, size_t size)
+{
+	struct device *dev = state->iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+	struct dma_memory_type *type = state->type;
+	enum pci_p2pdma_map_type map;
+
+	if (is_swiotlb_force_bounce(dev) ||
+	    dev_use_swiotlb(dev, size, state->iova->dir))
+		return false;
+
+	if (dma_map_direct(dev, ops) || !ops->alloc_iova)
+		return false;
+
+	if (type->type == DMA_MEMORY_TYPE_P2P) {
+		map = pci_p2pdma_map_type(type->p2p_pgmap, dev);
+		return map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE;
+	}
+
+	return type->type == DMA_MEMORY_TYPE_NORMAL;
+}
+EXPORT_SYMBOL_GPL(dma_can_use_iova);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 04/18] dma-mapping: implement link range API
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (2 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 03/18] dma-mapping: check if IOVA can be used Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 05/18] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.

In proposed API, the callers will perform the following steps:
dma_alloc_iova()
if (dma_can_use_iova(...))
  dma_start_range(...)
  for (page in range)
     dma_link_range(...)
  dma_end_range(...)
else
  /* Fallback to legacy map pages */
  dma_map_page(...)

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-map-ops.h |  6 +++
 include/linux/dma-mapping.h | 22 +++++++++++
 kernel/dma/mapping.c        | 78 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index b52e9c8db241..4868586b015e 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -90,6 +90,12 @@ struct dma_map_ops {
 
 	dma_addr_t (*alloc_iova)(struct device *dev, size_t size);
 	void (*free_iova)(struct device *dev, dma_addr_t dma_addr, size_t size);
+	int (*link_range)(struct dma_iova_state *state, phys_addr_t phys,
+			  dma_addr_t addr, size_t size);
+	void (*unlink_range)(struct dma_iova_state *state,
+			     dma_addr_t dma_handle, size_t size);
+	int (*start_range)(struct dma_iova_state *state);
+	void (*end_range)(struct dma_iova_state *state);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 9d1e020869a6..c530095ff232 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -11,6 +11,7 @@
 #include <linux/scatterlist.h>
 #include <linux/bug.h>
 #include <linux/mem_encrypt.h>
+#include <linux/iommu.h>
 
 /**
  * List of possible attributes associated with a DMA mapping. The semantics
@@ -103,6 +104,8 @@ struct dma_iova_attrs {
 struct dma_iova_state {
 	struct dma_iova_attrs *iova;
 	struct dma_memory_type *type;
+	struct iommu_domain *domain;
+	size_t range_size;
 };
 
 #ifdef CONFIG_DMA_API_DEBUG
@@ -184,6 +187,10 @@ int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
 
 void dma_get_memory_type(struct page *page, struct dma_memory_type *type);
 bool dma_can_use_iova(struct dma_iova_state *state, size_t size);
+int dma_start_range(struct dma_iova_state *state);
+void dma_end_range(struct dma_iova_state *state);
+int dma_link_range(struct dma_iova_state *state, phys_addr_t phys, size_t size);
+void dma_unlink_range(struct dma_iova_state *state);
 #else /* CONFIG_HAS_DMA */
 static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
 {
@@ -329,6 +336,21 @@ static inline bool dma_can_use_iova(struct dma_iova_state *state, size_t size)
 {
 	return false;
 }
+static inline int dma_start_range(struct dma_iova_state *state)
+{
+	return -EOPNOTSUPP;
+}
+static inline void dma_end_range(struct dma_iova_state *state)
+{
+}
+static inline int dma_link_range(struct dma_iova_state *state, phys_addr_t phys,
+				 size_t size)
+{
+	return -EOPNOTSUPP;
+}
+static inline void dma_unlink_range(struct dma_iova_state *state)
+{
+}
 #endif /* CONFIG_HAS_DMA */
 
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9044ee525fdb..089b4a977bab 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -989,7 +989,8 @@ bool dma_can_use_iova(struct dma_iova_state *state, size_t size)
 	    dev_use_swiotlb(dev, size, state->iova->dir))
 		return false;
 
-	if (dma_map_direct(dev, ops) || !ops->alloc_iova)
+	if (dma_map_direct(dev, ops) || !ops->alloc_iova || !ops->link_range ||
+	    !ops->start_range)
 		return false;
 
 	if (type->type == DMA_MEMORY_TYPE_P2P) {
@@ -1000,3 +1001,78 @@ bool dma_can_use_iova(struct dma_iova_state *state, size_t size)
 	return type->type == DMA_MEMORY_TYPE_NORMAL;
 }
 EXPORT_SYMBOL_GPL(dma_can_use_iova);
+
+/**
+ * dma_start_range - Start a range of IOVA space
+ * @state: IOVA state
+ *
+ * Start a range of IOVA space for the given IOVA state.
+ */
+int dma_start_range(struct dma_iova_state *state)
+{
+	struct device *dev = state->iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (!ops->start_range)
+		return 0;
+
+	return ops->start_range(state);
+}
+EXPORT_SYMBOL_GPL(dma_start_range);
+
+/**
+ * dma_end_range - End a range of IOVA space
+ * @state: IOVA state
+ *
+ * End a range of IOVA space for the given IOVA state.
+ */
+void dma_end_range(struct dma_iova_state *state)
+{
+	struct device *dev = state->iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	if (!ops->end_range)
+		return;
+
+	ops->end_range(state);
+}
+EXPORT_SYMBOL_GPL(dma_end_range);
+
+/**
+ * dma_link_range - Link a range of IOVA space
+ * @state: IOVA state
+ * @phys: physical address to link
+ * @size: size of the buffer
+ *
+ * Link a range of IOVA space for the given IOVA state.
+ */
+int dma_link_range(struct dma_iova_state *state, phys_addr_t phys, size_t size)
+{
+	struct device *dev = state->iova->dev;
+	dma_addr_t addr = state->iova->addr + state->range_size;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+	int ret;
+
+	ret = ops->link_range(state, phys, addr, size);
+	if (ret)
+		return ret;
+
+	state->range_size += size;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dma_link_range);
+
+/**
+ * dma_unlink_range - Unlink a range of IOVA space
+ * @state: IOVA state
+ *
+ * Unlink a range of IOVA space for the given IOVA state.
+ */
+void dma_unlink_range(struct dma_iova_state *state)
+{
+	struct device *dev = state->iova->dev;
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	ops->unlink_range(state, state->iova->addr, state->range_size);
+}
+EXPORT_SYMBOL_GPL(dma_unlink_range);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 05/18] mm/hmm: let users to tag specific PFN with DMA mapped bit
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (3 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 04/18] dma-mapping: implement link range API Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 06/18] dma-mapping: provide callbacks to link/unlink HMM PFNs to specific IOVA Leon Romanovsky
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new sticky flag (HMM_PFN_DMA_MAPPED), which isn't overwritten
by HMM range fault. Such flag allows users to tag specific PFNs with information
if this specific PFN was already DMA mapped.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/hmm.h |  4 ++++
 mm/hmm.c            | 34 +++++++++++++++++++++-------------
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..2999697db83a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,8 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
+ * 			to mark that page is already DMA mapped
  *
  * On input:
  * 0                 - Return the current state of the page, do not fault it.
@@ -36,6 +38,8 @@ enum hmm_pfn_flags {
 	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
 	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+	/* Sticky lag, carried from Input to Output */
+	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),
 
 	/* Input flags */
diff --git a/mm/hmm.c b/mm/hmm.c
index 93aebd9cc130..03aeb9929d9e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -44,8 +44,10 @@ static int hmm_pfns_fill(unsigned long addr, unsigned long end,
 {
 	unsigned long i = (addr - range->start) >> PAGE_SHIFT;
 
-	for (; addr < end; addr += PAGE_SIZE, i++)
-		range->hmm_pfns[i] = cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++) {
+		range->hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+		range->hmm_pfns[i] |= cpu_flags;
+	}
 	return 0;
 }
 
@@ -202,8 +204,10 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		return hmm_vma_fault(addr, end, required_fault, walk);
 
 	pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		hmm_pfns[i] = pfn | cpu_flags;
+	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+		hmm_pfns[i] |= pfn | cpu_flags;
+	}
 	return 0;
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -236,7 +240,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
 			goto fault;
-		*hmm_pfn = 0;
+		*hmm_pfn = *hmm_pfn & HMM_PFN_DMA_MAPPED;
 		return 0;
 	}
 
@@ -253,14 +257,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			cpu_flags = HMM_PFN_VALID;
 			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
+			*hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | swp_offset_pfn(entry) | cpu_flags;
 			return 0;
 		}
 
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (!required_fault) {
-			*hmm_pfn = 0;
+			*hmm_pfn = *hmm_pfn & HMM_PFN_DMA_MAPPED;
 			return 0;
 		}
 
@@ -304,11 +308,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			pte_unmap(ptep);
 			return -EFAULT;
 		}
-		*hmm_pfn = HMM_PFN_ERROR;
+		*hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | HMM_PFN_ERROR;
 		return 0;
 	}
 
-	*hmm_pfn = pte_pfn(pte) | cpu_flags;
+	*hmm_pfn = (*hmm_pfn & HMM_PFN_DMA_MAPPED) | pte_pfn(pte) | cpu_flags;
 	return 0;
 
 fault:
@@ -448,8 +452,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		}
 
 		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-		for (i = 0; i < npages; ++i, ++pfn)
-			hmm_pfns[i] = pfn | cpu_flags;
+		for (i = 0; i < npages; ++i, ++pfn) {
+			hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+			hmm_pfns[i] |= pfn | cpu_flags;
+		}
 		goto out_unlock;
 	}
 
@@ -507,8 +513,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	}
 
 	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-	for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		range->hmm_pfns[i] = pfn | cpu_flags;
+	for (; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+		range->hmm_pfns[i] &= HMM_PFN_DMA_MAPPED;
+		range->hmm_pfns[i] |= pfn | cpu_flags;
+	}
 
 	spin_unlock(ptl);
 	return 0;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 06/18] dma-mapping: provide callbacks to link/unlink HMM PFNs to specific IOVA
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (4 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 05/18] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 07/18] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new DMA link/unlink API to provide a way for HMM users to link
pages to already preallocated IOVA.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/dma-mapping.h |  15 +++++
 kernel/dma/mapping.c        | 108 ++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index c530095ff232..2578b6615a2f 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -135,6 +135,10 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
 
 int dma_alloc_iova(struct dma_iova_attrs *iova);
 void dma_free_iova(struct dma_iova_attrs *iova);
+dma_addr_t dma_hmm_link_page(unsigned long *pfn, struct dma_iova_attrs *iova,
+			     dma_addr_t dma_offset);
+void dma_hmm_unlink_page(unsigned long *pfn, struct dma_iova_attrs *iova,
+			 dma_addr_t dma_offset);
 
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 		size_t offset, size_t size, enum dma_data_direction dir,
@@ -199,6 +203,17 @@ static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
 static inline void dma_free_iova(struct dma_iova_attrs *iova)
 {
 }
+static inline dma_addr_t dma_hmm_link_page(unsigned long *pfn,
+					   struct dma_iova_attrs *iova,
+					   dma_addr_t dma_offset)
+{
+	return DMA_MAPPING_ERROR;
+}
+static inline void dma_hmm_unlink_page(unsigned long *pfn,
+				       struct dma_iova_attrs *iova,
+				       dma_addr_t dma_offset)
+{
+}
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
 		struct page *page, size_t offset, size_t size,
 		enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 089b4a977bab..69c431bd89e6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -16,6 +16,7 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/cc_platform.h>
+#include <linux/hmm.h>
 #include "debug.h"
 #include "direct.h"
 
@@ -1076,3 +1077,110 @@ void dma_unlink_range(struct dma_iova_state *state)
 	ops->unlink_range(state, state->iova->addr, state->range_size);
 }
 EXPORT_SYMBOL_GPL(dma_unlink_range);
+
+/**
+ * dma_hmm_link_page - Link a physical HMM page to DMA address
+ * @pfn: HMM PFN
+ * @iova: Preallocated IOVA space
+ * @dma_offset: DMA offset form which this page needs to be linked
+ *
+ * dma_alloc_iova() allocates IOVA based on the size specified by their use in
+ * iova->size. Call this function after IOVA allocation to link whole @page
+ * to get the DMA address. Note that very first call to this function
+ * will have @dma_offset set to 0 in the IOVA space allocated from
+ * dma_alloc_iova(). For subsequent calls to this function on same @iova,
+ * @dma_offset needs to be advanced by the caller with the size of previous
+ * page that was linked + DMA address returned for the previous page that was
+ * linked by this function.
+ */
+dma_addr_t dma_hmm_link_page(unsigned long *pfn, struct dma_iova_attrs *iova,
+			     dma_addr_t dma_offset)
+{
+	struct device *dev = iova->dev;
+	struct page *page = hmm_pfn_to_page(*pfn);
+	phys_addr_t phys = page_to_phys(page);
+	bool coherent = dev_is_dma_coherent(dev);
+	struct dma_memory_type type = {};
+	struct dma_iova_state state = {};
+	dma_addr_t addr;
+	int ret;
+
+	if (*pfn & HMM_PFN_DMA_MAPPED)
+		/*
+		 * We are in this flow when there is a need to resync flags,
+		 * for example when page was already linked in prefetch call
+		 * with READ flag and now we need to add WRITE flag
+		 *
+		 * This page was already programmed to HW and we don't want/need
+		 * to unlink and link it again just to resync flags.
+		 *
+		 * The DMA address calculation below is based on the fact that
+		 * HMM doesn't work with swiotlb.
+		 */
+		return (iova->addr) ? iova->addr + dma_offset :
+				      phys_to_dma(dev, phys);
+
+	dma_get_memory_type(page, &type);
+
+	state.iova = iova;
+	state.type = &type;
+	state.range_size = dma_offset;
+
+	if (!dma_can_use_iova(&state, PAGE_SIZE)) {
+		if (!coherent && !(iova->attrs & DMA_ATTR_SKIP_CPU_SYNC))
+			arch_sync_dma_for_device(phys, PAGE_SIZE, iova->dir);
+
+		addr = phys_to_dma(dev, phys);
+		goto done;
+	}
+
+	ret = dma_start_range(&state);
+	if (ret)
+		return DMA_MAPPING_ERROR;
+
+	ret = dma_link_range(&state, page_to_phys(page), PAGE_SIZE);
+	dma_end_range(&state);
+	if (ret)
+		return DMA_MAPPING_ERROR;
+
+	addr = iova->addr + dma_offset;
+done:
+	kmsan_handle_dma(page, 0, PAGE_SIZE, iova->dir);
+	*pfn |= HMM_PFN_DMA_MAPPED;
+	return addr;
+}
+EXPORT_SYMBOL_GPL(dma_hmm_link_page);
+
+/**
+ * dma_hmm_unlink_page - Unlink a physical HMM page from DMA address
+ * @pfn: HMM PFN
+ * @iova: Preallocated IOVA space
+ * @dma_offset: DMA offset form which this page needs to be unlinked
+ *              from the IOVA space
+ */
+void dma_hmm_unlink_page(unsigned long *pfn, struct dma_iova_attrs *iova,
+			 dma_addr_t dma_offset)
+{
+	struct device *dev = iova->dev;
+	struct page *page = hmm_pfn_to_page(*pfn);
+	struct dma_memory_type type = {};
+	struct dma_iova_state state = {};
+	const struct dma_map_ops *ops = get_dma_ops(dev);
+
+	dma_get_memory_type(page, &type);
+
+	state.iova = iova;
+	state.type = &type;
+
+	*pfn &= ~HMM_PFN_DMA_MAPPED;
+
+	if (!dma_can_use_iova(&state, PAGE_SIZE)) {
+		if (!(iova->attrs & DMA_ATTR_SKIP_CPU_SYNC))
+			dma_direct_sync_single_for_cpu(dev, dma_offset,
+						       PAGE_SIZE, iova->dir);
+		return;
+	}
+
+	ops->unlink_range(&state, state.iova->addr + dma_offset, PAGE_SIZE);
+}
+EXPORT_SYMBOL_GPL(dma_hmm_unlink_page);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 07/18] iommu/dma: Provide an interface to allow preallocate IOVA
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (5 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 06/18] dma-mapping: provide callbacks to link/unlink HMM PFNs to specific IOVA Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 08/18] iommu/dma: Implement link/unlink ranges callbacks Leon Romanovsky
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Separate IOVA allocation to dedicated callback so it will allow
cache of IOVA and reuse it in fast paths for devices which support
ODP (on-demand-paging) mechanism.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 50 +++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 89e34503e0bb..0b5ca6961940 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -357,7 +357,7 @@ int iommu_dma_init_fq(struct iommu_domain *domain)
 	atomic_set(&cookie->fq_timer_on, 0);
 	/*
 	 * Prevent incomplete fq state being observable. Pairs with path from
-	 * __iommu_dma_unmap() through iommu_dma_free_iova() to queue_iova()
+	 * __iommu_dma_unmap() through __iommu_dma_free_iova() to queue_iova()
 	 */
 	smp_wmb();
 	WRITE_ONCE(cookie->fq_domain, domain);
@@ -745,7 +745,7 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
 	}
 }
 
-static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
+static dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
 		size_t size, u64 dma_limit, struct device *dev)
 {
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -791,7 +791,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
 	return (dma_addr_t)iova << shift;
 }
 
-static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+static void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
 		dma_addr_t iova, size_t size, struct iommu_iotlb_gather *gather)
 {
 	struct iova_domain *iovad = &cookie->iovad;
@@ -828,7 +828,7 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
 
 	if (!iotlb_gather.queued)
 		iommu_iotlb_sync(domain, &iotlb_gather);
-	iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
+	__iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
 }
 
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -851,12 +851,12 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
 
 	size = iova_align(iovad, size + iova_off);
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
 	if (!iova)
 		return DMA_MAPPING_ERROR;
 
 	if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
-		iommu_dma_free_iova(cookie, iova, size, NULL);
+		__iommu_dma_free_iova(cookie, iova, size, NULL);
 		return DMA_MAPPING_ERROR;
 	}
 	return iova + iova_off;
@@ -960,7 +960,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 		return NULL;
 
 	size = iova_align(iovad, size);
-	iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
 	if (!iova)
 		goto out_free_pages;
 
@@ -994,7 +994,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
 out_free_sg:
 	sg_free_table(sgt);
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_pages:
 	__iommu_dma_free_pages(pages, count);
 	return NULL;
@@ -1429,7 +1429,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	if (!iova_len)
 		return __finalise_sg(dev, sg, nents, 0);
 
-	iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
 	if (!iova) {
 		ret = -ENOMEM;
 		goto out_restore_sg;
@@ -1446,7 +1446,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
 	return __finalise_sg(dev, sg, nents, iova);
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, iova_len, NULL);
+	__iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
 	__invalidate_sg(sg, nents);
 out:
@@ -1707,6 +1707,30 @@ static size_t iommu_dma_max_mapping_size(struct device *dev)
 	return SIZE_MAX;
 }
 
+static dma_addr_t iommu_dma_alloc_iova(struct device *dev, size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	dma_addr_t dma_mask = dma_get_mask(dev);
+
+	size = iova_align(iovad, size);
+	return __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+}
+
+static void iommu_dma_free_iova(struct device *dev, dma_addr_t iova,
+				size_t size)
+{
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	struct iommu_iotlb_gather iotlb_gather;
+
+	size = iova_align(iovad, size);
+	iommu_iotlb_gather_init(&iotlb_gather);
+	__iommu_dma_free_iova(cookie, iova, size, &iotlb_gather);
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
 	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED |
 				  DMA_F_CAN_SKIP_SYNC,
@@ -1731,6 +1755,8 @@ static const struct dma_map_ops iommu_dma_ops = {
 	.get_merge_boundary	= iommu_dma_get_merge_boundary,
 	.opt_mapping_size	= iommu_dma_opt_mapping_size,
 	.max_mapping_size       = iommu_dma_max_mapping_size,
+	.alloc_iova		= iommu_dma_alloc_iova,
+	.free_iova		= iommu_dma_free_iova,
 };
 
 void iommu_setup_dma_ops(struct device *dev)
@@ -1773,7 +1799,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	if (!msi_page)
 		return NULL;
 
-	iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+	iova = __iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 	if (!iova)
 		goto out_free_page;
 
@@ -1787,7 +1813,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
 	return msi_page;
 
 out_free_iova:
-	iommu_dma_free_iova(cookie, iova, size, NULL);
+	__iommu_dma_free_iova(cookie, iova, size, NULL);
 out_free_page:
 	kfree(msi_page);
 	return NULL;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 08/18] iommu/dma: Implement link/unlink ranges callbacks
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (6 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 07/18] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 09/18] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Add an implementation of link/unlink interface to perform in map/unmap
pages in fast patch for pre-allocated IOVA.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c | 79 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 0b5ca6961940..7425d155a14e 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1731,6 +1731,82 @@ static void iommu_dma_free_iova(struct device *dev, dma_addr_t iova,
 	__iommu_dma_free_iova(cookie, iova, size, &iotlb_gather);
 }
 
+static int iommu_dma_start_range(struct dma_iova_state *state)
+{
+	struct device *dev = state->iova->dev;
+
+	state->domain = iommu_get_dma_domain(dev);
+
+	if (static_branch_unlikely(&iommu_deferred_attach_enabled))
+		return iommu_deferred_attach(dev, state->domain);
+
+	return 0;
+}
+
+static int iommu_dma_link_range(struct dma_iova_state *state, phys_addr_t phys,
+				dma_addr_t addr, size_t size)
+{
+	struct iommu_domain *domain = state->domain;
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	struct device *dev = state->iova->dev;
+	enum dma_data_direction dir = state->iova->dir;
+	bool coherent = dev_is_dma_coherent(dev);
+	unsigned long attrs = state->iova->attrs;
+	int prot = dma_info_to_prot(dir, coherent, attrs);
+
+	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		arch_sync_dma_for_device(phys, size, dir);
+
+	size = iova_align(iovad, size);
+	return iommu_map(domain, addr, phys, size, prot, GFP_ATOMIC);
+}
+
+static void iommu_sync_dma_for_cpu(struct iommu_domain *domain,
+				   dma_addr_t start, size_t size,
+				   enum dma_data_direction dir)
+{
+	size_t sync_size, unmapped = 0;
+	phys_addr_t phys;
+
+	do {
+		phys = iommu_iova_to_phys(domain, start + unmapped);
+		if (WARN_ON(!phys))
+			continue;
+
+		sync_size = (unmapped + PAGE_SIZE > size) ? size % PAGE_SIZE :
+							    PAGE_SIZE;
+		arch_sync_dma_for_cpu(phys, sync_size, dir);
+		unmapped += sync_size;
+	} while (unmapped < size);
+}
+
+static void iommu_dma_unlink_range(struct dma_iova_state *state,
+				   dma_addr_t start, size_t size)
+{
+	struct device *dev = state->iova->dev;
+	struct iommu_domain *domain = iommu_get_dma_domain(dev);
+	struct iommu_dma_cookie *cookie = domain->iova_cookie;
+	struct iova_domain *iovad = &cookie->iovad;
+	struct iommu_iotlb_gather iotlb_gather;
+	bool coherent = dev_is_dma_coherent(dev);
+	unsigned long attrs = state->iova->attrs;
+	size_t unmapped;
+
+	iommu_iotlb_gather_init(&iotlb_gather);
+	iotlb_gather.queued = READ_ONCE(cookie->fq_domain);
+
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && !coherent)
+		iommu_sync_dma_for_cpu(domain, start, size, state->iova->dir);
+
+	size = iova_align(iovad, size);
+	unmapped = iommu_unmap_fast(domain, start, size, &iotlb_gather);
+	WARN_ON(unmapped != size);
+
+	if (!iotlb_gather.queued)
+		iommu_iotlb_sync(domain, &iotlb_gather);
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
 	.flags			= DMA_F_PCI_P2PDMA_SUPPORTED |
 				  DMA_F_CAN_SKIP_SYNC,
@@ -1757,6 +1833,9 @@ static const struct dma_map_ops iommu_dma_ops = {
 	.max_mapping_size       = iommu_dma_max_mapping_size,
 	.alloc_iova		= iommu_dma_alloc_iova,
 	.free_iova		= iommu_dma_free_iova,
+	.link_range		= iommu_dma_link_range,
+	.unlink_range		= iommu_dma_unlink_range,
+	.start_range		= iommu_dma_start_range,
 };
 
 void iommu_setup_dma_ops(struct device *dev)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 09/18] RDMA/umem: Preallocate and cache IOVA for UMEM ODP
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (7 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 08/18] iommu/dma: Implement link/unlink ranges callbacks Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 10/18] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

As a preparation to provide two step interface to map pages,
preallocate IOVA when UMEM is initialized.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c | 14 +++++++++++++-
 include/rdma/ib_umem_odp.h         |  1 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d31c23..955bf338b1bf 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -50,6 +50,7 @@
 static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 				   const struct mmu_interval_notifier_ops *ops)
 {
+	struct ib_device *dev = umem_odp->umem.ibdev;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
@@ -87,15 +88,25 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 			goto out_pfn_list;
 		}
 
+		umem_odp->iova.dev = dev->dma_device;
+		umem_odp->iova.size = end - start;
+		umem_odp->iova.dir = DMA_BIDIRECTIONAL;
+		ret = dma_alloc_iova(&umem_odp->iova);
+		if (ret)
+			goto out_dma_list;
+
+
 		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
 						   umem_odp->umem.owning_mm,
 						   start, end - start, ops);
 		if (ret)
-			goto out_dma_list;
+			goto out_free_iova;
 	}
 
 	return 0;
 
+out_free_iova:
+	dma_free_iova(&umem_odp->iova);
 out_dma_list:
 	kvfree(umem_odp->dma_list);
 out_pfn_list:
@@ -274,6 +285,7 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 					    ib_umem_end(umem_odp));
 		mutex_unlock(&umem_odp->umem_mutex);
 		mmu_interval_notifier_remove(&umem_odp->notifier);
+		dma_free_iova(&umem_odp->iova);
 		kvfree(umem_odp->dma_list);
 		kvfree(umem_odp->pfn_list);
 	}
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 0844c1d05ac6..bb2d7f2a5b04 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -23,6 +23,7 @@ struct ib_umem_odp {
 	 * See ODP_READ_ALLOWED_BIT and ODP_WRITE_ALLOWED_BIT.
 	 */
 	dma_addr_t		*dma_list;
+	struct dma_iova_attrs iova;
 	/*
 	 * The umem_mutex protects the page_list and dma_list fields of an ODP
 	 * umem, allowing only a single thread to map/unmap pages. The mutex
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 10/18] RDMA/umem: Store ODP access mask information in PFN
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (8 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 09/18] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 11/18] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

As a preparation to remove of dma_list, store access mask in PFN pointer
and not in dma_addr_t.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c   | 98 +++++++++++-----------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 +
 drivers/infiniband/hw/mlx5/odp.c     | 37 ++++++-----
 include/rdma/ib_umem_odp.h           | 14 +---
 4 files changed, 59 insertions(+), 91 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 955bf338b1bf..c628a98c41b7 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,22 +308,11 @@ EXPORT_SYMBOL(ib_umem_odp_release);
 static int ib_umem_odp_map_dma_single_page(
 		struct ib_umem_odp *umem_odp,
 		unsigned int dma_index,
-		struct page *page,
-		u64 access_mask)
+		struct page *page)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
 	dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
 
-	if (*dma_addr) {
-		/*
-		 * If the page is already dma mapped it means it went through
-		 * a non-invalidating trasition, like read-only to writable.
-		 * Resync the flags.
-		 */
-		*dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask;
-		return 0;
-	}
-
 	*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
 				    DMA_BIDIRECTIONAL);
 	if (ib_dma_mapping_error(dev, *dma_addr)) {
@@ -331,7 +320,6 @@ static int ib_umem_odp_map_dma_single_page(
 		return -EFAULT;
 	}
 	umem_odp->npages++;
-	*dma_addr |= access_mask;
 	return 0;
 }
 
@@ -367,9 +355,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	struct hmm_range range = {};
 	unsigned long timeout;
 
-	if (access_mask == 0)
-		return -EINVAL;
-
 	if (user_virt < ib_umem_start(umem_odp) ||
 	    user_virt + bcnt > ib_umem_end(umem_odp))
 		return -EFAULT;
@@ -395,7 +380,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	if (fault) {
 		range.default_flags = HMM_PFN_REQ_FAULT;
 
-		if (access_mask & ODP_WRITE_ALLOWED_BIT)
+		if (access_mask & HMM_PFN_WRITE)
 			range.default_flags |= HMM_PFN_REQ_WRITE;
 	}
 
@@ -427,22 +412,17 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	for (pfn_index = 0; pfn_index < num_pfns;
 		pfn_index += 1 << (page_shift - PAGE_SHIFT), dma_index++) {
 
-		if (fault) {
-			/*
-			 * Since we asked for hmm_range_fault() to populate
-			 * pages it shouldn't return an error entry on success.
-			 */
-			WARN_ON(range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
-			WARN_ON(!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
-		} else {
-			if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)) {
-				WARN_ON(umem_odp->dma_list[dma_index]);
-				continue;
-			}
-			access_mask = ODP_READ_ALLOWED_BIT;
-			if (range.hmm_pfns[pfn_index] & HMM_PFN_WRITE)
-				access_mask |= ODP_WRITE_ALLOWED_BIT;
-		}
+		/*
+		 * Since we asked for hmm_range_fault() to populate
+		 * pages it shouldn't return an error entry on success.
+		 */
+		WARN_ON(fault && range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
+		WARN_ON(fault && !(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
+		if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID))
+			continue;
+
+		if (range.hmm_pfns[pfn_index] & HMM_PFN_DMA_MAPPED)
+			continue;
 
 		hmm_order = hmm_pfn_to_map_order(range.hmm_pfns[pfn_index]);
 		/* If a hugepage was detected and ODP wasn't set for, the umem
@@ -457,13 +437,13 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 		}
 
 		ret = ib_umem_odp_map_dma_single_page(
-				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]),
-				access_mask);
+				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
 		if (ret < 0) {
 			ibdev_dbg(umem_odp->umem.ibdev,
 				  "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
 			break;
 		}
+		range.hmm_pfns[pfn_index] |= HMM_PFN_DMA_MAPPED;
 	}
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
@@ -483,7 +463,6 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 				 u64 bound)
 {
-	dma_addr_t dma_addr;
 	dma_addr_t dma;
 	int idx;
 	u64 addr;
@@ -494,34 +473,33 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	virt = max_t(u64, virt, ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
+		unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
+		struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
+
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 		dma = umem_odp->dma_list[idx];
 
-		/* The access flags guaranteed a valid DMA address in case was NULL */
-		if (dma) {
-			unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
-			struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
-
-			dma_addr = dma & ODP_DMA_ADDR_MASK;
-			ib_dma_unmap_page(dev, dma_addr,
-					  BIT(umem_odp->page_shift),
-					  DMA_BIDIRECTIONAL);
-			if (dma & ODP_WRITE_ALLOWED_BIT) {
-				struct page *head_page = compound_head(page);
-				/*
-				 * set_page_dirty prefers being called with
-				 * the page lock. However, MMU notifiers are
-				 * called sometimes with and sometimes without
-				 * the lock. We rely on the umem_mutex instead
-				 * to prevent other mmu notifiers from
-				 * continuing and allowing the page mapping to
-				 * be removed.
-				 */
-				set_page_dirty(head_page);
-			}
-			umem_odp->dma_list[idx] = 0;
-			umem_odp->npages--;
+		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
+			continue;
+		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_DMA_MAPPED))
+			continue;
+
+		ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
+				  DMA_BIDIRECTIONAL);
+		if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+			struct page *head_page = compound_head(page);
+			/*
+			 * set_page_dirty prefers being called with
+			 * the page lock. However, MMU notifiers are
+			 * called sometimes with and sometimes without
+			 * the lock. We rely on the umem_mutex instead
+			 * to prevent other mmu notifiers from
+			 * continuing and allowing the page mapping to
+			 * be removed.
+			 */
+			set_page_dirty(head_page);
 		}
+		umem_odp->npages--;
 	}
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index f255a12e26a0..e8494a803a58 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -334,6 +334,7 @@ struct mlx5_ib_flow_db {
 #define MLX5_IB_UPD_XLT_PD	      BIT(4)
 #define MLX5_IB_UPD_XLT_ACCESS	      BIT(5)
 #define MLX5_IB_UPD_XLT_INDIRECT      BIT(6)
+#define MLX5_IB_UPD_XLT_DOWNGRADE     BIT(7)
 
 /* Private QP creation flags to be passed in ib_qp_init_attr.create_flags.
  *
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 4a04cbc5b78a..5713fe25f4de 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -34,6 +34,7 @@
 #include <linux/kernel.h>
 #include <linux/dma-buf.h>
 #include <linux/dma-resv.h>
+#include <linux/hmm.h>
 
 #include "mlx5_ib.h"
 #include "cmd.h"
@@ -143,22 +144,12 @@ static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentries,
 	}
 }
 
-static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
-{
-	u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
-
-	if (umem_dma & ODP_READ_ALLOWED_BIT)
-		mtt_entry |= MLX5_IB_MTT_READ;
-	if (umem_dma & ODP_WRITE_ALLOWED_BIT)
-		mtt_entry |= MLX5_IB_MTT_WRITE;
-
-	return mtt_entry;
-}
-
 static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 			 struct mlx5_ib_mr *mr, int flags)
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
+	bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
+	unsigned long pfn;
 	dma_addr_t pa;
 	size_t i;
 
@@ -166,8 +157,17 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 		return;
 
 	for (i = 0; i < nentries; i++) {
+		pfn = odp->pfn_list[idx + i];
+		if (!(pfn & HMM_PFN_VALID))
+			/* Initial ODP init */
+			continue;
+
 		pa = odp->dma_list[idx + i];
-		pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+		pa |= MLX5_IB_MTT_READ;
+		if ((pfn & HMM_PFN_WRITE) && !downgrade)
+			pa |= MLX5_IB_MTT_WRITE;
+
+		pas[i] = cpu_to_be64(pa);
 	}
 }
 
@@ -268,8 +268,7 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
 		 * estimate the cost of another UMR vs. the cost of bigger
 		 * UMR.
 		 */
-		if (umem_odp->dma_list[idx] &
-		    (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+		if (umem_odp->pfn_list[idx] & HMM_PFN_VALID) {
 			if (!in_block) {
 				blk_start_idx = idx;
 				in_block = 1;
@@ -555,7 +554,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 {
 	int page_shift, ret, np;
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
-	u64 access_mask;
+	u64 access_mask = 0;
 	u64 start_idx;
 	bool fault = !(flags & MLX5_PF_FLAGS_SNAPSHOT);
 	u32 xlt_flags = MLX5_IB_UPD_XLT_ATOMIC;
@@ -563,12 +562,14 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	if (flags & MLX5_PF_FLAGS_ENABLE)
 		xlt_flags |= MLX5_IB_UPD_XLT_ENABLE;
 
+	if (flags & MLX5_PF_FLAGS_DOWNGRADE)
+		xlt_flags |= MLX5_IB_UPD_XLT_DOWNGRADE;
+
 	page_shift = odp->page_shift;
 	start_idx = (user_va - ib_umem_start(odp)) >> page_shift;
-	access_mask = ODP_READ_ALLOWED_BIT;
 
 	if (odp->umem.writable && !downgrade)
-		access_mask |= ODP_WRITE_ALLOWED_BIT;
+		access_mask |= HMM_PFN_WRITE;
 
 	np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
 	if (np < 0)
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index bb2d7f2a5b04..a3f4a5c03bf8 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -8,6 +8,7 @@
 
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
+#include <linux/hmm.h>
 
 struct ib_umem_odp {
 	struct ib_umem umem;
@@ -68,19 +69,6 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
 	       umem_odp->page_shift;
 }
 
-/*
- * The lower 2 bits of the DMA address signal the R/W permissions for
- * the entry. To upgrade the permissions, provide the appropriate
- * bitmask to the map_dma_pages function.
- *
- * Be aware that upgrading a mapped address might result in change of
- * the DMA address for the page.
- */
-#define ODP_READ_ALLOWED_BIT  (1<<0ULL)
-#define ODP_WRITE_ALLOWED_BIT (1<<1ULL)
-
-#define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
-
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 
 struct ib_umem_odp *
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 11/18] RDMA/core: Separate DMA mapping to caching IOVA and page linkage
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (9 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 10/18] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 12/18] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c | 61 +++---------------------------
 drivers/infiniband/hw/mlx5/odp.c   |  7 +++-
 include/rdma/ib_umem_odp.h         |  8 +---
 3 files changed, 12 insertions(+), 64 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index c628a98c41b7..6e170cb5110c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -81,20 +81,13 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 		if (!umem_odp->pfn_list)
 			return -ENOMEM;
 
-		umem_odp->dma_list = kvcalloc(
-			ndmas, sizeof(*umem_odp->dma_list), GFP_KERNEL);
-		if (!umem_odp->dma_list) {
-			ret = -ENOMEM;
-			goto out_pfn_list;
-		}
 
 		umem_odp->iova.dev = dev->dma_device;
 		umem_odp->iova.size = end - start;
 		umem_odp->iova.dir = DMA_BIDIRECTIONAL;
 		ret = dma_alloc_iova(&umem_odp->iova);
 		if (ret)
-			goto out_dma_list;
-
+			goto out_pfn_list;
 
 		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
 						   umem_odp->umem.owning_mm,
@@ -107,8 +100,6 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 
 out_free_iova:
 	dma_free_iova(&umem_odp->iova);
-out_dma_list:
-	kvfree(umem_odp->dma_list);
 out_pfn_list:
 	kvfree(umem_odp->pfn_list);
 	return ret;
@@ -286,7 +277,6 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 		mutex_unlock(&umem_odp->umem_mutex);
 		mmu_interval_notifier_remove(&umem_odp->notifier);
 		dma_free_iova(&umem_odp->iova);
-		kvfree(umem_odp->dma_list);
 		kvfree(umem_odp->pfn_list);
 	}
 	put_pid(umem_odp->tgid);
@@ -294,40 +284,10 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 }
 EXPORT_SYMBOL(ib_umem_odp_release);
 
-/*
- * Map for DMA and insert a single page into the on-demand paging page tables.
- *
- * @umem: the umem to insert the page to.
- * @dma_index: index in the umem to add the dma to.
- * @page: the page struct to map and add.
- * @access_mask: access permissions needed for this page.
- *
- * The function returns -EFAULT if the DMA mapping operation fails.
- *
- */
-static int ib_umem_odp_map_dma_single_page(
-		struct ib_umem_odp *umem_odp,
-		unsigned int dma_index,
-		struct page *page)
-{
-	struct ib_device *dev = umem_odp->umem.ibdev;
-	dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
-
-	*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
-				    DMA_BIDIRECTIONAL);
-	if (ib_dma_mapping_error(dev, *dma_addr)) {
-		*dma_addr = 0;
-		return -EFAULT;
-	}
-	umem_odp->npages++;
-	return 0;
-}
-
 /**
  * ib_umem_odp_map_dma_and_lock - DMA map userspace memory in an ODP MR and lock it.
  *
  * Maps the range passed in the argument to DMA addresses.
- * The DMA addresses of the mapped pages is updated in umem_odp->dma_list.
  * Upon success the ODP MR will be locked to let caller complete its device
  * page table update.
  *
@@ -435,15 +395,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 				  __func__, hmm_order, page_shift);
 			break;
 		}
-
-		ret = ib_umem_odp_map_dma_single_page(
-				umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
-		if (ret < 0) {
-			ibdev_dbg(umem_odp->umem.ibdev,
-				  "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
-			break;
-		}
-		range.hmm_pfns[pfn_index] |= HMM_PFN_DMA_MAPPED;
 	}
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
@@ -463,10 +414,8 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 				 u64 bound)
 {
-	dma_addr_t dma;
 	int idx;
 	u64 addr;
-	struct ib_device *dev = umem_odp->umem.ibdev;
 
 	lockdep_assert_held(&umem_odp->umem_mutex);
 
@@ -474,19 +423,19 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
 		unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
-		struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
 
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
-		dma = umem_odp->dma_list[idx];
 
 		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
 			continue;
 		if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_DMA_MAPPED))
 			continue;
 
-		ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
-				  DMA_BIDIRECTIONAL);
+		dma_hmm_unlink_page(&umem_odp->pfn_list[pfn_idx],
+				    &umem_odp->iova,
+				    idx * (1 << umem_odp->page_shift));
 		if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+			struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
 			struct page *head_page = compound_head(page);
 			/*
 			 * set_page_dirty prefers being called with
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5713fe25f4de..b2aeaef9d0e1 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -149,6 +149,7 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 {
 	struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
 	bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
+	struct ib_device *dev = odp->umem.ibdev;
 	unsigned long pfn;
 	dma_addr_t pa;
 	size_t i;
@@ -162,12 +163,16 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
 			/* Initial ODP init */
 			continue;
 
-		pa = odp->dma_list[idx + i];
+		pa = dma_hmm_link_page(&odp->pfn_list[idx + i], &odp->iova,
+				       (idx + i) * (1 << odp->page_shift));
+		WARN_ON_ONCE(ib_dma_mapping_error(dev, pa));
+
 		pa |= MLX5_IB_MTT_READ;
 		if ((pfn & HMM_PFN_WRITE) && !downgrade)
 			pa |= MLX5_IB_MTT_WRITE;
 
 		pas[i] = cpu_to_be64(pa);
+		odp->npages++;
 	}
 }
 
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index a3f4a5c03bf8..653fc076b6ee 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -18,15 +18,9 @@ struct ib_umem_odp {
 	/* An array of the pfns included in the on-demand paging umem. */
 	unsigned long *pfn_list;
 
-	/*
-	 * An array with DMA addresses mapped for pfns in pfn_list.
-	 * The lower two bits designate access permissions.
-	 * See ODP_READ_ALLOWED_BIT and ODP_WRITE_ALLOWED_BIT.
-	 */
-	dma_addr_t		*dma_list;
 	struct dma_iova_attrs iova;
 	/*
-	 * The umem_mutex protects the page_list and dma_list fields of an ODP
+	 * The umem_mutex protects the page_list field of an ODP
 	 * umem, allowing only a single thread to map/unmap pages. The mutex
 	 * also protects access to the mmu notifier counters.
 	 */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 12/18] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (10 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 11/18] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 13/18] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

RDMA UMEM never supported DMA addresses returned from SWIOTLB, as these
addresses should be programmed to the hardware which is not aware that
it is bounce buffers and not real ones.

Instead of silently leave broken system for the users who didn't
know it, let's be explicit and return an error to them.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/umem_odp.c | 81 +++++++++++++++---------------
 1 file changed, 41 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 6e170cb5110c..12186717a892 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -42,7 +42,8 @@
 #include <linux/interval_tree.h>
 #include <linux/hmm.h>
 #include <linux/pagemap.h>
-
+#include <linux/dma-map-ops.h>
+#include <linux/swiotlb.h>
 #include <rdma/ib_umem_odp.h>
 
 #include "uverbs.h"
@@ -51,50 +52,50 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
 				   const struct mmu_interval_notifier_ops *ops)
 {
 	struct ib_device *dev = umem_odp->umem.ibdev;
+	size_t page_size = 1UL << umem_odp->page_shift;
+	unsigned long start, end;
+	size_t ndmas, npfns;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
 	mutex_init(&umem_odp->umem_mutex);
+	if (umem_odp->is_implicit_odp)
+		return 0;
+
+	if (dev_use_swiotlb(dev->dma_device, page_size, DMA_BIDIRECTIONAL) ||
+	    is_swiotlb_force_bounce(dev->dma_device))
+		return -EOPNOTSUPP;
+
+	start = ALIGN_DOWN(umem_odp->umem.address, page_size);
+	if (check_add_overflow(umem_odp->umem.address,
+			       (unsigned long)umem_odp->umem.length, &end))
+		return -EOVERFLOW;
+	end = ALIGN(end, page_size);
+	if (unlikely(end < page_size))
+		return -EOVERFLOW;
+
+	ndmas = (end - start) >> umem_odp->page_shift;
+	if (!ndmas)
+		return -EINVAL;
+
+	npfns = (end - start) >> PAGE_SHIFT;
+	umem_odp->pfn_list =
+		kvcalloc(npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
+	if (!umem_odp->pfn_list)
+		return -ENOMEM;
+
+	umem_odp->iova.dev = dev->dma_device;
+	umem_odp->iova.size = end - start;
+	umem_odp->iova.dir = DMA_BIDIRECTIONAL;
+	ret = dma_alloc_iova(&umem_odp->iova);
+	if (ret)
+		goto out_pfn_list;
 
-	if (!umem_odp->is_implicit_odp) {
-		size_t page_size = 1UL << umem_odp->page_shift;
-		unsigned long start;
-		unsigned long end;
-		size_t ndmas, npfns;
-
-		start = ALIGN_DOWN(umem_odp->umem.address, page_size);
-		if (check_add_overflow(umem_odp->umem.address,
-				       (unsigned long)umem_odp->umem.length,
-				       &end))
-			return -EOVERFLOW;
-		end = ALIGN(end, page_size);
-		if (unlikely(end < page_size))
-			return -EOVERFLOW;
-
-		ndmas = (end - start) >> umem_odp->page_shift;
-		if (!ndmas)
-			return -EINVAL;
-
-		npfns = (end - start) >> PAGE_SHIFT;
-		umem_odp->pfn_list = kvcalloc(
-			npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
-		if (!umem_odp->pfn_list)
-			return -ENOMEM;
-
-
-		umem_odp->iova.dev = dev->dma_device;
-		umem_odp->iova.size = end - start;
-		umem_odp->iova.dir = DMA_BIDIRECTIONAL;
-		ret = dma_alloc_iova(&umem_odp->iova);
-		if (ret)
-			goto out_pfn_list;
-
-		ret = mmu_interval_notifier_insert(&umem_odp->notifier,
-						   umem_odp->umem.owning_mm,
-						   start, end - start, ops);
-		if (ret)
-			goto out_free_iova;
-	}
+	ret = mmu_interval_notifier_insert(&umem_odp->notifier,
+					   umem_odp->umem.owning_mm, start,
+					   end - start, ops);
+	if (ret)
+		goto out_free_iova;
 
 	return 0;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 13/18] vfio/mlx5: Explicitly use number of pages instead of allocated length
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (11 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 12/18] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 14/18] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

allocated_length is a multiple of page size and number of pages,
so let's change the functions to accept number of pages. It opens
us a venue to combine receive and send paths together with code
readability improvement.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 32 +++++++++++-----------
 drivers/vfio/pci/mlx5/cmd.h  | 10 +++----
 drivers/vfio/pci/mlx5/main.c | 53 +++++++++++++++++++++++-------------
 3 files changed, 55 insertions(+), 40 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 41a4b0cf4297..fdc3e515741f 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -318,8 +318,7 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 			struct mlx5_vhca_recv_buf *recv_buf,
 			u32 *mkey)
 {
-	size_t npages = buf ? DIV_ROUND_UP(buf->allocated_length, PAGE_SIZE) :
-				recv_buf->npages;
+	size_t npages = buf ? buf->npages : recv_buf->npages;
 	int err = 0, inlen;
 	__be64 *mtt;
 	void *mkc;
@@ -375,7 +374,7 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (buf->dmaed || !buf->allocated_length)
+	if (buf->dmaed || !buf->npages)
 		return -EINVAL;
 
 	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
@@ -444,7 +443,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 
 		if (ret)
 			goto err;
-		buf->allocated_length += filled * PAGE_SIZE;
+		buf->npages += filled;
 		/* clean input for another bulk allocation */
 		memset(page_list, 0, filled * sizeof(*page_list));
 		to_fill = min_t(unsigned int, to_alloc,
@@ -460,8 +459,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 }
 
 struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
-			 size_t length,
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 			 enum dma_data_direction dma_dir)
 {
 	struct mlx5_vhca_data_buffer *buf;
@@ -473,9 +471,8 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
 
 	buf->dma_dir = dma_dir;
 	buf->migf = migf;
-	if (length) {
-		ret = mlx5vf_add_migration_pages(buf,
-				DIV_ROUND_UP_ULL(length, PAGE_SIZE));
+	if (npages) {
+		ret = mlx5vf_add_migration_pages(buf, npages);
 		if (ret)
 			goto end;
 
@@ -501,8 +498,8 @@ void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf)
 }
 
 struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
-		       size_t length, enum dma_data_direction dma_dir)
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+		       enum dma_data_direction dma_dir)
 {
 	struct mlx5_vhca_data_buffer *buf, *temp_buf;
 	struct list_head free_list;
@@ -517,7 +514,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
 	list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
 		if (buf->dma_dir == dma_dir) {
 			list_del_init(&buf->buf_elm);
-			if (buf->allocated_length >= length) {
+			if (buf->npages >= npages) {
 				spin_unlock_irq(&migf->list_lock);
 				goto found;
 			}
@@ -531,7 +528,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
 		}
 	}
 	spin_unlock_irq(&migf->list_lock);
-	buf = mlx5vf_alloc_data_buffer(migf, length, dma_dir);
+	buf = mlx5vf_alloc_data_buffer(migf, npages, dma_dir);
 
 found:
 	while ((temp_buf = list_first_entry_or_null(&free_list,
@@ -712,7 +709,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	MLX5_SET(save_vhca_state_in, in, op_mod, 0);
 	MLX5_SET(save_vhca_state_in, in, vhca_id, mvdev->vhca_id);
 	MLX5_SET(save_vhca_state_in, in, mkey, buf->mkey);
-	MLX5_SET(save_vhca_state_in, in, size, buf->allocated_length);
+	MLX5_SET(save_vhca_state_in, in, size, buf->npages * PAGE_SIZE);
 	MLX5_SET(save_vhca_state_in, in, incremental, inc);
 	MLX5_SET(save_vhca_state_in, in, set_track, track);
 
@@ -734,8 +731,11 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	}
 
 	if (!header_buf) {
-		header_buf = mlx5vf_get_data_buffer(migf,
-			sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+		header_buf = mlx5vf_get_data_buffer(
+			migf,
+			DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+				     PAGE_SIZE),
+			DMA_NONE);
 		if (IS_ERR(header_buf)) {
 			err = PTR_ERR(header_buf);
 			goto err_free;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index df421dc6de04..7d4a833b6900 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -56,7 +56,7 @@ struct mlx5_vhca_data_buffer {
 	struct sg_append_table table;
 	loff_t start_pos;
 	u64 length;
-	u64 allocated_length;
+	u32 npages;
 	u32 mkey;
 	enum dma_data_direction dma_dir;
 	u8 dmaed:1;
@@ -217,12 +217,12 @@ int mlx5vf_cmd_alloc_pd(struct mlx5_vf_migration_file *migf);
 void mlx5vf_cmd_dealloc_pd(struct mlx5_vf_migration_file *migf);
 void mlx5fv_cmd_clean_migf_resources(struct mlx5_vf_migration_file *migf);
 struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
-			 size_t length, enum dma_data_direction dma_dir);
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+			 enum dma_data_direction dma_dir);
 void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf);
 struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
-		       size_t length, enum dma_data_direction dma_dir);
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+		       enum dma_data_direction dma_dir);
 void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf);
 struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
 				       unsigned long offset);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 61d9b0f9146d..0925cd7d2f17 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -308,6 +308,7 @@ static struct mlx5_vhca_data_buffer *
 mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
 				  u8 index, size_t required_length)
 {
+	u32 npages = DIV_ROUND_UP(required_length, PAGE_SIZE);
 	struct mlx5_vhca_data_buffer *buf = migf->buf[index];
 	u8 chunk_num;
 
@@ -315,12 +316,11 @@ mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
 	chunk_num = buf->stop_copy_chunk_num;
 	buf->migf->buf[index] = NULL;
 	/* Checking whether the pre-allocated buffer can fit */
-	if (buf->allocated_length >= required_length)
+	if (buf->npages >= npages)
 		return buf;
 
 	mlx5vf_put_data_buffer(buf);
-	buf = mlx5vf_get_data_buffer(buf->migf, required_length,
-				     DMA_FROM_DEVICE);
+	buf = mlx5vf_get_data_buffer(buf->migf, npages, DMA_FROM_DEVICE);
 	if (IS_ERR(buf))
 		return buf;
 
@@ -373,7 +373,8 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
 	u8 *to_buff;
 	int ret;
 
-	header_buf = mlx5vf_get_data_buffer(migf, size, DMA_NONE);
+	header_buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(size, PAGE_SIZE),
+					    DMA_NONE);
 	if (IS_ERR(header_buf))
 		return PTR_ERR(header_buf);
 
@@ -388,7 +389,7 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
 	to_buff = kmap_local_page(page);
 	memcpy(to_buff, &header, sizeof(header));
 	header_buf->length = sizeof(header);
-	data.stop_copy_size = cpu_to_le64(migf->buf[0]->allocated_length);
+	data.stop_copy_size = cpu_to_le64(migf->buf[0]->npages * PAGE_SIZE);
 	memcpy(to_buff + sizeof(header), &data, sizeof(data));
 	header_buf->length += sizeof(data);
 	kunmap_local(to_buff);
@@ -437,15 +438,20 @@ static int mlx5vf_prep_stop_copy(struct mlx5vf_pci_core_device *mvdev,
 
 	num_chunks = mvdev->chunk_mode ? MAX_NUM_CHUNKS : 1;
 	for (i = 0; i < num_chunks; i++) {
-		buf = mlx5vf_get_data_buffer(migf, inc_state_size, DMA_FROM_DEVICE);
+		buf = mlx5vf_get_data_buffer(
+			migf, DIV_ROUND_UP(inc_state_size, PAGE_SIZE),
+			DMA_FROM_DEVICE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto err;
 		}
 
 		migf->buf[i] = buf;
-		buf = mlx5vf_get_data_buffer(migf,
-				sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+		buf = mlx5vf_get_data_buffer(
+			migf,
+			DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+				     PAGE_SIZE),
+			DMA_NONE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto err;
@@ -553,7 +559,8 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
 	 * We finished transferring the current state and the device has a
 	 * dirty state, save a new state to be ready for.
 	 */
-	buf = mlx5vf_get_data_buffer(migf, inc_length, DMA_FROM_DEVICE);
+	buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(inc_length, PAGE_SIZE),
+				     DMA_FROM_DEVICE);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		mlx5vf_mark_err(migf);
@@ -674,8 +681,8 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev, bool track)
 
 	if (track) {
 		/* leave the allocated buffer ready for the stop-copy phase */
-		buf = mlx5vf_alloc_data_buffer(migf,
-			migf->buf[0]->allocated_length, DMA_FROM_DEVICE);
+		buf = mlx5vf_alloc_data_buffer(migf, migf->buf[0]->npages,
+					       DMA_FROM_DEVICE);
 		if (IS_ERR(buf)) {
 			ret = PTR_ERR(buf);
 			goto out_pd;
@@ -918,11 +925,14 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 				goto out_unlock;
 			break;
 		case MLX5_VF_LOAD_STATE_PREP_HEADER_DATA:
-			if (vhca_buf_header->allocated_length < migf->record_size) {
+		{
+			u32 npages = DIV_ROUND_UP(migf->record_size, PAGE_SIZE);
+
+			if (vhca_buf_header->npages < npages) {
 				mlx5vf_free_data_buffer(vhca_buf_header);
 
-				migf->buf_header[0] = mlx5vf_alloc_data_buffer(migf,
-						migf->record_size, DMA_NONE);
+				migf->buf_header[0] = mlx5vf_alloc_data_buffer(
+					migf, npages, DMA_NONE);
 				if (IS_ERR(migf->buf_header[0])) {
 					ret = PTR_ERR(migf->buf_header[0]);
 					migf->buf_header[0] = NULL;
@@ -935,6 +945,7 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 			vhca_buf_header->start_pos = migf->max_pos;
 			migf->load_state = MLX5_VF_LOAD_STATE_READ_HEADER_DATA;
 			break;
+		}
 		case MLX5_VF_LOAD_STATE_READ_HEADER_DATA:
 			ret = mlx5vf_resume_read_header_data(migf, vhca_buf_header,
 							&buf, &len, pos, &done);
@@ -945,12 +956,13 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
 		{
 			u64 size = max(migf->record_size,
 				       migf->stop_copy_prep_size);
+			u32 npages = DIV_ROUND_UP(size, PAGE_SIZE);
 
-			if (vhca_buf->allocated_length < size) {
+			if (vhca_buf->npages < npages) {
 				mlx5vf_free_data_buffer(vhca_buf);
 
-				migf->buf[0] = mlx5vf_alloc_data_buffer(migf,
-							size, DMA_TO_DEVICE);
+				migf->buf[0] = mlx5vf_alloc_data_buffer(
+					migf, npages, DMA_TO_DEVICE);
 				if (IS_ERR(migf->buf[0])) {
 					ret = PTR_ERR(migf->buf[0]);
 					migf->buf[0] = NULL;
@@ -1033,8 +1045,11 @@ mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)
 	}
 
 	migf->buf[0] = buf;
-	buf = mlx5vf_alloc_data_buffer(migf,
-		sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+	buf = mlx5vf_alloc_data_buffer(
+		migf,
+		DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+			     PAGE_SIZE),
+		DMA_NONE);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		goto out_buf;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 14/18] vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (12 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 13/18] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 15/18] vfio/mlx5: Explicitly store page list Leon Romanovsky
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Change the creation of mkey to be performed in multiple steps:
data allocation, DMA setup and actual call to HW to create that mkey.

In this new flow, the whole input to MKEY command is saved to eliminate
the need to keep array of pointers for DMA addresses for receive list
and in the future patches for send list too.

In addition to memory size reduce and elimination of unnecessary data
movements to set MKEY input, the code is prepared for future reuse.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 154 ++++++++++++++++++++----------------
 drivers/vfio/pci/mlx5/cmd.h |   4 +-
 2 files changed, 88 insertions(+), 70 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index fdc3e515741f..adf57104555a 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -313,39 +313,21 @@ static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
 	return ret;
 }
 
-static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
-			struct mlx5_vhca_data_buffer *buf,
-			struct mlx5_vhca_recv_buf *recv_buf,
-			u32 *mkey)
+static u32 *alloc_mkey_in(u32 npages, u32 pdn)
 {
-	size_t npages = buf ? buf->npages : recv_buf->npages;
-	int err = 0, inlen;
-	__be64 *mtt;
+	int inlen;
 	void *mkc;
 	u32 *in;
 
 	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
-		sizeof(*mtt) * round_up(npages, 2);
+		sizeof(__be64) * round_up(npages, 2);
 
-	in = kvzalloc(inlen, GFP_KERNEL);
+	in = kvzalloc(inlen, GFP_KERNEL_ACCOUNT);
 	if (!in)
-		return -ENOMEM;
+		return NULL;
 
 	MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
 		 DIV_ROUND_UP(npages, 2));
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
-
-	if (buf) {
-		struct sg_dma_page_iter dma_iter;
-
-		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
-			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
-	} else {
-		int i;
-
-		for (i = 0; i < npages; i++)
-			*mtt++ = cpu_to_be64(recv_buf->dma_addrs[i]);
-	}
 
 	mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
 	MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
@@ -359,9 +341,29 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
 	MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
 	MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
 	MLX5_SET64(mkc, mkc, len, npages * PAGE_SIZE);
-	err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
-	kvfree(in);
-	return err;
+
+	return in;
+}
+
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
+		       struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+		       u32 *mkey)
+{
+	__be64 *mtt;
+	int inlen;
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+	if (buf) {
+		struct sg_dma_page_iter dma_iter;
+
+		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
+			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+	}
+
+	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(__be64) * round_up(npages, 2);
+
+	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
 }
 
 static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -374,20 +376,27 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (buf->dmaed || !buf->npages)
+	if (buf->mkey_in || !buf->npages)
 		return -EINVAL;
 
 	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	if (ret)
 		return ret;
 
-	ret = _create_mkey(mdev, buf->migf->pdn, buf, NULL, &buf->mkey);
-	if (ret)
+	buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
+	if (!buf->mkey_in) {
+		ret = -ENOMEM;
 		goto err;
+	}
 
-	buf->dmaed = true;
+	ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+	if (ret)
+		goto err_create_mkey;
 
 	return 0;
+
+err_create_mkey:
+	kvfree(buf->mkey_in);
 err:
 	dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	return ret;
@@ -401,8 +410,9 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	lockdep_assert_held(&migf->mvdev->state_mutex);
 	WARN_ON(migf->mvdev->mdev_detach);
 
-	if (buf->dmaed) {
+	if (buf->mkey_in) {
 		mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+		kvfree(buf->mkey_in);
 		dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
 				  buf->dma_dir, 0);
 	}
@@ -779,7 +789,7 @@ int mlx5vf_cmd_load_vhca_state(struct mlx5vf_pci_core_device *mvdev,
 	if (mvdev->mdev_detach)
 		return -ENOTCONN;
 
-	if (!buf->dmaed) {
+	if (!buf->mkey_in) {
 		err = mlx5vf_dma_data_buffer(buf);
 		if (err)
 			return err;
@@ -1380,56 +1390,54 @@ static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
 	kvfree(recv_buf->page_list);
 	return -ENOMEM;
 }
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+		u32 *mkey_in)
+{
+	dma_addr_t addr;
+	__be64 *mtt;
+	int i;
+
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+	for (i = npages - 1; i >= 0; i--) {
+		addr = be64_to_cpu(mtt[i]);
+		dma_unmap_single(mdev->device, addr, PAGE_SIZE,
+				DMA_FROM_DEVICE);
+	}
+}
 
-static int register_dma_recv_pages(struct mlx5_core_dev *mdev,
-				   struct mlx5_vhca_recv_buf *recv_buf)
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+			      struct page **page_list, u32 *mkey_in)
 {
-	int i, j;
+	dma_addr_t addr;
+	__be64 *mtt;
+	int i;
 
-	recv_buf->dma_addrs = kvcalloc(recv_buf->npages,
-				       sizeof(*recv_buf->dma_addrs),
-				       GFP_KERNEL_ACCOUNT);
-	if (!recv_buf->dma_addrs)
-		return -ENOMEM;
+	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
 
-	for (i = 0; i < recv_buf->npages; i++) {
-		recv_buf->dma_addrs[i] = dma_map_page(mdev->device,
-						      recv_buf->page_list[i],
-						      0, PAGE_SIZE,
-						      DMA_FROM_DEVICE);
-		if (dma_mapping_error(mdev->device, recv_buf->dma_addrs[i]))
+	for (i = 0; i < npages; i++) {
+		addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
+				    DMA_FROM_DEVICE);
+		if (dma_mapping_error(mdev->device, addr))
 			goto error;
+
+		*mtt++ = cpu_to_be64(addr);
 	}
+
 	return 0;
 
 error:
-	for (j = 0; j < i; j++)
-		dma_unmap_single(mdev->device, recv_buf->dma_addrs[j],
-				 PAGE_SIZE, DMA_FROM_DEVICE);
-
-	kvfree(recv_buf->dma_addrs);
+	unregister_dma_pages(mdev, i, mkey_in);
 	return -ENOMEM;
 }
 
-static void unregister_dma_recv_pages(struct mlx5_core_dev *mdev,
-				      struct mlx5_vhca_recv_buf *recv_buf)
-{
-	int i;
-
-	for (i = 0; i < recv_buf->npages; i++)
-		dma_unmap_single(mdev->device, recv_buf->dma_addrs[i],
-				 PAGE_SIZE, DMA_FROM_DEVICE);
-
-	kvfree(recv_buf->dma_addrs);
-}
-
 static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 					  struct mlx5_vhca_qp *qp)
 {
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 
 	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
-	unregister_dma_recv_pages(mdev, recv_buf);
+	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+	kvfree(recv_buf->mkey_in);
 	free_recv_pages(&qp->recv_buf);
 }
 
@@ -1445,18 +1453,28 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 	if (err < 0)
 		return err;
 
-	err = register_dma_recv_pages(mdev, recv_buf);
-	if (err)
+	recv_buf->mkey_in = alloc_mkey_in(npages, pdn);
+	if (!recv_buf->mkey_in) {
+		err = -ENOMEM;
 		goto end;
+	}
+
+	err = register_dma_pages(mdev, npages, recv_buf->page_list,
+				 recv_buf->mkey_in);
+	if (err)
+		goto err_register_dma;
 
-	err = _create_mkey(mdev, pdn, NULL, recv_buf, &recv_buf->mkey);
+	err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
+			  &recv_buf->mkey);
 	if (err)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
-	unregister_dma_recv_pages(mdev, recv_buf);
+	unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+err_register_dma:
+	kvfree(recv_buf->mkey_in);
 end:
 	free_recv_pages(recv_buf);
 	return err;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 7d4a833b6900..25dd6ff54591 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -58,8 +58,8 @@ struct mlx5_vhca_data_buffer {
 	u64 length;
 	u32 npages;
 	u32 mkey;
+	u32 *mkey_in;
 	enum dma_data_direction dma_dir;
-	u8 dmaed:1;
 	u8 stop_copy_chunk_num;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
@@ -133,8 +133,8 @@ struct mlx5_vhca_cq {
 struct mlx5_vhca_recv_buf {
 	u32 npages;
 	struct page **page_list;
-	dma_addr_t *dma_addrs;
 	u32 next_rq_offset;
+	u32 *mkey_in;
 	u32 mkey;
 };
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 15/18] vfio/mlx5: Explicitly store page list
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (13 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 14/18] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 16/18] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

As a preparation to removal scatter-gather table and unifying
receive and send list, explicitly store page list.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c | 33 ++++++++++++++++-----------------
 drivers/vfio/pci/mlx5/cmd.h |  1 +
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index adf57104555a..cb23f03d58f4 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -421,6 +421,7 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
 		__free_page(sg_page_iter_page(&sg_iter));
 	sg_free_append_table(&buf->table);
+	kvfree(buf->page_list);
 	kfree(buf);
 }
 
@@ -428,44 +429,42 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 				      unsigned int npages)
 {
 	unsigned int to_alloc = npages;
+	size_t old_size, new_size;
 	struct page **page_list;
 	unsigned long filled;
 	unsigned int to_fill;
 	int ret;
 
-	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
-	page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT);
+	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*buf->page_list));
+	old_size = buf->npages * sizeof(*buf->page_list);
+	new_size = old_size + to_alloc * sizeof(*buf->page_list);
+	page_list = kvrealloc(buf->page_list, old_size, new_size,
+			GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!page_list)
 		return -ENOMEM;
 
+	buf->page_list = page_list;
+
 	do {
 		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
-						page_list);
-		if (!filled) {
-			ret = -ENOMEM;
-			goto err;
-		}
+				buf->page_list + buf->npages);
+		if (!filled)
+			return -ENOMEM;
+
 		to_alloc -= filled;
 		ret = sg_alloc_append_table_from_pages(
-			&buf->table, page_list, filled, 0,
+			&buf->table, buf->page_list + buf->npages, filled, 0,
 			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
 			GFP_KERNEL_ACCOUNT);
 
 		if (ret)
-			goto err;
+			return ret;
 		buf->npages += filled;
-		/* clean input for another bulk allocation */
-		memset(page_list, 0, filled * sizeof(*page_list));
 		to_fill = min_t(unsigned int, to_alloc,
-				PAGE_SIZE / sizeof(*page_list));
+				PAGE_SIZE / sizeof(*buf->page_list));
 	} while (to_alloc > 0);
 
-	kvfree(page_list);
 	return 0;
-
-err:
-	kvfree(page_list);
-	return ret;
 }
 
 struct mlx5_vhca_data_buffer *
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 25dd6ff54591..5b764199db53 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -53,6 +53,7 @@ struct mlx5_vf_migration_header {
 };
 
 struct mlx5_vhca_data_buffer {
+	struct page **page_list;
 	struct sg_append_table table;
 	loff_t start_pos;
 	u64 length;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 16/18] vfio/mlx5: Convert vfio to use DMA link API
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (14 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 15/18] vfio/mlx5: Explicitly store page list Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 17/18] block: export helper to get segment max size Leon Romanovsky
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Leon Romanovsky, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

From: Leon Romanovsky <leonro@nvidia.com>

Remove intermediate scatter-gather table as it is not needed
if DMA link API is used. This conversion reduces drastically
the memory used to manage that table.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/mlx5/cmd.c  | 241 ++++++++++++++++++++---------------
 drivers/vfio/pci/mlx5/cmd.h  |  10 +-
 drivers/vfio/pci/mlx5/main.c |  33 +----
 3 files changed, 143 insertions(+), 141 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index cb23f03d58f4..4520eaf78767 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -345,25 +345,106 @@ static u32 *alloc_mkey_in(u32 npages, u32 pdn)
 	return in;
 }
 
-static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
-		       struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages, u32 *mkey_in,
 		       u32 *mkey)
 {
+	int inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+		sizeof(__be64) * round_up(npages, 2);
+
+	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+}
+
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+				 u32 *mkey_in, struct dma_iova_attrs *iova,
+				 struct dma_memory_type *type)
+{
+	struct dma_iova_state state = {};
+	dma_addr_t addr;
 	__be64 *mtt;
-	int inlen;
+	int i;
+
+	WARN_ON_ONCE(iova->dir == DMA_NONE);
+
+	state.iova = iova;
+	state.type = type;
+	state.range_size = PAGE_SIZE * npages;
+
+	if (dma_can_use_iova(&state, PAGE_SIZE)) {
+		dma_unlink_range(&state);
+	} else {
+		mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in,
+					     klm_pas_mtt);
+		for (i = npages - 1; i >= 0; i--) {
+			addr = be64_to_cpu(mtt[i]);
+			dma_unmap_page_attrs(iova->dev, addr, PAGE_SIZE,
+					     iova->dir, iova->attrs);
+		}
+	}
+	dma_free_iova(iova);
+}
+
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+			      struct page **page_list, u32 *mkey_in,
+			      struct dma_iova_attrs *iova,
+			      struct dma_memory_type *type)
+{
+	struct dma_iova_state state = {};
+	dma_addr_t addr;
+	bool use_iova;
+	__be64 *mtt;
+	int i, err;
+
+	WARN_ON_ONCE(iova->dir == DMA_NONE);
+
+	iova->dev = mdev->device;
+	iova->size = npages * PAGE_SIZE;
+	err = dma_alloc_iova(iova);
+	if (err)
+		return err;
+
+	/*
+	 * All VFIO pages are of the same type, and it is enough
+	 * to check one page only
+	 */
+	dma_get_memory_type(page_list[0], type);
+	state.iova = iova;
+	state.type = type;
+
+	use_iova = dma_can_use_iova(&state, PAGE_SIZE);
 
 	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-	if (buf) {
-		struct sg_dma_page_iter dma_iter;
+	if (use_iova)
+		err = dma_start_range(&state);
+	if (err) {
+		dma_free_iova(iova);
+		return err;
+	}
+	for (i = 0; i < npages; i++) {
+		if (use_iova) {
+			err = dma_link_range(&state, page_to_phys(page_list[i]),
+					     PAGE_SIZE);
+			addr = iova->addr;
+		} else {
+			addr = dma_map_page_attrs(iova->dev, page_list[i], 0,
+						  PAGE_SIZE, iova->dir,
+						  iova->attrs);
+			err = dma_mapping_error(mdev->device, addr);
+		}
+		if (err)
+			goto error;
 
-		for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
-			*mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+		/* In IOVA case, we can use one MTT entry for whole buffer */
+		if (i == 0 || !use_iova)
+			*mtt++ = cpu_to_be64(addr);
 	}
+	if (use_iova)
+		dma_end_range(&state);
 
-	inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
-		sizeof(__be64) * round_up(npages, 2);
+	return 0;
 
-	return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+error:
+	unregister_dma_pages(mdev, i, mkey_in, iova, type);
+	return err;
 }
 
 static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -379,49 +460,56 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
 	if (buf->mkey_in || !buf->npages)
 		return -EINVAL;
 
-	ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
-	if (ret)
-		return ret;
-
 	buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
-	if (!buf->mkey_in) {
-		ret = -ENOMEM;
-		goto err;
-	}
+	if (!buf->mkey_in)
+		return -ENOMEM;
 
-	ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+	ret = register_dma_pages(mdev, buf->npages, buf->page_list,
+				 buf->mkey_in, &buf->iova, &buf->type);
+	if (ret)
+		goto err_register_dma;
+
+	ret = create_mkey(mdev, buf->npages, buf->mkey_in, &buf->mkey);
 	if (ret)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
+	unregister_dma_pages(mdev, buf->npages, buf->mkey_in, &buf->iova,
+			     &buf->type);
+err_register_dma:
 	kvfree(buf->mkey_in);
-err:
-	dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
 	return ret;
 }
 
+static void free_page_list(u32 npages, struct page **page_list)
+{
+	int i;
+
+	/* Undo alloc_pages_bulk_array() */
+	for (i = npages - 1; i >= 0; i--)
+		__free_page(page_list[i]);
+
+	kvfree(page_list);
+}
+
 void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
 {
-	struct mlx5_vf_migration_file *migf = buf->migf;
-	struct sg_page_iter sg_iter;
+	struct mlx5vf_pci_core_device *mvdev = buf->migf->mvdev;
+	struct mlx5_core_dev *mdev = mvdev->mdev;
 
-	lockdep_assert_held(&migf->mvdev->state_mutex);
-	WARN_ON(migf->mvdev->mdev_detach);
+	lockdep_assert_held(&mvdev->state_mutex);
+	WARN_ON(mvdev->mdev_detach);
 
 	if (buf->mkey_in) {
-		mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+		mlx5_core_destroy_mkey(mdev, buf->mkey);
+		unregister_dma_pages(mdev, buf->npages, buf->mkey_in,
+				     &buf->iova, &buf->type);
 		kvfree(buf->mkey_in);
-		dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
-				  buf->dma_dir, 0);
 	}
 
-	/* Undo alloc_pages_bulk_array() */
-	for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
-		__free_page(sg_page_iter_page(&sg_iter));
-	sg_free_append_table(&buf->table);
-	kvfree(buf->page_list);
+	free_page_list(buf->npages, buf->page_list);
 	kfree(buf);
 }
 
@@ -432,10 +520,7 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 	size_t old_size, new_size;
 	struct page **page_list;
 	unsigned long filled;
-	unsigned int to_fill;
-	int ret;
 
-	to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*buf->page_list));
 	old_size = buf->npages * sizeof(*buf->page_list);
 	new_size = old_size + to_alloc * sizeof(*buf->page_list);
 	page_list = kvrealloc(buf->page_list, old_size, new_size,
@@ -446,22 +531,13 @@ static int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
 	buf->page_list = page_list;
 
 	do {
-		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
-				buf->page_list + buf->npages);
+		filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_alloc,
+						buf->page_list + buf->npages);
 		if (!filled)
 			return -ENOMEM;
 
 		to_alloc -= filled;
-		ret = sg_alloc_append_table_from_pages(
-			&buf->table, buf->page_list + buf->npages, filled, 0,
-			filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
-			GFP_KERNEL_ACCOUNT);
-
-		if (ret)
-			return ret;
 		buf->npages += filled;
-		to_fill = min_t(unsigned int, to_alloc,
-				PAGE_SIZE / sizeof(*buf->page_list));
 	} while (to_alloc > 0);
 
 	return 0;
@@ -478,7 +554,7 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 	if (!buf)
 		return ERR_PTR(-ENOMEM);
 
-	buf->dma_dir = dma_dir;
+	buf->iova.dir = dma_dir;
 	buf->migf = migf;
 	if (npages) {
 		ret = mlx5vf_add_migration_pages(buf, npages);
@@ -521,7 +597,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
 
 	spin_lock_irq(&migf->list_lock);
 	list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
-		if (buf->dma_dir == dma_dir) {
+		if (buf->iova.dir == dma_dir) {
 			list_del_init(&buf->buf_elm);
 			if (buf->npages >= npages) {
 				spin_unlock_irq(&migf->list_lock);
@@ -1343,17 +1419,6 @@ static void mlx5vf_destroy_qp(struct mlx5_core_dev *mdev,
 	kfree(qp);
 }
 
-static void free_recv_pages(struct mlx5_vhca_recv_buf *recv_buf)
-{
-	int i;
-
-	/* Undo alloc_pages_bulk_array() */
-	for (i = 0; i < recv_buf->npages; i++)
-		__free_page(recv_buf->page_list[i]);
-
-	kvfree(recv_buf->page_list);
-}
-
 static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
 			    unsigned int npages)
 {
@@ -1389,45 +1454,6 @@ static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
 	kvfree(recv_buf->page_list);
 	return -ENOMEM;
 }
-static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
-		u32 *mkey_in)
-{
-	dma_addr_t addr;
-	__be64 *mtt;
-	int i;
-
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-	for (i = npages - 1; i >= 0; i--) {
-		addr = be64_to_cpu(mtt[i]);
-		dma_unmap_single(mdev->device, addr, PAGE_SIZE,
-				DMA_FROM_DEVICE);
-	}
-}
-
-static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
-			      struct page **page_list, u32 *mkey_in)
-{
-	dma_addr_t addr;
-	__be64 *mtt;
-	int i;
-
-	mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-
-	for (i = 0; i < npages; i++) {
-		addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
-				    DMA_FROM_DEVICE);
-		if (dma_mapping_error(mdev->device, addr))
-			goto error;
-
-		*mtt++ = cpu_to_be64(addr);
-	}
-
-	return 0;
-
-error:
-	unregister_dma_pages(mdev, i, mkey_in);
-	return -ENOMEM;
-}
 
 static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 					  struct mlx5_vhca_qp *qp)
@@ -1435,9 +1461,10 @@ static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
 	struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;
 
 	mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
-	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+	unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in,
+			     &recv_buf->iova, &recv_buf->type);
 	kvfree(recv_buf->mkey_in);
-	free_recv_pages(&qp->recv_buf);
+	free_page_list(recv_buf->npages, recv_buf->page_list);
 }
 
 static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
@@ -1458,24 +1485,26 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
 		goto end;
 	}
 
+	recv_buf->iova.dir = DMA_FROM_DEVICE;
 	err = register_dma_pages(mdev, npages, recv_buf->page_list,
-				 recv_buf->mkey_in);
+				 recv_buf->mkey_in, &recv_buf->iova,
+				 &recv_buf->type);
 	if (err)
 		goto err_register_dma;
 
-	err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
-			  &recv_buf->mkey);
+	err = create_mkey(mdev, npages, recv_buf->mkey_in, &recv_buf->mkey);
 	if (err)
 		goto err_create_mkey;
 
 	return 0;
 
 err_create_mkey:
-	unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+	unregister_dma_pages(mdev, npages, recv_buf->mkey_in, &recv_buf->iova,
+			     &recv_buf->type);
 err_register_dma:
 	kvfree(recv_buf->mkey_in);
 end:
-	free_recv_pages(recv_buf);
+	free_page_list(npages, recv_buf->page_list);
 	return err;
 }
 
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 5b764199db53..1b2552c238d8 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -53,21 +53,17 @@ struct mlx5_vf_migration_header {
 };
 
 struct mlx5_vhca_data_buffer {
+	struct dma_iova_attrs iova;
 	struct page **page_list;
-	struct sg_append_table table;
+	struct dma_memory_type type;
 	loff_t start_pos;
 	u64 length;
 	u32 npages;
 	u32 mkey;
 	u32 *mkey_in;
-	enum dma_data_direction dma_dir;
 	u8 stop_copy_chunk_num;
 	struct list_head buf_elm;
 	struct mlx5_vf_migration_file *migf;
-	/* Optimize mlx5vf_get_migration_page() for sequential access */
-	struct scatterlist *last_offset_sg;
-	unsigned int sg_last_entry;
-	unsigned long last_offset;
 };
 
 struct mlx5vf_async_data {
@@ -132,8 +128,10 @@ struct mlx5_vhca_cq {
 };
 
 struct mlx5_vhca_recv_buf {
+	struct dma_iova_attrs iova;
 	u32 npages;
 	struct page **page_list;
+	struct dma_memory_type type;
 	u32 next_rq_offset;
 	u32 *mkey_in;
 	u32 mkey;
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 0925cd7d2f17..ddadf8ccae87 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -34,35 +34,10 @@ static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev)
 			    core_device);
 }
 
-struct page *
-mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
-			  unsigned long offset)
+struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
+				       unsigned long offset)
 {
-	unsigned long cur_offset = 0;
-	struct scatterlist *sg;
-	unsigned int i;
-
-	/* All accesses are sequential */
-	if (offset < buf->last_offset || !buf->last_offset_sg) {
-		buf->last_offset = 0;
-		buf->last_offset_sg = buf->table.sgt.sgl;
-		buf->sg_last_entry = 0;
-	}
-
-	cur_offset = buf->last_offset;
-
-	for_each_sg(buf->last_offset_sg, sg,
-			buf->table.sgt.orig_nents - buf->sg_last_entry, i) {
-		if (offset < sg->length + cur_offset) {
-			buf->last_offset_sg = sg;
-			buf->sg_last_entry += i;
-			buf->last_offset = cur_offset;
-			return nth_page(sg_page(sg),
-					(offset - cur_offset) / PAGE_SIZE);
-		}
-		cur_offset += sg->length;
-	}
-	return NULL;
+	return buf->page_list[offset / PAGE_SIZE];
 }
 
 static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf)
@@ -121,7 +96,7 @@ static void mlx5vf_buf_read_done(struct mlx5_vhca_data_buffer *vhca_buf)
 	struct mlx5_vf_migration_file *migf = vhca_buf->migf;
 
 	if (vhca_buf->stop_copy_chunk_num) {
-		bool is_header = vhca_buf->dma_dir == DMA_NONE;
+		bool is_header = vhca_buf->iova.dir == DMA_NONE;
 		u8 chunk_num = vhca_buf->stop_copy_chunk_num;
 		size_t next_required_umem_size = 0;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 17/18] block: export helper to get segment max size
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (15 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 16/18] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-02  9:09 ` [RFC PATCH v1 18/18] nvme-pci: use new dma API Leon Romanovsky
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Chaitanya Kulkarni <kch@nvidia.com>

Export the get_max_segment_size() so driver can do use that to create
DMA mapping when it receives the request.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-merge.c      | 3 ++-
 include/linux/blk-mq.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8534c35e0497..0561e728ef95 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -190,7 +190,7 @@ static inline unsigned get_max_io_size(struct bio *bio,
  *
  * Returns the maximum number of bytes that can be added as a single segment.
  */
-static inline unsigned get_max_segment_size(const struct queue_limits *lim,
+inline unsigned get_max_segment_size(const struct queue_limits *lim,
 		struct page *start_page, unsigned long offset)
 {
 	unsigned long mask = lim->seg_boundary_mask;
@@ -203,6 +203,7 @@ static inline unsigned get_max_segment_size(const struct queue_limits *lim,
 	 */
 	return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
 }
+EXPORT_SYMBOL_GPL(get_max_segment_size);
 
 /**
  * bvec_split_segs - verify whether or not a bvec should be split in the middle
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 89ba6b16fe8b..008c77c9b518 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1150,4 +1150,7 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+unsigned get_max_segment_size(const struct queue_limits *lim,
+		struct page *start_page, unsigned long offset);
+
 #endif /* BLK_MQ_H */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [RFC PATCH v1 18/18] nvme-pci: use new dma API
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (16 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 17/18] block: export helper to get segment max size Leon Romanovsky
@ 2024-07-02  9:09 ` Leon Romanovsky
  2024-07-04 15:23   ` Robin Murphy
  2024-07-03  5:42 ` [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Christoph Hellwig
  2024-07-05  6:39 ` Christoph Hellwig
  19 siblings, 1 reply; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-02  9:09 UTC (permalink / raw)
  To: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

From: Chaitanya Kulkarni <kch@nvidia.com>

Introduce a new structure, iod_dma_map, to hold the DMA mapping for each
I/O. This includes the iova state and mapped addresses from
dma_link_range() or dma_map_page_attrs(). Replace the existing sg_table
in nvme_iod with struct dma_map. The size difference between :-

struct nvme_iod with struct sg_table :- 184
struct nvme_iod with struct dma_map  :- 176

In nvme_map_data(), allocate dma_map from mempool and iova using
dma_alloc_iova(). Obtain the memory type from the first bvec of the
first bio of the request and use that to decide whether we want to use
iova or not. In the newly added function nvme_rq_dma_map(), perform DMA
mapping for the bvec pages using nvme_dma_link_page(). Additionally,
if NVMe SGL is provided, build SGL entry inline while creating this
mapping to avoid extra traversal.

Call nvme_rq_dma_map() from nvme_pci_setup_prps() and
nvme_pci_setup_sgls(). For NVME SGL case, nvme_rq_dma_map() will handle
building SGL inline. To build PRPs, use iod->dma_map->dma_link_address
in nvme_pci_setup_prps() and increment the counter appropriately to
retrieve the next set of DMA addresses.

This demonstrates how the new DMA API can fit into the NVMe driver and
replace the old DMA APIs.

As this is an RFC, I expect more robust error handling, optimizations,
and in-depth testing for the final version once we agree on DMA API
architecture.

Following is the performance comparision for existing DMA API case
with sg_table and with dma_map, once we have agreement on the new DMA
API design I intend to get similar profiling numbers for new DMA API.

sgl (sg_table + old dma API ) vs no_sgl (iod_dma_map + new DMA API) :-

block size                               IOPS (k) Average of 3

4K
--------------------------------------------------------------
sg-list-fio-perf.bs-4k-1.fio:             68.6
sg-list-fio-perf.bs-4k-2.fio:             68       68.36
sg-list-fio-perf.bs-4k-3.fio:             68.5

no-sg-list-fio-perf.bs-4k-1.fio:          68.7
no-sg-list-fio-perf.bs-4k-2.fio:          68.5     68.43
no-sg-list-fio-perf.bs-4k-3.fio:          68.1

% Change default vs new DMA API =       +0.0975%

8K
--------------------------------------------------------------
sg-list-fio-perf.bs-8k-1.fio:             67
sg-list-fio-perf.bs-8k-2.fio:             67.1     67.03
sg-list-fio-perf.bs-8k-3.fio:             67

no-sg-list-fio-perf.bs-8k-1.fio:          66.7
no-sg-list-fio-perf.bs-8k-2.fio:          66.7     66.7
no-sg-list-fio-perf.bs-8k-3.fio:          66.7

% Change default vs new DMA API =       +0.4993%

16K
--------------------------------------------------------------
sg-list-fio-perf.bs-16k-1.fio:            63.8
sg-list-fio-perf.bs-16k-2.fio:            63.4     63.5
sg-list-fio-perf.bs-16k-3.fio:            63.3

no-sg-list-fio-perf.bs-16k-1.fio:         63.5
no-sg-list-fio-perf.bs-16k-2.fio:         63.4     63.33
no-sg-list-fio-perf.bs-16k-3.fio:         63.1

% Change default vs new DMA API =       -0.2632%

32K
--------------------------------------------------------------
sg-list-fio-perf.bs-32k-1.fio:            59.3
sg-list-fio-perf.bs-32k-2.fio:            59.3     59.36
sg-list-fio-perf.bs-32k-3.fio:            59.5

no-sg-list-fio-perf.bs-32k-1.fio:         59.5
no-sg-list-fio-perf.bs-32k-2.fio:         59.6     59.43
no-sg-list-fio-perf.bs-32k-3.fio:         59.2

% Change default vs new DMA API =       +0.1122%

64K
--------------------------------------------------------------
sg-list-fio-perf.bs-64k-1.fio:            53.7
sg-list-fio-perf.bs-64k-2.fio:            53.4     53.56
sg-list-fio-perf.bs-64k-3.fio:            53.6

no-sg-list-fio-perf.bs-64k-1.fio:         53.5
no-sg-list-fio-perf.bs-64k-2.fio:         53.8     53.63
no-sg-list-fio-perf.bs-64k-3.fio:         53.6

% Change default vs new DMA API =        +0.1246%

128K
--------------------------------------------------------------
sg-list-fio-perf/bs-128k-1.fio:           48
sg-list-fio-perf/bs-128k-2.fio:           46.4     47.13
sg-list-fio-perf/bs-128k-3.fio:           47

no-sg-list-fio-perf/bs-128k-1.fio:        46.6
no-sg-list-fio-perf/bs-128k-2.fio:        47        46.9
no-sg-list-fio-perf/bs-128k-3.fio:        47.1

% Change default vs new DMA API =       −0.495%

256K
--------------------------------------------------------------
sg-list-fio-perf/bs-256k-1.fio:           37
sg-list-fio-perf/bs-256k-2.fio:           41        39.93
sg-list-fio-perf/bs-256k-3.fio:           41.8

no-sg-list-fio-perf/bs-256k-1.fio:        37.5
no-sg-list-fio-perf/bs-256k-2.fio:        41.4      40.5
no-sg-list-fio-perf/bs-256k-3.fio:        42.6

% Change default vs new DMA API =       +1.42%

512K
--------------------------------------------------------------
sg-list-fio-perf/bs-512k-1.fio:           28.5
sg-list-fio-perf/bs-512k-2.fio:           28.2      28.4
sg-list-fio-perf/bs-512k-3.fio:           28.5

no-sg-list-fio-perf/bs-512k-1.fio:        28.7
no-sg-list-fio-perf/bs-512k-2.fio:        28.6      28.7
no-sg-list-fio-perf/bs-512k-3.fio:        28.8

% Change default vs new DMA API =       +1.06%

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/pci.c | 283 ++++++++++++++++++++++++++++++----------
 1 file changed, 213 insertions(+), 70 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 102a9fb0c65f..53a71b03c794 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -221,6 +221,16 @@ union nvme_descriptor {
 	__le64			*prp_list;
 };
 
+struct iod_dma_map {
+	bool use_iova;
+	struct dma_iova_state state;
+	struct dma_memory_type type;
+	struct dma_iova_attrs iova;
+	dma_addr_t dma_link_address[NVME_MAX_SEGS];
+	u32 len[NVME_MAX_SEGS];
+	u16 nr_dma_link_address;
+};
+
 /*
  * The nvme_iod describes the data in an I/O.
  *
@@ -236,7 +246,7 @@ struct nvme_iod {
 	unsigned int dma_len;	/* length of single DMA segment mapping */
 	dma_addr_t first_dma;
 	dma_addr_t meta_dma;
-	struct sg_table sgt;
+	struct iod_dma_map *dma_map;
 	union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
 };
 
@@ -521,6 +531,26 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
 	return true;
 }
 
+static inline void nvme_dma_unlink_range(struct nvme_iod *iod)
+{
+	struct dma_iova_attrs *iova = &iod->dma_map->iova;
+	dma_addr_t addr;
+	u16 len;
+	u32 i;
+
+	if (iod->dma_map->use_iova) {
+		dma_unlink_range(&iod->dma_map->state);
+		return;
+	}
+
+	for (i = 0; i < iod->dma_map->nr_dma_link_address; i++) {
+		addr = iod->dma_map->dma_link_address[i];
+		len = iod->dma_map->len[i];
+		dma_unmap_page_attrs(iova->dev, addr, len,
+				     iova->dir, iova->attrs);
+	}
+}
+
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
@@ -547,9 +577,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->sgt.nents);
-
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
+	nvme_dma_unlink_range(iod);
 
 	if (iod->nr_allocations == 0)
 		dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list,
@@ -559,21 +587,123 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 			      iod->first_dma);
 	else
 		nvme_free_prps(dev, req);
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
+
+	dma_free_iova(&iod->dma_map->iova);
+	mempool_free(iod->dma_map, dev->iod_mempool);
 }
 
-static void nvme_print_sgl(struct scatterlist *sgl, int nents)
+static inline dma_addr_t nvme_dma_link_page(struct page *page,
+					   unsigned int poffset,
+					   unsigned int len,
+					   struct nvme_iod *iod)
 {
-	int i;
-	struct scatterlist *sg;
+	struct dma_iova_attrs *iova = &iod->dma_map->iova;
+	struct dma_iova_state *state = &iod->dma_map->state;
+	dma_addr_t dma_addr;
+	int ret;
+
+	if (iod->dma_map->use_iova) {
+		phys_addr_t phys = page_to_phys(page) + poffset;
+
+		dma_addr = state->iova->addr + state->range_size;
+		ret = dma_link_range(&iod->dma_map->state, phys, len);
+		if (ret)
+			return DMA_MAPPING_ERROR;
+	} else {
+		dma_addr = dma_map_page_attrs(iova->dev, page, poffset, len,
+					      iova->dir, iova->attrs);
+	}
+	return dma_addr;
+}
+
+static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
+				 dma_addr_t dma_addr,
+				 unsigned int dma_len);
+
+static int __nvme_rq_dma_map(struct request *req, struct nvme_iod *iod,
+		struct nvme_sgl_desc *sgl_list)
+{
+	struct dma_iova_attrs *iova = &iod->dma_map->iova;
+	struct req_iterator iter;
+	struct bio_vec bv;
+	int cnt = 0;
+	dma_addr_t addr;
+
+	iod->dma_map->nr_dma_link_address = 0;
+	rq_for_each_bvec(bv, req, iter) {
+		unsigned nbytes = bv.bv_len;
+		unsigned total = 0;
+		unsigned offset, len;
+
+		if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+			addr = nvme_dma_link_page(bv.bv_page, bv.bv_offset,
+						  bv.bv_len, iod);
+			if (dma_mapping_error(iova->dev, addr)) {
+				pr_err("dma_mapping_error %d\n",
+					dma_mapping_error(iova->dev, addr));
+				return -ENOMEM;
+			}
+
+			iod->dma_map->dma_link_address[cnt] = addr;
+			iod->dma_map->len[cnt] = bv.bv_len;
+			iod->dma_map->nr_dma_link_address++;
+
+			if (sgl_list)
+				nvme_pci_sgl_set_data(&sgl_list[cnt], addr,
+						      bv.bv_len);
+			cnt++;
+			continue;
+		}
+		while (nbytes > 0) {
+			struct page *page = bv.bv_page;
+
+			offset = bv.bv_offset + total;
+			len = min(get_max_segment_size(&req->q->limits, page,
+						       offset), nbytes);
+
+			page += (offset >> PAGE_SHIFT);
+			offset &= ~PAGE_MASK;
+
+			addr = nvme_dma_link_page(page, offset, len, iod);
+			if (dma_mapping_error(iova->dev, addr)) {
+				pr_err("dma_mapping_error2 %d\n",
+					dma_mapping_error(iova->dev, addr));
+				return -ENOMEM;
+			}
+
+			iod->dma_map->dma_link_address[cnt] = addr;
+			iod->dma_map->len[cnt] = len;
+			iod->dma_map->nr_dma_link_address++;
 
-	for_each_sg(sgl, sg, nents, i) {
-		dma_addr_t phys = sg_phys(sg);
-		pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
-			"dma_address:%pad dma_length:%d\n",
-			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
-			sg_dma_len(sg));
+			if (sgl_list)
+				nvme_pci_sgl_set_data(&sgl_list[cnt], addr, len);
+
+			total += len;
+			nbytes -= len;
+			cnt++;
+		}
+	}
+	return cnt;
+}
+
+static int nvme_rq_dma_map(struct request *req, struct nvme_iod *iod,
+			   struct nvme_sgl_desc *sgl_list)
+{
+	int ret;
+
+	if (iod->dma_map->use_iova) {
+		ret = dma_start_range(&iod->dma_map->state);
+		if (ret) {
+			pr_err("dma_start_dange_failed %d", ret);
+			return ret;
+		}
+
+		ret = __nvme_rq_dma_map(req, iod, sgl_list);
+		dma_end_range(&iod->dma_map->state);
+		return ret;
 	}
+
+	return __nvme_rq_dma_map(req, iod, sgl_list);
 }
 
 static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
@@ -582,13 +712,23 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	struct dma_pool *pool;
 	int length = blk_rq_payload_bytes(req);
-	struct scatterlist *sg = iod->sgt.sgl;
-	int dma_len = sg_dma_len(sg);
-	u64 dma_addr = sg_dma_address(sg);
-	int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+	u16 dma_addr_cnt = 0;
+	int dma_len;
+	u64 dma_addr;
+	int offset;
 	__le64 *prp_list;
 	dma_addr_t prp_dma;
 	int nprps, i;
+	int ret;
+
+	ret = nvme_rq_dma_map(req, iod, NULL);
+	if (ret < 0)
+		return errno_to_blk_status(ret);
+
+	dma_len = iod->dma_map->len[dma_addr_cnt];
+	dma_addr = iod->dma_map->dma_link_address[dma_addr_cnt];
+	offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+	dma_addr_cnt++;
 
 	length -= (NVME_CTRL_PAGE_SIZE - offset);
 	if (length <= 0) {
@@ -600,9 +740,9 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 	if (dma_len) {
 		dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
 	} else {
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+		dma_addr = iod->dma_map->dma_link_address[dma_addr_cnt];
+		dma_len = iod->dma_map->len[dma_addr_cnt];
+		dma_addr_cnt++;
 	}
 
 	if (length <= NVME_CTRL_PAGE_SIZE) {
@@ -646,31 +786,29 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
 			break;
 		if (dma_len > 0)
 			continue;
-		if (unlikely(dma_len < 0))
-			goto bad_sgl;
-		sg = sg_next(sg);
-		dma_addr = sg_dma_address(sg);
-		dma_len = sg_dma_len(sg);
+		if (dma_addr_cnt >= iod->dma_map->nr_dma_link_address)
+			pr_err_ratelimited("dma_addr_cnt exceeded %u and %u\n",
+					   dma_addr_cnt,
+					   iod->dma_map->nr_dma_link_address);
+		dma_addr = iod->dma_map->dma_link_address[dma_addr_cnt];
+		dma_len = iod->dma_map->len[dma_addr_cnt];
+		dma_addr_cnt++;
 	}
 done:
-	cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
+	cmnd->dptr.prp1 = cpu_to_le64(iod->dma_map->dma_link_address[0]);
 	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
+
 	return BLK_STS_OK;
 free_prps:
 	nvme_free_prps(dev, req);
 	return BLK_STS_RESOURCE;
-bad_sgl:
-	WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
-			"Invalid SGL for payload:%d nents:%d\n",
-			blk_rq_payload_bytes(req), iod->sgt.nents);
-	return BLK_STS_IOERR;
 }
 
 static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
-		struct scatterlist *sg)
+		dma_addr_t dma_addr, unsigned int dma_len)
 {
-	sge->addr = cpu_to_le64(sg_dma_address(sg));
-	sge->length = cpu_to_le32(sg_dma_len(sg));
+	sge->addr = cpu_to_le64(dma_addr);
+	sge->length = cpu_to_le32(dma_len);
 	sge->type = NVME_SGL_FMT_DATA_DESC << 4;
 }
 
@@ -685,22 +823,16 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 		struct request *req, struct nvme_rw_command *cmd)
 {
+	unsigned int entries = blk_rq_nr_phys_segments(req);
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-	struct dma_pool *pool;
 	struct nvme_sgl_desc *sg_list;
-	struct scatterlist *sg = iod->sgt.sgl;
-	unsigned int entries = iod->sgt.nents;
+	struct dma_pool *pool;
 	dma_addr_t sgl_dma;
-	int i = 0;
+	int ret;
 
 	/* setting the transfer type as SGL */
 	cmd->flags = NVME_CMD_SGL_METABUF;
 
-	if (entries == 1) {
-		nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
-		return BLK_STS_OK;
-	}
-
 	if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
 		pool = dev->prp_small_pool;
 		iod->nr_allocations = 0;
@@ -718,12 +850,11 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
 	iod->list[0].sg_list = sg_list;
 	iod->first_dma = sgl_dma;
 
-	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
-	do {
-		nvme_pci_sgl_set_data(&sg_list[i++], sg);
-		sg = sg_next(sg);
-	} while (--entries > 0);
+	ret = nvme_rq_dma_map(req, iod, sg_list);
+	if (ret < 0)
+		return errno_to_blk_status(ret);
 
+	nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, ret);
 	return BLK_STS_OK;
 }
 
@@ -791,34 +922,47 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 	}
 
 	iod->dma_len = 0;
-	iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
-	if (!iod->sgt.sgl)
+	iod->dma_map = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
+	if (!iod->dma_map)
 		return BLK_STS_RESOURCE;
-	sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
-	iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
-	if (!iod->sgt.orig_nents)
-		goto out_free_sg;
 
-	rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
-			     DMA_ATTR_NO_WARN);
-	if (rc) {
-		if (rc == -EREMOTEIO)
-			ret = BLK_STS_TARGET;
-		goto out_free_sg;
-	}
+	iod->dma_map->state.range_size = 0;
+	iod->dma_map->iova.dev = dev->dev;
+	iod->dma_map->iova.dir = rq_dma_dir(req);
+	iod->dma_map->iova.attrs = DMA_ATTR_NO_WARN;
+	iod->dma_map->iova.size = blk_rq_payload_bytes(req);
+	if (!iod->dma_map->iova.size)
+		goto free_iod_map;
+
+	rc = dma_alloc_iova(&iod->dma_map->iova);
+	if (rc)
+		goto free_iod_map;
+
+	/*
+	 * Following call assumes that all the biovecs belongs to this request
+	 * are of the same type.
+	 */
+	dma_get_memory_type(req->bio->bi_io_vec[0].bv_page,
+			    &iod->dma_map->type);
+	iod->dma_map->state.iova = &iod->dma_map->iova;
+	iod->dma_map->state.type = &iod->dma_map->type;
+
+	iod->dma_map->use_iova =
+		dma_can_use_iova(&iod->dma_map->state,
+				 req->bio->bi_io_vec[0].bv_len);
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+	if (nvme_pci_use_sgls(dev, req, blk_rq_nr_phys_segments(req)))
 		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
 	if (ret != BLK_STS_OK)
-		goto out_unmap_sg;
+		goto free_iova;
 	return BLK_STS_OK;
 
-out_unmap_sg:
-	dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-out_free_sg:
-	mempool_free(iod->sgt.sgl, dev->iod_mempool);
+free_iova:
+	dma_free_iova(&iod->dma_map->iova);
+free_iod_map:
+	mempool_free(iod->dma_map, dev->iod_mempool);
 	return ret;
 }
 
@@ -842,7 +986,6 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
 
 	iod->aborted = false;
 	iod->nr_allocations = -1;
-	iod->sgt.nents = 0;
 
 	ret = nvme_setup_cmd(req->q->queuedata, req);
 	if (ret)
@@ -2670,7 +2813,7 @@ static void nvme_release_prp_pools(struct nvme_dev *dev)
 
 static int nvme_pci_alloc_iod_mempool(struct nvme_dev *dev)
 {
-	size_t alloc_size = sizeof(struct scatterlist) * NVME_MAX_SEGS;
+	size_t alloc_size = sizeof(struct iod_dma_map);
 
 	dev->iod_mempool = mempool_create_node(1,
 			mempool_kmalloc, mempool_kfree,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (17 preceding siblings ...)
  2024-07-02  9:09 ` [RFC PATCH v1 18/18] nvme-pci: use new dma API Leon Romanovsky
@ 2024-07-03  5:42 ` Christoph Hellwig
  2024-07-03 10:42   ` Zhu Yanjun
  2024-07-03 10:52   ` Leon Romanovsky
  2024-07-05  6:39 ` Christoph Hellwig
  19 siblings, 2 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-03  5:42 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

I just tried to boot this on my usual qemu test setup with emulated
nvme devices, and it dead-loops with messages like this fairly late
in the boot cycle:

[   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
[   43.826982] dma_mapping_error -12

passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
it go away.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-03  5:42 ` [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Christoph Hellwig
@ 2024-07-03 10:42   ` Zhu Yanjun
  2024-07-03 10:52   ` Leon Romanovsky
  1 sibling, 0 replies; 51+ messages in thread
From: Zhu Yanjun @ 2024-07-03 10:42 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

在 2024/7/3 13:42, Christoph Hellwig 写道:
> I just tried to boot this on my usual qemu test setup with emulated
> nvme devices, and it dead-loops with messages like this fairly late
> in the boot cycle:
> 
> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> [   43.826982] dma_mapping_error -12
> 
> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes

I also confronted this problem.IMO, if intel_iommu=off, the driver of 
drivers/iommu can not be used.

If I remember correctly, some guys in the first version can fix this 
problem. I will check the mails.

To me, I just revert this commit because I do not use this commit about 
nvme.

Zhu Yanjun

> it go away.
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-03  5:42 ` [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Christoph Hellwig
  2024-07-03 10:42   ` Zhu Yanjun
@ 2024-07-03 10:52   ` Leon Romanovsky
  2024-07-03 14:35     ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-03 10:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> I just tried to boot this on my usual qemu test setup with emulated
> nvme devices, and it dead-loops with messages like this fairly late
> in the boot cycle:
> 
> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> [   43.826982] dma_mapping_error -12
> 
> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> it go away.

Can you please share your kernel command line and qemu?
On my and Chaitanya setups it works fine.

Thanks

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-03 10:52   ` Leon Romanovsky
@ 2024-07-03 14:35     ` Christoph Hellwig
  2024-07-03 15:51       ` Leon Romanovsky
  2024-07-05 22:53       ` Chaitanya Kulkarni
  0 siblings, 2 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-03 14:35 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> > I just tried to boot this on my usual qemu test setup with emulated
> > nvme devices, and it dead-loops with messages like this fairly late
> > in the boot cycle:
> > 
> > [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> > [   43.826982] dma_mapping_error -12
> > 
> > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> > it go away.
> 
> Can you please share your kernel command line and qemu?
> On my and Chaitanya setups it works fine.

qemu-system-x86_64 \
        -nographic \
	-enable-kvm \
	-m 6g \
	-smp 4 \
	-cpu host \
	-M q35,kernel-irqchip=split \
	-kernel arch/x86/boot/bzImage \
	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
        -device intel-iommu,intremap=on \
	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
        -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
	-device virtio-blk,drive=root \
	-device nvme,drive=test,serial=1234


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-03 14:35     ` Christoph Hellwig
@ 2024-07-03 15:51       ` Leon Romanovsky
  2024-07-04  7:48         ` Christoph Hellwig
  2024-07-05 22:53       ` Chaitanya Kulkarni
  1 sibling, 1 reply; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-03 15:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Wed, Jul 03, 2024 at 04:35:30PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> > > I just tried to boot this on my usual qemu test setup with emulated
> > > nvme devices, and it dead-loops with messages like this fairly late
> > > in the boot cycle:
> > > 
> > > [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> > > [   43.826982] dma_mapping_error -12
> > > 
> > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> > > it go away.
> > 
> > Can you please share your kernel command line and qemu?
> > On my and Chaitanya setups it works fine.
> 
> qemu-system-x86_64 \
>         -nographic \
> 	-enable-kvm \
> 	-m 6g \
> 	-smp 4 \
> 	-cpu host \
> 	-M q35,kernel-irqchip=split \
> 	-kernel arch/x86/boot/bzImage \
> 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
>         -device intel-iommu,intremap=on \
> 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
>         -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> 	-device virtio-blk,drive=root \
> 	-device nvme,drive=test,serial=1234

Thanks, Chaitanya will take a look.

If we put aside this issue, do you think that the proposed API is the right one?

BTW, I have more fancy command line, it is probably the root cause of working/not-working:
/opt/simx/bin/qemu-system-x86_64
        -append root=/dev/root rw ignore_loglevel rootfstype=9p
        rootflags="cache=loose,trans=virtio" earlyprintk=serial,ttyS0,115200
                console=hvc0 noibrs noibpb nopti nospectre_v2 nospectre_v1
                l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off
                mitigations=off panic_on_warn=1
                intel_iommu=on iommu=nopt iommu.forcedac=true
                vfio_iommu_type1.allow_unsafe_interrupts=1
                systemd.hostname=mtl-leonro-l-vm
        -chardev stdio,id=stdio,mux=on,signal=off   
        -cpu host                                  
        -device virtio-rng-pci                    
        -device virtio-balloon-pci               
        -device isa-serial,chardev=stdio        
        -device virtio-serial-pci              
        -device virtconsole,chardev=stdio     
        -device virtio-9p-pci,fsdev=host_fs,mount_tag=/dev/root  
        -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0
        -device virtio-9p-pci,fsdev=host_bind_fs1,mount_tag=bind1
        -device virtio-9p-pci,fsdev=host_bind_fs2,mount_tag=bind2
        -device intel-iommu,intremap=on 
        -device connectx7              
        -device nvme,drive=drv0,serial=foo 
        -drive file=/home/leonro/.cache/mellanox/mkt/nvme-1g.raw,if=none,id=drv0,format=raw 
        -enable-kvm                                                                        
        -fsdev local,id=host_bind_fs1,security_model=none,path=/logs          
        -fsdev local,id=host_fs,security_model=none,path=/mnt/self           
        -fsdev local,id=host_bind_fs0,security_model=none,path=/plugins     
        -fsdev local,id=host_bind_fs2,security_model=none,path=/home/leonro
        -fw_cfg etc/sercon-port,string=2  
        -kernel /home/leonro/src/kernel/arch/x86/boot/bzImage 
        -m 5G -machine q35,kernel-irqchip=split              
        -mon chardev=stdio                                  
        -net nic,model=virtio,macaddr=52:54:01:d8:e5:f9    
        -net user,hostfwd=tcp:127.0.0.1:46645-:22  
        -no-reboot -nodefaults -nographic -smp 16 -vga none 

> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-03 15:51       ` Leon Romanovsky
@ 2024-07-04  7:48         ` Christoph Hellwig
  2024-07-04 13:18           ` Leon Romanovsky
  2024-07-08 16:52           ` Jason Gunthorpe
  0 siblings, 2 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-04  7:48 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> If we put aside this issue, do you think that the proposed API is the right one?

I haven't look at it in detail yet, but from a quick look there is a
few things to note:

1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
driver due to the stupid PRPs vs just using SGLs, but needing a fair
amount of extra boilerplate code in drivers is a bit of a warning sign.
I plan to look into this to see if I can help on improving it, but for
that I need a working version first.

2) The amount of seemingly unrelated global headers pulled into other
global headers.  Some of this might just be sloppiness, e.g. I can't
see why dma-mapping.h would actually need iommu.h to start with,
but pci.h in dma-map-ops.h is a no-go.

3) which brings me to real layering violations.  dev_is_untrusted and
dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
them. dma-map-ops.h is a semi-internal header only for implementations
of the dma ops (as very clearly documented at the top of that file),
it must not be included by drivers.  Same for swiotlb.h.

Not quite as concerning, but doing an indirect call for each map
through dma_map_ops in addition to the iommu ops is not every efficient.
We've through for a while to allow direct calls to dma-iommu similar
how we do direct calls to dma-direct from the core mapping.c code.
This might be a good time to do that as a prep step for this work.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-04  7:48         ` Christoph Hellwig
@ 2024-07-04 13:18           ` Leon Romanovsky
  2024-07-05  6:00             ` Christoph Hellwig
  2024-07-08 16:52           ` Jason Gunthorpe
  1 sibling, 1 reply; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-04 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> > If we put aside this issue, do you think that the proposed API is the right one?
> 
> I haven't look at it in detail yet, but from a quick look there is a
> few things to note:
> 
> 
> 1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
> driver due to the stupid PRPs vs just using SGLs, but needing a fair
> amount of extra boilerplate code in drivers is a bit of a warning sign.
> I plan to look into this to see if I can help on improving it, but for
> that I need a working version first.

Chaitanya is working on this and I'll join him to help on next Sunday,
after I'll return to the office from my sick leave/

> 
> 
> 2) The amount of seemingly unrelated global headers pulled into other
> global headers.  Some of this might just be sloppiness, e.g. I can't
> see why dma-mapping.h would actually need iommu.h to start with,
> but pci.h in dma-map-ops.h is a no-go.

pci.h was pulled because I needed to call to pci_p2pdma_map_type()
in dma_can_use_iova().

> 
> 3) which brings me to real layering violations.  dev_is_untrusted and
> dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
> them. dma-map-ops.h is a semi-internal header only for implementations
> of the dma ops (as very clearly documented at the top of that file),
> it must not be included by drivers.  Same for swiotlb.h.

These item shouldn't worry you and will be changed in the final version.
They are outcome of patch "RDMA/umem: Prevent UMEM ODP creation with SWIOTLB".
https://lore.kernel.org/all/d18c454636bf3cfdba9b66b7cc794d713eadc4a5.1719909395.git.leon@kernel.org/

All HMM users need such "prevention" so it will be moved to a common place.

> 
> Not quite as concerning, but doing an indirect call for each map
> through dma_map_ops in addition to the iommu ops is not every efficient.
> We've through for a while to allow direct calls to dma-iommu similar
> how we do direct calls to dma-direct from the core mapping.c code.
> This might be a good time to do that as a prep step for this work.

Sure, no problem, will start in parallel to work on this.

> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 18/18] nvme-pci: use new dma API
  2024-07-02  9:09 ` [RFC PATCH v1 18/18] nvme-pci: use new dma API Leon Romanovsky
@ 2024-07-04 15:23   ` Robin Murphy
  2024-07-04 17:16     ` Leon Romanovsky
  0 siblings, 1 reply; 51+ messages in thread
From: Robin Murphy @ 2024-07-04 15:23 UTC (permalink / raw)
  To: Leon Romanovsky, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni
  Cc: Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On 02/07/2024 10:09 am, Leon Romanovsky wrote:
[...]
> +static inline dma_addr_t nvme_dma_link_page(struct page *page,
> +					   unsigned int poffset,
> +					   unsigned int len,
> +					   struct nvme_iod *iod)
>   {
> -	int i;
> -	struct scatterlist *sg;
> +	struct dma_iova_attrs *iova = &iod->dma_map->iova;
> +	struct dma_iova_state *state = &iod->dma_map->state;
> +	dma_addr_t dma_addr;
> +	int ret;
> +
> +	if (iod->dma_map->use_iova) {
> +		phys_addr_t phys = page_to_phys(page) + poffset;

Yeah, there's no way this can possibly work. You can't do the 
dev_use_swiotlb() check up-front based on some overall DMA operation 
size, but then build that operation out of arbitrarily small fragments 
of different physical pages that *could* individually need bouncing to 
not break coherency.

Thanks,
Robin.

> +
> +		dma_addr = state->iova->addr + state->range_size;
> +		ret = dma_link_range(&iod->dma_map->state, phys, len);
> +		if (ret)
> +			return DMA_MAPPING_ERROR;
> +	} else {
> +		dma_addr = dma_map_page_attrs(iova->dev, page, poffset, len,
> +					      iova->dir, iova->attrs);
> +	}
> +	return dma_addr;
> +}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 18/18] nvme-pci: use new dma API
  2024-07-04 15:23   ` Robin Murphy
@ 2024-07-04 17:16     ` Leon Romanovsky
  2024-07-05  5:58       ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-04 17:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jens Axboe, Jason Gunthorpe, Joerg Roedel, Will Deacon,
	Keith Busch, Christoph Hellwig, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Thu, Jul 04, 2024 at 04:23:47PM +0100, Robin Murphy wrote:
> On 02/07/2024 10:09 am, Leon Romanovsky wrote:
> [...]
> > +static inline dma_addr_t nvme_dma_link_page(struct page *page,
> > +					   unsigned int poffset,
> > +					   unsigned int len,
> > +					   struct nvme_iod *iod)
> >   {
> > -	int i;
> > -	struct scatterlist *sg;
> > +	struct dma_iova_attrs *iova = &iod->dma_map->iova;
> > +	struct dma_iova_state *state = &iod->dma_map->state;
> > +	dma_addr_t dma_addr;
> > +	int ret;
> > +
> > +	if (iod->dma_map->use_iova) {
> > +		phys_addr_t phys = page_to_phys(page) + poffset;
> 
> Yeah, there's no way this can possibly work. You can't do the
> dev_use_swiotlb() check up-front based on some overall DMA operation size,
> but then build that operation out of arbitrarily small fragments of
> different physical pages that *could* individually need bouncing to not
> break coherency.

This is exactly how dma_map_sg() works. It checks in advance all SG and
proceeds with bounce buffer if needed. In our case all checks which
exists in dev_use_sg_swiotlb() will give "false". In v0, Christoph said
that NVMe guarantees alignment, which is only one "dynamic" check in
that function.

   600 static bool dev_use_sg_swiotlb(struct device *dev, struct scatterlist *sg,
   601                                int nents, enum dma_data_direction dir)
   602 {
   603         struct scatterlist *s;
   604         int i;
   605
   606         if (!IS_ENABLED(CONFIG_SWIOTLB))
   607                 return false;
   608
   609         if (dev_is_untrusted(dev))
   610                 return true;
   611
   612         /*
   613          * If kmalloc() buffers are not DMA-safe for this device and
   614          * direction, check the individual lengths in the sg list. If any
   615          * element is deemed unsafe, use the swiotlb for bouncing.
   616          */
   617         if (!dma_kmalloc_safe(dev, dir)) {
   618                 for_each_sg(sg, s, nents, i)
   619                         if (!dma_kmalloc_size_aligned(s->length))
   620                                 return true;
   621         }
   622
   623         return false;
   624 }

   ...
  1338 static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
  1339                 int nents, enum dma_data_direction dir, unsigned long attrs)
  ...
  1360         if (dev_use_sg_swiotlb(dev, sg, nents, dir))                          
  1361                 return iommu_dma_map_sg_swiotlb(dev, sg, nents, dir, attrs);   

Thanks

> 
> Thanks,
> Robin.
> 
> > +
> > +		dma_addr = state->iova->addr + state->range_size;
> > +		ret = dma_link_range(&iod->dma_map->state, phys, len);
> > +		if (ret)
> > +			return DMA_MAPPING_ERROR;
> > +	} else {
> > +		dma_addr = dma_map_page_attrs(iova->dev, page, poffset, len,
> > +					      iova->dir, iova->attrs);
> > +	}
> > +	return dma_addr;
> > +}
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 18/18] nvme-pci: use new dma API
  2024-07-04 17:16     ` Leon Romanovsky
@ 2024-07-05  5:58       ` Christoph Hellwig
  2024-07-05 18:48         ` Leon Romanovsky
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-05  5:58 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Robin Murphy, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

> This is exactly how dma_map_sg() works.

Which dma_map_sg?  swiotlb handling is implemented in the underlying
ops, dma-direct and dma-iommu specifically.

dma-direct just iterates over the entries and calls dma_direct_map_page,
which does a per-entry decision to bounce based on
is_swiotlb_force_bounce, dma_capable and dma_kmalloc_needs_bounce.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-04 13:18           ` Leon Romanovsky
@ 2024-07-05  6:00             ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-05  6:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Thu, Jul 04, 2024 at 04:18:39PM +0300, Leon Romanovsky wrote:
> > 2) The amount of seemingly unrelated global headers pulled into other
> > global headers.  Some of this might just be sloppiness, e.g. I can't
> > see why dma-mapping.h would actually need iommu.h to start with,
> > but pci.h in dma-map-ops.h is a no-go.
> 
> pci.h was pulled because I needed to call to pci_p2pdma_map_type()
> in dma_can_use_iova().

No, that's not the reason.  The reason is actually that whole
dev_use_swiotlb mess which shouldn't exist in this form.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
                   ` (18 preceding siblings ...)
  2024-07-03  5:42 ` [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Christoph Hellwig
@ 2024-07-05  6:39 ` Christoph Hellwig
  2024-07-07  9:45   ` Leon Romanovsky
  2024-07-08 23:57   ` Jason Gunthorpe
  19 siblings, 2 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-05  6:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Christoph Hellwig, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

Review from the NVMe driver consumer perspective.  I think if all these
were implement we'd probably end up with less code than before the
conversion.

The split between dma_iova_attrs, dma_memory_type and dma_iova_state is
odd.  I would have expected them to just be just a single object.  While
talking about this I think the domain field in dma_iova_state should
probably be a private pointer instead of being tied to the iommu.

Also do we need the attrs member in the iova_attrs structure?  The
"attrs" really are flags passed to the mapping routines that are
per-operation and not persistent, so I'd expect them to be passed
per-call and not stored in a structure.

I'd also expect that the use_iova field to be in the mapping state
and not separately provided by the driver.

For nvme specific data structures I would have expected a dma_add/
len pair in struct iod_dma_map, maybe even using a common type.

Also the data structure split seems odd - I'd expect the actual
mapping state and a small number (at least one) dma_addr/len pair
to be inside the nvme_iod structure, and then only do the dynamic
allocation if we need more of them because there are more segments
and we are not using the iommu.

If we had a common data structure for the dma_addr/len pairs
dma_unlink_range could just take care of the unmap for the non-iommu
case as well, which would be neat.  I'd also expect that
dma_free_iova would be covered by it.

I would have expected dma_link_range to return the dma_addr_t instead
of poking into the iova structure in the callers.

In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless.  In the
existing code the reason for it is to avoid allocating and mapping the
sg_table, but that code is still left before we even get to this code.

My suggestion above to only allocate the dma_addr/len pairs when there
is more than 1 or a few of it would allow to trivially implement that
suggestion using the normal API without having to keep that special
case and the dma_len parameter around.

If this addes a version of dma_map_page_atttrs that directly took
the physical address as a prep patch the callers would not have to
bother with page pointer manipulations and just work on physical
addresses for both the iommu and no-iommu cases.  It would also help
a little bit with the eventualy switch to store the physical address
instead of page+offset in the bio_vec.  Talking about that, I've
been wanting to add a bvec_phys helper for to convert the
page_phys(bv.bv_page) + bv.bv_offset calculations.  This is becoming
more urgent with more callers needing to that, I'll try to get it out
to Jens ASAP so that it can make the 6.11 merge window.

Can we make dma_start_range / dma_end_range simple no-ops for the
non-iommu code to avoid boilerplate code in the callers to avoid
boilerplate code in the callers to deal with the two cases?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 18/18] nvme-pci: use new dma API
  2024-07-05  5:58       ` Christoph Hellwig
@ 2024-07-05 18:48         ` Leon Romanovsky
  2024-07-06  6:08           ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-05 18:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Robin Murphy, Jens Axboe, Jason Gunthorpe, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Fri, Jul 05, 2024 at 07:58:06AM +0200, Christoph Hellwig wrote:
> > This is exactly how dma_map_sg() works.
> 
> Which dma_map_sg?  swiotlb handling is implemented in the underlying
> ops, dma-direct and dma-iommu specifically.
> 
> dma-direct just iterates over the entries and calls dma_direct_map_page,
> which does a per-entry decision to bounce based on
> is_swiotlb_force_bounce, dma_capable and dma_kmalloc_needs_bounce.

dma-direct is not going to have "use_iova" flag. Robin pointed to
dma-iommu path.

In that case the flow is dma_map_sg()->iommu_dma_map_sg()->dev_use_sg_swiotlb().

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-03 14:35     ` Christoph Hellwig
  2024-07-03 15:51       ` Leon Romanovsky
@ 2024-07-05 22:53       ` Chaitanya Kulkarni
  2024-07-06  6:26         ` Christoph Hellwig
  2024-07-07 12:45         ` Leon Romanovsky
  1 sibling, 2 replies; 51+ messages in thread
From: Chaitanya Kulkarni @ 2024-07-05 22:53 UTC (permalink / raw)
  To: Christoph Hellwig, Leon Romanovsky
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, iommu@lists.linux.dev,
	linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org,
	kvm@vger.kernel.org, linux-mm@kvack.org

On 7/3/24 07:35, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
>> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
>>> I just tried to boot this on my usual qemu test setup with emulated
>>> nvme devices, and it dead-loops with messages like this fairly late
>>> in the boot cycle:
>>>
>>> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
>>> [   43.826982] dma_mapping_error -12
>>>
>>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
>>> it go away.
>> Can you please share your kernel command line and qemu?
>> On my and Chaitanya setups it works fine.
> qemu-system-x86_64 \
>          -nographic \
> 	-enable-kvm \
> 	-m 6g \
> 	-smp 4 \
> 	-cpu host \
> 	-M q35,kernel-irqchip=split \
> 	-kernel arch/x86/boot/bzImage \
> 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
>          -device intel-iommu,intremap=on \
> 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
>          -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> 	-device virtio-blk,drive=root \
> 	-device nvme,drive=test,serial=1234
>

I tried to reproduce this issue somehow it is not reproducible.

I'll try again on Leon's setup on my Saturday night, to fix that
case.

-ck




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 18/18] nvme-pci: use new dma API
  2024-07-05 18:48         ` Leon Romanovsky
@ 2024-07-06  6:08           ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-06  6:08 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Robin Murphy, Jens Axboe, Jason Gunthorpe,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Fri, Jul 05, 2024 at 09:48:46PM +0300, Leon Romanovsky wrote:
> In that case the flow is dma_map_sg()->iommu_dma_map_sg()->dev_use_sg_swiotlb().

Even for that you'll still need to check the first and last entry
for being kmalloc misaligned if we assume that all middle entries are
aligned (which for NVMe they have to, but we're probably better off
figuring out a way to enforce that).


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-05 22:53       ` Chaitanya Kulkarni
@ 2024-07-06  6:26         ` Christoph Hellwig
  2024-07-07  9:16           ` Leon Romanovsky
  2024-07-07 12:45         ` Leon Romanovsky
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-06  6:26 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Jason Gunthorpe,
	Robin Murphy, Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, iommu@lists.linux.dev,
	linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org,
	kvm@vger.kernel.org, linux-mm@kvack.org

On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> I tried to reproduce this issue somehow it is not reproducible.
> 
> I'll try again on Leon's setup on my Saturday night, to fix that
> case.

It is passthrough I/O from userspace.  The address is not page aligned
as seen in the printk.  Forcing bounce buffering of all passthrough
I/O makes it go away.

The problem is the first mapped segment does not have to be aligned
and we're missing the code to places it at the aligned offset into
the IOVA space.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-06  6:26         ` Christoph Hellwig
@ 2024-07-07  9:16           ` Leon Romanovsky
  0 siblings, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-07  9:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chaitanya Kulkarni, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, iommu@lists.linux.dev,
	linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org,
	kvm@vger.kernel.org, linux-mm@kvack.org

On Sat, Jul 06, 2024 at 08:26:04AM +0200, Christoph Hellwig wrote:
> On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> > I tried to reproduce this issue somehow it is not reproducible.
> > 
> > I'll try again on Leon's setup on my Saturday night, to fix that
> > case.
> 
> It is passthrough I/O from userspace.  The address is not page aligned
> as seen in the printk.  Forcing bounce buffering of all passthrough
> I/O makes it go away.
> 
> The problem is the first mapped segment does not have to be aligned
> and we're missing the code to places it at the aligned offset into
> the IOVA space.

Thanks for the explanation.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-05  6:39 ` Christoph Hellwig
@ 2024-07-07  9:45   ` Leon Romanovsky
  2024-07-08 23:57   ` Jason Gunthorpe
  1 sibling, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-07  9:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jason Gunthorpe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote:
> Review from the NVMe driver consumer perspective.  I think if all these
> were implement we'd probably end up with less code than before the
> conversion.

Thanks for the review, I will try to address all the comments in the next version.

> 
> The split between dma_iova_attrs, dma_memory_type and dma_iova_state is
> odd.  I would have expected them to just be just a single object.  While
> talking about this I think the domain field in dma_iova_state should
> probably be a private pointer instead of being tied to the iommu.
> 
> Also do we need the attrs member in the iova_attrs structure?  The
> "attrs" really are flags passed to the mapping routines that are
> per-operation and not persistent, so I'd expect them to be passed
> per-call and not stored in a structure.

It is left-over from my not-send version where I added new attribute
to indicate that dma_alloc_iova() can't support SWIOTLB to avoid
dev_use_swiotlb() mess. I will remove it.

> 
> I'd also expect that the use_iova field to be in the mapping state
> and not separately provided by the driver.
> 
> For nvme specific data structures I would have expected a dma_add/
> len pair in struct iod_dma_map, maybe even using a common type.
> 
> Also the data structure split seems odd - I'd expect the actual
> mapping state and a small number (at least one) dma_addr/len pair
> to be inside the nvme_iod structure, and then only do the dynamic
> allocation if we need more of them because there are more segments
> and we are not using the iommu.
> 
> If we had a common data structure for the dma_addr/len pairs
> dma_unlink_range could just take care of the unmap for the non-iommu
> case as well, which would be neat.  I'd also expect that
> dma_free_iova would be covered by it.

Internally Jason asked for the same thing, but I didn't want to produce
asymmetric API where drivers have a call to dma_alloc_iova() but don't
have a call to dma_free_iova(). However, now, it is 2 versus 1, so I will
change it.

> 
> I would have expected dma_link_range to return the dma_addr_t instead
> of poking into the iova structure in the callers.
> 
> In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless.  In the
> existing code the reason for it is to avoid allocating and mapping the
> sg_table, but that code is still left before we even get to this code.
> 
> My suggestion above to only allocate the dma_addr/len pairs when there
> is more than 1 or a few of it would allow to trivially implement that
> suggestion using the normal API without having to keep that special
> case and the dma_len parameter around.
> 
> If this addes a version of dma_map_page_atttrs that directly took
> the physical address as a prep patch the callers would not have to
> bother with page pointer manipulations and just work on physical
> addresses for both the iommu and no-iommu cases.  It would also help
> a little bit with the eventualy switch to store the physical address
> instead of page+offset in the bio_vec.  Talking about that, I've
> been wanting to add a bvec_phys helper for to convert the
> page_phys(bv.bv_page) + bv.bv_offset calculations.  This is becoming
> more urgent with more callers needing to that, I'll try to get it out
> to Jens ASAP so that it can make the 6.11 merge window.
> 
> Can we make dma_start_range / dma_end_range simple no-ops for the
> non-iommu code to avoid boilerplate code in the callers to avoid
> boilerplate code in the callers to deal with the two cases?

Yes, sure.

Thanks

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-05 22:53       ` Chaitanya Kulkarni
  2024-07-06  6:26         ` Christoph Hellwig
@ 2024-07-07 12:45         ` Leon Romanovsky
  1 sibling, 0 replies; 51+ messages in thread
From: Leon Romanovsky @ 2024-07-07 12:45 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Christoph Hellwig, Jens Axboe, Jason Gunthorpe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, iommu@lists.linux.dev,
	linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org,
	kvm@vger.kernel.org, linux-mm@kvack.org

On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> On 7/3/24 07:35, Christoph Hellwig wrote:
> > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> >> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> >>> I just tried to boot this on my usual qemu test setup with emulated
> >>> nvme devices, and it dead-loops with messages like this fairly late
> >>> in the boot cycle:
> >>>
> >>> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> >>> [   43.826982] dma_mapping_error -12
> >>>
> >>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> >>> it go away.
> >> Can you please share your kernel command line and qemu?
> >> On my and Chaitanya setups it works fine.
> > qemu-system-x86_64 \
> >          -nographic \
> > 	-enable-kvm \
> > 	-m 6g \
> > 	-smp 4 \
> > 	-cpu host \
> > 	-M q35,kernel-irqchip=split \
> > 	-kernel arch/x86/boot/bzImage \
> > 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
> >          -device intel-iommu,intremap=on \
> > 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
> >          -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> > 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> > 	-device virtio-blk,drive=root \
> > 	-device nvme,drive=test,serial=1234
> >
> 
> I tried to reproduce this issue somehow it is not reproducible.
> 
> I'll try again on Leon's setup on my Saturday night, to fix that
> case.

Chaitanya,

I added "mem_align=120" line to fio configuration file and the issue reproduced.

Thanks

> 
> -ck
> 
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-04  7:48         ` Christoph Hellwig
  2024-07-04 13:18           ` Leon Romanovsky
@ 2024-07-08 16:52           ` Jason Gunthorpe
  2024-07-09  6:17             ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-08 16:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote:

> 1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
> driver due to the stupid PRPs vs just using SGLs, but needing a fair
> amount of extra boilerplate code in drivers is a bit of a warning sign.
> I plan to look into this to see if I can help on improving it, but for
> that I need a working version first.

It would be nice to have less.  So much now depends on the caller to
provide both the input and output data structure.

Ideally we'd have some template code that consolidates these loops to
common code with driver provided hooks - there are a few ways to get
that efficiently in C.

I think it will be clearer when we get to RDMA and there we have the
same SGL/PRP kind of split up and we can see what is sharable.

> Not quite as concerning, but doing an indirect call for each map
> through dma_map_ops in addition to the iommu ops is not every
> efficient.

Yeah, there is no reason to support anything other than dma-iommu.c
for the iommu path, so the dma_map_op indirection for this could just
be removed.

I'm also cooking something that should let us build a way to iommu map
a bio_vec very efficiently, which should transform this into a single
indirect call into the iommu driver per bio_vec, and a single radix
walk/etc.

> We've through for a while to allow direct calls to dma-iommu similar
> how we do direct calls to dma-direct from the core mapping.c code.
> This might be a good time to do that as a prep step for this work.

I think there is room to benchmark and further improve these
paths. Even the fast direct map path is not compiling down to a single
load/store instruction per bio_vec entry as would be ideal.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-05  6:39 ` Christoph Hellwig
  2024-07-07  9:45   ` Leon Romanovsky
@ 2024-07-08 23:57   ` Jason Gunthorpe
  2024-07-09  6:20     ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-08 23:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote:
> Review from the NVMe driver consumer perspective.  I think if all these
> were implement we'd probably end up with less code than before the
> conversion.

One of the topics that came up with at LSF is what to do with the
dma_memory_type data?

The concept here was that the new DMA API would work on same-type
memory only, meaning the caller would have to split up by type.

I understand the block stack already does this using P2P and !P2P, but
that isn't quite enough here as we want to split principally based on
IOMMU or !IOMMU.

Today that boils down to splitting based on a few things, like
grouping encrypted memory, and grouping by P2P provider.

So the idea was the "struct dma_memory_type" would encapsulate
whatever grouping was needed and the block layer would drive its
splitting on this, same structs can be merged.

I didn't notice this got included in this RFC..

The trivial option is to store the dma_memory_type per-BIO and drive
the de-duplication like that. The other could be to derive it from
first entry in the bio_vect every time a new page is added.

This same-type design is one of the key elements of this API
arrangement - for RDMA I expect we will need a data structure storing
a list of types with a list of pages, and we will DMA map type by
type.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-08 16:52           ` Jason Gunthorpe
@ 2024-07-09  6:17             ` Christoph Hellwig
  2024-07-09 18:53               ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-09  6:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Mon, Jul 08, 2024 at 01:52:38PM -0300, Jason Gunthorpe wrote:
> Ideally we'd have some template code that consolidates these loops to
> common code with driver provided hooks - there are a few ways to get
> that efficiently in C.
> 
> I think it will be clearer when we get to RDMA and there we have the
> same SGL/PRP kind of split up and we can see what is sharable.

I really would not want to build common code for PRPs - this is a concept
very specific to RDMA and NVMe.  OTOH more common code SGLs would be
nice.  If you look at e.g. SCSI drivers most of them have a simpe loop of
mapping the SG table and then copying the fields into the hardware SGL.
This would be a very common case for a helper.

That whole thing of course opens the question if we want a pure
in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
way to migrate and allows to share code easily.  We can look into ways
to avoiding that more for drivers that care, but most drivers are
probably best serve with it to keep the code simple and make the
conversion easier.

> I'm also cooking something that should let us build a way to iommu map
> a bio_vec very efficiently, which should transform this into a single
> indirect call into the iommu driver per bio_vec, and a single radix
> walk/etc.

I assume you mean array of bio_vecs here.  That would indeed nice.
We'd still potentially need a few calls for block drivers as
requests can have multiple bios and thus bio_vec arrays, but it would
still be a nice reduction of calls.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-08 23:57   ` Jason Gunthorpe
@ 2024-07-09  6:20     ` Christoph Hellwig
  2024-07-09 19:03       ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-09  6:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Mon, Jul 08, 2024 at 08:57:21PM -0300, Jason Gunthorpe wrote:
> I understand the block stack already does this using P2P and !P2P, but
> that isn't quite enough here as we want to split principally based on
> IOMMU or !IOMMU.

Except for the powerpc bypass IOMMU or not is a global decision,
and the bypass is per I/O.  So I'm not sure what else you want there?


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-09  6:17             ` Christoph Hellwig
@ 2024-07-09 18:53               ` Jason Gunthorpe
  2024-07-10  6:27                 ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-09 18:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Tue, Jul 09, 2024 at 08:17:21AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 08, 2024 at 01:52:38PM -0300, Jason Gunthorpe wrote:
> > Ideally we'd have some template code that consolidates these loops to
> > common code with driver provided hooks - there are a few ways to get
> > that efficiently in C.
> > 
> > I think it will be clearer when we get to RDMA and there we have the
> > same SGL/PRP kind of split up and we can see what is sharable.
> 
> I really would not want to build common code for PRPs - this is a concept
> very specific to RDMA and NVMe.  

I think DRM has it too. If you are populating a GPU page table then it
is basically a convoluted PRP. Probably requires different splitting
logic than what RDMA does, but I've never looked.

> OTOH more common code SGLs would be nice.  If you look at e.g. SCSI
> drivers most of them have a simpe loop of mapping the SG table and
> then copying the fields into the hardware SGL.  This would be a very
> common case for a helper.

Yes, I belive this is very common.

> That whole thing of course opens the question if we want a pure
> in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
> way to migrate and allows to share code easily.  We can look into ways
> to avoiding that more for drivers that care, but most drivers are
> probably best serve with it to keep the code simple and make the
> conversion easier.

My feeling has been that this RFC is the low level interface and we
can bring our own data structure on top.

It would probably make sense to build a scatterlist v2 on top of this
that has an in-memory dma_addr_t/len list close to today. Yes it costs
a memory allocation, or a larger initial allocation, but many places
may not really care. Block drivers have always allocated a SGL, for
instance.

Then the verbosity of this API is less important as we may only use it
in a few places.

My main take away was that we should make the dma_ops interface
simpler and more general so we can have this choice instead of welding
a single datastructure through everything.

> > I'm also cooking something that should let us build a way to iommu map
> > a bio_vec very efficiently, which should transform this into a single
> > indirect call into the iommu driver per bio_vec, and a single radix
> > walk/etc.
>
> I assume you mean array of bio_vecs here.  That would indeed nice.
> We'd still potentially need a few calls for block drivers as
> requests can have multiple bios and thus bio_vec arrays, but it would
> still be a nice reduction of calls.

Yes. iommufd has performance needs here, not sure what it will turn
into but bio_vec[] direct to optimized radix manipuilation is
something I'd be keen to see.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-09  6:20     ` Christoph Hellwig
@ 2024-07-09 19:03       ` Jason Gunthorpe
  2024-07-10  6:22         ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-09 19:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Tue, Jul 09, 2024 at 08:20:15AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 08, 2024 at 08:57:21PM -0300, Jason Gunthorpe wrote:
> > I understand the block stack already does this using P2P and !P2P, but
> > that isn't quite enough here as we want to split principally based on
> > IOMMU or !IOMMU.
> 
> Except for the powerpc bypass IOMMU or not is a global decision,
> and the bypass is per I/O.  So I'm not sure what else you want there?

For P2P we know if the DMA will go through the IOMMU or not based on
the PCIe fabric path between the initiator (the one doing the DMA) and
the target (the one providing the MMIO memory).

Depending on PCIe topology and ACS flags this path may use the IOMMU
or may skip the IOMMU.

To put it in code, the 'enum pci_p2pdma_map_type' can only be
determined once we know the initator and target struct device.

PCI_P2PDMA_MAP_BUS_ADDR means we don't use the iommu.

PCI_P2PDMA_MAP_THRU_HOST_BRIDGE means we do.

With this API it is important that a single request always has the
same PCI_P2PDMA_MAP_* outcome, and the simplest way to do that is to
split requests if the MMIO memory changes target struct devices.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-09 19:03       ` Jason Gunthorpe
@ 2024-07-10  6:22         ` Christoph Hellwig
  2024-07-11 23:29           ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-10  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Tue, Jul 09, 2024 at 04:03:20PM -0300, Jason Gunthorpe wrote:
> > Except for the powerpc bypass IOMMU or not is a global decision,
> > and the bypass is per I/O.  So I'm not sure what else you want there?
> 
> For P2P we know if the DMA will go through the IOMMU or not based on
> the PCIe fabric path between the initiator (the one doing the DMA) and
> the target (the one providing the MMIO memory).

Oh, yes.  So effectively you are asking if we can arbitrarily mix
P2P sources in a single map request.  I think the only sane answer
from the iommu/dma subsystem perspective is: hell no.

But that means the upper layer need to split at such a boundary.
E.g. get_user_pages needs to look at this and stop at the boundary,
leaving the rest to the next call.

For the block layer just having one kind per BIO is fine right now,
although I could see use cases where people would want to combine
them.  We can probably defer that until it is needed, though.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-09 18:53               ` Jason Gunthorpe
@ 2024-07-10  6:27                 ` Christoph Hellwig
  2024-07-11 23:21                   ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-10  6:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Tue, Jul 09, 2024 at 03:53:15PM -0300, Jason Gunthorpe wrote:
> > That whole thing of course opens the question if we want a pure
> > in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
> > way to migrate and allows to share code easily.  We can look into ways
> > to avoiding that more for drivers that care, but most drivers are
> > probably best serve with it to keep the code simple and make the
> > conversion easier.
> 
> My feeling has been that this RFC is the low level interface and we
> can bring our own data structure on top.
>
> It would probably make sense to build a scatterlist v2 on top of this
> that has an in-memory dma_addr_t/len list close to today

Yes, the usage of the dma_vec would be in a higher layer.  But I'd
really like to see it from the beginning.

> . Yes it costs
> a memory allocation, or a larger initial allocation, but many places
> may not really care. Block drivers have always allocated a SGL, for
> instance.

Except for those optimizing for snall transfer of a single segment
(like nvme). 

> My main take away was that we should make the dma_ops interface
> simpler and more general so we can have this choice instead of welding
> a single datastructure through everything.

Yes, I don't think the dma_vec should be the low-level interface.
I think a low-level interface based on physical address is the right
one.  I'll see what I can do to move the single segment map interface
to be physical address based instead of page based so that we can
unify them.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-10  6:27                 ` Christoph Hellwig
@ 2024-07-11 23:21                   ` Jason Gunthorpe
  0 siblings, 0 replies; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-11 23:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Wed, Jul 10, 2024 at 08:27:04AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 09, 2024 at 03:53:15PM -0300, Jason Gunthorpe wrote:
> > > That whole thing of course opens the question if we want a pure
> > > in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
> > > way to migrate and allows to share code easily.  We can look into ways
> > > to avoiding that more for drivers that care, but most drivers are
> > > probably best serve with it to keep the code simple and make the
> > > conversion easier.
> > 
> > My feeling has been that this RFC is the low level interface and we
> > can bring our own data structure on top.
> >
> > It would probably make sense to build a scatterlist v2 on top of this
> > that has an in-memory dma_addr_t/len list close to today
> 
> Yes, the usage of the dma_vec would be in a higher layer.  But I'd
> really like to see it from the beginning.

Well, lets start with agreeing on this layer's API and be confident it
can succeed.

Then I'd say to look at RDMA as it is a logical place to build such a
data structure and we can build something that at least does what RDMA
needs.

I need something anyhow to plumb through to DMABUF and over to iommufd
and VFIO, can't skip out on it :)

> Yes, I don't think the dma_vec should be the low-level interface.
> I think a low-level interface based on physical address is the right
> one.  I'll see what I can do to move the single segment map interface
> to be physical address based instead of page based so that we can
> unify them.

Yeah, I've been talking to Matthew explaining that starting at the DMA
API makes the most sense and lets remove mandatory struct page
entanglements there. Then we can start to examine other layers. Having
a consistent option in the DMA API to be physically based with a
memory type fits with that plan.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-10  6:22         ` Christoph Hellwig
@ 2024-07-11 23:29           ` Jason Gunthorpe
  2024-07-12  4:54             ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-11 23:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Wed, Jul 10, 2024 at 08:22:12AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 09, 2024 at 04:03:20PM -0300, Jason Gunthorpe wrote:
> > > Except for the powerpc bypass IOMMU or not is a global decision,
> > > and the bypass is per I/O.  So I'm not sure what else you want there?
> > 
> > For P2P we know if the DMA will go through the IOMMU or not based on
> > the PCIe fabric path between the initiator (the one doing the DMA) and
> > the target (the one providing the MMIO memory).
> 
> Oh, yes.  So effectively you are asking if we can arbitrarily mix
> P2P sources in a single map request.  I think the only sane answer
> from the iommu/dma subsystem perspective is: hell no.

Well, today we can mix them and the dma_map_sg will sort it out. With
this new API we can't anymore.

So this little detail needs to be taken care of somehow as well, and I
didn't see it in this RFC.

> For the block layer just having one kind per BIO is fine right now,
> although I could see use cases where people would want to combine
> them.  We can probably defer that until it is needed, though.

Do you have an application in mind that would want multi-kind per BIO?

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-11 23:29           ` Jason Gunthorpe
@ 2024-07-12  4:54             ` Christoph Hellwig
  2024-07-12 12:42               ` Jason Gunthorpe
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-12  4:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Thu, Jul 11, 2024 at 08:29:17PM -0300, Jason Gunthorpe wrote:
> So this little detail needs to be taken care of somehow as well, and I
> didn't see it in this RFC.

Yes.  Same about generally not mixing P2P and non-P2P.

> 
> > For the block layer just having one kind per BIO is fine right now,
> > although I could see use cases where people would want to combine
> > them.  We can probably defer that until it is needed, though.
> 
> Do you have an application in mind that would want multi-kind per BIO?

If you are doing say garbage collection in a file system, and do write
that sources data from multiple devices, where some sit directly on the
root port and others behind a switch.

This is all purely hypothetical, and I'm happy to just check for it
and reject it for it now.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-12  4:54             ` Christoph Hellwig
@ 2024-07-12 12:42               ` Jason Gunthorpe
  2024-07-13  5:24                 ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Jason Gunthorpe @ 2024-07-12 12:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Jens Axboe, Robin Murphy, Joerg Roedel,
	Will Deacon, Keith Busch, Zeng, Oak, Chaitanya Kulkarni,
	Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Marek Szyprowski,
	Jérôme Glisse, Andrew Morton, linux-block, linux-kernel,
	linux-rdma, iommu, linux-nvme, linux-pci, kvm, linux-mm

On Fri, Jul 12, 2024 at 06:54:22AM +0200, Christoph Hellwig wrote:

> This is all purely hypothetical, and I'm happy to just check for it
> and reject it for it now.

I do know a patch set is cooking to allow mixing ZONE_DEVICE P2P and
anon memory in the same VMA ala HMM with transparent migration of
ZONE_DEVICE to anon.

In this situation userspace will be generating IO with no idea about
any P2P/!P2P boundaries.

Jason

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API
  2024-07-12 12:42               ` Jason Gunthorpe
@ 2024-07-13  5:24                 ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-07-13  5:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Keith Busch, Zeng, Oak,
	Chaitanya Kulkarni, Sagi Grimberg, Bjorn Helgaas, Logan Gunthorpe,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Marek Szyprowski, Jérôme Glisse, Andrew Morton,
	linux-block, linux-kernel, linux-rdma, iommu, linux-nvme,
	linux-pci, kvm, linux-mm

On Fri, Jul 12, 2024 at 09:42:37AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 12, 2024 at 06:54:22AM +0200, Christoph Hellwig wrote:
> 
> > This is all purely hypothetical, and I'm happy to just check for it
> > and reject it for it now.
> 
> I do know a patch set is cooking to allow mixing ZONE_DEVICE P2P and
> anon memory in the same VMA ala HMM with transparent migration of
> ZONE_DEVICE to anon.
> 
> In this situation userspace will be generating IO with no idea about
> any P2P/!P2P boundaries.

Yes.  And as said from the beginning of these discussion I think the
right way is to change the gup code so that for a single call to
get/pin_user_pages it always returns either only P2P pages or non-P2P
pages only, with the FOLL_PCI_P2PDMA just allowing P2P pages at all.
Similarly they could easily just return one kind of P2P pages per
call only.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-07-13  5:24 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-02  9:09 [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 01/18] dma-mapping: query DMA memory type Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 02/18] dma-mapping: provide an interface to allocate IOVA Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 03/18] dma-mapping: check if IOVA can be used Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 04/18] dma-mapping: implement link range API Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 05/18] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 06/18] dma-mapping: provide callbacks to link/unlink HMM PFNs to specific IOVA Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 07/18] iommu/dma: Provide an interface to allow preallocate IOVA Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 08/18] iommu/dma: Implement link/unlink ranges callbacks Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 09/18] RDMA/umem: Preallocate and cache IOVA for UMEM ODP Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 10/18] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 11/18] RDMA/core: Separate DMA mapping to caching IOVA and page linkage Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 12/18] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 13/18] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 14/18] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 15/18] vfio/mlx5: Explicitly store page list Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 16/18] vfio/mlx5: Convert vfio to use DMA link API Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 17/18] block: export helper to get segment max size Leon Romanovsky
2024-07-02  9:09 ` [RFC PATCH v1 18/18] nvme-pci: use new dma API Leon Romanovsky
2024-07-04 15:23   ` Robin Murphy
2024-07-04 17:16     ` Leon Romanovsky
2024-07-05  5:58       ` Christoph Hellwig
2024-07-05 18:48         ` Leon Romanovsky
2024-07-06  6:08           ` Christoph Hellwig
2024-07-03  5:42 ` [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Christoph Hellwig
2024-07-03 10:42   ` Zhu Yanjun
2024-07-03 10:52   ` Leon Romanovsky
2024-07-03 14:35     ` Christoph Hellwig
2024-07-03 15:51       ` Leon Romanovsky
2024-07-04  7:48         ` Christoph Hellwig
2024-07-04 13:18           ` Leon Romanovsky
2024-07-05  6:00             ` Christoph Hellwig
2024-07-08 16:52           ` Jason Gunthorpe
2024-07-09  6:17             ` Christoph Hellwig
2024-07-09 18:53               ` Jason Gunthorpe
2024-07-10  6:27                 ` Christoph Hellwig
2024-07-11 23:21                   ` Jason Gunthorpe
2024-07-05 22:53       ` Chaitanya Kulkarni
2024-07-06  6:26         ` Christoph Hellwig
2024-07-07  9:16           ` Leon Romanovsky
2024-07-07 12:45         ` Leon Romanovsky
2024-07-05  6:39 ` Christoph Hellwig
2024-07-07  9:45   ` Leon Romanovsky
2024-07-08 23:57   ` Jason Gunthorpe
2024-07-09  6:20     ` Christoph Hellwig
2024-07-09 19:03       ` Jason Gunthorpe
2024-07-10  6:22         ` Christoph Hellwig
2024-07-11 23:29           ` Jason Gunthorpe
2024-07-12  4:54             ` Christoph Hellwig
2024-07-12 12:42               ` Jason Gunthorpe
2024-07-13  5:24                 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).