[PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
@ 2025-07-18 11:51 Yonatan Maman
  2025-07-18 11:51 ` [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Yonatan Maman @ 2025-07-18 11:51 UTC (permalink / raw)
  To: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel, Yonatan Maman

From: Yonatan Maman <Ymaman@Nvidia.com>

This patch series aims to enable Peer-to-Peer (P2P) DMA access in
GPU-centric applications that utilize RDMA and private device pages. This
enhancement reduces data transfer overhead by allowing the GPU to directly
expose device private page data to devices such as NICs, eliminating the
need to traverse system RAM, which is the native method for exposing
device private page data.

To fully support Peer-to-Peer for device private pages, the following
changes are proposed:

`Memory Management (MM)`
 * Leverage struct pagemap_ops to support P2P page operations: This
modification ensures that the GPU can directly map device private pages
for P2P DMA.
 * Utilize hmm_range_fault to support P2P connections for device private
pages (instead of Page fault)

`IB Drivers`
Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
requests.

`Nouveau driver`
Add support for the Nouveau p2p_page callback function: This update
integrates P2P DMA support into the Nouveau driver, allowing it to handle
P2P page operations seamlessly.

`MLX5 Driver`
Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
inaccessible bridges, the system falls back to standard DMA, which uses host
memory, for the affected PFNs

Previous version:
https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/

Yonatan Maman (5):
  mm/hmm: HMM API to enable P2P DMA for device private pages
  nouveau/dmem: HMM P2P DMA for private dev pages
  IB/core: P2P DMA for device private pages
  RDMA/mlx5: Enable P2P DMA with fallback mechanism
  RDMA/mlx5: Enabling ATS for ODP memory

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
 drivers/infiniband/core/umem_odp.c     |   4 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
 drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
 include/linux/hmm.h                    |   3 +-
 include/linux/memremap.h               |   8 ++
 mm/hmm.c                               |  57 ++++++++++---
 7 files changed, 195 insertions(+), 17 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
@ 2025-07-18 11:51 ` Yonatan Maman
  2025-07-18 14:17   ` Matthew Wilcox
  2025-07-21  6:59   ` Christoph Hellwig
  2025-07-18 11:51 ` [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 37+ messages in thread
From: Yonatan Maman @ 2025-07-18 11:51 UTC (permalink / raw)
  To: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel, Yonatan Maman, Gal Shalom

From: Yonatan Maman <Ymaman@Nvidia.com>

hmm_range_fault() by default triggered a page fault on device private
when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
cases, such as with RDMA devices, the migration overhead between the
device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
private page might be crucial for minimizing data transfer overhead.

Introduced an API to support P2P DMA for device private pages,includes:
 - Leveraging the struct pagemap_ops for P2P Page Callbacks. This callback
   involves mapping the page for P2P DMA and returning the corresponding
   PCI_P2P page.

 - Utilizing hmm_range_fault for initializing P2P DMA. The API
   also adds the HMM_PFN_REQ_TRY_P2P flag option for the
   hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
   attempts initializing the P2P connection first, if the owner device
   supports P2P, using p2p_page. In case of failure or lack of support,
   hmm_range_fault will continue with the regular flow of migrating the
   page to RAM.

This change does not affect previous use-cases of hmm_range_fault,
because both the caller and the page owner must explicitly request and
support it to initialize P2P connection.

Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
 include/linux/hmm.h      |  2 ++
 include/linux/memremap.h |  8 ++++++
 mm/hmm.c                 | 57 +++++++++++++++++++++++++++++++---------
 3 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index db75ffc949a7..988c98c0edcc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -27,6 +27,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
  *                      to mark that page is already DMA mapped
+ * HMM_PFN_ALLOW_P2P - Allow returning PCI P2PDMA page
  *
  * On input:
  * 0                 - Return the current state of the page, do not fault it.
@@ -47,6 +48,7 @@ enum hmm_pfn_flags {
 	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 4),
 	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
 	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
+	HMM_PFN_ALLOW_P2P = 1UL << (BITS_PER_LONG - 7),
 
 	HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 11),
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4aa151914eab..79becc37df00 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,14 @@ struct dev_pagemap_ops {
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
 
+	/*
+	 * Used for private (un-addressable) device memory only. Return a
+	 * corresponding PFN for a page that can be mapped to device
+	 * (e.g using dma_map_page)
+	 */
+	int (*get_dma_pfn_for_device)(struct page *private_page,
+				      unsigned long *dma_pfn);
+
 	/*
 	 * Handle the memory failure happens on a range of pfns.  Notify the
 	 * processes who are using these pfns, and try to recover the data on
diff --git a/mm/hmm.c b/mm/hmm.c
index feac86196a65..089e522b346b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -232,6 +232,49 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
 	return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
 }
 
+static bool hmm_handle_device_private(struct hmm_range *range,
+				      unsigned long pfn_req_flags,
+				      swp_entry_t entry,
+				      unsigned long *hmm_pfn)
+{
+	struct page *page = pfn_swap_entry_to_page(entry);
+	struct dev_pagemap *pgmap = page_pgmap(page);
+	int ret;
+
+	pfn_req_flags &= range->pfn_flags_mask;
+	pfn_req_flags |= range->default_flags;
+
+	/*
+	 * Don't fault in device private pages owned by the caller,
+	 * just report the PFN.
+	 */
+	if (pgmap->owner == range->dev_private_owner) {
+		*hmm_pfn = swp_offset_pfn(entry);
+		goto found;
+	}
+
+	/*
+	 * P2P for supported pages, and according to caller request
+	 * translate the private page to the match P2P page if it fails
+	 * continue with the regular flow
+	 */
+	if (pfn_req_flags & HMM_PFN_ALLOW_P2P &&
+	    pgmap->ops->get_dma_pfn_for_device) {
+		ret = pgmap->ops->get_dma_pfn_for_device(page, hmm_pfn);
+		if (!ret)
+			goto found;
+
+	}
+
+	return false;
+
+found:
+	*hmm_pfn |= HMM_PFN_VALID;
+	if (is_writable_device_private_entry(entry))
+		*hmm_pfn |= HMM_PFN_WRITE;
+	return true;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 			      unsigned long end, pmd_t *pmdp, pte_t *ptep,
 			      unsigned long *hmm_pfn)
@@ -255,19 +298,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	if (!pte_present(pte)) {
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
-		/*
-		 * Don't fault in device private pages owned by the caller,
-		 * just report the PFN.
-		 */
 		if (is_device_private_entry(entry) &&
-		    page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
-		    range->dev_private_owner) {
-			cpu_flags = HMM_PFN_VALID;
-			if (is_writable_device_private_entry(entry))
-				cpu_flags |= HMM_PFN_WRITE;
-			new_pfn_flags = swp_offset_pfn(entry) | cpu_flags;
-			goto out;
-		}
+		    hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))
+			return 0;
 
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
  2025-07-18 11:51 ` [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
@ 2025-07-18 11:51 ` Yonatan Maman
  2025-07-21  7:00   ` Christoph Hellwig
  2025-07-18 11:51 ` [PATCH v2 3/5] IB/core: P2P DMA for device private pages Yonatan Maman
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Yonatan Maman @ 2025-07-18 11:51 UTC (permalink / raw)
  To: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel, Yonatan Maman, Gal Shalom

From: Yonatan Maman <Ymaman@Nvidia.com>

Enabling Peer-to-Peer DMA (P2P DMA) access in GPU-centric applications
is crucial for minimizing data transfer overhead (e.g., for RDMA use-
case).

This change aims to enable that capability for Nouveau over HMM device
private pages. P2P DMA for private device pages allows the GPU to
directly exchange data with other devices (e.g., NICs) without needing
to traverse system RAM.

To fully support Peer-to-Peer for device private pages, the following
changes are made:

 - Introduce struct nouveau_dmem_hmm_p2p within struct nouveau_dmem
   to manage BAR1 PCI P2P memory. p2p_start_addr holds the virtual
   address allocated with pci_alloc_p2pmem(), and p2p_size represents
   the allocated size of the PCI P2P memory.

 - nouveau_dmem_init - Ensure BAR1 accessibility and assign struct
   pages (PCI_P2P_PAGE) for all BAR1 pages. Introduce
   nouveau_alloc_bar1_pci_p2p_mem in nouveau_dmem to expose BAR1 for
   use as P2P memory via pci_p2pdma_add_resource and implement static
   allocation and assignment of struct pages using pci_alloc_p2pmem.
   This function will be called from nouveau_dmem_init, and failure
   triggers a warning message instead of driver failure.

 - nouveau_dmem_fini - Ensure BAR1 PCI P2P memory is properly
   destroyed during driver cleanup. Introduce
   nouveau_destroy_bar1_pci_p2p_mem to handle freeing of PCI P2P
   memory associated with Nouveau BAR1. Modify nouveau_dmem_fini to
   call nouveau_destroy_bar1_pci_p2p_mem.

 - Implement Nouveau `p2p_page` callback function - Implement BAR1
   mapping for the chunk using `io_mem_reserve` if no mapping exists.
   Retrieve the pre-allocated P2P virtual address and size from
   `hmm_p2p`. Calculate the page offset within BAR1 and return the
   corresponding P2P page.

Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom@Nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index ca4932a150e3..acac1449d8cb 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -40,6 +40,9 @@
 #include <linux/hmm.h>
 #include <linux/memremap.h>
 #include <linux/migrate.h>
+#include <linux/pci-p2pdma.h>
+#include <nvkm/core/pci.h>
+
 
 /*
  * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
@@ -77,9 +80,15 @@ struct nouveau_dmem_migrate {
 	struct nouveau_channel *chan;
 };
 
+struct nouveau_dmem_hmm_p2p {
+	size_t p2p_size;
+	void *p2p_start_addr;
+};
+
 struct nouveau_dmem {
 	struct nouveau_drm *drm;
 	struct nouveau_dmem_migrate migrate;
+	struct nouveau_dmem_hmm_p2p hmm_p2p;
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
@@ -159,6 +168,60 @@ static int nouveau_dmem_copy_one(struct nouveau_drm *drm, struct page *spage,
 	return 0;
 }
 
+static int nouveau_dmem_bar1_mapping(struct nouveau_bo *nvbo,
+				     unsigned long long *bus_addr)
+{
+	int ret;
+	struct ttm_resource *mem = nvbo->bo.resource;
+
+	if (mem->bus.offset) {
+		*bus_addr = mem->bus.offset;
+		return 0;
+	}
+
+	if (PFN_UP(nvbo->bo.base.size) > PFN_UP(nvbo->bo.resource->size))
+		return -EINVAL;
+
+	ret = ttm_bo_reserve(&nvbo->bo, false, false, NULL);
+	if (ret)
+		return ret;
+
+	ret = nvbo->bo.bdev->funcs->io_mem_reserve(nvbo->bo.bdev, mem);
+	*bus_addr = mem->bus.offset;
+
+	ttm_bo_unreserve(&nvbo->bo);
+	return ret;
+}
+
+static int nouveau_dmem_get_dma_pfn(struct page *private_page,
+				    unsigned long *dma_pfn)
+{
+	int ret;
+	unsigned long long offset_in_chunk;
+	unsigned long long chunk_bus_addr;
+	unsigned long long bar1_base_addr;
+	struct nouveau_drm *drm = page_to_drm(private_page);
+	struct nouveau_bo *nvbo = nouveau_page_to_chunk(private_page)->bo;
+	struct nvkm_device *nv_device = nvxx_device(drm);
+	size_t p2p_size = drm->dmem->hmm_p2p.p2p_size;
+
+	bar1_base_addr = nv_device->func->resource_addr(nv_device, 1);
+	offset_in_chunk =
+		(page_to_pfn(private_page) << PAGE_SHIFT) -
+		nouveau_page_to_chunk(private_page)->pagemap.range.start;
+
+	ret = nouveau_dmem_bar1_mapping(nvbo, &chunk_bus_addr);
+	if (ret)
+		return ret;
+
+	*dma_pfn = chunk_bus_addr + offset_in_chunk;
+	if (!p2p_size || *dma_pfn > bar1_base_addr + p2p_size ||
+	    *dma_pfn < bar1_base_addr)
+		return -ENOMEM;
+
+	return 0;
+}
+
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 {
 	struct nouveau_drm *drm = page_to_drm(vmf->page);
@@ -222,6 +285,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.page_free		= nouveau_dmem_page_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.get_dma_pfn_for_device = nouveau_dmem_get_dma_pfn,
 };
 
 static int
@@ -407,14 +471,31 @@ nouveau_dmem_evict_chunk(struct nouveau_dmem_chunk *chunk)
 	kvfree(dma_addrs);
 }
 
+static void nouveau_destroy_bar1_pci_p2p_mem(struct nouveau_drm *drm,
+					     struct pci_dev *pdev,
+					     void *p2p_start_addr,
+					     size_t p2p_size)
+{
+	if (p2p_size)
+		pci_free_p2pmem(pdev, p2p_start_addr, p2p_size);
+
+	NV_INFO(drm, "PCI P2P memory freed(%p)\n", p2p_start_addr);
+}
+
 void
 nouveau_dmem_fini(struct nouveau_drm *drm)
 {
 	struct nouveau_dmem_chunk *chunk, *tmp;
+	struct nvkm_device *nv_device = nvxx_device(drm);
 
 	if (drm->dmem == NULL)
 		return;
 
+	nouveau_destroy_bar1_pci_p2p_mem(drm,
+					 nv_device->func->pci(nv_device)->pdev,
+					 drm->dmem->hmm_p2p.p2p_start_addr,
+					 drm->dmem->hmm_p2p.p2p_size);
+
 	mutex_lock(&drm->dmem->mutex);
 
 	list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) {
@@ -579,10 +660,28 @@ nouveau_dmem_migrate_init(struct nouveau_drm *drm)
 	return -ENODEV;
 }
 
+static int nouveau_alloc_bar1_pci_p2p_mem(struct nouveau_drm *drm,
+					  struct pci_dev *pdev, size_t size,
+					  void **pp2p_start_addr)
+{
+	int ret;
+
+	ret = pci_p2pdma_add_resource(pdev, 1, size, 0);
+	if (ret)
+		return ret;
+
+	*pp2p_start_addr = pci_alloc_p2pmem(pdev, size);
+
+	NV_INFO(drm, "PCI P2P memory allocated(%p)\n", *pp2p_start_addr);
+	return 0;
+}
+
 void
 nouveau_dmem_init(struct nouveau_drm *drm)
 {
 	int ret;
+	struct nvkm_device *nv_device = nvxx_device(drm);
+	size_t bar1_size;
 
 	/* This only make sense on PASCAL or newer */
 	if (drm->client.device.info.family < NV_DEVICE_INFO_V0_PASCAL)
@@ -603,6 +702,17 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 		kfree(drm->dmem);
 		drm->dmem = NULL;
 	}
+
+	/* Expose BAR1 for HMM P2P Memory */
+	bar1_size = nv_device->func->resource_size(nv_device, 1);
+	ret = nouveau_alloc_bar1_pci_p2p_mem(drm,
+					     nv_device->func->pci(nv_device)->pdev,
+					     bar1_size,
+					     &drm->dmem->hmm_p2p.p2p_start_addr);
+	drm->dmem->hmm_p2p.p2p_size = (ret) ? 0 : bar1_size;
+	if (ret)
+		NV_WARN(drm,
+			"PCI P2P memory allocation failed, HMM P2P won't be supported\n");
 }
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/5] IB/core: P2P DMA for device private pages
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
  2025-07-18 11:51 ` [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
  2025-07-18 11:51 ` [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
@ 2025-07-18 11:51 ` Yonatan Maman
  2025-07-18 11:51 ` [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism Yonatan Maman
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 37+ messages in thread
From: Yonatan Maman @ 2025-07-18 11:51 UTC (permalink / raw)
  To: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel, Yonatan Maman, Gal Shalom

From: Yonatan Maman <Ymaman@Nvidia.com>

Add Peer-to-Peer (P2P) DMA request for hmm_range_fault calling,
utilizing capabilities introduced in mm/hmm. By setting
range.default_flags to HMM_PFN_REQ_FAULT | HMM_PFN_REQ_TRY_P2P, HMM
attempts to initiate P2P DMA connections for device private pages
(instead of page fault handling).

This enhancement utilizes P2P DMA to reduce performance overhead
during data migration between devices (e.g., GPU) and system memory,
providing performance benefits for GPU-centric applications that
utilize RDMA and device private pages.

Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
 drivers/infiniband/core/umem_odp.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index b1c44ec1a3f3..7ba80ed4977c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -362,6 +362,10 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 			range.default_flags |= HMM_PFN_REQ_WRITE;
 	}
 
+	if (access_mask & HMM_PFN_ALLOW_P2P)
+		range.default_flags |= HMM_PFN_ALLOW_P2P;
+
+	range.pfn_flags_mask = HMM_PFN_ALLOW_P2P;
 	range.hmm_pfns = &(umem_odp->map.pfn_list[pfn_start_idx]);
 	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
                   ` (2 preceding siblings ...)
  2025-07-18 11:51 ` [PATCH v2 3/5] IB/core: P2P DMA for device private pages Yonatan Maman
@ 2025-07-18 11:51 ` Yonatan Maman
  2025-07-21  7:03   ` Christoph Hellwig
  2025-07-18 11:51 ` [PATCH v2 5/5] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Yonatan Maman @ 2025-07-18 11:51 UTC (permalink / raw)
  To: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel, Yonatan Maman, Gal Shalom

From: Yonatan Maman <Ymaman@Nvidia.com>

Add support for P2P for MLX5 NIC devices with automatic fallback to
standard DMA when P2P mapping fails.

The change introduces P2P DMA requests by default using the
HMM_PFN_ALLOW_P2P flag. If P2P mapping fails with -EFAULT error, the
operation is retried without the P2P flag, ensuring a fallback to
standard DMA flow (using host memory).

Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
 drivers/infiniband/hw/mlx5/odp.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f6abd64f07f7..6a0171117f48 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -715,6 +715,10 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	if (odp->umem.writable && !downgrade)
 		access_mask |= HMM_PFN_WRITE;
 
+	/*
+	 * try fault with HMM_PFN_ALLOW_P2P flag
+	 */
+	access_mask |= HMM_PFN_ALLOW_P2P;
 	np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
 	if (np < 0)
 		return np;
@@ -724,6 +728,18 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	 * ib_umem_odp_map_dma_and_lock already checks this.
 	 */
 	ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
+	if (ret == -EFAULT) {
+		/*
+		 * Indicate P2P Mapping Error, retry with no HMM_PFN_ALLOW_P2P
+		 */
+		mutex_unlock(&odp->umem_mutex);
+		access_mask &= ~(HMM_PFN_ALLOW_P2P);
+		np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
+		if (np < 0)
+			return np;
+		ret = mlx5r_umr_update_xlt(mr, start_idx, np, page_shift, xlt_flags);
+	}
+
 	mutex_unlock(&odp->umem_mutex);
 
 	if (ret < 0) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 5/5] RDMA/mlx5: Enabling ATS for ODP memory
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
                   ` (3 preceding siblings ...)
  2025-07-18 11:51 ` [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism Yonatan Maman
@ 2025-07-18 11:51 ` Yonatan Maman
  2025-07-20 10:30 ` [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Leon Romanovsky
  2025-07-21  6:54 ` Christoph Hellwig
  6 siblings, 0 replies; 37+ messages in thread
From: Yonatan Maman @ 2025-07-18 11:51 UTC (permalink / raw)
  To: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky
  Cc: Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel, Yonatan Maman, Gal Shalom

From: Yonatan Maman <Ymaman@Nvidia.com>

ATS (Address Translation Services) mainly utilized to optimize PCI
Peer-to-Peer transfers and prevent bus failures. This change employed
ATS usage for ODP memory, to optimize DMA P2P for ODP memory. (e.g DMA
P2P for private device pages - ODP memory).

Signed-off-by: Yonatan Maman <Ymaman@Nvidia.com>
Signed-off-by: Gal Shalom <GalShalom@Nvidia.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index fde859d207ae..a7b7a565b7e8 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1734,9 +1734,9 @@ static inline bool rt_supported(int ts_cap)
 static inline bool mlx5_umem_needs_ats(struct mlx5_ib_dev *dev,
 				       struct ib_umem *umem, int access_flags)
 {
-	if (!MLX5_CAP_GEN(dev->mdev, ats) || !umem->is_dmabuf)
-		return false;
-	return access_flags & IB_ACCESS_RELAXED_ORDERING;
+	if (MLX5_CAP_GEN(dev->mdev, ats) && (umem->is_dmabuf || umem->is_odp))
+		return access_flags & IB_ACCESS_RELAXED_ORDERING;
+	return false;
 }
 
 int set_roce_addr(struct mlx5_ib_dev *dev, u32 port_num,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-18 11:51 ` [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
@ 2025-07-18 14:17   ` Matthew Wilcox
  2025-07-18 14:44     ` Jason Gunthorpe
  2025-07-21  6:59   ` Christoph Hellwig
  1 sibling, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2025-07-18 14:17 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> +++ b/include/linux/memremap.h
> @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
>  	 */
>  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>  
> +	/*
> +	 * Used for private (un-addressable) device memory only. Return a
> +	 * corresponding PFN for a page that can be mapped to device
> +	 * (e.g using dma_map_page)
> +	 */
> +	int (*get_dma_pfn_for_device)(struct page *private_page,
> +				      unsigned long *dma_pfn);

This makes no sense.  If a page is addressable then it has a PFN.
If a page is not addressable then it doesn't have a PFN.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-18 14:17   ` Matthew Wilcox
@ 2025-07-18 14:44     ` Jason Gunthorpe
  2025-07-21  0:11       ` Alistair Popple
  2025-07-21 13:23       ` Matthew Wilcox
  0 siblings, 2 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 14:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > +++ b/include/linux/memremap.h
> > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> >  	 */
> >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> >  
> > +	/*
> > +	 * Used for private (un-addressable) device memory only. Return a
> > +	 * corresponding PFN for a page that can be mapped to device
> > +	 * (e.g using dma_map_page)
> > +	 */
> > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > +				      unsigned long *dma_pfn);
> 
> This makes no sense.  If a page is addressable then it has a PFN.
> If a page is not addressable then it doesn't have a PFN.

The DEVICE_PRIVATE pages have a PFN, but it is not usable for
anything.

This is effectively converting from a DEVICE_PRIVATE page to an actual
DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
proxy, like a swap entry, for where the real data is sitting.

Jason


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
                   ` (4 preceding siblings ...)
  2025-07-18 11:51 ` [PATCH v2 5/5] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
@ 2025-07-20 10:30 ` Leon Romanovsky
  2025-07-20 21:03   ` Yonatan Maman
  2025-07-21  6:54 ` Christoph Hellwig
  6 siblings, 1 reply; 37+ messages in thread
From: Leon Romanovsky @ 2025-07-20 10:30 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel

On Fri, Jul 18, 2025 at 02:51:07PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
> 
> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
> GPU-centric applications that utilize RDMA and private device pages. This
> enhancement reduces data transfer overhead by allowing the GPU to directly
> expose device private page data to devices such as NICs, eliminating the
> need to traverse system RAM, which is the native method for exposing
> device private page data.
> 
> To fully support Peer-to-Peer for device private pages, the following
> changes are proposed:
> 
> `Memory Management (MM)`
>  * Leverage struct pagemap_ops to support P2P page operations: This
> modification ensures that the GPU can directly map device private pages
> for P2P DMA.
>  * Utilize hmm_range_fault to support P2P connections for device private
> pages (instead of Page fault)
> 
> `IB Drivers`
> Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
> need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
> requests.
> 
> `Nouveau driver`
> Add support for the Nouveau p2p_page callback function: This update
> integrates P2P DMA support into the Nouveau driver, allowing it to handle
> P2P page operations seamlessly.
> 
> `MLX5 Driver`
> Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
> DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
> inaccessible bridges, the system falls back to standard DMA, which uses host
> memory, for the affected PFNs

I'm probably missing something very important, but why can't you always
perform p2p if two devices support it? It is strange that IB and not HMM
has a fallback mode.

Thanks

> 
> Previous version:
> https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
> https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/
> 
> Yonatan Maman (5):
>   mm/hmm: HMM API to enable P2P DMA for device private pages
>   nouveau/dmem: HMM P2P DMA for private dev pages
>   IB/core: P2P DMA for device private pages
>   RDMA/mlx5: Enable P2P DMA with fallback mechanism
>   RDMA/mlx5: Enabling ATS for ODP memory
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
>  drivers/infiniband/core/umem_odp.c     |   4 +
>  drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
>  drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
>  include/linux/hmm.h                    |   3 +-
>  include/linux/memremap.h               |   8 ++
>  mm/hmm.c                               |  57 ++++++++++---
>  7 files changed, 195 insertions(+), 17 deletions(-)
> 
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
  2025-07-20 10:30 ` [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Leon Romanovsky
@ 2025-07-20 21:03   ` Yonatan Maman
  2025-07-21  6:49     ` Leon Romanovsky
  0 siblings, 1 reply; 37+ messages in thread
From: Yonatan Maman @ 2025-07-20 21:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel



On 20/07/2025 13:30, Leon Romanovsky wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Jul 18, 2025 at 02:51:07PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> This patch series aims to enable Peer-to-Peer (P2P) DMA access in
>> GPU-centric applications that utilize RDMA and private device pages. This
>> enhancement reduces data transfer overhead by allowing the GPU to directly
>> expose device private page data to devices such as NICs, eliminating the
>> need to traverse system RAM, which is the native method for exposing
>> device private page data.
>>
>> To fully support Peer-to-Peer for device private pages, the following
>> changes are proposed:
>>
>> `Memory Management (MM)`
>>   * Leverage struct pagemap_ops to support P2P page operations: This
>> modification ensures that the GPU can directly map device private pages
>> for P2P DMA.
>>   * Utilize hmm_range_fault to support P2P connections for device private
>> pages (instead of Page fault)
>>
>> `IB Drivers`
>> Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
>> need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
>> requests.
>>
>> `Nouveau driver`
>> Add support for the Nouveau p2p_page callback function: This update
>> integrates P2P DMA support into the Nouveau driver, allowing it to handle
>> P2P page operations seamlessly.
>>
>> `MLX5 Driver`
>> Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
>> DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
>> inaccessible bridges, the system falls back to standard DMA, which uses host
>> memory, for the affected PFNs
> 
> I'm probably missing something very important, but why can't you always
> perform p2p if two devices support it? It is strange that IB and not HMM
> has a fallback mode.
> 
> Thanks
>

P2P mapping can fail even when both devices support it, due to PCIe 
bridge limitations or IOMMU restrictions that block direct P2P access. 
The fallback is in IB rather than HMM because HMM only manages memory 
pages - it doesn't do DMA mapping. The IB driver does the actual DMA 
operations, so it knows when P2P mapping fails and can fall back to 
copying through system memory.
In fact, hmm_range_fault doesn't have information about the destination 
device that will perform the DMA mapping.
>>
>> Previous version:
>> https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
>> https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/
>>
>> Yonatan Maman (5):
>>    mm/hmm: HMM API to enable P2P DMA for device private pages
>>    nouveau/dmem: HMM P2P DMA for private dev pages
>>    IB/core: P2P DMA for device private pages
>>    RDMA/mlx5: Enable P2P DMA with fallback mechanism
>>    RDMA/mlx5: Enabling ATS for ODP memory
>>
>>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
>>   drivers/infiniband/core/umem_odp.c     |   4 +
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
>>   drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
>>   include/linux/hmm.h                    |   3 +-
>>   include/linux/memremap.h               |   8 ++
>>   mm/hmm.c                               |  57 ++++++++++---
>>   7 files changed, 195 insertions(+), 17 deletions(-)
>>
>> --
>> 2.34.1
>>



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-18 14:44     ` Jason Gunthorpe
@ 2025-07-21  0:11       ` Alistair Popple
  2025-07-21 13:23       ` Matthew Wilcox
  1 sibling, 0 replies; 37+ messages in thread
From: Alistair Popple @ 2025-07-21  0:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > > +++ b/include/linux/memremap.h
> > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> > >  	 */
> > >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > >  
> > > +	/*
> > > +	 * Used for private (un-addressable) device memory only. Return a
> > > +	 * corresponding PFN for a page that can be mapped to device
> > > +	 * (e.g using dma_map_page)
> > > +	 */
> > > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > > +				      unsigned long *dma_pfn);
> > 
> > This makes no sense.  If a page is addressable then it has a PFN.
> > If a page is not addressable then it doesn't have a PFN.
> 
> The DEVICE_PRIVATE pages have a PFN, but it is not usable for
> anything.
> 
> This is effectively converting from a DEVICE_PRIVATE page to an actual
> DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
> proxy, like a swap entry, for where the real data is sitting.

Yes, it's on my backlog to start looking at using something other than a real
PFN for this proxy. Because having it as an actual PFN has caused us all sorts
of random issues as it still needs to reserve a real physical address range
which may or may not be available on a given machine.

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
  2025-07-20 21:03   ` Yonatan Maman
@ 2025-07-21  6:49     ` Leon Romanovsky
  2025-07-23  4:03       ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Leon Romanovsky @ 2025-07-21  6:49 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Alistair Popple, Ben Skeggs, Michael Guralnik, Or Har-Toov,
	Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma, dri-devel,
	nouveau, linux-kernel

On Mon, Jul 21, 2025 at 12:03:51AM +0300, Yonatan Maman wrote:
> 
> 
> On 20/07/2025 13:30, Leon Romanovsky wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Fri, Jul 18, 2025 at 02:51:07PM +0300, Yonatan Maman wrote:
> > > From: Yonatan Maman <Ymaman@Nvidia.com>
> > > 
> > > This patch series aims to enable Peer-to-Peer (P2P) DMA access in
> > > GPU-centric applications that utilize RDMA and private device pages. This
> > > enhancement reduces data transfer overhead by allowing the GPU to directly
> > > expose device private page data to devices such as NICs, eliminating the
> > > need to traverse system RAM, which is the native method for exposing
> > > device private page data.
> > > 
> > > To fully support Peer-to-Peer for device private pages, the following
> > > changes are proposed:
> > > 
> > > `Memory Management (MM)`
> > >   * Leverage struct pagemap_ops to support P2P page operations: This
> > > modification ensures that the GPU can directly map device private pages
> > > for P2P DMA.
> > >   * Utilize hmm_range_fault to support P2P connections for device private
> > > pages (instead of Page fault)
> > > 
> > > `IB Drivers`
> > > Add TRY_P2P_REQ flag for the hmm_range_fault call: This flag indicates the
> > > need for P2P mapping, enabling IB drivers to efficiently handle P2P DMA
> > > requests.
> > > 
> > > `Nouveau driver`
> > > Add support for the Nouveau p2p_page callback function: This update
> > > integrates P2P DMA support into the Nouveau driver, allowing it to handle
> > > P2P page operations seamlessly.
> > > 
> > > `MLX5 Driver`
> > > Utilize NIC Address Translation Service (ATS) for ODP memory, to optimize
> > > DMA P2P for private device pages. Also, when P2P DMA mapping fails due to
> > > inaccessible bridges, the system falls back to standard DMA, which uses host
> > > memory, for the affected PFNs
> > 
> > I'm probably missing something very important, but why can't you always
> > perform p2p if two devices support it? It is strange that IB and not HMM
> > has a fallback mode.
> > 
> > Thanks
> > 
> 
> P2P mapping can fail even when both devices support it, due to PCIe bridge
> limitations or IOMMU restrictions that block direct P2P access.

Yes, it is how p2p works. The decision "if p2p is supported or not" is
calculated by pci_p2pdma_map_type(). That function needs to get which two
devices will be connected.

In proposed HMM_PFN_ALLOW_P2P flag, you don't provide device information
and for the system with more than 2 p2p devices, you will get completely
random result.


> The fallback is in IB rather than HMM because HMM only manages memory pages - it doesn't
> do DMA mapping. The IB driver does the actual DMA operations, so it knows
> when P2P mapping fails and can fall back to copying through system memory.

The thing is that in proposed patch, IB doesn't check that p2p is
established with right device.
https://lore.kernel.org/all/20250718115112.3881129-5-ymaman@nvidia.com/

> In fact, hmm_range_fault doesn't have information about the destination
> device that will perform the DMA mapping.

So probably you need to teach HMM to perform page_faults on specific device.

Thansk

> > > 
> > > Previous version:
> > > https://lore.kernel.org/linux-mm/20241201103659.420677-1-ymaman@nvidia.com/
> > > https://lore.kernel.org/linux-mm/20241015152348.3055360-1-ymaman@nvidia.com/
> > > 
> > > Yonatan Maman (5):
> > >    mm/hmm: HMM API to enable P2P DMA for device private pages
> > >    nouveau/dmem: HMM P2P DMA for private dev pages
> > >    IB/core: P2P DMA for device private pages
> > >    RDMA/mlx5: Enable P2P DMA with fallback mechanism
> > >    RDMA/mlx5: Enabling ATS for ODP memory
> > > 
> > >   drivers/gpu/drm/nouveau/nouveau_dmem.c | 110 +++++++++++++++++++++++++
> > >   drivers/infiniband/core/umem_odp.c     |   4 +
> > >   drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
> > >   drivers/infiniband/hw/mlx5/odp.c       |  24 +++++-
> > >   include/linux/hmm.h                    |   3 +-
> > >   include/linux/memremap.h               |   8 ++
> > >   mm/hmm.c                               |  57 ++++++++++---
> > >   7 files changed, 195 insertions(+), 17 deletions(-)
> > > 
> > > --
> > > 2.34.1
> > > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
  2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
                   ` (5 preceding siblings ...)
  2025-07-20 10:30 ` [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Leon Romanovsky
@ 2025-07-21  6:54 ` Christoph Hellwig
  6 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-07-21  6:54 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel

Please use a more suitable name for your series.  There's absolutle
nothing GPU-specific here, and reusing the name from a complete
trainwreck that your company pushed over the last few years doesn't
help either.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-18 11:51 ` [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
  2025-07-18 14:17   ` Matthew Wilcox
@ 2025-07-21  6:59   ` Christoph Hellwig
  2025-07-22  5:42     ` Yonatan Maman
  2025-08-01 16:52     ` Jason Gunthorpe
  1 sibling, 2 replies; 37+ messages in thread
From: Christoph Hellwig @ 2025-07-21  6:59 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
> 
> hmm_range_fault() by default triggered a page fault on device private
> when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
> cases, such as with RDMA devices, the migration overhead between the
> device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
> performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
> private page might be crucial for minimizing data transfer overhead.

You don't enable DMA for device private pages.  You allow discovering
a DMAable alias for device private pages.

Also absolutely nothing GPU specific here.

> +	/*
> +	 * Don't fault in device private pages owned by the caller,
> +	 * just report the PFN.
> +	 */
> +	if (pgmap->owner == range->dev_private_owner) {
> +		*hmm_pfn = swp_offset_pfn(entry);
> +		goto found;

This is dangerous because it mixes actual DMAable alias PFNs with the
device private fake PFNs.  Maybe your hardware / driver can handle
it, but just leaking this out is not a good idea.

> +		    hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))

Overly long line here.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages
  2025-07-18 11:51 ` [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
@ 2025-07-21  7:00   ` Christoph Hellwig
  2025-07-22  5:23     ` Yonatan Maman
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-07-21  7:00 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 02:51:09PM +0300, Yonatan Maman wrote:
> +	.get_dma_pfn_for_device = nouveau_dmem_get_dma_pfn,

Please don't shorten the method name prefix in the implementation
symbol name, as that makes reading / refactoring the code a pain.

This might also be a hint that your method name is too long.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism
  2025-07-18 11:51 ` [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism Yonatan Maman
@ 2025-07-21  7:03   ` Christoph Hellwig
  2025-07-23  3:55     ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-07-21  7:03 UTC (permalink / raw)
  To: Yonatan Maman
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 02:51:11PM +0300, Yonatan Maman wrote:
> From: Yonatan Maman <Ymaman@Nvidia.com>
> 
> Add support for P2P for MLX5 NIC devices with automatic fallback to
> standard DMA when P2P mapping fails.

That's now how the P2P API works.  You need to check the P2P availability
higher up.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-18 14:44     ` Jason Gunthorpe
  2025-07-21  0:11       ` Alistair Popple
@ 2025-07-21 13:23       ` Matthew Wilcox
  2025-07-22  0:49         ` Alistair Popple
  1 sibling, 1 reply; 37+ messages in thread
From: Matthew Wilcox @ 2025-07-21 13:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > > +++ b/include/linux/memremap.h
> > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> > >  	 */
> > >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > >  
> > > +	/*
> > > +	 * Used for private (un-addressable) device memory only. Return a
> > > +	 * corresponding PFN for a page that can be mapped to device
> > > +	 * (e.g using dma_map_page)
> > > +	 */
> > > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > > +				      unsigned long *dma_pfn);
> > 
> > This makes no sense.  If a page is addressable then it has a PFN.
> > If a page is not addressable then it doesn't have a PFN.
> 
> The DEVICE_PRIVATE pages have a PFN, but it is not usable for
> anything.

OK, then I don't understand what DEVICE PRIVATE means.

I thought it was for memory on a PCIe device that isn't even visible
through a BAR and so the CPU has no way of addressing it directly.
But now you say that it has a PFN, which means it has a physical
address, which means it's accessible to the CPU.

So what is it?

> This is effectively converting from a DEVICE_PRIVATE page to an actual
> DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
> proxy, like a swap entry, for where the real data is sitting.
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-21 13:23       ` Matthew Wilcox
@ 2025-07-22  0:49         ` Alistair Popple
  2025-07-23  3:51           ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Alistair Popple @ 2025-07-22  0:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Mon, Jul 21, 2025 at 02:23:13PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 18, 2025 at 11:44:42AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jul 18, 2025 at 03:17:00PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
> > > > +++ b/include/linux/memremap.h
> > > > @@ -89,6 +89,14 @@ struct dev_pagemap_ops {
> > > >  	 */
> > > >  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> > > >  
> > > > +	/*
> > > > +	 * Used for private (un-addressable) device memory only. Return a
> > > > +	 * corresponding PFN for a page that can be mapped to device
> > > > +	 * (e.g using dma_map_page)
> > > > +	 */
> > > > +	int (*get_dma_pfn_for_device)(struct page *private_page,
> > > > +				      unsigned long *dma_pfn);
> > > 
> > > This makes no sense.  If a page is addressable then it has a PFN.
> > > If a page is not addressable then it doesn't have a PFN.
> > 
> > The DEVICE_PRIVATE pages have a PFN, but it is not usable for
> > anything.
> 
> OK, then I don't understand what DEVICE PRIVATE means.
> 
> I thought it was for memory on a PCIe device that isn't even visible
> through a BAR and so the CPU has no way of addressing it directly.

Correct.

> But now you say that it has a PFN, which means it has a physical
> address, which means it's accessible to the CPU.

Having a PFN doesn't mean it's actually accessible to the CPU. It is a real
physical address in the CPU address space, but it is a completely bogus/invalid
address - if the CPU actually tries to access it will cause a machine check
or whatever other exception gets generated when accessing an invalid physical
address.

Obviously we're careful to avoid that. The PFN is used solely to get to/from a
struct page (via pfn_to_page() or page_to_pfn()).

> So what is it?

IMHO a hack, because obviously we shouldn't require real physical addresses for
something the CPU can't actually address anyway and this causes real problems
(eg. it doesn't actually work on anything other than x86_64). There's no reason
the "PFN" we store in device-private entries couldn't instead just be an index
into some data structure holding pointers to the struct pages. So instead of
using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
and page_to_device_private_index().

We discussed this briefly at LSFMM, I think your suggestion for a data structure
was to use a maple tree. I'm yet to look at this more deeply but I'd like to
figure out where memdescs fit in this picture too.

 - Alistair

> > This is effectively converting from a DEVICE_PRIVATE page to an actual
> > DMA'able address of some kind. The DEVICE_PRIVATE is just a non-usable
> > proxy, like a swap entry, for where the real data is sitting.
> > 
> > Jason
> > 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages
  2025-07-21  7:00   ` Christoph Hellwig
@ 2025-07-22  5:23     ` Yonatan Maman
  0 siblings, 0 replies; 37+ messages in thread
From: Yonatan Maman @ 2025-07-22  5:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom



On 21/07/2025 10:00, Christoph Hellwig wrote:
> On Fri, Jul 18, 2025 at 02:51:09PM +0300, Yonatan Maman wrote:
>> +	.get_dma_pfn_for_device = nouveau_dmem_get_dma_pfn,
> 
> Please don't shorten the method name prefix in the implementation
> symbol name, as that makes reading / refactoring the code a pain.
> 
> This might also be a hint that your method name is too long.
> 

got it, will be fixed in V3, thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-21  6:59   ` Christoph Hellwig
@ 2025-07-22  5:42     ` Yonatan Maman
  2025-08-01 16:52     ` Jason Gunthorpe
  1 sibling, 0 replies; 37+ messages in thread
From: Yonatan Maman @ 2025-07-22  5:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jérôme Glisse, Andrew Morton, Jason Gunthorpe,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom



On 21/07/2025 9:59, Christoph Hellwig wrote:
> On Fri, Jul 18, 2025 at 02:51:08PM +0300, Yonatan Maman wrote:
>> From: Yonatan Maman <Ymaman@Nvidia.com>
>>
>> hmm_range_fault() by default triggered a page fault on device private
>> when HMM_PFN_REQ_FAULT flag was set. pages, migrating them to RAM. In some
>> cases, such as with RDMA devices, the migration overhead between the
>> device (e.g., GPU) and the CPU, and vice-versa, significantly degrades
>> performance. Thus, enabling Peer-to-Peer (P2P) DMA access for device
>> private page might be crucial for minimizing data transfer overhead.
> 
> You don't enable DMA for device private pages.  You allow discovering
> a DMAable alias for device private pages.
>
> Also absolutely nothing GPU specific here.
>
Ok, understood, I will change it (v3).
  >> +	/*
>> +	 * Don't fault in device private pages owned by the caller,
>> +	 * just report the PFN.
>> +	 */
>> +	if (pgmap->owner == range->dev_private_owner) {
>> +		*hmm_pfn = swp_offset_pfn(entry);
>> +		goto found;
> 
> This is dangerous because it mixes actual DMAable alias PFNs with the
> device private fake PFNs.  Maybe your hardware / driver can handle
> it, but just leaking this out is not a good idea.
>

In the current implementation, regular pci_p2p pages are returned as-is 
from hmm_range_fault() - for virtual address backed by pci_p2p page, it 
will return the corresponding PFN.
That said, we can mark these via the hmm_pfn output flags so the caller 
can handle them appropriately.

>> +		    hmm_handle_device_private(range, pfn_req_flags, entry, hmm_pfn))
> 
> Overly long line here.
> 

will be fixed (v3)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-22  0:49         ` Alistair Popple
@ 2025-07-23  3:51           ` Jason Gunthorpe
  2025-07-23  4:10             ` Alistair Popple
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2025-07-23  3:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Matthew Wilcox, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
> > So what is it?
> 
> IMHO a hack, because obviously we shouldn't require real physical addresses for
> something the CPU can't actually address anyway and this causes real
> problems

IMHO what DEVICE PRIVATE really boils down to is a way to have swap
entries that point to some kind of opaque driver managed memory.

We have alot of assumptions all over about pfn/phys to page
relationships so anything that has a struct page also has to come with
a fake PFN today..

> (eg. it doesn't actually work on anything other than x86_64). There's no reason
> the "PFN" we store in device-private entries couldn't instead just be an index
> into some data structure holding pointers to the struct pages. So instead of
> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
> and page_to_device_private_index().

It could work, but any of the pfn conversions would have to be tracked
down.. Could be troublesome.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism
  2025-07-21  7:03   ` Christoph Hellwig
@ 2025-07-23  3:55     ` Jason Gunthorpe
  2025-07-24  7:30       ` Christoph Hellwig
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2025-07-23  3:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Mon, Jul 21, 2025 at 12:03:41AM -0700, Christoph Hellwig wrote:
> On Fri, Jul 18, 2025 at 02:51:11PM +0300, Yonatan Maman wrote:
> > From: Yonatan Maman <Ymaman@Nvidia.com>
> > 
> > Add support for P2P for MLX5 NIC devices with automatic fallback to
> > standard DMA when P2P mapping fails.
> 
> That's now how the P2P API works.  You need to check the P2P availability
> higher up.

How do you mean?

This looks OKish to me, for ODP and HMM it has to check the P2P
availability on a page by page basis because every single page can be
a different origin device.

There isn't really a higher up here...

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
  2025-07-21  6:49     ` Leon Romanovsky
@ 2025-07-23  4:03       ` Jason Gunthorpe
  2025-07-23  8:44         ` Leon Romanovsky
  0 siblings, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2025-07-23  4:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Alistair Popple,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel

On Mon, Jul 21, 2025 at 09:49:04AM +0300, Leon Romanovsky wrote:
> > In fact, hmm_range_fault doesn't have information about the destination
> > device that will perform the DMA mapping.
> 
> So probably you need to teach HMM to perform page_faults on specific device.

That isn't how the HMM side is supposed to work, this API is just
giving the one and only P2P page that is backing the device private.

The providing driver shouldn't be doing any p2pdma operations to check
feasibility.

Otherwise we are doing p2p operations twice on every page, doesn't
make sense.

We've consistently been saying the P2P is done during the DMA mapping
side only, I think we should stick with that. Failing P2P is an
exception case, and the fix is to trigger page migration which the
general hmm code knows how to do. So calling hmm range fault again
makes sense to me. I wouldn't want drivers open coding the migration
logic in the new callback.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-23  3:51           ` Jason Gunthorpe
@ 2025-07-23  4:10             ` Alistair Popple
  2025-07-24  8:52               ` David Hildenbrand
  0 siblings, 1 reply; 37+ messages in thread
From: Alistair Popple @ 2025-07-23  4:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Wilcox, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
> > > So what is it?
> > 
> > IMHO a hack, because obviously we shouldn't require real physical addresses for
> > something the CPU can't actually address anyway and this causes real
> > problems
> 
> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
> entries that point to some kind of opaque driver managed memory.
> 
> We have alot of assumptions all over about pfn/phys to page
> relationships so anything that has a struct page also has to come with
> a fake PFN today..

Hmm ... maybe. To get that PFN though we have to come from either a special
swap entry which we already have special cases for, or a struct page (which is
a device private page) which we mostly have to handle specially anyway. I'm not
sure there's too many places that can sensibly handle a fake PFN without somehow
already knowing it is device-private PFN.

> > (eg. it doesn't actually work on anything other than x86_64). There's no reason
> > the "PFN" we store in device-private entries couldn't instead just be an index
> > into some data structure holding pointers to the struct pages. So instead of
> > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
> > and page_to_device_private_index().
> 
> It could work, but any of the pfn conversions would have to be tracked
> down.. Could be troublesome.

I looked at this a while back and I'm reasonably optimistic that this is doable
because we already have to treat these specially everywhere anyway. The proof
will be writing the patches of course.

 - Alistair

> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages ***
  2025-07-23  4:03       ` Jason Gunthorpe
@ 2025-07-23  8:44         ` Leon Romanovsky
  0 siblings, 0 replies; 37+ messages in thread
From: Leon Romanovsky @ 2025-07-23  8:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton, Lyude Paul,
	Danilo Krummrich, David Airlie, Simona Vetter, Alistair Popple,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel

On Wed, Jul 23, 2025 at 01:03:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 21, 2025 at 09:49:04AM +0300, Leon Romanovsky wrote:
> > > In fact, hmm_range_fault doesn't have information about the destination
> > > device that will perform the DMA mapping.
> > 
> > So probably you need to teach HMM to perform page_faults on specific device.
> 
> That isn't how the HMM side is supposed to work, this API is just
> giving the one and only P2P page that is backing the device private.

I know, but somehow you need to say: "please give me p2p pages for
specific device and not random device in the system as it is now".
This is what is missing from my PoV.

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism
  2025-07-23  3:55     ` Jason Gunthorpe
@ 2025-07-24  7:30       ` Christoph Hellwig
  2025-08-01 16:46         ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2025-07-24  7:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Alistair Popple, Ben Skeggs,
	Michael Guralnik, Or Har-Toov, Daisuke Matsuda, Shay Drory,
	linux-mm, linux-rdma, dri-devel, nouveau, linux-kernel,
	Gal Shalom

On Wed, Jul 23, 2025 at 12:55:22AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 21, 2025 at 12:03:41AM -0700, Christoph Hellwig wrote:
> > On Fri, Jul 18, 2025 at 02:51:11PM +0300, Yonatan Maman wrote:
> > > From: Yonatan Maman <Ymaman@Nvidia.com>
> > > 
> > > Add support for P2P for MLX5 NIC devices with automatic fallback to
> > > standard DMA when P2P mapping fails.
> > 
> > That's now how the P2P API works.  You need to check the P2P availability
> > higher up.
> 
> How do you mean?
> 
> This looks OKish to me, for ODP and HMM it has to check the P2P
> availability on a page by page basis because every single page can be
> a different origin device.
> 
> There isn't really a higher up here...

The DMA API expects the caller to already check for connectability,
why can't HMM do that like everyone else?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-23  4:10             ` Alistair Popple
@ 2025-07-24  8:52               ` David Hildenbrand
  2025-07-25  0:31                 ` Alistair Popple
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-07-24  8:52 UTC (permalink / raw)
  To: Alistair Popple, Jason Gunthorpe
  Cc: Matthew Wilcox, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On 23.07.25 06:10, Alistair Popple wrote:
> On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
>> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
>>>> So what is it?
>>>
>>> IMHO a hack, because obviously we shouldn't require real physical addresses for
>>> something the CPU can't actually address anyway and this causes real
>>> problems
>>
>> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
>> entries that point to some kind of opaque driver managed memory.
>>
>> We have alot of assumptions all over about pfn/phys to page
>> relationships so anything that has a struct page also has to come with
>> a fake PFN today..
> 
> Hmm ... maybe. To get that PFN though we have to come from either a special
> swap entry which we already have special cases for, or a struct page (which is
> a device private page) which we mostly have to handle specially anyway. I'm not
> sure there's too many places that can sensibly handle a fake PFN without somehow
> already knowing it is device-private PFN.
> 
>>> (eg. it doesn't actually work on anything other than x86_64). There's no reason
>>> the "PFN" we store in device-private entries couldn't instead just be an index
>>> into some data structure holding pointers to the struct pages. So instead of
>>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
>>> and page_to_device_private_index().
>>
>> It could work, but any of the pfn conversions would have to be tracked
>> down.. Could be troublesome.
> 
> I looked at this a while back and I'm reasonably optimistic that this is doable
> because we already have to treat these specially everywhere anyway.
How would that look like?

E.g., we have code like

if (is_device_private_entry(entry)) {
	page = pfn_swap_entry_to_page(entry);
	folio = page_folio(page);

	...
	folio_get(folio);
	...
}

We could easily stop allowing pfn_swap_entry_to_page(), turning these 
into non-pfn swap entries.

Would it then be something like

if (is_device_private_entry(entry)) {
	page = device_private_entry_to_page(entry);
	
	...
}

Whereby device_private_entry_to_page() obtains the "struct page" not via 
the PFN but some other magical (index) value?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-24  8:52               ` David Hildenbrand
@ 2025-07-25  0:31                 ` Alistair Popple
  2025-07-25  9:51                   ` David Hildenbrand
  2025-08-01 16:40                   ` Jason Gunthorpe
  0 siblings, 2 replies; 37+ messages in thread
From: Alistair Popple @ 2025-07-25  0:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote:
> On 23.07.25 06:10, Alistair Popple wrote:
> > On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
> > > > > So what is it?
> > > > 
> > > > IMHO a hack, because obviously we shouldn't require real physical addresses for
> > > > something the CPU can't actually address anyway and this causes real
> > > > problems
> > > 
> > > IMHO what DEVICE PRIVATE really boils down to is a way to have swap
> > > entries that point to some kind of opaque driver managed memory.
> > > 
> > > We have alot of assumptions all over about pfn/phys to page
> > > relationships so anything that has a struct page also has to come with
> > > a fake PFN today..
> > 
> > Hmm ... maybe. To get that PFN though we have to come from either a special
> > swap entry which we already have special cases for, or a struct page (which is
> > a device private page) which we mostly have to handle specially anyway. I'm not
> > sure there's too many places that can sensibly handle a fake PFN without somehow
> > already knowing it is device-private PFN.
> > 
> > > > (eg. it doesn't actually work on anything other than x86_64). There's no reason
> > > > the "PFN" we store in device-private entries couldn't instead just be an index
> > > > into some data structure holding pointers to the struct pages. So instead of
> > > > using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
> > > > and page_to_device_private_index().
> > > 
> > > It could work, but any of the pfn conversions would have to be tracked
> > > down.. Could be troublesome.
> > 
> > I looked at this a while back and I'm reasonably optimistic that this is doable
> > because we already have to treat these specially everywhere anyway.
> How would that look like?
> 
> E.g., we have code like
> 
> if (is_device_private_entry(entry)) {
> 	page = pfn_swap_entry_to_page(entry);
> 	folio = page_folio(page);
> 
> 	...
> 	folio_get(folio);
> 	...
> }
> 
> We could easily stop allowing pfn_swap_entry_to_page(), turning these into
> non-pfn swap entries.
> 
> Would it then be something like
> 
> if (is_device_private_entry(entry)) {
> 	page = device_private_entry_to_page(entry);
> 	
> 	...
> }
> 
> Whereby device_private_entry_to_page() obtains the "struct page" not via the
> PFN but some other magical (index) value?

Exactly. The observation being that when you convert a PTE from a swap entry
to a page we already know it is a device private entry, so can go look up the
struct page with special magic (eg. an index into some other array or data
structure).

And if you have a struct page you already know it's a device private page so if
you need to create the swap entry you can look up the magic index using some
alternate function.

The only issue would be if there were generic code paths that somehow have a
raw pfn obtained from neither a page-table walk or struct page. My assumption
(yet to be proven/tested) is that these paths don't exist.

 - Alistair

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-25  0:31                 ` Alistair Popple
@ 2025-07-25  9:51                   ` David Hildenbrand
  2025-08-01 16:40                   ` Jason Gunthorpe
  1 sibling, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-07-25  9:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Jason Gunthorpe, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On 25.07.25 02:31, Alistair Popple wrote:
> On Thu, Jul 24, 2025 at 10:52:54AM +0200, David Hildenbrand wrote:
>> On 23.07.25 06:10, Alistair Popple wrote:
>>> On Wed, Jul 23, 2025 at 12:51:42AM -0300, Jason Gunthorpe wrote:
>>>> On Tue, Jul 22, 2025 at 10:49:10AM +1000, Alistair Popple wrote:
>>>>>> So what is it?
>>>>>
>>>>> IMHO a hack, because obviously we shouldn't require real physical addresses for
>>>>> something the CPU can't actually address anyway and this causes real
>>>>> problems
>>>>
>>>> IMHO what DEVICE PRIVATE really boils down to is a way to have swap
>>>> entries that point to some kind of opaque driver managed memory.
>>>>
>>>> We have alot of assumptions all over about pfn/phys to page
>>>> relationships so anything that has a struct page also has to come with
>>>> a fake PFN today..
>>>
>>> Hmm ... maybe. To get that PFN though we have to come from either a special
>>> swap entry which we already have special cases for, or a struct page (which is
>>> a device private page) which we mostly have to handle specially anyway. I'm not
>>> sure there's too many places that can sensibly handle a fake PFN without somehow
>>> already knowing it is device-private PFN.
>>>
>>>>> (eg. it doesn't actually work on anything other than x86_64). There's no reason
>>>>> the "PFN" we store in device-private entries couldn't instead just be an index
>>>>> into some data structure holding pointers to the struct pages. So instead of
>>>>> using pfn_to_page()/page_to_pfn() we would use device_private_index_to_page()
>>>>> and page_to_device_private_index().
>>>>
>>>> It could work, but any of the pfn conversions would have to be tracked
>>>> down.. Could be troublesome.
>>>
>>> I looked at this a while back and I'm reasonably optimistic that this is doable
>>> because we already have to treat these specially everywhere anyway.
>> How would that look like?
>>
>> E.g., we have code like
>>
>> if (is_device_private_entry(entry)) {
>> 	page = pfn_swap_entry_to_page(entry);
>> 	folio = page_folio(page);
>>
>> 	...
>> 	folio_get(folio);
>> 	...
>> }
>>
>> We could easily stop allowing pfn_swap_entry_to_page(), turning these into
>> non-pfn swap entries.
>>
>> Would it then be something like
>>
>> if (is_device_private_entry(entry)) {
>> 	page = device_private_entry_to_page(entry);
>> 	
>> 	...
>> }
>>
>> Whereby device_private_entry_to_page() obtains the "struct page" not via the
>> PFN but some other magical (index) value?
> 
> Exactly. The observation being that when you convert a PTE from a swap entry
> to a page we already know it is a device private entry, so can go look up the
> struct page with special magic (eg. an index into some other array or data
> structure).
> 
> And if you have a struct page you already know it's a device private page so if
> you need to create the swap entry you can look up the magic index using some
> alternate function.
> 
> The only issue would be if there were generic code paths that somehow have a
> raw pfn obtained from neither a page-table walk or struct page. My assumption
> (yet to be proven/tested) is that these paths don't exist.

I guess memory compaction and friends don't apply to ZONE_DEVICE, and 
even memory_failure() handling goes a separate path.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-25  0:31                 ` Alistair Popple
  2025-07-25  9:51                   ` David Hildenbrand
@ 2025-08-01 16:40                   ` Jason Gunthorpe
  2025-08-01 16:50                     ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Jason Gunthorpe @ 2025-08-01 16:40 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:

> The only issue would be if there were generic code paths that somehow have a
> raw pfn obtained from neither a page-table walk or struct page. My assumption
> (yet to be proven/tested) is that these paths don't exist.

hmm does it, it encodes the device private into a pfn and expects the
caller to do pfn to page.

This isn't set in stone and could be changed..

But broadly, you'd want to entirely eliminate the ability to go from
pfn to device private or from device private to pfn.

Instead you'd want to work on some (space #, space index) tuple, maybe
encoded in a pfn_t, but absolutely and typesafely distinct. Each
driver gets its own 0 based space for device private information, the
space is effectively the pgmap.

And if you do this, maybe we don't need struct page (I mean the type!)
backing device memory at all.... Which would be a very worthwhile
project.

Do we ever even use anything in the device private struct page? Do we
refcount it?

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism
  2025-07-24  7:30       ` Christoph Hellwig
@ 2025-08-01 16:46         ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2025-08-01 16:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Thu, Jul 24, 2025 at 12:30:34AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 23, 2025 at 12:55:22AM -0300, Jason Gunthorpe wrote:
> > On Mon, Jul 21, 2025 at 12:03:41AM -0700, Christoph Hellwig wrote:
> > > On Fri, Jul 18, 2025 at 02:51:11PM +0300, Yonatan Maman wrote:
> > > > From: Yonatan Maman <Ymaman@Nvidia.com>
> > > > 
> > > > Add support for P2P for MLX5 NIC devices with automatic fallback to
> > > > standard DMA when P2P mapping fails.
> > > 
> > > That's now how the P2P API works.  You need to check the P2P availability
> > > higher up.
> > 
> > How do you mean?
> > 
> > This looks OKish to me, for ODP and HMM it has to check the P2P
> > availability on a page by page basis because every single page can be
> > a different origin device.
> > 
> > There isn't really a higher up here...
> 
> The DMA API expects the caller to already check for connectability,
> why can't HMM do that like everyone else?

It does, this doesn't change anything about how the DMA API works.

All this series does, and you stated it perfectly, is to allow HMM to
return the single PCI P2P alias of the device private page.

HMM already blindly returns normal P2P pages in a VMA, it should also
blindly return the P2P alias pages too.

Once the P2P is returned the xisting code in hmm_dma_map_pfn() calls
pci_p2pdma_state() to find out if it is compatible or not.

Lifting the pci_p2pdma_state() from hmm_dma_map_pfn() and into
hmm_range_fault() is perhaps possible and may be reasonable, but not
really related to this series.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-08-01 16:40                   ` Jason Gunthorpe
@ 2025-08-01 16:50                     ` David Hildenbrand
  2025-08-01 16:57                       ` Jason Gunthorpe
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2025-08-01 16:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Alistair Popple
  Cc: Matthew Wilcox, Yonatan Maman, Jérôme Glisse,
	Andrew Morton, Leon Romanovsky, Lyude Paul, Danilo Krummrich,
	David Airlie, Simona Vetter, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On 01.08.25 18:40, Jason Gunthorpe wrote:
> On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> 
>> The only issue would be if there were generic code paths that somehow have a
>> raw pfn obtained from neither a page-table walk or struct page. My assumption
>> (yet to be proven/tested) is that these paths don't exist.
> 
> hmm does it, it encodes the device private into a pfn and expects the
> caller to do pfn to page.
> 
> This isn't set in stone and could be changed..
> 
> But broadly, you'd want to entirely eliminate the ability to go from
> pfn to device private or from device private to pfn.
> 
> Instead you'd want to work on some (space #, space index) tuple, maybe
> encoded in a pfn_t, but absolutely and typesafely distinct. Each
> driver gets its own 0 based space for device private information, the
> space is effectively the pgmap.
> 
> And if you do this, maybe we don't need struct page (I mean the type!)
> backing device memory at all.... Which would be a very worthwhile
> project.
> 
> Do we ever even use anything in the device private struct page? Do we
> refcount it?

ref-counted and map-counted ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-07-21  6:59   ` Christoph Hellwig
  2025-07-22  5:42     ` Yonatan Maman
@ 2025-08-01 16:52     ` Jason Gunthorpe
  1 sibling, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2025-08-01 16:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yonatan Maman, Jérôme Glisse, Andrew Morton,
	Leon Romanovsky, Lyude Paul, Danilo Krummrich, David Airlie,
	Simona Vetter, Alistair Popple, Ben Skeggs, Michael Guralnik,
	Or Har-Toov, Daisuke Matsuda, Shay Drory, linux-mm, linux-rdma,
	dri-devel, nouveau, linux-kernel, Gal Shalom

On Sun, Jul 20, 2025 at 11:59:10PM -0700, Christoph Hellwig wrote:
> > +	/*
> > +	 * Don't fault in device private pages owned by the caller,
> > +	 * just report the PFN.
> > +	 */
> > +	if (pgmap->owner == range->dev_private_owner) {
> > +		*hmm_pfn = swp_offset_pfn(entry);
> > +		goto found;
> 
> This is dangerous because it mixes actual DMAable alias PFNs with the
> device private fake PFNs.  Maybe your hardware / driver can handle
> it, but just leaking this out is not a good idea.

For better or worse that is how the hmm API works today.

Recall the result is an array of unsigned long with a pfn and flags:

enum hmm_pfn_flags {
	/* Output fields and flags */
	HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
	HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),

The only promise is that every pfn has a struct page behind it.

If the caller specifies dev_private_owner then it must also look into
the struct page of every returned pfn to see if it is device private
or not.

hmm_dma_map_pfn() already unconditionally calls pci_p2pdma_state()
which checks for P2P struct pages.

It does sound like a good improvement to return the type of the pfn
(normal, p2p, private) in the flags bits as well to optimize away
these extra struct page lookups.

But this is a different project..

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-08-01 16:50                     ` David Hildenbrand
@ 2025-08-01 16:57                       ` Jason Gunthorpe
  2025-08-04  1:51                         ` Alistair Popple
  2025-08-04  7:48                         ` David Hildenbrand
  0 siblings, 2 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2025-08-01 16:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
> On 01.08.25 18:40, Jason Gunthorpe wrote:
> > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> > 
> > > The only issue would be if there were generic code paths that somehow have a
> > > raw pfn obtained from neither a page-table walk or struct page. My assumption
> > > (yet to be proven/tested) is that these paths don't exist.
> > 
> > hmm does it, it encodes the device private into a pfn and expects the
> > caller to do pfn to page.
> > 
> > This isn't set in stone and could be changed..
> > 
> > But broadly, you'd want to entirely eliminate the ability to go from
> > pfn to device private or from device private to pfn.
> > 
> > Instead you'd want to work on some (space #, space index) tuple, maybe
> > encoded in a pfn_t, but absolutely and typesafely distinct. Each
> > driver gets its own 0 based space for device private information, the
> > space is effectively the pgmap.
> > 
> > And if you do this, maybe we don't need struct page (I mean the type!)
> > backing device memory at all.... Which would be a very worthwhile
> > project.
> > 
> > Do we ever even use anything in the device private struct page? Do we
> > refcount it?
> 
> ref-counted and map-counted ...

Hm, so it would turn into another struct page split up where we get
ourselves a struct device_private and change all the places touching
its refcount and mapcount to use the new type.

If we could use some index scheme we could then divorce from struct
page and strink the struct size sooner.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-08-01 16:57                       ` Jason Gunthorpe
@ 2025-08-04  1:51                         ` Alistair Popple
  2025-08-05 14:09                           ` Jason Gunthorpe
  2025-08-04  7:48                         ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Alistair Popple @ 2025-08-04  1:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On Fri, Aug 01, 2025 at 01:57:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
> > On 01.08.25 18:40, Jason Gunthorpe wrote:
> > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> > > 
> > > > The only issue would be if there were generic code paths that somehow have a
> > > > raw pfn obtained from neither a page-table walk or struct page. My assumption
> > > > (yet to be proven/tested) is that these paths don't exist.
> > > 
> > > hmm does it, it encodes the device private into a pfn and expects the
> > > caller to do pfn to page.

What callers need to do pfn to page when finding a device private pfn via
hmm_range_fault()? GPU drivers don't, they tend just to use the pfn as an offset
from the start of the pgmap to find whatever data structure they are using to
track device memory allocations.

The migrate_vma_*() calls do, but they could easily be changed to whatever
index scheme we use so long as we can encode that this is a device entry in the
MIGRATE_PFN flags.

So other than adding a HMM_PFN flag to say this is really a device index I don't
see too many issues here.

> > > This isn't set in stone and could be changed..
> > > 
> > > But broadly, you'd want to entirely eliminate the ability to go from
> > > pfn to device private or from device private to pfn.
> > > 
> > > Instead you'd want to work on some (space #, space index) tuple, maybe
> > > encoded in a pfn_t, but absolutely and typesafely distinct. Each
> > > driver gets its own 0 based space for device private information, the
> > > space is effectively the pgmap.
> > > 
> > > And if you do this, maybe we don't need struct page (I mean the type!)
> > > backing device memory at all.... Which would be a very worthwhile
> > > project.

Exactly! Although we still need enough of a struct page or something else to
still be able to map them in normal anonymous VMAs. Short term the motivation
for this project is that the current scheme of "stealing" pfns for the device
doesn't actually work in a lot of cases.

> > > Do we ever even use anything in the device private struct page? Do we
> > > refcount it?
> > 
> > ref-counted and map-counted ...
> 
> Hm, so it would turn into another struct page split up where we get
> ourselves a struct device_private and change all the places touching
> its refcount and mapcount to use the new type.
> 
> If we could use some index scheme we could then divorce from struct
> page and strink the struct size sooner.

Right, that is roughly along the lines of what I was thinking.

> Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-08-01 16:57                       ` Jason Gunthorpe
  2025-08-04  1:51                         ` Alistair Popple
@ 2025-08-04  7:48                         ` David Hildenbrand
  1 sibling, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2025-08-04  7:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On 01.08.25 18:57, Jason Gunthorpe wrote:
> On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
>> On 01.08.25 18:40, Jason Gunthorpe wrote:
>>> On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
>>>
>>>> The only issue would be if there were generic code paths that somehow have a
>>>> raw pfn obtained from neither a page-table walk or struct page. My assumption
>>>> (yet to be proven/tested) is that these paths don't exist.
>>>
>>> hmm does it, it encodes the device private into a pfn and expects the
>>> caller to do pfn to page.
>>>
>>> This isn't set in stone and could be changed..
>>>
>>> But broadly, you'd want to entirely eliminate the ability to go from
>>> pfn to device private or from device private to pfn.
>>>
>>> Instead you'd want to work on some (space #, space index) tuple, maybe
>>> encoded in a pfn_t, but absolutely and typesafely distinct. Each
>>> driver gets its own 0 based space for device private information, the
>>> space is effectively the pgmap.
>>>
>>> And if you do this, maybe we don't need struct page (I mean the type!)
>>> backing device memory at all.... Which would be a very worthwhile
>>> project.
>>>
>>> Do we ever even use anything in the device private struct page? Do we
>>> refcount it?
>>
>> ref-counted and map-counted ...
> 
> Hm, so it would turn into another struct page split up where we get
> ourselves a struct device_private and change all the places touching
> its refcount and mapcount to use the new type.

We're already working with folios in all cases where we modify either 
refcount or mapcount IIUC.

The rmap handling (try to migrate, soon folio splitting) currently 
depends on the mapcount.

Not sure how that will all look like without a ... struct folio / struct 
page.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages
  2025-08-04  1:51                         ` Alistair Popple
@ 2025-08-05 14:09                           ` Jason Gunthorpe
  0 siblings, 0 replies; 37+ messages in thread
From: Jason Gunthorpe @ 2025-08-05 14:09 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, Matthew Wilcox, Yonatan Maman,
	Jérôme Glisse, Andrew Morton, Leon Romanovsky,
	Lyude Paul, Danilo Krummrich, David Airlie, Simona Vetter,
	Ben Skeggs, Michael Guralnik, Or Har-Toov, Daisuke Matsuda,
	Shay Drory, linux-mm, linux-rdma, dri-devel, nouveau,
	linux-kernel, Gal Shalom

On Mon, Aug 04, 2025 at 11:51:38AM +1000, Alistair Popple wrote:
> On Fri, Aug 01, 2025 at 01:57:49PM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 01, 2025 at 06:50:18PM +0200, David Hildenbrand wrote:
> > > On 01.08.25 18:40, Jason Gunthorpe wrote:
> > > > On Fri, Jul 25, 2025 at 10:31:25AM +1000, Alistair Popple wrote:
> > > > 
> > > > > The only issue would be if there were generic code paths that somehow have a
> > > > > raw pfn obtained from neither a page-table walk or struct page. My assumption
> > > > > (yet to be proven/tested) is that these paths don't exist.
> > > > 
> > > > hmm does it, it encodes the device private into a pfn and expects the
> > > > caller to do pfn to page.
> 
> What callers need to do pfn to page when finding a device private pfn via
> hmm_range_fault()? GPU drivers don't, they tend just to use the pfn as an offset
> from the start of the pgmap to find whatever data structure they are using to
> track device memory allocations.

All drivers today must. You have no idea if the PFN returned is a
private or CPU page. The only way to know is to check the struct page
type, by looking inside the struct page.

> So other than adding a HMM_PFN flag to say this is really a device index I don't
> see too many issues here.

Christoph suggested exactly this, and it would solve the issue. Seems
quite easy too. Let's do it.

Jason

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2025-08-05 14:09 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18 11:51 [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Yonatan Maman
2025-07-18 11:51 ` [PATCH v2 1/5] mm/hmm: HMM API to enable P2P DMA for device private pages Yonatan Maman
2025-07-18 14:17   ` Matthew Wilcox
2025-07-18 14:44     ` Jason Gunthorpe
2025-07-21  0:11       ` Alistair Popple
2025-07-21 13:23       ` Matthew Wilcox
2025-07-22  0:49         ` Alistair Popple
2025-07-23  3:51           ` Jason Gunthorpe
2025-07-23  4:10             ` Alistair Popple
2025-07-24  8:52               ` David Hildenbrand
2025-07-25  0:31                 ` Alistair Popple
2025-07-25  9:51                   ` David Hildenbrand
2025-08-01 16:40                   ` Jason Gunthorpe
2025-08-01 16:50                     ` David Hildenbrand
2025-08-01 16:57                       ` Jason Gunthorpe
2025-08-04  1:51                         ` Alistair Popple
2025-08-05 14:09                           ` Jason Gunthorpe
2025-08-04  7:48                         ` David Hildenbrand
2025-07-21  6:59   ` Christoph Hellwig
2025-07-22  5:42     ` Yonatan Maman
2025-08-01 16:52     ` Jason Gunthorpe
2025-07-18 11:51 ` [PATCH v2 2/5] nouveau/dmem: HMM P2P DMA for private dev pages Yonatan Maman
2025-07-21  7:00   ` Christoph Hellwig
2025-07-22  5:23     ` Yonatan Maman
2025-07-18 11:51 ` [PATCH v2 3/5] IB/core: P2P DMA for device private pages Yonatan Maman
2025-07-18 11:51 ` [PATCH v2 4/5] RDMA/mlx5: Enable P2P DMA with fallback mechanism Yonatan Maman
2025-07-21  7:03   ` Christoph Hellwig
2025-07-23  3:55     ` Jason Gunthorpe
2025-07-24  7:30       ` Christoph Hellwig
2025-08-01 16:46         ` Jason Gunthorpe
2025-07-18 11:51 ` [PATCH v2 5/5] RDMA/mlx5: Enabling ATS for ODP memory Yonatan Maman
2025-07-20 10:30 ` [PATCH v2 0/5] *** GPU Direct RDMA (P2P DMA) for Device Private Pages *** Leon Romanovsky
2025-07-20 21:03   ` Yonatan Maman
2025-07-21  6:49     ` Leon Romanovsky
2025-07-23  4:03       ` Jason Gunthorpe
2025-07-23  8:44         ` Leon Romanovsky
2025-07-21  6:54 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).