linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/24] Provide a new two step DMA mapping API
@ 2025-04-18  6:47 Leon Romanovsky
  2025-04-18  6:47 ` [PATCH v8 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
                   ` (25 more replies)
  0 siblings, 26 replies; 48+ messages in thread
From: Leon Romanovsky @ 2025-04-18  6:47 UTC (permalink / raw)
  To: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch
  Cc: Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
	Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
	Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Alex Williamson, Jérôme Glisse,
	Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
	iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
	Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
	Kanchan Joshi, Chaitanya Kulkarni

Following recent on site LSF/MM 2025 [1] discussion, the overall
response was extremely positive with many people expressed their
desire to see this series merged, so they can base their work on it.

It includes, but not limited:
 * Luis's "nvme-pci: breaking the 512 KiB max IO boundary":
   https://lore.kernel.org/all/20250320111328.2841690-1-mcgrof@kernel.org/
 * Chuck's NFS conversion to use one structure (bio_vec) for all types
   of RPC transports:
   https://lore.kernel.org/all/913df4b4-fc4a-409d-9007-088a3e2c8291@oracle.com
 * Matthew's vision for the world without struct page:
   https://lore.kernel.org/all/20250320111328.2841690-1-mcgrof@kernel.org/
 * Confidential computing roadmap from Dan:
   https://lore.kernel.org/all/6801a8e3968da_71fe29411@dwillia2-xfh.jf.intel.com.notmuch

This series is combination of effort of many people who contributed ideas,
code and testing and I'm gratefully thankful for them.

[1] https://lore.kernel.org/linux-rdma/20250122071600.GC10702@unreal/
-----------------------------------------------------------------------
Changelog:
v8:
 * Rebased to v6.15-rc1
 * Added NVMe patches which are now patches and not RFC. They were in
   RFC stage because block iterator caused to performance regression
   for very extreme case scenario (~100M IOPS), but after Kanchan fixed
   it, the code started to be ready for merging.
 * @Niklas, i didn't change naming in this series as it follows iommu
   naming format.
v7:
 * Rebased to v6.14-rc1
v6: https://lore.kernel.org/all/cover.1737106761.git.leon@kernel.org
 * Changed internal __size variable to u64 to properly set private flag
   in most significant bit.
 * Added comment about why we check DMA_IOVA_USE_SWIOTLB
 * Break unlink loop if phys is NULL, condition which we shouldn't get.
v5: https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org
 * Trimmed long lines in all patches.
 * Squashed "dma-mapping: Add check if IOVA can be used" into
   "dma: Provide an interface to allow allocate IOVA" patch.
 * Added tags from Christoph and Will.
 * Fixed spelling/grammar errors.
 * Change title from "dma: Provide an  ..." to be "dma-mapping: Provide
   an ...".
 * Slightly changed hmm patch to set sticky flags in one place.
v4: https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org
 * Added extra patch to add kernel-doc for iommu_unmap and iommu_unmap_fast
 * Rebased to v6.13-rc1
 * Added Will's tags
v3: https://lore.kernel.org/all/cover.1731244445.git.leon@kernel.org
 * Added DMA_ATTR_SKIP_CPU_SYNC to p2p pages in HMM.
 * Fixed error unwind if dma_iova_sync fails in HMM.
 * Clear all PFN flags which were set in map to make code.
   more clean, the callers anyway cleaned them.
 * Generalize sticky PFN flags logic in HMM.
 * Removed not-needed #ifdef-#endif section.
v2: https://lore.kernel.org/all/cover.1730892663.git.leon@kernel.org
 * Fixed docs file as Randy suggested
 * Fixed releases of memory in HMM path. It was allocated with kv..
   variants but released with kfree instead of kvfree.
 * Slightly changed commit message in VFIO patch.
v1: https://lore.kernel.org/all/cover.1730298502.git.leon@kernel.org
 * Squashed two VFIO patches into one
 * Added Acked-by/Reviewed-by tags
 * Fix docs spelling errors
 * Simplified dma_iova_sync() API
 * Added extra check in dma_iova_destroy() if mapped size to make code more clear
 * Fixed checkpatch warnings in p2p patch
 * Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
   be more general
 * Reduced the number of changes in VFIO patch
v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org

----------------------------------------------------------------------------
 LWN coverage:
Dancing the DMA two-step - https://lwn.net/Articles/997563/
----------------------------------------------------------------------------

Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the
impossibility of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - Allocate IOVA space:
    To do any pre-allocation required. This is done based on the caller
    supplying some details about how much IOMMU address space it would need
    in worst case.
 - Map and unmap relevant structures to pre-allocated IOVA space:
    Perform the actual mapping into the pre-allocated IOVA. This is very
    similar to dma_map_page().

In this series, examples of three different users are converted to the new API
to show the benefits and its versatility. Each user has a unique
flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
 * Memory reduction as in most real use cases there is no need to store mapped
   DMA addresses and unmap them.
 * Reducing the function call overhead by removing the need to call function
   pointers and use direct calls instead.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes.

The whole series together with two extra patches (to fix broken -rc1) is
here https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git dma-split-Apr-8

Thanks

Christoph Hellwig (12):
  PCI/P2PDMA: Refactor the p2pdma mapping helpers
  dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
  iommu: generalize the batched sync after map interface
  iommu/dma: Factor out a iommu_dma_map_swiotlb helper
  dma-mapping: add a dma_need_unmap helper
  docs: core-api: document the IOVA-based API
  block: share more code for bio addition helper
  block: don't merge different kinds of P2P transfers in a single bio
  blk-mq: add scatterlist-less DMA mapping helpers
  nvme-pci: remove struct nvme_descriptor
  nvme-pci: use a better encoding for small prp pool allocations
  nvme-pci: convert to blk_rq_dma_map

Kanchan Joshi (1):
  nvme-pci: optimize single-segment handling

Leon Romanovsky (11):
  iommu: add kernel-doc for iommu_unmap and iommu_unmap_fast
  dma-mapping: Provide an interface to allow allocate IOVA
  dma-mapping: Implement link/unlink ranges API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  mm/hmm: provide generic DMA managing logic
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page
    linkage
  RDMA/umem: Separate implicit ODP initialization from explicit ODP
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Enable the DMA link API

 Documentation/core-api/dma-api.rst   |  71 +++
 block/bio.c                          |  83 ++--
 block/blk-merge.c                    | 183 +++++++-
 drivers/infiniband/core/umem_odp.c   | 250 ++++------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +-
 drivers/infiniband/hw/mlx5/odp.c     |  65 +--
 drivers/infiniband/hw/mlx5/umr.c     |  12 +-
 drivers/infiniband/sw/rxe/rxe_odp.c  |  18 +-
 drivers/iommu/dma-iommu.c            | 468 ++++++++++++++++---
 drivers/iommu/iommu.c                |  84 ++--
 drivers/nvme/host/pci.c              | 655 ++++++++++++++-------------
 drivers/pci/p2pdma.c                 |  38 +-
 drivers/vfio/pci/mlx5/cmd.c          | 375 ++++++++-------
 drivers/vfio/pci/mlx5/cmd.h          |  35 +-
 drivers/vfio/pci/mlx5/main.c         |  87 ++--
 include/linux/blk-mq-dma.h           |  62 +++
 include/linux/blk_types.h            |   2 +
 include/linux/dma-map-ops.h          |  54 ---
 include/linux/dma-mapping.h          |  85 ++++
 include/linux/hmm-dma.h              |  33 ++
 include/linux/hmm.h                  |  21 +
 include/linux/iommu.h                |   4 +
 include/linux/pci-p2pdma.h           |  84 ++++
 include/rdma/ib_umem_odp.h           |  25 +-
 kernel/dma/direct.c                  |  44 +-
 kernel/dma/mapping.c                 |  18 +
 mm/hmm.c                             | 264 ++++++++++-
 27 files changed, 2060 insertions(+), 1072 deletions(-)
 create mode 100644 include/linux/blk-mq-dma.h
 create mode 100644 include/linux/hmm-dma.h

-- 
2.49.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-04-22 11:36 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-18  6:47 [PATCH v8 00/24] Provide a new two step DMA mapping API Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 01/24] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
2025-04-20 20:05   ` ALOK TIWARI
2025-04-18  6:47 ` [PATCH v8 02/24] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
2025-04-20 20:09   ` ALOK TIWARI
2025-04-18  6:47 ` [PATCH v8 03/24] iommu: generalize the batched sync after map interface Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 04/24] iommu: add kernel-doc for iommu_unmap and iommu_unmap_fast Leon Romanovsky
2025-04-22  4:23   ` Christoph Hellwig
2025-04-22  6:27     ` Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 05/24] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 06/24] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 07/24] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
2025-04-20 20:23   ` ALOK TIWARI
2025-04-18  6:47 ` [PATCH v8 08/24] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 09/24] docs: core-api: document the IOVA-based API Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 10/24] mm/hmm: let users to tag specific PFN with DMA mapped bit Leon Romanovsky
2025-04-22  4:24   ` Christoph Hellwig
2025-04-18  6:47 ` [PATCH v8 11/24] mm/hmm: provide generic DMA managing logic Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 12/24] RDMA/umem: Store ODP access mask information in PFN Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 13/24] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 14/24] RDMA/umem: Separate implicit ODP initialization from explicit ODP Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 15/24] vfio/mlx5: Explicitly use number of pages instead of allocated length Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 16/24] vfio/mlx5: Rewrite create mkey flow to allow better code reuse Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 17/24] vfio/mlx5: Enable the DMA link API Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 18/24] block: share more code for bio addition helper Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 19/24] block: don't merge different kinds of P2P transfers in a single bio Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 20/24] blk-mq: add scatterlist-less DMA mapping helpers Leon Romanovsky
2025-04-18 18:03   ` ALOK TIWARI
2025-04-20  7:09     ` Leon Romanovsky
2025-04-22  4:27   ` Christoph Hellwig
2025-04-22  6:36     ` Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 21/24] nvme-pci: remove struct nvme_descriptor Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 22/24] nvme-pci: use a better encoding for small prp pool allocations Leon Romanovsky
2025-04-18  6:47 ` [PATCH v8 23/24] nvme-pci: convert to blk_rq_dma_map Leon Romanovsky
2025-04-18 18:29   ` ALOK TIWARI
2025-04-22  5:00   ` Christoph Hellwig
2025-04-22  7:26     ` Leon Romanovsky
2025-04-22  7:32       ` Christoph Hellwig
2025-04-18  6:47 ` [PATCH v8 24/24] nvme-pci: optimize single-segment handling Leon Romanovsky
2025-04-18  8:02   ` Damien Le Moal
2025-04-18 11:19     ` Leon Romanovsky
2025-04-18 12:32     ` Kanchan Joshi
2025-04-22  4:39   ` Christoph Hellwig
2025-04-22  7:44     ` Leon Romanovsky
2025-04-22 11:36     ` Leon Romanovsky
2025-04-18 11:16 ` [PATCH v8 00/24] Provide a new two step DMA mapping API Leon Romanovsky
2025-04-18 12:18 ` Jens Axboe
2025-04-20  7:14   ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).