linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps
@ 2025-02-05 23:17 Alex Williamson
  2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Alex Williamson @ 2025-02-05 23:17 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
become a significant overhead for VMs making use of device assignment.
Not only does each mapping require upwards of a few seconds, but BARs
are mapped in and out of the VM address space multiple times during
guest boot.  Also factor in that multi-GPU configurations are
increasingly commonplace and BAR sizes are continuing to increase.
Configurations today can already be delayed minutes during guest boot.

We've taken steps to make Linux a better guest by batching PCI BAR
sizing operations[1], but it only provides and incremental improvement.

This series attempts to fully address the issue by leveraging the huge
pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
mappings, we can later take advantage of the knowledge of the mapping
level page mask to iterate on the relevant mapping stride.  In the
commonly achieved optimal case, this results in a reduction of pfn
lookups by a factor of 256k.  For a local test system, an overhead of
~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
page sized operations reduced to 32 pud sized operations).

Please review, test, and provide feedback.  I hope that mm folks can
ack the trivial follow_pfnmap_args update to provide the mapping level
page mask.  Naming is hard, so any preference other than pgmask is
welcome.  Thanks,

Alex

[1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/


Alex Williamson (5):
  vfio/type1: Catch zero from pin_user_pages_remote()
  vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
  vfio/type1: Use vfio_batch for vaddr_get_pfns()
  mm: Provide page mask in struct follow_pfnmap_args
  vfio/type1: Use mapping page mask for pfnmaps

 drivers/vfio/vfio_iommu_type1.c | 107 ++++++++++++++++++++------------
 include/linux/mm.h              |   2 +
 mm/memory.c                     |   1 +
 3 files changed, 72 insertions(+), 38 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args
  2025-02-05 23:17 [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
@ 2025-02-05 23:17 ` Alex Williamson
  2025-02-07  1:38   ` Mitchell Augustin
                     ` (2 more replies)
  2025-02-05 23:17 ` [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 15+ messages in thread
From: Alex Williamson @ 2025-02-05 23:17 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

follow_pfnmap_start() walks the page table for a given address and
fills out the struct follow_pfnmap_args in pfnmap_args_setup().
The page mask of the page table level is already provided to this
latter function for calculating the pfn.  This page mask can also be
useful for the caller to determine the extent of the contiguous
mapping.

For example, vfio-pci now supports huge_fault for pfnmaps and is able
to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
for a contiguous pfn range.  Providing the mapping page mask allows us
to skip the extent of the mapping level.  Assuming a 1GB pud level and
4KB page size, iterations are reduced by a factor of 256K.  In wall
clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b1c3db9cf355..0ef7e7a0b4eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2416,11 +2416,13 @@ struct follow_pfnmap_args {
 	 * Outputs:
 	 *
 	 * @pfn: the PFN of the address
+	 * @pgmask: page mask covering pfn
 	 * @pgprot: the pgprot_t of the mapping
 	 * @writable: whether the mapping is writable
 	 * @special: whether the mapping is a special mapping (real PFN maps)
 	 */
 	unsigned long pfn;
+	unsigned long pgmask;
 	pgprot_t pgprot;
 	bool writable;
 	bool special;
diff --git a/mm/memory.c b/mm/memory.c
index 398c031be9ba..97ccd43761b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6388,6 +6388,7 @@ static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
 	args->lock = lock;
 	args->ptep = ptep;
 	args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
+	args->pgmask = addr_mask;
 	args->pgprot = pgprot;
 	args->writable = writable;
 	args->special = special;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-05 23:17 [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
  2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
@ 2025-02-05 23:17 ` Alex Williamson
  2025-02-07  1:39   ` Mitchell Augustin
                     ` (2 more replies)
  2025-02-06 19:14 ` [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Peter Xu
  2025-02-07  1:39 ` Mitchell Augustin
  3 siblings, 3 replies; 15+ messages in thread
From: Alex Williamson @ 2025-02-05 23:17 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
page table and therefore knows the page mask of the level where the
address is found and returns this through follow_pfnmap_args.pgmask.
Subsequent pfns from this address until the end of the mapping page are
necessarily consecutive.  Use this information to retrieve a range of
pfnmap pfns in a single pass.

With optimal mappings and alignment on systems with 1GB pud and 4KB
page size, this reduces iterations for DMA mapping PCI BARs by a
factor of 256K.  In real world testing, the overhead of iterating
pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
sub-millisecond overhead.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 939920454da7..6f3e8d981311 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
 
 static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 			    unsigned long vaddr, unsigned long *pfn,
-			    bool write_fault)
+			    unsigned long *pgmask, bool write_fault)
 {
 	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
 	int ret;
@@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 			return ret;
 	}
 
-	if (write_fault && !args.writable)
+	if (write_fault && !args.writable) {
 		ret = -EFAULT;
-	else
+	} else {
 		*pfn = args.pfn;
+		*pgmask = args.pgmask;
+	}
 
 	follow_pfnmap_end(&args);
 	return ret;
@@ -590,15 +592,23 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	vma = vma_lookup(mm, vaddr);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+		unsigned long pgmask;
+
+		ret = follow_fault_pfn(vma, mm, vaddr, pfn, &pgmask,
+				       prot & IOMMU_WRITE);
 		if (ret == -EAGAIN)
 			goto retry;
 
 		if (!ret) {
-			if (is_invalid_reserved_pfn(*pfn))
-				ret = 1;
-			else
+			if (is_invalid_reserved_pfn(*pfn)) {
+				unsigned long epfn;
+
+				epfn = (((*pfn << PAGE_SHIFT) + ~pgmask + 1)
+					& pgmask) >> PAGE_SHIFT;
+				ret = min_t(int, npages, epfn - *pfn);
+			} else {
 				ret = -EFAULT;
+			}
 		}
 	}
 done:
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps
  2025-02-05 23:17 [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
  2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
  2025-02-05 23:17 ` [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
@ 2025-02-06 19:14 ` Peter Xu
  2025-02-07  1:39 ` Mitchell Augustin
  3 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2025-02-06 19:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, mitchell.augustin, clg, akpm, linux-mm

On Wed, Feb 05, 2025 at 04:17:16PM -0700, Alex Williamson wrote:
> As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
> become a significant overhead for VMs making use of device assignment.
> Not only does each mapping require upwards of a few seconds, but BARs
> are mapped in and out of the VM address space multiple times during
> guest boot.  Also factor in that multi-GPU configurations are
> increasingly commonplace and BAR sizes are continuing to increase.
> Configurations today can already be delayed minutes during guest boot.
> 
> We've taken steps to make Linux a better guest by batching PCI BAR
> sizing operations[1], but it only provides and incremental improvement.
> 
> This series attempts to fully address the issue by leveraging the huge
> pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
> mappings, we can later take advantage of the knowledge of the mapping
> level page mask to iterate on the relevant mapping stride.  In the
> commonly achieved optimal case, this results in a reduction of pfn
> lookups by a factor of 256k.  For a local test system, an overhead of
> ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
> page sized operations reduced to 32 pud sized operations).
> 
> Please review, test, and provide feedback.  I hope that mm folks can
> ack the trivial follow_pfnmap_args update to provide the mapping level
> page mask.  Naming is hard, so any preference other than pgmask is
> welcome.  Thanks,
> 
> Alex
> 
> [1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
> 
> 
> Alex Williamson (5):
>   vfio/type1: Catch zero from pin_user_pages_remote()
>   vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
>   vfio/type1: Use vfio_batch for vaddr_get_pfns()
>   mm: Provide page mask in struct follow_pfnmap_args
>   vfio/type1: Use mapping page mask for pfnmaps

FWIW:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args
  2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
@ 2025-02-07  1:38   ` Mitchell Augustin
  2025-02-14 17:17   ` Alex Williamson
  2025-02-14 19:14   ` Jason Gunthorpe
  2 siblings, 0 replies; 15+ messages in thread
From: Mitchell Augustin @ 2025-02-07  1:38 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, peterx, clg, akpm, linux-mm

Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>


On Wed, Feb 5, 2025 at 5:18 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> follow_pfnmap_start() walks the page table for a given address and
> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> The page mask of the page table level is already provided to this
> latter function for calculating the pfn.  This page mask can also be
> useful for the caller to determine the extent of the contiguous
> mapping.
>
> For example, vfio-pci now supports huge_fault for pfnmaps and is able
> to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> for a contiguous pfn range.  Providing the mapping page mask allows us
> to skip the extent of the mapping level.  Assuming a 1GB pud level and
> 4KB page size, iterations are reduced by a factor of 256K.  In wall
> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/memory.c        | 1 +
>  2 files changed, 3 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b1c3db9cf355..0ef7e7a0b4eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2416,11 +2416,13 @@ struct follow_pfnmap_args {
>          * Outputs:
>          *
>          * @pfn: the PFN of the address
> +        * @pgmask: page mask covering pfn
>          * @pgprot: the pgprot_t of the mapping
>          * @writable: whether the mapping is writable
>          * @special: whether the mapping is a special mapping (real PFN maps)
>          */
>         unsigned long pfn;
> +       unsigned long pgmask;
>         pgprot_t pgprot;
>         bool writable;
>         bool special;
> diff --git a/mm/memory.c b/mm/memory.c
> index 398c031be9ba..97ccd43761b2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6388,6 +6388,7 @@ static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
>         args->lock = lock;
>         args->ptep = ptep;
>         args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
> +       args->pgmask = addr_mask;
>         args->pgprot = pgprot;
>         args->writable = writable;
>         args->special = special;
> --
> 2.47.1
>


--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-05 23:17 ` [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
@ 2025-02-07  1:39   ` Mitchell Augustin
  2025-02-14 19:27   ` Jason Gunthorpe
  2025-02-14 19:46   ` Matthew Wilcox
  2 siblings, 0 replies; 15+ messages in thread
From: Mitchell Augustin @ 2025-02-07  1:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, peterx, clg, akpm, linux-mm

LGTM and completely eliminates guest VM PCI initialization slowdowns
on H100 and A100.
Also not seeing any obvious regressions on my side.

Reported-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>


On Wed, Feb 5, 2025 at 5:18 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> page table and therefore knows the page mask of the level where the
> address is found and returns this through follow_pfnmap_args.pgmask.
> Subsequent pfns from this address until the end of the mapping page are
> necessarily consecutive.  Use this information to retrieve a range of
> pfnmap pfns in a single pass.
>
> With optimal mappings and alignment on systems with 1GB pud and 4KB
> page size, this reduces iterations for DMA mapping PCI BARs by a
> factor of 256K.  In real world testing, the overhead of iterating
> pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> sub-millisecond overhead.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 939920454da7..6f3e8d981311 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
>
>  static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
>                             unsigned long vaddr, unsigned long *pfn,
> -                           bool write_fault)
> +                           unsigned long *pgmask, bool write_fault)
>  {
>         struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
>         int ret;
> @@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
>                         return ret;
>         }
>
> -       if (write_fault && !args.writable)
> +       if (write_fault && !args.writable) {
>                 ret = -EFAULT;
> -       else
> +       } else {
>                 *pfn = args.pfn;
> +               *pgmask = args.pgmask;
> +       }
>
>         follow_pfnmap_end(&args);
>         return ret;
> @@ -590,15 +592,23 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>         vma = vma_lookup(mm, vaddr);
>
>         if (vma && vma->vm_flags & VM_PFNMAP) {
> -               ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> +               unsigned long pgmask;
> +
> +               ret = follow_fault_pfn(vma, mm, vaddr, pfn, &pgmask,
> +                                      prot & IOMMU_WRITE);
>                 if (ret == -EAGAIN)
>                         goto retry;
>
>                 if (!ret) {
> -                       if (is_invalid_reserved_pfn(*pfn))
> -                               ret = 1;
> -                       else
> +                       if (is_invalid_reserved_pfn(*pfn)) {
> +                               unsigned long epfn;
> +
> +                               epfn = (((*pfn << PAGE_SHIFT) + ~pgmask + 1)
> +                                       & pgmask) >> PAGE_SHIFT;
> +                               ret = min_t(int, npages, epfn - *pfn);
> +                       } else {
>                                 ret = -EFAULT;
> +                       }
>                 }
>         }
>  done:
> --
> 2.47.1
>


--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps
  2025-02-05 23:17 [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
                   ` (2 preceding siblings ...)
  2025-02-06 19:14 ` [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Peter Xu
@ 2025-02-07  1:39 ` Mitchell Augustin
  3 siblings, 0 replies; 15+ messages in thread
From: Mitchell Augustin @ 2025-02-07  1:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, peterx, clg, akpm, linux-mm

Thanks Alex, this all looks great to me and completely eliminates the
boot time slowdown I was seeing in my tests on our DGX H100 and A100.
I also double-checked the memory mappings reported in /proc/iomem, and
everything looks consistent with how it was prior to this series on
both devices.

Reported-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>


On Wed, Feb 5, 2025 at 5:18 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
> become a significant overhead for VMs making use of device assignment.
> Not only does each mapping require upwards of a few seconds, but BARs
> are mapped in and out of the VM address space multiple times during
> guest boot.  Also factor in that multi-GPU configurations are
> increasingly commonplace and BAR sizes are continuing to increase.
> Configurations today can already be delayed minutes during guest boot.
>
> We've taken steps to make Linux a better guest by batching PCI BAR
> sizing operations[1], but it only provides and incremental improvement.
>
> This series attempts to fully address the issue by leveraging the huge
> pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
> mappings, we can later take advantage of the knowledge of the mapping
> level page mask to iterate on the relevant mapping stride.  In the
> commonly achieved optimal case, this results in a reduction of pfn
> lookups by a factor of 256k.  For a local test system, an overhead of
> ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
> page sized operations reduced to 32 pud sized operations).
>
> Please review, test, and provide feedback.  I hope that mm folks can
> ack the trivial follow_pfnmap_args update to provide the mapping level
> page mask.  Naming is hard, so any preference other than pgmask is
> welcome.  Thanks,
>
> Alex
>
> [1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
>
>
> Alex Williamson (5):
>   vfio/type1: Catch zero from pin_user_pages_remote()
>   vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
>   vfio/type1: Use vfio_batch for vaddr_get_pfns()
>   mm: Provide page mask in struct follow_pfnmap_args
>   vfio/type1: Use mapping page mask for pfnmaps
>
>  drivers/vfio/vfio_iommu_type1.c | 107 ++++++++++++++++++++------------
>  include/linux/mm.h              |   2 +
>  mm/memory.c                     |   1 +
>  3 files changed, 72 insertions(+), 38 deletions(-)
>
> --
> 2.47.1
>


--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args
  2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
  2025-02-07  1:38   ` Mitchell Augustin
@ 2025-02-14 17:17   ` Alex Williamson
  2025-02-14 21:39     ` David Hildenbrand
  2025-02-14 19:14   ` Jason Gunthorpe
  2 siblings, 1 reply; 15+ messages in thread
From: Alex Williamson @ 2025-02-14 17:17 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm,
	David Hildenbrand


Nudge.  Peter Xu provided an R-b for the series.  Would any other mm
folks like to chime in here to provide objection or approval for this
change and merging it through the vfio tree?  Series[1].  Thanks!

Alex

[1]https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/

On Wed,  5 Feb 2025 16:17:20 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> follow_pfnmap_start() walks the page table for a given address and
> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> The page mask of the page table level is already provided to this
> latter function for calculating the pfn.  This page mask can also be
> useful for the caller to determine the extent of the contiguous
> mapping.
> 
> For example, vfio-pci now supports huge_fault for pfnmaps and is able
> to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> for a contiguous pfn range.  Providing the mapping page mask allows us
> to skip the extent of the mapping level.  Assuming a 1GB pud level and
> 4KB page size, iterations are reduced by a factor of 256K.  In wall
> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/memory.c        | 1 +
>  2 files changed, 3 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b1c3db9cf355..0ef7e7a0b4eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2416,11 +2416,13 @@ struct follow_pfnmap_args {
>  	 * Outputs:
>  	 *
>  	 * @pfn: the PFN of the address
> +	 * @pgmask: page mask covering pfn
>  	 * @pgprot: the pgprot_t of the mapping
>  	 * @writable: whether the mapping is writable
>  	 * @special: whether the mapping is a special mapping (real PFN maps)
>  	 */
>  	unsigned long pfn;
> +	unsigned long pgmask;
>  	pgprot_t pgprot;
>  	bool writable;
>  	bool special;
> diff --git a/mm/memory.c b/mm/memory.c
> index 398c031be9ba..97ccd43761b2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6388,6 +6388,7 @@ static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
>  	args->lock = lock;
>  	args->ptep = ptep;
>  	args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
> +	args->pgmask = addr_mask;
>  	args->pgprot = pgprot;
>  	args->writable = writable;
>  	args->special = special;



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args
  2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
  2025-02-07  1:38   ` Mitchell Augustin
  2025-02-14 17:17   ` Alex Williamson
@ 2025-02-14 19:14   ` Jason Gunthorpe
  2 siblings, 0 replies; 15+ messages in thread
From: Jason Gunthorpe @ 2025-02-14 19:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On Wed, Feb 05, 2025 at 04:17:20PM -0700, Alex Williamson wrote:
> follow_pfnmap_start() walks the page table for a given address and
> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> The page mask of the page table level is already provided to this
> latter function for calculating the pfn.  This page mask can also be
> useful for the caller to determine the extent of the contiguous
> mapping.
> 
> For example, vfio-pci now supports huge_fault for pfnmaps and is able
> to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> for a contiguous pfn range.  Providing the mapping page mask allows us
> to skip the extent of the mapping level.  Assuming a 1GB pud level and
> 4KB page size, iterations are reduced by a factor of 256K.  In wall
> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  include/linux/mm.h | 2 ++
>  mm/memory.c        | 1 +
>  2 files changed, 3 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-05 23:17 ` [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
  2025-02-07  1:39   ` Mitchell Augustin
@ 2025-02-14 19:27   ` Jason Gunthorpe
  2025-02-17 21:52     ` Alex Williamson
  2025-02-14 19:46   ` Matthew Wilcox
  2 siblings, 1 reply; 15+ messages in thread
From: Jason Gunthorpe @ 2025-02-14 19:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On Wed, Feb 05, 2025 at 04:17:21PM -0700, Alex Williamson wrote:
> @@ -590,15 +592,23 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>  	vma = vma_lookup(mm, vaddr);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> -		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> +		unsigned long pgmask;
> +
> +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, &pgmask,
> +				       prot & IOMMU_WRITE);
>  		if (ret == -EAGAIN)
>  			goto retry;
>  
>  		if (!ret) {
> -			if (is_invalid_reserved_pfn(*pfn))
> -				ret = 1;
> -			else
> +			if (is_invalid_reserved_pfn(*pfn)) {
> +				unsigned long epfn;
> +
> +				epfn = (((*pfn << PAGE_SHIFT) + ~pgmask + 1)
> +					& pgmask) >> PAGE_SHIFT;

That seems a bit indirect

 epfn = ((*pfn) | (~pgmask >> PAGE_SHIFT)) + 1;

?

> +				ret = min_t(int, npages, epfn - *pfn);

It is nitty but the int's here should be long, and npages should be
unsigned long..

Jason


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-05 23:17 ` [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
  2025-02-07  1:39   ` Mitchell Augustin
  2025-02-14 19:27   ` Jason Gunthorpe
@ 2025-02-14 19:46   ` Matthew Wilcox
  2025-02-17 19:33     ` Alex Williamson
  2 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2025-02-14 19:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On Wed, Feb 05, 2025 at 04:17:21PM -0700, Alex Williamson wrote:
> +			if (is_invalid_reserved_pfn(*pfn)) {
> +				unsigned long epfn;
> +
> +				epfn = (((*pfn << PAGE_SHIFT) + ~pgmask + 1)
> +					& pgmask) >> PAGE_SHIFT;
> +				ret = min_t(int, npages, epfn - *pfn);

You've really made life hard for yourself by passing around a page mask
instead of an order (ie 0/PMD_ORDER/PUD_ORDER).  Why not:

				epfn = round_up(*pfn + 1, 1 << order);

Although if you insist on passing around a mask, this could be:

				unsigned long sz = (~pgmask >> PAGE_SHIFT) + 1;
				unsigned long epfn = round_up(*pfn + 1, sz)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args
  2025-02-14 17:17   ` Alex Williamson
@ 2025-02-14 21:39     ` David Hildenbrand
  2025-02-17 21:56       ` Alex Williamson
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-02-14 21:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On 14.02.25 18:17, Alex Williamson wrote:
> 
> Nudge.  Peter Xu provided an R-b for the series.  Would any other mm
> folks like to chime in here to provide objection or approval for this
> change and merging it through the vfio tree?  Series[1].  Thanks!
> 

Only skimmed over it, nothing jumped at me except ...

Nitpicking:

I was wondering if "page mask" really the right term here. I know that 
we use it in some context (gup, hugetlb, zeropage) to express "mask this 
off and you get the start of the aligned huge page".

For something that walks PFNMAPs (page frames without any real "huge 
page" logical metadata etc. grouping) it was uintuitive for me at first.

addr_mask or pfn_mask (shifted addr_mask) would have been clearer for me.

No strong opinion, just what came to mind while reading this ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-14 19:46   ` Matthew Wilcox
@ 2025-02-17 19:33     ` Alex Williamson
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Williamson @ 2025-02-17 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On Fri, 14 Feb 2025 19:46:22 +0000
Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Feb 05, 2025 at 04:17:21PM -0700, Alex Williamson wrote:
> > +			if (is_invalid_reserved_pfn(*pfn)) {
> > +				unsigned long epfn;
> > +
> > +				epfn = (((*pfn << PAGE_SHIFT) + ~pgmask + 1)
> > +					& pgmask) >> PAGE_SHIFT;
> > +				ret = min_t(int, npages, epfn - *pfn);  
> 
> You've really made life hard for yourself by passing around a page mask
> instead of an order (ie 0/PMD_ORDER/PUD_ORDER).  Why not:
> 
> 				epfn = round_up(*pfn + 1, 1 << order);
> 
> Although if you insist on passing around a mask, this could be:
> 
> 				unsigned long sz = (~pgmask >> PAGE_SHIFT) + 1;
> 				unsigned long epfn = round_up(*pfn + 1, sz)
> 

Hey Willy!

I was wishing I had an order, but I didn't want to mangle
follow_pfnmap_start() and follow_pfnmap_setup() too much.  Currently
the latter is doing:

	args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);

If follow_pfnmap_start() passed an order, this would need to change to
something equivalent to:

	args->pfn = pfn_base + ((args->address >> PAGE_SHIFT) & ((1UL << order) - 1));

Looks pretty ugly as well, so maybe I'm just shifting the ugliness
around, or maybe someone can spot a more elegant representation.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-14 19:27   ` Jason Gunthorpe
@ 2025-02-17 21:52     ` Alex Williamson
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Williamson @ 2025-02-17 21:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On Fri, 14 Feb 2025 15:27:04 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Feb 05, 2025 at 04:17:21PM -0700, Alex Williamson wrote:
> > @@ -590,15 +592,23 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >  	vma = vma_lookup(mm, vaddr);
> >  
> >  	if (vma && vma->vm_flags & VM_PFNMAP) {
> > -		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > +		unsigned long pgmask;
> > +
> > +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, &pgmask,
> > +				       prot & IOMMU_WRITE);
> >  		if (ret == -EAGAIN)
> >  			goto retry;
> >  
> >  		if (!ret) {
> > -			if (is_invalid_reserved_pfn(*pfn))
> > -				ret = 1;
> > -			else
> > +			if (is_invalid_reserved_pfn(*pfn)) {
> > +				unsigned long epfn;
> > +
> > +				epfn = (((*pfn << PAGE_SHIFT) + ~pgmask + 1)
> > +					& pgmask) >> PAGE_SHIFT;  
> 
> That seems a bit indirect
> 
>  epfn = ((*pfn) | (~pgmask >> PAGE_SHIFT)) + 1;
> 
> ?

That is simpler, for sure.  Thanks!

> > +				ret = min_t(int, npages, epfn - *pfn);  
> 
> It is nitty but the int's here should be long, and npages should be
> unsigned long..

Added a new patch that uses unsigned long consistently for passed page
counts and long for returns.  Now we just need a system with a 16TiB
huge page size.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args
  2025-02-14 21:39     ` David Hildenbrand
@ 2025-02-17 21:56       ` Alex Williamson
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Williamson @ 2025-02-17 21:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, akpm, linux-mm

On Fri, 14 Feb 2025 22:39:30 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 14.02.25 18:17, Alex Williamson wrote:
> > 
> > Nudge.  Peter Xu provided an R-b for the series.  Would any other mm
> > folks like to chime in here to provide objection or approval for this
> > change and merging it through the vfio tree?  Series[1].  Thanks!
> >   
> 
> Only skimmed over it, nothing jumped at me except ...
> 
> Nitpicking:
> 
> I was wondering if "page mask" really the right term here. I know that 
> we use it in some context (gup, hugetlb, zeropage) to express "mask this 
> off and you get the start of the aligned huge page".
> 
> For something that walks PFNMAPs (page frames without any real "huge 
> page" logical metadata etc. grouping) it was uintuitive for me at first.
> 
> addr_mask or pfn_mask (shifted addr_mask) would have been clearer for me.
> 
> No strong opinion, just what came to mind while reading this ...

It's called addr_mask in pfnmap_args_setup() so I'm happy to keep that
naming if pgmask is less intuitive.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-02-17 21:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05 23:17 [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
2025-02-05 23:17 ` [PATCH 4/5] mm: Provide page mask in struct follow_pfnmap_args Alex Williamson
2025-02-07  1:38   ` Mitchell Augustin
2025-02-14 17:17   ` Alex Williamson
2025-02-14 21:39     ` David Hildenbrand
2025-02-17 21:56       ` Alex Williamson
2025-02-14 19:14   ` Jason Gunthorpe
2025-02-05 23:17 ` [PATCH 5/5] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
2025-02-07  1:39   ` Mitchell Augustin
2025-02-14 19:27   ` Jason Gunthorpe
2025-02-17 21:52     ` Alex Williamson
2025-02-14 19:46   ` Matthew Wilcox
2025-02-17 19:33     ` Alex Williamson
2025-02-06 19:14 ` [PATCH 0/5] vfio: Improve DMA mapping performance for huge pfnmaps Peter Xu
2025-02-07  1:39 ` Mitchell Augustin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).