All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/hmm: Populate PFNs from PMD swap entry
@ 2025-08-29  8:05 Francois Dugast
  0 siblings, 0 replies; 11+ messages in thread
From: Francois Dugast @ 2025-08-29  8:05 UTC (permalink / raw)
  To: linux-mm, intel-xe, dri-devel
  Cc: Francois Dugast, Andrew Morton, Jason Gunthorpe, Leon Romanovsky,
	Zi Yan, Alistair Popple, Balbir Singh, David Airlie,
	Christian König, Mika Penttilä, Thomas Hellstrom,
	Matthew Brost

Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs
but also for PMDs.

Therefore, it is necessary to extend to PMDs the special handling which
is already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior
is to use the swap entry to populate HMM PFNs.

Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 mm/hmm.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index d545e2494994..d449fc4647d7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,29 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	}
 
 	if (!pmd_present(pmd)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_device_private_entry(entry) &&
+		    pfn_swap_entry_folio(entry)->pgmap->owner ==
+		    range->dev_private_owner) {
+			unsigned long cpu_flags = HMM_PFN_VALID |
+				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+			unsigned long pfn = swp_offset_pfn(entry);
+			unsigned long i;
+
+			if (is_writable_device_private_entry(entry))
+				cpu_flags |= HMM_PFN_WRITE;
+
+			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+				hmm_pfns[i] |= pfn | cpu_flags;
+			}
+
+			return 0;
+		}
+#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* + mm-hmm-populate-pfns-from-pmd-swap-entry.patch added to mm-new branch
@ 2025-08-30  3:56 Andrew Morton
  2025-09-02 11:17 ` [PATCH] mm/hmm: populate PFNs from PMD swap entry Francois Dugast
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2025-08-30  3:56 UTC (permalink / raw)
  To: mm-commits, ziy, thomas.hellstrom, mpenttil, matthew.brost,
	leonro, jgg, christian.koenig, balbirs, apopple, airlied,
	francois.dugast, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3849 bytes --]


The patch titled
     Subject: mm/hmm: populate PFNs from PMD swap entry
has been added to the -mm mm-new branch.  Its filename is
     mm-hmm-populate-pfns-from-pmd-swap-entry.patch

This patch will shortly appear at
     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-hmm-populate-pfns-from-pmd-swap-entry.patch

This patch will later appear in the mm-new branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews.  Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

------------------------------------------------------
From: Francois Dugast <francois.dugast@intel.com>
Subject: mm/hmm: populate PFNs from PMD swap entry
Date: Fri, 29 Aug 2025 10:05:05 +0200

Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs but
also for PMDs.

Therefore, it is necessary to extend to PMDs the special handling which is
already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior is
to use the swap entry to populate HMM PFNs.

Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users.

Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hmm.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- a/mm/hmm.c~mm-hmm-populate-pfns-from-pmd-swap-entry
+++ a/mm/hmm.c
@@ -355,6 +355,29 @@ again:
 	}
 
 	if (!pmd_present(pmd)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_device_private_entry(entry) &&
+		    pfn_swap_entry_folio(entry)->pgmap->owner ==
+		    range->dev_private_owner) {
+			unsigned long cpu_flags = HMM_PFN_VALID |
+				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+			unsigned long pfn = swp_offset_pfn(entry);
+			unsigned long i;
+
+			if (is_writable_device_private_entry(entry))
+				cpu_flags |= HMM_PFN_WRITE;
+
+			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+				hmm_pfns[i] |= pfn | cpu_flags;
+			}
+
+			return 0;
+		}
+#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
_

Patches currently in -mm which might be from francois.dugast@intel.com are

mm-hmm-populate-pfns-from-pmd-swap-entry.patch


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-08-30  3:56 + mm-hmm-populate-pfns-from-pmd-swap-entry.patch added to mm-new branch Andrew Morton
@ 2025-09-02 11:17 ` Francois Dugast
  2025-09-02 11:30   ` Balbir Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Dugast @ 2025-09-02 11:17 UTC (permalink / raw)
  To: akpm
  Cc: airlied, apopple, balbirs, christian.koenig, francois.dugast, jgg,
	leonro, matthew.brost, mm-commits, mpenttil, thomas.hellstrom,
	ziy

Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs but
also for PMDs.

Therefore, it is necessary to extend to PMDs the special handling which is
already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior is
to use the swap entry to populate HMM PFNs.

This change is a prerequisite to make use of device-private THP in drivers
using drivers/gpu/drm/drm_pagemap, such as xe.

Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users. In case this changes in the future, that is all HMM
users support a sparsely populated PFN list, the for() loop can be made to
skip remaining PFNs for the current order. A quick test shows the loop
takes about 10 ns, roughly 20 times faster than without this optimization.

Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/hmm.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index d545e2494994..d449fc4647d7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,29 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	}
 
 	if (!pmd_present(pmd)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_device_private_entry(entry) &&
+		    pfn_swap_entry_folio(entry)->pgmap->owner ==
+		    range->dev_private_owner) {
+			unsigned long cpu_flags = HMM_PFN_VALID |
+				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+			unsigned long pfn = swp_offset_pfn(entry);
+			unsigned long i;
+
+			if (is_writable_device_private_entry(entry))
+				cpu_flags |= HMM_PFN_WRITE;
+
+			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+				hmm_pfns[i] |= pfn | cpu_flags;
+			}
+
+			return 0;
+		}
+#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-02 11:17 ` [PATCH] mm/hmm: populate PFNs from PMD swap entry Francois Dugast
@ 2025-09-02 11:30   ` Balbir Singh
  2025-09-02 12:53     ` Francois Dugast
  0 siblings, 1 reply; 11+ messages in thread
From: Balbir Singh @ 2025-09-02 11:30 UTC (permalink / raw)
  To: Francois Dugast, akpm
  Cc: airlied, apopple, christian.koenig, jgg, leonro, matthew.brost,
	mm-commits, mpenttil, thomas.hellstrom, ziy

On 9/2/25 21:17, Francois Dugast wrote:
> Once support for THP migration of zone device pages is enabled, device
> private swap entries will be found during the walk not only for PTEs but
> also for PMDs.
> 
> Therefore, it is necessary to extend to PMDs the special handling which is
> already in place for PTEs when device private pages are owned by the
> caller: instead of faulting or skipping the range, the correct behavior is
> to use the swap entry to populate HMM PFNs.
> 
> This change is a prerequisite to make use of device-private THP in drivers
> using drivers/gpu/drm/drm_pagemap, such as xe.
> 
> Even though subsequent PFNs can be inferred when handling large order
> PFNs, the PFN list is still fully populated because this is currently
> expected by HMM users. In case this changes in the future, that is all HMM
> users support a sparsely populated PFN list, the for() loop can be made to
> skip remaining PFNs for the current order. A quick test shows the loop
> takes about 10 ns, roughly 20 times faster than without this optimization.
> 
> Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/hmm.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index d545e2494994..d449fc4647d7 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,29 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +

Can you add a comment here about why this is added? Why would there be a disconnect
between HMM users and the API? I assume you are referring to drivers that are
not yet aware of large folios.

> +			return 0;
> +		}
> +#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);

Other than that, based on the assumption that my patches are not a pre-requisite
for this

Acked-by: Balbir Singh <balbirs@nvidia.com>

Thanks,
Balbir


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-02 11:30   ` Balbir Singh
@ 2025-09-02 12:53     ` Francois Dugast
  2025-09-02 13:07       ` Francois Dugast
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Dugast @ 2025-09-02 12:53 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, airlied, apopple, christian.koenig, jgg, leonro,
	matthew.brost, mm-commits, mpenttil, thomas.hellstrom, ziy

On Tue, Sep 02, 2025 at 09:30:13PM +1000, Balbir Singh wrote:
> On 9/2/25 21:17, Francois Dugast wrote:
> > Once support for THP migration of zone device pages is enabled, device
> > private swap entries will be found during the walk not only for PTEs but
> > also for PMDs.
> > 
> > Therefore, it is necessary to extend to PMDs the special handling which is
> > already in place for PTEs when device private pages are owned by the
> > caller: instead of faulting or skipping the range, the correct behavior is
> > to use the swap entry to populate HMM PFNs.
> > 
> > This change is a prerequisite to make use of device-private THP in drivers
> > using drivers/gpu/drm/drm_pagemap, such as xe.
> > 
> > Even though subsequent PFNs can be inferred when handling large order
> > PFNs, the PFN list is still fully populated because this is currently
> > expected by HMM users. In case this changes in the future, that is all HMM
> > users support a sparsely populated PFN list, the for() loop can be made to
> > skip remaining PFNs for the current order. A quick test shows the loop
> > takes about 10 ns, roughly 20 times faster than without this optimization.
> > 
> > Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Leon Romanovsky <leonro@nvidia.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Mika Penttilä <mpenttil@redhat.com>
> > Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  mm/hmm.c | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index d545e2494994..d449fc4647d7 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,29 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> 
> Can you add a comment here about why this is added? Why would there be a disconnect
> between HMM users and the API? I assume you are referring to drivers that are
> not yet aware of large folios.

Yes I am, will do.

> 
> > +			return 0;
> > +		}
> > +#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> > +
> >  		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> 
> Other than that, based on the assumption that my patches are not a pre-requisite
> for this
> 
> Acked-by: Balbir Singh <balbirs@nvidia.com>

Thanks.

> 
> Thanks,
> Balbir
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-02 12:53     ` Francois Dugast
@ 2025-09-02 13:07       ` Francois Dugast
  2025-09-03  5:47         ` Matthew Brost
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Dugast @ 2025-09-02 13:07 UTC (permalink / raw)
  To: francois.dugast
  Cc: airlied, akpm, apopple, balbirs, christian.koenig, jgg, leonro,
	matthew.brost, mm-commits, mpenttil, thomas.hellstrom, ziy

Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs but
also for PMDs.

Therefore, it is necessary to extend to PMDs the special handling which is
already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior is
to use the swap entry to populate HMM PFNs.

This change is a prerequisite to make use of device-private THP in drivers
using drivers/gpu/drm/drm_pagemap, such as xe.

Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users. In case this changes in the future, that is all HMM
users support a sparsely populated PFN list, the for() loop can be made to
skip remaining PFNs for the current order. A quick test shows the loop
takes about 10 ns, roughly 20 times faster than without this optimization.

Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/hmm.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index d545e2494994..a8ac8d830e39 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -355,6 +355,35 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	}
 
 	if (!pmd_present(pmd)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (is_device_private_entry(entry) &&
+		    pfn_swap_entry_folio(entry)->pgmap->owner ==
+		    range->dev_private_owner) {
+			unsigned long cpu_flags = HMM_PFN_VALID |
+				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+			unsigned long pfn = swp_offset_pfn(entry);
+			unsigned long i;
+
+			if (is_writable_device_private_entry(entry))
+				cpu_flags |= HMM_PFN_WRITE;
+
+			/*
+			 * Fully populate the PFN list though subsequent
+			 * PFNs could be inferred, because drivers which
+			 * are not yet aware of large folios probably do
+			 * not support sparsely populated PFN lists.
+			 */
+			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+				hmm_pfns[i] |= pfn | cpu_flags;
+			}
+
+			return 0;
+		}
+#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
 			return -EFAULT;
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-02 13:07       ` Francois Dugast
@ 2025-09-03  5:47         ` Matthew Brost
  2025-09-04 13:25           ` Francois Dugast
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Brost @ 2025-09-03  5:47 UTC (permalink / raw)
  To: Francois Dugast
  Cc: airlied, akpm, apopple, balbirs, christian.koenig, jgg, leonro,
	mm-commits, mpenttil, thomas.hellstrom, ziy

On Tue, Sep 02, 2025 at 03:07:13PM +0200, Francois Dugast wrote:
> Once support for THP migration of zone device pages is enabled, device
> private swap entries will be found during the walk not only for PTEs but
> also for PMDs.
> 
> Therefore, it is necessary to extend to PMDs the special handling which is
> already in place for PTEs when device private pages are owned by the
> caller: instead of faulting or skipping the range, the correct behavior is
> to use the swap entry to populate HMM PFNs.
> 
> This change is a prerequisite to make use of device-private THP in drivers
> using drivers/gpu/drm/drm_pagemap, such as xe.
> 
> Even though subsequent PFNs can be inferred when handling large order
> PFNs, the PFN list is still fully populated because this is currently
> expected by HMM users. In case this changes in the future, that is all HMM
> users support a sparsely populated PFN list, the for() loop can be made to
> skip remaining PFNs for the current order. A quick test shows the loop
> takes about 10 ns, roughly 20 times faster than without this optimization.
> 
> Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> Acked-by: Balbir Singh <balbirs@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/hmm.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index d545e2494994..a8ac8d830e39 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -355,6 +355,35 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  	}
>  
>  	if (!pmd_present(pmd)) {
> +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +		if (is_device_private_entry(entry) &&
> +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> +		    range->dev_private_owner) {
> +			unsigned long cpu_flags = HMM_PFN_VALID |
> +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> +			unsigned long pfn = swp_offset_pfn(entry);
> +			unsigned long i;
> +
> +			if (is_writable_device_private_entry(entry))
> +				cpu_flags |= HMM_PFN_WRITE;
> +
> +			/*
> +			 * Fully populate the PFN list though subsequent
> +			 * PFNs could be inferred, because drivers which
> +			 * are not yet aware of large folios probably do
> +			 * not support sparsely populated PFN lists.
> +			 */
> +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> +				hmm_pfns[i] |= pfn | cpu_flags;
> +			}
> +
> +			return 0;
> +		}
> +#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +


I believe you missed my comment in a previous rev. I think you need
something like this force a fault on dev_private_owner / pgmap->owner
mismatch.

               required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
                                                     npages, 0);
               if (required_fault) {
                       if (is_device_private_entry(entry))
                               return hmm_vma_fault(addr, end, required_fault, walk);
                       else
                               return -EFAULT;
               }

Matt

>  		if (hmm_eange_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
>  			return -EFAULT;
>  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-03  5:47         ` Matthew Brost
@ 2025-09-04 13:25           ` Francois Dugast
  2025-09-08  9:10             ` Francois Dugast
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Dugast @ 2025-09-04 13:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: airlied, akpm, apopple, balbirs, christian.koenig, jgg, leonro,
	mm-commits, mpenttil, thomas.hellstrom, ziy

On Tue, Sep 02, 2025 at 10:47:40PM -0700, Matthew Brost wrote:
> On Tue, Sep 02, 2025 at 03:07:13PM +0200, Francois Dugast wrote:
> > Once support for THP migration of zone device pages is enabled, device
> > private swap entries will be found during the walk not only for PTEs but
> > also for PMDs.
> > 
> > Therefore, it is necessary to extend to PMDs the special handling which is
> > already in place for PTEs when device private pages are owned by the
> > caller: instead of faulting or skipping the range, the correct behavior is
> > to use the swap entry to populate HMM PFNs.
> > 
> > This change is a prerequisite to make use of device-private THP in drivers
> > using drivers/gpu/drm/drm_pagemap, such as xe.
> > 
> > Even though subsequent PFNs can be inferred when handling large order
> > PFNs, the PFN list is still fully populated because this is currently
> > expected by HMM users. In case this changes in the future, that is all HMM
> > users support a sparsely populated PFN list, the for() loop can be made to
> > skip remaining PFNs for the current order. A quick test shows the loop
> > takes about 10 ns, roughly 20 times faster than without this optimization.
> > 
> > Link: https://lkml.kernel.org/r/20250829080505.1020155-1-francois.dugast@intel.com
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > Acked-by: Balbir Singh <balbirs@nvidia.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Leon Romanovsky <leonro@nvidia.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Mika Penttilä <mpenttil@redhat.com>
> > Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  mm/hmm.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index d545e2494994..a8ac8d830e39 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -355,6 +355,35 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> >  	}
> >  
> >  	if (!pmd_present(pmd)) {
> > +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> > +		swp_entry_t entry = pmd_to_swp_entry(pmd);
> > +
> > +		if (is_device_private_entry(entry) &&
> > +		    pfn_swap_entry_folio(entry)->pgmap->owner ==
> > +		    range->dev_private_owner) {
> > +			unsigned long cpu_flags = HMM_PFN_VALID |
> > +				hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
> > +			unsigned long pfn = swp_offset_pfn(entry);
> > +			unsigned long i;
> > +
> > +			if (is_writable_device_private_entry(entry))
> > +				cpu_flags |= HMM_PFN_WRITE;
> > +
> > +			/*
> > +			 * Fully populate the PFN list though subsequent
> > +			 * PFNs could be inferred, because drivers which
> > +			 * are not yet aware of large folios probably do
> > +			 * not support sparsely populated PFN lists.
> > +			 */
> > +			for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> > +				hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
> > +				hmm_pfns[i] |= pfn | cpu_flags;
> > +			}
> > +
> > +			return 0;
> > +		}
> > +#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> > +
> 
> 
> I believe you missed my comment in a previous rev. I think you need
> something like this force a fault on dev_private_owner / pgmap->owner
> mismatch.
> 
>                required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
>                                                      npages, 0);
>                if (required_fault) {
>                        if (is_device_private_entry(entry))
>                                return hmm_vma_fault(addr, end, required_fault, walk);
>                        else
>                                return -EFAULT;
>                }
> 

Yes you are right, we need the same logic as what is already in place
for PTEs. I could confirm that during prefetch by running the IGT test
xe_exec_system_allocator/prefetch-sys-benchmark.

I will send a new revision.

Thanks,
Francois

> Matt
> 
> >  		if (hmm_eange_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
> >  			return -EFAULT;
> >  		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
> > -- 
> > 2.43.0
> > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-04 13:25           ` Francois Dugast
@ 2025-09-08  9:10             ` Francois Dugast
  2025-09-08 13:46               ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Francois Dugast @ 2025-09-08  9:10 UTC (permalink / raw)
  To: francois.dugast
  Cc: airlied, akpm, apopple, balbirs, christian.koenig, jgg, leonro,
	matthew.brost, mm-commits, mpenttil, thomas.hellstrom, ziy

Once support for THP migration of zone device pages is enabled, device
private swap entries will be found during the walk not only for PTEs but
also for PMDs.

Therefore, it is necessary to extend to PMDs the special handling which is
already in place for PTEs when device private pages are owned by the
caller: instead of faulting or skipping the range, the correct behavior is
to use the swap entry to populate HMM PFNs.

This change is a prerequisite to make use of device-private THP in drivers
using drivers/gpu/drm/drm_pagemap, such as xe.

Even though subsequent PFNs can be inferred when handling large order
PFNs, the PFN list is still fully populated because this is currently
expected by HMM users. In case this changes in the future, that is all HMM
users support a sparsely populated PFN list, the for() loop can be made to
skip remaining PFNs for the current order. A quick test shows the loop
takes about 10 ns, roughly 20 times faster than without this optimization.

Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Francois Dugast <francois.dugast@intel.com>
---
 mm/hmm.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 5 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index d545e2494994..3e00f08722d5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -326,6 +326,68 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	return hmm_vma_fault(addr, end, required_fault, walk);
 }
 
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
+				     unsigned long end, unsigned long *hmm_pfns,
+				     pmd_t pmd)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	unsigned long addr = start;
+	swp_entry_t entry = pmd_to_swp_entry(pmd);
+	unsigned int required_fault;
+
+	if (is_device_private_entry(entry) &&
+	    pfn_swap_entry_folio(entry)->pgmap->owner ==
+	    range->dev_private_owner) {
+		unsigned long cpu_flags = HMM_PFN_VALID |
+			hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
+		unsigned long pfn = swp_offset_pfn(entry);
+		unsigned long i;
+
+		if (is_writable_device_private_entry(entry))
+			cpu_flags |= HMM_PFN_WRITE;
+
+		/*
+		 * Fully populate the PFN list though subsequent PFNs could be
+		 * inferred, because drivers which are not yet aware of large
+		 * folios probably do not support sparsely populated PFN lists.
+		 */
+		for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+			hmm_pfns[i] &= HMM_PFN_INOUT_FLAGS;
+			hmm_pfns[i] |= pfn | cpu_flags;
+		}
+
+		return 0;
+	}
+
+	required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
+					      npages, 0);
+	if (required_fault) {
+		if (is_device_private_entry(entry))
+			return hmm_vma_fault(addr, end, required_fault, walk);
+		else
+			return -EFAULT;
+	}
+
+	return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
+}
+#else
+static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start,
+				     unsigned long end, unsigned long *hmm_pfns,
+				     pmd_t pmd)
+{
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+
+	if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
+		return -EFAULT;
+	return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
+}
+#endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			    unsigned long start,
 			    unsigned long end,
@@ -354,11 +416,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, 0);
 	}
 
-	if (!pmd_present(pmd)) {
-		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0))
-			return -EFAULT;
-		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
-	}
+	if (!pmd_present(pmd))
+		return hmm_vma_handle_absent_pmd(walk, start, end, hmm_pfns,
+						 pmd);
 
 	if (pmd_trans_huge(pmd)) {
 		/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-08  9:10             ` Francois Dugast
@ 2025-09-08 13:46               ` Jason Gunthorpe
  2025-09-09  1:50                 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:46 UTC (permalink / raw)
  To: Francois Dugast
  Cc: airlied, akpm, apopple, balbirs, christian.koenig, leonro,
	matthew.brost, mm-commits, mpenttil, thomas.hellstrom, ziy

On Mon, Sep 08, 2025 at 11:10:52AM +0200, Francois Dugast wrote:
> Once support for THP migration of zone device pages is enabled, device
> private swap entries will be found during the walk not only for PTEs but
> also for PMDs.
> 
> Therefore, it is necessary to extend to PMDs the special handling which is
> already in place for PTEs when device private pages are owned by the
> caller: instead of faulting or skipping the range, the correct behavior is
> to use the swap entry to populate HMM PFNs.
> 
> This change is a prerequisite to make use of device-private THP in drivers
> using drivers/gpu/drm/drm_pagemap, such as xe.
> 
> Even though subsequent PFNs can be inferred when handling large order
> PFNs, the PFN list is still fully populated because this is currently
> expected by HMM users. In case this changes in the future, that is all HMM
> users support a sparsely populated PFN list, the for() loop can be made to
> skip remaining PFNs for the current order. A quick test shows the loop
> takes about 10 ns, roughly 20 times faster than without this optimization.
> 
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Leon Romanovsky <leonro@nvidia.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Balbir Singh <balbirs@nvidia.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> ---
>  mm/hmm.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 65 insertions(+), 5 deletions(-)

Please put version numbers on your patches and include a change log.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hmm: populate PFNs from PMD swap entry
  2025-09-08 13:46               ` Jason Gunthorpe
@ 2025-09-09  1:50                 ` Andrew Morton
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2025-09-09  1:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Francois Dugast, airlied, apopple, balbirs, christian.koenig,
	leonro, matthew.brost, mm-commits, mpenttil, thomas.hellstrom,
	ziy

On Mon, 8 Sep 2025 10:46:19 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:

> > Even though subsequent PFNs can be inferred when handling large order
> > PFNs, the PFN list is still fully populated because this is currently
> > expected by HMM users. In case this changes in the future, that is all HMM
> > users support a sparsely populated PFN list, the for() loop can be made to
> > skip remaining PFNs for the current order. A quick test shows the loop
> > takes about 10 ns, roughly 20 times faster than without this optimization.
> > 
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Leon Romanovsky <leonro@nvidia.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Mika Penttilä <mpenttil@redhat.com>
> > Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > ---
> >  mm/hmm.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 65 insertions(+), 5 deletions(-)
> 
> Please put version numbers on your patches and include a change log.

yep.

But the diff between this and the previous version is larger than the
patch itself.  So it's basically a whole new patch.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-09  1:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-30  3:56 + mm-hmm-populate-pfns-from-pmd-swap-entry.patch added to mm-new branch Andrew Morton
2025-09-02 11:17 ` [PATCH] mm/hmm: populate PFNs from PMD swap entry Francois Dugast
2025-09-02 11:30   ` Balbir Singh
2025-09-02 12:53     ` Francois Dugast
2025-09-02 13:07       ` Francois Dugast
2025-09-03  5:47         ` Matthew Brost
2025-09-04 13:25           ` Francois Dugast
2025-09-08  9:10             ` Francois Dugast
2025-09-08 13:46               ` Jason Gunthorpe
2025-09-09  1:50                 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2025-08-29  8:05 [PATCH] mm/hmm: Populate " Francois Dugast

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.