[RFC 0/1] Avoid pagewalk hugepage-split race with VFIO DMA set

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [RFC 0/1] Avoid pagewalk hugepage-split race with VFIO DMA set
@ 2026-03-09 17:49 Max Boone
  2026-03-09 17:49 ` [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps Max Boone
  0 siblings, 1 reply; 14+ messages in thread
From: Max Boone @ 2026-03-09 17:49 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alex Williamson, linux-mm, kvm,
	linux-kernel, Max Tottenham, Josh Hunt, Matt Pelland, Max Boone

A kernel BUG can be triggered when /proc/$PID/numa_maps reads are
hammered by one process, while said $PID is setting up DMA for large
1G-aligned memory mapped BARs.

The 1G-aligned memory mapped BARs get set up as PUD-order PFNMAPs. When
the generic page walker (mm/pagewalk.c) gets to the PUD table entry of
the memory mapped BAR in the walk_pmd_range function, it tries to
split it by:

1. deleting the PUD entry by calling split_huge_pud
2. checking whether `pud_none` is true to go to `again`

		if (walk->vma)
			split_huge_pud(walk->vma, pud, addr);
		...
		if (pud_none(*pud))
			goto again;

3. if has_install is set, it calls __pmd_alloc and further descends
into walk_pmd_range

again:
		next = pud_addr_end(addr, end);
		if (pud_none(*pud)) {
			if (has_install)
				err = __pmd_alloc(walk->mm, pud, addr);

When VFIO is setting up DMA, the PUD entry can get reinstalled between the
split_huge_pud call and the pud_none check to goto again. In such
case the walk continues to the PMD-level and an illegal read happens.

As a mitigation, I propose to skip splitting the PMD and PUD entries
that are marked as special in the walker, which are mappings that do not
wish to be associated with a "struct page". The only occurences of these
entries I found were the vfio pci and nvgrace pfnmap mappings, which do
not behave like regular memory.

For a reproduction, the `vfio-mmap-bar.py` script repeatedly DMA-maps a
1G-aligned BAR and can be used to reproduce this bug:
- https://github.com/akamaxb/repro-vfio-page-walk-race.git

Run the `vfio-mmap-bar.py` script with the device you want to passthrough,
and in the mean time, cat the `/proc/$PID/numa_maps` of that process repeatedly
in a while loop. This caused the `numa_maps` read to crash on an illegal read,
when testing it against a 128GB-sized 2nd BAR of a NVIDIA Blackwell 6000 GPU.

Signed-off-by: Max Boone <mboone@akamai.com>
Signed-off-by: Max Tottenham <mtottenh@akamai.com>

Max Boone (1):
  mm/pagewalk: don't split device-backed huge pfnmaps

 mm/pagewalk.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-09 17:49 [RFC 0/1] Avoid pagewalk hugepage-split race with VFIO DMA set Max Boone
@ 2026-03-09 17:49 ` Max Boone
  2026-03-09 20:19   ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 14+ messages in thread
From: Max Boone @ 2026-03-09 17:49 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alex Williamson, linux-mm, kvm,
	linux-kernel, Max Tottenham, Josh Hunt, Matt Pelland, Max Boone

Don't split and descend on special PMD/PUDs, which are generally
device-backed huge pfnmaps as used by vfio for BAR mapping. These
can be faulted back in after splitting and before descending, which
can race to an illegal read.

Signed-off-by: Max Boone <mboone@akamai.com>
Signed-off-by: Max Tottenham <mtottenh@akamai.com>

---
 mm/pagewalk.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index a94c401ab..d1460dd84 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -147,10 +147,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				continue;
 		}
 
-		if (walk->vma)
+		if (walk->vma) {
+			/*
+			 * Don't descend into device-backed pfnmaps,
+			 * they might refault the PMD entry.
+			 */
+			if (unlikely(pmd_special(*pmd)))
+				continue;
+
 			split_huge_pmd(walk->vma, pmd, addr);
-		else if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+		} else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) {
 			continue; /* Nothing to do. */
+		}
 
 		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
@@ -213,10 +221,18 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 				continue;
 		}
 
-		if (walk->vma)
+		if (walk->vma) {
+			/*
+			 * Don't descend into device-backed pfnmaps,
+			 * they might refault the PUD entry.
+			 */
+			if (unlikely(pud_special(*pud)))
+				continue;
+
 			split_huge_pud(walk->vma, pud, addr);
-		else if (pud_leaf(*pud) || !pud_present(*pud))
+		} else if (pud_leaf(*pud) || !pud_present(*pud)) {
 			continue; /* Nothing to do. */
+		}
 
 		if (pud_none(*pud))
 			goto again;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-09 17:49 ` [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps Max Boone
@ 2026-03-09 20:19   ` David Hildenbrand (Arm)
  2026-03-09 22:47     ` Boone, Max
  2026-03-09 23:02     ` Boone, Max
  0 siblings, 2 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-09 20:19 UTC (permalink / raw)
  To: Max Boone, Andrew Morton
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alex Williamson, linux-mm, kvm,
	linux-kernel, Max Tottenham, Josh Hunt, Matt Pelland

On 3/9/26 18:49, Max Boone wrote:
> Don't split and descend on special PMD/PUDs, which are generally
> device-backed huge pfnmaps as used by vfio for BAR mapping. These
> can be faulted back in after splitting and before descending, which
> can race to an illegal read.
> 
> Signed-off-by: Max Boone <mboone@akamai.com>
> Signed-off-by: Max Tottenham <mtottenh@akamai.com>
> 
> ---
>  mm/pagewalk.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index a94c401ab..d1460dd84 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -147,10 +147,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  				continue;
>  		}
>  
> -		if (walk->vma)
> +		if (walk->vma) {
> +			/*
> +			 * Don't descend into device-backed pfnmaps,
> +			 * they might refault the PMD entry.
> +			 */
> +			if (unlikely(pmd_special(*pmd)))
> +				continue;

In general, if you're using pmd_special()/pud_split() and friends in
ordinary page table walking code, you are doing something wrong. We
don't want to leak these details in such page table walkers.

We do have vm_normal_page_pmd() to identify special mappings, but I
first have to understand what exactly you are trying to solve here.

(You would also be affecting the remapping of the huge zero folio.)

A lot more details from the cover letter belong into the patch
description. In fact, you don't even need a cover letter :)

IIUC, this is rather serious and would require a Fixes: and even Cc: stable?

I'll spend some time tomorrow trying to understand what the real problem
here is.

But for now: can this only be reproduces with PUDs (which you mention in
the cover letter) or also PMDs?

For the PMD case I would assume that pte_offset_map_lock() performs
proper checks And for the PUD case we are missing a re-check under PTL.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-09 20:19   ` David Hildenbrand (Arm)
@ 2026-03-09 22:47     ` Boone, Max
  2026-03-09 23:02     ` Boone, Max
  1 sibling, 0 replies; 14+ messages in thread
From: Boone, Max @ 2026-03-09 22:47 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand (Arm)
  Cc: Tottenham, Max, Mike Rapoport, Hunt, Joshua,
	linux-kernel@vger.kernel.org, Suren Baghdasaryan, Pelland, Matt,
	Liam R . Howlett, Vlastimil Babka, Lorenzo Stoakes,
	Alex Williamson, kvm@vger.kernel.org, Michal Hocko,
	linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 4456 bytes --]

On Mar 9, 2026 9:19 PM, "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
> On 3/9/26 18:49, Max Boone wrote:
> > Don't split and descend on special PMD/PUDs, which are generally
> > device-backed huge pfnmaps as used by vfio for BAR mapping. These
> > can be faulted back in after splitting and before descending, which
> > can race to an illegal read.
> >
> > Signed-off-by: Max Boone <mboone@akamai.com>
> > Signed-off-by: Max Tottenham <mtottenh@akamai.com>
> >
> > ---
> >  mm/pagewalk.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index a94c401ab..d1460dd84 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -147,10 +147,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >                                continue;
> >                }
> >
> > -             if (walk->vma)
> > +             if (walk->vma) {
> > +                     /*
> > +                      * Don't descend into device-backed pfnmaps,
> > +                      * they might refault the PMD entry.
> > +                      */
> > +                     if (unlikely(pmd_special(*pmd)))
> > +                             continue;
>
> In general, if you're using pmd_special()/pud_split() and friends in
> ordinary page table walking code, you are doing something wrong. We
> don't want to leak these details in such page table walkers.

That sounds sensible, there is a check in the split_huge_pud macro, which previously included DAX memory.

Related to handling that macro, I see another proposed patch for lazy provisioning of PTEs for PMD order THPs [1]. Possibly adding a return code to the split functions allows a better solution here as well?

I'm not sure if making the split (or rather unmap, calling it a split has been a bit confusing to me as it doesn't allocate PMDs) a noop will improve things - as to my understanding it will still try to descend.

> We do have vm_normal_page_pmd() to identify special mappings, but I
> first have to understand what exactly you are trying to solve here.

Specifically for the page walker, avoid splitting and descending into the PUD-order pfnmaps that VFIO creates for the BAR mappings - as these (can) represent hardware control registers rather than regular memory. I haven't been able to reproduce it with PMD-level pfnmaps, but I'll build a kernel with PUD-level pfnmaps disabled tomorrow.

But if course I'm mainly concerned with fixing the race such that reading numa_maps does not cause an illegal read, resulting in the read process crashing while holding the mmap lock of the process (and subsequent reads of proc freezing, waiting for the mmap lock they'll never get).

> (You would also be affecting the remapping of the huge zero folio.)

Ah, good one, I do think that this race can occur with PMD-level mappings, looking at the walking & splitting logic - but given the PUD-level mapping triggered the (rare) occurrence I'm fine to focus there first. I guess it helps we don't have 1G THPs, but it would be good to treat 2M and 1G similarly?

> A lot more details from the cover letter belong into the patch
> description. In fact, you don't even need a cover letter :)

Hehe, first timer, still figuring out the process.

> IIUC, this is rather serious and would require a Fixes: and even Cc: stable?
>
> I'll spend some time tomorrow trying to understand what the real problem
> here is.

I think so, the bug can be easily triggered by repeatedly booting up a VM that passes through a PCI device with large BARs while continuously reading the numa_maps of the main VM process. The reproducer script is mainly to narrow down to the specific part where the race occurs, the VFIO DMA set ioctl.

Should I raise a bug email to refer to, and resubmit a new RFC v2 (without the cover letter), or keep discussion in this thread for now?

> But for now: can this only be reproduces with PUDs (which you mention in
> the cover letter) or also PMDs?
>
> For the PMD case I would assume that pte_offset_map_lock() performs
> proper checks And for the PUD case we are missing a re-check under PTL.

Have only seen it with PUDs, will try forcing the mapping to happen with PMDs tomorrow.

> --
> Cheers,
>
> David

[1] https://lore.kernel.org/lkml/20260226113233.3987674-1-usama.arif@linux.dev/

[-- Attachment #2: Type: text/html, Size: 7408 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-09 20:19   ` David Hildenbrand (Arm)
  2026-03-09 22:47     ` Boone, Max
@ 2026-03-09 23:02     ` Boone, Max
  2026-03-10  9:11       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 14+ messages in thread
From: Boone, Max @ 2026-03-09 23:02 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alex Williamson,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

On Mar 9, 2026 9:19 PM, "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
> On 3/9/26 18:49, Max Boone wrote:
> > Don't split and descend on special PMD/PUDs, which are generally
> > device-backed huge pfnmaps as used by vfio for BAR mapping. These
> > can be faulted back in after splitting and before descending, which
> > can race to an illegal read.
> >
> > Signed-off-by: Max Boone <mboone@akamai.com>
> > Signed-off-by: Max Tottenham <mtottenh@akamai.com>
> >
> > ---
> >  mm/pagewalk.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index a94c401ab..d1460dd84 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -147,10 +147,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> >                                continue;
> >                }
> >
> > -             if (walk->vma)
> > +             if (walk->vma) {
> > +                     /*
> > +                      * Don't descend into device-backed pfnmaps,
> > +                      * they might refault the PMD entry.
> > +                      */
> > +                     if (unlikely(pmd_special(*pmd)))
> > +                             continue;
>
> In general, if you're using pmd_special()/pud_split() and friends in
> ordinary page table walking code, you are doing something wrong. We
> don't want to leak these details in such page table walkers.

That sounds sensible, there is a check in the split_huge_pud macro, which previously included DAX memory.

Related to handling that macro, I see another proposed patch for lazy provisioning of PTEs for PMD order THPs [1]. Possibly adding a return code to the split functions allows a better solution here as well?

I'm not sure if making the split (or rather unmap, calling it a split has been a bit confusing to me as it doesn't allocate PMDs) a noop will improve things - as to my understanding it will still try to descend.

> We do have vm_normal_page_pmd() to identify special mappings, but I
> first have to understand what exactly you are trying to solve here.

Specifically for the page walker, avoid splitting and descending into the PUD-order pfnmaps that VFIO creates for the BAR mappings - as these (can) represent hardware control registers rather than regular memory. I haven't been able to reproduce it with PMD-level pfnmaps, but I'll build a kernel with PUD-level pfnmaps disabled tomorrow.

But if course I'm mainly concerned with fixing the race such that reading numa_maps does not cause an illegal read, resulting in the read process crashing while holding the mmap lock of the process (and subsequent reads of proc freezing, waiting for the mmap lock they'll never get).

> (You would also be affecting the remapping of the huge zero folio.)

Ah, good one, I do think that this race can occur with PMD-level mappings, looking at the walking & splitting logic - but given the PUD-level mapping triggered the (rare) occurrence I'm fine to focus there first. I guess it helps we don't have 1G THPs, but it would be good to treat 2M and 1G similarly?

> A lot more details from the cover letter belong into the patch
> description. In fact, you don't even need a cover letter :)

Hehe, first timer, still figuring out the process.

> IIUC, this is rather serious and would require a Fixes: and even Cc: stable?
>
> I'll spend some time tomorrow trying to understand what the real problem
> here is.

I think so, the bug can be easily triggered by repeatedly booting up a VM that passes through a PCI device with large BARs while continuously reading the numa_maps of the main VM process. The reproducer script is mainly to narrow down to the specific part where the race occurs, the VFIO DMA set ioctl.

Should I raise a bug email to refer to, and resubmit a new RFC v2 (without the cover letter), or keep discussion in this thread for now?

> But for now: can this only be reproduces with PUDs (which you mention in
> the cover letter) or also PMDs?
>
> For the PMD case I would assume that pte_offset_map_lock() performs
> proper checks And for the PUD case we are missing a re-check under PTL.

Have only seen it with PUDs, will try forcing the mapping to happen with PMDs tomorrow.

> --
> Cheers,
>
> David

[1] https://lore.kernel.org/lkml/20260226113233.3987674-1-usama.arif@linux.dev/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-09 23:02     ` Boone, Max
@ 2026-03-10  9:11       ` David Hildenbrand (Arm)
  2026-03-10 11:38         ` Boone, Max
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-10  9:11 UTC (permalink / raw)
  To: Boone, Max, Andrew Morton
  Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alex Williamson,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

On 3/10/26 00:02, Boone, Max wrote:
> On Mar 9, 2026 9:19 PM, "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>>
>> On 3/9/26 18:49, Max Boone wrote:
>>> Don't split and descend on special PMD/PUDs, which are generally
>>> device-backed huge pfnmaps as used by vfio for BAR mapping. These
>>> can be faulted back in after splitting and before descending, which
>>> can race to an illegal read.
>>>
>>> Signed-off-by: Max Boone <mboone@akamai.com>
>>> Signed-off-by: Max Tottenham <mtottenh@akamai.com>
>>>
>>> ---
>>>  mm/pagewalk.c | 24 ++++++++++++++++++++----
>>>  1 file changed, 20 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index a94c401ab..d1460dd84 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -147,10 +147,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>>                                continue;
>>>                }
>>>
>>> -             if (walk->vma)
>>> +             if (walk->vma) {
>>> +                     /*
>>> +                      * Don't descend into device-backed pfnmaps,
>>> +                      * they might refault the PMD entry.
>>> +                      */
>>> +                     if (unlikely(pmd_special(*pmd)))
>>> +                             continue;
>>
>> In general, if you're using pmd_special()/pud_split() and friends in
>> ordinary page table walking code, you are doing something wrong. We
>> don't want to leak these details in such page table walkers.
> 
> That sounds sensible, there is a check in the split_huge_pud macro, which previously included DAX memory.
> 
> Related to handling that macro, I see another proposed patch for lazy provisioning of PTEs for PMD order THPs [1]. Possibly adding a return code to the split functions allows a better solution here as well?
> 

Maybe. I think the behavior of trying to split is ok. We just have
to teach code to deal with races.

Because the very same problem can likely be triggered by having the
splitting/unmapping be triggered from another thread in some other
code path concurrently.

> I'm not sure if making the split (or rather unmap, calling it a split has been a bit confusing to me as it doesn't allocate PMDs) a noop will improve things - as to my understanding it will still try to descend.
> 
>> We do have vm_normal_page_pmd() to identify special mappings, but I
>> first have to understand what exactly you are trying to solve here.
> 
> Specifically for the page walker, avoid splitting and descending into the PUD-order pfnmaps that VFIO creates for the BAR mappings - as these (can) represent hardware control registers rather than regular memory. I haven't been able to reproduce it with PMD-level pfnmaps, but I'll build a kernel with PUD-level pfnmaps disabled tomorrow.
> 
> But if course I'm mainly concerned with fixing the race such that reading numa_maps does not cause an illegal read, resulting in the read process crashing while holding the mmap lock of the process (and subsequent reads of proc freezing, waiting for the mmap lock they'll never get).

Right, that's what we should focus on.

> 
>> (You would also be affecting the remapping of the huge zero folio.)
> 
> Ah, good one, I do think that this race can occur with PMD-level mappings, looking at the walking & splitting logic - but given the PUD-level mapping triggered the (rare) occurrence I'm fine to focus there first. I guess it helps we don't have 1G THPs, but it would be good to treat 2M and 1G similarly?

I don't think it can happen for PMDs, as pte_offset_map_lock() double-checks
that we really have a page table there. See __pte_offset_map() where we do a

pmdval = pmdp_get_lockless(pmd);
...
if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
	goto nomap;
if (unlikely(pmd_trans_huge(pmdval)))
	goto unmap;
...
return __pte_map(&pmdval, addr);


If someone re-faulted the PMD, this function will detect it and reject
walking it as a PMD table.

PMD handling code has to deal with page table removal, so it needs 
some extra steps.

For PUD handling we don't need that. Once we spot a PUD table, it's
not going to get yanked underneath our feet.

> 
>> A lot more details from the cover letter belong into the patch
>> description. In fact, you don't even need a cover letter :)
> 
> Hehe, first timer, still figuring out the process.

:)

> 
>> IIUC, this is rather serious and would require a Fixes: and even Cc: stable?
>>
>> I'll spend some time tomorrow trying to understand what the real problem
>> here is.
> 
> I think so, the bug can be easily triggered by repeatedly booting up a VM that passes through a PCI device with large BARs while continuously reading the numa_maps of the main VM process. The reproducer script is mainly to narrow down to the specific part where the race occurs, the VFIO DMA set ioctl.
> 
> Should I raise a bug email to refer to, and resubmit a new RFC v2 (without the cover letter), or keep discussion in this thread for now?

No, it's okay. Let's first discuss the proper fix.

> 
>> But for now: can this only be reproduces with PUDs (which you mention in
>> the cover letter) or also PMDs?
>>
>> For the PMD case I would assume that pte_offset_map_lock() performs
>> proper checks And for the PUD case we are missing a re-check under PTL.
> 
> Have only seen it with PUDs, will try forcing the mapping to happen with PMDs tomorrow.

Can you try the following:


From b3f0a85b9f071e338097147f997f20d1ac796155 Mon Sep 17 00:00:00 2001
From: "David Hildenbrand (Arm)" <david@kernel.org>
Date: Tue, 10 Mar 2026 10:09:39 +0100
Subject: [PATCH] tmp

Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
 mm/pagewalk.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cb358558807c..779f6fa00ab7 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -96,6 +96,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
+	pud_t pudval = pudp_get(pud);
 	pmd_t *pmd;
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
@@ -104,6 +105,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 	int depth = real_depth(3);
 
+	/*
+	 * For PTE handling, pte_offset_map_lock() takes care of checking
+	 * whether there actually is a page table. But it also has to be
+	 * very careful about concurrent page table reclaim. If we spot a PMD
+	 * table, it cannot go away, so we can just walk it. However, if we find
+	 * something else, we have to retry.
+	 */
+	if (!pud_present(pudval) || pud_leaf(pudval)) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
@@ -176,7 +189,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 	pud = pud_offset(p4d, addr);
 	do {
- again:
+again:
 		next = pud_addr_end(addr, end);
 		if (pud_none(*pud)) {
 			if (has_install)
@@ -217,12 +230,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		else if (pud_leaf(*pud) || !pud_present(*pud))
 			continue; /* Nothing to do. */
 
-		if (pud_none(*pud))
-			goto again;
-
 		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
+
+		if (walk->action == ACTION_AGAIN)
+			goto again;
+
 	} while (pud++, addr = next, addr != end);
 
 	return err;
-- 
2.43.0


-- 
Cheers,

David


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-10  9:11       ` David Hildenbrand (Arm)
@ 2026-03-10 11:38         ` Boone, Max
  2026-03-10 15:19           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 14+ messages in thread
From: Boone, Max @ 2026-03-10 11:38 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alex Williamson,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

[-- Attachment #1: Type: text/plain, Size: 10036 bytes --]



> On Mar 10, 2026, at 10:11 AM, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
>  This message came from outside your organization.
> |-------------------------------------------------------------------!
> 
> On 3/10/26 00:02, Boone, Max wrote:
>> On Mar 9, 2026 9:19 PM, "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>>> 
>>> On 3/9/26 18:49, Max Boone wrote:
>>>> Don't split and descend on special PMD/PUDs, which are generally
>>>> device-backed huge pfnmaps as used by vfio for BAR mapping. These
>>>> can be faulted back in after splitting and before descending, which
>>>> can race to an illegal read.
>>>> 
>>>> Signed-off-by: Max Boone <mboone@akamai.com>
>>>> Signed-off-by: Max Tottenham <mtottenh@akamai.com>
>>>> 
>>>> ---
>>>> mm/pagewalk.c | 24 ++++++++++++++++++++----
>>>> 1 file changed, 20 insertions(+), 4 deletions(-)
>>>> 
>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>> index a94c401ab..d1460dd84 100644
>>>> --- a/mm/pagewalk.c
>>>> +++ b/mm/pagewalk.c
>>>> @@ -147,10 +147,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>>>                               continue;
>>>>               }
>>>> 
>>>> -             if (walk->vma)
>>>> +             if (walk->vma) {
>>>> +                     /*
>>>> +                      * Don't descend into device-backed pfnmaps,
>>>> +                      * they might refault the PMD entry.
>>>> +                      */
>>>> +                     if (unlikely(pmd_special(*pmd)))
>>>> +                             continue;
>>> 
>>> In general, if you're using pmd_special()/pud_split() and friends in
>>> ordinary page table walking code, you are doing something wrong. We
>>> don't want to leak these details in such page table walkers.
>> 
>> That sounds sensible, there is a check in the split_huge_pud macro, which previously included DAX memory.
>> 
>> Related to handling that macro, I see another proposed patch for lazy provisioning of PTEs for PMD order THPs [1]. Possibly adding a return code to the split functions allows a better solution here as well?
>> 
> 
> Maybe. I think the behavior of trying to split is ok. We just have
> to teach code to deal with races.
> 
> Because the very same problem can likely be triggered by having the
> splitting/unmapping be triggered from another thread in some other
> code path concurrently.

I was previously testing on 6.12 and didn’t see any changes to vfio-pci or
pagewalk.c which prompted me to check whether I could reproduce the
bug in a more recent kernel.  

However, when I tried to reproduce the bug on 7.0-rc2 (after adding some
tracing to get a clearer picture of the sequence of events) it doesn’t happen.
The VFIO DMA set operation is much faster on 7.0, so possibly the race 
window is too small for it to occur in reasonable time.

> 
>> I'm not sure if making the split (or rather unmap, calling it a split has been a bit confusing to me as it doesn't allocate PMDs) a noop will improve things - as to my understanding it will still try to descend.
>> 
>>> We do have vm_normal_page_pmd() to identify special mappings, but I
>>> first have to understand what exactly you are trying to solve here.
>> 
>> Specifically for the page walker, avoid splitting and descending into the PUD-order pfnmaps that VFIO creates for the BAR mappings - as these (can) represent hardware control registers rather than regular memory. I haven't been able to reproduce it with PMD-level pfnmaps, but I'll build a kernel with PUD-level pfnmaps disabled tomorrow.
>> 
>> But if course I'm mainly concerned with fixing the race such that reading numa_maps does not cause an illegal read, resulting in the read process crashing while holding the mmap lock of the process (and subsequent reads of proc freezing, waiting for the mmap lock they'll never get).
> 
> Right, that's what we should focus on.
> 
>> 
>>> (You would also be affecting the remapping of the huge zero folio.)
>> 
>> Ah, good one, I do think that this race can occur with PMD-level mappings, looking at the walking & splitting logic - but given the PUD-level mapping triggered the (rare) occurrence I'm fine to focus there first. I guess it helps we don't have 1G THPs, but it would be good to treat 2M and 1G similarly?
> 
> I don't think it can happen for PMDs, as pte_offset_map_lock() double-checks
> that we really have a page table there. See __pte_offset_map() where we do a
> 
> pmdval = pmdp_get_lockless(pmd);
> ...
> if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
> goto nomap;
> if (unlikely(pmd_trans_huge(pmdval)))
> goto unmap;
> ...
> return __pte_map(&pmdval, addr);
> 
> 
> If someone re-faulted the PMD, this function will detect it and reject
> walking it as a PMD table.
> 
> PMD handling code has to deal with page table removal, so it needs 
> some extra steps.
> 
> For PUD handling we don't need that. Once we spot a PUD table, it's
> not going to get yanked underneath our feet.

I have indeed not been able to reproduce the bug with PMDs, sounds like
that’s not an issue.

> 
>> 
>>> A lot more details from the cover letter belong into the patch
>>> description. In fact, you don't even need a cover letter :)
>> 
>> Hehe, first timer, still figuring out the process.
> 
> :)
> 
>> 
>>> IIUC, this is rather serious and would require a Fixes: and even Cc: stable?
>>> 
>>> I'll spend some time tomorrow trying to understand what the real problem
>>> here is.
>> 
>> I think so, the bug can be easily triggered by repeatedly booting up a VM that passes through a PCI device with large BARs while continuously reading the numa_maps of the main VM process. The reproducer script is mainly to narrow down to the specific part where the race occurs, the VFIO DMA set ioctl.
>> 
>> Should I raise a bug email to refer to, and resubmit a new RFC v2 (without the cover letter), or keep discussion in this thread for now?
> 
> No, it's okay. Let's first discuss the proper fix.
> 
>> 
>>> But for now: can this only be reproduces with PUDs (which you mention in
>>> the cover letter) or also PMDs?
>>> 
>>> For the PMD case I would assume that pte_offset_map_lock() performs
>>> proper checks And for the PUD case we are missing a re-check under PTL.
>> 
>> Have only seen it with PUDs, will try forcing the mapping to happen with PMDs tomorrow.
> 
> Can you try the following:
> 
> 
> From b3f0a85b9f071e338097147f997f20d1ac796155 Mon Sep 17 00:00:00 2001
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Date: Tue, 10 Mar 2026 10:09:39 +0100
> Subject: [PATCH] tmp
> 
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> ---
> mm/pagewalk.c | 22 ++++++++++++++++++----
> 1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index cb358558807c..779f6fa00ab7 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -96,6 +96,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>   struct mm_walk *walk)
> {
> + pud_t pudval = pudp_get(pud);
> pmd_t *pmd;
> unsigned long next;
> const struct mm_walk_ops *ops = walk->ops;
> @@ -104,6 +105,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> int err = 0;
> int depth = real_depth(3);
> 
> + /*
> +  * For PTE handling, pte_offset_map_lock() takes care of checking
> +  * whether there actually is a page table. But it also has to be
> +  * very careful about concurrent page table reclaim. If we spot a PMD
> +  * table, it cannot go away, so we can just walk it. However, if we find
> +  * something else, we have to retry.
> +  */
> + if (!pud_present(pudval) || pud_leaf(pudval)) {
> + walk->action = ACTION_AGAIN;
> + return 0;
> + }
> +
> pmd = pmd_offset(pud, addr);
> do {
> again:
> @@ -176,7 +189,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> 
> pud = pud_offset(p4d, addr);
> do {
> - again:
> +again:
> next = pud_addr_end(addr, end);
> if (pud_none(*pud)) {
> if (has_install)
> @@ -217,12 +230,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> else if (pud_leaf(*pud) || !pud_present(*pud))
> continue; /* Nothing to do. */
> 
> - if (pud_none(*pud))
> - goto again;
> -
> err = walk_pmd_range(pud, addr, next, walk);
> if (err)
> break;
> +
> + if (walk->action == ACTION_AGAIN)
> + goto again;
> +
> } while (pud++, addr = next, addr != end);
> 
> return err;
> -- 
> 2.43.0

That works, awesome!

interestingly enough the VFIO ioctl now also returns “[Errno 22] Invalid argument” where
I would previously see the process reading numa_maps crash.

[dma_map]
dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000 

For my own understanding, why is this patch preferred over:
- if (pud_none(*pud))
+ if (pud_none(*pud) || pud_leaf(*pud))
in the walk_pud_range function?

I do think moving the check to walk_pmd_range is a more clear on the code’s intent and
personally prefer the code there, but I don’t see why this check is removing the possibility
of a race after the (!pud_present(pudval) || pud_leaf(pudval)) check, as to me it looks
like the PMD entry was possible to disappear between the splitting and this check?

Anyways, regardless, this patch resolves the bug and looks good to me - what’s the 
course of action as we probably want to backport this to earlier kernels as well. Shall
I send in a new PATCH without cover letter and take it from there?

> 
> 
> -- 
> Cheers,
> 
> David



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-10 11:38         ` Boone, Max
@ 2026-03-10 15:19           ` David Hildenbrand (Arm)
  2026-03-11  9:42             ` Boone, Max
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-10 15:19 UTC (permalink / raw)
  To: Boone, Max
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alex Williamson,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

>> Because the very same problem can likely be triggered by having the
>> splitting/unmapping be triggered from another thread in some other
>> code path concurrently.
> 
> I was previously testing on 6.12 and didn’t see any changes to vfio-pci or
> pagewalk.c which prompted me to check whether I could reproduce the
> bug in a more recent kernel.  
> 
> However, when I tried to reproduce the bug on 7.0-rc2 (after adding some
> tracing to get a clearer picture of the sequence of events) it doesn’t happen.
> The VFIO DMA set operation is much faster on 7.0, so possibly the race 
> window is too small for it to occur in reasonable time.

Interesting. You could try adding a delay to a test kernel to see if you
can still provoke it.

There is the slight possibility that something else fixed the race for
your reproducer by "accident".

[...]

>>>
>>>
>>> Hehe, first timer, still figuring out the process.
>>
>> :)
>>
>>>
>>>
>>> I think so, the bug can be easily triggered by repeatedly booting up a VM that passes through a PCI device with large BARs while continuously reading the numa_maps of the main VM process. The reproducer script is mainly to narrow down to the specific part where the race occurs, the VFIO DMA set ioctl.
>>>
>>> Should I raise a bug email to refer to, and resubmit a new RFC v2 (without the cover letter), or keep discussion in this thread for now?
>>
>> No, it's okay. Let's first discuss the proper fix.
>>
>>>
>>>
>>> Have only seen it with PUDs, will try forcing the mapping to happen with PMDs tomorrow.
>>
>> Can you try the following:
>>
>>
>> From b3f0a85b9f071e338097147f997f20d1ac796155 Mon Sep 17 00:00:00 2001
>> From: "David Hildenbrand (Arm)" <david@kernel.org>
>> Date: Tue, 10 Mar 2026 10:09:39 +0100
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> ---
>> mm/pagewalk.c | 22 ++++++++++++++++++----
>> 1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index cb358558807c..779f6fa00ab7 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -96,6 +96,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>> static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>   struct mm_walk *walk)
>> {
>> + pud_t pudval = pudp_get(pud);
>> pmd_t *pmd;
>> unsigned long next;
>> const struct mm_walk_ops *ops = walk->ops;
>> @@ -104,6 +105,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>> int err = 0;
>> int depth = real_depth(3);
>>
>> + /*
>> +  * For PTE handling, pte_offset_map_lock() takes care of checking
>> +  * whether there actually is a page table. But it also has to be
>> +  * very careful about concurrent page table reclaim. If we spot a PMD
>> +  * table, it cannot go away, so we can just walk it. However, if we find
>> +  * something else, we have to retry.
>> +  */
>> + if (!pud_present(pudval) || pud_leaf(pudval)) {
>> + walk->action = ACTION_AGAIN;
>> + return 0;
>> + }
>> +
>> pmd = pmd_offset(pud, addr);
>> do {
>> again:
>> @@ -176,7 +189,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> do {
>> - again:
>> +again:
>> next = pud_addr_end(addr, end);
>> if (pud_none(*pud)) {
>> if (has_install)
>> @@ -217,12 +230,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>> else if (pud_leaf(*pud) || !pud_present(*pud))
>> continue; /* Nothing to do. */
>>
>> - if (pud_none(*pud))
>> - goto again;
>> -
>> err = walk_pmd_range(pud, addr, next, walk);
>> if (err)
>> break;
>> +
>> + if (walk->action == ACTION_AGAIN)
>> + goto again;
>> +
>> } while (pud++, addr = next, addr != end);
>>
>> return err;
>> -- 
>> 2.43.0
> 
> That works, awesome!
> 
> interestingly enough the VFIO ioctl now also returns “[Errno 22] Invalid argument” where
> I would previously see the process reading numa_maps crash.
> 
> [dma_map]
> dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
> dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
> dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000 

Just to double-check: is that expected?

I wonder why "-EINVAL" would be returned here. Do you know?

> 
> For my own understanding, why is this patch preferred over:
> - if (pud_none(*pud))
> + if (pud_none(*pud) || pud_leaf(*pud))
> in the walk_pud_range function?

It might currently work for PUDs, but as soon as we have non-present PUD
entries (like migration entries) the code could become shaky: pud_leaf()
is only guaranteed to yield the right result if pud_present() is true.

So I decided to instead make walk_pud_range() look more similar to
walk_pmd_range(), which is quite helpful for spotting actual differences
in the logic.

> 
> I do think moving the check to walk_pmd_range is a more clear on the code’s intent and
> personally prefer the code there, but I don’t see why this check is removing the possibility
> of a race after the (!pud_present(pudval) || pud_leaf(pudval)) check, as to me it looks
> like the PMD entry was possible to disappear between the splitting and this check?

I distilled that in the comment: PMD page tables cannot/are not
reclaimed. So once you see a PMD page table, it's not going anywhere
while you hold relevant locks (mmap_lock or VMA lock).

Only PMD leaf entries can get zapped any time and PMD none entries can
get populated any time. But not PMD page tables.

> 
> Anyways, regardless, this patch resolves the bug and looks good to me - what’s the 
> course of action as we probably want to backport this to earlier kernels as well. Shall
> I send in a new PATCH without cover letter and take it from there?

Right, I think you should:

(1) rework the patch description to incorporate the essential stuff from
    the cover letter
(2) Identify and add Fixes: tag and Cc: stable
(3) Document that we are reworking the code to mimic what we do in
    walk_pmd_range(), to have less inconsistency on the core logic
(4) Document why you think the reproducer fails on newer kernels. (or
    best try to get it reproduced by adding some delays in the code)
(5) Clarify that only PUD handling are prone to the race and that PMDs
    are fine (and point out why)
(6) Use a patch subject like "mm/pagewalk: fix race between unmapping
    and refaulting in walk_pud_range()"

Once you resend, best to add

	Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
	Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>

Above your SOB.

To get something like:

	Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
	Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
	Signed-off-by: Max Boone <mboone@akamai.com>

Note that the existing

	Signed-off-by: Max Tottenham <mtottenh@akamai.com>

Is weird, as Max Tottenham did not send out this patch. If he was
involved in the development, you should either make him

	Suggested-by:

Or
	Debugged-by:

Or
	Co-developed-by: + Signed-off-by:

See Documentation/process/submitting-patches.rst


Let me know if you have any questions :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-10 15:19           ` David Hildenbrand (Arm)
@ 2026-03-11  9:42             ` Boone, Max
  2026-03-11  9:59               ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 14+ messages in thread
From: Boone, Max @ 2026-03-11  9:42 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Alex Williamson
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Alex Williamson,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

[-- Attachment #1: Type: text/plain, Size: 5111 bytes --]


> On Mar 10, 2026, at 4:19 PM, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
>  This message came from outside your organization.
> |-------------------------------------------------------------------!
>> 
>> I was previously testing on 6.12 and didn’t see any changes to vfio-pci or
>> pagewalk.c which prompted me to check whether I could reproduce the
>> bug in a more recent kernel.  
>> 
>> However, when I tried to reproduce the bug on 7.0-rc2 (after adding some
>> tracing to get a clearer picture of the sequence of events) it doesn’t happen.
>> The VFIO DMA set operation is much faster on 7.0, so possibly the race 
>> window is too small for it to occur in reasonable time.
> 
> Interesting. You could try adding a delay to a test kernel to see if you
> can still provoke it.
> 
> There is the slight possibility that something else fixed the race for
> your reproducer by "accident".
> 
> [...]

Could be that the race window was made a lot smaller by [1], I see that the occurrence
of the race also drops significantly already when I’m just adding some extra trace
logs in the VFIO DMA set functions.

[…]

> 
> Just to double-check: is that expected?
> 
> I wonder why "-EINVAL" would be returned here. Do you know?
> 

I don’t think it’s expected, but at least it’s an error that can be caught in userspace.
I’ll do a bit more digging into why and where that originates, but I think that’ll get a
patch in vfio.

The -EINVAL originates from:

vfio_dma_do_map -> vfio_pin_map_dma -> vfio_pin_pages_remote
-> vaddr_get_pfns -> pin_user_pages_remote (mm/gup.c)

Possibly that’s also the origin of the concurrent PUD modification that requires
the retry in the walker in this patch.

>> 
>> For my own understanding, why is this patch preferred over:
>> - if (pud_none(*pud))
>> + if (pud_none(*pud) || pud_leaf(*pud))
>> in the walk_pud_range function?
> 
> It might currently work for PUDs, but as soon as we have non-present PUD
> entries (like migration entries) the code could become shaky: pud_leaf()
> is only guaranteed to yield the right result if pud_present() is true.
> 
> So I decided to instead make walk_pud_range() look more similar to
> walk_pmd_range(), which is quite helpful for spotting actual differences
> in the logic.
> 
>> 
>> I do think moving the check to walk_pmd_range is a more clear on the code’s intent and
>> personally prefer the code there, but I don’t see why this check is removing the possibility
>> of a race after the (!pud_present(pudval) || pud_leaf(pudval)) check, as to me it looks
>> like the PMD entry was possible to disappear between the splitting and this check?
> 
> I distilled that in the comment: PMD page tables cannot/are not
> reclaimed. So once you see a PMD page table, it's not going anywhere
> while you hold relevant locks (mmap_lock or VMA lock).
> 
> Only PMD leaf entries can get zapped any time and PMD none entries can
> get populated any time. But not PMD page tables.

Gotcha, thanks!

> 
>> 
>> Anyways, regardless, this patch resolves the bug and looks good to me - what’s the 
>> course of action as we probably want to backport this to earlier kernels as well. Shall
>> I send in a new PATCH without cover letter and take it from there?
> 
> Right, I think you should:
> 
> (1) rework the patch description to incorporate the essential stuff from
>    the cover letter
> (2) Identify and add Fixes: tag and Cc: stable
> (3) Document that we are reworking the code to mimic what we do in
>    walk_pmd_range(), to have less inconsistency on the core logic
> (4) Document why you think the reproducer fails on newer kernels. (or
>    best try to get it reproduced by adding some delays in the code)
> (5) Clarify that only PUD handling are prone to the race and that PMDs
>    are fine (and point out why)
> (6) Use a patch subject like "mm/pagewalk: fix race between unmapping
>    and refaulting in walk_pud_range()"
> 
> Once you resend, best to add
> 
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> 
> Above your SOB.
> 
> To get something like:
> 
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Max Boone <mboone@akamai.com>
> 
> Note that the existing
> 
> Signed-off-by: Max Tottenham <mtottenh@akamai.com>
> 
> Is weird, as Max Tottenham did not send out this patch. If he was
> involved in the development, you should either make him
> 
> Suggested-by:
> 
> Or
> Debugged-by:
> 
> Or
> Co-developed-by: + Signed-off-by:
> 
> See Documentation/process/submitting-patches.rst
> 
> 
> Let me know if you have any questions :)
> 

Will do, thanks a lot!

> -- 
> Cheers,
> 
> David

[1] https://lore.kernel.org/all/20250814064714.56485-1-lizhe.67@bytedance.com/


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-11  9:42             ` Boone, Max
@ 2026-03-11  9:59               ` David Hildenbrand (Arm)
  2026-03-11 10:34                 ` Boone, Max
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-11  9:59 UTC (permalink / raw)
  To: Boone, Max, Alex Williamson
  Cc: Andrew Morton, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

> The -EINVAL originates from:
> 
> vfio_dma_do_map -> vfio_pin_map_dma -> vfio_pin_pages_remote
> -> vaddr_get_pfns -> pin_user_pages_remote (mm/gup.c)
> 
> Possibly that’s also the origin of the concurrent PUD modification that requires
> the retry in the walker in this patch.

We'd have to find out why we manage to trigger a -EINVAL here. I don't
see how anything that this patch does could trigger that. So maybe a
problem in user space? (calling it on unsupported VMAs?).

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-11  9:59               ` David Hildenbrand (Arm)
@ 2026-03-11 10:34                 ` Boone, Max
  2026-03-11 10:45                   ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 14+ messages in thread
From: Boone, Max @ 2026-03-11 10:34 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Alex Williamson, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

[-- Attachment #1: Type: text/plain, Size: 1484 bytes --]



> On Mar 11, 2026, at 10:59 AM, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
>  This message came from outside your organization.
> |-------------------------------------------------------------------!
> 
>> The -EINVAL originates from:
>> 
>> vfio_dma_do_map -> vfio_pin_map_dma -> vfio_pin_pages_remote
>> -> vaddr_get_pfns -> pin_user_pages_remote (mm/gup.c)
>> 
>> Possibly that’s also the origin of the concurrent PUD modification that requires
>> the retry in the walker in this patch.
> 
> We'd have to find out why we manage to trigger a -EINVAL here. I don't
> see how anything that this patch does could trigger that. So maybe a
> problem in user space? (calling it on unsupported VMAs?).
> 

It looks like I was mistaken the EINVAL being from pin_user_pages_remote, 
rather it originates from:

vfio_dma_do_map
  -> vfio_pin_map_dma
    -> vfio_pin_pages_remote
      -> vaddr_get_pfns
        -> follow_fault_pfn
          -> follow_pfnmap_start (mm/memory.c)

In vfio_iommu_type1.c, follow_fault_pfn first checks whether follow_pfnmap_start
returns an error; if it does, it calls fixup_user_fault to fault the mapping in and then 
retries follow_pfnmap_start to obtain the PFN.

Sounds to me that the walker is likely re-splitting the PUD entry between 
the fixup_user_fault and follow_pfnmap_start calls?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-11 10:34                 ` Boone, Max
@ 2026-03-11 10:45                   ` David Hildenbrand (Arm)
  2026-03-11 11:14                     ` Boone, Max
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-11 10:45 UTC (permalink / raw)
  To: Boone, Max
  Cc: Alex Williamson, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

On 3/11/26 11:34, Boone, Max wrote:
> 
> 
>> On Mar 11, 2026, at 10:59 AM, David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> !-------------------------------------------------------------------|
>>  This Message Is From an External Sender
>>  This message came from outside your organization.
>> |-------------------------------------------------------------------!
>>
>>> The -EINVAL originates from:
>>>
>>> vfio_dma_do_map -> vfio_pin_map_dma -> vfio_pin_pages_remote
>>> -> vaddr_get_pfns -> pin_user_pages_remote (mm/gup.c)
>>>
>>> Possibly that’s also the origin of the concurrent PUD modification that requires
>>> the retry in the walker in this patch.
>>
>> We'd have to find out why we manage to trigger a -EINVAL here. I don't
>> see how anything that this patch does could trigger that. So maybe a
>> problem in user space? (calling it on unsupported VMAs?).
>>
> 
> It looks like I was mistaken the EINVAL being from pin_user_pages_remote, 
> rather it originates from:
> 
> vfio_dma_do_map
>   -> vfio_pin_map_dma
>     -> vfio_pin_pages_remote
>       -> vaddr_get_pfns
>         -> follow_fault_pfn
>           -> follow_pfnmap_start (mm/memory.c)
> 
> In vfio_iommu_type1.c, follow_fault_pfn first checks whether follow_pfnmap_start
> returns an error; if it does, it calls fixup_user_fault to fault the mapping in and then 
> retries follow_pfnmap_start to obtain the PFN.
> 
> Sounds to me that the walker is likely re-splitting the PUD entry between 
> the fixup_user_fault and follow_pfnmap_start calls?

Could be. IIRC, there are also scenarios where handle_mm_fault() just
returns success even though nothing was faulted in (e.g., when it
detects some races).

The code in follow_fault_pfn() should likely be updated to handle more
than one attempt. That's also what GUP does.

Likely, follow_fault_pfn() was never taught about PFNMAP mappings that
can be faulted+zapped (in the past they were always static).

If you turn that into a (possibly) endless loop, does the problem go away?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-11 10:45                   ` David Hildenbrand (Arm)
@ 2026-03-11 11:14                     ` Boone, Max
  2026-03-11 11:59                       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 14+ messages in thread
From: Boone, Max @ 2026-03-11 11:14 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Alex Williamson, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]



> On Mar 11, 2026, at 11:45 AM, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> The code in follow_fault_pfn() should likely be updated to handle more
> than one attempt. That's also what GUP does.
> 
> Likely, follow_fault_pfn() was never taught about PFNMAP mappings that
> can be faulted+zapped (in the past they were always static).
> 
> If you turn that into a (possibly) endless loop, does the problem go away?

Yep, was just trying that - with this change the problem goes away:

--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -532,7 +532,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,

                ret = follow_pfnmap_start(&args);
                if (ret)
-                       return ret;
+                       return -EAGAIN;
        }

        if (write_fault && !args.writable)

—

I’ll propose that with the VFIO folks when I get the patch for mm/pagewalk.c ready and will refer
to that patch, or would it be better to propose two commits under the same cover letter?

I can have a look at follow_fault_pfn but this problem is my first time diving into linux mm
so that will probably take a while (and some reading up on my end).

> 
> -- 
> Cheers,
> 
> David


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
  2026-03-11 11:14                     ` Boone, Max
@ 2026-03-11 11:59                       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-11 11:59 UTC (permalink / raw)
  To: Boone, Max
  Cc: Alex Williamson, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Tottenham, Max, Hunt, Joshua,
	Pelland, Matt

On 3/11/26 12:14, Boone, Max wrote:
> 
> 
>> On Mar 11, 2026, at 11:45 AM, David Hildenbrand (Arm) <david@kernel.org> wrote:
>>
>> The code in follow_fault_pfn() should likely be updated to handle more
>> than one attempt. That's also what GUP does.
>>
>> Likely, follow_fault_pfn() was never taught about PFNMAP mappings that
>> can be faulted+zapped (in the past they were always static).
>>
>> If you turn that into a (possibly) endless loop, does the problem go away?
> 
> Yep, was just trying that - with this change the problem goes away:
> 
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -532,7 +532,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> 
>                 ret = follow_pfnmap_start(&args);
>                 if (ret)
> -                       return ret;
> +                       return -EAGAIN;
>         }

Looks like follow_pfnmap_start() only ever returns -EINVAL: for invalid
parameters (unexpected) and non-present entries.

Returning -ENOENT when there is no actual entry would be clearer. Then
you could just translate -ENOENT to -EAGAIN.

But just always using -EAGAIN should be fine here I guess.

> 
>         if (write_fault && !args.writable)
> 
> —
> 
> I’ll propose that with the VFIO folks when I get the patch for mm/pagewalk.c ready and will refer
> to that patch, or would it be better to propose two commits under the same cover letter?

Probably be best to send two separate patches. One MM fix and one vfio
fix. :)

> 
> I can have a look at follow_fault_pfn but this problem is my first time diving into linux mm
> so that will probably take a while (and some reading up on my end).
No worries, just let me know if you have any questions.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-03-11 11:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09 17:49 [RFC 0/1] Avoid pagewalk hugepage-split race with VFIO DMA set Max Boone
2026-03-09 17:49 ` [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps Max Boone
2026-03-09 20:19   ` David Hildenbrand (Arm)
2026-03-09 22:47     ` Boone, Max
2026-03-09 23:02     ` Boone, Max
2026-03-10  9:11       ` David Hildenbrand (Arm)
2026-03-10 11:38         ` Boone, Max
2026-03-10 15:19           ` David Hildenbrand (Arm)
2026-03-11  9:42             ` Boone, Max
2026-03-11  9:59               ` David Hildenbrand (Arm)
2026-03-11 10:34                 ` Boone, Max
2026-03-11 10:45                   ` David Hildenbrand (Arm)
2026-03-11 11:14                     ` Boone, Max
2026-03-11 11:59                       ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox