Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: "Boone, Max" <mboone@akamai.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Alex Williamson <alex@shazbot.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Tottenham, Max" <mtottenh@akamai.com>,
	"Hunt, Joshua" <johunt@akamai.com>,
	"Pelland, Matt" <mpelland@akamai.com>
Subject: Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
Date: Tue, 10 Mar 2026 16:19:47 +0100	[thread overview]
Message-ID: <0a652e7e-339e-4f98-b591-7fe5680e2006@kernel.org> (raw)
In-Reply-To: <83842620-AD01-4619-845F-8DE7DF1F8F31@akamai.com>

>> Because the very same problem can likely be triggered by having the
>> splitting/unmapping be triggered from another thread in some other
>> code path concurrently.
> 
> I was previously testing on 6.12 and didn’t see any changes to vfio-pci or
> pagewalk.c which prompted me to check whether I could reproduce the
> bug in a more recent kernel.  
> 
> However, when I tried to reproduce the bug on 7.0-rc2 (after adding some
> tracing to get a clearer picture of the sequence of events) it doesn’t happen.
> The VFIO DMA set operation is much faster on 7.0, so possibly the race 
> window is too small for it to occur in reasonable time.

Interesting. You could try adding a delay to a test kernel to see if you
can still provoke it.

There is the slight possibility that something else fixed the race for
your reproducer by "accident".

[...]

>>>
>>>
>>> Hehe, first timer, still figuring out the process.
>>
>> :)
>>
>>>
>>>
>>> I think so, the bug can be easily triggered by repeatedly booting up a VM that passes through a PCI device with large BARs while continuously reading the numa_maps of the main VM process. The reproducer script is mainly to narrow down to the specific part where the race occurs, the VFIO DMA set ioctl.
>>>
>>> Should I raise a bug email to refer to, and resubmit a new RFC v2 (without the cover letter), or keep discussion in this thread for now?
>>
>> No, it's okay. Let's first discuss the proper fix.
>>
>>>
>>>
>>> Have only seen it with PUDs, will try forcing the mapping to happen with PMDs tomorrow.
>>
>> Can you try the following:
>>
>>
>> From b3f0a85b9f071e338097147f997f20d1ac796155 Mon Sep 17 00:00:00 2001
>> From: "David Hildenbrand (Arm)" <david@kernel.org>
>> Date: Tue, 10 Mar 2026 10:09:39 +0100
>> Subject: [PATCH] tmp
>>
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> ---
>> mm/pagewalk.c | 22 ++++++++++++++++++----
>> 1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index cb358558807c..779f6fa00ab7 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -96,6 +96,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>> static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>>   struct mm_walk *walk)
>> {
>> + pud_t pudval = pudp_get(pud);
>> pmd_t *pmd;
>> unsigned long next;
>> const struct mm_walk_ops *ops = walk->ops;
>> @@ -104,6 +105,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>> int err = 0;
>> int depth = real_depth(3);
>>
>> + /*
>> +  * For PTE handling, pte_offset_map_lock() takes care of checking
>> +  * whether there actually is a page table. But it also has to be
>> +  * very careful about concurrent page table reclaim. If we spot a PMD
>> +  * table, it cannot go away, so we can just walk it. However, if we find
>> +  * something else, we have to retry.
>> +  */
>> + if (!pud_present(pudval) || pud_leaf(pudval)) {
>> + walk->action = ACTION_AGAIN;
>> + return 0;
>> + }
>> +
>> pmd = pmd_offset(pud, addr);
>> do {
>> again:
>> @@ -176,7 +189,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> do {
>> - again:
>> +again:
>> next = pud_addr_end(addr, end);
>> if (pud_none(*pud)) {
>> if (has_install)
>> @@ -217,12 +230,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>> else if (pud_leaf(*pud) || !pud_present(*pud))
>> continue; /* Nothing to do. */
>>
>> - if (pud_none(*pud))
>> - goto again;
>> -
>> err = walk_pmd_range(pud, addr, next, walk);
>> if (err)
>> break;
>> +
>> + if (walk->action == ACTION_AGAIN)
>> + goto again;
>> +
>> } while (pud++, addr = next, addr != end);
>>
>> return err;
>> -- 
>> 2.43.0
> 
> That works, awesome!
> 
> interestingly enough the VFIO ioctl now also returns “[Errno 22] Invalid argument” where
> I would previously see the process reading numa_maps crash.
> 
> [dma_map]
> dma_map iova=0x000000000000 size=0x000004000000 vaddr=0x00007f7800000000
> dma_map FAILED iova=0x020000000000: [Errno 22] Invalid argument
> dma_map iova=0x040000000000 size=0x000002000000 vaddr=0x00007f5780000000 

Just to double-check: is that expected?

I wonder why "-EINVAL" would be returned here. Do you know?

> 
> For my own understanding, why is this patch preferred over:
> - if (pud_none(*pud))
> + if (pud_none(*pud) || pud_leaf(*pud))
> in the walk_pud_range function?

It might currently work for PUDs, but as soon as we have non-present PUD
entries (like migration entries) the code could become shaky: pud_leaf()
is only guaranteed to yield the right result if pud_present() is true.

So I decided to instead make walk_pud_range() look more similar to
walk_pmd_range(), which is quite helpful for spotting actual differences
in the logic.

> 
> I do think moving the check to walk_pmd_range is a more clear on the code’s intent and
> personally prefer the code there, but I don’t see why this check is removing the possibility
> of a race after the (!pud_present(pudval) || pud_leaf(pudval)) check, as to me it looks
> like the PMD entry was possible to disappear between the splitting and this check?

I distilled that in the comment: PMD page tables cannot/are not
reclaimed. So once you see a PMD page table, it's not going anywhere
while you hold relevant locks (mmap_lock or VMA lock).

Only PMD leaf entries can get zapped any time and PMD none entries can
get populated any time. But not PMD page tables.

> 
> Anyways, regardless, this patch resolves the bug and looks good to me - what’s the 
> course of action as we probably want to backport this to earlier kernels as well. Shall
> I send in a new PATCH without cover letter and take it from there?

Right, I think you should:

(1) rework the patch description to incorporate the essential stuff from
    the cover letter
(2) Identify and add Fixes: tag and Cc: stable
(3) Document that we are reworking the code to mimic what we do in
    walk_pmd_range(), to have less inconsistency on the core logic
(4) Document why you think the reproducer fails on newer kernels. (or
    best try to get it reproduced by adding some delays in the code)
(5) Clarify that only PUD handling are prone to the race and that PMDs
    are fine (and point out why)
(6) Use a patch subject like "mm/pagewalk: fix race between unmapping
    and refaulting in walk_pud_range()"

Once you resend, best to add

	Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
	Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>

Above your SOB.

To get something like:

	Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
	Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
	Signed-off-by: Max Boone <mboone@akamai.com>

Note that the existing

	Signed-off-by: Max Tottenham <mtottenh@akamai.com>

Is weird, as Max Tottenham did not send out this patch. If he was
involved in the development, you should either make him

	Suggested-by:

Or
	Debugged-by:

Or
	Co-developed-by: + Signed-off-by:

See Documentation/process/submitting-patches.rst


Let me know if you have any questions :)

-- 
Cheers,

David

next prev parent reply	other threads:[~2026-03-10 15:19 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 17:49 [RFC 0/1] Avoid pagewalk hugepage-split race with VFIO DMA set Max Boone
2026-03-09 17:49 ` [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps Max Boone
2026-03-09 20:19   ` David Hildenbrand (Arm)
2026-03-09 22:47     ` Boone, Max
2026-03-09 23:02     ` Boone, Max
2026-03-10  9:11       ` David Hildenbrand (Arm)
2026-03-10 11:38         ` Boone, Max
2026-03-10 15:19           ` David Hildenbrand (Arm) [this message]
2026-03-11  9:42             ` Boone, Max
2026-03-11  9:59               ` David Hildenbrand (Arm)
2026-03-11 10:34                 ` Boone, Max
2026-03-11 10:45                   ` David Hildenbrand (Arm)
2026-03-11 11:14                     ` Boone, Max
2026-03-11 11:59                       ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a652e7e-339e-4f98-b591-7fe5680e2006@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex@shazbot.org \
    --cc=johunt@akamai.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mboone@akamai.com \
    --cc=mhocko@suse.com \
    --cc=mpelland@akamai.com \
    --cc=mtottenh@akamai.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox