From: Harry Yoo <harry.yoo@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Zi Yan <ziy@nvidia.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Nico Pache <npache@redhat.com>,
Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
Kiryl Shutsemau <kas@kernel.org>, Hugh Dickins <hughd@google.com>,
Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
Pedro Falcato <pfalcato@suse.de>, Jane Chu <jane.chu@oracle.com>,
linux-mm@kvack.org, stable@vger.kernel.org
Subject: Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
Date: Wed, 24 Sep 2025 20:54:06 +0900 [thread overview]
Message-ID: <aNPb3qVCZTf2xMkN@hyeyoo> (raw)
In-Reply-To: <6e4f6a37-2449-4089-8b3d-234ba86878e2@redhat.com>
On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
> On 23.09.25 13:46, Harry Yoo wrote:
> > On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
> > > On 22.09.25 01:27, Harry Yoo wrote:
> > > This is all because we are trying to be smart and walking page tables
> > > without the page table lock held. This is just absolutely nasty.
> >
> > commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
> > taking the pmd_lock") did this :(
>
> Right. I can understand why we would not want to grab the lock when there is
> a leaf page table. But everything else is just asking for trouble (as we saw
> :) ).
> > > What about the following check:
> > >
> > > if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> > >
> > > Couldn't we have a similar race there when we are concurrently migrating?
> >
> > An excellent point! I agree that there could be a similar race,
> > but with something other than "bad pmd" error.
>
> Right, instead we'd go into change_pte_range() where we check
> pmd_trans_unstable().
Uh, my brain hurts again... :)
In case is_swap_pmd() or pmd_trans_huge() returned true, but another
kernel thread splits THP after we checked it, __split_huge_pmd() or
change_huge_pmd() will just return without actually splitting or changing
pmd entry, if it turns out that evaluating
(is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
was false positive due to race condition, because they both double check
after acquiring pmd lock:
1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
or is_pmd_migration_entry() under pmd lock.
2) change_huge_pmd() checks if it's either is_swap_pmd(),
pmd_trans_huge(), or pmd_devmap() under pmd lock.
And if either function simply returns because it was not a THP,
pmd migration entry, or pmd devmap, khugepaged cannot colleapse
huge page because we're holding mmap_lock in read mode.
And then we call change_pte_range() and that's safe.
> After that, I'm not sure ... maybe we'll just retry
Or as you mentioned, if we are misled into thinking it is not a THP,
PMD devmap, or swap PMD due to race condition, we'd end up going into
change_pte_range().
> or we'll accidentally try treating it as a PTE table.
But then pmd_trans_unstable() check should prevent us from treating
it as PTE table (and we're still holding mmap_lock here).
In such case we don't retry but skip it instead.
> Looks like
> pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
I think you mean
pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
> return "0"
> in case we hit migration entry? :/
pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
eventually checks !pmd_none() && !pmd_present() case.
> > It'd be more robust to do something like:
>
> That's also what I had in mind. But all this lockless stuff makes me a bit
> nervous :)
Yeah the code is not very straightforward... :/
But technically the diff that I pasted here should be enough to fix
this... or do you have any alternative approach in mind?
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2025-09-24 11:54 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-21 23:27 [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels Harry Yoo
2025-09-23 9:00 ` David Hildenbrand
2025-09-23 11:46 ` Harry Yoo
2025-09-23 14:09 ` David Hildenbrand
2025-09-24 11:54 ` Harry Yoo [this message]
2025-09-24 15:52 ` David Hildenbrand
2025-10-02 14:07 ` Harry Yoo
2025-10-06 8:18 ` David Hildenbrand
2025-10-20 13:25 ` Harry Yoo
2025-10-20 14:07 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aNPb3qVCZTf2xMkN@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=hughd@google.com \
--cc=jane.chu@oracle.com \
--cc=jannh@google.com \
--cc=kas@kernel.org \
--cc=lance.yang@linux.dev \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=npache@redhat.com \
--cc=pfalcato@suse.de \
--cc=ryan.roberts@arm.com \
--cc=stable@vger.kernel.org \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.