All of lore.kernel.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
	Kiryl Shutsemau <kas@kernel.org>, Hugh Dickins <hughd@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Jane Chu <jane.chu@oracle.com>,
	linux-mm@kvack.org, stable@vger.kernel.org
Subject: Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.
Date: Thu, 2 Oct 2025 23:07:47 +0900	[thread overview]
Message-ID: <aN6HMzXM4cL6Yf4A@hyeyoo> (raw)
In-Reply-To: <9b05b974-7478-4c99-9c4f-6593e0fd4f93@redhat.com>

On Wed, Sep 24, 2025 at 05:52:14PM +0200, David Hildenbrand wrote:
> On 24.09.25 13:54, Harry Yoo wrote:
> > On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
> > > On 23.09.25 13:46, Harry Yoo wrote:
> > > > On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
> > > > > On 22.09.25 01:27, Harry Yoo wrote:
> > In case is_swap_pmd() or pmd_trans_huge() returned true, but another
> > kernel thread splits THP after we checked it, __split_huge_pmd() or
> > change_huge_pmd() will just return without actually splitting or changing
> > pmd entry, if it turns out that evaluating
> > (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
> > was false positive due to race condition, because they both double check
> > after acquiring pmd lock:
> > 
> > 1) __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
> > or is_pmd_migration_entry() under pmd lock.
> > 
> > 2) change_huge_pmd() checks if it's either is_swap_pmd(),
> > pmd_trans_huge(), or pmd_devmap() under pmd lock.
> > 
> > And if either function simply returns because it was not a THP,
> > pmd migration entry, or pmd devmap, khugepaged cannot colleapse
> > huge page because we're holding mmap_lock in read mode.
> > 
> > And then we call change_pte_range() and that's safe.
> > 
> > > After that, I'm not sure ... maybe we'll just retry
> > 
> > Or as you mentioned, if we are misled into thinking it is not a THP,
> > PMD devmap, or swap PMD due to race condition, we'd end up going into
> > change_pte_range().
> > 
> > > or we'll accidentally try treating it as a PTE table.
> > 
> > But then pmd_trans_unstable() check should prevent us from treating
> > it as PTE table (and we're still holding mmap_lock here).
> > In such case we don't retry but skip it instead.
> > 
> > > Looks like
> > > pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
> > 
> > I think you mean
> > pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
> 
> Yes!
> 
> > 
> > > return "0"
> > > in case we hit migration entry? :/
> > 
> > pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
> > eventually checks !pmd_none() && !pmd_present() case.
 

Apologies for the late reply.

> Ah, right, I missed the pmd_present() while skimming over this extremely
> horrible function.
> 
> So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return
> "1" and make us retry.

We don't retry in pre-6.5 kernels because retrying is a new behavior
after commit 670ddd8cdcbd1.

> > > > It'd be more robust to do something like:
> > > 
> > > That's also what I had in mind. But all this lockless stuff makes me a bit
> > > nervous :)
> > 
> > Yeah the code is not very straightforward... :/
> > 
> > But technically the diff that I pasted here should be enough to fix
> > this... or do you have any alternative approach in mind?
> 
> Hopefully, I'm not convinced this code is not buggy, but at least regarding
> concurrent migration it should be fine with that.

I've been thinking about this...

Actually, it'll make more sense to open-code what pte_map_offset_lock()
does in the mainline:

1. do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5
   kernels doesn't do the check for us unlike the mainline.
2. check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but
   atomically.
3. after acquiring ptl in change_pte_range(), check if pmd has changed
   since step 1 and 2. if yes, retry (like mainline). if no, we're all good.

What do you think?

-- 
Cheers,
Harry / Hyeonggon


  reply	other threads:[~2025-10-02 14:08 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-21 23:27 [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels Harry Yoo
2025-09-23  9:00 ` David Hildenbrand
2025-09-23 11:46   ` Harry Yoo
2025-09-23 14:09     ` David Hildenbrand
2025-09-24 11:54       ` Harry Yoo
2025-09-24 15:52         ` David Hildenbrand
2025-10-02 14:07           ` Harry Yoo [this message]
2025-10-06  8:18             ` David Hildenbrand
2025-10-20 13:25               ` Harry Yoo
2025-10-20 14:07                 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aN6HMzXM4cL6Yf4A@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jannh@google.com \
    --cc=kas@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=npache@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=stable@vger.kernel.org \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.