From: Peter Xu <peterx@redhat.com>
To: Barry Song <21cnbao@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>,
Lokesh Gidra <lokeshgidra@google.com>,
linux-mm@kvack.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, zhengtangquan@oppo.com,
Barry Song <v-songbaohua@oppo.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Al Viro <viro@zeniv.linux.org.uk>,
Axel Rasmussen <axelrasmussen@google.com>,
Brian Geffon <bgeffon@google.com>,
Christian Brauner <brauner@kernel.org>,
David Hildenbrand <david@redhat.com>,
Hugh Dickins <hughd@google.com>, Jann Horn <jannh@google.com>,
Kalesh Singh <kaleshsingh@google.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Matthew Wilcox <willy@infradead.org>,
Michal Hocko <mhocko@suse.com>, Mike Rapoport <rppt@kernel.org>,
Nicolas Geoffray <ngeoffray@google.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Shuah Khan <shuah@kernel.org>,
ZhangPeng <zhangpeng362@huawei.com>, Yu Zhao <yuzhao@google.com>
Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
Date: Thu, 20 Feb 2025 17:59:37 -0500 [thread overview]
Message-ID: <Z7ez2Vl8Sa_bRb4e@x1.local> (raw)
In-Reply-To: <CAGsJ_4wptMn8HX6Uam7AQpWeE=nOUDHE-Vr81SQJq_oSjmTFHg@mail.gmail.com>
On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > swap entry.
> > > > >
> > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > migration by setting:
> > > > >
> > > > > src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > >
> > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > the PTE to the new dst_addr.
> > > > >
> > > > > This approach is incorrect because even if the PTE is a swap
> > > > > entry, it can still reference a folio that remains in the swap
> > > > > cache.
> > > > >
> > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > can occur due to:
> > > > > page_pgoff(folio, page) != linear_page_index(vma, address)
> > > >
> > > > Thanks for the report and reproducer!
> > > >
> > > > >
> > > > > $./a.out > /dev/null
> > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > [ 13.337716] memcg:ffff00000405f000
> > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > [ 13.340190] ------------[ cut here ]------------
> > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > [ 13.340969] Modules linked in:
> > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > [ 13.342018] sp : ffff80008752bb20
> > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > [ 13.343876] Call trace:
> > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > [ 13.344333] do_swap_page+0x1060/0x1400
> > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8
> > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8
> > > > > [ 13.344586] do_page_fault+0x20c/0x770
> > > > > [ 13.344673] do_translation_fault+0xb4/0xf0
> > > > > [ 13.344759] do_mem_abort+0x48/0xa0
> > > > > [ 13.344842] el0_da+0x58/0x130
> > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138
> > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0
> > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > [ 13.345504] ---[ end trace 0000000000000000 ]---
> > > > > [ 13.345715] note: a.out[107] exited with irqs disabled
> > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2
> > > > >
> > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > of folios as done in move_present_pte.
> > > >
> > > > How complex would that be? Is it a matter of adding
> > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > folio->index = linear_page_index like in move_present_pte() or
> > > > something more?
> > >
> > > My main concern is still with large folios that require a split_folio()
> > > during move_pages(), as the entire folio shares the same index and
> > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > making a split necessary.
> > >
> > > However, in split_huge_page_to_list_to_order(), there is a:
> > >
> > > if (folio_test_writeback(folio))
> > > return -EBUSY;
> > >
> > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > it simply returns -EBUSY:
> > >
> > > move_pages_pte()
> > > {
> > > /* at this point we have src_folio locked */
> > > if (folio_test_large(src_folio)) {
> > > /* split_folio() can block */
> > > pte_unmap(&orig_src_pte);
> > > pte_unmap(&orig_dst_pte);
> > > src_pte = dst_pte = NULL;
> > > err = split_folio(src_folio);
> > > if (err)
> > > goto out;
> > >
> > > /* have to reacquire the folio after it got split */
> > > folio_unlock(src_folio);
> > > folio_put(src_folio);
> > > src_folio = NULL;
> > > goto retry;
> > > }
> > > }
> > >
> > > Do we need a folio_wait_writeback() before calling split_folio()?
> >
> > Maybe no need in the first version to fix the immediate bug?
> >
> > It's also not always the case to hit writeback here. IIUC, writeback only
> > happens for a short window when the folio was just added into swapcache.
> > MOVE can happen much later after that anytime before a swapin. My
> > understanding is that's also what Matthew wanted to point out. It may be
> > better justified of that in a separate change with some performance
> > measurements.
>
> The bug we’re discussing occurs precisely within the short window you
> mentioned.
>
> 1. add_to_swap: The folio is added to swapcache.
> 2. try_to_unmap: PTEs are converted to swap entries.
> 3. pageout
> 4. Swapcache is cleared.
Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
least that should be step 3.5, as IIUC "writeback" needs to be cleared
before "swapcache" bit being cleared.
>
> The issue happens between steps 2 and 4, where the PTE is not present, but
> the folio is still in swapcache - the current code does move_swap_pte() but does
> not fixup folio->index within swapcache.
One thing I'm still not clear here is why it's a race condition, rather
than more severe than that. I mean, folio->index is definitely wrong, then
as long as the page still in swapcache, we should be able to move the swp
entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
the page in swapcache with the wrong folio->index already and trigger.
I wrote a quick test like that, it actually won't trigger..
I had a closer look in the code, I think it's because do_swap_page() has
the logic to detect folio->index matching first, and allocate a new folio
if it doesn't match in ksm_might_need_to_copy(). IIUC that was for
ksm.. but it looks like it's functioning too here.
ksm_might_need_to_copy:
if (folio_test_ksm(folio)) {
if (folio_stable_node(folio) &&
!(ksm_run & KSM_RUN_UNMERGE))
return folio; /* no need to copy it */
} else if (!anon_vma) {
return folio; /* no need to copy it */
} else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
anon_vma->root == vma->anon_vma->root) {
return folio; /* still no need to copy it */
}
...
new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
...
So I believe what I hit is at [1] it sees index doesn't match, then it
decided to allocate a new folio. In this case, it won't hit your BUG
because it'll be "folio != swapcache" later, so it'll setup the
folio->index for the new one, rather than the sanity check.
Do you know how your case got triggered, being able to bypass the above [1]
which should check folio->index already?
>
> My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> in this RFC.
>
> For small folios, there’s no split_folio issue, making it relatively
> simpler. Lokesh
> mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> the first priority.
Agreed.
--
Peter Xu
next prev parent reply other threads:[~2025-02-20 22:59 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-19 11:25 [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache Barry Song
2025-02-19 18:26 ` Suren Baghdasaryan
2025-02-19 18:30 ` David Hildenbrand
2025-02-19 18:58 ` Suren Baghdasaryan
2025-02-20 8:40 ` David Hildenbrand
2025-02-20 9:21 ` Barry Song
2025-02-20 10:24 ` David Hildenbrand
2025-02-26 5:37 ` Barry Song
2025-02-26 8:03 ` David Hildenbrand
2025-02-20 23:32 ` Peter Xu
2025-02-21 0:07 ` Barry Song
2025-02-21 1:49 ` Peter Xu
2025-02-22 21:31 ` Barry Song
2025-02-24 17:50 ` Peter Xu
2025-02-24 18:03 ` David Hildenbrand
2025-02-19 20:37 ` Barry Song
2025-02-19 20:57 ` Matthew Wilcox
2025-02-19 21:05 ` Barry Song
2025-02-19 21:02 ` Lokesh Gidra
2025-02-19 21:26 ` Barry Song
2025-02-19 21:32 ` Lokesh Gidra
2025-02-19 22:14 ` Peter Xu
2025-02-19 23:04 ` Barry Song
2025-02-19 23:19 ` Lokesh Gidra
2025-02-20 0:49 ` Barry Song
2025-02-20 22:59 ` Peter Xu [this message]
2025-02-20 23:47 ` Suren Baghdasaryan
2025-02-20 23:52 ` Suren Baghdasaryan
2025-02-21 0:36 ` Suren Baghdasaryan
2025-02-25 11:05 ` Barry Song
2025-02-25 15:34 ` Peter Xu
2025-02-25 17:02 ` Suren Baghdasaryan
2025-02-21 1:36 ` Barry Song
2025-02-21 1:54 ` Peter Xu
2025-02-20 8:51 ` David Hildenbrand
2025-02-20 9:31 ` Barry Song
2025-02-20 9:36 ` David Hildenbrand
2025-02-20 21:45 ` Barry Song
2025-02-20 22:19 ` Lokesh Gidra
2025-02-20 22:26 ` Barry Song
2025-02-20 22:31 ` David Hildenbrand
2025-02-20 22:33 ` Lokesh Gidra
2025-02-19 18:40 ` Lokesh Gidra
2025-02-19 20:45 ` Barry Song
2025-02-19 20:53 ` Lokesh Gidra
2025-02-19 22:31 ` Peter Xu
2025-02-20 0:50 ` Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z7ez2Vl8Sa_bRb4e@x1.local \
--to=peterx@redhat.com \
--cc=21cnbao@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=bgeffon@google.com \
--cc=brauner@kernel.org \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=jannh@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=mhocko@suse.com \
--cc=ngeoffray@google.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=surenb@google.com \
--cc=v-songbaohua@oppo.com \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
--cc=zhangpeng362@huawei.com \
--cc=zhengtangquan@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.