From: "David Hildenbrand (Arm)" <david@kernel.org>
To: mboone@akamai.com, Andrew Morton <akpm@linux-foundation.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
kvm@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
Date: Tue, 17 Mar 2026 15:05:37 +0100 [thread overview]
Message-ID: <12804801-d4f9-42c9-8580-e3897efd15ef@kernel.org> (raw)
In-Reply-To: <20260317-pagewalk-check-pmd-refault-v1-1-f699a010f2b3@akamai.com>
On 3/17/26 15:03, Max Boone via B4 Relay wrote:
> From: Max Boone <mboone@akamai.com>
>
> The splitting of a PUD entry in walk_pud_range() can race with
> a concurrent thread refaulting the PUD leaf entry causing it to
> try walking a PMD range that has disappeared.
>
> An example and reproduction of this is to try reading numa_maps of
> a process while VFIO-PCI is setting up DMA (specifically the
> vfio_pin_pages_remote call) on a large BAR for that process.
>
> This will trigger a kernel BUG:
> vfio-pci 0000:03:00.0: enabling device (0000 -> 0002)
> BUG: unable to handle page fault for address: ffffa23980000000
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> ...
> RIP: 0010:walk_pgd_range+0x3b5/0x7a0
> Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24
> 28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06
> 9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74
> RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287
> RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff
> RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0
> RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000
> R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000
> R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000
> FS: 00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
> <TASK>
> __walk_page_range+0x195/0x1b0
> walk_page_vma+0x62/0xc0
> show_numa_map+0x12b/0x3b0
> seq_read_iter+0x297/0x440
> seq_read+0x11d/0x140
> vfs_read+0xc2/0x340
> ksys_read+0x5f/0xe0
> do_syscall_64+0x68/0x130
> ? get_page_from_freelist+0x5c2/0x17e0
> ? mas_store_prealloc+0x17e/0x360
> ? vma_set_page_prot+0x4c/0xa0
> ? __alloc_pages_noprof+0x14e/0x2d0
> ? __mod_memcg_lruvec_state+0x8d/0x140
> ? __lruvec_stat_mod_folio+0x76/0xb0
> ? __folio_mod_stat+0x26/0x80
> ? do_anonymous_page+0x705/0x900
> ? __handle_mm_fault+0xa8d/0x1000
> ? __count_memcg_events+0x53/0xf0
> ? handle_mm_fault+0xa5/0x360
> ? do_user_addr_fault+0x342/0x640
> ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0
> ? irqentry_exit_to_user_mode+0x24/0x100
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe88464f47e
> Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f
> 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
> f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e
> RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003
> RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000
> R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> </TASK>
>
> Fix this by validating the PUD entry in walk_pmd_range() using a stable
> snapshot (pudp_get()). If the PUD is not present or is a leaf, retry the
> walk via ACTION_AGAIN instead of descending further. This mirrors the
> retry logic in walk_pmd_range().
>
> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
> Cc: stable@vger.kernel.org
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Max Boone <mboone@akamai.com>
> ---
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
next prev parent reply other threads:[~2026-03-17 14:05 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 14:03 [PATCH] mm/pagewalk: fix race between concurrent split and refault Max Boone
2026-03-17 14:03 ` Max Boone via B4 Relay
2026-03-17 14:05 ` David Hildenbrand (Arm) [this message]
2026-03-18 6:16 ` Qi Zheng
2026-03-18 7:37 ` Boone, Max
2026-03-18 7:38 ` David Hildenbrand (Arm)
2026-03-18 12:55 ` Lorenzo Stoakes (Oracle)
2026-03-18 13:08 ` Boone, Max
2026-03-18 13:27 ` Boone, Max
2026-03-18 14:07 ` Lorenzo Stoakes (Oracle)
2026-03-18 14:10 ` Lorenzo Stoakes (Oracle)
2026-03-18 14:30 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=12804801-d4f9-42c9-8580-e3897efd15ef@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mboone@akamai.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=stable@vger.kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.