public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/pagewalk: fix race between concurrent split and refault
@ 2026-03-17 14:03 Max Boone via B4 Relay
  2026-03-17 14:05 ` David Hildenbrand (Arm)
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Max Boone via B4 Relay @ 2026-03-17 14:03 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kvm, stable, Max Boone

From: Max Boone <mboone@akamai.com>

The splitting of a PUD entry in walk_pud_range() can race with
a concurrent thread refaulting the PUD leaf entry causing it to
try walking a PMD range that has disappeared.

An example and reproduction of this is to try reading numa_maps of
a process while VFIO-PCI is setting up DMA (specifically the
vfio_pin_pages_remote call) on a large BAR for that process.

This will trigger a kernel BUG:
vfio-pci 0000:03:00.0: enabling device (0000 -> 0002)
BUG: unable to handle page fault for address: ffffa23980000000
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
...
RIP: 0010:walk_pgd_range+0x3b5/0x7a0
Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24
28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06
   9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74
RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287
RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff
RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0
RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000
R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000
R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000
FS:  00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 __walk_page_range+0x195/0x1b0
 walk_page_vma+0x62/0xc0
 show_numa_map+0x12b/0x3b0
 seq_read_iter+0x297/0x440
 seq_read+0x11d/0x140
 vfs_read+0xc2/0x340
 ksys_read+0x5f/0xe0
 do_syscall_64+0x68/0x130
 ? get_page_from_freelist+0x5c2/0x17e0
 ? mas_store_prealloc+0x17e/0x360
 ? vma_set_page_prot+0x4c/0xa0
 ? __alloc_pages_noprof+0x14e/0x2d0
 ? __mod_memcg_lruvec_state+0x8d/0x140
 ? __lruvec_stat_mod_folio+0x76/0xb0
 ? __folio_mod_stat+0x26/0x80
 ? do_anonymous_page+0x705/0x900
 ? __handle_mm_fault+0xa8d/0x1000
 ? __count_memcg_events+0x53/0xf0
 ? handle_mm_fault+0xa5/0x360
 ? do_user_addr_fault+0x342/0x640
 ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0
 ? irqentry_exit_to_user_mode+0x24/0x100
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe88464f47e
Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f
84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
   f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e
RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003
RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000
R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
 </TASK>

Fix this by validating the PUD entry in walk_pmd_range() using a stable
snapshot (pudp_get()). If the PUD is not present or is a leaf, retry the
walk via ACTION_AGAIN instead of descending further. This mirrors the
retry logic in walk_pmd_range().

Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Cc: stable@vger.kernel.org
Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Max Boone <mboone@akamai.com>
---
 mm/pagewalk.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index a94c401ab..c74b4d800 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -97,6 +97,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
+	pud_t pudval = pudp_get(pud);
 	pmd_t *pmd;
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
@@ -105,6 +106,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 	int depth = real_depth(3);
 
+	/*
+	 * For PTE handling, pte_offset_map_lock() takes care of checking
+	 * whether there actually is a page table. But it also has to be
+	 * very careful about concurrent page table reclaim. If we spot a PMD
+	 * table, it cannot go away, so we can just walk it. However, if we find
+	 * something else, we have to retry.
+	 */
+	if (!pud_present(pudval) || pud_leaf(pudval)) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
@@ -218,12 +231,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		else if (pud_leaf(*pud) || !pud_present(*pud))
 			continue; /* Nothing to do. */
 
-		if (pud_none(*pud))
-			goto again;
-
 		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
+
+		if (walk->action == ACTION_AGAIN)
+			goto again;
+
 	} while (pud++, addr = next, addr != end);
 
 	return err;

---
base-commit: b4f0dd314b39ea154f62f3bd3115ed0470f9f71e
change-id: 20260317-pagewalk-check-pmd-refault-de8f14fbe6a5

Best regards,
-- 
Max Boone <mboone@akamai.com>



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-17 14:03 [PATCH] mm/pagewalk: fix race between concurrent split and refault Max Boone via B4 Relay
@ 2026-03-17 14:05 ` David Hildenbrand (Arm)
  2026-03-18  6:16 ` Qi Zheng
  2026-03-18 12:55 ` Lorenzo Stoakes (Oracle)
  2 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-17 14:05 UTC (permalink / raw)
  To: mboone, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kvm, stable

On 3/17/26 15:03, Max Boone via B4 Relay wrote:
> From: Max Boone <mboone@akamai.com>
> 
> The splitting of a PUD entry in walk_pud_range() can race with
> a concurrent thread refaulting the PUD leaf entry causing it to
> try walking a PMD range that has disappeared.
> 
> An example and reproduction of this is to try reading numa_maps of
> a process while VFIO-PCI is setting up DMA (specifically the
> vfio_pin_pages_remote call) on a large BAR for that process.
> 
> This will trigger a kernel BUG:
> vfio-pci 0000:03:00.0: enabling device (0000 -> 0002)
> BUG: unable to handle page fault for address: ffffa23980000000
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> ...
> RIP: 0010:walk_pgd_range+0x3b5/0x7a0
> Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24
> 28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06
>    9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74
> RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287
> RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff
> RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0
> RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000
> R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000
> R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000
> FS:  00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  __walk_page_range+0x195/0x1b0
>  walk_page_vma+0x62/0xc0
>  show_numa_map+0x12b/0x3b0
>  seq_read_iter+0x297/0x440
>  seq_read+0x11d/0x140
>  vfs_read+0xc2/0x340
>  ksys_read+0x5f/0xe0
>  do_syscall_64+0x68/0x130
>  ? get_page_from_freelist+0x5c2/0x17e0
>  ? mas_store_prealloc+0x17e/0x360
>  ? vma_set_page_prot+0x4c/0xa0
>  ? __alloc_pages_noprof+0x14e/0x2d0
>  ? __mod_memcg_lruvec_state+0x8d/0x140
>  ? __lruvec_stat_mod_folio+0x76/0xb0
>  ? __folio_mod_stat+0x26/0x80
>  ? do_anonymous_page+0x705/0x900
>  ? __handle_mm_fault+0xa8d/0x1000
>  ? __count_memcg_events+0x53/0xf0
>  ? handle_mm_fault+0xa5/0x360
>  ? do_user_addr_fault+0x342/0x640
>  ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0
>  ? irqentry_exit_to_user_mode+0x24/0x100
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe88464f47e
> Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f
> 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
>    f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e
> RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003
> RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000
> R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
>  </TASK>
> 
> Fix this by validating the PUD entry in walk_pmd_range() using a stable
> snapshot (pudp_get()). If the PUD is not present or is a leaf, retry the
> walk via ACTION_AGAIN instead of descending further. This mirrors the
> retry logic in walk_pmd_range().
> 
> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
> Cc: stable@vger.kernel.org
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Max Boone <mboone@akamai.com>
> ---

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-17 14:03 [PATCH] mm/pagewalk: fix race between concurrent split and refault Max Boone via B4 Relay
  2026-03-17 14:05 ` David Hildenbrand (Arm)
@ 2026-03-18  6:16 ` Qi Zheng
  2026-03-18  7:37   ` Boone, Max
  2026-03-18  7:38   ` David Hildenbrand (Arm)
  2026-03-18 12:55 ` Lorenzo Stoakes (Oracle)
  2 siblings, 2 replies; 11+ messages in thread
From: Qi Zheng @ 2026-03-18  6:16 UTC (permalink / raw)
  To: mboone, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kvm, stable

Hi Max,

On 3/17/26 10:03 PM, Max Boone via B4 Relay wrote:
> From: Max Boone <mboone@akamai.com>
> 
> The splitting of a PUD entry in walk_pud_range() can race with
> a concurrent thread refaulting the PUD leaf entry causing it to
> try walking a PMD range that has disappeared.
> 
> An example and reproduction of this is to try reading numa_maps of
> a process while VFIO-PCI is setting up DMA (specifically the
> vfio_pin_pages_remote call) on a large BAR for that process.
> 
> This will trigger a kernel BUG:
> vfio-pci 0000:03:00.0: enabling device (0000 -> 0002)
> BUG: unable to handle page fault for address: ffffa23980000000
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> ...
> RIP: 0010:walk_pgd_range+0x3b5/0x7a0
> Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24
> 28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06
>     9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74
> RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287
> RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff
> RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0
> RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000
> R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000
> R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000
> FS:  00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   __walk_page_range+0x195/0x1b0
>   walk_page_vma+0x62/0xc0
>   show_numa_map+0x12b/0x3b0
>   seq_read_iter+0x297/0x440
>   seq_read+0x11d/0x140
>   vfs_read+0xc2/0x340
>   ksys_read+0x5f/0xe0
>   do_syscall_64+0x68/0x130
>   ? get_page_from_freelist+0x5c2/0x17e0
>   ? mas_store_prealloc+0x17e/0x360
>   ? vma_set_page_prot+0x4c/0xa0
>   ? __alloc_pages_noprof+0x14e/0x2d0
>   ? __mod_memcg_lruvec_state+0x8d/0x140
>   ? __lruvec_stat_mod_folio+0x76/0xb0
>   ? __folio_mod_stat+0x26/0x80
>   ? do_anonymous_page+0x705/0x900
>   ? __handle_mm_fault+0xa8d/0x1000
>   ? __count_memcg_events+0x53/0xf0
>   ? handle_mm_fault+0xa5/0x360
>   ? do_user_addr_fault+0x342/0x640
>   ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0
>   ? irqentry_exit_to_user_mode+0x24/0x100
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe88464f47e
> Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f
> 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
>     f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e
> RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003
> RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000
> R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
>   </TASK>
> 
> Fix this by validating the PUD entry in walk_pmd_range() using a stable
> snapshot (pudp_get()). If the PUD is not present or is a leaf, retry the
> walk via ACTION_AGAIN instead of descending further. This mirrors the
> retry logic in walk_pmd_range().
> 
> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
> Cc: stable@vger.kernel.org
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Max Boone <mboone@akamai.com>
> ---
>   mm/pagewalk.c | 20 +++++++++++++++++---
>   1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index a94c401ab..c74b4d800 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -97,6 +97,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>   			  struct mm_walk *walk)
>   {
> +	pud_t pudval = pudp_get(pud);
>   	pmd_t *pmd;
>   	unsigned long next;
>   	const struct mm_walk_ops *ops = walk->ops;
> @@ -105,6 +106,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>   	int err = 0;
>   	int depth = real_depth(3);
>   
> +	/*
> +	 * For PTE handling, pte_offset_map_lock() takes care of checking
> +	 * whether there actually is a page table. But it also has to be
> +	 * very careful about concurrent page table reclaim. If we spot a PMD
> +	 * table, it cannot go away, so we can just walk it. However, if we find
> +	 * something else, we have to retry.
> +	 */
> +	if (!pud_present(pudval) || pud_leaf(pudval)) {
> +		walk->action = ACTION_AGAIN;
> +		return 0;
> +	}
> +
>   	pmd = pmd_offset(pud, addr);
>   	do {
>   again:
> @@ -218,12 +231,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>   		else if (pud_leaf(*pud) || !pud_present(*pud))
>   			continue; /* Nothing to do. */

Why not check pudval directly here? Like the following:

		if (pud_leaf(*pud) || !pud_present(*pud))
			goto again;

>   
> -		if (pud_none(*pud))
> -			goto again;
> -
>   		err = walk_pmd_range(pud, addr, next, walk);
>   		if (err)
>   			break;
> +
> +		if (walk->action == ACTION_AGAIN)
> +			goto again;
> +
>   	} while (pud++, addr = next, addr != end);
>   
>   	return err;
> 
> ---
> base-commit: b4f0dd314b39ea154f62f3bd3115ed0470f9f71e
> change-id: 20260317-pagewalk-check-pmd-refault-de8f14fbe6a5
> 
> Best regards,


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18  6:16 ` Qi Zheng
@ 2026-03-18  7:37   ` Boone, Max
  2026-03-18  7:38   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 11+ messages in thread
From: Boone, Max @ 2026-03-18  7:37 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	stable@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

Hey Qi,

[…]

> 
> Why not check pudval directly here? Like the following:
> 
> if (pud_leaf(*pud) || !pud_present(*pud))
> goto again;
> 

Good point, my initial idea [1] was also to put it there (although I
checked on pud_special instead and continued instead of retrying).
I wasn’t sure whether I could link to a thread in a patch message,
but there’s some discussion between David and me there.

Making sure that a passed-in PMD range can be walked by checking 
if the parent PUD is present & not a leaf feels better suited as a guard
in the walk_pmd_range() function to me. After all, the failure originates
from inside that function, and potential other callers won’t need to 
incorporate the check which has to be done for safety anyways.

It also makes the logic of walk_pud_range() more similar to 
walk_pmd_range() - which also has the retry if it gets an ACTION_AGAIN
from the walk_pte_range() call.

Finally, doesn’t feel very natural to me to have:

if (walk->vma)
    split_huge_pud(walk->vma, pud, addr);
else if (pud_leaf(*pud) || !pud_present(*pud))
    continue; /* Nothing to do. */
if (pud_leaf(*pud) || !pud_present(*pud))
    goto again; /* Retry on concurrent refault as leaf */

[1] https://lore.kernel.org/all/20260309174949.2514565-1-mboone@akamai.com/

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18  6:16 ` Qi Zheng
  2026-03-18  7:37   ` Boone, Max
@ 2026-03-18  7:38   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18  7:38 UTC (permalink / raw)
  To: Qi Zheng, mboone, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, kvm, stable

>> +
>>       pmd = pmd_offset(pud, addr);
>>       do {
>>   again:
>> @@ -218,12 +231,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned
>> long addr, unsigned long end,
>>           else if (pud_leaf(*pud) || !pud_present(*pud))
>>               continue; /* Nothing to do. */
> 
> Why not check pudval directly here? Like the following:
> 

As the patch description states: "This mirrors the
retry logic in walk_pmd_range()."

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-17 14:03 [PATCH] mm/pagewalk: fix race between concurrent split and refault Max Boone via B4 Relay
  2026-03-17 14:05 ` David Hildenbrand (Arm)
  2026-03-18  6:16 ` Qi Zheng
@ 2026-03-18 12:55 ` Lorenzo Stoakes (Oracle)
  2026-03-18 13:08   ` Boone, Max
  2 siblings, 1 reply; 11+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-18 12:55 UTC (permalink / raw)
  To: mboone
  Cc: Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, kvm, stable

On Tue, Mar 17, 2026 at 03:03:04PM +0100, Max Boone via B4 Relay wrote:
> From: Max Boone <mboone@akamai.com>
>
> The splitting of a PUD entry in walk_pud_range() can race with
> a concurrent thread refaulting the PUD leaf entry causing it to
> try walking a PMD range that has disappeared.

So IOW, the PUD entry is split, then refaulted back to a PUD leaf entry
again?

>
> An example and reproduction of this is to try reading numa_maps of
> a process while VFIO-PCI is setting up DMA (specifically the
> vfio_pin_pages_remote call) on a large BAR for that process.
>
> This will trigger a kernel BUG:
> vfio-pci 0000:03:00.0: enabling device (0000 -> 0002)
> BUG: unable to handle page fault for address: ffffa23980000000
> PGD 0 P4D 0
> Oops: Oops: 0000 [#1] SMP NOPTI
> ...
> RIP: 0010:walk_pgd_range+0x3b5/0x7a0
> Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24
> 28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06
>    9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74
> RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287
> RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff
> RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0
> RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000
> R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000
> R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000
> FS:  00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  __walk_page_range+0x195/0x1b0
>  walk_page_vma+0x62/0xc0
>  show_numa_map+0x12b/0x3b0
>  seq_read_iter+0x297/0x440
>  seq_read+0x11d/0x140
>  vfs_read+0xc2/0x340
>  ksys_read+0x5f/0xe0
>  do_syscall_64+0x68/0x130
>  ? get_page_from_freelist+0x5c2/0x17e0
>  ? mas_store_prealloc+0x17e/0x360
>  ? vma_set_page_prot+0x4c/0xa0
>  ? __alloc_pages_noprof+0x14e/0x2d0
>  ? __mod_memcg_lruvec_state+0x8d/0x140
>  ? __lruvec_stat_mod_folio+0x76/0xb0
>  ? __folio_mod_stat+0x26/0x80
>  ? do_anonymous_page+0x705/0x900
>  ? __handle_mm_fault+0xa8d/0x1000
>  ? __count_memcg_events+0x53/0xf0
>  ? handle_mm_fault+0xa5/0x360
>  ? do_user_addr_fault+0x342/0x640
>  ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0
>  ? irqentry_exit_to_user_mode+0x24/0x100
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7fe88464f47e
> Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f
> 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
>    f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e
> RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003
> RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000
> R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
>  </TASK>
>
> Fix this by validating the PUD entry in walk_pmd_range() using a stable
> snapshot (pudp_get()). If the PUD is not present or is a leaf, retry the
> walk via ACTION_AGAIN instead of descending further. This mirrors the
> retry logic in walk_pmd_range().

I think it mirrors the retry logic in walk_pte_range() more closely right?
Because there it's:

	if (!pte)
		walk->action = ACTION_AGAIN;
	return err;

I.e. let the parent handle the PTE not being got by pte_offset_map_lock(),
and you draw a comparison to this in the comment in walk_pmd_range().

>
> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")

Yikes, really? :) This is from 2017, I'm a little surprised we didn't hit
this bug until now.

Has something changed more recently that made it more likely to hit? Or is
it one of those 'needed people to have more RAM first' or bigger PCI BAR's?

> Cc: stable@vger.kernel.org
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Max Boone <mboone@akamai.com>

Only nits here, the logic LGTM, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> ---
>  mm/pagewalk.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index a94c401ab..c74b4d800 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -97,6 +97,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  			  struct mm_walk *walk)
>  {
> +	pud_t pudval = pudp_get(pud);
>  	pmd_t *pmd;
>  	unsigned long next;
>  	const struct mm_walk_ops *ops = walk->ops;
> @@ -105,6 +106,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  	int err = 0;
>  	int depth = real_depth(3);
>
> +	/*
> +	 * For PTE handling, pte_offset_map_lock() takes care of checking
> +	 * whether there actually is a page table. But it also has to be
> +	 * very careful about concurrent page table reclaim. If we spot a PMD
> +	 * table, it cannot go away, so we can just walk it. However, if we find
> +	 * something else, we have to retry.

Nitty but I think we can be clearer here something like:

	/*
	 * For PTE handling, pte_offset_map_lock() takes care of checking
	 * whether there actually is a page table. But it also has to be
	 * very careful about concurrent page table reclaim.
	 *
	 * Similarly, we have to be careful here - a PUD entry that points
         * to a PMD table cannot go away, so we can just walk it. But if
         * it's something else, we need to ensure we didn't race something,
         * so need to retry.
	 *
	 * A pertinent example of this is a PUD refault after PUD split -
         * we will need to split again or risk accessing invalid memory.
	 */

> +	 */
> +	if (!pud_present(pudval) || pud_leaf(pudval)) {
> +		walk->action = ACTION_AGAIN;
> +		return 0;
> +	}
> +
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  again:
> @@ -218,12 +231,13 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  		else if (pud_leaf(*pud) || !pud_present(*pud))
>  			continue; /* Nothing to do. */
>
> -		if (pud_none(*pud))
> -			goto again;
> -
>  		err = walk_pmd_range(pud, addr, next, walk);
>  		if (err)
>  			break;
> +
> +		if (walk->action == ACTION_AGAIN)
> +			goto again;
> +

NIT: trailing newline.

>  	} while (pud++, addr = next, addr != end);
>
>  	return err;
>
> ---
> base-commit: b4f0dd314b39ea154f62f3bd3115ed0470f9f71e
> change-id: 20260317-pagewalk-check-pmd-refault-de8f14fbe6a5
>
> Best regards,
> --
> Max Boone <mboone@akamai.com>
>
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18 12:55 ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 13:08   ` Boone, Max
  2026-03-18 13:27     ` Boone, Max
  2026-03-18 14:10     ` Lorenzo Stoakes (Oracle)
  0 siblings, 2 replies; 11+ messages in thread
From: Boone, Max @ 2026-03-18 13:08 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, stable@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]


> On Mar 18, 2026, at 1:55 PM, Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> 
>> […]
> 
> So IOW, the PUD entry is split, then refaulted back to a PUD leaf entry
> again?

As far as I understand indeed, although the usage and faulting of huge
pfnmaps does not feel intuitive to me yet. Empirically, yes, observing this
when follow_fault_pfn() in drivers/vfio/vfio_iommu_type1.c is running 
concurrently with walk_pud_range(). I have another patch sent up to
that list because this fix causes follow_fault_pfn() to return -EINVAL [1].

>> […] 
> 
> I think it mirrors the retry logic in walk_pte_range() more closely right?
> Because there it's:
> 
> if (!pte)
> walk->action = ACTION_AGAIN;
> return err;
> 
> I.e. let the parent handle the PTE not being got by pte_offset_map_lock(),
> and you draw a comparison to this in the comment in walk_pmd_range().

I’d personally say that the main logic introduced is walk_pud_range() retrying when
walk_pmd_range() fails. We’re also splitting the PUD in walk_pud_range() and 
descending. But yeah, retry logic mirrors walk_pmd_range(), deciding that we need
to retry mirrors walk_pte_range().

> 
>> 
>> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
> 
> Yikes, really? :) This is from 2017, I'm a little surprised we didn't hit
> this bug until now.
> 
> Has something changed more recently that made it more likely to hit? Or is
> it one of those 'needed people to have more RAM first' or bigger PCI BAR's?

Yeah, frankly, this is the first patch where I could find the splitting being introduced. It might
be more correct to refer to the introduction of 1G huge_pfnmaps?

> 
>> Cc: stable@vger.kernel.org
>> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Max Boone <mboone@akamai.com>
> 
> Only nits here, the logic LGTM, so:

I’ll write up a PATCH v2 later today.

> 
> […]



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18 13:08   ` Boone, Max
@ 2026-03-18 13:27     ` Boone, Max
  2026-03-18 14:07       ` Lorenzo Stoakes (Oracle)
  2026-03-18 14:10     ` Lorenzo Stoakes (Oracle)
  1 sibling, 1 reply; 11+ messages in thread
From: Boone, Max @ 2026-03-18 13:27 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, stable@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 964 bytes --]



> On Mar 18, 2026, at 2:08 PM, Max Boone <mboone@akamai.com> wrote:
>> 
>> Yikes, really? :) This is from 2017, I'm a little surprised we didn't hit
>> this bug until now.
>> 
>> Has something changed more recently that made it more likely to hit? Or is
>> it one of those 'needed people to have more RAM first' or bigger PCI BAR's?

Forgot to mention, but yeah, we’re seeing this on Blackwell cards which have very
large BARs, so probably seeing it first because of that. But the window was already
pretty small, it’s not a very logical thing to poll numa_maps or smaps walks while the
firmware of a VM is remapping the BARs of a GPU. With regards to that specific case
there’s a proxmox thread and mail from the same person presumably [1, 2] that mentions 
the same bug.

[1] https://forum.proxmox.com/threads/walk_pgd_range-crash-pve9-1-on-6-18.179895/
[2] https://lore.kernel.org/all/5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org/

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18 13:27     ` Boone, Max
@ 2026-03-18 14:07       ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 11+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-18 14:07 UTC (permalink / raw)
  To: Boone, Max
  Cc: Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, stable@vger.kernel.org

On Wed, Mar 18, 2026 at 01:27:33PM +0000, Boone, Max wrote:
>
>
> > On Mar 18, 2026, at 2:08 PM, Max Boone <mboone@akamai.com> wrote:
> >>
> >> Yikes, really? :) This is from 2017, I'm a little surprised we didn't hit
> >> this bug until now.
> >>
> >> Has something changed more recently that made it more likely to hit? Or is
> >> it one of those 'needed people to have more RAM first' or bigger PCI BAR's?
>
> Forgot to mention, but yeah, we’re seeing this on Blackwell cards which have very
> large BARs, so probably seeing it first because of that. But the window was already
> pretty small, it’s not a very logical thing to poll numa_maps or smaps walks while the
> firmware of a VM is remapping the BARs of a GPU. With regards to that specific case
> there’s a proxmox thread and mail from the same person presumably [1, 2] that mentions
> the same bug.

No question we should take this fix, the page walk code is the right place to
check for this as we are not safe assuming the PUD entry can't change.

>
> [1] https://forum.proxmox.com/threads/walk_pgd_range-crash-pve9-1-on-6-18.179895/
> [2] https://lore.kernel.org/all/5948f3a6-8f30-4c45-9b86-2af9a6b37405@kernel.org/

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18 13:08   ` Boone, Max
  2026-03-18 13:27     ` Boone, Max
@ 2026-03-18 14:10     ` Lorenzo Stoakes (Oracle)
  2026-03-18 14:30       ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 11+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-18 14:10 UTC (permalink / raw)
  To: Boone, Max
  Cc: Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, stable@vger.kernel.org

On Wed, Mar 18, 2026 at 01:08:33PM +0000, Boone, Max wrote:
>
> > On Mar 18, 2026, at 1:55 PM, Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> >> […]
> >
> > So IOW, the PUD entry is split, then refaulted back to a PUD leaf entry
> > again?
>
> As far as I understand indeed, although the usage and faulting of huge
> pfnmaps does not feel intuitive to me yet. Empirically, yes, observing this
> when follow_fault_pfn() in drivers/vfio/vfio_iommu_type1.c is running
> concurrently with walk_pud_range(). I have another patch sent up to
> that list because this fix causes follow_fault_pfn() to return -EINVAL [1].

Ack

>
> >> […]
> >
> > I think it mirrors the retry logic in walk_pte_range() more closely right?
> > Because there it's:
> >
> > if (!pte)
> > walk->action = ACTION_AGAIN;
> > return err;
> >
> > I.e. let the parent handle the PTE not being got by pte_offset_map_lock(),
> > and you draw a comparison to this in the comment in walk_pmd_range().
>
> I’d personally say that the main logic introduced is walk_pud_range() retrying when
> walk_pmd_range() fails. We’re also splitting the PUD in walk_pud_range() and
> descending. But yeah, retry logic mirrors walk_pmd_range(), deciding that we need
> to retry mirrors walk_pte_range().

It's not a big deal we can leave that as is.

>
> >
> >>
> >> Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
> >
> > Yikes, really? :) This is from 2017, I'm a little surprised we didn't hit
> > this bug until now.
> >
> > Has something changed more recently that made it more likely to hit? Or is
> > it one of those 'needed people to have more RAM first' or bigger PCI BAR's?
>
> Yeah, frankly, this is the first patch where I could find the splitting being introduced. It might
> be more correct to refer to the introduction of 1G huge_pfnmaps?

Yeah maybe that makes more sense? David - what do you think?

>
> >
> >> Cc: stable@vger.kernel.org
> >> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> >> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> >> Signed-off-by: Max Boone <mboone@akamai.com>
> >
> > Only nits here, the logic LGTM, so:
>
> I’ll write up a PATCH v2 later today.

Cheers!

>
> >
> > […]
>
>

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/pagewalk: fix race between concurrent split and refault
  2026-03-18 14:10     ` Lorenzo Stoakes (Oracle)
@ 2026-03-18 14:30       ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-18 14:30 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Boone, Max
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	stable@vger.kernel.org

>>>
>>> Yikes, really? :) This is from 2017, I'm a little surprised we didn't hit
>>> this bug until now.
>>>
>>> Has something changed more recently that made it more likely to hit? Or is
>>> it one of those 'needed people to have more RAM first' or bigger PCI BAR's?
>>
>> Yeah, frankly, this is the first patch where I could find the splitting being introduced. It might
>> be more correct to refer to the introduction of 1G huge_pfnmaps?
> 
> Yeah maybe that makes more sense? David - what do you think?

I'm not sure whether DAX with PUDs could trigger something similar?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-18 14:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 14:03 [PATCH] mm/pagewalk: fix race between concurrent split and refault Max Boone via B4 Relay
2026-03-17 14:05 ` David Hildenbrand (Arm)
2026-03-18  6:16 ` Qi Zheng
2026-03-18  7:37   ` Boone, Max
2026-03-18  7:38   ` David Hildenbrand (Arm)
2026-03-18 12:55 ` Lorenzo Stoakes (Oracle)
2026-03-18 13:08   ` Boone, Max
2026-03-18 13:27     ` Boone, Max
2026-03-18 14:07       ` Lorenzo Stoakes (Oracle)
2026-03-18 14:10     ` Lorenzo Stoakes (Oracle)
2026-03-18 14:30       ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox