[PATCH] x86/mm: Handle alloc failure in phys_*

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/mm: Handle alloc failure in phys_*_init()
@ 2025-06-04 18:59 Em Sharnoff
  2025-06-05  6:36 ` Ingo Molnar
  0 siblings, 1 reply; 5+ messages in thread
From: Em Sharnoff @ 2025-06-04 18:59 UTC (permalink / raw)
  To: linux-kernel, x86, linux-mm
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Edgecombe, Rick P,
	Oleg Vasilev, Arthur Petukhovsky, Stefan Radig, Misha Sakhnov

tl;dr:

* When setting up page table mappings for physical addresses after boot,
  alloc_low_page() uses GFP_ATOMIC, which is allowed to fail.
* This isn't currently handled, and results in a null pointer
  dereference when it occurs.
* This allocation failure can happen during memory hotplug.

To handle failure, change phys_pud_init() and similar functions to
return zero if allocation failed (either directly or transitively), and
convert that to -ENOMEM in arch_add_memory().

=== Background ===

We recently started observing these null pointer dereferences happening
in practice (albeit quite rarely), triggered by allocation failures
during virtio-mem hotplug.

We use virtio-mem quite heavily - adding/removing memory based on
resource usage of customer workloads across a fleet of VMs - so it's
somewhat expected that we have occasional allocation failures here, if
we run out of memory before hotplug takes place.

We started seeing this bug after upgrading from 6.6.64 to 6.12.26, but
there didn't appear to be relevant changes in the codepaths involved, so
we figured the upgrade was triggering a latent issue.

The possibility for this issue was also pointed out a while back:

> For alloc_low_pages(), I noticed the callers don’t check for allocation
> failure. I'm a little surprised that there haven't been reports of the
> allocation failing, because these operations could result in a lot more
> pages getting allocated way past boot, and failure causes a NULL
> pointer dereference.

https://lore.kernel.org/all/5aee7bcdf49b1c6b8ee902dd2abd9220169c694b.camel@intel.com/

For completeness, here's an example stack trace we saw (on 6.12.26):

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  ....
  Call Trace:
   <TASK>
   phys_pud_init+0xa0/0x390
   phys_p4d_init+0x93/0x330
   __kernel_physical_mapping_init+0xa1/0x370
   kernel_physical_mapping_init+0xf/0x20
   init_memory_mapping+0x1fa/0x430
   arch_add_memory+0x2b/0x50
   add_memory_resource+0xe6/0x260
   add_memory_driver_managed+0x78/0xc0
   virtio_mem_add_memory+0x46/0xc0
   virtio_mem_sbm_plug_and_add_mb+0xa3/0x160
   virtio_mem_run_wq+0x1035/0x16c0
   process_one_work+0x17a/0x3c0
   worker_thread+0x2c5/0x3f0
   ? _raw_spin_unlock_irqrestore+0x9/0x30
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xdc/0x110
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x35/0x60
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

and the allocation failure preceding it:

  kworker/0:2: page allocation failure: order:0, mode:0x920(GFP_ATOMIC|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5b/0x70
   dump_stack+0x10/0x20
   warn_alloc+0x103/0x180
   __alloc_pages_slowpath.constprop.0+0x738/0xf30
   __alloc_pages_noprof+0x1e9/0x340
   alloc_pages_mpol_noprof+0x47/0x100
   alloc_pages_noprof+0x4b/0x80
   get_free_pages_noprof+0xc/0x40
   alloc_low_pages+0xc2/0x150
   phys_pud_init+0x82/0x390
  ...

(everything from phys_pud_init and below was the same)

There's some additional context in a github issue we opened on our side:
https://github.com/neondatabase/autoscaling/issues/1391

=== Reproducing / Testing ===

I was able to partially reproduce the original issue we saw by
modifying phys_pud_init() to simulate alloc_low_page() returning null
after boot, and then doing memory hotplug to trigger the "failure".
Something roughly like:

  - pmd = alloc_low_page();
  + if (!after_bootmem)
  + 	pmd = alloc_low_page();
  + else
  + 	pmd = 0;

To test recovery, I also tried simulating just one alloc_low_page()
failure after boot. This change seemed to handle it at a basic level
(virito-mem hotplug succeeded with the right amount, after retrying),
but I didn't dig further.

We also plan to test this in our production environment (where we should
see the difference after a few days); as of 2025-06-04, we haven't yet
rolled that out.

=== Rationale ===

Note: This is the first time I'm looking at this code; please review
extra critically.

As far as I can tell:

1. phys_*_init() should not currently return zero
2. If phys_*_init() gives up partway through, subsequent retries will be
   able to continue from the progress so far.

So, it seems ok to give a zero return special meaning, and it seems like
this is something that can be gracefully handled with the code as it is.

Signed-off-by: Em Sharnoff <sharnoff@neon.tech>
---
 arch/x86/mm/init.c    |  6 +++++-
 arch/x86/mm/init_64.c | 50 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index bfa444a7dbb0..b90fe52a7d67 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -533,6 +533,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
  * Setup the direct mapping of the physical memory at PAGE_OFFSET.
  * This runs before bootmem is initialized and gets pages directly from
  * the physical memory. To access them they are temporarily mapped.
+ * Returns zero if allocation fails at any point.
  */
 unsigned long __ref init_memory_mapping(unsigned long start,
 					unsigned long end, pgprot_t prot)
@@ -547,10 +548,13 @@ unsigned long __ref init_memory_mapping(unsigned long start,
 	memset(mr, 0, sizeof(mr));
 	nr_range = split_mem_range(mr, 0, start, end);

-	for (i = 0; i < nr_range; i++)
+	for (i = 0; i < nr_range; i++) {
 		ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
 						   mr[i].page_size_mask,
 						   prot);
+		if (!ret)
+			return 0;
+	}

 	add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7c4f6f591f2b..1b0140b49371 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -502,7 +502,7 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end,
 /*
  * Create PMD level page table mapping for physical addresses. The virtual
  * and physical address have to be aligned at this level.
- * It returns the last physical address mapped.
+ * It returns the last physical address mapped, or zero if allocation failed.
  */
 static unsigned long __meminit
 phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
@@ -572,7 +572,14 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
 		}

 		pte = alloc_low_page();
+		if (!pte)
+			return 0;
 		paddr_last = phys_pte_init(pte, paddr, paddr_end, new_prot, init);
+		/*
+		 * phys_{ppmd,pud,p4d}_init return zero if allocation failed.
+		 * phys_pte_init makes no allocations, so should not return zero.
+		 */
+		BUG_ON(!paddr_last);

 		spin_lock(&init_mm.page_table_lock);
 		pmd_populate_kernel_init(&init_mm, pmd, pte, init);
@@ -586,7 +593,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
  * Create PUD level page table mapping for physical addresses. The virtual
  * and physical address do not have to be aligned at this level. KASLR can
  * randomize virtual addresses up to this level.
- * It returns the last physical address mapped.
+ * It returns the last physical address mapped, or zero if allocation failed.
  */
 static unsigned long __meminit
 phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
@@ -623,6 +630,8 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 							   paddr_end,
 							   page_size_mask,
 							   prot, init);
+				if (!paddr_last)
+					return 0;
 				continue;
 			}
 			/*
@@ -658,12 +667,22 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
 		}

 		pmd = alloc_low_page();
+		if (!pmd)
+			return 0;
 		paddr_last = phys_pmd_init(pmd, paddr, paddr_end,
 					   page_size_mask, prot, init);

+		/*
+		 * We might have !paddr_last if allocation failed, but we should still
+		 * update pud before bailing, so that subsequent retries can pick up on
+		 * progress (here and in phys_pmd_init) without leaking pmd.
+		 */
 		spin_lock(&init_mm.page_table_lock);
 		pud_populate_init(&init_mm, pud, pmd, init);
 		spin_unlock(&init_mm.page_table_lock);
+
+		if (!paddr_last)
+			return 0;
 	}

 	update_page_count(PG_LEVEL_1G, pages);
@@ -707,16 +726,26 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
 			pud = pud_offset(p4d, 0);
 			paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
 					page_size_mask, prot, init);
+			if (!paddr_last)
+				return 0;
 			continue;
 		}

 		pud = alloc_low_page();
+		if (!pud)
+			return 0;
 		paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
 					   page_size_mask, prot, init);

 		spin_lock(&init_mm.page_table_lock);
 		p4d_populate_init(&init_mm, p4d, pud, init);
 		spin_unlock(&init_mm.page_table_lock);
+
+		/*
+		 * Bail only after updating p4d to keep progress from pud across retries.
+		 */
+		if (!paddr_last)
+			return 0;
 	}

 	return paddr_last;
@@ -748,10 +777,14 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
 						   __pa(vaddr_end),
 						   page_size_mask,
 						   prot, init);
+			if (!paddr_last)
+				return 0;
 			continue;
 		}

 		p4d = alloc_low_page();
+		if (!p4d)
+			return 0;
 		paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
 					   page_size_mask, prot, init);

@@ -763,6 +796,13 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
 					  (pud_t *) p4d, init);

 		spin_unlock(&init_mm.page_table_lock);
+
+		/*
+		 * Bail only after updating pgd/p4d to keep progress from p4d across retries.
+		 */
+		if (!paddr_last)
+			return 0;
+
 		pgd_changed = true;
 	}

@@ -777,7 +817,8 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
  * Create page table mapping for the physical memory for specific physical
  * addresses. Note that it can only be used to populate non-present entries.
  * The virtual and physical addresses have to be aligned on PMD level
- * down. It returns the last physical address mapped.
+ * down. It returns the last physical address mapped, or zero if allocation
+ * failed at any point.
  */
 unsigned long __meminit
 kernel_physical_mapping_init(unsigned long paddr_start,
@@ -981,7 +1022,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;

-	init_memory_mapping(start, start + size, params->pgprot);
+	if (!init_memory_mapping(start, start + size, params->pgprot))
+		return -ENOMEM;

 	return add_pages(nid, start_pfn, nr_pages, params);
 }
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mm: Handle alloc failure in phys_*_init()
  2025-06-04 18:59 [PATCH] x86/mm: Handle alloc failure in phys_*_init() Em Sharnoff
@ 2025-06-05  6:36 ` Ingo Molnar
  2025-06-05  6:41   ` H. Peter Anvin
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Ingo Molnar @ 2025-06-05  6:36 UTC (permalink / raw)
  To: Em Sharnoff
  Cc: linux-kernel, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Edgecombe, Rick P, Oleg Vasilev,
	Arthur Petukhovsky, Stefan Radig, Misha Sakhnov

* Em Sharnoff <sharnoff@neon.tech> wrote:

> tl;dr:
> 
> * When setting up page table mappings for physical addresses after boot,
>   alloc_low_page() uses GFP_ATOMIC, which is allowed to fail.
> * This isn't currently handled, and results in a null pointer
>   dereference when it occurs.
> * This allocation failure can happen during memory hotplug.
> 
> To handle failure, change phys_pud_init() and similar functions to
> return zero if allocation failed (either directly or transitively), and
> convert that to -ENOMEM in arch_add_memory().

> +		/*
> +		 * Bail only after updating pgd/p4d to keep progress from p4d across retries.
> +		 */
> +		if (!paddr_last)
> +			return 0;
> +
>  		pgd_changed = true;

> -	init_memory_mapping(start, start + size, params->pgprot);
> +	if (!init_memory_mapping(start, start + size, params->pgprot))
> +		return -ENOMEM;

I agree that it makes total sense to fix all this (especially since you 
are actively triggering it), but have you tried also changing it away 
from GFP_ATOMIC? There's no real reason why it should be GFP_ATOMIC 
AFAICS, other than some historic inertia that nobody bothered to fix.

Plus, could you please change the return flow from this zero 
special-case over to something like ERR_PTR(-ENOMEM) and IS_ERR()?

*Technically* zero is a valid physical address, although we 
intentionally never use it in the kernel AFAIK and wouldn't ever put a 
page table there either. ERR_PTR()/IS_ERR() is much easier on the eyes 
than the zero special-case.

Finally, could you make this a 2-patch fix series: first one to fix the 
error return path to not crash, and the second one to change it away 
from GFP_ATOMIC?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mm: Handle alloc failure in phys_*_init()
  2025-06-05  6:36 ` Ingo Molnar
@ 2025-06-05  6:41   ` H. Peter Anvin
  2025-06-05  6:47   ` Ingo Molnar
  2025-06-09 10:36   ` Em Sharnoff
  2 siblings, 0 replies; 5+ messages in thread
From: H. Peter Anvin @ 2025-06-05  6:41 UTC (permalink / raw)
  To: Ingo Molnar, Em Sharnoff
  Cc: linux-kernel, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Edgecombe, Rick P, Oleg Vasilev, Arthur Petukhovsky, Stefan Radig,
	Misha Sakhnov

On June 4, 2025 11:36:45 PM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Em Sharnoff <sharnoff@neon.tech> wrote:
>
>> tl;dr:
>> 
>> * When setting up page table mappings for physical addresses after boot,
>>   alloc_low_page() uses GFP_ATOMIC, which is allowed to fail.
>> * This isn't currently handled, and results in a null pointer
>>   dereference when it occurs.
>> * This allocation failure can happen during memory hotplug.
>> 
>> To handle failure, change phys_pud_init() and similar functions to
>> return zero if allocation failed (either directly or transitively), and
>> convert that to -ENOMEM in arch_add_memory().
>
>> +		/*
>> +		 * Bail only after updating pgd/p4d to keep progress from p4d across retries.
>> +		 */
>> +		if (!paddr_last)
>> +			return 0;
>> +
>>  		pgd_changed = true;
>
>> -	init_memory_mapping(start, start + size, params->pgprot);
>> +	if (!init_memory_mapping(start, start + size, params->pgprot))
>> +		return -ENOMEM;
>
>I agree that it makes total sense to fix all this (especially since you 
>are actively triggering it), but have you tried also changing it away 
>from GFP_ATOMIC? There's no real reason why it should be GFP_ATOMIC 
>AFAICS, other than some historic inertia that nobody bothered to fix.
>
>Plus, could you please change the return flow from this zero 
>special-case over to something like ERR_PTR(-ENOMEM) and IS_ERR()?
>
>*Technically* zero is a valid physical address, although we 
>intentionally never use it in the kernel AFAIK and wouldn't ever put a 
>page table there either. ERR_PTR()/IS_ERR() is much easier on the eyes 
>than the zero special-case.
>
>Finally, could you make this a 2-patch fix series: first one to fix the 
>error return path to not crash, and the second one to change it away 
>from GFP_ATOMIC?
>
>Thanks,
>
>	Ingo

Specifically, zero is a valid *user space* address.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mm: Handle alloc failure in phys_*_init()
  2025-06-05  6:36 ` Ingo Molnar
  2025-06-05  6:41   ` H. Peter Anvin
@ 2025-06-05  6:47   ` Ingo Molnar
  2025-06-09 10:36   ` Em Sharnoff
  2 siblings, 0 replies; 5+ messages in thread
From: Ingo Molnar @ 2025-06-05  6:47 UTC (permalink / raw)
  To: Em Sharnoff
  Cc: linux-kernel, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Edgecombe, Rick P, Oleg Vasilev,
	Arthur Petukhovsky, Stefan Radig, Misha Sakhnov


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Em Sharnoff <sharnoff@neon.tech> wrote:
> 
> > tl;dr:
> > 
> > * When setting up page table mappings for physical addresses after boot,
> >   alloc_low_page() uses GFP_ATOMIC, which is allowed to fail.
> > * This isn't currently handled, and results in a null pointer
> >   dereference when it occurs.
> > * This allocation failure can happen during memory hotplug.
> > 
> > To handle failure, change phys_pud_init() and similar functions to
> > return zero if allocation failed (either directly or transitively), and
> > convert that to -ENOMEM in arch_add_memory().
> 
> > +		/*
> > +		 * Bail only after updating pgd/p4d to keep progress from p4d across retries.
> > +		 */
> > +		if (!paddr_last)
> > +			return 0;
> > +
> >  		pgd_changed = true;
> 
> > -	init_memory_mapping(start, start + size, params->pgprot);
> > +	if (!init_memory_mapping(start, start + size, params->pgprot))
> > +		return -ENOMEM;
> 
> I agree that it makes total sense to fix all this (especially since you 
> are actively triggering it), but have you tried also changing it away 
> from GFP_ATOMIC? There's no real reason why it should be GFP_ATOMIC 
> AFAICS, other than some historic inertia that nobody bothered to fix.
> 
> Plus, could you please change the return flow from this zero 
> special-case over to something like ERR_PTR(-ENOMEM) and IS_ERR()?
> 
> *Technically* zero is a valid physical address, although we 
> intentionally never use it in the kernel AFAIK and wouldn't ever put a 
> page table there either. ERR_PTR()/IS_ERR() is much easier on the eyes 
> than the zero special-case.

Small correction: since this is a PTE pointer, this is a kernel virtual 
address, which can never be zero - only physical addresses can be zero 
(but even they aren't in practice.).

My main point wrt. using ERR_PTR()/IS_ERR() instead of 0 remains.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] x86/mm: Handle alloc failure in phys_*_init()
  2025-06-05  6:36 ` Ingo Molnar
  2025-06-05  6:41   ` H. Peter Anvin
  2025-06-05  6:47   ` Ingo Molnar
@ 2025-06-09 10:36   ` Em Sharnoff
  2 siblings, 0 replies; 5+ messages in thread
From: Em Sharnoff @ 2025-06-09 10:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Edgecombe, Rick P, Oleg Vasilev,
	Arthur Petukhovsky, Stefan Radig, Misha Sakhnov

On 2025-06-05 07:36, Ingo Molnar wrote:
 
> I agree that it makes total sense to fix all this (especially since you 
> are actively triggering it), but have you tried also changing it away 
> from GFP_ATOMIC? There's no real reason why it should be GFP_ATOMIC 
> AFAICS, other than some historic inertia that nobody bothered to fix.

Fair enough, yeah. We hadn't tried that, no

> Finally, could you make this a 2-patch fix series: first one to fix the 
> error return path to not crash, and the second one to change it away 
> from GFP_ATOMIC?

Sounds good -- thanks for the feedback!

Sent a new patch set with those changes. For posterity, v2 is here:
https://lore.kernel.org/all/0ce5e150-19e0-457f-bec3-ee031c0be7e7@neon.tech/

Thanks,
Em

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-06-09 10:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 18:59 [PATCH] x86/mm: Handle alloc failure in phys_*_init() Em Sharnoff
2025-06-05  6:36 ` Ingo Molnar
2025-06-05  6:41   ` H. Peter Anvin
2025-06-05  6:47   ` Ingo Molnar
2025-06-09 10:36   ` Em Sharnoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).