linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
@ 2025-08-22  4:15 Sam Edwards
  2025-08-23 22:25 ` Ard Biesheuvel
  0 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2025-08-22  4:15 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier
  Cc: Andrew Morton, Anshuman Khandual, Ard Biesheuvel, Ryan Roberts,
	Baruch Siach, Kevin Brodsky, Joey Gouly, linux-arm-kernel,
	linux-kernel, Sam Edwards, stable

In early boot, Linux creates identity virtual->physical address mappings
so that it can enable the MMU before full memory management is ready.
To ensure some available physical memory to back these structures,
vmlinux.lds reserves some space (and defines marker symbols) in the
middle of the kernel image. However, because they are defined outside of
PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
is concerned.

In the typical case, this isn't actually a problem: the boot image is
prepared with objcopy, which zero-fills the gaps, so these structures
are incidentally zero-initialized (an all-zeroes entry is considered
absent, so zero-initialization is appropriate).

However, that is just a happy accident: the `vmlinux` ELF output
authoritatively represents the state of memory at entry. If the ELF
says a region of memory isn't initialized, we must treat it as
uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
the ELF directly -- sidestepping the objcopy-produced image entirely --
and therefore do not initialize the gaps. This results in the early boot
code crashing when it attempts to create identity mappings.

Therefore, add boot-time zero-initialization for the following:
- __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
- idmap_pg_dir
- reserved_pg_dir
- tramp_pg_dir # Already done, but this patch corrects the size

Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir)
before use, so this patch does not need to address it.

Cc: stable@vger.kernel.org
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
---
 arch/arm64/kernel/head.S | 12 ++++++++++++
 arch/arm64/mm/mmu.c      |  3 ++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index ca04b338cb0d..0c3be11d0006 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry)
 	bl	record_mmu_state
 	bl	preserve_boot_args
 
+	adrp	x0, reserved_pg_dir
+	add	x1, x0, #PAGE_SIZE
+0:	str	xzr, [x0], 8
+	cmp	x0, x1
+	b.lo	0b
+
+	adrp	x0, __pi_init_idmap_pg_dir
+	adrp	x1, __pi_init_idmap_pg_end
+1:	str	xzr, [x0], 8
+	cmp	x0, x1
+	b.lo	1b
+
 	adrp	x1, early_init_stack
 	mov	sp, x1
 	mov	x29, xzr
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 34e5d78af076..aaf823565a65 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void)
 	pgprot_val(prot) &= ~PTE_NG;
 
 	/* Map only the text into the trampoline page table */
-	memset(tramp_pg_dir, 0, PGD_SIZE);
+	memset(tramp_pg_dir, 0, PAGE_SIZE);
 	__create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
 			     entry_tramp_text_size(), prot,
 			     pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
@@ -806,6 +806,7 @@ static void __init create_idmap(void)
 	u64 end   = __pa_symbol(__idmap_text_end);
 	u64 ptep  = __pa_symbol(idmap_ptes);
 
+	memset(idmap_pg_dir, 0, PAGE_SIZE);
 	__pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
 		       IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
 		       __phys_to_virt(ptep) - ptep);
-- 
2.49.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-22  4:15 [PATCH] arm64/boot: Zero-initialize idmap PGDs before use Sam Edwards
@ 2025-08-23 22:25 ` Ard Biesheuvel
  2025-08-23 23:55   ` Sam Edwards
  0 siblings, 1 reply; 12+ messages in thread
From: Ard Biesheuvel @ 2025-08-23 22:25 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

Hi Sam,

On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@gmail.com> wrote:
>
> In early boot, Linux creates identity virtual->physical address mappings
> so that it can enable the MMU before full memory management is ready.
> To ensure some available physical memory to back these structures,
> vmlinux.lds reserves some space (and defines marker symbols) in the
> middle of the kernel image. However, because they are defined outside of
> PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> is concerned.
>
> In the typical case, this isn't actually a problem: the boot image is
> prepared with objcopy, which zero-fills the gaps, so these structures
> are incidentally zero-initialized (an all-zeroes entry is considered
> absent, so zero-initialization is appropriate).
>
> However, that is just a happy accident: the `vmlinux` ELF output
> authoritatively represents the state of memory at entry. If the ELF
> says a region of memory isn't initialized, we must treat it as
> uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> the ELF directly -- sidestepping the objcopy-produced image entirely --
> and therefore do not initialize the gaps. This results in the early boot
> code crashing when it attempts to create identity mappings.
>
> Therefore, add boot-time zero-initialization for the following:
> - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> - idmap_pg_dir
> - reserved_pg_dir

I don't think this is the right approach.

If the ELF representation is inaccurate, it should be fixed, and this
should be achievable without impacting the binary image at all.

> - tramp_pg_dir # Already done, but this patch corrects the size
>

What is wrong with the size?

> Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir)
> before use, so this patch does not need to address it.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> ---
>  arch/arm64/kernel/head.S | 12 ++++++++++++
>  arch/arm64/mm/mmu.c      |  3 ++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index ca04b338cb0d..0c3be11d0006 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry)
>         bl      record_mmu_state
>         bl      preserve_boot_args
>
> +       adrp    x0, reserved_pg_dir
> +       add     x1, x0, #PAGE_SIZE
> +0:     str     xzr, [x0], 8
> +       cmp     x0, x1
> +       b.lo    0b
> +
> +       adrp    x0, __pi_init_idmap_pg_dir
> +       adrp    x1, __pi_init_idmap_pg_end
> +1:     str     xzr, [x0], 8
> +       cmp     x0, x1
> +       b.lo    1b
> +
>         adrp    x1, early_init_stack
>         mov     sp, x1
>         mov     x29, xzr
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 34e5d78af076..aaf823565a65 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void)
>         pgprot_val(prot) &= ~PTE_NG;
>
>         /* Map only the text into the trampoline page table */
> -       memset(tramp_pg_dir, 0, PGD_SIZE);
> +       memset(tramp_pg_dir, 0, PAGE_SIZE);
>         __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
>                              entry_tramp_text_size(), prot,
>                              pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
> @@ -806,6 +806,7 @@ static void __init create_idmap(void)
>         u64 end   = __pa_symbol(__idmap_text_end);
>         u64 ptep  = __pa_symbol(idmap_ptes);
>
> +       memset(idmap_pg_dir, 0, PAGE_SIZE);
>         __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
>                        IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
>                        __phys_to_virt(ptep) - ptep);
> --
> 2.49.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-23 22:25 ` Ard Biesheuvel
@ 2025-08-23 23:55   ` Sam Edwards
  2025-08-24  0:29     ` Ard Biesheuvel
  2025-08-26  9:50     ` Mark Rutland
  0 siblings, 2 replies; 12+ messages in thread
From: Sam Edwards @ 2025-08-23 23:55 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Hi Sam,
>
> On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@gmail.com> wrote:
> >
> > In early boot, Linux creates identity virtual->physical address mappings
> > so that it can enable the MMU before full memory management is ready.
> > To ensure some available physical memory to back these structures,
> > vmlinux.lds reserves some space (and defines marker symbols) in the
> > middle of the kernel image. However, because they are defined outside of
> > PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> > is concerned.
> >
> > In the typical case, this isn't actually a problem: the boot image is
> > prepared with objcopy, which zero-fills the gaps, so these structures
> > are incidentally zero-initialized (an all-zeroes entry is considered
> > absent, so zero-initialization is appropriate).
> >
> > However, that is just a happy accident: the `vmlinux` ELF output
> > authoritatively represents the state of memory at entry. If the ELF
> > says a region of memory isn't initialized, we must treat it as
> > uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> > the ELF directly -- sidestepping the objcopy-produced image entirely --
> > and therefore do not initialize the gaps. This results in the early boot
> > code crashing when it attempts to create identity mappings.
> >
> > Therefore, add boot-time zero-initialization for the following:
> > - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> > - idmap_pg_dir
> > - reserved_pg_dir
>
> I don't think this is the right approach.
>
> If the ELF representation is inaccurate, it should be fixed, and this
> should be achievable without impacting the binary image at all.

Hi Ard,

I don't believe I can declare the ELF output "inaccurate" per se,
since it's the linker's final determination about the state of memory
at kernel entry -- including which regions are not the loader's
responsibility to initialize (and should therefore be initialized at
runtime, e.g. .bss). But, I think I understand your meaning: you would
prefer consistent load-time zero-initialization over run-time. I'm
open to that approach if that's the consensus here, but it will make
`vmlinux` dozens of KBs larger (even though it keeps `Image` the same
size).

>
> > - tramp_pg_dir # Already done, but this patch corrects the size
> >
>
> What is wrong with the size?

On higher-VABIT targets, that memset is overflowing by writing
PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in
size. My understanding is that only userspace (TTBR0) PGDs are
PGD_SIZE and kernelspace (TTBR1) PGDs like the trampoline mapping are
always PAGE_SIZE. Please correct me if I'm wrong; I might be misled by
how vmlinux.lds.S is making space for those PGDs. :)

(If you'd like, I can break that one-line change out as a separate
patch to apply immediately? It seems like a more critical concern than
everything else here.)

Best,
Sam

>
> > Note, swapper_pg_dir is already initialized (by copy from idmap_pg_dir)
> > before use, so this patch does not need to address it.
> >
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sam Edwards <CFSworks@gmail.com>
> > ---
> >  arch/arm64/kernel/head.S | 12 ++++++++++++
> >  arch/arm64/mm/mmu.c      |  3 ++-
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index ca04b338cb0d..0c3be11d0006 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -86,6 +86,18 @@ SYM_CODE_START(primary_entry)
> >         bl      record_mmu_state
> >         bl      preserve_boot_args
> >
> > +       adrp    x0, reserved_pg_dir
> > +       add     x1, x0, #PAGE_SIZE
> > +0:     str     xzr, [x0], 8
> > +       cmp     x0, x1
> > +       b.lo    0b
> > +
> > +       adrp    x0, __pi_init_idmap_pg_dir
> > +       adrp    x1, __pi_init_idmap_pg_end
> > +1:     str     xzr, [x0], 8
> > +       cmp     x0, x1
> > +       b.lo    1b
> > +
> >         adrp    x1, early_init_stack
> >         mov     sp, x1
> >         mov     x29, xzr
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 34e5d78af076..aaf823565a65 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -761,7 +761,7 @@ static int __init map_entry_trampoline(void)
> >         pgprot_val(prot) &= ~PTE_NG;
> >
> >         /* Map only the text into the trampoline page table */
> > -       memset(tramp_pg_dir, 0, PGD_SIZE);
> > +       memset(tramp_pg_dir, 0, PAGE_SIZE);
> >         __create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
> >                              entry_tramp_text_size(), prot,
> >                              pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
> > @@ -806,6 +806,7 @@ static void __init create_idmap(void)
> >         u64 end   = __pa_symbol(__idmap_text_end);
> >         u64 ptep  = __pa_symbol(idmap_ptes);
> >
> > +       memset(idmap_pg_dir, 0, PAGE_SIZE);
> >         __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
> >                        IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
> >                        __phys_to_virt(ptep) - ptep);
> > --
> > 2.49.1
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-23 23:55   ` Sam Edwards
@ 2025-08-24  0:29     ` Ard Biesheuvel
  2025-08-24  3:05       ` Sam Edwards
  2025-08-26  9:50     ` Mark Rutland
  1 sibling, 1 reply; 12+ messages in thread
From: Ard Biesheuvel @ 2025-08-24  0:29 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Sun, 24 Aug 2025 at 09:56, Sam Edwards <cfsworks@gmail.com> wrote:
>
> On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > Hi Sam,
> >
> > On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@gmail.com> wrote:
> > >
> > > In early boot, Linux creates identity virtual->physical address mappings
> > > so that it can enable the MMU before full memory management is ready.
> > > To ensure some available physical memory to back these structures,
> > > vmlinux.lds reserves some space (and defines marker symbols) in the
> > > middle of the kernel image. However, because they are defined outside of
> > > PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> > > is concerned.
> > >
> > > In the typical case, this isn't actually a problem: the boot image is
> > > prepared with objcopy, which zero-fills the gaps, so these structures
> > > are incidentally zero-initialized (an all-zeroes entry is considered
> > > absent, so zero-initialization is appropriate).
> > >
> > > However, that is just a happy accident: the `vmlinux` ELF output
> > > authoritatively represents the state of memory at entry. If the ELF
> > > says a region of memory isn't initialized, we must treat it as
> > > uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> > > the ELF directly -- sidestepping the objcopy-produced image entirely --
> > > and therefore do not initialize the gaps. This results in the early boot
> > > code crashing when it attempts to create identity mappings.
> > >
> > > Therefore, add boot-time zero-initialization for the following:
> > > - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> > > - idmap_pg_dir
> > > - reserved_pg_dir
> >
> > I don't think this is the right approach.
> >
> > If the ELF representation is inaccurate, it should be fixed, and this
> > should be achievable without impacting the binary image at all.
>
> Hi Ard,
>
> I don't believe I can declare the ELF output "inaccurate" per se,
> since it's the linker's final determination about the state of memory
> at kernel entry -- including which regions are not the loader's
> responsibility to initialize (and should therefore be initialized at
> runtime, e.g. .bss). But, I think I understand your meaning: you would
> prefer consistent load-time zero-initialization over run-time. I'm
> open to that approach if that's the consensus here, but it will make
> `vmlinux` dozens of KBs larger (even though it keeps `Image` the same
> size).
>

Indeed, I'd like the ELF representation to be such that only the tail
end of the image needs explicit clearing. A bit of bloat of vmlinux is
tolerable IMO.

Note that your fix is not complete: stores to memory done with the MMU
and caches disabled need to be invalidated from the D-caches too, or
they could carry stale clean lines. This is precisely the reason why
manipulation of memory should be limited to the bare minimum until the
ID map is enabled in the MMU.


> >
> > > - tramp_pg_dir # Already done, but this patch corrects the size
> > >
> >
> > What is wrong with the size?
>
> On higher-VABIT targets, that memset is overflowing by writing
> PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in
> size.

Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?

Note that at stage 1, arm64 does not support page table concatenation,
and so the root page table is never larger than a page.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-24  0:29     ` Ard Biesheuvel
@ 2025-08-24  3:05       ` Sam Edwards
  2025-08-24  8:55         ` Marc Zyngier
  0 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2025-08-24  3:05 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sun, 24 Aug 2025 at 09:56, Sam Edwards <cfsworks@gmail.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > Hi Sam,
> > >
> > > On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@gmail.com> wrote:
> > > >
> > > > In early boot, Linux creates identity virtual->physical address mappings
> > > > so that it can enable the MMU before full memory management is ready.
> > > > To ensure some available physical memory to back these structures,
> > > > vmlinux.lds reserves some space (and defines marker symbols) in the
> > > > middle of the kernel image. However, because they are defined outside of
> > > > PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> > > > is concerned.
> > > >
> > > > In the typical case, this isn't actually a problem: the boot image is
> > > > prepared with objcopy, which zero-fills the gaps, so these structures
> > > > are incidentally zero-initialized (an all-zeroes entry is considered
> > > > absent, so zero-initialization is appropriate).
> > > >
> > > > However, that is just a happy accident: the `vmlinux` ELF output
> > > > authoritatively represents the state of memory at entry. If the ELF
> > > > says a region of memory isn't initialized, we must treat it as
> > > > uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> > > > the ELF directly -- sidestepping the objcopy-produced image entirely --
> > > > and therefore do not initialize the gaps. This results in the early boot
> > > > code crashing when it attempts to create identity mappings.
> > > >
> > > > Therefore, add boot-time zero-initialization for the following:
> > > > - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> > > > - idmap_pg_dir
> > > > - reserved_pg_dir
> > >
> > > I don't think this is the right approach.
> > >
> > > If the ELF representation is inaccurate, it should be fixed, and this
> > > should be achievable without impacting the binary image at all.
> >
> > Hi Ard,
> >
> > I don't believe I can declare the ELF output "inaccurate" per se,
> > since it's the linker's final determination about the state of memory
> > at kernel entry -- including which regions are not the loader's
> > responsibility to initialize (and should therefore be initialized at
> > runtime, e.g. .bss). But, I think I understand your meaning: you would
> > prefer consistent load-time zero-initialization over run-time. I'm
> > open to that approach if that's the consensus here, but it will make
> > `vmlinux` dozens of KBs larger (even though it keeps `Image` the same
> > size).
> >
>
> Indeed, I'd like the ELF representation to be such that only the tail
> end of the image needs explicit clearing. A bit of bloat of vmlinux is
> tolerable IMO.

Since the explicit clearing region already includes the entirety of
__pi_init_pg_dir, would it make sense if I instead move the other
pg_dir items (except __pi_init_idmap_pg_dir) inside that region too,
both to keep them all grouped and to ensure that they're all cleared
in the same go? I'd still need to handle __pi_init_idmap_pg_dir, and
it would mean that reserved_pg_dir is first installed in TTBR1_EL1 a
few cycles before being zeroed, but beyond those two drawbacks it
sounds simpler to me, reduces the image size by a few pages, and meets
the "only clear the tail end" goal.

> Note that your fix is not complete: stores to memory done with the MMU
> and caches disabled need to be invalidated from the D-caches too, or
> they could carry stale clean lines. This is precisely the reason why
> manipulation of memory should be limited to the bare minimum until the
> ID map is enabled in the MMU.

ACK. ARM64 caches are one of those things that I understand in
principle but I'm still learning all of the gotchas. I appreciate that
you shared this insight despite rejecting the overall approach!

> > >
> > > > - tramp_pg_dir # Already done, but this patch corrects the size
> > > >
> > >
> > > What is wrong with the size?
> >
> > On higher-VABIT targets, that memset is overflowing by writing
> > PGD_SIZE bytes despite tramp_pg_dir being only PAGE_SIZE bytes in
> > size.
>
> Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?

I might be doing my math wrong, but wouldn't 52-bit VA with 4K
granules and 5 levels result in this?

Each PTE represents 4K of virtual memory, so covers VA bits [11:0]
(this is level 3)
Each PMD has 512 PTEs, the index of which covers VA bits [20:12] (this
is level 2)
Each PUD references 512 PMDs, the index covering VA [29:21] (this is level 1)
Each P4D references 512 PUDs, indexed by VA [38:30] (this is level 0)
The PGD, at level -1, therefore has to cover VA bits [51:39], which
means it has a 13-bit index: 8192 entries of 8 bytes each would make
it 16 pages in size.

> Note that at stage 1, arm64 does not support page table concatenation,
> and so the root page table is never larger than a page.

Doesn't PGD_SIZE refer to the size used for userspace PGDs after the
boot progresses beyond stage 1? (What do you mean by "never" here?
"Under no circumstances is it larger than a page at stage 1"? Or
"during the entire lifecycle of the system, there is no time at which
it's larger than a page"?)

Thanks for your time and attention to this,
Sam

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-24  3:05       ` Sam Edwards
@ 2025-08-24  8:55         ` Marc Zyngier
  2025-08-24 23:43           ` Sam Edwards
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Zyngier @ 2025-08-24  8:55 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Ard Biesheuvel, Catalin Marinas, Will Deacon, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

Hi Sam,

On Sun, 24 Aug 2025 04:05:05 +0100,
Sam Edwards <cfsworks@gmail.com> wrote:
> 
> On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >

[...]

> > Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?
> 
> I might be doing my math wrong, but wouldn't 52-bit VA with 4K
> granules and 5 levels result in this?

No. 52bit VA at 4kB granule results in levels 0-3 each resolving 9
bits, and level -1 resolving 4 bits. That's a total of 40 bits, plus
the 12 bits coming directly from the VA making for the expected 52.

> Each PTE represents 4K of virtual memory, so covers VA bits [11:0]
> (this is level 3)

That's where you got it wrong. The architecture is pretty clear that
each level resolves PAGE_SHIFT-3 bits, hence the computation
above. The bottom PAGE_SHIFT bits are directly extracted from the VA,
without any translation.

> Each PMD has 512 PTEs, the index of which covers VA bits [20:12] (this
> is level 2)
> Each PUD references 512 PMDs, the index covering VA [29:21] (this is level 1)
> Each P4D references 512 PUDs, indexed by VA [38:30] (this is level 0)
> The PGD, at level -1, therefore has to cover VA bits [51:39], which
> means it has a 13-bit index: 8192 entries of 8 bytes each would make
> it 16 pages in size.
>
> > Note that at stage 1, arm64 does not support page table concatenation,
> > and so the root page table is never larger than a page.
> 
> Doesn't PGD_SIZE refer to the size used for userspace PGDs after the
> boot progresses beyond stage 1? (What do you mean by "never" here?
> "Under no circumstances is it larger than a page at stage 1"? Or
> "during the entire lifecycle of the system, there is no time at which
> it's larger than a page"?)

Never, ever, is a S1 table bigger than a page. This concept doesn't
exist in the architecture. Only S2 tables can use concatenation at the
top-most level, for up to 16 pages (in order to skip a level when
possible).

The top-level can be smaller than a page, with some alignment
constraints, but that's about the only degree of freedom you have for
S1 page tables.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-24  8:55         ` Marc Zyngier
@ 2025-08-24 23:43           ` Sam Edwards
  2025-08-25  9:12             ` Marc Zyngier
  0 siblings, 1 reply; 12+ messages in thread
From: Sam Edwards @ 2025-08-24 23:43 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ard Biesheuvel, Catalin Marinas, Will Deacon, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

Hi, Marc! It's been a while; hope you're well.

On Sun, Aug 24, 2025 at 1:55 AM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Sam,
>
> On Sun, 24 Aug 2025 04:05:05 +0100,
> Sam Edwards <cfsworks@gmail.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
>
> [...]
>
> > > Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?
> >
> > I might be doing my math wrong, but wouldn't 52-bit VA with 4K
> > granules and 5 levels result in this?
>
> No. 52bit VA at 4kB granule results in levels 0-3 each resolving 9
> bits, and level -1 resolving 4 bits. That's a total of 40 bits, plus
> the 12 bits coming directly from the VA making for the expected 52.

Thank you, that makes it clear: I made an off-by-one mistake in my
counting of the levels.

> > Each PTE represents 4K of virtual memory, so covers VA bits [11:0]
> > (this is level 3)
>
> That's where you got it wrong. The architecture is pretty clear that
> each level resolves PAGE_SHIFT-3 bits, hence the computation
> above. The bottom PAGE_SHIFT bits are directly extracted from the VA,
> without any translation.

Bear with me a moment while I unpack which part of that I got wrong:
A PTE is the terminal entry of the MMU walk, so I believe I'm correct
(in this example, and assuming no hugepages) that each PTE represents
4K of virtual memory: that means the final step of computing a PA
takes a (valid) PTE and the low 12 bits of the VA, then just adds
those bits to the physical frame address.
It sounds like what you're saying is "That isn't a *level* though:
that's just concatenation. A 'level' always takes a bitslice of the VA
and uses it as an index into a table of word-sized entries. PTEs don't
point to a further table: they have all of the final information
encoded directly."
That makes a lot more sense to me, but contradicts how I read this
comment from pgtable-hwdef.h:
 * Level 3 descriptor (PTE).
I took this as, "a PTE describes how to perform level 3 of the
translation." But because in fact there are no "levels" after a PTE,
it must actually be saying "Level 3 of the translation is a lookup
into an array of PTEs."? The problem with that latter reading is that
this comment...
 * Level -1 descriptor (PGD).
...when read the same way, is saying "Level -1 of the translation is a
lookup into an array of PGDs." An "array of PGDs" is nonsense, so I
reverted back to my earlier readings: "PGD describes how to do level
-1." and "PTE describes how to do level 3."

This smells like a classic "fencepost problem": The "PXX" Linuxisms
refer to the *nodes* along the MMU walk, while the "levels" in ARM
parlance are the actual steps of the walk taken by hardware -- edges,
not nodes, getting us from fencepost to fencepost. A fence with five
segments needs six posts, but we only have five currently.

So: where do the terms P4D, PUD, and PMD fit in here? And which one's
our missing fencepost?
PGD ----> ??? ----> ??? ----> ??? ----> ??? ----> PTE (|| low VA bits
= final PA)

> > > Note that at stage 1, arm64 does not support page table concatenation,
> > > and so the root page table is never larger than a page.
> >
> > Doesn't PGD_SIZE refer to the size used for userspace PGDs after the
> > boot progresses beyond stage 1? (What do you mean by "never" here?
> > "Under no circumstances is it larger than a page at stage 1"? Or
> > "during the entire lifecycle of the system, there is no time at which
> > it's larger than a page"?)
>
> Never, ever, is a S1 table bigger than a page. This concept doesn't
> exist in the architecture. Only S2 tables can use concatenation at the
> top-most level, for up to 16 pages (in order to skip a level when
> possible).
>
> The top-level can be smaller than a page, with some alignment
> constraints, but that's about the only degree of freedom you have for
> S1 page tables.

Okay, that clicked for me: I was reading "stage" in the context of the
boot process. These explanations make a lot more sense when reading
"stage" in the context of the MMU.

So PGD_SIZE <= PAGE_SIZE, the PAGE_SIZE spacing in vmlinux.lds.S is
for alignment, and I should be looking at cases where PGDs are assumed
to be PAGE_SIZE to make those consistent instead. Thanks!

Cheers,
Sam

>
> Thanks,
>
>         M.
>
> --
> Jazz isn't dead. It just smells funny.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-24 23:43           ` Sam Edwards
@ 2025-08-25  9:12             ` Marc Zyngier
  2025-08-25 23:26               ` Sam Edwards
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Zyngier @ 2025-08-25  9:12 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Ard Biesheuvel, Catalin Marinas, Will Deacon, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Mon, 25 Aug 2025 00:43:08 +0100,
Sam Edwards <cfsworks@gmail.com> wrote:
> 
> Hi, Marc! It's been a while; hope you're well.
> 
> On Sun, Aug 24, 2025 at 1:55 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > Hi Sam,
> >
> > On Sun, 24 Aug 2025 04:05:05 +0100,
> > Sam Edwards <cfsworks@gmail.com> wrote:
> > >
> > > On Sat, Aug 23, 2025 at 5:29 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > > >
> >
> > [...]
> >
> > > > Under which conditions would PGD_SIZE assume a value greater than PAGE_SIZE?
> > >
> > > I might be doing my math wrong, but wouldn't 52-bit VA with 4K
> > > granules and 5 levels result in this?
> >
> > No. 52bit VA at 4kB granule results in levels 0-3 each resolving 9
> > bits, and level -1 resolving 4 bits. That's a total of 40 bits, plus
> > the 12 bits coming directly from the VA making for the expected 52.
> 
> Thank you, that makes it clear: I made an off-by-one mistake in my
> counting of the levels.
> 
> > > Each PTE represents 4K of virtual memory, so covers VA bits [11:0]
> > > (this is level 3)
> >
> > That's where you got it wrong. The architecture is pretty clear that
> > each level resolves PAGE_SHIFT-3 bits, hence the computation
> > above. The bottom PAGE_SHIFT bits are directly extracted from the VA,
> > without any translation.
> 
> Bear with me a moment while I unpack which part of that I got wrong:
> A PTE is the terminal entry of the MMU walk, so I believe I'm correct
> (in this example, and assuming no hugepages) that each PTE represents
> 4K of virtual memory: that means the final step of computing a PA
> takes a (valid) PTE and the low 12 bits of the VA, then just adds
> those bits to the physical frame address.
> It sounds like what you're saying is "That isn't a *level* though:
> that's just concatenation. A 'level' always takes a bitslice of the VA
> and uses it as an index into a table of word-sized entries. PTEs don't
> point to a further table: they have all of the final information
> encoded directly."

That's mostly it, yes. Each valid descriptor has an output address,
which either points to another table or to actual memory, further to
be indexed by the remaining bits of the VA (for 4kB pages: 12 bits for
a level-3, 21 bits for a level-2...). Level-3 (aka PTEs in x86
parlance) are always final.

> That makes a lot more sense to me, but contradicts how I read this
> comment from pgtable-hwdef.h:
>  * Level 3 descriptor (PTE).
> I took this as, "a PTE describes how to perform level 3 of the
> translation." But because in fact there are no "levels" after a PTE,
> it must actually be saying "Level 3 of the translation is a lookup
> into an array of PTEs."? The problem with that latter reading is that
> this comment...
>  * Level -1 descriptor (PGD).
> ...when read the same way, is saying "Level -1 of the translation is a
> lookup into an array of PGDs." An "array of PGDs" is nonsense, so I
> reverted back to my earlier readings: "PGD describes how to do level
> -1." and "PTE describes how to do level 3."

The initial level of lookup *is* an array: you take the base address
from TTBR, index it with the correct slice of bits from the VA, read
the value at that address, and you have the information needed for the
next level. The only difference is that you obtain that initial
address from a register instead of getting it from memory.

> 
> This smells like a classic "fencepost problem": The "PXX" Linuxisms
> refer to the *nodes* along the MMU walk, while the "levels" in ARM
> parlance are the actual steps of the walk taken by hardware -- edges,
> not nodes, getting us from fencepost to fencepost. A fence with five
> segments needs six posts, but we only have five currently.
> 
> So: where do the terms P4D, PUD, and PMD fit in here? And which one's
> our missing fencepost?
> PGD ----> ??? ----> ??? ----> ??? ----> ??? ----> PTE (|| low VA bits
> = final PA)

I'm struggling to see what you consider a problem, really. For me, the
original mistake is that you seem to have started off the LSBs of the
VA, instead of the MSBs.

As for the naming, the comments in pgtable-hwdef.h do apply. Except
that they only match a full 5-level walk, while the kernel can be
configured for as little as 2 levels. Hence the macro hell of folding
levels to hide the fact that we don't have 5 levels in most cases.

I find it much easier to reason about a start level (anywhere from -1
to 2, depending on the page size and the number of VA bits), and the
walk to always finish at level 3. The x86 naming is just compatibility
cruft that I tend to ignore.

Thanks,

	M.

-- 
Jazz isn't dead. It just smells funny.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-25  9:12             ` Marc Zyngier
@ 2025-08-25 23:26               ` Sam Edwards
  0 siblings, 0 replies; 12+ messages in thread
From: Sam Edwards @ 2025-08-25 23:26 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ard Biesheuvel, Catalin Marinas, Will Deacon, Andrew Morton,
	Anshuman Khandual, Ryan Roberts, Baruch Siach, Kevin Brodsky,
	Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Mon, Aug 25, 2025 at 2:12 AM Marc Zyngier <maz@kernel.org> wrote:
> > So: where do the terms P4D, PUD, and PMD fit in here? And which one's
> > our missing fencepost?
> > PGD ----> ??? ----> ??? ----> ??? ----> ??? ----> PTE (|| low VA bits
> > = final PA)
>
> I'm struggling to see what you consider a problem, really. For me, the
> original mistake is that you seem to have started off the LSBs of the
> VA, instead of the MSBs.

Right; I did my example MMU walk in reverse to start with the levels
that are never folded and work my way up to the PGD level, since I was
trying to deduce PTRS_PER_PGD, essentially.

As for the ultimate source of my confusion, it's best explained in
this paragraph from the paging documentation [0] (emphasis my own):

The entry part of the name is a bit confusing because while in Linux 1.0
this did refer to a single page table entry in the single top level page table,
it was retrofitted to be an array of mapping elements when two-level page
tables were first introduced, ***so the pte is the lowermost page table, not
a page table entry.***

As I see it, this creates a habit of using the names for the table
entries and the tables themselves interchangeably. Kernel folk who
don't work on the MM subsystem much (like myself) don't share this
habit. So when I say something like "an 'array of PGDs' is nonsense,"
I mean "an 'array of top-level page tables' is an oxymoron: if there's
an array of them, they aren't top-level." But someone more seasoned
like you thinks I'm saying "array of PGD entries" and asks "What's the
problem? That's, tautologically, what the PGD is..."

It's absolutely on me for not thinking to RTFM earlier. Thank you for
your patience. In my defense, I don't think there's any way I could
have known that "PTE" was a misnomer. The RISC-V team added a few
helpful comments in their pgtable*.h that probably helps people
encountering these MM terms for the first time through those files.
I'm considering sending a patch to clarify the comments on the ARM64
#defines similarly.

I've included a glossary of terms [1] in case this confusion comes up
again with somebody else.

> I find it much easier to reason about a start level (anywhere from -1
> to 2, depending on the page size and the number of VA bits), and the
> walk to always finish at level 3. The x86 naming is just compatibility
> cruft that I tend to ignore.

Indeed, I think this is probably what most seasoned ARM people do.
People who are first mentally visualizing the tree -- and specifically
Linux's intent with the tree -- will probably still be relying on the
PxD terms to bootstrap their understanding, however.

Thank you once again for your patience,
Sam

[0]: https://docs.kernel.org/mm/page_tables.html
[1]:
"PTE"
A newcomer probably thinks: "Page Table Entry, the final descriptor at
the end of an MMU walk."
Actual meaning: Bottom-most translation table *OR* an entry of that
table. It was once an initialism, but it's taken on a meaning of its
own.

"PGD"
A newcomer probably thinks: "Page Global Directory, or the pointer
thereto -- the thing you find in TTBRn_ELx"
Actual meaning: Page Global Directory or the pointer thereto *OR* Page
Global Descriptor: an entry of the Page Global Directory, which
encodes a pointer to the P4DIR if 5-level translation is enabled, or a
lower PxDIR if not.

#define PTRS_PER_PTE
A newcomer probably thinks: "Number of frames pointed from a single
Page Table Entry (so... er... 1???)"
Actual meaning: Number of entries in the "PTE," the final translation table.

#define PGDIR_SIZE
A newcomer probably thinks: "Number of bytes of VM represented by one
PGDIR." (or even "Number of bytes the top-level page table occupies in
memory.")
Actual meaning: Number of bytes of VM represented by *one entry* of
the PGDIR. The 'DIR' is a misnomer here.

#define PGD_SIZE
Actual meaning: Number of bytes the top-level page table occupies in
memory. Defined separately from PAGE_SIZE because it may be smaller
(never larger).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-23 23:55   ` Sam Edwards
  2025-08-24  0:29     ` Ard Biesheuvel
@ 2025-08-26  9:50     ` Mark Rutland
  2025-08-26 10:18       ` Ard Biesheuvel
  1 sibling, 1 reply; 12+ messages in thread
From: Mark Rutland @ 2025-08-26  9:50 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Ard Biesheuvel, Catalin Marinas, Will Deacon, Marc Zyngier,
	Andrew Morton, Anshuman Khandual, Ryan Roberts, Baruch Siach,
	Kevin Brodsky, Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Sat, Aug 23, 2025 at 04:55:44PM -0700, Sam Edwards wrote:
> On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > Hi Sam,
> >
> > On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@gmail.com> wrote:
> > >
> > > In early boot, Linux creates identity virtual->physical address mappings
> > > so that it can enable the MMU before full memory management is ready.
> > > To ensure some available physical memory to back these structures,
> > > vmlinux.lds reserves some space (and defines marker symbols) in the
> > > middle of the kernel image. However, because they are defined outside of
> > > PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> > > is concerned.
> > >
> > > In the typical case, this isn't actually a problem: the boot image is
> > > prepared with objcopy, which zero-fills the gaps, so these structures
> > > are incidentally zero-initialized (an all-zeroes entry is considered
> > > absent, so zero-initialization is appropriate).
> > >
> > > However, that is just a happy accident: the `vmlinux` ELF output
> > > authoritatively represents the state of memory at entry. If the ELF
> > > says a region of memory isn't initialized, we must treat it as
> > > uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> > > the ELF directly -- sidestepping the objcopy-produced image entirely --
> > > and therefore do not initialize the gaps. This results in the early boot
> > > code crashing when it attempts to create identity mappings.
> > >
> > > Therefore, add boot-time zero-initialization for the following:
> > > - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> > > - idmap_pg_dir
> > > - reserved_pg_dir
> >
> > I don't think this is the right approach.
> >
> > If the ELF representation is inaccurate, it should be fixed, and this
> > should be achievable without impacting the binary image at all.
> 
> Hi Ard,
> 
> I don't believe I can declare the ELF output "inaccurate" per se,
> since it's the linker's final determination about the state of memory
> at kernel entry -- including which regions are not the loader's
> responsibility to initialize (and should therefore be initialized at
> runtime, e.g. .bss). But, I think I understand your meaning: you would
> prefer consistent load-time zero-initialization over run-time. I'm
> open to that approach if that's the consensus here, but it will make
> `vmlinux` dozens of KBs larger (even though it keeps `Image` the same
> size).

Our intent was that these are zeroed at build time in the Image. If the
vmlinux isn't consistent with that, that's a problem with the way we
generate the vmlinux, and hence "the ELF representation is inaccurate".

I agree with Ard that it's better to bring the vmlinux into line with
that (if we need to handlr this at all), even if that means making the
vmlinux a few KB bigger.

Mark.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-26  9:50     ` Mark Rutland
@ 2025-08-26 10:18       ` Ard Biesheuvel
  2025-08-26 19:33         ` Sam Edwards
  0 siblings, 1 reply; 12+ messages in thread
From: Ard Biesheuvel @ 2025-08-26 10:18 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Sam Edwards, Catalin Marinas, Will Deacon, Marc Zyngier,
	Andrew Morton, Anshuman Khandual, Ryan Roberts, Baruch Siach,
	Kevin Brodsky, Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Tue, 26 Aug 2025 at 11:50, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Sat, Aug 23, 2025 at 04:55:44PM -0700, Sam Edwards wrote:
> > On Sat, Aug 23, 2025 at 3:25 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > Hi Sam,
> > >
> > > On Fri, 22 Aug 2025 at 14:15, Sam Edwards <cfsworks@gmail.com> wrote:
> > > >
> > > > In early boot, Linux creates identity virtual->physical address mappings
> > > > so that it can enable the MMU before full memory management is ready.
> > > > To ensure some available physical memory to back these structures,
> > > > vmlinux.lds reserves some space (and defines marker symbols) in the
> > > > middle of the kernel image. However, because they are defined outside of
> > > > PROGBITS sections, they aren't pre-initialized -- at least as far as ELF
> > > > is concerned.
> > > >
> > > > In the typical case, this isn't actually a problem: the boot image is
> > > > prepared with objcopy, which zero-fills the gaps, so these structures
> > > > are incidentally zero-initialized (an all-zeroes entry is considered
> > > > absent, so zero-initialization is appropriate).
> > > >
> > > > However, that is just a happy accident: the `vmlinux` ELF output
> > > > authoritatively represents the state of memory at entry. If the ELF
> > > > says a region of memory isn't initialized, we must treat it as
> > > > uninitialized. Indeed, certain bootloaders (e.g. Broadcom CFE) ingest
> > > > the ELF directly -- sidestepping the objcopy-produced image entirely --
> > > > and therefore do not initialize the gaps. This results in the early boot
> > > > code crashing when it attempts to create identity mappings.
> > > >
> > > > Therefore, add boot-time zero-initialization for the following:
> > > > - __pi_init_idmap_pg_dir..__pi_init_idmap_pg_end
> > > > - idmap_pg_dir
> > > > - reserved_pg_dir
> > >
> > > I don't think this is the right approach.
> > >
> > > If the ELF representation is inaccurate, it should be fixed, and this
> > > should be achievable without impacting the binary image at all.
> >
> > Hi Ard,
> >
> > I don't believe I can declare the ELF output "inaccurate" per se,
> > since it's the linker's final determination about the state of memory
> > at kernel entry -- including which regions are not the loader's
> > responsibility to initialize (and should therefore be initialized at
> > runtime, e.g. .bss). But, I think I understand your meaning: you would
> > prefer consistent load-time zero-initialization over run-time. I'm
> > open to that approach if that's the consensus here, but it will make
> > `vmlinux` dozens of KBs larger (even though it keeps `Image` the same
> > size).
>
> Our intent was that these are zeroed at build time in the Image. If the
> vmlinux isn't consistent with that, that's a problem with the way we
> generate the vmlinux, and hence "the ELF representation is inaccurate".
>
> I agree with Ard that it's better to bring the vmlinux into line with
> that (if we need to handlr this at all), even if that means making the
> vmlinux a few KB bigger.
>

Indeed. And actually, it should still be the ELF loader's job to
zero-initialize NOBITS sections, so ideally, we'd make these NOBITS
rather than PROGBITS, and the bloat issue should go away.

If the ELF loader in question relies on the executable's startup code
to clear NOBITS sections, it needs to be fixed in any case. Clearing
BSS like we do at startup time is really only appropriate for
bare-metal images such as arm64's Image, but a platform that elects to
use an ELF loader instead (even though that is not a supported
bootable format for arm64 Linux) should at least adhere to the ELF
spec.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] arm64/boot: Zero-initialize idmap PGDs before use
  2025-08-26 10:18       ` Ard Biesheuvel
@ 2025-08-26 19:33         ` Sam Edwards
  0 siblings, 0 replies; 12+ messages in thread
From: Sam Edwards @ 2025-08-26 19:33 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Mark Rutland, Catalin Marinas, Will Deacon, Marc Zyngier,
	Andrew Morton, Anshuman Khandual, Ryan Roberts, Baruch Siach,
	Kevin Brodsky, Joey Gouly, linux-arm-kernel, linux-kernel, stable

On Tue, Aug 26, 2025 at 3:18 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> Indeed. And actually, it should still be the ELF loader's job to
> zero-initialize NOBITS sections, so ideally, we'd make these NOBITS
> rather than PROGBITS, and the bloat issue should go away.

I completely agree. NOBITS seems like the best approach:
- It doesn't meaningfully increase the size of vmlinux
- It has no runtime cost (and indeed shouldn't change the binary image at all)
- Yet it still memorializes in ELF our expectation that these tables
are pre-zeroed (and addresses some of my other "what ifs" like "What
if the user wants to use objcopy --gap-fill?")

> If the ELF loader in question relies on the executable's startup code
> to clear NOBITS sections, it needs to be fixed in any case. Clearing
> BSS like we do at startup time is really only appropriate for
> bare-metal images such as arm64's Image, but a platform that elects to
> use an ELF loader instead (even though that is not a supported
> bootable format for arm64 Linux) should at least adhere to the ELF
> spec.

Here's hoping -- I'm afraid I can't substantially change anything
about this bootloader, so I've been looking to replace it instead. But
we are in agreement that if the ELF loader isn't following the spec
and NOBITS doesn't solve my problem, then interim workarounds are
solely my responsibility.

Best,
Sam

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-26 19:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  4:15 [PATCH] arm64/boot: Zero-initialize idmap PGDs before use Sam Edwards
2025-08-23 22:25 ` Ard Biesheuvel
2025-08-23 23:55   ` Sam Edwards
2025-08-24  0:29     ` Ard Biesheuvel
2025-08-24  3:05       ` Sam Edwards
2025-08-24  8:55         ` Marc Zyngier
2025-08-24 23:43           ` Sam Edwards
2025-08-25  9:12             ` Marc Zyngier
2025-08-25 23:26               ` Sam Edwards
2025-08-26  9:50     ` Mark Rutland
2025-08-26 10:18       ` Ard Biesheuvel
2025-08-26 19:33         ` Sam Edwards

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).