linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v3 0/3] Speed up boot with faster linear map creation
  2024-04-12 16:06 ` Will Deacon
@ 2024-04-10 12:53   ` Itaru Kitayama
  0 siblings, 0 replies; 9+ messages in thread
From: Itaru Kitayama @ 2024-04-10 12:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Mark Rutland, Ard Biesheuvel, David Hildenbrand,
	Donald Dutile, Eric Chanudet, Ryan Roberts, kernel-team,
	linux-arm-kernel, linux-kernel

On Fri, Apr 12, 2024 at 05:06:41PM +0100, Will Deacon wrote:
> On Fri, 12 Apr 2024 14:19:05 +0100, Ryan Roberts wrote:
> > It turns out that creating the linear map can take a significant proportion of
> > the total boot time, especially when rodata=full. And most of the time is spent
> > waiting on superfluous tlb invalidation and memory barriers. This series reworks
> > the kernel pgtable generation code to significantly reduce the number of those
> > TLBIs, ISBs and DSBs. See each patch for details.
> > 
> > The below shows the execution time of map_mem() across a couple of different
> > systems with different RAM configurations. We measure after applying each patch
> > and show the improvement relative to base (v6.9-rc2):
> > 
> > [...]
> 
> Applied to arm64 (for-next/mm), thanks!
> 
> [1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>       https://git.kernel.org/arm64/c/5c63db59c5f8
> [2/3] arm64: mm: Batch dsb and isb when populating pgtables
>       https://git.kernel.org/arm64/c/1fcb7cea8a5f
> [3/3] arm64: mm: Don't remap pgtables for allocate vs populate
>       https://git.kernel.org/arm64/c/0e9df1c905d8

I confirm this series boots the system on FVP (with my .config and my
buildroot rootfs using Shrinkwrap).

Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

Thanks,
Itaru.

> 
> Cheers,
> -- 
> Will
> 
> https://fixes.arm64.dev
> https://next.arm64.dev
> https://will.arm64.dev

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
  2024-04-12 13:19 ` [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block Ryan Roberts
@ 2024-04-11 14:03   ` Itaru Kitayama
  0 siblings, 0 replies; 9+ messages in thread
From: Itaru Kitayama @ 2024-04-11 14:03 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet, linux-arm-kernel,
	linux-kernel, Itaru Kitayama

Hi Ryan,

On Fri, Apr 12, 2024 at 02:19:06PM +0100, Ryan Roberts wrote:
> A large part of the kernel boot time is creating the kernel linear map
> page tables. When rodata=full, all memory is mapped by pte. And when
> there is lots of physical ram, there are lots of pte tables to populate.
> The primary cost associated with this is mapping and unmapping the pte
> table memory in the fixmap; at unmap time, the TLB entry must be
> invalidated and this is expensive.
> 
> Previously, each pmd and pte table was fixmapped/fixunmapped for each
> cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
> we ended up issuing 32 TLBIs per (pmd|pte) table during the population
> phase.
> 
> Let's fix that, and fixmap/fixunmap each page once per population, for a
> saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
> speedup.
> 
> Execution time of map_mem(), which creates the kernel linear map page
> tables, was measured on different machines with different RAM configs:
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> before         |  168   (0%) | 2198   (0%) | 8644   (0%) | 17447   (0%)
> after          |   78 (-53%) |  435 (-80%) | 1723 (-80%) |  3779 (-78%)
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>
> Tested-by: Eric Chanudet <echanude@redhat.com>
> ---
>  arch/arm64/mm/mmu.c | 27 ++++++++++++++-------------
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 495b732d5af3..9f1d69b7b494 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
>  	return ((old ^ new) & ~mask) == 0;
>  }
>  
> -static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
> +static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>  		     phys_addr_t phys, pgprot_t prot)
>  {
> -	pte_t *ptep;
> -
> -	ptep = pte_set_fixmap_offset(pmdp, addr);
>  	do {
>  		pte_t old_pte = __ptep_get(ptep);
>  
> @@ -192,8 +189,6 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  
>  		phys += PAGE_SIZE;
>  	} while (ptep++, addr += PAGE_SIZE, addr != end);
> -
> -	pte_clear_fixmap();
>  }
>  
>  static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> @@ -204,6 +199,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  {
>  	unsigned long next;
>  	pmd_t pmd = READ_ONCE(*pmdp);
> +	pte_t *ptep;
>  
>  	BUG_ON(pmd_sect(pmd));
>  	if (pmd_none(pmd)) {
> @@ -219,6 +215,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  	}
>  	BUG_ON(pmd_bad(pmd));
>  
> +	ptep = pte_set_fixmap_offset(pmdp, addr);
>  	do {
>  		pgprot_t __prot = prot;
>  
> @@ -229,20 +226,21 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  		    (flags & NO_CONT_MAPPINGS) == 0)
>  			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>  
> -		init_pte(pmdp, addr, next, phys, __prot);
> +		init_pte(ptep, addr, next, phys, __prot);
>  
> +		ptep += pte_index(next) - pte_index(addr);
>  		phys += next - addr;
>  	} while (addr = next, addr != end);
> +
> +	pte_clear_fixmap();
>  }
>  
> -static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
> +static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  		     phys_addr_t phys, pgprot_t prot,
>  		     phys_addr_t (*pgtable_alloc)(int), int flags)
>  {
>  	unsigned long next;
> -	pmd_t *pmdp;
>  
> -	pmdp = pmd_set_fixmap_offset(pudp, addr);
>  	do {
>  		pmd_t old_pmd = READ_ONCE(*pmdp);
>  
> @@ -268,8 +266,6 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
>  		}
>  		phys += next - addr;
>  	} while (pmdp++, addr = next, addr != end);
> -
> -	pmd_clear_fixmap();
>  }
>  
>  static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> @@ -279,6 +275,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  {
>  	unsigned long next;
>  	pud_t pud = READ_ONCE(*pudp);
> +	pmd_t *pmdp;
>  
>  	/*
>  	 * Check for initial section mappings in the pgd/pud.
> @@ -297,6 +294,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  	}
>  	BUG_ON(pud_bad(pud));
>  
> +	pmdp = pmd_set_fixmap_offset(pudp, addr);
>  	do {
>  		pgprot_t __prot = prot;
>  
> @@ -307,10 +305,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  		    (flags & NO_CONT_MAPPINGS) == 0)
>  			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>  
> -		init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
> +		init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>  
> +		pmdp += pmd_index(next) - pmd_index(addr);
>  		phys += next - addr;
>  	} while (addr = next, addr != end);
> +
> +	pmd_clear_fixmap();
>  }
>  
>  static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,

I looked at this specific patch 1/3 for a while and now make sense the
code changes to reduce down the number of TLBIs respecting the contigous
bit where available at both PMD and PTE levels. I can not finish other 2/3 and 3/3 patches in a timely manner but I'd like to give my

Reviewied-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>

on 1/3.

Thanks,
Itaru.

> -- 
> 2.25.1
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 0/3] Speed up boot with faster linear map creation
@ 2024-04-12 13:19 Ryan Roberts
  2024-04-12 13:19 ` [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block Ryan Roberts
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Ryan Roberts @ 2024-04-12 13:19 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

Hi All,

It turns out that creating the linear map can take a significant proportion of
the total boot time, especially when rodata=full. And most of the time is spent
waiting on superfluous tlb invalidation and memory barriers. This series reworks
the kernel pgtable generation code to significantly reduce the number of those
TLBIs, ISBs and DSBs. See each patch for details.

The below shows the execution time of map_mem() across a couple of different
systems with different RAM configurations. We measure after applying each patch
and show the improvement relative to base (v6.9-rc2):

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
base           |  168   (0%) | 2198   (0%) | 8644   (0%) | 17447   (0%)
no-cont-remap  |   78 (-53%) |  435 (-80%) | 1723 (-80%) |  3779 (-78%)
batch-barriers |   11 (-93%) |  161 (-93%) |  656 (-92%) |  1654 (-91%)
no-alloc-remap |   10 (-94%) |  104 (-95%) |  438 (-95%) |  1223 (-93%)

This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
boot tested various PAGE_SIZE and VA size configs.

---

Changes since v2 [2]
====================

  - Independently increment ptep/pmdp in alloc_init_cont_[pte|pmd]() rather than
    return the incremented value from int_[pte|pmd]() (per Mark)
  - Removed explicit barriers from alloc_init_cont_pte() and instead rely on the
    barriers in pte_clear_fixmap() (per Mark)
  - Significantly simplify approach to avoiding fixmap during alloc (patch 3
    reworked) (per Mark)
  - Dropped patch 4 - not possible with simplified patch 3 and improvement (~2%)
    didn't warrant complexity (per Mark)


Changes since v1 [1]
====================

  - Added Tested-by tags (thanks to Eric and Itaru)
  - Renamed ___set_pte() -> __set_pte_nosync() (per Ard)
  - Reordered patches (biggest impact & least controversial first)
  - Reordered alloc/map/unmap functions in mmu.c to aid reader
  - pte_clear() -> __pte_clear() in clear_fixmap_nosync()
  - Reverted generic p4d_index() which caused x86 build error. Replaced with
    unconditional p4d_index() define under arm64.


[1] https://lore.kernel.org/linux-arm-kernel/20240326101448.3453626-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20240404143308.2224141-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (3):
  arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
  arm64: mm: Batch dsb and isb when populating pgtables
  arm64: mm: Don't remap pgtables for allocate vs populate

 arch/arm64/include/asm/pgtable.h |   9 ++-
 arch/arm64/mm/mmu.c              | 101 +++++++++++++++++--------------
 2 files changed, 65 insertions(+), 45 deletions(-)

--
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
  2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
@ 2024-04-12 13:19 ` Ryan Roberts
  2024-04-11 14:03   ` Itaru Kitayama
  2024-04-12 13:19 ` [PATCH v3 2/3] arm64: mm: Batch dsb and isb when populating pgtables Ryan Roberts
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: Ryan Roberts @ 2024-04-12 13:19 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, Itaru Kitayama

A large part of the kernel boot time is creating the kernel linear map
page tables. When rodata=full, all memory is mapped by pte. And when
there is lots of physical ram, there are lots of pte tables to populate.
The primary cost associated with this is mapping and unmapping the pte
table memory in the fixmap; at unmap time, the TLB entry must be
invalidated and this is expensive.

Previously, each pmd and pte table was fixmapped/fixunmapped for each
cont(pte|pmd) block of mappings (16 entries with 4K granule). This means
we ended up issuing 32 TLBIs per (pmd|pte) table during the population
phase.

Let's fix that, and fixmap/fixunmap each page once per population, for a
saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot
speedup.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |  168   (0%) | 2198   (0%) | 8644   (0%) | 17447   (0%)
after          |   78 (-53%) |  435 (-80%) | 1723 (-80%) |  3779 (-78%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
---
 arch/arm64/mm/mmu.c | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 495b732d5af3..9f1d69b7b494 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -172,12 +172,9 @@ bool pgattr_change_is_safe(u64 old, u64 new)
 	return ((old ^ new) & ~mask) == 0;
 }
 
-static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
+static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot)
 {
-	pte_t *ptep;
-
-	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pte_t old_pte = __ptep_get(ptep);
 
@@ -192,8 +189,6 @@ static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 		phys += PAGE_SIZE;
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
-
-	pte_clear_fixmap();
 }
 
 static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
@@ -204,6 +199,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 {
 	unsigned long next;
 	pmd_t pmd = READ_ONCE(*pmdp);
+	pte_t *ptep;
 
 	BUG_ON(pmd_sect(pmd));
 	if (pmd_none(pmd)) {
@@ -219,6 +215,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	}
 	BUG_ON(pmd_bad(pmd));
 
+	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -229,20 +226,21 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pte(pmdp, addr, next, phys, __prot);
+		init_pte(ptep, addr, next, phys, __prot);
 
+		ptep += pte_index(next) - pte_index(addr);
 		phys += next - addr;
 	} while (addr = next, addr != end);
+
+	pte_clear_fixmap();
 }
 
-static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
+static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot,
 		     phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long next;
-	pmd_t *pmdp;
 
-	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
@@ -268,8 +266,6 @@ static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 		}
 		phys += next - addr;
 	} while (pmdp++, addr = next, addr != end);
-
-	pmd_clear_fixmap();
 }
 
 static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
@@ -279,6 +275,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 {
 	unsigned long next;
 	pud_t pud = READ_ONCE(*pudp);
+	pmd_t *pmdp;
 
 	/*
 	 * Check for initial section mappings in the pgd/pud.
@@ -297,6 +294,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 	}
 	BUG_ON(pud_bad(pud));
 
+	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -307,10 +305,13 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
+		init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
 
+		pmdp += pmd_index(next) - pmd_index(addr);
 		phys += next - addr;
 	} while (addr = next, addr != end);
+
+	pmd_clear_fixmap();
 }
 
 static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/3] arm64: mm: Batch dsb and isb when populating pgtables
  2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
  2024-04-12 13:19 ` [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block Ryan Roberts
@ 2024-04-12 13:19 ` Ryan Roberts
  2024-04-12 13:19 ` [PATCH v3 3/3] arm64: mm: Don't remap pgtables for allocate vs populate Ryan Roberts
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Ryan Roberts @ 2024-04-12 13:19 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, Itaru Kitayama

After removing uneccessary TLBIs, the next bottleneck when creating the
page tables for the linear map is DSB and ISB, which were previously
issued per-pte in __set_pte(). Since we are writing multiple ptes in a
given pte table, we can elide these barriers and insert them once we
have finished writing to the table.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |   78   (0%) |  435   (0%) | 1723   (0%) |  3779   (0%)
after          |   11 (-86%) |  161 (-63%) |  656 (-62%) |  1654 (-56%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
---
 arch/arm64/include/asm/pgtable.h |  7 ++++++-
 arch/arm64/mm/mmu.c              | 11 ++++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..105a95a8845c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -271,9 +271,14 @@ static inline pte_t pte_mkdevmap(pte_t pte)
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
 }
 
-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
 {
 	WRITE_ONCE(*ptep, pte);
+}
+
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+	__set_pte_nosync(ptep, pte);
 
 	/*
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9f1d69b7b494..ac88b89770a6 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -178,7 +178,11 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 	do {
 		pte_t old_pte = __ptep_get(ptep);
 
-		__set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
+		/*
+		 * Required barriers to make this visible to the table walker
+		 * are deferred to the end of alloc_init_cont_pte().
+		 */
+		__set_pte_nosync(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
@@ -232,6 +236,11 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
+	/*
+	 * Note: barriers and maintenance necessary to clear the fixmap slot
+	 * ensure that all previous pgtable writes are visible to the table
+	 * walker.
+	 */
 	pte_clear_fixmap();
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/3] arm64: mm: Don't remap pgtables for allocate vs populate
  2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
  2024-04-12 13:19 ` [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block Ryan Roberts
  2024-04-12 13:19 ` [PATCH v3 2/3] arm64: mm: Batch dsb and isb when populating pgtables Ryan Roberts
@ 2024-04-12 13:19 ` Ryan Roberts
  2024-04-12 14:56 ` [PATCH v3 0/3] Speed up boot with faster linear map creation Mark Rutland
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Ryan Roberts @ 2024-04-12 13:19 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, Ard Biesheuvel,
	David Hildenbrand, Donald Dutile, Eric Chanudet
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, Itaru Kitayama

During linear map pgtable creation, each pgtable is fixmapped /
fixunmapped twice; once during allocation to zero the memory, and a
again during population to write the entries. This means each table has
2 TLB invalidations issued against it. Let's fix this so that each table
is only fixmapped/fixunmapped once, halving the number of TLBIs, and
improving performance.

Achieve this by separating allocation and initialization (zeroing) of
the page. The allocated page is now fixmapped directly by the walker and
initialized, before being populated and finally fixunmapped.

This approach keeps the change small, but has the side effect that late
allocations (using __get_free_page()) must also go through the generic
memory clearing routine. So let's tell __get_free_page() not to zero the
memory to avoid duplication.

Additionally this approach means that fixmap/fixunmap is still used for
late pgtable modifications. That's not technically needed since the
memory is all mapped in the linear map by that point. That's left as a
possible future optimization if found to be needed.

Execution time of map_mem(), which creates the kernel linear map page
tables, was measured on different machines with different RAM configs:

               | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
               | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
---------------|-------------|-------------|-------------|-------------
               |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
---------------|-------------|-------------|-------------|-------------
before         |   11   (0%) |  161   (0%) |  656   (0%) |  1654   (0%)
after          |   10 (-11%) |  104 (-35%) |  438 (-33%) |  1223 (-26%)

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com>
Tested-by: Eric Chanudet <echanude@redhat.com>
---
 arch/arm64/include/asm/pgtable.h |  2 +
 arch/arm64/mm/mmu.c              | 67 +++++++++++++++++---------------
 2 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 105a95a8845c..92c9aed5e7af 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1010,6 +1010,8 @@ static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)
 
 static inline bool pgtable_l5_enabled(void) { return false; }
 
+#define p4d_index(addr)		(((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1))
+
 /* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
 #define p4d_set_fixmap(addr)		NULL
 #define p4d_set_fixmap_offset(p4dp, addr)	((p4d_t *)p4dp)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ac88b89770a6..c927e9312f10 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -109,28 +109,12 @@ EXPORT_SYMBOL(phys_mem_access_prot);
 static phys_addr_t __init early_pgtable_alloc(int shift)
 {
 	phys_addr_t phys;
-	void *ptr;
 
 	phys = memblock_phys_alloc_range(PAGE_SIZE, PAGE_SIZE, 0,
 					 MEMBLOCK_ALLOC_NOLEAKTRACE);
 	if (!phys)
 		panic("Failed to allocate page table page\n");
 
-	/*
-	 * The FIX_{PGD,PUD,PMD} slots may be in active use, but the FIX_PTE
-	 * slot will be free, so we can (ab)use the FIX_PTE slot to initialise
-	 * any level of table.
-	 */
-	ptr = pte_set_fixmap(phys);
-
-	memset(ptr, 0, PAGE_SIZE);
-
-	/*
-	 * Implicit barriers also ensure the zeroed page is visible to the page
-	 * table walker
-	 */
-	pte_clear_fixmap();
-
 	return phys;
 }
 
@@ -172,6 +156,14 @@ bool pgattr_change_is_safe(u64 old, u64 new)
 	return ((old ^ new) & ~mask) == 0;
 }
 
+static void init_clear_pgtable(void *table)
+{
+	clear_page(table);
+
+	/* Ensure the zeroing is observed by page table walks. */
+	dsb(ishst);
+}
+
 static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot)
 {
@@ -214,12 +206,15 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 			pmdval |= PMD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		pte_phys = pgtable_alloc(PAGE_SHIFT);
+		ptep = pte_set_fixmap(pte_phys);
+		init_clear_pgtable(ptep);
+		ptep += pte_index(addr);
 		__pmd_populate(pmdp, pte_phys, pmdval);
-		pmd = READ_ONCE(*pmdp);
+	} else {
+		BUG_ON(pmd_bad(pmd));
+		ptep = pte_set_fixmap_offset(pmdp, addr);
 	}
-	BUG_ON(pmd_bad(pmd));
 
-	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -298,12 +293,15 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 			pudval |= PUD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		pmd_phys = pgtable_alloc(PMD_SHIFT);
+		pmdp = pmd_set_fixmap(pmd_phys);
+		init_clear_pgtable(pmdp);
+		pmdp += pmd_index(addr);
 		__pud_populate(pudp, pmd_phys, pudval);
-		pud = READ_ONCE(*pudp);
+	} else {
+		BUG_ON(pud_bad(pud));
+		pmdp = pmd_set_fixmap_offset(pudp, addr);
 	}
-	BUG_ON(pud_bad(pud));
 
-	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
 		pgprot_t __prot = prot;
 
@@ -340,12 +338,15 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 			p4dval |= P4D_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		pud_phys = pgtable_alloc(PUD_SHIFT);
+		pudp = pud_set_fixmap(pud_phys);
+		init_clear_pgtable(pudp);
+		pudp += pud_index(addr);
 		__p4d_populate(p4dp, pud_phys, p4dval);
-		p4d = READ_ONCE(*p4dp);
+	} else {
+		BUG_ON(p4d_bad(p4d));
+		pudp = pud_set_fixmap_offset(p4dp, addr);
 	}
-	BUG_ON(p4d_bad(p4d));
 
-	pudp = pud_set_fixmap_offset(p4dp, addr);
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
@@ -395,12 +396,15 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 			pgdval |= PGD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		p4d_phys = pgtable_alloc(P4D_SHIFT);
+		p4dp = p4d_set_fixmap(p4d_phys);
+		init_clear_pgtable(p4dp);
+		p4dp += p4d_index(addr);
 		__pgd_populate(pgdp, p4d_phys, pgdval);
-		pgd = READ_ONCE(*pgdp);
+	} else {
+		BUG_ON(pgd_bad(pgd));
+		p4dp = p4d_set_fixmap_offset(pgdp, addr);
 	}
-	BUG_ON(pgd_bad(pgd));
 
-	p4dp = p4d_set_fixmap_offset(pgdp, addr);
 	do {
 		p4d_t old_p4d = READ_ONCE(*p4dp);
 
@@ -467,11 +471,10 @@ void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
 
 static phys_addr_t __pgd_pgtable_alloc(int shift)
 {
-	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL);
-	BUG_ON(!ptr);
+	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
+	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
 
-	/* Ensure the zeroed page is visible to the page table walker */
-	dsb(ishst);
+	BUG_ON(!ptr);
 	return __pa(ptr);
 }
 
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/3] Speed up boot with faster linear map creation
  2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
                   ` (2 preceding siblings ...)
  2024-04-12 13:19 ` [PATCH v3 3/3] arm64: mm: Don't remap pgtables for allocate vs populate Ryan Roberts
@ 2024-04-12 14:56 ` Mark Rutland
  2024-04-12 15:00 ` Ard Biesheuvel
  2024-04-12 16:06 ` Will Deacon
  5 siblings, 0 replies; 9+ messages in thread
From: Mark Rutland @ 2024-04-12 14:56 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon
  Cc: Ard Biesheuvel, David Hildenbrand, Donald Dutile, Eric Chanudet,
	linux-arm-kernel, linux-kernel

On Fri, Apr 12, 2024 at 02:19:05PM +0100, Ryan Roberts wrote:
> Hi All,
> 
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
> 
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  168   (0%) | 2198   (0%) | 8644   (0%) | 17447   (0%)
> no-cont-remap  |   78 (-53%) |  435 (-80%) | 1723 (-80%) |  3779 (-78%)
> batch-barriers |   11 (-93%) |  161 (-93%) |  656 (-92%) |  1654 (-91%)
> no-alloc-remap |   10 (-94%) |  104 (-95%) |  438 (-95%) |  1223 (-93%)
> 
> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> boot tested various PAGE_SIZE and VA size configs.

Nice!

> Ryan Roberts (3):
>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>   arm64: mm: Batch dsb and isb when populating pgtables
>   arm64: mm: Don't remap pgtables for allocate vs populate

For the series:

Reviewed-by: Mark Rutland <mark.rutland@arm.com>

Catalin, Will, are you happy to pick this up?

Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/3] Speed up boot with faster linear map creation
  2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
                   ` (3 preceding siblings ...)
  2024-04-12 14:56 ` [PATCH v3 0/3] Speed up boot with faster linear map creation Mark Rutland
@ 2024-04-12 15:00 ` Ard Biesheuvel
  2024-04-12 16:06 ` Will Deacon
  5 siblings, 0 replies; 9+ messages in thread
From: Ard Biesheuvel @ 2024-04-12 15:00 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, David Hildenbrand,
	Donald Dutile, Eric Chanudet, linux-arm-kernel, linux-kernel

On Fri, 12 Apr 2024 at 15:19, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi All,
>
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
>
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
>
>                | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra
>                | VM, 16G     | VM, 64G     | VM, 256G    | Metal, 512G
> ---------------|-------------|-------------|-------------|-------------
>                |   ms    (%) |   ms    (%) |   ms    (%) |    ms    (%)
> ---------------|-------------|-------------|-------------|-------------
> base           |  168   (0%) | 2198   (0%) | 8644   (0%) | 17447   (0%)
> no-cont-remap  |   78 (-53%) |  435 (-80%) | 1723 (-80%) |  3779 (-78%)
> batch-barriers |   11 (-93%) |  161 (-93%) |  656 (-92%) |  1654 (-91%)
> no-alloc-remap |   10 (-94%) |  104 (-95%) |  438 (-95%) |  1223 (-93%)
>
> This series applies on top of v6.9-rc2. All mm selftests pass. I've compile and
> boot tested various PAGE_SIZE and VA size configs.
...
>
> Ryan Roberts (3):
>   arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
>   arm64: mm: Batch dsb and isb when populating pgtables
>   arm64: mm: Don't remap pgtables for allocate vs populate
>

For the series,

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/3] Speed up boot with faster linear map creation
  2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
                   ` (4 preceding siblings ...)
  2024-04-12 15:00 ` Ard Biesheuvel
@ 2024-04-12 16:06 ` Will Deacon
  2024-04-10 12:53   ` Itaru Kitayama
  5 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2024-04-12 16:06 UTC (permalink / raw)
  To: Catalin Marinas, Mark Rutland, Ard Biesheuvel, David Hildenbrand,
	Donald Dutile, Eric Chanudet, Ryan Roberts
  Cc: kernel-team, Will Deacon, linux-arm-kernel, linux-kernel

On Fri, 12 Apr 2024 14:19:05 +0100, Ryan Roberts wrote:
> It turns out that creating the linear map can take a significant proportion of
> the total boot time, especially when rodata=full. And most of the time is spent
> waiting on superfluous tlb invalidation and memory barriers. This series reworks
> the kernel pgtable generation code to significantly reduce the number of those
> TLBIs, ISBs and DSBs. See each patch for details.
> 
> The below shows the execution time of map_mem() across a couple of different
> systems with different RAM configurations. We measure after applying each patch
> and show the improvement relative to base (v6.9-rc2):
> 
> [...]

Applied to arm64 (for-next/mm), thanks!

[1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
      https://git.kernel.org/arm64/c/5c63db59c5f8
[2/3] arm64: mm: Batch dsb and isb when populating pgtables
      https://git.kernel.org/arm64/c/1fcb7cea8a5f
[3/3] arm64: mm: Don't remap pgtables for allocate vs populate
      https://git.kernel.org/arm64/c/0e9df1c905d8

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-04-14 23:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-12 13:19 [PATCH v3 0/3] Speed up boot with faster linear map creation Ryan Roberts
2024-04-12 13:19 ` [PATCH v3 1/3] arm64: mm: Don't remap pgtables per-cont(pte|pmd) block Ryan Roberts
2024-04-11 14:03   ` Itaru Kitayama
2024-04-12 13:19 ` [PATCH v3 2/3] arm64: mm: Batch dsb and isb when populating pgtables Ryan Roberts
2024-04-12 13:19 ` [PATCH v3 3/3] arm64: mm: Don't remap pgtables for allocate vs populate Ryan Roberts
2024-04-12 14:56 ` [PATCH v3 0/3] Speed up boot with faster linear map creation Mark Rutland
2024-04-12 15:00 ` Ard Biesheuvel
2024-04-12 16:06 ` Will Deacon
2024-04-10 12:53   ` Itaru Kitayama

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).