linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/17] arm64 kernel text replication
@ 2023-05-30 14:04 Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 01/17] arm64: consolidate rox page protection logic Russell King (Oracle)
                   ` (17 more replies)
  0 siblings, 18 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Problem
-------

NUMA systems have greater latency when accessing data and instructions
across nodes, which can lead to a reduction in performance on CPU cores
that mainly perform accesses beyond their local node.

Normally when an ARM64 system boots, the kernel will end up placed in
memory, and each CPU core will have to fetch instructions and data from
which ever NUMA node the kernel has been placed. This means that while
executing kernel code, CPUs local to that node will run faster than
CPUs in remote nodes.

The higher the latency to access remote NUMA node memory, the more the
kernel performance suffers on those nodes.

If there is a local copy of the kernel text in each node's RAM, and
each node runs the kernel using its local copy of the kernel text,
then it stands to reason that the kernel will run faster due to fewer
stalls while instructions are fetched from remote memory.

The question then arises how to achieve this.

Background
----------

An important issue to contend with is what happens when a thread
migrates between nodes. Essentially, the thread's state (including
instruction pointer) is saved to memory, and the scheduler on that CPU
loads some other thread's state and that CPU resumes executing that
new thread.

The CPU gaining the migrating thread loads the saved state, again
including the instruction pointer, and the gaining CPU resumes fetching
instructions at the virtual address where the original CPU left off.

The key point is that the virtual address is what matters here, and
this gives us a way to implement kernel text replication fairly easily.
At a practical level, all we need to do is to ensure that the virtual
addresses which contain the kernel text point to a local copy of the
that text.

This is exactly how this proposal of kernel text replication achieves
the replication. We can go a little bit further and include most of
the read-only data in this replication, as that will never be written
to by the kernel (and thus remains constant.)

Solution
--------

So, what we need to achieve is:

1. multiple identical copies of the kernel text (and read-only data)
2. point the virtual mappings to the appropriate copy of kernel text
   for the NUMA node.

(1) is fairly easy to achieve - we just need to allocate some memory
in the appropriate node and copy the parts of the kernel we want to
replicate. However, we also need to deal with ARM64's kernel patching.
There are two functions that patch the kernel text,
__apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
these need to to be modified to update all copies of the kernel text.

(2) is slightly harder.

Firstly, the aarch64 architecture has a very useful feature here - the
kernel page tables are entirely separate from the user page tables.
The hardware contains two page table pointers, one is used for user
mappings, the other is used for kernel mappings.

Therefore, we only have one page table to be concerned with: the table
which maps kernel space. We do not need to be concerned with each
user processes page table.

The approach taken here is to ensure that the kernel is located in an
area of kernel virtual address space covered by a level-0 page table
entry which is not shared with any other user. We can then maintain
separate per-node level-0 page tables for kernel space where the only
difference between them is this level-0 page table entry.

This gives a couple of benefits. Firstly, when updates to the level-0
page table happen (e.g. when establishing new mappings) these updates
can simply be copied to the other level-0 page tables provided it isn't
for the kernel image. Secondly, we don't need complexity at lower
levels of the page table code to figure out whether a level-1 or lower
update needs to be propagated to other nodes.

The level-0 page table entry for the kernel can then be used to point
at a node-unique set of level 1..N page tables to make the appropriate
copy of the kernel text (and read-only data) into kernel space, while
keeping the kernel read-write data shared between nodes.

Performance Analysis
--------------------

Needless to say, the performance results from kernel text replication
are workload specific, but appear to show a gain of between 6% and
17% for database-centric like workloads. When combined with userspace
awareness of NUMA, this can result in a gain of over 50%.

Problems
--------

There are a few areas that are a problem for kernel text replication:
1) As this series changes the kernel space virtual address space
   layout, it breaks KASAN - and I've zero knowledge of KASAN so I
   have no idea how to fix it. I would be grateful for input from
   KASAN folk for suggestions how to fix this.

2) KASLR can not be used with kernel text replication, since we need
   to place the kernel in its own L0 page table entry, not in vmalloc
   space. KASLR is disabled when support for kernel text replication
   is enabled.

3) Changing the kernel virtual address space layout also means that
   kaslr_offset() and kaslr_enabled() need to become macros rather
   than inline functions due to the use of PGDIR_SIZE in the
   calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
   constant, but asm/memory.h is included by asm/pgtable.h, having
   this symbol available would produce a circular include
   dependency, so I don't think there is any choice here.

4) read-only protection for replicated kernel images is not yet
   implemented.

Patch overview:

Patch 1 cleans up the rox page protection logic.
Patch 2 reoganises the kernel virtual address space layout (causing
  problems (1 and 3).
Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
  addresses.
Patch 4 makes a needed cache flushing function visible.
Patch 5 through 16 are the guts of kernel text replication.
Patch 17 adds the Kconfig entry for it.

Further patches not included in this set add a Kconfig for the default
state, a test module, and add code to verify the replicated kernel
text matches the node 0 text after the kernel has completed most of
its boot.

 Documentation/admin-guide/kernel-parameters.txt |   5 +
 arch/arm64/Kconfig                              |  10 +-
 arch/arm64/include/asm/cacheflush.h             |   2 +
 arch/arm64/include/asm/ktext.h                  |  45 ++++++
 arch/arm64/include/asm/memory.h                 |  26 ++--
 arch/arm64/include/asm/mmu_context.h            |  12 +-
 arch/arm64/include/asm/pgtable.h                |  35 ++++-
 arch/arm64/include/asm/smp.h                    |   1 +
 arch/arm64/kernel/alternative.c                 |   4 +-
 arch/arm64/kernel/asm-offsets.c                 |   1 +
 arch/arm64/kernel/cpufeature.c                  |   2 +-
 arch/arm64/kernel/head.S                        |   3 +-
 arch/arm64/kernel/hibernate.c                   |   2 +-
 arch/arm64/kernel/patching.c                    |   7 +-
 arch/arm64/kernel/smp.c                         |   3 +
 arch/arm64/kernel/suspend.c                     |   3 +-
 arch/arm64/kernel/vmlinux.lds.S                 |   3 +
 arch/arm64/mm/Makefile                          |   2 +
 arch/arm64/mm/init.c                            |   3 +
 arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
 arch/arm64/mm/mmu.c                             |  85 ++++++++--
 21 files changed, 413 insertions(+), 39 deletions(-)
 create mode 100644 arch/arm64/include/asm/ktext.h
 create mode 100644 arch/arm64/mm/ktext.c


-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH RFC 01/17] arm64: consolidate rox page protection logic
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
@ 2023-05-30 14:04 ` Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 02/17] arm64: place kernel in its own L0 page table entry Russell King (Oracle)
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Consolidate the arm64 decision making for the page protections used
for executable pages, used by both the trampoline code and the kernel
text mapping code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/mm/mmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..4829abe017e9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -663,12 +663,17 @@ static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
 	vm_area_add_early(vma);
 }
 
+static pgprot_t kernel_exec_prot(void)
+{
+	return rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
+}
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 static int __init map_entry_trampoline(void)
 {
 	int i;
 
-	pgprot_t prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
+	pgprot_t prot = kernel_exec_prot();
 	phys_addr_t pa_start = __pa_symbol(__entry_tramp_text_start);
 
 	/* The trampoline is always mapped and can therefore be global */
@@ -723,7 +728,7 @@ static void __init map_kernel(pgd_t *pgdp)
 	 * mapping to install SW breakpoints. Allow this (only) when
 	 * explicitly requested with rodata=off.
 	 */
-	pgprot_t text_prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
+	pgprot_t text_prot = kernel_exec_prot();
 
 	/*
 	 * If we have a CPU that supports BTI and a kernel built for
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 02/17] arm64: place kernel in its own L0 page table entry
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 01/17] arm64: consolidate rox page protection logic Russell King (Oracle)
@ 2023-05-30 14:04 ` Russell King (Oracle)
       [not found]   ` <ZIb+Lg9F9b4ay90p@FVFF77S0Q05N>
  2023-05-30 14:04 ` [PATCH RFC 03/17] arm64: provide cpu_replace_ttbr1_phys() Russell King (Oracle)
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Kernel text replication needs to maintain separate per-node page
tables for the kernel text. In order to do this without affecting
other kernel memory mappings, placing the kernel such that it does
not share a L0 page table entry with any other mapping is desirable.

Prior to this commit, the layout without KASLR was:

+----------+
|  vmalloc |
+----------+
|  Kernel  |
+----------+ MODULES_END, VMALLOC_START, KIMAGE_VADDR =
|  Modules |                 MODULES_VADDR + MODULES_VSIZE
+----------+ MODULES_VADDR = _PAGE_END(VA_BITS_MIN)
| VA space |
+----------+ 0

This becomes:

+----------+
|  vmalloc |
+----------+ VMALLOC_START = MODULES_END + PGDIR_SIZE
|  Kernel  |
+----------+ MODULES_END, KIMAGE_VADDR = _PAGE_END(VA_BITS_MIN) + PGDIR_SIZE
|  Modules |
+----------+ MODULES_VADDR = MODULES_END - MODULES_VSIZE
| VA space |
+----------+ 0

This assumes MODULES_VSIZE (128M) <= PGDIR_SIZE.

One side effect of this change is that KIMAGE_VADDR's definition now
includes PGDIR_SIZE (to leave room for the modules) but this is not
defined when asm/memory.h is included. This means KIMAGE_VADDR can
not be used in inline functions within this file, so we convert
kaslr_offset() and kaslr_enabled() to be macros instead.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/memory.h  | 26 ++++++++++----------------
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/mm/mmu.c              |  3 ++-
 3 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c735afdf639b..089f556b7387 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -43,9 +43,9 @@
 #define VA_BITS			(CONFIG_ARM64_VA_BITS)
 #define _PAGE_OFFSET(va)	(-(UL(1) << (va)))
 #define PAGE_OFFSET		(_PAGE_OFFSET(VA_BITS))
-#define KIMAGE_VADDR		(MODULES_END)
-#define MODULES_END		(MODULES_VADDR + MODULES_VSIZE)
-#define MODULES_VADDR		(_PAGE_END(VA_BITS_MIN))
+#define KIMAGE_VADDR		(_PAGE_END(VA_BITS_MIN) + PGDIR_SIZE)
+#define MODULES_END		(KIMAGE_VADDR)
+#define MODULES_VADDR		(MODULES_END - MODULES_VSIZE)
 #define MODULES_VSIZE		(SZ_128M)
 #define VMEMMAP_START		(-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
 #define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
@@ -199,20 +199,14 @@ extern u64			kimage_vaddr;
 /* the offset between the kernel virtual and physical mappings */
 extern u64			kimage_voffset;
 
-static inline unsigned long kaslr_offset(void)
-{
-	return kimage_vaddr - KIMAGE_VADDR;
-}
+#define kaslr_offset()	((unsigned long)(kimage_vaddr - KIMAGE_VADDR))
 
-static inline bool kaslr_enabled(void)
-{
-	/*
-	 * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
-	 * placement of the image rather than from the seed, so a displacement
-	 * of less than MIN_KIMG_ALIGN means that no seed was provided.
-	 */
-	return kaslr_offset() >= MIN_KIMG_ALIGN;
-}
+/*
+ * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
+ * placement of the image rather than from the seed, so a displacement
+ * of less than MIN_KIMG_ALIGN means that no seed was provided.
+ */
+#define kaslr_enabled()	(kaslr_offset() >= MIN_KIMG_ALIGN)
 
 /*
  * Allow all memory at the discovery stage. We will clip it later.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0bd18de9fd97..cb526e69299d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -21,7 +21,7 @@
  * VMALLOC_END: extends to the available space below vmemmap, PCI I/O space
  *	and fixed mappings
  */
-#define VMALLOC_START		(MODULES_END)
+#define VMALLOC_START		(MODULES_END + PGDIR_SIZE)
 #define VMALLOC_END		(VMEMMAP_START - SZ_256M)
 
 #define vmemmap			((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 4829abe017e9..baf74d0c43c9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -478,7 +478,8 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
 				phys_addr_t size, pgprot_t prot)
 {
-	if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
+	if ((virt >= PAGE_END) && (virt < VMALLOC_START) &&
+	    !is_kernel(virt)) {
 		pr_warn("BUG: not updating mapping for %pa at 0x%016lx - outside kernel range\n",
 			&phys, virt);
 		return;
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 03/17] arm64: provide cpu_replace_ttbr1_phys()
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 01/17] arm64: consolidate rox page protection logic Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 02/17] arm64: place kernel in its own L0 page table entry Russell King (Oracle)
@ 2023-05-30 14:04 ` Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 04/17] arm64: make clean_dcache_range_nopatch() visible Russell King (Oracle)
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Provide a version of cpu_replace_ttbr1_phys() which operates using a
physical address rather than the virtual address of the page tables.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/mmu_context.h | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 56911691bef0..4aa6afc6f935 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -148,7 +148,7 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
  * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
  * avoiding the possibility of conflicting TLB entries being allocated.
  */
-static inline void cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
+static inline void cpu_replace_ttbr1_phys(phys_addr_t pgd_phys, pgd_t *idmap)
 {
 	typedef void (ttbr_replace_func)(phys_addr_t);
 	extern ttbr_replace_func idmap_cpu_replace_ttbr1;
@@ -156,9 +156,10 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
 	unsigned long daif;
 
 	/* phys_to_ttbr() zeros lower 2 bits of ttbr with 52-bit PA */
-	phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(pgdp));
+	phys_addr_t ttbr1 = phys_to_ttbr(pgd_phys);
 
-	if (system_supports_cnp() && !WARN_ON(pgdp != lm_alias(swapper_pg_dir))) {
+	if (system_supports_cnp() &&
+	    !WARN_ON(pgd_phys != virt_to_phys(lm_alias(swapper_pg_dir)))) {
 		/*
 		 * cpu_replace_ttbr1() is used when there's a boot CPU
 		 * up (i.e. cpufeature framework is not up yet) and
@@ -185,6 +186,11 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
 	cpu_uninstall_idmap();
 }
 
+static inline void __nocfi cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
+{
+	cpu_replace_ttbr1_phys(virt_to_phys(pgdp), idmap);
+}
+
 /*
  * It would be nice to return ASIDs back to the allocator, but unfortunately
  * that introduces a race with a generation rollover where we could erroneously
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 04/17] arm64: make clean_dcache_range_nopatch() visible
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (2 preceding siblings ...)
  2023-05-30 14:04 ` [PATCH RFC 03/17] arm64: provide cpu_replace_ttbr1_phys() Russell King (Oracle)
@ 2023-05-30 14:04 ` Russell King (Oracle)
  2023-05-30 14:04 ` [PATCH RFC 05/17] arm64: text replication: add init function Russell King (Oracle)
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

When we hook into the kernel text patching code, we will need to call
clean_dcache_range_nopatch() to ensure that the patching of the
replicated kernel text is properly visible to other CPUs. Make this
function available to the replication code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/cacheflush.h | 2 ++
 arch/arm64/kernel/alternative.c     | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 37185e978aeb..ac9ad56d5212 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -104,6 +104,8 @@ static inline void flush_icache_range(unsigned long start, unsigned long end)
 }
 #define flush_icache_range flush_icache_range
 
+void clean_dcache_range_nopatch(u64 start, u64 end);
+
 /*
  * Copy user data from/to a page which is mapped into a different
  * processes address space.  Really, we want to allow our "user
diff --git a/arch/arm64/kernel/alternative.c b/arch/arm64/kernel/alternative.c
index d32d4ed5519b..df9a73458a85 100644
--- a/arch/arm64/kernel/alternative.c
+++ b/arch/arm64/kernel/alternative.c
@@ -121,7 +121,7 @@ static noinstr void patch_alternative(struct alt_instr *alt,
  * accidentally call into the cache.S code, which is patched by us at
  * runtime.
  */
-static void clean_dcache_range_nopatch(u64 start, u64 end)
+void clean_dcache_range_nopatch(u64 start, u64 end)
 {
 	u64 cur, d_size, ctr_el0;
 
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 05/17] arm64: text replication: add init function
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (3 preceding siblings ...)
  2023-05-30 14:04 ` [PATCH RFC 04/17] arm64: make clean_dcache_range_nopatch() visible Russell King (Oracle)
@ 2023-05-30 14:04 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 06/17] arm64: text replication: add sanity checks Russell King (Oracle)
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:04 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

A simple patch that adds an empty function for kernel text replication
initialisation and hooks it into the initialisation path.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/ktext.h | 20 ++++++++++++++++++++
 arch/arm64/mm/Makefile         |  2 ++
 arch/arm64/mm/init.c           |  3 +++
 arch/arm64/mm/ktext.c          |  8 ++++++++
 4 files changed, 33 insertions(+)
 create mode 100644 arch/arm64/include/asm/ktext.h
 create mode 100644 arch/arm64/mm/ktext.c

diff --git a/arch/arm64/include/asm/ktext.h b/arch/arm64/include/asm/ktext.h
new file mode 100644
index 000000000000..1a5f7452a3bf
--- /dev/null
+++ b/arch/arm64/include/asm/ktext.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022, Oracle and/or its affiliates.
+ */
+#ifndef ASM_KTEXT_H
+#define ASM_KTEXT_H
+
+#ifdef CONFIG_REPLICATE_KTEXT
+
+void ktext_replication_init(void);
+
+#else
+
+static inline void ktext_replication_init(void)
+{
+}
+
+#endif
+
+#endif
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..41e705027c57 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -14,3 +14,5 @@ KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o	:= n
+
+obj-$(CONFIG_REPLICATE_KTEXT)	+= ktext.o
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 66e70ca47680..a0e4f2d93ee6 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -36,6 +36,7 @@
 #include <asm/fixmap.h>
 #include <asm/kasan.h>
 #include <asm/kernel-pgtable.h>
+#include <asm/ktext.h>
 #include <asm/kvm_host.h>
 #include <asm/memory.h>
 #include <asm/numa.h>
@@ -401,6 +402,8 @@ void __init bootmem_init(void)
 
 	arch_numa_init();
 
+	ktext_replication_init();
+
 	/*
 	 * must be done after arch_numa_init() which calls numa_init() to
 	 * initialize node_online_map that gets used in hugetlb_cma_reserve()
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
new file mode 100644
index 000000000000..3a8d37c9abc4
--- /dev/null
+++ b/arch/arm64/mm/ktext.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Oracle and/or its affiliates.
+ */
+
+void __init ktext_replication_init(void)
+{
+}
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 06/17] arm64: text replication: add sanity checks
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (4 preceding siblings ...)
  2023-05-30 14:04 ` [PATCH RFC 05/17] arm64: text replication: add init function Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 07/17] arm64: text replication: copy initial kernel text Russell King (Oracle)
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

The kernel text and modules must be in separate L0 page table entries.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/mm/ktext.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 3a8d37c9abc4..901f159c65e6 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -3,6 +3,27 @@
  * Copyright (C) 2022, Oracle and/or its affiliates.
  */
 
+#include <linux/kernel.h>
+#include <linux/pgtable.h>
+
+#include <asm/ktext.h>
+#include <asm/memory.h>
+
 void __init ktext_replication_init(void)
 {
+	int kidx = pgd_index((phys_addr_t)KERNEL_START);
+
+	/*
+	 * If we've messed up and the kernel shares a L0 entry with the
+	 * module or vmalloc area, then don't even attempt to use text
+	 * replication.
+	 */
+	if (pgd_index(MODULES_VADDR) == kidx) {
+		pr_warn("Kernel is located in the same L0 index as modules - text replication disabled\n");
+		return;
+	}
+	if (pgd_index(VMALLOC_START) == kidx) {
+		pr_warn("Kernel is located in the same L0 index as vmalloc - text replication disabled\n");
+		return;
+	}
 }
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 07/17] arm64: text replication: copy initial kernel text
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (5 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 06/17] arm64: text replication: add sanity checks Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 08/17] arm64: text replication: add node text patching Russell King (Oracle)
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Allocate memory on the appropriate node for the per-node copies of the
kernel text, and copy the kernel text to that memory. Clean and
invalidate the caches to the point of unification so that the copied
text is correctly visible to the target node.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/mm/ktext.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 901f159c65e6..4c803b89fcfe 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -4,14 +4,23 @@
  */
 
 #include <linux/kernel.h>
+#include <linux/memblock.h>
+#include <linux/numa.h>
 #include <linux/pgtable.h>
+#include <linux/string.h>
 
+#include <asm/cacheflush.h>
 #include <asm/ktext.h>
 #include <asm/memory.h>
 
+static void *kernel_texts[MAX_NUMNODES];
+
+/* Allocate memory for the replicated kernel texts. */
 void __init ktext_replication_init(void)
 {
+	size_t size = _etext - _stext;
 	int kidx = pgd_index((phys_addr_t)KERNEL_START);
+	int nid;
 
 	/*
 	 * If we've messed up and the kernel shares a L0 entry with the
@@ -26,4 +35,16 @@ void __init ktext_replication_init(void)
 		pr_warn("Kernel is located in the same L0 index as vmalloc - text replication disabled\n");
 		return;
 	}
+
+	for_each_node(nid) {
+		/* Nothing to do for node 0 */
+		if (!nid)
+			continue;
+
+		/* Allocate and copy initial kernel text for this node */
+		kernel_texts[nid] = memblock_alloc_node(size, PAGE_SIZE, nid);
+		memcpy(kernel_texts[nid], _stext, size);
+		caches_clean_inval_pou((u64)kernel_texts[nid],
+				       (u64)kernel_texts[nid] + size);
+	}
 }
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 08/17] arm64: text replication: add node text patching
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (6 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 07/17] arm64: text replication: copy initial kernel text Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 09/17] arm64: text replication: add node 0 page table definitions Russell King (Oracle)
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Add support for text patching on our replicated texts.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/ktext.h  | 12 +++++++
 arch/arm64/kernel/alternative.c |  2 ++
 arch/arm64/kernel/patching.c    |  7 +++-
 arch/arm64/mm/ktext.c           | 58 +++++++++++++++++++++++++++++++++
 4 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/ktext.h b/arch/arm64/include/asm/ktext.h
index 1a5f7452a3bf..289e11289c06 100644
--- a/arch/arm64/include/asm/ktext.h
+++ b/arch/arm64/include/asm/ktext.h
@@ -5,9 +5,13 @@
 #ifndef ASM_KTEXT_H
 #define ASM_KTEXT_H
 
+#include <linux/kprobes.h>
+
 #ifdef CONFIG_REPLICATE_KTEXT
 
 void ktext_replication_init(void);
+void __kprobes ktext_replication_patch(u32 *tp,  __le32 insn);
+void ktext_replication_patch_alternative(__le32 *src, int nr_inst);
 
 #else
 
@@ -15,6 +19,14 @@ static inline void ktext_replication_init(void)
 {
 }
 
+static inline void __kprobes ktext_replication_patch(u32 *tp,  __le32 insn)
+{
+}
+
+static inline void ktext_replication_patch_alternative(__le32 *src, int nr_inst)
+{
+}
+
 #endif
 
 #endif
diff --git a/arch/arm64/kernel/alternative.c b/arch/arm64/kernel/alternative.c
index df9a73458a85..6a897d1dda76 100644
--- a/arch/arm64/kernel/alternative.c
+++ b/arch/arm64/kernel/alternative.c
@@ -15,6 +15,7 @@
 #include <asm/alternative.h>
 #include <asm/cpufeature.h>
 #include <asm/insn.h>
+#include <asm/ktext.h>
 #include <asm/module.h>
 #include <asm/sections.h>
 #include <asm/vdso.h>
@@ -174,6 +175,7 @@ static void __apply_alternatives(const struct alt_region *region,
 		alt_cb(alt, origptr, updptr, nr_inst);
 
 		if (!is_module) {
+			ktext_replication_patch_alternative(updptr, nr_inst);
 			clean_dcache_range_nopatch((u64)origptr,
 						   (u64)(origptr + nr_inst));
 		}
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index b4835f6d594b..627fff6ddda2 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -10,6 +10,7 @@
 #include <asm/fixmap.h>
 #include <asm/insn.h>
 #include <asm/kprobes.h>
+#include <asm/ktext.h>
 #include <asm/patching.h>
 #include <asm/sections.h>
 
@@ -115,9 +116,13 @@ int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
 		return -EINVAL;
 
 	ret = aarch64_insn_write(tp, insn);
-	if (ret == 0)
+	if (ret == 0) {
+		/* Also patch the other nodes */
+		ktext_replication_patch(tp, cpu_to_le32(insn));
+
 		caches_clean_inval_pou((uintptr_t)tp,
 				     (uintptr_t)tp + AARCH64_INSN_SIZE);
+	}
 
 	return ret;
 }
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 4c803b89fcfe..80120b5fd29f 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -3,8 +3,10 @@
  * Copyright (C) 2022, Oracle and/or its affiliates.
  */
 
+#include <linux/kallsyms.h>
 #include <linux/kernel.h>
 #include <linux/memblock.h>
+#include <linux/mm.h>
 #include <linux/numa.h>
 #include <linux/pgtable.h>
 #include <linux/string.h>
@@ -15,6 +17,62 @@
 
 static void *kernel_texts[MAX_NUMNODES];
 
+void __kprobes ktext_replication_patch(u32 *tp, __le32 insn)
+{
+	unsigned long offset;
+	int nid, this_nid;
+	__le32 *p;
+
+	if (!is_kernel_text((unsigned long)tp))
+		return;
+
+	offset = (unsigned long)tp - (unsigned long)_stext;
+
+	this_nid = numa_node_id();
+	if (this_nid) {
+		/* The cache maintenance by aarch64_insn_patch_text_nosync()
+		 * will occur on this node. We need it to occur on node 0.
+		 */
+		p = (void *)lm_alias(_stext) + offset;
+		caches_clean_inval_pou((u64)p, (u64)p + AARCH64_INSN_SIZE);
+	}
+
+	for_each_node(nid) {
+		if (!kernel_texts[nid])
+			continue;
+
+		p = kernel_texts[nid] + offset;
+		WRITE_ONCE(*p, insn);
+		caches_clean_inval_pou((u64)p, (u64)p + AARCH64_INSN_SIZE);
+	}
+}
+
+/* Copy the patched alternative from the node0 image to the other
+ * modes. src is the node 0 linear-mapping address.
+ */
+void ktext_replication_patch_alternative(__le32 *src, int nr_inst)
+{
+	unsigned long offset;
+	size_t size;
+	int nid;
+	__le32 *p;
+
+	offset = (unsigned long)src - (unsigned long)lm_alias(_stext);
+	if (WARN_ON_ONCE(offset >= _etext - _stext))
+		return;
+
+	size = AARCH64_INSN_SIZE * nr_inst;
+
+	for_each_node(nid) {
+		if (!kernel_texts[nid])
+			continue;
+
+		p = kernel_texts[nid] + offset;
+		memcpy(p, src, size);
+		clean_dcache_range_nopatch((u64)p, (u64)p + size);
+	}
+}
+
 /* Allocate memory for the replicated kernel texts. */
 void __init ktext_replication_init(void)
 {
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 09/17] arm64: text replication: add node 0 page table definitions
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (7 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 08/17] arm64: text replication: add node text patching Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 10/17] arm64: text replication: add swapper page directory helpers Russell King (Oracle)
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Add a struct definition for the level zero page table group (the
optional trampoline page tables, reserved page tables, and swapper page
tables).

Add a symbol and extern declaration for the node 0 page table group.

Add an array of pointers to per-node page tables, which will default to
using the node 0 page table group.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/pgtable.h | 14 ++++++++++++++
 arch/arm64/kernel/vmlinux.lds.S  |  3 +++
 arch/arm64/mm/ktext.c            |  4 ++++
 3 files changed, 21 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index cb526e69299d..1e72067d1e9e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -615,6 +615,20 @@ extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
 extern pgd_t tramp_pg_dir[PTRS_PER_PGD];
 extern pgd_t reserved_pg_dir[PTRS_PER_PGD];
 
+struct pgtables {
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+	pgd_t tramp_pg_dir[PTRS_PER_PGD];
+#endif
+	pgd_t reserved_pg_dir[PTRS_PER_PGD];
+	pgd_t swapper_pg_dir[PTRS_PER_PGD];
+};
+
+extern struct pgtables pgtable_node0;
+
+#ifdef CONFIG_REPLICATE_KTEXT
+extern struct pgtables *pgtables[MAX_NUMNODES];
+#endif
+
 extern void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd);
 
 static inline bool in_swapper_pgdir(void *addr)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 3cd7e76cc562..d3c7ed76adbf 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -212,6 +212,9 @@ SECTIONS
 	idmap_pg_dir = .;
 	. += PAGE_SIZE;
 
+	/* pgtable struct - covers the tramp, reserved and swapper pgdirs */
+	pgtable_node0 = .;
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 	tramp_pg_dir = .;
 	. += PAGE_SIZE;
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 80120b5fd29f..85fc97877d75 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -15,6 +15,10 @@
 #include <asm/ktext.h>
 #include <asm/memory.h>
 
+struct pgtables *pgtables[MAX_NUMNODES] = {
+	[0 ... MAX_NUMNODES - 1] = &pgtable_node0,
+};
+
 static void *kernel_texts[MAX_NUMNODES];
 
 void __kprobes ktext_replication_patch(u32 *tp, __le32 insn)
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 10/17] arm64: text replication: add swapper page directory helpers
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (8 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 09/17] arm64: text replication: add node 0 page table definitions Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 11/17] arm64: text replication: create per-node kernel page tables Russell King (Oracle)
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Add a series of helpers for the swapper page directories - a set which
return those for the calling CPU, and those which take the NUMA node
number.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/pgtable.h | 19 +++++++++++++++++++
 arch/arm64/kernel/hibernate.c    |  2 +-
 arch/arm64/mm/ktext.c            | 20 ++++++++++++++++++++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1e72067d1e9e..5cfff64e4944 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -627,6 +627,25 @@ extern struct pgtables pgtable_node0;
 
 #ifdef CONFIG_REPLICATE_KTEXT
 extern struct pgtables *pgtables[MAX_NUMNODES];
+
+pgd_t *swapper_pg_dir_node(void);
+phys_addr_t __swapper_pg_dir_node_phys(int nid);
+phys_addr_t swapper_pg_dir_node_phys(void);
+#else
+static inline pgd_t *swapper_pg_dir_node(void)
+{
+	return swapper_pg_dir;
+}
+
+static inline phys_addr_t __swapper_pg_dir_node_phys(int nid)
+{
+	return __pa_symbol(swapper_pg_dir);
+}
+
+static inline phys_addr_t swapper_pg_dir_node_phys(void)
+{
+	return __pa_symbol(swapper_pg_dir);
+}
 #endif
 
 extern void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd);
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 788597a6b6a2..2236119bf16d 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -114,7 +114,7 @@ int arch_hibernation_header_save(void *addr, unsigned int max_size)
 		return -EOVERFLOW;
 
 	arch_hdr_invariants(&hdr->invariants);
-	hdr->ttbr1_el1		= __pa_symbol(swapper_pg_dir);
+	hdr->ttbr1_el1		= swapper_pg_dir_node_phys();
 	hdr->reenter_kernel	= _cpu_resume;
 
 	/* We can't use __hyp_get_vectors() because kvm may still be loaded */
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 85fc97877d75..ac5754972a09 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -21,6 +21,26 @@ struct pgtables *pgtables[MAX_NUMNODES] = {
 
 static void *kernel_texts[MAX_NUMNODES];
 
+static pgd_t *__swapper_pg_dir_node(int nid)
+{
+	return pgtables[nid]->swapper_pg_dir;
+}
+
+pgd_t *swapper_pg_dir_node(void)
+{
+	return __swapper_pg_dir_node(numa_node_id());
+}
+
+phys_addr_t __swapper_pg_dir_node_phys(int nid)
+{
+	return __pa(__swapper_pg_dir_node(nid));
+}
+
+phys_addr_t swapper_pg_dir_node_phys(void)
+{
+	return __swapper_pg_dir_node_phys(numa_node_id());
+}
+
 void __kprobes ktext_replication_patch(u32 *tp, __le32 insn)
 {
 	unsigned long offset;
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 11/17] arm64: text replication: create per-node kernel page tables
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (9 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 10/17] arm64: text replication: add swapper page directory helpers Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 12/17] arm64: text replication: boot secondary CPUs with appropriate TTBR1 Russell King (Oracle)
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Allocate the level 0 page tables for the per-node kernel text
replication, but copy all level 0 table entries from the NUMA node 0
table. Therefore, for the time being, each node's level 0 page tables
will contain identical entries, and thus other nodes will continue
to use the node 0 kernel text.

Since the level 0 page tables can be updated at runtime to add entries
for vmalloc and module space, propagate these updates to the other
swapper page tables. The exception is if we see an update for the
level 0 entry which points to the kernel mapping.

We also need to setup a copy of the trampoline page tables as well, as
the assembly code relies on the two page tables being a fixed offset
apart.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/ktext.h | 12 ++++++++++
 arch/arm64/mm/ktext.c          | 42 +++++++++++++++++++++++++++++++++-
 arch/arm64/mm/mmu.c            |  5 ++++
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/ktext.h b/arch/arm64/include/asm/ktext.h
index 289e11289c06..386f9812d3c1 100644
--- a/arch/arm64/include/asm/ktext.h
+++ b/arch/arm64/include/asm/ktext.h
@@ -7,11 +7,15 @@
 
 #include <linux/kprobes.h>
 
+#include <asm/pgtable-types.h>
+
 #ifdef CONFIG_REPLICATE_KTEXT
 
 void ktext_replication_init(void);
 void __kprobes ktext_replication_patch(u32 *tp,  __le32 insn);
 void ktext_replication_patch_alternative(__le32 *src, int nr_inst);
+void ktext_replication_set_swapper_pgd(pgd_t *pgdp, pgd_t pgd);
+void ktext_replication_init_tramp(void);
 
 #else
 
@@ -27,6 +31,14 @@ static inline void ktext_replication_patch_alternative(__le32 *src, int nr_inst)
 {
 }
 
+static inline void ktext_replication_set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+}
+
+static inline void ktext_replication_init_tramp(void)
+{
+}
+
 #endif
 
 #endif
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index ac5754972a09..290012d2bd03 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -14,6 +14,7 @@
 #include <asm/cacheflush.h>
 #include <asm/ktext.h>
 #include <asm/memory.h>
+#include <asm/pgalloc.h>
 
 struct pgtables *pgtables[MAX_NUMNODES] = {
 	[0 ... MAX_NUMNODES - 1] = &pgtable_node0,
@@ -97,7 +98,7 @@ void ktext_replication_patch_alternative(__le32 *src, int nr_inst)
 	}
 }
 
-/* Allocate memory for the replicated kernel texts. */
+/* Allocate page tables and memory for the replicated kernel texts. */
 void __init ktext_replication_init(void)
 {
 	size_t size = _etext - _stext;
@@ -128,5 +129,44 @@ void __init ktext_replication_init(void)
 		memcpy(kernel_texts[nid], _stext, size);
 		caches_clean_inval_pou((u64)kernel_texts[nid],
 				       (u64)kernel_texts[nid] + size);
+
+		/* Allocate the pagetables for this node */
+		pgtables[nid] = memblock_alloc_node(sizeof(*pgtables[0]),
+						    PGD_SIZE, nid);
+
+		/* Copy initial swapper page directory */
+		memcpy(pgtables[nid]->swapper_pg_dir, swapper_pg_dir, PGD_SIZE);
+	}
+}
+
+void ktext_replication_set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	unsigned long idx = pgdp - swapper_pg_dir;
+	int nid;
+
+	if (WARN_ON_ONCE(idx >= PTRS_PER_PGD) ||
+	    WARN_ON_ONCE(idx == pgd_index((phys_addr_t)KERNEL_START)))
+		return;
+
+	for_each_node(nid) {
+		if (pgtables[nid]->swapper_pg_dir == swapper_pg_dir)
+			continue;
+
+		WRITE_ONCE(pgtables[nid]->swapper_pg_dir[idx], pgd);
+	}
+}
+
+#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
+void __init ktext_replication_init_tramp(void)
+{
+	int nid;
+
+	for_each_node(nid) {
+		/* Nothing to do for node 0 */
+		if (pgtables[nid]->tramp_pg_dir == tramp_pg_dir)
+			continue;
+
+		memcpy(pgtables[nid]->tramp_pg_dir, tramp_pg_dir, PGD_SIZE);
 	}
 }
+#endif
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index baf74d0c43c9..12fc3b1116e6 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -31,6 +31,7 @@
 #include <asm/fixmap.h>
 #include <asm/kasan.h>
 #include <asm/kernel-pgtable.h>
+#include <asm/ktext.h>
 #include <asm/sections.h>
 #include <asm/setup.h>
 #include <linux/sizes.h>
@@ -81,6 +82,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 	pgd_t *fixmap_pgdp;
 
 	spin_lock(&swapper_pgdir_lock);
+	ktext_replication_set_swapper_pgd(pgdp, pgd);
 	fixmap_pgdp = pgd_set_fixmap(__pa_symbol(pgdp));
 	WRITE_ONCE(*fixmap_pgdp, pgd);
 	/*
@@ -695,6 +697,9 @@ static int __init map_entry_trampoline(void)
 		__set_fixmap(FIX_ENTRY_TRAMP_TEXT1 - i,
 			     pa_start + i * PAGE_SIZE, PAGE_KERNEL_RO);
 
+	/* Copy trampoline page tables to other numa nodes */
+	ktext_replication_init_tramp();
+
 	return 0;
 }
 core_initcall(map_entry_trampoline);
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 12/17] arm64: text replication: boot secondary CPUs with appropriate TTBR1
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (10 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 11/17] arm64: text replication: create per-node kernel page tables Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 13/17] arm64: text replication: update cnp support Russell King (Oracle)
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Arrange for secondary CPUs to boot with TTBR1 pointing at the
appropriate per-node copy of the kernel page tables for the CPUs NUMA
node.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/smp.h    | 1 +
 arch/arm64/kernel/asm-offsets.c | 1 +
 arch/arm64/kernel/head.S        | 3 ++-
 arch/arm64/kernel/smp.c         | 3 +++
 4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h
index f2d26235bfb4..9a4246141599 100644
--- a/arch/arm64/include/asm/smp.h
+++ b/arch/arm64/include/asm/smp.h
@@ -79,6 +79,7 @@ asmlinkage void secondary_start_kernel(void);
 struct secondary_data {
 	struct task_struct *task;
 	long status;
+	phys_addr_t ttbr1;
 };
 
 extern struct secondary_data secondary_data;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 0996094b0d22..f3b8cf661de2 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -121,6 +121,7 @@ int main(void)
   DEFINE(IRQ_CPUSTAT_SOFTIRQ_PENDING, offsetof(irq_cpustat_t, __softirq_pending));
   BLANK();
   DEFINE(CPU_BOOT_TASK,		offsetof(struct secondary_data, task));
+  DEFINE(CPU_BOOT_TTBR1,	offsetof(struct secondary_data, ttbr1));
   BLANK();
   DEFINE(FTR_OVR_VAL_OFFSET,	offsetof(struct arm64_ftr_override, val));
   DEFINE(FTR_OVR_MASK_OFFSET,	offsetof(struct arm64_ftr_override, mask));
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index e92caebff46a..e66ee578f755 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -646,7 +646,8 @@ SYM_FUNC_START_LOCAL(secondary_startup)
 	ldr_l	x0, vabits_actual
 #endif
 	bl	__cpu_setup			// initialise processor
-	adrp	x1, swapper_pg_dir
+	adr_l	x1, secondary_data
+	ldr	x1, [x1, #CPU_BOOT_TTBR1]
 	adrp	x2, idmap_pg_dir
 	bl	__enable_mmu
 	ldr	x8, =__secondary_switched
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index d00d4cbb31b1..8b4cd4924abf 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -119,6 +119,9 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
 	 * page tables.
 	 */
 	secondary_data.task = idle;
+	secondary_data.ttbr1 = __swapper_pg_dir_node_phys(cpu_to_node(cpu));
+	dcache_clean_poc((uintptr_t)&secondary_data,
+			 (uintptr_t)&secondary_data + sizeof(secondary_data));
 	update_cpu_boot_status(CPU_MMU_OFF);
 
 	/* Now bring the CPU into our world */
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 13/17] arm64: text replication: update cnp support
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (11 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 12/17] arm64: text replication: boot secondary CPUs with appropriate TTBR1 Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 14/17] arm64: text replication: setup page tables for copied kernel Russell King (Oracle)
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Add changes for CNP (Common Not Private) support of kernel text
replication. Although text replication has only been tested on
dual-socket Ampere A1 systems, provided the different NUMA nodes
are not part of the same inner shareable domain, CNP should not
be a problem.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/mmu_context.h | 2 +-
 arch/arm64/kernel/cpufeature.c       | 2 +-
 arch/arm64/kernel/suspend.c          | 3 ++-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 4aa6afc6f935..c0e13dc73fd7 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -159,7 +159,7 @@ static inline void cpu_replace_ttbr1_phys(phys_addr_t pgd_phys, pgd_t *idmap)
 	phys_addr_t ttbr1 = phys_to_ttbr(pgd_phys);
 
 	if (system_supports_cnp() &&
-	    !WARN_ON(pgd_phys != virt_to_phys(lm_alias(swapper_pg_dir)))) {
+	    !WARN_ON(pgd_phys != swapper_pg_dir_node_phys())) {
 		/*
 		 * cpu_replace_ttbr1() is used when there's a boot CPU
 		 * up (i.e. cpufeature framework is not up yet) and
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 7d7128c65161..fdb93b5b7d8e 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -3341,7 +3341,7 @@ subsys_initcall_sync(init_32bit_el0_mask);
 
 static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
 {
-	cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
+	cpu_replace_ttbr1_phys(swapper_pg_dir_node_phys(), idmap_pg_dir);
 }
 
 /*
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index 0fbdf5fe64d8..49fa80bafd6d 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -55,7 +55,8 @@ void notrace __cpu_suspend_exit(void)
 
 	/* Restore CnP bit in TTBR1_EL1 */
 	if (system_supports_cnp())
-		cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
+		cpu_replace_ttbr1_phys(swapper_pg_dir_node_phys(),
+				       idmap_pg_dir);
 
 	/*
 	 * PSTATE was not saved over suspend/resume, re-enable any detected
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 14/17] arm64: text replication: setup page tables for copied kernel
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (12 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 13/17] arm64: text replication: update cnp support Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 15/17] arm64: text replication: include most of read-only data as well Russell King (Oracle)
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Setup page table entries in each non-boot NUMA node page table to
point at each node's own copy of the kernel text. This switches
each node to use its own unique copy of the kernel text.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/ktext.h |  1 +
 arch/arm64/mm/ktext.c          |  8 +++++
 arch/arm64/mm/mmu.c            | 53 ++++++++++++++++++++++++++++------
 3 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/ktext.h b/arch/arm64/include/asm/ktext.h
index 386f9812d3c1..6ece59ca90a2 100644
--- a/arch/arm64/include/asm/ktext.h
+++ b/arch/arm64/include/asm/ktext.h
@@ -16,6 +16,7 @@ void __kprobes ktext_replication_patch(u32 *tp,  __le32 insn);
 void ktext_replication_patch_alternative(__le32 *src, int nr_inst);
 void ktext_replication_set_swapper_pgd(pgd_t *pgdp, pgd_t pgd);
 void ktext_replication_init_tramp(void);
+void create_kernel_nid_map(pgd_t *pgdp, void *ktext);
 
 #else
 
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 290012d2bd03..11eba88fdd49 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -136,6 +136,14 @@ void __init ktext_replication_init(void)
 
 		/* Copy initial swapper page directory */
 		memcpy(pgtables[nid]->swapper_pg_dir, swapper_pg_dir, PGD_SIZE);
+
+		/* Clear the kernel mapping */
+		memset(&pgtables[nid]->swapper_pg_dir[kidx], 0,
+		       sizeof(pgtables[nid]->swapper_pg_dir[kidx]));
+
+		/* Create kernel mapping pointing at our local copy */
+		create_kernel_nid_map(pgtables[nid]->swapper_pg_dir,
+				      kernel_texts[nid]);
 	}
 }
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 12fc3b1116e6..2ba5cdfa28ce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -641,6 +641,16 @@ void mark_rodata_ro(void)
 	debug_checkwx();
 }
 
+static void __init create_kernel_mapping(pgd_t *pgdp, phys_addr_t pa_start,
+					 void *va_start, void *va_end,
+					 pgprot_t prot, int flags)
+{
+	size_t size = va_end - va_start;
+
+	__create_pgd_mapping(pgdp, pa_start, (unsigned long)va_start, size,
+			     prot, early_pgtable_alloc, flags);
+}
+
 static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
 				      pgprot_t prot, struct vm_struct *vma,
 				      int flags, unsigned long vm_flags)
@@ -651,8 +661,7 @@ static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
 	BUG_ON(!PAGE_ALIGNED(pa_start));
 	BUG_ON(!PAGE_ALIGNED(size));
 
-	__create_pgd_mapping(pgdp, pa_start, (unsigned long)va_start, size, prot,
-			     early_pgtable_alloc, flags);
+	create_kernel_mapping(pgdp, pa_start, va_start, va_end, prot, flags);
 
 	if (!(vm_flags & VM_NO_GUARD))
 		size += PAGE_SIZE;
@@ -721,14 +730,8 @@ static bool arm64_early_this_cpu_has_bti(void)
 						    ID_AA64PFR1_EL1_BT_SHIFT);
 }
 
-/*
- * Create fine-grained mappings for the kernel.
- */
-static void __init map_kernel(pgd_t *pgdp)
+static pgprot_t __init kernel_text_pgprot(void)
 {
-	static struct vm_struct vmlinux_text, vmlinux_rodata, vmlinux_inittext,
-				vmlinux_initdata, vmlinux_data;
-
 	/*
 	 * External debuggers may need to write directly to the text
 	 * mapping to install SW breakpoints. Allow this (only) when
@@ -744,6 +747,38 @@ static void __init map_kernel(pgd_t *pgdp)
 	if (arm64_early_this_cpu_has_bti())
 		text_prot = __pgprot_modify(text_prot, PTE_GP, PTE_GP);
 
+	return text_prot;
+}
+
+#ifdef CONFIG_REPLICATE_KTEXT
+void __init create_kernel_nid_map(pgd_t *pgdp, void *ktext)
+{
+	pgprot_t text_prot = kernel_text_pgprot();
+
+	create_kernel_mapping(pgdp, __pa(ktext), _stext, _etext, text_prot, 0);
+	create_kernel_mapping(pgdp, __pa_symbol(__start_rodata),
+			      __start_rodata, __inittext_begin,
+			      PAGE_KERNEL, NO_CONT_MAPPINGS);
+	create_kernel_mapping(pgdp, __pa_symbol(__inittext_begin),
+			      __inittext_begin, __inittext_end,
+			      text_prot, 0);
+	create_kernel_mapping(pgdp, __pa_symbol(__initdata_begin),
+			      __initdata_begin, __initdata_end,
+			      PAGE_KERNEL, 0);
+	create_kernel_mapping(pgdp, __pa_symbol(_data), _data, _end,
+			      PAGE_KERNEL, 0);
+}
+#endif
+
+/*
+ * Create fine-grained mappings for the kernel.
+ */
+static void __init map_kernel(pgd_t *pgdp)
+{
+	static struct vm_struct vmlinux_text, vmlinux_rodata, vmlinux_inittext,
+				vmlinux_initdata, vmlinux_data;
+	pgprot_t text_prot = kernel_text_pgprot();
+
 	/*
 	 * Only rodata will be remapped with different permissions later on,
 	 * all other segments are allowed to use contiguous mappings.
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 15/17] arm64: text replication: include most of read-only data as well
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (13 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 14/17] arm64: text replication: setup page tables for copied kernel Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 16/17] arm64: text replication: early kernel option to enable replication Russell King (Oracle)
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Include as much of the read-only data in the replication as we can
without needing to move away from the generic RO_DATA() macro in
the linker script.

Unfortunately, the read-only data section is immedaitely followed
by the read-only after init data with no page alignment, which
means we can't have separate mappings for the read-only data
section and everything else. Changing that would mean replacing
the generic RO_DATA() macro which increases the maintenance burden.

however, this is likely not worth the effort as the majority of
read-only data will be covered.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/mm/ktext.c |  2 +-
 arch/arm64/mm/mmu.c   | 21 ++++++++++++++++++---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index 11eba88fdd49..f64a649f06a4 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -101,7 +101,7 @@ void ktext_replication_patch_alternative(__le32 *src, int nr_inst)
 /* Allocate page tables and memory for the replicated kernel texts. */
 void __init ktext_replication_init(void)
 {
-	size_t size = _etext - _stext;
+	size_t size = __end_rodata - _stext;
 	int kidx = pgd_index((phys_addr_t)KERNEL_START);
 	int nid;
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 2ba5cdfa28ce..5e4cb28f3e5f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -753,11 +753,26 @@ static pgprot_t __init kernel_text_pgprot(void)
 #ifdef CONFIG_REPLICATE_KTEXT
 void __init create_kernel_nid_map(pgd_t *pgdp, void *ktext)
 {
+	phys_addr_t pa_ktext;
+	size_t ro_offset;
+	void *ro_end;
 	pgprot_t text_prot = kernel_text_pgprot();
 
-	create_kernel_mapping(pgdp, __pa(ktext), _stext, _etext, text_prot, 0);
-	create_kernel_mapping(pgdp, __pa_symbol(__start_rodata),
-			      __start_rodata, __inittext_begin,
+	pa_ktext = __pa(ktext);
+	ro_offset = __pa_symbol(__start_rodata) - __pa_symbol(_stext);
+	/*
+	 * We must not cover the read-only data after init, since this
+	 * is written to during boot, and thus must be shared between
+	 * the NUMA nodes.
+	 */
+	ro_end = PTR_ALIGN_DOWN((void *)__start_ro_after_init, PAGE_SIZE);
+
+	create_kernel_mapping(pgdp, pa_ktext, _stext, _etext, text_prot, 0);
+	create_kernel_mapping(pgdp, pa_ktext + ro_offset,
+			      __start_rodata, ro_end,
+			      PAGE_KERNEL, NO_CONT_MAPPINGS);
+	create_kernel_mapping(pgdp, __pa_symbol(ro_end),
+			      ro_end, __inittext_begin,
 			      PAGE_KERNEL, NO_CONT_MAPPINGS);
 	create_kernel_mapping(pgdp, __pa_symbol(__inittext_begin),
 			      __inittext_begin, __inittext_end,
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 16/17] arm64: text replication: early kernel option to enable replication
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (14 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 15/17] arm64: text replication: include most of read-only data as well Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-05-30 14:05 ` [PATCH RFC 17/17] arm64: text replication: add Kconfig Russell King (Oracle)
  2023-06-05  9:05 ` [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Provide an early kernel option "ktext=" which allows the kernel text
replication to be enabled. This takes a boolean argument.

The way this has been implemented means that we take all the same paths
through the kernel at runtime whether kernel text replication has been
enabled or not; this allows the performance effects of the code changes
to be evaluated separately from the act of running with replicating the
kernel text.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 .../admin-guide/kernel-parameters.txt          |  5 +++++
 arch/arm64/mm/ktext.c                          | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9e5bab29685f..684bd004816e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2488,6 +2488,11 @@
 			0: force disabled
 			1: force enabled
 
+	ktext=		[ARM64] Control kernel text replication on NUMA
+			machines. Default: disabled.
+			0: disable kernel text replication
+			1: enable kernel text replication
+
 	kunit.enable=	[KUNIT] Enable executing KUnit tests. Requires
 			CONFIG_KUNIT to be set to be fully enabled. The
 			default value can be overridden via
diff --git a/arch/arm64/mm/ktext.c b/arch/arm64/mm/ktext.c
index f64a649f06a4..ec2d474bbd2c 100644
--- a/arch/arm64/mm/ktext.c
+++ b/arch/arm64/mm/ktext.c
@@ -98,6 +98,21 @@ void ktext_replication_patch_alternative(__le32 *src, int nr_inst)
 	}
 }
 
+static bool ktext_enabled;
+
+static int __init parse_ktext(char *str)
+{
+	bool enabled;
+	int ret = strtobool(str, &enabled);
+
+	if (ret)
+		return ret;
+
+	ktext_enabled = enabled;
+	return 0;
+}
+early_param("ktext", parse_ktext);
+
 /* Allocate page tables and memory for the replicated kernel texts. */
 void __init ktext_replication_init(void)
 {
@@ -119,6 +134,9 @@ void __init ktext_replication_init(void)
 		return;
 	}
 
+	if (!ktext_enabled)
+		return;
+
 	for_each_node(nid) {
 		/* Nothing to do for node 0 */
 		if (!nid)
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH RFC 17/17] arm64: text replication: add Kconfig
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (15 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 16/17] arm64: text replication: early kernel option to enable replication Russell King (Oracle)
@ 2023-05-30 14:05 ` Russell King (Oracle)
  2023-06-05  9:05 ` [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
  17 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-05-30 14:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Add the Kconfig symbol for kernel text replication. This unfortunately
requires KASAN and kernel text randomisation options to be disabled at
the moment.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/Kconfig | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b1201d25a8a4..e1120841e26e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -159,7 +159,7 @@ config ARM64
 	select HAVE_ARCH_HUGE_VMAP
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_JUMP_LABEL_RELATIVE
-	select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
+	select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48 && !REPLICATE_KTEXT)
 	select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN
 	select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
 	select HAVE_ARCH_KASAN_HW_TAGS if (HAVE_ARCH_KASAN && ARM64_MTE)
@@ -1400,6 +1400,13 @@ config NODES_SHIFT
 	  Specify the maximum number of NUMA Nodes available on the target
 	  system.  Increases memory reserved to accommodate various tables.
 
+config REPLICATE_KTEXT
+	bool "Replicate kernel text across numa nodes"
+	depends on NUMA
+	help
+	  Say Y here to enable replicating the kernel text across multiple
+	  nodes in a NUMA cluster.  This trades memory for speed.
+
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
@@ -2167,6 +2174,7 @@ config RELOCATABLE
 
 config RANDOMIZE_BASE
 	bool "Randomize the address of the kernel image"
+	depends on !REPLICATE_KTEXT
 	select ARM64_MODULE_PLTS if MODULES
 	select RELOCATABLE
 	help
-- 
2.30.2


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
                   ` (16 preceding siblings ...)
  2023-05-30 14:05 ` [PATCH RFC 17/17] arm64: text replication: add Kconfig Russell King (Oracle)
@ 2023-06-05  9:05 ` Russell King (Oracle)
  2023-06-05 13:46   ` Mark Rutland
  2023-06-23 15:24   ` Ard Biesheuvel
  17 siblings, 2 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-06-05  9:05 UTC (permalink / raw)
  To: Catalin Marinas, Jonathan Corbet, Will Deacon; +Cc: linux-arm-kernel, linux-doc

Hi,

Are there any comments on this?

Thanks.

On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> Problem
> -------
> 
> NUMA systems have greater latency when accessing data and instructions
> across nodes, which can lead to a reduction in performance on CPU cores
> that mainly perform accesses beyond their local node.
> 
> Normally when an ARM64 system boots, the kernel will end up placed in
> memory, and each CPU core will have to fetch instructions and data from
> which ever NUMA node the kernel has been placed. This means that while
> executing kernel code, CPUs local to that node will run faster than
> CPUs in remote nodes.
> 
> The higher the latency to access remote NUMA node memory, the more the
> kernel performance suffers on those nodes.
> 
> If there is a local copy of the kernel text in each node's RAM, and
> each node runs the kernel using its local copy of the kernel text,
> then it stands to reason that the kernel will run faster due to fewer
> stalls while instructions are fetched from remote memory.
> 
> The question then arises how to achieve this.
> 
> Background
> ----------
> 
> An important issue to contend with is what happens when a thread
> migrates between nodes. Essentially, the thread's state (including
> instruction pointer) is saved to memory, and the scheduler on that CPU
> loads some other thread's state and that CPU resumes executing that
> new thread.
> 
> The CPU gaining the migrating thread loads the saved state, again
> including the instruction pointer, and the gaining CPU resumes fetching
> instructions at the virtual address where the original CPU left off.
> 
> The key point is that the virtual address is what matters here, and
> this gives us a way to implement kernel text replication fairly easily.
> At a practical level, all we need to do is to ensure that the virtual
> addresses which contain the kernel text point to a local copy of the
> that text.
> 
> This is exactly how this proposal of kernel text replication achieves
> the replication. We can go a little bit further and include most of
> the read-only data in this replication, as that will never be written
> to by the kernel (and thus remains constant.)
> 
> Solution
> --------
> 
> So, what we need to achieve is:
> 
> 1. multiple identical copies of the kernel text (and read-only data)
> 2. point the virtual mappings to the appropriate copy of kernel text
>    for the NUMA node.
> 
> (1) is fairly easy to achieve - we just need to allocate some memory
> in the appropriate node and copy the parts of the kernel we want to
> replicate. However, we also need to deal with ARM64's kernel patching.
> There are two functions that patch the kernel text,
> __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> these need to to be modified to update all copies of the kernel text.
> 
> (2) is slightly harder.
> 
> Firstly, the aarch64 architecture has a very useful feature here - the
> kernel page tables are entirely separate from the user page tables.
> The hardware contains two page table pointers, one is used for user
> mappings, the other is used for kernel mappings.
> 
> Therefore, we only have one page table to be concerned with: the table
> which maps kernel space. We do not need to be concerned with each
> user processes page table.
> 
> The approach taken here is to ensure that the kernel is located in an
> area of kernel virtual address space covered by a level-0 page table
> entry which is not shared with any other user. We can then maintain
> separate per-node level-0 page tables for kernel space where the only
> difference between them is this level-0 page table entry.
> 
> This gives a couple of benefits. Firstly, when updates to the level-0
> page table happen (e.g. when establishing new mappings) these updates
> can simply be copied to the other level-0 page tables provided it isn't
> for the kernel image. Secondly, we don't need complexity at lower
> levels of the page table code to figure out whether a level-1 or lower
> update needs to be propagated to other nodes.
> 
> The level-0 page table entry for the kernel can then be used to point
> at a node-unique set of level 1..N page tables to make the appropriate
> copy of the kernel text (and read-only data) into kernel space, while
> keeping the kernel read-write data shared between nodes.
> 
> Performance Analysis
> --------------------
> 
> Needless to say, the performance results from kernel text replication
> are workload specific, but appear to show a gain of between 6% and
> 17% for database-centric like workloads. When combined with userspace
> awareness of NUMA, this can result in a gain of over 50%.
> 
> Problems
> --------
> 
> There are a few areas that are a problem for kernel text replication:
> 1) As this series changes the kernel space virtual address space
>    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
>    have no idea how to fix it. I would be grateful for input from
>    KASAN folk for suggestions how to fix this.
> 
> 2) KASLR can not be used with kernel text replication, since we need
>    to place the kernel in its own L0 page table entry, not in vmalloc
>    space. KASLR is disabled when support for kernel text replication
>    is enabled.
> 
> 3) Changing the kernel virtual address space layout also means that
>    kaslr_offset() and kaslr_enabled() need to become macros rather
>    than inline functions due to the use of PGDIR_SIZE in the
>    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
>    constant, but asm/memory.h is included by asm/pgtable.h, having
>    this symbol available would produce a circular include
>    dependency, so I don't think there is any choice here.
> 
> 4) read-only protection for replicated kernel images is not yet
>    implemented.
> 
> Patch overview:
> 
> Patch 1 cleans up the rox page protection logic.
> Patch 2 reoganises the kernel virtual address space layout (causing
>   problems (1 and 3).
> Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
>   addresses.
> Patch 4 makes a needed cache flushing function visible.
> Patch 5 through 16 are the guts of kernel text replication.
> Patch 17 adds the Kconfig entry for it.
> 
> Further patches not included in this set add a Kconfig for the default
> state, a test module, and add code to verify the replicated kernel
> text matches the node 0 text after the kernel has completed most of
> its boot.
> 
>  Documentation/admin-guide/kernel-parameters.txt |   5 +
>  arch/arm64/Kconfig                              |  10 +-
>  arch/arm64/include/asm/cacheflush.h             |   2 +
>  arch/arm64/include/asm/ktext.h                  |  45 ++++++
>  arch/arm64/include/asm/memory.h                 |  26 ++--
>  arch/arm64/include/asm/mmu_context.h            |  12 +-
>  arch/arm64/include/asm/pgtable.h                |  35 ++++-
>  arch/arm64/include/asm/smp.h                    |   1 +
>  arch/arm64/kernel/alternative.c                 |   4 +-
>  arch/arm64/kernel/asm-offsets.c                 |   1 +
>  arch/arm64/kernel/cpufeature.c                  |   2 +-
>  arch/arm64/kernel/head.S                        |   3 +-
>  arch/arm64/kernel/hibernate.c                   |   2 +-
>  arch/arm64/kernel/patching.c                    |   7 +-
>  arch/arm64/kernel/smp.c                         |   3 +
>  arch/arm64/kernel/suspend.c                     |   3 +-
>  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
>  arch/arm64/mm/Makefile                          |   2 +
>  arch/arm64/mm/init.c                            |   3 +
>  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
>  arch/arm64/mm/mmu.c                             |  85 ++++++++--
>  21 files changed, 413 insertions(+), 39 deletions(-)
>  create mode 100644 arch/arm64/include/asm/ktext.h
>  create mode 100644 arch/arm64/mm/ktext.c
> 
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-05  9:05 ` [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
@ 2023-06-05 13:46   ` Mark Rutland
  2023-06-23 15:24   ` Ard Biesheuvel
  1 sibling, 0 replies; 27+ messages in thread
From: Mark Rutland @ 2023-06-05 13:46 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Catalin Marinas, Jonathan Corbet, Will Deacon, linux-arm-kernel,
	linux-doc

On Mon, Jun 05, 2023 at 10:05:22AM +0100, Russell King (Oracle) wrote:
> Hi,
> 
> Are there any comments on this?

This is on my queue of things to review, but I haven't had the chance to give
more than a cursory look so far. I'm hoping to get to it in the next few days.

Thanks,
Mark.

> 
> Thanks.
> 
> On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> > Problem
> > -------
> > 
> > NUMA systems have greater latency when accessing data and instructions
> > across nodes, which can lead to a reduction in performance on CPU cores
> > that mainly perform accesses beyond their local node.
> > 
> > Normally when an ARM64 system boots, the kernel will end up placed in
> > memory, and each CPU core will have to fetch instructions and data from
> > which ever NUMA node the kernel has been placed. This means that while
> > executing kernel code, CPUs local to that node will run faster than
> > CPUs in remote nodes.
> > 
> > The higher the latency to access remote NUMA node memory, the more the
> > kernel performance suffers on those nodes.
> > 
> > If there is a local copy of the kernel text in each node's RAM, and
> > each node runs the kernel using its local copy of the kernel text,
> > then it stands to reason that the kernel will run faster due to fewer
> > stalls while instructions are fetched from remote memory.
> > 
> > The question then arises how to achieve this.
> > 
> > Background
> > ----------
> > 
> > An important issue to contend with is what happens when a thread
> > migrates between nodes. Essentially, the thread's state (including
> > instruction pointer) is saved to memory, and the scheduler on that CPU
> > loads some other thread's state and that CPU resumes executing that
> > new thread.
> > 
> > The CPU gaining the migrating thread loads the saved state, again
> > including the instruction pointer, and the gaining CPU resumes fetching
> > instructions at the virtual address where the original CPU left off.
> > 
> > The key point is that the virtual address is what matters here, and
> > this gives us a way to implement kernel text replication fairly easily.
> > At a practical level, all we need to do is to ensure that the virtual
> > addresses which contain the kernel text point to a local copy of the
> > that text.
> > 
> > This is exactly how this proposal of kernel text replication achieves
> > the replication. We can go a little bit further and include most of
> > the read-only data in this replication, as that will never be written
> > to by the kernel (and thus remains constant.)
> > 
> > Solution
> > --------
> > 
> > So, what we need to achieve is:
> > 
> > 1. multiple identical copies of the kernel text (and read-only data)
> > 2. point the virtual mappings to the appropriate copy of kernel text
> >    for the NUMA node.
> > 
> > (1) is fairly easy to achieve - we just need to allocate some memory
> > in the appropriate node and copy the parts of the kernel we want to
> > replicate. However, we also need to deal with ARM64's kernel patching.
> > There are two functions that patch the kernel text,
> > __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> > these need to to be modified to update all copies of the kernel text.
> > 
> > (2) is slightly harder.
> > 
> > Firstly, the aarch64 architecture has a very useful feature here - the
> > kernel page tables are entirely separate from the user page tables.
> > The hardware contains two page table pointers, one is used for user
> > mappings, the other is used for kernel mappings.
> > 
> > Therefore, we only have one page table to be concerned with: the table
> > which maps kernel space. We do not need to be concerned with each
> > user processes page table.
> > 
> > The approach taken here is to ensure that the kernel is located in an
> > area of kernel virtual address space covered by a level-0 page table
> > entry which is not shared with any other user. We can then maintain
> > separate per-node level-0 page tables for kernel space where the only
> > difference between them is this level-0 page table entry.
> > 
> > This gives a couple of benefits. Firstly, when updates to the level-0
> > page table happen (e.g. when establishing new mappings) these updates
> > can simply be copied to the other level-0 page tables provided it isn't
> > for the kernel image. Secondly, we don't need complexity at lower
> > levels of the page table code to figure out whether a level-1 or lower
> > update needs to be propagated to other nodes.
> > 
> > The level-0 page table entry for the kernel can then be used to point
> > at a node-unique set of level 1..N page tables to make the appropriate
> > copy of the kernel text (and read-only data) into kernel space, while
> > keeping the kernel read-write data shared between nodes.
> > 
> > Performance Analysis
> > --------------------
> > 
> > Needless to say, the performance results from kernel text replication
> > are workload specific, but appear to show a gain of between 6% and
> > 17% for database-centric like workloads. When combined with userspace
> > awareness of NUMA, this can result in a gain of over 50%.
> > 
> > Problems
> > --------
> > 
> > There are a few areas that are a problem for kernel text replication:
> > 1) As this series changes the kernel space virtual address space
> >    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
> >    have no idea how to fix it. I would be grateful for input from
> >    KASAN folk for suggestions how to fix this.
> > 
> > 2) KASLR can not be used with kernel text replication, since we need
> >    to place the kernel in its own L0 page table entry, not in vmalloc
> >    space. KASLR is disabled when support for kernel text replication
> >    is enabled.
> > 
> > 3) Changing the kernel virtual address space layout also means that
> >    kaslr_offset() and kaslr_enabled() need to become macros rather
> >    than inline functions due to the use of PGDIR_SIZE in the
> >    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
> >    constant, but asm/memory.h is included by asm/pgtable.h, having
> >    this symbol available would produce a circular include
> >    dependency, so I don't think there is any choice here.
> > 
> > 4) read-only protection for replicated kernel images is not yet
> >    implemented.
> > 
> > Patch overview:
> > 
> > Patch 1 cleans up the rox page protection logic.
> > Patch 2 reoganises the kernel virtual address space layout (causing
> >   problems (1 and 3).
> > Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
> >   addresses.
> > Patch 4 makes a needed cache flushing function visible.
> > Patch 5 through 16 are the guts of kernel text replication.
> > Patch 17 adds the Kconfig entry for it.
> > 
> > Further patches not included in this set add a Kconfig for the default
> > state, a test module, and add code to verify the replicated kernel
> > text matches the node 0 text after the kernel has completed most of
> > its boot.
> > 
> >  Documentation/admin-guide/kernel-parameters.txt |   5 +
> >  arch/arm64/Kconfig                              |  10 +-
> >  arch/arm64/include/asm/cacheflush.h             |   2 +
> >  arch/arm64/include/asm/ktext.h                  |  45 ++++++
> >  arch/arm64/include/asm/memory.h                 |  26 ++--
> >  arch/arm64/include/asm/mmu_context.h            |  12 +-
> >  arch/arm64/include/asm/pgtable.h                |  35 ++++-
> >  arch/arm64/include/asm/smp.h                    |   1 +
> >  arch/arm64/kernel/alternative.c                 |   4 +-
> >  arch/arm64/kernel/asm-offsets.c                 |   1 +
> >  arch/arm64/kernel/cpufeature.c                  |   2 +-
> >  arch/arm64/kernel/head.S                        |   3 +-
> >  arch/arm64/kernel/hibernate.c                   |   2 +-
> >  arch/arm64/kernel/patching.c                    |   7 +-
> >  arch/arm64/kernel/smp.c                         |   3 +
> >  arch/arm64/kernel/suspend.c                     |   3 +-
> >  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
> >  arch/arm64/mm/Makefile                          |   2 +
> >  arch/arm64/mm/init.c                            |   3 +
> >  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c                             |  85 ++++++++--
> >  21 files changed, 413 insertions(+), 39 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ktext.h
> >  create mode 100644 arch/arm64/mm/ktext.c
> > 
> > 
> > -- 
> > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> > 
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 02/17] arm64: place kernel in its own L0 page table entry
       [not found]   ` <ZIb+Lg9F9b4ay90p@FVFF77S0Q05N>
@ 2023-06-12 15:04     ` Russell King (Oracle)
  0 siblings, 0 replies; 27+ messages in thread
From: Russell King (Oracle) @ 2023-06-12 15:04 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Jonathan Corbet, Will Deacon, linux-arm-kernel,
	linux-doc, Ard Biesheuvel

On Mon, Jun 12, 2023 at 12:14:54PM +0100, Mark Rutland wrote:
> Hi Russell,
> 
> On Tue, May 30, 2023 at 03:04:40PM +0100, Russell King (Oracle) wrote:
> > Kernel text replication needs to maintain separate per-node page
> > tables for the kernel text. In order to do this without affecting
> > other kernel memory mappings, placing the kernel such that it does
> > not share a L0 page table entry with any other mapping is desirable.
> > 
> > Prior to this commit, the layout without KASLR was:
> > 
> > +----------+
> > |  vmalloc |
> > +----------+
> > |  Kernel  |
> > +----------+ MODULES_END, VMALLOC_START, KIMAGE_VADDR =
> > |  Modules |                 MODULES_VADDR + MODULES_VSIZE
> > +----------+ MODULES_VADDR = _PAGE_END(VA_BITS_MIN)
> > | VA space |
> > +----------+ 0
> > 
> > This becomes:
> > 
> > +----------+
> > |  vmalloc |
> > +----------+ VMALLOC_START = MODULES_END + PGDIR_SIZE
> > |  Kernel  |
> > +----------+ MODULES_END, KIMAGE_VADDR = _PAGE_END(VA_BITS_MIN) + PGDIR_SIZE
> > |  Modules |
> > +----------+ MODULES_VADDR = MODULES_END - MODULES_VSIZE
> > | VA space |
> > +----------+ 0
> 
> With KSASLR we may randomize the kernel and module space over a substantial
> portion of the vmalloc range. Are you expecting that text replication is going
> to restruct that range, or that we'd make it mutually exclusive with KASLR?

In the patch that adds the REPLICATE_KTEXT config option, I've made it
exclusive with RANDOMIZE_BASE, but this change in layout isn't dependent
on REPLICATE_KTEXT.

I've tested it with RANDOMIZE_BASE=y, and nothing seems to get upset,
so I believe that this patch doesn't cause any negative issues.

> I also note that the L0 table could have as few as two entries (with 16K pages
> and 4 levels). So either we'd need to also mess with an L1 table, or make text
> replication mutually exclusive with such configurations.

Ah, thanks for pointing that out - I was hoping to avoid needing
to touch anything but L0 tables.

However, it brings up a question: are there any NUMA systems that would
have just two entries in the L0 table? I suspect NUMA systems have lots
of RAM, and so would want a page table layout that results in multiple
L0 entries.

> > This assumes MODULES_VSIZE (128M) <= PGDIR_SIZE.
> 
> As a heads-up, we've just changed MODULES_VSIZE to be 2G in
> 
>   https://lore.kernel.org/linux-arm-kernel/20230530110328.2213762-1-mark.rutland@arm.com/
> 
> .. which is queued in the arm64 for-next/module-alloc branch:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/module-alloc

Ok - so I need to get a bit more clever about calculating MODULES_END
and KIMAGE_VADDR

> > One side effect of this change is that KIMAGE_VADDR's definition now
> > includes PGDIR_SIZE (to leave room for the modules) but this is not
> > defined when asm/memory.h is included. This means KIMAGE_VADDR can
> > not be used in inline functions within this file, so we convert
> > kaslr_offset() and kaslr_enabled() to be macros instead.
> 
> That series above also decoupled kaslr_enabled() from kaslr_offset(), 
> so we'd only need to change kaslr_offset().

Ok, I'll take a look to see how my changes are impacted.

> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 4829abe017e9..baf74d0c43c9 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -478,7 +478,8 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> >  static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
> >  				phys_addr_t size, pgprot_t prot)
> >  {
> > -	if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
> > +	if ((virt >= PAGE_END) && (virt < VMALLOC_START) &&
> > +	    !is_kernel(virt)) {
> >  		pr_warn("BUG: not updating mapping for %pa at 0x%016lx - outside kernel range\n",
> >  			&phys, virt);
> >  		return;
> 
> I think the existing conditions here aren't quite right, and have become bogus
> over time, and I don't think that the is_kernel() check is necessary here.
> 
> Originally, back in commit:
> 
>   c1cc1552616d0f35 ("arm64: MMU initialisation")
> 
> We had:
> 
> 	if (virt < VMALLOC_START) {
> 		pr_warning("BUG: not creating mapping for 0x%016llx at 0x%016lx - outside kernel range\n",
> 			   phys, virt);
> 		return;
> 	}
> 
> ... which checked that the VA range we were manipulating was in the TTBR1 VA
> range, as at the time, VMALLOC_START happened to be the lowest TTBR1 address.
> 
> That didn't substantially change until commit:
> 
>   14c127c957c1c607 ("arm64: mm: Flip kernel VA space")
> 
> ... when the test was changed to:
> 
> 	if ((virt >= VA_START) && (virt < VMALLOC_START)) {
> 		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
> 			&phys, virt);
> 		return;
> 	}
> 
> Note: in that commit, VA_START was actually the end of the linear map (which
> was itself a the start of the TTBR1 address space), so this is just checking if
> we're poking a small portion of the TTBR1 address space, rather than if we're
> poking *outside* of the TTBR1 address space.
> 
> That doesn't make much sense, and I'm pretty sure that was a thinko rather than
> an intentional change of semantic.
> 
> I "fixed" that without thinking in commit:
> 
>   77ad4ce69321abbe ("arm64: memory: rename VA_START to PAGE_END")
> 
> ... making that:
> 
> 	if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
> 		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
> 			&phys, virt);
> 		return;
> 	}
> 
> ... but clearly it has lost the original semantic and doesn't make much sense.
> 
> I think the test should actually be something like:
> 
> 	/* Must be a TTBR1 address */
> 	if (virt < PAGE_OFFSET ) {
> 		...
> 	}
> 
> ... and then we won't randomly trip for kernel mappings if those fall between
> the linear map and vmalloc range.

Okay, so that sounds like if this is fixed, then I won't need to patch
it! Yay!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-05  9:05 ` [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
  2023-06-05 13:46   ` Mark Rutland
@ 2023-06-23 15:24   ` Ard Biesheuvel
  2023-06-23 15:34     ` Russell King (Oracle)
  2023-06-23 16:37     ` Marc Zyngier
  1 sibling, 2 replies; 27+ messages in thread
From: Ard Biesheuvel @ 2023-06-23 15:24 UTC (permalink / raw)
  To: Russell King (Oracle), Marc Zyngier, Quentin Perret, Mark Rutland
  Cc: Catalin Marinas, Jonathan Corbet, Will Deacon, linux-arm-kernel,
	linux-doc

(cc Marc and Quentin)

On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> Hi,
>
> Are there any comments on this?
>

Hi Russell,

I think the proposed approach is sound, but it is rather intrusive, as
you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
work gets merged (which uses root level -1 when booted on LPA2 capable
hardware, and level 0 otherwise), we'll have yet another combination
that is either fully incompatible, or cumbersome to support at the
very least.

I wonder if it would be worthwhile to explore an alternative approach,
using pKVM and the host stage2:

- all stage1 kernel mappings remain as they are, and the kernel code
running at EL1 has no awareness of the replication beyond being
involved in allocating the memory;
- host is booted in protected KVM mode, which means that the host
kernel executes under a stage 2 mapping;
- each NUMA node has its own set of stage 2 page tables, and maps the
kernel's code/rodata IPA range to a NUMA local PA range
- the kernel's code and rodata are mapped read-only in the primary
stage-2 mapping so updates trap to EL2, permitting the hypervisor to
replicate those update to all clones.

Note that pKVM retains the capabilities of ordinary KVM, so as long as
you boot at EL2, the only downside compared to your approach would be
the increased TLB footprint due to the stage 2 mappings for the host
kernel.

Marc, Quentin, Will: any thoughts?



>
> On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> > Problem
> > -------
> >
> > NUMA systems have greater latency when accessing data and instructions
> > across nodes, which can lead to a reduction in performance on CPU cores
> > that mainly perform accesses beyond their local node.
> >
> > Normally when an ARM64 system boots, the kernel will end up placed in
> > memory, and each CPU core will have to fetch instructions and data from
> > which ever NUMA node the kernel has been placed. This means that while
> > executing kernel code, CPUs local to that node will run faster than
> > CPUs in remote nodes.
> >
> > The higher the latency to access remote NUMA node memory, the more the
> > kernel performance suffers on those nodes.
> >
> > If there is a local copy of the kernel text in each node's RAM, and
> > each node runs the kernel using its local copy of the kernel text,
> > then it stands to reason that the kernel will run faster due to fewer
> > stalls while instructions are fetched from remote memory.
> >
> > The question then arises how to achieve this.
> >
> > Background
> > ----------
> >
> > An important issue to contend with is what happens when a thread
> > migrates between nodes. Essentially, the thread's state (including
> > instruction pointer) is saved to memory, and the scheduler on that CPU
> > loads some other thread's state and that CPU resumes executing that
> > new thread.
> >
> > The CPU gaining the migrating thread loads the saved state, again
> > including the instruction pointer, and the gaining CPU resumes fetching
> > instructions at the virtual address where the original CPU left off.
> >
> > The key point is that the virtual address is what matters here, and
> > this gives us a way to implement kernel text replication fairly easily.
> > At a practical level, all we need to do is to ensure that the virtual
> > addresses which contain the kernel text point to a local copy of the
> > that text.
> >
> > This is exactly how this proposal of kernel text replication achieves
> > the replication. We can go a little bit further and include most of
> > the read-only data in this replication, as that will never be written
> > to by the kernel (and thus remains constant.)
> >
> > Solution
> > --------
> >
> > So, what we need to achieve is:
> >
> > 1. multiple identical copies of the kernel text (and read-only data)
> > 2. point the virtual mappings to the appropriate copy of kernel text
> >    for the NUMA node.
> >
> > (1) is fairly easy to achieve - we just need to allocate some memory
> > in the appropriate node and copy the parts of the kernel we want to
> > replicate. However, we also need to deal with ARM64's kernel patching.
> > There are two functions that patch the kernel text,
> > __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> > these need to to be modified to update all copies of the kernel text.
> >
> > (2) is slightly harder.
> >
> > Firstly, the aarch64 architecture has a very useful feature here - the
> > kernel page tables are entirely separate from the user page tables.
> > The hardware contains two page table pointers, one is used for user
> > mappings, the other is used for kernel mappings.
> >
> > Therefore, we only have one page table to be concerned with: the table
> > which maps kernel space. We do not need to be concerned with each
> > user processes page table.
> >
> > The approach taken here is to ensure that the kernel is located in an
> > area of kernel virtual address space covered by a level-0 page table
> > entry which is not shared with any other user. We can then maintain
> > separate per-node level-0 page tables for kernel space where the only
> > difference between them is this level-0 page table entry.
> >
> > This gives a couple of benefits. Firstly, when updates to the level-0
> > page table happen (e.g. when establishing new mappings) these updates
> > can simply be copied to the other level-0 page tables provided it isn't
> > for the kernel image. Secondly, we don't need complexity at lower
> > levels of the page table code to figure out whether a level-1 or lower
> > update needs to be propagated to other nodes.
> >
> > The level-0 page table entry for the kernel can then be used to point
> > at a node-unique set of level 1..N page tables to make the appropriate
> > copy of the kernel text (and read-only data) into kernel space, while
> > keeping the kernel read-write data shared between nodes.
> >
> > Performance Analysis
> > --------------------
> >
> > Needless to say, the performance results from kernel text replication
> > are workload specific, but appear to show a gain of between 6% and
> > 17% for database-centric like workloads. When combined with userspace
> > awareness of NUMA, this can result in a gain of over 50%.
> >
> > Problems
> > --------
> >
> > There are a few areas that are a problem for kernel text replication:
> > 1) As this series changes the kernel space virtual address space
> >    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
> >    have no idea how to fix it. I would be grateful for input from
> >    KASAN folk for suggestions how to fix this.
> >
> > 2) KASLR can not be used with kernel text replication, since we need
> >    to place the kernel in its own L0 page table entry, not in vmalloc
> >    space. KASLR is disabled when support for kernel text replication
> >    is enabled.
> >
> > 3) Changing the kernel virtual address space layout also means that
> >    kaslr_offset() and kaslr_enabled() need to become macros rather
> >    than inline functions due to the use of PGDIR_SIZE in the
> >    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
> >    constant, but asm/memory.h is included by asm/pgtable.h, having
> >    this symbol available would produce a circular include
> >    dependency, so I don't think there is any choice here.
> >
> > 4) read-only protection for replicated kernel images is not yet
> >    implemented.
> >
> > Patch overview:
> >
> > Patch 1 cleans up the rox page protection logic.
> > Patch 2 reoganises the kernel virtual address space layout (causing
> >   problems (1 and 3).
> > Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
> >   addresses.
> > Patch 4 makes a needed cache flushing function visible.
> > Patch 5 through 16 are the guts of kernel text replication.
> > Patch 17 adds the Kconfig entry for it.
> >
> > Further patches not included in this set add a Kconfig for the default
> > state, a test module, and add code to verify the replicated kernel
> > text matches the node 0 text after the kernel has completed most of
> > its boot.
> >
> >  Documentation/admin-guide/kernel-parameters.txt |   5 +
> >  arch/arm64/Kconfig                              |  10 +-
> >  arch/arm64/include/asm/cacheflush.h             |   2 +
> >  arch/arm64/include/asm/ktext.h                  |  45 ++++++
> >  arch/arm64/include/asm/memory.h                 |  26 ++--
> >  arch/arm64/include/asm/mmu_context.h            |  12 +-
> >  arch/arm64/include/asm/pgtable.h                |  35 ++++-
> >  arch/arm64/include/asm/smp.h                    |   1 +
> >  arch/arm64/kernel/alternative.c                 |   4 +-
> >  arch/arm64/kernel/asm-offsets.c                 |   1 +
> >  arch/arm64/kernel/cpufeature.c                  |   2 +-
> >  arch/arm64/kernel/head.S                        |   3 +-
> >  arch/arm64/kernel/hibernate.c                   |   2 +-
> >  arch/arm64/kernel/patching.c                    |   7 +-
> >  arch/arm64/kernel/smp.c                         |   3 +
> >  arch/arm64/kernel/suspend.c                     |   3 +-
> >  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
> >  arch/arm64/mm/Makefile                          |   2 +
> >  arch/arm64/mm/init.c                            |   3 +
> >  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c                             |  85 ++++++++--
> >  21 files changed, 413 insertions(+), 39 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ktext.h
> >  create mode 100644 arch/arm64/mm/ktext.c
> >
> >
> > --
> > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> >
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-23 15:24   ` Ard Biesheuvel
@ 2023-06-23 15:34     ` Russell King (Oracle)
  2023-06-23 15:54       ` Marc Zyngier
  2023-06-23 16:37     ` Marc Zyngier
  1 sibling, 1 reply; 27+ messages in thread
From: Russell King (Oracle) @ 2023-06-23 15:34 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Marc Zyngier, Quentin Perret, Mark Rutland, Catalin Marinas,
	Jonathan Corbet, Will Deacon, linux-arm-kernel, linux-doc

On Fri, Jun 23, 2023 at 05:24:20PM +0200, Ard Biesheuvel wrote:
> (cc Marc and Quentin)
> 
> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Hi,
> >
> > Are there any comments on this?
> >
> 
> Hi Russell,
> 
> I think the proposed approach is sound, but it is rather intrusive, as
> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
> work gets merged (which uses root level -1 when booted on LPA2 capable
> hardware, and level 0 otherwise), we'll have yet another combination
> that is either fully incompatible, or cumbersome to support at the
> very least.
> 
> I wonder if it would be worthwhile to explore an alternative approach,
> using pKVM and the host stage2:
> 
> - all stage1 kernel mappings remain as they are, and the kernel code
> running at EL1 has no awareness of the replication beyond being
> involved in allocating the memory;
> - host is booted in protected KVM mode, which means that the host
> kernel executes under a stage 2 mapping;
> - each NUMA node has its own set of stage 2 page tables, and maps the
> kernel's code/rodata IPA range to a NUMA local PA range
> - the kernel's code and rodata are mapped read-only in the primary
> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
> replicate those update to all clones.
> 
> Note that pKVM retains the capabilities of ordinary KVM, so as long as
> you boot at EL2, the only downside compared to your approach would be
> the increased TLB footprint due to the stage 2 mappings for the host
> kernel.
> 
> Marc, Quentin, Will: any thoughts?

Thanks for taking a look.

That sounds great, but my initial question would be whether, with such a
setup, one could then run VMs under such a kernel without hardware that
supports nested virtualisation? I suspect the answer would be no.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-23 15:34     ` Russell King (Oracle)
@ 2023-06-23 15:54       ` Marc Zyngier
  2023-06-26 23:42         ` Lameter, Christopher
  0 siblings, 1 reply; 27+ messages in thread
From: Marc Zyngier @ 2023-06-23 15:54 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Ard Biesheuvel, Quentin Perret, Mark Rutland, Catalin Marinas,
	Jonathan Corbet, Will Deacon, linux-arm-kernel, linux-doc

On 2023-06-23 16:34, Russell King (Oracle) wrote:
> On Fri, Jun 23, 2023 at 05:24:20PM +0200, Ard Biesheuvel wrote:
>> (cc Marc and Quentin)
>> 
>> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
>> <linux@armlinux.org.uk> wrote:
>> >
>> > Hi,
>> >
>> > Are there any comments on this?
>> >
>> 
>> Hi Russell,
>> 
>> I think the proposed approach is sound, but it is rather intrusive, as
>> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
>> work gets merged (which uses root level -1 when booted on LPA2 capable
>> hardware, and level 0 otherwise), we'll have yet another combination
>> that is either fully incompatible, or cumbersome to support at the
>> very least.
>> 
>> I wonder if it would be worthwhile to explore an alternative approach,
>> using pKVM and the host stage2:
>> 
>> - all stage1 kernel mappings remain as they are, and the kernel code
>> running at EL1 has no awareness of the replication beyond being
>> involved in allocating the memory;
>> - host is booted in protected KVM mode, which means that the host
>> kernel executes under a stage 2 mapping;
>> - each NUMA node has its own set of stage 2 page tables, and maps the
>> kernel's code/rodata IPA range to a NUMA local PA range
>> - the kernel's code and rodata are mapped read-only in the primary
>> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
>> replicate those update to all clones.
>> 
>> Note that pKVM retains the capabilities of ordinary KVM, so as long as
>> you boot at EL2, the only downside compared to your approach would be
>> the increased TLB footprint due to the stage 2 mappings for the host
>> kernel.
>> 
>> Marc, Quentin, Will: any thoughts?
> 
> Thanks for taking a look.
> 
> That sounds great, but my initial question would be whether, with such 
> a
> setup, one could then run VMs under such a kernel without hardware that
> supports nested virtualisation? I suspect the answer would be no.

The answer is yes. All you need to do is to switch between the host
and guest stage-2s in the hypervisor, which is what KVM running in
protected mode does.

         M.

-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-23 15:24   ` Ard Biesheuvel
  2023-06-23 15:34     ` Russell King (Oracle)
@ 2023-06-23 16:37     ` Marc Zyngier
  1 sibling, 0 replies; 27+ messages in thread
From: Marc Zyngier @ 2023-06-23 16:37 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Russell King (Oracle), Quentin Perret, Mark Rutland,
	Catalin Marinas, Jonathan Corbet, Will Deacon, linux-arm-kernel,
	linux-doc

On Fri, 23 Jun 2023 16:24:20 +0100,
Ard Biesheuvel <ardb@kernel.org> wrote:
> 
> (cc Marc and Quentin)
> 
> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Hi,
> >
> > Are there any comments on this?
> >
> 
> Hi Russell,
> 
> I think the proposed approach is sound, but it is rather intrusive, as
> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
> work gets merged (which uses root level -1 when booted on LPA2 capable
> hardware, and level 0 otherwise), we'll have yet another combination
> that is either fully incompatible, or cumbersome to support at the
> very least.
> 
> I wonder if it would be worthwhile to explore an alternative approach,
> using pKVM and the host stage2:
> 
> - all stage1 kernel mappings remain as they are, and the kernel code
> running at EL1 has no awareness of the replication beyond being
> involved in allocating the memory;
> - host is booted in protected KVM mode, which means that the host
> kernel executes under a stage 2 mapping;
> - each NUMA node has its own set of stage 2 page tables, and maps the
> kernel's code/rodata IPA range to a NUMA local PA range
> - the kernel's code and rodata are mapped read-only in the primary
> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
> replicate those update to all clones.
> 
> Note that pKVM retains the capabilities of ordinary KVM, so as long as
> you boot at EL2, the only downside compared to your approach would be
> the increased TLB footprint due to the stage 2 mappings for the host
> kernel.
> 
> Marc, Quentin, Will: any thoughts?

I like the idea, though there are a couple of 'interesting' corner
cases:

- you have to give up VHE, which means that if your workload is to
  mainly run VMs, you pay an extra cost on each guest entry/exit

- the EL2 code doesn't have the luxury of a stage-2, meaning that
  either you accept the fact that this code is going to suffer form
  uneven performance, or you keep the complexity of the kernel-visible
  replication for the EL2 code only

- memory allocation for the stage-2 is tricky (Quentin can talk about
  that), and relies on being able to steal enough memory to cover the
  whole of the host's memory-map, including I/O. Having a set of S2
  PTs per node is going to increase that pressure/complexity

- I'm not too worried about the TLB aspect. Cores tend to cache VA/PA,
  not VA/IPA+IPA/PA. What is going to cost is the walk itself. This
  could be mitigated if S2 uses large mappings (possibly using 64k
  pages).

The last point makes me think that what this approach may not be pKVM
itself, but something that builds on top of what pKVM has (host S2)
and the nVHE/hVHE behaviour.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-23 15:54       ` Marc Zyngier
@ 2023-06-26 23:42         ` Lameter, Christopher
  2023-06-27  8:02           ` Marc Zyngier
  0 siblings, 1 reply; 27+ messages in thread
From: Lameter, Christopher @ 2023-06-26 23:42 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Russell King (Oracle), Ard Biesheuvel, Quentin Perret,
	Mark Rutland, Catalin Marinas, Jonathan Corbet, Will Deacon,
	linux-arm-kernel, linux-doc

On Fri, 23 Jun 2023, Marc Zyngier wrote:

>> That sounds great, but my initial question would be whether, with such a
>> setup, one could then run VMs under such a kernel without hardware that
>> supports nested virtualisation? I suspect the answer would be no.
>
> The answer is yes. All you need to do is to switch between the host
> and guest stage-2s in the hypervisor, which is what KVM running in
> protected mode does.

Well I think his point was that there are machines running without a 
hypervisor and kernel replication needs to work on that. We certainly 
benefit a lot from kernel replication and our customers may elect to run 
ARM64 kernels without hypervisors on bare metal.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH RFC 00/17] arm64 kernel text replication
  2023-06-26 23:42         ` Lameter, Christopher
@ 2023-06-27  8:02           ` Marc Zyngier
  0 siblings, 0 replies; 27+ messages in thread
From: Marc Zyngier @ 2023-06-27  8:02 UTC (permalink / raw)
  To: Lameter, Christopher
  Cc: Russell King (Oracle), Ard Biesheuvel, Quentin Perret,
	Mark Rutland, Catalin Marinas, Jonathan Corbet, Will Deacon,
	linux-arm-kernel, linux-doc

On Tue, 27 Jun 2023 00:42:53 +0100,
"Lameter, Christopher" <cl@os.amperecomputing.com> wrote:
> 
> On Fri, 23 Jun 2023, Marc Zyngier wrote:
> 
> >> That sounds great, but my initial question would be whether, with such a
> >> setup, one could then run VMs under such a kernel without hardware that
> >> supports nested virtualisation? I suspect the answer would be no.
> > 
> > The answer is yes. All you need to do is to switch between the host
> > and guest stage-2s in the hypervisor, which is what KVM running in
> > protected mode does.
> 
> Well I think his point was that there are machines running without a
> hypervisor and kernel replication needs to work on that. We certainly
> benefit a lot from kernel replication and our customers may elect to
> run ARM64 kernels without hypervisors on bare metal.

These are not incompatible goals.

The hypervisor is a function that the user may want to enable or not.
Irrespective of that, the HW that underpins the virtualisation
functionality is available and allows you to solve this particular
problem in a different way. This doesn't preclude from running
bare-metal at all.

There is even precedent in using stage-2 to work around critical bugs
(the Socionext PCIe fiasco springs to mind).

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-06-27  8:03 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-30 14:04 [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
2023-05-30 14:04 ` [PATCH RFC 01/17] arm64: consolidate rox page protection logic Russell King (Oracle)
2023-05-30 14:04 ` [PATCH RFC 02/17] arm64: place kernel in its own L0 page table entry Russell King (Oracle)
     [not found]   ` <ZIb+Lg9F9b4ay90p@FVFF77S0Q05N>
2023-06-12 15:04     ` Russell King (Oracle)
2023-05-30 14:04 ` [PATCH RFC 03/17] arm64: provide cpu_replace_ttbr1_phys() Russell King (Oracle)
2023-05-30 14:04 ` [PATCH RFC 04/17] arm64: make clean_dcache_range_nopatch() visible Russell King (Oracle)
2023-05-30 14:04 ` [PATCH RFC 05/17] arm64: text replication: add init function Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 06/17] arm64: text replication: add sanity checks Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 07/17] arm64: text replication: copy initial kernel text Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 08/17] arm64: text replication: add node text patching Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 09/17] arm64: text replication: add node 0 page table definitions Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 10/17] arm64: text replication: add swapper page directory helpers Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 11/17] arm64: text replication: create per-node kernel page tables Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 12/17] arm64: text replication: boot secondary CPUs with appropriate TTBR1 Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 13/17] arm64: text replication: update cnp support Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 14/17] arm64: text replication: setup page tables for copied kernel Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 15/17] arm64: text replication: include most of read-only data as well Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 16/17] arm64: text replication: early kernel option to enable replication Russell King (Oracle)
2023-05-30 14:05 ` [PATCH RFC 17/17] arm64: text replication: add Kconfig Russell King (Oracle)
2023-06-05  9:05 ` [PATCH RFC 00/17] arm64 kernel text replication Russell King (Oracle)
2023-06-05 13:46   ` Mark Rutland
2023-06-23 15:24   ` Ard Biesheuvel
2023-06-23 15:34     ` Russell King (Oracle)
2023-06-23 15:54       ` Marc Zyngier
2023-06-26 23:42         ` Lameter, Christopher
2023-06-27  8:02           ` Marc Zyngier
2023-06-23 16:37     ` Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).