linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] mm: jit/text allocator
@ 2023-06-01 10:12 Mike Rapoport
  2023-06-01 10:12 ` [PATCH 01/13] nios2: define virtual address space for modules Mike Rapoport
                   ` (14 more replies)
  0 siblings, 15 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Hi,

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

This set splits code allocation from modules by introducing
jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
sites of module_alloc() and module_memfree() with the new APIs and
implements core text and related allocation in a central place.

Instead of architecture specific overrides for module_alloc(), the
architectures that require non-default behaviour for text allocation must
fill jit_alloc_params structure and implement jit_alloc_arch_params() that
returns a pointer to that structure. If an architecture does not implement
jit_alloc_arch_params(), the defaults compatible with the current
modules::module_alloc() are used.

The new jitalloc infrastructure allows decoupling of kprobes and ftrace
from modules, and most importantly it enables ROX allocations for
executable memory.

A centralized infrastructure for code allocation allows future
optimizations for allocations of executable memory, caching large pages for
better iTLB performance and providing sub-page allocations for users that
only need small jit code snippets.

patches 1-5: split out the code allocation from modules and arch
patch 6: add dedicated API for data allocations with constraints similar to
code allocations
patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES
patches 10-13: enable ROX allocations for executable memory on x86

Mike Rapoport (IBM) (11):
  nios2: define virtual address space for modules
  mm: introduce jit_text_alloc() and use it instead of module_alloc()
  mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc
  mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
  module, jitalloc: drop module_alloc
  mm/jitalloc: introduce jit_data_alloc()
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  arch: make jitalloc setup available regardless of CONFIG_MODULES
  kprobes: remove dependcy on CONFIG_MODULES
  modules, jitalloc: prepare to allocate executable memory as ROX
  x86/jitalloc: make memory allocated for code ROX

Song Liu (2):
  ftrace: Add swap_func to ftrace_process_locs()
  x86/jitalloc: prepare to allocate exectuatble memory as ROX

 arch/Kconfig                     |   5 +-
 arch/arm/kernel/module.c         |  32 ------
 arch/arm/mm/init.c               |  35 ++++++
 arch/arm64/kernel/module.c       |  47 --------
 arch/arm64/mm/init.c             |  42 +++++++
 arch/loongarch/kernel/module.c   |   6 -
 arch/loongarch/mm/init.c         |  16 +++
 arch/mips/kernel/module.c        |   9 --
 arch/mips/mm/init.c              |  19 ++++
 arch/nios2/include/asm/pgtable.h |   5 +-
 arch/nios2/kernel/module.c       |  24 ++--
 arch/parisc/kernel/module.c      |  11 --
 arch/parisc/mm/init.c            |  21 +++-
 arch/powerpc/kernel/kprobes.c    |   4 +-
 arch/powerpc/kernel/module.c     |  37 -------
 arch/powerpc/mm/mem.c            |  41 +++++++
 arch/riscv/kernel/module.c       |  10 --
 arch/riscv/mm/init.c             |  18 +++
 arch/s390/kernel/ftrace.c        |   4 +-
 arch/s390/kernel/kprobes.c       |   4 +-
 arch/s390/kernel/module.c        |  46 +-------
 arch/s390/mm/init.c              |  35 ++++++
 arch/sparc/kernel/module.c       |  34 +-----
 arch/sparc/mm/Makefile           |   2 +
 arch/sparc/mm/jitalloc.c         |  21 ++++
 arch/sparc/net/bpf_jit_comp_32.c |   8 +-
 arch/x86/Kconfig                 |   2 +
 arch/x86/kernel/alternative.c    |  43 ++++---
 arch/x86/kernel/ftrace.c         |  59 +++++-----
 arch/x86/kernel/kprobes/core.c   |   4 +-
 arch/x86/kernel/module.c         |  75 +------------
 arch/x86/kernel/static_call.c    |  10 +-
 arch/x86/kernel/unwind_orc.c     |  13 ++-
 arch/x86/mm/init.c               |  52 +++++++++
 arch/x86/net/bpf_jit_comp.c      |  22 +++-
 include/linux/ftrace.h           |   2 +
 include/linux/jitalloc.h         |  69 ++++++++++++
 include/linux/moduleloader.h     |  15 ---
 kernel/bpf/core.c                |  14 +--
 kernel/kprobes.c                 |  51 +++++----
 kernel/module/Kconfig            |   1 +
 kernel/module/main.c             |  56 ++++------
 kernel/trace/ftrace.c            |  13 ++-
 kernel/trace/trace_kprobe.c      |  11 ++
 mm/Kconfig                       |   3 +
 mm/Makefile                      |   1 +
 mm/jitalloc.c                    | 185 +++++++++++++++++++++++++++++++
 mm/mm_init.c                     |   2 +
 48 files changed, 777 insertions(+), 462 deletions(-)
 create mode 100644 arch/sparc/mm/jitalloc.c
 create mode 100644 include/linux/jitalloc.h
 create mode 100644 mm/jitalloc.c


base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
-- 
2.35.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/13] nios2: define virtual address space for modules
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-13 22:16   ` Dinh Nguyen
  2023-06-01 10:12 ` [PATCH 02/13] mm: introduce jit_text_alloc() and use it instead of module_alloc() Mike Rapoport
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
cannot reach all of vmalloc address space.

Define module space as 32MiB below the kernel base and switch nios2 to
use vmalloc for module allocations.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/nios2/include/asm/pgtable.h |  5 ++++-
 arch/nios2/kernel/module.c       | 19 ++++---------------
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 0f5c2564e9f5..0073b289c6a4 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -25,7 +25,10 @@
 #include <asm-generic/pgtable-nopmd.h>
 
 #define VMALLOC_START		CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
-#define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
+#define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
+
+#define MODULES_VADDR		(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
+#define MODULES_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
 
 struct mm_struct;
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 76e0a42d6e36..9c97b7513853 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -21,23 +21,12 @@
 
 #include <asm/cacheflush.h>
 
-/*
- * Modules should NOT be allocated with kmalloc for (obvious) reasons.
- * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
- * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
- * addresses in 0xc0000000)
- */
 void *module_alloc(unsigned long size)
 {
-	if (size == 0)
-		return NULL;
-	return kmalloc(size, GFP_KERNEL);
-}
-
-/* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
-{
-	kfree(module_region);
+	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+				    GFP_KERNEL, PAGE_KERNEL_EXEC,
+				    VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+				    __builtin_return_address(0));
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/13] mm: introduce jit_text_alloc() and use it instead of module_alloc()
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
  2023-06-01 10:12 ` [PATCH 01/13] nios2: define virtual address space for modules Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 03/13] mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc Mike Rapoport
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules
and puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing
jit_text_alloc() and jit_free() APIs.

Start with making jit_text_alloc() a wrapper for module_alloc() and
jit_free() a replacement of module_memfree() to allow updating all call
sites to use the new APIs.

The name jit_text_alloc() emphasizes that the allocated memory is for
executable code, the allocations of the associated data, like data sections
of a module will use jit_data_alloc() interface that will be added later.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/powerpc/kernel/kprobes.c    |  4 ++--
 arch/s390/kernel/ftrace.c        |  4 ++--
 arch/s390/kernel/kprobes.c       |  4 ++--
 arch/s390/kernel/module.c        |  5 +++--
 arch/sparc/net/bpf_jit_comp_32.c |  8 ++++----
 arch/x86/kernel/ftrace.c         |  6 +++---
 arch/x86/kernel/kprobes/core.c   |  4 ++--
 include/linux/jitalloc.h         | 10 ++++++++++
 include/linux/moduleloader.h     |  3 ---
 kernel/bpf/core.c                | 14 +++++++-------
 kernel/kprobes.c                 |  8 ++++----
 kernel/module/Kconfig            |  1 +
 kernel/module/main.c             | 23 +++++++----------------
 mm/Kconfig                       |  3 +++
 mm/Makefile                      |  1 +
 mm/jitalloc.c                    | 20 ++++++++++++++++++++
 16 files changed, 71 insertions(+), 47 deletions(-)
 create mode 100644 include/linux/jitalloc.h
 create mode 100644 mm/jitalloc.c

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index b20ee72e873a..e5835b148ec4 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
 #include <linux/extable.h>
 #include <linux/kdebug.h>
 #include <linux/slab.h>
-#include <linux/moduleloader.h>
 #include <linux/set_memory.h>
+#include <linux/jitalloc.h>
 #include <asm/code-patching.h>
 #include <asm/cacheflush.h>
 #include <asm/sstep.h>
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = jit_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..6e50a88b9b5d 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
  *   Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
  */
 
-#include <linux/moduleloader.h>
 #include <linux/hardirq.h>
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/kprobes.h>
+#include <linux/jitalloc.h>
 #include <trace/syscall.h>
 #include <asm/asm-offsets.h>
 #include <asm/text-patching.h>
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
 {
 	const char *start, *end;
 
-	ftrace_plt = module_alloc(PAGE_SIZE);
+	ftrace_plt = jit_text_alloc(PAGE_SIZE);
 	if (!ftrace_plt)
 		panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index d4b863ed0aa7..3804945f212f 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include <linux/moduleloader.h>
 #include <linux/kprobes.h>
 #include <linux/ptrace.h>
 #include <linux/preempt.h>
@@ -21,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/hardirq.h>
 #include <linux/ftrace.h>
+#include <linux/jitalloc.h>
 #include <asm/set_memory.h>
 #include <asm/sections.h>
 #include <asm/dis.h>
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = jit_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 	set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index f1b35dcdf3eb..d4844cfe3d7e 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
 #include <linux/moduleloader.h>
 #include <linux/bug.h>
 #include <linux/memory.h>
+#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/nospec-branch.h>
 #include <asm/facility.h>
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-	module_memfree(mod->arch.trampolines_start);
+	jit_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -509,7 +510,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me,
 
 	size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
 	numpages = DIV_ROUND_UP(size, PAGE_SIZE);
-	start = module_alloc(numpages * PAGE_SIZE);
+	start = jit_text_alloc(numpages * PAGE_SIZE);
 	if (!start)
 		return -ENOMEM;
 	set_memory_rox((unsigned long)start, numpages);
diff --git a/arch/sparc/net/bpf_jit_comp_32.c b/arch/sparc/net/bpf_jit_comp_32.c
index a74e5004c6c8..068be1097d1a 100644
--- a/arch/sparc/net/bpf_jit_comp_32.c
+++ b/arch/sparc/net/bpf_jit_comp_32.c
@@ -1,10 +1,10 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <linux/moduleloader.h>
 #include <linux/workqueue.h>
 #include <linux/netdevice.h>
 #include <linux/filter.h>
 #include <linux/cache.h>
 #include <linux/if_vlan.h>
+#include <linux/jitalloc.h>
 
 #include <asm/cacheflush.h>
 #include <asm/ptrace.h>
@@ -713,7 +713,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 				if (unlikely(proglen + ilen > oldproglen)) {
 					pr_err("bpb_jit_compile fatal error\n");
 					kfree(addrs);
-					module_memfree(image);
+					jit_free(image);
 					return;
 				}
 				memcpy(image + proglen, temp, ilen);
@@ -736,7 +736,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 			break;
 		}
 		if (proglen == oldproglen) {
-			image = module_alloc(proglen);
+			image = jit_text_alloc(proglen);
 			if (!image)
 				goto out;
 		}
@@ -758,7 +758,7 @@ cond_branch:			f_offset = addrs[i + filter[i].jf];
 void bpf_jit_free(struct bpf_prog *fp)
 {
 	if (fp->jited)
-		module_memfree(fp->bpf_func);
+		jit_free(fp->bpf_func);
 
 	bpf_prog_unlock_free(fp);
 }
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5e7ead52cfdb..157c8a799704 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -25,6 +25,7 @@
 #include <linux/memory.h>
 #include <linux/vmalloc.h>
 #include <linux/set_memory.h>
+#include <linux/jitalloc.h>
 
 #include <trace/syscall.h>
 
@@ -261,15 +262,14 @@ void arch_ftrace_update_code(int command)
 #ifdef CONFIG_X86_64
 
 #ifdef CONFIG_MODULES
-#include <linux/moduleloader.h>
 /* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
-	return module_alloc(size);
+	return jit_text_alloc(size);
 }
 static inline void tramp_free(void *tramp)
 {
-	module_memfree(tramp);
+	jit_free(tramp);
 }
 #else
 /* Trampolines can only be created if modules are supported */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6..48bbf97de5a0 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -40,11 +40,11 @@
 #include <linux/kgdb.h>
 #include <linux/ftrace.h>
 #include <linux/kasan.h>
-#include <linux/moduleloader.h>
 #include <linux/objtool.h>
 #include <linux/vmalloc.h>
 #include <linux/pgtable.h>
 #include <linux/set_memory.h>
+#include <linux/jitalloc.h>
 
 #include <asm/text-patching.h>
 #include <asm/cacheflush.h>
@@ -414,7 +414,7 @@ void *alloc_insn_page(void)
 {
 	void *page;
 
-	page = module_alloc(PAGE_SIZE);
+	page = jit_text_alloc(PAGE_SIZE);
 	if (!page)
 		return NULL;
 
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
new file mode 100644
index 000000000000..9517e64e474d
--- /dev/null
+++ b/include/linux/jitalloc.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_JITALLOC_H
+#define _LINUX_JITALLOC_H
+
+#include <linux/types.h>
+
+void jit_free(void *buf);
+void *jit_text_alloc(size_t len);
+
+#endif /* _LINUX_JITALLOC_H */
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 03be088fb439..b3374342f7af 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -29,9 +29,6 @@ unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
    sections.  Returns NULL on failure. */
 void *module_alloc(unsigned long size);
 
-/* Free memory returned from module_alloc. */
-void module_memfree(void *module_region);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7421487422d4..bf954d2721c1 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -22,7 +22,6 @@
 #include <linux/skbuff.h>
 #include <linux/vmalloc.h>
 #include <linux/random.h>
-#include <linux/moduleloader.h>
 #include <linux/bpf.h>
 #include <linux/btf.h>
 #include <linux/objtool.h>
@@ -37,6 +36,7 @@
 #include <linux/nospec.h>
 #include <linux/bpf_mem_alloc.h>
 #include <linux/memcontrol.h>
+#include <linux/jitalloc.h>
 
 #include <asm/barrier.h>
 #include <asm/unaligned.h>
@@ -860,7 +860,7 @@ static struct bpf_prog_pack *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
+	pack->ptr = jit_text_alloc(BPF_PROG_PACK_SIZE);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -884,7 +884,7 @@ void *bpf_prog_pack_alloc(u32 size, bpf_jit_fill_hole_t bpf_fill_ill_insns)
 	mutex_lock(&pack_mutex);
 	if (size > BPF_PROG_PACK_SIZE) {
 		size = round_up(size, PAGE_SIZE);
-		ptr = module_alloc(size);
+		ptr = jit_text_alloc(size);
 		if (ptr) {
 			bpf_fill_ill_insns(ptr, size);
 			set_vm_flush_reset_perms(ptr);
@@ -922,7 +922,7 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 
 	mutex_lock(&pack_mutex);
 	if (hdr->size > BPF_PROG_PACK_SIZE) {
-		module_memfree(hdr);
+		jit_free(hdr);
 		goto out;
 	}
 
@@ -946,7 +946,7 @@ void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
 				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
 		list_del(&pack->list);
-		module_memfree(pack->ptr);
+		jit_free(pack->ptr);
 		kfree(pack);
 	}
 out:
@@ -997,12 +997,12 @@ void bpf_jit_uncharge_modmem(u32 size)
 
 void *__weak bpf_jit_alloc_exec(unsigned long size)
 {
-	return module_alloc(size);
+	return jit_text_alloc(size);
 }
 
 void __weak bpf_jit_free_exec(void *addr)
 {
-	module_memfree(addr);
+	jit_free(addr);
 }
 
 struct bpf_binary_header *
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 00e177de91cc..3caf3561c048 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -26,7 +26,6 @@
 #include <linux/slab.h>
 #include <linux/stddef.h>
 #include <linux/export.h>
-#include <linux/moduleloader.h>
 #include <linux/kallsyms.h>
 #include <linux/freezer.h>
 #include <linux/seq_file.h>
@@ -39,6 +38,7 @@
 #include <linux/jump_label.h>
 #include <linux/static_call.h>
 #include <linux/perf_event.h>
+#include <linux/jitalloc.h>
 
 #include <asm/sections.h>
 #include <asm/cacheflush.h>
@@ -113,17 +113,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
 	/*
-	 * Use module_alloc() so this page is within +/- 2GB of where the
+	 * Use jit_text_alloc() so this page is within +/- 2GB of where the
 	 * kernel image and loaded module images reside. This is required
 	 * for most of the architectures.
 	 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 	 */
-	return module_alloc(PAGE_SIZE);
+	return jit_text_alloc(PAGE_SIZE);
 }
 
 static void free_insn_page(void *page)
 {
-	module_memfree(page);
+	jit_free(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 33a2e991f608..a228b6aafc8f 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 menuconfig MODULES
 	bool "Enable loadable module support"
+	select JIT_ALLOC
 	modules
 	help
 	  Kernel modules are small pieces of compiled code which can
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 044aa2c9e3cb..51278c571bcb 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -57,6 +57,7 @@
 #include <linux/audit.h>
 #include <linux/cfi.h>
 #include <linux/debugfs.h>
+#include <linux/jitalloc.h>
 #include <uapi/linux/module.h>
 #include "internal.h"
 
@@ -1186,16 +1187,6 @@ resolve_symbol_wait(struct module *mod,
 	return ksym;
 }
 
-void __weak module_memfree(void *module_region)
-{
-	/*
-	 * This memory may be RO, and freeing RO memory in an interrupt is not
-	 * supported by vmalloc.
-	 */
-	WARN_ON(in_interrupt());
-	vfree(module_region);
-}
-
 void __weak module_arch_cleanup(struct module *mod)
 {
 }
@@ -1214,7 +1205,7 @@ static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
 	if (mod_mem_use_vmalloc(type))
 		return vzalloc(size);
-	return module_alloc(size);
+	return jit_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
@@ -1222,7 +1213,7 @@ static void module_memory_free(void *ptr, enum mod_mem_type type)
 	if (mod_mem_use_vmalloc(type))
 		vfree(ptr);
 	else
-		module_memfree(ptr);
+		jit_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
@@ -2478,9 +2469,9 @@ static void do_free_init(struct work_struct *w)
 
 	llist_for_each_safe(pos, n, list) {
 		initfree = container_of(pos, struct mod_initfree, node);
-		module_memfree(initfree->init_text);
-		module_memfree(initfree->init_data);
-		module_memfree(initfree->init_rodata);
+		jit_free(initfree->init_text);
+		jit_free(initfree->init_data);
+		jit_free(initfree->init_rodata);
 		kfree(initfree);
 	}
 }
@@ -2583,7 +2574,7 @@ static noinline int do_init_module(struct module *mod)
 	 * We want to free module_init, but be aware that kallsyms may be
 	 * walking this with preempt disabled.  In all the failure paths, we
 	 * call synchronize_rcu(), but we don't want to slow down the success
-	 * path. module_memfree() cannot be called in an interrupt, so do the
+	 * path. jit_free() cannot be called in an interrupt, so do the
 	 * work and call synchronize_rcu() in a work queue.
 	 *
 	 * Note that module_alloc() on most architectures creates W+X page
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..2dea61dade13 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1206,6 +1206,9 @@ config PER_VMA_LOCK
 	  This feature allows locking each virtual memory area separately when
 	  handling page faults instead of taking mmap_lock.
 
+config JIT_ALLOC
+       bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index e29afc890cde..18d45cd60a11 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -137,3 +137,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_JIT_ALLOC) +=  jitalloc.o
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
new file mode 100644
index 000000000000..f15262202a1a
--- /dev/null
+++ b/mm/jitalloc.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/moduleloader.h>
+#include <linux/vmalloc.h>
+#include <linux/jitalloc.h>
+
+void jit_free(void *buf)
+{
+	/*
+	 * This memory may be RO, and freeing RO memory in an interrupt is not
+	 * supported by vmalloc.
+	 */
+	WARN_ON(in_interrupt());
+	vfree(buf);
+}
+
+void *jit_text_alloc(size_t len)
+{
+	return module_alloc(len);
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/13] mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
  2023-06-01 10:12 ` [PATCH 01/13] nios2: define virtual address space for modules Mike Rapoport
  2023-06-01 10:12 ` [PATCH 02/13] mm: introduce jit_text_alloc() and use it instead of module_alloc() Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 04/13] mm/jitalloc, arch: convert remaining " Mike Rapoport
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.

Provide a generic implementation in jitalloc that uses the parameters
for address space ranges, required alignment and page protections
provided by architectures.

The architecures must fill jit_alloc_params structure and implement
jit_alloc_arch_params() that returns a pointer to that structure. This
way the jitalloc initialization won't be called from every architecure,
but rather from a central place, namely initialization of the core
memory management.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/loongarch/kernel/module.c | 14 ++++++++--
 arch/mips/kernel/module.c      | 16 ++++++++---
 arch/nios2/kernel/module.c     | 15 ++++++----
 arch/parisc/kernel/module.c    | 18 ++++++++----
 arch/riscv/kernel/module.c     | 16 +++++++----
 arch/sparc/kernel/module.c     | 39 +++++++++++---------------
 include/linux/jitalloc.h       | 31 +++++++++++++++++++++
 mm/jitalloc.c                  | 51 ++++++++++++++++++++++++++++++++++
 mm/mm_init.c                   |  2 ++
 9 files changed, 156 insertions(+), 46 deletions(-)

diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index b8b86088b2dd..1d5e00874ae7 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,6 +18,7 @@
 #include <linux/ftrace.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
+#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/inst.h>
 
@@ -469,10 +470,17 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.pgprot	= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-			GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0));
+	jit_alloc_params.text.start = MODULES_VADDR;
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
 }
 
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 0c936cbf20c5..f762c697ab9c 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,6 +20,7 @@
 #include <linux/kernel.h>
 #include <linux/spinlock.h>
 #include <linux/jump_label.h>
+#include <linux/jitalloc.h>
 
 extern void jump_label_apply_nops(struct module *mod);
 
@@ -33,11 +34,18 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULE_START
-void *module_alloc(unsigned long size)
+
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.start	= MODULE_START,
+	.text.end	= MODULE_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
-				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+
+	return &jit_alloc_params;
 }
 #endif
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 9c97b7513853..b41d52775ec2 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -18,15 +18,20 @@
 #include <linux/fs.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
+#include <linux/jitalloc.h>
 
 #include <asm/cacheflush.h>
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.pgprot	= PAGE_KERNEL_EXEC,
+	.text.start	= MODULES_VADDR,
+	.text.end	= MODULES_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-				    GFP_KERNEL, PAGE_KERNEL_EXEC,
-				    VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	return &jit_alloc_params;
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index f6e38c4d3904..49fdf741fd24 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -49,6 +49,7 @@
 #include <linux/bug.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/jitalloc.h>
 
 #include <asm/unwind.h>
 #include <asm/sections.h>
@@ -173,15 +174,20 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-void *module_alloc(unsigned long size)
-{
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
 	/* using RWX means less protection for modules, but it's
 	 * easier than trying to map the text, data, init_text and
 	 * init_data correctly */
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-				    GFP_KERNEL,
-				    PAGE_KERNEL_RWX, 0, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	.text.pgprot	= PAGE_KERNEL_RWX,
+	.text.end	= VMALLOC_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.start = VMALLOC_START;
+
+	return &jit_alloc_params;
 }
 
 #ifndef CONFIG_64BIT
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 7c651d55fcbd..731255654c94 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -11,6 +11,7 @@
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
 #include <linux/pgtable.h>
+#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 
@@ -436,12 +437,17 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 }
 
 #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.pgprot	= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	return __vmalloc_node_range(size, 1, MODULES_VADDR,
-				    MODULES_END, GFP_KERNEL,
-				    PAGE_KERNEL, 0, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	jit_alloc_params.text.start = MODULES_VADDR;
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
 }
 #endif
 
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index 66c45a2764bc..03f0de693b4d 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -14,6 +14,11 @@
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <linux/mm.h>
+#include <linux/jitalloc.h>
+
+#ifdef CONFIG_SPARC64
+#include <linux/jump_label.h>
+#endif
 
 #include <asm/processor.h>
 #include <asm/spitfire.h>
@@ -21,34 +26,22 @@
 
 #include "entry.h"
 
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
 #ifdef CONFIG_SPARC64
-
-#include <linux/jump_label.h>
-
-static void *module_map(unsigned long size)
-{
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
-	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-				GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
-}
+	.text.start	= MODULES_VADDR,
+	.text.end	= MODULES_END,
 #else
-static void *module_map(unsigned long size)
-{
-	return vmalloc(size);
-}
-#endif /* CONFIG_SPARC64 */
+	.text.start	= VMALLOC_START,
+	.text.end	= VMALLOC_END,
+#endif
+};
 
-void *module_alloc(unsigned long size)
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	void *ret;
-
-	ret = module_map(size);
-	if (ret)
-		memset(ret, 0, size);
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
 
-	return ret;
+	return &jit_alloc_params;
 }
 
 /* Make generic code ignore STT_REGISTER dummy undefined symbols.  */
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 9517e64e474d..34fddef23dea 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -4,7 +4,38 @@
 
 #include <linux/types.h>
 
+/**
+ * struct jit_address_space -	address space definition for code and
+ *				related data allocations
+ * @pgprot:	permisssions for memory in this address space
+ * @start:	address space start
+ * @end:	address space end (inclusive)
+ */
+struct jit_address_space {
+	pgprot_t        pgprot;
+	unsigned long   start;
+	unsigned long   end;
+};
+
+/**
+ * struct jit_alloc_params -	architecure parameters for code allocations
+ * @text:	address space range for text allocations
+ * @alignment:	alignment required for text allocations
+ */
+struct jit_alloc_params {
+	struct jit_address_space	text;
+	unsigned int			alignment;
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void);
+
 void jit_free(void *buf);
 void *jit_text_alloc(size_t len);
 
+#ifdef CONFIG_JIT_ALLOC
+void jit_alloc_init(void);
+#else
+static inline void jit_alloc_init(void) {}
+#endif
+
 #endif /* _LINUX_JITALLOC_H */
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index f15262202a1a..3e63eeb8bf4b 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -2,8 +2,22 @@
 
 #include <linux/moduleloader.h>
 #include <linux/vmalloc.h>
+#include <linux/mm.h>
 #include <linux/jitalloc.h>
 
+static struct jit_alloc_params jit_alloc_params;
+
+static void *jit_alloc(size_t len, unsigned int alignment, pgprot_t pgprot,
+		       unsigned long start, unsigned long end)
+{
+	if (PAGE_ALIGN(len) > (end - start))
+		return NULL;
+
+	return __vmalloc_node_range(len, alignment, start, end, GFP_KERNEL,
+				    pgprot, VM_FLUSH_RESET_PERMS,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
 void jit_free(void *buf)
 {
 	/*
@@ -16,5 +30,42 @@ void jit_free(void *buf)
 
 void *jit_text_alloc(size_t len)
 {
+	if (jit_alloc_params.text.start) {
+		unsigned int align = jit_alloc_params.alignment;
+		pgprot_t pgprot = jit_alloc_params.text.pgprot;
+		unsigned long start = jit_alloc_params.text.start;
+		unsigned long end = jit_alloc_params.text.end;
+
+		return jit_alloc(len, align, pgprot, start, end);
+	}
+
 	return module_alloc(len);
 }
+
+struct jit_alloc_params * __weak jit_alloc_arch_params(void)
+{
+	return NULL;
+}
+
+static bool jit_alloc_validate_params(struct jit_alloc_params *p)
+{
+	if (!p->alignment || !p->text.start || !p->text.end ||
+	    !pgprot_val(p->text.pgprot)) {
+		pr_crit("Invalid parameters for jit allocator, module loading will fail");
+		return false;
+	}
+
+	return true;
+}
+
+void jit_alloc_init(void)
+{
+	struct jit_alloc_params *p = jit_alloc_arch_params();
+
+	if (p) {
+		if (!jit_alloc_validate_params(p))
+			return;
+
+		jit_alloc_params = *p;
+	}
+}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..5f50e75bbc5f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -26,6 +26,7 @@
 #include <linux/pgtable.h>
 #include <linux/swap.h>
 #include <linux/cma.h>
+#include <linux/jitalloc.h>
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
@@ -2747,4 +2748,5 @@ void __init mm_core_init(void)
 	pti_init();
 	kmsan_init_runtime();
 	mm_cache_init();
+	jit_alloc_init();
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/13] mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (2 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 03/13] mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 22:35   ` Song Liu
  2023-06-01 10:12 ` [PATCH 05/13] module, jitalloc: drop module_alloc Mike Rapoport
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Extend jitalloc parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc and support for allocation of KASAN shadow required by
arm64, s390 and x86.

The core implementation of jit_alloc() takes care of suppressing warnings
when the initial allocation fails but there is a fallback range defined.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/arm/kernel/module.c     | 32 ++++++++++----------
 arch/arm64/kernel/module.c   | 57 ++++++++++++++++--------------------
 arch/powerpc/kernel/module.c | 46 +++++++++++++----------------
 arch/s390/kernel/module.c    | 31 ++++++++------------
 arch/x86/kernel/module.c     | 29 +++++++-----------
 include/linux/jitalloc.h     | 14 +++++++++
 mm/jitalloc.c                | 44 ++++++++++++++++++++++++----
 7 files changed, 138 insertions(+), 115 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index d59c36dc0494..83ccbf98164f 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,6 +16,7 @@
 #include <linux/fs.h>
 #include <linux/string.h>
 #include <linux/gfp.h>
+#include <linux/jitalloc.h>
 
 #include <asm/sections.h>
 #include <asm/smp_plat.h>
@@ -34,23 +35,22 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.start	= MODULES_VADDR,
+	.text.end	= MODULES_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	/* Silence the initial allocation */
-	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
-		gfp_mask |= __GFP_NOWARN;
-
-	p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-				gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
-	if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
-		return p;
-	return __vmalloc_node_range(size, 1,  VMALLOC_START, VMALLOC_END,
-				GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
+	jit_alloc_params.text.pgprot = PAGE_KERNEL_EXEC;
+
+	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+		jit_alloc_params.text.fallback_start = VMALLOC_START;
+		jit_alloc_params.text.fallback_end = VMALLOC_END;
+	}
+
+	return &jit_alloc_params;
 }
 #endif
 
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 5af4975caeb5..ecf1f4030317 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -17,56 +17,49 @@
 #include <linux/moduleloader.h>
 #include <linux/scs.h>
 #include <linux/vmalloc.h>
+#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/insn.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= MODULE_ALIGN,
+	.flags		= JIT_ALLOC_KASAN_SHADOW,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
 	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	/* Silence the initial allocation */
-	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
-		gfp_mask |= __GFP_NOWARN;
 
 	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
 	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
 		/* don't exceed the static module region - see below */
 		module_alloc_end = MODULES_END;
 
-	p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-				module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
-				NUMA_NO_NODE, __builtin_return_address(0));
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+	jit_alloc_params.text.start = module_alloc_base;
+	jit_alloc_params.text.end = module_alloc_end;
 
-	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
+	/*
+	 * KASAN without KASAN_VMALLOC can only deal with module
+	 * allocations being served from the reserved module region,
+	 * since the remainder of the vmalloc region is already
+	 * backed by zero shadow pages, and punching holes into it
+	 * is non-trivial. Since the module region is not randomized
+	 * when KASAN is enabled without KASAN_VMALLOC, it is even
+	 * less likely that the module region gets exhausted, so we
+	 * can simply omit this fallback in that case.
+	 */
+	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
 	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
 	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
-	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
-		/*
-		 * KASAN without KASAN_VMALLOC can only deal with module
-		 * allocations being served from the reserved module region,
-		 * since the remainder of the vmalloc region is already
-		 * backed by zero shadow pages, and punching holes into it
-		 * is non-trivial. Since the module region is not randomized
-		 * when KASAN is enabled without KASAN_VMALLOC, it is even
-		 * less likely that the module region gets exhausted, so we
-		 * can simply omit this fallback in that case.
-		 */
-		p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-				module_alloc_base + SZ_2G, GFP_KERNEL,
-				PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
-
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
+	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
+		jit_alloc_params.text.fallback_start = module_alloc_base;
+		jit_alloc_params.text.fallback_end = module_alloc_base + SZ_2G;
 	}
 
-	/* Memory is intended to be executable, reset the pointer tag. */
-	return kasan_reset_tag(p);
+	return &jit_alloc_params;
 }
 
 enum aarch64_reloc_op {
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index f6d6ae0a1692..83bdedc7eba0 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -12,6 +12,7 @@
 #include <linux/bug.h>
 #include <asm/module.h>
 #include <linux/uaccess.h>
+#include <linux/jitalloc.h>
 #include <asm/firmware.h>
 #include <linux/sort.h>
 #include <asm/setup.h>
@@ -89,39 +90,32 @@ int module_finalize(const Elf_Ehdr *hdr,
 	return 0;
 }
 
-static __always_inline void *
-__module_alloc(unsigned long size, unsigned long start, unsigned long end, bool nowarn)
-{
-	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
-	gfp_t gfp = GFP_KERNEL | (nowarn ? __GFP_NOWARN : 0);
-
-	/*
-	 * Don't do huge page allocations for modules yet until more testing
-	 * is done. STRICT_MODULE_RWX may require extra work to support this
-	 * too.
-	 */
-	return __vmalloc_node_range(size, 1, start, end, gfp, prot,
-				    VM_FLUSH_RESET_PERMS,
-				    NUMA_NO_NODE, __builtin_return_address(0));
-}
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+};
 
-void *module_alloc(unsigned long size)
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
 #ifdef MODULES_VADDR
+	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
 	unsigned long limit = (unsigned long)_etext - SZ_32M;
-	void *ptr = NULL;
 
-	BUILD_BUG_ON(TASK_SIZE > MODULES_VADDR);
+	jit_alloc_params.text.pgprot = prot;
 
 	/* First try within 32M limit from _etext to avoid branch trampolines */
-	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit)
-		ptr = __module_alloc(size, limit, MODULES_END, true);
-
-	if (!ptr)
-		ptr = __module_alloc(size, MODULES_VADDR, MODULES_END, false);
-
-	return ptr;
+	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
+		jit_alloc_params.text.start = limit;
+		jit_alloc_params.text.end = MODULES_END;
+		jit_alloc_params.text.fallback_start = MODULES_VADDR;
+		jit_alloc_params.text.fallback_end = MODULES_END;
+	} else {
+		jit_alloc_params.text.start = MODULES_VADDR;
+		jit_alloc_params.text.end = MODULES_END;
+	}
 #else
-	return __module_alloc(size, VMALLOC_START, VMALLOC_END, false);
+	jit_alloc_params.text.start = VMALLOC_START;
+	jit_alloc_params.text.end = VMALLOC_END;
 #endif
+
+	return &jit_alloc_params;
 }
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index d4844cfe3d7e..0986a1a1b261 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -55,23 +55,18 @@ static unsigned long get_module_load_offset(void)
 	return module_load_offset;
 }
 
-void *module_alloc(unsigned long size)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= MODULE_ALIGN,
+	.flags		= JIT_ALLOC_KASAN_SHADOW,
+	.text.pgprot	= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
 {
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
-	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				 MODULES_VADDR + get_module_load_offset(),
-				 MODULES_END, gfp_mask, PAGE_KERNEL,
-				 VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
-				 NUMA_NO_NODE, __builtin_return_address(0));
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
-	}
-	return p;
+	jit_alloc_params.text.start = MODULES_VADDR + get_module_load_offset();
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
 }
 
 #ifdef CONFIG_FUNCTION_TRACER
@@ -130,7 +125,7 @@ static void check_rela(Elf_Rela *rela, struct module *me)
 	case R_390_GLOB_DAT:
 	case R_390_JMP_SLOT:
 	case R_390_RELATIVE:
-		/* Only needed if we want to support loading of 
+		/* Only needed if we want to support loading of
 		   modules linked with -shared. */
 		break;
 	}
@@ -442,7 +437,7 @@ static int apply_rela(Elf_Rela *rela, Elf_Addr base, Elf_Sym *symtab,
 	case R_390_GLOB_DAT:	/* Create GOT entry.  */
 	case R_390_JMP_SLOT:	/* Create PLT entry.  */
 	case R_390_RELATIVE:	/* Adjust by program base.  */
-		/* Only needed if we want to support loading of 
+		/* Only needed if we want to support loading of
 		   modules linked with -shared. */
 		return -ENOEXEC;
 	default:
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b05f62ee2344..cce84b61a036 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,6 +19,7 @@
 #include <linux/jump_label.h>
 #include <linux/random.h>
 #include <linux/memory.h>
+#include <linux/jitalloc.h>
 
 #include <asm/text-patching.h>
 #include <asm/page.h>
@@ -65,26 +66,18 @@ static unsigned long int get_module_load_offset(void)
 }
 #endif
 
-void *module_alloc(unsigned long size)
-{
-	gfp_t gfp_mask = GFP_KERNEL;
-	void *p;
-
-	if (PAGE_ALIGN(size) > MODULES_LEN)
-		return NULL;
-
-	p = __vmalloc_node_range(size, MODULE_ALIGN,
-				 MODULES_VADDR + get_module_load_offset(),
-				 MODULES_END, gfp_mask, PAGE_KERNEL,
-				 VM_FLUSH_RESET_PERMS | VM_DEFER_KMEMLEAK,
-				 NUMA_NO_NODE, __builtin_return_address(0));
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= MODULE_ALIGN,
+	.flags		= JIT_ALLOC_KASAN_SHADOW,
+};
 
-	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
-		vfree(p);
-		return NULL;
-	}
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+	jit_alloc_params.text.start = MODULES_VADDR + get_module_load_offset();
+	jit_alloc_params.text.end = MODULES_END;
 
-	return p;
+	return &jit_alloc_params;
 }
 
 #ifdef CONFIG_X86_32
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 34fddef23dea..34ee57795a18 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -4,26 +4,40 @@
 
 #include <linux/types.h>
 
+/**
+ * enum jit_alloc_flags - options for executable memory allocations
+ * @JIT_ALLOC_KASAN_SHADOW:	allocate kasan shadow
+ */
+enum jit_alloc_flags {
+	JIT_ALLOC_KASAN_SHADOW	= (1 << 0),
+};
+
 /**
  * struct jit_address_space -	address space definition for code and
  *				related data allocations
  * @pgprot:	permisssions for memory in this address space
  * @start:	address space start
  * @end:	address space end (inclusive)
+ * @fallback_start:	start of the range for fallback allocations
+ * @fallback_end:	end of the range for fallback allocations (inclusive)
  */
 struct jit_address_space {
 	pgprot_t        pgprot;
 	unsigned long   start;
 	unsigned long   end;
+	unsigned long	fallback_start;
+	unsigned long	fallback_end;
 };
 
 /**
  * struct jit_alloc_params -	architecure parameters for code allocations
  * @text:	address space range for text allocations
+ * @flags:	options for executable memory allocations
  * @alignment:	alignment required for text allocations
  */
 struct jit_alloc_params {
 	struct jit_address_space	text;
+	enum jit_alloc_flags		flags;
 	unsigned int			alignment;
 };
 
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 3e63eeb8bf4b..4e10af7803f7 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -8,14 +8,44 @@
 static struct jit_alloc_params jit_alloc_params;
 
 static void *jit_alloc(size_t len, unsigned int alignment, pgprot_t pgprot,
-		       unsigned long start, unsigned long end)
+		       unsigned long start, unsigned long end,
+		       unsigned long fallback_start, unsigned long fallback_end,
+		       bool kasan)
 {
+	unsigned long vm_flags  = VM_FLUSH_RESET_PERMS;
+	bool fallback  = !!fallback_start;
+	gfp_t gfp_flags = GFP_KERNEL;
+	void *p;
+
 	if (PAGE_ALIGN(len) > (end - start))
 		return NULL;
 
-	return __vmalloc_node_range(len, alignment, start, end, GFP_KERNEL,
-				    pgprot, VM_FLUSH_RESET_PERMS,
-				    NUMA_NO_NODE, __builtin_return_address(0));
+	if (kasan)
+		vm_flags |= VM_DEFER_KMEMLEAK;
+
+	if (fallback)
+		gfp_flags |= __GFP_NOWARN;
+
+	p = __vmalloc_node_range(len, alignment, start, end, gfp_flags,
+				 pgprot, vm_flags, NUMA_NO_NODE,
+				 __builtin_return_address(0));
+
+	if (!p && fallback) {
+		start = fallback_start;
+		end = fallback_end;
+		gfp_flags = GFP_KERNEL;
+
+		p = __vmalloc_node_range(len, alignment, start, end, gfp_flags,
+					 pgprot, vm_flags, NUMA_NO_NODE,
+					 __builtin_return_address(0));
+	}
+
+	if (p && kasan && (kasan_alloc_module_shadow(p, len, GFP_KERNEL) < 0)) {
+		vfree(p);
+		return NULL;
+	}
+
+	return kasan_reset_tag(p);
 }
 
 void jit_free(void *buf)
@@ -35,8 +65,12 @@ void *jit_text_alloc(size_t len)
 		pgprot_t pgprot = jit_alloc_params.text.pgprot;
 		unsigned long start = jit_alloc_params.text.start;
 		unsigned long end = jit_alloc_params.text.end;
+		unsigned long fallback_start = jit_alloc_params.text.fallback_start;
+		unsigned long fallback_end = jit_alloc_params.text.fallback_end;
+		bool kasan = jit_alloc_params.flags & JIT_ALLOC_KASAN_SHADOW;
 
-		return jit_alloc(len, align, pgprot, start, end);
+		return jit_alloc(len, align, pgprot, start, end,
+				 fallback_start, fallback_end, kasan);
 	}
 
 	return module_alloc(len);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/13] module, jitalloc: drop module_alloc
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (3 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 04/13] mm/jitalloc, arch: convert remaining " Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 06/13] mm/jitalloc: introduce jit_data_alloc() Mike Rapoport
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Define default parameters for address range for code allocations
using the current values in module_alloc() and make jit_text_alloc() use
these defaults when an architecure does not supply its specific
parameters.

With this, jit_text_alloc() implements memory allocation in a way
compatible with module_alloc() and can be used as a replacement for
module_alloc().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/arm64/kernel/module.c   |  2 +-
 arch/s390/kernel/module.c    |  2 +-
 arch/x86/kernel/module.c     |  2 +-
 include/linux/jitalloc.h     |  8 ++++++++
 include/linux/moduleloader.h | 12 ------------
 kernel/module/main.c         |  7 -------
 mm/jitalloc.c                | 31 +++++++++++++++++--------------
 7 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index ecf1f4030317..91ffcff5a44c 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -24,7 +24,7 @@
 #include <asm/sections.h>
 
 static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= MODULE_ALIGN,
+	.alignment	= JIT_ALLOC_ALIGN,
 	.flags		= JIT_ALLOC_KASAN_SHADOW,
 };
 
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 0986a1a1b261..3f85cf1e7c4e 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -56,7 +56,7 @@ static unsigned long get_module_load_offset(void)
 }
 
 static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= MODULE_ALIGN,
+	.alignment	= JIT_ALLOC_ALIGN,
 	.flags		= JIT_ALLOC_KASAN_SHADOW,
 	.text.pgprot	= PAGE_KERNEL,
 };
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index cce84b61a036..cacca613b8bd 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -67,7 +67,7 @@ static unsigned long int get_module_load_offset(void)
 #endif
 
 static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= MODULE_ALIGN,
+	.alignment	= JIT_ALLOC_ALIGN,
 	.flags		= JIT_ALLOC_KASAN_SHADOW,
 };
 
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 34ee57795a18..823b13706a90 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -4,6 +4,14 @@
 
 #include <linux/types.h>
 
+#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
+		!defined(CONFIG_KASAN_VMALLOC)
+#include <linux/kasan.h>
+#define JIT_ALLOC_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
+#else
+#define JIT_ALLOC_ALIGN PAGE_SIZE
+#endif
+
 /**
  * enum jit_alloc_flags - options for executable memory allocations
  * @JIT_ALLOC_KASAN_SHADOW:	allocate kasan shadow
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index b3374342f7af..4321682fe849 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -25,10 +25,6 @@ int module_frob_arch_sections(Elf_Ehdr *hdr,
 /* Additional bytes needed by arch in front of individual sections */
 unsigned int arch_mod_section_prepend(struct module *mod, unsigned int section);
 
-/* Allocator used for allocating struct module, core sections and init
-   sections.  Returns NULL on failure. */
-void *module_alloc(unsigned long size);
-
 /* Determines if the section name is an init section (that is only used during
  * module loading).
  */
@@ -113,12 +109,4 @@ void module_arch_cleanup(struct module *mod);
 /* Any cleanup before freeing mod->module_init */
 void module_arch_freeing_init(struct module *mod);
 
-#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
-		!defined(CONFIG_KASAN_VMALLOC)
-#include <linux/kasan.h>
-#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
-#else
-#define MODULE_ALIGN PAGE_SIZE
-#endif
-
 #endif
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 51278c571bcb..dfb7fa109f1a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1600,13 +1600,6 @@ static void free_modinfo(struct module *mod)
 	}
 }
 
-void * __weak module_alloc(unsigned long size)
-{
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
-			NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 bool __weak module_init_section(const char *name)
 {
 	return strstarts(name, ".init");
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 4e10af7803f7..221940e36b46 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -60,20 +60,16 @@ void jit_free(void *buf)
 
 void *jit_text_alloc(size_t len)
 {
-	if (jit_alloc_params.text.start) {
-		unsigned int align = jit_alloc_params.alignment;
-		pgprot_t pgprot = jit_alloc_params.text.pgprot;
-		unsigned long start = jit_alloc_params.text.start;
-		unsigned long end = jit_alloc_params.text.end;
-		unsigned long fallback_start = jit_alloc_params.text.fallback_start;
-		unsigned long fallback_end = jit_alloc_params.text.fallback_end;
-		bool kasan = jit_alloc_params.flags & JIT_ALLOC_KASAN_SHADOW;
-
-		return jit_alloc(len, align, pgprot, start, end,
-				 fallback_start, fallback_end, kasan);
-	}
-
-	return module_alloc(len);
+	unsigned int align = jit_alloc_params.alignment;
+	pgprot_t pgprot = jit_alloc_params.text.pgprot;
+	unsigned long start = jit_alloc_params.text.start;
+	unsigned long end = jit_alloc_params.text.end;
+	unsigned long fallback_start = jit_alloc_params.text.fallback_start;
+	unsigned long fallback_end = jit_alloc_params.text.fallback_end;
+	bool kasan = jit_alloc_params.flags & JIT_ALLOC_KASAN_SHADOW;
+
+	return jit_alloc(len, align, pgprot, start, end,
+			 fallback_start, fallback_end, kasan);
 }
 
 struct jit_alloc_params * __weak jit_alloc_arch_params(void)
@@ -101,5 +97,12 @@ void jit_alloc_init(void)
 			return;
 
 		jit_alloc_params = *p;
+		return;
 	}
+
+	/* defaults for architecures that don't need special handling */
+	jit_alloc_params.alignment	= 1;
+	jit_alloc_params.text.pgprot	= PAGE_KERNEL_EXEC;
+	jit_alloc_params.text.start	= VMALLOC_START;
+	jit_alloc_params.text.end	= VMALLOC_END;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/13] mm/jitalloc: introduce jit_data_alloc()
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (4 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 05/13] module, jitalloc: drop module_alloc Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 07/13] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Data related to code allocations, such as module data section, need to
comply with architecture constraints for its placement and its
allocation right now was done using jit_text_alloc().

Create a dedicated API for allocating data related to code allocations
and allow architectures to define address ranges for data allocations.

Since currently this is only relevant for powerpc variants that use the
VMALLOC address space for module data allocations, automatically reuse
address ranges defined for text unless address range for data is
explicitly defined by an architecture.

With separation of code and data allocations, data sections of the
modules are now mapped as PAGE_KERNEL rather than PAGE_KERNEL_EXEC which
was a default on many architectures.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/powerpc/kernel/module.c |  8 ++++++++
 include/linux/jitalloc.h     |  2 ++
 kernel/module/main.c         | 15 +++------------
 mm/jitalloc.c                | 36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 83bdedc7eba0..b58af61e90c0 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -96,6 +96,10 @@ static struct jit_alloc_params jit_alloc_params = {
 
 struct jit_alloc_params *jit_alloc_arch_params(void)
 {
+	/*
+	 * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+	 * allow allocating data in the entire vmalloc space
+	 */
 #ifdef MODULES_VADDR
 	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
 	unsigned long limit = (unsigned long)_etext - SZ_32M;
@@ -112,6 +116,10 @@ struct jit_alloc_params *jit_alloc_arch_params(void)
 		jit_alloc_params.text.start = MODULES_VADDR;
 		jit_alloc_params.text.end = MODULES_END;
 	}
+
+	jit_alloc_params.data.pgprot	= PAGE_KERNEL;
+	jit_alloc_params.data.start	= VMALLOC_START;
+	jit_alloc_params.data.end	= VMALLOC_END;
 #else
 	jit_alloc_params.text.start = VMALLOC_START;
 	jit_alloc_params.text.end = VMALLOC_END;
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 823b13706a90..7f8cafb3cfe9 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -45,6 +45,7 @@ struct jit_address_space {
  */
 struct jit_alloc_params {
 	struct jit_address_space	text;
+	struct jit_address_space	data;
 	enum jit_alloc_flags		flags;
 	unsigned int			alignment;
 };
@@ -53,6 +54,7 @@ struct jit_alloc_params *jit_alloc_arch_params(void);
 
 void jit_free(void *buf);
 void *jit_text_alloc(size_t len);
+void *jit_data_alloc(size_t len);
 
 #ifdef CONFIG_JIT_ALLOC
 void jit_alloc_init(void);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index dfb7fa109f1a..91477aa5f671 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1195,25 +1195,16 @@ void __weak module_arch_freeing_init(struct module *mod)
 {
 }
 
-static bool mod_mem_use_vmalloc(enum mod_mem_type type)
-{
-	return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
-		mod_mem_type_is_core_data(type);
-}
-
 static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
-	if (mod_mem_use_vmalloc(type))
-		return vzalloc(size);
+	if (mod_mem_type_is_data(type))
+		return jit_data_alloc(size);
 	return jit_text_alloc(size);
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
 {
-	if (mod_mem_use_vmalloc(type))
-		vfree(ptr);
-	else
-		jit_free(ptr);
+	jit_free(ptr);
 }
 
 static void free_mod_mem(struct module *mod)
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 221940e36b46..16fd715d501a 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -72,6 +72,20 @@ void *jit_text_alloc(size_t len)
 			 fallback_start, fallback_end, kasan);
 }
 
+void *jit_data_alloc(size_t len)
+{
+	unsigned int align = jit_alloc_params.alignment;
+	pgprot_t pgprot = jit_alloc_params.data.pgprot;
+	unsigned long start = jit_alloc_params.data.start;
+	unsigned long end = jit_alloc_params.data.end;
+	unsigned long fallback_start = jit_alloc_params.data.fallback_start;
+	unsigned long fallback_end = jit_alloc_params.data.fallback_end;
+	bool kasan = jit_alloc_params.flags & JIT_ALLOC_KASAN_SHADOW;
+
+	return jit_alloc(len, align, pgprot, start, end,
+			 fallback_start, fallback_end, kasan);
+}
+
 struct jit_alloc_params * __weak jit_alloc_arch_params(void)
 {
 	return NULL;
@@ -88,6 +102,23 @@ static bool jit_alloc_validate_params(struct jit_alloc_params *p)
 	return true;
 }
 
+static void jit_alloc_init_missing(struct jit_alloc_params *p)
+{
+	if (!pgprot_val(jit_alloc_params.data.pgprot))
+		jit_alloc_params.data.pgprot = PAGE_KERNEL;
+
+	if (!jit_alloc_params.data.start) {
+		jit_alloc_params.data.start = p->text.start;
+		jit_alloc_params.data.end = p->text.end;
+	}
+
+	if (!jit_alloc_params.data.fallback_start &&
+	    jit_alloc_params.text.fallback_start) {
+		jit_alloc_params.data.fallback_start = p->text.fallback_start;
+		jit_alloc_params.data.fallback_end = p->text.fallback_end;
+	}
+}
+
 void jit_alloc_init(void)
 {
 	struct jit_alloc_params *p = jit_alloc_arch_params();
@@ -97,6 +128,8 @@ void jit_alloc_init(void)
 			return;
 
 		jit_alloc_params = *p;
+		jit_alloc_init_missing(p);
+
 		return;
 	}
 
@@ -105,4 +138,7 @@ void jit_alloc_init(void)
 	jit_alloc_params.text.pgprot	= PAGE_KERNEL_EXEC;
 	jit_alloc_params.text.start	= VMALLOC_START;
 	jit_alloc_params.text.end	= VMALLOC_END;
+	jit_alloc_params.data.pgprot	= PAGE_KERNEL;
+	jit_alloc_params.data.start	= VMALLOC_START;
+	jit_alloc_params.data.end	= VMALLOC_END;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/13] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (5 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 06/13] mm/jitalloc: introduce jit_data_alloc() Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 08/13] arch: make jitalloc setup available regardless of CONFIG_MODULES Mike Rapoport
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With jitalloc separated from the modules code, the jit_text_alloc() is
available regardless of CONFIG_MODULE.

Move jitalloc initialization to x86/mm/init.c so that it won't get
compiled away when CONFIG_MODULE=n and enable dynamic ftrace
unconditionally.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/x86/Kconfig         |  1 +
 arch/x86/kernel/ftrace.c |  9 --------
 arch/x86/kernel/module.c | 44 --------------------------------------
 arch/x86/mm/init.c       | 46 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 53 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..fac4add6ce16 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86_64
 	select SWIOTLB
 	select ARCH_HAS_ELFCORE_COMPAT
 	select ZONE_DMA32
+	select JIT_ALLOC if DYNAMIC_FTRACE
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 157c8a799704..aa99536b824c 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,7 +261,6 @@ void arch_ftrace_update_code(int command)
 /* Currently only x86_64 supports dynamic trampolines */
 #ifdef CONFIG_X86_64
 
-#ifdef CONFIG_MODULES
 /* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
@@ -271,14 +270,6 @@ static inline void tramp_free(void *tramp)
 {
 	jit_free(tramp);
 }
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
-	return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
 
 /* Defined as markers to the end of the ftrace default trampolines */
 extern void ftrace_regs_caller_end(void);
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index cacca613b8bd..94a00dc103cd 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -19,7 +19,6 @@
 #include <linux/jump_label.h>
 #include <linux/random.h>
 #include <linux/memory.h>
-#include <linux/jitalloc.h>
 
 #include <asm/text-patching.h>
 #include <asm/page.h>
@@ -37,49 +36,6 @@ do {							\
 } while (0)
 #endif
 
-#ifdef CONFIG_RANDOMIZE_BASE
-static unsigned long module_load_offset;
-
-/* Mutex protects the module_load_offset. */
-static DEFINE_MUTEX(module_kaslr_mutex);
-
-static unsigned long int get_module_load_offset(void)
-{
-	if (kaslr_enabled()) {
-		mutex_lock(&module_kaslr_mutex);
-		/*
-		 * Calculate the module_load_offset the first time this
-		 * code is called. Once calculated it stays the same until
-		 * reboot.
-		 */
-		if (module_load_offset == 0)
-			module_load_offset =
-				get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-		mutex_unlock(&module_kaslr_mutex);
-	}
-	return module_load_offset;
-}
-#else
-static unsigned long int get_module_load_offset(void)
-{
-	return 0;
-}
-#endif
-
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= JIT_ALLOC_ALIGN,
-	.flags		= JIT_ALLOC_KASAN_SHADOW,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.pgprot = PAGE_KERNEL;
-	jit_alloc_params.text.start = MODULES_VADDR + get_module_load_offset();
-	jit_alloc_params.text.end = MODULES_END;
-
-	return &jit_alloc_params;
-}
-
 #ifdef CONFIG_X86_32
 int apply_relocate(Elf32_Shdr *sechdrs,
 		   const char *strtab,
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3cdac0f0055d..ffaf9a3840ce 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -7,6 +7,7 @@
 #include <linux/swapops.h>
 #include <linux/kmemleak.h>
 #include <linux/sched/task.h>
+#include <linux/jitalloc.h>
 
 #include <asm/set_memory.h>
 #include <asm/e820/api.h>
@@ -1084,3 +1085,48 @@ unsigned long arch_max_swapfile_size(void)
 	return pages;
 }
 #endif
+
+#ifdef CONFIG_JIT_ALLOC
+#ifdef CONFIG_RANDOMIZE_BASE
+static unsigned long jit_load_offset;
+
+/* Mutex protects the jit_load_offset. */
+static DEFINE_MUTEX(jit_kaslr_mutex);
+
+static unsigned long int get_jit_load_offset(void)
+{
+	if (kaslr_enabled()) {
+		mutex_lock(&jit_kaslr_mutex);
+		/*
+		 * Calculate the jit_load_offset the first time this
+		 * code is called. Once calculated it stays the same until
+		 * reboot.
+		 */
+		if (jit_load_offset == 0)
+			jit_load_offset =
+				get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+		mutex_unlock(&jit_kaslr_mutex);
+	}
+	return jit_load_offset;
+}
+#else
+static unsigned long int get_jit_load_offset(void)
+{
+	return 0;
+}
+#endif
+
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= JIT_ALLOC_ALIGN,
+	.flags		= JIT_ALLOC_KASAN_SHADOW,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+	jit_alloc_params.text.start = MODULES_VADDR + get_jit_load_offset();
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
+}
+#endif /* CONFIG_JIT_ALLOC */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/13] arch: make jitalloc setup available regardless of CONFIG_MODULES
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (6 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 07/13] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 09/13] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

jitalloc does not depend on modules, on the contrary modules use
jitalloc.

To make jitalloc available when CONFIG_MODULES=n, for instance for
kprobes, split jit_alloc_params initialization out from
arch/kernel/module.c and compile it when CONFIG_JIT_ALLOC=y

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/arm/kernel/module.c       | 32 --------------------------
 arch/arm/mm/init.c             | 35 ++++++++++++++++++++++++++++
 arch/arm64/kernel/module.c     | 40 --------------------------------
 arch/arm64/mm/init.c           | 42 ++++++++++++++++++++++++++++++++++
 arch/loongarch/kernel/module.c | 14 ------------
 arch/loongarch/mm/init.c       | 16 +++++++++++++
 arch/mips/kernel/module.c      | 17 --------------
 arch/mips/mm/init.c            | 19 +++++++++++++++
 arch/parisc/kernel/module.c    | 17 --------------
 arch/parisc/mm/init.c          | 21 ++++++++++++++++-
 arch/powerpc/kernel/module.c   | 39 -------------------------------
 arch/powerpc/mm/mem.c          | 41 +++++++++++++++++++++++++++++++++
 arch/riscv/kernel/module.c     | 16 -------------
 arch/riscv/mm/init.c           | 18 +++++++++++++++
 arch/s390/kernel/module.c      | 32 --------------------------
 arch/s390/mm/init.c            | 35 ++++++++++++++++++++++++++++
 arch/sparc/kernel/module.c     | 19 ---------------
 arch/sparc/mm/Makefile         |  2 ++
 arch/sparc/mm/jitalloc.c       | 21 +++++++++++++++++
 19 files changed, 249 insertions(+), 227 deletions(-)
 create mode 100644 arch/sparc/mm/jitalloc.c

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 83ccbf98164f..054e799e7091 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,44 +16,12 @@
 #include <linux/fs.h>
 #include <linux/string.h>
 #include <linux/gfp.h>
-#include <linux/jitalloc.h>
 
 #include <asm/sections.h>
 #include <asm/smp_plat.h>
 #include <asm/unwind.h>
 #include <asm/opcodes.h>
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR	(((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-	.text.start	= MODULES_VADDR,
-	.text.end	= MODULES_END,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.pgprot = PAGE_KERNEL_EXEC;
-
-	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
-		jit_alloc_params.text.fallback_start = VMALLOC_START;
-		jit_alloc_params.text.fallback_end = VMALLOC_END;
-	}
-
-	return &jit_alloc_params;
-}
-#endif
-
 bool module_init_section(const char *name)
 {
 	return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index ce64bdb55a16..e492625b7f3d 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
 #include <linux/sizes.h>
 #include <linux/stop_machine.h>
 #include <linux/swiotlb.h>
+#include <linux/jitalloc.h>
 
 #include <asm/cp15.h>
 #include <asm/mach-types.h>
@@ -486,3 +487,37 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 	free_reserved_area((void *)start, (void *)end, -1, "initrd");
 }
 #endif
+
+#ifdef CONFIG_JIT_ALLOC
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR	(((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#ifdef CONFIG_MMU
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.start	= MODULES_VADDR,
+	.text.end	= MODULES_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.pgprot = PAGE_KERNEL_EXEC;
+
+	if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+		jit_alloc_params.text.fallback_start = VMALLOC_START;
+		jit_alloc_params.text.fallback_end = VMALLOC_END;
+	}
+
+	return &jit_alloc_params;
+}
+#endif
+
+#endif /* CONFIG_JIT_ALLOC */
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 91ffcff5a44c..6d09b29fe9db 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -17,51 +17,11 @@
 #include <linux/moduleloader.h>
 #include <linux/scs.h>
 #include <linux/vmalloc.h>
-#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/insn.h>
 #include <asm/scs.h>
 #include <asm/sections.h>
 
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= JIT_ALLOC_ALIGN,
-	.flags		= JIT_ALLOC_KASAN_SHADOW,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
-
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
-		/* don't exceed the static module region - see below */
-		module_alloc_end = MODULES_END;
-
-	jit_alloc_params.text.pgprot = PAGE_KERNEL;
-	jit_alloc_params.text.start = module_alloc_base;
-	jit_alloc_params.text.end = module_alloc_end;
-
-	/*
-	 * KASAN without KASAN_VMALLOC can only deal with module
-	 * allocations being served from the reserved module region,
-	 * since the remainder of the vmalloc region is already
-	 * backed by zero shadow pages, and punching holes into it
-	 * is non-trivial. Since the module region is not randomized
-	 * when KASAN is enabled without KASAN_VMALLOC, it is even
-	 * less likely that the module region gets exhausted, so we
-	 * can simply omit this fallback in that case.
-	 */
-	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
-	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
-	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
-	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
-		jit_alloc_params.text.fallback_start = module_alloc_base;
-		jit_alloc_params.text.fallback_end = module_alloc_base + SZ_2G;
-	}
-
-	return &jit_alloc_params;
-}
-
 enum aarch64_reloc_op {
 	RELOC_OP_NONE,
 	RELOC_OP_ABS,
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 66e70ca47680..a4463a35b3c5 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -31,6 +31,7 @@
 #include <linux/hugetlb.h>
 #include <linux/acpi_iort.h>
 #include <linux/kmemleak.h>
+#include <linux/jitalloc.h>
 
 #include <asm/boot.h>
 #include <asm/fixmap.h>
@@ -493,3 +494,44 @@ void dump_mem_limit(void)
 		pr_emerg("Memory Limit: none\n");
 	}
 }
+
+#ifdef CONFIG_JIT_ALLOC
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= JIT_ALLOC_ALIGN,
+	.flags		= JIT_ALLOC_KASAN_SHADOW,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
+
+	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
+	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
+		/* don't exceed the static module region - see below */
+		module_alloc_end = MODULES_END;
+
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+	jit_alloc_params.text.start = module_alloc_base;
+	jit_alloc_params.text.end = module_alloc_end;
+
+	/*
+	 * KASAN without KASAN_VMALLOC can only deal with module
+	 * allocations being served from the reserved module region,
+	 * since the remainder of the vmalloc region is already
+	 * backed by zero shadow pages, and punching holes into it
+	 * is non-trivial. Since the module region is not randomized
+	 * when KASAN is enabled without KASAN_VMALLOC, it is even
+	 * less likely that the module region gets exhausted, so we
+	 * can simply omit this fallback in that case.
+	 */
+	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
+	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
+	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
+	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS)))) {
+		jit_alloc_params.text.fallback_start = module_alloc_base;
+		jit_alloc_params.text.fallback_end = module_alloc_base + SZ_2G;
+	}
+
+	return &jit_alloc_params;
+}
+#endif
diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index 1d5e00874ae7..181b5f8b09f1 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,7 +18,6 @@
 #include <linux/ftrace.h>
 #include <linux/string.h>
 #include <linux/kernel.h>
-#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/inst.h>
 
@@ -470,19 +469,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-	.text.pgprot	= PAGE_KERNEL,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.start = MODULES_VADDR;
-	jit_alloc_params.text.end = MODULES_END;
-
-	return &jit_alloc_params;
-}
-
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
 				   const Elf_Shdr *sechdrs, struct module *mod)
 {
diff --git a/arch/loongarch/mm/init.c b/arch/loongarch/mm/init.c
index 3b7d8129570b..30ca8e497377 100644
--- a/arch/loongarch/mm/init.c
+++ b/arch/loongarch/mm/init.c
@@ -24,6 +24,7 @@
 #include <linux/gfp.h>
 #include <linux/hugetlb.h>
 #include <linux/mmzone.h>
+#include <linux/jitalloc.h>
 
 #include <asm/asm-offsets.h>
 #include <asm/bootinfo.h>
@@ -274,3 +275,18 @@ EXPORT_SYMBOL(invalid_pmd_table);
 #endif
 pte_t invalid_pte_table[PTRS_PER_PTE] __page_aligned_bss;
 EXPORT_SYMBOL(invalid_pte_table);
+
+#ifdef CONFIG_JIT_ALLOC
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.pgprot	= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.start = MODULES_VADDR;
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
+}
+#endif
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index f762c697ab9c..dba78c7a4a88 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,7 +20,6 @@
 #include <linux/kernel.h>
 #include <linux/spinlock.h>
 #include <linux/jump_label.h>
-#include <linux/jitalloc.h>
 
 extern void jump_label_apply_nops(struct module *mod);
 
@@ -33,22 +32,6 @@ struct mips_hi16 {
 static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
-#ifdef MODULE_START
-
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-	.text.start	= MODULE_START,
-	.text.end	= MODULE_END,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.pgprot = PAGE_KERNEL;
-
-	return &jit_alloc_params;
-}
-#endif
-
 static void apply_r_mips_32(u32 *location, u32 base, Elf_Addr v)
 {
 	*location = base + v;
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 5a8002839550..1fd1bea78fdc 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -31,6 +31,7 @@
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/initrd.h>
+#include <linux/jitalloc.h>
 
 #include <asm/bootinfo.h>
 #include <asm/cachectl.h>
@@ -568,3 +569,21 @@ EXPORT_SYMBOL_GPL(invalid_pmd_table);
 #endif
 pte_t invalid_pte_table[PTRS_PER_PTE] __page_aligned_bss;
 EXPORT_SYMBOL(invalid_pte_table);
+
+#ifdef CONFIG_JIT_ALLOC
+#ifdef MODULE_START
+
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.start	= MODULE_START,
+	.text.end	= MODULE_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+
+	return &jit_alloc_params;
+}
+#endif
+#endif
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index 49fdf741fd24..3cb0b2c72d85 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -49,7 +49,6 @@
 #include <linux/bug.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
-#include <linux/jitalloc.h>
 
 #include <asm/unwind.h>
 #include <asm/sections.h>
@@ -174,22 +173,6 @@ static inline int reassemble_22(int as22)
 		((as22 & 0x0003ff) << 3));
 }
 
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-	/* using RWX means less protection for modules, but it's
-	 * easier than trying to map the text, data, init_text and
-	 * init_data correctly */
-	.text.pgprot	= PAGE_KERNEL_RWX,
-	.text.end	= VMALLOC_END,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.start = VMALLOC_START;
-
-	return &jit_alloc_params;
-}
-
 #ifndef CONFIG_64BIT
 static inline unsigned long count_gots(const Elf_Rela *rela, unsigned long n)
 {
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index b0c43f3b0a5f..1601519486fa 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -24,6 +24,7 @@
 #include <linux/nodemask.h>	/* for node_online_map */
 #include <linux/pagemap.h>	/* for release_pages */
 #include <linux/compat.h>
+#include <linux/jitalloc.h>
 
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
@@ -479,7 +480,7 @@ void free_initmem(void)
 	/* finally dump all the instructions which were cached, since the
 	 * pages are no-longer executable */
 	flush_icache_range(init_begin, init_end);
-	
+
 	free_initmem_default(POISON_FREE_INITMEM);
 
 	/* set up a new led state on systems shipped LED State panel */
@@ -891,3 +892,21 @@ static const pgprot_t protection_map[16] = {
 	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_RWX
 };
 DECLARE_VM_GET_PAGE_PROT
+
+#ifdef CONFIG_JIT_ALLOC
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	/* using RWX means less protection for modules, but it's
+	 * easier than trying to map the text, data, init_text and
+	 * init_data correctly */
+	.text.pgprot	= PAGE_KERNEL_RWX,
+	.text.end	= VMALLOC_END,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.start = VMALLOC_START;
+
+	return &jit_alloc_params;
+}
+#endif
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index b58af61e90c0..b30e00964a60 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -12,7 +12,6 @@
 #include <linux/bug.h>
 #include <asm/module.h>
 #include <linux/uaccess.h>
-#include <linux/jitalloc.h>
 #include <asm/firmware.h>
 #include <linux/sort.h>
 #include <asm/setup.h>
@@ -89,41 +88,3 @@ int module_finalize(const Elf_Ehdr *hdr,
 
 	return 0;
 }
-
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	/*
-	 * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
-	 * allow allocating data in the entire vmalloc space
-	 */
-#ifdef MODULES_VADDR
-	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
-	unsigned long limit = (unsigned long)_etext - SZ_32M;
-
-	jit_alloc_params.text.pgprot = prot;
-
-	/* First try within 32M limit from _etext to avoid branch trampolines */
-	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
-		jit_alloc_params.text.start = limit;
-		jit_alloc_params.text.end = MODULES_END;
-		jit_alloc_params.text.fallback_start = MODULES_VADDR;
-		jit_alloc_params.text.fallback_end = MODULES_END;
-	} else {
-		jit_alloc_params.text.start = MODULES_VADDR;
-		jit_alloc_params.text.end = MODULES_END;
-	}
-
-	jit_alloc_params.data.pgprot	= PAGE_KERNEL;
-	jit_alloc_params.data.start	= VMALLOC_START;
-	jit_alloc_params.data.end	= VMALLOC_END;
-#else
-	jit_alloc_params.text.start = VMALLOC_START;
-	jit_alloc_params.text.end = VMALLOC_END;
-#endif
-
-	return &jit_alloc_params;
-}
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8b121df7b08f..de970988119f 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -16,6 +16,7 @@
 #include <linux/highmem.h>
 #include <linux/suspend.h>
 #include <linux/dma-direct.h>
+#include <linux/jitalloc.h>
 
 #include <asm/swiotlb.h>
 #include <asm/machdep.h>
@@ -406,3 +407,43 @@ int devmem_is_allowed(unsigned long pfn)
  * the EHEA driver. Drop this when drivers/net/ethernet/ibm/ehea is removed.
  */
 EXPORT_SYMBOL_GPL(walk_system_ram_range);
+
+#ifdef CONFIG_JIT_ALLOC
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	/*
+	 * BOOK3S_32 and 8xx define MODULES_VADDR for text allocations and
+	 * allow allocating data in the entire vmalloc space
+	 */
+#ifdef MODULES_VADDR
+	pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC;
+	unsigned long limit = (unsigned long)_etext - SZ_32M;
+
+	jit_alloc_params.text.pgprot = prot;
+
+	/* First try within 32M limit from _etext to avoid branch trampolines */
+	if (MODULES_VADDR < PAGE_OFFSET && MODULES_END > limit) {
+		jit_alloc_params.text.start = limit;
+		jit_alloc_params.text.end = MODULES_END;
+		jit_alloc_params.text.fallback_start = MODULES_VADDR;
+		jit_alloc_params.text.fallback_end = MODULES_END;
+	} else {
+		jit_alloc_params.text.start = MODULES_VADDR;
+		jit_alloc_params.text.end = MODULES_END;
+	}
+
+	jit_alloc_params.data.pgprot	= PAGE_KERNEL;
+	jit_alloc_params.data.start	= VMALLOC_START;
+	jit_alloc_params.data.end	= VMALLOC_END;
+#else
+	jit_alloc_params.text.start = VMALLOC_START;
+	jit_alloc_params.text.end = VMALLOC_END;
+#endif
+
+	return &jit_alloc_params;
+}
+#endif
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 731255654c94..8af08d5449bf 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -11,7 +11,6 @@
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
 #include <linux/pgtable.h>
-#include <linux/jitalloc.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 
@@ -436,21 +435,6 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
 	return 0;
 }
 
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-	.text.pgprot	= PAGE_KERNEL,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.start = MODULES_VADDR;
-	jit_alloc_params.text.end = MODULES_END;
-
-	return &jit_alloc_params;
-}
-#endif
-
 int module_finalize(const Elf_Ehdr *hdr,
 		    const Elf_Shdr *sechdrs,
 		    struct module *me)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 747e5b1ef02d..5b87f83ef810 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -23,6 +23,7 @@
 #ifdef CONFIG_RELOCATABLE
 #include <linux/elf.h>
 #endif
+#include <linux/jitalloc.h>
 
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
@@ -1363,3 +1364,20 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 	return vmemmap_populate_basepages(start, end, node, NULL);
 }
 #endif
+
+#ifdef CONFIG_JIT_ALLOC
+#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+	.text.pgprot	= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.start = MODULES_VADDR;
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
+}
+#endif
+#endif
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 3f85cf1e7c4e..0a4f4f32ef49 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -37,38 +37,6 @@
 
 #define PLT_ENTRY_SIZE 22
 
-static unsigned long get_module_load_offset(void)
-{
-	static DEFINE_MUTEX(module_kaslr_mutex);
-	static unsigned long module_load_offset;
-
-	if (!kaslr_enabled())
-		return 0;
-	/*
-	 * Calculate the module_load_offset the first time this code
-	 * is called. Once calculated it stays the same until reboot.
-	 */
-	mutex_lock(&module_kaslr_mutex);
-	if (!module_load_offset)
-		module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
-	mutex_unlock(&module_kaslr_mutex);
-	return module_load_offset;
-}
-
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= JIT_ALLOC_ALIGN,
-	.flags		= JIT_ALLOC_KASAN_SHADOW,
-	.text.pgprot	= PAGE_KERNEL,
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.start = MODULES_VADDR + get_module_load_offset();
-	jit_alloc_params.text.end = MODULES_END;
-
-	return &jit_alloc_params;
-}
-
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 8d94e29adcdb..6e428e0f3215 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -50,6 +50,7 @@
 #include <asm/uv.h>
 #include <linux/virtio_anchor.h>
 #include <linux/virtio_config.h>
+#include <linux/jitalloc.h>
 
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __section(".bss..swapper_pg_dir");
 pgd_t invalid_pg_dir[PTRS_PER_PGD] __section(".bss..invalid_pg_dir");
@@ -311,3 +312,37 @@ void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 	vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
+
+#ifdef CONFIG_JIT_ALLOC
+static unsigned long get_module_load_offset(void)
+{
+	static DEFINE_MUTEX(module_kaslr_mutex);
+	static unsigned long module_load_offset;
+
+	if (!kaslr_enabled())
+		return 0;
+	/*
+	 * Calculate the module_load_offset the first time this code
+	 * is called. Once calculated it stays the same until reboot.
+	 */
+	mutex_lock(&module_kaslr_mutex);
+	if (!module_load_offset)
+		module_load_offset = get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
+	mutex_unlock(&module_kaslr_mutex);
+	return module_load_offset;
+}
+
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= JIT_ALLOC_ALIGN,
+	.flags		= JIT_ALLOC_KASAN_SHADOW,
+	.text.pgprot	= PAGE_KERNEL,
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.start = MODULES_VADDR + get_module_load_offset();
+	jit_alloc_params.text.end = MODULES_END;
+
+	return &jit_alloc_params;
+}
+#endif
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index 03f0de693b4d..9edbd0372add 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -14,7 +14,6 @@
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <linux/mm.h>
-#include <linux/jitalloc.h>
 
 #ifdef CONFIG_SPARC64
 #include <linux/jump_label.h>
@@ -26,24 +25,6 @@
 
 #include "entry.h"
 
-static struct jit_alloc_params jit_alloc_params = {
-	.alignment	= 1,
-#ifdef CONFIG_SPARC64
-	.text.start	= MODULES_VADDR,
-	.text.end	= MODULES_END,
-#else
-	.text.start	= VMALLOC_START,
-	.text.end	= VMALLOC_END,
-#endif
-};
-
-struct jit_alloc_params *jit_alloc_arch_params(void)
-{
-	jit_alloc_params.text.pgprot = PAGE_KERNEL;
-
-	return &jit_alloc_params;
-}
-
 /* Make generic code ignore STT_REGISTER dummy undefined symbols.  */
 int module_frob_arch_sections(Elf_Ehdr *hdr,
 			      Elf_Shdr *sechdrs,
diff --git a/arch/sparc/mm/Makefile b/arch/sparc/mm/Makefile
index 871354aa3c00..95ede0fd851a 100644
--- a/arch/sparc/mm/Makefile
+++ b/arch/sparc/mm/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_SPARC32)   += leon_mm.o
 
 # Only used by sparc64
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
+
+obj-$(CONFIG_JIT_ALLOC) += jitalloc.o
diff --git a/arch/sparc/mm/jitalloc.c b/arch/sparc/mm/jitalloc.c
new file mode 100644
index 000000000000..6b407a8e85ef
--- /dev/null
+++ b/arch/sparc/mm/jitalloc.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/jitalloc.h>
+
+static struct jit_alloc_params jit_alloc_params = {
+	.alignment	= 1,
+#ifdef CONFIG_SPARC64
+	.text.start	= MODULES_VADDR,
+	.text.end	= MODULES_END,
+#else
+	.text.start	= VMALLOC_START,
+	.text.end	= VMALLOC_END,
+#endif
+};
+
+struct jit_alloc_params *jit_alloc_arch_params(void)
+{
+	jit_alloc_params.text.pgprot = PAGE_KERNEL;
+
+	return &jit_alloc_params;
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/13] kprobes: remove dependcy on CONFIG_MODULES
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (7 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 08/13] arch: make jitalloc setup available regardless of CONFIG_MODULES Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 10/13] modules, jitalloc: prepare to allocate executable memory as ROX Mike Rapoport
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with jitalloc, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guars for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_JIT_ALLOC and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/Kconfig                |  2 +-
 kernel/kprobes.c            | 43 +++++++++++++++++++++----------------
 kernel/trace/trace_kprobe.c | 11 ++++++++++
 3 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..479a7b8be191 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -39,9 +39,9 @@ config GENERIC_ENTRY
 
 config KPROBES
 	bool "Kprobes"
-	depends on MODULES
 	depends on HAVE_KPROBES
 	select KALLSYMS
+	select JIT_ALLOC
 	select TASKS_RCU if PREEMPTION
 	help
 	  Kprobes allows you to trap at almost any kernel address and
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 3caf3561c048..11c1cfbb11ae 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1568,6 +1568,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
 		goto out;
 	}
 
+#ifdef CONFIG_MODULES
 	/* Check if 'p' is probing a module. */
 	*probed_mod = __module_text_address((unsigned long) p->addr);
 	if (*probed_mod) {
@@ -1591,6 +1592,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
 			ret = -ENOENT;
 		}
 	}
+#endif
+
 out:
 	preempt_enable();
 	jump_label_unlock();
@@ -2484,24 +2487,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end)
 	return 0;
 }
 
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
-{
-	struct kprobe_blacklist_entry *ent, *n;
-
-	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
-		if (ent->start_addr < start || ent->start_addr >= end)
-			continue;
-		list_del(&ent->list);
-		kfree(ent);
-	}
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
-	kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
 				   char *type, char *sym)
 {
@@ -2566,6 +2551,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start,
 	return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end)
+{
+	struct kprobe_blacklist_entry *ent, *n;
+
+	list_for_each_entry_safe(ent, n, &kprobe_blacklist, list) {
+		if (ent->start_addr < start || ent->start_addr >= end)
+			continue;
+		list_del(&ent->list);
+		kfree(ent);
+	}
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+	kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 static void add_module_kprobe_blacklist(struct module *mod)
 {
 	unsigned long start, end;
@@ -2667,6 +2671,7 @@ static struct notifier_block kprobe_module_nb = {
 	.notifier_call = kprobes_module_callback,
 	.priority = 0
 };
+#endif
 
 void kprobe_free_init_mem(void)
 {
@@ -2726,8 +2731,10 @@ static int __init init_kprobes(void)
 	err = arch_init_kprobes();
 	if (!err)
 		err = register_die_notifier(&kprobe_exceptions_nb);
+#ifdef CONFIG_MODULES
 	if (!err)
 		err = register_module_notifier(&kprobe_module_nb);
+#endif
 
 	kprobes_initialized = (err == 0);
 	kprobe_sysctls_init();
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 59cda19a9033..cf804e372554 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -111,6 +111,7 @@ static nokprobe_inline bool trace_kprobe_within_module(struct trace_kprobe *tk,
 	return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
 }
 
+#ifdef CONFIG_MODULES
 static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 {
 	char *p;
@@ -129,6 +130,12 @@ static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
 
 	return ret;
 }
+#else
+static inline bool trace_kprobe_module_exist(struct trace_kprobe *tk)
+{
+	return false;
+}
+#endif
 
 static bool trace_kprobe_is_busy(struct dyn_event *ev)
 {
@@ -670,6 +677,7 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
 	return ret;
 }
 
+#ifdef CONFIG_MODULES
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,
 				       unsigned long val, void *data)
@@ -704,6 +712,7 @@ static struct notifier_block trace_kprobe_module_nb = {
 	.notifier_call = trace_kprobe_module_callback,
 	.priority = 1	/* Invoked after kprobe module callback */
 };
+#endif
 
 static int __trace_kprobe_create(int argc, const char *argv[])
 {
@@ -1797,8 +1806,10 @@ static __init int init_kprobe_trace_early(void)
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_MODULES
 	if (register_module_notifier(&trace_kprobe_module_nb))
 		return -EINVAL;
+#endif
 
 	return 0;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/13] modules, jitalloc: prepare to allocate executable memory as ROX
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (8 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 09/13] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 11/13] ftrace: Add swap_func to ftrace_process_locs() Mike Rapoport
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

When executable memory will be allocated as ROX it won't be possible to
update it using memset() and memcpy().

Introduce jit_update_copy() and jit_update_set() APIs and use them in
modules loading code instead of memcpy() and memset().

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 include/linux/jitalloc.h |  2 ++
 kernel/module/main.c     | 19 ++++++++++++++-----
 mm/jitalloc.c            | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 7f8cafb3cfe9..0ba5ef785a85 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -55,6 +55,8 @@ struct jit_alloc_params *jit_alloc_arch_params(void);
 void jit_free(void *buf);
 void *jit_text_alloc(size_t len);
 void *jit_data_alloc(size_t len);
+void jit_update_copy(void *buf, void *new_buf, size_t len);
+void jit_update_set(void *buf, int c, size_t len);
 
 #ifdef CONFIG_JIT_ALLOC
 void jit_alloc_init(void);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 91477aa5f671..9f0711c42aa2 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1197,9 +1197,19 @@ void __weak module_arch_freeing_init(struct module *mod)
 
 static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
 {
-	if (mod_mem_type_is_data(type))
-		return jit_data_alloc(size);
-	return jit_text_alloc(size);
+	void *p;
+
+	if (mod_mem_type_is_data(type)) {
+		p = jit_data_alloc(size);
+		if (p)
+			memset(p, 0, size);
+	} else {
+		p = jit_text_alloc(size);
+		if (p)
+			jit_update_set(p, 0, size);
+	}
+
+	return p;
 }
 
 static void module_memory_free(void *ptr, enum mod_mem_type type)
@@ -2223,7 +2233,6 @@ static int move_module(struct module *mod, struct load_info *info)
 			t = type;
 			goto out_enomem;
 		}
-		memset(ptr, 0, mod->mem[type].size);
 		mod->mem[type].base = ptr;
 	}
 
@@ -2251,7 +2260,7 @@ static int move_module(struct module *mod, struct load_info *info)
 				ret = -ENOEXEC;
 				goto out_enomem;
 			}
-			memcpy(dest, (void *)shdr->sh_addr, shdr->sh_size);
+			jit_update_copy(dest, (void *)shdr->sh_addr, shdr->sh_size);
 		}
 		/*
 		 * Update the userspace copy's ELF section address to point to
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index 16fd715d501a..a8ae64364d56 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -7,6 +7,16 @@
 
 static struct jit_alloc_params jit_alloc_params;
 
+static inline void jit_text_poke_copy(void *dst, const void *src, size_t len)
+{
+	memcpy(dst, src, len);
+}
+
+static inline void jit_text_poke_set(void *addr, int c, size_t len)
+{
+	memset(addr, c, len);
+}
+
 static void *jit_alloc(size_t len, unsigned int alignment, pgprot_t pgprot,
 		       unsigned long start, unsigned long end,
 		       unsigned long fallback_start, unsigned long fallback_end,
@@ -86,6 +96,16 @@ void *jit_data_alloc(size_t len)
 			 fallback_start, fallback_end, kasan);
 }
 
+void jit_update_copy(void *buf, void *new_buf, size_t len)
+{
+	jit_text_poke_copy(buf, new_buf, len);
+}
+
+void jit_update_set(void *addr, int c, size_t len)
+{
+	jit_text_poke_set(addr, c, len);
+}
+
 struct jit_alloc_params * __weak jit_alloc_arch_params(void)
 {
 	return NULL;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/13] ftrace: Add swap_func to ftrace_process_locs()
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (9 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 10/13] modules, jitalloc: prepare to allocate executable memory as ROX Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:12 ` [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX Mike Rapoport
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: Song Liu <song@kernel.org>

ftrace_process_locs sorts module mcount, which is inside RO memory. Add a
ftrace_swap_func so that archs can use RO-memory-poke function to do the
sorting.

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/ftrace.h |  2 ++
 kernel/trace/ftrace.c  | 13 ++++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index b23bdd414394..fe443b8ed32c 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -1166,4 +1166,6 @@ unsigned long arch_syscall_addr(int nr);
 
 #endif /* CONFIG_FTRACE_SYSCALLS */
 
+void ftrace_swap_func(void *a, void *b, int n);
+
 #endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 764668467155..f5ddc9d4cfb6 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6430,6 +6430,17 @@ static void test_is_sorted(unsigned long *start, unsigned long count)
 }
 #endif
 
+void __weak ftrace_swap_func(void *a, void *b, int n)
+{
+	unsigned long t;
+
+	WARN_ON_ONCE(n != sizeof(t));
+
+	t = *((unsigned long *)a);
+	*(unsigned long *)a = *(unsigned long *)b;
+	*(unsigned long *)b = t;
+}
+
 static int ftrace_process_locs(struct module *mod,
 			       unsigned long *start,
 			       unsigned long *end)
@@ -6455,7 +6466,7 @@ static int ftrace_process_locs(struct module *mod,
 	 */
 	if (!IS_ENABLED(CONFIG_BUILDTIME_MCOUNT_SORT) || mod) {
 		sort(start, count, sizeof(*start),
-		     ftrace_cmp_ips, NULL);
+		     ftrace_cmp_ips, ftrace_swap_func);
 	} else {
 		test_is_sorted(start, count);
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (10 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 11/13] ftrace: Add swap_func to ftrace_process_locs() Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 10:30   ` Peter Zijlstra
                     ` (2 more replies)
  2023-06-01 10:12 ` [PATCH 13/13] x86/jitalloc: make memory allocated for code ROX Mike Rapoport
                   ` (2 subsequent siblings)
  14 siblings, 3 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: Song Liu <song@kernel.org>

Replace direct memory writes to memory allocated for code with text poking
to allow allocation of executable memory as ROX.

The only exception is arch_prepare_bpf_trampoline() that cannot jit
directly into module memory yet, so it uses set_memory calls to
unprotect the memory before writing to it and to protect memory in the
end.

Signed-off-by: Song Liu <song@kernel.org>
Co-developed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/x86/kernel/alternative.c | 43 +++++++++++++++++++++++------------
 arch/x86/kernel/ftrace.c      | 41 +++++++++++++++++++++------------
 arch/x86/kernel/module.c      | 24 +++++--------------
 arch/x86/kernel/static_call.c | 10 ++++----
 arch/x86/kernel/unwind_orc.c  | 13 +++++++----
 arch/x86/net/bpf_jit_comp.c   | 22 +++++++++++++-----
 6 files changed, 91 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d93..91057de8e6bc 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
 #include <linux/mmu_context.h>
 #include <linux/bsearch.h>
 #include <linux/sync_core.h>
+#include <linux/set_memory.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -76,6 +77,19 @@ do {									\
 	}								\
 } while (0)
 
+void text_poke_early(void *addr, const void *opcode, size_t len);
+
+static void __init_or_module do_text_poke(void *addr, const void *opcode, size_t len)
+{
+	if (system_state < SYSTEM_RUNNING) {
+		text_poke_early(addr, opcode, len);
+	} else {
+		mutex_lock(&text_mutex);
+		text_poke(addr, opcode, len);
+		mutex_unlock(&text_mutex);
+	}
+}
+
 static const unsigned char x86nops[] =
 {
 	BYTES_NOP1,
@@ -108,7 +122,7 @@ static void __init_or_module add_nops(void *insns, unsigned int len)
 		unsigned int noplen = len;
 		if (noplen > ASM_NOP_MAX)
 			noplen = ASM_NOP_MAX;
-		memcpy(insns, x86_nops[noplen], noplen);
+		do_text_poke(insns, x86_nops[noplen], noplen);
 		insns += noplen;
 		len -= noplen;
 	}
@@ -120,7 +134,6 @@ extern s32 __cfi_sites[], __cfi_sites_end[];
 extern s32 __ibt_endbr_seal[], __ibt_endbr_seal_end[];
 extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
 extern s32 __smp_locks[], __smp_locks_end[];
-void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
  * Are we looking at a near JMP with a 1 or 4-byte displacement.
@@ -331,7 +344,7 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 
 		DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn: ", instr);
 
-		text_poke_early(instr, insn_buff, insn_buff_sz);
+		do_text_poke(instr, insn_buff, insn_buff_sz);
 
 next:
 		optimize_nops(instr, a->instrlen);
@@ -564,7 +577,7 @@ void __init_or_module noinline apply_retpolines(s32 *start, s32 *end)
 			optimize_nops(bytes, len);
 			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
 			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-			text_poke_early(addr, bytes, len);
+			do_text_poke(addr, bytes, len);
 		}
 	}
 }
@@ -638,7 +651,7 @@ void __init_or_module noinline apply_returns(s32 *start, s32 *end)
 		if (len == insn.length) {
 			DUMP_BYTES(((u8*)addr),  len, "%px: orig: ", addr);
 			DUMP_BYTES(((u8*)bytes), len, "%px: repl: ", addr);
-			text_poke_early(addr, bytes, len);
+			do_text_poke(addr, bytes, len);
 		}
 	}
 }
@@ -674,7 +687,7 @@ static void poison_endbr(void *addr, bool warn)
 	 */
 	DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
 	DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
-	text_poke_early(addr, &poison, 4);
+	do_text_poke(addr, &poison, 4);
 }
 
 /*
@@ -869,7 +882,7 @@ static int cfi_disable_callers(s32 *start, s32 *end)
 		if (!hash) /* nocfi callers */
 			continue;
 
-		text_poke_early(addr, jmp, 2);
+		do_text_poke(addr, jmp, 2);
 	}
 
 	return 0;
@@ -892,7 +905,7 @@ static int cfi_enable_callers(s32 *start, s32 *end)
 		if (!hash) /* nocfi callers */
 			continue;
 
-		text_poke_early(addr, mov, 2);
+		do_text_poke(addr, mov, 2);
 	}
 
 	return 0;
@@ -913,7 +926,7 @@ static int cfi_rand_preamble(s32 *start, s32 *end)
 			return -EINVAL;
 
 		hash = cfi_rehash(hash);
-		text_poke_early(addr + 1, &hash, 4);
+		do_text_poke(addr + 1, &hash, 4);
 	}
 
 	return 0;
@@ -932,9 +945,9 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end)
 			 addr, addr, 5, addr))
 			return -EINVAL;
 
-		text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size);
+		do_text_poke(addr, fineibt_preamble_start, fineibt_preamble_size);
 		WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678);
-		text_poke_early(addr + fineibt_preamble_hash, &hash, 4);
+		do_text_poke(addr + fineibt_preamble_hash, &hash, 4);
 	}
 
 	return 0;
@@ -953,7 +966,7 @@ static int cfi_rand_callers(s32 *start, s32 *end)
 		hash = decode_caller_hash(addr);
 		if (hash) {
 			hash = -cfi_rehash(hash);
-			text_poke_early(addr + 2, &hash, 4);
+			do_text_poke(addr + 2, &hash, 4);
 		}
 	}
 
@@ -971,9 +984,9 @@ static int cfi_rewrite_callers(s32 *start, s32 *end)
 		addr -= fineibt_caller_size;
 		hash = decode_caller_hash(addr);
 		if (hash) {
-			text_poke_early(addr, fineibt_caller_start, fineibt_caller_size);
+			do_text_poke(addr, fineibt_caller_start, fineibt_caller_size);
 			WARN_ON(*(u32 *)(addr + fineibt_caller_hash) != 0x12345678);
-			text_poke_early(addr + fineibt_caller_hash, &hash, 4);
+			do_text_poke(addr + fineibt_caller_hash, &hash, 4);
 		}
 		/* rely on apply_retpolines() */
 	}
@@ -1243,7 +1256,7 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
 
 		/* Pad the rest with nops */
 		add_nops(insn_buff + used, p->len - used);
-		text_poke_early(p->instr, insn_buff, p->len);
+		do_text_poke(p->instr, insn_buff, p->len);
 	}
 }
 extern struct paravirt_patch_site __start_parainstructions[],
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index aa99536b824c..d50595f2c1a6 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -118,10 +118,13 @@ ftrace_modify_code_direct(unsigned long ip, const char *old_code,
 		return ret;
 
 	/* replace the text with the new text */
-	if (ftrace_poke_late)
+	if (ftrace_poke_late) {
 		text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
-	else
-		text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
+	} else {
+		mutex_lock(&text_mutex);
+		text_poke((void *)ip, new_code, MCOUNT_INSN_SIZE);
+		mutex_unlock(&text_mutex);
+	}
 	return 0;
 }
 
@@ -319,7 +322,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned const char op_ref[] = { 0x48, 0x8b, 0x15 };
 	unsigned const char retq[] = { RET_INSN_OPCODE, INT3_INSN_OPCODE };
 	union ftrace_op_code_union op_ptr;
-	int ret;
+	void *ret;
 
 	if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) {
 		start_offset = (unsigned long)ftrace_regs_caller;
@@ -350,15 +353,15 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	npages = DIV_ROUND_UP(*tramp_size, PAGE_SIZE);
 
 	/* Copy ftrace_caller onto the trampoline memory */
-	ret = copy_from_kernel_nofault(trampoline, (void *)start_offset, size);
-	if (WARN_ON(ret < 0))
+	ret = text_poke_copy(trampoline, (void *)start_offset, size);
+	if (WARN_ON(!ret))
 		goto fail;
 
 	ip = trampoline + size;
 	if (cpu_feature_enabled(X86_FEATURE_RETHUNK))
 		__text_gen_insn(ip, JMP32_INSN_OPCODE, ip, x86_return_thunk, JMP32_INSN_SIZE);
 	else
-		memcpy(ip, retq, sizeof(retq));
+		text_poke_copy(ip, retq, sizeof(retq));
 
 	/* No need to test direct calls on created trampolines */
 	if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) {
@@ -366,8 +369,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		ip = trampoline + (jmp_offset - start_offset);
 		if (WARN_ON(*(char *)ip != 0x75))
 			goto fail;
-		ret = copy_from_kernel_nofault(ip, x86_nops[2], 2);
-		if (ret < 0)
+		if (!text_poke_copy(ip, x86_nops[2], 2))
 			goto fail;
 	}
 
@@ -380,7 +382,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	 */
 
 	ptr = (unsigned long *)(trampoline + size + RET_SIZE);
-	*ptr = (unsigned long)ops;
+	text_poke_copy(ptr, &ops, sizeof(unsigned long));
 
 	op_offset -= start_offset;
 	memcpy(&op_ptr, trampoline + op_offset, OP_REF_SIZE);
@@ -396,7 +398,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	op_ptr.offset = offset;
 
 	/* put in the new offset to the ftrace_ops */
-	memcpy(trampoline + op_offset, &op_ptr, OP_REF_SIZE);
+	text_poke_copy(trampoline + op_offset, &op_ptr, OP_REF_SIZE);
 
 	/* put in the call to the function */
 	mutex_lock(&text_mutex);
@@ -406,9 +408,9 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	 * the depth accounting before the call already.
 	 */
 	dest = ftrace_ops_get_func(ops);
-	memcpy(trampoline + call_offset,
-	       text_gen_insn(CALL_INSN_OPCODE, trampoline + call_offset, dest),
-	       CALL_INSN_SIZE);
+	text_poke_copy_locked(trampoline + call_offset,
+	      text_gen_insn(CALL_INSN_OPCODE, trampoline + call_offset, dest),
+	      CALL_INSN_SIZE, false);
 	mutex_unlock(&text_mutex);
 
 	/* ALLOC_TRAMP flags lets us know we created it */
@@ -658,4 +660,15 @@ void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
 }
 #endif
 
+void ftrace_swap_func(void *a, void *b, int n)
+{
+	unsigned long t;
+
+	WARN_ON_ONCE(n != sizeof(t));
+
+	t = *((unsigned long *)a);
+	text_poke_copy(a, b, sizeof(t));
+	text_poke_copy(b, &t, sizeof(t));
+}
+
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 94a00dc103cd..444bc76574b9 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -83,7 +83,6 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 		   unsigned int symindex,
 		   unsigned int relsec,
 		   struct module *me,
-		   void *(*write)(void *dest, const void *src, size_t len),
 		   bool apply)
 {
 	unsigned int i;
@@ -151,14 +150,14 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
 				       (int)ELF64_R_TYPE(rel[i].r_info), loc, val);
 				return -ENOEXEC;
 			}
-			write(loc, &val, size);
+			text_poke(loc, &val, size);
 		} else {
 			if (memcmp(loc, &val, size)) {
 				pr_warn("x86/modules: Invalid relocation target, existing value does not match expected value for type %d, loc %p, val %Lx\n",
 					(int)ELF64_R_TYPE(rel[i].r_info), loc, val);
 				return -ENOEXEC;
 			}
-			write(loc, &zero, size);
+			text_poke(loc, &zero, size);
 		}
 	}
 	return 0;
@@ -179,22 +178,11 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
 			      bool apply)
 {
 	int ret;
-	bool early = me->state == MODULE_STATE_UNFORMED;
-	void *(*write)(void *, const void *, size_t) = memcpy;
-
-	if (!early) {
-		write = text_poke;
-		mutex_lock(&text_mutex);
-	}
-
-	ret = __write_relocate_add(sechdrs, strtab, symindex, relsec, me,
-				   write, apply);
-
-	if (!early) {
-		text_poke_sync();
-		mutex_unlock(&text_mutex);
-	}
 
+	mutex_lock(&text_mutex);
+	ret = __write_relocate_add(sechdrs, strtab, symindex, relsec, me, apply);
+	text_poke_sync();
+	mutex_unlock(&text_mutex);
 	return ret;
 }
 
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index b70670a98597..90aacef21dfa 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -51,7 +51,7 @@ asm (".global __static_call_return\n\t"
      ".size __static_call_return, . - __static_call_return \n\t");
 
 static void __ref __static_call_transform(void *insn, enum insn_type type,
-					  void *func, bool modinit)
+					  void *func)
 {
 	const void *emulate = NULL;
 	int size = CALL_INSN_SIZE;
@@ -105,7 +105,7 @@ static void __ref __static_call_transform(void *insn, enum insn_type type,
 	if (memcmp(insn, code, size) == 0)
 		return;
 
-	if (system_state == SYSTEM_BOOTING || modinit)
+	if (system_state == SYSTEM_BOOTING)
 		return text_poke_early(insn, code, size);
 
 	text_poke_bp(insn, code, size, emulate);
@@ -160,12 +160,12 @@ void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
 
 	if (tramp) {
 		__static_call_validate(tramp, true, true);
-		__static_call_transform(tramp, __sc_insn(!func, true), func, false);
+		__static_call_transform(tramp, __sc_insn(!func, true), func);
 	}
 
 	if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) && site) {
 		__static_call_validate(site, tail, false);
-		__static_call_transform(site, __sc_insn(!func, tail), func, false);
+		__static_call_transform(site, __sc_insn(!func, tail), func);
 	}
 
 	mutex_unlock(&text_mutex);
@@ -193,7 +193,7 @@ bool __static_call_fixup(void *tramp, u8 op, void *dest)
 
 	mutex_lock(&text_mutex);
 	if (op == RET_INSN_OPCODE || dest == &__x86_return_thunk)
-		__static_call_transform(tramp, RET, NULL, true);
+		__static_call_transform(tramp, RET, NULL);
 	mutex_unlock(&text_mutex);
 
 	return true;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 3ac50b7298d1..264188ec50c9 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -7,6 +7,7 @@
 #include <asm/unwind.h>
 #include <asm/orc_types.h>
 #include <asm/orc_lookup.h>
+#include <asm/text-patching.h>
 
 #define orc_warn(fmt, ...) \
 	printk_deferred_once(KERN_WARNING "WARNING: " fmt, ##__VA_ARGS__)
@@ -222,18 +223,22 @@ static void orc_sort_swap(void *_a, void *_b, int size)
 	struct orc_entry orc_tmp;
 	int *a = _a, *b = _b, tmp;
 	int delta = _b - _a;
+	int val;
 
 	/* Swap the .orc_unwind_ip entries: */
 	tmp = *a;
-	*a = *b + delta;
-	*b = tmp - delta;
+	val = *b + delta;
+	text_poke_copy(a, &val, sizeof(val));
+	val = tmp - delta;
+	text_poke_copy(b, &val, sizeof(val));
 
 	/* Swap the corresponding .orc_unwind entries: */
 	orc_a = cur_orc_table + (a - cur_orc_ip_table);
 	orc_b = cur_orc_table + (b - cur_orc_ip_table);
 	orc_tmp = *orc_a;
-	*orc_a = *orc_b;
-	*orc_b = orc_tmp;
+
+	text_poke_copy(orc_a, orc_b, sizeof(*orc_b));
+	text_poke_copy(orc_b, &orc_tmp, sizeof(orc_tmp));
 }
 
 static int orc_sort_cmp(const void *_a, const void *_b)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 1056bbf55b17..bae267f0a257 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -226,7 +226,7 @@ static u8 simple_alu_opcodes[] = {
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	/* Fill whole space with INT3 instructions */
-	memset(area, 0xcc, size);
+	text_poke_set(area, 0xcc, size);
 }
 
 int bpf_arch_text_invalidate(void *dst, size_t len)
@@ -2202,6 +2202,9 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 		orig_call += X86_PATCH_SIZE;
 	}
 
+	set_memory_nx((unsigned long)image & PAGE_MASK, 1);
+	set_memory_rw((unsigned long)image & PAGE_MASK, 1);
+
 	prog = image;
 
 	EMIT_ENDBR();
@@ -2238,20 +2241,24 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
 		if (emit_rsb_call(&prog, __bpf_tramp_enter, prog)) {
 			ret = -EINVAL;
-			goto cleanup;
+			goto reprotect_memory;
 		}
 	}
 
 	if (fentry->nr_links)
 		if (invoke_bpf(m, &prog, fentry, regs_off, run_ctx_off,
-			       flags & BPF_TRAMP_F_RET_FENTRY_RET))
-			return -EINVAL;
+			       flags & BPF_TRAMP_F_RET_FENTRY_RET)) {
+			ret = -EINVAL;
+			goto reprotect_memory;
+		}
 
 	if (fmod_ret->nr_links) {
 		branches = kcalloc(fmod_ret->nr_links, sizeof(u8 *),
 				   GFP_KERNEL);
-		if (!branches)
-			return -ENOMEM;
+		if (!branches) {
+			ret =  -ENOMEM;
+			goto reprotect_memory;
+		}
 
 		if (invoke_bpf_mod_ret(m, &prog, fmod_ret, regs_off,
 				       run_ctx_off, branches)) {
@@ -2336,6 +2343,9 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
 
 cleanup:
 	kfree(branches);
+reprotect_memory:
+	set_memory_rox((unsigned long)image & PAGE_MASK, 1);
+
 	return ret;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 13/13] x86/jitalloc: make memory allocated for code ROX
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (11 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX Mike Rapoport
@ 2023-06-01 10:12 ` Mike Rapoport
  2023-06-01 16:12 ` [PATCH 00/13] mm: jit/text allocator Mark Rutland
  2023-06-02  0:36 ` Song Liu
  14 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Mike Rapoport, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Song Liu, Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner,
	Will Deacon, bpf, linux-arm-kernel, linux-mips, linux-mm,
	linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

From: "Mike Rapoport (IBM)" <rppt@kernel.org>

When STRICT_KERNEL_RWX or STRICT_MODULE_RWX is enabled, force text
allocations to use KERNEL_PAGE_ROX.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/Kconfig             |  3 +++
 arch/x86/Kconfig         |  1 +
 arch/x86/kernel/ftrace.c |  3 ---
 arch/x86/mm/init.c       |  6 ++++++
 include/linux/jitalloc.h |  2 ++
 mm/jitalloc.c            | 21 +++++++++++++++++++++
 6 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 479a7b8be191..e7c4b01307d7 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1307,6 +1307,9 @@ config STRICT_MODULE_RWX
 	  and non-text memory will be made non-executable. This provides
 	  protection against certain security exploits (e.g. writing to text)
 
+config ARCH_HAS_TEXT_POKE
+	def_bool n
+
 # select if the architecture provides an asm/dma-direct.h header
 config ARCH_HAS_PHYS_TO_DMA
 	bool
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fac4add6ce16..e1a512f557de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -96,6 +96,7 @@ config X86
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
+	select ARCH_HAS_TEXT_POKE
 	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index d50595f2c1a6..bd4dd8974ee6 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -313,7 +313,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned long call_offset;
 	unsigned long jmp_offset;
 	unsigned long offset;
-	unsigned long npages;
 	unsigned long size;
 	unsigned long *ptr;
 	void *trampoline;
@@ -350,7 +349,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		return 0;
 
 	*tramp_size = size + RET_SIZE + sizeof(void *);
-	npages = DIV_ROUND_UP(*tramp_size, PAGE_SIZE);
 
 	/* Copy ftrace_caller onto the trampoline memory */
 	ret = text_poke_copy(trampoline, (void *)start_offset, size);
@@ -416,7 +414,6 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	/* ALLOC_TRAMP flags lets us know we created it */
 	ops->flags |= FTRACE_OPS_FL_ALLOC_TRAMP;
 
-	set_memory_rox((unsigned long)trampoline, npages);
 	return (unsigned long)trampoline;
 fail:
 	tramp_free(trampoline);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ffaf9a3840ce..c314738991fa 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1127,6 +1127,12 @@ struct jit_alloc_params *jit_alloc_arch_params(void)
 	jit_alloc_params.text.start = MODULES_VADDR + get_jit_load_offset();
 	jit_alloc_params.text.end = MODULES_END;
 
+	if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) ||
+	    IS_ENABLED(CONFIG_STRICT_MODULE_RWX)) {
+		jit_alloc_params.text.pgprot = PAGE_KERNEL_ROX;
+		jit_alloc_params.flags |= JIT_ALLOC_USE_TEXT_POKE;
+	}
+
 	return &jit_alloc_params;
 }
 #endif /* CONFIG_JIT_ALLOC */
diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
index 0ba5ef785a85..0e29e87acefe 100644
--- a/include/linux/jitalloc.h
+++ b/include/linux/jitalloc.h
@@ -15,9 +15,11 @@
 /**
  * enum jit_alloc_flags - options for executable memory allocations
  * @JIT_ALLOC_KASAN_SHADOW:	allocate kasan shadow
+ * @JIT_ALLOC_USE_TEXT_POKE:	use text poking APIs to update memory
  */
 enum jit_alloc_flags {
 	JIT_ALLOC_KASAN_SHADOW	= (1 << 0),
+	JIT_ALLOC_USE_TEXT_POKE	= (1 << 1),
 };
 
 /**
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
index a8ae64364d56..15d1067faf3f 100644
--- a/mm/jitalloc.c
+++ b/mm/jitalloc.c
@@ -7,6 +7,26 @@
 
 static struct jit_alloc_params jit_alloc_params;
 
+#ifdef CONFIG_ARCH_HAS_TEXT_POKE
+#include <asm/text-patching.h>
+
+static inline void jit_text_poke_copy(void *dst, const void *src, size_t len)
+{
+	if (jit_alloc_params.flags & JIT_ALLOC_USE_TEXT_POKE)
+		text_poke_copy(dst, src, len);
+	else
+		memcpy(dst, src, len);
+}
+
+static inline void jit_text_poke_set(void *addr, int c, size_t len)
+{
+	if (jit_alloc_params.flags & JIT_ALLOC_USE_TEXT_POKE)
+		text_poke_set(addr, c, len);
+	else
+		memset(addr, c, len);
+}
+
+#else
 static inline void jit_text_poke_copy(void *dst, const void *src, size_t len)
 {
 	memcpy(dst, src, len);
@@ -16,6 +36,7 @@ static inline void jit_text_poke_set(void *addr, int c, size_t len)
 {
 	memset(addr, c, len);
 }
+#endif
 
 static void *jit_alloc(size_t len, unsigned int alignment, pgprot_t pgprot,
 		       unsigned long start, unsigned long end,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 10:12 ` [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX Mike Rapoport
@ 2023-06-01 10:30   ` Peter Zijlstra
  2023-06-01 11:07     ` Mike Rapoport
  2023-06-01 17:52     ` Kent Overstreet
  2023-06-01 16:54   ` Edgecombe, Rick P
  2023-06-01 22:49   ` Song Liu
  2 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2023-06-01 10:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:

> +static void __init_or_module do_text_poke(void *addr, const void *opcode, size_t len)
> +{
> +	if (system_state < SYSTEM_RUNNING) {
> +		text_poke_early(addr, opcode, len);
> +	} else {
> +		mutex_lock(&text_mutex);
> +		text_poke(addr, opcode, len);
> +		mutex_unlock(&text_mutex);
> +	}
> +}

So I don't much like do_text_poke(); why?

> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index aa99536b824c..d50595f2c1a6 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -118,10 +118,13 @@ ftrace_modify_code_direct(unsigned long ip, const char *old_code,
>  		return ret;
>  
>  	/* replace the text with the new text */
> -	if (ftrace_poke_late)
> +	if (ftrace_poke_late) {
>  		text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
> -	else
> -		text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
> +	} else {
> +		mutex_lock(&text_mutex);
> +		text_poke((void *)ip, new_code, MCOUNT_INSN_SIZE);
> +		mutex_unlock(&text_mutex);
> +	}
>  	return 0;
>  }

And in the above case it's actively wrong for loosing the _queue()
thing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 10:30   ` Peter Zijlstra
@ 2023-06-01 11:07     ` Mike Rapoport
  2023-06-02  0:02       ` Song Liu
  2023-06-01 17:52     ` Kent Overstreet
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Rapoport @ 2023-06-01 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 01, 2023 at 12:30:50PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:
> 
> > +static void __init_or_module do_text_poke(void *addr, const void *opcode, size_t len)
> > +{
> > +	if (system_state < SYSTEM_RUNNING) {
> > +		text_poke_early(addr, opcode, len);
> > +	} else {
> > +		mutex_lock(&text_mutex);
> > +		text_poke(addr, opcode, len);
> > +		mutex_unlock(&text_mutex);
> > +	}
> > +}
> 
> So I don't much like do_text_poke(); why?

I believe the idea was to keep memcpy for early boot before the kernel
image is protected without going and adding if (is_module_text_address())
all over the place.

I think this can be used instead without updating all the call sites of
text_poke_early():

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 91057de8e6bc..f994e63e9903 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1458,7 +1458,7 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
 		 * code cannot be running and speculative code-fetches are
 		 * prevented. Just change the code.
 		 */
-		memcpy(addr, opcode, len);
+		text_poke_copy(addr, opcode, len);
 	} else {
 		local_irq_save(flags);
 		memcpy(addr, opcode, len);
 
> > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > index aa99536b824c..d50595f2c1a6 100644
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -118,10 +118,13 @@ ftrace_modify_code_direct(unsigned long ip, const char *old_code,
> >  		return ret;
> >  
> >  	/* replace the text with the new text */
> > -	if (ftrace_poke_late)
> > +	if (ftrace_poke_late) {
> >  		text_poke_queue((void *)ip, new_code, MCOUNT_INSN_SIZE, NULL);
> > -	else
> > -		text_poke_early((void *)ip, new_code, MCOUNT_INSN_SIZE);
> > +	} else {
> > +		mutex_lock(&text_mutex);
> > +		text_poke((void *)ip, new_code, MCOUNT_INSN_SIZE);
> > +		mutex_unlock(&text_mutex);
> > +	}
> >  	return 0;
> >  }
> 
> And in the above case it's actively wrong for loosing the _queue()
> thing.

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (12 preceding siblings ...)
  2023-06-01 10:12 ` [PATCH 13/13] x86/jitalloc: make memory allocated for code ROX Mike Rapoport
@ 2023-06-01 16:12 ` Mark Rutland
  2023-06-01 18:14   ` Kent Overstreet
  2023-06-02  0:36 ` Song Liu
  14 siblings, 1 reply; 55+ messages in thread
From: Mark Rutland @ 2023-06-01 16:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

Hi Mike,

On Thu, Jun 01, 2023 at 01:12:44PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> Hi,
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.

I agree this is a problem, and one key issue here is that these can have
different requirements. For example, on arm64 we need modules to be placed
within a 128M or 2G window containing the kernel, whereas it would be safe for
the kprobes XOL area to be placed arbitrarily far from the kernel image (since
we don't allow PC-relative insns to be stepped out-of-line). Likewise arm64
doesn't have ftrace trampolines, and DIRECT_CALL trampolines can safely be
placed arbitarily far from the kernel image.

For a while I have wanted to give kprobes its own allocator so that it can work
even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
the modules area.

Given that, I think these should have their own allocator functions that can be
provided independently, even if those happen to use common infrastructure.

> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> This set splits code allocation from modules by introducing
> jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> sites of module_alloc() and module_memfree() with the new APIs and
> implements core text and related allocation in a central place.
> 
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> returns a pointer to that structure. If an architecture does not implement
> jit_alloc_arch_params(), the defaults compatible with the current
> modules::module_alloc() are used.

As above, I suspect that each of the callsites should probably be using common
infrastructure, but I don't think that a single jit_alloc_arch_params() makes
sense, since the parameters for each case may need to be distinct.

> The new jitalloc infrastructure allows decoupling of kprobes and ftrace
> from modules, and most importantly it enables ROX allocations for
> executable memory.
> 
> A centralized infrastructure for code allocation allows future
> optimizations for allocations of executable memory, caching large pages for
> better iTLB performance and providing sub-page allocations for users that
> only need small jit code snippets.

This sounds interesting, but I think this can be achieved without requiring a
single jit_alloc_arch_params() shared by all users?

Thanks,
Mark.

> 
> patches 1-5: split out the code allocation from modules and arch
> patch 6: add dedicated API for data allocations with constraints similar to
> code allocations
> patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES
> patches 10-13: enable ROX allocations for executable memory on x86
> 
> Mike Rapoport (IBM) (11):
>   nios2: define virtual address space for modules
>   mm: introduce jit_text_alloc() and use it instead of module_alloc()
>   mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc
>   mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
>   module, jitalloc: drop module_alloc
>   mm/jitalloc: introduce jit_data_alloc()
>   x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
>   arch: make jitalloc setup available regardless of CONFIG_MODULES
>   kprobes: remove dependcy on CONFIG_MODULES
>   modules, jitalloc: prepare to allocate executable memory as ROX
>   x86/jitalloc: make memory allocated for code ROX
> 
> Song Liu (2):
>   ftrace: Add swap_func to ftrace_process_locs()
>   x86/jitalloc: prepare to allocate exectuatble memory as ROX
> 
>  arch/Kconfig                     |   5 +-
>  arch/arm/kernel/module.c         |  32 ------
>  arch/arm/mm/init.c               |  35 ++++++
>  arch/arm64/kernel/module.c       |  47 --------
>  arch/arm64/mm/init.c             |  42 +++++++
>  arch/loongarch/kernel/module.c   |   6 -
>  arch/loongarch/mm/init.c         |  16 +++
>  arch/mips/kernel/module.c        |   9 --
>  arch/mips/mm/init.c              |  19 ++++
>  arch/nios2/include/asm/pgtable.h |   5 +-
>  arch/nios2/kernel/module.c       |  24 ++--
>  arch/parisc/kernel/module.c      |  11 --
>  arch/parisc/mm/init.c            |  21 +++-
>  arch/powerpc/kernel/kprobes.c    |   4 +-
>  arch/powerpc/kernel/module.c     |  37 -------
>  arch/powerpc/mm/mem.c            |  41 +++++++
>  arch/riscv/kernel/module.c       |  10 --
>  arch/riscv/mm/init.c             |  18 +++
>  arch/s390/kernel/ftrace.c        |   4 +-
>  arch/s390/kernel/kprobes.c       |   4 +-
>  arch/s390/kernel/module.c        |  46 +-------
>  arch/s390/mm/init.c              |  35 ++++++
>  arch/sparc/kernel/module.c       |  34 +-----
>  arch/sparc/mm/Makefile           |   2 +
>  arch/sparc/mm/jitalloc.c         |  21 ++++
>  arch/sparc/net/bpf_jit_comp_32.c |   8 +-
>  arch/x86/Kconfig                 |   2 +
>  arch/x86/kernel/alternative.c    |  43 ++++---
>  arch/x86/kernel/ftrace.c         |  59 +++++-----
>  arch/x86/kernel/kprobes/core.c   |   4 +-
>  arch/x86/kernel/module.c         |  75 +------------
>  arch/x86/kernel/static_call.c    |  10 +-
>  arch/x86/kernel/unwind_orc.c     |  13 ++-
>  arch/x86/mm/init.c               |  52 +++++++++
>  arch/x86/net/bpf_jit_comp.c      |  22 +++-
>  include/linux/ftrace.h           |   2 +
>  include/linux/jitalloc.h         |  69 ++++++++++++
>  include/linux/moduleloader.h     |  15 ---
>  kernel/bpf/core.c                |  14 +--
>  kernel/kprobes.c                 |  51 +++++----
>  kernel/module/Kconfig            |   1 +
>  kernel/module/main.c             |  56 ++++------
>  kernel/trace/ftrace.c            |  13 ++-
>  kernel/trace/trace_kprobe.c      |  11 ++
>  mm/Kconfig                       |   3 +
>  mm/Makefile                      |   1 +
>  mm/jitalloc.c                    | 185 +++++++++++++++++++++++++++++++
>  mm/mm_init.c                     |   2 +
>  48 files changed, 777 insertions(+), 462 deletions(-)
>  create mode 100644 arch/sparc/mm/jitalloc.c
>  create mode 100644 include/linux/jitalloc.h
>  create mode 100644 mm/jitalloc.c
> 
> 
> base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
> -- 
> 2.35.1
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 10:12 ` [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX Mike Rapoport
  2023-06-01 10:30   ` Peter Zijlstra
@ 2023-06-01 16:54   ` Edgecombe, Rick P
  2023-06-01 18:00     ` Kent Overstreet
  2023-06-01 22:49   ` Song Liu
  2 siblings, 1 reply; 55+ messages in thread
From: Edgecombe, Rick P @ 2023-06-01 16:54 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, rppt@kernel.org
  Cc: tglx@linutronix.de, mcgrof@kernel.org, deller@gmx.de,
	davem@davemloft.net, netdev@vger.kernel.org,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linux-riscv@lists.infradead.org, linuxppc-dev@lists.ozlabs.org,
	hca@linux.ibm.com, catalin.marinas@arm.com,
	kent.overstreet@linux.dev, linux-s390@vger.kernel.org,
	christophe.leroy@csgroup.eu, chenhuacai@kernel.org,
	mpe@ellerman.id.au, linux-trace-kernel@vger.kernel.org,
	tsbogend@alpha.franken.de, palmer@dabbelt.com, x86@kernel.org,
	linux-parisc@vger.kernel.org, rostedt@goodmis.org,
	will@kernel.org, dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Thu, 2023-06-01 at 13:12 +0300, Mike Rapoport wrote:
>  /*
>   * Are we looking at a near JMP with a 1 or 4-byte displacement.
> @@ -331,7 +344,7 @@ void __init_or_module noinline
> apply_alternatives(struct alt_instr *start,
>  
>                 DUMP_BYTES(insn_buff, insn_buff_sz, "%px: final_insn:
> ", instr);
>  
> -               text_poke_early(instr, insn_buff, insn_buff_sz);
> +               do_text_poke(instr, insn_buff, insn_buff_sz);
>  
>  next:
>                 optimize_nops(instr, a->instrlen);
> @@ -564,7 +577,7 @@ void __init_or_module noinline
> apply_retpolines(s32 *start, s32 *end)
>                         optimize_nops(bytes, len);
>                         DUMP_BYTES(((u8*)addr),  len, "%px: orig: ",
> addr);
>                         DUMP_BYTES(((u8*)bytes), len, "%px: repl: ",
> addr);
> -                       text_poke_early(addr, bytes, len);
> +                       do_text_poke(addr, bytes, len);
>                 }
>         }
>  }
> @@ -638,7 +651,7 @@ void __init_or_module noinline apply_returns(s32
> *start, s32 *end)
>                 if (len == insn.length) {
>                         DUMP_BYTES(((u8*)addr),  len, "%px: orig: ",
> addr);
>                         DUMP_BYTES(((u8*)bytes), len, "%px: repl: ",
> addr);
> -                       text_poke_early(addr, bytes, len);
> +                       do_text_poke(addr, bytes, len);
>                 }
>         }
>  }
> @@ -674,7 +687,7 @@ static void poison_endbr(void *addr, bool warn)
>          */
>         DUMP_BYTES(((u8*)addr), 4, "%px: orig: ", addr);
>         DUMP_BYTES(((u8*)&poison), 4, "%px: repl: ", addr);
> -       text_poke_early(addr, &poison, 4);
> +       do_text_poke(addr, &poison, 4);
>  }
>  
>  /*
> @@ -869,7 +882,7 @@ static int cfi_disable_callers(s32 *start, s32
> *end)
>                 if (!hash) /* nocfi callers */
>                         continue;
>  
> -               text_poke_early(addr, jmp, 2);
> +               do_text_poke(addr, jmp, 2);
>         }
>  
>         return 0;
> @@ -892,7 +905,7 @@ static int cfi_enable_callers(s32 *start, s32
> *end)
>                 if (!hash) /* nocfi callers */
>                         continue;
>  
> -               text_poke_early(addr, mov, 2);
> +               do_text_poke(addr, mov, 2);
>         }
>  
>         return 0;
> @@ -913,7 +926,7 @@ static int cfi_rand_preamble(s32 *start, s32
> *end)
>                         return -EINVAL;
>  
>                 hash = cfi_rehash(hash);
> -               text_poke_early(addr + 1, &hash, 4);
> +               do_text_poke(addr + 1, &hash, 4);
>         }
>  
>         return 0;
> @@ -932,9 +945,9 @@ static int cfi_rewrite_preamble(s32 *start, s32
> *end)
>                          addr, addr, 5, addr))
>                         return -EINVAL;
>  
> -               text_poke_early(addr, fineibt_preamble_start,
> fineibt_preamble_size);
> +               do_text_poke(addr, fineibt_preamble_start,
> fineibt_preamble_size);
>                 WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) !=
> 0x12345678);
> -               text_poke_early(addr + fineibt_preamble_hash, &hash,
> 4);
> +               do_text_poke(addr + fineibt_preamble_hash, &hash, 4);
>         }

It is just a local flush, but I wonder how much text_poke()ing is too
much. A lot of the are even inside loops. Can't it do the batch version
at least?

The other thing, and maybe this is in paranoia category, but it's
probably at least worth noting. Before the modules were not made
executable until all of the code was finalized. Now they are made
executable in an intermediate state and then patched later. It might
weaken the CFI stuff, but also it just kind of seems a bit unbounded
for dealing with executable code.

Preparing the modules in a separate RW mapping, and then text_poke()ing
the whole thing in when you are done would resolve both of these.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 10:30   ` Peter Zijlstra
  2023-06-01 11:07     ` Mike Rapoport
@ 2023-06-01 17:52     ` Kent Overstreet
  1 sibling, 0 replies; 55+ messages in thread
From: Kent Overstreet @ 2023-06-01 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 01, 2023 at 12:30:50PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:
> 
> > +static void __init_or_module do_text_poke(void *addr, const void *opcode, size_t len)
> > +{
> > +	if (system_state < SYSTEM_RUNNING) {
> > +		text_poke_early(addr, opcode, len);
> > +	} else {
> > +		mutex_lock(&text_mutex);
> > +		text_poke(addr, opcode, len);
> > +		mutex_unlock(&text_mutex);
> > +	}
> > +}
> 
> So I don't much like do_text_poke(); why?

Could you share why?

I think the impementation sucks but conceptually it's the right idea -
create a new temporary mapping to avoid the need for RWX mappings.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 16:54   ` Edgecombe, Rick P
@ 2023-06-01 18:00     ` Kent Overstreet
  2023-06-01 18:13       ` Edgecombe, Rick P
  0 siblings, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2023-06-01 18:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel@vger.kernel.org, rppt@kernel.org, tglx@linutronix.de,
	mcgrof@kernel.org, deller@gmx.de, davem@davemloft.net,
	netdev@vger.kernel.org, linux@armlinux.org.uk,
	linux-mips@vger.kernel.org, linux-riscv@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-s390@vger.kernel.org,
	christophe.leroy@csgroup.eu, chenhuacai@kernel.org,
	mpe@ellerman.id.au, linux-trace-kernel@vger.kernel.org,
	tsbogend@alpha.franken.de, palmer@dabbelt.com, x86@kernel.org,
	linux-parisc@vger.kernel.org, rostedt@goodmis.org,
	will@kernel.org, dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Thu, Jun 01, 2023 at 04:54:27PM +0000, Edgecombe, Rick P wrote:
> It is just a local flush, but I wonder how much text_poke()ing is too
> much. A lot of the are even inside loops. Can't it do the batch version
> at least?
> 
> The other thing, and maybe this is in paranoia category, but it's
> probably at least worth noting. Before the modules were not made
> executable until all of the code was finalized. Now they are made
> executable in an intermediate state and then patched later. It might
> weaken the CFI stuff, but also it just kind of seems a bit unbounded
> for dealing with executable code.

I believe bpf starts out by initializing new executable memory with
illegal opcodes, maybe we should steal that and make it standard.

> Preparing the modules in a separate RW mapping, and then text_poke()ing
> the whole thing in when you are done would resolve both of these.

text_poke() _does_ create a separate RW mapping.

The thing that sucks about text_poke() is that it always does a full TLB
flush, and AFAICT that's not remotely needed. What it really wants to be
doing is conceptually just

kmap_local()
mempcy()
kunmap_loca()
flush_icache();

...except that kmap_local() won't actually create a new mapping on
non-highmem architectures, so text_poke() open codes it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 18:00     ` Kent Overstreet
@ 2023-06-01 18:13       ` Edgecombe, Rick P
  2023-06-01 18:38         ` Kent Overstreet
  0 siblings, 1 reply; 55+ messages in thread
From: Edgecombe, Rick P @ 2023-06-01 18:13 UTC (permalink / raw)
  To: kent.overstreet@linux.dev
  Cc: tglx@linutronix.de, mcgrof@kernel.org, deller@gmx.de,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, palmer@dabbelt.com,
	x86@kernel.org, chenhuacai@kernel.org, tsbogend@alpha.franken.de,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	rppt@kernel.org, mpe@ellerman.id.au, linux-s390@vger.kernel.org,
	christophe.leroy@csgroup.eu, rostedt@goodmis.org, will@kernel.org,
	dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Thu, 2023-06-01 at 14:00 -0400, Kent Overstreet wrote:
> On Thu, Jun 01, 2023 at 04:54:27PM +0000, Edgecombe, Rick P wrote:
> > It is just a local flush, but I wonder how much text_poke()ing is
> > too
> > much. A lot of the are even inside loops. Can't it do the batch
> > version
> > at least?
> > 
> > The other thing, and maybe this is in paranoia category, but it's
> > probably at least worth noting. Before the modules were not made
> > executable until all of the code was finalized. Now they are made
> > executable in an intermediate state and then patched later. It
> > might
> > weaken the CFI stuff, but also it just kind of seems a bit
> > unbounded
> > for dealing with executable code.
> 
> I believe bpf starts out by initializing new executable memory with
> illegal opcodes, maybe we should steal that and make it standard.

I was thinking of modules which have a ton of alternatives, errata
fixes, etc applied to them after the initial sections are written to
the to-be-executable mapping. I thought this had zeroed pages to start,
which seems ok.

> 
> > Preparing the modules in a separate RW mapping, and then
> > text_poke()ing
> > the whole thing in when you are done would resolve both of these.
> 
> text_poke() _does_ create a separate RW mapping.

Sorry, I meant a separate RW allocation.

> 
> The thing that sucks about text_poke() is that it always does a full
> TLB
> flush, and AFAICT that's not remotely needed. What it really wants to
> be
> doing is conceptually just
> 
> kmap_local()
> mempcy()
> kunmap_loca()
> flush_icache();
> 
> ...except that kmap_local() won't actually create a new mapping on
> non-highmem architectures, so text_poke() open codes it.

Text poke creates only a local CPU RW mapping. It's more secure because
other threads can't write to it. It also only needs to flush the local
core when it's done since it's not using a shared MM. It used to use
the fixmap, which is similar to what you are describing I think.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-01 16:12 ` [PATCH 00/13] mm: jit/text allocator Mark Rutland
@ 2023-06-01 18:14   ` Kent Overstreet
  2023-06-02  9:35     ` Mark Rutland
  0 siblings, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2023-06-01 18:14 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> For a while I have wanted to give kprobes its own allocator so that it can work
> even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> the modules area.
> 
> Given that, I think these should have their own allocator functions that can be
> provided independently, even if those happen to use common infrastructure.

How much memory can kprobes conceivably use? I think we also want to try
to push back on combinatorial new allocators, if we can.

> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> > 
> > This set splits code allocation from modules by introducing
> > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > sites of module_alloc() and module_memfree() with the new APIs and
> > implements core text and related allocation in a central place.
> > 
> > Instead of architecture specific overrides for module_alloc(), the
> > architectures that require non-default behaviour for text allocation must
> > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > returns a pointer to that structure. If an architecture does not implement
> > jit_alloc_arch_params(), the defaults compatible with the current
> > modules::module_alloc() are used.
> 
> As above, I suspect that each of the callsites should probably be using common
> infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> sense, since the parameters for each case may need to be distinct.

I don't see how that follows. The whole point of function parameters is
that they may be different :)

Can you give more detail on what parameters you need? If the only extra
parameter is just "does this allocation need to live close to kernel
text", that's not that big of a deal.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 18:13       ` Edgecombe, Rick P
@ 2023-06-01 18:38         ` Kent Overstreet
  2023-06-01 20:50           ` Edgecombe, Rick P
  0 siblings, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2023-06-01 18:38 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: tglx@linutronix.de, mcgrof@kernel.org, deller@gmx.de,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, palmer@dabbelt.com,
	x86@kernel.org, chenhuacai@kernel.org, tsbogend@alpha.franken.de,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	rppt@kernel.org, mpe@ellerman.id.au, linux-s390@vger.kernel.org,
	christophe.leroy@csgroup.eu, rostedt@goodmis.org, will@kernel.org,
	dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Thu, Jun 01, 2023 at 06:13:44PM +0000, Edgecombe, Rick P wrote:
> > text_poke() _does_ create a separate RW mapping.
> 
> Sorry, I meant a separate RW allocation.

Ah yes, that makes sense


> 
> > 
> > The thing that sucks about text_poke() is that it always does a full
> > TLB
> > flush, and AFAICT that's not remotely needed. What it really wants to
> > be
> > doing is conceptually just
> > 
> > kmap_local()
> > mempcy()
> > kunmap_loca()
> > flush_icache();
> > 
> > ...except that kmap_local() won't actually create a new mapping on
> > non-highmem architectures, so text_poke() open codes it.
> 
> Text poke creates only a local CPU RW mapping. It's more secure because
> other threads can't write to it.

*nod*, same as kmap_local

> It also only needs to flush the local core when it's done since it's
> not using a shared MM.
 
Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
could be a bit clearer.

What is it (if anything) you don't like about text_poke() then? It looks
like it's doing broadly similar things to kmap_local(), so should be
in the same ballpark from a performance POV?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 18:38         ` Kent Overstreet
@ 2023-06-01 20:50           ` Edgecombe, Rick P
  2023-06-01 23:54             ` Nadav Amit
  2023-06-04 21:47             ` Kent Overstreet
  0 siblings, 2 replies; 55+ messages in thread
From: Edgecombe, Rick P @ 2023-06-01 20:50 UTC (permalink / raw)
  To: kent.overstreet@linux.dev
  Cc: tglx@linutronix.de, mcgrof@kernel.org, deller@gmx.de,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	palmer@dabbelt.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
	x86@kernel.org, tsbogend@alpha.franken.de, rppt@kernel.org,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	christophe.leroy@csgroup.eu, rostedt@goodmis.org, will@kernel.org,
	dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Thu, 2023-06-01 at 14:38 -0400, Kent Overstreet wrote:
> On Thu, Jun 01, 2023 at 06:13:44PM +0000, Edgecombe, Rick P wrote:
> > > text_poke() _does_ create a separate RW mapping.
> > 
> > Sorry, I meant a separate RW allocation.
> 
> Ah yes, that makes sense
> 
> 
> > 
> > > 
> > > The thing that sucks about text_poke() is that it always does a
> > > full
> > > TLB
> > > flush, and AFAICT that's not remotely needed. What it really
> > > wants to
> > > be
> > > doing is conceptually just
> > > 
> > > kmap_local()
> > > mempcy()
> > > kunmap_loca()
> > > flush_icache();
> > > 
> > > ...except that kmap_local() won't actually create a new mapping
> > > on
> > > non-highmem architectures, so text_poke() open codes it.
> > 
> > Text poke creates only a local CPU RW mapping. It's more secure
> > because
> > other threads can't write to it.
> 
> *nod*, same as kmap_local

It's only used and flushed locally, but it is accessible to all CPU's,
right?

> 
> > It also only needs to flush the local core when it's done since
> > it's
> > not using a shared MM.
>  
> Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
> could be a bit clearer.
> 
> What is it (if anything) you don't like about text_poke() then? It
> looks
> like it's doing broadly similar things to kmap_local(), so should be
> in the same ballpark from a performance POV?

The way text_poke() is used here, it is creating a new writable alias
and flushing it for *each* write to the module (like for each write of
an individual relocation, etc). I was just thinking it might warrant
some batching or something.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/13] mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc
  2023-06-01 10:12 ` [PATCH 04/13] mm/jitalloc, arch: convert remaining " Mike Rapoport
@ 2023-06-01 22:35   ` Song Liu
  0 siblings, 0 replies; 55+ messages in thread
From: Song Liu @ 2023-06-01 22:35 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 1, 2023 at 3:13 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Extend jitalloc parameters to accommodate more complex overrides of
> module_alloc() by architectures.
>
> This includes specification of a fallback range required by arm, arm64
> and powerpc and support for allocation of KASAN shadow required by
> arm64, s390 and x86.
>
> The core implementation of jit_alloc() takes care of suppressing warnings
> when the initial allocation fails but there is a fallback range defined.
>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>

[...]

>
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 5af4975caeb5..ecf1f4030317 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -17,56 +17,49 @@
>  #include <linux/moduleloader.h>
>  #include <linux/scs.h>
>  #include <linux/vmalloc.h>
> +#include <linux/jitalloc.h>
>  #include <asm/alternative.h>
>  #include <asm/insn.h>
>  #include <asm/scs.h>
>  #include <asm/sections.h>
>
> -void *module_alloc(unsigned long size)
> +static struct jit_alloc_params jit_alloc_params = {
> +       .alignment      = MODULE_ALIGN,
> +       .flags          = JIT_ALLOC_KASAN_SHADOW,
> +};
> +
> +struct jit_alloc_params *jit_alloc_arch_params(void)
>  {
>         u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;

module_alloc_base() is initialized in kaslr_init(), which is called after
mm_core_init(). We will need some special logic for this.

Thanks,
Song

> -       gfp_t gfp_mask = GFP_KERNEL;
> -       void *p;
> -
> -       /* Silence the initial allocation */
> -       if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> -               gfp_mask |= __GFP_NOWARN;
>

[...]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 10:12 ` [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX Mike Rapoport
  2023-06-01 10:30   ` Peter Zijlstra
  2023-06-01 16:54   ` Edgecombe, Rick P
@ 2023-06-01 22:49   ` Song Liu
  2 siblings, 0 replies; 55+ messages in thread
From: Song Liu @ 2023-06-01 22:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 1, 2023 at 3:15 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Song Liu <song@kernel.org>
>
> Replace direct memory writes to memory allocated for code with text poking
> to allow allocation of executable memory as ROX.
>
> The only exception is arch_prepare_bpf_trampoline() that cannot jit
> directly into module memory yet, so it uses set_memory calls to
> unprotect the memory before writing to it and to protect memory in the
> end.
>
> Signed-off-by: Song Liu <song@kernel.org>
> Co-developed-by: Mike Rapoport (IBM) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
>  arch/x86/kernel/alternative.c | 43 +++++++++++++++++++++++------------
>  arch/x86/kernel/ftrace.c      | 41 +++++++++++++++++++++------------
>  arch/x86/kernel/module.c      | 24 +++++--------------
>  arch/x86/kernel/static_call.c | 10 ++++----
>  arch/x86/kernel/unwind_orc.c  | 13 +++++++----
>  arch/x86/net/bpf_jit_comp.c   | 22 +++++++++++++-----

We need the following in this patch (or before this patch).
Otherwise, the system will crash at the VIRTUAL_BUG_ON()
in vmalloc_to_page().

Thanks,
Song

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index bf954d2721c1..4efa8a795ebc 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1084,7 +1084,7 @@ bpf_jit_binary_pack_alloc(unsigned int proglen,
u8 **image_ptr,
                return NULL;
        }

-       *rw_header = kvmalloc(size, GFP_KERNEL);
+       *rw_header = kvzalloc(size, GFP_KERNEL);
        if (!*rw_header) {
                bpf_arch_text_copy(&ro_header->size, &size, sizeof(size));
                bpf_prog_pack_free(ro_header);
@@ -1092,8 +1092,6 @@ bpf_jit_binary_pack_alloc(unsigned int proglen,
u8 **image_ptr,
                return NULL;
        }

-       /* Fill space with illegal/arch-dep instructions. */
-       bpf_fill_ill_insns(*rw_header, size);
        (*rw_header)->size = size;

        hole = min_t(unsigned int, size - (proglen + sizeof(*ro_header)),

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 20:50           ` Edgecombe, Rick P
@ 2023-06-01 23:54             ` Nadav Amit
  2023-06-05  2:52               ` Steven Rostedt
  2023-06-04 21:47             ` Kent Overstreet
  1 sibling, 1 reply; 55+ messages in thread
From: Nadav Amit @ 2023-06-01 23:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kent.overstreet@linux.dev, Thomas Gleixner, mcgrof@kernel.org,
	deller@gmx.de, netdev@vger.kernel.org, davem@davemloft.net,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	palmer@dabbelt.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
	x86@kernel.org, tsbogend@alpha.franken.de, rppt@kernel.org,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	christophe.leroy@csgroup.eu, rostedt@goodmis.org, Will Deacon,
	dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	Andrew Morton



> On Jun 1, 2023, at 1:50 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Thu, 2023-06-01 at 14:38 -0400, Kent Overstreet wrote:
>> On Thu, Jun 01, 2023 at 06:13:44PM +0000, Edgecombe, Rick P wrote:
>>>> text_poke() _does_ create a separate RW mapping.
>>> 
>>> Sorry, I meant a separate RW allocation.
>> 
>> Ah yes, that makes sense
>> 
>> 
>>> 
>>>> 
>>>> The thing that sucks about text_poke() is that it always does a
>>>> full
>>>> TLB
>>>> flush, and AFAICT that's not remotely needed. What it really
>>>> wants to
>>>> be
>>>> doing is conceptually just
>>>> 
>>>> kmap_local()
>>>> mempcy()
>>>> kunmap_loca()
>>>> flush_icache();
>>>> 
>>>> ...except that kmap_local() won't actually create a new mapping
>>>> on
>>>> non-highmem architectures, so text_poke() open codes it.
>>> 
>>> Text poke creates only a local CPU RW mapping. It's more secure
>>> because
>>> other threads can't write to it.
>> 
>> *nod*, same as kmap_local
> 
> It's only used and flushed locally, but it is accessible to all CPU's,
> right?
> 
>> 
>>> It also only needs to flush the local core when it's done since
>>> it's
>>> not using a shared MM.
>>  
>> Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
>> could be a bit clearer.
>> 
>> What is it (if anything) you don't like about text_poke() then? It
>> looks
>> like it's doing broadly similar things to kmap_local(), so should be
>> in the same ballpark from a performance POV?
> 
> The way text_poke() is used here, it is creating a new writable alias
> and flushing it for *each* write to the module (like for each write of
> an individual relocation, etc). I was just thinking it might warrant
> some batching or something.

I am not advocating to do so, but if you want to have many efficient
writes, perhaps you can just disable CR0.WP. Just saying that if you
are about to write all over the memory, text_poke() does not provide
too much security for the poking thread.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 11:07     ` Mike Rapoport
@ 2023-06-02  0:02       ` Song Liu
  0 siblings, 0 replies; 55+ messages in thread
From: Song Liu @ 2023-06-02  0:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Peter Zijlstra, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Kent Overstreet, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 1, 2023 at 4:07 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, Jun 01, 2023 at 12:30:50PM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 01, 2023 at 01:12:56PM +0300, Mike Rapoport wrote:
> >
> > > +static void __init_or_module do_text_poke(void *addr, const void *opcode, size_t len)
> > > +{
> > > +   if (system_state < SYSTEM_RUNNING) {
> > > +           text_poke_early(addr, opcode, len);
> > > +   } else {
> > > +           mutex_lock(&text_mutex);
> > > +           text_poke(addr, opcode, len);
> > > +           mutex_unlock(&text_mutex);
> > > +   }
> > > +}
> >
> > So I don't much like do_text_poke(); why?
>
> I believe the idea was to keep memcpy for early boot before the kernel
> image is protected without going and adding if (is_module_text_address())
> all over the place.
>
> I think this can be used instead without updating all the call sites of
> text_poke_early():
>
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 91057de8e6bc..f994e63e9903 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1458,7 +1458,7 @@ void __init_or_module text_poke_early(void *addr, const void *opcode,
>                  * code cannot be running and speculative code-fetches are
>                  * prevented. Just change the code.
>                  */
> -               memcpy(addr, opcode, len);
> +               text_poke_copy(addr, opcode, len);
>         } else {
>                 local_irq_save(flags);
>                 memcpy(addr, opcode, len);
>

This alone doesn't work, as text_poke_early() is called
before addr is added to the list of module texts. So we
still use memcpy() here.

Thanks,
Song

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
                   ` (13 preceding siblings ...)
  2023-06-01 16:12 ` [PATCH 00/13] mm: jit/text allocator Mark Rutland
@ 2023-06-02  0:36 ` Song Liu
  14 siblings, 0 replies; 55+ messages in thread
From: Song Liu @ 2023-06-02  0:36 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Andrew Morton, Catalin Marinas, Christophe Leroy,
	David S. Miller, Dinh Nguyen, Heiko Carstens, Helge Deller,
	Huacai Chen, Kent Overstreet, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Steven Rostedt,
	Thomas Bogendoerfer, Thomas Gleixner, Will Deacon, bpf,
	linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 1, 2023 at 3:13 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>
> Hi,
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystmes
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> This set splits code allocation from modules by introducing
> jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> sites of module_alloc() and module_memfree() with the new APIs and
> implements core text and related allocation in a central place.
>
> Instead of architecture specific overrides for module_alloc(), the
> architectures that require non-default behaviour for text allocation must
> fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> returns a pointer to that structure. If an architecture does not implement
> jit_alloc_arch_params(), the defaults compatible with the current
> modules::module_alloc() are used.
>
> The new jitalloc infrastructure allows decoupling of kprobes and ftrace
> from modules, and most importantly it enables ROX allocations for
> executable memory.

This set does look cleaner than my version [1]. However, this is
partially because this set only separates text and data; while [1]
also separates rw data, ro data, and ro_after_init data. We need
such separation to fully cover module usage, and to remove
VM_FLUSH_RESET_PERMS. Once we add these logic to this
set, the two versions will look similar.

OTOH, I do like the fact this version enables kprobes (and
potentially ftrace and bpf) without CONFIG_MODULES. And
mm/ seems a better home for the logic.

That being said, besides comments in a few patches, this
version looks good to me. With the fix I suggested for patch
12/13, it passed my tests on x86_64 with modules, kprobes,
ftrace, and BPF.

If we decided to ship this version, I would appreciate it if I
could get more credit for my work in [1] and research work
before that.

Thanks,
Song

[1] https://lore.kernel.org/lkml/20230526051529.3387103-1-song@kernel.org/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-01 18:14   ` Kent Overstreet
@ 2023-06-02  9:35     ` Mark Rutland
  2023-06-02 18:20       ` Song Liu
  2023-06-05  9:20       ` Mike Rapoport
  0 siblings, 2 replies; 55+ messages in thread
From: Mark Rutland @ 2023-06-02  9:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Rapoport, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > For a while I have wanted to give kprobes its own allocator so that it can work
> > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > the modules area.
> > 
> > Given that, I think these should have their own allocator functions that can be
> > provided independently, even if those happen to use common infrastructure.
> 
> How much memory can kprobes conceivably use? I think we also want to try
> to push back on combinatorial new allocators, if we can.

That depends on who's using it, and how (e.g. via BPF).

To be clear, I'm not necessarily asking for entirely different allocators, but
I do thinkg that we want wrappers that can at least pass distinct start+end
parameters to a common allocator, and for arm64's modules code I'd expect that
we'd keep the range falblack logic out of the common allcoator, and just call
it twice.

> > > Several architectures override module_alloc() because of various
> > > constraints where the executable memory can be located and this causes
> > > additional obstacles for improvements of code allocation.
> > > 
> > > This set splits code allocation from modules by introducing
> > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > sites of module_alloc() and module_memfree() with the new APIs and
> > > implements core text and related allocation in a central place.
> > > 
> > > Instead of architecture specific overrides for module_alloc(), the
> > > architectures that require non-default behaviour for text allocation must
> > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > returns a pointer to that structure. If an architecture does not implement
> > > jit_alloc_arch_params(), the defaults compatible with the current
> > > modules::module_alloc() are used.
> > 
> > As above, I suspect that each of the callsites should probably be using common
> > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > sense, since the parameters for each case may need to be distinct.
> 
> I don't see how that follows. The whole point of function parameters is
> that they may be different :)

What I mean is that jit_alloc_arch_params() tries to aggregate common
parameters, but they aren't actually common (e.g. the actual start+end range
for allocation).

> Can you give more detail on what parameters you need? If the only extra
> parameter is just "does this allocation need to live close to kernel
> text", that's not that big of a deal.

My thinking was that we at least need the start + end for each caller. That
might be it, tbh.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-02  9:35     ` Mark Rutland
@ 2023-06-02 18:20       ` Song Liu
  2023-06-03 21:11         ` Puranjay Mohan
  2023-06-04 18:02         ` Kent Overstreet
  2023-06-05  9:20       ` Mike Rapoport
  1 sibling, 2 replies; 55+ messages in thread
From: Song Liu @ 2023-06-02 18:20 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kent Overstreet, Mike Rapoport, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86, Puranjay Mohan

On Fri, Jun 2, 2023 at 2:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > For a while I have wanted to give kprobes its own allocator so that it can work
> > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > > the modules area.
> > >
> > > Given that, I think these should have their own allocator functions that can be
> > > provided independently, even if those happen to use common infrastructure.
> >
> > How much memory can kprobes conceivably use? I think we also want to try
> > to push back on combinatorial new allocators, if we can.
>
> That depends on who's using it, and how (e.g. via BPF).
>
> To be clear, I'm not necessarily asking for entirely different allocators, but
> I do thinkg that we want wrappers that can at least pass distinct start+end
> parameters to a common allocator, and for arm64's modules code I'd expect that
> we'd keep the range falblack logic out of the common allcoator, and just call
> it twice.
>
> > > > Several architectures override module_alloc() because of various
> > > > constraints where the executable memory can be located and this causes
> > > > additional obstacles for improvements of code allocation.
> > > >
> > > > This set splits code allocation from modules by introducing
> > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > implements core text and related allocation in a central place.
> > > >
> > > > Instead of architecture specific overrides for module_alloc(), the
> > > > architectures that require non-default behaviour for text allocation must
> > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > > returns a pointer to that structure. If an architecture does not implement
> > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > modules::module_alloc() are used.
> > >
> > > As above, I suspect that each of the callsites should probably be using common
> > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > > sense, since the parameters for each case may need to be distinct.
> >
> > I don't see how that follows. The whole point of function parameters is
> > that they may be different :)
>
> What I mean is that jit_alloc_arch_params() tries to aggregate common
> parameters, but they aren't actually common (e.g. the actual start+end range
> for allocation).
>
> > Can you give more detail on what parameters you need? If the only extra
> > parameter is just "does this allocation need to live close to kernel
> > text", that's not that big of a deal.
>
> My thinking was that we at least need the start + end for each caller. That
> might be it, tbh.

IIUC, arm64 uses VMALLOC address space for BPF programs. The reason
is each BPF program uses at least 64kB (one page) out of the 128MB
address space. Puranjay Mohan (CC'ed) is working on enabling
bpf_prog_pack for arm64. Once this work is done, multiple BPF programs
will be able to share a page. Will this improvement remove the need to
specify a different address range for BPF programs?

Thanks,
Song

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-02 18:20       ` Song Liu
@ 2023-06-03 21:11         ` Puranjay Mohan
  2023-06-04 18:02         ` Kent Overstreet
  1 sibling, 0 replies; 55+ messages in thread
From: Puranjay Mohan @ 2023-06-03 21:11 UTC (permalink / raw)
  To: Song Liu
  Cc: Mark Rutland, Kent Overstreet, Mike Rapoport, linux-kernel,
	Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Dinh Nguyen, Heiko Carstens, Helge Deller, Huacai Chen,
	Luis Chamberlain, Michael Ellerman, Naveen N. Rao, Palmer Dabbelt,
	Russell King, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Will Deacon, bpf, linux-arm-kernel, linux-mips,
	linux-mm, linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86

On Fri, Jun 2, 2023 at 8:21 PM Song Liu <song@kernel.org> wrote:
>
> On Fri, Jun 2, 2023 at 2:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > > > the modules area.
> > > >
> > > > Given that, I think these should have their own allocator functions that can be
> > > > provided independently, even if those happen to use common infrastructure.
> > >
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> >
> > That depends on who's using it, and how (e.g. via BPF).
> >
> > To be clear, I'm not necessarily asking for entirely different allocators, but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect that
> > we'd keep the range falblack logic out of the common allcoator, and just call
> > it twice.
> >
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > >
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > >
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > > > returns a pointer to that structure. If an architecture does not implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > >
> > > > As above, I suspect that each of the callsites should probably be using common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > > > sense, since the parameters for each case may need to be distinct.
> > >
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> >
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> >
> > > Can you give more detail on what parameters you need? If the only extra
> > > parameter is just "does this allocation need to live close to kernel
> > > text", that's not that big of a deal.
> >
> > My thinking was that we at least need the start + end for each caller. That
> > might be it, tbh.
>
> IIUC, arm64 uses VMALLOC address space for BPF programs. The reason
> is each BPF program uses at least 64kB (one page) out of the 128MB
> address space. Puranjay Mohan (CC'ed) is working on enabling
> bpf_prog_pack for arm64. Once this work is done, multiple BPF programs
> will be able to share a page. Will this improvement remove the need to
> specify a different address range for BPF programs?

Hi,
Thanks for adding me to the conversation.

The ARM64 BPF JIT used to allocate the memory using module_alloc but it
was not optimal because BPF programs and modules were sharing the 128 MB
module region. This was fixed by
91fc957c9b1d ("arm64/bpf: don't allocate BPF JIT programs in module memory")
It created a dedicated 128 MB region set aside for BPF programs.

But 128MB could get exhausted especially where PAGE_SIZE is 64KB - one
page is needed per program. This restriction was removed by
b89ddf4cca43 ("arm64/bpf: Remove 128MB limit for BPF JIT programs")

So, currently BPF programs are using a full page from vmalloc (4 KB,
16 KB, or 64 KB).
This wastes memory and also causes iTLB pressure. Enabling bpf_prog_pack
for ARM64 would fix it. I am doing some final tests and will send the patches in
1-2 days.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-02 18:20       ` Song Liu
  2023-06-03 21:11         ` Puranjay Mohan
@ 2023-06-04 18:02         ` Kent Overstreet
  2023-06-04 21:22           ` Song Liu
  1 sibling, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2023-06-04 18:02 UTC (permalink / raw)
  To: Song Liu
  Cc: Mark Rutland, Mike Rapoport, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86, Puranjay Mohan

On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote:
> IIUC, arm64 uses VMALLOC address space for BPF programs. The reason
> is each BPF program uses at least 64kB (one page) out of the 128MB
> address space. Puranjay Mohan (CC'ed) is working on enabling
> bpf_prog_pack for arm64. Once this work is done, multiple BPF programs
> will be able to share a page. Will this improvement remove the need to
> specify a different address range for BPF programs?

Can we please stop working on BPF specific sub page allocation and focus
on doing this in mm/? This never should have been in BPF in the first
place.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-04 18:02         ` Kent Overstreet
@ 2023-06-04 21:22           ` Song Liu
  2023-06-04 21:40             ` Kent Overstreet
  0 siblings, 1 reply; 55+ messages in thread
From: Song Liu @ 2023-06-04 21:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mark Rutland, Mike Rapoport, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86, Puranjay Mohan

On Sun, Jun 4, 2023 at 11:02 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote:
> > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason
> > is each BPF program uses at least 64kB (one page) out of the 128MB
> > address space. Puranjay Mohan (CC'ed) is working on enabling
> > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs
> > will be able to share a page. Will this improvement remove the need to
> > specify a different address range for BPF programs?
>
> Can we please stop working on BPF specific sub page allocation and focus
> on doing this in mm/? This never should have been in BPF in the first
> place.

That work is mostly independent of the allocator work we are discussing here.
The goal Puranjay's work is to enable the arm64 BPF JIT engine to use a
ROX allocator. The allocator could be the bpf_prog_pack allocator, or jitalloc,
or module_alloc_type. Puranjay is using bpf_prog_alloc for now. But once
jitalloc or module_alloc_type (either one) is merged, we will migrate BPF
JIT engines (x86_64 and arm64) to the new allocator and then tear down
bpf_prog_pack.

Does this make sense?

Thanks,
Song

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-04 21:22           ` Song Liu
@ 2023-06-04 21:40             ` Kent Overstreet
  2023-06-05  4:05               ` Song Liu
  0 siblings, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2023-06-04 21:40 UTC (permalink / raw)
  To: Song Liu
  Cc: Mark Rutland, Mike Rapoport, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86, Puranjay Mohan

On Sun, Jun 04, 2023 at 02:22:30PM -0700, Song Liu wrote:
> On Sun, Jun 4, 2023 at 11:02 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote:
> > > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason
> > > is each BPF program uses at least 64kB (one page) out of the 128MB
> > > address space. Puranjay Mohan (CC'ed) is working on enabling
> > > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs
> > > will be able to share a page. Will this improvement remove the need to
> > > specify a different address range for BPF programs?
> >
> > Can we please stop working on BPF specific sub page allocation and focus
> > on doing this in mm/? This never should have been in BPF in the first
> > place.
> 
> That work is mostly independent of the allocator work we are discussing here.
> The goal Puranjay's work is to enable the arm64 BPF JIT engine to use a
> ROX allocator. The allocator could be the bpf_prog_pack allocator, or jitalloc,
> or module_alloc_type. Puranjay is using bpf_prog_alloc for now. But once
> jitalloc or module_alloc_type (either one) is merged, we will migrate BPF
> JIT engines (x86_64 and arm64) to the new allocator and then tear down
> bpf_prog_pack.
> 
> Does this make sense?

Yeah, as long as that's the plan. Maybe one of you could tell us what
issues were preventing prog_pack from being used in the first place, it
might be relevant - this is the time to get the new allocator API right.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 20:50           ` Edgecombe, Rick P
  2023-06-01 23:54             ` Nadav Amit
@ 2023-06-04 21:47             ` Kent Overstreet
  1 sibling, 0 replies; 55+ messages in thread
From: Kent Overstreet @ 2023-06-04 21:47 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: tglx@linutronix.de, mcgrof@kernel.org, deller@gmx.de,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	palmer@dabbelt.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
	x86@kernel.org, tsbogend@alpha.franken.de, rppt@kernel.org,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	christophe.leroy@csgroup.eu, rostedt@goodmis.org, will@kernel.org,
	dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Thu, Jun 01, 2023 at 08:50:39PM +0000, Edgecombe, Rick P wrote:
> > Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
> > could be a bit clearer.
> > 
> > What is it (if anything) you don't like about text_poke() then? It
> > looks
> > like it's doing broadly similar things to kmap_local(), so should be
> > in the same ballpark from a performance POV?
> 
> The way text_poke() is used here, it is creating a new writable alias
> and flushing it for *each* write to the module (like for each write of
> an individual relocation, etc). I was just thinking it might warrant
> some batching or something.

Ah, I see. A kmap_local type interface might get us that kind of
batching, if it supported mapping compound pages - currently kmap_local
still only maps single pages, but with folios getting plumbed around I
assume someone will make it handle compound pages eventually.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-01 23:54             ` Nadav Amit
@ 2023-06-05  2:52               ` Steven Rostedt
  2023-06-05  8:11                 ` Mike Rapoport
  0 siblings, 1 reply; 55+ messages in thread
From: Steven Rostedt @ 2023-06-05  2:52 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Edgecombe, Rick P, kent.overstreet@linux.dev, Thomas Gleixner,
	mcgrof@kernel.org, deller@gmx.de, netdev@vger.kernel.org,
	davem@davemloft.net, linux@armlinux.org.uk,
	linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	hca@linux.ibm.com, catalin.marinas@arm.com,
	linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, palmer@dabbelt.com,
	chenhuacai@kernel.org, mpe@ellerman.id.au, x86@kernel.org,
	tsbogend@alpha.franken.de, rppt@kernel.org,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	christophe.leroy@csgroup.eu, Will Deacon, dinguyen@kernel.org,
	naveen.n.rao@linux.ibm.com, sparclinux@vger.kernel.org,
	linux-modules@vger.kernel.org, bpf@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, song@kernel.org,
	linux-mm@kvack.org, loongarch@lists.linux.dev, Andrew Morton

On Thu, 1 Jun 2023 16:54:36 -0700
Nadav Amit <nadav.amit@gmail.com> wrote:

> > The way text_poke() is used here, it is creating a new writable alias
> > and flushing it for *each* write to the module (like for each write of
> > an individual relocation, etc). I was just thinking it might warrant
> > some batching or something.  

Batching does exist, which is what the text_poke_queue() thing does.

-- Steve


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-04 21:40             ` Kent Overstreet
@ 2023-06-05  4:05               ` Song Liu
  0 siblings, 0 replies; 55+ messages in thread
From: Song Liu @ 2023-06-05  4:05 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mark Rutland, Mike Rapoport, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86, Puranjay Mohan

On Sun, Jun 4, 2023 at 2:40 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Sun, Jun 04, 2023 at 02:22:30PM -0700, Song Liu wrote:
> > On Sun, Jun 4, 2023 at 11:02 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote:
> > > > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason
> > > > is each BPF program uses at least 64kB (one page) out of the 128MB
> > > > address space. Puranjay Mohan (CC'ed) is working on enabling
> > > > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs
> > > > will be able to share a page. Will this improvement remove the need to
> > > > specify a different address range for BPF programs?
> > >
> > > Can we please stop working on BPF specific sub page allocation and focus
> > > on doing this in mm/? This never should have been in BPF in the first
> > > place.
> >
> > That work is mostly independent of the allocator work we are discussing here.
> > The goal Puranjay's work is to enable the arm64 BPF JIT engine to use a
> > ROX allocator. The allocator could be the bpf_prog_pack allocator, or jitalloc,
> > or module_alloc_type. Puranjay is using bpf_prog_alloc for now. But once
> > jitalloc or module_alloc_type (either one) is merged, we will migrate BPF
> > JIT engines (x86_64 and arm64) to the new allocator and then tear down
> > bpf_prog_pack.
> >
> > Does this make sense?
>
> Yeah, as long as that's the plan. Maybe one of you could tell us what
> issues were preventing prog_pack from being used in the first place, it
> might be relevant - this is the time to get the new allocator API right.

The JIT engine does a lot of writes. Instead of doing many text_poke(),
we are using a temporary RW write buffer, and then do text_poke_copy()
at the end. To make this work, we need the JIT engine to be able to
handle an RW temporary buffer and an RO final memory region. There
is nothing preventing prog_pack to work. It is just we need to do the
work.

Thanks,
Song

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-05  2:52               ` Steven Rostedt
@ 2023-06-05  8:11                 ` Mike Rapoport
  2023-06-05 16:10                   ` Edgecombe, Rick P
  0 siblings, 1 reply; 55+ messages in thread
From: Mike Rapoport @ 2023-06-05  8:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nadav Amit, Edgecombe, Rick P, kent.overstreet@linux.dev,
	Thomas Gleixner, mcgrof@kernel.org, deller@gmx.de,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux@armlinux.org.uk, linux-mips@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, hca@linux.ibm.com,
	catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
	linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org,
	palmer@dabbelt.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
	x86@kernel.org, tsbogend@alpha.franken.de,
	linux-trace-kernel@vger.kernel.org, linux-parisc@vger.kernel.org,
	christophe.leroy@csgroup.eu, Will Deacon, dinguyen@kernel.org,
	naveen.n.rao@linux.ibm.com, sparclinux@vger.kernel.org,
	linux-modules@vger.kernel.org, bpf@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, song@kernel.org,
	linux-mm@kvack.org, loongarch@lists.linux.dev, Andrew Morton

On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> On Thu, 1 Jun 2023 16:54:36 -0700
> Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > > The way text_poke() is used here, it is creating a new writable alias
> > > and flushing it for *each* write to the module (like for each write of
> > > an individual relocation, etc). I was just thinking it might warrant
> > > some batching or something.  

> > I am not advocating to do so, but if you want to have many efficient
> > writes, perhaps you can just disable CR0.WP. Just saying that if you
> > are about to write all over the memory, text_poke() does not provide
> > too much security for the poking thread.

Heh, this is definitely and easier hack to implement :)

> Batching does exist, which is what the text_poke_queue() thing does.

For module loading text_poke_queue() will still be much slower than a bunch
of memset()s for no good reason because we don't need all the complexity of
text_poke_bp_batch() for module initialization because we are sure we are
not patching live code.

What we'd need here is a new batching mode that will create a writable
alias mapping at the beginning of apply_relocate_*() and module_finalize(),
then it will use memcpy() to that writable alias and will tear the mapping
down in the end.

Another option is to teach alternatives to update a writable copy rather
than do in place changes like Song suggested. My feeling is that it will be
more intrusive change though.

> -- Steve
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-02  9:35     ` Mark Rutland
  2023-06-02 18:20       ` Song Liu
@ 2023-06-05  9:20       ` Mike Rapoport
  2023-06-05 10:09         ` Mark Rutland
  2023-06-05 21:13         ` Kent Overstreet
  1 sibling, 2 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-05  9:20 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kent Overstreet, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > For a while I have wanted to give kprobes its own allocator so that it can work
> > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > > the modules area.
> > > 
> > > Given that, I think these should have their own allocator functions that can be
> > > provided independently, even if those happen to use common infrastructure.
> > 
> > How much memory can kprobes conceivably use? I think we also want to try
> > to push back on combinatorial new allocators, if we can.
> 
> That depends on who's using it, and how (e.g. via BPF).
> 
> To be clear, I'm not necessarily asking for entirely different allocators, but
> I do thinkg that we want wrappers that can at least pass distinct start+end
> parameters to a common allocator, and for arm64's modules code I'd expect that
> we'd keep the range falblack logic out of the common allcoator, and just call
> it twice.
> 
> > > > Several architectures override module_alloc() because of various
> > > > constraints where the executable memory can be located and this causes
> > > > additional obstacles for improvements of code allocation.
> > > > 
> > > > This set splits code allocation from modules by introducing
> > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > implements core text and related allocation in a central place.
> > > > 
> > > > Instead of architecture specific overrides for module_alloc(), the
> > > > architectures that require non-default behaviour for text allocation must
> > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > > returns a pointer to that structure. If an architecture does not implement
> > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > modules::module_alloc() are used.
> > > 
> > > As above, I suspect that each of the callsites should probably be using common
> > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > > sense, since the parameters for each case may need to be distinct.
> > 
> > I don't see how that follows. The whole point of function parameters is
> > that they may be different :)
> 
> What I mean is that jit_alloc_arch_params() tries to aggregate common
> parameters, but they aren't actually common (e.g. the actual start+end range
> for allocation).

jit_alloc_arch_params() tries to aggregate architecture constraints and
requirements for allocations of executable memory and this exactly what
the first 6 patches of this set do.

A while ago Thomas suggested to use a structure that parametrizes
architecture constraints by the memory type used in modules [1] and Song
implemented the infrastructure for it and x86 part [2].

I liked the idea of defining parameters in a single structure, but I
thought that approaching the problem from the arch side rather than from
modules perspective will be better starting point, hence these patches.

I don't see a fundamental reason why a single structure cannot describe
what is needed for different code allocation cases, be it modules, kprobes
or bpf. There is of course an assumption that the core allocations will be
the same for all the users, and it seems to me that something like 

* allocate physical memory if allocator caches are empty
* map it in vmalloc or modules address space
* return memory from the allocator cache to the caller

will work for all usecases.

We might need separate caches for different cases on different
architectures, and a way to specify what cache should be used in the
allocator API, but that does not contradict a single structure for arch
specific parameters, but only makes it more elaborate, e.g. something like

enum jit_type {
	JIT_MODULES_TEXT,
	JIT_MODULES_DATA,
	JIT_KPROBES,
	JIT_FTRACE,
	JIT_BPF,
	JIT_TYPE_MAX,
};

struct jit_alloc_params {
	struct jit_range	ranges[JIT_TYPE_MAX];
	/* ... */
};

> > Can you give more detail on what parameters you need? If the only extra
> > parameter is just "does this allocation need to live close to kernel
> > text", that's not that big of a deal.
> 
> My thinking was that we at least need the start + end for each caller. That
> might be it, tbh.

Do you mean that modules will have something like

	jit_text_alloc(size, MODULES_START, MODULES_END);

and kprobes will have

	jit_text_alloc(size, KPROBES_START, KPROBES_END);
?

It sill can be achieved with a single jit_alloc_arch_params(), just by
adding enum jit_type parameter to jit_text_alloc().

[1] https://lore.kernel.org/linux-mm/87v8mndy3y.ffs@tglx/ 
[2] https://lore.kernel.org/all/20230526051529.3387103-1-song@kernel.org

> Thanks,
> Mark.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-05  9:20       ` Mike Rapoport
@ 2023-06-05 10:09         ` Mark Rutland
  2023-06-06 10:16           ` Mike Rapoport
                             ` (2 more replies)
  2023-06-05 21:13         ` Kent Overstreet
  1 sibling, 3 replies; 55+ messages in thread
From: Mark Rutland @ 2023-06-05 10:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kent Overstreet, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions that can be
> > > > provided independently, even if those happen to use common infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect that
> > we'd keep the range falblack logic out of the common allcoator, and just call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > > > returns a pointer to that structure. If an architecture does not implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
> 
> enum jit_type {
> 	JIT_MODULES_TEXT,
> 	JIT_MODULES_DATA,
> 	JIT_KPROBES,
> 	JIT_FTRACE,
> 	JIT_BPF,
> 	JIT_TYPE_MAX,
> };
> 
> struct jit_alloc_params {
> 	struct jit_range	ranges[JIT_TYPE_MAX];
> 	/* ... */
> };
> 
> > > Can you give more detail on what parameters you need? If the only extra
> > > parameter is just "does this allocation need to live close to kernel
> > > text", that's not that big of a deal.
> > 
> > My thinking was that we at least need the start + end for each caller. That
> > might be it, tbh.
> 
> Do you mean that modules will have something like
> 
> 	jit_text_alloc(size, MODULES_START, MODULES_END);
> 
> and kprobes will have
> 
> 	jit_text_alloc(size, KPROBES_START, KPROBES_END);
> ?

Yes.

> It sill can be achieved with a single jit_alloc_arch_params(), just by
> adding enum jit_type parameter to jit_text_alloc().

That feels backwards to me; it centralizes a bunch of information about
distinct users to be able to shove that into a static array, when the callsites
can pass that information. 

What's *actually* common after separating out the ranges? Is it just the
permissions?

If we want this to be able to share allocations and so on, why can't we do this
like a kmem_cache, and have the callsite pass a pointer to the allocator data?
That would make it easy for callsites to share an allocator or use a distinct
one.

Thanks,
Mark.

> [1] https://lore.kernel.org/linux-mm/87v8mndy3y.ffs@tglx/ 
> [2] https://lore.kernel.org/all/20230526051529.3387103-1-song@kernel.org
> 
> > Thanks,
> > Mark.
> 
> -- 
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-05  8:11                 ` Mike Rapoport
@ 2023-06-05 16:10                   ` Edgecombe, Rick P
  2023-06-05 20:42                     ` Mike Rapoport
  2023-06-05 21:11                     ` Nadav Amit
  0 siblings, 2 replies; 55+ messages in thread
From: Edgecombe, Rick P @ 2023-06-05 16:10 UTC (permalink / raw)
  To: rostedt@goodmis.org, rppt@kernel.org
  Cc: tglx@linutronix.de, deller@gmx.de, mcgrof@kernel.org,
	netdev@vger.kernel.org, nadav.amit@gmail.com,
	linux@armlinux.org.uk, davem@davemloft.net,
	linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	hca@linux.ibm.com, catalin.marinas@arm.com,
	linux-kernel@vger.kernel.org, kent.overstreet@linux.dev,
	linux-s390@vger.kernel.org, palmer@dabbelt.com,
	chenhuacai@kernel.org, tsbogend@alpha.franken.de,
	linux-trace-kernel@vger.kernel.org, mpe@ellerman.id.au,
	linux-parisc@vger.kernel.org, x86@kernel.org,
	christophe.leroy@csgroup.eu, linux-riscv@lists.infradead.org,
	will@kernel.org, dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Jun 2023 16:54:36 -0700
> > Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> > > > The way text_poke() is used here, it is creating a new writable
> > > > alias
> > > > and flushing it for *each* write to the module (like for each
> > > > write of
> > > > an individual relocation, etc). I was just thinking it might
> > > > warrant
> > > > some batching or something.  
> 
> > > I am not advocating to do so, but if you want to have many
> > > efficient
> > > writes, perhaps you can just disable CR0.WP. Just saying that if
> > > you
> > > are about to write all over the memory, text_poke() does not
> > > provide
> > > too much security for the poking thread.
> 
> Heh, this is definitely and easier hack to implement :)

I don't know the details, but previously there was some strong dislike
of CR0.WP toggling. And now there is also the problem of CET. Setting
CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
I guess you might get away with toggling them both in some controlled
situation, but it might be a lot easier to hack up then to be made
fully acceptable. It does sound much more efficient though.

> 
> > Batching does exist, which is what the text_poke_queue() thing
> > does.
> 
> For module loading text_poke_queue() will still be much slower than a
> bunch
> of memset()s for no good reason because we don't need all the
> complexity of
> text_poke_bp_batch() for module initialization because we are sure we
> are
> not patching live code.
> 
> What we'd need here is a new batching mode that will create a
> writable
> alias mapping at the beginning of apply_relocate_*() and
> module_finalize(),
> then it will use memcpy() to that writable alias and will tear the
> mapping
> down in the end.

It's probably only a tiny bit faster than keeping a separate writable
allocation and text_poking it in at the end.

> 
> Another option is to teach alternatives to update a writable copy
> rather
> than do in place changes like Song suggested. My feeling is that it
> will be
> more intrusive change though.

You mean keeping a separate RW allocation and then text_poking() the
whole thing in when you are done? That is what I was trying to say at
the beginning of this thread. The other benefit is you don't make the
intermediate loading states of the module, executable.

I tried this technique previously [0], and I thought it was not too
bad. In most of the callers it looks similar to what you have in
do_text_poke(). Sometimes less, sometimes more. It might need
enlightening of some of the stuff currently using text_poke() during
module loading, like jump labels. So that bit is more intrusive, yea.
But it sounds so much cleaner and well controlled. Did you have a
particular trouble spot in mind?


[0]
https://lore.kernel.org/lkml/20201120202426.18009-5-rick.p.edgecombe@intel.com/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-05 16:10                   ` Edgecombe, Rick P
@ 2023-06-05 20:42                     ` Mike Rapoport
  2023-06-05 21:01                       ` Edgecombe, Rick P
  2023-06-05 21:11                     ` Nadav Amit
  1 sibling, 1 reply; 55+ messages in thread
From: Mike Rapoport @ 2023-06-05 20:42 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rostedt@goodmis.org, tglx@linutronix.de, deller@gmx.de,
	mcgrof@kernel.org, netdev@vger.kernel.org, nadav.amit@gmail.com,
	linux@armlinux.org.uk, davem@davemloft.net,
	linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	hca@linux.ibm.com, catalin.marinas@arm.com,
	linux-kernel@vger.kernel.org, kent.overstreet@linux.dev,
	linux-s390@vger.kernel.org, palmer@dabbelt.com,
	chenhuacai@kernel.org, tsbogend@alpha.franken.de,
	linux-trace-kernel@vger.kernel.org, mpe@ellerman.id.au,
	linux-parisc@vger.kernel.org, x86@kernel.org,
	christophe.leroy@csgroup.eu, linux-riscv@lists.infradead.org,
	will@kernel.org, dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Mon, Jun 05, 2023 at 04:10:21PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
> > On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> > > On Thu, 1 Jun 2023 16:54:36 -0700
> > > Nadav Amit <nadav.amit@gmail.com> wrote:
> > > 
> > > > > The way text_poke() is used here, it is creating a new writable
> > > > > alias
> > > > > and flushing it for *each* write to the module (like for each
> > > > > write of
> > > > > an individual relocation, etc). I was just thinking it might
> > > > > warrant
> > > > > some batching or something.  
> > 
> > > > I am not advocating to do so, but if you want to have many
> > > > efficient
> > > > writes, perhaps you can just disable CR0.WP. Just saying that if
> > > > you
> > > > are about to write all over the memory, text_poke() does not
> > > > provide
> > > > too much security for the poking thread.
> > 
> > Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.
 
I don't think we'd really want that, especially looking at 

		WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");

at native_write_cr0().
 
> > > Batching does exist, which is what the text_poke_queue() thing
> > > does.
> > 
> > For module loading text_poke_queue() will still be much slower than a
> > bunch
> > of memset()s for no good reason because we don't need all the
> > complexity of
> > text_poke_bp_batch() for module initialization because we are sure we
> > are
> > not patching live code.
> > 
> > What we'd need here is a new batching mode that will create a
> > writable
> > alias mapping at the beginning of apply_relocate_*() and
> > module_finalize(),
> > then it will use memcpy() to that writable alias and will tear the
> > mapping
> > down in the end.
> 
> It's probably only a tiny bit faster than keeping a separate writable
> allocation and text_poking it in at the end.

Right, but it still will be faster than text_poking every relocation.
 
> > Another option is to teach alternatives to update a writable copy
> > rather
> > than do in place changes like Song suggested. My feeling is that it
> > will be
> > more intrusive change though.
> 
> You mean keeping a separate RW allocation and then text_poking() the
> whole thing in when you are done? That is what I was trying to say at
> the beginning of this thread. The other benefit is you don't make the
> intermediate loading states of the module, executable.
> 
> I tried this technique previously [0], and I thought it was not too
> bad. In most of the callers it looks similar to what you have in
> do_text_poke(). Sometimes less, sometimes more. It might need
> enlightening of some of the stuff currently using text_poke() during
> module loading, like jump labels. So that bit is more intrusive, yea.
> But it sounds so much cleaner and well controlled. Did you have a
> particular trouble spot in mind?

Nothing in particular, except the intrusive part. Except the changes in
modules.c we'd need to teach alternatives to deal with a writable copy.
 
> [0]
> https://lore.kernel.org/lkml/20201120202426.18009-5-rick.p.edgecombe@intel.com/

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-05 20:42                     ` Mike Rapoport
@ 2023-06-05 21:01                       ` Edgecombe, Rick P
  0 siblings, 0 replies; 55+ messages in thread
From: Edgecombe, Rick P @ 2023-06-05 21:01 UTC (permalink / raw)
  To: rppt@kernel.org
  Cc: tglx@linutronix.de, deller@gmx.de, mcgrof@kernel.org,
	netdev@vger.kernel.org, nadav.amit@gmail.com,
	linux@armlinux.org.uk, davem@davemloft.net,
	linux-mips@vger.kernel.org, linux-riscv@lists.infradead.org,
	hca@linux.ibm.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	kent.overstreet@linux.dev, linux-s390@vger.kernel.org,
	palmer@dabbelt.com, chenhuacai@kernel.org,
	tsbogend@alpha.franken.de, linux-trace-kernel@vger.kernel.org,
	linux-parisc@vger.kernel.org, christophe.leroy@csgroup.eu,
	x86@kernel.org, mpe@ellerman.id.au, rostedt@goodmis.org,
	will@kernel.org, dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	akpm@linux-foundation.org

On Mon, 2023-06-05 at 23:42 +0300, Mike Rapoport wrote:
> > I tried this technique previously [0], and I thought it was not too
> > bad. In most of the callers it looks similar to what you have in
> > do_text_poke(). Sometimes less, sometimes more. It might need
> > enlightening of some of the stuff currently using text_poke()
> > during
> > module loading, like jump labels. So that bit is more intrusive,
> > yea.
> > But it sounds so much cleaner and well controlled. Did you have a
> > particular trouble spot in mind?
> 
> Nothing in particular, except the intrusive part. Except the changes
> in
> modules.c we'd need to teach alternatives to deal with a writable
> copy.

I didn't think alternatives piece looked too bad on the caller side (if
that's what you meant):
https://lore.kernel.org/lkml/20201120202426.18009-7-rick.p.edgecombe@intel.com/

The ugly part was in the (poorly named) module_adjust_writable_addr():

+static inline void *module_adjust_writable_addr(void *addr)
+{
+	unsigned long laddr = (unsigned long)addr;
+	struct module *mod;
+
+	mutex_lock(&module_mutex);
+	mod = __module_address(laddr);
+	if (!mod) {
+		mutex_unlock(&module_mutex);
+		return addr;
+	}
+	mutex_unlock(&module_mutex);
+	/* The module shouldn't be going away if someone is trying to
write to it */
+
+	return (void *)perm_writable_addr(module_get_allocation(mod,
laddr), laddr);
+}
+

It took module_mutex and looked up the module in order to find the
writable buffer from just the executable address. Basically all the
loading code external to modules had to go through that interface. But
now I'm wondering what I was thinking, it seems this could just be an
RCU read lock. That doesn't seem to bad...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX
  2023-06-05 16:10                   ` Edgecombe, Rick P
  2023-06-05 20:42                     ` Mike Rapoport
@ 2023-06-05 21:11                     ` Nadav Amit
  1 sibling, 0 replies; 55+ messages in thread
From: Nadav Amit @ 2023-06-05 21:11 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: rostedt@goodmis.org, rppt@kernel.org, Thomas Gleixner,
	deller@gmx.de, mcgrof@kernel.org, netdev@vger.kernel.org,
	linux@armlinux.org.uk, davem@davemloft.net,
	linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	hca@linux.ibm.com, catalin.marinas@arm.com,
	linux-kernel@vger.kernel.org, kent.overstreet@linux.dev,
	linux-s390@vger.kernel.org, palmer@dabbelt.com,
	chenhuacai@kernel.org, tsbogend@alpha.franken.de,
	linux-trace-kernel@vger.kernel.org, mpe@ellerman.id.au,
	linux-parisc@vger.kernel.org, x86@kernel.org,
	christophe.leroy@csgroup.eu, linux-riscv@lists.infradead.org,
	Will Deacon, dinguyen@kernel.org, naveen.n.rao@linux.ibm.com,
	sparclinux@vger.kernel.org, linux-modules@vger.kernel.org,
	bpf@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	song@kernel.org, linux-mm@kvack.org, loongarch@lists.linux.dev,
	Andrew Morton



> On Jun 5, 2023, at 9:10 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
>> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
>>> On Thu, 1 Jun 2023 16:54:36 -0700
>>> Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>>>> The way text_poke() is used here, it is creating a new writable
>>>>> alias
>>>>> and flushing it for *each* write to the module (like for each
>>>>> write of
>>>>> an individual relocation, etc). I was just thinking it might
>>>>> warrant
>>>>> some batching or something.  
>> 
>>>> I am not advocating to do so, but if you want to have many
>>>> efficient
>>>> writes, perhaps you can just disable CR0.WP. Just saying that if
>>>> you
>>>> are about to write all over the memory, text_poke() does not
>>>> provide
>>>> too much security for the poking thread.
>> 
>> Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.

Thanks for highlighting this issue. I understand the limitations of
CR0.WP. There is also always the concerns that without CET or other
control flow integrity mechanism, someone would abuse (using ROP/JOP)
functions that clear CR0.WP…


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-05  9:20       ` Mike Rapoport
  2023-06-05 10:09         ` Mark Rutland
@ 2023-06-05 21:13         ` Kent Overstreet
  1 sibling, 0 replies; 55+ messages in thread
From: Kent Overstreet @ 2023-06-05 21:13 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions that can be
> > > > provided independently, even if those happen to use common infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect that
> > we'd keep the range falblack logic out of the common allcoator, and just call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > > > returns a pointer to that structure. If an architecture does not implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
> 
> enum jit_type {
> 	JIT_MODULES_TEXT,
> 	JIT_MODULES_DATA,
> 	JIT_KPROBES,
> 	JIT_FTRACE,
> 	JIT_BPF,
> 	JIT_TYPE_MAX,
> };

Why would we actually need different enums for modules_text, kprobes,
ftrace and bpf? Why can't we treat all text allocations the same?

The reason we can't do that currently is because modules need to go in a
128Mb region on some archs, and without sub page allocation
bpf/kprobes/etc. burn a full page for each allocation. But we're doing
sub page allocation - right?

That leaves module data - which really needs to be split out into rw,
ro, ro_after_init - but I'm not sure we'd even want the same API for
those, they need fairly different page permissions handling.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-05 10:09         ` Mark Rutland
@ 2023-06-06 10:16           ` Mike Rapoport
  2023-06-06 18:21           ` Song Liu
  2023-07-20  8:53           ` Mike Rapoport
  2 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-06 10:16 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kent Overstreet, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 05, 2023 at 11:09:34AM +0100, Mark Rutland wrote:
> On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> > On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> >
> > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > adding enum jit_type parameter to jit_text_alloc().
> 
> That feels backwards to me; it centralizes a bunch of information about
> distinct users to be able to shove that into a static array, when the callsites
> can pass that information. 

The goal was not to shove everything into an array, but centralize
architecture requirements for code allocations. The callsites don't have
that information per se, they get it from the arch code, so having this
information in a single place per arch is better than spreading
MODULE_START, KPROBES_START etc all over.

I'd agree though that having types for jit_text_alloc is ugly and this
should be handled differently.
 
> What's *actually* common after separating out the ranges? Is it just the
> permissions?

On x86 everything, on arm64 apparently just the permissions.

I've started to summarize what are the restrictions for code placement for
modules, kprobes and bpf on different architectures, that's roughly what
I've got so far:

* x86 and s390 need everything within modules address space because of
PC-relative
* arm, arm64, loongarch, sparc64, riscv64, some of mips and
powerpc32 configurations require a dedicated modules address space; the
rest just use vmalloc address space
* all architectures that support kprobes except x86 and s390 don't use
relative jumps, so they don't care where kprobes insn_page will live
* not sure yet about BPF. Looks like on arm and arm64 it does not use
relative jumps, so it can be anywhere, didn't dig enough about the others.

> If we want this to be able to share allocations and so on, why can't we do this
> like a kmem_cache, and have the callsite pass a pointer to the allocator data?
> That would make it easy for callsites to share an allocator or use a distinct
> one.

This maybe something worth exploring.
 
> Thanks,
> Mark.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-05 10:09         ` Mark Rutland
  2023-06-06 10:16           ` Mike Rapoport
@ 2023-06-06 18:21           ` Song Liu
  2023-06-08 18:41             ` Mike Rapoport
  2023-07-20  8:53           ` Mike Rapoport
  2 siblings, 1 reply; 55+ messages in thread
From: Song Liu @ 2023-06-06 18:21 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Mike Rapoport, Kent Overstreet, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote:

[...]

> > > > Can you give more detail on what parameters you need? If the only extra
> > > > parameter is just "does this allocation need to live close to kernel
> > > > text", that's not that big of a deal.
> > >
> > > My thinking was that we at least need the start + end for each caller. That
> > > might be it, tbh.
> >
> > Do you mean that modules will have something like
> >
> >       jit_text_alloc(size, MODULES_START, MODULES_END);
> >
> > and kprobes will have
> >
> >       jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > ?
>
> Yes.

How about we start with two APIs:
     jit_text_alloc(size);
     jit_text_alloc_range(size, start, end);

AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
not quite convinced it is needed.

>
> > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > adding enum jit_type parameter to jit_text_alloc().
>
> That feels backwards to me; it centralizes a bunch of information about
> distinct users to be able to shove that into a static array, when the callsites
> can pass that information.

I think we only two type of users: module and everything else (ftrace, kprobe,
bpf stuff). The key differences are:

  1. module uses text and data; while everything else only uses text.
  2. module code is generated by the compiler, and thus has stronger
  requirements in address ranges; everything else are generated via some
  JIT or manual written assembly, so they are more flexible with address
  ranges (in JIT, we can avoid using instructions that requires a specific
  address range).

The next question is, can we have the two types of users share the same
address ranges? If not, we can reserve the preferred range for modules,
and let everything else use the other range. I don't see reasons to further
separate users in the "everything else" group.

>
> What's *actually* common after separating out the ranges? Is it just the
> permissions?

I believe permission is the key, as we need the hardware to enforce
permission.

>
> If we want this to be able to share allocations and so on, why can't we do this
> like a kmem_cache, and have the callsite pass a pointer to the allocator data?
> That would make it easy for callsites to share an allocator or use a distinct
> one.

Sharing among different call sites will give us more benefit (in TLB
misses rate,
etc.). For example, a 2MB page may host text of two kernel modules, 4 kprobes,
6 ftrace trampolines, and 10 BPF programs. All of these only require one entry
in the iTLB.

Thanks,
Song

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-06 18:21           ` Song Liu
@ 2023-06-08 18:41             ` Mike Rapoport
  2023-06-09 17:02               ` Song Liu
  2023-06-13 18:56               ` Kent Overstreet
  0 siblings, 2 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-08 18:41 UTC (permalink / raw)
  To: Song Liu
  Cc: Mark Rutland, Kent Overstreet, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote:
> On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote:
> 
> [...]
> 
> > > > > Can you give more detail on what parameters you need? If the only extra
> > > > > parameter is just "does this allocation need to live close to kernel
> > > > > text", that's not that big of a deal.
> > > >
> > > > My thinking was that we at least need the start + end for each caller. That
> > > > might be it, tbh.
> > >
> > > Do you mean that modules will have something like
> > >
> > >       jit_text_alloc(size, MODULES_START, MODULES_END);
> > >
> > > and kprobes will have
> > >
> > >       jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > > ?
> >
> > Yes.
> 
> How about we start with two APIs:
>      jit_text_alloc(size);
>      jit_text_alloc_range(size, start, end);
> 
> AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
> not quite convinced it is needed.
 
Right now arm64 and riscv override bpf and kprobes allocations to use the
entire vmalloc address space, but having the ability to allocate generated
code outside of modules area may be useful for other architectures.

Still the start + end for the callers feels backwards to me because the
callers do not define the ranges, but rather the architectures, so we still
need a way for architectures to define how they want allocate memory for
the generated code.

> > > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > > adding enum jit_type parameter to jit_text_alloc().
> >
> > That feels backwards to me; it centralizes a bunch of information about
> > distinct users to be able to shove that into a static array, when the callsites
> > can pass that information.
> 
> I think we only two type of users: module and everything else (ftrace, kprobe,
> bpf stuff). The key differences are:
> 
>   1. module uses text and data; while everything else only uses text.
>   2. module code is generated by the compiler, and thus has stronger
>   requirements in address ranges; everything else are generated via some
>   JIT or manual written assembly, so they are more flexible with address
>   ranges (in JIT, we can avoid using instructions that requires a specific
>   address range).
> 
> The next question is, can we have the two types of users share the same
> address ranges? If not, we can reserve the preferred range for modules,
> and let everything else use the other range. I don't see reasons to further
> separate users in the "everything else" group.
 
I agree that we can define only two types: modules and everything else and
let the architectures define if they need different ranges for these two
types, or want the same range for everything.

With only two types we can have two API calls for alloc, and a single
structure that defines the ranges etc from the architecture side rather
than spread all over.

Like something along these lines:

	struct execmem_range {
		unsigned long   start;
		unsigned long   end;
		unsigned long   fallback_start;
		unsigned long   fallback_end;
		pgprot_t        pgprot;
		unsigned int	alignment;
	};

	struct execmem_modules_range {
		enum execmem_module_flags flags;
		struct execmem_range text;
		struct execmem_range data;
	};

	struct execmem_jit_range {
		struct execmem_range text;
	};

	struct execmem_params {
		struct execmem_modules_range	modules;
		struct execmem_jit_range	jit;
	};

	struct execmem_params *execmem_arch_params(void);

	void *execmem_text_alloc(size_t size);
	void *execmem_data_alloc(size_t size);
	void execmem_free(void *ptr);

	void *jit_text_alloc(size_t size);
	void jit_free(void *ptr);

Modules or anything that must live close to the kernel image can use
execmem_*_alloc() and the callers that don't generally care about relative
addressing will use jit_text_alloc(), presuming that arch will restrict jit
range if necessary, like e.g. below for arm64 jit can be anywhere in
vmalloc and for x86 and s390 it will share the modules range. 


	struct execmem_params arm64_execmem = {
		.modules = {
			.flags = KASAN,
			.text = {
				.start = MODULES_VADDR,
				.end = MODULES_END,
				.pgprot = PAGE_KERNEL_ROX,
				.fallback_start = VMALLOC_START,
				.fallback_start = VMALLOC_END,
			},
		},
		.jit = {
			.text = {
				.start = VMALLOC_START,
				.end = VMALLOC_END,
				.pgprot = PAGE_KERNEL_ROX,
			},
		},
	};

	/* x86 and s390 */
	struct execmem_params cisc_execmem = {
		.modules = {
			.flags = KASAN,
			.text = {
				.start = MODULES_VADDR,
				.end = MODULES_END,
				.pgprot = PAGE_KERNEL_ROX,
			},
		},
		.jit_range = {},	/* impplies reusing .modules */
	};

	struct execmem_params default_execmem = {
		.modules = {
			.flags = KASAN,
			.text = {
				.start = VMALLOC_START,
				.end = VMALLOC_END,
				.pgprot = PAGE_KERNEL_EXEC,
			},
		},
	};

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-08 18:41             ` Mike Rapoport
@ 2023-06-09 17:02               ` Song Liu
  2023-06-12 21:34                 ` Mike Rapoport
  2023-06-13 18:56               ` Kent Overstreet
  1 sibling, 1 reply; 55+ messages in thread
From: Song Liu @ 2023-06-09 17:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, Kent Overstreet, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 8, 2023 at 11:41 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote:
> > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > [...]
> >
> > > > > > Can you give more detail on what parameters you need? If the only extra
> > > > > > parameter is just "does this allocation need to live close to kernel
> > > > > > text", that's not that big of a deal.
> > > > >
> > > > > My thinking was that we at least need the start + end for each caller. That
> > > > > might be it, tbh.
> > > >
> > > > Do you mean that modules will have something like
> > > >
> > > >       jit_text_alloc(size, MODULES_START, MODULES_END);
> > > >
> > > > and kprobes will have
> > > >
> > > >       jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > > > ?
> > >
> > > Yes.
> >
> > How about we start with two APIs:
> >      jit_text_alloc(size);
> >      jit_text_alloc_range(size, start, end);
> >
> > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
> > not quite convinced it is needed.
>
> Right now arm64 and riscv override bpf and kprobes allocations to use the
> entire vmalloc address space, but having the ability to allocate generated
> code outside of modules area may be useful for other architectures.
>
> Still the start + end for the callers feels backwards to me because the
> callers do not define the ranges, but rather the architectures, so we still
> need a way for architectures to define how they want allocate memory for
> the generated code.

Yeah, this makes sense.

>
> > > > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > > > adding enum jit_type parameter to jit_text_alloc().
> > >
> > > That feels backwards to me; it centralizes a bunch of information about
> > > distinct users to be able to shove that into a static array, when the callsites
> > > can pass that information.
> >
> > I think we only two type of users: module and everything else (ftrace, kprobe,
> > bpf stuff). The key differences are:
> >
> >   1. module uses text and data; while everything else only uses text.
> >   2. module code is generated by the compiler, and thus has stronger
> >   requirements in address ranges; everything else are generated via some
> >   JIT or manual written assembly, so they are more flexible with address
> >   ranges (in JIT, we can avoid using instructions that requires a specific
> >   address range).
> >
> > The next question is, can we have the two types of users share the same
> > address ranges? If not, we can reserve the preferred range for modules,
> > and let everything else use the other range. I don't see reasons to further
> > separate users in the "everything else" group.
>
> I agree that we can define only two types: modules and everything else and
> let the architectures define if they need different ranges for these two
> types, or want the same range for everything.
>
> With only two types we can have two API calls for alloc, and a single
> structure that defines the ranges etc from the architecture side rather
> than spread all over.
>
> Like something along these lines:
>
>         struct execmem_range {
>                 unsigned long   start;
>                 unsigned long   end;
>                 unsigned long   fallback_start;
>                 unsigned long   fallback_end;
>                 pgprot_t        pgprot;
>                 unsigned int    alignment;
>         };
>
>         struct execmem_modules_range {
>                 enum execmem_module_flags flags;
>                 struct execmem_range text;
>                 struct execmem_range data;
>         };
>
>         struct execmem_jit_range {
>                 struct execmem_range text;
>         };
>
>         struct execmem_params {
>                 struct execmem_modules_range    modules;
>                 struct execmem_jit_range        jit;
>         };
>
>         struct execmem_params *execmem_arch_params(void);
>
>         void *execmem_text_alloc(size_t size);
>         void *execmem_data_alloc(size_t size);
>         void execmem_free(void *ptr);

With the jit variation, maybe we can just call these
module_[text|data]_alloc()?

btw: Depending on the implementation of the allocator, we may also
need separate free()s for text and data.

>
>         void *jit_text_alloc(size_t size);
>         void jit_free(void *ptr);
>

[...]

How should we move ahead from here?

AFAICT, all these changes can be easily extended and refactored
in the future, so we don't have to make it perfect the first time.
OTOH, having the interface committed (either this set or my
module_alloc_type version) can unblock works in the binpack
allocator and the users side. Therefore, I think we can move
relatively fast here?

Thanks,
Song

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-09 17:02               ` Song Liu
@ 2023-06-12 21:34                 ` Mike Rapoport
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-12 21:34 UTC (permalink / raw)
  To: Song Liu
  Cc: Mark Rutland, Kent Overstreet, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Fri, Jun 09, 2023 at 10:02:16AM -0700, Song Liu wrote:
> On Thu, Jun 8, 2023 at 11:41 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote:
> > > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > [...]
> > >
> > > > > > > Can you give more detail on what parameters you need? If the only extra
> > > > > > > parameter is just "does this allocation need to live close to kernel
> > > > > > > text", that's not that big of a deal.
> > > > > >
> > > > > > My thinking was that we at least need the start + end for each caller. That
> > > > > > might be it, tbh.
> > > > >
> > > > > Do you mean that modules will have something like
> > > > >
> > > > >       jit_text_alloc(size, MODULES_START, MODULES_END);
> > > > >
> > > > > and kprobes will have
> > > > >
> > > > >       jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > > > > ?
> > > >
> > > > Yes.
> > >
> > > How about we start with two APIs:
> > >      jit_text_alloc(size);
> > >      jit_text_alloc_range(size, start, end);
> > >
> > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
> > > not quite convinced it is needed.
> >
> > Right now arm64 and riscv override bpf and kprobes allocations to use the
> > entire vmalloc address space, but having the ability to allocate generated
> > code outside of modules area may be useful for other architectures.
> >
> > Still the start + end for the callers feels backwards to me because the
> > callers do not define the ranges, but rather the architectures, so we still
> > need a way for architectures to define how they want allocate memory for
> > the generated code.
> 
> Yeah, this makes sense.
> 
> >
> > > > > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > > > > adding enum jit_type parameter to jit_text_alloc().
> > > >
> > > > That feels backwards to me; it centralizes a bunch of information about
> > > > distinct users to be able to shove that into a static array, when the callsites
> > > > can pass that information.
> > >
> > > I think we only two type of users: module and everything else (ftrace, kprobe,
> > > bpf stuff). The key differences are:
> > >
> > >   1. module uses text and data; while everything else only uses text.
> > >   2. module code is generated by the compiler, and thus has stronger
> > >   requirements in address ranges; everything else are generated via some
> > >   JIT or manual written assembly, so they are more flexible with address
> > >   ranges (in JIT, we can avoid using instructions that requires a specific
> > >   address range).
> > >
> > > The next question is, can we have the two types of users share the same
> > > address ranges? If not, we can reserve the preferred range for modules,
> > > and let everything else use the other range. I don't see reasons to further
> > > separate users in the "everything else" group.
> >
> > I agree that we can define only two types: modules and everything else and
> > let the architectures define if they need different ranges for these two
> > types, or want the same range for everything.
> >
> > With only two types we can have two API calls for alloc, and a single
> > structure that defines the ranges etc from the architecture side rather
> > than spread all over.
> >
> > Like something along these lines:
> >
> >         struct execmem_range {
> >                 unsigned long   start;
> >                 unsigned long   end;
> >                 unsigned long   fallback_start;
> >                 unsigned long   fallback_end;
> >                 pgprot_t        pgprot;
> >                 unsigned int    alignment;
> >         };
> >
> >         struct execmem_modules_range {
> >                 enum execmem_module_flags flags;
> >                 struct execmem_range text;
> >                 struct execmem_range data;
> >         };
> >
> >         struct execmem_jit_range {
> >                 struct execmem_range text;
> >         };
> >
> >         struct execmem_params {
> >                 struct execmem_modules_range    modules;
> >                 struct execmem_jit_range        jit;
> >         };
> >
> >         struct execmem_params *execmem_arch_params(void);
> >
> >         void *execmem_text_alloc(size_t size);
> >         void *execmem_data_alloc(size_t size);
> >         void execmem_free(void *ptr);
> 
> With the jit variation, maybe we can just call these
> module_[text|data]_alloc()?

I was thinking about "execmem_*_alloc()" for allocations that must be close to kernel
image, like modules, ftrace on x86 and s390 and maybe something else in the
future.

And jit_text_alloc() for allocations that can reside anywhere.

I tried to find a different name for 'struct execmem_modules_range' but
couldn't think of anything better than 'struct execmem_close_to_kernel', so
I've left modules in the name.
 
> btw: Depending on the implementation of the allocator, we may also
> need separate free()s for text and data.
> 
> >
> >         void *jit_text_alloc(size_t size);
> >         void jit_free(void *ptr);
> >

Let's just add jit_free() for completeness even if it will be the same as
execmem_free() for now.
 
> [...]
> 
> How should we move ahead from here?
> 
> AFAICT, all these changes can be easily extended and refactored
> in the future, so we don't have to make it perfect the first time.
> OTOH, having the interface committed (either this set or my
> module_alloc_type version) can unblock works in the binpack
> allocator and the users side. Therefore, I think we can move
> relatively fast here?

Once the interface and architecture abstraction is ready we can work on the
allocator and the users. We also need to update text_poking/alternatives on
architectures that would allocate executable memory as ROX. I did some
quick tests and with these patches 'modprobe xfs' takes tens time more than
before.
 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-08 18:41             ` Mike Rapoport
  2023-06-09 17:02               ` Song Liu
@ 2023-06-13 18:56               ` Kent Overstreet
  2023-06-13 21:09                 ` Mike Rapoport
  1 sibling, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2023-06-13 18:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Song Liu, Mark Rutland, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Thu, Jun 08, 2023 at 09:41:16PM +0300, Mike Rapoport wrote:
> On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote:
> > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > 
> > [...]
> > 
> > > > > > Can you give more detail on what parameters you need? If the only extra
> > > > > > parameter is just "does this allocation need to live close to kernel
> > > > > > text", that's not that big of a deal.
> > > > >
> > > > > My thinking was that we at least need the start + end for each caller. That
> > > > > might be it, tbh.
> > > >
> > > > Do you mean that modules will have something like
> > > >
> > > >       jit_text_alloc(size, MODULES_START, MODULES_END);
> > > >
> > > > and kprobes will have
> > > >
> > > >       jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > > > ?
> > >
> > > Yes.
> > 
> > How about we start with two APIs:
> >      jit_text_alloc(size);
> >      jit_text_alloc_range(size, start, end);
> > 
> > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
> > not quite convinced it is needed.
>  
> Right now arm64 and riscv override bpf and kprobes allocations to use the
> entire vmalloc address space, but having the ability to allocate generated
> code outside of modules area may be useful for other architectures.
> 
> Still the start + end for the callers feels backwards to me because the
> callers do not define the ranges, but rather the architectures, so we still
> need a way for architectures to define how they want allocate memory for
> the generated code.

So, the start + end just comes from the need to keep relative pointers
under a certain size. I think this could be just a flag, I see no reason
to expose actual addresses here.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-13 18:56               ` Kent Overstreet
@ 2023-06-13 21:09                 ` Mike Rapoport
  0 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-06-13 21:09 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Song Liu, Mark Rutland, linux-kernel, Andrew Morton,
	Catalin Marinas, Christophe Leroy, David S. Miller, Dinh Nguyen,
	Heiko Carstens, Helge Deller, Huacai Chen, Luis Chamberlain,
	Michael Ellerman, Naveen N. Rao, Palmer Dabbelt, Russell King,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Tue, Jun 13, 2023 at 02:56:14PM -0400, Kent Overstreet wrote:
> On Thu, Jun 08, 2023 at 09:41:16PM +0300, Mike Rapoport wrote:
> > On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote:
> > > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > 
> > > [...]
> > > 
> > > > > > > Can you give more detail on what parameters you need? If the only extra
> > > > > > > parameter is just "does this allocation need to live close to kernel
> > > > > > > text", that's not that big of a deal.
> > > > > >
> > > > > > My thinking was that we at least need the start + end for each caller. That
> > > > > > might be it, tbh.
> > > > >
> > > > > Do you mean that modules will have something like
> > > > >
> > > > >       jit_text_alloc(size, MODULES_START, MODULES_END);
> > > > >
> > > > > and kprobes will have
> > > > >
> > > > >       jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > > > > ?
> > > >
> > > > Yes.
> > > 
> > > How about we start with two APIs:
> > >      jit_text_alloc(size);
> > >      jit_text_alloc_range(size, start, end);
> > > 
> > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
> > > not quite convinced it is needed.
> >  
> > Right now arm64 and riscv override bpf and kprobes allocations to use the
> > entire vmalloc address space, but having the ability to allocate generated
> > code outside of modules area may be useful for other architectures.
> > 
> > Still the start + end for the callers feels backwards to me because the
> > callers do not define the ranges, but rather the architectures, so we still
> > need a way for architectures to define how they want allocate memory for
> > the generated code.
> 
> So, the start + end just comes from the need to keep relative pointers
> under a certain size. I think this could be just a flag, I see no reason
> to expose actual addresses here.

It's the other way around. The start + end comes from the need to restrict
allocation to certain range because of the relative addressing. I don't see
how a flag can help here.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/13] nios2: define virtual address space for modules
  2023-06-01 10:12 ` [PATCH 01/13] nios2: define virtual address space for modules Mike Rapoport
@ 2023-06-13 22:16   ` Dinh Nguyen
  0 siblings, 0 replies; 55+ messages in thread
From: Dinh Nguyen @ 2023-06-13 22:16 UTC (permalink / raw)
  To: Mike Rapoport, linux-kernel
  Cc: Andrew Morton, Catalin Marinas, Christophe Leroy, David S. Miller,
	Heiko Carstens, Helge Deller, Huacai Chen, Kent Overstreet,
	Luis Chamberlain, Michael Ellerman, Naveen N. Rao, Palmer Dabbelt,
	Russell King, Song Liu, Steven Rostedt, Thomas Bogendoerfer,
	Thomas Gleixner, Will Deacon, bpf, linux-arm-kernel, linux-mips,
	linux-mm, linux-modules, linux-parisc, linux-riscv, linux-s390,
	linux-trace-kernel, linuxppc-dev, loongarch, netdev, sparclinux,
	x86



On 6/1/23 05:12, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
> cannot reach all of vmalloc address space.
> 
> Define module space as 32MiB below the kernel base and switch nios2 to
> use vmalloc for module allocations.
> 
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
> ---
>   arch/nios2/include/asm/pgtable.h |  5 ++++-
>   arch/nios2/kernel/module.c       | 19 ++++---------------
>   2 files changed, 8 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
> index 0f5c2564e9f5..0073b289c6a4 100644
> --- a/arch/nios2/include/asm/pgtable.h
> +++ b/arch/nios2/include/asm/pgtable.h
> @@ -25,7 +25,10 @@
>   #include <asm-generic/pgtable-nopmd.h>
>   
>   #define VMALLOC_START		CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
> -#define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
> +#define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
> +
> +#define MODULES_VADDR		(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
> +#define MODULES_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
>   
>   struct mm_struct;
>   
> diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
> index 76e0a42d6e36..9c97b7513853 100644
> --- a/arch/nios2/kernel/module.c
> +++ b/arch/nios2/kernel/module.c
> @@ -21,23 +21,12 @@
>   
>   #include <asm/cacheflush.h>
>   
> -/*
> - * Modules should NOT be allocated with kmalloc for (obvious) reasons.
> - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
> - * from 0x80000000 (vmalloc area) to 0xc00000000 (kernel) (kmalloc returns
> - * addresses in 0xc0000000)
> - */
>   void *module_alloc(unsigned long size)
>   {
> -	if (size == 0)
> -		return NULL;
> -	return kmalloc(size, GFP_KERNEL);
> -}
> -
> -/* Free memory returned from module_alloc */
> -void module_memfree(void *module_region)
> -{
> -	kfree(module_region);
> +	return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
> +				    GFP_KERNEL, PAGE_KERNEL_EXEC,
> +				    VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> +				    __builtin_return_address(0));
>   }
>   
>   int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,

Acked-by: Dinh Nguyen <dinguyen@kernel.org>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 00/13] mm: jit/text allocator
  2023-06-05 10:09         ` Mark Rutland
  2023-06-06 10:16           ` Mike Rapoport
  2023-06-06 18:21           ` Song Liu
@ 2023-07-20  8:53           ` Mike Rapoport
  2 siblings, 0 replies; 55+ messages in thread
From: Mike Rapoport @ 2023-07-20  8:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Kent Overstreet, linux-kernel, Andrew Morton, Catalin Marinas,
	Christophe Leroy, David S. Miller, Dinh Nguyen, Heiko Carstens,
	Helge Deller, Huacai Chen, Luis Chamberlain, Michael Ellerman,
	Naveen N. Rao, Palmer Dabbelt, Russell King, Song Liu,
	Steven Rostedt, Thomas Bogendoerfer, Thomas Gleixner, Will Deacon,
	bpf, linux-arm-kernel, linux-mips, linux-mm, linux-modules,
	linux-parisc, linux-riscv, linux-s390, linux-trace-kernel,
	linuxppc-dev, loongarch, netdev, sparclinux, x86

On Mon, Jun 05, 2023 at 11:09:34AM +0100, Mark Rutland wrote:
> On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> > On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > > For a while I have wanted to give kprobes its own allocator so that it can work
> > > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in
> > > > > the modules area.
> > > > > 
> > > > > Given that, I think these should have their own allocator functions that can be
> > > > > provided independently, even if those happen to use common infrastructure.
> > > > 
> > > > How much memory can kprobes conceivably use? I think we also want to try
> > > > to push back on combinatorial new allocators, if we can.
> > > 
> > > That depends on who's using it, and how (e.g. via BPF).
> > > 
> > > To be clear, I'm not necessarily asking for entirely different allocators, but
> > > I do thinkg that we want wrappers that can at least pass distinct start+end
> > > parameters to a common allocator, and for arm64's modules code I'd expect that
> > > we'd keep the range falblack logic out of the common allcoator, and just call
> > > it twice.
> > > 
> > > > > > Several architectures override module_alloc() because of various
> > > > > > constraints where the executable memory can be located and this causes
> > > > > > additional obstacles for improvements of code allocation.
> > > > > > 
> > > > > > This set splits code allocation from modules by introducing
> > > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > > implements core text and related allocation in a central place.
> > > > > > 
> > > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > > architectures that require non-default behaviour for text allocation must
> > > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that
> > > > > > returns a pointer to that structure. If an architecture does not implement
> > > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > > modules::module_alloc() are used.
> > > > > 
> > > > > As above, I suspect that each of the callsites should probably be using common
> > > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes
> > > > > sense, since the parameters for each case may need to be distinct.
> > > > 
> > > > I don't see how that follows. The whole point of function parameters is
> > > > that they may be different :)
> > > 
> > > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > > parameters, but they aren't actually common (e.g. the actual start+end range
> > > for allocation).
> > 
> > jit_alloc_arch_params() tries to aggregate architecture constraints and
> > requirements for allocations of executable memory and this exactly what
> > the first 6 patches of this set do.
> > 
> > A while ago Thomas suggested to use a structure that parametrizes
> > architecture constraints by the memory type used in modules [1] and Song
> > implemented the infrastructure for it and x86 part [2].
> > 
> > I liked the idea of defining parameters in a single structure, but I
> > thought that approaching the problem from the arch side rather than from
> > modules perspective will be better starting point, hence these patches.
> > 
> > I don't see a fundamental reason why a single structure cannot describe
> > what is needed for different code allocation cases, be it modules, kprobes
> > or bpf. There is of course an assumption that the core allocations will be
> > the same for all the users, and it seems to me that something like 
> > 
> > * allocate physical memory if allocator caches are empty
> > * map it in vmalloc or modules address space
> > * return memory from the allocator cache to the caller
> > 
> > will work for all usecases.
> > 
> > We might need separate caches for different cases on different
> > architectures, and a way to specify what cache should be used in the
> > allocator API, but that does not contradict a single structure for arch
> > specific parameters, but only makes it more elaborate, e.g. something like
> > 
> > enum jit_type {
> > 	JIT_MODULES_TEXT,
> > 	JIT_MODULES_DATA,
> > 	JIT_KPROBES,
> > 	JIT_FTRACE,
> > 	JIT_BPF,
> > 	JIT_TYPE_MAX,
> > };
> > 
> > struct jit_alloc_params {
> > 	struct jit_range	ranges[JIT_TYPE_MAX];
> > 	/* ... */
> > };
> > 
> > > > Can you give more detail on what parameters you need? If the only extra
> > > > parameter is just "does this allocation need to live close to kernel
> > > > text", that's not that big of a deal.
> > > 
> > > My thinking was that we at least need the start + end for each caller. That
> > > might be it, tbh.
> > 
> > Do you mean that modules will have something like
> > 
> > 	jit_text_alloc(size, MODULES_START, MODULES_END);
> > 
> > and kprobes will have
> > 
> > 	jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > ?
> 
> Yes.
> 
> > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > adding enum jit_type parameter to jit_text_alloc().
> 
> That feels backwards to me; it centralizes a bunch of information about
> distinct users to be able to shove that into a static array, when the callsites
> can pass that information. 
> 
> What's *actually* common after separating out the ranges? Is it just the
> permissions?

Even if for some architecture the only common thing are the permissions,
having a definition for code allocations in a single place an improvement.
The diffstat of the patches is indeed positive (even without comments), but
having a single structure that specifies how the code should be allocated
would IMHO actually reduce the maintenance burden.

And features like caching of large pages and sub-page size allocations are
surely will be easier to opt-in this way.
 
> If we want this to be able to share allocations and so on, why can't we do this
> like a kmem_cache, and have the callsite pass a pointer to the allocator data?
> That would make it easy for callsites to share an allocator or use a distinct
> one.

I've looked into doing this like a kmem_cache with call sites passing the
allocator data, and this gets really hairy. For each user we need to pass
the arch specific parameters to that user, create a cache there and only
then the cache can be used. Since we don't have hooks to setup any of the
users in the arch code, the initialization gets more complex than shoving
everything into an array.

I think that jit_alloc(type, size) is the best way to move forward to let
different users choose their ranges and potentially caches. Differentiation
by the API name will explode even now and it'll get worse if/when new users
will show up and we can't even force users to avoid using PC-relative
addressing because, e.g. RISC-V explicitly switched their BPF JIT to use
that.
 
> Thanks,
> Mark.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2023-07-20  8:58 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-01 10:12 [PATCH 00/13] mm: jit/text allocator Mike Rapoport
2023-06-01 10:12 ` [PATCH 01/13] nios2: define virtual address space for modules Mike Rapoport
2023-06-13 22:16   ` Dinh Nguyen
2023-06-01 10:12 ` [PATCH 02/13] mm: introduce jit_text_alloc() and use it instead of module_alloc() Mike Rapoport
2023-06-01 10:12 ` [PATCH 03/13] mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc Mike Rapoport
2023-06-01 10:12 ` [PATCH 04/13] mm/jitalloc, arch: convert remaining " Mike Rapoport
2023-06-01 22:35   ` Song Liu
2023-06-01 10:12 ` [PATCH 05/13] module, jitalloc: drop module_alloc Mike Rapoport
2023-06-01 10:12 ` [PATCH 06/13] mm/jitalloc: introduce jit_data_alloc() Mike Rapoport
2023-06-01 10:12 ` [PATCH 07/13] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES Mike Rapoport
2023-06-01 10:12 ` [PATCH 08/13] arch: make jitalloc setup available regardless of CONFIG_MODULES Mike Rapoport
2023-06-01 10:12 ` [PATCH 09/13] kprobes: remove dependcy on CONFIG_MODULES Mike Rapoport
2023-06-01 10:12 ` [PATCH 10/13] modules, jitalloc: prepare to allocate executable memory as ROX Mike Rapoport
2023-06-01 10:12 ` [PATCH 11/13] ftrace: Add swap_func to ftrace_process_locs() Mike Rapoport
2023-06-01 10:12 ` [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX Mike Rapoport
2023-06-01 10:30   ` Peter Zijlstra
2023-06-01 11:07     ` Mike Rapoport
2023-06-02  0:02       ` Song Liu
2023-06-01 17:52     ` Kent Overstreet
2023-06-01 16:54   ` Edgecombe, Rick P
2023-06-01 18:00     ` Kent Overstreet
2023-06-01 18:13       ` Edgecombe, Rick P
2023-06-01 18:38         ` Kent Overstreet
2023-06-01 20:50           ` Edgecombe, Rick P
2023-06-01 23:54             ` Nadav Amit
2023-06-05  2:52               ` Steven Rostedt
2023-06-05  8:11                 ` Mike Rapoport
2023-06-05 16:10                   ` Edgecombe, Rick P
2023-06-05 20:42                     ` Mike Rapoport
2023-06-05 21:01                       ` Edgecombe, Rick P
2023-06-05 21:11                     ` Nadav Amit
2023-06-04 21:47             ` Kent Overstreet
2023-06-01 22:49   ` Song Liu
2023-06-01 10:12 ` [PATCH 13/13] x86/jitalloc: make memory allocated for code ROX Mike Rapoport
2023-06-01 16:12 ` [PATCH 00/13] mm: jit/text allocator Mark Rutland
2023-06-01 18:14   ` Kent Overstreet
2023-06-02  9:35     ` Mark Rutland
2023-06-02 18:20       ` Song Liu
2023-06-03 21:11         ` Puranjay Mohan
2023-06-04 18:02         ` Kent Overstreet
2023-06-04 21:22           ` Song Liu
2023-06-04 21:40             ` Kent Overstreet
2023-06-05  4:05               ` Song Liu
2023-06-05  9:20       ` Mike Rapoport
2023-06-05 10:09         ` Mark Rutland
2023-06-06 10:16           ` Mike Rapoport
2023-06-06 18:21           ` Song Liu
2023-06-08 18:41             ` Mike Rapoport
2023-06-09 17:02               ` Song Liu
2023-06-12 21:34                 ` Mike Rapoport
2023-06-13 18:56               ` Kent Overstreet
2023-06-13 21:09                 ` Mike Rapoport
2023-07-20  8:53           ` Mike Rapoport
2023-06-05 21:13         ` Kent Overstreet
2023-06-02  0:36 ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).