[PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes
@ 2025-07-04 13:49 Mike Rapoport
  2025-07-04 13:49 ` [PATCH 1/8] execmem: drop unused execmem_update_copy() Mike Rapoport
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

These patches enable use of EXECMEM_ROX_CACHE for ftrace and kprobes
allocations on x86.

They also include some ground work in execmem.

Since the execmem model for caching large ROX pages changed from the
initial assumption that the memory that is allocated from ROX cache is
always ROX to the current state where memory can be temporarily made RW and
then restored to ROX, we can stop using text poking to update it. This also
saves the hassle of trying lock text_mutex in execmem_cache_free() when
kprobes already hold that mutex.

The patches 1-6 update and cleanup execmem ROX cache management,
patch 7 enables EXECMEM_ROX_CACHE for kprobes and
patch 8 enables EXECMEM_ROX_CACHE for frace.

The patches are also available at git:

https://git.kernel.org/rppt/h/execmem/x86-rox/ftrace%2bkprobes

Mike Rapoport (Microsoft) (8):
  execmem: drop unused execmem_update_copy()
  execmem: introduce execmem_alloc_rw()
  execmem: rework execmem_cache_free()
  execmem: move execmem_force_rw() and execmem_restore_rox() before use
  execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
  execmem: drop writable parameter from execmem_fill_trapping_insns()
  x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
  x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations

 arch/x86/kernel/alternative.c  |   3 +-
 arch/x86/kernel/ftrace.c       |   2 +-
 arch/x86/kernel/kprobes/core.c |  18 ---
 arch/x86/mm/init.c             |  24 ++--
 include/linux/execmem.h        |  54 ++++-----
 kernel/module/main.c           |  13 +--
 mm/execmem.c                   | 193 +++++++++++++++++++++++++--------
 7 files changed, 189 insertions(+), 118 deletions(-)


base-commit: 86731a2a651e58953fc949573895f2fa6d456841
-- 
2.47.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/8] execmem: drop unused execmem_update_copy()
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-07 10:10   ` Christophe Leroy
  2025-07-04 13:49 ` [PATCH 2/8] execmem: introduce execmem_alloc_rw() Mike Rapoport
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The execmem_update_copy() that used text poking was required when memory
allocated from ROX cache was always read-only. Since now its permissions
can be switched to read-write there is no need in a function that updates
memory with text poking.

Remove it.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/execmem.h | 13 -------------
 mm/execmem.c            |  5 -----
 2 files changed, 18 deletions(-)

diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 3be35680a54f..734fbe83d98e 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -185,19 +185,6 @@ DEFINE_FREE(execmem, void *, if (_T) execmem_free(_T));
 struct vm_struct *execmem_vmap(size_t size);
 #endif
 
-/**
- * execmem_update_copy - copy an update to executable memory
- * @dst:  destination address to update
- * @src:  source address containing the data
- * @size: how many bytes of memory shold be copied
- *
- * Copy @size bytes from @src to @dst using text poking if the memory at
- * @dst is read-only.
- *
- * Return: a pointer to @dst or NULL on error
- */
-void *execmem_update_copy(void *dst, const void *src, size_t size);
-
 /**
  * execmem_is_rox - check if execmem is read-only
  * @type - the execmem type to check
diff --git a/mm/execmem.c b/mm/execmem.c
index 2b683e7d864d..0712ebb4eb77 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -399,11 +399,6 @@ void execmem_free(void *ptr)
 		vfree(ptr);
 }
 
-void *execmem_update_copy(void *dst, const void *src, size_t size)
-{
-	return text_poke_copy(dst, src, size);
-}
-
 bool execmem_is_rox(enum execmem_type type)
 {
 	return !!(execmem_info->ranges[type].flags & EXECMEM_ROX_CACHE);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/8] execmem: introduce execmem_alloc_rw()
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
  2025-07-04 13:49 ` [PATCH 1/8] execmem: drop unused execmem_update_copy() Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-04 13:49 ` [PATCH 3/8] execmem: rework execmem_cache_free() Mike Rapoport
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Some callers of execmem_alloc() require the memory to be temporarily
writable even when it is allocated from ROX cache. These callers use
execemem_make_temp_rw() right after the call to execmem_alloc().

Wrap this sequence in execmem_alloc_rw() API.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/alternative.c |  3 +--
 include/linux/execmem.h       | 38 ++++++++++++++++++++---------------
 kernel/module/main.c          | 13 ++----------
 mm/execmem.c                  | 27 ++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ea1d984166cd..526a5fef93ab 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -120,7 +120,7 @@ struct its_array its_pages;
 
 static void *__its_alloc(struct its_array *pages)
 {
-	void *page __free(execmem) = execmem_alloc(EXECMEM_MODULE_TEXT, PAGE_SIZE);
+	void *page __free(execmem) = execmem_alloc_rw(EXECMEM_MODULE_TEXT, PAGE_SIZE);
 	if (!page)
 		return NULL;
 
@@ -237,7 +237,6 @@ static void *its_alloc(void)
 	if (!page)
 		return NULL;
 
-	execmem_make_temp_rw(page, PAGE_SIZE);
 	if (pages == &its_pages)
 		set_memory_x((unsigned long)page, 1);
 
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 734fbe83d98e..4e510d1c609c 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -67,21 +67,6 @@ enum execmem_range_flags {
  */
 void execmem_fill_trapping_insns(void *ptr, size_t size, bool writable);
 
-/**
- * execmem_make_temp_rw - temporarily remap region with read-write
- *			  permissions
- * @ptr:	address of the region to remap
- * @size:	size of the region to remap
- *
- * Remaps a part of the cached large page in the ROX cache in the range
- * [@ptr, @ptr + @size) as writable and not executable. The caller must
- * have exclusive ownership of this range and ensure nothing will try to
- * execute code in this range.
- *
- * Return: 0 on success or negative error code on failure.
- */
-int execmem_make_temp_rw(void *ptr, size_t size);
-
 /**
  * execmem_restore_rox - restore read-only-execute permissions
  * @ptr:	address of the region to remap
@@ -95,7 +80,6 @@ int execmem_make_temp_rw(void *ptr, size_t size);
  */
 int execmem_restore_rox(void *ptr, size_t size);
 #else
-static inline int execmem_make_temp_rw(void *ptr, size_t size) { return 0; }
 static inline int execmem_restore_rox(void *ptr, size_t size) { return 0; }
 #endif
 
@@ -165,6 +149,28 @@ struct execmem_info *execmem_arch_setup(void);
  */
 void *execmem_alloc(enum execmem_type type, size_t size);
 
+/**
+ * execmem_alloc_rw - allocate writatble executable memory
+ * @type: type of the allocation
+ * @size: how many bytes of memory are required
+ *
+ * Allocates memory that will contain executable code, either generated or
+ * loaded from kernel modules.
+ *
+ * Allocates memory that will contain data coupled with executable code,
+ * like data sections in kernel modules.
+ *
+ * Forces writable permissions on the allocated memory and the caller is
+ * responsible to manage the permissions afterwards.
+ *
+ * For architectures that use ROX cache the permissions will be set to R+W.
+ * For architectures that don't use ROX cache the default permissions for @type
+ * will be used as they must be writable.
+ *
+ * Return: a pointer to the allocated memory or %NULL
+ */
+void *execmem_alloc_rw(enum execmem_type type, size_t size);
+
 /**
  * execmem_free - free executable memory
  * @ptr: pointer to the memory that should be freed
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 413ac6ea3702..d009326ef7bb 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1292,20 +1292,11 @@ static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
 	else
 		execmem_type = EXECMEM_MODULE_TEXT;
 
-	ptr = execmem_alloc(execmem_type, size);
+	ptr = execmem_alloc_rw(execmem_type, size);
 	if (!ptr)
 		return -ENOMEM;
 
-	if (execmem_is_rox(execmem_type)) {
-		int err = execmem_make_temp_rw(ptr, size);
-
-		if (err) {
-			execmem_free(ptr);
-			return -ENOMEM;
-		}
-
-		mod->mem[type].is_rox = true;
-	}
+	mod->mem[type].is_rox = execmem_is_rox(execmem_type);
 
 	/*
 	 * The pointer to these blocks of memory are stored on the module
diff --git a/mm/execmem.c b/mm/execmem.c
index 0712ebb4eb77..6b040fbc5f4f 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -336,7 +336,7 @@ static bool execmem_cache_free(void *ptr)
 	return true;
 }
 
-int execmem_make_temp_rw(void *ptr, size_t size)
+static int execmem_force_rw(void *ptr, size_t size)
 {
 	unsigned int nr = PAGE_ALIGN(size) >> PAGE_SHIFT;
 	unsigned long addr = (unsigned long)ptr;
@@ -358,6 +358,16 @@ int execmem_restore_rox(void *ptr, size_t size)
 }
 
 #else /* CONFIG_ARCH_HAS_EXECMEM_ROX */
+/*
+ * when ROX cache is not used the permissions defined by architectures for
+ * execmem ranges that are updated before use (e.g. EXECMEM_MODULE_TEXT) must
+ * be writable anyway
+ */
+static inline int execmem_force_rw(void *ptr, size_t size)
+{
+	return 0;
+}
+
 static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
 {
 	return NULL;
@@ -387,6 +397,21 @@ void *execmem_alloc(enum execmem_type type, size_t size)
 	return kasan_reset_tag(p);
 }
 
+void *execmem_alloc_rw(enum execmem_type type, size_t size)
+{
+	void *p __free(execmem) = execmem_alloc(type, size);
+	int err;
+
+	if (!p)
+		return NULL;
+
+	err = execmem_force_rw(p, size);
+	if (err)
+		return NULL;
+
+	return no_free_ptr(p);
+}
+
 void execmem_free(void *ptr)
 {
 	/*
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
  2025-07-04 13:49 ` [PATCH 1/8] execmem: drop unused execmem_update_copy() Mike Rapoport
  2025-07-04 13:49 ` [PATCH 2/8] execmem: introduce execmem_alloc_rw() Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-07 11:11   ` Peter Zijlstra
  2025-07-07 15:32   ` Yann Ylavic
  2025-07-04 13:49 ` [PATCH 4/8] execmem: move execmem_force_rw() and execmem_restore_rox() before use Mike Rapoport
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Currently execmem_cache_free() ignores potential allocation failures that
may happen in execmem_cache_add(). Besides, it uses text poking to fill the
memory with trapping instructions before returning it to cache although it
would be more efficient to make that memory writable, update it using
memcpy and then restore ROX protection.

Rework execmem_cache_free() so that in case of an error it will defer
freeing of the memory to a delayed work.

With this the happy fast path will now change permissions to RW, fill the
memory with trapping instructions using memcpy, restore ROX permissions,
add the memory back to the free cache and clear the relevant entry in
busy_areas.

If any step in the fast path fails, the entry in busy_areas will be marked
as pending_free. These entries will be handled by a delayed work and freed
asynchronously.

To make the fast path faster, use __GFP_NORETRY for memory allocations and
let asynchronous handler try harder with GFP_KERNEL.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/execmem.c | 120 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 97 insertions(+), 23 deletions(-)

diff --git a/mm/execmem.c b/mm/execmem.c
index 6b040fbc5f4f..1cc781244593 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -93,8 +93,15 @@ struct execmem_cache {
 	struct mutex mutex;
 	struct maple_tree busy_areas;
 	struct maple_tree free_areas;
+	unsigned int pending_free_cnt;	/* protected by mutex */
 };
 
+/* delay to schedule asynchronous free if fast path free fails */
+#define FREE_DELAY	(msecs_to_jiffies(10))
+
+/* mark entries in busy_areas that should be freed asynchronously */
+#define PENDING_FREE_MASK	(1 << (PAGE_SHIFT - 1))
+
 static struct execmem_cache execmem_cache = {
 	.mutex = __MUTEX_INITIALIZER(execmem_cache.mutex),
 	.busy_areas = MTREE_INIT_EXT(busy_areas, MT_FLAGS_LOCK_EXTERN,
@@ -155,20 +162,17 @@ static void execmem_cache_clean(struct work_struct *work)
 
 static DECLARE_WORK(execmem_cache_clean_work, execmem_cache_clean);
 
-static int execmem_cache_add(void *ptr, size_t size)
+static int execmem_cache_add_locked(void *ptr, size_t size, gfp_t gfp_mask)
 {
 	struct maple_tree *free_areas = &execmem_cache.free_areas;
-	struct mutex *mutex = &execmem_cache.mutex;
 	unsigned long addr = (unsigned long)ptr;
 	MA_STATE(mas, free_areas, addr - 1, addr + 1);
 	unsigned long lower, upper;
 	void *area = NULL;
-	int err;
 
 	lower = addr;
 	upper = addr + size - 1;
 
-	mutex_lock(mutex);
 	area = mas_walk(&mas);
 	if (area && mas.last == addr - 1)
 		lower = mas.index;
@@ -178,12 +182,14 @@ static int execmem_cache_add(void *ptr, size_t size)
 		upper = mas.last;
 
 	mas_set_range(&mas, lower, upper);
-	err = mas_store_gfp(&mas, (void *)lower, GFP_KERNEL);
-	mutex_unlock(mutex);
-	if (err)
-		return err;
+	return mas_store_gfp(&mas, (void *)lower, gfp_mask);
+}
 
-	return 0;
+static int execmem_cache_add(void *ptr, size_t size, gfp_t gfp_mask)
+{
+	guard(mutex)(&execmem_cache.mutex);
+
+	return execmem_cache_add_locked(ptr, size, gfp_mask);
 }
 
 static bool within_range(struct execmem_range *range, struct ma_state *mas,
@@ -278,7 +284,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 	if (err)
 		goto err_free_mem;
 
-	err = execmem_cache_add(p, alloc_size);
+	err = execmem_cache_add(p, alloc_size, GFP_KERNEL);
 	if (err)
 		goto err_reset_direct_map;
 
@@ -307,33 +313,101 @@ static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
 	return __execmem_cache_alloc(range, size);
 }
 
+static inline bool is_pending_free(void *ptr)
+{
+	return ((unsigned long)ptr & PENDING_FREE_MASK);
+}
+
+static inline void *pending_free_set(void *ptr)
+{
+	return (void *)((unsigned long)ptr | PENDING_FREE_MASK);
+}
+
+static inline void *pending_free_clear(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~PENDING_FREE_MASK);
+}
+
+static int execmem_force_rw(void *ptr, size_t size);
+
+static int __execmem_cache_free(struct ma_state *mas, void *ptr, gfp_t gfp_mask)
+{
+	size_t size = mas_range_len(mas);
+	int err;
+
+	err = execmem_force_rw(ptr, size);
+	if (err)
+		return err;
+
+	execmem_fill_trapping_insns(ptr, size, /* writable = */ true);
+	execmem_restore_rox(ptr, size);
+
+	err = execmem_cache_add_locked(ptr, size, gfp_mask);
+	if (err)
+		return err;
+
+	mas_store_gfp(mas, NULL, gfp_mask);
+	return 0;
+}
+
+static void execmem_cache_free_slow(struct work_struct *work);
+static DECLARE_DELAYED_WORK(execmem_cache_free_work, execmem_cache_free_slow);
+
+static void execmem_cache_free_slow(struct work_struct *work)
+{
+	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
+	MA_STATE(mas, busy_areas, 0, ULONG_MAX);
+	void *area;
+
+	guard(mutex)(&execmem_cache.mutex);
+
+	if (!execmem_cache.pending_free_cnt)
+		return;
+
+	mas_for_each(&mas, area, ULONG_MAX) {
+		if (!is_pending_free(area))
+			continue;
+
+		pending_free_clear(area);
+		if (__execmem_cache_free(&mas, area, GFP_KERNEL))
+			continue;
+
+		execmem_cache.pending_free_cnt--;
+	}
+
+	if (execmem_cache.pending_free_cnt)
+		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
+	else
+		schedule_work(&execmem_cache_clean_work);
+}
+
 static bool execmem_cache_free(void *ptr)
 {
 	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
-	struct mutex *mutex = &execmem_cache.mutex;
 	unsigned long addr = (unsigned long)ptr;
 	MA_STATE(mas, busy_areas, addr, addr);
-	size_t size;
 	void *area;
+	int err;
+
+	guard(mutex)(&execmem_cache.mutex);
 
-	mutex_lock(mutex);
 	area = mas_walk(&mas);
-	if (!area) {
-		mutex_unlock(mutex);
+	if (!area)
 		return false;
-	}
-	size = mas_range_len(&mas);
-
-	mas_store_gfp(&mas, NULL, GFP_KERNEL);
-	mutex_unlock(mutex);
-
-	execmem_fill_trapping_insns(ptr, size, /* writable = */ false);
 
-	execmem_cache_add(ptr, size);
+	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
+	if (err)
+		goto err_slowpath;
 
 	schedule_work(&execmem_cache_clean_work);
 
 	return true;
+
+err_slowpath:
+	mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
+	execmem_cache.pending_free_cnt++;
+	schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
+	return true;
 }
 
 static int execmem_force_rw(void *ptr, size_t size)
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/8] execmem: move execmem_force_rw() and execmem_restore_rox() before use
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
                   ` (2 preceding siblings ...)
  2025-07-04 13:49 ` [PATCH 3/8] execmem: rework execmem_cache_free() Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-04 13:49 ` [PATCH 5/8] execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP) Mike Rapoport
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

to avoid static declarations.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/execmem.c | 44 +++++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/mm/execmem.c b/mm/execmem.c
index 1cc781244593..3cb3a9d1c93f 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -137,6 +137,27 @@ static int execmem_set_direct_map_valid(struct vm_struct *vm, bool valid)
 	return err;
 }
 
+static int execmem_force_rw(void *ptr, size_t size)
+{
+	unsigned int nr = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	unsigned long addr = (unsigned long)ptr;
+	int ret;
+
+	ret = set_memory_nx(addr, nr);
+	if (ret)
+		return ret;
+
+	return set_memory_rw(addr, nr);
+}
+
+int execmem_restore_rox(void *ptr, size_t size)
+{
+	unsigned int nr = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	unsigned long addr = (unsigned long)ptr;
+
+	return set_memory_rox(addr, nr);
+}
+
 static void execmem_cache_clean(struct work_struct *work)
 {
 	struct maple_tree *free_areas = &execmem_cache.free_areas;
@@ -328,8 +349,6 @@ static inline void *pending_free_clear(void *ptr)
 	return (void *)((unsigned long)ptr & ~PENDING_FREE_MASK);
 }
 
-static int execmem_force_rw(void *ptr, size_t size);
-
 static int __execmem_cache_free(struct ma_state *mas, void *ptr, gfp_t gfp_mask)
 {
 	size_t size = mas_range_len(mas);
@@ -410,27 +429,6 @@ static bool execmem_cache_free(void *ptr)
 	return true;
 }
 
-static int execmem_force_rw(void *ptr, size_t size)
-{
-	unsigned int nr = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	unsigned long addr = (unsigned long)ptr;
-	int ret;
-
-	ret = set_memory_nx(addr, nr);
-	if (ret)
-		return ret;
-
-	return set_memory_rw(addr, nr);
-}
-
-int execmem_restore_rox(void *ptr, size_t size)
-{
-	unsigned int nr = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	unsigned long addr = (unsigned long)ptr;
-
-	return set_memory_rox(addr, nr);
-}
-
 #else /* CONFIG_ARCH_HAS_EXECMEM_ROX */
 /*
  * when ROX cache is not used the permissions defined by architectures for
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/8] execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP)
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
                   ` (3 preceding siblings ...)
  2025-07-04 13:49 ` [PATCH 4/8] execmem: move execmem_force_rw() and execmem_restore_rox() before use Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-04 13:49 ` [PATCH 6/8] execmem: drop writable parameter from execmem_fill_trapping_insns() Mike Rapoport
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

When execmem populates ROX cache it uses vmalloc(VM_ALLOW_HUGE_VMAP).
Although vmalloc falls back to allocating base pages if high order
allocation fails, it may happen that it still cannot allocate enough
memory.

Right now ROX cache is only used by modules and in majority of cases the
allocations happen at boot time when there's plenty of free memory, but
upcoming enabling ROX cache for ftrace and kprobes would mean that execmem
allocations can happen when the system is under memory pressure and a
failure to allocate large page worth of memory becomes more likely.

Fallback to regular vmalloc() if vmalloc(VM_ALLOW_HUGE_VMAP) fails.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/execmem.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/execmem.c b/mm/execmem.c
index 3cb3a9d1c93f..ec2a6aab143b 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -291,6 +291,11 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 
 	alloc_size = round_up(size, PMD_SIZE);
 	p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags);
+	if (!p) {
+		alloc_size = size;
+		p = execmem_vmalloc(range, alloc_size, PAGE_KERNEL, vm_flags);
+	}
+
 	if (!p)
 		return err;
 
@@ -457,7 +462,7 @@ void *execmem_alloc(enum execmem_type type, size_t size)
 	bool use_cache = range->flags & EXECMEM_ROX_CACHE;
 	unsigned long vm_flags = VM_FLUSH_RESET_PERMS;
 	pgprot_t pgprot = range->pgprot;
-	void *p;
+	void *p = NULL;
 
 	size = PAGE_ALIGN(size);
 
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/8] execmem: drop writable parameter from execmem_fill_trapping_insns()
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
                   ` (4 preceding siblings ...)
  2025-07-04 13:49 ` [PATCH 5/8] execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP) Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-04 13:49 ` [PATCH 7/8] x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations Mike Rapoport
  2025-07-04 13:49 ` [PATCH 8/8] x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations Mike Rapoport
  7 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

After update of execmem_cache_free() that made memory writable before
updating it, there is no need to update read only memory, so the writable
parameter to execmem_fill_trapping_insns() is not needed. Drop it.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/mm/init.c      | 8 ++------
 include/linux/execmem.h | 3 +--
 mm/execmem.c            | 4 ++--
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7456df985d96..dbc63f0d538f 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1063,13 +1063,9 @@ unsigned long arch_max_swapfile_size(void)
 static struct execmem_info execmem_info __ro_after_init;
 
 #ifdef CONFIG_ARCH_HAS_EXECMEM_ROX
-void execmem_fill_trapping_insns(void *ptr, size_t size, bool writeable)
+void execmem_fill_trapping_insns(void *ptr, size_t size)
 {
-	/* fill memory with INT3 instructions */
-	if (writeable)
-		memset(ptr, INT3_INSN_OPCODE, size);
-	else
-		text_poke_set(ptr, INT3_INSN_OPCODE, size);
+	memset(ptr, INT3_INSN_OPCODE, size);
 }
 #endif
 
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
index 4e510d1c609c..fe367bdadc3e 100644
--- a/include/linux/execmem.h
+++ b/include/linux/execmem.h
@@ -60,12 +60,11 @@ enum execmem_range_flags {
  *				 will trap
  * @ptr:	pointer to memory to fill
  * @size:	size of the range to fill
- * @writable:	is the memory poited by @ptr is writable or ROX
  *
  * A hook for architecures to fill execmem ranges with invalid instructions.
  * Architectures that use EXECMEM_ROX_CACHE must implement this.
  */
-void execmem_fill_trapping_insns(void *ptr, size_t size, bool writable);
+void execmem_fill_trapping_insns(void *ptr, size_t size);
 
 /**
  * execmem_restore_rox - restore read-only-execute permissions
diff --git a/mm/execmem.c b/mm/execmem.c
index ec2a6aab143b..398e60c1002f 100644
--- a/mm/execmem.c
+++ b/mm/execmem.c
@@ -304,7 +304,7 @@ static int execmem_cache_populate(struct execmem_range *range, size_t size)
 		goto err_free_mem;
 
 	/* fill memory with instructions that will trap */
-	execmem_fill_trapping_insns(p, alloc_size, /* writable = */ true);
+	execmem_fill_trapping_insns(p, alloc_size);
 
 	err = set_memory_rox((unsigned long)p, vm->nr_pages);
 	if (err)
@@ -363,7 +363,7 @@ static int __execmem_cache_free(struct ma_state *mas, void *ptr, gfp_t gfp_mask)
 	if (err)
 		return err;
 
-	execmem_fill_trapping_insns(ptr, size, /* writable = */ true);
+	execmem_fill_trapping_insns(ptr, size);
 	execmem_restore_rox(ptr, size);
 
 	err = execmem_cache_add_locked(ptr, size, gfp_mask);
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/8] x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
                   ` (5 preceding siblings ...)
  2025-07-04 13:49 ` [PATCH 6/8] execmem: drop writable parameter from execmem_fill_trapping_insns() Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  2025-07-04 13:49 ` [PATCH 8/8] x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations Mike Rapoport
  7 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

x86::alloc_insn_page() always allocates ROX memory.

Instead of overriding this method, add EXECMEM_KPROBES entry in
execmem_info with pgprot set to PAGE_KERNEL_ROX and  use ROX cache when
configuration and CPU features allow it.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/kprobes/core.c | 18 ------------------
 arch/x86/mm/init.c             |  9 ++++++++-
 2 files changed, 8 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 47cb8eb138ba..6079d15dab8c 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -481,24 +481,6 @@ static int prepare_singlestep(kprobe_opcode_t *buf, struct kprobe *p,
 	return len;
 }
 
-/* Make page to RO mode when allocate it */
-void *alloc_insn_page(void)
-{
-	void *page;
-
-	page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
-	if (!page)
-		return NULL;
-
-	/*
-	 * TODO: Once additional kernel code protection mechanisms are set, ensure
-	 * that the page was not maliciously altered and it is still zeroed.
-	 */
-	set_memory_rox((unsigned long)page, 1);
-
-	return page;
-}
-
 /* Kprobe x86 instruction emulation - only regs->ip or IF flag modifiers */
 
 static void kprobe_emulate_ifmodifiers(struct kprobe *p, struct pt_regs *regs)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index dbc63f0d538f..442fafd8ff52 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1098,7 +1098,14 @@ struct execmem_info __init *execmem_arch_setup(void)
 				.pgprot	= pgprot,
 				.alignment = MODULE_ALIGN,
 			},
-			[EXECMEM_KPROBES ... EXECMEM_BPF] = {
+			[EXECMEM_KPROBES] = {
+				.flags	= flags,
+				.start	= start,
+				.end	= MODULES_END,
+				.pgprot	= PAGE_KERNEL_ROX,
+				.alignment = MODULE_ALIGN,
+			},
+			[EXECMEM_FTRACE ... EXECMEM_BPF] = {
 				.flags	= EXECMEM_KASAN_SHADOW,
 				.start	= start,
 				.end	= MODULES_END,
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/8] x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations
  2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
                   ` (6 preceding siblings ...)
  2025-07-04 13:49 ` [PATCH 7/8] x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations Mike Rapoport
@ 2025-07-04 13:49 ` Mike Rapoport
  7 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-04 13:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	Mike Rapoport, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

For the most part ftrace uses text poking and can handle ROX memory.
The only place that requires writable memory is create_trampoline() that
updates the allocated memory and in the end makes it ROX.

Use execmem_alloc_rw() in x86::ftrace::alloc_tramp() and enable ROX cache
for EXECMEM_FTRACE when configuration and CPU features allow that.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 arch/x86/kernel/ftrace.c | 2 +-
 arch/x86/mm/init.c       | 9 ++++++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 252e82bcfd2f..4450acec9390 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -263,7 +263,7 @@ void arch_ftrace_update_code(int command)
 
 static inline void *alloc_tramp(unsigned long size)
 {
-	return execmem_alloc(EXECMEM_FTRACE, size);
+	return execmem_alloc_rw(EXECMEM_FTRACE, size);
 }
 static inline void tramp_free(void *tramp)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 442fafd8ff52..bb57e93b4caf 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1105,7 +1105,14 @@ struct execmem_info __init *execmem_arch_setup(void)
 				.pgprot	= PAGE_KERNEL_ROX,
 				.alignment = MODULE_ALIGN,
 			},
-			[EXECMEM_FTRACE ... EXECMEM_BPF] = {
+			[EXECMEM_FTRACE] = {
+				.flags	= flags,
+				.start	= start,
+				.end	= MODULES_END,
+				.pgprot	= pgprot,
+				.alignment = MODULE_ALIGN,
+			},
+			[EXECMEM_BPF] = {
 				.flags	= EXECMEM_KASAN_SHADOW,
 				.start	= start,
 				.end	= MODULES_END,
-- 
2.47.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] execmem: drop unused execmem_update_copy()
  2025-07-04 13:49 ` [PATCH 1/8] execmem: drop unused execmem_update_copy() Mike Rapoport
@ 2025-07-07 10:10   ` Christophe Leroy
  2025-07-07 11:49     ` Mike Rapoport
  0 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2025-07-07 10:10 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Andy Lutomirski, Borislav Petkov, Daniel Gomez, Dave Hansen,
	Ingo Molnar, Luis Chamberlain, Mark Rutland, Masami Hiramatsu,
	H. Peter Anvin, Peter Zijlstra, Petr Pavlu, Sami Tolvanen,
	Steven Rostedt, Thomas Gleixner, linux-kernel, linux-mm,
	linux-modules, linux-trace-kernel, x86



Le 04/07/2025 à 15:49, Mike Rapoport a écrit :
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> The execmem_update_copy() that used text poking was required when memory
> allocated from ROX cache was always read-only. Since now its permissions
> can be switched to read-write there is no need in a function that updates
> memory with text poking.

Erm. Looks like I missed the patch that introduced this change.

On some variant of powerpc, namely book3s/32, this is not feasible. The 
granularity for setting the NX (non exec) bit is 256 Mbytes sections.
So the area dedicated to execmem [MODULES_VADDR; MODULES_END[ always 
have the NX bit unset.

You can change any page within this area from ROX to RWX but you can't 
make it RW without X. If you want RW without X you must map it in the 
VMALLOC area, as VMALLOC area have NX bit always set.

Christophe

> 
> Remove it.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   include/linux/execmem.h | 13 -------------
>   mm/execmem.c            |  5 -----
>   2 files changed, 18 deletions(-)
> 
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> index 3be35680a54f..734fbe83d98e 100644
> --- a/include/linux/execmem.h
> +++ b/include/linux/execmem.h
> @@ -185,19 +185,6 @@ DEFINE_FREE(execmem, void *, if (_T) execmem_free(_T));
>   struct vm_struct *execmem_vmap(size_t size);
>   #endif
>   
> -/**
> - * execmem_update_copy - copy an update to executable memory
> - * @dst:  destination address to update
> - * @src:  source address containing the data
> - * @size: how many bytes of memory shold be copied
> - *
> - * Copy @size bytes from @src to @dst using text poking if the memory at
> - * @dst is read-only.
> - *
> - * Return: a pointer to @dst or NULL on error
> - */
> -void *execmem_update_copy(void *dst, const void *src, size_t size);
> -
>   /**
>    * execmem_is_rox - check if execmem is read-only
>    * @type - the execmem type to check
> diff --git a/mm/execmem.c b/mm/execmem.c
> index 2b683e7d864d..0712ebb4eb77 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -399,11 +399,6 @@ void execmem_free(void *ptr)
>   		vfree(ptr);
>   }
>   
> -void *execmem_update_copy(void *dst, const void *src, size_t size)
> -{
> -	return text_poke_copy(dst, src, size);
> -}
> -
>   bool execmem_is_rox(enum execmem_type type)
>   {
>   	return !!(execmem_info->ranges[type].flags & EXECMEM_ROX_CACHE);



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-04 13:49 ` [PATCH 3/8] execmem: rework execmem_cache_free() Mike Rapoport
@ 2025-07-07 11:11   ` Peter Zijlstra
  2025-07-07 11:32     ` Mike Rapoport
  2025-07-07 15:32   ` Yann Ylavic
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-07-07 11:11 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Petr Pavlu, Sami Tolvanen,
	Steven Rostedt, Thomas Gleixner, linux-kernel, linux-mm,
	linux-modules, linux-trace-kernel, x86

On Fri, Jul 04, 2025 at 04:49:38PM +0300, Mike Rapoport wrote:
>  static bool execmem_cache_free(void *ptr)
>  {
>  	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
>  	unsigned long addr = (unsigned long)ptr;
>  	MA_STATE(mas, busy_areas, addr, addr);
>  	void *area;
> +	int err;
> +
> +	guard(mutex)(&execmem_cache.mutex);
>  
>  	area = mas_walk(&mas);
> +	if (!area)
>  		return false;
>  
> +	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> +	if (err)
> +		goto err_slowpath;
>  
>  	schedule_work(&execmem_cache_clean_work);
>  
>  	return true;
> +
> +err_slowpath:
> +	mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> +	execmem_cache.pending_free_cnt++;
> +	schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> +	return true;
>  }

This is a bit if an anti-pattern, using guard() and error goto. Since
there is only the one site, its best to write it like so:

static bool execmem_cache_free(void *ptr)
{
	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
	unsigned long addr = (unsigned long)ptr;
	MA_STATE(mas, busy_areas, addr, addr);
	void *area;
	int err;

	guard(mutex)(&execmem_cache.mutex);

	area = mas_walk(&mas);
	if (!area)
		return false;

	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
	if (err) {
		mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
		execmem_cache.pending_free_cnt++;
		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
		return true;
	}

	schedule_work(&execmem_cache_clean_work);
	return true;
}

And now I have to ask what happens if mas_store_gfp() returns an error?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-07 11:11   ` Peter Zijlstra
@ 2025-07-07 11:32     ` Mike Rapoport
  2025-07-07 15:06       ` Liam R. Howlett
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Rapoport @ 2025-07-07 11:32 UTC (permalink / raw)
  To: Peter Zijlstra, Liam R. Howlett
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Petr Pavlu, Sami Tolvanen,
	Steven Rostedt, Thomas Gleixner, linux-kernel, linux-mm,
	linux-modules, linux-trace-kernel, x86

On Mon, Jul 07, 2025 at 01:11:02PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 04, 2025 at 04:49:38PM +0300, Mike Rapoport wrote:
> >  static bool execmem_cache_free(void *ptr)
> >  {
> >  	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> >  	unsigned long addr = (unsigned long)ptr;
> >  	MA_STATE(mas, busy_areas, addr, addr);
> >  	void *area;
> > +	int err;
> > +
> > +	guard(mutex)(&execmem_cache.mutex);
> >  
> >  	area = mas_walk(&mas);
> > +	if (!area)
> >  		return false;
> >  
> > +	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> > +	if (err)
> > +		goto err_slowpath;
> >  
> >  	schedule_work(&execmem_cache_clean_work);
> >  
> >  	return true;
> > +
> > +err_slowpath:
> > +	mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> > +	execmem_cache.pending_free_cnt++;
> > +	schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > +	return true;
> >  }
> 
> This is a bit if an anti-pattern, using guard() and error goto. Since

Good to know :)

> there is only the one site, its best to write it like so:
> 
> static bool execmem_cache_free(void *ptr)
> {
> 	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> 	unsigned long addr = (unsigned long)ptr;
> 	MA_STATE(mas, busy_areas, addr, addr);
> 	void *area;
> 	int err;
> 
> 	guard(mutex)(&execmem_cache.mutex);
> 
> 	area = mas_walk(&mas);
> 	if (!area)
> 		return false;
> 
> 	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> 	if (err) {
> 		mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> 		execmem_cache.pending_free_cnt++;
> 		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> 		return true;
> 	}
> 
> 	schedule_work(&execmem_cache_clean_work);
> 	return true;
> }
> 
> And now I have to ask what happens if mas_store_gfp() returns an error?

AFAIU it won't. mas points to exact slot we've got the area from, nothing else
can modify the tree because of the mutex, so that mas_store_gfp()
essentially updates the value at an existing entry.

I'll add a comment about it.

Added @Liam to make sure I'm not saying nonsense :)

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] execmem: drop unused execmem_update_copy()
  2025-07-07 10:10   ` Christophe Leroy
@ 2025-07-07 11:49     ` Mike Rapoport
  2025-07-07 13:02       ` Christophe Leroy
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Rapoport @ 2025-07-07 11:49 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Mon, Jul 07, 2025 at 12:10:43PM +0200, Christophe Leroy wrote:
> 
> Le 04/07/2025 à 15:49, Mike Rapoport a écrit :
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > The execmem_update_copy() that used text poking was required when memory
> > allocated from ROX cache was always read-only. Since now its permissions
> > can be switched to read-write there is no need in a function that updates
> > memory with text poking.
> 
> Erm. Looks like I missed the patch that introduced this change.
> 
> On some variant of powerpc, namely book3s/32, this is not feasible.

The only user of EXECMEM_ROX_CACHE for now is x86-64, we can always revisit
when powerpc book3s/32 would want to opt in to cache usage.

And it seems that [MODULES_VADDR, MODULES_END] is already mapped with
"large pages", isn't it?

> The granularity for setting the NX (non exec) bit is 256 Mbytes sections.
> So the area dedicated to execmem [MODULES_VADDR; MODULES_END[ always have
> the NX bit unset.
> 
> You can change any page within this area from ROX to RWX but you can't make
> it RW without X. If you want RW without X you must map it in the VMALLOC
> area, as VMALLOC area have NX bit always set.

So what will happen when one callse

	set_memory_nx()
	set_memory_rw()

in such areas?

> Christophe

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] execmem: drop unused execmem_update_copy()
  2025-07-07 11:49     ` Mike Rapoport
@ 2025-07-07 13:02       ` Christophe Leroy
  2025-07-08  8:22         ` Mike Rapoport
  0 siblings, 1 reply; 22+ messages in thread
From: Christophe Leroy @ 2025-07-07 13:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86



Le 07/07/2025 à 13:49, Mike Rapoport a écrit :
> On Mon, Jul 07, 2025 at 12:10:43PM +0200, Christophe Leroy wrote:
>>
>> Le 04/07/2025 à 15:49, Mike Rapoport a écrit :
>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>
>>> The execmem_update_copy() that used text poking was required when memory
>>> allocated from ROX cache was always read-only. Since now its permissions
>>> can be switched to read-write there is no need in a function that updates
>>> memory with text poking.
>>
>> Erm. Looks like I missed the patch that introduced this change.
>>
>> On some variant of powerpc, namely book3s/32, this is not feasible.
> 
> The only user of EXECMEM_ROX_CACHE for now is x86-64, we can always revisit
> when powerpc book3s/32 would want to opt in to cache usage.
> 
> And it seems that [MODULES_VADDR, MODULES_END] is already mapped with
> "large pages", isn't it?

I don't think so. It uses execmem_alloc() which sets VM_ALLOW_HUGE_VMAP 
only when using EXECMEM_ROX_CACHE. And book3s/32 doesn't have large pages.

Only 8xx has large pages but they are not PMD aligned (PMD_SIZE is 4M 
while large pages are 512k and 8M) so it wouldn't work well with 
existing execmem_vmalloc().

> 
>> The granularity for setting the NX (non exec) bit is 256 Mbytes sections.
>> So the area dedicated to execmem [MODULES_VADDR; MODULES_END[ always have
>> the NX bit unset.
>>
>> You can change any page within this area from ROX to RWX but you can't make
>> it RW without X. If you want RW without X you must map it in the VMALLOC
>> area, as VMALLOC area have NX bit always set.
> 
> So what will happen when one callse
> 
> 	set_memory_nx()
> 	set_memory_rw()
> 
> in such areas?

Nothing will happen. It will successfully unset the X bit on the PTE but 
that will be ignored by the HW which only relies on the segment's NX bit 
which is set for the entire VMALLOC area and unset for the entire MODULE 
area.

That's one of the reasons why it has 
CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC , to make sure text is 
allocated in exec area and data in no-exec area.

Christophe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-07 11:32     ` Mike Rapoport
@ 2025-07-07 15:06       ` Liam R. Howlett
  2025-07-07 15:12         ` Mike Rapoport
  0 siblings, 1 reply; 22+ messages in thread
From: Liam R. Howlett @ 2025-07-07 15:06 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Peter Zijlstra, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Daniel Gomez, Dave Hansen, Ingo Molnar, Luis Chamberlain,
	Mark Rutland, Masami Hiramatsu, H. Peter Anvin, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

* Mike Rapoport <rppt@kernel.org> [250707 07:32]:
> On Mon, Jul 07, 2025 at 01:11:02PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 04, 2025 at 04:49:38PM +0300, Mike Rapoport wrote:
> > >  static bool execmem_cache_free(void *ptr)
> > >  {
> > >  	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> > >  	unsigned long addr = (unsigned long)ptr;
> > >  	MA_STATE(mas, busy_areas, addr, addr);
> > >  	void *area;
> > > +	int err;
> > > +
> > > +	guard(mutex)(&execmem_cache.mutex);
> > >  
> > >  	area = mas_walk(&mas);
> > > +	if (!area)
> > >  		return false;
> > >  
> > > +	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> > > +	if (err)
> > > +		goto err_slowpath;
> > >  
> > >  	schedule_work(&execmem_cache_clean_work);
> > >  
> > >  	return true;
> > > +
> > > +err_slowpath:
> > > +	mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> > > +	execmem_cache.pending_free_cnt++;
> > > +	schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > > +	return true;
> > >  }
> > 
> > This is a bit if an anti-pattern, using guard() and error goto. Since
> 
> Good to know :)
> 
> > there is only the one site, its best to write it like so:
> > 
> > static bool execmem_cache_free(void *ptr)
> > {
> > 	struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> > 	unsigned long addr = (unsigned long)ptr;
> > 	MA_STATE(mas, busy_areas, addr, addr);
> > 	void *area;
> > 	int err;
> > 
> > 	guard(mutex)(&execmem_cache.mutex);
> > 
> > 	area = mas_walk(&mas);
> > 	if (!area)
> > 		return false;
> > 
> > 	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> > 	if (err) {
> > 		mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> > 		execmem_cache.pending_free_cnt++;
> > 		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > 		return true;
> > 	}
> > 
> > 	schedule_work(&execmem_cache_clean_work);
> > 	return true;
> > }
> > 
> > And now I have to ask what happens if mas_store_gfp() returns an error?
> 
> AFAIU it won't. mas points to exact slot we've got the area from, nothing else
> can modify the tree because of the mutex, so that mas_store_gfp()
> essentially updates the value at an existing entry.
> 
> I'll add a comment about it.
> 
> Added @Liam to make sure I'm not saying nonsense :)
> 

Yes, if there is already a node with a value with the same range, there
will be no allocations that will happen, so it'll just change the
pointer for you.  This is a slot store operation.

But, if it's possible to have no entries (an empty tree, or a single
value at 0), you will most likely allocate a node to store it, which is
256B.

I don't think this is a concern in this particular case though as you
are searching for an entry and storing, so it needs to exist.  So
really, the only scenario here is if you store 1 - ULONG_MAX (without
having expanded a root node) or 0 - ULONG_MAX, and that seems invalid.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-07 15:06       ` Liam R. Howlett
@ 2025-07-07 15:12         ` Mike Rapoport
  2025-07-08  7:26           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Mike Rapoport @ 2025-07-07 15:12 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Zijlstra, Andrew Morton, Andy Lutomirski,
	Borislav Petkov, Daniel Gomez, Dave Hansen, Ingo Molnar,
	Luis Chamberlain, Mark Rutland, Masami Hiramatsu, H. Peter Anvin,
	Petr Pavlu, Sami Tolvanen, Steven Rostedt, Thomas Gleixner,
	linux-kernel, linux-mm, linux-modules, linux-trace-kernel, x86

On Mon, Jul 07, 2025 at 11:06:25AM -0400, Liam R. Howlett wrote:
> * Mike Rapoport <rppt@kernel.org> [250707 07:32]:
> > On Mon, Jul 07, 2025 at 01:11:02PM +0200, Peter Zijlstra wrote:
> > > 
> > > 	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> > > 	if (err) {
> > > 		mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> > > 		execmem_cache.pending_free_cnt++;
> > > 		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > > 		return true;
> > > 	}
> > > 
> > > 	schedule_work(&execmem_cache_clean_work);
> > > 	return true;
> > > }
> > > 
> > > And now I have to ask what happens if mas_store_gfp() returns an error?
> > 
> > AFAIU it won't. mas points to exact slot we've got the area from, nothing else
> > can modify the tree because of the mutex, so that mas_store_gfp()
> > essentially updates the value at an existing entry.
> > 
> > I'll add a comment about it.
> > 
> > Added @Liam to make sure I'm not saying nonsense :)
> > 
> 
> Yes, if there is already a node with a value with the same range, there
> will be no allocations that will happen, so it'll just change the
> pointer for you.  This is a slot store operation.
> 
> But, if it's possible to have no entries (an empty tree, or a single
> value at 0), you will most likely allocate a node to store it, which is
> 256B.
> 
> I don't think this is a concern in this particular case though as you
> are searching for an entry and storing, so it needs to exist.  So
> really, the only scenario here is if you store 1 - ULONG_MAX (without
> having expanded a root node) or 0 - ULONG_MAX, and that seems invalid.

Thanks for clarification, Liam!
The tree cannot be empty at that point and if it has a single value, it
won't be at 0, I'm quite sure no architecture has execmem areas at 0.

 
> Thanks,
> Liam

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-04 13:49 ` [PATCH 3/8] execmem: rework execmem_cache_free() Mike Rapoport
  2025-07-07 11:11   ` Peter Zijlstra
@ 2025-07-07 15:32   ` Yann Ylavic
  2025-07-07 15:43     ` Yann Ylavic
  2025-07-08  7:10     ` Mike Rapoport
  1 sibling, 2 replies; 22+ messages in thread
From: Yann Ylavic @ 2025-07-07 15:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Fri, Jul 4, 2025 at 3:54 PM Mike Rapoport <rppt@kernel.org> wrote:
> +
> +static void execmem_cache_free_slow(struct work_struct *work)
> +{
> +       struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> +       MA_STATE(mas, busy_areas, 0, ULONG_MAX);
> +       void *area;
> +
> +       guard(mutex)(&execmem_cache.mutex);
> +
> +       if (!execmem_cache.pending_free_cnt)
> +               return;
> +
> +       mas_for_each(&mas, area, ULONG_MAX) {
> +               if (!is_pending_free(area))
> +                       continue;
> +
> +               pending_free_clear(area);

Probably:
                  area = pending_free_clear(area);
?

> +               if (__execmem_cache_free(&mas, area, GFP_KERNEL))
> +                       continue;
> +
> +               execmem_cache.pending_free_cnt--;
> +       }
> +
> +       if (execmem_cache.pending_free_cnt)
> +               schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> +       else
> +               schedule_work(&execmem_cache_clean_work);
> +}


Regards;
Yann.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-07 15:32   ` Yann Ylavic
@ 2025-07-07 15:43     ` Yann Ylavic
  2025-07-08  7:10     ` Mike Rapoport
  1 sibling, 0 replies; 22+ messages in thread
From: Yann Ylavic @ 2025-07-07 15:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Mon, Jul 7, 2025 at 5:32 PM Yann Ylavic <ylavic.dev@gmail.com> wrote:
>
> On Fri, Jul 4, 2025 at 3:54 PM Mike Rapoport <rppt@kernel.org> wrote:
> > +
> > +static void execmem_cache_free_slow(struct work_struct *work)
> > +{
> > +       struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> > +       MA_STATE(mas, busy_areas, 0, ULONG_MAX);
> > +       void *area;
> > +
> > +       guard(mutex)(&execmem_cache.mutex);
> > +
> > +       if (!execmem_cache.pending_free_cnt)
> > +               return;
> > +
> > +       mas_for_each(&mas, area, ULONG_MAX) {
> > +               if (!is_pending_free(area))
> > +                       continue;
> > +
> > +               pending_free_clear(area);
>
> Probably:
>                   area = pending_free_clear(area);
> ?

Likewise in execmem_cache_free_slow() btw.

>
> > +               if (__execmem_cache_free(&mas, area, GFP_KERNEL))
> > +                       continue;
> > +
> > +               execmem_cache.pending_free_cnt--;
> > +       }
> > +
> > +       if (execmem_cache.pending_free_cnt)
> > +               schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > +       else
> > +               schedule_work(&execmem_cache_clean_work);
> > +}
>
>
> Regards;
> Yann.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-07 15:32   ` Yann Ylavic
  2025-07-07 15:43     ` Yann Ylavic
@ 2025-07-08  7:10     ` Mike Rapoport
  1 sibling, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-08  7:10 UTC (permalink / raw)
  To: Yann Ylavic
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Mon, Jul 07, 2025 at 05:32:11PM +0200, Yann Ylavic wrote:
> On Fri, Jul 4, 2025 at 3:54 PM Mike Rapoport <rppt@kernel.org> wrote:
> > +
> > +static void execmem_cache_free_slow(struct work_struct *work)
> > +{
> > +       struct maple_tree *busy_areas = &execmem_cache.busy_areas;
> > +       MA_STATE(mas, busy_areas, 0, ULONG_MAX);
> > +       void *area;
> > +
> > +       guard(mutex)(&execmem_cache.mutex);
> > +
> > +       if (!execmem_cache.pending_free_cnt)
> > +               return;
> > +
> > +       mas_for_each(&mas, area, ULONG_MAX) {
> > +               if (!is_pending_free(area))
> > +                       continue;
> > +
> > +               pending_free_clear(area);
> 
> Probably:
>                   area = pending_free_clear(area);
> ?

Right, thanks!
 
> > +               if (__execmem_cache_free(&mas, area, GFP_KERNEL))
> > +                       continue;
> > +
> > +               execmem_cache.pending_free_cnt--;
> > +       }
> > +
> > +       if (execmem_cache.pending_free_cnt)
> > +               schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > +       else
> > +               schedule_work(&execmem_cache_clean_work);
> > +}
> 
> 
> Regards;
> Yann.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-07 15:12         ` Mike Rapoport
@ 2025-07-08  7:26           ` Peter Zijlstra
  2025-07-08  8:13             ` Mike Rapoport
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2025-07-08  7:26 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Liam R. Howlett, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Daniel Gomez, Dave Hansen, Ingo Molnar, Luis Chamberlain,
	Mark Rutland, Masami Hiramatsu, H. Peter Anvin, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Mon, Jul 07, 2025 at 06:12:26PM +0300, Mike Rapoport wrote:
> On Mon, Jul 07, 2025 at 11:06:25AM -0400, Liam R. Howlett wrote:
> > * Mike Rapoport <rppt@kernel.org> [250707 07:32]:
> > > On Mon, Jul 07, 2025 at 01:11:02PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > 	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> > > > 	if (err) {
> > > > 		mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> > > > 		execmem_cache.pending_free_cnt++;
> > > > 		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > > > 		return true;
> > > > 	}
> > > > 
> > > > 	schedule_work(&execmem_cache_clean_work);
> > > > 	return true;
> > > > }
> > > > 
> > > > And now I have to ask what happens if mas_store_gfp() returns an error?
> > > 
> > > AFAIU it won't. mas points to exact slot we've got the area from, nothing else
> > > can modify the tree because of the mutex, so that mas_store_gfp()
> > > essentially updates the value at an existing entry.
> > > 
> > > I'll add a comment about it.
> > > 
> > > Added @Liam to make sure I'm not saying nonsense :)
> > > 
> > 
> > Yes, if there is already a node with a value with the same range, there
> > will be no allocations that will happen, so it'll just change the
> > pointer for you.  This is a slot store operation.
> > 
> > But, if it's possible to have no entries (an empty tree, or a single
> > value at 0), you will most likely allocate a node to store it, which is
> > 256B.
> > 
> > I don't think this is a concern in this particular case though as you
> > are searching for an entry and storing, so it needs to exist.  So
> > really, the only scenario here is if you store 1 - ULONG_MAX (without
> > having expanded a root node) or 0 - ULONG_MAX, and that seems invalid.
> 
> Thanks for clarification, Liam!
> The tree cannot be empty at that point and if it has a single value, it
> won't be at 0, I'm quite sure no architecture has execmem areas at 0.

Would it make sense to have something like GFP_NO_ALLOC to pass to
functions like this where we know it won't actually allocate -- and
which when it does reach the allocator generates a WARN and returns NULL
?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/8] execmem: rework execmem_cache_free()
  2025-07-08  7:26           ` Peter Zijlstra
@ 2025-07-08  8:13             ` Mike Rapoport
  0 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-08  8:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Liam R. Howlett, Andrew Morton, Andy Lutomirski, Borislav Petkov,
	Daniel Gomez, Dave Hansen, Ingo Molnar, Luis Chamberlain,
	Mark Rutland, Masami Hiramatsu, H. Peter Anvin, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Tue, Jul 08, 2025 at 09:26:49AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 07, 2025 at 06:12:26PM +0300, Mike Rapoport wrote:
> > On Mon, Jul 07, 2025 at 11:06:25AM -0400, Liam R. Howlett wrote:
> > > * Mike Rapoport <rppt@kernel.org> [250707 07:32]:
> > > > On Mon, Jul 07, 2025 at 01:11:02PM +0200, Peter Zijlstra wrote:
> > > > > 
> > > > > 	err = __execmem_cache_free(&mas, ptr, GFP_KERNEL | __GFP_NORETRY);
> > > > > 	if (err) {
> > > > > 		mas_store_gfp(&mas, pending_free_set(ptr), GFP_KERNEL);
> > > > > 		execmem_cache.pending_free_cnt++;
> > > > > 		schedule_delayed_work(&execmem_cache_free_work, FREE_DELAY);
> > > > > 		return true;
> > > > > 	}
> > > > > 
> > > > > 	schedule_work(&execmem_cache_clean_work);
> > > > > 	return true;
> > > > > }
> > > > > 
> > > > > And now I have to ask what happens if mas_store_gfp() returns an error?
> > > > 
> > > > AFAIU it won't. mas points to exact slot we've got the area from, nothing else
> > > > can modify the tree because of the mutex, so that mas_store_gfp()
> > > > essentially updates the value at an existing entry.
> > > > 
> > > > I'll add a comment about it.
> > > > 
> > > > Added @Liam to make sure I'm not saying nonsense :)
> > > > 
> > > 
> > > Yes, if there is already a node with a value with the same range, there
> > > will be no allocations that will happen, so it'll just change the
> > > pointer for you.  This is a slot store operation.
> > > 
> > > But, if it's possible to have no entries (an empty tree, or a single
> > > value at 0), you will most likely allocate a node to store it, which is
> > > 256B.
> > > 
> > > I don't think this is a concern in this particular case though as you
> > > are searching for an entry and storing, so it needs to exist.  So
> > > really, the only scenario here is if you store 1 - ULONG_MAX (without
> > > having expanded a root node) or 0 - ULONG_MAX, and that seems invalid.
> > 
> > Thanks for clarification, Liam!
> > The tree cannot be empty at that point and if it has a single value, it
> > won't be at 0, I'm quite sure no architecture has execmem areas at 0.
> 
> Would it make sense to have something like GFP_NO_ALLOC to pass to
> functions like this where we know it won't actually allocate -- and
> which when it does reach the allocator generates a WARN and returns NULL
> ?

We can add a WARN at the caller as well, that won't require a new gfp flag.
The question is how to recover if such thing happen, I don't really see
what execmem can do here if mas_store_gfp() returns an error :/

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/8] execmem: drop unused execmem_update_copy()
  2025-07-07 13:02       ` Christophe Leroy
@ 2025-07-08  8:22         ` Mike Rapoport
  0 siblings, 0 replies; 22+ messages in thread
From: Mike Rapoport @ 2025-07-08  8:22 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Andy Lutomirski, Borislav Petkov, Daniel Gomez,
	Dave Hansen, Ingo Molnar, Luis Chamberlain, Mark Rutland,
	Masami Hiramatsu, H. Peter Anvin, Peter Zijlstra, Petr Pavlu,
	Sami Tolvanen, Steven Rostedt, Thomas Gleixner, linux-kernel,
	linux-mm, linux-modules, linux-trace-kernel, x86

On Mon, Jul 07, 2025 at 03:02:15PM +0200, Christophe Leroy wrote:
> 
> 
> Le 07/07/2025 à 13:49, Mike Rapoport a écrit :
> > On Mon, Jul 07, 2025 at 12:10:43PM +0200, Christophe Leroy wrote:
> > > 
> > > Le 04/07/2025 à 15:49, Mike Rapoport a écrit :
> > > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > > > 
> > > > The execmem_update_copy() that used text poking was required when memory
> > > > allocated from ROX cache was always read-only. Since now its permissions
> > > > can be switched to read-write there is no need in a function that updates
> > > > memory with text poking.
> > > 
> > > Erm. Looks like I missed the patch that introduced this change.
> > > 
> > > On some variant of powerpc, namely book3s/32, this is not feasible.
> > 
> > The only user of EXECMEM_ROX_CACHE for now is x86-64, we can always revisit
> > when powerpc book3s/32 would want to opt in to cache usage.
> > 
> > And it seems that [MODULES_VADDR, MODULES_END] is already mapped with
> > "large pages", isn't it?
> 
> I don't think so. It uses execmem_alloc() which sets VM_ALLOW_HUGE_VMAP only
> when using EXECMEM_ROX_CACHE. And book3s/32 doesn't have large pages.
> 
> Only 8xx has large pages but they are not PMD aligned (PMD_SIZE is 4M while
> large pages are 512k and 8M) so it wouldn't work well with existing
> execmem_vmalloc().
 
The PMD_SIZE can be replaced with one of arch_vmap size helpers if needed. 
Or even parametrized in execmem_info.
 
> > > The granularity for setting the NX (non exec) bit is 256 Mbytes sections.
> > > So the area dedicated to execmem [MODULES_VADDR; MODULES_END[ always have
> > > the NX bit unset.
> > > 
> > > You can change any page within this area from ROX to RWX but you can't make
> > > it RW without X. If you want RW without X you must map it in the VMALLOC
> > > area, as VMALLOC area have NX bit always set.
> > 
> > So what will happen when one callse
> > 
> > 	set_memory_nx()
> > 	set_memory_rw()
> > 
> > in such areas?
> 
> Nothing will happen. It will successfully unset the X bit on the PTE but
> that will be ignored by the HW which only relies on the segment's NX bit
> which is set for the entire VMALLOC area and unset for the entire MODULE
> area.

And set_memory_rw() will essentially make the mapping RWX if it's in MODULE
area?
 
> Christophe
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-07-08  8:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 13:49 [PATCH 0/8] x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes Mike Rapoport
2025-07-04 13:49 ` [PATCH 1/8] execmem: drop unused execmem_update_copy() Mike Rapoport
2025-07-07 10:10   ` Christophe Leroy
2025-07-07 11:49     ` Mike Rapoport
2025-07-07 13:02       ` Christophe Leroy
2025-07-08  8:22         ` Mike Rapoport
2025-07-04 13:49 ` [PATCH 2/8] execmem: introduce execmem_alloc_rw() Mike Rapoport
2025-07-04 13:49 ` [PATCH 3/8] execmem: rework execmem_cache_free() Mike Rapoport
2025-07-07 11:11   ` Peter Zijlstra
2025-07-07 11:32     ` Mike Rapoport
2025-07-07 15:06       ` Liam R. Howlett
2025-07-07 15:12         ` Mike Rapoport
2025-07-08  7:26           ` Peter Zijlstra
2025-07-08  8:13             ` Mike Rapoport
2025-07-07 15:32   ` Yann Ylavic
2025-07-07 15:43     ` Yann Ylavic
2025-07-08  7:10     ` Mike Rapoport
2025-07-04 13:49 ` [PATCH 4/8] execmem: move execmem_force_rw() and execmem_restore_rox() before use Mike Rapoport
2025-07-04 13:49 ` [PATCH 5/8] execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP) Mike Rapoport
2025-07-04 13:49 ` [PATCH 6/8] execmem: drop writable parameter from execmem_fill_trapping_insns() Mike Rapoport
2025-07-04 13:49 ` [PATCH 7/8] x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations Mike Rapoport
2025-07-04 13:49 ` [PATCH 8/8] x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).