Linux Kernel Selftest development
 help / color / mirror / Atom feed
* [PATCH v7 2/5] bug/kunit: Reduce runtime impact of warning backtrace suppression
From: Albert Esteve @ 2026-04-20 12:28 UTC (permalink / raw)
  To: Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jonathan Corbet, Shuah Khan, Andrew Morton
  Cc: linux-kernel, linux-arch, linux-kselftest, kunit-dev, dri-devel,
	workflows, linux-doc, peterz, Alessandro Carminati, Albert Esteve
In-Reply-To: <20260420-kunit_add_support-v7-0-e8bc6e0f70de@redhat.com>

From: Alessandro Carminati <acarmina@redhat.com>

KUnit support is not consistently present across distributions, some
include it in their stock kernels, while others do not.
While both KUNIT and KUNIT_SUPPRESS_BACKTRACE can be considered debug
features, the fact that some distros ship with KUnit enabled means it's
important to minimize the runtime impact of this patch.

To that end, this patch adds an atomic counter that tracks the number
of active suppressions. __kunit_is_suppressed_warning() checks this
counter first and returns immediately when no suppressions are active,
avoiding RCU-protected list traversal in the common case.

Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 lib/kunit/bug.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/lib/kunit/bug.c b/lib/kunit/bug.c
index 356c8a5928828..a7a88f0670d44 100644
--- a/lib/kunit/bug.c
+++ b/lib/kunit/bug.c
@@ -8,6 +8,7 @@
 
 #include <kunit/bug.h>
 #include <kunit/resource.h>
+#include <linux/atomic.h>
 #include <linux/export.h>
 #include <linux/rculist.h>
 #include <linux/sched.h>
@@ -15,11 +16,13 @@
 #ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
 
 static LIST_HEAD(suppressed_warnings);
+static atomic_t suppressed_warnings_cnt = ATOMIC_INIT(0);
 
 static void __kunit_suppress_warning_remove(struct __suppressed_warning *warning)
 {
 	list_del_rcu(&warning->node);
 	synchronize_rcu(); /* Wait for readers to finish */
+	atomic_dec(&suppressed_warnings_cnt);
 }
 
 KUNIT_DEFINE_ACTION_WRAPPER(__kunit_suppress_warning_cleanup,
@@ -37,6 +40,7 @@ __kunit_start_suppress_warning(struct kunit *test)
 		return NULL;
 
 	warning->task = current;
+	atomic_inc(&suppressed_warnings_cnt);
 	list_add_rcu(&warning->node, &suppressed_warnings);
 
 	ret = kunit_add_action_or_reset(test,
@@ -68,6 +72,9 @@ bool __kunit_is_suppressed_warning(void)
 {
 	struct __suppressed_warning *warning;
 
+	if (!atomic_read(&suppressed_warnings_cnt))
+		return false;
+
 	rcu_read_lock();
 	list_for_each_entry_rcu(warning, &suppressed_warnings, node) {
 		if (warning->task == current) {

-- 
2.52.0


^ permalink raw reply related

* [PATCH v7 1/5] bug/kunit: Core support for suppressing warning backtraces
From: Albert Esteve @ 2026-04-20 12:28 UTC (permalink / raw)
  To: Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jonathan Corbet, Shuah Khan, Andrew Morton
  Cc: linux-kernel, linux-arch, linux-kselftest, kunit-dev, dri-devel,
	workflows, linux-doc, peterz, Alessandro Carminati, Guenter Roeck,
	Kees Cook, Albert Esteve
In-Reply-To: <20260420-kunit_add_support-v7-0-e8bc6e0f70de@redhat.com>

From: Alessandro Carminati <acarmina@redhat.com>

Some unit tests intentionally trigger warning backtraces by passing bad
parameters to kernel API functions. Such unit tests typically check the
return value from such calls, not the existence of the warning backtrace.

Such intentionally generated warning backtraces are neither desirable
nor useful for a number of reasons:
- They can result in overlooked real problems.
- A warning that suddenly starts to show up in unit tests needs to be
  investigated and has to be marked to be ignored, for example by
  adjusting filter scripts. Such filters are ad hoc because there is
  no real standard format for warnings. On top of that, such filter
  scripts would require constant maintenance.

Solve the problem by providing a means to identify and suppress specific
warning backtraces while executing test code. Support suppressing multiple
backtraces while at the same time limiting changes to generic code to the
absolute minimum.

Implementation details:
Suppression is checked at two points in the warning path:
- In warn_slowpath_fmt(), the check runs before any output, fully
  suppressing both message and backtrace.
- In __report_bug(), the check runs before __warn() is called,
  suppressing the backtrace and stack dump. Note that on this path,
  the WARN() format message may still appear in the kernel log since
  __warn_printk() runs before the trap that enters __report_bug().

A helper function, `__kunit_is_suppressed_warning()`, walks an
RCU-protected list of active suppressions, matching by current task.
The suppression state is tied to the KUnit test lifecycle via
kunit_add_action(), ensuring automatic cleanup at test exit.

The list of suppressed warnings is protected with RCU to allow
concurrent read access without locks.

The implementation is deliberately simple and avoids architecture-specific
optimizations to preserve portability.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Albert Esteve <aesteve@redhat.com>
---
 include/kunit/bug.h  | 56 +++++++++++++++++++++++++++++++++++
 include/kunit/test.h |  1 +
 kernel/panic.c       |  8 ++++-
 lib/bug.c            |  8 +++++
 lib/kunit/Kconfig    |  9 ++++++
 lib/kunit/Makefile   |  6 ++--
 lib/kunit/bug.c      | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 169 insertions(+), 3 deletions(-)

diff --git a/include/kunit/bug.h b/include/kunit/bug.h
new file mode 100644
index 0000000000000..e52c9d21d9fe6
--- /dev/null
+++ b/include/kunit/bug.h
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * KUnit helpers for backtrace suppression
+ *
+ * Copyright (C) 2025 Alessandro Carminati <acarmina@redhat.com>
+ * Copyright (C) 2024 Guenter Roeck <linux@roeck-us.net>
+ */
+
+#ifndef _KUNIT_BUG_H
+#define _KUNIT_BUG_H
+
+#ifndef __ASSEMBLY__
+
+#include <linux/kconfig.h>
+
+struct kunit;
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+
+#include <linux/types.h>
+
+struct task_struct;
+
+struct __suppressed_warning {
+	struct list_head node;
+	struct task_struct *task;
+	int counter;
+};
+
+struct __suppressed_warning *
+__kunit_start_suppress_warning(struct kunit *test);
+void __kunit_end_suppress_warning(struct kunit *test,
+				  struct __suppressed_warning *warning);
+int __kunit_suppressed_warning_count(struct __suppressed_warning *warning);
+bool __kunit_is_suppressed_warning(void);
+
+#define KUNIT_START_SUPPRESSED_WARNING(test) \
+	struct __suppressed_warning *__kunit_suppress =	\
+		__kunit_start_suppress_warning(test)
+
+#define KUNIT_END_SUPPRESSED_WARNING(test) \
+	__kunit_end_suppress_warning(test, __kunit_suppress)
+
+#define KUNIT_SUPPRESSED_WARNING_COUNT() \
+	__kunit_suppressed_warning_count(__kunit_suppress)
+
+#else /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+
+#define KUNIT_START_SUPPRESSED_WARNING(test)
+#define KUNIT_END_SUPPRESSED_WARNING(test)
+#define KUNIT_SUPPRESSED_WARNING_COUNT() 0
+static inline bool __kunit_is_suppressed_warning(void) { return false; }
+
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
+#endif /* __ASSEMBLY__ */
+#endif /* _KUNIT_BUG_H */
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 9cd1594ab697d..4ec07b3fa0204 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -10,6 +10,7 @@
 #define _KUNIT_TEST_H
 
 #include <kunit/assert.h>
+#include <kunit/bug.h>
 #include <kunit/try-catch.h>
 
 #include <linux/args.h>
diff --git a/kernel/panic.c b/kernel/panic.c
index c78600212b6c1..d7a7a679f56c4 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -39,6 +39,7 @@
 #include <linux/sys_info.h>
 #include <trace/events/error_report.h>
 #include <asm/sections.h>
+#include <kunit/bug.h>
 
 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -1080,9 +1081,14 @@ void __warn(const char *file, int line, void *caller, unsigned taint,
 void warn_slowpath_fmt(const char *file, int line, unsigned taint,
 		       const char *fmt, ...)
 {
-	bool rcu = warn_rcu_enter();
+	bool rcu;
 	struct warn_args args;
 
+	if (__kunit_is_suppressed_warning())
+		return;
+
+	rcu = warn_rcu_enter();
+
 	pr_warn(CUT_HERE);
 
 	if (!fmt) {
diff --git a/lib/bug.c b/lib/bug.c
index 623c467a8b76c..606205c8c302f 100644
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -48,6 +48,7 @@
 #include <linux/rculist.h>
 #include <linux/ftrace.h>
 #include <linux/context_tracking.h>
+#include <kunit/bug.h>
 
 extern struct bug_entry __start___bug_table[], __stop___bug_table[];
 
@@ -223,6 +224,13 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 	no_cut   = bug->flags & BUGFLAG_NO_CUT_HERE;
 	has_args = bug->flags & BUGFLAG_ARGS;
 
+	/*
+	 * Before the once logic so suppressed warnings do not consume
+	 * the single-fire budget of WARN_ON_ONCE().
+	 */
+	if (warning && __kunit_is_suppressed_warning())
+		return BUG_TRAP_TYPE_WARN;
+
 	if (warning && once) {
 		if (done)
 			return BUG_TRAP_TYPE_WARN;
diff --git a/lib/kunit/Kconfig b/lib/kunit/Kconfig
index 498cc51e493dc..57527418fcf09 100644
--- a/lib/kunit/Kconfig
+++ b/lib/kunit/Kconfig
@@ -15,6 +15,15 @@ menuconfig KUNIT
 
 if KUNIT
 
+config KUNIT_SUPPRESS_BACKTRACE
+	bool "KUnit - Enable backtrace suppression"
+	default y
+	help
+	  Enable backtrace suppression for KUnit. If enabled, backtraces
+	  generated intentionally by KUnit tests are suppressed. Disable
+	  to reduce kernel image size if image size is more important than
+	  suppression of backtraces generated by KUnit tests.
+
 config KUNIT_DEBUGFS
 	bool "KUnit - Enable /sys/kernel/debug/kunit debugfs representation" if !KUNIT_ALL_TESTS
 	default KUNIT_ALL_TESTS
diff --git a/lib/kunit/Makefile b/lib/kunit/Makefile
index 656f1fa35abcc..fe177ff3ebdef 100644
--- a/lib/kunit/Makefile
+++ b/lib/kunit/Makefile
@@ -16,8 +16,10 @@ ifeq ($(CONFIG_KUNIT_DEBUGFS),y)
 kunit-objs +=				debugfs.o
 endif
 
-# KUnit 'hooks' are built-in even when KUnit is built as a module.
-obj-$(if $(CONFIG_KUNIT),y) +=		hooks.o
+# KUnit 'hooks' and bug handling are built-in even when KUnit is built
+# as a module.
+obj-$(if $(CONFIG_KUNIT),y) +=		hooks.o \
+					bug.o
 
 obj-$(CONFIG_KUNIT_TEST) +=		kunit-test.o
 obj-$(CONFIG_KUNIT_TEST) +=		platform-test.o
diff --git a/lib/kunit/bug.c b/lib/kunit/bug.c
new file mode 100644
index 0000000000000..356c8a5928828
--- /dev/null
+++ b/lib/kunit/bug.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit helpers for backtrace suppression
+ *
+ * Copyright (C) 2025 Alessandro Carminati <acarmina@redhat.com>
+ * Copyright (C) 2024 Guenter Roeck <linux@roeck-us.net>
+ */
+
+#include <kunit/bug.h>
+#include <kunit/resource.h>
+#include <linux/export.h>
+#include <linux/rculist.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_KUNIT_SUPPRESS_BACKTRACE
+
+static LIST_HEAD(suppressed_warnings);
+
+static void __kunit_suppress_warning_remove(struct __suppressed_warning *warning)
+{
+	list_del_rcu(&warning->node);
+	synchronize_rcu(); /* Wait for readers to finish */
+}
+
+KUNIT_DEFINE_ACTION_WRAPPER(__kunit_suppress_warning_cleanup,
+			    __kunit_suppress_warning_remove,
+			    struct __suppressed_warning *);
+
+struct __suppressed_warning *
+__kunit_start_suppress_warning(struct kunit *test)
+{
+	struct __suppressed_warning *warning;
+	int ret;
+
+	warning = kunit_kzalloc(test, sizeof(*warning), GFP_KERNEL);
+	if (!warning)
+		return NULL;
+
+	warning->task = current;
+	list_add_rcu(&warning->node, &suppressed_warnings);
+
+	ret = kunit_add_action_or_reset(test,
+					__kunit_suppress_warning_cleanup,
+					warning);
+	if (ret)
+		return NULL;
+
+	return warning;
+}
+EXPORT_SYMBOL_GPL(__kunit_start_suppress_warning);
+
+void __kunit_end_suppress_warning(struct kunit *test,
+				  struct __suppressed_warning *warning)
+{
+	if (!warning)
+		return;
+	kunit_release_action(test, __kunit_suppress_warning_cleanup, warning);
+}
+EXPORT_SYMBOL_GPL(__kunit_end_suppress_warning);
+
+int __kunit_suppressed_warning_count(struct __suppressed_warning *warning)
+{
+	return warning ? warning->counter : 0;
+}
+EXPORT_SYMBOL_GPL(__kunit_suppressed_warning_count);
+
+bool __kunit_is_suppressed_warning(void)
+{
+	struct __suppressed_warning *warning;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(warning, &suppressed_warnings, node) {
+		if (warning->task == current) {
+			warning->counter++;
+			rcu_read_unlock();
+			return true;
+		}
+	}
+	rcu_read_unlock();
+
+	return false;
+}
+
+#endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */

-- 
2.52.0


^ permalink raw reply related

* [PATCH v7 0/5] kunit: Add support for suppressing warning backtraces
From: Albert Esteve @ 2026-04-20 12:28 UTC (permalink / raw)
  To: Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jonathan Corbet, Shuah Khan, Andrew Morton
  Cc: linux-kernel, linux-arch, linux-kselftest, kunit-dev, dri-devel,
	workflows, linux-doc, peterz, Alessandro Carminati, Guenter Roeck,
	Kees Cook, Albert Esteve, Linux Kernel Functional Testing,
	Dan Carpenter, Maíra Canal, Kees Cook, Simona Vetter,
	David Gow

Some unit tests intentionally trigger warning backtraces by passing bad
parameters to kernel API functions. Such unit tests typically check the
return value from such calls, not the existence of the warning backtrace.

Such intentionally generated warning backtraces are neither desirable
nor useful for a number of reasons:
- They can result in overlooked real problems.
- A warning that suddenly starts to show up in unit tests needs to be
  investigated and has to be marked to be ignored, for example by
  adjusting filter scripts. Such filters are ad hoc because there is
  no real standard format for warnings. On top of that, such filter
  scripts would require constant maintenance.

One option to address the problem would be to add messages such as
"expected warning backtraces start/end here" to the kernel log.
However, that would again require filter scripts, might result in
missing real problematic warning backtraces triggered while the test
is running, and the irrelevant backtrace(s) would still clog the
kernel log.

Solve the problem by providing a means to identify and suppress specific
warning backtraces while executing test code. Support suppressing multiple
backtraces while at the same time limiting changes to generic code to the
absolute minimum.

Overview:
Patch#1 Introduces the suppression infrastructure.
Patch#2 Mitigates the impact of suppression.
Patch#3 Adds selftests to validate the functionality.
Patch#4 Demonstrates real-world usage in the DRM subsystem.
Patch#5 Documents the new API and usage guidelines.

Design Notes:
The objective is to suppress unwanted WARN*() generated messages.

Although most major architectures share common bug handling via `lib/bug.c`
and `report_bug()`, some minor or legacy architectures still rely on their
own platform-specific handling. This divergence must be considered in any
such feature. Additionally, a key challenge in implementing this feature is
the fragmentation of `WARN*()` message emission: part of the output is
produced in the macro itself (via __warn_printk()), and part in the exception
handler.

Lessons from the Previous Attempt:
In earlier iterations, suppression logic was added inside the
`__report_bug()` function to intercept WARN*() output. To implement the
check in the bug handler code, two strategies were considered:

* Strategy #1: Use `kallsyms` to infer the originating function. This
  approach proved unreliable due to compiler-induced transformations
  such as inlining, cloning, and code fragmentation.

* Strategy #2: Store function name `__func__` in `struct bug_entry` in
  the `__bug_table`. However, `__func__` is a compiler-generated symbol,
  which complicates relocation and linking in position-independent code.
  Additionally, architectures not using the unified `BUG()` path would
  still require ad-hoc handling.

A per-macro solution was also attempted (v5-v6), injecting checks
directly into the `WARN*()` macros in `include/asm-generic/bug.h`.
While this offered full control, it required modifying the generic
bug header and was considered too invasive and damaging the critical
path, and thus incorrect [1].

Current Proposal: Check in `warn_slowpath_fmt()` and `__report_bug()`.
Suppression is checked at two points in the warning path:
- In `warn_slowpath_fmt()` (kernel/panic.c), for architectures without
  __WARN_FLAGS. The check runs before any output, fully suppressing
  both message and backtrace.
- In `__report_bug()` (lib/bug.c), for architectures that define
  __WARN_FLAGS. The check runs before `__warn()` is called, suppressing
  the backtrace and stack dump. On this path, the `WARN()` format message
  may still appear in the kernel log since `__warn_printk()` executes
  before the trap.
This approach avoids modifying include/asm-generic/bug.h entirely,
requires no architecture-specific code, and limits changes to generic
code to the absolute minimum.

A helper function, `__kunit_is_suppressed_warning()`, walks an RCU-
protected list of active suppressions, matching by current task. The
suppression state is dynamically allocated via kunit_kzalloc() and
tied to the KUnit test lifecycle via kunit_add_action(), ensuring
automatic cleanup at test exit.

To minimize runtime impact when no suppressions are active, an atomic
counter tracks the number of active suppressions.
`__kunit_is_suppressed_warning()` checks this counter first and returns
immediately when it is zero, avoiding the RCU-protected list traversal
in the common case.

This series is based on the RFC patch and subsequent discussion at
https://patchwork.kernel.org/project/linux-kselftest/patch/02546e59-1afe-4b08-ba81-d94f3b691c9a@moroto.mountain/
and offers a more comprehensive solution of the problem discussed there.

[1] https://lore.kernel.org/all/CAGegRW76X8Fk_5qqOBw_aqBwAkQTsc8kXKHEuu9ECeXzdJwMSw@mail.gmail.com/

Changes since RFC:
- Introduced CONFIG_KUNIT_SUPPRESS_BACKTRACE
- Minor cleanups and bug fixes
- Added support for all affected architectures
- Added support for counting suppressed warnings
- Added unit tests using those counters
- Added patch to suppress warning backtraces in dev_addr_lists tests

Changes since v1:
- Rebased to v6.9-rc1
- Added Tested-by:, Acked-by:, and Reviewed-by: tags
  [I retained those tags since there have been no functional changes]
- Introduced KUNIT_SUPPRESS_BACKTRACE configuration option, enabled by
  default.

Changes since v2:
- Rebased to v6.9-rc2
- Added comments to drm warning suppression explaining why it is needed.
- Added patch to move conditional code in arch/sh/include/asm/bug.h
  to avoid kerneldoc warning
- Added architecture maintainers to Cc: for architecture specific patches
- No functional changes

Changes since v3:
- Rebased to v6.14-rc6
- Dropped net: "kunit: Suppress lock warning noise at end of dev_addr_lists tests"
  since 3db3b62955cd6d73afde05a17d7e8e106695c3b9
- Added __kunit_ and KUNIT_ prefixes.
- Tested on interessed architectures.

Changes since v4:
- Rebased to v6.15-rc7
- Dropped all code in __report_bug()
- Moved all checks in WARN*() macros.
- Dropped all architecture specific code.
- Made __kunit_is_suppressed_warning nice to noinstr functions.

Changes since v5:
- Rebased to v7.0-rc3
- Added RCU protection for the suppressed warnings list.
- Added static key and branching optimization.
- Removed custom `strcmp` implementation and reworked
  __kunit_is_suppressed_warning() entrypoint function.

Changes since v6:
- Moved suppression checks from WARN*() macros to warn_slowpath_fmt()
  and __report_bug().
- Replaced stack-allocated suppression struct with kunit_kzalloc() heap
  allocation tied to the KUnit test lifecycle.
- Changed suppression strategy from function-name matching to task-scoped:
  all warnings on the current task are suppressed between START and END,
  rather than only warnings originating from a specific named function.
- Simplified macro API: removed KUNIT_DECLARE_SUPPRESSED_WARNING(),
  the START macro now takes (test) and handles allocation internally.
- Removed static key and branching optiomization, as by the time it
  was executed, callers are already in warn slowpaths.
- Link to v6: https://lore.kernel.org/r/20260317-kunit_add_support-v6-0-dd22aeb3fe5d@redhat.com

Alessandro Carminati (2):
  bug/kunit: Core support for suppressing warning backtraces
  bug/kunit: Suppressing warning backtraces reduced impact on WARN*()
    sites

Guenter Roeck (3):
  Add unit tests to verify that warning backtrace suppression works.
  drm: Suppress intentional warning backtraces in scaling unit tests
  kunit: Add documentation for warning backtrace suppression API

 Documentation/dev-tools/kunit/usage.rst |  30 ++++++-
 drivers/gpu/drm/tests/drm_rect_test.c   |  16 ++++
 include/asm-generic/bug.h               |  48 +++++++----
 include/kunit/bug.h                     |  62 ++++++++++++++
 include/kunit/test.h                    |   1 +
 lib/kunit/Kconfig                       |   9 ++
 lib/kunit/Makefile                      |   9 +-
 lib/kunit/backtrace-suppression-test.c  | 105 ++++++++++++++++++++++++
 lib/kunit/bug.c                         |  54 ++++++++++++
 9 files changed, 316 insertions(+), 18 deletions(-)
 create mode 100644 include/kunit/bug.h
 create mode 100644 lib/kunit/backtrace-suppression-test.c
 create mode 100644 lib/kunit/bug.c

--
2.34.1

---
Alessandro Carminati (2):
      bug/kunit: Core support for suppressing warning backtraces
      bug/kunit: Reduce runtime impact of warning backtrace suppression

Guenter Roeck (3):
      kunit: Add backtrace suppression self-tests
      drm: Suppress intentional warning backtraces in scaling unit tests
      kunit: Add documentation for warning backtrace suppression API

 Documentation/dev-tools/kunit/usage.rst | 30 ++++++++++-
 drivers/gpu/drm/tests/drm_rect_test.c   | 14 +++++
 include/kunit/bug.h                     | 56 ++++++++++++++++++++
 include/kunit/test.h                    |  1 +
 kernel/panic.c                          |  8 ++-
 lib/bug.c                               |  8 +++
 lib/kunit/Kconfig                       |  9 ++++
 lib/kunit/Makefile                      |  9 +++-
 lib/kunit/backtrace-suppression-test.c  | 90 ++++++++++++++++++++++++++++++++
 lib/kunit/bug.c                         | 91 +++++++++++++++++++++++++++++++++
 10 files changed, 312 insertions(+), 4 deletions(-)
---
base-commit: 80234b5ab240f52fa45d201e899e207b9265ef91
change-id: 20260312-kunit_add_support-2f35806b19dd

Best regards,
-- 
Albert Esteve <aesteve@redhat.com>


^ permalink raw reply

* Re: [PATCH v7 3/4] KVM: arm64: PMU: Introduce FIXED_COUNTERS_ONLY
From: Akihiko Odaki @ 2026-04-20 12:07 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Kees Cook, Gustavo A. R. Silva,
	Paolo Bonzini, Jonathan Corbet, Shuah Khan, linux-arm-kernel,
	kvmarm, linux-kernel, linux-hardening, devel, kvm, linux-doc,
	linux-kselftest
In-Reply-To: <86qzoa0xj6.wl-maz@kernel.org>

On 2026/04/20 18:51, Marc Zyngier wrote:
> On Mon, 20 Apr 2026 09:36:16 +0100,
> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
>>
>> On 2026/04/20 2:19, Marc Zyngier wrote:
>>> On Sat, 18 Apr 2026 09:14:25 +0100,
>>> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
>>>>
>>>> On a heterogeneous arm64 system, KVM's PMU emulation is based on the
>>>> features of a single host PMU instance. When a vCPU is migrated to a
>>>> pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
>>>> incrementing.
>>>>
>>>> Although this behavior is permitted by the architecture, Windows does
>>>> not handle it gracefully and may crash with a division-by-zero error.
>>>>
>>>> The current workaround requires VMMs to pin vCPUs to a set of pCPUs
>>>> that share a compatible PMU. This is difficult to implement correctly in
>>>> QEMU/libvirt, where pinning occurs after vCPU initialization, and it
>>>> also restricts the guest to a subset of available pCPUs.
>>>>
>>>> Introduce the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY attribute to
>>>> create a "fixed-counters-only" PMU. When set, KVM exposes a PMU that is
>>>> compatible with all pCPUs but that does not support programmable
>>>> event counters which may have different feature sets on different PMUs.
>>>>
>>>> This allows Windows guests to run reliably on heterogeneous systems
>>>> without crashing, even without vCPU pinning, and enables VMMs to
>>>> schedule vCPUs across all available pCPUs, making full use of the host
>>>> hardware.
>>>>
>>>> Much like KVM_ARM_VCPU_PMU_V3_IRQ and other read-write attributes, this
>>>> attribute provides a getter that facilitates kernel and userspace
>>>> debugging/testing.
>>>
>>> OK, so that's the sales pitch. But how is it implemented? I would like
>>> to be able to read a high-level description of the implementation
>>> trade-offs.
>>
>> Implementation-wise it is very trivial. Essentially the following
>> addition in kvm_arm_pmu_v3_get_attr() is the entire implementation:
>> +	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
>> +		if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
>> &vcpu->kvm->arch.flags))
>> +			return 0;
>>
>> Both its functionality and code complexity is trivial. So we can argue that:
>> - the functionality is too trivial to be useful or
>> - the interface/implementation complexity is so trivial that it does not
>>    incur maintenance burden
>>
>> In this case the selftest uses the getter so I was more inclined to
>> have it, but adding one just for the selftest sounds too ad-hoc, so
>> here I looked into other attributes to ensure that it was not
>> introducing inconsistency with existing interfaces.
>>
>> As the result, I found there are other read-write attributes; in fact
>> there are more read-write attributes than write-only ones.
> 
> You're completely missing the point. I'm referring to the whole of the
> commit message, which is more of a marketing slide than a technical
> description.

In terms of implementation, the obvious tradeoff is that it adds more 
code to implement the feature. One thing to note is that 
kvm_vcpu_load_pmu() is added and is called each time a vCPU migrates 
across pCPUs. The heavy part, making the KVM_REQ_RELOAD_PMU request, 
only happens when the feature is enabled.

> 
> I really don't care about the getter at this stage, which while
> pointless, does not make things more awful than they already are.
> 
>>
>>>
>>>>
>>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>> ---
>>>>    Documentation/virt/kvm/devices/vcpu.rst |  29 ++++++
>>>>    arch/arm64/include/asm/kvm_host.h       |   2 +
>>>>    arch/arm64/include/uapi/asm/kvm.h       |   1 +
>>>>    arch/arm64/kvm/arm.c                    |   1 +
>>>>    arch/arm64/kvm/pmu-emul.c               | 155 +++++++++++++++++++++++---------
>>>>    include/kvm/arm_pmu.h                   |   2 +
>>>>    6 files changed, 147 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
>>>> index 60bf205cb373..e0aeb1897d77 100644
>>>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>>>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>>>> @@ -161,6 +161,35 @@ explicitly selected, or the number of counters is out of range for the
>>>>    selected PMU. Selecting a new PMU cancels the effect of setting this
>>>>    attribute.
>>>>    +1.6 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
>>>> +------------------------------------------------------
>>>> +
>>>> +:Parameters: no additional parameter in kvm_device_attr.addr
>>>> +
>>>> +:Returns:
>>>> +
>>>> +	 =======  =====================================================
>>>> +	 -EBUSY   Attempted to set after initializing PMUv3 or running
>>>> +		  VCPU, or attempted to set for the first time after
>>>> +		  setting an event filter
>>>> +	 -ENXIO   Attempted to get before setting
>>>> +	 -ENODEV  Attempted to set while PMUv3 not supported
>>>> +	 =======  =====================================================
>>>> +
>>>> +If set, PMUv3 will be emulated without programmable event counters. The VCPU
>>>> +will use any compatible hardware PMU. This attribute is particularly useful on
>>>
>>> Not quite "any PMU". It will use *the* PMU of the physical CPU,
>>> irrespective of the implementation.
>>
>> I think:
>>
>> - this comment
>> - one on the KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED note
>> - one on kvm_pmu_create_perf_event()
>> - and one on kvm_arm_pmu_v3_set_pmu_fixed_counters_only()
>>
>> All boil down into one question: will it support all possible CPUs, or
>> will it support a subset? Let me answer here:
>>
>> This patch is written to support a subset instead of all possible
>> CPUs. If a pCPU does not have a compatible PMU, the pCPU will not be
>> supported and cause KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED.
> 
> This is not a thing. Either *all* the CPUs have a PMU that can be used
> for KVM, or PMU support is not offered to guests. That's a hard line
> in the sand. And the code already upholds this by checking the
> sanitised PMUVer field.
> 
>>
>> This patch does not enforce all possible CPUs are covered by the
>> compatible PMUs. Theoretically speaking,
>> kvm_arm_pmu_get_pmuver_limit() enables the PMU emulation when real
>> PMUv3 hardware covers all possible CPUs *or* the relevant registers
>> can be trapped with IMPDEF, so some pCPU may not have a compatible PMU
>> and only provide the IMPDEF trapping.
> 
> How is that possible? Please describe the case where that can happen,
> and I will make sure that such a system stops booting. The intent is
> definitely that that:
> 
> - for early CPUs, we take the minimal capability of all CPUs
> 
> - for late CPUs, either they match at least the capability recorded by
>    early CPUs, or they don't boot.

All CPUs may trap the relevant registers with IMPDEF but some of them 
may not have compatible PMUs. As I wrote in the previous email, I don't 
think it will happen in practice.

> 
>> Practically, I don't think any sane configuration will ever have such
>> a subset support, so we can explicitly enforce all possible CPUs are
>> covered by the compatible PMUs if desired.
> 
> That's not just desired. This is a requirement. And it is already
> enforced AFAICS.
> 
>>
>>>
>>>> +heterogeneous systems where different hardware PMUs cover different physical
>>>> +CPUs. The compatibility of hardware PMUs can be checked with
>>>> +KVM_ARM_VCPU_PMU_V3_SET_PMU. All VCPUs in a VM share this attribute. It isn't
>>>> +possible to set it for the first time if a PMU event filter is already present.
>>>
>>> "for the first time" gives the impression that it will work if you try
>>> again. I'd rather we say that "This feature is incompatible with the
>>> existence of a PMU event filter".
>>
>> The following sequence will work:
>> 1. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
>> 2. Set KVM_ARM_VCPU_PMU_V3_FILTER
>> 3. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
>>
>> This is to make the behavior conistent with KVM_ARM_VCPU_PMU_V3_SET_PMU.
> 
> I don't think this is correct. Filtering is completely at odds with
> this patch, and I don't want to have to reason about the combination.

kvm_arm_pmu_v3_set_pmu() has the following condition:

if (kvm_vm_has_ran_once(kvm) ||
     (kvm->arch.pmu_filter && kvm->arch.arm_pmu != arm_pmu)) {
	ret = -EBUSY;
	break;
}

kvm_arm_pmu_v3_set_pmu_fixed_counters_only() has the corresponding 
condition for consistency:

if (kvm_vm_has_ran_once(kvm) ||
     (kvm->arch.pmu_filter &&
      !test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
	       &kvm->arch.flags)))
	return -EBUSY;

We can of course kill the PMU event filter for FIXED_COUNTERS_ONLY. The 
filter is effectively no-op with FIXED_COUNTERS_ONLY and I don't think 
that consistency matters much.

> 
> [...]
> 
>>>> +	int i;
>>>> +
>>>> +	for_each_set_bit(i, &mask, 32) {
>>>> +		pmc = kvm_vcpu_idx_to_pmc(vcpu, i);
>>>> +		if (!pmc->perf_event)
>>>> +			continue;
>>>> +
>>>> +		cpu_pmu = to_arm_pmu(pmc->perf_event->pmu);
>>>> +		if (!cpumask_test_cpu(vcpu->cpu, &cpu_pmu->supported_cpus)) {
>>>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +}
>>>> +
>>>
>>> Why do we need to inflict this on VMs that do not have the fixed
>>> counter restriction?
>>
>> This function is to re-create the perf_event in case the current
>> perf_event does not support the pCPU because e.g., the pCPU is a
>> E-core while the perf_event only covers the P-cores.
> 
> That's not what I meant. This code is only here to support the
> fixed-function feature. It makes no sense outside of it, because *we
> don't support counter migration across implementations*.
> 
> So what's the purpose of this stuff for the normal KVM setup?

None. It's only for this feature. We can add a check of the feature flag 
at the beginning of the function to avoid that loop.

> 
>>
>>>
>>> And even then, all you have to reconfigure is the cycle counter. So
>>> why the loop? All we want to find out is whether the cycle counter is
>>> instantiated on the PMU that matches the current CPU.
>>
>> I just wanted to avoid hardcoding assumptions on the fixed
>> counter(s). FEAT_PMUv3_ICNTR will be naturaly handled with a loop, for
>> example.
> 
> Well, not that loop, since ICNTR is counter 32. So please let's stop
> the nonsense and only add what is required?
> 
> [...]
> 
>>>>    +
>>>> clear_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
>>>> &kvm->arch.flags);
>>>
>>> Why does this need to be cleared? I'd rather we make sure it is never
>>> set the first place.
>>
>> KVM_ARM_VCPU_PMU_V3_SET_PMU and
>> KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY can be set on the same
>> VCPU. The last KVM_ARM_VCPU_PMU_V3_SET_PMU or
>> KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY setting will be effective.
>>
>> A VMM may try set these attributes to check if the setting is
>> supported. For example, the RFC QEMU patch first uses
>> KVM_ARM_VCPU_PMU_V3_SET_PMU to find a compatible PMU that covers all
>> pCPUs, and then falls back to
>> KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY. The order of such probing is
>> up to the VMM.
> 
> KVM_ARM_VCPU_PMU_V3_SET_PMU is not a probing mechanism. You must probe
> the PMUs by looking in /sys/bus/event_source/devices/, like kvmtool
> does.
> 
> So there is no reason to support this stuff, and the two flags should
> be made mutually exclusive.

Thanks for the pointer. I'll make a change to make the flags mutually 
exclusive and test it with an amended QEMU patch that follows what 
kvmtool does.

> 
> [...]
> 
>>>>
>>>
>>> In conclusion, I find this patch to be rather messy. For a start, it
>>> needs to be split in at least 5 patches:
>>>
>>> - at least two for the refactoring
>>> - one for the PMU core changes
>>> - one for the UAPI
>>> - one for documentation
>>
>> That clarifies the expected granurarity of patches. The next version
>> will be in that layout, perhaps with more patches if an additional
>> change. Thanks for the guidance.
>>
>>>
>>> I'd also like some clarification on how this is intended to work if we
>>> enable FEAT_PMUv3_ICNTR, because the definition seems to be designed
>>> to encompass all fixed-function counters, and I expect this to grow
>>> over time.
>>
>> Indeed the UAPI was designed to encompass all fixed-function counters
>> as suggested by Oliver.
>>
>> To support the UAPI, the implementation avoids hardcoding the
>> assumption on the fixed counter(s). FEAT_PMUv3_INCTR will be naturaly
>> supported once the common code is properly updated (i.e., the size of
>> the event counter bitmask is grown the corresponding registers are
>> wired up with a proper check of the feature.)
>>
>> I expect migration will be handled with the conventional register
>> getters and setters, but please share if you have a concern.
> 
> At the very least I want to see some documentation explaining that.

What kind of documentation do you expect? If we change 
kvm_vcpu_load_pmu() to avoid for_each_set_bit(), there would be a good 
chance to forget updating it when mechanically updating existing 
for_each_set_bit() instances, so it is a candidate for documentation. 
But I don't have a good idea where to place it either.

Regards,
Akihiko Odaki

^ permalink raw reply

* Re: [PATCH v5 6/6] selftests/bpf: tc_tunnel validate decap GSO state
From: bot+bpf-ci @ 2026-04-20 11:19 UTC (permalink / raw)
  To: nhudson, bpf, netdev, willemb, martin.lau
  Cc: nhudson, ast, daniel, andrii, eddyz87, memxor, shuah,
	linux-kselftest, linux-kernel, ast, andrii, daniel, martin.lau,
	eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260420104051.1528843-7-nhudson@akamai.com>

[-- Attachment #1: Type: text/plain, Size: 3501 bytes --]

> diff --git a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
> index 7376df405a6b..511022630bb1 100644
> --- a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
> +++ b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c

[ ... ]

> @@ -621,6 +640,11 @@ static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
>  		break;
>  	case IPPROTO_GRE:
>  		olen += sizeof(struct gre_hdr);
> +		if (!bpf_core_enum_value_exists(enum bpf_adj_room_flags,
> +						BPF_F_ADJ_ROOM_DECAP_L4_GRE))
> +			return TC_ACT_SHOT;
> +		flags |= BPF_F_ADJ_ROOM_DECAP_L4_GRE;
> +

[ ... ]

> @@ -634,6 +658,10 @@ static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
>  		break;
>  	case IPPROTO_UDP:
>  		olen += sizeof(struct udphdr);
> +		if (!bpf_core_enum_value_exists(enum bpf_adj_room_flags,
> +						BPF_F_ADJ_ROOM_DECAP_L4_UDP))
> +			return TC_ACT_SHOT;
> +		flags |= BPF_F_ADJ_ROOM_DECAP_L4_UDP;

[ ... ]

> @@ -655,6 +683,35 @@ static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
>  	if (bpf_skb_adjust_room(skb, -olen, BPF_ADJ_ROOM_MAC, flags))
>  		return TC_ACT_SHOT;
>
> +	kskb = bpf_cast_to_kern_ctx(skb);
> +	shinfo = bpf_core_cast(kskb->head + kskb->end, struct skb_shared_info);

Could the pointer arithmetic here cause compilation issues on 32-bit
architectures?

On 32-bit architectures without NET_SKBUFF_DATA_USES_OFFSET, sk_buff::end
is defined as a pointer (unsigned char *) rather than an offset. Adding
two pointers is invalid in C and would fail to compile.

Martin KaFai Lau noted that other tests use this pattern and suggested it
could be addressed with bpf_core_field_size(kskb->end) if 32-bit support
becomes necessary in the future.

Ref: https://lore.kernel.org/bpf/20260416124324.B0CE6C2BCAF@smtp.kernel.org/

> +	if (!shinfo->gso_size)
> +		return TC_ACT_OK;
> +
> +	if ((flags & BPF_F_ADJ_ROOM_DECAP_L4_UDP) &&
> +	    (shinfo->gso_type & SKB_GSO_UDP_TUNNEL_MASK))
> +		return TC_ACT_SHOT;
> +
> +	if ((flags & BPF_F_ADJ_ROOM_DECAP_L4_GRE) &&
> +	    (shinfo->gso_type & (SKB_GSO_GRE | SKB_GSO_GRE_CSUM)))
> +		return TC_ACT_SHOT;
> +
> +	if ((flags & BPF_F_ADJ_ROOM_DECAP_IPXIP4) &&
> +	    (shinfo->gso_type & SKB_GSO_IPXIP4))
> +		return TC_ACT_SHOT;
> +
> +	if ((flags & BPF_F_ADJ_ROOM_DECAP_IPXIP6) &&
> +	    (shinfo->gso_type & SKB_GSO_IPXIP6))
> +		return TC_ACT_SHOT;

Are these IPXIP4 and IPXIP6 checks reachable?

Looking earlier in decap_internal(), for IPPROTO_IPIP and IPPROTO_IPV6
cases, the function only sets BPF_F_ADJ_ROOM_DECAP_L3_IPV4 and
BPF_F_ADJ_ROOM_DECAP_L3_IPV6 on the flags variable (around the protocol
switch statement).

Because BPF_F_ADJ_ROOM_DECAP_IPXIP4 and BPF_F_ADJ_ROOM_DECAP_IPXIP6 are
never added to flags, wouldn't these validation checks always be skipped?

Ref: https://lore.kernel.org/bpf/20260416124324.B0CE6C2BCAF@smtp.kernel.org/

> +
> +	if (flags & (BPF_F_ADJ_ROOM_DECAP_L4_MASK |
> +		     BPF_F_ADJ_ROOM_DECAP_IPXIP_MASK)) {
> +		if ((shinfo->gso_type & SKB_GSO_TUNNEL_MASK) && !kskb->encapsulation)
> +			return TC_ACT_SHOT;
> +		if (!(shinfo->gso_type & SKB_GSO_TUNNEL_MASK) && kskb->encapsulation)
> +			return TC_ACT_SHOT;
> +	}
> +
>  	return TC_ACT_OK;
>  }


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24662733093

^ permalink raw reply

* [PATCH v5 6/6] selftests/bpf: tc_tunnel validate decap GSO state
From: Nick Hudson @ 2026-04-20 10:40 UTC (permalink / raw)
  To: bpf, netdev, Willem de Bruijn, Martin KaFai Lau
  Cc: Nick Hudson, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Shuah Khan,
	linux-kselftest, linux-kernel
In-Reply-To: <20260420104051.1528843-1-nhudson@akamai.com>

Require BPF_F_ADJ_ROOM_DECAP_L4_UDP and BPF_F_ADJ_ROOM_DECAP_L4_GRE enum
values at runtime using CO-RE enum existence checks so missing kernel
support fails fast instead of silently proceeding.

After bpf_skb_adjust_room() decapsulation, inspect skb_shared_info and
sk_buff state for GSO packets and assert that the expected tunnel GSO
bits are cleared and encapsulation matches the remaining tunnel state.

Signed-off-by: Nick Hudson <nhudson@akamai.com>
---
 .../selftests/bpf/progs/test_tc_tunnel.c      | 57 +++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
index 7376df405a6b..511022630bb1 100644
--- a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
+++ b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
@@ -6,6 +6,7 @@
 
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_core_read.h>
 #include "bpf_tracing_net.h"
 #include "bpf_compiler.h"
 
@@ -37,6 +38,22 @@ struct vxlanhdr___local {
 
 #define	EXTPROTO_VXLAN	0x1
 
+#define SKB_GSO_UDP_TUNNEL_MASK	(SKB_GSO_UDP_TUNNEL |			\
+				 SKB_GSO_UDP_TUNNEL_CSUM)
+
+#define SKB_GSO_TUNNEL_MASK	(SKB_GSO_UDP_TUNNEL_MASK |		\
+				 SKB_GSO_GRE |				\
+				 SKB_GSO_GRE_CSUM |			\
+				 SKB_GSO_IPXIP4 |			\
+				 SKB_GSO_IPXIP6 |			\
+				 SKB_GSO_ESP)
+
+#define BPF_F_ADJ_ROOM_DECAP_L4_MASK	(BPF_F_ADJ_ROOM_DECAP_L4_UDP |	\
+				 BPF_F_ADJ_ROOM_DECAP_L4_GRE)
+
+#define BPF_F_ADJ_ROOM_DECAP_IPXIP_MASK	(BPF_F_ADJ_ROOM_DECAP_IPXIP4 |	\
+					 BPF_F_ADJ_ROOM_DECAP_IPXIP6)
+
 #define	VXLAN_FLAGS     bpf_htonl(1<<27)
 #define	VNI_ID		1
 #define	VXLAN_VNI	bpf_htonl(VNI_ID << 8)
@@ -592,6 +609,8 @@ int __encap_ip6vxlan_eth(struct __sk_buff *skb)
 static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
 {
 	__u64 flags = BPF_F_ADJ_ROOM_FIXED_GSO;
+	struct sk_buff *kskb;
+	struct skb_shared_info *shinfo;
 	struct ipv6_opt_hdr ip6_opt_hdr;
 	struct gre_hdr greh;
 	struct udphdr udph;
@@ -621,6 +640,11 @@ static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
 		break;
 	case IPPROTO_GRE:
 		olen += sizeof(struct gre_hdr);
+		if (!bpf_core_enum_value_exists(enum bpf_adj_room_flags,
+						BPF_F_ADJ_ROOM_DECAP_L4_GRE))
+			return TC_ACT_SHOT;
+		flags |= BPF_F_ADJ_ROOM_DECAP_L4_GRE;
+
 		if (bpf_skb_load_bytes(skb, off + len, &greh, sizeof(greh)) < 0)
 			return TC_ACT_OK;
 		switch (bpf_ntohs(greh.protocol)) {
@@ -634,6 +658,10 @@ static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
 		break;
 	case IPPROTO_UDP:
 		olen += sizeof(struct udphdr);
+		if (!bpf_core_enum_value_exists(enum bpf_adj_room_flags,
+						BPF_F_ADJ_ROOM_DECAP_L4_UDP))
+			return TC_ACT_SHOT;
+		flags |= BPF_F_ADJ_ROOM_DECAP_L4_UDP;
 		if (bpf_skb_load_bytes(skb, off + len, &udph, sizeof(udph)) < 0)
 			return TC_ACT_OK;
 		switch (bpf_ntohs(udph.dest)) {
@@ -655,6 +683,35 @@ static int decap_internal(struct __sk_buff *skb, int off, int len, char proto)
 	if (bpf_skb_adjust_room(skb, -olen, BPF_ADJ_ROOM_MAC, flags))
 		return TC_ACT_SHOT;
 
+	kskb = bpf_cast_to_kern_ctx(skb);
+	shinfo = bpf_core_cast(kskb->head + kskb->end, struct skb_shared_info);
+	if (!shinfo->gso_size)
+		return TC_ACT_OK;
+
+	if ((flags & BPF_F_ADJ_ROOM_DECAP_L4_UDP) &&
+	    (shinfo->gso_type & SKB_GSO_UDP_TUNNEL_MASK))
+		return TC_ACT_SHOT;
+
+	if ((flags & BPF_F_ADJ_ROOM_DECAP_L4_GRE) &&
+	    (shinfo->gso_type & (SKB_GSO_GRE | SKB_GSO_GRE_CSUM)))
+		return TC_ACT_SHOT;
+
+	if ((flags & BPF_F_ADJ_ROOM_DECAP_IPXIP4) &&
+	    (shinfo->gso_type & SKB_GSO_IPXIP4))
+		return TC_ACT_SHOT;
+
+	if ((flags & BPF_F_ADJ_ROOM_DECAP_IPXIP6) &&
+	    (shinfo->gso_type & SKB_GSO_IPXIP6))
+		return TC_ACT_SHOT;
+
+	if (flags & (BPF_F_ADJ_ROOM_DECAP_L4_MASK |
+		     BPF_F_ADJ_ROOM_DECAP_IPXIP_MASK)) {
+		if ((shinfo->gso_type & SKB_GSO_TUNNEL_MASK) && !kskb->encapsulation)
+			return TC_ACT_SHOT;
+		if (!(shinfo->gso_type & SKB_GSO_TUNNEL_MASK) && kskb->encapsulation)
+			return TC_ACT_SHOT;
+	}
+
 	return TC_ACT_OK;
 }
 
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v2] selftests: harness: fix pidfd leak in __wait_for_test
From: Thomas Weißschuh @ 2026-04-20 10:15 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Christian Brauner, Geliang Tang, linux-kselftest, mptcp
In-Reply-To: <a82e275ccfb2609a1984d90ab559fa3af78f1e81.1776678050.git.tanggeliang@kylinos.cn>

On Mon, Apr 20, 2026 at 05:45:28PM +0800, Geliang Tang wrote:
> From: Geliang Tang <tanggeliang@kylinos.cn>
> 
> Fix the pidfd leak in kselftest_harness.h's __wait_for_test() where
> childfd = syscall(__NR_pidfd_open, t->pid, 0) is never closed.
> 
> Fixes: 73a3cde97677 ("selftests: harness: Implement test timeouts through pidfd")
> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>

Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>

(...)

^ permalink raw reply

* Re: [PATCH v7 3/4] KVM: arm64: PMU: Introduce FIXED_COUNTERS_ONLY
From: Marc Zyngier @ 2026-04-20  9:51 UTC (permalink / raw)
  To: Akihiko Odaki
  Cc: Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Kees Cook, Gustavo A. R. Silva,
	Paolo Bonzini, Jonathan Corbet, Shuah Khan, linux-arm-kernel,
	kvmarm, linux-kernel, linux-hardening, devel, kvm, linux-doc,
	linux-kselftest
In-Reply-To: <06c6664c-7f0c-47b2-babf-ba2a541fd9f2@rsg.ci.i.u-tokyo.ac.jp>

On Mon, 20 Apr 2026 09:36:16 +0100,
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
> 
> On 2026/04/20 2:19, Marc Zyngier wrote:
> > On Sat, 18 Apr 2026 09:14:25 +0100,
> > Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
> >> 
> >> On a heterogeneous arm64 system, KVM's PMU emulation is based on the
> >> features of a single host PMU instance. When a vCPU is migrated to a
> >> pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
> >> incrementing.
> >> 
> >> Although this behavior is permitted by the architecture, Windows does
> >> not handle it gracefully and may crash with a division-by-zero error.
> >> 
> >> The current workaround requires VMMs to pin vCPUs to a set of pCPUs
> >> that share a compatible PMU. This is difficult to implement correctly in
> >> QEMU/libvirt, where pinning occurs after vCPU initialization, and it
> >> also restricts the guest to a subset of available pCPUs.
> >> 
> >> Introduce the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY attribute to
> >> create a "fixed-counters-only" PMU. When set, KVM exposes a PMU that is
> >> compatible with all pCPUs but that does not support programmable
> >> event counters which may have different feature sets on different PMUs.
> >> 
> >> This allows Windows guests to run reliably on heterogeneous systems
> >> without crashing, even without vCPU pinning, and enables VMMs to
> >> schedule vCPUs across all available pCPUs, making full use of the host
> >> hardware.
> >> 
> >> Much like KVM_ARM_VCPU_PMU_V3_IRQ and other read-write attributes, this
> >> attribute provides a getter that facilitates kernel and userspace
> >> debugging/testing.
> > 
> > OK, so that's the sales pitch. But how is it implemented? I would like
> > to be able to read a high-level description of the implementation
> > trade-offs.
> 
> Implementation-wise it is very trivial. Essentially the following
> addition in kvm_arm_pmu_v3_get_attr() is the entire implementation:
> +	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
> +		if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
> &vcpu->kvm->arch.flags))
> +			return 0;
> 
> Both its functionality and code complexity is trivial. So we can argue that:
> - the functionality is too trivial to be useful or
> - the interface/implementation complexity is so trivial that it does not
>   incur maintenance burden
> 
> In this case the selftest uses the getter so I was more inclined to
> have it, but adding one just for the selftest sounds too ad-hoc, so
> here I looked into other attributes to ensure that it was not
> introducing inconsistency with existing interfaces.
> 
> As the result, I found there are other read-write attributes; in fact
> there are more read-write attributes than write-only ones.

You're completely missing the point. I'm referring to the whole of the
commit message, which is more of a marketing slide than a technical
description.

I really don't care about the getter at this stage, which while
pointless, does not make things more awful than they already are.

> 
> > 
> >> 
> >> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> >> ---
> >>   Documentation/virt/kvm/devices/vcpu.rst |  29 ++++++
> >>   arch/arm64/include/asm/kvm_host.h       |   2 +
> >>   arch/arm64/include/uapi/asm/kvm.h       |   1 +
> >>   arch/arm64/kvm/arm.c                    |   1 +
> >>   arch/arm64/kvm/pmu-emul.c               | 155 +++++++++++++++++++++++---------
> >>   include/kvm/arm_pmu.h                   |   2 +
> >>   6 files changed, 147 insertions(+), 43 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> >> index 60bf205cb373..e0aeb1897d77 100644
> >> --- a/Documentation/virt/kvm/devices/vcpu.rst
> >> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> >> @@ -161,6 +161,35 @@ explicitly selected, or the number of counters is out of range for the
> >>   selected PMU. Selecting a new PMU cancels the effect of setting this
> >>   attribute.
> >>   +1.6 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
> >> +------------------------------------------------------
> >> +
> >> +:Parameters: no additional parameter in kvm_device_attr.addr
> >> +
> >> +:Returns:
> >> +
> >> +	 =======  =====================================================
> >> +	 -EBUSY   Attempted to set after initializing PMUv3 or running
> >> +		  VCPU, or attempted to set for the first time after
> >> +		  setting an event filter
> >> +	 -ENXIO   Attempted to get before setting
> >> +	 -ENODEV  Attempted to set while PMUv3 not supported
> >> +	 =======  =====================================================
> >> +
> >> +If set, PMUv3 will be emulated without programmable event counters. The VCPU
> >> +will use any compatible hardware PMU. This attribute is particularly useful on
> > 
> > Not quite "any PMU". It will use *the* PMU of the physical CPU,
> > irrespective of the implementation.
> 
> I think:
> 
> - this comment
> - one on the KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED note
> - one on kvm_pmu_create_perf_event()
> - and one on kvm_arm_pmu_v3_set_pmu_fixed_counters_only()
> 
> All boil down into one question: will it support all possible CPUs, or
> will it support a subset? Let me answer here:
> 
> This patch is written to support a subset instead of all possible
> CPUs. If a pCPU does not have a compatible PMU, the pCPU will not be
> supported and cause KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED.

This is not a thing. Either *all* the CPUs have a PMU that can be used
for KVM, or PMU support is not offered to guests. That's a hard line
in the sand. And the code already upholds this by checking the
sanitised PMUVer field.

>
> This patch does not enforce all possible CPUs are covered by the
> compatible PMUs. Theoretically speaking,
> kvm_arm_pmu_get_pmuver_limit() enables the PMU emulation when real
> PMUv3 hardware covers all possible CPUs *or* the relevant registers
> can be trapped with IMPDEF, so some pCPU may not have a compatible PMU
> and only provide the IMPDEF trapping.

How is that possible? Please describe the case where that can happen,
and I will make sure that such a system stops booting. The intent is
definitely that that:

- for early CPUs, we take the minimal capability of all CPUs

- for late CPUs, either they match at least the capability recorded by
  early CPUs, or they don't boot.

> Practically, I don't think any sane configuration will ever have such
> a subset support, so we can explicitly enforce all possible CPUs are
> covered by the compatible PMUs if desired.

That's not just desired. This is a requirement. And it is already
enforced AFAICS.

> 
> > 
> >> +heterogeneous systems where different hardware PMUs cover different physical
> >> +CPUs. The compatibility of hardware PMUs can be checked with
> >> +KVM_ARM_VCPU_PMU_V3_SET_PMU. All VCPUs in a VM share this attribute. It isn't
> >> +possible to set it for the first time if a PMU event filter is already present.
> > 
> > "for the first time" gives the impression that it will work if you try
> > again. I'd rather we say that "This feature is incompatible with the
> > existence of a PMU event filter".
> 
> The following sequence will work:
> 1. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
> 2. Set KVM_ARM_VCPU_PMU_V3_FILTER
> 3. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
> 
> This is to make the behavior conistent with KVM_ARM_VCPU_PMU_V3_SET_PMU.

I don't think this is correct. Filtering is completely at odds with
this patch, and I don't want to have to reason about the combination.

[...]

> >> +	int i;
> >> +
> >> +	for_each_set_bit(i, &mask, 32) {
> >> +		pmc = kvm_vcpu_idx_to_pmc(vcpu, i);
> >> +		if (!pmc->perf_event)
> >> +			continue;
> >> +
> >> +		cpu_pmu = to_arm_pmu(pmc->perf_event->pmu);
> >> +		if (!cpumask_test_cpu(vcpu->cpu, &cpu_pmu->supported_cpus)) {
> >> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
> >> +			break;
> >> +		}
> >> +	}
> >> +}
> >> +
> > 
> > Why do we need to inflict this on VMs that do not have the fixed
> > counter restriction?
> 
> This function is to re-create the perf_event in case the current
> perf_event does not support the pCPU because e.g., the pCPU is a
> E-core while the perf_event only covers the P-cores.

That's not what I meant. This code is only here to support the
fixed-function feature. It makes no sense outside of it, because *we
don't support counter migration across implementations*.

So what's the purpose of this stuff for the normal KVM setup?

> 
> > 
> > And even then, all you have to reconfigure is the cycle counter. So
> > why the loop? All we want to find out is whether the cycle counter is
> > instantiated on the PMU that matches the current CPU.
> 
> I just wanted to avoid hardcoding assumptions on the fixed
> counter(s). FEAT_PMUv3_ICNTR will be naturaly handled with a loop, for
> example.

Well, not that loop, since ICNTR is counter 32. So please let's stop
the nonsense and only add what is required?

[...]

> >>   +
> >> clear_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY,
> >> &kvm->arch.flags);
> > 
> > Why does this need to be cleared? I'd rather we make sure it is never
> > set the first place.
> 
> KVM_ARM_VCPU_PMU_V3_SET_PMU and
> KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY can be set on the same
> VCPU. The last KVM_ARM_VCPU_PMU_V3_SET_PMU or
> KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY setting will be effective.
> 
> A VMM may try set these attributes to check if the setting is
> supported. For example, the RFC QEMU patch first uses
> KVM_ARM_VCPU_PMU_V3_SET_PMU to find a compatible PMU that covers all
> pCPUs, and then falls back to
> KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY. The order of such probing is
> up to the VMM.

KVM_ARM_VCPU_PMU_V3_SET_PMU is not a probing mechanism. You must probe
the PMUs by looking in /sys/bus/event_source/devices/, like kvmtool
does.

So there is no reason to support this stuff, and the two flags should
be made mutually exclusive.

[...]

> >>
> > 
> > In conclusion, I find this patch to be rather messy. For a start, it
> > needs to be split in at least 5 patches:
> > 
> > - at least two for the refactoring
> > - one for the PMU core changes
> > - one for the UAPI
> > - one for documentation
> 
> That clarifies the expected granurarity of patches. The next version
> will be in that layout, perhaps with more patches if an additional
> change. Thanks for the guidance.
> 
> > 
> > I'd also like some clarification on how this is intended to work if we
> > enable FEAT_PMUv3_ICNTR, because the definition seems to be designed
> > to encompass all fixed-function counters, and I expect this to grow
> > over time.
> 
> Indeed the UAPI was designed to encompass all fixed-function counters
> as suggested by Oliver.
> 
> To support the UAPI, the implementation avoids hardcoding the
> assumption on the fixed counter(s). FEAT_PMUv3_INCTR will be naturaly
> supported once the common code is properly updated (i.e., the size of
> the event counter bitmask is grown the corresponding registers are
> wired up with a proper check of the feature.)
> 
> I expect migration will be handled with the conventional register
> getters and setters, but please share if you have a concern.

At the very least I want to see some documentation explaining that.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v2 0/6] kselftests: livepatch: Adapt tests to be executed on 4.12 kernels
From: Miroslav Benes @ 2026-04-20  9:46 UTC (permalink / raw)
  To: Joe Lawrence
  Cc: Marcos Paulo de Souza, Josh Poimboeuf, Jiri Kosina, Petr Mladek,
	Shuah Khan, live-patching, linux-kselftest, linux-kernel
In-Reply-To: <aeJ9pn6v5sGq5nln@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2504 bytes --]

On Fri, 17 Apr 2026, Joe Lawrence wrote:

> On Thu, Apr 16, 2026 at 03:18:33PM -0300, Marcos Paulo de Souza wrote:
> > On Thu, 2026-04-16 at 10:07 -0700, Josh Poimboeuf wrote:
> > > On Mon, Apr 13, 2026 at 02:26:11PM -0300, Marcos Paulo de Souza
> > > wrote:
> > > > A new version of the patchset, with fewer patches now. Please take
> > > > a look!
> > > > 
> > > > Original cover-letter:
> > > > These patches don't really change how the patches are run, just
> > > > skip
> > > > some tests on kernels that don't support a feature (like kprobe and
> > > > livepatched living together) or when a livepatch sysfs attribute is
> > > > missing.
> > > > 
> > > > The last patch slightly adjusts check_result function to skip dmesg
> > > > messages on SLE kernels when a livepatch is removed.
> > > 
> > > Why are we adding complexity to support Linux 4.12 in mainline? 
> > > Isn't
> > > that what enterprise distros are for?
> > 
> > These changes do not add any new complex code, just checks to enable
> > the tests to run on older kernels. I believe that it would be good for
> > all enterprises distros if they could run more tests in maintenance
> > updates of their kernels using the upstream tests.
> > 
> > The changes are not really that big. Some patches were removed from v1
> > because there were adding checks for out-of-tree messages (like the
> > last paragraph of the v2 erroneously shows), and another one was to
> > check if kprobes could live alongside livepatches, which fails for 4.12
> > kernels.
> > 
> > The patches for this versions introduce only checks to avoid testing
> > sysfs attributes for kernels that don't supports them.
> > 
> 
> IMHO when the changes are reasonably small, I think we should consider
> accomodating older kernels for the selftest suite.  If we reach the
> point of having to introduce version #ifdef-erry, that opinion would
> flip pretty quickly.  It's pretty amazing that modern tests still run on
> older kernels (with this patchset) -- not an explicit kselftest goal
> AFAIK, but nice to have.
> 
> If we do merge this patchset, it should update the doc
> tools/testing/selftests/livepatch/README to note the oldest
> expected/tested upstream kernel.  (So new selftest authors may have some
> idea of what API / sysfs features to use.)  And that this compatibility
> was only an incidental "feature" that came for nearly free.  It's not a
> promise to never add backwards-incompatible tests in the future.

I agree with Joe on both points.

Miroslav

^ permalink raw reply

* Re: [PATCH 00/53] selftests/mm: make MM selftests more CI friendly
From: Ryan Roberts @ 2026-04-20  9:46 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Barry Song,
	Dev Jain, Jason Gunthorpe, John Hubbard, Liam R. Howlett,
	Lance Yang, Leon Romanovsky, Lorenzo Stoakes, Mark Brown,
	Michal Hocko, Nico Pache, Peter Xu, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-kselftest, linux-mm
In-Reply-To: <aeXvty_LE4v5XLq3@kernel.org>

On 20/04/2026 10:19, Mike Rapoport wrote:
> Hi Ryan,
> 
> On Mon, Apr 20, 2026 at 09:37:03AM +0100, Ryan Roberts wrote:
>> On 06/04/2026 15:16, Mike Rapoport wrote:
>>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>>>
>>> Hi,
>>>
>>> There's a lot of dancing around HugeTLB settings in run_vmtests.sh.
>>> Some test need just a few default huge pages, some require at least 256 MB, and
>>> some just skip lots of tests if huge pages of all supported sizes are not
>>> available.
>>>
>>> The goal of this set is to make tests deal with HugeTLB setup and teardown.
>>
>> Hi Mike,
>>
>> I haven't had a chance to review this series properly, but the intent certainly
>> seems extremely valuable!
>>
>> I thought I'd share some configuration magic that I always use when running on
>> arm64. Appologies if I'm teaching you to suck eggs...
>>
>> arm64 supports multiple hugetlb sizes and (at least in the past) the magic in
>> run_vmtests.sh only reserves for the default size. As a consequence, whenever I
>> run these tests on arm64, I always boot with:
>>
>> hugepagesz=1G hugepages=0:2,1:2
>> hugepagesz=32M hugepages=0:2,1:2
>> default_hugepagesz=2M hugepages=0:64,1:64
>> hugepagesz=64K hugepages=0:2,1:2
>>
>> Which reserves 2 pages of each supported non-default size in each of 2 NUMA
>> nodes, and 64 of the default size in each NUMA node. (This would need adjusting
>> if using a different base page size).
>>
>> My recollection is that this effectively overrides what the script was doing and
>> is sufficient to make all hugetlb tests run for all hugepage sizes.
> 
> My goal is to let the tests themself set up the right hugetlb configuration
> without forcing it neither in command line nor in the wrapper scripts.
> 
> On x86 I can run all the tests in a virtio-ng VM with two nodes and no
> kernel command line overrides. I suppose that should work on arm64 too.
> 
> There are some additional settings that such a VM would need to avoid
> skipping tests that presume swap or a real filesystem, but that's more of
> virtio-ng limitation.
> 
>> If it's possible to get this non-default hugepage size reservation logic into
>> the tests themselves, this will make the mm selftests much easier to run on
>> arm64 with full coverage.
> 
> That's what the second half of series do. E.g for cow tests:
> 
> https://lore.kernel.org/linux-mm/ee6bbac9-b375-4413-a771-6d32c7afda67@arm.com/T/#m62f23b835061449bc6249afacf993bb32ea11234

Ahh, excellent; you're already considering the non-default sizes. I'll get back
in my box :)

>  
>> Another observation is that "secretmem.enable" is currently needed on the
>> cmdline to enable secretmem so that the associated tests run. Not sure what can
>> be done to make that simpler?
> 
> Looks like you're testing really old kernels :)
> The secretmem default changed to "enabled" from 6.5 ;-)

Good to know; I created my scripts/environment pre-6.5, so that's just
historical baggage on my part, I guess.

>  
>> And there are tests that depend on having more than 1 NUMA node; I always run
>> under QEMU with 2 emulated NUMA nodes. I guess that's really just a property of
>> the HW, so nothing to be done from the test harness.
> 
> Right, test harness can't do much about it. It's either run in a virtual
> machines with 2 (or more) nodes or enable NUMA emulation in the kernel
> configuration and the kernel command line.

ACK

Thanks again for this improvement!

> 
>> Thanks,
>> Ryan
> 


^ permalink raw reply

* [PATCH v2] selftests: harness: fix pidfd leak in __wait_for_test
From: Geliang Tang @ 2026-04-20  9:45 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Christian Brauner, Thomas Weißschuh
  Cc: Geliang Tang, linux-kselftest, mptcp

From: Geliang Tang <tanggeliang@kylinos.cn>

Fix the pidfd leak in kselftest_harness.h's __wait_for_test() where
childfd = syscall(__NR_pidfd_open, t->pid, 0) is never closed.

Fixes: 73a3cde97677 ("selftests: harness: Implement test timeouts through pidfd")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
---
v2:
 - move close() directly after poll() as Thomas suggested.

Hi,

While adding more TLS selftests for MPTCP KTLS development, a segmentation
fault occurred. Debugging revealed that the accept() failure was due to MPTCP
tests requiring over 1024 file descriptors simultaneously. I initially raised
the limit to 4096 in [1], but sashiko noted that the real issue was a pidfd
leak in kselftest_harness.h's __wait_for_test(). Hence, this fix addresses that.

Thanks,
-Geliang

[1]
https://patchwork.kernel.org/project/mptcp/patch/ced184831757eaae9e690f65d799809ed22cae28.1776469069.git.tanggeliang@kylinos.cn/
---
 tools/testing/selftests/kselftest_harness.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
index 75fb016cd190..53535c188b33 100644
--- a/tools/testing/selftests/kselftest_harness.h
+++ b/tools/testing/selftests/kselftest_harness.h
@@ -996,6 +996,7 @@ static void __wait_for_test(struct __test_metadata *t)
 	poll_child.fd = childfd;
 	poll_child.events = POLLIN;
 	ret = poll(&poll_child, 1, t->timeout * 1000);
+	close(childfd);
 	if (ret == -1) {
 		t->exit_code = KSFT_FAIL;
 		fprintf(TH_LOG_STREAM,
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH v2 05/53] selftests/mm: merge map_hugetlb into hugepage-mmap
From: Donet Tom @ 2026-04-20  9:20 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton, David Hildenbrand
  Cc: Baolin Wang, Barry Song, Dev Jain, Jason Gunthorpe, John Hubbard,
	Liam R. Howlett, Lance Yang, Leon Romanovsky, Lorenzo Stoakes,
	Mark Brown, Michal Hocko, Nico Pache, Peter Xu, Ryan Roberts,
	Sarthak Sharma, Shuah Khan, Suren Baghdasaryan, Vlastimil Babka,
	Zi Yan, linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <20260418105539.1261536-6-rppt@kernel.org>

Hi Mike

On 4/18/26 4:24 PM, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Both tests create a hugettlb mapping, fill it with data and verify the
> data, the only difference is that one uses file-backed memory and another
> one uses anonymous memory.
>
> Merge both tests into a single file.
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>   tools/testing/selftests/mm/Makefile        |   1 -
>   tools/testing/selftests/mm/hugepage-mmap.c | 112 ++++++++++++++++-----
>   tools/testing/selftests/mm/map_hugetlb.c   |  88 ----------------
>   tools/testing/selftests/mm/run_vmtests.sh  |   1 -
>   4 files changed, 85 insertions(+), 117 deletions(-)
>   delete mode 100644 tools/testing/selftests/mm/map_hugetlb.c
>
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index cd24596cdd27..cbda989f6b6a 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -70,7 +70,6 @@ TEST_GEN_FILES += hugepage-vmemmap
>   TEST_GEN_FILES += khugepaged
>   TEST_GEN_FILES += madv_populate
>   TEST_GEN_FILES += map_fixed_noreplace
> -TEST_GEN_FILES += map_hugetlb
>   TEST_GEN_FILES += map_populate
>   ifneq (,$(filter $(ARCH),arm64 riscv riscv64 x86 x86_64 loongarch32 loongarch64))
>   TEST_GEN_FILES += memfd_secret
> diff --git a/tools/testing/selftests/mm/hugepage-mmap.c b/tools/testing/selftests/mm/hugepage-mmap.c
> index d543419de040..f4fcc7c45875 100644
> --- a/tools/testing/selftests/mm/hugepage-mmap.c
> +++ b/tools/testing/selftests/mm/hugepage-mmap.c
> @@ -15,6 +15,7 @@
>   #include <unistd.h>
>   #include <sys/mman.h>
>   #include <fcntl.h>
> +#include "vm_util.h"
>   #include "kselftest.h"
>   
>   #define LENGTH (256UL*1024*1024)
> @@ -25,54 +26,111 @@ static void check_bytes(char *addr)
>   	ksft_print_msg("First hex is %x\n", *((unsigned int *)addr));
>   }
>   
> -static void write_bytes(char *addr)
> +static void write_bytes(char *addr, size_t length)
>   {
>   	unsigned long i;
>   
> -	for (i = 0; i < LENGTH; i++)
> +	for (i = 0; i < length; i++)
>   		*(addr + i) = (char)i;
>   }
>   
> -static int read_bytes(char *addr)
> +static bool verify_bytes(char *addr, size_t length)
>   {
>   	unsigned long i;
>   
>   	check_bytes(addr);
> -	for (i = 0; i < LENGTH; i++)
> -		if (*(addr + i) != (char)i) {
> -			ksft_print_msg("Error: Mismatch at %lu\n", i);
> -			return 1;
> -		}
> -	return 0;
> +	for (i = 0; i < length; i++)
> +		if (*(addr + i) != (char)i)
> +			return false;
> +
> +	return true;
>   }
>   
> -int main(void)
> +static bool test_mmap(size_t length, int mmap_flags, int fd,
> +		      const char *test_name)
>   {
>   	void *addr;
> -	int fd, ret;
>   
> -	ksft_print_header();
> -	ksft_set_plan(1);
> +	addr = mmap(NULL, length, PROTECTION, mmap_flags, fd, 0);
> +	if (addr == MAP_FAILED)
> +		ksft_exit_fail_perror("mmap");
> +
> +	ksft_print_msg("Returned address is %p\n", addr);
> +	check_bytes(addr);
> +	write_bytes(addr, length);
> +	if (!verify_bytes(addr, length))
> +		ksft_exit_fail_msg("%s\n", test_name);
> +
> +	/* munmap() length of MAP_HUGETLB memory must be hugepage aligned */
> +	if (munmap(addr, length))
> +		ksft_exit_fail_perror("munmap");
> +
> +	return true;
> +}
> +
> +static void test_anon_mmap(size_t length, int shift)
> +{
> +	const char *test_name = "hugetlb anonymous mmap";
> +	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB;
> +	bool passed;
> +
> +	if (shift)
> +		mmap_flags |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
> +
> +	passed = test_mmap(length, mmap_flags, -1, test_name);
> +	ksft_test_result(passed, "%s\n", test_name);
> +}
> +
> +static void test_file_mmap(size_t length, int shift)
> +{
> +	const char *test_name = "hugetlb file mmap";
> +	int mmap_flags = MAP_SHARED;
> +	bool passed;
> +	int fd;
> +
> +	if (shift)
> +		mmap_flags |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;

 From what I understand, selecting an alternative huge page size using 
mmap flags is typically used along with MAP_HUGETLB. Since we are not 
using MAP_HUGETLB, would this code still be necessary? please let me 
know if I might be missing something.



>   
> -	fd = memfd_create("hugepage-mmap", MFD_HUGETLB);
> +	fd = memfd_create("hugetlb-mmap", MFD_HUGETLB);
>
In my test, the default huge page size is 2MB. I passed a shift value of 
30 (for 1GB huge pages), but the mmap call failed — To support 1GB huge 
page sizes on a system with a 2MB default huge page size, I used 
MFD_HUGETLB | MFD_HUGE_1GB. With this change, file-backed mmap using 1GB 
huge pages started working. Do you think this change would be needed?

-Donet

>   	if (fd < 0)
> -		ksft_exit_fail_msg("memfd_create() failed: %s\n", strerror(errno));
> +		ksft_exit_fail_perror("memfd_create");
> +
> +	passed = test_mmap(length, mmap_flags, fd, test_name);
> +
> +	close(fd);
> +	ksft_test_result(passed, "%s\n", test_name);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	size_t hugepage_size;
> +	size_t length = LENGTH;
> +	int shift = 0;
> +
> +	ksft_print_header();
> +	ksft_set_plan(2);
> +
> +	if (argc > 1)
> +		length = atol(argv[1]) << 20;
> +	if (argc > 2)
> +		shift = atoi(argv[2]);
>   
> -	addr = mmap(NULL, LENGTH, PROTECTION, MAP_SHARED, fd, 0);
> -	if (addr == MAP_FAILED) {
> -		close(fd);
> -		ksft_exit_fail_msg("mmap(): %s\n", strerror(errno));
> +	if (shift) {
> +		hugepage_size = (1 << shift);
> +		ksft_print_msg("%u kB hugepages\n", 1 << (shift - 10));
> +	} else {
> +		hugepage_size = default_huge_page_size();
> +		ksft_print_msg("Default size hugepages (%lu Kb)\n", hugepage_size >> 10);
>   	}
>   
> -	ksft_print_msg("Returned address is %p\n", addr);
> -	check_bytes(addr);
> -	write_bytes(addr);
> -	ret = read_bytes(addr);
> +	/* munmap with fail if the length is not page aligned */
> +	if (hugepage_size > length)
> +		length = hugepage_size;
>   
> -	munmap(addr, LENGTH);
> -	close(fd);
> +	ksft_print_msg("Mapping %lu Mbytes\n", (unsigned long)length >> 20);
>   
> -	ksft_test_result(!ret, "Read same data\n");
> +	test_anon_mmap(length, shift);
> +	test_file_mmap(length, shift);
>   
> -	ksft_exit(!ret);
> +	ksft_finished();
>   }
> diff --git a/tools/testing/selftests/mm/map_hugetlb.c b/tools/testing/selftests/mm/map_hugetlb.c
> deleted file mode 100644
> index aa409107611b..000000000000
> --- a/tools/testing/selftests/mm/map_hugetlb.c
> +++ /dev/null
> @@ -1,88 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0
> -/*
> - * Example of using hugepage memory in a user application using the mmap
> - * system call with MAP_HUGETLB flag.  Before running this program make
> - * sure the administrator has allocated enough default sized huge pages
> - * to cover the 256 MB allocation.
> - */
> -#include <stdlib.h>
> -#include <stdio.h>
> -#include <unistd.h>
> -#include <sys/mman.h>
> -#include <fcntl.h>
> -#include "vm_util.h"
> -#include "kselftest.h"
> -
> -#define LENGTH (256UL*1024*1024)
> -#define PROTECTION (PROT_READ | PROT_WRITE)
> -
> -static void check_bytes(char *addr)
> -{
> -	ksft_print_msg("First hex is %x\n", *((unsigned int *)addr));
> -}
> -
> -static void write_bytes(char *addr, size_t length)
> -{
> -	unsigned long i;
> -
> -	for (i = 0; i < length; i++)
> -		*(addr + i) = (char)i;
> -}
> -
> -static void read_bytes(char *addr, size_t length)
> -{
> -	unsigned long i;
> -
> -	check_bytes(addr);
> -	for (i = 0; i < length; i++)
> -		if (*(addr + i) != (char)i)
> -			ksft_exit_fail_msg("Mismatch at %lu\n", i);
> -
> -	ksft_test_result_pass("Read correct data\n");
> -}
> -
> -int main(int argc, char **argv)
> -{
> -	void *addr;
> -	size_t hugepage_size;
> -	size_t length = LENGTH;
> -	int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB;
> -	int shift = 0;
> -
> -	hugepage_size = default_huge_page_size();
> -	/* munmap with fail if the length is not page aligned */
> -	if (hugepage_size > length)
> -		length = hugepage_size;
> -
> -	ksft_print_header();
> -	ksft_set_plan(1);
> -
> -	if (argc > 1)
> -		length = atol(argv[1]) << 20;
> -	if (argc > 2) {
> -		shift = atoi(argv[2]);
> -		if (shift)
> -			flags |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT;
> -	}
> -
> -	if (shift)
> -		ksft_print_msg("%u kB hugepages\n", 1 << (shift - 10));
> -	else
> -		ksft_print_msg("Default size hugepages\n");
> -	ksft_print_msg("Mapping %lu Mbytes\n", (unsigned long)length >> 20);
> -
> -	addr = mmap(NULL, length, PROTECTION, flags, -1, 0);
> -	if (addr == MAP_FAILED)
> -		ksft_exit_fail_msg("mmap: %s\n", strerror(errno));
> -
> -	ksft_print_msg("Returned address is %p\n", addr);
> -	check_bytes(addr);
> -	write_bytes(addr, length);
> -	read_bytes(addr, length);
> -
> -	/* munmap() length of MAP_HUGETLB memory must be hugepage aligned */
> -	if (munmap(addr, length))
> -		ksft_exit_fail_msg("munmap: %s\n", strerror(errno));
> -
> -	ksft_finished();
> -}
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index e2dc9ac87bfc..61b450032af8 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -292,7 +292,6 @@ CATEGORY="hugetlb" run_test ./hugepage-shm
>   echo "$shmmax" > /proc/sys/kernel/shmmax
>   echo "$shmall" > /proc/sys/kernel/shmall
>   
> -CATEGORY="hugetlb" run_test ./map_hugetlb
>   CATEGORY="hugetlb" run_test ./hugepage-mremap
>   CATEGORY="hugetlb" run_test ./hugepage-vmemmap
>   CATEGORY="hugetlb" run_test ./hugetlb-madvise

^ permalink raw reply

* Re: [PATCH 00/53] selftests/mm: make MM selftests more CI friendly
From: Mike Rapoport @ 2026-04-20  9:19 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Barry Song,
	Dev Jain, Jason Gunthorpe, John Hubbard, Liam R. Howlett,
	Lance Yang, Leon Romanovsky, Lorenzo Stoakes, Mark Brown,
	Michal Hocko, Nico Pache, Peter Xu, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-kselftest, linux-mm
In-Reply-To: <ee6bbac9-b375-4413-a771-6d32c7afda67@arm.com>

Hi Ryan,

On Mon, Apr 20, 2026 at 09:37:03AM +0100, Ryan Roberts wrote:
> On 06/04/2026 15:16, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Hi,
> > 
> > There's a lot of dancing around HugeTLB settings in run_vmtests.sh.
> > Some test need just a few default huge pages, some require at least 256 MB, and
> > some just skip lots of tests if huge pages of all supported sizes are not
> > available.
> > 
> > The goal of this set is to make tests deal with HugeTLB setup and teardown.
> 
> Hi Mike,
> 
> I haven't had a chance to review this series properly, but the intent certainly
> seems extremely valuable!
> 
> I thought I'd share some configuration magic that I always use when running on
> arm64. Appologies if I'm teaching you to suck eggs...
> 
> arm64 supports multiple hugetlb sizes and (at least in the past) the magic in
> run_vmtests.sh only reserves for the default size. As a consequence, whenever I
> run these tests on arm64, I always boot with:
> 
> hugepagesz=1G hugepages=0:2,1:2
> hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64
> hugepagesz=64K hugepages=0:2,1:2
> 
> Which reserves 2 pages of each supported non-default size in each of 2 NUMA
> nodes, and 64 of the default size in each NUMA node. (This would need adjusting
> if using a different base page size).
> 
> My recollection is that this effectively overrides what the script was doing and
> is sufficient to make all hugetlb tests run for all hugepage sizes.

My goal is to let the tests themself set up the right hugetlb configuration
without forcing it neither in command line nor in the wrapper scripts.

On x86 I can run all the tests in a virtio-ng VM with two nodes and no
kernel command line overrides. I suppose that should work on arm64 too.

There are some additional settings that such a VM would need to avoid
skipping tests that presume swap or a real filesystem, but that's more of
virtio-ng limitation.

> If it's possible to get this non-default hugepage size reservation logic into
> the tests themselves, this will make the mm selftests much easier to run on
> arm64 with full coverage.

That's what the second half of series do. E.g for cow tests:

https://lore.kernel.org/linux-mm/ee6bbac9-b375-4413-a771-6d32c7afda67@arm.com/T/#m62f23b835061449bc6249afacf993bb32ea11234
 
> Another observation is that "secretmem.enable" is currently needed on the
> cmdline to enable secretmem so that the associated tests run. Not sure what can
> be done to make that simpler?

Looks like you're testing really old kernels :)
The secretmem default changed to "enabled" from 6.5 ;-)
 
> And there are tests that depend on having more than 1 NUMA node; I always run
> under QEMU with 2 emulated NUMA nodes. I guess that's really just a property of
> the HW, so nothing to be done from the test harness.

Right, test harness can't do much about it. It's either run in a virtual
machines with 2 (or more) nodes or enable NUMA emulation in the kernel
configuration and the kernel command line.

> Thanks,
> Ryan

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH 02/53] selftests/mm: khugepaged: enable collapse_single_pte_entry_compound for shmem
From: Mike Rapoport @ 2026-04-20  9:04 UTC (permalink / raw)
  To: Sarthak Sharma
  Cc: Andrew Morton, David Hildenbrand, Baolin Wang, Barry Song,
	Dev Jain, Jason Gunthorpe, John Hubbard, Liam R. Howlett,
	Lance Yang, Leon Romanovsky, Lorenzo Stoakes, Mark Brown,
	Michal Hocko, Nico Pache, Peter Xu, Ryan Roberts, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-kselftest, linux-mm
In-Reply-To: <a44aa2d7-5eb9-468a-a0ee-ee10b2b73502@arm.com>

On Mon, Apr 20, 2026 at 12:20:43PM +0530, Sarthak Sharma wrote:
> Hi Mike
> 
> On 4/6/26 7:46 PM, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > A comment in collapse_single_pte_entry_compound() says it can't run on
> > shmem because "MADV_DONTNEED can't evict tmpfs pages".
> > But MADV_REMOVE can!
> > 
> > Use MADV_REMOVE for tmpfs to evict pages and enable
> > collapse_single_pte_entry_compound() test for shmem.
> > 
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > --->  tools/testing/selftests/mm/khugepaged.c | 14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> > index 3fe7ef04ac62..e6fb01ca44ed 100644
> > --- a/tools/testing/selftests/mm/khugepaged.c
> > +++ b/tools/testing/selftests/mm/khugepaged.c
> > @@ -783,20 +783,17 @@ static void collapse_max_ptes_swap(struct collapse_context *c, struct mem_ops *o
> >  
> >  static void collapse_single_pte_entry_compound(struct collapse_context *c, struct mem_ops *ops)
> >  {
> > +	int advise = MADV_DONTNEED;
> >  	void *p;
> >  
> >  	p = alloc_hpage(ops);
> >  
> > -	if (is_tmpfs(ops)) {
> > -		/* MADV_DONTNEED won't evict tmpfs pages */
> > -		printf("tmpfs...");
> > -		skip("Skip");
> > -		goto skip;
> > -	}
> > +	if (is_tmpfs(ops))
> > +		advise = MADV_REMOVE;
> 
> is_tmpfs(ops) will always return false for shmem_ops, since the function
> definition does not handle the shmem_ops case. Therefore, this advise
> will always remain as MADV_DONTNEED.

is_tmpfs() will be true for file ops when file is on tmpfs.
 
> Also, I am able to run the shmem tests using MADV_DONTNEED and the tests
> succeed. So, perhaps we don't need to use MADV_REMOVE in this case?

I read the comment and changed the advice to the one that will actually
evict the pages from the page cache.
Looking closely it this and collapse_max_ptes_none() test, I don't think
presence of the pages in the page cache would somehow affect splitting and
collapsing, so apparently the skips for is_tmpfs() cases can be just
dropped.
 
> >  
> >  	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
> >  	printf("Split huge page leaving single PTE mapping compound page...");
> > -	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
> > +	madvise(p + page_size, hpage_pmd_size - page_size, advise);
> >  	if (ops->check_huge(p, 0))
> >  		success("OK");
> >  	else
> > @@ -805,7 +802,6 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c, struc
> >  	c->collapse("Collapse PTE table with single PTE mapping compound page",
> >  		    p, 1, ops, true);
> >  	validate_memory(p, 0, page_size);
> > -skip:
> >  	ops->cleanup_area(p, hpage_pmd_size);
> >  }
> >  
> > @@ -1251,8 +1247,10 @@ int main(int argc, char **argv)
> >  
> >  	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
> >  	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
> > +	TEST(collapse_single_pte_entry_compound, khugepaged_context, shmem_ops);
> >  	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
> >  	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
> > +	TEST(collapse_single_pte_entry_compound, madvise_context, shmem_ops);
> >  
> >  	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
> >  	TEST(collapse_full_of_compound, khugepaged_context, file_ops);
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* [PATCH] selftests: harness: Restore order of test functions
From: Thomas Weißschuh (Schneider Electric) @ 2026-04-20  8:44 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan, Sun Jian,
	Jakub Kicinski
  Cc: linux-kselftest, linux-kernel, Thomas Weißschuh

The recent addition of explicit constructor orders for fixture tests
broke the ordering of those relative to non-fixture tests and the
reverse-constructor-order detection.

Restore the ordering of the test functions relative to each other by
using the same explicit test order for all test registrations and
__constructor_order_first().

Rename the constant, as it is not specific to TEST_F() anymore.

Fixes: 6be268151426 ("selftests/harness: order TEST_F and XFAIL_ADD constructors")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
The harness selftest flags this issue, but apparently that was not used.
---
 tools/testing/selftests/kselftest_harness.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
index 75fb016cd190..eaa659acb3b4 100644
--- a/tools/testing/selftests/kselftest_harness.h
+++ b/tools/testing/selftests/kselftest_harness.h
@@ -76,7 +76,7 @@ static inline void __kselftest_memset_safe(void *s, int c, size_t n)
 		memset(s, c, n);
 }
 
-#define KSELFTEST_PRIO_TEST_F  20000
+#define KSELFTEST_PRIO_TEST    20000
 #define KSELFTEST_PRIO_XFAIL   20001
 
 #define TEST_TIMEOUT_DEFAULT 30
@@ -194,7 +194,7 @@ static inline void __kselftest_memset_safe(void *s, int c, size_t n)
 		  .fixture = &_fixture_global, \
 		  .termsig = _signal, \
 		  .timeout = TEST_TIMEOUT_DEFAULT, }; \
-	static void __attribute__((constructor)) _register_##test_name(void) \
+	static void __attribute__((constructor(KSELFTEST_PRIO_TEST))) _register_##test_name(void) \
 	{ \
 		__register_test(&_##test_name##_object); \
 	} \
@@ -238,7 +238,7 @@ static inline void __kselftest_memset_safe(void *s, int c, size_t n)
 	FIXTURE_VARIANT(fixture_name); \
 	static struct __fixture_metadata _##fixture_name##_fixture_object = \
 		{ .name =  #fixture_name, }; \
-	static void __attribute__((constructor)) \
+	static void __attribute__((constructor(KSELFTEST_PRIO_TEST))) \
 	_register_##fixture_name##_data(void) \
 	{ \
 		__register_fixture(&_##fixture_name##_fixture_object); \
@@ -364,7 +364,7 @@ static inline void __kselftest_memset_safe(void *s, int c, size_t n)
 		_##fixture_name##_##variant_name##_object = \
 		{ .name = #variant_name, \
 		  .data = &_##fixture_name##_##variant_name##_variant}; \
-	static void __attribute__((constructor)) \
+	static void __attribute__((constructor(KSELFTEST_PRIO_TEST)) \
 		_register_##fixture_name##_##variant_name(void) \
 	{ \
 		__register_fixture_variant(&_##fixture_name##_fixture_object, \
@@ -468,7 +468,7 @@ static inline void __kselftest_memset_safe(void *s, int c, size_t n)
 			fixture_name##_teardown(_metadata, self, variant); \
 	} \
 	static struct __test_metadata *_##fixture_name##_##test_name##_object; \
-	static void __attribute__((constructor(KSELFTEST_PRIO_TEST_F))) \
+	static void __attribute__((constructor(KSELFTEST_PRIO_TEST))) \
 			_register_##fixture_name##_##test_name(void) \
 	{ \
 		struct __test_metadata *object = mmap(NULL, sizeof(*object), \
@@ -1323,7 +1323,7 @@ static int test_harness_run(int argc, char **argv)
 	return KSFT_FAIL;
 }
 
-static void __attribute__((constructor)) __constructor_order_first(void)
+static void __attribute__((constructor(KSELFTEST_PRIO_TEST))) __constructor_order_first(void)
 {
 	__constructor_order_forward = true;
 }

---
base-commit: c1f49dea2b8f335813d3b348fd39117fb8efb428
change-id: 20260420-kselftests-harness-order-9f641f204bde

Best regards,
--  
Thomas Weißschuh <thomas.weissschuh@linutronix.de>


^ permalink raw reply related

* Re: [PATCH 00/53] selftests/mm: make MM selftests more CI friendly
From: Ryan Roberts @ 2026-04-20  8:37 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton, David Hildenbrand
  Cc: Baolin Wang, Barry Song, Dev Jain, Jason Gunthorpe, John Hubbard,
	Liam R. Howlett, Lance Yang, Leon Romanovsky, Lorenzo Stoakes,
	Mark Brown, Michal Hocko, Nico Pache, Peter Xu, Shuah Khan,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-kselftest, linux-mm
In-Reply-To: <20260406141735.2179309-1-rppt@kernel.org>

On 06/04/2026 15:16, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> Hi,
> 
> There's a lot of dancing around HugeTLB settings in run_vmtests.sh.
> Some test need just a few default huge pages, some require at least 256 MB, and
> some just skip lots of tests if huge pages of all supported sizes are not
> available.
> 
> The goal of this set is to make tests deal with HugeTLB setup and teardown.

Hi Mike,

I haven't had a chance to review this series properly, but the intent certainly
seems extremely valuable!

I thought I'd share some configuration magic that I always use when running on
arm64. Appologies if I'm teaching you to suck eggs...

arm64 supports multiple hugetlb sizes and (at least in the past) the magic in
run_vmtests.sh only reserves for the default size. As a consequence, whenever I
run these tests on arm64, I always boot with:

hugepagesz=1G hugepages=0:2,1:2
hugepagesz=32M hugepages=0:2,1:2
default_hugepagesz=2M hugepages=0:64,1:64
hugepagesz=64K hugepages=0:2,1:2

Which reserves 2 pages of each supported non-default size in each of 2 NUMA
nodes, and 64 of the default size in each NUMA node. (This would need adjusting
if using a different base page size).

My recollection is that this effectively overrides what the script was doing and
is sufficient to make all hugetlb tests run for all hugepage sizes.

If it's possible to get this non-default hugepage size reservation logic into
the tests themselves, this will make the mm selftests much easier to run on
arm64 with full coverage.

Another observation is that "secretmem.enable" is currently needed on the
cmdline to enable secretmem so that the associated tests run. Not sure what can
be done to make that simpler?

And there are tests that depend on having more than 1 NUMA node; I always run
under QEMU with 2 emulated NUMA nodes. I guess that's really just a property of
the HW, so nothing to be done from the test harness.

Thanks,
Ryan


> 
> There are already convenient helpers that allow easy reading and writing of
> /proc and /sysfs, so adding a few APIs that will detect and update HugeTLB
> settings shouldn't be a big deal. But these nice helpers use kselftest
> framework, and many of HugeTLB (and even THP) test don't, so as a result this
> patchset also includes a lot of churn for conversion of those tests to
> kselftest framework (patches 7-19).
> 
> And there were a few small things I fixed on the way.
> 
> I don't mean for this set to land in 7.1-rc1, except perhaps the small fixes in
> the beginning of the series (patches 1-4).
> 
> But after staring at this code for some time I realized that I won't spot any
> new issues in these patches and an extra pair(s) of eyes would be helpful.
> 
> The extension of thp_settings to hugepage_settings that also include HugeTLB
> helpers is implemented by patches 20-26.
> 
> Patches 27-28 are there to allow setting up SHM limits in hugetlb-shm and
> thuge-gen tests.
> 
> Patches 29-51 integrate the new APIs in all the tests that use HugeTLB.
> 
> And at last patches 52-53 drop HugeTLB setup from run_vmtests.sh
> 
> Happy Easter to those who celebrate!
> 
> Mike Rapoport (Microsoft) (53):
>   selftests/mm: hugetlb-read-hwpoison: add SIGBUS handler
>   selftests/mm: khugepaged: enable collapse_single_pte_entry_compound for shmem
>   selftests/mm: migration: don't assume hupe page is TWOMEG
>   selftests/mm: run_vmtests.sh: don't gate THP and KSM tests on HAVE_HUGEPAGES
>   selftests/mm: merge map_hugetlb into hugepage-mmap
>   selftests/mm: rename hugepage-* tests to hugetlb-*
>   selftests/mm: hugetlb-shm: use kselftest framework
>   selftests/mm: hugetlb-vmemmap: use kselftest framework
>   selftests/mm: hugetlb-madvise: use kselftest framework
>   selftests/mm: hugetlb_madv_vs_map: use kselftest framework
>   selftests/mm: hugetlb-read-hwpoison: use kselftest framework
>   selftests/mm: khugepaged: group tests in an array
>   selftests/mm: khugepaged: use ksefltest framework
>   selftests/mm: ksm_tests: use kselftest framework
>   selftests/mm: protection_keys: use descriptive test names in TAP output
>   selftests/mm: protection_keys: use kselftest framework
>   selftests/mm: uffd-stress: use kselftest framework
>   selftests/mm: uffd-unit-tests: use kselftest framework
>   selftests/mm: va_high_addr_switch: use kselftest framework
>   selftests/mm: add atexit() and signal handlers to thp_settings
>   selftests/mm: rename thp_settings.[ch] to  hugepage_settings.[ch]
>   selftests/mm: move HugeTLB helpers to hugepage_settings
>   selftests/mm: hugepage_settings: use unsigned long in detect_hugetlb_page_size
>   selftests/mm: hugepage_settings: add APIs to get and set nr_hugepages
>   selftests/mm: hugepage_settings: rename get_free_hugepages()
>   selftests/mm: hugepage_settings: add APIs for HugeTLB setup and teardown
>   selftests/mm: move read_file(), read_num() and write_num() to vm_util
>   selftests/mm: vm_util: add helpers to set and restore shm limits
>   selftests/mm: compaction_test: use HugeTLB helpers ...
>   selftests/mm: cow: add setup of HugeTLB pages
>   selftests/mm: gup_longterm: add setup of HugeTLB pages
>   selftests/mm: gup_test: add setup of HugeTLB pages
>   selftests/mm: hmm-tests: add setup of HugeTLB pages
>   selftests/mm: hugepage_dio: add setup of HugeTLB pages
>   selftests/mm: hugetlb_fault_after_madv: add setup of HugeTLB pages
>   selftests/mm: hugetlb-madvise: add setup of HugeTLB pages
>   selftests/mm: hugetlb_madv_vs_map: add setup of HugeTLB pages
>   selftests/mm: hugetlb-mmap: add setup of HugeTLB pages
>   selftests/mm: hugetlb-mremap: add setup of HugeTLB pages
>   selftests/mm: hugetlb-shm: add setup of HugeTLB pages
>   selftests/mm: hugetlb-soft-online: add setup of HugeTLB pages
>   selftests/mm: hugetlb-vmemmap: add setup of HugeTLB pages
>   selftests/mm: migration: add setup of HugeTLB pages
>   selftests/mm: pagemap_ioctl: add setup of HugeTLB pages
>   selftests/mm: protection_keys: use library code for HugeTLB setup
>   selftests/mm: thuge-gen: add setup of HugeTLB pages
>   selftests/mm: uffd-stress: use hugetlb_save and alloc huge pages
>   selftests/mm: uffd-unit-tests: add setup of HugeTLB pages
>   selftests/mm: uffd-wp-mremap: add setup of HugeTLB pages
>   selftests/mm: va_high_addr_switch: add setup of HugeTLB pages
>   selftests/mm: va_high_addr_switch.sh: drop huge pages setup
>   selftests/mm: run_vmtests.sh: free memory if available memory is low
>   selftests/mm: run_vmtests.sh: drop detection and setup of HugeTLB
> 
>  tools/testing/selftests/mm/.gitignore         |   4 +
>  tools/testing/selftests/mm/Makefile           |  13 +-
>  tools/testing/selftests/mm/compaction_test.c  | 113 +----
>  tools/testing/selftests/mm/cow.c              |  29 +-
>  .../selftests/mm/folio_split_race_test.c      |   2 +-
>  tools/testing/selftests/mm/guard-regions.c    |   2 +-
>  tools/testing/selftests/mm/gup_longterm.c     |   3 +-
>  tools/testing/selftests/mm/gup_test.c         |  12 +
>  tools/testing/selftests/mm/hmm-tests.c        |  24 +-
>  tools/testing/selftests/mm/hugepage-mmap.c    |  78 ----
>  .../{thp_settings.c => hugepage_settings.c}   | 284 +++++++++++--
>  .../{thp_settings.h => hugepage_settings.h}   |  75 +++-
>  tools/testing/selftests/mm/hugetlb-madvise.c  | 209 ++++------
>  tools/testing/selftests/mm/hugetlb-mmap.c     | 141 +++++++
>  .../{hugepage-mremap.c => hugetlb-mremap.c}   |  13 +-
>  .../selftests/mm/hugetlb-read-hwpoison.c      | 123 +++---
>  .../mm/{hugepage-shm.c => hugetlb-shm.c}      |  65 ++-
>  .../selftests/mm/hugetlb-soft-offline.c       |  45 +-
>  .../{hugepage-vmemmap.c => hugetlb-vmemmap.c} |  46 +-
>  tools/testing/selftests/mm/hugetlb_dio.c      |  15 +-
>  .../selftests/mm/hugetlb_fault_after_madv.c   |   7 +-
>  .../selftests/mm/hugetlb_madv_vs_map.c        |  22 +-
>  tools/testing/selftests/mm/khugepaged.c       | 394 ++++++++----------
>  tools/testing/selftests/mm/ksm_tests.c        | 182 ++++----
>  tools/testing/selftests/mm/map_hugetlb.c      |  88 ----
>  tools/testing/selftests/mm/migration.c        |  54 ++-
>  tools/testing/selftests/mm/pagemap_ioctl.c    |  13 +-
>  tools/testing/selftests/mm/pkey-helpers.h     |   6 +-
>  .../testing/selftests/mm/prctl_thp_disable.c  |   2 +-
>  tools/testing/selftests/mm/protection_keys.c  | 131 +++---
>  tools/testing/selftests/mm/run_vmtests.sh     | 177 ++------
>  tools/testing/selftests/mm/soft-dirty.c       |   2 +-
>  .../selftests/mm/split_huge_page_test.c       |   2 +-
>  tools/testing/selftests/mm/thuge-gen.c        |  80 +---
>  tools/testing/selftests/mm/transhuge-stress.c |   2 +-
>  tools/testing/selftests/mm/uffd-common.h      |   1 +
>  tools/testing/selftests/mm/uffd-stress.c      |  44 +-
>  tools/testing/selftests/mm/uffd-unit-tests.c  | 110 +++--
>  tools/testing/selftests/mm/uffd-wp-mremap.c   |  12 +-
>  .../selftests/mm/va_high_addr_switch.c        |  40 +-
>  .../selftests/mm/va_high_addr_switch.sh       |  39 +-
>  tools/testing/selftests/mm/vm_util.c          | 133 +++---
>  tools/testing/selftests/mm/vm_util.h          |  15 +-
>  43 files changed, 1377 insertions(+), 1475 deletions(-)
>  delete mode 100644 tools/testing/selftests/mm/hugepage-mmap.c
>  rename tools/testing/selftests/mm/{thp_settings.c => hugepage_settings.c} (60%)
>  rename tools/testing/selftests/mm/{thp_settings.h => hugepage_settings.h} (55%)
>  create mode 100644 tools/testing/selftests/mm/hugetlb-mmap.c
>  rename tools/testing/selftests/mm/{hugepage-mremap.c => hugetlb-mremap.c} (94%)
>  rename tools/testing/selftests/mm/{hugepage-shm.c => hugetlb-shm.c} (56%)
>  rename tools/testing/selftests/mm/{hugepage-vmemmap.c => hugetlb-vmemmap.c} (76%)
>  delete mode 100644 tools/testing/selftests/mm/map_hugetlb.c
> 
> 
> base-commit: 9a5c21a0791faf7967feea87f8f345419330bd2f
> --
> 2.53.0


^ permalink raw reply

* Re: [PATCH v7 3/4] KVM: arm64: PMU: Introduce FIXED_COUNTERS_ONLY
From: Akihiko Odaki @ 2026-04-20  8:36 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Kees Cook, Gustavo A. R. Silva,
	Paolo Bonzini, Jonathan Corbet, Shuah Khan, linux-arm-kernel,
	kvmarm, linux-kernel, linux-hardening, devel, kvm, linux-doc,
	linux-kselftest
In-Reply-To: <87ldeic1gk.wl-maz@kernel.org>

On 2026/04/20 2:19, Marc Zyngier wrote:
> On Sat, 18 Apr 2026 09:14:25 +0100,
> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
>>
>> On a heterogeneous arm64 system, KVM's PMU emulation is based on the
>> features of a single host PMU instance. When a vCPU is migrated to a
>> pCPU with an incompatible PMU, counters such as PMCCNTR_EL0 stop
>> incrementing.
>>
>> Although this behavior is permitted by the architecture, Windows does
>> not handle it gracefully and may crash with a division-by-zero error.
>>
>> The current workaround requires VMMs to pin vCPUs to a set of pCPUs
>> that share a compatible PMU. This is difficult to implement correctly in
>> QEMU/libvirt, where pinning occurs after vCPU initialization, and it
>> also restricts the guest to a subset of available pCPUs.
>>
>> Introduce the KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY attribute to
>> create a "fixed-counters-only" PMU. When set, KVM exposes a PMU that is
>> compatible with all pCPUs but that does not support programmable
>> event counters which may have different feature sets on different PMUs.
>>
>> This allows Windows guests to run reliably on heterogeneous systems
>> without crashing, even without vCPU pinning, and enables VMMs to
>> schedule vCPUs across all available pCPUs, making full use of the host
>> hardware.
>>
>> Much like KVM_ARM_VCPU_PMU_V3_IRQ and other read-write attributes, this
>> attribute provides a getter that facilitates kernel and userspace
>> debugging/testing.
> 
> OK, so that's the sales pitch. But how is it implemented? I would like
> to be able to read a high-level description of the implementation
> trade-offs.

Implementation-wise it is very trivial. Essentially the following 
addition in kvm_arm_pmu_v3_get_attr() is the entire implementation:
+	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
+		if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, 
&vcpu->kvm->arch.flags))
+			return 0;

Both its functionality and code complexity is trivial. So we can argue that:
- the functionality is too trivial to be useful or
- the interface/implementation complexity is so trivial that it does not
   incur maintenance burden

In this case the selftest uses the getter so I was more inclined to have 
it, but adding one just for the selftest sounds too ad-hoc, so here I 
looked into other attributes to ensure that it was not introducing 
inconsistency with existing interfaces.

As the result, I found there are other read-write attributes; in fact 
there are more read-write attributes than write-only ones.

> 
>>
>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>> ---
>>   Documentation/virt/kvm/devices/vcpu.rst |  29 ++++++
>>   arch/arm64/include/asm/kvm_host.h       |   2 +
>>   arch/arm64/include/uapi/asm/kvm.h       |   1 +
>>   arch/arm64/kvm/arm.c                    |   1 +
>>   arch/arm64/kvm/pmu-emul.c               | 155 +++++++++++++++++++++++---------
>>   include/kvm/arm_pmu.h                   |   2 +
>>   6 files changed, 147 insertions(+), 43 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
>> index 60bf205cb373..e0aeb1897d77 100644
>> --- a/Documentation/virt/kvm/devices/vcpu.rst
>> +++ b/Documentation/virt/kvm/devices/vcpu.rst
>> @@ -161,6 +161,35 @@ explicitly selected, or the number of counters is out of range for the
>>   selected PMU. Selecting a new PMU cancels the effect of setting this
>>   attribute.
>>   
>> +1.6 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
>> +------------------------------------------------------
>> +
>> +:Parameters: no additional parameter in kvm_device_attr.addr
>> +
>> +:Returns:
>> +
>> +	 =======  =====================================================
>> +	 -EBUSY   Attempted to set after initializing PMUv3 or running
>> +		  VCPU, or attempted to set for the first time after
>> +		  setting an event filter
>> +	 -ENXIO   Attempted to get before setting
>> +	 -ENODEV  Attempted to set while PMUv3 not supported
>> +	 =======  =====================================================
>> +
>> +If set, PMUv3 will be emulated without programmable event counters. The VCPU
>> +will use any compatible hardware PMU. This attribute is particularly useful on
> 
> Not quite "any PMU". It will use *the* PMU of the physical CPU,
> irrespective of the implementation.

I think:

- this comment
- one on the KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED note
- one on kvm_pmu_create_perf_event()
- and one on kvm_arm_pmu_v3_set_pmu_fixed_counters_only()

All boil down into one question: will it support all possible CPUs, or 
will it support a subset? Let me answer here:

This patch is written to support a subset instead of all possible CPUs. 
If a pCPU does not have a compatible PMU, the pCPU will not be supported 
and cause KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED.

This patch does not enforce all possible CPUs are covered by the 
compatible PMUs. Theoretically speaking, kvm_arm_pmu_get_pmuver_limit() 
enables the PMU emulation when real PMUv3 hardware covers all possible 
CPUs *or* the relevant registers can be trapped with IMPDEF, so some 
pCPU may not have a compatible PMU and only provide the IMPDEF trapping.

Practically, I don't think any sane configuration will ever have such a 
subset support, so we can explicitly enforce all possible CPUs are 
covered by the compatible PMUs if desired.

> 
>> +heterogeneous systems where different hardware PMUs cover different physical
>> +CPUs. The compatibility of hardware PMUs can be checked with
>> +KVM_ARM_VCPU_PMU_V3_SET_PMU. All VCPUs in a VM share this attribute. It isn't
>> +possible to set it for the first time if a PMU event filter is already present.
> 
> "for the first time" gives the impression that it will work if you try
> again. I'd rather we say that "This feature is incompatible with the
> existence of a PMU event filter".

The following sequence will work:
1. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY
2. Set KVM_ARM_VCPU_PMU_V3_FILTER
3. Set KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY

This is to make the behavior conistent with KVM_ARM_VCPU_PMU_V3_SET_PMU.

> 
> Furthermore, the architecture currently describes *two* fixed-function
> counters (cycles and instructions), while KVM only expose the cycle
> counter. I'm all for the extra abstraction, but what does it mean for
> migration if we enable FEAT_PMUv3_ICNTR?

I'll answe this at the end of this email.

> 
>> +
>> +Note that KVM will not make any attempts to run the VCPU on the physical CPUs
>> +with compatible hardware PMUs. This is entirely left to userspace. However,
>> +attempting to run the VCPU on an unsupported CPU will fail and KVM_RUN will
>> +return with exit_reason = KVM_EXIT_FAIL_ENTRY and populate the fail_entry struct
>> +by setting hardware_entry_failure_reason field to
>> +KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED and the cpu field to the processor id.
>> +
> 
> This is mostly a copy-paste of the previous section. How relevant is
> this to the fixed-counters-only feature? If the whole point of this
> stuff is to ensure compatibility across CPUs with different PMU
> implementations, surely what you describe here is the opposite of what
> you want.

Please see the earlier discussion of supported pCPUs.

> 
> My preference would be to move this to a separate patch in any case,
> more on that below.

I will do so with the next version.

> 
>>   2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
>>   =================================
>>   
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 59f25b85be2b..b59e0182472c 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -353,6 +353,8 @@ struct kvm_arch {
>>   #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS		10
>>   	/* Unhandled SEAs are taken to userspace */
>>   #define KVM_ARCH_FLAG_EXIT_SEA				11
>> +	/* PMUv3 is emulated without progammable event counters */
>> +#define KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY	12
>>   	unsigned long flags;
>>   
>>   	/* VM-wide vCPU feature set */
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index a792a599b9d6..474c84fa757f 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -436,6 +436,7 @@ enum {
>>   #define   KVM_ARM_VCPU_PMU_V3_FILTER		2
>>   #define   KVM_ARM_VCPU_PMU_V3_SET_PMU		3
>>   #define   KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS	4
>> +#define   KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY	5
>>   #define KVM_ARM_VCPU_TIMER_CTRL		1
>>   #define   KVM_ARM_VCPU_TIMER_IRQ_VTIMER		0
>>   #define   KVM_ARM_VCPU_TIMER_IRQ_PTIMER		1
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 620a465248d1..dca16ca26d32 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -634,6 +634,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>   	if (has_vhe())
>>   		kvm_vcpu_load_vhe(vcpu);
>>   	kvm_arch_vcpu_load_fp(vcpu);
>> +	kvm_vcpu_load_pmu(vcpu);
>>   	kvm_vcpu_pmu_restore_guest(vcpu);
>>   	if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
>>   		kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>> index ef5140bbfe28..d1009c144581 100644
>> --- a/arch/arm64/kvm/pmu-emul.c
>> +++ b/arch/arm64/kvm/pmu-emul.c
>> @@ -326,7 +326,10 @@ u64 kvm_pmu_implemented_counter_mask(struct kvm_vcpu *vcpu)
>>   
>>   static void kvm_pmc_enable_perf_event(struct kvm_pmc *pmc)
>>   {
>> -	if (!pmc->perf_event) {
>> +	struct kvm_vcpu *vcpu = kvm_pmc_to_vcpu(pmc);
>> +
>> +	if (!pmc->perf_event ||
>> +	    !cpumask_test_cpu(vcpu->cpu, &to_arm_pmu(pmc->perf_event->pmu)->supported_cpus)) {
>>   		kvm_pmu_create_perf_event(pmc);
>>   		return;
>>   	}
>> @@ -667,10 +670,8 @@ static bool kvm_pmc_counts_at_el2(struct kvm_pmc *pmc)
>>   	return kvm_pmc_read_evtreg(pmc) & ARMV8_PMU_INCLUDE_EL2;
>>   }
>>   
>> -static int kvm_map_pmu_event(struct kvm *kvm, unsigned int eventsel)
>> +static int kvm_map_pmu_event(struct arm_pmu *pmu, unsigned int eventsel)
>>   {
>> -	struct arm_pmu *pmu = kvm->arch.arm_pmu;
>> -
>>   	/*
>>   	 * The CPU PMU likely isn't PMUv3; let the driver provide a mapping
>>   	 * for the guest's PMUv3 event ID.
> 
> This refactor should be in its own patch. This sort of minor change is
> adding noise to the mean of the patch, for no good reason.

I'll make that change with the next version too.

> 
>> @@ -681,6 +682,23 @@ static int kvm_map_pmu_event(struct kvm *kvm, unsigned int eventsel)
>>   	return eventsel;
>>   }
>>   
>> +static struct arm_pmu *kvm_pmu_probe_armpmu(int cpu)
>> +{
>> +	struct arm_pmu_entry *entry;
>> +	struct arm_pmu *pmu;
>> +
>> +	guard(rcu)();
>> +
>> +	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
>> +		pmu = entry->arm_pmu;
>> +
>> +		if (cpumask_test_cpu(cpu, &pmu->supported_cpus))
>> +			return pmu;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>>   /**
>>    * kvm_pmu_create_perf_event - create a perf event for a counter
>>    * @pmc: Counter context
>> @@ -694,6 +712,12 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
>>   	int eventsel;
>>   	u64 evtreg;
>>   
>> +	if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &vcpu->kvm->arch.flags)) {
>> +		arm_pmu = kvm_pmu_probe_armpmu(vcpu->cpu);
>> +		if (!arm_pmu)
>> +			return;
> 
> How is it possible to not get a PMU here? We don't expose the PMU to a
> guest at all if there are CPUs without PMUs, see the comment in
> kvm_host_pmu_init(). So I'd expect this to never fail.

Please see the earlier comment.

> 
>> +	}
>> +
>>   	evtreg = kvm_pmc_read_evtreg(pmc);
>>   
>>   	kvm_pmu_stop_counter(pmc);
>> @@ -722,7 +746,7 @@ static void kvm_pmu_create_perf_event(struct kvm_pmc *pmc)
>>   	 * Don't create an event if we're running on hardware that requires
>>   	 * PMUv3 event translation and we couldn't find a valid mapping.
>>   	 */
>> -	eventsel = kvm_map_pmu_event(vcpu->kvm, eventsel);
>> +	eventsel = kvm_map_pmu_event(arm_pmu, eventsel);
>>   	if (eventsel < 0)
>>   		return;
>>   
>> @@ -810,42 +834,6 @@ void kvm_host_pmu_init(struct arm_pmu *pmu)
>>   	list_add_tail_rcu(&entry->entry, &arm_pmus);
>>   }
>>   
>> -static struct arm_pmu *kvm_pmu_probe_armpmu(void)
>> -{
>> -	struct arm_pmu_entry *entry;
>> -	struct arm_pmu *pmu;
>> -	int cpu;
>> -
>> -	guard(rcu)();
>> -
>> -	/*
>> -	 * It is safe to use a stale cpu to iterate the list of PMUs so long as
>> -	 * the same value is used for the entirety of the loop. Given this, and
>> -	 * the fact that no percpu data is used for the lookup there is no need
>> -	 * to disable preemption.
>> -	 *
>> -	 * It is still necessary to get a valid cpu, though, to probe for the
>> -	 * default PMU instance as userspace is not required to specify a PMU
>> -	 * type. In order to uphold the preexisting behavior KVM selects the
>> -	 * PMU instance for the core during vcpu init. A dependent use
>> -	 * case would be a user with disdain of all things big.LITTLE that
>> -	 * affines the VMM to a particular cluster of cores.
>> -	 *
>> -	 * In any case, userspace should just do the sane thing and use the UAPI
>> -	 * to select a PMU type directly. But, be wary of the baggage being
>> -	 * carried here.
>> -	 */
>> -	cpu = raw_smp_processor_id();
>> -	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
>> -		pmu = entry->arm_pmu;
>> -
>> -		if (cpumask_test_cpu(cpu, &pmu->supported_cpus))
>> -			return pmu;
>> -	}
>> -
>> -	return NULL;
>> -}
>> -
> 
> Same thing for the refactoring of this function. Moving it, changing
> the signature and moving the comment somewhere else would be better
> placed on its own.

This will be in a separate patch with the next version.

> 
>>   static u64 __compute_pmceid(struct arm_pmu *pmu, bool pmceid1)
>>   {
>>   	u32 hi[2], lo[2];
>> @@ -888,6 +876,9 @@ u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
>>   	u64 val, mask = 0;
>>   	int base, i, nr_events;
>>   
>> +	if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &vcpu->kvm->arch.flags))
>> +		return 0;
>> +
>>   	if (!pmceid1) {
>>   		val = compute_pmceid0(cpu_pmu);
>>   		base = 0;
>> @@ -915,6 +906,26 @@ u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
>>   	return val & mask;
>>   }
>>   
>> +void kvm_vcpu_load_pmu(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long mask = kvm_pmu_enabled_counter_mask(vcpu);
>> +	struct kvm_pmc *pmc;
>> +	struct arm_pmu *cpu_pmu;
> 
> Move these to be inside the loop.

I followed the pattern of other functions, but I agree this new code can 
follow a more modern style. It will be done with the next version.

> 
>> +	int i;
>> +
>> +	for_each_set_bit(i, &mask, 32) {
>> +		pmc = kvm_vcpu_idx_to_pmc(vcpu, i);
>> +		if (!pmc->perf_event)
>> +			continue;
>> +
>> +		cpu_pmu = to_arm_pmu(pmc->perf_event->pmu);
>> +		if (!cpumask_test_cpu(vcpu->cpu, &cpu_pmu->supported_cpus)) {
>> +			kvm_make_request(KVM_REQ_RELOAD_PMU, vcpu);
>> +			break;
>> +		}
>> +	}
>> +}
>> +
> 
> Why do we need to inflict this on VMs that do not have the fixed
> counter restriction?

This function is to re-create the perf_event in case the current 
perf_event does not support the pCPU because e.g., the pCPU is a E-core 
while the perf_event only covers the P-cores.

> 
> And even then, all you have to reconfigure is the cycle counter. So
> why the loop? All we want to find out is whether the cycle counter is
> instantiated on the PMU that matches the current CPU.

I just wanted to avoid hardcoding assumptions on the fixed counter(s). 
FEAT_PMUv3_ICNTR will be naturaly handled with a loop, for example.

> 
>>   void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu)
>>   {
>>   	u64 mask = kvm_pmu_implemented_counter_mask(vcpu);
>> @@ -1016,6 +1027,9 @@ u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm)
>>   {
>>   	struct arm_pmu *arm_pmu = kvm->arch.arm_pmu;
>>   
>> +	if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags))
>> +		return 0;
>> +
>>   	/*
>>   	 * PMUv3 requires that all event counters are capable of counting any
>>   	 * event, though the same may not be true of non-PMUv3 hardware.
>> @@ -1070,7 +1084,24 @@ static void kvm_arm_set_pmu(struct kvm *kvm, struct arm_pmu *arm_pmu)
>>    */
>>   int kvm_arm_set_default_pmu(struct kvm *kvm)
>>   {
>> -	struct arm_pmu *arm_pmu = kvm_pmu_probe_armpmu();
>> +	/*
>> +	 * It is safe to use a stale cpu to iterate the list of PMUs so long as
>> +	 * the same value is used for the entirety of the loop. Given this, and
>> +	 * the fact that no percpu data is used for the lookup there is no need
>> +	 * to disable preemption.
>> +	 *
>> +	 * It is still necessary to get a valid cpu, though, to probe for the
>> +	 * default PMU instance as userspace is not required to specify a PMU
>> +	 * type. In order to uphold the preexisting behavior KVM selects the
>> +	 * PMU instance for the core during vcpu init. A dependent use
>> +	 * case would be a user with disdain of all things big.LITTLE that
>> +	 * affines the VMM to a particular cluster of cores.
>> +	 *
>> +	 * In any case, userspace should just do the sane thing and use the UAPI
>> +	 * to select a PMU type directly. But, be wary of the baggage being
>> +	 * carried here.
>> +	 */
>> +	struct arm_pmu *arm_pmu = kvm_pmu_probe_armpmu(raw_smp_processor_id());
>>   
>>   	if (!arm_pmu)
>>   		return -ENODEV;
>> @@ -1098,6 +1129,7 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
>>   				break;
>>   			}
>>   
>> +			clear_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags);
> 
> Why does this need to be cleared? I'd rather we make sure it is never
> set the first place.

KVM_ARM_VCPU_PMU_V3_SET_PMU and KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY 
can be set on the same VCPU. The last KVM_ARM_VCPU_PMU_V3_SET_PMU or 
KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY setting will be effective.

A VMM may try set these attributes to check if the setting is supported. 
For example, the RFC QEMU patch first uses KVM_ARM_VCPU_PMU_V3_SET_PMU 
to find a compatible PMU that covers all pCPUs, and then falls back to 
KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY. The order of such probing is up 
to the VMM.

This rationale applies also to the next comment.

> 
>>   			kvm_arm_set_pmu(kvm, arm_pmu);
>>   			cpumask_copy(kvm->arch.supported_cpus, &arm_pmu->supported_cpus);
>>   			ret = 0;
>> @@ -1108,11 +1140,42 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
>>   	return ret;
>>   }
>>   
>> +static int kvm_arm_pmu_v3_set_pmu_fixed_counters_only(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm *kvm = vcpu->kvm;
>> +	struct arm_pmu_entry *entry;
>> +	struct arm_pmu *arm_pmu;
>> +	struct cpumask *supported_cpus = kvm->arch.supported_cpus;
>> +
>> +	lockdep_assert_held(&kvm->arch.config_lock);
>> +
>> +	if (kvm_vm_has_ran_once(kvm) ||
>> +	    (kvm->arch.pmu_filter &&
>> +	     !test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags)))
>> +		return -EBUSY;
>> +
>> +	set_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags);
>> +	kvm_arm_set_nr_counters(kvm, 0);
>> +	cpumask_clear(supported_cpus);
> 
> What is the purpose of this cpumask_clear()? Under what conditions can
> you have something else?
> 
>> +
>> +	guard(rcu)();
>> +
>> +	list_for_each_entry_rcu(entry, &arm_pmus, entry) {
>> +		arm_pmu = entry->arm_pmu;
>> +		cpumask_or(supported_cpus, supported_cpus, &arm_pmu->supported_cpus);
> 
> Why isn't supported_cpus directly set to possible_cpus? Isn't that the
> base requirement that you can run on any CPU at all?

Please see the earlier discussion of supported pCPUs.

> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static int kvm_arm_pmu_v3_set_nr_counters(struct kvm_vcpu *vcpu, unsigned int n)
>>   {
>>   	struct kvm *kvm = vcpu->kvm;
>>   
>> -	if (!kvm->arch.arm_pmu)
>> +	lockdep_assert_held(&kvm->arch.config_lock);
>> +
>> +	if (!kvm->arch.arm_pmu &&
>> +	    !test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &kvm->arch.flags))
>>   		return -EINVAL;
>>   
>>   	if (n > kvm_arm_pmu_get_max_counters(kvm))
>> @@ -1227,6 +1290,8 @@ int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>   
>>   		return kvm_arm_pmu_v3_set_nr_counters(vcpu, n);
>>   	}
>> +	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
>> +		return kvm_arm_pmu_v3_set_pmu_fixed_counters_only(vcpu);
>>   	case KVM_ARM_VCPU_PMU_V3_INIT:
>>   		return kvm_arm_pmu_v3_init(vcpu);
>>   	}
>> @@ -1253,6 +1318,9 @@ int kvm_arm_pmu_v3_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>   		irq = vcpu->arch.pmu.irq_num;
>>   		return put_user(irq, uaddr);
>>   	}
>> +	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
>> +		if (test_bit(KVM_ARCH_FLAG_PMU_V3_FIXED_COUNTERS_ONLY, &vcpu->kvm->arch.flags))
> 
> With 6 occurrences of this test_bit(), it feels like it'd be valuable
> to have a dedicate predicate to help with readability.

I'll add one with the next version.

> 
>> +			return 0;
>>   	}
>>   
>>   	return -ENXIO;
>> @@ -1266,6 +1334,7 @@ int kvm_arm_pmu_v3_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>>   	case KVM_ARM_VCPU_PMU_V3_FILTER:
>>   	case KVM_ARM_VCPU_PMU_V3_SET_PMU:
>>   	case KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS:
>> +	case KVM_ARM_VCPU_PMU_V3_FIXED_COUNTERS_ONLY:
>>   		if (kvm_vcpu_has_pmu(vcpu))
>>   			return 0;
>>   	}
>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>> index 96754b51b411..1375cbaf97b2 100644
>> --- a/include/kvm/arm_pmu.h
>> +++ b/include/kvm/arm_pmu.h
>> @@ -56,6 +56,7 @@ void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u64 val);
>>   void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u64 val);
>>   void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u64 data,
>>   				    u64 select_idx);
>> +void kvm_vcpu_load_pmu(struct kvm_vcpu *vcpu);
>>   void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu);
>>   int kvm_arm_pmu_v3_set_attr(struct kvm_vcpu *vcpu,
>>   			    struct kvm_device_attr *attr);
>> @@ -161,6 +162,7 @@ static inline u64 kvm_pmu_get_pmceid(struct kvm_vcpu *vcpu, bool pmceid1)
>>   static inline void kvm_pmu_update_vcpu_events(struct kvm_vcpu *vcpu) {}
>>   static inline void kvm_vcpu_pmu_restore_guest(struct kvm_vcpu *vcpu) {}
>>   static inline void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu) {}
>> +static inline void kvm_vcpu_load_pmu(struct kvm_vcpu *vcpu) {}
>>   static inline void kvm_vcpu_reload_pmu(struct kvm_vcpu *vcpu) {}
>>   static inline u8 kvm_arm_pmu_get_pmuver_limit(void)
>>   {
>>
> 
> In conclusion, I find this patch to be rather messy. For a start, it
> needs to be split in at least 5 patches:
> 
> - at least two for the refactoring
> - one for the PMU core changes
> - one for the UAPI
> - one for documentation

That clarifies the expected granurarity of patches. The next version 
will be in that layout, perhaps with more patches if an additional 
change. Thanks for the guidance.

> 
> I'd also like some clarification on how this is intended to work if we
> enable FEAT_PMUv3_ICNTR, because the definition seems to be designed
> to encompass all fixed-function counters, and I expect this to grow
> over time.

Indeed the UAPI was designed to encompass all fixed-function counters as 
suggested by Oliver.

To support the UAPI, the implementation avoids hardcoding the assumption 
on the fixed counter(s). FEAT_PMUv3_INCTR will be naturaly supported 
once the common code is properly updated (i.e., the size of the event 
counter bitmask is grown the corresponding registers are wired up with a 
proper check of the feature.)

I expect migration will be handled with the conventional register 
getters and setters, but please share if you have a concern.

> 
> I'm also not planning to look at the selftest at this stage.

That is completely understandable; I'll focus on refining the design and 
implementation for the next version first.

Regards,
Akihiko Odaki

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/cpuset: record DL BW alloc CPU for attach rollback
From: Juri Lelli @ 2026-04-20  7:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Guopeng Zhang, tj, hannes, mkoutny, void, arighi, changwoo, shuah,
	chenridong, Juri Lelli, Valentin Schneider, Dietmar Eggemann,
	cgroups, sched-ext, linux-kselftest, linux-kernel
In-Reply-To: <e0fea6ec-397c-40a6-9300-a3529a3d1167@redhat.com>

Hi!

On 2026-04-19 22:31:29-04:00, Waiman Long wrote:
> On 4/19/26 10:21 PM, Guopeng Zhang wrote:
> 
> > 在 2026/4/18 2:51, Waiman Long 写道:
> > ...
> > Hi Waiman,
> >
> > Thank you for the review and for the Reviewed-by.
> > I think you are right to call this out. Looking at the
> > current logic, !cpumask_intersects(oldcs->effective_cpus, cs->effective_cpus)
> > does not obviously guarantee that the migration is crossing into a different
> > root domain. If the old and new cpusets are disjoint but still belong to the
> > same root domain, it does look possible that we reserve bandwidth on the
> > destination side without a corresponding subtraction from the source side.
> > I will try to reproduce that configuration and follow up with results.
> > my current understanding is that the DL bandwidth
> > accounting is done at root-domain granularity, not at arbitrary cpuset-subset
> > granularity.
> 
> That is my understanding too.
> 
> > That also seems consistent with
> > Documentation/scheduler/sched-deadline.rst, which says that deadline tasks
> > cannot have a CPU affinity mask smaller than the root domain they are created
> > on, and that a restricted CPU set should be achieved by creating a restricted
> > root domain with cpuset.
> 
> A root domain should be created by creating cpuset root partition for v2 
> or using the cpuset.cpu_exclusive flag in v1.
> 
> What is listed in the documentation is the ideal case, but users may not 
> strictly follow the rule.

But, if they don't and try to create DEADLINE task on cpusets that are
subsets of a root-domain, that should fail, as the affinity mask won't
be covering the entire root domain. So no BW allocated (and no
additional data structures) for subsets either.

Thanks,
Juri


^ permalink raw reply

* Re: [PATCH 7.2 v3 11/12] selftests/mm: remove READ_ONLY_THP_FOR_FS in khugepaged
From: Baolin Wang @ 2026-04-20  7:56 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-12-ziy@nvidia.com>



On 4/18/26 10:44 AM, Zi Yan wrote:
> Change the requirement to a file system with large folio support and the
> supported order needs to include PMD_ORDER.
> 
> Also add tests of opening a file with read write permission and populating
> folios with writes. Reuse the XFS image from split_huge_page_test.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply

* Re: [PATCH v5 13/14] selftests/mm: move hwpoison setup into run_test() and silence modprobe output for memory-failure category
From: Miaohe Lin @ 2026-04-20  7:26 UTC (permalink / raw)
  To: Sayali Patil
  Cc: David Hildenbrand, Zi Yan, Michal Hocko, Oscar Salvador,
	Lorenzo Stoakes, Dev Jain, Liam.Howlett, linuxppc-dev,
	Venkat Rao Bagalkote, Andrew Morton, Shuah Khan, linux-mm,
	linux-kernel, linux-kselftest, Ritesh Harjani
In-Reply-To: <ec5a0e5e98c4fe55b8571c408ca891bd02208cc3.1776150071.git.sayalip@linux.ibm.com>

On 2026/4/14 16:22, Sayali Patil wrote:
> run_vmtests.sh contains special handling to ensure the hwpoison_inject
> module is available for the memory-failure tests. This logic was
> implemented outside of run_test(), making the setup category-specific
> but managed globally.
> 
> Move the hwpoison_inject handling into run_test() and restrict it
> to the memory-failure category so that:
> 1. the module is checked and loaded only when memory-failure tests run,
> 2. the test is skipped if the module or the debugfs interface
> (/sys/kernel/debug/hwpoison/) is not available.
> 3. the module is unloaded after the test if it was loaded by the script.
> 
> This localizes category-specific setup and makes the test flow
> consistent with other per-category preparations.
> 
> While updating this logic, fix the module availability check.
> The script previously used:
> 
> 	modprobe -R hwpoison_inject
> 
> The -R option prints the resolved module name to stdout, causing every
> run to print:
> 
> 	hwpoison_inject
> 
> in the test output, even when no action is required, introducing
> unnecessary noise.
> 
> Replace this with:
> 
> 	modprobe -n hwpoison_inject
> 
> which verifies that the module is loadable without producing output,
> keeping the selftest logs clean and consistent.
> 
> Also, ensure that skipped tests do not override a previously recorded
> failure. A skipped test currently sets exitcode to ksft_skip even if a
> prior test has failed, which can mask failures in the final exit status.
> Update the logic to only set exitcode to ksft_skip when no failure has
> been recorded.
> 
> Fixes: ff4ef2fbd101 ("selftests/mm: add memory failure anonymous page test")
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Signed-off-by: Sayali Patil <sayalip@linux.ibm.com>

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks.
.

^ permalink raw reply

* Re: [PATCH net 2/2] selftests/net: packetdrill: cover challenge ACK on SEG.ACK > SND.NXT
From: Eric Dumazet @ 2026-04-20  7:22 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan, linux-kernel, linux-kselftest
In-Reply-To: <20260420025428.101192-3-jiayuan.chen@linux.dev>

On Sun, Apr 19, 2026 at 7:55 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> Exercise the RFC 5961 Section 5.2 / RFC 793 Section 3.9 requirement
> on the upper edge of the acceptable ACK range, mirroring the existing
> coverage of the SEG.ACK < SND.UNA - MAX.SND.WND case.
>
> After the peer ACKs data the receiver has never sent, the receiver
> must respond with <SEQ = SND.NXT, ACK = RCV.NXT, CTL = ACK> and drop
> the offending segment.  The script validates this exact response.
>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

Reviewed-by: Eric Dumazet <edumazet@google.com>
Thanks!

^ permalink raw reply

* Re: [PATCH net 1/2] tcp: send a challenge ACK on SEG.ACK > SND.NXT
From: Eric Dumazet @ 2026-04-20  7:21 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: netdev, Neal Cardwell, Kuniyuki Iwashima, David S. Miller,
	David Ahern, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Shuah Khan, linux-kernel, linux-kselftest
In-Reply-To: <20260420025428.101192-2-jiayuan.chen@linux.dev>

On Sun, Apr 19, 2026 at 7:55 PM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> RFC 5961 Section 5.2 validates an incoming segment's ACK value
> against the range [SND.UNA - MAX.SND.WND, SND.NXT] and states:
>
>   "All incoming segments whose ACK value doesn't satisfy the above
>    condition MUST be discarded and an ACK sent back."
>
> Commit 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack
> Mitigation") opted Linux into this mitigation and implements the
> challenge ACK on the lower side (SEG.ACK < SND.UNA - MAX.SND.WND),
> but the symmetric upper side (SEG.ACK > SND.NXT) still takes the
> pre-RFC-5961 path and silently returns
> SKB_DROP_REASON_TCP_ACK_UNSENT_DATA, even though RFC 793 Section 3.9
> (now RFC 9293 Section 3.10.7.4) has always required:
>
>   "If the ACK acknowledges something not yet sent (SEG.ACK > SND.NXT)
>    then send an ACK, drop the segment, and return."
>
> Complete the mitigation by sending a challenge ACK on that branch,
> reusing the existing tcp_send_challenge_ack() path which already
> enforces the per-socket RFC 5961 Section 7 rate limit via
> __tcp_oow_rate_limited().  FLAG_NO_CHALLENGE_ACK is honoured for
> symmetry with the lower-edge case.
>
> Fixes: 354e4aa391ed ("tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
>
> ---
> I'm not sure if 'blamed commit' is appropriate, because I think
> it's due to missing parts of the implementation, or it might be
> directly targeted to net-next.

The Fixes: tag seems appropriate, and net tree LGTM.

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks!

^ permalink raw reply

* Re: [PATCH] selftests: harness: fix pidfd leak in __wait_for_test
From: Thomas Weißschuh @ 2026-04-20  7:19 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Christian Brauner, Geliang Tang, linux-kselftest, mptcp
In-Reply-To: <f6a4cf56c8c361d3e7373d83fd3930bd28600b4c.1776668161.git.tanggeliang@kylinos.cn>

On Mon, Apr 20, 2026 at 03:04:57PM +0800, Geliang Tang wrote:
> From: Geliang Tang <tanggeliang@kylinos.cn>

(...)

> diff --git a/tools/testing/selftests/kselftest_harness.h b/tools/testing/selftests/kselftest_harness.h
> index 75fb016cd190..b12bc4dc3230 100644
> --- a/tools/testing/selftests/kselftest_harness.h
> +++ b/tools/testing/selftests/kselftest_harness.h
> @@ -1001,7 +1001,7 @@ static void __wait_for_test(struct __test_metadata *t)
>  		fprintf(TH_LOG_STREAM,
>  			"# %s: unable to wait on child pidfd\n",
>  			t->name);
> -		return;
> +		goto out;
>  	} else if (ret == 0) {
>  		timed_out = true;
>  		/* signal process group */
> @@ -1013,7 +1013,7 @@ static void __wait_for_test(struct __test_metadata *t)
>  		fprintf(TH_LOG_STREAM,
>  			"# %s: Failed to wait for PID %d (errno: %d)\n",
>  			t->name, t->pid, errno);
> -		return;
> +		goto out;
>  	}
>  
>  	if (timed_out) {
> @@ -1066,6 +1066,8 @@ static void __wait_for_test(struct __test_metadata *t)
>  			t->name,
>  			status);
>  	}
> +out:
> +	close(childfd);

I think the close() could be directly after the poll().
That would make the code simpler.

>  }
>  
>  static void test_harness_list_tests(void)
> -- 
> 2.51.0
> 

^ permalink raw reply

* Re: [PATCH v7 2/4] KVM: arm64: PMU: Protect the list of PMUs with RCU
From: Akihiko Odaki @ 2026-04-20  7:17 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Kees Cook, Gustavo A. R. Silva,
	Paolo Bonzini, Jonathan Corbet, Shuah Khan, linux-arm-kernel,
	kvmarm, linux-kernel, linux-hardening, devel, kvm, linux-doc,
	linux-kselftest
In-Reply-To: <86se8q15eo.wl-maz@kernel.org>

On 2026/04/20 16:01, Marc Zyngier wrote:
> On Mon, 20 Apr 2026 07:21:45 +0100,
> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
>>
>> On 2026/04/19 23:34, Marc Zyngier wrote:
>>> On Sat, 18 Apr 2026 09:14:24 +0100,
>>> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> wrote:
>>>>
>>>> Convert the list of PMUs to a RCU-protected list that has primitives to
>>>> avoid read-side contention.
>>>>
>>>> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
>>>> ---
>>>>    arch/arm64/kvm/pmu-emul.c | 14 ++++++--------
>>>>    1 file changed, 6 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
>>>> index 59ec96e09321..ef5140bbfe28 100644
>>>> --- a/arch/arm64/kvm/pmu-emul.c
>>>> +++ b/arch/arm64/kvm/pmu-emul.c
>>>> @@ -7,9 +7,9 @@
>>>>    #include <linux/cpu.h>
>>>>    #include <linux/kvm.h>
>>>>    #include <linux/kvm_host.h>
>>>> -#include <linux/list.h>
>>>>    #include <linux/perf_event.h>
>>>>    #include <linux/perf/arm_pmu.h>
>>>> +#include <linux/rculist.h>
>>>>    #include <linux/uaccess.h>
>>>>    #include <asm/kvm_emulate.h>
>>>>    #include <kvm/arm_pmu.h>
>>>> @@ -26,7 +26,6 @@ static bool kvm_pmu_counter_is_enabled(struct kvm_pmc *pmc);
>>>>      bool kvm_supports_guest_pmuv3(void)
>>>>    {
>>>> -	guard(mutex)(&arm_pmus_lock);
>>>>    	return !list_empty(&arm_pmus);
>>>
>>> Please read include/linux/rculist.h and the discussion about the
>>> interaction of list_empty() with RCU-protected lists. How about using
>>> list_first_or_null_rcu() for peace of mind?
>>
>> list_first_or_null_rcu() is useful to replace a sequence of
>> list_empty() and list_first_entry() that is protected by a lock, but
>> this function instead requires the invariant that nobody deletes an
>> element from the list, and list_first_or_null_rcu() does not allow
>> removing the requirement.
>>
>> The header file says:
>>> Where are list_empty_rcu() and list_first_entry_rcu()?
>>>
>>> They do not exist because they would lead to subtle race conditions:
>>>
>>> if (!list_empty_rcu(mylist)) {
>>> 	struct foo *bar = list_first_entry_rcu(mylist, struct foo,
>>> 					       list_member);
>>> 	do_something(bar);
>>> }
>>>
>>> The list might be non-empty when list_empty_rcu() checks it, but it
>>> might have become empty by the time that list_first_entry_rcu()
>>> rereads the ->next pointer, which would result in a SEGV.
>>>
>>> When not using RCU, it is OK for list_first_entry() to re-read that
>>> pointer because both functions should be protected by some lock that
>>> blocks writers.
>>>
>>> When using RCU, list_empty() uses READ_ONCE() to fetch the
>>> RCU-protected ->next pointer and then compares it to the address of
>>> the  list head.  However, it neither dereferences this pointer nor
>>> provides  this pointer to its caller.  Thus, READ_ONCE() suffices
>>> (that is,  rcu_dereference() is not needed), which means that
>>> list_empty() can be used anywhere you would want to use
>>> list_empty_rcu().  Just don't expect anything useful to happen if you
>>> do a subsequent lockless call to list_first_entry_rcu()!!!
>>>
>>> See list_first_or_null_rcu for an alternative.
>>
>> However, kvm_supports_guest_pmuv3() locked a mutex when calling
>> list_empty() and unlocked it immediately after that, instead of
>> re-reading list_first_entry(). This construct inherently had a race
>> condition with code that deletes an element; when the caller of
>> kvm_supports_guest_pmuv3() decides to enable guest PMUv3, the host PMU
>> may have been gone. But it was still safe because no one deletes an
>> element.
>>
>> The same logic also applies when using RCU. As the comment says, we
>> can use list_empty() instead of the hypothetical list_empty_rcu()
>> macro because we don't expect it to magically enable something like
>> list_first_entry_rcu(). This function instead keep relying on the fact
>> that no one deletes an element of the list.
> 
> And that's exactly the sort of thing I am trying to plan for. *Should*
> we introduce a way to remove PMUs from the list, this predicate
> becomes unsafe.

Perhaps so. In regards to this series, I'd rather like to keep it out of 
scope as the requirement is not new.

> 
> So I want at least a comment explaining this to the unsuspecting
> reader, as this is rather subtle.

I agree. I had to put some effort to understand the previous 
mutex-protected implementation and to design the new RCU-protected one. 
I'll add one with the next version.

Regards,
Akihiko Odaki

^ permalink raw reply

* Re: [PATCH 1/5] liveupdate: Remove limit on the number of sessions
From: Mike Rapoport @ 2026-04-20  7:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kselftest, shuah, akpm, linux-mm, linux-kernel, dmatlack,
	kexec, pratyush, skhawaja, graf
In-Reply-To: <20260414200237.444170-2-pasha.tatashin@soleen.com>

On Tue, Apr 14, 2026 at 08:02:33PM +0000, Pasha Tatashin wrote:
> Currently, the number of LUO sessions is limited by a fixed number of
> pre-allocated pages for serialization (16 pages, allowing for ~819
> sessions).
> 
> This limitation is problematic if LUO is used to support things such as
> systemd file descriptor store, and would be used not just as VM memory
> but to save other states on the machine.
> 
> Remove this limit by transitioning to a linked-block approach for
> session metadata serialization. Instead of a single contiguous block,
> session metadata is now stored in a chain of 16-page blocks. Each block
> starts with a header containing the physical address of the next block
> and the number of session entries in the current block.
> 
> - Bump session ABI version to v3.
> - Update struct luo_session_header_ser to include a 'next' pointer.
> - Implement dynamic block allocation in luo_session_insert().
> - Update setup, serialization, and deserialization logic to traverse
>   the block chain.
> - Remove LUO_SESSION_MAX limit.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/kho/abi/luo.h      |  19 +--
>  kernel/liveupdate/luo_internal.h |  12 +-
>  kernel/liveupdate/luo_session.c  | 237 +++++++++++++++++++++++--------
>  3 files changed, 197 insertions(+), 71 deletions(-)

...

> +/**
> + * struct luo_session_block - Internal representation of a session serialization block.
> + * @list: List head for linking blocks in memory.
> + * @ser:  Pointer to the serialized header in preserved memory.
> + */
> +struct luo_session_block {
> +	struct list_head list;
> +	struct luo_session_header_ser *ser;
> +};
> +
>  /**
>   * struct luo_session_header - Header struct for managing LUO sessions.
>   * @count:      The number of sessions currently tracked in the @list.
> + * @nblocks:    The number of allocated serialization blocks.
>   * @list:       The head of the linked list of `struct luo_session` instances.
>   * @rwsem:      A read-write semaphore providing synchronized access to the
>   *              session list and other fields in this structure.
> - * @header_ser: The header data of serialization array.
> - * @ser:        The serialized session data (an array of
> - *              `struct luo_session_ser`).
> + * @blocks:     The list of serialization blocks (struct luo_session_block).
>   * @active:     Set to true when first initialized. If previous kernel did not
>   *              send session data, active stays false for incoming.
>   */
>  struct luo_session_header {
>  	long count;
> +	long nblocks;
>  	struct list_head list;
>  	struct rw_semaphore rwsem;
> -	struct luo_session_header_ser *header_ser;
> -	struct luo_session_ser *ser;
> +	struct list_head blocks;

Don't we need some sort of locking for blocks?

>  	bool active;
>  };
  
> @@ -147,15 +222,6 @@ static int luo_session_insert(struct luo_session_header *sh,
>  
>  	guard(rwsem_write)(&sh->rwsem);
>  
> -	/*
> -	 * For outgoing we should make sure there is room in serialization array
> -	 * for new session.
> -	 */
> -	if (sh == &luo_session_global.outgoing) {
> -		if (sh->count == LUO_SESSION_MAX)
> -			return -ENOMEM;
> -	}
> -
>  	/*
>  	 * For small number of sessions this loop won't hurt performance
>  	 * but if we ever start using a lot of sessions, this might

For ~8.1 million sessions this comment does not seem valid anymore ;-)

-- 
Sincerely yours,
Mike.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox