* [PATCH 4/4] rtla/tests: Add runtime tests for restoring continue flag
From: Tomas Glozar @ 2026-05-26 10:25 UTC (permalink / raw)
To: Steven Rostedt, Tomas Glozar
Cc: John Kacur, Luis Goncalves, Crystal Wood, Costa Shulyupin,
Wander Lairson Costa, LKML, linux-trace-kernel
In-Reply-To: <20260526102523.2662391-1-tglozar@redhat.com>
In case an action preceding the continue action fails, not only
the continue flag should not be set, it should be unset if it was set
from a previous run of actions_perform().
Add a runtime test to both osnoise and timerlat tools that checks that
this works properly by creating a temporary file.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
---
Depends on "rtla/tests: Extend runtime test coverage" patchset
- https://lore.kernel.org/linux-trace-kernel/20260423130558.882022-1-tglozar@redhat.com/
tools/tracing/rtla/tests/osnoise.t | 2 ++
tools/tracing/rtla/tests/timerlat.t | 2 ++
2 files changed, 4 insertions(+)
diff --git a/tools/tracing/rtla/tests/osnoise.t b/tools/tracing/rtla/tests/osnoise.t
index 9c2f84a4187d..a7956ab605cd 100644
--- a/tools/tracing/rtla/tests/osnoise.t
+++ b/tools/tracing/rtla/tests/osnoise.t
@@ -65,6 +65,8 @@ check "top stop at failed action" \
"osnoise top -S 2 --on-threshold shell,command='echo -n abc; false' --on-threshold shell,command='echo -n defgh'" 2 "^abc" "defgh"
check_top_q_hist "with continue" \
"osnoise TOOL -S 2 -d 5s --on-threshold shell,command='echo TestOutput' --on-threshold continue" 0 "^TestOutput$"
+check_top_q_hist "with conditional continue" \
+ "osnoise TOOL -S 2 --on-threshold shell,command='if [ -f a ]; then echo 2; exit 1; else echo -n 1; touch a; fi' --on-threshold continue" 2 "^12$" "^2$"
check_top_hist "with trace output at end" \
"osnoise TOOL -d 1s --on-end trace" 0 "^ Saving trace to osnoise_trace.txt$"
diff --git a/tools/tracing/rtla/tests/timerlat.t b/tools/tracing/rtla/tests/timerlat.t
index f3e5f99e862b..19fd5af26ebb 100644
--- a/tools/tracing/rtla/tests/timerlat.t
+++ b/tools/tracing/rtla/tests/timerlat.t
@@ -94,6 +94,8 @@ check "top stop at failed action" \
"timerlat top -T 2 --on-threshold shell,command='echo -n abc; false' --on-threshold shell,command='echo -n defgh'" 2 "^abc" "defgh"
check_top_q_hist "with continue" \
"timerlat TOOL -T 2 -d 5s --on-threshold shell,command='echo TestOutput' --on-threshold continue" 0 "^TestOutput$"
+check_top_q_hist "with conditional continue" \
+ "timerlat TOOL -T 2 --on-threshold shell,command='if [ -f a ]; then echo 2; exit 1; else echo -n 1; touch a; fi' --on-threshold continue" 2 "^12$" "^2$"
check_top_hist "with trace output at end" \
"timerlat TOOL -d 1s --on-end trace" 0 "^ Saving trace to timerlat_trace.txt$"
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v2 10/17] landlock: Set audit_net.sk for socket access checks
From: Mickaël Salaün @ 2026-05-26 10:42 UTC (permalink / raw)
To: Christian Brauner, Günther Noack, Steven Rostedt
Cc: Jann Horn, Jeff Xu, Justin Suess, Kees Cook, Masami Hiramatsu,
Mathieu Desnoyers, Matthieu Buffet, Mikhail Ivanov, Tingmao Wang,
kernel-team, linux-fsdevel, linux-security-module,
linux-trace-kernel, stable
In-Reply-To: <20260406143717.1815792-11-mic@digikod.net>
I merged this fix in the -next branch.
On Mon, Apr 06, 2026 at 04:37:08PM +0200, Mickaël Salaün wrote:
> Set audit_net.sk in current_check_access_socket() to provide the socket
> object to audit_log_lsm_data(). This makes Landlock consistent with
> AppArmor, which always sets .sk for socket operations, and with
> SELinux's generic socket permission checks.
>
> The socket's local and foreign address information (laddr, lport, faddr,
> fport) is logged by the shared lsm_audit.c infrastructure when the
> socket has bound or connected state. Fields with zero values are
> suppressed by print_ipv4_addr()/print_ipv6_addr(), so the audit output
> is unchanged for the common case of bind denials on unbound sockets.
> For connect denials after a prior bind, the bound local address (laddr,
> lport) appears before the existing sockaddr fields (daddr, dest).
>
> No existing fields are removed or reordered, and the new field names
> (laddr, lport, faddr, fport) are standard audit fields already emitted
> by other LSMs through the same lsm_audit.c code path.
>
> Add net_bind and net_connect audit tests. The net_bind test verifies
> basic net denial auditing. The net_connect test binds to an allowed
> port, then connects to a denied port, and verifies that the audit record
> includes laddr/lport from the socket state.
>
> Fixes: 9f74411a40ce ("landlock: Log TCP bind and connect denials")
> Cc: stable@vger.kernel.org
> Cc: Günther Noack <gnoack@google.com>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> ---
>
> Changes since v1:
> - New patch.
> ---
> security/landlock/net.c | 1 +
> tools/testing/selftests/landlock/audit_test.c | 187 ++++++++++++++++++
> 2 files changed, 188 insertions(+)
>
> diff --git a/security/landlock/net.c b/security/landlock/net.c
> index a2aefc7967a1..d8bc9e0d012a 100644
> --- a/security/landlock/net.c
> +++ b/security/landlock/net.c
> @@ -225,6 +225,7 @@ static int current_check_access_socket(struct socket *const sock,
> return 0;
>
> audit_net.family = address->sa_family;
> + audit_net.sk = sock->sk;
> landlock_log_denial(subject,
> &(struct landlock_request){
> .type = LANDLOCK_REQUEST_NET_ACCESS,
> diff --git a/tools/testing/selftests/landlock/audit_test.c b/tools/testing/selftests/landlock/audit_test.c
> index da0bfd06391e..65dfb272c825 100644
> --- a/tools/testing/selftests/landlock/audit_test.c
> +++ b/tools/testing/selftests/landlock/audit_test.c
> @@ -6,14 +6,17 @@
> */
>
> #define _GNU_SOURCE
> +#include <arpa/inet.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <limits.h>
> #include <linux/landlock.h>
> +#include <netinet/in.h>
> #include <pthread.h>
> #include <stdlib.h>
> #include <sys/mount.h>
> #include <sys/prctl.h>
> +#include <sys/socket.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <unistd.h>
> @@ -160,6 +163,190 @@ TEST_F(audit, layers)
> EXPECT_EQ(0, close(ruleset_fd));
> }
>
> +static int matches_log_net_bind(struct __test_metadata *const _metadata,
> + int audit_fd, __u16 port, __u64 *domain_id)
> +{
> + /*
> + * The socket is unbound at bind() time, so laddr/lport/faddr/fport from
> + * the socket object are zero and not printed. Only the sockaddr fields
> + * (src) appear.
> + */
> + static const char log_template[] = REGEX_LANDLOCK_PREFIX
> + " blockers=net\\.bind_tcp src=%u$";
> + char log_match[sizeof(log_template) + 10];
> +
> + snprintf(log_match, sizeof(log_match), log_template, port);
> + return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
> + domain_id);
> +}
> +
> +/*
> + * Verifies that network denial audit records include enriched socket
> + * information (laddr/lport/faddr/fport) from the socket object.
> + */
> +TEST_F(audit, net_bind)
> +{
> + const struct landlock_ruleset_attr ruleset_attr = {
> + .handled_access_net = LANDLOCK_ACCESS_NET_BIND_TCP,
> + };
> + struct landlock_net_port_attr net_port = {
> + .allowed_access = LANDLOCK_ACCESS_NET_BIND_TCP,
> + .port = 1024,
> + };
> + int status, ruleset_fd;
> + pid_t child;
> + __u64 denial_dom = 1;
> +
> + ruleset_fd =
> + landlock_create_ruleset(&ruleset_attr, sizeof(ruleset_attr), 0);
> + ASSERT_LE(0, ruleset_fd);
> +
> + /* Allow port 1024 only. */
> + ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
> + &net_port, 0));
> +
> + EXPECT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
> +
> + child = fork();
> + ASSERT_LE(0, child);
> + if (child == 0) {
> + struct sockaddr_in addr = {
> + .sin_family = AF_INET,
> + .sin_port = htons(1025),
> + .sin_addr.s_addr = htonl(INADDR_ANY),
> + };
> + int sock_fd;
> +
> + EXPECT_EQ(0, landlock_restrict_self(ruleset_fd, 0));
> + close(ruleset_fd);
> +
> + /* Bind to port 1025 (not allowed). */
> + sock_fd = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, 0);
> + ASSERT_LE(0, sock_fd);
> + EXPECT_EQ(-1, bind(sock_fd, (struct sockaddr *)&addr,
> + sizeof(addr)));
> + EXPECT_EQ(EACCES, errno);
> + close(sock_fd);
> +
> + /* Verify audit record with enriched socket info. */
> + EXPECT_EQ(0, matches_log_net_bind(_metadata, self->audit_fd,
> + 1025, &denial_dom));
> + EXPECT_NE(denial_dom, 1);
> + EXPECT_NE(denial_dom, 0);
> +
> + _exit(_metadata->exit_code);
> + return;
> + }
> +
> + ASSERT_EQ(child, waitpid(child, &status, 0));
> + if (WIFSIGNALED(status) || !WIFEXITED(status) ||
> + WEXITSTATUS(status) != EXIT_SUCCESS)
> + _metadata->exit_code = KSFT_FAIL;
> +
> + EXPECT_EQ(0, close(ruleset_fd));
> +}
> +
> +static int matches_log_net_connect(struct __test_metadata *const _metadata,
> + int audit_fd, __u16 denied_port,
> + __u16 bound_port, __u64 *domain_id)
> +{
> + /*
> + * After bind(), the socket has local address state. The audit record
> + * should include laddr/lport from the socket (via audit_net.sk) and
> + * daddr/dest from the connect sockaddr.
> + */
> + static const char log_template[] = REGEX_LANDLOCK_PREFIX
> + " blockers=net\\.connect_tcp"
> + " laddr=127\\.0\\.0\\.1 lport=%u"
> + " daddr=127\\.0\\.0\\.1 dest=%u$";
> + char log_match[sizeof(log_template) + 20];
> +
> + snprintf(log_match, sizeof(log_match), log_template, bound_port,
> + denied_port);
> + return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
> + domain_id);
> +}
> +
> +/*
> + * Verifies that network denial audit records for connect include enriched
> + * socket information (laddr/lport) from the socket object after a prior bind.
> + * This complements net_bind which tests the unbound case.
> + */
> +TEST_F(audit, net_connect)
> +{
> + const struct landlock_ruleset_attr ruleset_attr = {
> + .handled_access_net = LANDLOCK_ACCESS_NET_BIND_TCP |
> + LANDLOCK_ACCESS_NET_CONNECT_TCP,
> + };
> + struct landlock_net_port_attr net_port;
> + int status, ruleset_fd;
> + pid_t child;
> + __u64 denial_dom = 1;
> +
> + ruleset_fd =
> + landlock_create_ruleset(&ruleset_attr, sizeof(ruleset_attr), 0);
> + ASSERT_LE(0, ruleset_fd);
> +
> + /* Allow bind to port 1024 and connect to port 1024. */
> + net_port.allowed_access = LANDLOCK_ACCESS_NET_BIND_TCP |
> + LANDLOCK_ACCESS_NET_CONNECT_TCP;
> + net_port.port = 1024;
> + ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
> + &net_port, 0));
> +
> + EXPECT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0));
> +
> + child = fork();
> + ASSERT_LE(0, child);
> + if (child == 0) {
> + struct sockaddr_in bind_addr = {
> + .sin_family = AF_INET,
> + .sin_port = htons(1024),
> + .sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> + };
> + struct sockaddr_in conn_addr = {
> + .sin_family = AF_INET,
> + .sin_port = htons(1025),
> + .sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> + };
> + int sock_fd, optval = 1;
> +
> + EXPECT_EQ(0, landlock_restrict_self(ruleset_fd, 0));
> + close(ruleset_fd);
> +
> + sock_fd = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, 0);
> + ASSERT_LE(0, sock_fd);
> + ASSERT_EQ(0, setsockopt(sock_fd, SOL_SOCKET, SO_REUSEADDR,
> + &optval, sizeof(optval)));
> +
> + /* Bind to allowed port 1024 (succeeds). */
> + ASSERT_EQ(0, bind(sock_fd, (struct sockaddr *)&bind_addr,
> + sizeof(bind_addr)));
> +
> + /* Connect to denied port 1025 (fails). */
> + EXPECT_EQ(-1, connect(sock_fd, (struct sockaddr *)&conn_addr,
> + sizeof(conn_addr)));
> + EXPECT_EQ(EACCES, errno);
> + close(sock_fd);
> +
> + /* Verify audit record with laddr/lport from bound socket. */
> + EXPECT_EQ(0, matches_log_net_connect(_metadata, self->audit_fd,
> + 1025, 1024, &denial_dom));
> + EXPECT_NE(denial_dom, 1);
> + EXPECT_NE(denial_dom, 0);
> +
> + _exit(_metadata->exit_code);
> + return;
> + }
> +
> + ASSERT_EQ(child, waitpid(child, &status, 0));
> + if (WIFSIGNALED(status) || !WIFEXITED(status) ||
> + WEXITSTATUS(status) != EXIT_SUCCESS)
> + _metadata->exit_code = KSFT_FAIL;
> +
> + EXPECT_EQ(0, close(ruleset_fd));
> +}
> +
> struct thread_data {
> pid_t parent_pid;
> int ruleset_fd, pipe_child, pipe_parent;
> --
> 2.53.0
>
>
^ permalink raw reply
* Re: [PATCHv3 07/12] selftests/bpf: Emit nop,nop10 instructions combo for x86_64 arch
From: Jakub Sitnicki @ 2026-05-26 10:58 UTC (permalink / raw)
To: Jiri Olsa
Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260521124411.31133-8-jolsa@kernel.org>
On Thu, May 21, 2026 at 02:44 PM +02, Jiri Olsa wrote:
> Syncing latest usdt.h change [1].
>
> Now that we have nop10 optimization support in kernel, let's emit
> nop,nop10 for usdt probe. We leave it up to the library to use
> desirable nop instruction.
>
> [1] TBD
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
^ permalink raw reply
* [PATCH RFC 0/3] Demote to lower tier using non-temporal stores
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
Johannes Weiner
Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In most memory tiering scenarios, the memory to be demoted is expected
to be cold and most likely out of the node's last-level cache (as well
as target pages in the target node). Using non-temporal stores instead
of a standard memcpy path can reduce the cache pollution in the local
node and the bandwidth overhead to the target node. Furthermore, for
certain types of CXL devices that support in-line memory compression,
the last-level cache eviction patterns can negatively affect the
bandwidth of the device. Non-temporal stores can mitigate this.
This patch-set introduces a new migrate_mode flag for using non-temporal
stores that is used only in the demotion path. Patch 1 adds some helpers in
x86 and mm to bring non-temporal stores support to a respective folio_copy
function. Patch 2 adds the new flag and necessary changes for compatibility
with the existing behavior. Patch 3 uses the new flag for demotions and
guards this by a Kconfig option.
Experimental data: in a CXL system with 1 memory expander, a microbenchmark
that allocates N=64 GB memory in the local node and then triggers demotion
using memory.reclaim, shows a practically complete elimination of read
traffic on the device, i.e. write traffic is N GB with and without the
patch, while read traffic drops from N to almost 0 with the patch.
Opens:
1. There is some "duplication" in the x86 tree and a bit in mm. Can we do
something better there? As it is now in copy_mc_to_kernel_nt we
duplicate the machine check functionality, which if available will override
the non-temporal. We were not sure how to prioritize these two and what's
the best approach here. Can we completely skip the machine checked for this
path?
2. We've hidden the use of the new flag behind a Kconfig option. Is this
the right way to add this?
3. We have not carefully considered how this should be structured so that
it is easily adopted in other architecture trees (e.g. aarch64). Arm
support is currently out of our scope but any input is appreciated.
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
Alirad Malek (3):
mm, x86: support copying a folio using non-temporal stores
mm: new migrate_mode flag for async using non-temporal stores
mm: use non-temporal stores for demotion
arch/x86/include/asm/uaccess.h | 4 ++++
arch/x86/lib/copy_mc.c | 26 ++++++++++++++++++++++++++
fs/nfs/write.c | 2 +-
include/linux/highmem.h | 32 ++++++++++++++++++++++++++++++++
include/linux/migrate_mode.h | 9 +++++++++
include/linux/mm.h | 1 +
include/trace/events/migrate.h | 1 +
mm/Kconfig | 8 ++++++++
mm/compaction.c | 18 +++++++++---------
mm/migrate.c | 21 ++++++++++++++-------
mm/util.c | 17 +++++++++++++++++
11 files changed, 122 insertions(+), 17 deletions(-)
---
base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
change-id: 20260526-rfc-nt-demote-0fafadfdd006
Best regards,
--
Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
^ permalink raw reply
* [PATCH RFC 1/3] mm, x86: support copying a folio using non-temporal stores
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
Johannes Weiner
Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-0-eb9c9422daef@zptcorp.com>
From: Alirad Malek <alirad.malek@zptcorp.com>
In x86, use memcpy_flushcache (that uses non-temporal store
instructions) to copy a folio. To achieve that, starting from folio_mc_copy
down to copy_mc_to_kernel, create a series of helpers (named with an _nt
suffix) that have similar behavior to the original counterparts.
Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
arch/x86/include/asm/uaccess.h | 4 ++++
arch/x86/lib/copy_mc.c | 26 ++++++++++++++++++++++++++
include/linux/highmem.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 +
mm/util.c | 17 +++++++++++++++++
5 files changed, 80 insertions(+)
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 367297b188c3..2d0938d3e372 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -494,6 +494,10 @@ unsigned long __must_check
copy_mc_to_kernel(void *to, const void *from, unsigned len);
#define copy_mc_to_kernel copy_mc_to_kernel
+unsigned long __must_check
+copy_mc_to_kernel_nt(void *to, const void *from, unsigned len);
+#define copy_mc_to_kernel_nt copy_mc_to_kernel_nt
+
unsigned long __must_check
copy_mc_to_user(void __user *to, const void *from, unsigned len);
#endif
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 97e88e58567b..5a2ee5c2211e 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -81,6 +81,32 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
}
EXPORT_SYMBOL_GPL(copy_mc_to_kernel);
+/**
+ * copy_mc_to_kernel_nt - memory copy that handles source exceptions
+ * if enabled, otherwise uses non-temporal stores
+ * @dst: destination address
+ * @src: source address
+ * @len: number of bytes to copy
+ *
+ * Return 0 for success, or number of bytes not copied if there was an
+ * exception.
+ */
+unsigned long __must_check copy_mc_to_kernel_nt(void *dst, const void *src, unsigned len)
+{
+ unsigned long ret;
+
+ if (copy_mc_fragile_enabled) {
+ instrument_memcpy_before(dst, src, len);
+ ret = copy_mc_fragile(dst, src, len);
+ instrument_memcpy_after(dst, src, len, ret);
+ return ret;
+ }
+
+ memcpy_flushcache(dst, src, len);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(copy_mc_to_kernel_nt);
+
unsigned long __must_check copy_mc_to_user(void __user *dst, const void *src, unsigned len)
{
unsigned long ret;
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..a5cb435b9ffe 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -468,6 +468,32 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
return ret;
}
+
+#ifdef copy_mc_to_kernel_nt
+static inline int copy_mc_highpage_nt(struct page *to, struct page *from)
+{
+ unsigned long ret;
+ char *vfrom, *vto;
+
+ vfrom = kmap_local_page(from);
+ vto = kmap_local_page(to);
+ ret = copy_mc_to_kernel_nt(vto, vfrom, PAGE_SIZE);
+ if (!ret)
+ kmsan_copy_page_meta(to, from);
+ kunmap_local(vto);
+ kunmap_local(vfrom);
+
+ if (ret)
+ memory_failure_queue(page_to_pfn(from), 0);
+
+ return ret;
+}
+#else
+static inline int copy_mc_highpage_nt(struct page *to, struct page *from)
+{
+ return copy_mc_highpage(to, from);
+}
+#endif
#else
static inline int copy_mc_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma)
@@ -481,6 +507,12 @@ static inline int copy_mc_highpage(struct page *to, struct page *from)
copy_highpage(to, from);
return 0;
}
+
+static inline int copy_mc_highpage_nt(struct page *to, struct page *from)
+{
+ copy_highpage(to, from);
+ return 0;
+}
#endif
static inline void memcpy_page(struct page *dst_page, size_t dst_off,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5be3d8a8f806..d07ce478582d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1644,6 +1644,7 @@ void __folio_put(struct folio *folio);
void split_page(struct page *page, unsigned int order);
void folio_copy(struct folio *dst, struct folio *src);
int folio_mc_copy(struct folio *dst, struct folio *src);
+int folio_mc_copy_nt(struct folio *dst, struct folio *src);
unsigned long nr_free_buffer_pages(void);
diff --git a/mm/util.c b/mm/util.c
index b05ab6f97e11..e09e9b5f8eee 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -749,6 +749,23 @@ int folio_mc_copy(struct folio *dst, struct folio *src)
}
EXPORT_SYMBOL(folio_mc_copy);
+int folio_mc_copy_nt(struct folio *dst, struct folio *src)
+{
+ long nr = folio_nr_pages(src);
+ long i = 0;
+
+ for (;;) {
+ if (copy_mc_highpage_nt(folio_page(dst, i), folio_page(src, i)))
+ return -EHWPOISON;
+ if (++i == nr)
+ break;
+ cond_resched();
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(folio_mc_copy_nt);
+
int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
static int sysctl_overcommit_ratio __read_mostly = 50;
static unsigned long sysctl_overcommit_kbytes __read_mostly;
--
2.43.0
^ permalink raw reply related
* [PATCH RFC 3/3] mm: use non-temporal stores for demotion
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
Johannes Weiner
Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-0-eb9c9422daef@zptcorp.com>
From: Alirad Malek <alirad.malek@zptcorp.com>
Memory demoted to a lower tier is assumed to be cold and most likely out of
the CPU's last level cache. Additionally, in certain demotion targets (e.g.
CXL devices with compressed memory) the bandwidth can be negatively
impacted by the eviction patterns of the last level cache when standard
memcpy is used. When the feature is enabled, use the
MIGRATE_ASYNC_NON_TEMPORAL_STORES flag in demotions to trigger the folio
copy path using non-temporal stores.
Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
mm/Kconfig | 8 ++++++++
mm/migrate.c | 9 ++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index ebd8ea353687..4b7a75b57f6e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -645,6 +645,14 @@ config MIGRATION
pages as migration can relocate pages to satisfy a huge page
allocation instead of reclaiming.
+config DEMOTION_WITH_NON_TEMPORAL_STORES
+ bool "Use non-temporal stores for demotion"
+ default n
+ depends on MIGRATION
+ help
+ Enable non-temporal stores when migrating pages due to demotion.
+ If disabled, demotion uses regular migration copy paths.
+
config DEVICE_MIGRATION
def_bool MIGRATION && ZONE_DEVICE
diff --git a/mm/migrate.c b/mm/migrate.c
index ff6cf50e7b0b..368d40dc8772 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -862,7 +862,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
if (folio_ref_count(src) != expected_count)
return -EAGAIN;
- rc = folio_mc_copy(dst, src);
+ if (mode == MIGRATE_ASYNC_NON_TEMPORAL_STORES)
+ rc = folio_mc_copy_nt(dst, src);
+ else
+ rc = folio_mc_copy(dst, src);
if (unlikely(rc))
return rc;
@@ -2081,6 +2084,10 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
LIST_HEAD(split_folios);
struct migrate_pages_stats stats;
+ if (IS_ENABLED(CONFIG_DEMOTION_WITH_NON_TEMPORAL_STORES) &&
+ reason == MR_DEMOTION && mode == MIGRATE_ASYNC)
+ mode = MIGRATE_ASYNC_NON_TEMPORAL_STORES;
+
trace_mm_migrate_pages_start(mode, reason);
memset(&stats, 0, sizeof(stats));
--
2.43.0
^ permalink raw reply related
* [PATCH RFC 2/3] mm: new migrate_mode flag for async using non-temporal stores
From: Yiannis Nikolakopoulos @ 2026-05-26 11:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Gregory Price, Ying Huang, Alistair Popple, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Brendan Jackman,
Johannes Weiner
Cc: David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
linux-trace-kernel, Yiannis Nikolakopoulos, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-0-eb9c9422daef@zptcorp.com>
From: Alirad Malek <alirad.malek@zptcorp.com>
In preparation for the following patch, add a new migrate_mode which is
still async but will use non-temporal stores. Add a helper function
that checks for both async modes and replace all plain checks of
MIGRATE_ASYNC.
Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
---
fs/nfs/write.c | 2 +-
include/linux/migrate_mode.h | 9 +++++++++
include/trace/events/migrate.h | 1 +
mm/compaction.c | 18 +++++++++---------
mm/migrate.c | 12 ++++++------
5 files changed, 26 insertions(+), 16 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1ed4b3590b1a..beae4441e080 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2119,7 +2119,7 @@ int nfs_migrate_folio(struct address_space *mapping, struct folio *dst,
}
if (folio_test_private_2(src)) { /* [DEPRECATED] */
- if (mode == MIGRATE_ASYNC)
+ if (migrate_mode_is_async(mode))
return -EBUSY;
folio_wait_private_2(src);
}
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 265c4328b36a..f7186e705b48 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -3,6 +3,8 @@
#define MIGRATE_MODE_H_INCLUDED
/*
* MIGRATE_ASYNC means never block
+ * MIGRATE_ASYNC_NON_TEMPORAL_STORES means never block and use non-temporal
+ * stores if supported by the architecture
* MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking
* on most operations but not ->writepage as the potential stall time
* is too significant
@@ -10,10 +12,17 @@
*/
enum migrate_mode {
MIGRATE_ASYNC,
+ MIGRATE_ASYNC_NON_TEMPORAL_STORES,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
};
+static inline bool migrate_mode_is_async(enum migrate_mode mode)
+{
+ return mode == MIGRATE_ASYNC ||
+ mode == MIGRATE_ASYNC_NON_TEMPORAL_STORES;
+}
+
enum migrate_reason {
MR_COMPACTION,
MR_MEMORY_FAILURE,
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index cd01dd7b3640..e493207a3f46 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -9,6 +9,7 @@
#define MIGRATE_MODE \
EM( MIGRATE_ASYNC, "MIGRATE_ASYNC") \
+ EM(MIGRATE_ASYNC_NON_TEMPORAL_STORES, "MIGRATE_ASYNC_NON_TEMPORAL_STORES") \
EM( MIGRATE_SYNC_LIGHT, "MIGRATE_SYNC_LIGHT") \
EMe(MIGRATE_SYNC, "MIGRATE_SYNC")
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..cd26781b7376 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -444,7 +444,7 @@ static void update_cached_migrate(struct compact_control *cc, unsigned long pfn)
/* Update where async and sync compaction should restart */
if (pfn > zone->compact_cached_migrate_pfn[0])
zone->compact_cached_migrate_pfn[0] = pfn;
- if (cc->mode != MIGRATE_ASYNC &&
+ if (!migrate_mode_is_async(cc->mode) &&
pfn > zone->compact_cached_migrate_pfn[1])
zone->compact_cached_migrate_pfn[1] = pfn;
}
@@ -507,7 +507,7 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
__acquires(lock)
{
/* Track if the lock is contended in async mode */
- if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
+ if (migrate_mode_is_async(cc->mode) && !cc->contended) {
if (spin_trylock_irqsave(lock, *flags))
return true;
@@ -864,7 +864,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
return -EAGAIN;
/* async migration should just abort */
- if (cc->mode == MIGRATE_ASYNC)
+ if (migrate_mode_is_async(cc->mode))
return -EAGAIN;
reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
@@ -875,7 +875,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
cond_resched();
- if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
+ if (cc->direct_compaction && migrate_mode_is_async(cc->mode)) {
skip_on_failure = true;
next_skip_pfn = block_end_pfn(low_pfn, cc->order);
}
@@ -1364,7 +1364,7 @@ static bool suitable_migration_source(struct compact_control *cc,
if (pageblock_skip_persistent(page))
return false;
- if ((cc->mode != MIGRATE_ASYNC) || !cc->direct_compaction)
+ if (!migrate_mode_is_async(cc->mode) || !cc->direct_compaction)
return true;
block_mt = get_pageblock_migratetype(page);
@@ -1465,7 +1465,7 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn)
return;
/* Minimise scanning during async compaction */
- if (cc->direct_compaction && cc->mode == MIGRATE_ASYNC)
+ if (cc->direct_compaction && migrate_mode_is_async(cc->mode))
return;
/* Pageblock boundaries */
@@ -1705,7 +1705,7 @@ static void isolate_freepages(struct compact_control *cc)
block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
zone_end_pfn(zone));
low_pfn = pageblock_end_pfn(cc->migrate_pfn);
- stride = cc->mode == MIGRATE_ASYNC ? COMPACT_CLUSTER_MAX : 1;
+ stride = migrate_mode_is_async(cc->mode) ? COMPACT_CLUSTER_MAX : 1;
/*
* Isolate free pages until enough are available to migrate the
@@ -2514,7 +2514,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
unsigned long start_pfn = cc->zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(cc->zone);
unsigned long last_migrated_pfn;
- const bool sync = cc->mode != MIGRATE_ASYNC;
+ const bool sync = !migrate_mode_is_async(cc->mode);
bool update_cached;
unsigned int nr_succeeded = 0, nr_migratepages;
int order;
@@ -2537,7 +2537,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
ret = compaction_suit_allocation_order(cc->zone, cc->order,
cc->highest_zoneidx,
cc->alloc_flags,
- cc->mode == MIGRATE_ASYNC,
+ migrate_mode_is_async(cc->mode),
!cc->direct_compaction);
if (ret != COMPACT_CONTINUE)
return ret;
diff --git a/mm/migrate.c b/mm/migrate.c
index 2c3d489ecf51..ff6cf50e7b0b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -907,7 +907,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
do {
if (!trylock_buffer(bh)) {
- if (mode == MIGRATE_ASYNC)
+ if (migrate_mode_is_async(mode))
goto unlock;
if (mode == MIGRATE_SYNC_LIGHT && !buffer_uptodate(bh))
goto unlock;
@@ -1220,7 +1220,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
dst->private = NULL;
if (!folio_trylock(src)) {
- if (mode == MIGRATE_ASYNC)
+ if (migrate_mode_is_async(mode))
goto out;
/*
@@ -1325,7 +1325,7 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
/* Establish migration ptes */
VM_BUG_ON_FOLIO(folio_test_anon(src) &&
!folio_test_ksm(src) && !anon_vma, src);
- try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0);
+ try_to_migrate(src, migrate_mode_is_async(mode) ? TTU_BATCH_FLUSH : 0);
old_page_state |= PAGE_WAS_MAPPED;
}
@@ -1565,7 +1565,7 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
{
int rc;
- if (mode == MIGRATE_ASYNC) {
+ if (migrate_mode_is_async(mode)) {
if (!folio_trylock(folio))
return -EAGAIN;
} else {
@@ -1799,7 +1799,7 @@ static int migrate_pages_batch(struct list_head *from,
LIST_HEAD(dst_folios);
bool nosplit = (reason == MR_NUMA_MISPLACED);
- VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC &&
+ VM_WARN_ON_ONCE(!migrate_mode_is_async(mode) &&
!list_empty(from) && !list_is_singular(from));
for (pass = 0; pass < nr_pass && retry; pass++) {
@@ -2107,7 +2107,7 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
list_cut_before(&folios, from, &folio2->lru);
else
list_splice_init(from, &folios);
- if (mode == MIGRATE_ASYNC)
+ if (migrate_mode_is_async(mode))
rc = migrate_pages_batch(&folios, get_new_folio, put_new_folio,
private, mode, reason, &ret_folios,
&split_folios, &stats,
--
2.43.0
^ permalink raw reply related
* [RFC PATCH v3 0/3] trace: stack trace deduplication for ftrace ring buffer
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <20260514034916.2162517-1-lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Hi Masami, Steven, all,
This is v3 of the ftrace stackmap series. It addresses the Sashiko
review on v2 [1] that Masami pointed out.
[1] https://sashiko.dev/#/patchset/20260522104017.1668638-1-lipengfei28%40xiaomi.com
The series adds stack trace deduplication to ftrace. When the
stacktrace option is enabled, the ring buffer stores a 4-byte
stack_id instead of a full kernel stack trace, while the full
stacks are exported via tracefs.
Rebased onto v7.1-rc5 (e8c2f9fdadee) before sending.
Changes since v2
================
Patch 1 (lock-free stackmap):
- Hot-path counters changed from atomic64_t to per-CPU local_t.
This avoids the raw_spinlock_t fallback that atomic64_t uses on
32-bit GENERIC_ATOMIC64, which would deadlock from NMI context.
- reset() now serializes against tracefs readers via an
rw_semaphore (held for write during the clearing memset, held
for read by seq_file iteration and bin snapshot construction).
synchronize_rcu() alone was insufficient because seq_file/bin
readers are in process context, not preempt-disabled.
- get_id() uses atomic_read_acquire() on smap->resetting so
subsequent loads of entry->key/val are properly ordered after
the check (LKMM control dependencies only order stores).
- All plain reads of entry->key now use READ_ONCE() to avoid
LKMM data races with the cmpxchg writer.
- val->nr in the hot path now uses READ_ONCE() to keep style
consistent with the seq_show / bin_open readers.
- stackmap_seq_next() now updates *pos past map_size on EOF so
seq_read() terminates instead of looping on the last element.
- Added a comment in the cmpxchg-claim path documenting that
two CPUs racing with the same key_hash may produce a small
number of duplicate entries; this is an accepted trade-off
for keeping the hot path lock-free.
- Removed BUG_ON in create path (the constraint is satisfied by
construction; no runtime check needed).
Patch 2 (integration):
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and
ZEROED_TRACE_FLAGS so the option is only exposed under the
top-level trace instance, matching the convention used for
other global-only options such as 'printk' and 'record-cmd'.
Secondary instances under tracing/instances/*/ no longer see
the option at all, instead of seeing it as a silent no-op.
- TRACE_STACK_ID added to trace_valid_entry() in trace_selftest.c
so ftrace startup selftests don't reject the entry type.
- Corrected a comment about how global_trace.stackmap is
zero-initialized (BSS, not kzalloc).
Patch 3 (docs / selftest / tooling):
- Selftest now reads trace contents BEFORE switching back to the
nop tracer (tracer_init() calls tracing_reset_online_cpus()
which would have emptied the ring buffer).
- Added 'function:tracer' to the selftest '# requires:' line so
ftracetest skips when CONFIG_FUNCTION_TRACER is disabled
instead of failing spuriously.
- Selftest grep tightened to '<stack_id' to avoid future
false-positives if any other tracepoint name contains
"stack_id".
- New stackmap-instance-gate.tc selftest asserts the option and
stack_map* nodes are present on the global instance and absent
on a freshly-created secondary instance, locking in the
TOP_LEVEL_TRACE_FLAGS gating behavior introduced in patch 2.
- Documentation Performance section made vendor-neutral
("aarch64 SMP system" instead of a specific device name) and
the term "Hit rate" replaced with "Dedup rate" to match the
actual stat field name (success_rate).
- Documentation Design section now states that deduplication is
best-effort under heavy contention (cmpxchg races may produce
a small number of duplicate entries for the same stack), so
users observing entries > unique-stacks have a documented
explanation.
Test results
============
Device: Xiaomi SM8850 (ARM64), Android 16, kernel 6.12 (OGKI)
Config: CONFIG_FTRACE_STACKMAP=y, bits=14 (16384 elts, 32768 slots)
Method: 5-second capture with stacktrace trigger
Functional tests (all PASS):
- tracefs nodes (stack_map / stack_map_stat / stack_map_bin) exist
- options/stackmap writable, trace shows <stack_id N>
- stack_map text export with correct symbols
- reset clears entries when tracing stopped
- reset rejected (-EBUSY) while tracing active
- per-event trigger: only specified events get stacks
Performance (sched_switch, 5s):
entries: 466 / 16384
successes: 9159
drops: 0
success_rate: 100%
dedup rate: 95.2% (466 unique stacks / 9625 total events)
Performance (kmem_cache_alloc, 5s):
entries: 1177 / 16384
successes: 60078
drops: 0
success_rate: 100%
dedup rate: 98.1% (1177 unique stacks / 61255 total events)
Ring buffer space savings:
Event Full stack Stackmap Saving
---------------- --------------- --------------- ------
sched_switch 9625 × 88B=847KB 12B×9625+88B×466=156KB 82%
kmem_cache_alloc 61255×88B=5.4MB 12B×61255+88B×1177=839KB 85%
QEMU validation (v3 base: v7.1-rc5)
===================================
The series boots cleanly on aarch64 QEMU. A post-init smoke test
(12/12 PASS) verified all functional behaviors including:
- tracefs nodes appear with correct file modes
- stack_id events emitted, kernel symbols resolve correctly
(e.g. __schedule+0x7cc/0x1138)
- reset rejected with -EBUSY while tracing is active
- reset clears the map when tracing is stopped
- per-CPU local_t counters aggregate correctly across CPUs
- stack_map_bin magic correct (0x464D5342 'FSMB')
- 'stackmap' option visible on the global instance, hidden on
secondary instances under tracing/instances/*/
Boot-time activation via 'trace_options=stackmap,stacktrace' works:
events that fire before stackmap initialization fall back to
recording full stack traces; later events are deduplicated. No
events are dropped due to the transition.
Known limitations
=================
- Per-instance stackmap support is not included in this series.
Following the convention used for other global-only options
(PRINTK, RECORD_CMD), the 'stackmap' option is gated to the
top-level trace instance via TOP_LEVEL_TRACE_FLAGS, so it is
not exposed under tracing/instances/*/options/. Per-instance
maps would be a follow-up.
- The element pool is allocated eagerly at fs_initcall when
CONFIG_FTRACE_STACKMAP=y, regardless of whether userspace will
ever enable the option. At the default bits=14 this is roughly
8 MB of vmalloc; at the maximum bits=18, ~135 MB. The eager
allocation keeps the hot path entirely allocation-free and
avoids any allocation-failure path under tracing pressure.
Lazy allocation on first 'echo 1 > options/stackmap' is a
reasonable follow-up if maintainers prefer that trade-off.
- Deduplication is best-effort, not strict: under heavy
concurrent contention two CPUs racing in the insert path with
the same stack hash may each succeed in claiming a different
slot, producing a small number of duplicate entries for the
same stack. ref_count is then split across the duplicates.
This is intentional: it keeps the hot path lock-free and
bounds memory by the element pool size.
- The stackmap currently covers kernel stacks only.
- stack_map_bin is a best-effort snapshot, not a fully atomic export.
- trace-cmd / libtraceevent integration is left for follow-up once
the binary format settles.
Usage
=====
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Pengfei Li (3):
trace: add lock-free stackmap for stack trace deduplication
trace: integrate stackmap into ftrace stack recording path
trace: add documentation, selftest and tooling for stackmap
Documentation/trace/ftrace-stackmap.rst | 162 ++++
Documentation/trace/index.rst | 1 +
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 78 +-
kernel/trace/trace.h | 16 +
kernel/trace/trace_entries.h | 15 +
kernel/trace/trace_output.c | 23 +
kernel/trace/trace_selftest.c | 1 +
kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 ++
.../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++
.../test.d/ftrace/stackmap-instance-gate.tc | 42 +
tools/tracing/stackmap_dump.py | 150 ++++
14 files changed, 1449 insertions(+), 2 deletions(-)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100755 tools/tracing/stackmap_dump.py
--
2.34.1
^ permalink raw reply
* [RFC PATCH v3 1/3] trace: add lock-free stackmap for stack trace deduplication
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add a lock-free hash map (ftrace_stackmap) that deduplicates kernel
stack traces for the ftrace ring buffer. Instead of storing full
stack traces (80-160 bytes each) in the ring buffer for every event,
ftrace can store a 4-byte stack_id when the stackmap option is enabled.
The implementation is modeled after tracing_map.c (used by hist
triggers), using the same lock-free design based on Dr. Cliff Click's
non-blocking hash table algorithm:
- Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
- Pre-allocated element pool (zero allocation on hot path)
- Linear probing with 2x over-provisioned table; probe length is
bounded by FTRACE_STACKMAP_MAX_PROBE so worst-case insert/lookup
is O(1) even when the table is heavily loaded with claimed-but-
empty slots from pool exhaustion
- Single global instance (initialized for the global trace array)
The Kconfig depends on ARCH_HAVE_NMI_SAFE_CMPXCHG, matching the
existing tracing_map / hist_triggers requirement: the lock-free
hot path uses cmpxchg in a context that may be reached from NMI.
The stackmap is exported via three tracefs nodes:
- stack_map: text export with symbol resolution (mode 0640)
- stack_map_stat: counters (entries, successes, drops, success_rate)
- stack_map_bin: binary export (all fields native-endian)
Hot-path counters use per-CPU local_t (NMI-safe single-instruction
increments) instead of atomic64_t. atomic64_t falls back to
raw_spinlock_t-based emulation on 32-bit GENERIC_ATOMIC64 systems,
which would deadlock if an NMI hit while the spinlock was held.
local_t avoids this hazard.
Reset semantics:
- Reset is a control-path operation only allowed when tracing is
stopped on the owning trace_array. Online reset (with tracing
active) is intentionally not supported.
- Reset uses atomic_cmpxchg() to claim the resetting flag, then
verifies tracer_tracing_is_on() returns false.
- synchronize_rcu() drains in-flight get_id() callers from the
ftrace callback path (which runs preempt-disabled).
- A reader_sem (rw_semaphore) serializes the clearing memset
against tracefs readers (seq_file iteration and stack_map_bin
snapshot), which run in process context and aren't covered by
synchronize_rcu(). The hot path doesn't take this lock.
- Reset clears the resetting flag with atomic_set_release() so a
subsequent get_id() observes a fully cleared map.
- get_id() uses atomic_read_acquire() on resetting so subsequent
loads of entry->key/val are properly ordered after the check
(control dependencies only order stores per LKMM).
- Concurrent reset returns -EBUSY; reset while tracing is active
returns -EBUSY.
Concurrency notes:
- entry->val publication uses smp_store_release() paired with
smp_load_acquire() in all dereferencing readers.
- entry->key reads (in get_id, seq_start/next, bin_open) use
READ_ONCE() to avoid LKMM data races with the cmpxchg writer.
- elt->nr is read with READ_ONCE() and clamped to MAX_DEPTH before
use in seq_show and bin_open.
- Pool exhaustion: stackmap_get_elt() short-circuits via
atomic_read() before the contended atomic RMW, avoiding cacheline
contention once the pool is full. Slots that win cmpxchg but
cannot get an elt are left 'claimed but empty'; subsequent
lookups treat val==NULL as a miss and probe past them.
Hash key:
- Per-instance random seed stored in the stackmap struct (no
global state), seeded at create time.
- 32-bit jhash is forced to 1 if it lands on 0 (which is the
free-slot sentinel). Full memcmp confirms matches.
Memory:
- Single flat vmalloc for the element pool (no per-elt kzalloc).
- bits parameter clamped to [10, 18]: at the maximum bits=18, the
element pool is ~135 MB and a stack_map_bin snapshot may briefly
allocate another ~135 MB.
- struct stackmap_bin_snapshot uses u64 (not size_t) for its size
field so data[] is 8-byte aligned on both 32-bit and 64-bit
architectures, avoiding alignment faults when writing u64 IPs
on strict-alignment architectures.
Kernel command line parameter:
- ftrace_stackmap.bits=N: set map capacity (2^N unique stacks,
range 10-18, default 14)
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/Kconfig | 22 +
kernel/trace/Makefile | 1 +
kernel/trace/trace_stackmap.c | 780 ++++++++++++++++++++++++++++++++++
kernel/trace/trace_stackmap.h | 57 +++
4 files changed, 860 insertions(+)
create mode 100644 kernel/trace/trace_stackmap.c
create mode 100644 kernel/trace/trace_stackmap.h
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e130da35808f..e49cae886ff0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -412,6 +412,28 @@ config STACK_TRACER
Say N if unsure.
+config FTRACE_STACKMAP
+ bool "Ftrace stack map deduplication"
+ depends on TRACING
+ depends on STACKTRACE
+ depends on ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select KALLSYMS
+ help
+ This enables a global stack trace hash table for ftrace, inspired
+ by eBPF's BPF_MAP_TYPE_STACK_TRACE. When enabled, ftrace can store
+ only a stack_id in the ring buffer instead of the full stack trace,
+ significantly reducing trace buffer usage when the same call stacks
+ appear repeatedly.
+
+ The deduplicated stacks are exported via:
+ /sys/kernel/debug/tracing/stack_map
+
+ Writing to this file resets the stack map. Reading shows all unique
+ stacks with their stack_id and reference count.
+
+ Say Y if you want to reduce ftrace buffer usage for stack traces.
+ Say N if unsure.
+
config TRACE_PREEMPT_TOGGLE
bool
help
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 8d3d96e847d8..c2d9b2bf895a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_HWLAT_TRACER) += trace_hwlat.o
obj-$(CONFIG_OSNOISE_TRACER) += trace_osnoise.o
obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
+obj-$(CONFIG_FTRACE_STACKMAP) += trace_stackmap.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += trace_functions_graph.o
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
diff --git a/kernel/trace/trace_stackmap.c b/kernel/trace/trace_stackmap.c
new file mode 100644
index 000000000000..c89f6d527c96
--- /dev/null
+++ b/kernel/trace/trace_stackmap.c
@@ -0,0 +1,780 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ftrace Stack Map - Lock-free stack trace deduplication for ftrace
+ *
+ * Modeled after tracing_map.c (used by hist triggers), this provides
+ * a lock-free hash map optimized for the ftrace hot path. The design
+ * is based on Dr. Cliff Click's non-blocking hash table algorithm.
+ *
+ * Key properties:
+ * - Lock-free insert via cmpxchg, safe in NMI/IRQ/any context
+ * - Pre-allocated element pool (zero allocation on hot path)
+ * - Linear probing with 2x over-provisioned table; probe length
+ * bounded by FTRACE_STACKMAP_MAX_PROBE to keep worst-case lookup
+ * cost constant even when the table is heavily loaded
+ * - Single global instance (initialized for the global trace array)
+ *
+ * Reset is a control-path operation, only allowed when tracing is
+ * stopped on the owning trace_array. The protocol is:
+ *
+ * - atomic_cmpxchg(&resetting, 0, 1) atomically claims reset rights
+ * and blocks new get_id() callers (they observe resetting=1 and
+ * return -EINVAL).
+ * - tracer_tracing_is_on() is checked AFTER the cmpxchg, so the
+ * resetting flag itself prevents new insertions even if userspace
+ * re-enables tracing immediately after the check.
+ * - synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path, which runs with preemption disabled.
+ *
+ * Online reset (with tracing active) is intentionally not supported
+ * to keep the design simple and the proof obligations small.
+ *
+ * The 32-bit jhash of the stack IPs is the hash table key. On hash
+ * collision, linear probing finds the next slot and full memcmp
+ * confirms the match.
+ *
+ * Concurrent userspace readers (cat stack_map / stack_map_bin) get
+ * a best-effort snapshot. They are coherent with the hot path
+ * (smp_load_acquire on entry->val), but they are not coherent with
+ * a concurrent reset; since reset requires tracing to be stopped,
+ * mid-iteration reset can produce truncated or partial output but
+ * never crashes.
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/jhash.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/local_lock.h>
+#include <linux/percpu.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/log2.h>
+#include <asm/local.h>
+
+#include "trace.h"
+#include "trace_stackmap.h"
+
+/*
+ * Bound the linear-probe scan length. With a 2x over-provisioned table,
+ * a well-distributed hash gives very short probe chains. Capping at 64
+ * keeps worst-case lookup O(1) even when the table is heavily loaded
+ * with claimed-but-empty slots from pool exhaustion.
+ */
+#define FTRACE_STACKMAP_MAX_PROBE 64
+
+/*
+ * Memory ordering of entry->val: published with smp_store_release()
+ * by the inserter; consumed with smp_load_acquire() by every reader
+ * that dereferences the elt (get_id, seq_show, bin_open). This pairs
+ * the writes to elt->{nr,ips,ref_count} (initialized BEFORE the
+ * publish) with the reads of those fields (which happen AFTER the
+ * load). seq_start / seq_next only test val for NULL and use the
+ * acquire load purely to keep memory ordering symmetric.
+ */
+
+/*
+ * Each pre-allocated element holds one unique stack trace.
+ * Fixed size: MAX_DEPTH entries regardless of actual depth.
+ */
+struct stackmap_elt {
+ u32 nr; /* actual number of IPs */
+ atomic_t ref_count;
+ unsigned long ips[FTRACE_STACKMAP_MAX_DEPTH];
+};
+
+/*
+ * Hash table entry: a 32-bit key (jhash of stack) + pointer to elt.
+ * key == 0 means the slot is free.
+ */
+struct stackmap_entry {
+ u32 key; /* 0 = free, non-zero = jhash */
+ struct stackmap_elt *val; /* NULL until fully published */
+};
+
+struct ftrace_stackmap {
+ struct trace_array *tr; /* owning trace_array */
+ unsigned int map_bits;
+ unsigned int map_size; /* 1 << (map_bits + 1) */
+ unsigned int max_elts; /* 1 << map_bits */
+ u32 hash_seed; /* per-instance jhash seed */
+ atomic_t next_elt; /* index into elts pool */
+ struct stackmap_entry *entries; /* hash table */
+ struct stackmap_elt *elts; /* flat element pool */
+ atomic_t resetting;
+ /*
+ * Reader/reset serialization. Held in shared mode (read lock)
+ * across seq_file iteration and binary snapshot construction;
+ * held in exclusive mode (write lock) by reset's clearing
+ * phase. The hot path (get_id) does not take this lock — it
+ * uses smp_load_acquire/smp_store_release on entry->val and
+ * the resetting flag for the lock-free protocol.
+ */
+ struct rw_semaphore reader_sem;
+ /*
+ * Per-CPU counters using local_t. local_t increments are NMI-
+ * safe on all architectures (single-instruction or interrupt-
+ * masked) and avoid the raw_spinlock_t fallback that
+ * atomic64_t uses on 32-bit GENERIC_ATOMIC64 — which would
+ * deadlock if an NMI hit while the spinlock was held.
+ */
+ local_t __percpu *successes; /* events served (hits + new inserts) */
+ local_t __percpu *drops;
+};
+
+/*
+ * Cap the bits parameter to keep worst-case allocations bounded:
+ * bits=18 → 256K elts, 512K slots, ~130 MB elt pool, ~130 MB bin
+ * export.
+ * Smaller workloads should use the default (14) which gives 16K elts
+ * (~8 MB pool); bump bits via the ftrace_stackmap.bits= kernel
+ * parameter for higher unique-stack capacity.
+ */
+#define FTRACE_STACKMAP_BITS_MIN 10
+#define FTRACE_STACKMAP_BITS_MAX 18
+#define FTRACE_STACKMAP_BITS_DEFAULT 14
+
+static unsigned int stackmap_map_bits = FTRACE_STACKMAP_BITS_DEFAULT;
+static int __init stackmap_bits_setup(char *str)
+{
+ unsigned long val;
+
+ if (kstrtoul(str, 0, &val))
+ return -EINVAL;
+ val = clamp_val(val, FTRACE_STACKMAP_BITS_MIN, FTRACE_STACKMAP_BITS_MAX);
+ stackmap_map_bits = val;
+ return 0;
+}
+early_param("ftrace_stackmap.bits", stackmap_bits_setup);
+
+/* --- Element pool --- */
+
+static struct stackmap_elt *stackmap_get_elt(struct ftrace_stackmap *smap)
+{
+ int idx;
+
+ /*
+ * Fast-path early-out once the pool is fully consumed. Avoids
+ * the contended atomic RMW on next_elt for every traced event
+ * after the pool is exhausted.
+ */
+ if (atomic_read(&smap->next_elt) >= smap->max_elts)
+ return NULL;
+
+ idx = atomic_fetch_add_unless(&smap->next_elt, 1, smap->max_elts);
+ if (idx < smap->max_elts)
+ return &smap->elts[idx];
+ return NULL;
+}
+
+/* --- Create / Destroy / Reset --- */
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr)
+{
+ struct ftrace_stackmap *smap;
+ unsigned int bits;
+
+ smap = kzalloc(sizeof(*smap), GFP_KERNEL);
+ if (!smap)
+ return ERR_PTR(-ENOMEM);
+
+ /* Defensive clamp: reject bogus bits even if early_param is bypassed. */
+ bits = clamp_val(stackmap_map_bits,
+ FTRACE_STACKMAP_BITS_MIN,
+ FTRACE_STACKMAP_BITS_MAX);
+
+ smap->tr = tr;
+ smap->map_bits = bits;
+ smap->max_elts = 1U << bits;
+ smap->map_size = 1U << (bits + 1); /* 2x over-provision */
+
+ smap->entries = vzalloc(sizeof(*smap->entries) * smap->map_size);
+ if (!smap->entries) {
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /*
+ * Single large vmalloc of the element pool, indexed flat.
+ * At bits=18 this is 256K * sizeof(struct stackmap_elt). The
+ * struct is ~520 B (8 + 4 + 4 + 64*8), so total ~135 MB.
+ */
+ smap->elts = vzalloc(sizeof(*smap->elts) * (size_t)smap->max_elts);
+ if (!smap->elts) {
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->successes = alloc_percpu(local_t);
+ if (!smap->successes) {
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+ smap->drops = alloc_percpu(local_t);
+ if (!smap->drops) {
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ smap->hash_seed = get_random_u32();
+ atomic_set(&smap->next_elt, 0);
+ atomic_set(&smap->resetting, 0);
+ init_rwsem(&smap->reader_sem);
+
+ return smap;
+}
+
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap)
+{
+ if (!smap || IS_ERR(smap))
+ return;
+ free_percpu(smap->drops);
+ free_percpu(smap->successes);
+ vfree(smap->elts);
+ vfree(smap->entries);
+ kfree(smap);
+}
+
+/**
+ * ftrace_stackmap_reset - clear all entries in the stackmap
+ * @smap: the stackmap to reset
+ *
+ * Returns 0 on success, -EBUSY if another reset is already in
+ * progress, or if tracing is currently active on the owning
+ * trace_array.
+ *
+ * Online reset (with tracing active) is not supported. Caller must
+ * stop tracing first (echo 0 > tracing_on).
+ *
+ * Caller is process context (typically sysfs write handler).
+ *
+ * Protocol:
+ * 1. Atomically claim reset rights via cmpxchg on @resetting.
+ * 2. Verify tracing is stopped on @smap->tr; if not, release the
+ * claim and return -EBUSY. The resetting flag itself blocks
+ * any subsequent get_id() callers.
+ * 3. synchronize_rcu() drains in-flight get_id() callers from the
+ * ftrace callback path (which runs preempt-disabled).
+ * 4. memset entries, elts, and counters.
+ * 5. Release the resetting flag with release semantics so any new
+ * get_id() observes a fully cleared map.
+ */
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap)
+{
+ if (!smap)
+ return 0;
+
+ if (atomic_cmpxchg(&smap->resetting, 0, 1) != 0)
+ return -EBUSY;
+
+ if (smap->tr && tracer_tracing_is_on(smap->tr)) {
+ atomic_set(&smap->resetting, 0);
+ return -EBUSY;
+ }
+
+ /*
+ * synchronize_rcu() itself is a full barrier; no extra smp_mb()
+ * is needed before it. It drains in-flight ftrace callbacks that
+ * may have already passed the resetting check with the old value.
+ */
+ synchronize_rcu();
+
+ /*
+ * Take the reader_sem in exclusive mode. This serializes the
+ * memset against any tracefs reader (seq_file iteration or
+ * stack_map_bin snapshot) that may currently hold the rwsem
+ * for read. synchronize_rcu() already drained the hot path;
+ * this rwsem covers process-context readers that aren't
+ * preempt-disabled.
+ */
+ down_write(&smap->reader_sem);
+
+ memset(smap->entries, 0, sizeof(*smap->entries) * smap->map_size);
+ memset(smap->elts, 0, sizeof(*smap->elts) * (size_t)smap->max_elts);
+
+ atomic_set(&smap->next_elt, 0);
+ {
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ local_set(per_cpu_ptr(smap->successes, cpu), 0);
+ local_set(per_cpu_ptr(smap->drops, cpu), 0);
+ }
+ }
+
+ up_write(&smap->reader_sem);
+
+ /* Release resetting=0 so new get_id() observes a cleared map. */
+ atomic_set_release(&smap->resetting, 0);
+ return 0;
+}
+
+/* --- Core: get_id (lock-free, NMI-safe) --- */
+
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries)
+{
+ u32 key_hash, idx, test_key, trace_len;
+ struct stackmap_entry *entry;
+ struct stackmap_elt *val;
+ int probes = 0;
+
+ /*
+ * atomic_read_acquire() pairs with atomic_set_release() in the
+ * reset path. This ensures that subsequent reads of entry->key
+ * and entry->val are ordered after this check; without acquire,
+ * the CPU would only have a control dependency, which orders
+ * subsequent stores but not loads (per LKMM).
+ */
+ if (!smap || !nr_entries || atomic_read_acquire(&smap->resetting))
+ return -EINVAL;
+ if (nr_entries > FTRACE_STACKMAP_MAX_DEPTH)
+ nr_entries = FTRACE_STACKMAP_MAX_DEPTH;
+
+ trace_len = nr_entries * sizeof(unsigned long);
+ /*
+ * jhash2() requires the length in u32 units and the data to be
+ * u32-aligned. On 64-bit kernels sizeof(unsigned long)==8, so
+ * trace_len is always a multiple of 8 (hence of 4). Use jhash2
+ * directly; the cast to u32* is safe because ips[] is naturally
+ * aligned to sizeof(unsigned long) >= 4.
+ */
+ key_hash = jhash2((const u32 *)ips, trace_len / sizeof(u32),
+ smap->hash_seed);
+ if (key_hash == 0)
+ key_hash = 1; /* 0 means free slot */
+
+ idx = key_hash >> (32 - (smap->map_bits + 1));
+
+ while (probes < FTRACE_STACKMAP_MAX_PROBE) {
+ idx &= (smap->map_size - 1);
+ entry = &smap->entries[idx];
+ /*
+ * READ_ONCE() to avoid LKMM data race with concurrent
+ * cmpxchg(&entry->key, 0, key_hash) on this slot.
+ */
+ test_key = READ_ONCE(entry->key);
+
+ if (test_key == key_hash) {
+ /*
+ * smp_load_acquire pairs with smp_store_release in
+ * the publisher below; ensures we see fully-formed
+ * elt fields (nr, ips, ref_count) before dereference.
+ */
+ val = smp_load_acquire(&entry->val);
+ /*
+ * READ_ONCE(val->nr) keeps style consistent with
+ * the seq_show / bin_open readers. nr is write-once
+ * (set before publish, never modified afterwards),
+ * so the load is data-race-free, but READ_ONCE
+ * silences any analysis tool that flags a plain
+ * read of a field that is also read under acquire
+ * elsewhere.
+ */
+ if (val && READ_ONCE(val->nr) == nr_entries &&
+ memcmp(val->ips, ips, trace_len) == 0) {
+ atomic_inc(&val->ref_count);
+ local_inc(this_cpu_ptr(smap->successes));
+ return (int)idx;
+ }
+ /*
+ * val == NULL: another CPU is mid-insert, or this
+ * slot is "claimed but empty" (pool exhausted).
+ * val != NULL but mismatch: 32-bit hash collision
+ * with a different stack. In both cases, advance.
+ */
+ } else if (!test_key) {
+ /*
+ * Free slot: try to claim it.
+ *
+ * If two CPUs race here with the same key_hash
+ * (same stack), one loses the cmpxchg, advances,
+ * and may insert the same stack at a later slot.
+ * This can produce a small number of duplicate
+ * entries under heavy contention. The trade-off
+ * is accepted to keep the hot path lock-free;
+ * ref_count is split across the duplicates and
+ * total memory cost is bounded by the element
+ * pool size.
+ */
+ if (cmpxchg(&entry->key, 0, key_hash) == 0) {
+ struct stackmap_elt *elt;
+
+ elt = stackmap_get_elt(smap);
+ if (!elt) {
+ /*
+ * Pool exhausted. We claimed this
+ * slot with cmpxchg but cannot fill
+ * it. Leave key set so the slot
+ * stays "claimed but empty" — future
+ * lookups treat val==NULL as a miss
+ * and probe past it. Cannot revert
+ * key=0 without racing other CPUs.
+ */
+ local_inc(this_cpu_ptr(smap->drops));
+ return -ENOSPC;
+ }
+
+ elt->nr = nr_entries;
+ atomic_set(&elt->ref_count, 1);
+ memcpy(elt->ips, ips, trace_len);
+
+ /*
+ * Publish elt with release semantics so the
+ * reader's smp_load_acquire can safely
+ * dereference val->nr / val->ips.
+ */
+ smp_store_release(&entry->val, elt);
+ local_inc(this_cpu_ptr(smap->successes));
+ return (int)idx;
+ }
+ /* cmpxchg failed; another CPU claimed this slot. */
+ }
+
+ idx++;
+ probes++;
+ }
+
+ local_inc(this_cpu_ptr(smap->drops));
+ return -ENOSPC;
+}
+
+/* --- Text export: /sys/kernel/debug/tracing/stack_map --- */
+
+struct stackmap_seq_private {
+ struct ftrace_stackmap *smap;
+};
+
+static void *stackmap_seq_start(struct seq_file *m, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ /*
+ * Take the reader_sem to serialize against ftrace_stackmap_reset(),
+ * which holds it for write while clearing the table. Released in
+ * stackmap_seq_stop(), which seq_file calls regardless of whether
+ * start() returned an element or NULL (per Documentation/filesystems
+ * /seq_file.rst: "the iterator value returned by start() or next()
+ * is guaranteed to be passed to a subsequent next() or stop()").
+ */
+ down_read(&smap->reader_sem);
+ for (i = *pos; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ smp_load_acquire(&smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ return NULL;
+}
+
+static void *stackmap_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+ u32 i;
+
+ if (!smap)
+ return NULL;
+ for (i = *pos + 1; i < smap->map_size; i++) {
+ if (READ_ONCE(smap->entries[i].key) &&
+ smp_load_acquire(&smap->entries[i].val)) {
+ *pos = i;
+ return &smap->entries[i];
+ }
+ }
+ /*
+ * Advance *pos past the end so that on the next read() the
+ * subsequent stackmap_seq_start() call returns NULL and the
+ * iteration terminates. Without this, seq_read() would loop
+ * on the last element.
+ */
+ *pos = smap->map_size;
+ return NULL;
+}
+
+static void stackmap_seq_stop(struct seq_file *m, void *v)
+{
+ struct stackmap_seq_private *priv = m->private;
+ struct ftrace_stackmap *smap = priv->smap;
+
+ /*
+ * seq_file invokes stop() unconditionally after each iteration
+ * pass (see seq_read_iter / traverse), even when start() returned
+ * NULL. Always release here, balanced against the down_read in
+ * stackmap_seq_start().
+ */
+ if (smap)
+ up_read(&smap->reader_sem);
+}
+
+static int stackmap_seq_show(struct seq_file *m, void *v)
+{
+ struct stackmap_entry *entry = v;
+ struct stackmap_elt *elt = smp_load_acquire(&entry->val);
+ struct stackmap_seq_private *priv = m->private;
+ u32 idx = entry - priv->smap->entries;
+ u32 i, nr;
+
+ if (!elt)
+ return 0;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ seq_printf(m, "stack_id %u [ref %u, depth %u]\n",
+ idx, atomic_read(&elt->ref_count), nr);
+ for (i = 0; i < nr; i++)
+ seq_printf(m, " [%u] %pS\n", i, (void *)elt->ips[i]);
+ seq_putc(m, '\n');
+ return 0;
+}
+
+static const struct seq_operations stackmap_seq_ops = {
+ .start = stackmap_seq_start,
+ .next = stackmap_seq_next,
+ .stop = stackmap_seq_stop,
+ .show = stackmap_seq_show,
+};
+
+static int stackmap_open(struct inode *inode, struct file *file)
+{
+ struct stackmap_seq_private *priv;
+ struct seq_file *m;
+ int ret;
+
+ ret = seq_open_private(file, &stackmap_seq_ops,
+ sizeof(struct stackmap_seq_private));
+ if (ret)
+ return ret;
+ m = file->private_data;
+ priv = m->private;
+ priv->smap = inode->i_private;
+ return 0;
+}
+
+/*
+ * Accept exactly "0" or "reset" (optionally followed by a single newline).
+ */
+static bool stackmap_write_is_reset(const char *buf, size_t n)
+{
+ if (n > 0 && buf[n - 1] == '\n')
+ n--;
+ return (n == 1 && buf[0] == '0') ||
+ (n == 5 && memcmp(buf, "reset", 5) == 0);
+}
+
+static ssize_t stackmap_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct stackmap_seq_private *priv = m->private;
+ char buf[8];
+ size_t n = min(count, sizeof(buf) - 1);
+ int ret;
+
+ if (n == 0)
+ return -EINVAL;
+ if (copy_from_user(buf, ubuf, n))
+ return -EFAULT;
+ buf[n] = '\0';
+
+ if (!stackmap_write_is_reset(buf, n))
+ return -EINVAL;
+
+ /*
+ * ftrace_stackmap_reset() atomically claims reset rights via
+ * cmpxchg and returns -EBUSY if another reset is in progress
+ * or if tracing is active.
+ */
+ ret = ftrace_stackmap_reset(priv->smap);
+ if (ret)
+ return ret;
+ return count;
+}
+
+const struct file_operations ftrace_stackmap_fops = {
+ .open = stackmap_open,
+ .read = seq_read,
+ .write = stackmap_write,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
+/* --- Stats --- */
+
+static int stackmap_stat_show(struct seq_file *m, void *v)
+{
+ struct ftrace_stackmap *smap = m->private;
+ u64 successes = 0, drops = 0;
+ u32 entries;
+ int cpu;
+
+ if (!smap) {
+ seq_puts(m, "stackmap not initialized\n");
+ return 0;
+ }
+
+ entries = atomic_read(&smap->next_elt);
+ for_each_possible_cpu(cpu) {
+ successes += local_read(per_cpu_ptr(smap->successes, cpu));
+ drops += local_read(per_cpu_ptr(smap->drops, cpu));
+ }
+
+ seq_printf(m, "entries: %u / %u\n", entries, smap->max_elts);
+ seq_printf(m, "table_size: %u\n", smap->map_size);
+ seq_printf(m, "successes: %llu\n", successes);
+ seq_printf(m, "drops: %llu\n", drops);
+ if (successes + drops > 0)
+ seq_printf(m, "success_rate: %llu%%\n",
+ successes * 100 / (successes + drops));
+ return 0;
+}
+
+static int stackmap_stat_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, stackmap_stat_show, inode->i_private);
+}
+
+const struct file_operations ftrace_stackmap_stat_fops = {
+ .open = stackmap_stat_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* --- Binary export --- */
+
+struct stackmap_bin_snapshot {
+ /*
+ * Use u64 (not size_t) so data[] is 8-byte aligned on both
+ * 32-bit and 64-bit architectures. The IP array within data[]
+ * is accessed as u64*, which would alignment-fault on strict
+ * architectures (e.g. older ARM, SPARC) if data[] started at
+ * a 4-byte boundary.
+ */
+ u64 size;
+ char data[];
+};
+
+static int stackmap_bin_open(struct inode *inode, struct file *file)
+{
+ struct ftrace_stackmap *smap = inode->i_private;
+ struct stackmap_bin_snapshot *snap;
+ struct ftrace_stackmap_bin_header *hdr;
+ size_t alloc_size, off;
+ u32 nr_entries, i, nr_stacks;
+
+ if (!smap)
+ return -ENODEV;
+
+ /*
+ * Worst-case allocation size: every populated entry uses a
+ * full-depth stack. The (+1) gives one slack slot in case a
+ * concurrent insert lands between this snapshot and iteration.
+ * The loop below performs an explicit bounds check anyway.
+ *
+ * At bits=18 this caps at ~135 MB. The file is mode 0440
+ * (TRACE_MODE_READ), so only privileged users can open it.
+ */
+ nr_entries = atomic_read(&smap->next_elt);
+ alloc_size = sizeof(*hdr) + (nr_entries + 1) *
+ (sizeof(struct ftrace_stackmap_bin_entry) +
+ FTRACE_STACKMAP_MAX_DEPTH * sizeof(u64));
+
+ snap = vmalloc(sizeof(*snap) + alloc_size);
+ if (!snap)
+ return -ENOMEM;
+
+ hdr = (struct ftrace_stackmap_bin_header *)snap->data;
+ hdr->magic = FTRACE_STACKMAP_BIN_MAGIC;
+ hdr->version = FTRACE_STACKMAP_BIN_VERSION;
+ hdr->reserved = 0;
+ off = sizeof(*hdr);
+ nr_stacks = 0;
+
+ /*
+ * Take reader_sem to serialize against ftrace_stackmap_reset(),
+ * which clears the table and elt pool under the write lock.
+ */
+ down_read(&smap->reader_sem);
+
+ for (i = 0; i < smap->map_size; i++) {
+ struct stackmap_entry *entry = &smap->entries[i];
+ struct stackmap_elt *elt;
+ struct ftrace_stackmap_bin_entry *e;
+ u64 *ips_out;
+ u32 k, nr;
+
+ if (!READ_ONCE(entry->key))
+ continue;
+ elt = smp_load_acquire(&entry->val);
+ if (!elt)
+ continue;
+
+ nr = READ_ONCE(elt->nr);
+ if (nr > FTRACE_STACKMAP_MAX_DEPTH)
+ nr = FTRACE_STACKMAP_MAX_DEPTH;
+
+ /* Bounds check: stop if we would overflow the allocation. */
+ if (off + sizeof(*e) + nr * sizeof(u64) > alloc_size)
+ break;
+
+ e = (struct ftrace_stackmap_bin_entry *)(snap->data + off);
+ e->stack_id = i;
+ e->nr = nr;
+ e->ref_count = atomic_read(&elt->ref_count);
+ e->reserved = 0;
+ off += sizeof(*e);
+
+ ips_out = (u64 *)(snap->data + off);
+ for (k = 0; k < nr; k++)
+ ips_out[k] = (u64)elt->ips[k];
+ off += nr * sizeof(u64);
+ nr_stacks++;
+ }
+
+ up_read(&smap->reader_sem);
+
+ hdr->nr_stacks = nr_stacks;
+ snap->size = off;
+ file->private_data = snap;
+ return 0;
+}
+
+static ssize_t stackmap_bin_read(struct file *file, char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct stackmap_bin_snapshot *snap = file->private_data;
+
+ if (!snap)
+ return -EINVAL;
+ return simple_read_from_buffer(ubuf, count, ppos, snap->data, snap->size);
+}
+
+static int stackmap_bin_release(struct inode *inode, struct file *file)
+{
+ vfree(file->private_data);
+ return 0;
+}
+
+const struct file_operations ftrace_stackmap_bin_fops = {
+ .open = stackmap_bin_open,
+ .read = stackmap_bin_read,
+ .llseek = default_llseek,
+ .release = stackmap_bin_release,
+};
diff --git a/kernel/trace/trace_stackmap.h b/kernel/trace/trace_stackmap.h
new file mode 100644
index 000000000000..2e82bd6fb1c3
--- /dev/null
+++ b/kernel/trace/trace_stackmap.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TRACE_STACKMAP_H
+#define _TRACE_STACKMAP_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+
+#define FTRACE_STACKMAP_MAX_DEPTH 64
+
+/* Binary export format */
+#define FTRACE_STACKMAP_BIN_MAGIC 0x464D5342 /* 'FSMB' */
+#define FTRACE_STACKMAP_BIN_VERSION 2
+
+struct ftrace_stackmap_bin_header {
+ u32 magic;
+ u32 version;
+ u32 nr_stacks;
+ u32 reserved;
+};
+
+struct ftrace_stackmap_bin_entry {
+ u32 stack_id;
+ u32 nr;
+ u32 ref_count;
+ u32 reserved;
+ /* followed by u64 ips[nr] */
+};
+
+struct trace_array;
+
+#ifdef CONFIG_FTRACE_STACKMAP
+
+struct ftrace_stackmap;
+
+struct ftrace_stackmap *ftrace_stackmap_create(struct trace_array *tr);
+void ftrace_stackmap_destroy(struct ftrace_stackmap *smap);
+int ftrace_stackmap_get_id(struct ftrace_stackmap *smap,
+ unsigned long *ips, unsigned int nr_entries);
+int ftrace_stackmap_reset(struct ftrace_stackmap *smap);
+
+extern const struct file_operations ftrace_stackmap_fops;
+extern const struct file_operations ftrace_stackmap_stat_fops;
+extern const struct file_operations ftrace_stackmap_bin_fops;
+
+#else
+
+struct ftrace_stackmap;
+static inline struct ftrace_stackmap *
+ftrace_stackmap_create(struct trace_array *tr) { return NULL; }
+static inline void ftrace_stackmap_destroy(struct ftrace_stackmap *s) { }
+static inline int ftrace_stackmap_get_id(struct ftrace_stackmap *s,
+ unsigned long *ips, unsigned int n)
+{ return -EOPNOTSUPP; }
+static inline int ftrace_stackmap_reset(struct ftrace_stackmap *s) { return 0; }
+
+#endif
+#endif /* _TRACE_STACKMAP_H */
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v3 2/3] trace: integrate stackmap into ftrace stack recording path
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add TRACE_STACK_ID event type and integrate ftrace_stackmap into
__ftrace_trace_stack(). When the 'stackmap' trace option is enabled,
the stack recording path stores a 4-byte stack_id in the ring buffer
instead of the full stack trace.
Changes:
- New TRACE_STACK_ID in trace_type enum
- New stack_id_entry in trace_entries.h
- New TRACE_ITER(STACKMAP) trace option flag; when CONFIG_FTRACE_STACKMAP
is disabled, TRACE_ITER_STACKMAP_BIT is defined as -1 so that
TRACE_ITER(STACKMAP) evaluates to 0 (following the existing pattern
used by TRACE_ITER_PROF_TEXT_OFFSET)
- 'stackmap' is added to TOP_LEVEL_TRACE_FLAGS and ZEROED_TRACE_FLAGS
so it is only exposed under the top-level trace instance, matching
the convention already used for global-only options such as 'printk'
and 'record-cmd'. Secondary instances under tracing/instances/*/
do not see the option at all, avoiding a confusing no-op.
- Modified __ftrace_trace_stack() to call ftrace_stackmap_get_id()
when the stackmap option is active. If reserving a TRACE_STACK_ID
ring-buffer slot fails after a successful get_id(), the path falls
through to the full-stack recording so the event still gets a stack
trace recorded.
- Stackmap pointer read with smp_load_acquire(), published with
smp_store_release() to ensure proper initialization ordering
- NULL check on tr->stackmap is retained as defense-in-depth: events
that fire before fs_initcall (when the map is created) or after a
failed ftrace_stackmap_create() observe a NULL pointer and fall back
to full stack recording without dereferencing it
- ftrace_stackmap_create() takes the owning trace_array so the
stackmap can later check tracing state during reset
- Added stack_id print handler in trace_output.c
- Added TRACE_STACK_ID to trace_valid_entry() in trace_selftest.c
so ftrace startup selftests don't reject the new entry type when
the stackmap option is enabled
Fallback behavior: if stackmap returns an error (pool exhausted,
resetting, or NULL pointer), the full stack trace is recorded as
before -- no new failure modes introduced.
Per-instance stackmap support is left as a follow-up; gating the
option via TOP_LEVEL_TRACE_FLAGS makes the global-only scope
explicit at the tracefs interface rather than relying on a silent
runtime fallback.
Usage:
echo 1 > /sys/kernel/debug/tracing/options/stackmap
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
kernel/trace/trace.c | 78 ++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 16 +++++++
kernel/trace/trace_entries.h | 15 +++++++
kernel/trace/trace_output.c | 23 +++++++++++
kernel/trace/trace_selftest.c | 1 +
5 files changed, 131 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6eb4d3097a4d..36120355e549 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -57,6 +57,7 @@
#include "trace.h"
#include "trace_output.h"
+#include "trace_stackmap.h"
#ifdef CONFIG_FTRACE_STARTUP_TEST
/*
@@ -509,12 +510,13 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
/* trace_options that are only supported by global_trace */
#define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER(PRINTK) | \
TRACE_ITER(PRINTK_MSGONLY) | TRACE_ITER(RECORD_CMD) | \
- TRACE_ITER(PROF_TEXT_OFFSET) | FPROFILE_DEFAULT_FLAGS)
+ TRACE_ITER(PROF_TEXT_OFFSET) | TRACE_ITER(STACKMAP) | \
+ FPROFILE_DEFAULT_FLAGS)
/* trace_flags that are default zero for instances */
#define ZEROED_TRACE_FLAGS \
(TRACE_ITER(EVENT_FORK) | TRACE_ITER(FUNC_FORK) | TRACE_ITER(TRACE_PRINTK) | \
- TRACE_ITER(COPY_MARKER))
+ TRACE_ITER(COPY_MARKER) | TRACE_ITER(STACKMAP))
/*
* The global_trace is the descriptor that holds the top-level tracing
@@ -2184,6 +2186,49 @@ void __ftrace_trace_stack(struct trace_array *tr,
}
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ /*
+ * If stackmap dedup is enabled, try to store only the stack_id
+ * in the ring buffer instead of the full stack trace.
+ */
+ if (tr->trace_flags & TRACE_ITER(STACKMAP)) {
+ struct ftrace_stackmap *smap;
+ struct stack_id_entry *sid_entry;
+ int sid;
+
+ smap = smp_load_acquire(&tr->stackmap);
+ if (!smap)
+ goto full_stack;
+
+ sid = ftrace_stackmap_get_id(smap, fstack->calls, nr_entries);
+ if (sid >= 0) {
+ event = __trace_buffer_lock_reserve(buffer,
+ TRACE_STACK_ID,
+ sizeof(*sid_entry), trace_ctx);
+ if (!event) {
+ /*
+ * Could not reserve a TRACE_STACK_ID slot;
+ * fall back to the full-stack path so the
+ * event still gets a stack trace recorded.
+ */
+ goto full_stack;
+ }
+ sid_entry = ring_buffer_event_data(event);
+ sid_entry->stack_id = sid;
+ /*
+ * stack_id is a synthetic side-event attached to a
+ * primary trace event that was already subject to
+ * filtering. No per-event filter is defined for
+ * TRACE_STACK_ID, so commit unconditionally.
+ */
+ __buffer_unlock_commit(buffer, event);
+ goto out;
+ }
+ /* On stackmap failure, record the full stack instead. */
+ }
+full_stack:
+#endif
+
event = __trace_buffer_lock_reserve(buffer, TRACE_STACK,
struct_size(entry, caller, nr_entries),
trace_ctx);
@@ -9222,6 +9267,35 @@ static __init void tracer_init_tracefs_work_func(struct work_struct *work)
NULL, &tracing_dyn_info_fops);
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+ {
+ struct ftrace_stackmap *smap;
+
+ smap = ftrace_stackmap_create(&global_trace);
+ if (!IS_ERR(smap)) {
+ /*
+ * Use smp_store_release to ensure the stackmap
+ * structure is fully initialized before publishing
+ * the pointer to concurrent trace event readers.
+ */
+ smp_store_release(&global_trace.stackmap, smap);
+ trace_create_file("stack_map", TRACE_MODE_WRITE, NULL,
+ smap, &ftrace_stackmap_fops);
+ trace_create_file("stack_map_stat", TRACE_MODE_READ, NULL,
+ smap, &ftrace_stackmap_stat_fops);
+ trace_create_file("stack_map_bin", TRACE_MODE_READ, NULL,
+ smap, &ftrace_stackmap_bin_fops);
+ } else {
+ pr_warn("ftrace stackmap init failed, dedup disabled\n");
+ /*
+ * global_trace is statically defined; its stackmap
+ * field is zero-initialized via BSS, so leaving it
+ * NULL ensures the smp_load_acquire() in
+ * __ftrace_trace_stack() falls back to full stack.
+ */
+ }
+ }
+#endif
create_trace_instances(NULL);
update_tracer_options();
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..7e7d5e5a35ff 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -57,6 +57,7 @@ enum trace_type {
TRACE_TIMERLAT,
TRACE_RAW_DATA,
TRACE_FUNC_REPEATS,
+ TRACE_STACK_ID,
__TRACE_LAST_TYPE,
};
@@ -453,6 +454,9 @@ struct trace_array {
struct cond_snapshot *cond_snapshot;
#endif
struct trace_func_repeats __percpu *last_func_repeats;
+#ifdef CONFIG_FTRACE_STACKMAP
+ struct ftrace_stackmap *stackmap;
+#endif
/*
* On boot up, the ring buffer is set to the minimum size, so that
* we do not waste memory on systems that are not using tracing.
@@ -579,6 +583,8 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct func_repeats_entry, \
TRACE_FUNC_REPEATS); \
+ IF_ASSIGN(var, ent, struct stack_id_entry, \
+ TRACE_STACK_ID); \
__ftrace_bad_type(); \
} while (0)
@@ -1449,7 +1455,16 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
# define STACK_FLAGS
#endif
+#ifdef CONFIG_FTRACE_STACKMAP
+# define STACKMAP_FLAGS \
+ C(STACKMAP, "stackmap"),
+#else
+# define STACKMAP_FLAGS
+# define TRACE_ITER_STACKMAP_BIT -1
+#endif
+
#ifdef CONFIG_FUNCTION_PROFILER
+
# define PROFILER_FLAGS \
C(PROF_TEXT_OFFSET, "prof-text-offset"),
# ifdef CONFIG_FUNCTION_GRAPH_TRACER
@@ -1506,6 +1521,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
FUNCTION_FLAGS \
FGRAPH_FLAGS \
STACK_FLAGS \
+ STACKMAP_FLAGS \
BRANCH_FLAGS \
PROFILER_FLAGS \
FPROFILE_FLAGS
diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index 54417468fdeb..89ed14b7e5fd 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -250,6 +250,21 @@ FTRACE_ENTRY(user_stack, userstack_entry,
(void *)__entry->caller[6], (void *)__entry->caller[7])
);
+/*
+ * Stack ID entry - stores only a stack_id referencing the stackmap.
+ * Used when CONFIG_FTRACE_STACKMAP is enabled to deduplicate stacks.
+ */
+FTRACE_ENTRY(stack_id, stack_id_entry,
+
+ TRACE_STACK_ID,
+
+ F_STRUCT(
+ __field( int, stack_id )
+ ),
+
+ F_printk("<stack_id %d>", __entry->stack_id)
+);
+
/*
* trace_printk entry:
*/
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index a5ad76175d10..68678ea88159 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1517,6 +1517,28 @@ static struct trace_event trace_user_stack_event = {
.funcs = &trace_user_stack_funcs,
};
+/* TRACE_STACK_ID */
+static enum print_line_t trace_stack_id_print(struct trace_iterator *iter,
+ int flags, struct trace_event *event)
+{
+ struct stack_id_entry *field;
+ struct trace_seq *s = &iter->seq;
+
+ trace_assign_type(field, iter->ent);
+ trace_seq_printf(s, "<stack_id %d>\n", field->stack_id);
+
+ return trace_handle_return(s);
+}
+
+static struct trace_event_functions trace_stack_id_funcs = {
+ .trace = trace_stack_id_print,
+};
+
+static struct trace_event trace_stack_id_event = {
+ .type = TRACE_STACK_ID,
+ .funcs = &trace_stack_id_funcs,
+};
+
/* TRACE_HWLAT */
static enum print_line_t
trace_hwlat_print(struct trace_iterator *iter, int flags,
@@ -1908,6 +1930,7 @@ static struct trace_event *events[] __initdata = {
&trace_wake_event,
&trace_stack_event,
&trace_user_stack_event,
+ &trace_stack_id_event,
&trace_bputs_event,
&trace_bprint_event,
&trace_print_event,
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 929c84075315..0c97065b0d68 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -14,6 +14,7 @@ static inline int trace_valid_entry(struct trace_entry *entry)
case TRACE_CTX:
case TRACE_WAKE:
case TRACE_STACK:
+ case TRACE_STACK_ID:
case TRACE_PRINT:
case TRACE_BRANCH:
case TRACE_GRAPH_ENT:
--
2.34.1
^ permalink raw reply related
* [RFC PATCH v3 3/3] trace: add documentation, selftest and tooling for stackmap
From: Li Pengfei @ 2026-05-26 11:52 UTC (permalink / raw)
To: mhiramat, rostedt
Cc: linux-trace-kernel, linux-kernel, cmllamas, zhangbo56, Pengfei Li,
kernel test robot
In-Reply-To: <cover.1779769138.git.lipengfei28@xiaomi.com>
From: Pengfei Li <lipengfei28@xiaomi.com>
Add supporting files for the ftrace stackmap feature:
Documentation/trace/ftrace-stackmap.rst:
Documentation covering design, usage, tracefs interface, binary
format, and performance characteristics. Added to the 'Core Tracing
Frameworks' toctree in Documentation/trace/index.rst. Documents:
- Reset requires tracing to be stopped first
- Boot-time activation via trace_options=stackmap
- bits parameter range [10, 18] and worst-case memory usage
- tracefs file modes (0640 / 0440)
- Best-effort snapshot semantics for stack_map_bin
- Counter naming: successes (events served), drops, success_rate
- Gravestone amplification when the pool is exhausted
tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc:
Functional selftest verifying:
- stackmap tracefs nodes exist
- enabling stackmap + stacktrace produces stack_id events
- stack_map_stat shows non-zero successes and zero drops
- reset clears entries when tracing is stopped
- reset is rejected (-EBUSY) while tracing is active
Test reads trace contents BEFORE switching back to the nop tracer
(tracer_init() unconditionally calls tracing_reset_online_cpus(),
which would empty the ring buffer). The function:tracer dependency
is declared in '# requires:' so ftracetest skips on kernels without
CONFIG_FUNCTION_TRACER instead of failing spuriously. An EXIT trap
restores options/stackmap and options/stacktrace on any exit path.
tools/tracing/stackmap_dump.py:
Python script to parse the binary stack_map_bin export.
Features:
- Automatic endianness detection via magic number
- Batched addr2line via stdin (avoids ARG_MAX with large stacks)
- JSON output mode
- Top-N filtering by ref_count
Binary format: all fields are native-endian. The parser detects
byte order by reading the magic value (0x464D5342 = 'FSMB').
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605160010.fakzGVVq-lkp@intel.com/
Signed-off-by: Pengfei Li <lipengfei28@xiaomi.com>
---
Documentation/trace/ftrace-stackmap.rst | 162 ++++++++++++++++++
Documentation/trace/index.rst | 1 +
.../ftrace/test.d/ftrace/stackmap-basic.tc | 103 +++++++++++
.../test.d/ftrace/stackmap-instance-gate.tc | 42 +++++
tools/tracing/stackmap_dump.py | 150 ++++++++++++++++
5 files changed, 458 insertions(+)
create mode 100644 Documentation/trace/ftrace-stackmap.rst
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
create mode 100644 tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
create mode 100755 tools/tracing/stackmap_dump.py
diff --git a/Documentation/trace/ftrace-stackmap.rst b/Documentation/trace/ftrace-stackmap.rst
new file mode 100644
index 000000000000..191347be3664
--- /dev/null
+++ b/Documentation/trace/ftrace-stackmap.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Ftrace Stack Map
+======================
+
+:Author: Pengfei Li <lipengfei28@xiaomi.com>
+
+Overview
+========
+
+The ftrace stack map provides stack trace deduplication for the ftrace
+ring buffer. When enabled, instead of storing full kernel stack traces
+(typically 80-160 bytes each) in the ring buffer for every event, ftrace
+stores only a 4-byte ``stack_id``. The full stacks are maintained in a
+separate hash table and exported via tracefs for userspace to resolve.
+
+This is inspired by eBPF's ``BPF_MAP_TYPE_STACK_TRACE`` but integrated
+into ftrace's infrastructure, requiring no userspace daemon.
+
+Configuration
+=============
+
+Enable ``CONFIG_FTRACE_STACKMAP=y`` in the kernel config.
+
+Kernel command line parameters:
+
+- ``ftrace_stackmap.bits=N`` - Set map capacity to 2^N unique stacks
+ (default: 14 → 16384 stacks; valid range: 10-18).
+
+ At ``bits=18`` the kernel reserves roughly 130 MB of vmalloc memory
+ for the element pool. Each ``open()`` of ``stack_map_bin`` may
+ briefly allocate a similar amount for a snapshot. The cap is set
+ intentionally to bound memory usage.
+
+Usage
+=====
+
+Enable stack deduplication::
+
+ echo 1 > /sys/kernel/debug/tracing/options/stackmap
+ echo 1 > /sys/kernel/debug/tracing/options/stacktrace
+ echo function > /sys/kernel/debug/tracing/current_tracer
+
+The trace output will show ``<stack_id N>`` instead of full stack traces::
+
+ sh-1234 [006] d.h.. 123.456789: <stack_id 42>
+
+To view the actual stacks::
+
+ cat /sys/kernel/debug/tracing/stack_map
+
+Output format::
+
+ stack_id 42 [ref 1337, depth 8]
+ [0] schedule+0x48/0xc0
+ [1] schedule_timeout+0x1c/0x30
+ ...
+
+To view statistics::
+
+ cat /sys/kernel/debug/tracing/stack_map_stat
+
+Output::
+
+ entries: 2500 / 16384
+ table_size: 32768
+ successes: 148923
+ drops: 0
+ success_rate: 100%
+
+To reset the stack map (tracing must be stopped first)::
+
+ echo 0 > /sys/kernel/debug/tracing/tracing_on
+ echo 0 > /sys/kernel/debug/tracing/stack_map
+
+Reset returns ``-EBUSY`` if tracing is currently active, or if another
+reset is already in progress.
+
+Boot-time activation
+====================
+
+The stackmap option can be enabled from the kernel command line::
+
+ trace_options=stackmap,stacktrace
+
+Trace events that fire before the tracefs filesystem is initialized
+(``fs_initcall`` time) fall back to recording full stack traces; once
+``ftrace_stackmap_create()`` runs, subsequent events are deduplicated.
+The crossover is automatic and lossless — no events are dropped, but
+early-boot stacks recorded before the crossover are not deduplicated.
+
+Tracefs Nodes
+=============
+
+The stack_map files are owned by root and not world-readable
+(``stack_map``: 0640; ``stack_map_stat`` and ``stack_map_bin``: 0440).
+
+``stack_map``
+ Text export of all deduplicated stacks with symbol resolution.
+ Writing ``0`` or ``reset`` clears all entries (only when tracing
+ is stopped).
+
+``stack_map_stat``
+ Statistics: entries (allocated unique stacks), table_size,
+ successes (events served), drops (events that fell back to
+ full-stack recording), and success_rate. Drops accumulate when
+ the element pool is exhausted; once that happens, slots that
+ won the cmpxchg but failed to allocate an element remain
+ "claimed but empty" and increase probe pressure for any future
+ insert hashing to the same bucket. Reset (when tracing is
+ stopped) clears these gravestones.
+
+``stack_map_bin``
+ Binary export for efficient userspace consumption. Format:
+
+ - Header (16 bytes): magic(u32) + version(u32) + nr_stacks(u32) + reserved(u32)
+ - Per stack: stack_id(u32) + nr(u32) + ref_count(u32) + reserved(u32) + ips(u64 × nr)
+
+ All fields are written in the kernel's native byte order.
+ Userspace tools detect endianness by reading the magic value.
+ Magic: ``0x464D5342`` ('FSMB'), Version: 2.
+
+ The export is a best-effort snapshot allocated at ``open()``;
+ concurrent inserts during the snapshot may be truncated. A
+ bounds check ensures no overflow.
+
+Design
+======
+
+The stack map is modeled after ``tracing_map.c`` (used by hist triggers),
+using a lock-free design based on Dr. Cliff Click's non-blocking hash table
+algorithm:
+
+- **Lookup/Insert**: Lock-free via ``cmpxchg``, safe in NMI/IRQ/any context
+- **Memory**: Pre-allocated element pool, zero allocation on the hot path
+ (no GFP_ATOMIC failures under memory pressure)
+- **Collision**: Linear probing with a 2x over-provisioned table; probe
+ length is bounded so worst-case insert/lookup is O(1)
+- **Scope**: Currently supports the global trace instance
+- **Hash**: 32-bit jhash with a per-instance random seed; full ``memcmp``
+ confirms matches
+
+Deduplication is best-effort, not strict: if two CPUs race in the
+insert path with the same ``key_hash`` (i.e. the same stack), the
+``cmpxchg`` loser advances by one slot and may insert the same stack
+again. Under heavy contention this can produce a small number of
+duplicate entries for the same stack; ``ref_count`` is then split
+across the duplicates. Total memory is still bounded by the element
+pool size, and lookup correctness is unaffected (each duplicate is
+a self-consistent entry with its own ``stack_id``). The trade-off is
+intentional and keeps the hot path lock-free.
+
+Performance
+===========
+
+Typical results on an aarch64 SMP system (function tracer, 2 seconds):
+
+- Unique stacks: ~3000
+- Dedup rate: 84-98% (depends on workload diversity)
+- Ring buffer savings: ~80% for stack data
+- Overhead per event: ~50ns (one jhash + hash table lookup)
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5d9bf4694d5d..ac8b1141c23a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -33,6 +33,7 @@ the Linux kernel.
ftrace
ftrace-design
ftrace-uses
+ ftrace-stackmap
kprobes
kprobetrace
fprobetrace
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
new file mode 100644
index 000000000000..18fa998ae460
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-basic.tc
@@ -0,0 +1,103 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap basic functionality
+# requires: stack_map options/stackmap function:tracer
+
+# Test that ftrace stackmap deduplication works:
+# 1. Enable stackmap + stacktrace options
+# 2. Run function tracer briefly
+# 3. Verify trace contains <stack_id> events (read BEFORE switching
+# tracer back to nop, since tracer_init() resets the ring buffer)
+# 4. Verify stack_map has entries and zero drops
+# 5. Verify reset is rejected (-EBUSY) while tracing is active
+# 6. Verify reset clears the map when tracing is stopped
+
+fail() {
+ echo "FAIL: $1"
+ exit_fail
+}
+
+# Restore state on any exit (success, fail, or interrupt) so a
+# half-finished test does not leave stacktrace/stackmap enabled.
+cleanup() {
+ disable_tracing 2>/dev/null
+ echo nop > current_tracer 2>/dev/null
+ echo 0 > options/stackmap 2>/dev/null
+ echo 0 > options/stacktrace 2>/dev/null
+}
+trap cleanup EXIT
+
+disable_tracing
+clear_trace
+
+# Verify stackmap files exist
+test -f stack_map || fail "stack_map file missing"
+test -f stack_map_stat || fail "stack_map_stat file missing"
+test -f stack_map_bin || fail "stack_map_bin file missing"
+
+# Enable stackmap dedup
+echo 1 > options/stackmap
+echo 1 > options/stacktrace
+
+# Run function tracer briefly
+echo function > current_tracer
+enable_tracing
+sleep 1
+disable_tracing
+
+# Read trace contents NOW, before switching tracer back to nop.
+# tracer_init() unconditionally calls tracing_reset_online_cpus(),
+# so the ring buffer would be empty after 'echo nop > current_tracer'.
+count=$(grep -c "<stack_id" trace || true)
+: "${count:=0}"
+if [ "$count" -eq 0 ]; then
+ fail "trace has no <stack_id> events"
+fi
+
+# Now safe to switch back and disable options
+echo nop > current_tracer
+echo 0 > options/stackmap
+
+# Check stack_map_stat
+entries=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries:=0}"
+if [ "$entries" -eq 0 ]; then
+ fail "stackmap has zero entries after tracing"
+fi
+
+successes=$(cat stack_map_stat | grep "^successes:" | awk '{print $2}')
+: "${successes:=0}"
+if [ "$successes" -eq 0 ]; then
+ fail "stackmap has zero successes"
+fi
+
+drops=$(cat stack_map_stat | grep "^drops:" | awk '{print $2}')
+: "${drops:=0}"
+if [ "$drops" -ne 0 ]; then
+ fail "stackmap had $drops drops (pool exhausted?)"
+fi
+
+# Check stack_map text output is parseable
+first_id=$(cat stack_map | grep "^stack_id" | head -1 | awk '{print $2}')
+if [ -z "$first_id" ]; then
+ fail "stack_map output has no stack_id entries"
+fi
+
+# Test that reset is rejected while tracing is active
+enable_tracing
+if echo 0 > stack_map 2>/dev/null; then
+ disable_tracing
+ fail "stackmap reset should fail while tracing is active"
+fi
+disable_tracing
+
+# Test reset works when tracing is stopped
+echo 0 > stack_map
+entries_after=$(cat stack_map_stat | grep "^entries:" | awk '{print $2}')
+: "${entries_after:=-1}"
+if [ "$entries_after" -ne 0 ]; then
+ fail "stackmap reset did not clear entries (got $entries_after)"
+fi
+
+echo "stackmap basic test passed: $entries unique stacks, $successes successes, $drops drops"
+exit 0
diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
new file mode 100644
index 000000000000..49848eac2624
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/stackmap-instance-gate.tc
@@ -0,0 +1,42 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: ftrace - stackmap option is gated to the top-level trace instance
+# requires: stack_map options/stackmap instances
+
+# The 'stackmap' option is added to TOP_LEVEL_TRACE_FLAGS, matching the
+# convention used for global-only options like 'printk' and 'record-cmd'.
+# Verify that:
+# 1. The global instance exposes options/stackmap and the stack_map* nodes.
+# 2. A newly created secondary instance under instances/ does NOT expose
+# options/stackmap or stack_map* nodes.
+
+fail() {
+ echo "FAIL: $1"
+ rmdir instances/test_stackmap_gate 2>/dev/null
+ exit_fail
+}
+
+# 1. Global instance must expose the option and the nodes
+test -e options/stackmap || fail "options/stackmap missing on global instance"
+test -e stack_map || fail "stack_map missing on global instance"
+test -e stack_map_stat || fail "stack_map_stat missing on global instance"
+test -e stack_map_bin || fail "stack_map_bin missing on global instance"
+
+# 2. Create a secondary instance and verify it does NOT see the option
+# or the stack_map* nodes.
+mkdir instances/test_stackmap_gate || fail "could not create secondary instance"
+
+if [ -e instances/test_stackmap_gate/options/stackmap ]; then
+ fail "secondary instance unexpectedly exposes options/stackmap"
+fi
+
+for f in stack_map stack_map_stat stack_map_bin; do
+ if [ -e instances/test_stackmap_gate/$f ]; then
+ fail "secondary instance unexpectedly has $f"
+ fi
+done
+
+rmdir instances/test_stackmap_gate || fail "could not remove secondary instance"
+
+echo "stackmap option gating to top-level instance works"
+exit 0
diff --git a/tools/tracing/stackmap_dump.py b/tools/tracing/stackmap_dump.py
new file mode 100755
index 000000000000..fcd8ddcd97de
--- /dev/null
+++ b/tools/tracing/stackmap_dump.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""
+stackmap_dump.py - Parse and display ftrace stack_map_bin binary export.
+
+Usage:
+ # Pull from device and parse
+ adb pull /sys/kernel/debug/tracing/stack_map_bin /tmp/stack_map.bin
+ python3 stackmap_dump.py /tmp/stack_map.bin
+
+ # With vmlinux for offline symbol resolution
+ python3 stackmap_dump.py /tmp/stack_map.bin --vmlinux vmlinux
+
+ # JSON output for tooling
+ python3 stackmap_dump.py /tmp/stack_map.bin --json
+"""
+
+import struct
+import sys
+import argparse
+import json
+import subprocess
+
+MAGIC = 0x464D5342 # 'FSMB'
+HEADER_SIZE = 16 # 4 x u32
+ENTRY_SIZE = 16 # 4 x u32
+
+
+def detect_endianness(data):
+ """Detect byte order from magic number in header."""
+ if len(data) < 4:
+ raise ValueError("File too small")
+ magic_le = struct.unpack_from('<I', data, 0)[0]
+ if magic_le == MAGIC:
+ return '<'
+ magic_be = struct.unpack_from('>I', data, 0)[0]
+ if magic_be == MAGIC:
+ return '>'
+ raise ValueError(f"Bad magic: 0x{magic_le:08x} (neither LE nor BE)")
+
+
+def batch_addr2line(vmlinux, addrs):
+ """Resolve multiple addresses in one addr2line invocation."""
+ if not addrs:
+ return {}
+ try:
+ # Feed addresses on stdin to avoid ARG_MAX limits with large
+ # numbers of addresses (one stack can have 30+ frames; a
+ # snapshot can have thousands of unique stacks).
+ stdin = '\n'.join(hex(a) for a in addrs) + '\n'
+ result = subprocess.run(
+ ['addr2line', '-f', '-e', vmlinux],
+ input=stdin, capture_output=True, text=True, timeout=60
+ )
+ lines = result.stdout.split('\n')
+ # addr2line outputs 2 lines per address: function name + source location
+ symbols = {}
+ for i, addr in enumerate(addrs):
+ idx = i * 2
+ if idx < len(lines) and lines[idx] and lines[idx] != '??':
+ symbols[addr] = lines[idx]
+ return symbols
+ except (subprocess.TimeoutExpired, FileNotFoundError) as e:
+ print(f"warning: addr2line failed: {e}", file=sys.stderr)
+ return {}
+
+
+def parse_stackmap_bin(data):
+ """Parse binary stackmap data, yield (stack_id, ref_count, [ips])."""
+ if len(data) < HEADER_SIZE:
+ raise ValueError("File too small for header")
+
+ endian = detect_endianness(data)
+ header_fmt = f'{endian}IIII'
+ entry_fmt = f'{endian}IIII'
+
+ magic, version, nr_stacks, _ = struct.unpack_from(header_fmt, data, 0)
+ if version != 2:
+ raise ValueError(f"Unsupported version: {version}")
+
+ offset = HEADER_SIZE
+ for _ in range(nr_stacks):
+ if offset + ENTRY_SIZE > len(data):
+ break
+ stack_id, nr, ref_count, _ = struct.unpack_from(entry_fmt, data, offset)
+ offset += ENTRY_SIZE
+
+ ips_size = nr * 8
+ if offset + ips_size > len(data):
+ break
+ ips = struct.unpack_from(f'{endian}{nr}Q', data, offset)
+ offset += ips_size
+
+ yield stack_id, ref_count, list(ips)
+
+
+def main():
+ parser = argparse.ArgumentParser(description='Parse ftrace stack_map_bin')
+ parser.add_argument('file', help='Path to stack_map_bin file')
+ parser.add_argument('--vmlinux', help='Path to vmlinux for symbol resolution')
+ parser.add_argument('--json', action='store_true', help='JSON output')
+ parser.add_argument('--top', type=int, default=0,
+ help='Show only top N stacks by ref_count')
+ args = parser.parse_args()
+
+ with open(args.file, 'rb') as f:
+ data = f.read()
+
+ stacks = list(parse_stackmap_bin(data))
+
+ if args.top > 0:
+ stacks.sort(key=lambda x: x[1], reverse=True)
+ stacks = stacks[:args.top]
+
+ # Batch symbol resolution
+ symbols = {}
+ if args.vmlinux:
+ all_addrs = set()
+ for _, _, ips in stacks:
+ all_addrs.update(ips)
+ symbols = batch_addr2line(args.vmlinux, list(all_addrs))
+
+ if args.json:
+ output = []
+ for stack_id, ref_count, ips in stacks:
+ entry = {
+ 'stack_id': stack_id,
+ 'ref_count': ref_count,
+ 'ips': [f'0x{ip:x}' for ip in ips]
+ }
+ if args.vmlinux:
+ entry['symbols'] = [symbols.get(ip, f'0x{ip:x}')
+ for ip in ips]
+ output.append(entry)
+ print(json.dumps(output, indent=2))
+ else:
+ for stack_id, ref_count, ips in stacks:
+ print(f"stack_id {stack_id} [ref {ref_count}, depth {len(ips)}]")
+ for i, ip in enumerate(ips):
+ sym = symbols.get(ip, '')
+ if sym:
+ sym = f' {sym}'
+ print(f" [{i}] 0x{ip:x}{sym}")
+ print()
+
+ print(f"Total: {len(stacks)} unique stacks", file=sys.stderr)
+
+
+if __name__ == '__main__':
+ main()
--
2.34.1
^ permalink raw reply related
* Re: [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
From: Nico Pache @ 2026-05-26 12:00 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
catalin.marinas, cl, corbet, dave.hansen, dev.jain, gourry,
hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
zokeefe, Bagas Sanjaya
In-Reply-To: <94f759f8-e2ed-4f22-b9e7-4693ad005509@kernel.org>
On Fri, May 22, 2026 at 3:59 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
>
> >
> > process THP controls
> > @@ -264,11 +265,6 @@ support the following arguments::
> > Khugepaged controls
> > -------------------
> >
> > -.. note::
> > - khugepaged currently only searches for opportunities to collapse to
> > - PMD-sized THP and no attempt is made to collapse to other THP
> > - sizes.
>
> Should we maybe leave this here and clarify that for file/shmem, it will still
> only collapse to PMD-sized THPs?
Ah yes that would be a good idea. Ill send a fixup!
Thank you :)
>
> --
> Cheers,
>
> David
>
^ permalink raw reply
* Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Nico Pache @ 2026-05-26 12:07 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, linux-doc, linux-kernel, linux-mm,
linux-trace-kernel, aarcange, anshuman.khandual, apopple, baohua,
baolin.wang, byungchul, catalin.marinas, cl, corbet, dave.hansen,
david, dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh,
jglisse, joshua.hahnjy, kas, lance.yang, liam, ljs,
mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
pfalcato, rakie.kim, raquini, rdunlap, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260526065708.oyyddmt2zgfwu2q7@master>
On Tue, May 26, 2026 at 12:57 AM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, May 25, 2026 at 12:10:41PM -0700, Andrew Morton wrote:
> >On Mon, 25 May 2026 08:15:53 -0600 Nico Pache <npache@redhat.com> wrote:
> >
> >> Can you please append the following fixup that reverts one of the
> >> changes requested in V17. The issue with the change is described
> >> below.
> >
> >OK. fyi, what I received was badly mangled: wordwrapping, tabs messed
> >up, etc.
> >
> >Here's my reconstruction:
> >
>
> Hi, Nico
>
> I tried to reply your mail, but found it has some encoding problem, so reply
> here.
Yeah sorry I didnt properly configure my email client after getting a
new laptop.
>
> >
> >Author: Nico Pache <npache@redhat.com>
> >Subject: fix potential use-after-free of vma in mthp_collapse()
> >Date: Mon May 25 07:38:59 2026 -0600
> >
> >Between V17 and v18, one reviewer (Wei) brought up that we are not doing
> >the uffd-armed check until deep in the collapse operation. While not
> >functionally incorrect, it can lead to unnecessary work.
>
> So we decide to tolerate the behavioral change?
Yes, I believe it is ok for now. Either way we needed to remove the
potential UAF. It only affects the behavior if mTHP is enabled, so the
legacy behavior is kept. And the uffd case is limited.
My future work involves further optimizing and cleaning up khugepaged.
I'll make this part of the goal too. My first thought is to do the
revalidation at every order (between the locks dropping); but that
essentially pays the same penalty... I can't think of a clean solution
at the moment.
Does that sound ok?
Cheers,
-- Nico
-- Nico
>
> >
> >We optimized this by passing the vma variable to mthp_collapse() and using
> >the collapse_max_ptes_none() function to check the state of uffd-armed
> >preventing the wasted work later in the collapse.
> >
> >mthp_collapse() is called after mmap_read_unlock(), so the vma pointer can
> >become stale. Remove the vma parameter and pass NULL to
> >collapse_max_ptes_none() instead.
> >
> >Link: https://lore.kernel.org/2b2cda8c-358a-4a5c-989c-ae42593ef2ea@redhat.com
> >Signed-off-by: Nico Pache <npache@redhat.com>
> >...
> >
> > mm/khugepaged.c | 10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> >
> >--- a/mm/khugepaged.c~mm-khugepaged-introduce-mthp-collapse-support-fix
> >+++ a/mm/khugepaged.c
> >@@ -1502,9 +1502,9 @@ static unsigned int collapse_mthp_count_
> > * If a collapse is permitted, we attempt to collapse the PTE range into a
> > * mTHP.
> > */
> >-static int mthp_collapse(struct mm_struct *mm, struct vm_area_struct *vma,
> >- unsigned long address, int referenced, int unmapped,
> >- struct collapse_control *cc, unsigned long enabled_orders)
> >+static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >+ int referenced, int unmapped, struct collapse_control *cc,
> >+ unsigned long enabled_orders)
> > {
> > unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> > int collapsed = 0, stack_size = 0;
> >@@ -1524,7 +1524,7 @@ static int mthp_collapse(struct mm_struc
> > if (!test_bit(order, &enabled_orders))
> > goto next_order;
> >
> >- max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> >+ max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >
> > nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> > nr_ptes);
> >@@ -1749,7 +1749,7 @@ out_unmap:
> > if (result == SCAN_SUCCEED) {
> > /* collapse_huge_page expects the lock to be dropped before calling */
> > mmap_read_unlock(mm);
> >- nr_collapsed = mthp_collapse(mm, vma, start_addr, referenced,
> >+ nr_collapsed = mthp_collapse(mm, start_addr, referenced,
> > unmapped, cc, enabled_orders);
> > /* mmap_lock was released above, set lock_dropped */
> > *lock_dropped = true;
> >_
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply
* Re: [PATCHv3 05/12] libbpf: Change has_nop_combo to work on top of nop10
From: Jiri Olsa @ 2026-05-26 14:26 UTC (permalink / raw)
To: Jiri Olsa
Cc: Andrii Nakryiko, Oleg Nesterov, Peter Zijlstra, Ingo Molnar,
Masami Hiramatsu, Andrii Nakryiko, Jakub Sitnicki, bpf,
linux-trace-kernel
In-Reply-To: <ahDKatbb_ixUY_XF@krava>
On Fri, May 22, 2026 at 11:28:10PM +0200, Jiri Olsa wrote:
> On Fri, May 22, 2026 at 11:52:56AM -0700, Andrii Nakryiko wrote:
> > On Thu, May 21, 2026 at 5:45 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> > > fixing has_nop_combo to reflect that.
> > >
> > > Fixes: 41a5c7df4466 ("libbpf: Add support to detect nop,nop5 instructions combo for usdt probe")
> > > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > > tools/lib/bpf/usdt.c | 16 ++++++++--------
> > > 1 file changed, 8 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
> > > index e3710933fd52..484a4354e82b 100644
> > > --- a/tools/lib/bpf/usdt.c
> > > +++ b/tools/lib/bpf/usdt.c
> > > @@ -305,7 +305,7 @@ struct usdt_manager *usdt_manager_new(struct bpf_object *obj)
> > >
> > > /*
> > > * Detect kernel support for uprobe() syscall, it's presence means we can
> > > - * take advantage of faster nop5 uprobe handling.
> > > + * take advantage of faster nop10 uprobe handling.
> > > * Added in: 56101b69c919 ("uprobes/x86: Add uprobe syscall to speed up uprobe")
> >
> > Would be nice to add commit that switches nop5 to nop10 (but until it
> > lands hash is not stable, so, hmmm, maybe we'll land this patch
> > separately? send it a bit later to bpf-next?)
>
> hm, I think that would affect the subtest_optimized_attach usdt test
> which depend on this behaviour, will check
usdt/optimized_attach will fail without this change,
I'll make note to update it later when we have the hash
jirka
^ permalink raw reply
* Re: [PATCH mm-unstable v18 04/14] mm/khugepaged: generalize __collapse_huge_page_* for mTHP support
From: Nico Pache @ 2026-05-26 14:39 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260522150009.121603-5-npache@redhat.com>
On 5/22/26 8:59 AM, Nico Pache wrote:
> generalize the order of the __collapse_huge_page_* and collapse_max_*
> functions to support future mTHP collapse.
>
> The current mechanism for determining collapse with the
> khugepaged_max_ptes_none value is not designed with mTHP in mind. This
> raises a key design issue: if we support user defined max_pte_none values
> (even those scaled by order), a collapse of a lower order can introduces
> an feedback loop, or "creep", when max_ptes_none is set to a value greater
> than HPAGE_PMD_NR / 2. [1]
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that will limit mTHP
> collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1.
> This effectively supports two modes: [2]
>
> - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE
> that maps the shared zeropage. Consequently, no memory bloat.
> - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest
> available mTHP order.
>
> This removes the possibility of "creep", and a warning will be emitted if
> any non-supported max_ptes_none value is configured with mTHP enabled.
> Any intermediate value will default mTHP collapse to max_ptes_none=0.
>
> mTHP collapse will not honor the khugepaged_max_ptes_shared or
> khugepaged_max_ptes_swap parameters, and will fail if it encounters a
> shared or swapped entry.
>
> No functional changes in this patch; however it defines future behavior
> for mTHP collapse.
Hi Andrew,
Can you please append the following fixup to this commit.
Changes are described below but are very minor nits!
Thank you!
commit e3985903daa4fa77a27a632c1c2fa4c23aac9ad5
Author: Nico Pache <npache@redhat.com>
Date: Tue May 26 05:40:03 2026 -0600
fixup: cleanup collapse_max_ptes_none
make max_ptes_none a const and cleanup the pr_warn_once
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e98ba5b15163..4c7e404b0f8d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -360,7 +360,7 @@ static bool pte_none_or_zero(pte_t pte)
static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
struct vm_area_struct *vma, unsigned int order)
{
- unsigned int max_ptes_none = khugepaged_max_ptes_none;
+ const unsigned int max_ptes_none = khugepaged_max_ptes_none;
if (vma && userfaultfd_armed(vma))
return 0;
@@ -376,14 +376,13 @@ static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
*/
if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
return (1 << order) - 1;
- if (!max_ptes_none)
- return 0;
/*
* For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
* emit a warning and return 0.
*/
- pr_warn_once("mTHP collapse does not support max_ptes_none values"
- " other than 0 or %u, defaulting to 0.\n",
+ if (max_ptes_none)
+ pr_warn_once("mTHP collapse does not support max_ptes_none"
+ " values other than 0 or %u, defaulting to 0.\n",
KHUGEPAGED_MAX_PTES_LIMIT);
return 0;
}
>
> [1] - https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com
> [2] - https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com
>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Co-developed-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
> mm/khugepaged.c | 121 +++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 88 insertions(+), 33 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 116f39518948..e98ba5b15163 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -353,30 +353,52 @@ static bool pte_none_or_zero(pte_t pte)
> * the shared zeropage for the given collapse operation.
> * @cc: The collapse control struct
> * @vma: The vma to check for userfaultfd
> + * @order: The folio order being collapsed to
> *
> * Return: Maximum number of empty/shared zeropage PTEs for the collapse operation
> */
> static unsigned int collapse_max_ptes_none(struct collapse_control *cc,
> - struct vm_area_struct *vma)
> + struct vm_area_struct *vma, unsigned int order)
> {
> + unsigned int max_ptes_none = khugepaged_max_ptes_none;
> +
> if (vma && userfaultfd_armed(vma))
> return 0;
> /* for MADV_COLLAPSE, allow any empty/shared zeropage PTEs */
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> - /* For all other cases respect the user defined maximum */
> - return khugepaged_max_ptes_none;
> + /* for PMD collapse, respect the user defined maximum */
> + if (is_pmd_order(order))
> + return max_ptes_none;
> + /*
> + * for mTHP collapse with the sysctl value set to KHUGEPAGED_MAX_PTES_LIMIT,
> + * scale the maximum number of PTEs to the order of the collapse.
> + */
> + if (max_ptes_none == KHUGEPAGED_MAX_PTES_LIMIT)
> + return (1 << order) - 1;
> + if (!max_ptes_none)
> + return 0;
> + /*
> + * For mTHP collapse of values other than 0 or KHUGEPAGED_MAX_PTES_LIMIT,
> + * emit a warning and return 0.
> + */
> + pr_warn_once("mTHP collapse does not support max_ptes_none values"
> + " other than 0 or %u, defaulting to 0.\n",
> + KHUGEPAGED_MAX_PTES_LIMIT);
> + return 0;
> }
>
> /**
> * collapse_max_ptes_shared - Calculate maximum allowed PTEs that map shared
> * anonymous pages for the given collapse operation.
> * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
> *
> * Return: Maximum number of PTEs that map shared anonymous pages for the
> * collapse operation
> */
> -static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_shared(struct collapse_control *cc,
> + unsigned int order)
> {
> /*
> * For MADV_COLLAPSE, do not restrict the number of PTEs that map shared
> @@ -384,6 +406,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> */
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> + /*
> + * for mTHP collapse do not allow collapsing anonymous memory pages that
> + * are shared between processes.
> + */
> + if (!is_pmd_order(order))
> + return 0;
> + /* for PMD collapse, respect the user defined maximum */
> return khugepaged_max_ptes_shared;
> }
>
> @@ -391,11 +420,13 @@ static unsigned int collapse_max_ptes_shared(struct collapse_control *cc)
> * collapse_max_ptes_swap - Calculate the maximum allowed non-present PTEs or the
> * maximum allowed non-present pagecache entries for the given collapse operation.
> * @cc: The collapse control struct
> + * @order: The folio order being collapsed to
> *
> * Return: Maximum number of non-present PTEs or the maximum allowed non-present
> * pagecache entries for the collapse operation.
> */
> -static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> +static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> + unsigned int order)
> {
> /*
> * For MADV_COLLAPSE, do not restrict the number PTEs entries or
> @@ -403,6 +434,10 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc)
> */
> if (!cc->is_khugepaged)
> return HPAGE_PMD_NR;
> + /* for mTHP collapse do not allow any non-present PTEs or pagecache entries */
> + if (!is_pmd_order(order))
> + return 0;
> + /* for PMD collapse, respect the user defined maximum */
> return khugepaged_max_ptes_swap;
> }
>
> @@ -596,10 +631,11 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>
> static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long start_addr, pte_t *pte, struct collapse_control *cc,
> - struct list_head *compound_pagelist)
> + unsigned int order, struct list_head *compound_pagelist)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> - const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> + const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, order);
> + const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, order);
> + const unsigned long nr_pages = 1UL << order;
> struct page *page = NULL;
> struct folio *folio = NULL;
> unsigned long addr = start_addr;
> @@ -607,7 +643,7 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> int none_or_zero = 0, shared = 0, referenced = 0;
> enum scan_result result = SCAN_FAIL;
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> + for (_pte = pte; _pte < pte + nr_pages;
> _pte++, addr += PAGE_SIZE) {
> pte_t pteval = ptep_get(_pte);
> if (pte_none_or_zero(pteval)) {
> @@ -740,18 +776,18 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
>
> static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> - struct vm_area_struct *vma,
> - unsigned long address,
> - spinlock_t *ptl,
> - struct list_head *compound_pagelist)
> + struct vm_area_struct *vma, unsigned long address,
> + spinlock_t *ptl, unsigned int order,
> + struct list_head *compound_pagelist)
> {
> - unsigned long end = address + HPAGE_PMD_SIZE;
> + const unsigned long nr_pages = 1UL << order;
> + unsigned long end = address + (PAGE_SIZE << order);
> struct folio *src, *tmp;
> pte_t pteval;
> pte_t *_pte;
> unsigned int nr_ptes;
>
> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte += nr_ptes,
> + for (_pte = pte; _pte < pte + nr_pages; _pte += nr_ptes,
> address += nr_ptes * PAGE_SIZE) {
> nr_ptes = 1;
> pteval = ptep_get(_pte);
> @@ -804,11 +840,10 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte,
> }
>
> static void __collapse_huge_page_copy_failed(pte_t *pte,
> - pmd_t *pmd,
> - pmd_t orig_pmd,
> - struct vm_area_struct *vma,
> - struct list_head *compound_pagelist)
> + pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> + unsigned int order, struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> spinlock_t *pmd_ptl;
>
> /*
> @@ -824,7 +859,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> * Release both raw and compound pages isolated
> * in __collapse_huge_page_isolate.
> */
> - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist);
> + release_pte_pages(pte, pte + nr_pages, compound_pagelist);
> }
>
> /*
> @@ -844,16 +879,17 @@ static void __collapse_huge_page_copy_failed(pte_t *pte,
> */
> static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *folio,
> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma,
> - unsigned long address, spinlock_t *ptl,
> + unsigned long address, spinlock_t *ptl, unsigned int order,
> struct list_head *compound_pagelist)
> {
> + const unsigned long nr_pages = 1UL << order;
> unsigned int i;
> enum scan_result result = SCAN_SUCCEED;
>
> /*
> * Copying pages' contents is subject to memory poison at any iteration.
> */
> - for (i = 0; i < HPAGE_PMD_NR; i++) {
> + for (i = 0; i < nr_pages; i++) {
> pte_t pteval = ptep_get(pte + i);
> struct page *page = folio_page(folio, i);
> unsigned long src_addr = address + i * PAGE_SIZE;
> @@ -872,10 +908,10 @@ static enum scan_result __collapse_huge_page_copy(pte_t *pte, struct folio *foli
>
> if (likely(result == SCAN_SUCCEED))
> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl,
> - compound_pagelist);
> + order, compound_pagelist);
> else
> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma,
> - compound_pagelist);
> + order, compound_pagelist);
>
> return result;
> }
> @@ -1042,16 +1078,20 @@ static enum scan_result check_pmd_still_valid(struct mm_struct *mm,
> * Bring missing pages in from swap, to complete THP collapse.
> * Only done if khugepaged_scan_pmd believes it is worthwhile.
> *
> + * For mTHP orders the function bails on the first swap entry, because
> + * faulting pages back in during collapse could re-populate PTEs that
> + * push a later scan over the threshold for a higher-order collapse.
> + *
> * Called and returns without pte mapped or spinlocks held.
> * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> */
> static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> - struct vm_area_struct *vma, unsigned long start_addr, pmd_t *pmd,
> - int referenced)
> + struct vm_area_struct *vma, unsigned long start_addr,
> + pmd_t *pmd, int referenced, unsigned int order)
> {
> int swapped_in = 0;
> vm_fault_t ret = 0;
> - unsigned long addr, end = start_addr + (HPAGE_PMD_NR * PAGE_SIZE);
> + unsigned long addr, end = start_addr + (PAGE_SIZE << order);
> enum scan_result result;
> pte_t *pte = NULL;
> spinlock_t *ptl;
> @@ -1083,6 +1123,19 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> pte_present(vmf.orig_pte))
> continue;
>
> + /*
> + * TODO: Support swapin without leading to further mTHP
> + * collapses. Currently bringing in new pages via swapin may
> + * cause a future higher order collapse on a rescan of the same
> + * range.
> + */
> + if (!is_pmd_order(order)) {
> + pte_unmap(pte);
> + mmap_read_unlock(mm);
> + result = SCAN_EXCEED_SWAP_PTE;
> + goto out;
> + }
> +
> vmf.pte = pte;
> vmf.ptl = ptl;
> ret = do_swap_page(&vmf);
> @@ -1203,7 +1256,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * that case. Continuing to collapse causes inconsistency.
> */
> result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced);
> + referenced, HPAGE_PMD_ORDER);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1251,6 +1304,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> if (pte) {
> result = __collapse_huge_page_isolate(vma, address, pte, cc,
> + HPAGE_PMD_ORDER,
> &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> @@ -1281,6 +1335,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> vma, address, pte_ptl,
> + HPAGE_PMD_ORDER,
> &compound_pagelist);
> pte_unmap(pte);
> if (unlikely(result != SCAN_SUCCEED))
> @@ -1316,9 +1371,9 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start_addr,
> bool *lock_dropped, struct collapse_control *cc)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma);
> - const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc);
> - const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> + const unsigned int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> + const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
> + const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> pmd_t *pmd;
> pte_t *pte, *_pte;
> int none_or_zero = 0, shared = 0, referenced = 0;
> @@ -2372,8 +2427,8 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> unsigned long addr, struct file *file, pgoff_t start,
> struct collapse_control *cc)
> {
> - const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL);
> - const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc);
> + const unsigned int max_ptes_none = collapse_max_ptes_none(cc, NULL, HPAGE_PMD_ORDER);
> + const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> struct folio *folio = NULL;
> struct address_space *mapping = file->f_mapping;
> XA_STATE(xas, &mapping->i_pages, start);
^ permalink raw reply related
* Re: [PATCH mm-unstable v18 06/14] mm/khugepaged: generalize collapse_huge_page for mTHP collapse
From: Nico Pache @ 2026-05-26 14:42 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe, Usama Arif
In-Reply-To: <20260522150009.121603-7-npache@redhat.com>
On 5/22/26 9:00 AM, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> access/changes to the page tables. This can happen if the rmap walkers hit
> a pmd_none while the PMD entry is currently unavailable due to being
> temporarily removed during the collapse phase.
>
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
Hi Andrew can you please append the following fixup to this commit!
Changes are described below.
Thank you :)
commit ed96f34ba40ffd2d6a5b54abe46fd6b480fc89af
Author: Nico Pache <npache@redhat.com>
Date: Tue May 26 05:54:04 2026 -0600
fixup: add a clarifying comment and change warn_on
Add a clarifying comment describing how the locking/mmu notifer is handled
and change the WARN_ON_ONCE to VM_WARN_ON_ONCE per davids suggestion.
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9ab54397ae08..61d9494293b9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1283,6 +1283,13 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
anon_vma_lock_write(vma->anon_vma);
anon_vma_locked = true;
+ /*
+ * Only notify about the PTE range we will actually modify. While we
+ * temporary unmap the whole PTE table for mTHP collapse, we'll remap
+ * it later, leaving other PTEs effectively unmodified. The locks we
+ * hold prevent anybody from stumbling over such temporarily unmapped
+ * PTE tables.
+ */
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
end_addr);
mmu_notifier_invalidate_range_start(&range);
@@ -1348,7 +1355,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
*/
__folio_mark_uptodate(folio);
spin_lock(pmd_ptl);
- WARN_ON_ONCE(!pmd_none(*pmd));
+ VM_WARN_ON_ONCE(!pmd_none(*pmd));
if (is_pmd_order(order)) {
pgtable = pmd_pgtable(_pmd);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
> mm/khugepaged.c | 93 +++++++++++++++++++++++++++++--------------------
> 1 file changed, 55 insertions(+), 38 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index fab35d318641..d64f42f66236 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1214,34 +1214,36 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> * while allocating a THP, as that could trigger direct reclaim/compaction.
> * Note that the VMA must be rechecked after grabbing the mmap_lock again.
> */
> -static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long address,
> - int referenced, int unmapped, struct collapse_control *cc)
> +static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long start_addr,
> + int referenced, int unmapped, struct collapse_control *cc,
> + unsigned int order)
> {
> + const unsigned long pmd_addr = start_addr & HPAGE_PMD_MASK;
> + const unsigned long end_addr = start_addr + (PAGE_SIZE << order);
> LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> - pte_t *pte;
> + pte_t *pte = NULL;
> pgtable_t pgtable;
> struct folio *folio;
> spinlock_t *pmd_ptl, *pte_ptl;
> enum scan_result result = SCAN_FAIL;
> struct vm_area_struct *vma;
> struct mmu_notifier_range range;
> + bool anon_vma_locked = false;
>
> - VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -
> - result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> + result = alloc_charge_folio(&folio, mm, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
>
> mmap_read_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> + &vma, cc, order);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> }
>
> - result = find_pmd_or_thp_or_none(mm, address, &pmd);
> + result = find_pmd_or_thp_or_none(mm, pmd_addr, &pmd);
> if (result != SCAN_SUCCEED) {
> mmap_read_unlock(mm);
> goto out_nolock;
> @@ -1253,8 +1255,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * released when it fails. So we jump out_nolock directly in
> * that case. Continuing to collapse causes inconsistency.
> */
> - result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> - referenced, HPAGE_PMD_ORDER);
> + result = __collapse_huge_page_swapin(mm, vma, start_addr, pmd,
> + referenced, order);
> if (result != SCAN_SUCCEED)
> goto out_nolock;
> }
> @@ -1269,20 +1271,21 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * mmap_lock.
> */
> mmap_write_lock(mm);
> - result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> - HPAGE_PMD_ORDER);
> + result = hugepage_vma_revalidate(mm, pmd_addr, /*expect_anon=*/ true,
> + &vma, cc, order);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
> /* check if the pmd is still valid */
> vma_start_write(vma);
> - result = check_pmd_still_valid(mm, address, pmd);
> + result = check_pmd_still_valid(mm, pmd_addr, pmd);
> if (result != SCAN_SUCCEED)
> goto out_up_write;
>
> anon_vma_lock_write(vma->anon_vma);
> + anon_vma_locked = true;
>
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> - address + HPAGE_PMD_SIZE);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, start_addr,
> + end_addr);
> mmu_notifier_invalidate_range_start(&range);
>
> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1294,26 +1297,23 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * Parallel GUP-fast is fine since GUP-fast will back off when
> * it detects PMD is changed.
> */
> - _pmd = pmdp_collapse_flush(vma, address, pmd);
> + _pmd = pmdp_collapse_flush(vma, pmd_addr, pmd);
> spin_unlock(pmd_ptl);
> mmu_notifier_invalidate_range_end(&range);
> tlb_remove_table_sync_one();
>
> - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> + pte = pte_offset_map_lock(mm, &_pmd, start_addr, &pte_ptl);
> if (pte) {
> - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> + result = __collapse_huge_page_isolate(vma, start_addr, pte, cc,
> + order, &compound_pagelist);
> spin_unlock(pte_ptl);
> } else {
> result = SCAN_NO_PTE_TABLE;
> }
>
> if (unlikely(result != SCAN_SUCCEED)) {
> - if (pte)
> - pte_unmap(pte);
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> + WARN_ON_ONCE(!pmd_none(*pmd));
> /*
> * We can only use set_pmd_at when establishing
> * hugepmds and never for establishing regular pmds that
> @@ -1321,21 +1321,24 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> */
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> spin_unlock(pmd_ptl);
> - anon_vma_unlock_write(vma->anon_vma);
> goto out_up_write;
> }
>
> /*
> - * All pages are isolated and locked so anon_vma rmap
> - * can't run anymore.
> + * For PMD collapse all pages are isolated and locked so anon_vma
> + * rmap can't run anymore. For mTHP collapse the PMD entry has been
> + * removed and not all pages are isolated and locked, so we must hold
> + * the lock to prevent neighboring folios from attempting to access
> + * this PMD until its reinstalled.
> */
> - anon_vma_unlock_write(vma->anon_vma);
> + if (is_pmd_order(order)) {
> + anon_vma_unlock_write(vma->anon_vma);
> + anon_vma_locked = false;
> + }
>
> result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> - vma, address, pte_ptl,
> - HPAGE_PMD_ORDER,
> - &compound_pagelist);
> - pte_unmap(pte);
> + vma, start_addr, pte_ptl,
> + order, &compound_pagelist);
> if (unlikely(result != SCAN_SUCCEED))
> goto out_up_write;
>
> @@ -1345,18 +1348,32 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> * write.
> */
> __folio_mark_uptodate(folio);
> - pgtable = pmd_pgtable(_pmd);
> -
> spin_lock(pmd_ptl);
> - BUG_ON(!pmd_none(*pmd));
> - pgtable_trans_huge_deposit(mm, pmd, pgtable);
> - map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> + WARN_ON_ONCE(!pmd_none(*pmd));
> + if (is_pmd_order(order)) {
> + pgtable = pmd_pgtable(_pmd);
> + pgtable_trans_huge_deposit(mm, pmd, pgtable);
> + map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_addr);
> + } else {
> + /*
> + * set_ptes is called in map_anon_folio_pte_nopf with the
> + * pmd_ptl lock still held; this is safe as the PMD is expected
> + * to be none. The pmd entry is then repopulated below.
> + */
> + map_anon_folio_pte_nopf(folio, pte, vma, start_addr, /*uffd_wp=*/ false);
> + smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
> + pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> + }
> spin_unlock(pmd_ptl);
>
> folio = NULL;
>
> result = SCAN_SUCCEED;
> out_up_write:
> + if (anon_vma_locked)
> + anon_vma_unlock_write(vma->anon_vma);
> + if (pte)
> + pte_unmap(pte);
> mmap_write_unlock(mm);
> out_nolock:
> if (folio)
> @@ -1536,7 +1553,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> /* collapse_huge_page expects the lock to be dropped before calling */
> mmap_read_unlock(mm);
> result = collapse_huge_page(mm, start_addr, referenced,
> - unmapped, cc);
> + unmapped, cc, HPAGE_PMD_ORDER);
> /* collapse_huge_page will return with the mmap_lock released */
> *lock_dropped = true;
> }
^ permalink raw reply related
* Re: [PATCH mm-unstable v18 14/14] Documentation: mm: update the admin guide for mTHP collapse
From: Nico Pache @ 2026-05-26 14:45 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, akpm
Cc: aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
willy, yang, ying.huang, ziy, zokeefe, Bagas Sanjaya
In-Reply-To: <20260522150009.121603-15-npache@redhat.com>
On 5/22/26 9:00 AM, Nico Pache wrote:
> Now that we can collapse to mTHPs lets update the admin guide to
> reflect these changes and provide proper guidance on how to utilize it.
>
> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
Hi Andrew,
Can you please append the following fixup to this commit.
The changes are simply undoing a deleted note i added and reworking it slightly to reflect the new khugepaged behavior.
Cheers!
--Nico
commit d81806992231ef920c731e62468a3a1b2ef6b869
Author: Nico Pache <npache@redhat.com>
Date: Tue May 26 07:47:42 2026 -0600
fixup: add back note and edit doc about khugepaged limits
Signed-off-by: Nico Pache <npache@redhat.com>
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 644869d3adfd..ebec1e6b0e6b 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -265,6 +265,11 @@ support the following arguments::
Khugepaged controls
-------------------
+.. note::
+ khugepaged currently only searches for opportunities to collapse file/shmem
+ to PMD-sized THP. Only anonymous memory will attempt to collapse to other THP
+ sizes.
+
khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
> Documentation/admin-guide/mm/transhuge.rst | 50 +++++++++++++---------
> 1 file changed, 30 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 80a4d0bed70b..644869d3adfd 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -63,7 +63,8 @@ often.
> THP can be enabled system wide or restricted to certain tasks or even
> memory ranges inside task's address space. Unless THP is completely
> disabled, there is ``khugepaged`` daemon that scans memory and
> -collapses sequences of basic pages into PMD-sized huge pages.
> +collapses sequences of basic pages into huge pages of either PMD size
> +or mTHP sizes, if the system is configured to do so.
>
> The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
> interface and using madvise(2) and prctl(2) system calls.
> @@ -219,10 +220,10 @@ this behaviour by writing 0 to shrink_underused, and enable it by writing
> echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
> echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
>
> -khugepaged will be automatically started when PMD-sized THP is enabled
> +khugepaged will be automatically started when any THP size is enabled
> (either of the per-size anon control or the top-level control are set
> to "always" or "madvise"), and it'll be automatically shutdown when
> -PMD-sized THP is disabled (when both the per-size anon control and the
> +all THP sizes are disabled (when both the per-size anon control and the
> top-level control are "never")
>
> process THP controls
> @@ -264,11 +265,6 @@ support the following arguments::
> Khugepaged controls
> -------------------
>
> -.. note::
> - khugepaged currently only searches for opportunities to collapse to
> - PMD-sized THP and no attempt is made to collapse to other THP
> - sizes.
> -
> khugepaged runs usually at low frequency so while one may not want to
> invoke defrag algorithms synchronously during the page faults, it
> should be worth invoking defrag at least in khugepaged. However it's
> @@ -296,11 +292,11 @@ allocation failure to throttle the next allocation attempt::
> The khugepaged progress can be seen in the number of pages collapsed (note
> that this counter may not be an exact count of the number of pages
> collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
> -being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
> -one 2M hugepage. Each may happen independently, or together, depending on
> -the type of memory and the failures that occur. As such, this value should
> -be interpreted roughly as a sign of progress, and counters in /proc/vmstat
> -consulted for more accurate accounting)::
> +being replaced by a PMD mapping, or (2) physical pages replaced by one
> +hugepage of various sizes (PMD-sized or mTHP). Each may happen independently,
> +or together, depending on the type of memory and the failures that occur.
> +As such, this value should be interpreted roughly as a sign of progress,
> +and counters in /proc/vmstat consulted for more accurate accounting)::
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
>
> @@ -308,16 +304,21 @@ for each pass::
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
>
> -``max_ptes_none`` specifies how many extra small pages (that are
> -not already mapped) can be allocated when collapsing a group
> -of small pages into one large page::
> +``max_ptes_none`` specifies how many empty (none/zero) pages are allowed
> +when collapsing a group of small pages into one large page::
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>
> -A higher value leads to use additional memory for programs.
> -A lower value leads to gain less thp performance. Value of
> -max_ptes_none can waste cpu time very little, you can
> -ignore it.
> +For PMD-sized THP collapse, this directly limits the number of empty pages
> +allowed in the 2MB region.
> +
> +For mTHP collapse, only 0 or (HPAGE_PMD_NR - 1) are supported. At
> +HPAGE_PMD_NR - 1, we collapse to the highest possible order. Any intermediate
> +value will emit a warning and mTHP collapse will default to max_ptes_none=0.
> +
> +A higher value allows more empty pages, potentially leading to more memory
> +usage but better THP performance. A lower value is more conservative and
> +may result in fewer THP collapses.
>
> ``max_ptes_swap`` specifies how many pages can be brought in from
> swap when collapsing a group of pages into a transparent huge page::
> @@ -337,6 +338,15 @@ that THP is shared. Exceeding the number would block the collapse::
>
> A higher value may increase memory footprint for some workloads.
>
> +.. note::
> + For mTHP collapse, khugepaged does not support collapsing regions that
> + contain shared or swapped out pages, as this could lead to continuous
> + promotion to higher orders. The collapse will fail if any shared or
> + swapped PTEs are encountered during the scan.
> +
> + Currently, madvise_collapse only supports collapsing to PMD-sized THPs
> + and does not attempt mTHP collapses.
> +
> Boot parameters
> ===============
>
^ permalink raw reply related
* [PATCH 0/5] x86/xen: Get rid of Xen private lazy MMU mode tracking
From: Juergen Gross @ 2026-05-26 15:05 UTC (permalink / raw)
To: linux-kernel, x86, linux-trace-kernel, linux-mm, virtualization
Cc: Juergen Gross, Boris Ostrovsky, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, xen-devel, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Ajay Kaher, Alexey Makhalov, Broadcom internal kernel review list
With generic lazy MMU mode tracking being available, there is no real
need for having Xen PV specific code to track lazy MMU mode, too.
This makes it possible to drop two paravirt hooks.
Note that this series is based on my series "x86/xen: Do some Xen-PV
related cleanups" [1].
[1]: https://lore.kernel.org/lkml/20260522152114.77319-1-jgross@suse.com/
Juergen Gross (5):
x86/xen: Drop lazy mode from trace entries
x86/xen: Change interface of xen_mc_issue()
mm: Refactor lazy_mmu_mode_pause() and lazy_mmu_mode_resume()
x86/xen: Get rid of last XEN_LAZY_MMU uses
x86/xen: Replace generic lazy tracking with cpu specific one
arch/x86/include/asm/paravirt.h | 9 ++--
arch/x86/include/asm/paravirt_types.h | 11 +----
arch/x86/include/asm/xen/hypervisor.h | 25 +---------
arch/x86/kernel/paravirt.c | 6 +--
arch/x86/xen/enlighten_pv.c | 30 +++++-------
arch/x86/xen/mmu_pv.c | 66 +++++++++------------------
arch/x86/xen/xen-ops.h | 12 +++--
include/linux/pgtable.h | 56 ++++++++++++++++++-----
include/trace/events/xen.h | 33 ++++++++------
9 files changed, 112 insertions(+), 136 deletions(-)
--
2.54.0
^ permalink raw reply
* [PATCH 1/5] x86/xen: Drop lazy mode from trace entries
From: Juergen Gross @ 2026-05-26 15:05 UTC (permalink / raw)
To: linux-kernel, x86, linux-trace-kernel
Cc: Juergen Gross, Boris Ostrovsky, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, xen-devel
In-Reply-To: <20260526150514.129330-1-jgross@suse.com>
Drop the lazy mode (cpu or mmu) from the xen_mc_batch and xen_mc_issue
trace entries.
This is done in preparation of removing the xen_lazy_mode percpu
variable.
Signed-off-by: Juergen Gross <jgross@suse.com>
---
arch/x86/xen/xen-ops.h | 11 +++++++----
include/trace/events/xen.h | 33 +++++++++++++++++++--------------
2 files changed, 26 insertions(+), 18 deletions(-)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 6808010ac379..dc892f421f25 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -98,7 +98,7 @@ static inline void xen_mc_batch(void)
/* need to disable interrupts until this entry is complete */
local_irq_save(flags);
- trace_xen_mc_batch(xen_get_lazy_mode());
+ trace_xen_mc_batch(flags);
__this_cpu_write(xen_mc_irq_flags, flags);
}
@@ -114,13 +114,16 @@ void xen_mc_flush(void);
/* Issue a multicall if we're not in a lazy mode */
static inline void xen_mc_issue(unsigned mode)
{
- trace_xen_mc_issue(mode);
+ bool flush = !(xen_get_lazy_mode() & mode);
+ unsigned long flags = this_cpu_read(xen_mc_irq_flags);
- if ((xen_get_lazy_mode() & mode) == 0)
+ trace_xen_mc_issue(flush, flags);
+
+ if (flush)
xen_mc_flush();
/* restore flags saved in xen_mc_batch */
- local_irq_restore(this_cpu_read(xen_mc_irq_flags));
+ local_irq_restore(flags);
}
/* Set up a callback to be called when the current batch is flushed */
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index e3f139f0bc78..ad384969e2cb 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -12,24 +12,29 @@
struct multicall_entry;
/* Multicalls */
-DECLARE_EVENT_CLASS(xen_mc__batch,
- TP_PROTO(enum xen_lazy_mode mode),
- TP_ARGS(mode),
+TRACE_EVENT(xen_mc_batch,
+ TP_PROTO(unsigned long flags),
+ TP_ARGS(flags),
TP_STRUCT__entry(
- __field(enum xen_lazy_mode, mode)
+ __field(unsigned long, flags)
),
- TP_fast_assign(__entry->mode = mode),
- TP_printk("start batch LAZY_%s",
- (__entry->mode == XEN_LAZY_MMU) ? "MMU" :
- (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE")
+ TP_fast_assign(__entry->flags = flags),
+ TP_printk("start batch lazy flags %lx", __entry->flags)
);
-#define DEFINE_XEN_MC_BATCH(name) \
- DEFINE_EVENT(xen_mc__batch, name, \
- TP_PROTO(enum xen_lazy_mode mode), \
- TP_ARGS(mode))
-DEFINE_XEN_MC_BATCH(xen_mc_batch);
-DEFINE_XEN_MC_BATCH(xen_mc_issue);
+TRACE_EVENT(xen_mc_issue,
+ TP_PROTO(bool flush, unsigned long flags),
+ TP_ARGS(flush, flags),
+ TP_STRUCT__entry(
+ __field(unsigned long, flags)
+ __field(bool, flush)
+ ),
+ TP_fast_assign(__entry->flush = flush;
+ __entry->flags = flags;
+ ),
+ TP_printk("flush: %s, flags %lx",
+ __entry->flush ? "yes" : "no", __entry->flags)
+ );
TRACE_DEFINE_SIZEOF(ulong);
--
2.54.0
^ permalink raw reply related
* Re: [PATCH RFC 3/3] mm: use non-temporal stores for demotion
From: Gregory Price @ 2026-05-26 15:25 UTC (permalink / raw)
To: Yiannis Nikolakopoulos
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Trond Myklebust, Anna Schumaker,
Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Ying Huang, Alistair Popple, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Brendan Jackman, Johannes Weiner,
David Rientjes, Davidlohr Bueso, Fan Ni, Frank van der Linden,
Jonathan Cameron, Raghavendra K T, Rao, Bharata Bhasker,
SeongJae Park, Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos,
dimitrios, Ryan Roberts, linux-kernel, linux-mm, linux-nfs,
linux-trace-kernel, Alirad Malek
In-Reply-To: <20260526-rfc-nt-demote-v1-3-eb9c9422daef@zptcorp.com>
On Tue, May 26, 2026 at 01:37:04PM +0200, Yiannis Nikolakopoulos wrote:
> From: Alirad Malek <alirad.malek@zptcorp.com>
>
> Memory demoted to a lower tier is assumed to be cold and most likely out of
> the CPU's last level cache. Additionally, in certain demotion targets (e.g.
> CXL devices with compressed memory) the bandwidth can be negatively
> impacted by the eviction patterns of the last level cache when standard
> memcpy is used. When the feature is enabled, use the
> MIGRATE_ASYNC_NON_TEMPORAL_STORES flag in demotions to trigger the folio
> copy path using non-temporal stores.
>
> Signed-off-by: Alirad Malek <alirad.malek@zptcorp.com>
> Co-developed-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
> Signed-off-by: Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com>
> ---
> mm/Kconfig | 8 ++++++++
> mm/migrate.c | 9 ++++++++-
> 2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebd8ea353687..4b7a75b57f6e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -645,6 +645,14 @@ config MIGRATION
> pages as migration can relocate pages to satisfy a huge page
> allocation instead of reclaiming.
>
> +config DEMOTION_WITH_NON_TEMPORAL_STORES
> + bool "Use non-temporal stores for demotion"
> + default n
> + depends on MIGRATION
> + help
> + Enable non-temporal stores when migrating pages due to demotion.
> + If disabled, demotion uses regular migration copy paths.
> +
Do we actually need this config flag or should we just default to this
(if the arch supports NT stores)?
> config DEVICE_MIGRATION
> def_bool MIGRATION && ZONE_DEVICE
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index ff6cf50e7b0b..368d40dc8772 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -862,7 +862,10 @@ static int __migrate_folio(struct address_space *mapping, struct folio *dst,
> if (folio_ref_count(src) != expected_count)
> return -EAGAIN;
>
> - rc = folio_mc_copy(dst, src);
> + if (mode == MIGRATE_ASYNC_NON_TEMPORAL_STORES)
> + rc = folio_mc_copy_nt(dst, src);
> + else
> + rc = folio_mc_copy(dst, src);
> if (unlikely(rc))
> return rc;
>
> @@ -2081,6 +2084,10 @@ int migrate_pages(struct list_head *from, new_folio_t get_new_folio,
> LIST_HEAD(split_folios);
> struct migrate_pages_stats stats;
>
> + if (IS_ENABLED(CONFIG_DEMOTION_WITH_NON_TEMPORAL_STORES) &&
> + reason == MR_DEMOTION && mode == MIGRATE_ASYNC)
> + mode = MIGRATE_ASYNC_NON_TEMPORAL_STORES;
> +
> trace_mm_migrate_pages_start(mode, reason);
>
> memset(&stats, 0, sizeof(stats));
>
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-26 15:33 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
Jiri Olsa
In-Reply-To: <20260525235507.a81c565023258c63fc9201f4@kernel.org>
On Mon, 25 May 2026 23:55:07 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> On Thu, 21 May 2026 22:50:33 -0400
> Steven Rostedt <rostedt@kernel.org> wrote:
>
> > +static int handle_typecast(char *arg, struct fetch_insn **pcode,
> > + struct fetch_insn *end,
> > + struct traceprobe_parse_context *ctx)
> > +{
> > + char *tmp;
> > + int ret;
> > +
> > + /* Currently this only works for eprobes */
> > + if (!(ctx->flags & TPARG_FL_TEVENT)) {
> > + trace_probe_log_err(ctx->offset, TYPECAST_NOT_EVENT);
> > + return -EINVAL;
> > + }
> > +
> > + tmp = strchr(arg, ')');
> > + if (!tmp) {
> > + trace_probe_log_err(ctx->offset + strlen(arg),
> > + DEREF_OPEN_BRACE);
> > + return -EINVAL;
> > + }
> > + *tmp = '\0';
> > + ret = query_btf_struct(arg + 1, ctx);
> > + *tmp = ')';
>
> BTW, is there any reason to recover this? The @arg is copied
> string, see traceprobe_parse_probe_arg_body().
Yeah I know. But it's just something I prefer to do to keep code more
robust. That is, don't leave side effects if you can help it. It likely
doesn't matter here, but I've always errored on the side of caution. ;-)
-- Steve
^ permalink raw reply
* Re: [PATCH v6] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-26 15:33 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: LKML, Linux trace kernel, Mathieu Desnoyers, Mark Rutland,
Peter Zijlstra, Namhyung Kim, Takaya Saeki, Douglas Raillard,
Tom Zanussi, Andrew Morton, Thomas Gleixner, Ian Rogers,
Jiri Olsa
In-Reply-To: <20260526090910.8aae5c9357e119bd0043b18b@kernel.org>
On Tue, 26 May 2026 09:09:10 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> On Thu, 21 May 2026 22:50:33 -0400
> Steven Rostedt <rostedt@kernel.org> wrote:
>
> > @@ -640,7 +673,7 @@ static int parse_btf_arg(char *varname,
> > int i, is_ptr, ret;
> > u32 tid;
> >
> > - if (WARN_ON_ONCE(!ctx->funcname))
> > + if (WARN_ON_ONCE(!ctx->funcname && !(ctx->flags & TPARG_FL_TEVENT)))
> > return -EINVAL;
> >
> > is_ptr = split_next_field(varname, &field, ctx);
> > @@ -653,6 +686,20 @@ static int parse_btf_arg(char *varname,
> > return -EOPNOTSUPP;
> > }
> >
> > + if (ctx->flags & TPARG_FL_TEVENT) {
> > + int ret;
>
> nit: parse_btf_arg already declared @ret. So we don't need this.
Ah, will fix.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCH v2] perf/ftrace: Fix WARNING in __unregister_ftrace_function
From: Steven Rostedt @ 2026-05-26 15:38 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Rik van Riel, Mathieu Desnoyers, linux-kernel, linux-trace-kernel,
kernel-team, sashiko-bot, sashiko-reviews
In-Reply-To: <20260525143947.ea1c0a7b29146d22faa5feda@kernel.org>
On Mon, 25 May 2026 14:39:47 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > The error handling there all seems to "goto err_destroy"
> >
> > err_destroy:
> > if (event->destroy) {
> > event->destroy(event);
> > event->destroy = NULL;
> > }
> >
> >
> > > > However, it does not set event->destroy to NULL.
> >
> > ... but it does?
> >
> > I am not sure what code Sashiko is looking at,
> > but it does not look like the code I just pulled.
>
> Indeed.
>
> >
> > Is there a different tree I should be looking at
> > than upstream Linus?
>
> You can see the baseline info if you expand the collapsed triangle.
> Anyway, it said:
>
> linux-trace/HEAD (70575e77839f4c5337ce2653b39b86bb365a870e)
>
> So that is linux-trace/master.
>
> commit 70575e77839f4c5337ce2653b39b86bb365a870e (linux-trace/master)
> Merge: 7bc6e90d7aa4 a43ae8057cc1
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Fri Sep 30 09:41:34 2022 -0700
>
> Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
>
>
> Hmm, this is too old. And linux-trace/master is not used anymore.
Hmm, I probably should delete that branch.
Thanks,
-- Steve
^ permalink raw reply
* Re: [PATCH v9] blk-mq: add tracepoint block_rq_tag_wait
From: Jens Axboe @ 2026-05-26 16:37 UTC (permalink / raw)
To: rostedt, mhiramat, mathieu.desnoyers, Aaron Tomlin
Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
john.g.garry, loberman, neelx, sean, mproche, chjohnst,
linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260525005123.722277-1-atomlin@atomlin.com>
On Sun, 24 May 2026 20:51:23 -0400, Aaron Tomlin wrote:
> In high-performance storage environments, particularly when utilising
> RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> latency spikes can occur when fast devices (SSDs) are starved of hardware
> tags when sharing the same blk_mq_tag_set.
>
> Currently, diagnosing this specific hardware queue contention is
> difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
> forces the current thread to block uninterruptible via io_schedule().
> While this can be inferred via sched:sched_switch or dynamically
> traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
> dedicated, out-of-the-box observability for this event.
>
> [...]
Applied, thanks!
[1/1] blk-mq: add tracepoint block_rq_tag_wait
commit: 9ece10778f8931630f86e802f94dc71115de0c8c
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] tracing: Do not call map->ops->elt_free() if elt_alloc() is not succeeded
From: Tom Zanussi @ 2026-05-26 16:48 UTC (permalink / raw)
To: Masami Hiramatsu (Google), Steven Rostedt, Tom Zanussi
Cc: linux-trace-kernel, linux-kernel, Mathieu Desnoyers, Rosen Penev
In-Reply-To: <177933895460.108746.5396070821443932634.stgit@devnote2>
Hi Masami,
On Thu, 2026-05-21 at 13:49 +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
>
> In paths where tracing_map_elt_alloc() failed to allocate objects,
> the map->ops->elt_alloc() call was never successful. In this case,
> map->ops->elt_free() should not be called.
>
> This bug was found by Sashiko.
> Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com
>
> Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map")
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> Cc: stable@vger.kernel.org
> ---
> kernel/trace/tracing_map.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
> index bf1a507695b6..0dd7927df22a 100644
> --- a/kernel/trace/tracing_map.c
> +++ b/kernel/trace/tracing_map.c
> @@ -386,13 +386,11 @@ static void tracing_map_elt_init_fields(struct tracing_map_elt *elt)
> }
> }
>
> -static void tracing_map_elt_free(struct tracing_map_elt *elt)
> +static void __tracing_map_elt_free(struct tracing_map_elt *elt)
> {
> if (!elt)
> return;
>
> - if (elt->map->ops && elt->map->ops->elt_free)
> - elt->map->ops->elt_free(elt);
> kfree(elt->fields);
> kfree(elt->vars);
> kfree(elt->var_set);
> @@ -400,6 +398,17 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
> kfree(elt);
> }
>
> +static void tracing_map_elt_free(struct tracing_map_elt *elt)
> +{
> + if (!elt)
> + return;
> +
> + /* Only objects initialized with alloc_elt() should be passed to free_elt().*/
> + if (elt->map->ops && elt->map->ops->elt_free)
> + elt->map->ops->elt_free(elt);
> + __tracing_map_elt_free(elt);
> +}
> +
> static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
> {
> struct tracing_map_elt *elt;
> @@ -444,7 +453,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
> }
> return elt;
> free:
> - tracing_map_elt_free(elt);
> + __tracing_map_elt_free(elt);
>
> return ERR_PTR(err);
> }
>
>
Looks good to me. Thanks for fixing it!
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox