[PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
@ 2026-04-23 10:37 Bunyod Suvonov
  2026-04-23 17:46 ` Shakeel Butt
  2026-04-24  3:14 ` [PATCH v2] " Bunyod Suvonov
  0 siblings, 2 replies; 6+ messages in thread
From: Bunyod Suvonov @ 2026-04-23 10:37 UTC (permalink / raw)
  To: akpm, hannes, rostedt, mhiramat
  Cc: david, mhocko, zhengqi.arch, shakeel.butt, ljs, mathieu.desnoyers,
	linux-mm, linux-trace-kernel, linux-kernel, Bunyod Suvonov

Vmscan has six main reclaim entry points: try_to_free_pages() for
direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
for node reclaim, shrink_all_memory() for hibernation reclaim, and
balance_pgdat() for kswapd reclaim.

All of them, except for shrink_all_memory() and balance_pgdat(), already
have begin/end tracepoints. This makes it harder to trace which reclaim
path is responsible for memory reclaim activity, because kswapd reclaim
cannot be identified as cleanly as other reclaim entry points, even
though it is the main background reclaim path under memory pressure.
There may be no need to trace shrink_all_memory() as it is primarily
used during hibernation. So this patch adds the missing tracepoint pair
for balance_pgdat().

The begin tracepoint records the node id, requested reclaim order, and
highest_zoneidx. The end tracepoint records the node id, reclaim order
that balance_pgdat() finished with, highest_zoneidx, and nr_reclaimed.
Together, they show the requested reclaim order and zone bound, whether
reclaim fell back to a lower order, and how much reclaim work was done.

Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
---
 include/trace/events/vmscan.h | 52 +++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |  5 ++++
 2 files changed, 57 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 4445a8d9218d..b4bf7b8def1f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -96,6 +96,58 @@ TRACE_EVENT(mm_vmscan_kswapd_wake,
 		__entry->order)
 );
 
+TRACE_EVENT(mm_vmscan_balance_pgdat_begin,
+
+	TP_PROTO(int nid, int order, int highest_zoneidx),
+
+	TP_ARGS(nid, order, highest_zoneidx),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, order)
+		__field(int, highest_zoneidx)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->order = order;
+		__entry->highest_zoneidx = highest_zoneidx;
+	),
+
+	TP_printk("nid=%d order=%d highest_zoneidx=%-8s",
+		__entry->nid,
+		__entry->order,
+		__print_symbolic(__entry->highest_zoneidx, ZONE_TYPE))
+);
+
+TRACE_EVENT(mm_vmscan_balance_pgdat_end,
+
+	TP_PROTO(int nid, int order, int highest_zoneidx,
+		 unsigned long nr_reclaimed),
+
+	TP_ARGS(nid, order, highest_zoneidx, nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, order)
+		__field(int, highest_zoneidx)
+		__field(unsigned long, nr_reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->order = order;
+		__entry->highest_zoneidx = highest_zoneidx;
+		__entry->nr_reclaimed = nr_reclaimed;
+	),
+
+	TP_printk("nid=%d order=%d highest_zoneidx=%-8s nr_reclaimed=%lu",
+		__entry->nid,
+		__entry->order,
+		__print_symbolic(__entry->highest_zoneidx, ZONE_TYPE),
+		__entry->nr_reclaimed)
+);
+
 TRACE_EVENT(mm_vmscan_wakeup_kswapd,
 
 	TP_PROTO(int nid, int zid, int order, gfp_t gfp_flags),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..b2d89ed69d22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7121,6 +7121,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		.may_unmap = 1,
 	};
 
+	trace_mm_vmscan_balance_pgdat_begin(pgdat->node_id, order,
+					    highest_zoneidx);
 	set_task_reclaim_state(current, &sc.reclaim_state);
 	psi_memstall_enter(&pflags);
 	__fs_reclaim_acquire(_THIS_IP_);
@@ -7314,6 +7316,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	psi_memstall_leave(&pflags);
 	set_task_reclaim_state(current, NULL);
 
+	trace_mm_vmscan_balance_pgdat_end(pgdat->node_id, sc.order,
+					  highest_zoneidx, sc.nr_reclaimed);
+
 	/*
 	 * Return the order kswapd stopped reclaiming at as
 	 * prepare_kswapd_sleep() takes it into account. If another caller
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
  2026-04-23 10:37 [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints Bunyod Suvonov
@ 2026-04-23 17:46 ` Shakeel Butt
  2026-04-24  0:46   ` SUVONOV BUNYOD
  2026-04-24  3:14 ` [PATCH v2] " Bunyod Suvonov
  1 sibling, 1 reply; 6+ messages in thread
From: Shakeel Butt @ 2026-04-23 17:46 UTC (permalink / raw)
  To: Bunyod Suvonov
  Cc: akpm, hannes, rostedt, mhiramat, david, mhocko, zhengqi.arch, ljs,
	mathieu.desnoyers, linux-mm, linux-trace-kernel, linux-kernel

On Thu, Apr 23, 2026 at 06:37:53PM +0800, Bunyod Suvonov wrote:
> Vmscan has six main reclaim entry points: try_to_free_pages() for
> direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
> mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
> for node reclaim, shrink_all_memory() for hibernation reclaim, and
> balance_pgdat() for kswapd reclaim.
> 
> All of them, except for shrink_all_memory() and balance_pgdat(), already
> have begin/end tracepoints. This makes it harder to trace which reclaim
> path is responsible for memory reclaim activity, because kswapd reclaim
> cannot be identified as cleanly as other reclaim entry points, even
> though it is the main background reclaim path under memory pressure.
> There may be no need to trace shrink_all_memory() as it is primarily
> used during hibernation. So this patch adds the missing tracepoint pair
> for balance_pgdat().
> 
> The begin tracepoint records the node id, requested reclaim order, and
> highest_zoneidx. The end tracepoint records the node id, reclaim order
> that balance_pgdat() finished with, highest_zoneidx, and nr_reclaimed.

Do we need to trace highest_zoneidx at the end? Can it change within
balance_pgdat()?

> Together, they show the requested reclaim order and zone bound, whether
> reclaim fell back to a lower order, and how much reclaim work was done.
> 
> Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>

Overall looks good. 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
  2026-04-23 17:46 ` Shakeel Butt
@ 2026-04-24  0:46   ` SUVONOV BUNYOD
  2026-04-24  2:15     ` Shakeel Butt
  0 siblings, 1 reply; 6+ messages in thread
From: SUVONOV BUNYOD @ 2026-04-24  0:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, hannes, rostedt, mhiramat, david, mhocko, zhengqi arch, ljs,
	mathieu desnoyers, linux-mm, linux-trace-kernel, linux-kernel

Thank you for reviewing Shakeel,

> Do we need to trace highest_zoneidx at the end? Can it change within
> balance_pgdat()?

highest_zoneidx does not change within a balance_pgdat() invocation. It
is passed in as an argument and remains the classzone bound used for the
balancing checks throughout the function.

I kept highest_zoneidx in the end tracepoint to make the outcome event
self-contained. In principle, begin/end correlation is possible, but
under sustained memory pressure kswapd reclaim can be frequent enough
that consumers may prefer to analyze end events directly, and any
dependence on matching begin/end becomes less convenient and less robust
in the presence of filtering or dropped trace records.

Since nr_reclaimed and the final order are only known at the end, having
highest_zoneidx there allows end-only analysis without correlating with
the begin event.

For example, it lets users answer questions like:
- this pass reclaimed too much or too little memory; what highest_zoneidx
did that result correspond to?
- how much reclaim was done when balancing up to ZONE_NORMAL vs other
classzone bounds?
- when highest_zoneidx == ZONE_NORMAL, how often did reclaim finish at
order=0?

So it is there because it provides context for the end-of-reclaim result.
Do you think this is sufficient justification? If not, then I can drop it
from the end tracepoint in v2.

----- Original Message -----
From: "Shakeel Butt" <shakeel.butt@linux.dev>
To: "Bunyod Suvonov" <b.suvonov@sjtu.edu.cn>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, rostedt@goodmis.org, mhiramat@kernel.org, david@kernel.org, mhocko@kernel.org, "zhengqi arch" <zhengqi.arch@bytedance.com>, ljs@kernel.org, "mathieu desnoyers" <mathieu.desnoyers@efficios.com>, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-kernel@vger.kernel.org
Sent: Friday, April 24, 2026 1:46:55 AM
Subject: Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints

On Thu, Apr 23, 2026 at 06:37:53PM +0800, Bunyod Suvonov wrote:
> Vmscan has six main reclaim entry points: try_to_free_pages() for
> direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
> mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
> for node reclaim, shrink_all_memory() for hibernation reclaim, and
> balance_pgdat() for kswapd reclaim.
> 
> All of them, except for shrink_all_memory() and balance_pgdat(), already
> have begin/end tracepoints. This makes it harder to trace which reclaim
> path is responsible for memory reclaim activity, because kswapd reclaim
> cannot be identified as cleanly as other reclaim entry points, even
> though it is the main background reclaim path under memory pressure.
> There may be no need to trace shrink_all_memory() as it is primarily
> used during hibernation. So this patch adds the missing tracepoint pair
> for balance_pgdat().
> 
> The begin tracepoint records the node id, requested reclaim order, and
> highest_zoneidx. The end tracepoint records the node id, reclaim order
> that balance_pgdat() finished with, highest_zoneidx, and nr_reclaimed.

Do we need to trace highest_zoneidx at the end? Can it change within
balance_pgdat()?

> Together, they show the requested reclaim order and zone bound, whether
> reclaim fell back to a lower order, and how much reclaim work was done.
> 
> Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>

Overall looks good.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints
  2026-04-24  0:46   ` SUVONOV BUNYOD
@ 2026-04-24  2:15     ` Shakeel Butt
  0 siblings, 0 replies; 6+ messages in thread
From: Shakeel Butt @ 2026-04-24  2:15 UTC (permalink / raw)
  To: SUVONOV BUNYOD
  Cc: akpm, hannes, rostedt, mhiramat, david, mhocko, zhengqi arch, ljs,
	mathieu desnoyers, linux-mm, linux-trace-kernel, linux-kernel

On Fri, Apr 24, 2026 at 08:46:24AM +0800, SUVONOV BUNYOD wrote:
> Thank you for reviewing Shakeel,
> 
> > Do we need to trace highest_zoneidx at the end? Can it change within
> > balance_pgdat()?
> 
> highest_zoneidx does not change within a balance_pgdat() invocation. It
> is passed in as an argument and remains the classzone bound used for the
> balancing checks throughout the function.
> 
> I kept highest_zoneidx in the end tracepoint to make the outcome event
> self-contained. In principle, begin/end correlation is possible, but
> under sustained memory pressure kswapd reclaim can be frequent enough
> that consumers may prefer to analyze end events directly, and any
> dependence on matching begin/end becomes less convenient and less robust
> in the presence of filtering or dropped trace records.
> 
> Since nr_reclaimed and the final order are only known at the end, having
> highest_zoneidx there allows end-only analysis without correlating with
> the begin event.
> 
> For example, it lets users answer questions like:
> - this pass reclaimed too much or too little memory; what highest_zoneidx
> did that result correspond to?
> - how much reclaim was done when balancing up to ZONE_NORMAL vs other
> classzone bounds?
> - when highest_zoneidx == ZONE_NORMAL, how often did reclaim finish at
> order=0?
> 
> So it is there because it provides context for the end-of-reclaim result.
> Do you think this is sufficient justification? If not, then I can drop it
> from the end tracepoint in v2.

I think it is ok but let's add this reasoning in the commit message.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] mm/vmscan: add balance_pgdat begin/end tracepoints
  2026-04-23 10:37 [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints Bunyod Suvonov
  2026-04-23 17:46 ` Shakeel Butt
@ 2026-04-24  3:14 ` Bunyod Suvonov
  2026-04-24  3:16   ` Shakeel Butt
  1 sibling, 1 reply; 6+ messages in thread
From: Bunyod Suvonov @ 2026-04-24  3:14 UTC (permalink / raw)
  To: akpm, hannes, rostedt, mhiramat
  Cc: david, mhocko, zhengqi.arch, shakeel.butt, ljs, mathieu.desnoyers,
	linux-mm, linux-trace-kernel, linux-kernel, Bunyod Suvonov

Vmscan has six main reclaim entry points: try_to_free_pages() for
direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
for node reclaim, shrink_all_memory() for hibernation reclaim, and
balance_pgdat() for kswapd reclaim.

All of them, except for shrink_all_memory() and balance_pgdat(), already
have begin/end tracepoints. This makes it harder to trace which reclaim
path is responsible for memory reclaim activity, because kswapd reclaim
cannot be identified as cleanly as other reclaim entry points, even
though it is the main background reclaim path under memory pressure.
There may be no need to trace shrink_all_memory() as it is primarily
used during hibernation. So this patch adds the missing tracepoint pair
for balance_pgdat().

The begin tracepoint records the node id, requested reclaim order, and
the requested classzone bound (highest_zoneidx). The end tracepoint
records the node id, the reclaim order that balance_pgdat() finished
with, the requested classzone bound, and nr_reclaimed. Together, they
show the requested reclaim order and classzone bound, whether reclaim
fell back to a lower order, and how much reclaim work was done.

The end tracepoint also records highest_zoneidx even though it does not
change within a balance_pgdat() invocation. This keeps the end event
self-contained, so users can analyze reclaim results directly from end
events without depending on begin/end correlation, which is less
convenient when tracing is filtered or records are dropped. It also
makes it straightforward to relate nr_reclaimed and the final reclaim
order to the requested classzone bound.

Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
---
v2:
- explain why highest_zoneidx is kept in the end tracepoint

 include/trace/events/vmscan.h | 52 +++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |  5 ++++
 2 files changed, 57 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 4445a8d9218d..b4bf7b8def1f 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -96,6 +96,58 @@ TRACE_EVENT(mm_vmscan_kswapd_wake,
 		__entry->order)
 );

+TRACE_EVENT(mm_vmscan_balance_pgdat_begin,
+
+	TP_PROTO(int nid, int order, int highest_zoneidx),
+
+	TP_ARGS(nid, order, highest_zoneidx),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, order)
+		__field(int, highest_zoneidx)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->order = order;
+		__entry->highest_zoneidx = highest_zoneidx;
+	),
+
+	TP_printk("nid=%d order=%d highest_zoneidx=%-8s",
+		__entry->nid,
+		__entry->order,
+		__print_symbolic(__entry->highest_zoneidx, ZONE_TYPE))
+);
+
+TRACE_EVENT(mm_vmscan_balance_pgdat_end,
+
+	TP_PROTO(int nid, int order, int highest_zoneidx,
+		 unsigned long nr_reclaimed),
+
+	TP_ARGS(nid, order, highest_zoneidx, nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, order)
+		__field(int, highest_zoneidx)
+		__field(unsigned long, nr_reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->order = order;
+		__entry->highest_zoneidx = highest_zoneidx;
+		__entry->nr_reclaimed = nr_reclaimed;
+	),
+
+	TP_printk("nid=%d order=%d highest_zoneidx=%-8s nr_reclaimed=%lu",
+		__entry->nid,
+		__entry->order,
+		__print_symbolic(__entry->highest_zoneidx, ZONE_TYPE),
+		__entry->nr_reclaimed)
+);
+
 TRACE_EVENT(mm_vmscan_wakeup_kswapd,

 	TP_PROTO(int nid, int zid, int order, gfp_t gfp_flags),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..b2d89ed69d22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7121,6 +7121,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		.may_unmap = 1,
 	};

+	trace_mm_vmscan_balance_pgdat_begin(pgdat->node_id, order,
+					    highest_zoneidx);
 	set_task_reclaim_state(current, &sc.reclaim_state);
 	psi_memstall_enter(&pflags);
 	__fs_reclaim_acquire(_THIS_IP_);
@@ -7314,6 +7316,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	psi_memstall_leave(&pflags);
 	set_task_reclaim_state(current, NULL);

+	trace_mm_vmscan_balance_pgdat_end(pgdat->node_id, sc.order,
+					  highest_zoneidx, sc.nr_reclaimed);
+
 	/*
 	 * Return the order kswapd stopped reclaiming at as
 	 * prepare_kswapd_sleep() takes it into account. If another caller
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/vmscan: add balance_pgdat begin/end tracepoints
  2026-04-24  3:14 ` [PATCH v2] " Bunyod Suvonov
@ 2026-04-24  3:16   ` Shakeel Butt
  0 siblings, 0 replies; 6+ messages in thread
From: Shakeel Butt @ 2026-04-24  3:16 UTC (permalink / raw)
  To: Bunyod Suvonov, akpm, hannes, rostedt, mhiramat
  Cc: david, mhocko, zhengqi.arch, ljs, mathieu.desnoyers, linux-mm,
	linux-trace-kernel, linux-kernel, Bunyod Suvonov

April 23, 2026 at 8:14 PM, "Bunyod Suvonov" <b.suvonov@sjtu.edu.cn mailto:b.suvonov@sjtu.edu.cn?to=%22Bunyod%20Suvonov%22%20%3Cb.suvonov%40sjtu.edu.cn%3E > wrote:


> 
> Vmscan has six main reclaim entry points: try_to_free_pages() for
> direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
> mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
> for node reclaim, shrink_all_memory() for hibernation reclaim, and
> balance_pgdat() for kswapd reclaim.
> 
> All of them, except for shrink_all_memory() and balance_pgdat(), already
> have begin/end tracepoints. This makes it harder to trace which reclaim
> path is responsible for memory reclaim activity, because kswapd reclaim
> cannot be identified as cleanly as other reclaim entry points, even
> though it is the main background reclaim path under memory pressure.
> There may be no need to trace shrink_all_memory() as it is primarily
> used during hibernation. So this patch adds the missing tracepoint pair
> for balance_pgdat().
> 
> The begin tracepoint records the node id, requested reclaim order, and
> the requested classzone bound (highest_zoneidx). The end tracepoint
> records the node id, the reclaim order that balance_pgdat() finished
> with, the requested classzone bound, and nr_reclaimed. Together, they
> show the requested reclaim order and classzone bound, whether reclaim
> fell back to a lower order, and how much reclaim work was done.
> 
> The end tracepoint also records highest_zoneidx even though it does not
> change within a balance_pgdat() invocation. This keeps the end event
> self-contained, so users can analyze reclaim results directly from end
> events without depending on begin/end correlation, which is less
> convenient when tracing is filtered or records are dropped. It also
> makes it straightforward to relate nr_reclaimed and the final reclaim
> order to the requested classzone bound.
> 
> Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-24  3:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 10:37 [PATCH] mm/vmscan: add balance_pgdat begin/end tracepoints Bunyod Suvonov
2026-04-23 17:46 ` Shakeel Butt
2026-04-24  0:46   ` SUVONOV BUNYOD
2026-04-24  2:15     ` Shakeel Butt
2026-04-24  3:14 ` [PATCH v2] " Bunyod Suvonov
2026-04-24  3:16   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox