* [PATCH 0/2] mm: vmscan: add PID and cgroup ID to vmscan tracepoints
@ 2025-12-08 18:14 Thomas Ballasi
2025-12-08 18:14 ` [PATCH 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-08 18:14 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Andrew Morton
Cc: linux-mm, linux-trace-kernel
Attributing tracepoints to specific processes or cgroups might happen
to be challenging in some scenarios. Implementing additional context
to these tracepoints with PIDs and cgroup IDs will help improve
analysis.
Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
Thomas Ballasi (2):
mm: vmscan: add cgroup IDs to vmscan tracepoints
mm: vmscan: add PIDs to vmscan tracepoints
include/trace/events/vmscan.h | 77 +++++++++++++++++++++++------------
mm/vmscan.c | 17 ++++----
2 files changed, 60 insertions(+), 34 deletions(-)
--
2.33.8
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] mm: vmscan: add cgroup IDs to vmscan tracepoints
2025-12-08 18:14 [PATCH 0/2] mm: vmscan: add PID and cgroup ID to vmscan tracepoints Thomas Ballasi
@ 2025-12-08 18:14 ` Thomas Ballasi
2025-12-08 18:14 ` [PATCH 2/2] mm: vmscan: add PIDs " Thomas Ballasi
2025-12-16 14:02 ` [PATCH v2 0/2] mm: vmscan: add PID and cgroup ID " Thomas Ballasi
2 siblings, 0 replies; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-08 18:14 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Andrew Morton
Cc: linux-mm, linux-trace-kernel
Memory reclaim events are currently difficult to attribute to
specific cgroups, making debugging memory pressure issues
challenging. This patch adds memory cgroup ID (memcg_id) to key
vmscan tracepoints to enable better correlation and analysis.
For operations not associated with a specific cgroup, the field
is defaulted to 0.
Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
---
include/trace/events/vmscan.h | 65 +++++++++++++++++++++--------------
mm/vmscan.c | 17 ++++-----
2 files changed, 48 insertions(+), 34 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d2123dd960d59..afc9f80d03f34 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -114,85 +114,92 @@ TRACE_EVENT(mm_vmscan_wakeup_kswapd,
DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags),
+ TP_ARGS(order, gfp_flags, memcg_id),
TP_STRUCT__entry(
__field( int, order )
__field( unsigned long, gfp_flags )
+ __field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->order = order;
__entry->gfp_flags = (__force unsigned long)gfp_flags;
+ __entry->memcg_id = memcg_id;
),
- TP_printk("order=%d gfp_flags=%s",
+ TP_printk("order=%d gfp_flags=%s memcg_id=%u",
__entry->order,
- show_gfp_flags(__entry->gfp_flags))
+ show_gfp_flags(__entry->gfp_flags),
+ __entry->memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_begin,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg_id)
);
#ifdef CONFIG_MEMCG
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg_id)
);
#endif /* CONFIG_MEMCG */
DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed),
+ TP_ARGS(nr_reclaimed, memcg_id),
TP_STRUCT__entry(
__field( unsigned long, nr_reclaimed )
+ __field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->memcg_id = memcg_id;
),
- TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+ TP_printk("nr_reclaimed=%lu memcg_id=%u",
+ __entry->nr_reclaimed,
+ __entry->memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_direct_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
#ifdef CONFIG_MEMCG
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
#endif /* CONFIG_MEMCG */
@@ -209,6 +216,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__field(struct shrinker *, shr)
__field(void *, shrink)
__field(int, nid)
+ __field(unsigned short, memcg_id)
__field(long, nr_objects_to_shrink)
__field(unsigned long, gfp_flags)
__field(unsigned long, cache_items)
@@ -221,6 +229,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
+ __entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = (__force unsigned long)sc->gfp_mask;
__entry->cache_items = cache_items;
@@ -229,10 +238,11 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->priority = priority;
),
- TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+ TP_printk("%pS %p: nid: %d memcg_id: %u objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->memcg_id,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
__entry->cache_items,
@@ -242,15 +252,16 @@ TRACE_EVENT(mm_shrink_slab_start,
);
TRACE_EVENT(mm_shrink_slab_end,
- TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
+ TP_PROTO(struct shrinker *shr, struct shrink_control *sc, int shrinker_retval,
long unused_scan_cnt, long new_scan_cnt, long total_scan),
- TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
+ TP_ARGS(shr, sc, shrinker_retval, unused_scan_cnt, new_scan_cnt,
total_scan),
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(int, nid)
+ __field(unsigned short, memcg_id)
__field(void *, shrink)
__field(long, unused_scan)
__field(long, new_scan)
@@ -260,7 +271,8 @@ TRACE_EVENT(mm_shrink_slab_end,
TP_fast_assign(
__entry->shr = shr;
- __entry->nid = nid;
+ __entry->nid = sc->nid;
+ __entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->shrink = shr->scan_objects;
__entry->unused_scan = unused_scan_cnt;
__entry->new_scan = new_scan_cnt;
@@ -268,10 +280,11 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->total_scan = total_scan;
),
- TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+ TP_printk("%pS %p: nid: %d memcg_id: %u unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->memcg_id,
__entry->unused_scan,
__entry->new_scan,
__entry->total_scan,
@@ -463,9 +476,9 @@ TRACE_EVENT(mm_vmscan_node_reclaim_begin,
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
TRACE_EVENT(mm_vmscan_throttled,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 258f5472f1e90..0e65ec3a087a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -931,7 +931,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
*/
new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
- trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
+ trace_mm_shrink_slab_end(shrinker, shrinkctl, freed, nr, new_nr, total_scan);
return freed;
}
@@ -7092,11 +7092,11 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
return 1;
set_task_reclaim_state(current, &sc.reclaim_state);
- trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
+ trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask, 0);
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
- trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+ trace_mm_vmscan_direct_reclaim_end(nr_reclaimed, 0);
set_task_reclaim_state(current, NULL);
return nr_reclaimed;
@@ -7126,7 +7126,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
- sc.gfp_mask);
+ sc.gfp_mask,
+ mem_cgroup_id(memcg));
/*
* NOTE: Although we can get the priority field, using it
@@ -7137,7 +7138,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
*/
shrink_lruvec(lruvec, &sc);
- trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
+ trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed, mem_cgroup_id(memcg));
*nr_scanned = sc.nr_scanned;
@@ -7171,13 +7172,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
set_task_reclaim_state(current, &sc.reclaim_state);
- trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
+ trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask, mem_cgroup_id(memcg));
noreclaim_flag = memalloc_noreclaim_save();
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
memalloc_noreclaim_restore(noreclaim_flag);
- trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
+ trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed, mem_cgroup_id(memcg));
set_task_reclaim_state(current, NULL);
return nr_reclaimed;
@@ -8072,7 +8073,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
fs_reclaim_release(sc.gfp_mask);
psi_memstall_leave(&pflags);
- trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
+ trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed, 0);
return sc.nr_reclaimed >= nr_pages;
}
--
2.33.8
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-08 18:14 [PATCH 0/2] mm: vmscan: add PID and cgroup ID to vmscan tracepoints Thomas Ballasi
2025-12-08 18:14 ` [PATCH 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
@ 2025-12-08 18:14 ` Thomas Ballasi
2025-12-10 3:09 ` Steven Rostedt
2025-12-16 14:02 ` [PATCH v2 0/2] mm: vmscan: add PID and cgroup ID " Thomas Ballasi
2 siblings, 1 reply; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-08 18:14 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Andrew Morton
Cc: linux-mm, linux-trace-kernel
The changes aims at adding additionnal tracepoints variables to help
debuggers attribute them to specific processes.
The PID field uses in_task() to reliably detect when we're in process
context and can safely access current->pid. When not in process
context (such as in interrupt or in an asynchronous RCU context), the
field is set to -1 as a sentinel value.
Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
---
include/trace/events/vmscan.h | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index afc9f80d03f34..eddb4e75e2e23 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -121,18 +121,21 @@ DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
TP_STRUCT__entry(
__field( int, order )
__field( unsigned long, gfp_flags )
+ __field( int, pid )
__field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->order = order;
__entry->gfp_flags = (__force unsigned long)gfp_flags;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = memcg_id;
),
- TP_printk("order=%d gfp_flags=%s memcg_id=%u",
+ TP_printk("order=%d gfp_flags=%s pid=%d memcg_id=%u",
__entry->order,
show_gfp_flags(__entry->gfp_flags),
+ __entry->pid,
__entry->memcg_id)
);
@@ -167,16 +170,19 @@ DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
TP_STRUCT__entry(
__field( unsigned long, nr_reclaimed )
+ __field( int, pid )
__field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = memcg_id;
),
- TP_printk("nr_reclaimed=%lu memcg_id=%u",
+ TP_printk("nr_reclaimed=%lu pid=%d memcg_id=%u",
__entry->nr_reclaimed,
+ __entry->pid,
__entry->memcg_id)
);
@@ -216,6 +222,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__field(struct shrinker *, shr)
__field(void *, shrink)
__field(int, nid)
+ __field(int, pid)
__field(unsigned short, memcg_id)
__field(long, nr_objects_to_shrink)
__field(unsigned long, gfp_flags)
@@ -229,6 +236,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = (__force unsigned long)sc->gfp_mask;
@@ -238,10 +246,11 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->priority = priority;
),
- TP_printk("%pS %p: nid: %d memcg_id: %u objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+ TP_printk("%pS %p: nid: %d pid: %d memcg_id: %u objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->pid,
__entry->memcg_id,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
@@ -261,6 +270,7 @@ TRACE_EVENT(mm_shrink_slab_end,
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(int, nid)
+ __field(int, pid)
__field(unsigned short, memcg_id)
__field(void *, shrink)
__field(long, unused_scan)
@@ -272,6 +282,7 @@ TRACE_EVENT(mm_shrink_slab_end,
TP_fast_assign(
__entry->shr = shr;
__entry->nid = sc->nid;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->shrink = shr->scan_objects;
__entry->unused_scan = unused_scan_cnt;
@@ -280,10 +291,11 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->total_scan = total_scan;
),
- TP_printk("%pS %p: nid: %d memcg_id: %u unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+ TP_printk("%pS %p: nid: %d pid: %d memcg_id: %u unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->pid,
__entry->memcg_id,
__entry->unused_scan,
__entry->new_scan,
--
2.33.8
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-08 18:14 ` [PATCH 2/2] mm: vmscan: add PIDs " Thomas Ballasi
@ 2025-12-10 3:09 ` Steven Rostedt
0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-12-10 3:09 UTC (permalink / raw)
To: Thomas Ballasi
Cc: Masami Hiramatsu, Andrew Morton, linux-mm, linux-trace-kernel
On Mon, 8 Dec 2025 10:14:13 -0800
Thomas Ballasi <tballasi@linux.microsoft.com> wrote:
> ---
> include/trace/events/vmscan.h | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index afc9f80d03f34..eddb4e75e2e23 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -121,18 +121,21 @@ DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
> TP_STRUCT__entry(
> __field( int, order )
> __field( unsigned long, gfp_flags )
> + __field( int, pid )
This puts a hole in the ring buffer on 64 bit machines. Please keep pid
next to order as they are both 'int' and not have an "unsigned long"
between the two.
> __field( unsigned short, memcg_id )
> ),
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 0/2] mm: vmscan: add PID and cgroup ID to vmscan tracepoints
2025-12-08 18:14 [PATCH 0/2] mm: vmscan: add PID and cgroup ID to vmscan tracepoints Thomas Ballasi
2025-12-08 18:14 ` [PATCH 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
2025-12-08 18:14 ` [PATCH 2/2] mm: vmscan: add PIDs " Thomas Ballasi
@ 2025-12-16 14:02 ` Thomas Ballasi
2025-12-16 14:02 ` [PATCH v2 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
2025-12-16 14:02 ` [PATCH v2 2/2] mm: vmscan: add PIDs " Thomas Ballasi
2 siblings, 2 replies; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-16 14:02 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Andrew Morton
Cc: linux-mm, linux-trace-kernel
Changes in v2:
- Swapped field entries to prevent a hole in the ring buffer
Link to v1:
https://lore.kernel.org/linux-trace-kernel/20251208181413.4722-1-tballasi@linux.microsoft.com/
Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
Thomas Ballasi (2):
mm: vmscan: add cgroup IDs to vmscan tracepoints
mm: vmscan: add PIDs to vmscan tracepoints
include/trace/events/vmscan.h | 77 +++++++++++++++++++++++------------
mm/vmscan.c | 17 ++++----
2 files changed, 60 insertions(+), 34 deletions(-)
--
2.33.8
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v2 1/2] mm: vmscan: add cgroup IDs to vmscan tracepoints
2025-12-16 14:02 ` [PATCH v2 0/2] mm: vmscan: add PID and cgroup ID " Thomas Ballasi
@ 2025-12-16 14:02 ` Thomas Ballasi
2025-12-16 18:50 ` Shakeel Butt
2025-12-17 22:21 ` Steven Rostedt
2025-12-16 14:02 ` [PATCH v2 2/2] mm: vmscan: add PIDs " Thomas Ballasi
1 sibling, 2 replies; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-16 14:02 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Andrew Morton
Cc: linux-mm, linux-trace-kernel
Memory reclaim events are currently difficult to attribute to
specific cgroups, making debugging memory pressure issues
challenging. This patch adds memory cgroup ID (memcg_id) to key
vmscan tracepoints to enable better correlation and analysis.
For operations not associated with a specific cgroup, the field
is defaulted to 0.
Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
---
include/trace/events/vmscan.h | 65 +++++++++++++++++++++--------------
mm/vmscan.c | 17 ++++-----
2 files changed, 48 insertions(+), 34 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d2123dd960d59..afc9f80d03f34 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -114,85 +114,92 @@ TRACE_EVENT(mm_vmscan_wakeup_kswapd,
DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags),
+ TP_ARGS(order, gfp_flags, memcg_id),
TP_STRUCT__entry(
__field( int, order )
__field( unsigned long, gfp_flags )
+ __field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->order = order;
__entry->gfp_flags = (__force unsigned long)gfp_flags;
+ __entry->memcg_id = memcg_id;
),
- TP_printk("order=%d gfp_flags=%s",
+ TP_printk("order=%d gfp_flags=%s memcg_id=%u",
__entry->order,
- show_gfp_flags(__entry->gfp_flags))
+ show_gfp_flags(__entry->gfp_flags),
+ __entry->memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_begin,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg_id)
);
#ifdef CONFIG_MEMCG
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
- TP_PROTO(int order, gfp_t gfp_flags),
+ TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
- TP_ARGS(order, gfp_flags)
+ TP_ARGS(order, gfp_flags, memcg_id)
);
#endif /* CONFIG_MEMCG */
DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed),
+ TP_ARGS(nr_reclaimed, memcg_id),
TP_STRUCT__entry(
__field( unsigned long, nr_reclaimed )
+ __field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->memcg_id = memcg_id;
),
- TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+ TP_printk("nr_reclaimed=%lu memcg_id=%u",
+ __entry->nr_reclaimed,
+ __entry->memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_direct_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
#ifdef CONFIG_MEMCG
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
#endif /* CONFIG_MEMCG */
@@ -209,6 +216,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__field(struct shrinker *, shr)
__field(void *, shrink)
__field(int, nid)
+ __field(unsigned short, memcg_id)
__field(long, nr_objects_to_shrink)
__field(unsigned long, gfp_flags)
__field(unsigned long, cache_items)
@@ -221,6 +229,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
+ __entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = (__force unsigned long)sc->gfp_mask;
__entry->cache_items = cache_items;
@@ -229,10 +238,11 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->priority = priority;
),
- TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+ TP_printk("%pS %p: nid: %d memcg_id: %u objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->memcg_id,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
__entry->cache_items,
@@ -242,15 +252,16 @@ TRACE_EVENT(mm_shrink_slab_start,
);
TRACE_EVENT(mm_shrink_slab_end,
- TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval,
+ TP_PROTO(struct shrinker *shr, struct shrink_control *sc, int shrinker_retval,
long unused_scan_cnt, long new_scan_cnt, long total_scan),
- TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt,
+ TP_ARGS(shr, sc, shrinker_retval, unused_scan_cnt, new_scan_cnt,
total_scan),
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(int, nid)
+ __field(unsigned short, memcg_id)
__field(void *, shrink)
__field(long, unused_scan)
__field(long, new_scan)
@@ -260,7 +271,8 @@ TRACE_EVENT(mm_shrink_slab_end,
TP_fast_assign(
__entry->shr = shr;
- __entry->nid = nid;
+ __entry->nid = sc->nid;
+ __entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->shrink = shr->scan_objects;
__entry->unused_scan = unused_scan_cnt;
__entry->new_scan = new_scan_cnt;
@@ -268,10 +280,11 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->total_scan = total_scan;
),
- TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+ TP_printk("%pS %p: nid: %d memcg_id: %u unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->memcg_id,
__entry->unused_scan,
__entry->new_scan,
__entry->total_scan,
@@ -463,9 +476,9 @@ TRACE_EVENT(mm_vmscan_node_reclaim_begin,
DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
- TP_PROTO(unsigned long nr_reclaimed),
+ TP_PROTO(unsigned long nr_reclaimed, unsigned short memcg_id),
- TP_ARGS(nr_reclaimed)
+ TP_ARGS(nr_reclaimed, memcg_id)
);
TRACE_EVENT(mm_vmscan_throttled,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 258f5472f1e90..0e65ec3a087a5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -931,7 +931,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
*/
new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
- trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
+ trace_mm_shrink_slab_end(shrinker, shrinkctl, freed, nr, new_nr, total_scan);
return freed;
}
@@ -7092,11 +7092,11 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
return 1;
set_task_reclaim_state(current, &sc.reclaim_state);
- trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
+ trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask, 0);
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
- trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+ trace_mm_vmscan_direct_reclaim_end(nr_reclaimed, 0);
set_task_reclaim_state(current, NULL);
return nr_reclaimed;
@@ -7126,7 +7126,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
- sc.gfp_mask);
+ sc.gfp_mask,
+ mem_cgroup_id(memcg));
/*
* NOTE: Although we can get the priority field, using it
@@ -7137,7 +7138,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
*/
shrink_lruvec(lruvec, &sc);
- trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
+ trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed, mem_cgroup_id(memcg));
*nr_scanned = sc.nr_scanned;
@@ -7171,13 +7172,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
set_task_reclaim_state(current, &sc.reclaim_state);
- trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
+ trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask, mem_cgroup_id(memcg));
noreclaim_flag = memalloc_noreclaim_save();
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
memalloc_noreclaim_restore(noreclaim_flag);
- trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
+ trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed, mem_cgroup_id(memcg));
set_task_reclaim_state(current, NULL);
return nr_reclaimed;
@@ -8072,7 +8073,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
fs_reclaim_release(sc.gfp_mask);
psi_memstall_leave(&pflags);
- trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
+ trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed, 0);
return sc.nr_reclaimed >= nr_pages;
}
--
2.33.8
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v2 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-16 14:02 ` [PATCH v2 0/2] mm: vmscan: add PID and cgroup ID " Thomas Ballasi
2025-12-16 14:02 ` [PATCH v2 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
@ 2025-12-16 14:02 ` Thomas Ballasi
2025-12-16 18:03 ` Steven Rostedt
1 sibling, 1 reply; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-16 14:02 UTC (permalink / raw)
To: Steven Rostedt, Masami Hiramatsu, Andrew Morton
Cc: linux-mm, linux-trace-kernel
The changes aims at adding additionnal tracepoints variables to help
debuggers attribute them to specific processes.
The PID field uses in_task() to reliably detect when we're in process
context and can safely access current->pid. When not in process
context (such as in interrupt or in an asynchronous RCU context), the
field is set to -1 as a sentinel value.
Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
---
include/trace/events/vmscan.h | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index afc9f80d03f34..315725f30b504 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -120,19 +120,22 @@ DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
TP_STRUCT__entry(
__field( int, order )
+ __field( int, pid )
__field( unsigned long, gfp_flags )
__field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->order = order;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->gfp_flags = (__force unsigned long)gfp_flags;
__entry->memcg_id = memcg_id;
),
- TP_printk("order=%d gfp_flags=%s memcg_id=%u",
+ TP_printk("order=%d gfp_flags=%s pid=%d memcg_id=%u",
__entry->order,
show_gfp_flags(__entry->gfp_flags),
+ __entry->pid,
__entry->memcg_id)
);
@@ -167,16 +170,19 @@ DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
TP_STRUCT__entry(
__field( unsigned long, nr_reclaimed )
+ __field( int, pid )
__field( unsigned short, memcg_id )
),
TP_fast_assign(
__entry->nr_reclaimed = nr_reclaimed;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = memcg_id;
),
- TP_printk("nr_reclaimed=%lu memcg_id=%u",
+ TP_printk("nr_reclaimed=%lu pid=%d memcg_id=%u",
__entry->nr_reclaimed,
+ __entry->pid,
__entry->memcg_id)
);
@@ -216,6 +222,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__field(struct shrinker *, shr)
__field(void *, shrink)
__field(int, nid)
+ __field(int, pid)
__field(unsigned short, memcg_id)
__field(long, nr_objects_to_shrink)
__field(unsigned long, gfp_flags)
@@ -229,6 +236,7 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->shr = shr;
__entry->shrink = shr->scan_objects;
__entry->nid = sc->nid;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = (__force unsigned long)sc->gfp_mask;
@@ -238,10 +246,11 @@ TRACE_EVENT(mm_shrink_slab_start,
__entry->priority = priority;
),
- TP_printk("%pS %p: nid: %d memcg_id: %u objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
+ TP_printk("%pS %p: nid: %d pid: %d memcg_id: %u objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->pid,
__entry->memcg_id,
__entry->nr_objects_to_shrink,
show_gfp_flags(__entry->gfp_flags),
@@ -261,6 +270,7 @@ TRACE_EVENT(mm_shrink_slab_end,
TP_STRUCT__entry(
__field(struct shrinker *, shr)
__field(int, nid)
+ __field(int, pid)
__field(unsigned short, memcg_id)
__field(void *, shrink)
__field(long, unused_scan)
@@ -272,6 +282,7 @@ TRACE_EVENT(mm_shrink_slab_end,
TP_fast_assign(
__entry->shr = shr;
__entry->nid = sc->nid;
+ __entry->pid = in_task() ? current->pid : -1;
__entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
__entry->shrink = shr->scan_objects;
__entry->unused_scan = unused_scan_cnt;
@@ -280,10 +291,11 @@ TRACE_EVENT(mm_shrink_slab_end,
__entry->total_scan = total_scan;
),
- TP_printk("%pS %p: nid: %d memcg_id: %u unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
+ TP_printk("%pS %p: nid: %d pid: %d memcg_id: %u unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d",
__entry->shrink,
__entry->shr,
__entry->nid,
+ __entry->pid,
__entry->memcg_id,
__entry->unused_scan,
__entry->new_scan,
--
2.33.8
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-16 14:02 ` [PATCH v2 2/2] mm: vmscan: add PIDs " Thomas Ballasi
@ 2025-12-16 18:03 ` Steven Rostedt
2025-12-29 10:54 ` Thomas Ballasi
0 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2025-12-16 18:03 UTC (permalink / raw)
To: Thomas Ballasi
Cc: Masami Hiramatsu, Andrew Morton, linux-mm, linux-trace-kernel
On Tue, 16 Dec 2025 06:02:52 -0800
Thomas Ballasi <tballasi@linux.microsoft.com> wrote:
> The changes aims at adding additionnal tracepoints variables to help
> debuggers attribute them to specific processes.
>
> The PID field uses in_task() to reliably detect when we're in process
> context and can safely access current->pid. When not in process
> context (such as in interrupt or in an asynchronous RCU context), the
> field is set to -1 as a sentinel value.
>
> Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
Is this really needed? The trace events already show if you are in
interrupt context or not.
# tracer: nop
#
# entries-in-buffer/entries-written: 25817/25817 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq <<<<------ Shows irq context
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [002] d..1. 11429.293552: rcu_watching: Startirq 0 1 0x74c
<idle>-0 [000] d.H1. 11429.293564: rcu_utilization: Start scheduler-tick
<idle>-0 [000] d.H1. 11429.293566: rcu_utilization: End scheduler-tick
<idle>-0 [002] dN.1. 11429.293567: rcu_watching: Endirq 1 0 0x74c
<idle>-0 [002] dN.1. 11429.293568: rcu_watching: Start 0 1 0x754
<idle>-0 [000] d.s1. 11429.293577: rcu_watching: --= 3 1 0xdf4
<idle>-0 [002] dN.1. 11429.293579: rcu_utilization: Start context switch
<idle>-0 [002] dN.1. 11429.293580: rcu_utilization: End context switch
rcu_sched-15 [002] d..1. 11429.293589: rcu_grace_period: rcu_sched 132685 start
<idle>-0 [000] dN.1. 11429.293592: rcu_watching: Endirq 1 0 0xdf4
rcu_sched-15 [002] d..1. 11429.293592: rcu_grace_period: rcu_sched 132685 cpustart
rcu_sched-15 [002] d..1. 11429.293592: rcu_grace_period_init: rcu_sched 132685 0 0 7 ff
<idle>-0 [000] dN.1. 11429.293593: rcu_watching: Start 0 1 0xdfc
Thus, you can already tell if you are in interrupt context or not, and you
always get the current pid. The 'H', 'h' or 's' means you are in a
interrupt type context. ('H' for hard interrupt interrupting a softirq, 'h'
for just a hard interrupt, and 's' for a softirq).
What's the point of adding another field to cover the same information
that's already available?
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] mm: vmscan: add cgroup IDs to vmscan tracepoints
2025-12-16 14:02 ` [PATCH v2 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
@ 2025-12-16 18:50 ` Shakeel Butt
2025-12-17 22:21 ` Steven Rostedt
1 sibling, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2025-12-16 18:50 UTC (permalink / raw)
To: Thomas Ballasi
Cc: Steven Rostedt, Masami Hiramatsu, Andrew Morton, linux-mm,
linux-trace-kernel
On Tue, Dec 16, 2025 at 06:02:51AM -0800, Thomas Ballasi wrote:
> Memory reclaim events are currently difficult to attribute to
> specific cgroups, making debugging memory pressure issues
> challenging. This patch adds memory cgroup ID (memcg_id) to key
> vmscan tracepoints to enable better correlation and analysis.
>
> For operations not associated with a specific cgroup, the field
> is defaulted to 0.
>
> Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
> ---
...
> + __entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
...
> + __entry->memcg_id = sc->memcg ? mem_cgroup_id(sc->memcg) : 0;
...
>
...
> trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
> - sc.gfp_mask);
> + sc.gfp_mask,
> + mem_cgroup_id(memcg));
>
...
> + trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed, mem_cgroup_id(memcg));
...
> + trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask, mem_cgroup_id(memcg));
...
> + trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed, mem_cgroup_id(memcg));
Please don't use mem_cgroup_id() here as it is an ID internal to memcg.
Use cgroup_id(memcg->css.cgroup) instead which is inode number and is
exposed to the userspace.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 1/2] mm: vmscan: add cgroup IDs to vmscan tracepoints
2025-12-16 14:02 ` [PATCH v2 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
2025-12-16 18:50 ` Shakeel Butt
@ 2025-12-17 22:21 ` Steven Rostedt
1 sibling, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-12-17 22:21 UTC (permalink / raw)
To: Thomas Ballasi
Cc: Masami Hiramatsu, Andrew Morton, linux-mm, linux-trace-kernel
On Tue, 16 Dec 2025 06:02:51 -0800
Thomas Ballasi <tballasi@linux.microsoft.com> wrote:
> ---
> include/trace/events/vmscan.h | 65 +++++++++++++++++++++--------------
> mm/vmscan.c | 17 ++++-----
> 2 files changed, 48 insertions(+), 34 deletions(-)
>
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index d2123dd960d59..afc9f80d03f34 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -114,85 +114,92 @@ TRACE_EVENT(mm_vmscan_wakeup_kswapd,
>
> DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
>
> - TP_PROTO(int order, gfp_t gfp_flags),
> + TP_PROTO(int order, gfp_t gfp_flags, unsigned short memcg_id),
>
> - TP_ARGS(order, gfp_flags),
> + TP_ARGS(order, gfp_flags, memcg_id),
>
> TP_STRUCT__entry(
> __field( int, order )
> __field( unsigned long, gfp_flags )
> + __field( unsigned short, memcg_id )
> ),
Hmm, the above adds some holes. Note, events are at a minimum, 4 bytes
aligend. On 64bit, they can be 8 byte aligned. Still, above is the same as:
struct {
int order;
unsigned long gfp_flags;
unsigned short memcg_id;
};
See the issue? Perhaps it may be better to add the memcg_id in between the
order and gfp_flags?
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-16 18:03 ` Steven Rostedt
@ 2025-12-29 10:54 ` Thomas Ballasi
2025-12-29 18:29 ` Steven Rostedt
0 siblings, 1 reply; 13+ messages in thread
From: Thomas Ballasi @ 2025-12-29 10:54 UTC (permalink / raw)
To: rostedt; +Cc: akpm, linux-mm, linux-trace-kernel, mhiramat, tballasi
On Tue, Dec 16, 2025 at 01:03:02PM -0500, Steven Rostedt wrote:
> On Tue, 16 Dec 2025 06:02:52 -0800
> Thomas Ballasi <tballasi@linux.microsoft.com> wrote:
>
> > The changes aims at adding additionnal tracepoints variables to help
> > debuggers attribute them to specific processes.
> >
> > The PID field uses in_task() to reliably detect when we're in process
> > context and can safely access current->pid. When not in process
> > context (such as in interrupt or in an asynchronous RCU context), the
> > field is set to -1 as a sentinel value.
> >
> > Signed-off-by: Thomas Ballasi <tballasi@linux.microsoft.com>
>
> Is this really needed? The trace events already show if you are in
> interrupt context or not.
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 25817/25817 #P:8
> #
> # _-----=> irqs-off/BH-disabled
> # / _----=> need-resched
> # | / _---=> hardirq/softirq <<<<------ Shows irq context
> # || / _--=> preempt-depth
> # ||| / _-=> migrate-disable
> # |||| / delay
> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
> # | | | ||||| | |
> <idle>-0 [002] d..1. 11429.293552: rcu_watching: Startirq 0 1 0x74c
> <idle>-0 [000] d.H1. 11429.293564: rcu_utilization: Start scheduler-tick
> <idle>-0 [000] d.H1. 11429.293566: rcu_utilization: End scheduler-tick
> <idle>-0 [002] dN.1. 11429.293567: rcu_watching: Endirq 1 0 0x74c
> <idle>-0 [002] dN.1. 11429.293568: rcu_watching: Start 0 1 0x754
> <idle>-0 [000] d.s1. 11429.293577: rcu_watching: --= 3 1 0xdf4
> <idle>-0 [002] dN.1. 11429.293579: rcu_utilization: Start context switch
> <idle>-0 [002] dN.1. 11429.293580: rcu_utilization: End context switch
> rcu_sched-15 [002] d..1. 11429.293589: rcu_grace_period: rcu_sched 132685 start
> <idle>-0 [000] dN.1. 11429.293592: rcu_watching: Endirq 1 0 0xdf4
> rcu_sched-15 [002] d..1. 11429.293592: rcu_grace_period: rcu_sched 132685 cpustart
> rcu_sched-15 [002] d..1. 11429.293592: rcu_grace_period_init: rcu_sched 132685 0 0 7 ff
> <idle>-0 [000] dN.1. 11429.293593: rcu_watching: Start 0 1 0xdfc
>
> Thus, you can already tell if you are in interrupt context or not, and you
> always get the current pid. The 'H', 'h' or 's' means you are in a
> interrupt type context. ('H' for hard interrupt interrupting a softirq, 'h'
> for just a hard interrupt, and 's' for a softirq).
>
> What's the point of adding another field to cover the same information
> that's already available?
>
> -- Steve
(re-sending the reply as I believe I missed the reply all)
It indeed shows whether or not we're in an IRQ, but I believe the
kernel shouldn't show erronous debugging values. Even though it can be
obvious that we're in an interrupt, some people might look directly at
the garbage PID value without having second thoughts and taking it for
granted. On the other hand, it takes just a small check to mark the
debugging information as clearly invalid, which complements the IRQ
context flag.
If we shouldn't put that check there, I'd happily remove it, but I'd
tend to think it's a trivial addition that can only be for the best.
Thomas
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-29 10:54 ` Thomas Ballasi
@ 2025-12-29 18:29 ` Steven Rostedt
2025-12-29 21:36 ` Steven Rostedt
0 siblings, 1 reply; 13+ messages in thread
From: Steven Rostedt @ 2025-12-29 18:29 UTC (permalink / raw)
To: Thomas Ballasi; +Cc: akpm, linux-mm, linux-trace-kernel, mhiramat
On Mon, 29 Dec 2025 02:54:27 -0800
Thomas Ballasi <tballasi@linux.microsoft.com> wrote:
> It indeed shows whether or not we're in an IRQ, but I believe the
> kernel shouldn't show erronous debugging values. Even though it can be
> obvious that we're in an interrupt, some people might look directly at
> the garbage PID value without having second thoughts and taking it for
> granted. On the other hand, it takes just a small check to mark the
> debugging information as clearly invalid, which complements the IRQ
> context flag.
>
> If we shouldn't put that check there, I'd happily remove it, but I'd
> tend to think it's a trivial addition that can only be for the best.
I just don't like wasting valuable ring buffer space for something that can
be easily determined without it.
How about this. I just wrote up this patch, and it could be something you
use. I tested it against the sched waking events, by adding:
__entry->target_cpu = task_cpu(p);
),
- TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
+ TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d %s",
__entry->comm, __entry->pid, __entry->prio,
- __entry->target_cpu)
+ __entry->target_cpu,
+ __event_in_irq() ? "(in-irq)" : "")
);
Which produces:
<idle>-0 [003] d.h4. 44.832126: sched_waking: comm=in:imklog pid=619 prio=120 target_cpu=006 (in-irq)
<idle>-0 [003] d.s3. 44.832180: sched_waking: comm=rcu_preempt pid=15 prio=120 target_cpu=001 (in-irq)
in:imklog-619 [006] d..2. 44.832393: sched_waking: comm=rs:main Q:Reg pid=620 prio=120 target_cpu=003
You can see it adds "(in-irq)" when the even is executed from IRQ context
(soft or hard irq). But I also added __event_in_hardirq() and
__event_in_softirq() if you wanted to distinguish them.
Now you don't need to update what goes into the ring buffer (and waste its
space), but only update the output format that makes it obvious that the
task was in interrupt context or not.
I also used trace-cmd to record the events, and it still parses properly
with no updates to libtraceevent needed.
Would this work for you?
Below is the patch that allows for this:
-- Steve
diff --git a/include/trace/stages/stage3_trace_output.h b/include/trace/stages/stage3_trace_output.h
index 1e7b0bef95f5..53a23988a3b8 100644
--- a/include/trace/stages/stage3_trace_output.h
+++ b/include/trace/stages/stage3_trace_output.h
@@ -150,3 +150,11 @@
#undef __get_buf
#define __get_buf(len) trace_seq_acquire(p, (len))
+
+#undef __event_in_hardirq
+#undef __event_in_softirq
+#undef __event_in_irq
+
+#define __event_in_hardirq() (__entry->ent.flags & TRACE_FLAG_HARDIRQ)
+#define __event_in_softirq() (__entry->ent.flags & TRACE_FLAG_SOFTIRQ)
+#define __event_in_irq() (__entry->ent.flags & (TRACE_FLAG_HARDIRQ | TRACE_FLAG_SOFTIRQ))
diff --git a/include/trace/stages/stage7_class_define.h b/include/trace/stages/stage7_class_define.h
index fcd564a590f4..47008897a795 100644
--- a/include/trace/stages/stage7_class_define.h
+++ b/include/trace/stages/stage7_class_define.h
@@ -26,6 +26,25 @@
#undef __print_hex_dump
#undef __get_buf
+#undef __event_in_hardirq
+#undef __event_in_softirq
+#undef __event_in_irq
+
+/*
+ * The TRACE_FLAG_* are enums. Instead of using TRACE_DEFINE_ENUM(),
+ * use their hardcoded values. These values are parsed by user space
+ * tooling elsewhere so they will never change.
+ *
+ * See "enum trace_flag_type" in linux/trace_events.h:
+ * TRACE_FLAG_HARDIRQ
+ * TRACE_FLAG_SOFTIRQ
+ */
+
+/* This is what is displayed in the format files */
+#define __event_in_hardirq() (REC->common_flags & 0x8)
+#define __event_in_softirq() (REC->common_flags & 0x10)
+#define __event_in_irq() (REC->common_flags & 0x18)
+
/*
* The below is not executed in the kernel. It is only what is
* displayed in the print format for userspace to parse.
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v2 2/2] mm: vmscan: add PIDs to vmscan tracepoints
2025-12-29 18:29 ` Steven Rostedt
@ 2025-12-29 21:36 ` Steven Rostedt
0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2025-12-29 21:36 UTC (permalink / raw)
To: Thomas Ballasi; +Cc: akpm, linux-mm, linux-trace-kernel, mhiramat
On Mon, 29 Dec 2025 13:29:42 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> I just don't like wasting valuable ring buffer space for something that can
> be easily determined without it.
>
> How about this. I just wrote up this patch, and it could be something you
> use. I tested it against the sched waking events, by adding:
>
> __entry->target_cpu = task_cpu(p);
> ),
>
> - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d %s",
> __entry->comm, __entry->pid, __entry->prio,
> - __entry->target_cpu)
> + __entry->target_cpu,
> + __event_in_irq() ? "(in-irq)" : "")
> );
>
> Which produces:
>
> <idle>-0 [003] d.h4. 44.832126: sched_waking: comm=in:imklog pid=619 prio=120 target_cpu=006 (in-irq)
> <idle>-0 [003] d.s3. 44.832180: sched_waking: comm=rcu_preempt pid=15 prio=120 target_cpu=001 (in-irq)
> in:imklog-619 [006] d..2. 44.832393: sched_waking: comm=rs:main Q:Reg pid=620 prio=120 target_cpu=003
>
> You can see it adds "(in-irq)" when the even is executed from IRQ context
> (soft or hard irq). But I also added __event_in_hardirq() and
> __event_in_softirq() if you wanted to distinguish them.
>
> Now you don't need to update what goes into the ring buffer (and waste its
> space), but only update the output format that makes it obvious that the
> task was in interrupt context or not.
>
> I also used trace-cmd to record the events, and it still parses properly
> with no updates to libtraceevent needed.
>
> Would this work for you?
If this would work for you. Feel free to take the patch I posted and use that:
https://lore.kernel.org/all/20251229163515.3d1b0bba@gandalf.local.home/
-- Steve
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-12-29 21:36 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-08 18:14 [PATCH 0/2] mm: vmscan: add PID and cgroup ID to vmscan tracepoints Thomas Ballasi
2025-12-08 18:14 ` [PATCH 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
2025-12-08 18:14 ` [PATCH 2/2] mm: vmscan: add PIDs " Thomas Ballasi
2025-12-10 3:09 ` Steven Rostedt
2025-12-16 14:02 ` [PATCH v2 0/2] mm: vmscan: add PID and cgroup ID " Thomas Ballasi
2025-12-16 14:02 ` [PATCH v2 1/2] mm: vmscan: add cgroup IDs " Thomas Ballasi
2025-12-16 18:50 ` Shakeel Butt
2025-12-17 22:21 ` Steven Rostedt
2025-12-16 14:02 ` [PATCH v2 2/2] mm: vmscan: add PIDs " Thomas Ballasi
2025-12-16 18:03 ` Steven Rostedt
2025-12-29 10:54 ` Thomas Ballasi
2025-12-29 18:29 ` Steven Rostedt
2025-12-29 21:36 ` Steven Rostedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).