[PATCH v2 0/2] mm: memcg: improve vmscan tracepoints

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] mm: memcg: improve vmscan tracepoints
@ 2023-11-22 10:01 Dmitry Rokosov
  2023-11-22 10:01 ` [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints Dmitry Rokosov
       [not found] ` <20231122100156.6568-3-ddrokosov@salutedevices.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-22 10:01 UTC (permalink / raw)
  To: rostedt, mhiramat, hannes, mhocko, roman.gushchin, shakeelb,
	muchun.song, akpm
  Cc: kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf,
	Dmitry Rokosov

The motivation behind this commit is to enhance the traceability and
understanding of memcg events. By integrating the function cgroup_name()
into the existing memcg tracepoints, this patch series introduces a new
tracepoint template for the begin() and end() events. It utilizes a
static __array() to store the cgroup name, enabling developers to easily
identify the cgroup associated with a specific memcg tracepoint event.

Additionally, this patch series introduces new shrink_memcg tracepoints
to facilitate non-direct memcg reclaim tracing and debugging.

Changes v2 since v1 at [1]:
    - change the position of the "memcg" parameter to ensure backward
      compatibility with userspace tools that use memcg tracepoints
    - add additional CONFIG_MEMCG ifdefs to prevent the use of memcg
      tracepoints when memcg is disabled

Links:
    [1] https://lore.kernel.org/all/20231101102837.25205-1-ddrokosov@salutedevices.com/

Dmitry Rokosov (2):
  mm: memcg: print out cgroup name in the memcg tracepoints
  mm: memcg: introduce new event to trace shrink_memcg

 include/trace/events/vmscan.h | 91 ++++++++++++++++++++++++++++++-----
 mm/vmscan.c                   | 21 ++++++--
 2 files changed, 95 insertions(+), 17 deletions(-)

-- 
2.36.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints
  2023-11-22 10:01 [PATCH v2 0/2] mm: memcg: improve vmscan tracepoints Dmitry Rokosov
@ 2023-11-22 10:01 ` Dmitry Rokosov
  2023-11-23  7:21   ` Shakeel Butt
       [not found] ` <20231122100156.6568-3-ddrokosov@salutedevices.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-22 10:01 UTC (permalink / raw)
  To: rostedt, mhiramat, hannes, mhocko, roman.gushchin, shakeelb,
	muchun.song, akpm
  Cc: kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf,
	Dmitry Rokosov

Sometimes it is necessary to understand in which memcg tracepoint event
occurred. The function cgroup_name() is a useful tool for this purpose.
To integrate cgroup_name() into the existing memcg tracepoints, this
patch introduces a new tracepoint template for the begin() and end()
events, utilizing static __array() to store the cgroup name.

Signed-off-by: Dmitry Rokosov <ddrokosov@salutedevices.com>
---
 include/trace/events/vmscan.h | 77 +++++++++++++++++++++++++++++------
 mm/vmscan.c                   | 10 ++---
 2 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d2123dd960d5..9b49cd120ae9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -141,19 +141,47 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_b
 );
 
 #ifdef CONFIG_MEMCG
-DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
 
-	TP_PROTO(int order, gfp_t gfp_flags),
+DECLARE_EVENT_CLASS(mm_vmscan_memcg_reclaim_begin_template,
 
-	TP_ARGS(order, gfp_flags)
+	TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
+
+	TP_ARGS(order, gfp_flags, memcg),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, gfp_flags)
+		__array(char, name, NAME_MAX + 1)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->gfp_flags = (__force unsigned long)gfp_flags;
+		cgroup_name(memcg->css.cgroup,
+			__entry->name,
+			sizeof(__entry->name));
+	),
+
+	TP_printk("order=%d gfp_flags=%s memcg=%s",
+		__entry->order,
+		show_gfp_flags(__entry->gfp_flags),
+		__entry->name)
 );
 
-DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
 
-	TP_PROTO(int order, gfp_t gfp_flags),
+	TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
 
-	TP_ARGS(order, gfp_flags)
+	TP_ARGS(order, gfp_flags, memcg)
+);
+
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin,
+
+	TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
+
+	TP_ARGS(order, gfp_flags, memcg)
 );
+
 #endif /* CONFIG_MEMCG */
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
@@ -181,19 +209,44 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_direct_reclaim_end
 );
 
 #ifdef CONFIG_MEMCG
-DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_reclaim_end,
 
-	TP_PROTO(unsigned long nr_reclaimed),
+DECLARE_EVENT_CLASS(mm_vmscan_memcg_reclaim_end_template,
 
-	TP_ARGS(nr_reclaimed)
+	TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
+
+	TP_ARGS(nr_reclaimed, memcg),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, nr_reclaimed)
+		__array(char, name, NAME_MAX + 1)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed = nr_reclaimed;
+		cgroup_name(memcg->css.cgroup,
+			__entry->name,
+			sizeof(__entry->name));
+	),
+
+	TP_printk("nr_reclaimed=%lu memcg=%s",
+		__entry->nr_reclaimed,
+		__entry->name)
 );
 
-DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_reclaim_end,
 
-	TP_PROTO(unsigned long nr_reclaimed),
+	TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
 
-	TP_ARGS(nr_reclaimed)
+	TP_ARGS(nr_reclaimed, memcg)
 );
+
+DEFINE_EVENT(mm_vmscan_memcg_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed, const struct mem_cgroup *memcg),
+
+	TP_ARGS(nr_reclaimed, memcg)
+);
+
 #endif /* CONFIG_MEMCG */
 
 TRACE_EVENT(mm_shrink_slab_start,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..45780952f4b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7088,8 +7088,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 
-	trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order,
-						      sc.gfp_mask);
+	trace_mm_vmscan_memcg_softlimit_reclaim_begin(sc.order, sc.gfp_mask,
+						      memcg);
 
 	/*
 	 * NOTE: Although we can get the priority field, using it
@@ -7100,7 +7100,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 	 */
 	shrink_lruvec(lruvec, &sc);
 
-	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
+	trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed, memcg);
 
 	*nr_scanned = sc.nr_scanned;
 
@@ -7134,13 +7134,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
-	trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
+	trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask, memcg);
 	noreclaim_flag = memalloc_noreclaim_save();
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
-	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
+	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed, memcg);
 	set_task_reclaim_state(current, NULL);
 
 	return nr_reclaimed;
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg
       [not found] ` <20231122100156.6568-3-ddrokosov@salutedevices.com>
@ 2023-11-22 10:23   ` Michal Hocko
  2023-11-22 10:58     ` Dmitry Rokosov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2023-11-22 10:23 UTC (permalink / raw)
  To: Dmitry Rokosov
  Cc: rostedt, mhiramat, hannes, roman.gushchin, shakeelb, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Wed 22-11-23 13:01:56, Dmitry Rokosov wrote:
> The shrink_memcg flow plays a crucial role in memcg reclamation.
> Currently, it is not possible to trace this point from non-direct
> reclaim paths.

Is this really true? AFAICS we have
mm_vmscan_lru_isolate
mm_vmscan_lru_shrink_active
mm_vmscan_lru_shrink_inactive

which are in the vry core of the memory reclaim. Sure post processing
those is some work.

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 45780952f4b5..6d89b39d9a91 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6461,6 +6461,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		 */
>  		cond_resched();
>  
> +#ifdef CONFIG_MEMCG
> +		trace_mm_vmscan_memcg_shrink_begin(sc->order,
> +						   sc->gfp_mask,
> +						   memcg);
> +#endif

this is a common code path for node and direct reclaim which means that
we will have multiple begin/end tracepoints covering similar operations.
To me that sounds excessive. If you are missing a cumulative kswapd
alternative to 
mm_vmscan_direct_reclaim_begin
mm_vmscan_direct_reclaim_end
mm_vmscan_memcg_reclaim_begin
mm_vmscan_memcg_reclaim_end
mm_vmscan_memcg_softlimit_reclaim_begin
mm_vmscan_memcg_softlimit_reclaim_end
mm_vmscan_node_reclaim_begin
mm_vmscan_node_reclaim_end

then place it into kswapd path. But it would be really great to
elaborate some more why this is really needed. Cannot you simply
aggregate stats for kswapd from existing tracepoints?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg
  2023-11-22 10:23   ` [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg Michal Hocko
@ 2023-11-22 10:58     ` Dmitry Rokosov
  2023-11-22 13:24       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-22 10:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rostedt, mhiramat, hannes, roman.gushchin, shakeelb, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

Hello Michal,

Thank you for the quick review!

On Wed, Nov 22, 2023 at 11:23:24AM +0100, Michal Hocko wrote:
> On Wed 22-11-23 13:01:56, Dmitry Rokosov wrote:
> > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > Currently, it is not possible to trace this point from non-direct
> > reclaim paths.
> 
> Is this really true? AFAICS we have
> mm_vmscan_lru_isolate
> mm_vmscan_lru_shrink_active
> mm_vmscan_lru_shrink_inactive
> 
> which are in the vry core of the memory reclaim. Sure post processing
> those is some work.

Sure, you are absolutely right. In the usual scenario, the memcg
shrinker utilizes two sub-shrinkers: slab and LRU. We can enable the
tracepoints you mentioned and analyze them. However, there is one
potential issue. Enabling these tracepoints will trigger the reclaim
events show for all pages. Although we can filter them per pid, we
cannot filter them per cgroup. Nevertheless, there are times when it
would be extremely beneficial to comprehend the effectiveness of the
reclaim process within the relevant cgroup. For this reason, I am adding
the cgroup name to the memcg tracepoints and implementing a cumulative
tracepoint for memcg shrink (LRU + slab)."

> 
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 45780952f4b5..6d89b39d9a91 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6461,6 +6461,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >  		 */
> >  		cond_resched();
> >  
> > +#ifdef CONFIG_MEMCG
> > +		trace_mm_vmscan_memcg_shrink_begin(sc->order,
> > +						   sc->gfp_mask,
> > +						   memcg);
> > +#endif
> 
> this is a common code path for node and direct reclaim which means that
> we will have multiple begin/end tracepoints covering similar operations.
> To me that sounds excessive. If you are missing a cumulative kswapd
> alternative to 
> mm_vmscan_direct_reclaim_begin
> mm_vmscan_direct_reclaim_end
> mm_vmscan_memcg_reclaim_begin
> mm_vmscan_memcg_reclaim_end
> mm_vmscan_memcg_softlimit_reclaim_begin
> mm_vmscan_memcg_softlimit_reclaim_end
> mm_vmscan_node_reclaim_begin
> mm_vmscan_node_reclaim_end
> 
> then place it into kswapd path. But it would be really great to
> elaborate some more why this is really needed. Cannot you simply
> aggregate stats for kswapd from existing tracepoints?
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Thank you,
Dmitry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg
  2023-11-22 10:58     ` Dmitry Rokosov
@ 2023-11-22 13:24       ` Michal Hocko
  2023-11-22 18:57         ` Dmitry Rokosov
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2023-11-22 13:24 UTC (permalink / raw)
  To: Dmitry Rokosov
  Cc: rostedt, mhiramat, hannes, roman.gushchin, shakeelb, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Wed 22-11-23 13:58:36, Dmitry Rokosov wrote:
> Hello Michal,
> 
> Thank you for the quick review!
> 
> On Wed, Nov 22, 2023 at 11:23:24AM +0100, Michal Hocko wrote:
> > On Wed 22-11-23 13:01:56, Dmitry Rokosov wrote:
> > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > Currently, it is not possible to trace this point from non-direct
> > > reclaim paths.
> > 
> > Is this really true? AFAICS we have
> > mm_vmscan_lru_isolate
> > mm_vmscan_lru_shrink_active
> > mm_vmscan_lru_shrink_inactive
> > 
> > which are in the vry core of the memory reclaim. Sure post processing
> > those is some work.
> 
> Sure, you are absolutely right. In the usual scenario, the memcg
> shrinker utilizes two sub-shrinkers: slab and LRU. We can enable the
> tracepoints you mentioned and analyze them. However, there is one
> potential issue. Enabling these tracepoints will trigger the reclaim
> events show for all pages. Although we can filter them per pid, we
> cannot filter them per cgroup. Nevertheless, there are times when it
> would be extremely beneficial to comprehend the effectiveness of the
> reclaim process within the relevant cgroup. For this reason, I am adding
> the cgroup name to the memcg tracepoints and implementing a cumulative
> tracepoint for memcg shrink (LRU + slab)."

I can see how printing memcg in mm_vmscan_memcg_reclaim_begin makes it
easier to postprocess per memcg reclaim. But you could do that just by
adding that to mm_vmscan_memcg_reclaim_{begin, end}, no? Why exactly
does this matter for kswapd and other global reclaim contexts? 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg
  2023-11-22 13:24       ` Michal Hocko
@ 2023-11-22 18:57         ` Dmitry Rokosov
  2023-11-23 11:26           ` Dmitry Rokosov
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-22 18:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rostedt, mhiramat, hannes, roman.gushchin, shakeelb, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Wed, Nov 22, 2023 at 02:24:59PM +0100, Michal Hocko wrote:
> On Wed 22-11-23 13:58:36, Dmitry Rokosov wrote:
> > Hello Michal,
> > 
> > Thank you for the quick review!
> > 
> > On Wed, Nov 22, 2023 at 11:23:24AM +0100, Michal Hocko wrote:
> > > On Wed 22-11-23 13:01:56, Dmitry Rokosov wrote:
> > > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > > Currently, it is not possible to trace this point from non-direct
> > > > reclaim paths.
> > > 
> > > Is this really true? AFAICS we have
> > > mm_vmscan_lru_isolate
> > > mm_vmscan_lru_shrink_active
> > > mm_vmscan_lru_shrink_inactive
> > > 
> > > which are in the vry core of the memory reclaim. Sure post processing
> > > those is some work.
> > 
> > Sure, you are absolutely right. In the usual scenario, the memcg
> > shrinker utilizes two sub-shrinkers: slab and LRU. We can enable the
> > tracepoints you mentioned and analyze them. However, there is one
> > potential issue. Enabling these tracepoints will trigger the reclaim
> > events show for all pages. Although we can filter them per pid, we
> > cannot filter them per cgroup. Nevertheless, there are times when it
> > would be extremely beneficial to comprehend the effectiveness of the
> > reclaim process within the relevant cgroup. For this reason, I am adding
> > the cgroup name to the memcg tracepoints and implementing a cumulative
> > tracepoint for memcg shrink (LRU + slab)."
> 
> I can see how printing memcg in mm_vmscan_memcg_reclaim_begin makes it
> easier to postprocess per memcg reclaim. But you could do that just by
> adding that to mm_vmscan_memcg_reclaim_{begin, end}, no? Why exactly
> does this matter for kswapd and other global reclaim contexts? 

From my point of view, kswapd and other non-direct reclaim paths are
important for memcg analysis because they also influence the memcg
reclaim statistics.

The tracepoint mm_vmscan_memcg_reclaim_{begin, end} is called from the
direct memcg reclaim flow, such as:
    - a direct write to the 'reclaim' node
    - changing 'max' and 'high' thresholds
    - raising the 'force_empty' mechanism
    - the charge path
    - etc.

However, it doesn't cover global reclaim contexts, so it doesn't provide
us with the full memcg reclaim statistics.

-- 
Thank you,
Dmitry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints
  2023-11-22 10:01 ` [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints Dmitry Rokosov
@ 2023-11-23  7:21   ` Shakeel Butt
  2023-11-23  8:03     ` Dmitry Rokosov
  0 siblings, 1 reply; 13+ messages in thread
From: Shakeel Butt @ 2023-11-23  7:21 UTC (permalink / raw)
  To: Dmitry Rokosov
  Cc: rostedt, mhiramat, hannes, mhocko, roman.gushchin, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Wed, Nov 22, 2023 at 01:01:55PM +0300, Dmitry Rokosov wrote:
> Sometimes it is necessary to understand in which memcg tracepoint event
> occurred. The function cgroup_name() is a useful tool for this purpose.
> To integrate cgroup_name() into the existing memcg tracepoints, this
> patch introduces a new tracepoint template for the begin() and end()
> events, utilizing static __array() to store the cgroup name.
> 
> Signed-off-by: Dmitry Rokosov <ddrokosov@salutedevices.com>
> ---
>  include/trace/events/vmscan.h | 77 +++++++++++++++++++++++++++++------
>  mm/vmscan.c                   | 10 ++---
>  2 files changed, 70 insertions(+), 17 deletions(-)
> 
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index d2123dd960d5..9b49cd120ae9 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -141,19 +141,47 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_b
>  );
>  
>  #ifdef CONFIG_MEMCG
> -DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
>  
> -	TP_PROTO(int order, gfp_t gfp_flags),
> +DECLARE_EVENT_CLASS(mm_vmscan_memcg_reclaim_begin_template,
>  
> -	TP_ARGS(order, gfp_flags)
> +	TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
> +
> +	TP_ARGS(order, gfp_flags, memcg),
> +
> +	TP_STRUCT__entry(
> +		__field(int, order)
> +		__field(unsigned long, gfp_flags)
> +		__array(char, name, NAME_MAX + 1)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->order = order;
> +		__entry->gfp_flags = (__force unsigned long)gfp_flags;
> +		cgroup_name(memcg->css.cgroup,
> +			__entry->name,
> +			sizeof(__entry->name));

Any reason not to use cgroup_ino? cgroup_name may conflict and be
ambiguous.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints
  2023-11-23  7:21   ` Shakeel Butt
@ 2023-11-23  8:03     ` Dmitry Rokosov
  2023-11-23  8:15       ` Shakeel Butt
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-23  8:03 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: rostedt, mhiramat, hannes, mhocko, roman.gushchin, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

Hello Shakeel,

Thank you for quick review!
Please find my thoughts below.

On Thu, Nov 23, 2023 at 07:21:26AM +0000, Shakeel Butt wrote:
> On Wed, Nov 22, 2023 at 01:01:55PM +0300, Dmitry Rokosov wrote:
> > Sometimes it is necessary to understand in which memcg tracepoint event
> > occurred. The function cgroup_name() is a useful tool for this purpose.
> > To integrate cgroup_name() into the existing memcg tracepoints, this
> > patch introduces a new tracepoint template for the begin() and end()
> > events, utilizing static __array() to store the cgroup name.
> > 
> > Signed-off-by: Dmitry Rokosov <ddrokosov@salutedevices.com>
> > ---
> >  include/trace/events/vmscan.h | 77 +++++++++++++++++++++++++++++------
> >  mm/vmscan.c                   | 10 ++---
> >  2 files changed, 70 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> > index d2123dd960d5..9b49cd120ae9 100644
> > --- a/include/trace/events/vmscan.h
> > +++ b/include/trace/events/vmscan.h
> > @@ -141,19 +141,47 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_b
> >  );
> >  
> >  #ifdef CONFIG_MEMCG
> > -DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin,
> >  
> > -	TP_PROTO(int order, gfp_t gfp_flags),
> > +DECLARE_EVENT_CLASS(mm_vmscan_memcg_reclaim_begin_template,
> >  
> > -	TP_ARGS(order, gfp_flags)
> > +	TP_PROTO(int order, gfp_t gfp_flags, const struct mem_cgroup *memcg),
> > +
> > +	TP_ARGS(order, gfp_flags, memcg),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(int, order)
> > +		__field(unsigned long, gfp_flags)
> > +		__array(char, name, NAME_MAX + 1)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->order = order;
> > +		__entry->gfp_flags = (__force unsigned long)gfp_flags;
> > +		cgroup_name(memcg->css.cgroup,
> > +			__entry->name,
> > +			sizeof(__entry->name));
> 
> Any reason not to use cgroup_ino? cgroup_name may conflict and be
> ambiguous.

I actually didn't consider it, as the cgroup name serves as a clear tag
for filtering the appropriate cgroup in the entire trace file. However,
you are correct that there might be conflicts with cgroup names.
Therefore, it might be better to display both tags: ino and name. What
do you think on this?

-- 
Thank you,
Dmitry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints
  2023-11-23  8:03     ` Dmitry Rokosov
@ 2023-11-23  8:15       ` Shakeel Butt
  2023-11-23  8:45         ` Dmitry Rokosov
  0 siblings, 1 reply; 13+ messages in thread
From: Shakeel Butt @ 2023-11-23  8:15 UTC (permalink / raw)
  To: Dmitry Rokosov
  Cc: rostedt, mhiramat, hannes, mhocko, roman.gushchin, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Thu, Nov 23, 2023 at 11:03:34AM +0300, Dmitry Rokosov wrote:
[...]
> > > +		cgroup_name(memcg->css.cgroup,
> > > +			__entry->name,
> > > +			sizeof(__entry->name));
> > 
> > Any reason not to use cgroup_ino? cgroup_name may conflict and be
> > ambiguous.
> 
> I actually didn't consider it, as the cgroup name serves as a clear tag
> for filtering the appropriate cgroup in the entire trace file. However,
> you are correct that there might be conflicts with cgroup names.
> Therefore, it might be better to display both tags: ino and name. What
> do you think on this?
> 

I can see putting cgroup name can avoid pre or post processing, so
putting both are fine. Though keep in mind that cgroup_name acquires a
lock which may impact the applications running on the system.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints
  2023-11-23  8:15       ` Shakeel Butt
@ 2023-11-23  8:45         ` Dmitry Rokosov
  2023-11-23 11:21           ` Dmitry Rokosov
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-23  8:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: rostedt, mhiramat, hannes, mhocko, roman.gushchin, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Thu, Nov 23, 2023 at 08:15:47AM +0000, Shakeel Butt wrote:
> On Thu, Nov 23, 2023 at 11:03:34AM +0300, Dmitry Rokosov wrote:
> [...]
> > > > +		cgroup_name(memcg->css.cgroup,
> > > > +			__entry->name,
> > > > +			sizeof(__entry->name));
> > > 
> > > Any reason not to use cgroup_ino? cgroup_name may conflict and be
> > > ambiguous.
> > 
> > I actually didn't consider it, as the cgroup name serves as a clear tag
> > for filtering the appropriate cgroup in the entire trace file. However,
> > you are correct that there might be conflicts with cgroup names.
> > Therefore, it might be better to display both tags: ino and name. What
> > do you think on this?
> > 
> 
> I can see putting cgroup name can avoid pre or post processing, so
> putting both are fine. Though keep in mind that cgroup_name acquires a
> lock which may impact the applications running on the system.

Are you talking about kernfs_rename_lock? Yes, it's acquired each
time... Unfortunatelly, I don't know a way to save cgroup_name one time
somehow...

-- 
Thank you,
Dmitry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints
  2023-11-23  8:45         ` Dmitry Rokosov
@ 2023-11-23 11:21           ` Dmitry Rokosov
  0 siblings, 0 replies; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-23 11:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: rostedt, mhiramat, hannes, mhocko, roman.gushchin, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

Shakeel,

On Thu, Nov 23, 2023 at 11:45:10AM +0300, Dmitry Rokosov wrote:
> On Thu, Nov 23, 2023 at 08:15:47AM +0000, Shakeel Butt wrote:
> > On Thu, Nov 23, 2023 at 11:03:34AM +0300, Dmitry Rokosov wrote:
> > [...]
> > > > > +		cgroup_name(memcg->css.cgroup,
> > > > > +			__entry->name,
> > > > > +			sizeof(__entry->name));
> > > > 
> > > > Any reason not to use cgroup_ino? cgroup_name may conflict and be
> > > > ambiguous.
> > > 
> > > I actually didn't consider it, as the cgroup name serves as a clear tag
> > > for filtering the appropriate cgroup in the entire trace file. However,
> > > you are correct that there might be conflicts with cgroup names.
> > > Therefore, it might be better to display both tags: ino and name. What
> > > do you think on this?
> > > 
> > 
> > I can see putting cgroup name can avoid pre or post processing, so
> > putting both are fine. Though keep in mind that cgroup_name acquires a
> > lock which may impact the applications running on the system.
> 
> Are you talking about kernfs_rename_lock? Yes, it's acquired each
> time... Unfortunatelly, I don't know a way to save cgroup_name one time
> somehow...

I delved deeper and realized that kernfs_rename_lock is a read-write
lock, but it's a global one. While it's true that we only enable
tracepoints during specific periods of the host's lifetime, the trace
system is still a fast way to debug things. So, you're absolutely right,
we shouldn't slow down the system unnecessarily.

Therefore, today I will prepare a new version with only the cgroup ino.
Thank you for pointing that out to me!

-- 
Thank you,
Dmitry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg
  2023-11-22 18:57         ` Dmitry Rokosov
@ 2023-11-23 11:26           ` Dmitry Rokosov
  2023-11-27  9:25             ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Dmitry Rokosov @ 2023-11-23 11:26 UTC (permalink / raw)
  To: Michal Hocko, shakeelb
  Cc: rostedt, mhiramat, hannes, roman.gushchin, shakeelb, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

Michal, Shakeel,

Sorry for pinging you here, but I don't quite understand your decision
on this patchset.

Is it a NAK or not? If it's not, should I consider redesigning
something? For instance, introducing stub functions to
remove ifdefs from shrink_node_memcgs().

Thank you for taking the time to look into this!

On Wed, Nov 22, 2023 at 09:57:27PM +0300, Dmitry Rokosov wrote:
> On Wed, Nov 22, 2023 at 02:24:59PM +0100, Michal Hocko wrote:
> > On Wed 22-11-23 13:58:36, Dmitry Rokosov wrote:
> > > Hello Michal,
> > > 
> > > Thank you for the quick review!
> > > 
> > > On Wed, Nov 22, 2023 at 11:23:24AM +0100, Michal Hocko wrote:
> > > > On Wed 22-11-23 13:01:56, Dmitry Rokosov wrote:
> > > > > The shrink_memcg flow plays a crucial role in memcg reclamation.
> > > > > Currently, it is not possible to trace this point from non-direct
> > > > > reclaim paths.
> > > > 
> > > > Is this really true? AFAICS we have
> > > > mm_vmscan_lru_isolate
> > > > mm_vmscan_lru_shrink_active
> > > > mm_vmscan_lru_shrink_inactive
> > > > 
> > > > which are in the vry core of the memory reclaim. Sure post processing
> > > > those is some work.
> > > 
> > > Sure, you are absolutely right. In the usual scenario, the memcg
> > > shrinker utilizes two sub-shrinkers: slab and LRU. We can enable the
> > > tracepoints you mentioned and analyze them. However, there is one
> > > potential issue. Enabling these tracepoints will trigger the reclaim
> > > events show for all pages. Although we can filter them per pid, we
> > > cannot filter them per cgroup. Nevertheless, there are times when it
> > > would be extremely beneficial to comprehend the effectiveness of the
> > > reclaim process within the relevant cgroup. For this reason, I am adding
> > > the cgroup name to the memcg tracepoints and implementing a cumulative
> > > tracepoint for memcg shrink (LRU + slab)."
> > 
> > I can see how printing memcg in mm_vmscan_memcg_reclaim_begin makes it
> > easier to postprocess per memcg reclaim. But you could do that just by
> > adding that to mm_vmscan_memcg_reclaim_{begin, end}, no? Why exactly
> > does this matter for kswapd and other global reclaim contexts? 
> 
> From my point of view, kswapd and other non-direct reclaim paths are
> important for memcg analysis because they also influence the memcg
> reclaim statistics.
> 
> The tracepoint mm_vmscan_memcg_reclaim_{begin, end} is called from the
> direct memcg reclaim flow, such as:
>     - a direct write to the 'reclaim' node
>     - changing 'max' and 'high' thresholds
>     - raising the 'force_empty' mechanism
>     - the charge path
>     - etc.
> 
> However, it doesn't cover global reclaim contexts, so it doesn't provide
> us with the full memcg reclaim statistics.
> 
> -- 
> Thank you,
> Dmitry

-- 
Thank you,
Dmitry


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg
  2023-11-23 11:26           ` Dmitry Rokosov
@ 2023-11-27  9:25             ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2023-11-27  9:25 UTC (permalink / raw)
  To: Dmitry Rokosov
  Cc: shakeelb, rostedt, mhiramat, hannes, roman.gushchin, muchun.song,
	akpm, kernel, rockosov, cgroups, linux-mm, linux-kernel, bpf

On Thu 23-11-23 14:26:29, Dmitry Rokosov wrote:
> Michal, Shakeel,
> 
> Sorry for pinging you here, but I don't quite understand your decision
> on this patchset.
> 
> Is it a NAK or not? If it's not, should I consider redesigning
> something? For instance, introducing stub functions to
> remove ifdefs from shrink_node_memcgs().
> 
> Thank you for taking the time to look into this!

Sorry for a late reply. I have noticed you have posted a new version.
Let me have a look and comment there.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-11-27  9:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-22 10:01 [PATCH v2 0/2] mm: memcg: improve vmscan tracepoints Dmitry Rokosov
2023-11-22 10:01 ` [PATCH v2 1/2] mm: memcg: print out cgroup name in the memcg tracepoints Dmitry Rokosov
2023-11-23  7:21   ` Shakeel Butt
2023-11-23  8:03     ` Dmitry Rokosov
2023-11-23  8:15       ` Shakeel Butt
2023-11-23  8:45         ` Dmitry Rokosov
2023-11-23 11:21           ` Dmitry Rokosov
     [not found] ` <20231122100156.6568-3-ddrokosov@salutedevices.com>
2023-11-22 10:23   ` [PATCH v2 2/2] mm: memcg: introduce new event to trace shrink_memcg Michal Hocko
2023-11-22 10:58     ` Dmitry Rokosov
2023-11-22 13:24       ` Michal Hocko
2023-11-22 18:57         ` Dmitry Rokosov
2023-11-23 11:26           ` Dmitry Rokosov
2023-11-27  9:25             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).