[RFC 0/1] Try to add memory allocation info for cgroup oom kill

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/1] Try to add memory allocation info for cgroup oom kill
@ 2025-08-14 17:11 Yueyang Pan
  2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan
  2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt
  0 siblings, 2 replies; 18+ messages in thread
From: Yueyang Pan @ 2025-08-14 17:11 UTC (permalink / raw)
  To: Suren Baghdasaryan, Kent Overstreet, Usama Arif; +Cc: linux-mm, linux-kernel

Right now in the oom_kill_process if the oom is because of the cgroup 
limit, we won't get memory allocation infomation. In some cases, we 
can have a large cgroup workload running which dominates the machine. 
The reason using cgroup is to leave some resource for system. When this 
cgroup is killed, we would also like to have some memory allocation 
information for the whole server as well. This is reason behind this 
mini change. Is it an acceptable thing to do? Will it be too much 
information for people? I am happy with any suggestions!

Yueyang Pan (1):
  Add memory allocation info for cgroup oom

 mm/oom_kill.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC 1/1] Add memory allocation info for cgroup oom
  2025-08-14 17:11 [RFC 0/1] Try to add memory allocation info for cgroup oom kill Yueyang Pan
@ 2025-08-14 17:11 ` Yueyang Pan
  2025-08-14 20:11   ` Joshua Hahn
  2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt
  1 sibling, 1 reply; 18+ messages in thread
From: Yueyang Pan @ 2025-08-14 17:11 UTC (permalink / raw)
  To: Suren Baghdasaryan, Kent Overstreet, Usama Arif; +Cc: linux-mm, linux-kernel

Enable show_mem for the cgroup oom case. We will have memory allocation 
information in such case for the machine.

Signed-off-by: Yueyang Pan <pyyjason@gmail.com>
---
 mm/oom_kill.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 17650f0b516e..3ca224028396 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc)
 		pr_warn("COMPACTION is disabled!!!\n");
 
 	dump_stack();
-	if (is_memcg_oom(oc))
+	if (is_memcg_oom(oc)) {
 		mem_cgroup_print_oom_meminfo(oc->memcg);
+		show_mem();
+	}
 	else {
 		__show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask));
 		if (should_dump_unreclaim_slab())
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC 1/1] Add memory allocation info for cgroup oom
  2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan
@ 2025-08-14 20:11   ` Joshua Hahn
  2025-08-18 14:24     ` Yueyang Pan
  0 siblings, 1 reply; 18+ messages in thread
From: Joshua Hahn @ 2025-08-14 20:11 UTC (permalink / raw)
  To: Yueyang Pan
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel, kernel-team

On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote:

> Enable show_mem for the cgroup oom case. We will have memory allocation 
> information in such case for the machine.

Hi Pan,

Thank you for your patch! This makes sense to me. As for your concerns from the
cover letter on whether this is too much information: personally I don't think
so, but perhaps other developers will have different opinions?

I just have a few comments / nits.

> Signed-off-by: Yueyang Pan <pyyjason@gmail.com>
> ---
>  mm/oom_kill.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 17650f0b516e..3ca224028396 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc)
>  		pr_warn("COMPACTION is disabled!!!\n");
>  
>  	dump_stack();
> -	if (is_memcg_oom(oc))
> +	if (is_memcg_oom(oc)) {
>  		mem_cgroup_print_oom_meminfo(oc->memcg);
> +		show_mem();

Below, there is a direct call to __show_mem, which limits node and zone
filtering. I am wondering whether it would make sense to also call __show_mem
with the same arguments? show_mem() is just a wrapper around __show_mem with
default parameters (i.e. not filtering out nodes, not filtering out
zones).

If you think this makes sense, we can even take it out of the if-else statement
and call it unconditionally. But this is just my opinion, please feel free to
keep the unfiltered call if you believe that fits better in here.

> +	}

NIT: Should this closing brace be on the same line as the following else
statement, as per the kernel style guide [1]

>  	else {
>  		__show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask));
>  		if (should_dump_unreclaim_slab())
> -- 
> 2.47.3

Thanks again Pan, I hope you have a great day!
Joshua

[1] https://docs.kernel.org/process/coding-style.html

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/1] Add memory allocation info for cgroup oom
  2025-08-14 20:11   ` Joshua Hahn
@ 2025-08-18 14:24     ` Yueyang Pan
  2025-08-21  1:25       ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Yueyang Pan @ 2025-08-18 14:24 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, Michal Hocko,
	David Rientjes, Shakeel Butt, Andrew Morton, linux-mm,
	linux-kernel, kernel-team

On Thu, Aug 14, 2025 at 01:11:08PM -0700, Joshua Hahn wrote:
> On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote:
> 
> > Enable show_mem for the cgroup oom case. We will have memory allocation 
> > information in such case for the machine.
> 
> Hi Pan,
> 
> Thank you for your patch! This makes sense to me. As for your concerns from the
> cover letter on whether this is too much information: personally I don't think
> so, but perhaps other developers will have different opinions?
> 
> I just have a few comments / nits.

Thanks for your comment, Joshua.

> 
> > Signed-off-by: Yueyang Pan <pyyjason@gmail.com>
> > ---
> >  mm/oom_kill.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 17650f0b516e..3ca224028396 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc)
> >  		pr_warn("COMPACTION is disabled!!!\n");
> >  
> >  	dump_stack();
> > -	if (is_memcg_oom(oc))
> > +	if (is_memcg_oom(oc)) {
> >  		mem_cgroup_print_oom_meminfo(oc->memcg);
> > +		show_mem();
> 
> Below, there is a direct call to __show_mem, which limits node and zone
> filtering. I am wondering whether it would make sense to also call __show_mem
> with the same arguments? show_mem() is just a wrapper around __show_mem with
> default parameters (i.e. not filtering out nodes, not filtering out
> zones).

The reason why I call show_mem here directly is because cgroup is not bound to 
a specific zone or node (correctly me if I am wrong). Thus I simply invoke 
show_mem to show system-wide memory info.

> 
> If you think this makes sense, we can even take it out of the if-else statement
> and call it unconditionally. But this is just my opinion, please feel free to
> keep the unfiltered call if you believe that fits better in here.
> 
> > +	}
> 
> NIT: Should this closing brace be on the same line as the following else
> statement, as per the kernel style guide [1]

Sorry for this. I will run checkpatch for my formal patch definitely

> 
> >  	else {
> >  		__show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask));
> >  		if (should_dump_unreclaim_slab())
> > -- 
> > 2.47.3
> 
> Thanks again Pan, I hope you have a great day!
> Joshua
> 
> [1] https://docs.kernel.org/process/coding-style.html
> 
> Sent using hkml (https://github.com/sjp38/hackermail)

Sorry that I forgot to cc some maintainers so I added them in this reply.
Pan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/1] Add memory allocation info for cgroup oom
  2025-08-18 14:24     ` Yueyang Pan
@ 2025-08-21  1:25       ` Suren Baghdasaryan
  2025-08-21 19:09         ` Yueyang Pan
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-21  1:25 UTC (permalink / raw)
  To: Yueyang Pan
  Cc: Joshua Hahn, Kent Overstreet, Usama Arif, Michal Hocko,
	David Rientjes, Shakeel Butt, Andrew Morton, linux-mm,
	linux-kernel, kernel-team

On Mon, Aug 18, 2025 at 7:24 AM Yueyang Pan <pyyjason@gmail.com> wrote:
>
> On Thu, Aug 14, 2025 at 01:11:08PM -0700, Joshua Hahn wrote:
> > On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote:
> >
> > > Enable show_mem for the cgroup oom case. We will have memory allocation
> > > information in such case for the machine.

Memory allocations are only a part of show_mem(), so I would not call
this change memory allocation profiling specific. The title and the
changelog should be corrected to reflect exactly what is being done
here - logging system in addition to cgroup memory state during cgroup
oom-kill.
As for whether it makes sense to report system memory during cgroup
oom-kill... I'm not too sure. Maybe people who use memcgs more
extensively than what I've seen (in Android) can chime in?


> >
> > Hi Pan,
> >
> > Thank you for your patch! This makes sense to me. As for your concerns from the
> > cover letter on whether this is too much information: personally I don't think
> > so, but perhaps other developers will have different opinions?
> >
> > I just have a few comments / nits.
>
> Thanks for your comment, Joshua.
>
> >
> > > Signed-off-by: Yueyang Pan <pyyjason@gmail.com>
> > > ---
> > >  mm/oom_kill.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 17650f0b516e..3ca224028396 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc)
> > >             pr_warn("COMPACTION is disabled!!!\n");
> > >
> > >     dump_stack();
> > > -   if (is_memcg_oom(oc))
> > > +   if (is_memcg_oom(oc)) {
> > >             mem_cgroup_print_oom_meminfo(oc->memcg);
> > > +           show_mem();
> >
> > Below, there is a direct call to __show_mem, which limits node and zone
> > filtering. I am wondering whether it would make sense to also call __show_mem
> > with the same arguments? show_mem() is just a wrapper around __show_mem with
> > default parameters (i.e. not filtering out nodes, not filtering out
> > zones).
>
> The reason why I call show_mem here directly is because cgroup is not bound to
> a specific zone or node (correctly me if I am wrong). Thus I simply invoke
> show_mem to show system-wide memory info.
>
> >
> > If you think this makes sense, we can even take it out of the if-else statement
> > and call it unconditionally. But this is just my opinion, please feel free to
> > keep the unfiltered call if you believe that fits better in here.
> >
> > > +   }
> >
> > NIT: Should this closing brace be on the same line as the following else
> > statement, as per the kernel style guide [1]
>
> Sorry for this. I will run checkpatch for my formal patch definitely
>
> >
> > >     else {
> > >             __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask));
> > >             if (should_dump_unreclaim_slab())
> > > --
> > > 2.47.3
> >
> > Thanks again Pan, I hope you have a great day!
> > Joshua
> >
> > [1] https://docs.kernel.org/process/coding-style.html
> >
> > Sent using hkml (https://github.com/sjp38/hackermail)
>
> Sorry that I forgot to cc some maintainers so I added them in this reply.
> Pan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/1] Add memory allocation info for cgroup oom
  2025-08-21  1:25       ` Suren Baghdasaryan
@ 2025-08-21 19:09         ` Yueyang Pan
  0 siblings, 0 replies; 18+ messages in thread
From: Yueyang Pan @ 2025-08-21 19:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Joshua Hahn, Kent Overstreet, Usama Arif, Michal Hocko,
	David Rientjes, Shakeel Butt, Andrew Morton, linux-mm,
	linux-kernel, kernel-team

On Wed, Aug 20, 2025 at 06:25:56PM -0700, Suren Baghdasaryan wrote:
> On Mon, Aug 18, 2025 at 7:24 AM Yueyang Pan <pyyjason@gmail.com> wrote:
> >
> > On Thu, Aug 14, 2025 at 01:11:08PM -0700, Joshua Hahn wrote:
> > > On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote:
> > >
> > > > Enable show_mem for the cgroup oom case. We will have memory allocation
> > > > information in such case for the machine.
> 
> Memory allocations are only a part of show_mem(), so I would not call
> this change memory allocation profiling specific. The title and the
> changelog should be corrected to reflect exactly what is being done
> here - logging system in addition to cgroup memory state during cgroup
> oom-kill.

Thanks for your feedback Suren! I will change the title to be precise in 
the next version.

> As for whether it makes sense to report system memory during cgroup
> oom-kill... I'm not too sure. Maybe people who use memcgs more
> extensively than what I've seen (in Android) can chime in?
> 

In my opinion, the show_free_areas and memory allocation profiling data 
can provide an entrypoint to understand what happens with cgroup oom. We 
can also compare them with historical data to see if some memory usage 
has a spike.

Feel free to critize me if I am not making sense.

> 
> > >
> > > Hi Pan,
> > >
> > > Thank you for your patch! This makes sense to me. As for your concerns from the
> > > cover letter on whether this is too much information: personally I don't think
> > > so, but perhaps other developers will have different opinions?
> > >
> > > I just have a few comments / nits.
> >
> > Thanks for your comment, Joshua.
> >
> > >
> > > > Signed-off-by: Yueyang Pan <pyyjason@gmail.com>
> > > > ---
> > > >  mm/oom_kill.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 17650f0b516e..3ca224028396 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc)
> > > >             pr_warn("COMPACTION is disabled!!!\n");
> > > >
> > > >     dump_stack();
> > > > -   if (is_memcg_oom(oc))
> > > > +   if (is_memcg_oom(oc)) {
> > > >             mem_cgroup_print_oom_meminfo(oc->memcg);
> > > > +           show_mem();
> > >
> > > Below, there is a direct call to __show_mem, which limits node and zone
> > > filtering. I am wondering whether it would make sense to also call __show_mem
> > > with the same arguments? show_mem() is just a wrapper around __show_mem with
> > > default parameters (i.e. not filtering out nodes, not filtering out
> > > zones).
> >
> > The reason why I call show_mem here directly is because cgroup is not bound to
> > a specific zone or node (correctly me if I am wrong). Thus I simply invoke
> > show_mem to show system-wide memory info.
> >
> > >
> > > If you think this makes sense, we can even take it out of the if-else statement
> > > and call it unconditionally. But this is just my opinion, please feel free to
> > > keep the unfiltered call if you believe that fits better in here.
> > >
> > > > +   }
> > >
> > > NIT: Should this closing brace be on the same line as the following else
> > > statement, as per the kernel style guide [1]
> >
> > Sorry for this. I will run checkpatch for my formal patch definitely
> >
> > >
> > > >     else {
> > > >             __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask));
> > > >             if (should_dump_unreclaim_slab())
> > > > --
> > > > 2.47.3
> > >
> > > Thanks again Pan, I hope you have a great day!
> > > Joshua
> > >
> > > [1] https://docs.kernel.org/process/coding-style.html
> > >
> > > Sent using hkml (https://github.com/sjp38/hackermail)
> >
> > Sorry that I forgot to cc some maintainers so I added them in this reply.
> > Pan

Thanks,
Pan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-14 17:11 [RFC 0/1] Try to add memory allocation info for cgroup oom kill Yueyang Pan
  2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan
@ 2025-08-21 18:35 ` Shakeel Butt
  2025-08-21 19:18   ` Yueyang Pan
  1 sibling, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-08-21 18:35 UTC (permalink / raw)
  To: Yueyang Pan
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel

On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> Right now in the oom_kill_process if the oom is because of the cgroup 
> limit, we won't get memory allocation infomation. In some cases, we 
> can have a large cgroup workload running which dominates the machine. 
> The reason using cgroup is to leave some resource for system. When this 
> cgroup is killed, we would also like to have some memory allocation 
> information for the whole server as well. This is reason behind this 
> mini change. Is it an acceptable thing to do? Will it be too much 
> information for people? I am happy with any suggestions!

For a single patch, it is better to have all the context in the patch
and there is no need for cover letter.

What exact information you want on the memcg oom that will be helpful
for the users in general? You mentioned memory allocation information,
can you please elaborate a bit more.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt
@ 2025-08-21 19:18   ` Yueyang Pan
  2025-08-21 19:53     ` Shakeel Butt
  0 siblings, 1 reply; 18+ messages in thread
From: Yueyang Pan @ 2025-08-21 19:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel

On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > Right now in the oom_kill_process if the oom is because of the cgroup 
> > limit, we won't get memory allocation infomation. In some cases, we 
> > can have a large cgroup workload running which dominates the machine. 
> > The reason using cgroup is to leave some resource for system. When this 
> > cgroup is killed, we would also like to have some memory allocation 
> > information for the whole server as well. This is reason behind this 
> > mini change. Is it an acceptable thing to do? Will it be too much 
> > information for people? I am happy with any suggestions!
> 
> For a single patch, it is better to have all the context in the patch
> and there is no need for cover letter.

Thanks for your suggestion Shakeel! I will change this in the next version.

> 
> What exact information you want on the memcg oom that will be helpful
> for the users in general? You mentioned memory allocation information,
> can you please elaborate a bit more.
> 

As in my reply to Suren, I was thinking the system-wide memory usage info 
provided by show_free_pages and memory allocation profiling info can help 
us debug cgoom by comparing them with historical data. What is your take on 
this?

Thanks,
Pan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 19:18   ` Yueyang Pan
@ 2025-08-21 19:53     ` Shakeel Butt
  2025-08-21 20:00       ` Suren Baghdasaryan
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Shakeel Butt @ 2025-08-21 19:53 UTC (permalink / raw)
  To: Yueyang Pan
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel

On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > Right now in the oom_kill_process if the oom is because of the cgroup 
> > > limit, we won't get memory allocation infomation. In some cases, we 
> > > can have a large cgroup workload running which dominates the machine. 
> > > The reason using cgroup is to leave some resource for system. When this 
> > > cgroup is killed, we would also like to have some memory allocation 
> > > information for the whole server as well. This is reason behind this 
> > > mini change. Is it an acceptable thing to do? Will it be too much 
> > > information for people? I am happy with any suggestions!
> > 
> > For a single patch, it is better to have all the context in the patch
> > and there is no need for cover letter.
> 
> Thanks for your suggestion Shakeel! I will change this in the next version.
> 
> > 
> > What exact information you want on the memcg oom that will be helpful
> > for the users in general? You mentioned memory allocation information,
> > can you please elaborate a bit more.
> > 
> 
> As in my reply to Suren, I was thinking the system-wide memory usage info 
> provided by show_free_pages and memory allocation profiling info can help 
> us debug cgoom by comparing them with historical data. What is your take on 
> this?
> 

I am not really sure about show_free_areas(). More specifically how the
historical data diff will be useful for a memcg oom. If you have a
concrete example, please give one. For memory allocation profiling, is
it possible to filter for the given memcg? Do we save memcg information
in the memory allocation profiling?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 19:53     ` Shakeel Butt
@ 2025-08-21 20:00       ` Suren Baghdasaryan
  2025-08-21 21:26         ` Shakeel Butt
  2025-08-26 14:06       ` Yueyang Pan
  2025-08-27  2:32       ` Suren Baghdasaryan
  2 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-21 20:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel

On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?

No, memory allocation profiling is not cgroup-aware. It tracks
allocations and their code locations but no other context.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 20:00       ` Suren Baghdasaryan
@ 2025-08-21 21:26         ` Shakeel Butt
  2025-08-26 13:52           ` Yueyang Pan
  0 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-08-21 21:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel

On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
> 
> No, memory allocation profiling is not cgroup-aware. It tracks
> allocations and their code locations but no other context.

Thanks for the info. Pan, will having memcg info along with allocation
profile help your use-case? (Though adding that might not be easy or
cheaper)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 21:26         ` Shakeel Butt
@ 2025-08-26 13:52           ` Yueyang Pan
  0 siblings, 0 replies; 18+ messages in thread
From: Yueyang Pan @ 2025-08-26 13:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel

On Thu, Aug 21, 2025 at 02:26:42PM -0700, Shakeel Butt wrote:
> On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> > > it possible to filter for the given memcg? Do we save memcg information
> > > in the memory allocation profiling?
> > 
> > No, memory allocation profiling is not cgroup-aware. It tracks
> > allocations and their code locations but no other context.
> 
> Thanks for the info. Pan, will having memcg info along with allocation
> profile help your use-case? (Though adding that might not be easy or
> cheaper)

Yeah I have been thinking about it with eBPF hooks but it is going to be a long 
term effort as we need to measure the overhead. Now the way memory profiling is 
implemented incur almost "zero" overhead.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 19:53     ` Shakeel Butt
  2025-08-21 20:00       ` Suren Baghdasaryan
@ 2025-08-26 14:06       ` Yueyang Pan
  2025-08-27  2:38         ` Suren Baghdasaryan
  2025-08-27  2:32       ` Suren Baghdasaryan
  2 siblings, 1 reply; 18+ messages in thread
From: Yueyang Pan @ 2025-08-26 14:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel

On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup 
> > > > limit, we won't get memory allocation infomation. In some cases, we 
> > > > can have a large cgroup workload running which dominates the machine. 
> > > > The reason using cgroup is to leave some resource for system. When this 
> > > > cgroup is killed, we would also like to have some memory allocation 
> > > > information for the whole server as well. This is reason behind this 
> > > > mini change. Is it an acceptable thing to do? Will it be too much 
> > > > information for people? I am happy with any suggestions!
> > > 
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> > 
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> > 
> > > 
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > > 
> > 
> > As in my reply to Suren, I was thinking the system-wide memory usage info 
> > provided by show_free_pages and memory allocation profiling info can help 
> > us debug cgoom by comparing them with historical data. What is your take on 
> > this?
> > 
> 
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is

Sorry for my late reply. I have been trying hard to think about a use case. 
One specific case I can think about is when there is no workload stacking, 
when one job is running solely on the machine. For example, memory allocation 
profiling can tell the memory usage of the network driver, which can make 
cg allocates memory harder and eventually leads to cgoom. Without this 
information, it would be hard to reason about what is happening in the kernel 
given increased oom number.

show_free_areas() will give a summary of different types of memory which 
can possibably lead to increased cgoom in my previous case. Then one looks 
deeper via the memory allocation profiling as an entrypoint to debug.

Does this make sense to you?

> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?

Thanks
Pan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-26 14:06       ` Yueyang Pan
@ 2025-08-27  2:38         ` Suren Baghdasaryan
  2025-08-29  6:35           ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27  2:38 UTC (permalink / raw)
  To: Yueyang Pan
  Cc: Shakeel Butt, Kent Overstreet, Usama Arif, linux-mm, linux-kernel,
	Sourav Panda, Pasha Tatashin, Johannes Weiner

On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
>
> On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
>
> Sorry for my late reply. I have been trying hard to think about a use case.
> One specific case I can think about is when there is no workload stacking,
> when one job is running solely on the machine. For example, memory allocation
> profiling can tell the memory usage of the network driver, which can make
> cg allocates memory harder and eventually leads to cgoom. Without this
> information, it would be hard to reason about what is happening in the kernel
> given increased oom number.
>
> show_free_areas() will give a summary of different types of memory which
> can possibably lead to increased cgoom in my previous case. Then one looks
> deeper via the memory allocation profiling as an entrypoint to debug.
>
> Does this make sense to you?

I think if we had per-memcg memory profiling that would make sense.
Counters would reflect only allocations made by the processes from
that memcg and you could easily identify the allocation that caused
memcg to oom. But dumping system-wide profiling information at
memcg-oom time I think would not help you with this task. It will be
polluted with allocations from other memcgs, so likely won't help much
(unless there is some obvious leak or you know that a specific
allocation is done only by a process from your memcg and no other
process).

>
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
>
> Thanks
> Pan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-27  2:38         ` Suren Baghdasaryan
@ 2025-08-29  6:35           ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2025-08-29  6:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Yueyang Pan, Shakeel Butt, Kent Overstreet, Usama Arif, linux-mm,
	linux-kernel, Sourav Panda, Pasha Tatashin, Johannes Weiner

On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote:
> On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> >
> > Sorry for my late reply. I have been trying hard to think about a use case.
> > One specific case I can think about is when there is no workload stacking,
> > when one job is running solely on the machine. For example, memory allocation
> > profiling can tell the memory usage of the network driver, which can make
> > cg allocates memory harder and eventually leads to cgoom. Without this
> > information, it would be hard to reason about what is happening in the kernel
> > given increased oom number.
> >
> > show_free_areas() will give a summary of different types of memory which
> > can possibably lead to increased cgoom in my previous case. Then one looks
> > deeper via the memory allocation profiling as an entrypoint to debug.
> >
> > Does this make sense to you?
> 
> I think if we had per-memcg memory profiling that would make sense.
> Counters would reflect only allocations made by the processes from
> that memcg and you could easily identify the allocation that caused
> memcg to oom. But dumping system-wide profiling information at
> memcg-oom time I think would not help you with this task. It will be
> polluted with allocations from other memcgs, so likely won't help much
> (unless there is some obvious leak or you know that a specific
> allocation is done only by a process from your memcg and no other
> process).

I agree with Suren. It makes very little sense and in many cases it
could be actively misleading to print global memory state on memcg OOMs.
Not to mention that those events, unlike global OOMs, could happen much
more often.
If you are interested in a more information on memcg oom occurance you
can detext OOM events and print whatever information you need.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-21 19:53     ` Shakeel Butt
  2025-08-21 20:00       ` Suren Baghdasaryan
  2025-08-26 14:06       ` Yueyang Pan
@ 2025-08-27  2:32       ` Suren Baghdasaryan
  2025-08-27  4:47         ` Usama Arif
  2025-08-27 21:15         ` Shakeel Butt
  2 siblings, 2 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27  2:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel

On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?

Actually I was thinking about making memory profiling memcg-aware but
it would be quite costly both from memory and performance points of
view. Currently we have a per-cpu counter for each allocation in the
kernel codebase. To make it work for each memcg we would have to add
memcg dimension to the counters, so each counter becomes per-cpu plus
per-memcg. I'll be thinking about possible optimizations since many of
these counters will stay at 0 but any such optimization would come at
a performance cost, which we tried to keep at the absolute minimum.

I'm CC'ing Sourav and Pasha since they were also interested in making
memory allocation profiling memcg-aware. Would Meta folks (Usama,
Shakeel, Johannes) be interested in such enhancement as well? Would it
be preferable to have such accounting for a specific memcg which we
pre-select (less memory and performance overhead) or we need that for
all memcgs as a generic feature? We have some options here but I want
to understand what would be sufficient and add as little overhead as
possible.
Thanks,
Suren.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-27  2:32       ` Suren Baghdasaryan
@ 2025-08-27  4:47         ` Usama Arif
  2025-08-27 21:15         ` Shakeel Butt
  1 sibling, 0 replies; 18+ messages in thread
From: Usama Arif @ 2025-08-27  4:47 UTC (permalink / raw)
  To: Suren Baghdasaryan, Shakeel Butt
  Cc: Yueyang Pan, Kent Overstreet, linux-mm, linux-kernel, hannes



On 27/08/2025 03:32, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
>>> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
>>>> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
>>>>> Right now in the oom_kill_process if the oom is because of the cgroup
>>>>> limit, we won't get memory allocation infomation. In some cases, we
>>>>> can have a large cgroup workload running which dominates the machine.
>>>>> The reason using cgroup is to leave some resource for system. When this
>>>>> cgroup is killed, we would also like to have some memory allocation
>>>>> information for the whole server as well. This is reason behind this
>>>>> mini change. Is it an acceptable thing to do? Will it be too much
>>>>> information for people? I am happy with any suggestions!
>>>>
>>>> For a single patch, it is better to have all the context in the patch
>>>> and there is no need for cover letter.
>>>
>>> Thanks for your suggestion Shakeel! I will change this in the next version.
>>>
>>>>
>>>> What exact information you want on the memcg oom that will be helpful
>>>> for the users in general? You mentioned memory allocation information,
>>>> can you please elaborate a bit more.
>>>>
>>>
>>> As in my reply to Suren, I was thinking the system-wide memory usage info
>>> provided by show_free_pages and memory allocation profiling info can help
>>> us debug cgoom by comparing them with historical data. What is your take on
>>> this?
>>>
>>
>> I am not really sure about show_free_areas(). More specifically how the
>> historical data diff will be useful for a memcg oom. If you have a
>> concrete example, please give one. For memory allocation profiling, is
>> it possible to filter for the given memcg? Do we save memcg information
>> in the memory allocation profiling?
> 
> Actually I was thinking about making memory profiling memcg-aware but
> it would be quite costly both from memory and performance points of
> view. Currently we have a per-cpu counter for each allocation in the
> kernel codebase. To make it work for each memcg we would have to add
> memcg dimension to the counters, so each counter becomes per-cpu plus
> per-memcg. I'll be thinking about possible optimizations since many of
> these counters will stay at 0 but any such optimization would come at
> a performance cost, which we tried to keep at the absolute minimum.
> 
> I'm CC'ing Sourav and Pasha since they were also interested in making
> memory allocation profiling memcg-aware. Would Meta folks (Usama,
> Shakeel, Johannes) be interested in such enhancement as well? Would it
> be preferable to have such accounting for a specific memcg which we
> pre-select (less memory and performance overhead) or we need that for
> all memcgs as a generic feature? We have some options here but I want
> to understand what would be sufficient and add as little overhead as
> possible.

Yes, having per memcg counters is going to be extremely useful (we were
thinking of having this as a future project to work on). For meta fleet
in particular, we might have almost 100 memcgs running, but the number
of memcgs running workloads is particularly small (usually less than 10).
In the rest, you might have services that are responsible for telemetry,
monitoring, security, etc (for which we arent really interested in
the memory allocation profile). So yes, it would be ideal to have the
profile for just pre-select memcgs, especially if it leads to lower memory
and performance overhead.

Having memory allocation profile at memcg level is especially needed when
we have multiple workloads stacked on the same host. Having it at host
level in such a case makes the data less useful when we have OOMs and
for workload analysis as you dont know which workload is contributing
how much.

> Thanks,
> Suren.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
  2025-08-27  2:32       ` Suren Baghdasaryan
  2025-08-27  4:47         ` Usama Arif
@ 2025-08-27 21:15         ` Shakeel Butt
  1 sibling, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2025-08-27 21:15 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel

On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
> 
> Actually I was thinking about making memory profiling memcg-aware but
> it would be quite costly both from memory and performance points of
> view. Currently we have a per-cpu counter for each allocation in the
> kernel codebase. To make it work for each memcg we would have to add
> memcg dimension to the counters, so each counter becomes per-cpu plus
> per-memcg. I'll be thinking about possible optimizations since many of
> these counters will stay at 0 but any such optimization would come at
> a performance cost, which we tried to keep at the absolute minimum.
> 
> I'm CC'ing Sourav and Pasha since they were also interested in making
> memory allocation profiling memcg-aware. Would Meta folks (Usama,
> Shakeel, Johannes) be interested in such enhancement as well? Would it
> be preferable to have such accounting for a specific memcg which we
> pre-select (less memory and performance overhead) or we need that for
> all memcgs as a generic feature? We have some options here but I want
> to understand what would be sufficient and add as little overhead as
> possible.

Thanks Suren, yes, as already mentioned by Usama, Meta will be
interested in memcg aware allocation profiling. I would say start simple
and as little overhead as possible. More functionality can be added
later when the need arises. Maybe the first useful addition is just
adding how many allocations for a specific allocation site are memcg
charged.



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-08-29  6:35 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 17:11 [RFC 0/1] Try to add memory allocation info for cgroup oom kill Yueyang Pan
2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan
2025-08-14 20:11   ` Joshua Hahn
2025-08-18 14:24     ` Yueyang Pan
2025-08-21  1:25       ` Suren Baghdasaryan
2025-08-21 19:09         ` Yueyang Pan
2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt
2025-08-21 19:18   ` Yueyang Pan
2025-08-21 19:53     ` Shakeel Butt
2025-08-21 20:00       ` Suren Baghdasaryan
2025-08-21 21:26         ` Shakeel Butt
2025-08-26 13:52           ` Yueyang Pan
2025-08-26 14:06       ` Yueyang Pan
2025-08-27  2:38         ` Suren Baghdasaryan
2025-08-29  6:35           ` Michal Hocko
2025-08-27  2:32       ` Suren Baghdasaryan
2025-08-27  4:47         ` Usama Arif
2025-08-27 21:15         ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).