* [RFC 0/1] Try to add memory allocation info for cgroup oom kill @ 2025-08-14 17:11 Yueyang Pan 2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan 2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt 0 siblings, 2 replies; 19+ messages in thread From: Yueyang Pan @ 2025-08-14 17:11 UTC (permalink / raw) To: Suren Baghdasaryan, Kent Overstreet, Usama Arif; +Cc: linux-mm, linux-kernel Right now in the oom_kill_process if the oom is because of the cgroup limit, we won't get memory allocation infomation. In some cases, we can have a large cgroup workload running which dominates the machine. The reason using cgroup is to leave some resource for system. When this cgroup is killed, we would also like to have some memory allocation information for the whole server as well. This is reason behind this mini change. Is it an acceptable thing to do? Will it be too much information for people? I am happy with any suggestions! Yueyang Pan (1): Add memory allocation info for cgroup oom mm/oom_kill.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- 2.47.3 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC 1/1] Add memory allocation info for cgroup oom 2025-08-14 17:11 [RFC 0/1] Try to add memory allocation info for cgroup oom kill Yueyang Pan @ 2025-08-14 17:11 ` Yueyang Pan 2025-08-14 20:11 ` Joshua Hahn 2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt 1 sibling, 1 reply; 19+ messages in thread From: Yueyang Pan @ 2025-08-14 17:11 UTC (permalink / raw) To: Suren Baghdasaryan, Kent Overstreet, Usama Arif; +Cc: linux-mm, linux-kernel Enable show_mem for the cgroup oom case. We will have memory allocation information in such case for the machine. Signed-off-by: Yueyang Pan <pyyjason@gmail.com> --- mm/oom_kill.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 17650f0b516e..3ca224028396 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc) pr_warn("COMPACTION is disabled!!!\n"); dump_stack(); - if (is_memcg_oom(oc)) + if (is_memcg_oom(oc)) { mem_cgroup_print_oom_meminfo(oc->memcg); + show_mem(); + } else { __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask)); if (should_dump_unreclaim_slab()) -- 2.47.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC 1/1] Add memory allocation info for cgroup oom 2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan @ 2025-08-14 20:11 ` Joshua Hahn 2025-08-18 14:24 ` Yueyang Pan 0 siblings, 1 reply; 19+ messages in thread From: Joshua Hahn @ 2025-08-14 20:11 UTC (permalink / raw) To: Yueyang Pan Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel, kernel-team On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote: > Enable show_mem for the cgroup oom case. We will have memory allocation > information in such case for the machine. Hi Pan, Thank you for your patch! This makes sense to me. As for your concerns from the cover letter on whether this is too much information: personally I don't think so, but perhaps other developers will have different opinions? I just have a few comments / nits. > Signed-off-by: Yueyang Pan <pyyjason@gmail.com> > --- > mm/oom_kill.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 17650f0b516e..3ca224028396 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc) > pr_warn("COMPACTION is disabled!!!\n"); > > dump_stack(); > - if (is_memcg_oom(oc)) > + if (is_memcg_oom(oc)) { > mem_cgroup_print_oom_meminfo(oc->memcg); > + show_mem(); Below, there is a direct call to __show_mem, which limits node and zone filtering. I am wondering whether it would make sense to also call __show_mem with the same arguments? show_mem() is just a wrapper around __show_mem with default parameters (i.e. not filtering out nodes, not filtering out zones). If you think this makes sense, we can even take it out of the if-else statement and call it unconditionally. But this is just my opinion, please feel free to keep the unfiltered call if you believe that fits better in here. > + } NIT: Should this closing brace be on the same line as the following else statement, as per the kernel style guide [1] > else { > __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask)); > if (should_dump_unreclaim_slab()) > -- > 2.47.3 Thanks again Pan, I hope you have a great day! Joshua [1] https://docs.kernel.org/process/coding-style.html Sent using hkml (https://github.com/sjp38/hackermail) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 1/1] Add memory allocation info for cgroup oom 2025-08-14 20:11 ` Joshua Hahn @ 2025-08-18 14:24 ` Yueyang Pan 2025-08-21 1:25 ` Suren Baghdasaryan 0 siblings, 1 reply; 19+ messages in thread From: Yueyang Pan @ 2025-08-18 14:24 UTC (permalink / raw) To: Joshua Hahn Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton, linux-mm, linux-kernel, kernel-team On Thu, Aug 14, 2025 at 01:11:08PM -0700, Joshua Hahn wrote: > On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote: > > > Enable show_mem for the cgroup oom case. We will have memory allocation > > information in such case for the machine. > > Hi Pan, > > Thank you for your patch! This makes sense to me. As for your concerns from the > cover letter on whether this is too much information: personally I don't think > so, but perhaps other developers will have different opinions? > > I just have a few comments / nits. Thanks for your comment, Joshua. > > > Signed-off-by: Yueyang Pan <pyyjason@gmail.com> > > --- > > mm/oom_kill.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 17650f0b516e..3ca224028396 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc) > > pr_warn("COMPACTION is disabled!!!\n"); > > > > dump_stack(); > > - if (is_memcg_oom(oc)) > > + if (is_memcg_oom(oc)) { > > mem_cgroup_print_oom_meminfo(oc->memcg); > > + show_mem(); > > Below, there is a direct call to __show_mem, which limits node and zone > filtering. I am wondering whether it would make sense to also call __show_mem > with the same arguments? show_mem() is just a wrapper around __show_mem with > default parameters (i.e. not filtering out nodes, not filtering out > zones). The reason why I call show_mem here directly is because cgroup is not bound to a specific zone or node (correctly me if I am wrong). Thus I simply invoke show_mem to show system-wide memory info. > > If you think this makes sense, we can even take it out of the if-else statement > and call it unconditionally. But this is just my opinion, please feel free to > keep the unfiltered call if you believe that fits better in here. > > > + } > > NIT: Should this closing brace be on the same line as the following else > statement, as per the kernel style guide [1] Sorry for this. I will run checkpatch for my formal patch definitely > > > else { > > __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask)); > > if (should_dump_unreclaim_slab()) > > -- > > 2.47.3 > > Thanks again Pan, I hope you have a great day! > Joshua > > [1] https://docs.kernel.org/process/coding-style.html > > Sent using hkml (https://github.com/sjp38/hackermail) Sorry that I forgot to cc some maintainers so I added them in this reply. Pan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 1/1] Add memory allocation info for cgroup oom 2025-08-18 14:24 ` Yueyang Pan @ 2025-08-21 1:25 ` Suren Baghdasaryan 2025-08-21 19:09 ` Yueyang Pan 0 siblings, 1 reply; 19+ messages in thread From: Suren Baghdasaryan @ 2025-08-21 1:25 UTC (permalink / raw) To: Yueyang Pan Cc: Joshua Hahn, Kent Overstreet, Usama Arif, Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton, linux-mm, linux-kernel, kernel-team On Mon, Aug 18, 2025 at 7:24 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > On Thu, Aug 14, 2025 at 01:11:08PM -0700, Joshua Hahn wrote: > > On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote: > > > > > Enable show_mem for the cgroup oom case. We will have memory allocation > > > information in such case for the machine. Memory allocations are only a part of show_mem(), so I would not call this change memory allocation profiling specific. The title and the changelog should be corrected to reflect exactly what is being done here - logging system in addition to cgroup memory state during cgroup oom-kill. As for whether it makes sense to report system memory during cgroup oom-kill... I'm not too sure. Maybe people who use memcgs more extensively than what I've seen (in Android) can chime in? > > > > Hi Pan, > > > > Thank you for your patch! This makes sense to me. As for your concerns from the > > cover letter on whether this is too much information: personally I don't think > > so, but perhaps other developers will have different opinions? > > > > I just have a few comments / nits. > > Thanks for your comment, Joshua. > > > > > > Signed-off-by: Yueyang Pan <pyyjason@gmail.com> > > > --- > > > mm/oom_kill.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 17650f0b516e..3ca224028396 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc) > > > pr_warn("COMPACTION is disabled!!!\n"); > > > > > > dump_stack(); > > > - if (is_memcg_oom(oc)) > > > + if (is_memcg_oom(oc)) { > > > mem_cgroup_print_oom_meminfo(oc->memcg); > > > + show_mem(); > > > > Below, there is a direct call to __show_mem, which limits node and zone > > filtering. I am wondering whether it would make sense to also call __show_mem > > with the same arguments? show_mem() is just a wrapper around __show_mem with > > default parameters (i.e. not filtering out nodes, not filtering out > > zones). > > The reason why I call show_mem here directly is because cgroup is not bound to > a specific zone or node (correctly me if I am wrong). Thus I simply invoke > show_mem to show system-wide memory info. > > > > > If you think this makes sense, we can even take it out of the if-else statement > > and call it unconditionally. But this is just my opinion, please feel free to > > keep the unfiltered call if you believe that fits better in here. > > > > > + } > > > > NIT: Should this closing brace be on the same line as the following else > > statement, as per the kernel style guide [1] > > Sorry for this. I will run checkpatch for my formal patch definitely > > > > > > else { > > > __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask)); > > > if (should_dump_unreclaim_slab()) > > > -- > > > 2.47.3 > > > > Thanks again Pan, I hope you have a great day! > > Joshua > > > > [1] https://docs.kernel.org/process/coding-style.html > > > > Sent using hkml (https://github.com/sjp38/hackermail) > > Sorry that I forgot to cc some maintainers so I added them in this reply. > Pan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 1/1] Add memory allocation info for cgroup oom 2025-08-21 1:25 ` Suren Baghdasaryan @ 2025-08-21 19:09 ` Yueyang Pan 0 siblings, 0 replies; 19+ messages in thread From: Yueyang Pan @ 2025-08-21 19:09 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Joshua Hahn, Kent Overstreet, Usama Arif, Michal Hocko, David Rientjes, Shakeel Butt, Andrew Morton, linux-mm, linux-kernel, kernel-team On Wed, Aug 20, 2025 at 06:25:56PM -0700, Suren Baghdasaryan wrote: > On Mon, Aug 18, 2025 at 7:24 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > > > On Thu, Aug 14, 2025 at 01:11:08PM -0700, Joshua Hahn wrote: > > > On Thu, 14 Aug 2025 10:11:57 -0700 Yueyang Pan <pyyjason@gmail.com> wrote: > > > > > > > Enable show_mem for the cgroup oom case. We will have memory allocation > > > > information in such case for the machine. > > Memory allocations are only a part of show_mem(), so I would not call > this change memory allocation profiling specific. The title and the > changelog should be corrected to reflect exactly what is being done > here - logging system in addition to cgroup memory state during cgroup > oom-kill. Thanks for your feedback Suren! I will change the title to be precise in the next version. > As for whether it makes sense to report system memory during cgroup > oom-kill... I'm not too sure. Maybe people who use memcgs more > extensively than what I've seen (in Android) can chime in? > In my opinion, the show_free_areas and memory allocation profiling data can provide an entrypoint to understand what happens with cgroup oom. We can also compare them with historical data to see if some memory usage has a spike. Feel free to critize me if I am not making sense. > > > > > > > Hi Pan, > > > > > > Thank you for your patch! This makes sense to me. As for your concerns from the > > > cover letter on whether this is too much information: personally I don't think > > > so, but perhaps other developers will have different opinions? > > > > > > I just have a few comments / nits. > > > > Thanks for your comment, Joshua. > > > > > > > > > Signed-off-by: Yueyang Pan <pyyjason@gmail.com> > > > > --- > > > > mm/oom_kill.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > index 17650f0b516e..3ca224028396 100644 > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > > @@ -465,8 +465,10 @@ static void dump_header(struct oom_control *oc) > > > > pr_warn("COMPACTION is disabled!!!\n"); > > > > > > > > dump_stack(); > > > > - if (is_memcg_oom(oc)) > > > > + if (is_memcg_oom(oc)) { > > > > mem_cgroup_print_oom_meminfo(oc->memcg); > > > > + show_mem(); > > > > > > Below, there is a direct call to __show_mem, which limits node and zone > > > filtering. I am wondering whether it would make sense to also call __show_mem > > > with the same arguments? show_mem() is just a wrapper around __show_mem with > > > default parameters (i.e. not filtering out nodes, not filtering out > > > zones). > > > > The reason why I call show_mem here directly is because cgroup is not bound to > > a specific zone or node (correctly me if I am wrong). Thus I simply invoke > > show_mem to show system-wide memory info. > > > > > > > > If you think this makes sense, we can even take it out of the if-else statement > > > and call it unconditionally. But this is just my opinion, please feel free to > > > keep the unfiltered call if you believe that fits better in here. > > > > > > > + } > > > > > > NIT: Should this closing brace be on the same line as the following else > > > statement, as per the kernel style guide [1] > > > > Sorry for this. I will run checkpatch for my formal patch definitely > > > > > > > > > else { > > > > __show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask, gfp_zone(oc->gfp_mask)); > > > > if (should_dump_unreclaim_slab()) > > > > -- > > > > 2.47.3 > > > > > > Thanks again Pan, I hope you have a great day! > > > Joshua > > > > > > [1] https://docs.kernel.org/process/coding-style.html > > > > > > Sent using hkml (https://github.com/sjp38/hackermail) > > > > Sorry that I forgot to cc some maintainers so I added them in this reply. > > Pan Thanks, Pan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-14 17:11 [RFC 0/1] Try to add memory allocation info for cgroup oom kill Yueyang Pan 2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan @ 2025-08-21 18:35 ` Shakeel Butt 2025-08-21 19:18 ` Yueyang Pan 1 sibling, 1 reply; 19+ messages in thread From: Shakeel Butt @ 2025-08-21 18:35 UTC (permalink / raw) To: Yueyang Pan Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > Right now in the oom_kill_process if the oom is because of the cgroup > limit, we won't get memory allocation infomation. In some cases, we > can have a large cgroup workload running which dominates the machine. > The reason using cgroup is to leave some resource for system. When this > cgroup is killed, we would also like to have some memory allocation > information for the whole server as well. This is reason behind this > mini change. Is it an acceptable thing to do? Will it be too much > information for people? I am happy with any suggestions! For a single patch, it is better to have all the context in the patch and there is no need for cover letter. What exact information you want on the memcg oom that will be helpful for the users in general? You mentioned memory allocation information, can you please elaborate a bit more. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt @ 2025-08-21 19:18 ` Yueyang Pan 2025-08-21 19:53 ` Shakeel Butt 0 siblings, 1 reply; 19+ messages in thread From: Yueyang Pan @ 2025-08-21 19:18 UTC (permalink / raw) To: Shakeel Butt Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > Right now in the oom_kill_process if the oom is because of the cgroup > > limit, we won't get memory allocation infomation. In some cases, we > > can have a large cgroup workload running which dominates the machine. > > The reason using cgroup is to leave some resource for system. When this > > cgroup is killed, we would also like to have some memory allocation > > information for the whole server as well. This is reason behind this > > mini change. Is it an acceptable thing to do? Will it be too much > > information for people? I am happy with any suggestions! > > For a single patch, it is better to have all the context in the patch > and there is no need for cover letter. Thanks for your suggestion Shakeel! I will change this in the next version. > > What exact information you want on the memcg oom that will be helpful > for the users in general? You mentioned memory allocation information, > can you please elaborate a bit more. > As in my reply to Suren, I was thinking the system-wide memory usage info provided by show_free_pages and memory allocation profiling info can help us debug cgoom by comparing them with historical data. What is your take on this? Thanks, Pan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 19:18 ` Yueyang Pan @ 2025-08-21 19:53 ` Shakeel Butt 2025-08-21 20:00 ` Suren Baghdasaryan ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Shakeel Butt @ 2025-08-21 19:53 UTC (permalink / raw) To: Yueyang Pan Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > limit, we won't get memory allocation infomation. In some cases, we > > > can have a large cgroup workload running which dominates the machine. > > > The reason using cgroup is to leave some resource for system. When this > > > cgroup is killed, we would also like to have some memory allocation > > > information for the whole server as well. This is reason behind this > > > mini change. Is it an acceptable thing to do? Will it be too much > > > information for people? I am happy with any suggestions! > > > > For a single patch, it is better to have all the context in the patch > > and there is no need for cover letter. > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > What exact information you want on the memcg oom that will be helpful > > for the users in general? You mentioned memory allocation information, > > can you please elaborate a bit more. > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > provided by show_free_pages and memory allocation profiling info can help > us debug cgoom by comparing them with historical data. What is your take on > this? > I am not really sure about show_free_areas(). More specifically how the historical data diff will be useful for a memcg oom. If you have a concrete example, please give one. For memory allocation profiling, is it possible to filter for the given memcg? Do we save memcg information in the memory allocation profiling? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 19:53 ` Shakeel Butt @ 2025-08-21 20:00 ` Suren Baghdasaryan 2025-08-21 21:26 ` Shakeel Butt 2025-08-26 14:06 ` Yueyang Pan 2025-08-27 2:32 ` Suren Baghdasaryan 2 siblings, 1 reply; 19+ messages in thread From: Suren Baghdasaryan @ 2025-08-21 20:00 UTC (permalink / raw) To: Shakeel Butt Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > can have a large cgroup workload running which dominates the machine. > > > > The reason using cgroup is to leave some resource for system. When this > > > > cgroup is killed, we would also like to have some memory allocation > > > > information for the whole server as well. This is reason behind this > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > information for people? I am happy with any suggestions! > > > > > > For a single patch, it is better to have all the context in the patch > > > and there is no need for cover letter. > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > for the users in general? You mentioned memory allocation information, > > > can you please elaborate a bit more. > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > provided by show_free_pages and memory allocation profiling info can help > > us debug cgoom by comparing them with historical data. What is your take on > > this? > > > > I am not really sure about show_free_areas(). More specifically how the > historical data diff will be useful for a memcg oom. If you have a > concrete example, please give one. For memory allocation profiling, is > it possible to filter for the given memcg? Do we save memcg information > in the memory allocation profiling? No, memory allocation profiling is not cgroup-aware. It tracks allocations and their code locations but no other context. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 20:00 ` Suren Baghdasaryan @ 2025-08-21 21:26 ` Shakeel Butt 2025-08-26 13:52 ` Yueyang Pan 0 siblings, 1 reply; 19+ messages in thread From: Shakeel Butt @ 2025-08-21 21:26 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote: > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > can have a large cgroup workload running which dominates the machine. > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > information for the whole server as well. This is reason behind this > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > information for people? I am happy with any suggestions! > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > and there is no need for cover letter. > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > for the users in general? You mentioned memory allocation information, > > > > can you please elaborate a bit more. > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > provided by show_free_pages and memory allocation profiling info can help > > > us debug cgoom by comparing them with historical data. What is your take on > > > this? > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > historical data diff will be useful for a memcg oom. If you have a > > concrete example, please give one. For memory allocation profiling, is > > it possible to filter for the given memcg? Do we save memcg information > > in the memory allocation profiling? > > No, memory allocation profiling is not cgroup-aware. It tracks > allocations and their code locations but no other context. Thanks for the info. Pan, will having memcg info along with allocation profile help your use-case? (Though adding that might not be easy or cheaper) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 21:26 ` Shakeel Butt @ 2025-08-26 13:52 ` Yueyang Pan 0 siblings, 0 replies; 19+ messages in thread From: Yueyang Pan @ 2025-08-26 13:52 UTC (permalink / raw) To: Shakeel Butt Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 02:26:42PM -0700, Shakeel Butt wrote: > On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote: > > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > information for the whole server as well. This is reason behind this > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > and there is no need for cover letter. > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > for the users in general? You mentioned memory allocation information, > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > provided by show_free_pages and memory allocation profiling info can help > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > this? > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > historical data diff will be useful for a memcg oom. If you have a > > > concrete example, please give one. For memory allocation profiling, is > > > it possible to filter for the given memcg? Do we save memcg information > > > in the memory allocation profiling? > > > > No, memory allocation profiling is not cgroup-aware. It tracks > > allocations and their code locations but no other context. > > Thanks for the info. Pan, will having memcg info along with allocation > profile help your use-case? (Though adding that might not be easy or > cheaper) Yeah I have been thinking about it with eBPF hooks but it is going to be a long term effort as we need to measure the overhead. Now the way memory profiling is implemented incur almost "zero" overhead. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 19:53 ` Shakeel Butt 2025-08-21 20:00 ` Suren Baghdasaryan @ 2025-08-26 14:06 ` Yueyang Pan 2025-08-27 2:38 ` Suren Baghdasaryan 2025-08-27 2:32 ` Suren Baghdasaryan 2 siblings, 1 reply; 19+ messages in thread From: Yueyang Pan @ 2025-08-26 14:06 UTC (permalink / raw) To: Shakeel Butt Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > can have a large cgroup workload running which dominates the machine. > > > > The reason using cgroup is to leave some resource for system. When this > > > > cgroup is killed, we would also like to have some memory allocation > > > > information for the whole server as well. This is reason behind this > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > information for people? I am happy with any suggestions! > > > > > > For a single patch, it is better to have all the context in the patch > > > and there is no need for cover letter. > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > for the users in general? You mentioned memory allocation information, > > > can you please elaborate a bit more. > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > provided by show_free_pages and memory allocation profiling info can help > > us debug cgoom by comparing them with historical data. What is your take on > > this? > > > > I am not really sure about show_free_areas(). More specifically how the > historical data diff will be useful for a memcg oom. If you have a > concrete example, please give one. For memory allocation profiling, is Sorry for my late reply. I have been trying hard to think about a use case. One specific case I can think about is when there is no workload stacking, when one job is running solely on the machine. For example, memory allocation profiling can tell the memory usage of the network driver, which can make cg allocates memory harder and eventually leads to cgoom. Without this information, it would be hard to reason about what is happening in the kernel given increased oom number. show_free_areas() will give a summary of different types of memory which can possibably lead to increased cgoom in my previous case. Then one looks deeper via the memory allocation profiling as an entrypoint to debug. Does this make sense to you? > it possible to filter for the given memcg? Do we save memcg information > in the memory allocation profiling? Thanks Pan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-26 14:06 ` Yueyang Pan @ 2025-08-27 2:38 ` Suren Baghdasaryan 2025-08-29 6:35 ` Michal Hocko 0 siblings, 1 reply; 19+ messages in thread From: Suren Baghdasaryan @ 2025-08-27 2:38 UTC (permalink / raw) To: Yueyang Pan Cc: Shakeel Butt, Kent Overstreet, Usama Arif, linux-mm, linux-kernel, Sourav Panda, Pasha Tatashin, Johannes Weiner On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > can have a large cgroup workload running which dominates the machine. > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > information for the whole server as well. This is reason behind this > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > information for people? I am happy with any suggestions! > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > and there is no need for cover letter. > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > for the users in general? You mentioned memory allocation information, > > > > can you please elaborate a bit more. > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > provided by show_free_pages and memory allocation profiling info can help > > > us debug cgoom by comparing them with historical data. What is your take on > > > this? > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > historical data diff will be useful for a memcg oom. If you have a > > concrete example, please give one. For memory allocation profiling, is > > Sorry for my late reply. I have been trying hard to think about a use case. > One specific case I can think about is when there is no workload stacking, > when one job is running solely on the machine. For example, memory allocation > profiling can tell the memory usage of the network driver, which can make > cg allocates memory harder and eventually leads to cgoom. Without this > information, it would be hard to reason about what is happening in the kernel > given increased oom number. > > show_free_areas() will give a summary of different types of memory which > can possibably lead to increased cgoom in my previous case. Then one looks > deeper via the memory allocation profiling as an entrypoint to debug. > > Does this make sense to you? I think if we had per-memcg memory profiling that would make sense. Counters would reflect only allocations made by the processes from that memcg and you could easily identify the allocation that caused memcg to oom. But dumping system-wide profiling information at memcg-oom time I think would not help you with this task. It will be polluted with allocations from other memcgs, so likely won't help much (unless there is some obvious leak or you know that a specific allocation is done only by a process from your memcg and no other process). > > > it possible to filter for the given memcg? Do we save memcg information > > in the memory allocation profiling? > > Thanks > Pan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-27 2:38 ` Suren Baghdasaryan @ 2025-08-29 6:35 ` Michal Hocko 0 siblings, 0 replies; 19+ messages in thread From: Michal Hocko @ 2025-08-29 6:35 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Yueyang Pan, Shakeel Butt, Kent Overstreet, Usama Arif, linux-mm, linux-kernel, Sourav Panda, Pasha Tatashin, Johannes Weiner On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote: > On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > information for the whole server as well. This is reason behind this > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > and there is no need for cover letter. > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > for the users in general? You mentioned memory allocation information, > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > provided by show_free_pages and memory allocation profiling info can help > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > this? > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > historical data diff will be useful for a memcg oom. If you have a > > > concrete example, please give one. For memory allocation profiling, is > > > > Sorry for my late reply. I have been trying hard to think about a use case. > > One specific case I can think about is when there is no workload stacking, > > when one job is running solely on the machine. For example, memory allocation > > profiling can tell the memory usage of the network driver, which can make > > cg allocates memory harder and eventually leads to cgoom. Without this > > information, it would be hard to reason about what is happening in the kernel > > given increased oom number. > > > > show_free_areas() will give a summary of different types of memory which > > can possibably lead to increased cgoom in my previous case. Then one looks > > deeper via the memory allocation profiling as an entrypoint to debug. > > > > Does this make sense to you? > > I think if we had per-memcg memory profiling that would make sense. > Counters would reflect only allocations made by the processes from > that memcg and you could easily identify the allocation that caused > memcg to oom. But dumping system-wide profiling information at > memcg-oom time I think would not help you with this task. It will be > polluted with allocations from other memcgs, so likely won't help much > (unless there is some obvious leak or you know that a specific > allocation is done only by a process from your memcg and no other > process). I agree with Suren. It makes very little sense and in many cases it could be actively misleading to print global memory state on memcg OOMs. Not to mention that those events, unlike global OOMs, could happen much more often. If you are interested in a more information on memcg oom occurance you can detext OOM events and print whatever information you need. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-21 19:53 ` Shakeel Butt 2025-08-21 20:00 ` Suren Baghdasaryan 2025-08-26 14:06 ` Yueyang Pan @ 2025-08-27 2:32 ` Suren Baghdasaryan 2025-08-27 4:47 ` Usama Arif 2025-08-27 21:15 ` Shakeel Butt 2 siblings, 2 replies; 19+ messages in thread From: Suren Baghdasaryan @ 2025-08-27 2:32 UTC (permalink / raw) To: Shakeel Butt Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > can have a large cgroup workload running which dominates the machine. > > > > The reason using cgroup is to leave some resource for system. When this > > > > cgroup is killed, we would also like to have some memory allocation > > > > information for the whole server as well. This is reason behind this > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > information for people? I am happy with any suggestions! > > > > > > For a single patch, it is better to have all the context in the patch > > > and there is no need for cover letter. > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > for the users in general? You mentioned memory allocation information, > > > can you please elaborate a bit more. > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > provided by show_free_pages and memory allocation profiling info can help > > us debug cgoom by comparing them with historical data. What is your take on > > this? > > > > I am not really sure about show_free_areas(). More specifically how the > historical data diff will be useful for a memcg oom. If you have a > concrete example, please give one. For memory allocation profiling, is > it possible to filter for the given memcg? Do we save memcg information > in the memory allocation profiling? Actually I was thinking about making memory profiling memcg-aware but it would be quite costly both from memory and performance points of view. Currently we have a per-cpu counter for each allocation in the kernel codebase. To make it work for each memcg we would have to add memcg dimension to the counters, so each counter becomes per-cpu plus per-memcg. I'll be thinking about possible optimizations since many of these counters will stay at 0 but any such optimization would come at a performance cost, which we tried to keep at the absolute minimum. I'm CC'ing Sourav and Pasha since they were also interested in making memory allocation profiling memcg-aware. Would Meta folks (Usama, Shakeel, Johannes) be interested in such enhancement as well? Would it be preferable to have such accounting for a specific memcg which we pre-select (less memory and performance overhead) or we need that for all memcgs as a generic feature? We have some options here but I want to understand what would be sufficient and add as little overhead as possible. Thanks, Suren. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-27 2:32 ` Suren Baghdasaryan @ 2025-08-27 4:47 ` Usama Arif 2025-08-27 21:15 ` Shakeel Butt 1 sibling, 0 replies; 19+ messages in thread From: Usama Arif @ 2025-08-27 4:47 UTC (permalink / raw) To: Suren Baghdasaryan, Shakeel Butt Cc: Yueyang Pan, Kent Overstreet, linux-mm, linux-kernel, hannes On 27/08/2025 03:32, Suren Baghdasaryan wrote: > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: >>> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: >>>> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: >>>>> Right now in the oom_kill_process if the oom is because of the cgroup >>>>> limit, we won't get memory allocation infomation. In some cases, we >>>>> can have a large cgroup workload running which dominates the machine. >>>>> The reason using cgroup is to leave some resource for system. When this >>>>> cgroup is killed, we would also like to have some memory allocation >>>>> information for the whole server as well. This is reason behind this >>>>> mini change. Is it an acceptable thing to do? Will it be too much >>>>> information for people? I am happy with any suggestions! >>>> >>>> For a single patch, it is better to have all the context in the patch >>>> and there is no need for cover letter. >>> >>> Thanks for your suggestion Shakeel! I will change this in the next version. >>> >>>> >>>> What exact information you want on the memcg oom that will be helpful >>>> for the users in general? You mentioned memory allocation information, >>>> can you please elaborate a bit more. >>>> >>> >>> As in my reply to Suren, I was thinking the system-wide memory usage info >>> provided by show_free_pages and memory allocation profiling info can help >>> us debug cgoom by comparing them with historical data. What is your take on >>> this? >>> >> >> I am not really sure about show_free_areas(). More specifically how the >> historical data diff will be useful for a memcg oom. If you have a >> concrete example, please give one. For memory allocation profiling, is >> it possible to filter for the given memcg? Do we save memcg information >> in the memory allocation profiling? > > Actually I was thinking about making memory profiling memcg-aware but > it would be quite costly both from memory and performance points of > view. Currently we have a per-cpu counter for each allocation in the > kernel codebase. To make it work for each memcg we would have to add > memcg dimension to the counters, so each counter becomes per-cpu plus > per-memcg. I'll be thinking about possible optimizations since many of > these counters will stay at 0 but any such optimization would come at > a performance cost, which we tried to keep at the absolute minimum. > > I'm CC'ing Sourav and Pasha since they were also interested in making > memory allocation profiling memcg-aware. Would Meta folks (Usama, > Shakeel, Johannes) be interested in such enhancement as well? Would it > be preferable to have such accounting for a specific memcg which we > pre-select (less memory and performance overhead) or we need that for > all memcgs as a generic feature? We have some options here but I want > to understand what would be sufficient and add as little overhead as > possible. Yes, having per memcg counters is going to be extremely useful (we were thinking of having this as a future project to work on). For meta fleet in particular, we might have almost 100 memcgs running, but the number of memcgs running workloads is particularly small (usually less than 10). In the rest, you might have services that are responsible for telemetry, monitoring, security, etc (for which we arent really interested in the memory allocation profile). So yes, it would be ideal to have the profile for just pre-select memcgs, especially if it leads to lower memory and performance overhead. Having memory allocation profile at memcg level is especially needed when we have multiple workloads stacked on the same host. Having it at host level in such a case makes the data less useful when we have OOMs and for workload analysis as you dont know which workload is contributing how much. > Thanks, > Suren. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-27 2:32 ` Suren Baghdasaryan 2025-08-27 4:47 ` Usama Arif @ 2025-08-27 21:15 ` Shakeel Butt 2025-09-07 5:16 ` Suren Baghdasaryan 1 sibling, 1 reply; 19+ messages in thread From: Shakeel Butt @ 2025-08-27 21:15 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote: > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > can have a large cgroup workload running which dominates the machine. > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > information for the whole server as well. This is reason behind this > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > information for people? I am happy with any suggestions! > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > and there is no need for cover letter. > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > for the users in general? You mentioned memory allocation information, > > > > can you please elaborate a bit more. > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > provided by show_free_pages and memory allocation profiling info can help > > > us debug cgoom by comparing them with historical data. What is your take on > > > this? > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > historical data diff will be useful for a memcg oom. If you have a > > concrete example, please give one. For memory allocation profiling, is > > it possible to filter for the given memcg? Do we save memcg information > > in the memory allocation profiling? > > Actually I was thinking about making memory profiling memcg-aware but > it would be quite costly both from memory and performance points of > view. Currently we have a per-cpu counter for each allocation in the > kernel codebase. To make it work for each memcg we would have to add > memcg dimension to the counters, so each counter becomes per-cpu plus > per-memcg. I'll be thinking about possible optimizations since many of > these counters will stay at 0 but any such optimization would come at > a performance cost, which we tried to keep at the absolute minimum. > > I'm CC'ing Sourav and Pasha since they were also interested in making > memory allocation profiling memcg-aware. Would Meta folks (Usama, > Shakeel, Johannes) be interested in such enhancement as well? Would it > be preferable to have such accounting for a specific memcg which we > pre-select (less memory and performance overhead) or we need that for > all memcgs as a generic feature? We have some options here but I want > to understand what would be sufficient and add as little overhead as > possible. Thanks Suren, yes, as already mentioned by Usama, Meta will be interested in memcg aware allocation profiling. I would say start simple and as little overhead as possible. More functionality can be added later when the need arises. Maybe the first useful addition is just adding how many allocations for a specific allocation site are memcg charged. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill 2025-08-27 21:15 ` Shakeel Butt @ 2025-09-07 5:16 ` Suren Baghdasaryan 0 siblings, 0 replies; 19+ messages in thread From: Suren Baghdasaryan @ 2025-09-07 5:16 UTC (permalink / raw) To: Shakeel Butt Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel, Sourav Panda, Pasha Tatashin, Johannes Weiner On Wed, Aug 27, 2025 at 2:15 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote: > > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > information for the whole server as well. This is reason behind this > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > and there is no need for cover letter. > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > for the users in general? You mentioned memory allocation information, > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > provided by show_free_pages and memory allocation profiling info can help > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > this? > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > historical data diff will be useful for a memcg oom. If you have a > > > concrete example, please give one. For memory allocation profiling, is > > > it possible to filter for the given memcg? Do we save memcg information > > > in the memory allocation profiling? > > > > Actually I was thinking about making memory profiling memcg-aware but > > it would be quite costly both from memory and performance points of > > view. Currently we have a per-cpu counter for each allocation in the > > kernel codebase. To make it work for each memcg we would have to add > > memcg dimension to the counters, so each counter becomes per-cpu plus > > per-memcg. I'll be thinking about possible optimizations since many of > > these counters will stay at 0 but any such optimization would come at > > a performance cost, which we tried to keep at the absolute minimum. > > > > I'm CC'ing Sourav and Pasha since they were also interested in making > > memory allocation profiling memcg-aware. Would Meta folks (Usama, > > Shakeel, Johannes) be interested in such enhancement as well? Would it > > be preferable to have such accounting for a specific memcg which we > > pre-select (less memory and performance overhead) or we need that for > > all memcgs as a generic feature? We have some options here but I want > > to understand what would be sufficient and add as little overhead as > > possible. > > Thanks Suren, yes, as already mentioned by Usama, Meta will be > interested in memcg aware allocation profiling. I would say start simple > and as little overhead as possible. More functionality can be added > later when the need arises. Maybe the first useful addition is just > adding how many allocations for a specific allocation site are memcg > charged. Adding back Sourav, Pasha and Johannes who got accidentally dropped in the replies. I looked a bit into adding memcg-awareness into memory allocation profiling and it's more complicated than I first thought (as usual). The main complication is that we need to add memcg_id or some other memcg identifier into codetag_ref. That's needed so that we can unaccount the correct memcg when we free an allocation - that's the usual function of the codetag_ref. Now, extending codetag_ref is not a problem by itself but when we use mem_profiling_compressed mode, we store an index of the codetag instead of codetag_ref in the unused page flag bits. This is useful optimization to avoid using page_ext and overhead associated with it. So, full blown memcg support seems problematic. What I'm thinking is easily doable is a filtering interface where we could select a specific memcg to be profiled, IOW we profile only allocations from a chosen memcg. Filtering can be done using ioctl interface on /proc/allocinfo, which can be used for other things as well, like filtering non-zero allocations, returning per-NUMA node information, etc. I see that Damon uses similar memcg filtering (see damos_filter.memcg_id), so I can reuse some of that code for implementing this facility. From high-level, userspace will be able to select one memcg at a time to be profiled. At some later time profiling information is gathered and another memcg can be selected or filtering can be reset to profile all allocations from all memcgs. I expect overhead for this kind of memcg filtering to be quite low. WDYT folks, would this be helpful and cover your usecases? > ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-09-07 5:16 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-14 17:11 [RFC 0/1] Try to add memory allocation info for cgroup oom kill Yueyang Pan 2025-08-14 17:11 ` [RFC 1/1] Add memory allocation info for cgroup oom Yueyang Pan 2025-08-14 20:11 ` Joshua Hahn 2025-08-18 14:24 ` Yueyang Pan 2025-08-21 1:25 ` Suren Baghdasaryan 2025-08-21 19:09 ` Yueyang Pan 2025-08-21 18:35 ` [RFC 0/1] Try to add memory allocation info for cgroup oom kill Shakeel Butt 2025-08-21 19:18 ` Yueyang Pan 2025-08-21 19:53 ` Shakeel Butt 2025-08-21 20:00 ` Suren Baghdasaryan 2025-08-21 21:26 ` Shakeel Butt 2025-08-26 13:52 ` Yueyang Pan 2025-08-26 14:06 ` Yueyang Pan 2025-08-27 2:38 ` Suren Baghdasaryan 2025-08-29 6:35 ` Michal Hocko 2025-08-27 2:32 ` Suren Baghdasaryan 2025-08-27 4:47 ` Usama Arif 2025-08-27 21:15 ` Shakeel Butt 2025-09-07 5:16 ` Suren Baghdasaryan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).