* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-21 19:53 ` Shakeel Butt
@ 2025-08-21 20:00 ` Suren Baghdasaryan
2025-08-21 21:26 ` Shakeel Butt
2025-08-26 14:06 ` Yueyang Pan
2025-08-27 2:32 ` Suren Baghdasaryan
2 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-21 20:00 UTC (permalink / raw)
To: Shakeel Butt
Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel
On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?
No, memory allocation profiling is not cgroup-aware. It tracks
allocations and their code locations but no other context.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-21 20:00 ` Suren Baghdasaryan
@ 2025-08-21 21:26 ` Shakeel Butt
2025-08-26 13:52 ` Yueyang Pan
0 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2025-08-21 21:26 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel
On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
>
> No, memory allocation profiling is not cgroup-aware. It tracks
> allocations and their code locations but no other context.
Thanks for the info. Pan, will having memcg info along with allocation
profile help your use-case? (Though adding that might not be easy or
cheaper)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-21 21:26 ` Shakeel Butt
@ 2025-08-26 13:52 ` Yueyang Pan
0 siblings, 0 replies; 18+ messages in thread
From: Yueyang Pan @ 2025-08-26 13:52 UTC (permalink / raw)
To: Shakeel Butt
Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
linux-kernel
On Thu, Aug 21, 2025 at 02:26:42PM -0700, Shakeel Butt wrote:
> On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> > > it possible to filter for the given memcg? Do we save memcg information
> > > in the memory allocation profiling?
> >
> > No, memory allocation profiling is not cgroup-aware. It tracks
> > allocations and their code locations but no other context.
>
> Thanks for the info. Pan, will having memcg info along with allocation
> profile help your use-case? (Though adding that might not be easy or
> cheaper)
Yeah I have been thinking about it with eBPF hooks but it is going to be a long
term effort as we need to measure the overhead. Now the way memory profiling is
implemented incur almost "zero" overhead.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-21 19:53 ` Shakeel Butt
2025-08-21 20:00 ` Suren Baghdasaryan
@ 2025-08-26 14:06 ` Yueyang Pan
2025-08-27 2:38 ` Suren Baghdasaryan
2025-08-27 2:32 ` Suren Baghdasaryan
2 siblings, 1 reply; 18+ messages in thread
From: Yueyang Pan @ 2025-08-26 14:06 UTC (permalink / raw)
To: Shakeel Butt
Cc: Suren Baghdasaryan, Kent Overstreet, Usama Arif, linux-mm,
linux-kernel
On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
Sorry for my late reply. I have been trying hard to think about a use case.
One specific case I can think about is when there is no workload stacking,
when one job is running solely on the machine. For example, memory allocation
profiling can tell the memory usage of the network driver, which can make
cg allocates memory harder and eventually leads to cgoom. Without this
information, it would be hard to reason about what is happening in the kernel
given increased oom number.
show_free_areas() will give a summary of different types of memory which
can possibably lead to increased cgoom in my previous case. Then one looks
deeper via the memory allocation profiling as an entrypoint to debug.
Does this make sense to you?
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?
Thanks
Pan
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-26 14:06 ` Yueyang Pan
@ 2025-08-27 2:38 ` Suren Baghdasaryan
2025-08-29 6:35 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27 2:38 UTC (permalink / raw)
To: Yueyang Pan
Cc: Shakeel Butt, Kent Overstreet, Usama Arif, linux-mm, linux-kernel,
Sourav Panda, Pasha Tatashin, Johannes Weiner
On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
>
> On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
>
> Sorry for my late reply. I have been trying hard to think about a use case.
> One specific case I can think about is when there is no workload stacking,
> when one job is running solely on the machine. For example, memory allocation
> profiling can tell the memory usage of the network driver, which can make
> cg allocates memory harder and eventually leads to cgoom. Without this
> information, it would be hard to reason about what is happening in the kernel
> given increased oom number.
>
> show_free_areas() will give a summary of different types of memory which
> can possibably lead to increased cgoom in my previous case. Then one looks
> deeper via the memory allocation profiling as an entrypoint to debug.
>
> Does this make sense to you?
I think if we had per-memcg memory profiling that would make sense.
Counters would reflect only allocations made by the processes from
that memcg and you could easily identify the allocation that caused
memcg to oom. But dumping system-wide profiling information at
memcg-oom time I think would not help you with this task. It will be
polluted with allocations from other memcgs, so likely won't help much
(unless there is some obvious leak or you know that a specific
allocation is done only by a process from your memcg and no other
process).
>
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
>
> Thanks
> Pan
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-27 2:38 ` Suren Baghdasaryan
@ 2025-08-29 6:35 ` Michal Hocko
0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2025-08-29 6:35 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Yueyang Pan, Shakeel Butt, Kent Overstreet, Usama Arif, linux-mm,
linux-kernel, Sourav Panda, Pasha Tatashin, Johannes Weiner
On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote:
> On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> >
> > Sorry for my late reply. I have been trying hard to think about a use case.
> > One specific case I can think about is when there is no workload stacking,
> > when one job is running solely on the machine. For example, memory allocation
> > profiling can tell the memory usage of the network driver, which can make
> > cg allocates memory harder and eventually leads to cgoom. Without this
> > information, it would be hard to reason about what is happening in the kernel
> > given increased oom number.
> >
> > show_free_areas() will give a summary of different types of memory which
> > can possibably lead to increased cgoom in my previous case. Then one looks
> > deeper via the memory allocation profiling as an entrypoint to debug.
> >
> > Does this make sense to you?
>
> I think if we had per-memcg memory profiling that would make sense.
> Counters would reflect only allocations made by the processes from
> that memcg and you could easily identify the allocation that caused
> memcg to oom. But dumping system-wide profiling information at
> memcg-oom time I think would not help you with this task. It will be
> polluted with allocations from other memcgs, so likely won't help much
> (unless there is some obvious leak or you know that a specific
> allocation is done only by a process from your memcg and no other
> process).
I agree with Suren. It makes very little sense and in many cases it
could be actively misleading to print global memory state on memcg OOMs.
Not to mention that those events, unlike global OOMs, could happen much
more often.
If you are interested in a more information on memcg oom occurance you
can detext OOM events and print whatever information you need.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-21 19:53 ` Shakeel Butt
2025-08-21 20:00 ` Suren Baghdasaryan
2025-08-26 14:06 ` Yueyang Pan
@ 2025-08-27 2:32 ` Suren Baghdasaryan
2025-08-27 4:47 ` Usama Arif
2025-08-27 21:15 ` Shakeel Butt
2 siblings, 2 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27 2:32 UTC (permalink / raw)
To: Shakeel Butt
Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel
On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?
Actually I was thinking about making memory profiling memcg-aware but
it would be quite costly both from memory and performance points of
view. Currently we have a per-cpu counter for each allocation in the
kernel codebase. To make it work for each memcg we would have to add
memcg dimension to the counters, so each counter becomes per-cpu plus
per-memcg. I'll be thinking about possible optimizations since many of
these counters will stay at 0 but any such optimization would come at
a performance cost, which we tried to keep at the absolute minimum.
I'm CC'ing Sourav and Pasha since they were also interested in making
memory allocation profiling memcg-aware. Would Meta folks (Usama,
Shakeel, Johannes) be interested in such enhancement as well? Would it
be preferable to have such accounting for a specific memcg which we
pre-select (less memory and performance overhead) or we need that for
all memcgs as a generic feature? We have some options here but I want
to understand what would be sufficient and add as little overhead as
possible.
Thanks,
Suren.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-27 2:32 ` Suren Baghdasaryan
@ 2025-08-27 4:47 ` Usama Arif
2025-08-27 21:15 ` Shakeel Butt
1 sibling, 0 replies; 18+ messages in thread
From: Usama Arif @ 2025-08-27 4:47 UTC (permalink / raw)
To: Suren Baghdasaryan, Shakeel Butt
Cc: Yueyang Pan, Kent Overstreet, linux-mm, linux-kernel, hannes
On 27/08/2025 03:32, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
>>> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
>>>> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
>>>>> Right now in the oom_kill_process if the oom is because of the cgroup
>>>>> limit, we won't get memory allocation infomation. In some cases, we
>>>>> can have a large cgroup workload running which dominates the machine.
>>>>> The reason using cgroup is to leave some resource for system. When this
>>>>> cgroup is killed, we would also like to have some memory allocation
>>>>> information for the whole server as well. This is reason behind this
>>>>> mini change. Is it an acceptable thing to do? Will it be too much
>>>>> information for people? I am happy with any suggestions!
>>>>
>>>> For a single patch, it is better to have all the context in the patch
>>>> and there is no need for cover letter.
>>>
>>> Thanks for your suggestion Shakeel! I will change this in the next version.
>>>
>>>>
>>>> What exact information you want on the memcg oom that will be helpful
>>>> for the users in general? You mentioned memory allocation information,
>>>> can you please elaborate a bit more.
>>>>
>>>
>>> As in my reply to Suren, I was thinking the system-wide memory usage info
>>> provided by show_free_pages and memory allocation profiling info can help
>>> us debug cgoom by comparing them with historical data. What is your take on
>>> this?
>>>
>>
>> I am not really sure about show_free_areas(). More specifically how the
>> historical data diff will be useful for a memcg oom. If you have a
>> concrete example, please give one. For memory allocation profiling, is
>> it possible to filter for the given memcg? Do we save memcg information
>> in the memory allocation profiling?
>
> Actually I was thinking about making memory profiling memcg-aware but
> it would be quite costly both from memory and performance points of
> view. Currently we have a per-cpu counter for each allocation in the
> kernel codebase. To make it work for each memcg we would have to add
> memcg dimension to the counters, so each counter becomes per-cpu plus
> per-memcg. I'll be thinking about possible optimizations since many of
> these counters will stay at 0 but any such optimization would come at
> a performance cost, which we tried to keep at the absolute minimum.
>
> I'm CC'ing Sourav and Pasha since they were also interested in making
> memory allocation profiling memcg-aware. Would Meta folks (Usama,
> Shakeel, Johannes) be interested in such enhancement as well? Would it
> be preferable to have such accounting for a specific memcg which we
> pre-select (less memory and performance overhead) or we need that for
> all memcgs as a generic feature? We have some options here but I want
> to understand what would be sufficient and add as little overhead as
> possible.
Yes, having per memcg counters is going to be extremely useful (we were
thinking of having this as a future project to work on). For meta fleet
in particular, we might have almost 100 memcgs running, but the number
of memcgs running workloads is particularly small (usually less than 10).
In the rest, you might have services that are responsible for telemetry,
monitoring, security, etc (for which we arent really interested in
the memory allocation profile). So yes, it would be ideal to have the
profile for just pre-select memcgs, especially if it leads to lower memory
and performance overhead.
Having memory allocation profile at memcg level is especially needed when
we have multiple workloads stacked on the same host. Having it at host
level in such a case makes the data less useful when we have OOMs and
for workload analysis as you dont know which workload is contributing
how much.
> Thanks,
> Suren.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill
2025-08-27 2:32 ` Suren Baghdasaryan
2025-08-27 4:47 ` Usama Arif
@ 2025-08-27 21:15 ` Shakeel Butt
1 sibling, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2025-08-27 21:15 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Yueyang Pan, Kent Overstreet, Usama Arif, linux-mm, linux-kernel
On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
>
> Actually I was thinking about making memory profiling memcg-aware but
> it would be quite costly both from memory and performance points of
> view. Currently we have a per-cpu counter for each allocation in the
> kernel codebase. To make it work for each memcg we would have to add
> memcg dimension to the counters, so each counter becomes per-cpu plus
> per-memcg. I'll be thinking about possible optimizations since many of
> these counters will stay at 0 but any such optimization would come at
> a performance cost, which we tried to keep at the absolute minimum.
>
> I'm CC'ing Sourav and Pasha since they were also interested in making
> memory allocation profiling memcg-aware. Would Meta folks (Usama,
> Shakeel, Johannes) be interested in such enhancement as well? Would it
> be preferable to have such accounting for a specific memcg which we
> pre-select (less memory and performance overhead) or we need that for
> all memcgs as a generic feature? We have some options here but I want
> to understand what would be sufficient and add as little overhead as
> possible.
Thanks Suren, yes, as already mentioned by Usama, Meta will be
interested in memcg aware allocation profiling. I would say start simple
and as little overhead as possible. More functionality can be added
later when the need arises. Maybe the first useful addition is just
adding how many allocations for a specific allocation site are memcg
charged.
^ permalink raw reply [flat|nested] 18+ messages in thread