From: "Christian König" <christian.koenig@amd.com>
To: "T.J. Mercier" <tjmercier@google.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
Dave Airlie <airlied@gmail.com>,
dri-devel@lists.freedesktop.org, tj@kernel.org,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
cgroups@vger.kernel.org, Dave Chinner <david@fromorbit.com>,
Waiman Long <longman@redhat.com>,
simona@ffwll.ch, Suren Baghdasaryan <surenb@google.com>
Subject: Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)
Date: Tue, 3 Mar 2026 10:29:24 +0100 [thread overview]
Message-ID: <614c3c39-1e11-4da4-b5ac-b8a6432dac7e@amd.com> (raw)
In-Reply-To: <CABdmKX0=xPiwXgOHskGkE9Umj5=NrC=7OtngJjrm=mtOZmyzvA@mail.gmail.com>
On 3/2/26 20:35, T.J. Mercier wrote:
> On Mon, Mar 2, 2026 at 7:51 AM Christian König <christian.koenig@amd.com> wrote:
>>
>> On 3/2/26 16:40, Shakeel Butt wrote:
>>> +TJ
>>>
>>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
>>>> On 3/2/26 15:15, Shakeel Butt wrote:
>>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>>>>>> On 2/24/26 20:28, Dave Airlie wrote:
>>>>> [...]
>>>>>>
>>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
>>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
>>>>>>
>>>>>> Wait a second. How does accounting to cgroups help with that in any way?
>>>>>>
>>>>>> The last time I looked into this problem the OOM killer worked based on the per task_struct stats which couldn't be influenced this way.
>>>>>>
>>>>>
>>>>> It depends on the context of the oom-killer. If the oom-killer is triggered due
>>>>> to memcg limits then only the processes in the scope of the memcg will be
>>>>> targetted by the oom-killer. With the specific setting, the oom-killer can kill
>>>>> all the processes in the target memcg.
>>>>>
>>>>> However nowadays the userspace oom-killer is preferred over the kernel
>>>>> oom-killer due to flexibility and configurability. Userspace oom-killers like
>>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
>>>>> environments. Such oom-killers looks at memcg stats and hiding something
>>>>> something from memcg i.e. not charging to memcg will hide such usage from these
>>>>> oom-killers.
>>>>
>>>> Well exactly that's the problem. Android's oom killer is *not* using memcg exactly because of this inflexibility.
>>>
>>> Are you sure Android's oom killer is not using memcg? From what I see in the
>>> documentation [1], it requires memcg.
>
> LMKD used to use memcg v1 for memory.pressure_level, but that has been
> replaced by PSI which is now the default configuration. I deprecated
> all configurations with memcg v1 dependencies in January. We plan to
> remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels
> reach EOL.
>
>> My bad, I should have been wording that better.
>>
>> The Android OOM killer is not using memcg for tracking GPU memory allocations, because memcg doesn't have proper support for tracking shared buffers.
>>
>> In other words GPU memory allocations are shared by design and it is the norm that the process which is using it is not the process which has allocated it.
>>
>> What we would need (as a start) to handle all of this with memcg would be to accounted the resources to the process which referenced it and not the one which allocated it.
>>
>> I can give a full list of requirements which would be needed by cgroups to cover all the different use cases, but it basically means tons of extra complexity.
>
> Yeah this is right. We usually prioritize fast kills rather than
> picking the biggest offender though. Application state (foreground /
> background) is the primary selector, however LMKD does have a mode
> (kill_heaviest_task) where it will pick the largest task within a
> group of apps sharing the same application state. For this it uses RSS
> from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out
> of tree interface in procfs for accounting dmabufs used by a process.
> It tracks FD references and map references as they come and go, and
> only counts any buffer once for a process regardless of the number and
> type of references a process has to the same buffer. I dislike it
> greatly.
*sigh* I was really hoping that we would have nailed it with the BPF support for DMA-buf and not rely on out of tree stuff any more.
We should really stop re-inventing the wheel over and over again and fix the shortcomings cgroups has instead and then use that one.
> My original intention was to use the dmabuf BPF iterator we added to
> scan maps and FDs of a process for dmabufs on demand. Very simple and
> pretty fast in BPF. This wouldn't support high watermark tracking, so
> I was forced into doing something else for per-process accounting. To
> be fair, the HWM tracking has detected a few application bugs where
> 4GB of system memory was inadvertently consumed by dmabufs.
>
> The BPF iterator is currently used to support accounting of buffers
> not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice
> improvement for that over the old sysfs interface. I hope to replace
> the slow scanning of procfs for dmabufs in libdmabufinfo with BPF
> programs that use the dmabuf iterator, but that's not a priority for
> this year.
>
> Independent of all of that, memcg doesn't really work well for this
> because it's shared memory that can only be attributed to a single
> memcg, and the most common allocator (gralloc) is in a separate
> process and memcg than the processes using the buffers (camera,
> YouTube, etc.). I had a few patches that transferred the ownership of
> buffers to a new memcg when they were sent via Binder, but this used
> the memcg v1 charge moving functionality which is now gone because it
> was so complicated. But that only works if there is one user that
> should be charged for the buffer anyway. What if it is shared by
> multiple applications and services?
Well the "usual" (e.g. what you find in the literature and what other operating systems do) approach is to use a proportional set size instead of the resident set size: https://en.wikipedia.org/wiki/Proportional_set_size
The problem is that a proportional set size is usually harder to come by. So it means additional overhead, more complex interfaces etc...
Regards,
Christian.
>
>> Regards,
>> Christian.
>>
>>>
>>> [1] https://source.android.com/docs/core/perf/lmkd
>>>
>>>>
>>>> See the multiple iterations we already had on that topic. Even including reverting already upstream uAPI.
>>>>
>>>> The latest incarnation is that BPF is used for this task on Android.
>>>>
>>>> Regards,
>>>> Christian.
>>
next prev parent reply other threads:[~2026-03-03 9:29 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 2:06 drm/ttm/memcg/lru: enable memcg tracking for ttm and amdgpu driver (complete series v5) Dave Airlie
2026-02-24 2:06 ` [PATCH 01/16] mm: add gpu active/reclaim per-node stat counters (v2) Dave Airlie
2026-02-24 2:06 ` [PATCH 02/16] drm/ttm: use gpu mm stats to track gpu memory allocations. (v4) Dave Airlie
2026-02-24 2:06 ` [PATCH 03/16] ttm/pool: port to list_lru. (v2) Dave Airlie
2026-02-24 2:06 ` [PATCH 04/16] ttm/pool: drop numa specific pools Dave Airlie
2026-02-24 2:06 ` [PATCH 05/16] ttm/pool: make pool shrinker NUMA aware (v2) Dave Airlie
2026-02-24 2:06 ` [PATCH 06/16] ttm/pool: track allocated_pages per numa node Dave Airlie
2026-02-24 2:06 ` [PATCH 07/16] memcg: add support for GPU page counters. (v4) Dave Airlie
2026-02-24 7:20 ` kernel test robot
2026-02-24 7:50 ` Christian König
2026-02-24 19:28 ` Dave Airlie
2026-02-25 9:09 ` Christian König
2026-03-02 14:15 ` Shakeel Butt
2026-03-02 14:37 ` Christian König
2026-03-02 15:40 ` Shakeel Butt
2026-03-02 15:51 ` Christian König
2026-03-02 17:16 ` Shakeel Butt
2026-03-02 19:36 ` Christian König
2026-03-05 3:23 ` Dave Airlie
2026-03-02 19:35 ` T.J. Mercier
2026-03-03 9:29 ` Christian König [this message]
2026-03-03 17:25 ` T.J. Mercier
2026-03-05 3:19 ` Dave Airlie
2026-03-05 9:25 ` Christian König
2026-03-10 1:27 ` T.J. Mercier
2026-02-24 2:06 ` [PATCH 08/16] ttm: add a memcg accounting flag to the alloc/populate APIs Dave Airlie
2026-02-24 8:42 ` kernel test robot
2026-02-24 2:06 ` [PATCH 09/16] ttm/pool: initialise the shrinker earlier Dave Airlie
2026-02-24 2:06 ` [PATCH 10/16] ttm: add objcg pointer to bo and tt (v2) Dave Airlie
2026-02-24 2:06 ` [PATCH 11/16] ttm/pool: enable memcg tracking and shrinker. (v3) Dave Airlie
2026-02-24 2:06 ` [PATCH 12/16] ttm: hook up memcg placement flags Dave Airlie
2026-02-24 2:06 ` [PATCH 13/16] memcontrol: allow objcg api when memcg is config off Dave Airlie
2026-02-24 2:06 ` [PATCH 14/16] amdgpu: add support for memory cgroups Dave Airlie
2026-02-24 2:06 ` [PATCH 15/16] ttm: add support for a module option to disable memcg integration Dave Airlie
2026-02-24 2:06 ` [PATCH 16/16] xe: create a flag to enable memcg accounting for XE as well Dave Airlie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=614c3c39-1e11-4da4-b5ac-b8a6432dac7e@amd.com \
--to=christian.koenig@amd.com \
--cc=airlied@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=david@fromorbit.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hannes@cmpxchg.org \
--cc=longman@redhat.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=simona@ffwll.ch \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=tjmercier@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.