list_lru isolate callback question?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* list_lru isolate callback question?
@ 2025-06-05  2:16 Dave Airlie
  2025-06-05  7:55 ` Kairui Song
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Airlie @ 2025-06-05  2:16 UTC (permalink / raw)
  To: kasong, Dave Chinner, Johannes Weiner,
	Linux Memory Management List

I've hit a case where I think it might be valuable to have the nid +
struct memcg for the item being iterated available in the isolate
callback, I know in theory we should be able to retrieve it from the
item, but I'm also not convinced we should need to since we have it
already in the outer function?

typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
                        struct list_lru_one *list,
                        int nid,
                        struct mem_cgroup *memcg,
                        void *cb_arg);

It's probably not essential (I think I can get the nid back easily,
not sure about the memcg yet), but I thought I'd ask if there would be
resistance against just adding them to the callback?

Dave.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05  2:16 list_lru isolate callback question? Dave Airlie
@ 2025-06-05  7:55 ` Kairui Song
  2025-06-05  9:22   ` Dave Airlie
  0 siblings, 1 reply; 13+ messages in thread
From: Kairui Song @ 2025-06-05  7:55 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Dave Chinner, Johannes Weiner, Linux Memory Management List

On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
>
> I've hit a case where I think it might be valuable to have the nid +
> struct memcg for the item being iterated available in the isolate
> callback, I know in theory we should be able to retrieve it from the
> item, but I'm also not convinced we should need to since we have it
> already in the outer function?
>
> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
>                         struct list_lru_one *list,
>                         int nid,
>                         struct mem_cgroup *memcg,
>                         void *cb_arg);
>

Hi Dave,

> It's probably not essential (I think I can get the nid back easily,
> not sure about the memcg yet), but I thought I'd ask if there would be

If it's a slab object you should be able to get it easily with:
memcg = mem_cgroup_from_slab_obj(item));
nid = page_to_nid(virt_to_page(item));

> resistance against just adding them to the callback?

I'm not sure about the context here, I personally prefer to keep the
function minimized unless necessary, so things like !CONFIG_MEMCG or
single node builds won't have two dummy parameters here, and
most caller won't need them, the compiler can't optimize that
out IIUC.

>
> Dave.
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05  7:55 ` Kairui Song
@ 2025-06-05  9:22   ` Dave Airlie
  2025-06-05 13:53     ` Matthew Wilcox
  2025-06-05 22:39     ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Airlie @ 2025-06-05  9:22 UTC (permalink / raw)
  To: Kairui Song; +Cc: Dave Chinner, Johannes Weiner, Linux Memory Management List

On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
> >
> > I've hit a case where I think it might be valuable to have the nid +
> > struct memcg for the item being iterated available in the isolate
> > callback, I know in theory we should be able to retrieve it from the
> > item, but I'm also not convinced we should need to since we have it
> > already in the outer function?
> >
> > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> >                         struct list_lru_one *list,
> >                         int nid,
> >                         struct mem_cgroup *memcg,
> >                         void *cb_arg);
> >
>
> Hi Dave,
>
> > It's probably not essential (I think I can get the nid back easily,
> > not sure about the memcg yet), but I thought I'd ask if there would be
>
> If it's a slab object you should be able to get it easily with:
> memcg = mem_cgroup_from_slab_obj(item));
> nid = page_to_nid(virt_to_page(item));
>

It's in relation to some work trying to tie GPU system memory
allocations into memcg properly,

Not slab objects, but I do have pages so I'm using page_to_nid right now,
however these pages aren't currently setting p->memcg_data as I don't
need that for this, but maybe
this gives me a reason to go down that road.

> > resistance against just adding them to the callback?
>
> I'm not sure about the context here, I personally prefer to keep the
> function minimized unless necessary, so things like !CONFIG_MEMCG or
> single node builds won't have two dummy parameters here, and
> most caller won't need them, the compiler can't optimize that
> out IIUC.

I reconsidered and was wondering if

struct lru_walk_args {
    struct list_lru_one *list;
    int nid;
    struct mem_cgroup *memcg;
}
would also be an option instead of adding two unused args.

But I'll see if I can make it work once I get the memcg pieces of the
puzzle sorted out.

Dave.
>
> >
> > Dave.
> >


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05  9:22   ` Dave Airlie
@ 2025-06-05 13:53     ` Matthew Wilcox
  2025-06-05 20:59       ` Dave Airlie
  2025-06-05 22:39     ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2025-06-05 13:53 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Kairui Song, Dave Chinner, Johannes Weiner,
	Linux Memory Management List

On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> Not slab objects, but I do have pages so I'm using page_to_nid right now,
> however these pages aren't currently setting p->memcg_data as I don't
> need that for this, but maybe
> this gives me a reason to go down that road.

Please don't.  ->memcg_data is moving from struct page to the containing
object (slab/folio/...)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05 13:53     ` Matthew Wilcox
@ 2025-06-05 20:59       ` Dave Airlie
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Airlie @ 2025-06-05 20:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kairui Song, Dave Chinner, Johannes Weiner,
	Linux Memory Management List

On Thu, 5 Jun 2025 at 23:53, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> > Not slab objects, but I do have pages so I'm using page_to_nid right now,
> > however these pages aren't currently setting p->memcg_data as I don't
> > need that for this, but maybe
> > this gives me a reason to go down that road.
>
> Please don't.  ->memcg_data is moving from struct page to the containing
> object (slab/folio/...)

I think I'd like to move all the code in the TTM page pooling handling
to folios, but I started pulling the thread one day and realised I had
no idea what I was doing when it came to knowing what I was doing.

I think porting all the x86 set_pages_* interfaces was where I think I
needed to start.

Dave.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05  9:22   ` Dave Airlie
  2025-06-05 13:53     ` Matthew Wilcox
@ 2025-06-05 22:39     ` Dave Chinner
  2025-06-05 22:59       ` Dave Airlie
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-06-05 22:39 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List

On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > I've hit a case where I think it might be valuable to have the nid +
> > > struct memcg for the item being iterated available in the isolate
> > > callback, I know in theory we should be able to retrieve it from the
> > > item, but I'm also not convinced we should need to since we have it
> > > already in the outer function?
> > >
> > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> > >                         struct list_lru_one *list,
> > >                         int nid,
> > >                         struct mem_cgroup *memcg,
> > >                         void *cb_arg);
> > >
> >
> > Hi Dave,
> >
> > > It's probably not essential (I think I can get the nid back easily,
> > > not sure about the memcg yet), but I thought I'd ask if there would be
> >
> > If it's a slab object you should be able to get it easily with:
> > memcg = mem_cgroup_from_slab_obj(item));
> > nid = page_to_nid(virt_to_page(item));
> >
> 
> It's in relation to some work trying to tie GPU system memory
> allocations into memcg properly,
> 
> Not slab objects, but I do have pages so I'm using page_to_nid right now,
> however these pages aren't currently setting p->memcg_data as I don't
> need that for this, but maybe
> this gives me a reason to go down that road.

How are you accounting the page to the memcg if the page is not
marked as owned by as specific memcg?

Are you relying on the page being indexed in a specific list_lru to
account for the page correcting in reclaim contexts, and that's why
you need this information in the walk context?

I'd actually like to know more details of the problem you are trying
to solve - all I've heard is "we're trying to do <something> with
GPUs and memcgs with list_lrus", but I don't know what it is so I
can't really give decent feedback on your questions....

> > > resistance against just adding them to the callback?
> >
> > I'm not sure about the context here, I personally prefer to keep the
> > function minimized unless necessary, so things like !CONFIG_MEMCG or
> > single node builds won't have two dummy parameters here, and
> > most caller won't need them, the compiler can't optimize that
> > out IIUC.
> 
> I reconsidered and was wondering if
> 
> struct lru_walk_args {
>     struct list_lru_one *list;
>     int nid;
>     struct mem_cgroup *memcg;
> }
> would also be an option instead of adding two unused args.

The walk function is passed a struct list_lru_one. If there is a
need to get the {nid,memcg} of the objects efficiently from walk
contexts, then we should encode them into the struct list_lru_one
at init time and retreive them from there.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05 22:39     ` Dave Chinner
@ 2025-06-05 22:59       ` Dave Airlie
  2025-06-10 22:44         ` Dave Chinner
                           ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Dave Airlie @ 2025-06-05 22:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List

On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
> > > >
> > > > I've hit a case where I think it might be valuable to have the nid +
> > > > struct memcg for the item being iterated available in the isolate
> > > > callback, I know in theory we should be able to retrieve it from the
> > > > item, but I'm also not convinced we should need to since we have it
> > > > already in the outer function?
> > > >
> > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> > > >                         struct list_lru_one *list,
> > > >                         int nid,
> > > >                         struct mem_cgroup *memcg,
> > > >                         void *cb_arg);
> > > >
> > >
> > > Hi Dave,
> > >
> > > > It's probably not essential (I think I can get the nid back easily,
> > > > not sure about the memcg yet), but I thought I'd ask if there would be
> > >
> > > If it's a slab object you should be able to get it easily with:
> > > memcg = mem_cgroup_from_slab_obj(item));
> > > nid = page_to_nid(virt_to_page(item));
> > >
> >
> > It's in relation to some work trying to tie GPU system memory
> > allocations into memcg properly,
> >
> > Not slab objects, but I do have pages so I'm using page_to_nid right now,
> > however these pages aren't currently setting p->memcg_data as I don't
> > need that for this, but maybe
> > this gives me a reason to go down that road.
>
> How are you accounting the page to the memcg if the page is not
> marked as owned by as specific memcg?
>
> Are you relying on the page being indexed in a specific list_lru to
> account for the page correcting in reclaim contexts, and that's why
> you need this information in the walk context?
>
> I'd actually like to know more details of the problem you are trying
> to solve - all I've heard is "we're trying to do <something> with
> GPUs and memcgs with list_lrus", but I don't know what it is so I
> can't really give decent feedback on your questions....
>

Big picture problem, GPU drivers do a lot of memory allocations for
userspace applications that historically have not gone via memcg
accounting. This has been pointed out to be bad and should be fixed.

As part of that problem, GPU drivers have the ability to hand out
uncached/writecombined pages to userspace, creating these pages
requires changing attributes and as such is a heavy weight operation
which necessitates page pools. These page pools only currently have a
global shrinker and roll their own NUMA awareness. The
uncached/writecombined memory isn't a core feature of userspace usage
patterns, but since we want to do things right it seems like a good
idea to clean up the space first.

Get proper vmstat/memcg tracking for all allocations done for the GPU,
these can be very large, so I think we should add core mm counters for
them and memcg ones as well, so userspace can see them and make more
educated decisions.

We don't need page level memcg tracking as the pages are all either
allocated to the process as part of a larger buffer object, or the
pages are in the pool which has the memcg info, so we aren't intending
on using __GFP_ACCOUNT at this stage. I also don't really like having
this as part of kmem, these really are userspace only things mostly
and they are mostly used by gpu and userspace.

My rough plan:
1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
2. add global and memcg counters and tracking.
3. convert TTM page pools over to memcg aware shrinker so we get the
proper operation inside a memcg for some niche use cases.
4. Figure out how to deal with memory evictions from VRAM - this is
probably the hardest problem to solve as there is no great policy.

Also handwave shouldn't this all be folios at some point.

>
> The walk function is passed a struct list_lru_one. If there is a
> need to get the {nid,memcg} of the objects efficiently from walk
> contexts, then we should encode them into the struct list_lru_one
> at init time and retreive them from there.

Oh interesting, that might also be a decent option.

Dave.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05 22:59       ` Dave Airlie
@ 2025-06-10 22:44         ` Dave Chinner
  2025-06-11  1:40           ` Dave Airlie
  2025-06-10 23:07         ` Balbir Singh
  2025-06-11  3:36         ` Matthew Wilcox
  2 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2025-06-10 22:44 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List

On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote:
> On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> > > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > >
> > > > > I've hit a case where I think it might be valuable to have the nid +
> > > > > struct memcg for the item being iterated available in the isolate
> > > > > callback, I know in theory we should be able to retrieve it from the
> > > > > item, but I'm also not convinced we should need to since we have it
> > > > > already in the outer function?
> > > > >
> > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> > > > >                         struct list_lru_one *list,
> > > > >                         int nid,
> > > > >                         struct mem_cgroup *memcg,
> > > > >                         void *cb_arg);
> > > > >
> > > >
> > > > Hi Dave,
> > > >
> > > > > It's probably not essential (I think I can get the nid back easily,
> > > > > not sure about the memcg yet), but I thought I'd ask if there would be
> > > >
> > > > If it's a slab object you should be able to get it easily with:
> > > > memcg = mem_cgroup_from_slab_obj(item));
> > > > nid = page_to_nid(virt_to_page(item));
> > > >
> > >
> > > It's in relation to some work trying to tie GPU system memory
> > > allocations into memcg properly,
> > >
> > > Not slab objects, but I do have pages so I'm using page_to_nid right now,
> > > however these pages aren't currently setting p->memcg_data as I don't
> > > need that for this, but maybe
> > > this gives me a reason to go down that road.
> >
> > How are you accounting the page to the memcg if the page is not
> > marked as owned by as specific memcg?
> >
> > Are you relying on the page being indexed in a specific list_lru to
> > account for the page correcting in reclaim contexts, and that's why
> > you need this information in the walk context?
> >
> > I'd actually like to know more details of the problem you are trying
> > to solve - all I've heard is "we're trying to do <something> with
> > GPUs and memcgs with list_lrus", but I don't know what it is so I
> > can't really give decent feedback on your questions....
> >
> 
> Big picture problem, GPU drivers do a lot of memory allocations for
> userspace applications that historically have not gone via memcg
> accounting. This has been pointed out to be bad and should be fixed.
> 
> As part of that problem, GPU drivers have the ability to hand out
> uncached/writecombined pages to userspace, creating these pages
> requires changing attributes and as such is a heavy weight operation
> which necessitates page pools. These page pools only currently have a
> global shrinker and roll their own NUMA awareness.

Ok, it looks to me like there's been a proliferation of these pools
and shrinkers in recent times? I was aware of the TTM + i915/gem
shrinkers, but now I look I see XE, panfrost and MSM all have there
own custom shrinkers now? Ah, panfrost and msm look simple, but
XE is a wrapper around ttm that does all sorts of weird runtime PM
stuff.

I don't see anything obviously NUMA aware in any of them, though....

> The
> uncached/writecombined memory isn't a core feature of userspace usage
> patterns, but since we want to do things right it seems like a good
> idea to clean up the space first.
> 
> Get proper vmstat/memcg tracking for all allocations done for the GPU,
> these can be very large, so I think we should add core mm counters for
> them and memcg ones as well, so userspace can see them and make more
> educated decisions.

That's not really related to shrinkers and LRUs, so I'll leave that
for you to take up with the core-mm people. :)

> We don't need page level memcg tracking as the pages are all either
> allocated to the process as part of a larger buffer object, or the
> pages are in the pool which has the memcg info, so we aren't intending
> on using __GFP_ACCOUNT at this stage. I also don't really like having
> this as part of kmem, these really are userspace only things mostly
> and they are mostly used by gpu and userspace.

Seems reasonable to me if you can manage the memcgs outside the LRU
contexts sanely.

> My rough plan:
> 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
> 2. add global and memcg counters and tracking.
> 3. convert TTM page pools over to memcg aware shrinker so we get the
> proper operation inside a memcg for some niche use cases.

Once you've converted to list-lru, this step should mainly be be
adding a flag to the shrinker and changing the list_lru init
function.

Just remember that list_lru_{add,del}() require external means of
pinning the memcg while the LRU operation is being done. i.e. the
indexing and management of memcg lifetimes for LRU operations is the
responsibility of the external code, not the list_lru
infrastructure.

> 4. Figure out how to deal with memory evictions from VRAM - this is
> probably the hardest problem to solve as there is no great policy.

Buffer objects on the LRU can be unused by userspace but still
mapped into VRAM at the time the shrinker tries to reclaim them?
Hence the shrinker tries to evict them from VRAM to reclaim the
memory? Or are you talking about something else?

> Also handwave shouldn't this all be folios at some point.

Or maybe a new uncached/writecombined specific type tailored to the
exact needs of GPU resource management. </wave>

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05 22:59       ` Dave Airlie
  2025-06-10 22:44         ` Dave Chinner
@ 2025-06-10 23:07         ` Balbir Singh
  2025-06-11  1:43           ` Dave Airlie
  2025-06-11  3:36         ` Matthew Wilcox
  2 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2025-06-10 23:07 UTC (permalink / raw)
  To: Dave Airlie, Dave Chinner
  Cc: Kairui Song, Johannes Weiner, Linux Memory Management List

On 6/6/25 08:59, Dave Airlie wrote:
> On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
>>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
>>>>
>>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
>>>>>
>>>>> I've hit a case where I think it might be valuable to have the nid +
>>>>> struct memcg for the item being iterated available in the isolate
>>>>> callback, I know in theory we should be able to retrieve it from the
>>>>> item, but I'm also not convinced we should need to since we have it
>>>>> already in the outer function?
>>>>>
>>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
>>>>>                         struct list_lru_one *list,
>>>>>                         int nid,
>>>>>                         struct mem_cgroup *memcg,
>>>>>                         void *cb_arg);
>>>>>
>>>>
>>>> Hi Dave,
>>>>
>>>>> It's probably not essential (I think I can get the nid back easily,
>>>>> not sure about the memcg yet), but I thought I'd ask if there would be
>>>>
>>>> If it's a slab object you should be able to get it easily with:
>>>> memcg = mem_cgroup_from_slab_obj(item));
>>>> nid = page_to_nid(virt_to_page(item));
>>>>
>>>
>>> It's in relation to some work trying to tie GPU system memory
>>> allocations into memcg properly,
>>>
>>> Not slab objects, but I do have pages so I'm using page_to_nid right now,
>>> however these pages aren't currently setting p->memcg_data as I don't
>>> need that for this, but maybe
>>> this gives me a reason to go down that road.
>>
>> How are you accounting the page to the memcg if the page is not
>> marked as owned by as specific memcg?
>>
>> Are you relying on the page being indexed in a specific list_lru to
>> account for the page correcting in reclaim contexts, and that's why
>> you need this information in the walk context?
>>
>> I'd actually like to know more details of the problem you are trying
>> to solve - all I've heard is "we're trying to do <something> with
>> GPUs and memcgs with list_lrus", but I don't know what it is so I
>> can't really give decent feedback on your questions....
>>
> 
> Big picture problem, GPU drivers do a lot of memory allocations for
> userspace applications that historically have not gone via memcg
> accounting. This has been pointed out to be bad and should be fixed.
> 
> As part of that problem, GPU drivers have the ability to hand out
> uncached/writecombined pages to userspace, creating these pages
> requires changing attributes and as such is a heavy weight operation
> which necessitates page pools. These page pools only currently have a
> global shrinker and roll their own NUMA awareness. The
> uncached/writecombined memory isn't a core feature of userspace usage
> patterns, but since we want to do things right it seems like a good
> idea to clean up the space first.
> 
> Get proper vmstat/memcg tracking for all allocations done for the GPU,
> these can be very large, so I think we should add core mm counters for
> them and memcg ones as well, so userspace can see them and make more
> educated decisions.
> 
> We don't need page level memcg tracking as the pages are all either
> allocated to the process as part of a larger buffer object, or the
> pages are in the pool which has the memcg info, so we aren't intending
> on using __GFP_ACCOUNT at this stage. I also don't really like having
> this as part of kmem, these really are userspace only things mostly
> and they are mostly used by gpu and userspace.
> 
> My rough plan:
> 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
> 2. add global and memcg counters and tracking.
> 3. convert TTM page pools over to memcg aware shrinker so we get the
> proper operation inside a memcg for some niche use cases.
> 4. Figure out how to deal with memory evictions from VRAM - this is
> probably the hardest problem to solve as there is no great policy.
> 
> Also handwave shouldn't this all be folios at some point.
> 

The key requirements for memcg would be to track the mm on whose behalf
the allocation was made.

kmemcg (__GFP_ACCOUNT) tracks only kernel
allocations (meant for kernel overheads), we don't really need it and
you've already mentioned this.

For memcg evictions reference count and reclaim is used today, I guess
in #4, you are referring to getting that information for VRAM?

Is the overall goal to overcommit VRAM or to restrict the amount of
VRAM usage or a combination of bith?

Balbir Singh


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-10 22:44         ` Dave Chinner
@ 2025-06-11  1:40           ` Dave Airlie
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Airlie @ 2025-06-11  1:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List

On Wed, 11 Jun 2025 at 08:44, Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote:
> > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> > > > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
> > > > > >
> > > > > > I've hit a case where I think it might be valuable to have the nid +
> > > > > > struct memcg for the item being iterated available in the isolate
> > > > > > callback, I know in theory we should be able to retrieve it from the
> > > > > > item, but I'm also not convinced we should need to since we have it
> > > > > > already in the outer function?
> > > > > >
> > > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> > > > > >                         struct list_lru_one *list,
> > > > > >                         int nid,
> > > > > >                         struct mem_cgroup *memcg,
> > > > > >                         void *cb_arg);
> > > > > >
> > > > >
> > > > > Hi Dave,
> > > > >
> > > > > > It's probably not essential (I think I can get the nid back easily,
> > > > > > not sure about the memcg yet), but I thought I'd ask if there would be
> > > > >
> > > > > If it's a slab object you should be able to get it easily with:
> > > > > memcg = mem_cgroup_from_slab_obj(item));
> > > > > nid = page_to_nid(virt_to_page(item));
> > > > >
> > > >
> > > > It's in relation to some work trying to tie GPU system memory
> > > > allocations into memcg properly,
> > > >
> > > > Not slab objects, but I do have pages so I'm using page_to_nid right now,
> > > > however these pages aren't currently setting p->memcg_data as I don't
> > > > need that for this, but maybe
> > > > this gives me a reason to go down that road.
> > >
> > > How are you accounting the page to the memcg if the page is not
> > > marked as owned by as specific memcg?
> > >
> > > Are you relying on the page being indexed in a specific list_lru to
> > > account for the page correcting in reclaim contexts, and that's why
> > > you need this information in the walk context?
> > >
> > > I'd actually like to know more details of the problem you are trying
> > > to solve - all I've heard is "we're trying to do <something> with
> > > GPUs and memcgs with list_lrus", but I don't know what it is so I
> > > can't really give decent feedback on your questions....
> > >
> >
> > Big picture problem, GPU drivers do a lot of memory allocations for
> > userspace applications that historically have not gone via memcg
> > accounting. This has been pointed out to be bad and should be fixed.
> >
> > As part of that problem, GPU drivers have the ability to hand out
> > uncached/writecombined pages to userspace, creating these pages
> > requires changing attributes and as such is a heavy weight operation
> > which necessitates page pools. These page pools only currently have a
> > global shrinker and roll their own NUMA awareness.
>
> Ok, it looks to me like there's been a proliferation of these pools
> and shrinkers in recent times? I was aware of the TTM + i915/gem
> shrinkers, but now I look I see XE, panfrost and MSM all have there
> own custom shrinkers now? Ah, panfrost and msm look simple, but
> XE is a wrapper around ttm that does all sorts of weird runtime PM
> stuff.

There is a bunch of different reasons, TTM is for systems with
discrete device memory, it also manages uncached pools, it has a
shrink itself for the uncached pools.
Xe recently added support for a proper shrinker for the device which
needs to interact with TTM, so that is second and different shrinker.
The PM interactions are so
it can power up the GPU to do certain operations or refuse to do them
if powered down.
Then panfrost/msm are arm only no discrete memory so have their own
simpler ones. I think at some point amdgpu will grow a shrinker more
like the Xe one.

>
> I don't see anything obviously NUMA aware in any of them, though....

The numa awareness is in the uncached pools code, the shrinker isn't
numa aware, but the pools can be created per numa-node, and currently
just shrink indiscriminately, moving to list_lru should allow them to
shrink smarter.

>
> > We don't need page level memcg tracking as the pages are all either
> > allocated to the process as part of a larger buffer object, or the
> > pages are in the pool which has the memcg info, so we aren't intending
> > on using __GFP_ACCOUNT at this stage. I also don't really like having
> > this as part of kmem, these really are userspace only things mostly
> > and they are mostly used by gpu and userspace.
>
> Seems reasonable to me if you can manage the memcgs outside the LRU
> contexts sanely.
>
> > My rough plan:
> > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
> > 2. add global and memcg counters and tracking.
> > 3. convert TTM page pools over to memcg aware shrinker so we get the
> > proper operation inside a memcg for some niche use cases.
>
> Once you've converted to list-lru, this step should mainly be be
> adding a flag to the shrinker and changing the list_lru init
> function.
>
> Just remember that list_lru_{add,del}() require external means of
> pinning the memcg while the LRU operation is being done. i.e. the
> indexing and management of memcg lifetimes for LRU operations is the
> responsibility of the external code, not the list_lru
> infrastructure.
>
> > 4. Figure out how to deal with memory evictions from VRAM - this is
> > probably the hardest problem to solve as there is no great policy.
>
> Buffer objects on the LRU can be unused by userspace but still
> mapped into VRAM at the time the shrinker tries to reclaim them?
> Hence the shrinker tries to evict them from VRAM to reclaim the
> memory? Or are you talking about something else?

VRAM isn't currently tracked, but when we do have to evict something from VRAM,
we generally have to evict to system RAM somewhere, and who's memcg that
gets accounted to is hard to determine at this point. There is also
code in the Xe/TTM
shrinker to handle evictions to swap under memory pressure smarter,

you have a 10MB VRAM object, it's getting evicted, we allocate 10MB
main memory and copy it, then it can get pushed to swap,
the recent xe shrinker allows the 10MB to get evicted into 10 1MB
chunks which can get pushed to swap quicker to only use 1MB of system
RAM.

>
> > Also handwave shouldn't this all be folios at some point.
>
> Or maybe a new uncached/writecombined specific type tailored to the
> exact needs of GPU resource management. </wave>

Yes not saying a page flag would make it easier, but sometimes a
pageflag would make it easier, though a folio flag for uncached/wc
with core mm dealing with pools etc would probably be nice in the
future, esp if more devices start needing this stuff.

Dave.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-10 23:07         ` Balbir Singh
@ 2025-06-11  1:43           ` Dave Airlie
  2025-06-11 22:34             ` Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Airlie @ 2025-06-11  1:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Dave Chinner, Kairui Song, Johannes Weiner,
	Linux Memory Management List

On Wed, 11 Jun 2025 at 09:07, Balbir Singh <balbirs@nvidia.com> wrote:
>
> On 6/6/25 08:59, Dave Airlie wrote:
> > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
> >>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
> >>>>
> >>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
> >>>>>
> >>>>> I've hit a case where I think it might be valuable to have the nid +
> >>>>> struct memcg for the item being iterated available in the isolate
> >>>>> callback, I know in theory we should be able to retrieve it from the
> >>>>> item, but I'm also not convinced we should need to since we have it
> >>>>> already in the outer function?
> >>>>>
> >>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
> >>>>>                         struct list_lru_one *list,
> >>>>>                         int nid,
> >>>>>                         struct mem_cgroup *memcg,
> >>>>>                         void *cb_arg);
> >>>>>
> >>>>
> >>>> Hi Dave,
> >>>>
> >>>>> It's probably not essential (I think I can get the nid back easily,
> >>>>> not sure about the memcg yet), but I thought I'd ask if there would be
> >>>>
> >>>> If it's a slab object you should be able to get it easily with:
> >>>> memcg = mem_cgroup_from_slab_obj(item));
> >>>> nid = page_to_nid(virt_to_page(item));
> >>>>
> >>>
> >>> It's in relation to some work trying to tie GPU system memory
> >>> allocations into memcg properly,
> >>>
> >>> Not slab objects, but I do have pages so I'm using page_to_nid right now,
> >>> however these pages aren't currently setting p->memcg_data as I don't
> >>> need that for this, but maybe
> >>> this gives me a reason to go down that road.
> >>
> >> How are you accounting the page to the memcg if the page is not
> >> marked as owned by as specific memcg?
> >>
> >> Are you relying on the page being indexed in a specific list_lru to
> >> account for the page correcting in reclaim contexts, and that's why
> >> you need this information in the walk context?
> >>
> >> I'd actually like to know more details of the problem you are trying
> >> to solve - all I've heard is "we're trying to do <something> with
> >> GPUs and memcgs with list_lrus", but I don't know what it is so I
> >> can't really give decent feedback on your questions....
> >>
> >
> > Big picture problem, GPU drivers do a lot of memory allocations for
> > userspace applications that historically have not gone via memcg
> > accounting. This has been pointed out to be bad and should be fixed.
> >
> > As part of that problem, GPU drivers have the ability to hand out
> > uncached/writecombined pages to userspace, creating these pages
> > requires changing attributes and as such is a heavy weight operation
> > which necessitates page pools. These page pools only currently have a
> > global shrinker and roll their own NUMA awareness. The
> > uncached/writecombined memory isn't a core feature of userspace usage
> > patterns, but since we want to do things right it seems like a good
> > idea to clean up the space first.
> >
> > Get proper vmstat/memcg tracking for all allocations done for the GPU,
> > these can be very large, so I think we should add core mm counters for
> > them and memcg ones as well, so userspace can see them and make more
> > educated decisions.
> >
> > We don't need page level memcg tracking as the pages are all either
> > allocated to the process as part of a larger buffer object, or the
> > pages are in the pool which has the memcg info, so we aren't intending
> > on using __GFP_ACCOUNT at this stage. I also don't really like having
> > this as part of kmem, these really are userspace only things mostly
> > and they are mostly used by gpu and userspace.
> >
> > My rough plan:
> > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
> > 2. add global and memcg counters and tracking.
> > 3. convert TTM page pools over to memcg aware shrinker so we get the
> > proper operation inside a memcg for some niche use cases.
> > 4. Figure out how to deal with memory evictions from VRAM - this is
> > probably the hardest problem to solve as there is no great policy.
> >
> > Also handwave shouldn't this all be folios at some point.
> >
>
> The key requirements for memcg would be to track the mm on whose behalf
> the allocation was made.
>
> kmemcg (__GFP_ACCOUNT) tracks only kernel
> allocations (meant for kernel overheads), we don't really need it and
> you've already mentioned this.
>
> For memcg evictions reference count and reclaim is used today, I guess
> in #4, you are referring to getting that information for VRAM?
>
> Is the overall goal to overcommit VRAM or to restrict the amount of
> VRAM usage or a combination of bith?

This is kinda the crux of where we are getting to.

We don't track VRAM at all with memcg that will be the dmem controllers jobs.

But in the corner case where we do overcommit VRAM, who pays for the
system RAM where we evict stuff to.

I think ideally we would have system limits give an amount of VRAM and
system RAM to a process, and it can live within that budget, and we'd
try not to evict VRAM from processes that have a cgroup accounted
right to some of it, but that isn't great for average things like
desktops or games (where overcommit makes sense), it would be more for
container workloads on GPU clusters.

Dave.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-05 22:59       ` Dave Airlie
  2025-06-10 22:44         ` Dave Chinner
  2025-06-10 23:07         ` Balbir Singh
@ 2025-06-11  3:36         ` Matthew Wilcox
  2 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2025-06-11  3:36 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Dave Chinner, Kairui Song, Johannes Weiner,
	Linux Memory Management List

On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote:
> Also handwave shouldn't this all be folios at some point.

Maybe?  The question is what per-allocation data do you need to track?

For filesystems, it's quite extensive, and that's what folios do.  They
track the filesystem object (address_space), location within that object
(index), per-folio filesystem state (something like buffer_head or iomap
folio state), a bunch of flags (dirty/uptodate/locked/writeback/...),
LRU list, mapcount, refcount, memcg, etc, etc.

If, for example, you need to go from page to "which pool does this page
belong to?", that would be a good thing you could keep in your memdesc.
You can certainly have some flags.  It sounds like you don't need
per-allocation memcg; you keep track of memcg per pool (I have a feeling
that filesystems would do just as well keeping track of memcgs per inode
rather than per folio, but that's a fight for the future).

What I would like is for you to do something like 'struct slab' and keep
your metadata in the same bits as struct page, but not messing with the
contents of struct page.  So don't use page->index to store an integer
just because it has the right type, define your own type.

Happy to help with some of this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: list_lru isolate callback question?
  2025-06-11  1:43           ` Dave Airlie
@ 2025-06-11 22:34             ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2025-06-11 22:34 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Dave Chinner, Kairui Song, Johannes Weiner,
	Linux Memory Management List

On 6/11/25 11:43, Dave Airlie wrote:
> On Wed, 11 Jun 2025 at 09:07, Balbir Singh <balbirs@nvidia.com> wrote:
>>
>> On 6/6/25 08:59, Dave Airlie wrote:
>>> On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote:
>>>>
>>>> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote:
>>>>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote:
>>>>>>
>>>>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote:
>>>>>>>
>>>>>>> I've hit a case where I think it might be valuable to have the nid +
>>>>>>> struct memcg for the item being iterated available in the isolate
>>>>>>> callback, I know in theory we should be able to retrieve it from the
>>>>>>> item, but I'm also not convinced we should need to since we have it
>>>>>>> already in the outer function?
>>>>>>>
>>>>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
>>>>>>>                         struct list_lru_one *list,
>>>>>>>                         int nid,
>>>>>>>                         struct mem_cgroup *memcg,
>>>>>>>                         void *cb_arg);
>>>>>>>
>>>>>>
>>>>>> Hi Dave,
>>>>>>
>>>>>>> It's probably not essential (I think I can get the nid back easily,
>>>>>>> not sure about the memcg yet), but I thought I'd ask if there would be
>>>>>>
>>>>>> If it's a slab object you should be able to get it easily with:
>>>>>> memcg = mem_cgroup_from_slab_obj(item));
>>>>>> nid = page_to_nid(virt_to_page(item));
>>>>>>
>>>>>
>>>>> It's in relation to some work trying to tie GPU system memory
>>>>> allocations into memcg properly,
>>>>>
>>>>> Not slab objects, but I do have pages so I'm using page_to_nid right now,
>>>>> however these pages aren't currently setting p->memcg_data as I don't
>>>>> need that for this, but maybe
>>>>> this gives me a reason to go down that road.
>>>>
>>>> How are you accounting the page to the memcg if the page is not
>>>> marked as owned by as specific memcg?
>>>>
>>>> Are you relying on the page being indexed in a specific list_lru to
>>>> account for the page correcting in reclaim contexts, and that's why
>>>> you need this information in the walk context?
>>>>
>>>> I'd actually like to know more details of the problem you are trying
>>>> to solve - all I've heard is "we're trying to do <something> with
>>>> GPUs and memcgs with list_lrus", but I don't know what it is so I
>>>> can't really give decent feedback on your questions....
>>>>
>>>
>>> Big picture problem, GPU drivers do a lot of memory allocations for
>>> userspace applications that historically have not gone via memcg
>>> accounting. This has been pointed out to be bad and should be fixed.
>>>
>>> As part of that problem, GPU drivers have the ability to hand out
>>> uncached/writecombined pages to userspace, creating these pages
>>> requires changing attributes and as such is a heavy weight operation
>>> which necessitates page pools. These page pools only currently have a
>>> global shrinker and roll their own NUMA awareness. The
>>> uncached/writecombined memory isn't a core feature of userspace usage
>>> patterns, but since we want to do things right it seems like a good
>>> idea to clean up the space first.
>>>
>>> Get proper vmstat/memcg tracking for all allocations done for the GPU,
>>> these can be very large, so I think we should add core mm counters for
>>> them and memcg ones as well, so userspace can see them and make more
>>> educated decisions.
>>>
>>> We don't need page level memcg tracking as the pages are all either
>>> allocated to the process as part of a larger buffer object, or the
>>> pages are in the pool which has the memcg info, so we aren't intending
>>> on using __GFP_ACCOUNT at this stage. I also don't really like having
>>> this as part of kmem, these really are userspace only things mostly
>>> and they are mostly used by gpu and userspace.
>>>
>>> My rough plan:
>>> 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker
>>> 2. add global and memcg counters and tracking.
>>> 3. convert TTM page pools over to memcg aware shrinker so we get the
>>> proper operation inside a memcg for some niche use cases.
>>> 4. Figure out how to deal with memory evictions from VRAM - this is
>>> probably the hardest problem to solve as there is no great policy.
>>>
>>> Also handwave shouldn't this all be folios at some point.
>>>
>>
>> The key requirements for memcg would be to track the mm on whose behalf
>> the allocation was made.
>>
>> kmemcg (__GFP_ACCOUNT) tracks only kernel
>> allocations (meant for kernel overheads), we don't really need it and
>> you've already mentioned this.
>>
>> For memcg evictions reference count and reclaim is used today, I guess
>> in #4, you are referring to getting that information for VRAM?
>>
>> Is the overall goal to overcommit VRAM or to restrict the amount of
>> VRAM usage or a combination of bith?
> 
> This is kinda the crux of where we are getting to.
> 
> We don't track VRAM at all with memcg that will be the dmem controllers jobs.
> 
> But in the corner case where we do overcommit VRAM, who pays for the
> system RAM where we evict stuff to.
> 
> I think ideally we would have system limits give an amount of VRAM and
> system RAM to a process, and it can live within that budget, and we'd
> try not to evict VRAM from processes that have a cgroup accounted
> right to some of it, but that isn't great for average things like
> desktops or games (where overcommit makes sense), it would be more for
> container workloads on GPU clusters.
> 


Makes sense! Thanks!

Balbir



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-06-11 22:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05  2:16 list_lru isolate callback question? Dave Airlie
2025-06-05  7:55 ` Kairui Song
2025-06-05  9:22   ` Dave Airlie
2025-06-05 13:53     ` Matthew Wilcox
2025-06-05 20:59       ` Dave Airlie
2025-06-05 22:39     ` Dave Chinner
2025-06-05 22:59       ` Dave Airlie
2025-06-10 22:44         ` Dave Chinner
2025-06-11  1:40           ` Dave Airlie
2025-06-10 23:07         ` Balbir Singh
2025-06-11  1:43           ` Dave Airlie
2025-06-11 22:34             ` Balbir Singh
2025-06-11  3:36         ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).