* list_lru isolate callback question? @ 2025-06-05 2:16 Dave Airlie 2025-06-05 7:55 ` Kairui Song 0 siblings, 1 reply; 13+ messages in thread From: Dave Airlie @ 2025-06-05 2:16 UTC (permalink / raw) To: kasong, Dave Chinner, Johannes Weiner, Linux Memory Management List I've hit a case where I think it might be valuable to have the nid + struct memcg for the item being iterated available in the isolate callback, I know in theory we should be able to retrieve it from the item, but I'm also not convinced we should need to since we have it already in the outer function? typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, struct list_lru_one *list, int nid, struct mem_cgroup *memcg, void *cb_arg); It's probably not essential (I think I can get the nid back easily, not sure about the memcg yet), but I thought I'd ask if there would be resistance against just adding them to the callback? Dave. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 2:16 list_lru isolate callback question? Dave Airlie @ 2025-06-05 7:55 ` Kairui Song 2025-06-05 9:22 ` Dave Airlie 0 siblings, 1 reply; 13+ messages in thread From: Kairui Song @ 2025-06-05 7:55 UTC (permalink / raw) To: Dave Airlie; +Cc: Dave Chinner, Johannes Weiner, Linux Memory Management List On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > > I've hit a case where I think it might be valuable to have the nid + > struct memcg for the item being iterated available in the isolate > callback, I know in theory we should be able to retrieve it from the > item, but I'm also not convinced we should need to since we have it > already in the outer function? > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > struct list_lru_one *list, > int nid, > struct mem_cgroup *memcg, > void *cb_arg); > Hi Dave, > It's probably not essential (I think I can get the nid back easily, > not sure about the memcg yet), but I thought I'd ask if there would be If it's a slab object you should be able to get it easily with: memcg = mem_cgroup_from_slab_obj(item)); nid = page_to_nid(virt_to_page(item)); > resistance against just adding them to the callback? I'm not sure about the context here, I personally prefer to keep the function minimized unless necessary, so things like !CONFIG_MEMCG or single node builds won't have two dummy parameters here, and most caller won't need them, the compiler can't optimize that out IIUC. > > Dave. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 7:55 ` Kairui Song @ 2025-06-05 9:22 ` Dave Airlie 2025-06-05 13:53 ` Matthew Wilcox 2025-06-05 22:39 ` Dave Chinner 0 siblings, 2 replies; 13+ messages in thread From: Dave Airlie @ 2025-06-05 9:22 UTC (permalink / raw) To: Kairui Song; +Cc: Dave Chinner, Johannes Weiner, Linux Memory Management List On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > > > > I've hit a case where I think it might be valuable to have the nid + > > struct memcg for the item being iterated available in the isolate > > callback, I know in theory we should be able to retrieve it from the > > item, but I'm also not convinced we should need to since we have it > > already in the outer function? > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > > struct list_lru_one *list, > > int nid, > > struct mem_cgroup *memcg, > > void *cb_arg); > > > > Hi Dave, > > > It's probably not essential (I think I can get the nid back easily, > > not sure about the memcg yet), but I thought I'd ask if there would be > > If it's a slab object you should be able to get it easily with: > memcg = mem_cgroup_from_slab_obj(item)); > nid = page_to_nid(virt_to_page(item)); > It's in relation to some work trying to tie GPU system memory allocations into memcg properly, Not slab objects, but I do have pages so I'm using page_to_nid right now, however these pages aren't currently setting p->memcg_data as I don't need that for this, but maybe this gives me a reason to go down that road. > > resistance against just adding them to the callback? > > I'm not sure about the context here, I personally prefer to keep the > function minimized unless necessary, so things like !CONFIG_MEMCG or > single node builds won't have two dummy parameters here, and > most caller won't need them, the compiler can't optimize that > out IIUC. I reconsidered and was wondering if struct lru_walk_args { struct list_lru_one *list; int nid; struct mem_cgroup *memcg; } would also be an option instead of adding two unused args. But I'll see if I can make it work once I get the memcg pieces of the puzzle sorted out. Dave. > > > > > Dave. > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 9:22 ` Dave Airlie @ 2025-06-05 13:53 ` Matthew Wilcox 2025-06-05 20:59 ` Dave Airlie 2025-06-05 22:39 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2025-06-05 13:53 UTC (permalink / raw) To: Dave Airlie Cc: Kairui Song, Dave Chinner, Johannes Weiner, Linux Memory Management List On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > Not slab objects, but I do have pages so I'm using page_to_nid right now, > however these pages aren't currently setting p->memcg_data as I don't > need that for this, but maybe > this gives me a reason to go down that road. Please don't. ->memcg_data is moving from struct page to the containing object (slab/folio/...) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 13:53 ` Matthew Wilcox @ 2025-06-05 20:59 ` Dave Airlie 0 siblings, 0 replies; 13+ messages in thread From: Dave Airlie @ 2025-06-05 20:59 UTC (permalink / raw) To: Matthew Wilcox Cc: Kairui Song, Dave Chinner, Johannes Weiner, Linux Memory Management List On Thu, 5 Jun 2025 at 23:53, Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > > Not slab objects, but I do have pages so I'm using page_to_nid right now, > > however these pages aren't currently setting p->memcg_data as I don't > > need that for this, but maybe > > this gives me a reason to go down that road. > > Please don't. ->memcg_data is moving from struct page to the containing > object (slab/folio/...) I think I'd like to move all the code in the TTM page pooling handling to folios, but I started pulling the thread one day and realised I had no idea what I was doing when it came to knowing what I was doing. I think porting all the x86 set_pages_* interfaces was where I think I needed to start. Dave. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 9:22 ` Dave Airlie 2025-06-05 13:53 ` Matthew Wilcox @ 2025-06-05 22:39 ` Dave Chinner 2025-06-05 22:59 ` Dave Airlie 1 sibling, 1 reply; 13+ messages in thread From: Dave Chinner @ 2025-06-05 22:39 UTC (permalink / raw) To: Dave Airlie; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: > > > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > > > > > > I've hit a case where I think it might be valuable to have the nid + > > > struct memcg for the item being iterated available in the isolate > > > callback, I know in theory we should be able to retrieve it from the > > > item, but I'm also not convinced we should need to since we have it > > > already in the outer function? > > > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > > > struct list_lru_one *list, > > > int nid, > > > struct mem_cgroup *memcg, > > > void *cb_arg); > > > > > > > Hi Dave, > > > > > It's probably not essential (I think I can get the nid back easily, > > > not sure about the memcg yet), but I thought I'd ask if there would be > > > > If it's a slab object you should be able to get it easily with: > > memcg = mem_cgroup_from_slab_obj(item)); > > nid = page_to_nid(virt_to_page(item)); > > > > It's in relation to some work trying to tie GPU system memory > allocations into memcg properly, > > Not slab objects, but I do have pages so I'm using page_to_nid right now, > however these pages aren't currently setting p->memcg_data as I don't > need that for this, but maybe > this gives me a reason to go down that road. How are you accounting the page to the memcg if the page is not marked as owned by as specific memcg? Are you relying on the page being indexed in a specific list_lru to account for the page correcting in reclaim contexts, and that's why you need this information in the walk context? I'd actually like to know more details of the problem you are trying to solve - all I've heard is "we're trying to do <something> with GPUs and memcgs with list_lrus", but I don't know what it is so I can't really give decent feedback on your questions.... > > > resistance against just adding them to the callback? > > > > I'm not sure about the context here, I personally prefer to keep the > > function minimized unless necessary, so things like !CONFIG_MEMCG or > > single node builds won't have two dummy parameters here, and > > most caller won't need them, the compiler can't optimize that > > out IIUC. > > I reconsidered and was wondering if > > struct lru_walk_args { > struct list_lru_one *list; > int nid; > struct mem_cgroup *memcg; > } > would also be an option instead of adding two unused args. The walk function is passed a struct list_lru_one. If there is a need to get the {nid,memcg} of the objects efficiently from walk contexts, then we should encode them into the struct list_lru_one at init time and retreive them from there. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 22:39 ` Dave Chinner @ 2025-06-05 22:59 ` Dave Airlie 2025-06-10 22:44 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Dave Airlie @ 2025-06-05 22:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: > > > > > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > > > > > > > > I've hit a case where I think it might be valuable to have the nid + > > > > struct memcg for the item being iterated available in the isolate > > > > callback, I know in theory we should be able to retrieve it from the > > > > item, but I'm also not convinced we should need to since we have it > > > > already in the outer function? > > > > > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > > > > struct list_lru_one *list, > > > > int nid, > > > > struct mem_cgroup *memcg, > > > > void *cb_arg); > > > > > > > > > > Hi Dave, > > > > > > > It's probably not essential (I think I can get the nid back easily, > > > > not sure about the memcg yet), but I thought I'd ask if there would be > > > > > > If it's a slab object you should be able to get it easily with: > > > memcg = mem_cgroup_from_slab_obj(item)); > > > nid = page_to_nid(virt_to_page(item)); > > > > > > > It's in relation to some work trying to tie GPU system memory > > allocations into memcg properly, > > > > Not slab objects, but I do have pages so I'm using page_to_nid right now, > > however these pages aren't currently setting p->memcg_data as I don't > > need that for this, but maybe > > this gives me a reason to go down that road. > > How are you accounting the page to the memcg if the page is not > marked as owned by as specific memcg? > > Are you relying on the page being indexed in a specific list_lru to > account for the page correcting in reclaim contexts, and that's why > you need this information in the walk context? > > I'd actually like to know more details of the problem you are trying > to solve - all I've heard is "we're trying to do <something> with > GPUs and memcgs with list_lrus", but I don't know what it is so I > can't really give decent feedback on your questions.... > Big picture problem, GPU drivers do a lot of memory allocations for userspace applications that historically have not gone via memcg accounting. This has been pointed out to be bad and should be fixed. As part of that problem, GPU drivers have the ability to hand out uncached/writecombined pages to userspace, creating these pages requires changing attributes and as such is a heavy weight operation which necessitates page pools. These page pools only currently have a global shrinker and roll their own NUMA awareness. The uncached/writecombined memory isn't a core feature of userspace usage patterns, but since we want to do things right it seems like a good idea to clean up the space first. Get proper vmstat/memcg tracking for all allocations done for the GPU, these can be very large, so I think we should add core mm counters for them and memcg ones as well, so userspace can see them and make more educated decisions. We don't need page level memcg tracking as the pages are all either allocated to the process as part of a larger buffer object, or the pages are in the pool which has the memcg info, so we aren't intending on using __GFP_ACCOUNT at this stage. I also don't really like having this as part of kmem, these really are userspace only things mostly and they are mostly used by gpu and userspace. My rough plan: 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker 2. add global and memcg counters and tracking. 3. convert TTM page pools over to memcg aware shrinker so we get the proper operation inside a memcg for some niche use cases. 4. Figure out how to deal with memory evictions from VRAM - this is probably the hardest problem to solve as there is no great policy. Also handwave shouldn't this all be folios at some point. > > The walk function is passed a struct list_lru_one. If there is a > need to get the {nid,memcg} of the objects efficiently from walk > contexts, then we should encode them into the struct list_lru_one > at init time and retreive them from there. Oh interesting, that might also be a decent option. Dave. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 22:59 ` Dave Airlie @ 2025-06-10 22:44 ` Dave Chinner 2025-06-11 1:40 ` Dave Airlie 2025-06-10 23:07 ` Balbir Singh 2025-06-11 3:36 ` Matthew Wilcox 2 siblings, 1 reply; 13+ messages in thread From: Dave Chinner @ 2025-06-10 22:44 UTC (permalink / raw) To: Dave Airlie; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote: > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > > > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > > > > > > > > > > I've hit a case where I think it might be valuable to have the nid + > > > > > struct memcg for the item being iterated available in the isolate > > > > > callback, I know in theory we should be able to retrieve it from the > > > > > item, but I'm also not convinced we should need to since we have it > > > > > already in the outer function? > > > > > > > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > > > > > struct list_lru_one *list, > > > > > int nid, > > > > > struct mem_cgroup *memcg, > > > > > void *cb_arg); > > > > > > > > > > > > > Hi Dave, > > > > > > > > > It's probably not essential (I think I can get the nid back easily, > > > > > not sure about the memcg yet), but I thought I'd ask if there would be > > > > > > > > If it's a slab object you should be able to get it easily with: > > > > memcg = mem_cgroup_from_slab_obj(item)); > > > > nid = page_to_nid(virt_to_page(item)); > > > > > > > > > > It's in relation to some work trying to tie GPU system memory > > > allocations into memcg properly, > > > > > > Not slab objects, but I do have pages so I'm using page_to_nid right now, > > > however these pages aren't currently setting p->memcg_data as I don't > > > need that for this, but maybe > > > this gives me a reason to go down that road. > > > > How are you accounting the page to the memcg if the page is not > > marked as owned by as specific memcg? > > > > Are you relying on the page being indexed in a specific list_lru to > > account for the page correcting in reclaim contexts, and that's why > > you need this information in the walk context? > > > > I'd actually like to know more details of the problem you are trying > > to solve - all I've heard is "we're trying to do <something> with > > GPUs and memcgs with list_lrus", but I don't know what it is so I > > can't really give decent feedback on your questions.... > > > > Big picture problem, GPU drivers do a lot of memory allocations for > userspace applications that historically have not gone via memcg > accounting. This has been pointed out to be bad and should be fixed. > > As part of that problem, GPU drivers have the ability to hand out > uncached/writecombined pages to userspace, creating these pages > requires changing attributes and as such is a heavy weight operation > which necessitates page pools. These page pools only currently have a > global shrinker and roll their own NUMA awareness. Ok, it looks to me like there's been a proliferation of these pools and shrinkers in recent times? I was aware of the TTM + i915/gem shrinkers, but now I look I see XE, panfrost and MSM all have there own custom shrinkers now? Ah, panfrost and msm look simple, but XE is a wrapper around ttm that does all sorts of weird runtime PM stuff. I don't see anything obviously NUMA aware in any of them, though.... > The > uncached/writecombined memory isn't a core feature of userspace usage > patterns, but since we want to do things right it seems like a good > idea to clean up the space first. > > Get proper vmstat/memcg tracking for all allocations done for the GPU, > these can be very large, so I think we should add core mm counters for > them and memcg ones as well, so userspace can see them and make more > educated decisions. That's not really related to shrinkers and LRUs, so I'll leave that for you to take up with the core-mm people. :) > We don't need page level memcg tracking as the pages are all either > allocated to the process as part of a larger buffer object, or the > pages are in the pool which has the memcg info, so we aren't intending > on using __GFP_ACCOUNT at this stage. I also don't really like having > this as part of kmem, these really are userspace only things mostly > and they are mostly used by gpu and userspace. Seems reasonable to me if you can manage the memcgs outside the LRU contexts sanely. > My rough plan: > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker > 2. add global and memcg counters and tracking. > 3. convert TTM page pools over to memcg aware shrinker so we get the > proper operation inside a memcg for some niche use cases. Once you've converted to list-lru, this step should mainly be be adding a flag to the shrinker and changing the list_lru init function. Just remember that list_lru_{add,del}() require external means of pinning the memcg while the LRU operation is being done. i.e. the indexing and management of memcg lifetimes for LRU operations is the responsibility of the external code, not the list_lru infrastructure. > 4. Figure out how to deal with memory evictions from VRAM - this is > probably the hardest problem to solve as there is no great policy. Buffer objects on the LRU can be unused by userspace but still mapped into VRAM at the time the shrinker tries to reclaim them? Hence the shrinker tries to evict them from VRAM to reclaim the memory? Or are you talking about something else? > Also handwave shouldn't this all be folios at some point. Or maybe a new uncached/writecombined specific type tailored to the exact needs of GPU resource management. </wave> -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-10 22:44 ` Dave Chinner @ 2025-06-11 1:40 ` Dave Airlie 0 siblings, 0 replies; 13+ messages in thread From: Dave Airlie @ 2025-06-11 1:40 UTC (permalink / raw) To: Dave Chinner; +Cc: Kairui Song, Johannes Weiner, Linux Memory Management List On Wed, 11 Jun 2025 at 08:44, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote: > > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > > > > On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: > > > > > > > > > > On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > > > > > > > > > > > > I've hit a case where I think it might be valuable to have the nid + > > > > > > struct memcg for the item being iterated available in the isolate > > > > > > callback, I know in theory we should be able to retrieve it from the > > > > > > item, but I'm also not convinced we should need to since we have it > > > > > > already in the outer function? > > > > > > > > > > > > typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > > > > > > struct list_lru_one *list, > > > > > > int nid, > > > > > > struct mem_cgroup *memcg, > > > > > > void *cb_arg); > > > > > > > > > > > > > > > > Hi Dave, > > > > > > > > > > > It's probably not essential (I think I can get the nid back easily, > > > > > > not sure about the memcg yet), but I thought I'd ask if there would be > > > > > > > > > > If it's a slab object you should be able to get it easily with: > > > > > memcg = mem_cgroup_from_slab_obj(item)); > > > > > nid = page_to_nid(virt_to_page(item)); > > > > > > > > > > > > > It's in relation to some work trying to tie GPU system memory > > > > allocations into memcg properly, > > > > > > > > Not slab objects, but I do have pages so I'm using page_to_nid right now, > > > > however these pages aren't currently setting p->memcg_data as I don't > > > > need that for this, but maybe > > > > this gives me a reason to go down that road. > > > > > > How are you accounting the page to the memcg if the page is not > > > marked as owned by as specific memcg? > > > > > > Are you relying on the page being indexed in a specific list_lru to > > > account for the page correcting in reclaim contexts, and that's why > > > you need this information in the walk context? > > > > > > I'd actually like to know more details of the problem you are trying > > > to solve - all I've heard is "we're trying to do <something> with > > > GPUs and memcgs with list_lrus", but I don't know what it is so I > > > can't really give decent feedback on your questions.... > > > > > > > Big picture problem, GPU drivers do a lot of memory allocations for > > userspace applications that historically have not gone via memcg > > accounting. This has been pointed out to be bad and should be fixed. > > > > As part of that problem, GPU drivers have the ability to hand out > > uncached/writecombined pages to userspace, creating these pages > > requires changing attributes and as such is a heavy weight operation > > which necessitates page pools. These page pools only currently have a > > global shrinker and roll their own NUMA awareness. > > Ok, it looks to me like there's been a proliferation of these pools > and shrinkers in recent times? I was aware of the TTM + i915/gem > shrinkers, but now I look I see XE, panfrost and MSM all have there > own custom shrinkers now? Ah, panfrost and msm look simple, but > XE is a wrapper around ttm that does all sorts of weird runtime PM > stuff. There is a bunch of different reasons, TTM is for systems with discrete device memory, it also manages uncached pools, it has a shrink itself for the uncached pools. Xe recently added support for a proper shrinker for the device which needs to interact with TTM, so that is second and different shrinker. The PM interactions are so it can power up the GPU to do certain operations or refuse to do them if powered down. Then panfrost/msm are arm only no discrete memory so have their own simpler ones. I think at some point amdgpu will grow a shrinker more like the Xe one. > > I don't see anything obviously NUMA aware in any of them, though.... The numa awareness is in the uncached pools code, the shrinker isn't numa aware, but the pools can be created per numa-node, and currently just shrink indiscriminately, moving to list_lru should allow them to shrink smarter. > > > We don't need page level memcg tracking as the pages are all either > > allocated to the process as part of a larger buffer object, or the > > pages are in the pool which has the memcg info, so we aren't intending > > on using __GFP_ACCOUNT at this stage. I also don't really like having > > this as part of kmem, these really are userspace only things mostly > > and they are mostly used by gpu and userspace. > > Seems reasonable to me if you can manage the memcgs outside the LRU > contexts sanely. > > > My rough plan: > > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker > > 2. add global and memcg counters and tracking. > > 3. convert TTM page pools over to memcg aware shrinker so we get the > > proper operation inside a memcg for some niche use cases. > > Once you've converted to list-lru, this step should mainly be be > adding a flag to the shrinker and changing the list_lru init > function. > > Just remember that list_lru_{add,del}() require external means of > pinning the memcg while the LRU operation is being done. i.e. the > indexing and management of memcg lifetimes for LRU operations is the > responsibility of the external code, not the list_lru > infrastructure. > > > 4. Figure out how to deal with memory evictions from VRAM - this is > > probably the hardest problem to solve as there is no great policy. > > Buffer objects on the LRU can be unused by userspace but still > mapped into VRAM at the time the shrinker tries to reclaim them? > Hence the shrinker tries to evict them from VRAM to reclaim the > memory? Or are you talking about something else? VRAM isn't currently tracked, but when we do have to evict something from VRAM, we generally have to evict to system RAM somewhere, and who's memcg that gets accounted to is hard to determine at this point. There is also code in the Xe/TTM shrinker to handle evictions to swap under memory pressure smarter, you have a 10MB VRAM object, it's getting evicted, we allocate 10MB main memory and copy it, then it can get pushed to swap, the recent xe shrinker allows the 10MB to get evicted into 10 1MB chunks which can get pushed to swap quicker to only use 1MB of system RAM. > > > Also handwave shouldn't this all be folios at some point. > > Or maybe a new uncached/writecombined specific type tailored to the > exact needs of GPU resource management. </wave> Yes not saying a page flag would make it easier, but sometimes a pageflag would make it easier, though a folio flag for uncached/wc with core mm dealing with pools etc would probably be nice in the future, esp if more devices start needing this stuff. Dave. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 22:59 ` Dave Airlie 2025-06-10 22:44 ` Dave Chinner @ 2025-06-10 23:07 ` Balbir Singh 2025-06-11 1:43 ` Dave Airlie 2025-06-11 3:36 ` Matthew Wilcox 2 siblings, 1 reply; 13+ messages in thread From: Balbir Singh @ 2025-06-10 23:07 UTC (permalink / raw) To: Dave Airlie, Dave Chinner Cc: Kairui Song, Johannes Weiner, Linux Memory Management List On 6/6/25 08:59, Dave Airlie wrote: > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote: >> >> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: >>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: >>>> >>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: >>>>> >>>>> I've hit a case where I think it might be valuable to have the nid + >>>>> struct memcg for the item being iterated available in the isolate >>>>> callback, I know in theory we should be able to retrieve it from the >>>>> item, but I'm also not convinced we should need to since we have it >>>>> already in the outer function? >>>>> >>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, >>>>> struct list_lru_one *list, >>>>> int nid, >>>>> struct mem_cgroup *memcg, >>>>> void *cb_arg); >>>>> >>>> >>>> Hi Dave, >>>> >>>>> It's probably not essential (I think I can get the nid back easily, >>>>> not sure about the memcg yet), but I thought I'd ask if there would be >>>> >>>> If it's a slab object you should be able to get it easily with: >>>> memcg = mem_cgroup_from_slab_obj(item)); >>>> nid = page_to_nid(virt_to_page(item)); >>>> >>> >>> It's in relation to some work trying to tie GPU system memory >>> allocations into memcg properly, >>> >>> Not slab objects, but I do have pages so I'm using page_to_nid right now, >>> however these pages aren't currently setting p->memcg_data as I don't >>> need that for this, but maybe >>> this gives me a reason to go down that road. >> >> How are you accounting the page to the memcg if the page is not >> marked as owned by as specific memcg? >> >> Are you relying on the page being indexed in a specific list_lru to >> account for the page correcting in reclaim contexts, and that's why >> you need this information in the walk context? >> >> I'd actually like to know more details of the problem you are trying >> to solve - all I've heard is "we're trying to do <something> with >> GPUs and memcgs with list_lrus", but I don't know what it is so I >> can't really give decent feedback on your questions.... >> > > Big picture problem, GPU drivers do a lot of memory allocations for > userspace applications that historically have not gone via memcg > accounting. This has been pointed out to be bad and should be fixed. > > As part of that problem, GPU drivers have the ability to hand out > uncached/writecombined pages to userspace, creating these pages > requires changing attributes and as such is a heavy weight operation > which necessitates page pools. These page pools only currently have a > global shrinker and roll their own NUMA awareness. The > uncached/writecombined memory isn't a core feature of userspace usage > patterns, but since we want to do things right it seems like a good > idea to clean up the space first. > > Get proper vmstat/memcg tracking for all allocations done for the GPU, > these can be very large, so I think we should add core mm counters for > them and memcg ones as well, so userspace can see them and make more > educated decisions. > > We don't need page level memcg tracking as the pages are all either > allocated to the process as part of a larger buffer object, or the > pages are in the pool which has the memcg info, so we aren't intending > on using __GFP_ACCOUNT at this stage. I also don't really like having > this as part of kmem, these really are userspace only things mostly > and they are mostly used by gpu and userspace. > > My rough plan: > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker > 2. add global and memcg counters and tracking. > 3. convert TTM page pools over to memcg aware shrinker so we get the > proper operation inside a memcg for some niche use cases. > 4. Figure out how to deal with memory evictions from VRAM - this is > probably the hardest problem to solve as there is no great policy. > > Also handwave shouldn't this all be folios at some point. > The key requirements for memcg would be to track the mm on whose behalf the allocation was made. kmemcg (__GFP_ACCOUNT) tracks only kernel allocations (meant for kernel overheads), we don't really need it and you've already mentioned this. For memcg evictions reference count and reclaim is used today, I guess in #4, you are referring to getting that information for VRAM? Is the overall goal to overcommit VRAM or to restrict the amount of VRAM usage or a combination of bith? Balbir Singh ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-10 23:07 ` Balbir Singh @ 2025-06-11 1:43 ` Dave Airlie 2025-06-11 22:34 ` Balbir Singh 0 siblings, 1 reply; 13+ messages in thread From: Dave Airlie @ 2025-06-11 1:43 UTC (permalink / raw) To: Balbir Singh Cc: Dave Chinner, Kairui Song, Johannes Weiner, Linux Memory Management List On Wed, 11 Jun 2025 at 09:07, Balbir Singh <balbirs@nvidia.com> wrote: > > On 6/6/25 08:59, Dave Airlie wrote: > > On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote: > >> > >> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: > >>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: > >>>> > >>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: > >>>>> > >>>>> I've hit a case where I think it might be valuable to have the nid + > >>>>> struct memcg for the item being iterated available in the isolate > >>>>> callback, I know in theory we should be able to retrieve it from the > >>>>> item, but I'm also not convinced we should need to since we have it > >>>>> already in the outer function? > >>>>> > >>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, > >>>>> struct list_lru_one *list, > >>>>> int nid, > >>>>> struct mem_cgroup *memcg, > >>>>> void *cb_arg); > >>>>> > >>>> > >>>> Hi Dave, > >>>> > >>>>> It's probably not essential (I think I can get the nid back easily, > >>>>> not sure about the memcg yet), but I thought I'd ask if there would be > >>>> > >>>> If it's a slab object you should be able to get it easily with: > >>>> memcg = mem_cgroup_from_slab_obj(item)); > >>>> nid = page_to_nid(virt_to_page(item)); > >>>> > >>> > >>> It's in relation to some work trying to tie GPU system memory > >>> allocations into memcg properly, > >>> > >>> Not slab objects, but I do have pages so I'm using page_to_nid right now, > >>> however these pages aren't currently setting p->memcg_data as I don't > >>> need that for this, but maybe > >>> this gives me a reason to go down that road. > >> > >> How are you accounting the page to the memcg if the page is not > >> marked as owned by as specific memcg? > >> > >> Are you relying on the page being indexed in a specific list_lru to > >> account for the page correcting in reclaim contexts, and that's why > >> you need this information in the walk context? > >> > >> I'd actually like to know more details of the problem you are trying > >> to solve - all I've heard is "we're trying to do <something> with > >> GPUs and memcgs with list_lrus", but I don't know what it is so I > >> can't really give decent feedback on your questions.... > >> > > > > Big picture problem, GPU drivers do a lot of memory allocations for > > userspace applications that historically have not gone via memcg > > accounting. This has been pointed out to be bad and should be fixed. > > > > As part of that problem, GPU drivers have the ability to hand out > > uncached/writecombined pages to userspace, creating these pages > > requires changing attributes and as such is a heavy weight operation > > which necessitates page pools. These page pools only currently have a > > global shrinker and roll their own NUMA awareness. The > > uncached/writecombined memory isn't a core feature of userspace usage > > patterns, but since we want to do things right it seems like a good > > idea to clean up the space first. > > > > Get proper vmstat/memcg tracking for all allocations done for the GPU, > > these can be very large, so I think we should add core mm counters for > > them and memcg ones as well, so userspace can see them and make more > > educated decisions. > > > > We don't need page level memcg tracking as the pages are all either > > allocated to the process as part of a larger buffer object, or the > > pages are in the pool which has the memcg info, so we aren't intending > > on using __GFP_ACCOUNT at this stage. I also don't really like having > > this as part of kmem, these really are userspace only things mostly > > and they are mostly used by gpu and userspace. > > > > My rough plan: > > 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker > > 2. add global and memcg counters and tracking. > > 3. convert TTM page pools over to memcg aware shrinker so we get the > > proper operation inside a memcg for some niche use cases. > > 4. Figure out how to deal with memory evictions from VRAM - this is > > probably the hardest problem to solve as there is no great policy. > > > > Also handwave shouldn't this all be folios at some point. > > > > The key requirements for memcg would be to track the mm on whose behalf > the allocation was made. > > kmemcg (__GFP_ACCOUNT) tracks only kernel > allocations (meant for kernel overheads), we don't really need it and > you've already mentioned this. > > For memcg evictions reference count and reclaim is used today, I guess > in #4, you are referring to getting that information for VRAM? > > Is the overall goal to overcommit VRAM or to restrict the amount of > VRAM usage or a combination of bith? This is kinda the crux of where we are getting to. We don't track VRAM at all with memcg that will be the dmem controllers jobs. But in the corner case where we do overcommit VRAM, who pays for the system RAM where we evict stuff to. I think ideally we would have system limits give an amount of VRAM and system RAM to a process, and it can live within that budget, and we'd try not to evict VRAM from processes that have a cgroup accounted right to some of it, but that isn't great for average things like desktops or games (where overcommit makes sense), it would be more for container workloads on GPU clusters. Dave. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-11 1:43 ` Dave Airlie @ 2025-06-11 22:34 ` Balbir Singh 0 siblings, 0 replies; 13+ messages in thread From: Balbir Singh @ 2025-06-11 22:34 UTC (permalink / raw) To: Dave Airlie Cc: Dave Chinner, Kairui Song, Johannes Weiner, Linux Memory Management List On 6/11/25 11:43, Dave Airlie wrote: > On Wed, 11 Jun 2025 at 09:07, Balbir Singh <balbirs@nvidia.com> wrote: >> >> On 6/6/25 08:59, Dave Airlie wrote: >>> On Fri, 6 Jun 2025 at 08:39, Dave Chinner <david@fromorbit.com> wrote: >>>> >>>> On Thu, Jun 05, 2025 at 07:22:23PM +1000, Dave Airlie wrote: >>>>> On Thu, 5 Jun 2025 at 17:55, Kairui Song <ryncsn@gmail.com> wrote: >>>>>> >>>>>> On Thu, Jun 5, 2025 at 10:17 AM Dave Airlie <airlied@gmail.com> wrote: >>>>>>> >>>>>>> I've hit a case where I think it might be valuable to have the nid + >>>>>>> struct memcg for the item being iterated available in the isolate >>>>>>> callback, I know in theory we should be able to retrieve it from the >>>>>>> item, but I'm also not convinced we should need to since we have it >>>>>>> already in the outer function? >>>>>>> >>>>>>> typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item, >>>>>>> struct list_lru_one *list, >>>>>>> int nid, >>>>>>> struct mem_cgroup *memcg, >>>>>>> void *cb_arg); >>>>>>> >>>>>> >>>>>> Hi Dave, >>>>>> >>>>>>> It's probably not essential (I think I can get the nid back easily, >>>>>>> not sure about the memcg yet), but I thought I'd ask if there would be >>>>>> >>>>>> If it's a slab object you should be able to get it easily with: >>>>>> memcg = mem_cgroup_from_slab_obj(item)); >>>>>> nid = page_to_nid(virt_to_page(item)); >>>>>> >>>>> >>>>> It's in relation to some work trying to tie GPU system memory >>>>> allocations into memcg properly, >>>>> >>>>> Not slab objects, but I do have pages so I'm using page_to_nid right now, >>>>> however these pages aren't currently setting p->memcg_data as I don't >>>>> need that for this, but maybe >>>>> this gives me a reason to go down that road. >>>> >>>> How are you accounting the page to the memcg if the page is not >>>> marked as owned by as specific memcg? >>>> >>>> Are you relying on the page being indexed in a specific list_lru to >>>> account for the page correcting in reclaim contexts, and that's why >>>> you need this information in the walk context? >>>> >>>> I'd actually like to know more details of the problem you are trying >>>> to solve - all I've heard is "we're trying to do <something> with >>>> GPUs and memcgs with list_lrus", but I don't know what it is so I >>>> can't really give decent feedback on your questions.... >>>> >>> >>> Big picture problem, GPU drivers do a lot of memory allocations for >>> userspace applications that historically have not gone via memcg >>> accounting. This has been pointed out to be bad and should be fixed. >>> >>> As part of that problem, GPU drivers have the ability to hand out >>> uncached/writecombined pages to userspace, creating these pages >>> requires changing attributes and as such is a heavy weight operation >>> which necessitates page pools. These page pools only currently have a >>> global shrinker and roll their own NUMA awareness. The >>> uncached/writecombined memory isn't a core feature of userspace usage >>> patterns, but since we want to do things right it seems like a good >>> idea to clean up the space first. >>> >>> Get proper vmstat/memcg tracking for all allocations done for the GPU, >>> these can be very large, so I think we should add core mm counters for >>> them and memcg ones as well, so userspace can see them and make more >>> educated decisions. >>> >>> We don't need page level memcg tracking as the pages are all either >>> allocated to the process as part of a larger buffer object, or the >>> pages are in the pool which has the memcg info, so we aren't intending >>> on using __GFP_ACCOUNT at this stage. I also don't really like having >>> this as part of kmem, these really are userspace only things mostly >>> and they are mostly used by gpu and userspace. >>> >>> My rough plan: >>> 1. convert TTM page pools over to list_lru and use a NUMA aware shrinker >>> 2. add global and memcg counters and tracking. >>> 3. convert TTM page pools over to memcg aware shrinker so we get the >>> proper operation inside a memcg for some niche use cases. >>> 4. Figure out how to deal with memory evictions from VRAM - this is >>> probably the hardest problem to solve as there is no great policy. >>> >>> Also handwave shouldn't this all be folios at some point. >>> >> >> The key requirements for memcg would be to track the mm on whose behalf >> the allocation was made. >> >> kmemcg (__GFP_ACCOUNT) tracks only kernel >> allocations (meant for kernel overheads), we don't really need it and >> you've already mentioned this. >> >> For memcg evictions reference count and reclaim is used today, I guess >> in #4, you are referring to getting that information for VRAM? >> >> Is the overall goal to overcommit VRAM or to restrict the amount of >> VRAM usage or a combination of bith? > > This is kinda the crux of where we are getting to. > > We don't track VRAM at all with memcg that will be the dmem controllers jobs. > > But in the corner case where we do overcommit VRAM, who pays for the > system RAM where we evict stuff to. > > I think ideally we would have system limits give an amount of VRAM and > system RAM to a process, and it can live within that budget, and we'd > try not to evict VRAM from processes that have a cgroup accounted > right to some of it, but that isn't great for average things like > desktops or games (where overcommit makes sense), it would be more for > container workloads on GPU clusters. > Makes sense! Thanks! Balbir ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: list_lru isolate callback question? 2025-06-05 22:59 ` Dave Airlie 2025-06-10 22:44 ` Dave Chinner 2025-06-10 23:07 ` Balbir Singh @ 2025-06-11 3:36 ` Matthew Wilcox 2 siblings, 0 replies; 13+ messages in thread From: Matthew Wilcox @ 2025-06-11 3:36 UTC (permalink / raw) To: Dave Airlie Cc: Dave Chinner, Kairui Song, Johannes Weiner, Linux Memory Management List On Fri, Jun 06, 2025 at 08:59:16AM +1000, Dave Airlie wrote: > Also handwave shouldn't this all be folios at some point. Maybe? The question is what per-allocation data do you need to track? For filesystems, it's quite extensive, and that's what folios do. They track the filesystem object (address_space), location within that object (index), per-folio filesystem state (something like buffer_head or iomap folio state), a bunch of flags (dirty/uptodate/locked/writeback/...), LRU list, mapcount, refcount, memcg, etc, etc. If, for example, you need to go from page to "which pool does this page belong to?", that would be a good thing you could keep in your memdesc. You can certainly have some flags. It sounds like you don't need per-allocation memcg; you keep track of memcg per pool (I have a feeling that filesystems would do just as well keeping track of memcgs per inode rather than per folio, but that's a fight for the future). What I would like is for you to do something like 'struct slab' and keep your metadata in the same bits as struct page, but not messing with the contents of struct page. So don't use page->index to store an integer just because it has the right type, define your own type. Happy to help with some of this. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-06-11 22:34 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-05 2:16 list_lru isolate callback question? Dave Airlie 2025-06-05 7:55 ` Kairui Song 2025-06-05 9:22 ` Dave Airlie 2025-06-05 13:53 ` Matthew Wilcox 2025-06-05 20:59 ` Dave Airlie 2025-06-05 22:39 ` Dave Chinner 2025-06-05 22:59 ` Dave Airlie 2025-06-10 22:44 ` Dave Chinner 2025-06-11 1:40 ` Dave Airlie 2025-06-10 23:07 ` Balbir Singh 2025-06-11 1:43 ` Dave Airlie 2025-06-11 22:34 ` Balbir Singh 2025-06-11 3:36 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).