From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
intel-xe@lists.freedesktop.org
Cc: "Natalie Vock" <natalie.vock@gmx.de>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
cgroups@vger.kernel.org, "Huang Rui" <ray.huang@amd.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Matthew Auld" <matthew.auld@intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
"Simona Vetter" <simona@ffwll.ch>,
"David Airlie" <airlied@gmail.com>,
"Christian König" <christian.koenig@amd.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
linux-kernel@vger.kernel.org,
"Thadeu Lima de Souza Cascardo" <cascardo@igalia.com>
Subject: Re: [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering max below current usage
Date: Wed, 22 Apr 2026 12:29:35 +0200 [thread overview]
Message-ID: <4f74cacc-ff98-426f-ac31-c25e6cbec314@linux.intel.com> (raw)
In-Reply-To: <3b662522e17e380953d9b981d8c2febecf42455e.camel@linux.intel.com>
Hey,
Den 2026-04-22 kl. 12:20, skrev Thomas Hellström:
> On Wed, 2026-04-22 at 11:50 +0200, Maarten Lankhorst wrote:
>> Hey,
>>
>> Den 2026-04-22 kl. 10:42, skrev Thomas Hellström:
>>> On Wed, 2026-04-22 at 10:31 +0200, Maarten Lankhorst wrote:
>>>> Hey,
>>>>
>>>> (Adding Thadeu to cc since they've been working on the same
>>>> issue)
>>>>
>>>> Den 2026-03-27 kl. 09:15, skrev Thomas Hellström:
>>>>> Add an optional reclaim callback to struct dmem_cgroup_region.
>>>>> When
>>>>> dmem.max is set below current usage, invoke the callback to
>>>>> evict
>>>>> memory
>>>>> and retry setting the limit rather than failing immediately.
>>>>> Signal
>>>>> interruptions propagate back to the write() caller.
>>>>>
>>>>> RFC:
>>>>> Due to us updating the max limit _after_ the usage has been
>>>>> sufficiently lowered, this should be prone to failures if there
>>>>> are
>>>>> aggressive allocators running in parallel to the reclaim.
>>>>> So can we somehow enforce the new limit while the eviction is
>>>>> happening?
>>>>>
>>>>> Assisted-by: GitHub Copilot:claude-sonnet-4.6
>>>>> Signed-off-by: Thomas Hellström
>>>>> <thomas.hellstrom@linux.intel.com>
>>>>> ---
>>>>> include/linux/cgroup_dmem.h | 11 +++++
>>>>> kernel/cgroup/dmem.c | 94
>>>>> +++++++++++++++++++++++++++++++++----
>>>>> 2 files changed, 96 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/cgroup_dmem.h
>>>>> b/include/linux/cgroup_dmem.h
>>>>> index dd4869f1d736..61520a431740 100644
>>>>> --- a/include/linux/cgroup_dmem.h
>>>>> +++ b/include/linux/cgroup_dmem.h
>>>>> @@ -26,6 +26,10 @@ bool dmem_cgroup_state_evict_valuable(struct
>>>>> dmem_cgroup_pool_state *limit_pool,
>>>>> bool ignore_low, bool
>>>>> *ret_hit_low);
>>>>>
>>>>> void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state
>>>>> *pool);
>>>>> +void dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region
>>>>> *region,
>>>>> + int (*reclaim)(struct
>>>>> dmem_cgroup_pool_state *pool,
>>>>> + u64
>>>>> target_bytes, void *priv),
>>>>> + void *priv);
>>>>> #else
>>>>> static inline __printf(2,3) struct dmem_cgroup_region *
>>>>> dmem_cgroup_register_region(u64 size, const char *name_fmt,
>>>>> ...)
>>>>> @@ -62,5 +66,12 @@ bool dmem_cgroup_state_evict_valuable(struct
>>>>> dmem_cgroup_pool_state *limit_pool,
>>>>> static inline void dmem_cgroup_pool_state_put(struct
>>>>> dmem_cgroup_pool_state *pool)
>>>>> { }
>>>>>
>>>>> +static inline void
>>>>> +dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region
>>>>> *region,
>>>>> + int (*reclaim)(struct
>>>>> dmem_cgroup_pool_state *pool,
>>>>> + u64
>>>>> target_bytes,
>>>>> void *priv),
>>>>> + void *priv)
>>>>> +{ }
>>>>> +
>>>>> #endif
>>>>> #endif /* _CGROUP_DMEM_H */
>>>>> diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
>>>>> index 3e6d4c0b26a1..f993fb058b74 100644
>>>>> --- a/kernel/cgroup/dmem.c
>>>>> +++ b/kernel/cgroup/dmem.c
>>>>> @@ -51,6 +51,18 @@ struct dmem_cgroup_region {
>>>>> * No new pools should be added to the region
>>>>> afterwards.
>>>>> */
>>>>> bool unregistered;
>>>>> +
>>>>> + /**
>>>>> + * @reclaim: Optional callback invoked when dmem.max
>>>>> is
>>>>> set below the
>>>>> + * current usage of a pool. The driver should attempt
>>>>> to
>>>>> free at least
>>>>> + * @target_bytes from @pool. May be called multiple
>>>>> times
>>>>> if usage
>>>>> + * remains above the limit after returning.
>>>>> + */
>>>>> + int (*reclaim)(struct dmem_cgroup_pool_state *pool,
>>>>> u64
>>>>> target_bytes,
>>>>> + void *priv);
>>>>> +
>>>>> + /** @reclaim_priv: Private data passed to @reclaim. */
>>>>> + void *reclaim_priv;
>>>>> };
>>>>>
>>>>> struct dmemcg_state {
>>>>> @@ -145,23 +157,59 @@ static void free_cg_pool(struct
>>>>> dmem_cgroup_pool_state *pool)
>>>>> }
>>>>>
>>>>> static int
>>>>> -set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val)
>>>>> +set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val,
>>>>> + struct dmem_cgroup_region *region)
>>>>> {
>>>>> page_counter_set_min(&pool->cnt, val);
>>>>> return 0;
>>>>> }
>>>>>
>>>>> static int
>>>>> -set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
>>>>> +set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val,
>>>>> + struct dmem_cgroup_region *region)
>>>>> {
>>>>> page_counter_set_low(&pool->cnt, val);
>>>>> return 0;
>>>>> }
>>>>>
>>>>> static int
>>>>> -set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
>>>>> +set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val,
>>>>> + struct dmem_cgroup_region *region)
>>>>> {
>>>>> - return page_counter_set_max(&pool->cnt, val);
>>>>> + int err = page_counter_set_max(&pool->cnt, val);
>>>>> +
>>>>> + if (err != -EBUSY || !region || !region->reclaim)
>>>>> + return err;
>>>>> +
>>>>> + /*
>>>>> + * The new max is below current usage. Ask the driver
>>>>> to
>>>>> evict memory
>>>>> + * and retry, up to a bounded number of times. Signal
>>>>> interruptions are
>>>>> + * propagated back to the write() caller; other
>>>>> reclaim
>>>>> failures leave
>>>>> + * -EBUSY as the result.
>>>>> + */
>>>>> + for (int retries = 5; retries > 0; retries--) {
>>>>> + u64 usage = page_counter_read(&pool->cnt);
>>>>> + u64 target = usage > val ? usage - val : 0;
>>>>> + int reclaim_err;
>>>>> +
>>>>> + if (!target) {
>>>>> + err = page_counter_set_max(&pool->cnt,
>>>>> val);
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + reclaim_err = region->reclaim(pool, target,
>>>>> region->reclaim_priv);
>>>>> + if (reclaim_err) {
>>>>> + if (reclaim_err == -EINTR ||
>>>>> reclaim_err
>>>>> == -ERESTARTSYS)
>>>>> + err = reclaim_err;
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + err = page_counter_set_max(&pool->cnt, val);
>>>>> + if (err != -EBUSY)
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + return err;
>>>>> }
>>>>
>>>> I mentioned this in chat but I wanted to mention it on the
>>>> mailing
>>>> list for others as well,
>>>> can we reproduce the behavior from memory_max_write() in
>>>> mm/memcontrol.c?
>>>>
>>>> 1. First set new limit through xchg.
>>>> 2. If O_NONBLOCK is set -> do nothing, next allocation in target
>>>> region will fail and cause reclaim.
>>>> 3. If not set -> reclaim until below new limit or interrupted by
>>>> a
>>>> signal, return success in all cases here since we set new limit.
>>>>
>>>>
>>>
>>> Yup.
>>>
>>> For 3, we also need to consider the case where we fail to reclaim
>>> due
>>> to memory being pinned. If it's OK to (usually temporary) have
>>> current
>>> usage above max, that would work.
>>>
>>> I have that coded up and also add a patch on top to defer reclaim
>>> to a
>>> thread if we bail due to signal or O_NONBLOCK. Perhaps we could
>>> discuss
>>> whether that's a good or bad idea in that patch.
>>
>> That doesn't sound like a good idea. The semantics of O_NONBLOCK
>> are deliberately intended to be able to change the max without
>> causing
>> reclaim.
>>
>> See the details in commit ("memcg: introduce non-blocking limit
>> setting option")
>
> From reading the docs that introduces, it sounds more like that avoids
> *synchronous* reclaim, which is also in line with O_NONBLOCK semantics.
>
> The analogy with launching a thread would be more that of kswapd doing
> the reclaim in the memcg case?
>
> But OTOH, if we were to introduce a thread-driven dmem reclaim that
> would perhaps be something that wasn't directly tied to the dmem
> controller but rather to the dmem provider itself. (TTM in this case).
From the docs:
+ If memory.max is opened with O_NONBLOCK, then the synchronous
+ reclaim and oom-kill are bypassed. This is useful for admin
+ processes that need to dynamically adjust the job's memory limits
+ without expending their own CPU resources on memory reclamation.
+ The job will trigger the reclaim and/or oom-kill on its next
+ charge request.
The task writing to max will not trigger a reclaim,
only set the new max value.
But when a process, part of the affected cgroup, tries to allocate memory,
it will be forced to reclaim memory until below max again.
This is a workflow where instead of the updater doing all
the evictions, the evictions handled by a process in the cgroup itself.
>>
>> I also believe it's ok not to continue reclaiming if aborted, the
>> caller can
>> always try again if necessary.
>>
>> If we want to deviate from the memcg controller, we need a very good
>> reason
>> to do so. I'd like to keep the semantics the same if possible.
>>
>>> Will send out when I've updated the IGT tests accordingly.
>>>
>>> Thanks,
>>> Thomas
>>
>> Kind regards,
>> ~Maarten Lankhorst
next prev parent reply other threads:[~2026-04-22 10:29 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-27 8:15 [PATCH 0/5] Add reclaim to the dmem cgroup controller Thomas Hellström
2026-03-27 8:15 ` [PATCH 1/5] cgroup/dmem: Return error when setting max below current usage Thomas Hellström
2026-03-27 8:15 ` [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering " Thomas Hellström
2026-04-22 8:31 ` Maarten Lankhorst
2026-04-22 8:42 ` Thomas Hellström
2026-04-22 9:50 ` Maarten Lankhorst
2026-04-22 10:20 ` Thomas Hellström
2026-04-22 10:29 ` Maarten Lankhorst [this message]
2026-04-22 10:36 ` Thomas Hellström
2026-03-27 8:15 ` [PATCH 3/5] drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem controller Thomas Hellström
2026-03-27 8:15 ` [PATCH 4/5] drm/xe: Wire up dmem cgroup reclaim for VRAM manager Thomas Hellström
2026-03-27 8:16 ` [PATCH 5/5] drm/amdgpu: " Thomas Hellström
2026-03-27 8:33 ` [PATCH 0/5] Add reclaim to the dmem cgroup controller Thomas Hellström
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4f74cacc-ff98-426f-ac31-c25e6cbec314@linux.intel.com \
--to=maarten.lankhorst@linux.intel.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=cascardo@igalia.com \
--cc=cgroups@vger.kernel.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hannes@cmpxchg.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=mkoutny@suse.com \
--cc=mripard@kernel.org \
--cc=natalie.vock@gmx.de \
--cc=ray.huang@amd.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
--cc=thomas.hellstrom@linux.intel.com \
--cc=tj@kernel.org \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox