Re: [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering max below current usage

public inbox for dri-devel@lists.freedesktop.org
 help / color / mirror / Atom feed

From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
	 intel-xe@lists.freedesktop.org
Cc: "Natalie Vock" <natalie.vock@gmx.de>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
	cgroups@vger.kernel.org, "Huang Rui" <ray.huang@amd.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Matthew Auld" <matthew.auld@intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	"Simona Vetter" <simona@ffwll.ch>,
	"David Airlie" <airlied@gmail.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
	linux-kernel@vger.kernel.org,
	"Thadeu Lima de Souza Cascardo" <cascardo@igalia.com>
Subject: Re: [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering max below current usage
Date: Wed, 22 Apr 2026 12:36:50 +0200	[thread overview]
Message-ID: <fb353b64e9084bc8fff01f8d5cc45701a2a60a60.camel@linux.intel.com> (raw)
In-Reply-To: <4f74cacc-ff98-426f-ac31-c25e6cbec314@linux.intel.com>

On Wed, 2026-04-22 at 12:29 +0200, Maarten Lankhorst wrote:
> Hey,
> 
> Den 2026-04-22 kl. 12:20, skrev Thomas Hellström:
> > On Wed, 2026-04-22 at 11:50 +0200, Maarten Lankhorst wrote:
> > > Hey,
> > > 
> > > Den 2026-04-22 kl. 10:42, skrev Thomas Hellström:
> > > > On Wed, 2026-04-22 at 10:31 +0200, Maarten Lankhorst wrote:
> > > > > Hey,
> > > > > 
> > > > > (Adding Thadeu to cc since they've been working on the same
> > > > > issue)
> > > > > 
> > > > > Den 2026-03-27 kl. 09:15, skrev Thomas Hellström:
> > > > > > Add an optional reclaim callback to struct
> > > > > > dmem_cgroup_region. 
> > > > > > When
> > > > > > dmem.max is set below current usage, invoke the callback to
> > > > > > evict
> > > > > > memory
> > > > > > and retry setting the limit rather than failing
> > > > > > immediately. 
> > > > > > Signal
> > > > > > interruptions propagate back to the write() caller.
> > > > > > 
> > > > > > RFC:
> > > > > > Due to us updating the max limit _after_ the usage has been
> > > > > > sufficiently lowered, this should be prone to failures if
> > > > > > there
> > > > > > are
> > > > > > aggressive allocators running in parallel to the reclaim.
> > > > > > So can we somehow enforce the new limit while the eviction
> > > > > > is
> > > > > > happening?
> > > > > > 
> > > > > > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > > > > > Signed-off-by: Thomas Hellström
> > > > > > <thomas.hellstrom@linux.intel.com>
> > > > > > ---
> > > > > >  include/linux/cgroup_dmem.h | 11 +++++
> > > > > >  kernel/cgroup/dmem.c        | 94
> > > > > > +++++++++++++++++++++++++++++++++----
> > > > > >  2 files changed, 96 insertions(+), 9 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/linux/cgroup_dmem.h
> > > > > > b/include/linux/cgroup_dmem.h
> > > > > > index dd4869f1d736..61520a431740 100644
> > > > > > --- a/include/linux/cgroup_dmem.h
> > > > > > +++ b/include/linux/cgroup_dmem.h
> > > > > > @@ -26,6 +26,10 @@ bool
> > > > > > dmem_cgroup_state_evict_valuable(struct
> > > > > > dmem_cgroup_pool_state *limit_pool,
> > > > > >  				      bool ignore_low,
> > > > > > bool
> > > > > > *ret_hit_low);
> > > > > >  
> > > > > >  void dmem_cgroup_pool_state_put(struct
> > > > > > dmem_cgroup_pool_state
> > > > > > *pool);
> > > > > > +void dmem_cgroup_region_set_reclaim(struct
> > > > > > dmem_cgroup_region
> > > > > > *region,
> > > > > > +				    int (*reclaim)(struct
> > > > > > dmem_cgroup_pool_state *pool,
> > > > > > +						   u64
> > > > > > target_bytes, void *priv),
> > > > > > +				    void *priv);
> > > > > >  #else
> > > > > >  static inline __printf(2,3) struct dmem_cgroup_region *
> > > > > >  dmem_cgroup_register_region(u64 size, const char
> > > > > > *name_fmt,
> > > > > > ...)
> > > > > > @@ -62,5 +66,12 @@ bool
> > > > > > dmem_cgroup_state_evict_valuable(struct
> > > > > > dmem_cgroup_pool_state *limit_pool,
> > > > > >  static inline void dmem_cgroup_pool_state_put(struct
> > > > > > dmem_cgroup_pool_state *pool)
> > > > > >  { }
> > > > > >  
> > > > > > +static inline void
> > > > > > +dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region
> > > > > > *region,
> > > > > > +			       int (*reclaim)(struct
> > > > > > dmem_cgroup_pool_state *pool,
> > > > > > +					      u64
> > > > > > target_bytes,
> > > > > > void *priv),
> > > > > > +			       void *priv)
> > > > > > +{ }
> > > > > > +
> > > > > >  #endif
> > > > > >  #endif	/* _CGROUP_DMEM_H */
> > > > > > diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
> > > > > > index 3e6d4c0b26a1..f993fb058b74 100644
> > > > > > --- a/kernel/cgroup/dmem.c
> > > > > > +++ b/kernel/cgroup/dmem.c
> > > > > > @@ -51,6 +51,18 @@ struct dmem_cgroup_region {
> > > > > >  	 * No new pools should be added to the region
> > > > > > afterwards.
> > > > > >  	 */
> > > > > >  	bool unregistered;
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * @reclaim: Optional callback invoked when
> > > > > > dmem.max
> > > > > > is
> > > > > > set below the
> > > > > > +	 * current usage of a pool. The driver should
> > > > > > attempt
> > > > > > to
> > > > > > free at least
> > > > > > +	 * @target_bytes from @pool. May be called
> > > > > > multiple
> > > > > > times
> > > > > > if usage
> > > > > > +	 * remains above the limit after returning.
> > > > > > +	 */
> > > > > > +	int (*reclaim)(struct dmem_cgroup_pool_state
> > > > > > *pool,
> > > > > > u64
> > > > > > target_bytes,
> > > > > > +		       void *priv);
> > > > > > +
> > > > > > +	/** @reclaim_priv: Private data passed to
> > > > > > @reclaim. */
> > > > > > +	void *reclaim_priv;
> > > > > >  };
> > > > > >  
> > > > > >  struct dmemcg_state {
> > > > > > @@ -145,23 +157,59 @@ static void free_cg_pool(struct
> > > > > > dmem_cgroup_pool_state *pool)
> > > > > >  }
> > > > > >  
> > > > > >  static int
> > > > > > -set_resource_min(struct dmem_cgroup_pool_state *pool, u64
> > > > > > val)
> > > > > > +set_resource_min(struct dmem_cgroup_pool_state *pool, u64
> > > > > > val,
> > > > > > +		 struct dmem_cgroup_region *region)
> > > > > >  {
> > > > > >  	page_counter_set_min(&pool->cnt, val);
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  
> > > > > >  static int
> > > > > > -set_resource_low(struct dmem_cgroup_pool_state *pool, u64
> > > > > > val)
> > > > > > +set_resource_low(struct dmem_cgroup_pool_state *pool, u64
> > > > > > val,
> > > > > > +		 struct dmem_cgroup_region *region)
> > > > > >  {
> > > > > >  	page_counter_set_low(&pool->cnt, val);
> > > > > >  	return 0;
> > > > > >  }
> > > > > >  
> > > > > >  static int
> > > > > > -set_resource_max(struct dmem_cgroup_pool_state *pool, u64
> > > > > > val)
> > > > > > +set_resource_max(struct dmem_cgroup_pool_state *pool, u64
> > > > > > val,
> > > > > > +		 struct dmem_cgroup_region *region)
> > > > > >  {
> > > > > > -	return page_counter_set_max(&pool->cnt, val);
> > > > > > +	int err = page_counter_set_max(&pool->cnt, val);
> > > > > > +
> > > > > > +	if (err != -EBUSY || !region || !region->reclaim)
> > > > > > +		return err;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * The new max is below current usage.  Ask the
> > > > > > driver
> > > > > > to
> > > > > > evict memory
> > > > > > +	 * and retry, up to a bounded number of times. 
> > > > > > Signal
> > > > > > interruptions are
> > > > > > +	 * propagated back to the write() caller; other
> > > > > > reclaim
> > > > > > failures leave
> > > > > > +	 * -EBUSY as the result.
> > > > > > +	 */
> > > > > > +	for (int retries = 5; retries > 0; retries--) {
> > > > > > +		u64 usage = page_counter_read(&pool->cnt);
> > > > > > +		u64 target = usage > val ? usage - val :
> > > > > > 0;
> > > > > > +		int reclaim_err;
> > > > > > +
> > > > > > +		if (!target) {
> > > > > > +			err = page_counter_set_max(&pool-
> > > > > > >cnt,
> > > > > > val);
> > > > > > +			break;
> > > > > > +		}
> > > > > > +
> > > > > > +		reclaim_err = region->reclaim(pool,
> > > > > > target,
> > > > > > region->reclaim_priv);
> > > > > > +		if (reclaim_err) {
> > > > > > +			if (reclaim_err == -EINTR ||
> > > > > > reclaim_err
> > > > > > == -ERESTARTSYS)
> > > > > > +				err = reclaim_err;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +
> > > > > > +		err = page_counter_set_max(&pool->cnt,
> > > > > > val);
> > > > > > +		if (err != -EBUSY)
> > > > > > +			break;
> > > > > > +	}
> > > > > > +
> > > > > > +	return err;
> > > > > >  }
> > > > > 
> > > > > I mentioned this in chat but I wanted to mention it on the
> > > > > mailing
> > > > > list for others as well,
> > > > > can we reproduce the behavior from memory_max_write() in
> > > > > mm/memcontrol.c?
> > > > > 
> > > > > 1. First set new limit through xchg.
> > > > > 2. If O_NONBLOCK is set -> do nothing, next allocation in
> > > > > target
> > > > > region will fail and cause reclaim.
> > > > > 3. If not set -> reclaim until below new limit or interrupted
> > > > > by
> > > > > a
> > > > > signal, return success in all cases here since we set new
> > > > > limit.
> > > > > 
> > > > > 
> > > > 
> > > > Yup.
> > > > 
> > > > For 3, we also need to consider the case where we fail to
> > > > reclaim
> > > > due
> > > > to memory being pinned. If it's OK to (usually temporary) have
> > > > current
> > > > usage above max, that would work.
> > > > 
> > > > I have that coded up and also add a patch on top to defer
> > > > reclaim
> > > > to a
> > > > thread if we bail due to signal or O_NONBLOCK. Perhaps we could
> > > > discuss
> > > > whether that's a good or bad idea in that patch.
> > > 
> > > That doesn't sound like a good idea. The semantics of O_NONBLOCK
> > > are deliberately intended to be able to change the max without
> > > causing
> > > reclaim.
> > > 
> > > See the details in commit ("memcg: introduce non-blocking limit
> > > setting option")
> > 
> > From reading the docs that introduces, it sounds more like that
> > avoids
> > *synchronous* reclaim, which is also in line with O_NONBLOCK
> > semantics.
> > 
> > The analogy with launching a thread would be more that of kswapd
> > doing
> > the reclaim in the memcg case?
> > 
> > But OTOH, if we were to introduce a thread-driven dmem reclaim that
> > would perhaps be something that wasn't directly tied to the dmem
> > controller but rather to the dmem provider itself. (TTM in this
> > case).
> 
> From the docs:
> +        If memory.max is opened with O_NONBLOCK, then the
> synchronous
> +        reclaim and oom-kill are bypassed. This is useful for admin
> +        processes that need to dynamically adjust the job's memory
> limits
> +        without expending their own CPU resources on memory
> reclamation.
> +        The job will trigger the reclaim and/or oom-kill on its next
> +        charge request.
> 
> The task writing to max will not trigger a reclaim,
> only set the new max value.

I still read *synchronous* reclaim.

> 
> But when a process, part of the affected cgroup, tries to allocate
> memory,
> it will be forced to reclaim memory until below max again. 
> 
> This is a workflow where instead of the updater doing all
> the evictions, the evictions handled by a process in the cgroup
> itself.

But kswapd is still used to do background per-cgroup reclaim in this
case, right?

Thanks,
Thomas


> 
> > > 
> > > I also believe it's ok not to continue reclaiming if aborted, the
> > > caller can
> > > always try again if necessary.
> > > 
> > > If we want to deviate from the memcg controller, we need a very
> > > good
> > > reason
> > > to do so. I'd like to keep the semantics the same if possible.
> > > 
> > > > Will send out when I've updated the IGT tests accordingly.
> > > > 
> > > > Thanks,
> > > > Thomas
> > > 
> > > Kind regards,
> > > ~Maarten Lankhorst

next prev parent reply	other threads:[~2026-04-22 10:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-27  8:15 [PATCH 0/5] Add reclaim to the dmem cgroup controller Thomas Hellström
2026-03-27  8:15 ` [PATCH 1/5] cgroup/dmem: Return error when setting max below current usage Thomas Hellström
2026-03-27  8:15 ` [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering " Thomas Hellström
2026-04-22  8:31   ` Maarten Lankhorst
2026-04-22  8:42     ` Thomas Hellström
2026-04-22  9:50       ` Maarten Lankhorst
2026-04-22 10:20         ` Thomas Hellström
2026-04-22 10:29           ` Maarten Lankhorst
2026-04-22 10:36             ` Thomas Hellström [this message]
2026-03-27  8:15 ` [PATCH 3/5] drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem controller Thomas Hellström
2026-03-27  8:15 ` [PATCH 4/5] drm/xe: Wire up dmem cgroup reclaim for VRAM manager Thomas Hellström
2026-03-27  8:16 ` [PATCH 5/5] drm/amdgpu: " Thomas Hellström
2026-03-27  8:33 ` [PATCH 0/5] Add reclaim to the dmem cgroup controller Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb353b64e9084bc8fff01f8d5cc45701a2a60a60.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=cascardo@igalia.com \
    --cc=cgroups@vger.kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hannes@cmpxchg.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mkoutny@suse.com \
    --cc=mripard@kernel.org \
    --cc=natalie.vock@gmx.de \
    --cc=ray.huang@amd.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    --cc=tj@kernel.org \
    --cc=tzimmermann@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox