From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: intel-xe@lists.freedesktop.org
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Natalie Vock" <natalie.vock@gmx.de>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
cgroups@vger.kernel.org, "Huang Rui" <ray.huang@amd.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Matthew Auld" <matthew.auld@intel.com>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
"Simona Vetter" <simona@ffwll.ch>,
"David Airlie" <airlied@gmail.com>,
"Christian König" <christian.koenig@amd.com>,
"Thadeu Lima de Souza Cascardo" <cascardo@igalia.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
linux-kernel@vger.kernel.org
Subject: [PATCH v7 3/6] cgroup/dmem: Add reclaim callback for lowering max below current usage
Date: Fri, 3 Jul 2026 15:05:38 +0200 [thread overview]
Message-ID: <20260703130541.2686-4-thomas.hellstrom@linux.intel.com> (raw)
In-Reply-To: <20260703130541.2686-1-thomas.hellstrom@linux.intel.com>
Add an optional reclaim callback to struct dmem_cgroup_region. When
dmem.max is set below the current usage of a cgroup pool, the new limit
is applied immediately (so that concurrent allocations are throttled
while reclaim is in progress) and then the driver is asked to evict
memory to bring usage back below the limit.
Reclaim is attempted up to a bounded number of times. No error is
returned to userspace if usage remains above the limit after reclaim,
and a pending signal will abort the reclaim loop early. This matches
the behavior of memory.max in the memory cgroup controller.
Also honor O_NONBLOCK so that if that flag is set during the
max value write, no reclaim is initiated. The idea is to avoid
charging the reclaim cost to the writer of the max value.
v2:
- Write max before reclaim is attempted (Maarten)
- Let signals abort the reclaim without error (Maarten)
- If a new max value is written with the O_NONBLOCK flag,
reclaim is not attempted (Maarten)
- Extract region from the pool parameter rather than
passing it explicitly to set_resource_xxx().
v3:
- Use an rw_semaphore (unregister_sem) to protect reclaim callbacks
against concurrent region unregistration: readers (reclaim) hold the
read side; dmem_cgroup_unregister_region() takes the write side to
drain in-flight callbacks before returning. (Sashiko-bot)
v5:
- Rebased on the introduction of struct dmem_cgroup_init.
- Use nonblock=true in reset_all_resource_limits() to avoid sleeping
inside rcu_read_lock() in dmemcs_offline(). (Sashiko-bot)
- Compare usage against the truncated limit value stored in cnt.max,
not the original u64. (Sashiko-bot)
- Use a DMEM_MAX_RECLAIM_RETRIES (16) retry budget instead of 5, matching
the memcg controller's MAX_RECLAIM_RETRIES. Only -ENOSPC (no progress)
counts against the retry budget; other errors terminate the loop
immediately.
v6:
- Fix dmem_cgroup_ops->reclaim docstring: -ENOSPC does not stop reclaim
immediately but is retried up to DMEM_MAX_RECLAIM_RETRIES times; only
other negative errors terminate the loop. (Sashiko-bot)
v7:
- Replace the per-region rw_semaphore with a static SRCU domain
(dmemcg_srcu). SRCU is a better fit than rwsem for this use: it
avoids the per-region lock overhead on every reclaim call, and
synchronize_srcu() at unregister time is a rare operation. (Maarten)
- Trim in-function comments to focus on what rather than how.
Assisted-by: GitHub_Copilot:claude-sonnet-4.6
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
include/linux/cgroup_dmem.h | 22 ++++++++++
kernel/cgroup/dmem.c | 81 +++++++++++++++++++++++++++++++------
2 files changed, 91 insertions(+), 12 deletions(-)
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index d9eab8a2c1ee..8664321fa9f7 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -14,12 +14,34 @@ struct dmem_cgroup_pool_state;
/* Opaque definition of a cgroup region, used internally */
struct dmem_cgroup_region;
+/**
+ * struct dmem_cgroup_ops - Operations for a dmem cgroup region.
+ * @reclaim: Optional callback invoked when dmem.max is set below the current
+ * usage of a pool. The driver should attempt to free at least
+ * @target_bytes from @pool. May be called multiple times if usage
+ * remains above the limit after returning.
+ *
+ * Return: 0 if some progress was made (even if less than
+ * @target_bytes was freed), -ENOSPC if no progress could be made
+ * (the caller will retry up to a bounded number of times), or
+ * another negative error code if a fatal error occurred (stops
+ * further reclaim attempts immediately).
+ */
+struct dmem_cgroup_ops {
+ int (*reclaim)(struct dmem_cgroup_pool_state *pool,
+ u64 target_bytes, void *priv);
+};
+
/**
* struct dmem_cgroup_init - Initialization parameters for a dmem cgroup region.
* @size: Size of the region in bytes.
+ * @ops: Optional operations for this region. May be NULL.
+ * @reclaim_priv: Opaque pointer passed to @ops->reclaim. May be NULL.
*/
struct dmem_cgroup_init {
u64 size;
+ const struct dmem_cgroup_ops *ops;
+ void *reclaim_priv;
};
#if IS_ENABLED(CONFIG_CGROUP_DMEM)
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index d12c8543f3fe..719d28dd1078 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -17,6 +17,13 @@
#include <linux/refcount.h>
#include <linux/rculist.h>
#include <linux/slab.h>
+#include <linux/srcu.h>
+
+/* Maximum reclaim attempts before giving up when lowering dmem.max. */
+#define DMEM_MAX_RECLAIM_RETRIES 16
+
+/* SRCU domain serialising reclaim callbacks against region unregistration. */
+DEFINE_STATIC_SRCU(dmemcg_srcu);
struct dmem_cgroup_region {
/**
@@ -48,9 +55,18 @@ struct dmem_cgroup_region {
/**
* @unregistered: Whether the region is unregistered by its caller.
- * No new pools should be added to the region afterwards.
+ * No new pools should be added to the region afterwards, and no new
+ * reclaim callbacks should be invoked.
*/
bool unregistered;
+
+ /**
+ * @ops: Optional driver operations for this region.
+ */
+ const struct dmem_cgroup_ops *ops;
+
+ /** @reclaim_priv: Private data passed to @ops->reclaim. */
+ void *reclaim_priv;
};
struct dmemcg_state {
@@ -145,21 +161,52 @@ static void free_cg_pool(struct dmem_cgroup_pool_state *pool)
}
static void
-set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
{
page_counter_set_min(&pool->cnt, val);
}
static void
-set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
{
page_counter_set_low(&pool->cnt, val);
}
static void
-set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
+set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock)
{
- page_counter_set_max(&pool->cnt, val);
+ struct dmem_cgroup_region *region = pool->region;
+ unsigned long limit = (unsigned long)val;
+
+ /* Apply the new limit immediately so concurrent allocations are throttled. */
+ xchg(&pool->cnt.max, limit);
+
+ if (nonblock)
+ return;
+
+ int srcu_idx = srcu_read_lock(&dmemcg_srcu);
+
+ if (!READ_ONCE(region->unregistered) && region->ops && region->ops->reclaim) {
+ for (int retries = DMEM_MAX_RECLAIM_RETRIES; ; ) {
+ u64 usage = page_counter_read(&pool->cnt);
+ int ret;
+
+ if (usage <= limit)
+ break;
+
+ if (signal_pending(current))
+ break;
+
+ ret = region->ops->reclaim(pool, usage - limit, region->reclaim_priv);
+
+ /* -ENOSPC means no progress; other errors are fatal. */
+ if (ret && (ret != -ENOSPC || !retries--))
+ break;
+
+ cond_resched();
+ }
+ }
+ srcu_read_unlock(&dmemcg_srcu, srcu_idx);
}
static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
@@ -189,9 +236,10 @@ static u64 get_resource_peak(struct dmem_cgroup_pool_state *pool)
static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
- set_resource_min(rpool, 0);
- set_resource_low(rpool, 0);
- set_resource_max(rpool, PAGE_COUNTER_MAX);
+ set_resource_min(rpool, 0, false);
+ set_resource_low(rpool, 0, false);
+ /* nonblock: raising to max makes reclaim a no-op; sleeping is forbidden here. */
+ set_resource_max(rpool, PAGE_COUNTER_MAX, true);
}
static void dmemcs_offline(struct cgroup_subsys_state *css)
@@ -468,7 +516,10 @@ static void dmemcg_free_region(struct kref *ref)
* dmem_cgroup_unregister_region() - Unregister a previously registered region.
* @region: The region to unregister.
*
- * This function undoes dmem_cgroup_register_region.
+ * This function undoes dmem_cgroup_register_region. It drains any
+ * in-flight reclaim callbacks before returning, so the caller may safely
+ * free the resources pointed to by the @reclaim_priv that was passed at
+ * registration time.
*/
void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
{
@@ -493,9 +544,11 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region)
* no new pools should be added to the dead region
* by get_cg_pool_unlocked.
*/
- region->unregistered = true;
+ WRITE_ONCE(region->unregistered, true);
spin_unlock(&dmemcg_lock);
+ synchronize_srcu(&dmemcg_srcu);
+
kref_put(®ion->ref, dmemcg_free_region);
}
EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region);
@@ -537,6 +590,8 @@ dmem_cgroup_register_region(const struct dmem_cgroup_init *init,
INIT_LIST_HEAD(&ret->pools);
ret->name = region_name;
ret->size = init->size;
+ ret->ops = init->ops;
+ ret->reclaim_priv = init->reclaim_priv;
kref_init(&ret->ref);
spin_lock(&dmemcg_lock);
@@ -733,9 +788,10 @@ static int dmemcg_parse_limit(char *options, u64 *new_limit)
static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off,
- void (*apply)(struct dmem_cgroup_pool_state *, u64))
+ void (*apply)(struct dmem_cgroup_pool_state *, u64, bool))
{
struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of));
+ bool nonblock = of->file->f_flags & O_NONBLOCK;
int err = 0;
while (buf && !err) {
@@ -780,7 +836,8 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of,
}
/* And commit */
- apply(pool, new_limit);
+ apply(pool, new_limit, nonblock);
+
dmemcg_pool_put(pool);
out_put:
--
2.54.0
next prev parent reply other threads:[~2026-07-03 13:06 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-03 13:05 [PATCH v7 0/6] Add reclaim to the dmem cgroup controller Thomas Hellström
2026-07-03 13:05 ` [PATCH v7 1/6] drm/amdgpu: Fix init ordering in amdgpu_vram_mgr_init() Thomas Hellström
2026-07-03 13:08 ` Christian König
2026-07-03 13:11 ` Thomas Hellström
2026-07-03 13:26 ` sashiko-bot
2026-07-03 13:05 ` [PATCH v7 2/6] cgroup/dmem: Introduce struct dmem_cgroup_init for region initialization Thomas Hellström
2026-07-03 13:05 ` Thomas Hellström [this message]
2026-07-03 13:24 ` [PATCH v7 3/6] cgroup/dmem: Add reclaim callback for lowering max below current usage sashiko-bot
2026-07-03 13:05 ` [PATCH v7 4/6] drm/ttm: Hook up a cgroup-aware reclaim callback for the dmem controller Thomas Hellström
2026-07-03 13:19 ` sashiko-bot
2026-07-03 13:05 ` [PATCH v7 5/6] drm/xe: Wire up dmem cgroup reclaim for VRAM manager Thomas Hellström
2026-07-03 13:05 ` [PATCH v7 6/6] drm/amdgpu: " Thomas Hellström
2026-07-03 13:25 ` sashiko-bot
2026-07-03 14:37 ` [PATCH v7 0/6] Add reclaim to the dmem cgroup controller Thadeu Lima de Souza Cascardo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260703130541.2686-4-thomas.hellstrom@linux.intel.com \
--to=thomas.hellstrom@linux.intel.com \
--cc=airlied@gmail.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=cascardo@igalia.com \
--cc=cgroups@vger.kernel.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hannes@cmpxchg.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maarten.lankhorst@linux.intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=mkoutny@suse.com \
--cc=mripard@kernel.org \
--cc=natalie.vock@gmx.de \
--cc=ray.huang@amd.com \
--cc=rodrigo.vivi@intel.com \
--cc=simona@ffwll.ch \
--cc=tj@kernel.org \
--cc=tzimmermann@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox