From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B90F748A2D0; Thu, 11 Jun 2026 17:33:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.20 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781199234; cv=none; b=Ja1Jq1DHsMANBgAB1SdZQyGCASWtLMeTZAmPxDh9PWJMIpVkDNuuRimEfVfyriVAcswVFJJjTW1wyPCSjknKt/zVZiOUc0CaJbvn4Ry7MP5uBpXXinLgzuWSu06gUJWhcpx7ErwdJADzqfye6Krfo9aEEWApuG64kc0CREyfGog= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781199234; c=relaxed/simple; bh=bmH48JPITcb3vmshd82lXsyWN8yNb9TZSLio3998qQ4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=RByzcbcZDE3bAMDARbXf+DQIzdNWhOT6oyjCV65ZaUphtle71apGAQm/pANb80z6lLAp3hZqShow+SpbatAd8C6MIpmT5kKINRJlqZAORGGwHf37JfeQzHQAy9767JM9a7mcpgHHYtWIId1VkBn3TMRFZWLTCyiVHHAxAZb6GQM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=F6bjVem2; arc=none smtp.client-ip=198.175.65.20 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="F6bjVem2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1781199231; x=1812735231; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=bmH48JPITcb3vmshd82lXsyWN8yNb9TZSLio3998qQ4=; b=F6bjVem2C7WEIzWnIDUh3/UsUPoP+eXcEXWdF5Q7OvXTDq2VinmOVbWO jFMZkLwhcud9vd6iTqoaIgCx3PZBOzcV/rGdTgteqfeytsLiySt2huwY4 zmdpKP80gnNdyc8SQx2z9+jDiyz4hcpSEhM8bI89i6ONu2Gy2wTMVtUcM gukkTFge2wu240nD44wUuQN7d9TgvkgvSCwA+8i2rf2I8/YYELMTmHIje IgDrxokUu1gxI38zGnKHPAuPJMqjzX8aFjdDAcWVbVTdVwKDS5Q//8fyk UWvbHFdY6Do7BJxI8ix/PqGmthppgRs+5avLNFvuWG41grydEAFafn2Za A==; X-CSE-ConnectionGUID: YskZJ+5fSNu8yx0XzZ3Rpg== X-CSE-MsgGUID: 1BZ2SFYBSyyb60sL36kWkA== X-IronPort-AV: E=McAfee;i="6800,10657,11813"; a="81762033" X-IronPort-AV: E=Sophos;i="6.24,199,1774335600"; d="scan'208";a="81762033" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jun 2026 10:33:51 -0700 X-CSE-ConnectionGUID: z5zyNc7WSVqf7CbOm9M/3w== X-CSE-MsgGUID: +1Rud3QHTYWBUUBjjhZFqw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.24,199,1774335600"; d="scan'208";a="240214962" Received: from amilburn-desk.amilburn-desk (HELO fedora) ([10.245.244.169]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jun 2026 10:33:45 -0700 From: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= To: intel-xe@lists.freedesktop.org Cc: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= , Natalie Vock , Johannes Weiner , Tejun Heo , =?UTF-8?q?Michal=20Koutn=C3=BD?= , cgroups@vger.kernel.org, Huang Rui , Matthew Brost , Matthew Auld , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , Simona Vetter , David Airlie , =?UTF-8?q?Christian=20K=C3=B6nig?= , Alex Deucher , Rodrigo Vivi , dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH v6 3/6] cgroup/dmem: Add reclaim callback for lowering max below current usage Date: Thu, 11 Jun 2026 19:32:58 +0200 Message-ID: <20260611173301.17473-4-thomas.hellstrom@linux.intel.com> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com> References: <20260611173301.17473-1-thomas.hellstrom@linux.intel.com> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add an optional reclaim callback to struct dmem_cgroup_region. When dmem.max is set below the current usage of a cgroup pool, the new limit is applied immediately (so that concurrent allocations are throttled while reclaim is in progress) and then the driver is asked to evict memory to bring usage back below the limit. Reclaim is attempted up to a bounded number of times. No error is returned to userspace if usage remains above the limit after reclaim, and a pending signal will abort the reclaim loop early. This matches the behavior of memory.max in the memory cgroup controller. Also honor O_NONBLOCK so that if that flag is set during the max value write, no reclaim is initiated. The idea is to avoid charging the reclaim cost to the writer of the max value. v2: - Write max before reclaim is attempted (Maarten) - Let signals abort the reclaim without error (Maarten) - If a new max value is written with the O_NONBLOCK flag, reclaim is not attempted (Maarten) - Extract region from the pool parameter rather than passing it explicitly to set_resource_xxx(). v3: - Use an rwsem to protect reclaim callback registration and region unregister against concurrent reclaim invocations, ensuring reclaim_priv is visible when the callback is invoked. (Sashiko-bot) v5: - Rebased on the introduction of struct dmem_cgroup_init. - Use nonblock=true in reset_all_resource_limits() to avoid sleeping inside rcu_read_lock() in dmemcs_offline(). (Sashiko-bot) - Compare usage against the truncated limit value stored in cnt.max, not the original u64. (Sashiko-bot) - Use a DMEM_MAX_RECLAIM_RETRIES (16) retry budget instead of 5, matching the memcg controller's MAX_RECLAIM_RETRIES. Only -ENOSPC (no progress) counts against the retry budget; other errors terminate the loop immediately. v6: - Fix dmem_cgroup_ops->reclaim docstring: -ENOSPC does not stop reclaim immediately but is retried up to DMEM_MAX_RECLAIM_RETRIES times; only other negative errors terminate the loop. (Sashiko-bot) Assisted-by: GitHub_Copilot:claude-sonnet-4.6 Signed-off-by: Thomas Hellström --- include/linux/cgroup_dmem.h | 22 +++++++ kernel/cgroup/dmem.c | 119 +++++++++++++++++++++++++++++++++--- 2 files changed, 131 insertions(+), 10 deletions(-) diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h index d9eab8a2c1ee..8664321fa9f7 100644 --- a/include/linux/cgroup_dmem.h +++ b/include/linux/cgroup_dmem.h @@ -14,12 +14,34 @@ struct dmem_cgroup_pool_state; /* Opaque definition of a cgroup region, used internally */ struct dmem_cgroup_region; +/** + * struct dmem_cgroup_ops - Operations for a dmem cgroup region. + * @reclaim: Optional callback invoked when dmem.max is set below the current + * usage of a pool. The driver should attempt to free at least + * @target_bytes from @pool. May be called multiple times if usage + * remains above the limit after returning. + * + * Return: 0 if some progress was made (even if less than + * @target_bytes was freed), -ENOSPC if no progress could be made + * (the caller will retry up to a bounded number of times), or + * another negative error code if a fatal error occurred (stops + * further reclaim attempts immediately). + */ +struct dmem_cgroup_ops { + int (*reclaim)(struct dmem_cgroup_pool_state *pool, + u64 target_bytes, void *priv); +}; + /** * struct dmem_cgroup_init - Initialization parameters for a dmem cgroup region. * @size: Size of the region in bytes. + * @ops: Optional operations for this region. May be NULL. + * @reclaim_priv: Opaque pointer passed to @ops->reclaim. May be NULL. */ struct dmem_cgroup_init { u64 size; + const struct dmem_cgroup_ops *ops; + void *reclaim_priv; }; #if IS_ENABLED(CONFIG_CGROUP_DMEM) diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c index d12c8543f3fe..da99d133182c 100644 --- a/kernel/cgroup/dmem.c +++ b/kernel/cgroup/dmem.c @@ -18,6 +18,13 @@ #include #include +/* + * Number of reclaim attempts before giving up when lowering dmem.max + * below current usage. Mirrors memcg's MAX_RECLAIM_RETRIES; unify the + * two in a follow-up instead of duplicating the constant. + */ +#define DMEM_MAX_RECLAIM_RETRIES 16 + struct dmem_cgroup_region { /** * @ref: References keeping the region alive. @@ -51,6 +58,24 @@ struct dmem_cgroup_region { * No new pools should be added to the region afterwards. */ bool unregistered; + + /** + * @ops: Optional operations, set from dmem_cgroup_init at registration. + */ + const struct dmem_cgroup_ops *ops; + + /** @reclaim_priv: Private data passed to @ops->reclaim. */ + void *reclaim_priv; + + /** + * @unregister_sem: Serialises reclaim callbacks against unregistration. + * + * Readers (reclaim) hold the read side for the duration of a callback + * invocation. dmem_cgroup_unregister_region() takes the write side to + * drain any in-flight callbacks before returning, so callers may safely + * free @reclaim_priv once unregister returns. + */ + struct rw_semaphore unregister_sem; }; struct dmemcg_state { @@ -145,21 +170,71 @@ static void free_cg_pool(struct dmem_cgroup_pool_state *pool) } static void -set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val) +set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock) { page_counter_set_min(&pool->cnt, val); } static void -set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val) +set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock) { page_counter_set_low(&pool->cnt, val); } static void -set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val) +set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val, bool nonblock) { - page_counter_set_max(&pool->cnt, val); + struct dmem_cgroup_region *region = pool->region; + unsigned long limit = (unsigned long)val; + + /* + * Always update the limit, even if usage currently exceeds it. + * Concurrent allocations will be throttled against the new limit + * while reclaim is in progress. + */ + xchg(&pool->cnt.max, limit); + + if (nonblock) + return; + + /* + * Hold the read side for the duration of the reclaim loop so that + * dmem_cgroup_unregister_region() cannot return (and the caller + * cannot free reclaim_priv) while a callback is in progress. + * + * The ops check must happen inside the lock. A caller may have + * observed ops != NULL before dmem_cgroup_unregister_region() + * acquired the write side; rechecking under down_read() is safe + * because region->unregistered is set while the write side is + * held, so any down_read() that succeeds after up_write() will + * see unregistered = true and skip the loop. + */ + down_read(®ion->unregister_sem); + if (!region->unregistered && region->ops && region->ops->reclaim) { + for (int retries = DMEM_MAX_RECLAIM_RETRIES; ; ) { + u64 usage = page_counter_read(&pool->cnt); + int ret; + + if (usage <= limit) + break; + + if (signal_pending(current)) + break; + + ret = region->ops->reclaim(pool, usage - limit, region->reclaim_priv); + + /* + * Mirror memcg's retry strategy: only count -ENOSPC (no + * progress) against the retry budget; any other error is + * fatal and terminates the loop immediately. + */ + if (ret && (ret != -ENOSPC || !retries--)) + break; + + cond_resched(); + } + } + up_read(®ion->unregister_sem); } static u64 get_resource_low(struct dmem_cgroup_pool_state *pool) @@ -189,9 +264,14 @@ static u64 get_resource_peak(struct dmem_cgroup_pool_state *pool) static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool) { - set_resource_min(rpool, 0); - set_resource_low(rpool, 0); - set_resource_max(rpool, PAGE_COUNTER_MAX); + set_resource_min(rpool, 0, false); + set_resource_low(rpool, 0, false); + /* + * Use nonblock=true: we are raising the limit to PAGE_COUNTER_MAX so + * reclaim is pointless, and dmemcs_offline() holds rcu_read_lock() + * which forbids sleeping. + */ + set_resource_max(rpool, PAGE_COUNTER_MAX, true); } static void dmemcs_offline(struct cgroup_subsys_state *css) @@ -468,7 +548,10 @@ static void dmemcg_free_region(struct kref *ref) * dmem_cgroup_unregister_region() - Unregister a previously registered region. * @region: The region to unregister. * - * This function undoes dmem_cgroup_register_region. + * This function undoes dmem_cgroup_register_region. It drains any + * in-flight reclaim callbacks before returning, so the caller may safely + * free the resources pointed to by @init.reclaim_priv once this function + * returns. */ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region) { @@ -477,6 +560,15 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region) if (!region) return; + /* + * Acquire the write side to drain any in-flight reclaim callbacks. + * After up_write() below, set_resource_max() will observe + * region->unregistered = true under its own down_read() and skip + * the reclaim loop, so reclaim_priv is safe to free once this + * function returns. + */ + down_write(®ion->unregister_sem); + spin_lock(&dmemcg_lock); /* Remove from global region list */ @@ -496,6 +588,8 @@ void dmem_cgroup_unregister_region(struct dmem_cgroup_region *region) region->unregistered = true; spin_unlock(&dmemcg_lock); + up_write(®ion->unregister_sem); + kref_put(®ion->ref, dmemcg_free_region); } EXPORT_SYMBOL_GPL(dmem_cgroup_unregister_region); @@ -537,7 +631,10 @@ dmem_cgroup_register_region(const struct dmem_cgroup_init *init, INIT_LIST_HEAD(&ret->pools); ret->name = region_name; ret->size = init->size; + ret->ops = init->ops; + ret->reclaim_priv = init->reclaim_priv; kref_init(&ret->ref); + init_rwsem(&ret->unregister_sem); spin_lock(&dmemcg_lock); list_add_tail_rcu(&ret->region_node, &dmem_cgroup_regions); @@ -733,9 +830,10 @@ static int dmemcg_parse_limit(char *options, u64 *new_limit) static ssize_t dmemcg_limit_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off, - void (*apply)(struct dmem_cgroup_pool_state *, u64)) + void (*apply)(struct dmem_cgroup_pool_state *, u64, bool)) { struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of)); + bool nonblock = of->file->f_flags & O_NONBLOCK; int err = 0; while (buf && !err) { @@ -780,7 +878,8 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of, } /* And commit */ - apply(pool, new_limit); + apply(pool, new_limit, nonblock); + dmemcg_pool_put(pool); out_put: -- 2.54.0