From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CFBB235DA53; Wed, 22 Apr 2026 08:31:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.13 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776846697; cv=none; b=FxOwq72rKBrasWT0aN0e+miTGoMsNDjiqH+rfnPlWwbHZhB9mX16/9tNe9h8P+KoI4ggTUrH58ttAaNorSlGTHmGOoqtKWWxPaXRVzOO1JUkKvtPLpbFqJ224FRET/0Vv52gsxoGRbXbxZTYd614zWuiehUS39S2zVlaqsPw4Sg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776846697; c=relaxed/simple; bh=BLoTYMxGQTA/oOMa4NfPFqVslSfO4JVYrMUvoKq0rLQ=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=u0dTwiLciM27MjdNidpLc/Ps1J4EpSd7D9D/lIgMGJEFjqFGLhYMKcKnc9tUlHitTeAPa83zjKk/D++RuplxMn5J3kRVlbsUU01yriPzsv2RL+agHRXrnfTKpgHW2BqaQhWbFkPTlqGa83yF10bBDNZigZK3MYHiEcBXn95q2q4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ML4/oefa; arc=none smtp.client-ip=198.175.65.13 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ML4/oefa" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1776846696; x=1808382696; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=BLoTYMxGQTA/oOMa4NfPFqVslSfO4JVYrMUvoKq0rLQ=; b=ML4/oefaGYjIlKiYDKe0CWZoPsM8Hy5mGSk9WjfQqVsOZWBHLxmxElno 9PbByHb5+AEMgps5mFzYN4aw2CcOKeogZhjXTz/EZA48lXq4BH81VDGe+ Qissu6ujjMmsrvUEJevDWrVSr6FUu4+pSRsSRDq8m7VOszlAeP8l0Pgq9 jozSmmvH5NnwwrIHlRlDHhTss3JaKXpZd5CvfMMhTlUPrBtWz5MkLazDf ytZnJHI6Vi62Bu1XmrhPBImuz24Pijy577olaaZQhujLv6QddsyK9vU6r mbG68Ty2JEVJ++Ml2GgCuMcwWfrCCcBrZGkwGqre6UoS8wclyqfnZkHaG g==; X-CSE-ConnectionGUID: UFIfg9WCSkOX3e6BIcCkVw== X-CSE-MsgGUID: h+975VQAR8mijHNCm/NIbQ== X-IronPort-AV: E=McAfee;i="6800,10657,11763"; a="88870013" X-IronPort-AV: E=Sophos;i="6.23,192,1770624000"; d="scan'208";a="88870013" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Apr 2026 01:31:35 -0700 X-CSE-ConnectionGUID: t8N3QOulRZCJDsonFBsNWA== X-CSE-MsgGUID: f0JvRIgESJ+fIWx0meLcwQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,192,1770624000"; d="scan'208";a="232159328" Received: from amilburn-desk.amilburn-desk (HELO [10.245.245.228]) ([10.245.245.228]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Apr 2026 01:31:31 -0700 Message-ID: <4b647952-0038-4878-b67e-6c7fc7ab27a6@linux.intel.com> Date: Wed, 22 Apr 2026 10:31:14 +0200 Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering max below current usage To: =?UTF-8?Q?Thomas_Hellstr=C3=B6m?= , intel-xe@lists.freedesktop.org Cc: Natalie Vock , Johannes Weiner , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , cgroups@vger.kernel.org, Huang Rui , Matthew Brost , Matthew Auld , Maxime Ripard , Thomas Zimmermann , Simona Vetter , David Airlie , =?UTF-8?Q?Christian_K=C3=B6nig?= , Alex Deucher , Rodrigo Vivi , dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org, Thadeu Lima de Souza Cascardo References: <20260327081600.4885-1-thomas.hellstrom@linux.intel.com> <20260327081600.4885-3-thomas.hellstrom@linux.intel.com> Content-Language: en-US From: Maarten Lankhorst In-Reply-To: <20260327081600.4885-3-thomas.hellstrom@linux.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hey, (Adding Thadeu to cc since they've been working on the same issue) Den 2026-03-27 kl. 09:15, skrev Thomas Hellström: > Add an optional reclaim callback to struct dmem_cgroup_region. When > dmem.max is set below current usage, invoke the callback to evict memory > and retry setting the limit rather than failing immediately. Signal > interruptions propagate back to the write() caller. > > RFC: > Due to us updating the max limit _after_ the usage has been > sufficiently lowered, this should be prone to failures if there are > aggressive allocators running in parallel to the reclaim. > So can we somehow enforce the new limit while the eviction is > happening? > > Assisted-by: GitHub Copilot:claude-sonnet-4.6 > Signed-off-by: Thomas Hellström > --- > include/linux/cgroup_dmem.h | 11 +++++ > kernel/cgroup/dmem.c | 94 +++++++++++++++++++++++++++++++++---- > 2 files changed, 96 insertions(+), 9 deletions(-) > > diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h > index dd4869f1d736..61520a431740 100644 > --- a/include/linux/cgroup_dmem.h > +++ b/include/linux/cgroup_dmem.h > @@ -26,6 +26,10 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool, > bool ignore_low, bool *ret_hit_low); > > void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool); > +void dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region *region, > + int (*reclaim)(struct dmem_cgroup_pool_state *pool, > + u64 target_bytes, void *priv), > + void *priv); > #else > static inline __printf(2,3) struct dmem_cgroup_region * > dmem_cgroup_register_region(u64 size, const char *name_fmt, ...) > @@ -62,5 +66,12 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool, > static inline void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool) > { } > > +static inline void > +dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region *region, > + int (*reclaim)(struct dmem_cgroup_pool_state *pool, > + u64 target_bytes, void *priv), > + void *priv) > +{ } > + > #endif > #endif /* _CGROUP_DMEM_H */ > diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c > index 3e6d4c0b26a1..f993fb058b74 100644 > --- a/kernel/cgroup/dmem.c > +++ b/kernel/cgroup/dmem.c > @@ -51,6 +51,18 @@ struct dmem_cgroup_region { > * No new pools should be added to the region afterwards. > */ > bool unregistered; > + > + /** > + * @reclaim: Optional callback invoked when dmem.max is set below the > + * current usage of a pool. The driver should attempt to free at least > + * @target_bytes from @pool. May be called multiple times if usage > + * remains above the limit after returning. > + */ > + int (*reclaim)(struct dmem_cgroup_pool_state *pool, u64 target_bytes, > + void *priv); > + > + /** @reclaim_priv: Private data passed to @reclaim. */ > + void *reclaim_priv; > }; > > struct dmemcg_state { > @@ -145,23 +157,59 @@ static void free_cg_pool(struct dmem_cgroup_pool_state *pool) > } > > static int > -set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val) > +set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val, > + struct dmem_cgroup_region *region) > { > page_counter_set_min(&pool->cnt, val); > return 0; > } > > static int > -set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val) > +set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val, > + struct dmem_cgroup_region *region) > { > page_counter_set_low(&pool->cnt, val); > return 0; > } > > static int > -set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val) > +set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val, > + struct dmem_cgroup_region *region) > { > - return page_counter_set_max(&pool->cnt, val); > + int err = page_counter_set_max(&pool->cnt, val); > + > + if (err != -EBUSY || !region || !region->reclaim) > + return err; > + > + /* > + * The new max is below current usage. Ask the driver to evict memory > + * and retry, up to a bounded number of times. Signal interruptions are > + * propagated back to the write() caller; other reclaim failures leave > + * -EBUSY as the result. > + */ > + for (int retries = 5; retries > 0; retries--) { > + u64 usage = page_counter_read(&pool->cnt); > + u64 target = usage > val ? usage - val : 0; > + int reclaim_err; > + > + if (!target) { > + err = page_counter_set_max(&pool->cnt, val); > + break; > + } > + > + reclaim_err = region->reclaim(pool, target, region->reclaim_priv); > + if (reclaim_err) { > + if (reclaim_err == -EINTR || reclaim_err == -ERESTARTSYS) > + err = reclaim_err; > + break; > + } > + > + err = page_counter_set_max(&pool->cnt, val); > + if (err != -EBUSY) > + break; > + } > + > + return err; > } I mentioned this in chat but I wanted to mention it on the mailing list for others as well, can we reproduce the behavior from memory_max_write() in mm/memcontrol.c? 1. First set new limit through xchg. 2. If O_NONBLOCK is set -> do nothing, next allocation in target region will fail and cause reclaim. 3. If not set -> reclaim until below new limit or interrupted by a signal, return success in all cases here since we set new limit. > > static u64 get_resource_low(struct dmem_cgroup_pool_state *pool) > @@ -186,9 +234,9 @@ static u64 get_resource_current(struct dmem_cgroup_pool_state *pool) > > static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool) > { > - set_resource_min(rpool, 0); > - set_resource_low(rpool, 0); > - set_resource_max(rpool, PAGE_COUNTER_MAX); > + set_resource_min(rpool, 0, NULL); > + set_resource_low(rpool, 0, NULL); > + set_resource_max(rpool, PAGE_COUNTER_MAX, NULL); > } > > static void dmemcs_offline(struct cgroup_subsys_state *css) > @@ -570,6 +618,32 @@ void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool) > } > EXPORT_SYMBOL_GPL(dmem_cgroup_pool_state_put); > > +/** > + * dmem_cgroup_region_set_reclaim - Register a reclaim callback on a region. > + * @region: The region to register the callback for. > + * @reclaim: Callback to invoke when dmem.max is set below current usage. > + * Called with the pool that needs reclaiming and the number of > + * bytes to free. Returns 0 on progress, negative on failure. > + * @priv: Opaque pointer passed back to @reclaim. > + * > + * When dmem.max is lowered below the current usage of a cgroup pool, the > + * dmem controller will call @reclaim with a target number of bytes to free. > + * After @reclaim returns the controller retries setting the limit; if usage > + * is still too high it calls @reclaim again, up to a bounded retry count. > + */ > +void dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region *region, > + int (*reclaim)(struct dmem_cgroup_pool_state *pool, > + u64 target_bytes, void *priv), > + void *priv) > +{ > + if (!region) > + return; > + > + region->reclaim = reclaim; > + region->reclaim_priv = priv; > +} > +EXPORT_SYMBOL_GPL(dmem_cgroup_region_set_reclaim); > + > static struct dmem_cgroup_pool_state * > get_cg_pool_unlocked(struct dmemcg_state *cg, struct dmem_cgroup_region *region) > { > @@ -728,7 +802,8 @@ static int dmemcg_parse_limit(char *options, struct dmem_cgroup_region *region, > > static ssize_t dmemcg_limit_write(struct kernfs_open_file *of, > char *buf, size_t nbytes, loff_t off, > - int (*apply)(struct dmem_cgroup_pool_state *, u64)) > + int (*apply)(struct dmem_cgroup_pool_state *, u64, > + struct dmem_cgroup_region *)) > { > struct dmemcg_state *dmemcs = css_to_dmemcs(of_css(of)); > int err = 0; > @@ -775,7 +850,8 @@ static ssize_t dmemcg_limit_write(struct kernfs_open_file *of, > } > > /* And commit */ > - err = apply(pool, new_limit); > + err = apply(pool, new_limit, region); > + > dmemcg_pool_put(pool); > > out_put: Kind regards, ~Maarten Lankhorst