Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-05-30  1:40 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Hao Jia, akpm, tj, hannes, shakeel.butt, mhocko, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <CAKEwX=MQe_KFZe2vBXQYh0aa-x+E8AzNwmyjJGJk4tDoS9ML3A@mail.gmail.com>

On Fri, May 29, 2026 at 12:58:09PM -0700, Nhat Pham wrote:
> On Tue, May 26, 2026 at 4:46 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
> >
> > From: Hao Jia <jiahao1@lixiang.com>
> >
> > Zswap currently writes back pages to backing swap reactively, triggered
> > either by the shrinker or when the pool reaches its size limit. There is
> > no mechanism to control the amount of writeback for a specific memory
> > cgroup. However, users may want to proactively write back zswap pages,
> > e.g., to free up memory for other applications or to prepare for
> > memory-intensive workloads.
> >
> > Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> > interface. When specified, this key bypasses standard memory reclaim
> > and exclusively performs proactive zswap writeback up to the requested
> > budget. If omitted, the default reclaim behavior remains unchanged.
> >
> > Example usage:
> >   # Write back 100MB of pages from zswap to the backing swap
> >   echo "100M zswap_writeback_only" > memory.reclaim
> 
> Hmmm, so this 100MB is the pre-compression size? i.e if this 100 MB
> compresses to 25 MB, then you're only freeing 25 MB?
> 
> I'm ok-ish with this, but can you document it?

That's a good point. I think pre-compressed size doesn't make sense to
be honest. We should care about how much memory we are actually trying
to save by doing writeback here.

The pre-compressed size is only useful in determining the blast radius,
how many actual pages are going to have slower page faults now. But
then, I don't think there's a reasonable way for userspace to decide
that.

I understand passing in the compressed size is tricky because we need to
keep track of the size of the compressed pages we end up writing back,
but it should be doable.

If we really want pre-compressed size here, then yes we need to make it
very clear, and I vote that we use a separate interface in this case
because memory.reclaim having different meanings for the amount of
memory written to it is extremely counter-intuitive.

> 
> The rest seems solid to me, FWIW. I'll defer to Johannes and Yosry for
> opinions on zswap-only proactive reclaim.

^ permalink raw reply

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Yosry Ahmed @ 2026-05-30  1:37 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260526114601.67041-3-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 07:45:59PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@lixiang.com>
> 
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or when the pool reaches its size limit. There is
> no mechanism to control the amount of writeback for a specific memory
> cgroup. However, users may want to proactively write back zswap pages,
> e.g., to free up memory for other applications or to prepare for
> memory-intensive workloads.
> 
> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> interface. When specified, this key bypasses standard memory reclaim
> and exclusively performs proactive zswap writeback up to the requested
> budget. If omitted, the default reclaim behavior remains unchanged.
> 
> Example usage:
>   # Write back 100MB of pages from zswap to the backing swap
>   echo "100M zswap_writeback_only" > memory.reclaim
> 
> Note that the actual amount written back may be less than requested due
> to the zswap second-chance algorithm: referenced entries are rotated on
> the LRU on the first encounter and only written back on a second pass.
> If fewer bytes are written back than requested, -EAGAIN is returned,
> matching the existing memory.reclaim semantics.
> 
> Internally, extend user_proactive_reclaim() to parse the new
> "zswap_writeback_only" token and invoke the dedicated handler. Add
> zswap_proactive_writeback() to walk the target memcg subtree via the
> per-memcg writeback cursor, draining per-node zswap LRUs through
> list_lru_walk_one() with the shrink_memcg_cb() callback.
> 
> Suggested-by: Yosry Ahmed <yosry@kernel.org>
> Suggested-by: Nhat Pham <nphamcs@gmail.com>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  18 +++-
>  Documentation/admin-guide/mm/zswap.rst  |  11 +-
>  include/linux/zswap.h                   |   7 ++
>  mm/vmscan.c                             |  14 +++
>  mm/zswap.c                              | 138 ++++++++++++++++++++++++
>  5 files changed, 185 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6efd0095ed99..6564abf0dec5 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1425,9 +1425,10 @@ PAGE_SIZE multiple when read back.
>  
>  The following nested keys are defined.
>  
> -	  ==========            ================================
> +	  ====================  ==================================================
>  	  swappiness            Swappiness value to reclaim with
> -	  ==========            ================================
> +	  zswap_writeback_only  Only perform proactive zswap writeback
> +	  ====================  ==================================================
>  
>  	Specifying a swappiness value instructs the kernel to perform
>  	the reclaim with that swappiness value. Note that this has the
> @@ -1437,6 +1438,19 @@ The following nested keys are defined.
>  	The valid range for swappiness is [0-200, max], setting
>  	swappiness=max exclusively reclaims anonymous memory.
>  
> +	The zswap_writeback_only key skips ordinary memory reclaim and
> +	writes back pages from zswap to the backing swap device until
> +	the requested amount has been written or no further candidates
> +	are found. This is useful to proactively offload cold pages from
> +	the zswap pool to the swap device. It is only available if
> +	zswap writeback is enabled. zswap_writeback_only cannot be combined
> +	with swappiness; specifying both returns -EINVAL.
> +
> +	Example::
> +
> +	  # Write back up to 100MB of pages from zswap to the backing swap
> +	  echo "100M zswap_writeback_only" > memory.reclaim


memcg folks need to chime in about the interface here. An alternative
would be a separate interface (e.g. memory.zswap.do_writeback or
memory.zswap.writeback.reclaim or sth).

> diff --git a/mm/zswap.c b/mm/zswap.c
> index 73e64a635690..7bcbf788f634 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1679,6 +1679,144 @@ int zswap_load(struct folio *folio)
>  	return 0;
>  }
>  
> +/*
> + * Maximum LRU scan limit:
> + * number of entries to scan per page of remaining budget.
> + */
> +#define ZSWAP_PROACTIVE_WB_SCAN_RATIO	16UL
> +/*
> + * Batch size for proactive writeback:
> + * - As the per-memcg writeback target in the outer memcg loop.
> + * - As the per-walk budget passed to list_lru_walk_one().
> + */
> +#define ZSWAP_PROACTIVE_WB_BATCH	128UL
> +
> +/*
> + * Walk the per-node LRUs of @memcg to write back up to @nr_to_write pages.
> + * Returns the number of pages written back, or -ENOENT if @memcg is a
> + * zombie or has writeback disabled.
> + */
> +static long zswap_proactive_shrink_memcg(struct mem_cgroup *memcg,
> +					 unsigned long nr_to_write)
> +{
> +	unsigned long nr_written = 0;
> +	int nid;
> +
> +	if (!mem_cgroup_zswap_writeback_enabled(memcg))
> +		return -ENOENT;
> +
> +	if (!mem_cgroup_online(memcg))
> +		return -ENOENT;
> +
> +	for_each_node_state(nid, N_NORMAL_MEMORY) {
> +		bool encountered_page_in_swapcache = false;
> +		unsigned long nr_to_scan, nr_scanned = 0;
> +
> +		/*
> +		 * Cap by LRU length: bounds rewalks when referenced
> +		 * entries keep rotating to the tail.
> +		 */
> +		nr_to_scan = list_lru_count_one(&zswap_list_lru, nid, memcg);
> +		if (!nr_to_scan)
> +			continue;
> +
> +		/*
> +		 * Cap by SCAN_RATIO * remaining budget: bounds scan cost
> +		 * to the remaining writeback budget.
> +		 */
> +		nr_to_scan = min(nr_to_scan,
> +				 (nr_to_write - nr_written) * ZSWAP_PROACTIVE_WB_SCAN_RATIO);
> +
> +		while (nr_scanned < nr_to_scan) {
> +			unsigned long nr_to_walk = min(ZSWAP_PROACTIVE_WB_BATCH,
> +						       nr_to_scan - nr_scanned);
> +
> +			if (signal_pending(current))
> +				return nr_written;
> +
> +			/*
> +			 * Account for the committed budget rather than the walker's
> +			 * actual delta. If the list is emptied concurrently, the
> +			 * walker visits nothing and nr_scanned would never advance.
> +			 */
> +			nr_scanned += nr_to_walk;
> +
> +			nr_written += list_lru_walk_one(&zswap_list_lru, nid, memcg,
> +							&shrink_memcg_cb,
> +							&encountered_page_in_swapcache,
> +							&nr_to_walk);
> +
> +			if (nr_written >= nr_to_write)
> +				return nr_written;
> +			if (encountered_page_in_swapcache)
> +				break;
> +
> +			cond_resched();
> +		}
> +	}
> +
> +	return nr_written;
> +}
> +
> +int zswap_proactive_writeback(struct mem_cgroup *memcg,
> +			      unsigned long nr_to_writeback)
> +{
> +	struct mem_cgroup *iter_memcg;
> +	unsigned long nr_written = 0;
> +	int failures = 0, attempts = 0;
> +
> +	if (!memcg)
> +		return -EINVAL;
> +	if (!nr_to_writeback)
> +		return 0;
> +
> +	/*
> +	 * Writeback will be aborted with -EAGAIN if we encounter
> +	 * the following MAX_RECLAIM_RETRIES times:
> +	 * - No writeback-candidate memcgs found in a subtree walk.
> +	 * - A writeback-candidate memcg wrote back zero pages.
> +	 */
> +	while (nr_written < nr_to_writeback) {
> +		unsigned long batch_size;
> +		long shrunk;
> +
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		iter_memcg = zswap_mem_cgroup_iter(memcg);
> +
> +		if (!iter_memcg) {
> +			/*
> +			 * Continue without incrementing failures if we found
> +			 * candidate memcgs in the last subtree walk.
> +			 */
> +			if (!attempts && ++failures == MAX_RECLAIM_RETRIES)
> +				return -EAGAIN;
> +			attempts = 0;
> +			continue;
> +		}
> +
> +		batch_size = min(nr_to_writeback - nr_written,
> +				 ZSWAP_PROACTIVE_WB_BATCH);
> +		shrunk = zswap_proactive_shrink_memcg(iter_memcg, batch_size);
> +		mem_cgroup_put(iter_memcg);
> +
> +		/* Writeback-disabled or offline: skip without counting. */
> +		if (shrunk == -ENOENT)
> +			continue;
> +
> +		++attempts;
> +		if (shrunk > 0)
> +			nr_written += shrunk;
> +		else if (++failures == MAX_RECLAIM_RETRIES)
> +			return -EAGAIN;
> +
> +		cond_resched();
> +	}
> +
> +	return 0;
> +}
> +

There is a lot of copy+paste from shrink_worker() and shrink_memcg()
here. We really should be able to reuse shrink_memcg().

Is the main difference that we are scanning in batches here? I think we
can have shrink_memcg() do that too. If anything, it might make the
shrinker more efficient. Over-reclaim is ofc a concern, and especially
in the zswap_store() path as the overhead can be noticeable. Maybe we
can parameterize the batch size based on the code path.

Nhat, what do you think?

^ permalink raw reply

* Re: [PATCH v5 5/9] mm: list_lru: deduplicate lock_list_lru()
From: Wei Yang @ 2026-05-30  1:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Wei Yang, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Shakeel Butt, Michal Hocko, Dave Chinner, Roman Gushchin,
	Muchun Song, Qi Zheng, Yosry Ahmed, Zi Yan, Liam R . Howlett,
	Usama Arif, Kiryl Shutsemau, Vlastimil Babka, Kairui Song,
	Mikhail Zaslonko, Vasily Gorbik, Baolin Wang, Barry Song,
	Dev Jain, Lance Yang, Nico Pache, Ryan Roberts, cgroups, linux-mm,
	linux-kernel
In-Reply-To: <ahmXqjQ0Vz4pb4u1@cmpxchg.org>

On Fri, May 29, 2026 at 09:42:02AM -0400, Johannes Weiner wrote:
>On Fri, May 29, 2026 at 09:56:28AM +0000, Wei Yang wrote:
>> On Wed, May 27, 2026 at 04:45:12PM -0400, Johannes Weiner wrote:
>> >The MEMCG and !MEMCG paths have the same pattern. Share the code.
>> >
>> >Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
>> >Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>> >Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>> >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> >Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org>
>> >---
>> > mm/list_lru.c | 21 +++++++++------------
>> > 1 file changed, 9 insertions(+), 12 deletions(-)
>> >
>> >diff --git a/mm/list_lru.c b/mm/list_lru.c
>> >index 7d0523e44010..fdb3fe2ea64f 100644
>> >--- a/mm/list_lru.c
>> >+++ b/mm/list_lru.c
>> >@@ -15,6 +15,14 @@
>> > #include "slab.h"
>> > #include "internal.h"
>> 
>> Hi, Johannes
>> 
>> One very tiny nit below.
>> 
>> > 
>> >+static inline void lock_list_lru(struct list_lru_one *l, bool irq)
>> 
>> Here we use @irq.
>> 
>> >+{
>> >+	if (irq)
>> >+		spin_lock_irq(&l->lock);
>> >+	else
>> >+		spin_lock(&l->lock);
>> >+}
>> >+
>> > static inline void unlock_list_lru(struct list_lru_one *l, bool irq_off)
>> 
>> Here we use @irq_off.
>> 
>> Do you think it would be nicer to unify the parameter name?
>
>Yes, I think it would be nicer.
>
>Note that I inherited this - we had irq on the lock and irq_off on the
>unlock before already. I didn't want to mix even more yak shaving prep
>patches into this series.

Reasonable.

>
>Mind sending a follow-up patch on top of mm-unstable?

Thanks, I am glad to.

Since this is trivial, I would wait until everything is settle down.

Looks good to you?

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
From: Yosry Ahmed @ 2026-05-30  1:24 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260526114601.67041-2-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 07:45:58PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@lixiang.com>
> 
> The zswap background writeback worker shrink_worker() uses a global
> cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
> across the online memcgs under root_mem_cgroup.
> 
> Proactive writeback also wants a similar per-memcg cursor that is
> scoped to the specified memcg, so that repeated invocations against
> the same memcg make forward progress across its descendant memcgs
> instead of restarting from the first child memcg each time.

Is this a problem in practice?

Is the concern the overhead of scanning memcgs repeatedly, or lack of
fairness? I wonder if we should just do writeback in batches from all
memcgs, similar to how reclaim does it, then evaluate at the end if we
need to start over?

> 
> Naturally, group the cursor and its protecting spinlock into a
> zswap_wb_iter struct, and make it a member of struct mem_cgroup to
> realize per-memcg cursor management. Accordingly, shrink_worker() now
> uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.

If we really need to have per-memcg cursors (I am not a big fan), I
think we can minimize the overhead by making the cursor updates use
atomic cmpxchg instead of having a per-memcg lock.

> 
> Because the cursor is now per-memcg, the offline cleanup must visit
> every ancestor that could be holding a reference to the dying memcg.
> Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up
> to the root.

Another reason why I don't like per-memcg cursors. There is too much
complexity and I wonder if it's warranted. If we stick with per-memcg
cursors please do the refactoring in separate patches to make the
patches easier to review.

Thanks!

^ permalink raw reply

* [PATCH-next v4 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
In-Reply-To: <20260529212108.120506-1-longman@redhat.com>

With cgroup v2, the cgroup_taskset structure passed into the cgroup
can_attach() and attach() methods can contain task migration data with
multiple destination or source cpusets when the cpuset controller is
enabled or disabled respectively.

Since cpuset is threaded in both v1 and v2, another possible way to
cause many-to-one migration is to move the whole process with multiple
threads in different cpuset enabled threaded cgroups into another cpuset
enabled cgroup.

The current cpuset_can_attach() and cpuset_attach() functions still
expect task migration is from one source cpuset to one destination
cpuset. This has been the case since cpuset was enabled for cgroup v2
in commit 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default
hierarchy").

This problem is less an issue when enabling the cpuset controller as all
the newly created child cpusets will have exactly the same set of CPUs
and memory nodes except when deadline tasks are involved in migration
as the deadline task accounting data can be off.

It can be more problematic when the cpuset controller is disabled as
their set of CPUs and memory nodes may differ from their parent or with
the moving of multi-threaded process from different threaded cgroups.

Fix that by tracking the set of source (old) and destination cpusets
in singly linked lists and iterating them all to properly update the
internal data. Also keep the current cs and oldcs variables up-to-date
with the css and task iterators.

To ensure proper DL tasks accounting, the nr_migrate_dl_tasks in both
the source and destination cpusets are decremented/incremented with
their values added to nr_deadline_tasks when the migration is successful.

Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset-internal.h |   6 +
 kernel/cgroup/cpuset.c          | 204 ++++++++++++++++++++++++--------
 2 files changed, 158 insertions(+), 52 deletions(-)

diff --git a/kernel/cgroup/cpuset-internal.h b/kernel/cgroup/cpuset-internal.h
index f7aaf01f7cd5..4c2772a7fd5e 100644
--- a/kernel/cgroup/cpuset-internal.h
+++ b/kernel/cgroup/cpuset-internal.h
@@ -161,6 +161,12 @@ struct cpuset {
 	 */
 	bool remote_partition;
 
+	/*
+	 * cpuset_can_attach() and cpuset_attach() specific data
+	 */
+	bool			attach_node_in_llist;
+	struct llist_node	attach_node;
+
 	/*
 	 * number of SCHED_DEADLINE tasks attached to this cpuset, so that we
 	 * know when to rebuild associated root domain bandwidth information.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a6506b94e60a..2add658eb288 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -37,6 +37,7 @@
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 #include <linux/task_work.h>
+#include <linux/llist.h>
 
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
@@ -1127,6 +1128,8 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
  * matter which child cpuset is selected as cpuset_attach_old_cs.
  */
 static struct cpuset *cpuset_attach_old_cs;
+static LLIST_HEAD(src_cs_head);
+static LLIST_HEAD(dst_cs_head);
 static bool attach_cpus_updated;
 static bool attach_mems_updated;
 
@@ -3017,9 +3020,10 @@ static int update_prstate(struct cpuset *cs, int new_prs)
  * Also set the boolean flag passed in by @psetsched depending on if
  * security_task_setscheduler() call is needed and @oldcs is not NULL.
  */
-static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
-				   bool *psetsched)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs, bool *psetsched)
 {
+	bool cpu_match, mem_match;
+
 	if (cpumask_empty(cs->effective_cpus) ||
 	   (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
 		return -ENOSPC;
@@ -3030,15 +3034,34 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
 	/*
 	 * Update attach specific data
 	 */
-	attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
-	attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+	if (!cs->attach_node_in_llist) {
+		llist_add(&cs->attach_node, &dst_cs_head);
+		cs->attach_node_in_llist = true;
+	}
+	if (!oldcs->attach_node_in_llist) {
+		llist_add(&oldcs->attach_node, &src_cs_head);
+		oldcs->attach_node_in_llist = true;
+	}
+
+	cpu_match = cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+	mem_match = nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+	/*
+	 * Set the updated flags whenever there is a mismatch in any of the
+	 * src/dst pairs.
+	 */
+	if (!attach_cpus_updated)
+		attach_cpus_updated = !cpu_match;
+
+	if (!attach_mems_updated)
+		attach_mems_updated = !mem_match;
 
 	/*
 	 * Skip rights over task setsched check in v2 when nothing changes,
 	 * migration permission derives from hierarchy ownership in
 	 * cgroup_procs_write_permission()).
 	 */
-	*psetsched = !cpuset_v2() || attach_cpus_updated || attach_mems_updated;
+	*psetsched = !cpuset_v2() || !cpu_match || !mem_match;
 
 	/*
 	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
@@ -3053,33 +3076,103 @@ static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
 	return 0;
 }
 
-static int cpuset_reserve_dl_bw(struct cpuset *cs)
+/*
+ * If reset_dl_bw is set, reset the previous dl_bw_alloc() call. Otherwise,
+ * update nr_deadline_tasks according to nr_migrate_dl_tasks in both source
+ * and destination cpusets.
+ */
+static void clear_attach_data(bool reset_dl_bw)
+{
+	struct cpuset *cs, *next;
+
+	llist_for_each_entry_safe(cs, next, src_cs_head.first, attach_node) {
+		cs->attach_node.next = NULL;
+		cs->attach_node_in_llist = false;
+		if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+			cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+		cs->nr_migrate_dl_tasks = 0;
+	}
+
+	llist_for_each_entry_safe(cs, next, dst_cs_head.first, attach_node) {
+		cs->attach_node.next = NULL;
+		cs->attach_node_in_llist = false;
+		if (reset_dl_bw && cs->dl_bw_cpu >= 0)
+			dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
+		if (cs->nr_migrate_dl_tasks && !reset_dl_bw)
+			cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
+		cs->nr_migrate_dl_tasks = 0;
+		cs->sum_migrate_dl_bw = 0;
+		cs->dl_bw_cpu = -1;
+	}
+
+	src_cs_head.first = NULL;
+	dst_cs_head.first = NULL;
+	attach_cpus_updated = false;
+	attach_mems_updated = false;
+}
+
+static int cpuset_reserve_dl_bw(void)
 {
+	struct cpuset *cs;
 	int cpu, ret;
 
-	if (!cs->sum_migrate_dl_bw)
-		return 0;
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node) {
+		if (!cs->sum_migrate_dl_bw)
+			continue;
 
-	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
-	if (unlikely(cpu >= nr_cpu_ids))
-		return -EINVAL;
+		cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+		if (unlikely(cpu >= nr_cpu_ids))
+			return -EINVAL;
 
-	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
-	if (ret)
-		return ret;
+		ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+		if (ret)
+			return ret;
 
-	cs->dl_bw_cpu = cpu;
+		cs->dl_bw_cpu = cpu;
+	}
 	return 0;
 }
 
-static void reset_migrate_dl_data(struct cpuset *cs)
+static void set_attach_in_progress(void)
 {
-	cs->nr_migrate_dl_tasks = 0;
-	cs->sum_migrate_dl_bw = 0;
-	cs->dl_bw_cpu = -1;
+	struct cpuset *cs;
+
+	/*
+	 * Mark attach is in progress.  This makes validate_change() fail
+	 * changes which zero cpus/mems_allowed.
+	 */
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+		cs->attach_in_progress++;
+}
+
+static void reset_attach_in_progress(void)
+{
+	struct cpuset *cs;
+
+	llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+		dec_attach_in_progress_locked(cs);
 }
 
-/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
+/*
+ * Called by cgroups to determine if a cpuset is usable; cpuset_mutex held.
+ *
+ * With cgroup v2, enabling of cpuset controller in a cgroup subtree can
+ * cause @tset to contain task migration data from one parent cpuset to multiple
+ * child cpusets. Not much is needed to be done here other than tracking the
+ * number of DL tasks in each cpuset as the CPUs and memory nodes of the child
+ * cpusets are exactly the same as the parent.
+ *
+ * Conversely, disabling of cpuset controller can cause @tset to contain task
+ * migration data from multiple child cpusets to one parent cpuset. Here, the
+ * CPUs and memory nodes of the child cpusets may be different from the parent,
+ * but must be a subset of its parent.
+ *
+ * Another possible many-to-one migration is the moving of the whole
+ * multithreaded process with threads in different cpusets to another cpuset.
+ *
+ * For all other use cases, @tset task migration data should be from one source
+ * cpuset to one destination cpuset.
+ */
 static int cpuset_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
@@ -3101,6 +3194,16 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		goto out_unlock;
 
 	cgroup_taskset_for_each(task, css, tset) {
+		struct cpuset *newcs = css_cs(css);
+		struct cpuset *new_oldcs = task_cs(task);
+
+		if ((newcs != cs) || (new_oldcs != oldcs)) {
+			cs = newcs;
+			oldcs = new_oldcs;
+			ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
+			if (ret)
+				goto out_unlock;
+		}
 		ret = task_can_attach(task);
 		if (ret)
 			goto out_unlock;
@@ -3122,23 +3225,19 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 			 * contribute to sum_migrate_dl_bw.
 			 */
 			cs->nr_migrate_dl_tasks++;
+			oldcs->nr_migrate_dl_tasks--;
 			if (dl_task_needs_bw_move(task, cs->effective_cpus))
 				cs->sum_migrate_dl_bw += task->dl.dl_bw;
 		}
 	}
 
-	ret = cpuset_reserve_dl_bw(cs);
+	ret = cpuset_reserve_dl_bw();
 
 out_unlock:
-	if (ret) {
-		reset_migrate_dl_data(cs);
-	} else {
-		/*
-		 * Mark attach is in progress.  This makes validate_change() fail
-		 * changes which zero cpus/mems_allowed.
-		 */
-		cs->attach_in_progress++;
-	}
+	if (ret)
+		clear_attach_data(true);
+	else
+		set_attach_in_progress();
 
 	mutex_unlock(&cpuset_mutex);
 	return ret;
@@ -3153,14 +3252,8 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
 	cs = css_cs(css);
 
 	mutex_lock(&cpuset_mutex);
-	dec_attach_in_progress_locked(cs);
-
-	if (cs->dl_bw_cpu >= 0)
-		dl_bw_free(cs->dl_bw_cpu, cs->sum_migrate_dl_bw);
-
-	if (cs->nr_migrate_dl_tasks)
-		reset_migrate_dl_data(cs);
-
+	reset_attach_in_progress();
+	clear_attach_data(true);
 	mutex_unlock(&cpuset_mutex);
 }
 
@@ -3232,7 +3325,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct task_struct *task;
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
-	struct cpuset *oldcs = cpuset_attach_old_cs;
 
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
@@ -3245,32 +3337,40 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * In the default hierarchy, enabling cpuset in the child cgroups
 	 * will trigger a cpuset_attach() call with no change in effective cpus
 	 * and mems. In that case, we can optimize out by skipping the task
-	 * iteration and update.
+	 * iteration and update, but the destination cpuset list is iterated to
+	 * set old_mems_sllowed.
 	 */
 	if (cpuset_v2()) {
 		cpuset_attach_nodemask_to = cs->effective_mems;
-		if (!attach_cpus_updated && !attach_mems_updated)
+		if (!attach_cpus_updated && !attach_mems_updated) {
+			llist_for_each_entry(cs, dst_cs_head.first, attach_node)
+				cs->old_mems_allowed = cs->effective_mems;
 			goto out;
+		}
 	} else {
 		guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 	}
 
-	cgroup_taskset_for_each(task, css, tset)
+	cgroup_taskset_for_each(task, css, tset) {
+		struct cpuset *newcs = css_cs(css);
+
+		if (newcs != cs) {
+			cs->old_mems_allowed = cpuset_attach_nodemask_to;
+			cs = newcs;
+			if (cpuset_v2())
+				cpuset_attach_nodemask_to = cs->effective_mems;
+			else
+				guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+		}
 		cpuset_attach_task(cs, task);
+	}
 
-out:
 	if (queue_task_work)
 		schedule_flush_migrate_mm();
 	cs->old_mems_allowed = cpuset_attach_nodemask_to;
-
-	if (cs->nr_migrate_dl_tasks) {
-		cs->nr_deadline_tasks += cs->nr_migrate_dl_tasks;
-		oldcs->nr_deadline_tasks -= cs->nr_migrate_dl_tasks;
-		reset_migrate_dl_data(cs);
-	}
-
-	dec_attach_in_progress_locked(cs);
-
+out:
+	reset_attach_in_progress();
+	clear_attach_data(false);
 	mutex_unlock(&cpuset_mutex);
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH-next v4 5/6] cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside cpuset_attach_task()
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
In-Reply-To: <20260529212108.120506-1-longman@redhat.com>

The cpuset_attach_task() was introduced in commit 42a11bf5c543
("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly")
to enable the CLONE_INTO_CGROUP flag of clone(2) to behave more like
moving a task from one cpuset into another one. That commits didn't
move the mpol_rebind_mm() and cpuset_migrate_mm() calls for group leader
into cpuset_attach_task().

When the CLONE_INTO_CGROUP flag is used without CLONE_THREAD, the new
task is its own group leader. So it is still not equivalent to moving
task between cpusets in this case. Make CLONE_INTO_CGROUP behaves
more close to cpuset_attach() by moving the mpol_rebind_mm() and
cpuset_migrate_mm() calls inside cpuset_attach_task(). As a result,
the following static variables will have to be updated in cpuset_fork().
 - cpuset_attach_old_cs
 - attach_cpus_updated
 - attach_mems_updated
 - queue_task_work

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 89 ++++++++++++++++++++++++------------------
 1 file changed, 51 insertions(+), 38 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0bb63a9cda0b..a6506b94e60a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3171,9 +3171,12 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset)
  */
 static cpumask_var_t cpus_attach;
 static nodemask_t cpuset_attach_nodemask_to;
+static bool queue_task_work;
 
 static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 {
+	struct mm_struct *mm;
+
 	lockdep_assert_cpuset_lock_held();
 
 	if (cs != &top_cpuset)
@@ -3187,24 +3190,56 @@ static void cpuset_attach_task(struct cpuset *cs, struct task_struct *task)
 	 */
 	WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
 
+	if (cpuset_v2() && !attach_mems_updated)
+		return;
+
 	cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
 	cpuset1_update_task_spread_flags(cs, task);
+
+	if (task != task->group_leader)
+		return;
+
+	/*
+	 * Change mm for threadgroup leader. This is expensive and may
+	 * sleep and should be moved outside migration path proper.
+	 */
+	mm = get_task_mm(task);
+	if (mm) {
+		struct cpuset *oldcs = cpuset_attach_old_cs;
+
+		mpol_rebind_mm(mm, &cs->effective_mems);
+
+		/*
+		 * old_mems_allowed is the same with mems_allowed
+		 * here, except if this task is being moved
+		 * automatically due to hotplug.  In that case
+		 * @mems_allowed has been updated and is empty, so
+		 * @old_mems_allowed is the right nodesets that we
+		 * migrate mm from.
+		 */
+		if (is_memory_migrate(cs)) {
+			cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
+					  &cpuset_attach_nodemask_to);
+			queue_task_work = true;
+		} else {
+			mmput(mm);
+		}
+	}
 }
 
 static void cpuset_attach(struct cgroup_taskset *tset)
 {
 	struct task_struct *task;
-	struct task_struct *leader;
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
-	bool queue_task_work = false;
 
 	cgroup_taskset_first(tset, &css);
 	cs = css_cs(css);
 
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	mutex_lock(&cpuset_mutex);
+	queue_task_work = false;
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
@@ -3223,38 +3258,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	cgroup_taskset_for_each(task, css, tset)
 		cpuset_attach_task(cs, task);
 
-	/*
-	 * Change mm for all threadgroup leaders. This is expensive and may
-	 * sleep and should be moved outside migration path proper. Skip it
-	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
-	 * not set.
-	 */
-	if (!is_memory_migrate(cs) && !attach_mems_updated)
-		goto out;
-
-	cgroup_taskset_for_each_leader(leader, css, tset) {
-		struct mm_struct *mm = get_task_mm(leader);
-
-		if (mm) {
-			mpol_rebind_mm(mm, &cs->effective_mems);
-
-			/*
-			 * old_mems_allowed is the same with mems_allowed
-			 * here, except if this task is being moved
-			 * automatically due to hotplug.  In that case
-			 * @mems_allowed has been updated and is empty, so
-			 * @old_mems_allowed is the right nodesets that we
-			 * migrate mm from.
-			 */
-			if (is_memory_migrate(cs)) {
-				cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
-						  &cpuset_attach_nodemask_to);
-				queue_task_work = true;
-			} else
-				mmput(mm);
-		}
-	}
-
 out:
 	if (queue_task_work)
 		schedule_flush_migrate_mm();
@@ -3688,15 +3691,14 @@ static void cpuset_cancel_fork(struct task_struct *task, struct css_set *cset)
  */
 static void cpuset_fork(struct task_struct *task)
 {
-	struct cpuset *cs;
-	bool same_cs;
+	struct cpuset *cs, *oldcs;
 
 	rcu_read_lock();
 	cs = task_cs(task);
-	same_cs = (cs == task_cs(current));
+	oldcs = task_cs(current);
 	rcu_read_unlock();
 
-	if (same_cs) {
+	if (cs == oldcs) {
 		if (cs == &top_cpuset)
 			return;
 
@@ -3708,7 +3710,18 @@ static void cpuset_fork(struct task_struct *task)
 	/* CLONE_INTO_CGROUP */
 	mutex_lock(&cpuset_mutex);
 	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	/*
+	 * Assume CPUs and memory nodes are updated
+	 * A CLONE_INTO_CGROUP operation should have taken the cgroup mutex
+	 * and so there shouldn't be a competing cpuset_attach() operation.
+	 */
+	attach_cpus_updated = attach_mems_updated = true;
+	queue_task_work = false;
+	cpuset_attach_old_cs = oldcs;
 	cpuset_attach_task(cs, task);
+	attach_cpus_updated = attach_mems_updated = false;
+	if (queue_task_work)
+		schedule_flush_migrate_mm();
 
 	dec_attach_in_progress_locked(cs);
 	mutex_unlock(&cpuset_mutex);
-- 
2.54.0


^ permalink raw reply related

* [PATCH-next v4 4/6] cgroup/cpuset: Made cpuset_attach_old_cs track task group leaders
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long,
	Ridong Chen
In-Reply-To: <20260529212108.120506-1-longman@redhat.com>

There are two possible ways that migration of tasks from multiple source
cpusets to a target cpuset can happen. Either a multithread application
with threads in different cpusets is wholely moved to a new cpuset
or disabling of v2 cpuset controller will move all the tasks in child
cpusets to the parent cpuset.

In the former case, it is the mm setting of the group leader that really
matters. So cpuset_attach_old_cs should track the oldcs of the thread
leader. In the latter case, effective_mems of child cpusets must always
be a subset of the parent. So no real page migration will be necessary
no matter which child cpuset is selected as cpuset_attach_old_cs.

IOW, cpuset_attach_old_cs should be updated to match the latest task
group leader in cpuset_can_attach().

Suggested-by: Ridong Chen <ridong.chen@linux.dev>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0f93f3d84494..0bb63a9cda0b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1111,6 +1111,20 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
 /*
  * cpuset_can_attach() and cpuset_attach() specific internal data
  * Protected by cpuset_mutex
+ *
+ * The cpuset_attach_old_cs is used mainly by cpuset_migrate_mm() to get the
+ * old_mems_allowed value. There are two ways that many-to-one cpuset migration
+ * can happen:
+ * 1) A multithread application with threads in different cpusets is wholely
+ *    moved to a new cpuset.
+ * 2) Disabling v2 cpuset controller will move all the tasks in child cpusets
+ *    to the parent cpuset.
+ *
+ * In the former case, it is the mm setting of the group leader that really
+ * matters. So cpuset_attach_old_cs should track the oldcs of the thread
+ * leader. In the latter case, effective_mems of child cpusets must always
+ * be a subset of the parent. So no real page migration will be necessary no
+ * matter which child cpuset is selected as cpuset_attach_old_cs.
  */
 static struct cpuset *cpuset_attach_old_cs;
 static bool attach_cpus_updated;
@@ -3091,6 +3105,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		if (ret)
 			goto out_unlock;
 
+		/* Update cpuset_attach_old_cs to the latest group leader */
+		if (task == task->group_leader)
+			cpuset_attach_old_cs = task_cs(task);
+
 		if (setsched_check) {
 			ret = security_task_setscheduler(task);
 			if (ret)
-- 
2.54.0


^ permalink raw reply related

* [PATCH-next v4 3/6] cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
In-Reply-To: <20260529212108.120506-1-longman@redhat.com>

Expand the scope of cpuset_can_attach_check() by including the setting
of setsched flag inside cpuset_can_attach_check() with the new @oldcs
and @psetsched argument. As cpuset_can_attach_check() is also called
from cpuset_can_fork(), set the new arguments to NULL from that caller.

While at it, expose the source and destination cpuset cpu/memory check
results in the new attach_cpus_updated and attach_mems_updated static
flags so that these flags can be used directly from cpuset_attach()
without the need to do the same computations again.

Two new global attach related flags are added (attach_cpus_updated &
attach_mems_updated) which are set to indicate that CPUs or memory nodes
are updated. These 2 flags are set in cpuset_can_attach() and are used
in cpuset_attach() for optimization. Since cpuset_mutex will be released
between the 2 calls, it is possible that an intervening cpuset action
may change the CPU or node mask of the relevant cpusets, so check is
added to set these flags if the effective_cpus or effective_mems of
those cpusets is changed.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 90 ++++++++++++++++++++++++++++--------------
 1 file changed, 60 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a6f191b48529..0f93f3d84494 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1108,6 +1108,14 @@ enum partition_cmd {
 static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
 				    struct tmpmasks *tmp);
 
+/*
+ * cpuset_can_attach() and cpuset_attach() specific internal data
+ * Protected by cpuset_mutex
+ */
+static struct cpuset *cpuset_attach_old_cs;
+static bool attach_cpus_updated;
+static bool attach_mems_updated;
+
 /*
  * Update partition exclusive flag
  *
@@ -1192,6 +1200,8 @@ static void reset_partition_data(struct cpuset *cs)
 	}
 	if (!cpumask_and(cs->effective_cpus, parent->effective_cpus, cs->cpus_allowed))
 		cpumask_copy(cs->effective_cpus, parent->effective_cpus);
+	if (cs->attach_in_progress)
+		attach_cpus_updated = true;
 }
 
 /*
@@ -1242,6 +1252,8 @@ static void partition_xcpus_add(int new_prs, struct cpuset *parent,
 				     xcpus);
 
 	cpumask_andnot(parent->effective_cpus, parent->effective_cpus, xcpus);
+	if (parent->attach_in_progress)
+		attach_cpus_updated = true;
 }
 
 /*
@@ -1269,6 +1281,8 @@ static void partition_xcpus_del(int old_prs, struct cpuset *parent,
 
 	cpumask_or(parent->effective_cpus, parent->effective_cpus, xcpus);
 	cpumask_and(parent->effective_cpus, parent->effective_cpus, cpu_active_mask);
+	if (parent->attach_in_progress)
+		attach_cpus_updated = true;
 }
 
 /*
@@ -2217,6 +2231,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
 		if (new_prs <= 0)
 			reset_partition_data(cp);
 		spin_unlock_irq(&callback_lock);
+		if (cp->attach_in_progress)
+			attach_cpus_updated = true;
 
 		notify_partition_change(cp, old_prs);
 
@@ -2720,6 +2736,8 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 		spin_lock_irq(&callback_lock);
 		cp->effective_mems = *new_mems;
 		spin_unlock_irq(&callback_lock);
+		if (cp->attach_in_progress)
+			attach_mems_updated = true;
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
@@ -2976,19 +2994,48 @@ static int update_prstate(struct cpuset *cs, int new_prs)
 	return 0;
 }
 
-static struct cpuset *cpuset_attach_old_cs;
-
 /*
  * Check to see if a cpuset can accept a new task
  * For v1, cpus_allowed and mems_allowed can't be empty.
  * For v2, effective_cpus can't be empty.
  * Note that in v1, effective_cpus = cpus_allowed.
+ *
+ * Also set the boolean flag passed in by @psetsched depending on if
+ * security_task_setscheduler() call is needed and @oldcs is not NULL.
  */
-static int cpuset_can_attach_check(struct cpuset *cs)
+static int cpuset_can_attach_check(struct cpuset *cs, struct cpuset *oldcs,
+				   bool *psetsched)
 {
 	if (cpumask_empty(cs->effective_cpus) ||
 	   (!is_in_v2_mode() && nodes_empty(cs->mems_allowed)))
 		return -ENOSPC;
+
+	if (!oldcs)
+		return 0;
+
+	/*
+	 * Update attach specific data
+	 */
+	attach_cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus);
+	attach_mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
+
+	/*
+	 * Skip rights over task setsched check in v2 when nothing changes,
+	 * migration permission derives from hierarchy ownership in
+	 * cgroup_procs_write_permission()).
+	 */
+	*psetsched = !cpuset_v2() || attach_cpus_updated || attach_mems_updated;
+
+	/*
+	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
+	 * brings the last online CPU offline as users are not allowed to empty
+	 * cpuset.cpus when there are active tasks inside. When that happens,
+	 * we should allow tasks to migrate out without security check to make
+	 * sure they will be able to run after migration.
+	 */
+	if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
+		*psetsched = false;
+
 	return 0;
 }
 
@@ -3035,29 +3082,10 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	mutex_lock(&cpuset_mutex);
 
 	/* Check to see if task is allowed in the cpuset */
-	ret = cpuset_can_attach_check(cs);
+	ret = cpuset_can_attach_check(cs, oldcs, &setsched_check);
 	if (ret)
 		goto out_unlock;
 
-	/*
-	 * Skip rights over task setsched check in v2 when nothing changes,
-	 * migration permission derives from hierarchy ownership in
-	 * cgroup_procs_write_permission()).
-	 */
-	setsched_check = !cpuset_v2() ||
-		!cpumask_equal(cs->effective_cpus, oldcs->effective_cpus) ||
-		!nodes_equal(cs->effective_mems, oldcs->effective_mems);
-
-	/*
-	 * A v1 cpuset with tasks will have no CPU left only when CPU hotplug
-	 * brings the last online CPU offline as users are not allowed to empty
-	 * cpuset.cpus when there are active tasks inside. When that happens,
-	 * we should allow tasks to migrate out without security check to make
-	 * sure they will be able to run after migration.
-	 */
-	if (!is_in_v2_mode() && cpumask_empty(oldcs->effective_cpus))
-		setsched_check = false;
-
 	cgroup_taskset_for_each(task, css, tset) {
 		ret = task_can_attach(task);
 		if (ret)
@@ -3152,7 +3180,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	struct cgroup_subsys_state *css;
 	struct cpuset *cs;
 	struct cpuset *oldcs = cpuset_attach_old_cs;
-	bool cpus_updated, mems_updated;
 	bool queue_task_work = false;
 
 	cgroup_taskset_first(tset, &css);
@@ -3160,9 +3187,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	lockdep_assert_cpus_held();	/* see cgroup_attach_lock() */
 	mutex_lock(&cpuset_mutex);
-	cpus_updated = !cpumask_equal(cs->effective_cpus,
-				      oldcs->effective_cpus);
-	mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems);
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
@@ -3172,7 +3196,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 */
 	if (cpuset_v2()) {
 		cpuset_attach_nodemask_to = cs->effective_mems;
-		if (!cpus_updated && !mems_updated)
+		if (!attach_cpus_updated && !attach_mems_updated)
 			goto out;
 	} else {
 		guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
@@ -3187,7 +3211,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
 	 * not set.
 	 */
-	if (!is_memory_migrate(cs) && !mems_updated)
+	if (!is_memory_migrate(cs) && !attach_mems_updated)
 		goto out;
 
 	cgroup_taskset_for_each_leader(leader, css, tset) {
@@ -3602,7 +3626,7 @@ static int cpuset_can_fork(struct task_struct *task, struct css_set *cset)
 	mutex_lock(&cpuset_mutex);
 
 	/* Check to see if task is allowed in the cpuset */
-	ret = cpuset_can_attach_check(cs);
+	ret = cpuset_can_attach_check(cs, NULL, NULL);
 	if (ret)
 		goto out_unlock;
 
@@ -3742,6 +3766,8 @@ hotplug_update_tasks(struct cpuset *cs,
 	cpumask_copy(cs->effective_cpus, new_cpus);
 	cs->effective_mems = *new_mems;
 	spin_unlock_irq(&callback_lock);
+	if (cs->attach_in_progress)
+		attach_cpus_updated = attach_mems_updated = true;
 
 	if (cpus_updated)
 		cpuset_update_tasks_cpumask(cs, new_cpus);
@@ -3927,6 +3953,8 @@ static void cpuset_handle_hotplug(void)
 		}
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
+		if (top_cpuset.attach_in_progress)
+			attach_cpus_updated = true;
 		/* we don't mess with cpumasks of tasks in top_cpuset */
 	}
 
@@ -3937,6 +3965,8 @@ static void cpuset_handle_hotplug(void)
 			top_cpuset.mems_allowed = new_mems;
 		top_cpuset.effective_mems = new_mems;
 		spin_unlock_irq(&callback_lock);
+		if (top_cpuset.attach_in_progress)
+			attach_mems_updated = true;
 		cpuset_update_tasks_nodemask(&top_cpuset);
 	}
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH-next v4 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
In-Reply-To: <20260529212108.120506-1-longman@redhat.com>

Whenever memory node mask is changed, there are 4 places where the node
mask has to be updated or used.
 1) task's node mask via cpuset_change_task_nodemask()
 2) memory policy binding via mpol_rebind_mm()
 3) if memory migration is enabled, migrate from old_mems_allowed to
    the new node mask via cpuset_migrate_mm().
 4) setting old_mems_allowed

These memory actions are done in cpuset_update_tasks_nodemask() and
cpuset_attach(). However there are inconsistencies in what node masks
are being used in these 2 functions.

In cpuset_update_tasks_nodemask(),
 - cpuset_change_task_nodemask(): guarantee_online_mems()
 - mpol_rebind_mm(): mems_allowed
 - cpuset_migrate_mm(): guarantee_online_mems()
 - old_mems_allowed: guarantee_online_mems()

In cpuset_attach(),
 - cpuset_change_task_nodemask(): guarantee_online_mems()
 - mpol_rebind_mm(): effective_mems
 - cpuset_migrate_mm(): effective_mems
 - old_mems_allowed: effective_mems

These inconsistencies dates back to quite a long time ago and it is
hard to say what should be the correct values.

The guarantee_online_mems() function returns a node mask from current or
an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
However, node in node_states[N_ONLINE] may not have memory. So
node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].

The guarantee_online_mems() function should only be useful for v1 where
mems_allowed is the same as effective_mems. With v2, the memory nodes
in effective_mems should always be a subset of node_states[N_MEMORY],
so guarantee_online_mems() should just return cs->effective_mems.

Let use the following setup for both of them and make them consistent.
 - cpuset_change_task_nodemask(): guarantee_online_mems()
 - mpol_rebind_mm(): effective_mems
 - cpuset_migrate_mm(): guarantee_online_mems()
 - old_mems_allowed: guarantee_online_mems()

So for v2, it is effectively all effective_mems. For v1, mpol_rebind_mm()
uses cpus_allowed which may differ from what guarantee_online_mems()
returns.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 51327333980a..961427cd83a5 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2615,6 +2615,13 @@ static void *cpuset_being_rebound;
  * Iterate through each task of @cs updating its mems_allowed to the
  * effective cpuset's.  As this function is called with cpuset_mutex held,
  * cpuset membership stays stable.
+ *
+ * - cpuset_change_task_nodemask(): guarantee_online_mems()
+ * - mpol_rebind_mm(): effective_mems
+ * - cpuset_migrate_mm(): guarantee_online_mems()
+ * - old_mems_allowed: guarantee_online_mems()
+ *
+ * For v2, guarantee_online_mems() should just return effective_mems.
  */
 void cpuset_update_tasks_nodemask(struct cpuset *cs)
 {
@@ -2624,7 +2631,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
-	guarantee_online_mems(cs, &newmems);
+	if (cpuset_v2())
+		newmems = cs->effective_mems;
+	else
+		guarantee_online_mems(cs, &newmems);
 
 	/*
 	 * The mpol_rebind_mm() call takes mmap_lock, which we couldn't
@@ -2649,7 +2659,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 		migrate = is_memory_migrate(cs);
 
-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->effective_mems);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
@@ -2713,6 +2723,8 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 
 		WARN_ON(!is_in_v2_mode() &&
 			!nodes_equal(cp->mems_allowed, cp->effective_mems));
+		WARN_ON(cpuset_v2() &&
+			!nodes_subset(cp->effective_mems, node_states[N_MEMORY]));
 
 		cpuset_update_tasks_nodemask(cp);
 
@@ -3147,17 +3159,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 
 	/*
 	 * In the default hierarchy, enabling cpuset in the child cgroups
-	 * will trigger a number of cpuset_attach() calls with no change
-	 * in effective cpus and mems. In that case, we can optimize out
-	 * by skipping the task iteration and update.
+	 * will trigger a cpuset_attach() call with no change in effective cpus
+	 * and mems. In that case, we can optimize out by skipping the task
+	 * iteration and update.
 	 */
-	if (cpuset_v2() && !cpus_updated && !mems_updated) {
+	if (cpuset_v2()) {
 		cpuset_attach_nodemask_to = cs->effective_mems;
-		goto out;
+		if (!cpus_updated && !mems_updated)
+			goto out;
+	} else {
+		guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
 	}
 
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
-
 	cgroup_taskset_for_each(task, css, tset)
 		cpuset_attach_task(cs, task);
 
@@ -3167,7 +3180,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	 * if there is no change in effective_mems and CS_MEMORY_MIGRATE is
 	 * not set.
 	 */
-	cpuset_attach_nodemask_to = cs->effective_mems;
 	if (!is_memory_migrate(cs) && !mems_updated)
 		goto out;
 
@@ -3175,7 +3187,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
-			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+			mpol_rebind_mm(mm, &cs->effective_mems);
 
 			/*
 			 * old_mems_allowed is the same with mems_allowed
-- 
2.54.0


^ permalink raw reply related

* [PATCH-next v4 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long

 v4:
  - Add a new patch 1 to fix inconsistency in node mask usage in
    cpuset_update_tasks_nodemask() and cpuset_attach() and adjust
    the subsequent patches accordingly.
  - Update patch 3 to set the update flags whenever the CPU or node
    mask is updated to address issue reported by Sashiko.
  - Update patch 5 to remove unneeded setting of old_mems_allowed as
    well as calling schedule_flush_migrate_mm() if queue_task_work is
    set.

 v3:
  - Rebased to the lastest linux-next tree.
  - Keep cpuset_attach_old_cs as suggested by Chen Ridong and replace
    patch 3 by a new one to make it track task group leader.

Sashiko AI review of another cpuset patch had found that cpuset_attach()
and cpuset_can_attach() can be passed a cgroup_taskset with tasks
migrating from one source cpuset to multiple destination cpusets and
vice versa.  Further testing of the cpuset code indicates that this is
indeed the case when the v2 cpuset controller is enabled or disabled.

Unfortunately, cpuset_attach() and cpuset_can_attach() still assume that
there will be one source and one destinaton cpuset which may result in
inocrrect behavior.

This patch series is created to fix this issue.

Patch 1 is to fix an inconsistency in the way node mask update is being
handled in cpuset_update_tasks_nodemask() and cpuset_attach() so that
they match each other.

Patches 2 and 3 are just preparatory patches to make the remaining
patches easier to review.

Patch 4 makes cpuset_attach_old_cs to track group leader for use by
cpuset_migrate_mm().

Patch 5 moves mpol_rebind_mm() and cpuset_migrate_mm() inside
cpuset_attach_task() to make CLONE_INTO_CGROUP flag of clone(2) works
more like moving task from one cpuset to another one, while also make
supporting multiple source and destination cpusets easier.

Patch 6 makes the necessary changes to enable the support of multiple
source and destination cpusets by keeping all the source and destination
cpusets found during task iterations in two singly linked lists for
source and destination cpusets respectively.

Waiman Long (6):
  cgroup/cpuset: Fix node inconsistencies between
    cpuset_update_tasks_nodemask() and cpuset_attach()
  cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
  cgroup/cpuset: Expand the scope of cpuset_can_attach_check()
  cgroup/cpuset: Made cpuset_attach_old_cs track task group leaders
  cgroup/cpuset: Move mpol_rebind_mm/cpuset_migrate_mm() calls inside
    cpuset_attach_task()
  cgroup/cpuset: Support multiple source/destination cpusets for
    cpuset_*attach()

 kernel/cgroup/cpuset-internal.h |   6 +
 kernel/cgroup/cpuset.c          | 424 +++++++++++++++++++++++---------
 2 files changed, 308 insertions(+), 122 deletions(-)

-- 
2.54.0

^ permalink raw reply

* [PATCH-next v4 2/6] cgroup/cpuset: Add a cpuset_reserve_dl_bw() helper
From: Waiman Long @ 2026-05-29 21:21 UTC (permalink / raw)
  To: Chen Ridong, Tejun Heo, Johannes Weiner, Michal Koutný,
	Peter Zijlstra
  Cc: cgroups, linux-kernel, Aaron Tomlin, Guopeng Zhang, Waiman Long
In-Reply-To: <20260529212108.120506-1-longman@redhat.com>

Extract the DL bandwidth allocation code in cpuset_attach() to a new
cpuset_reserve_dl_bw() helper to simplify code.

No functional change is expected.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 53 ++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 961427cd83a5..a6f191b48529 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2992,6 +2992,25 @@ static int cpuset_can_attach_check(struct cpuset *cs)
 	return 0;
 }
 
+static int cpuset_reserve_dl_bw(struct cpuset *cs)
+{
+	int cpu, ret;
+
+	if (!cs->sum_migrate_dl_bw)
+		return 0;
+
+	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
+	if (unlikely(cpu >= nr_cpu_ids))
+		return -EINVAL;
+
+	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
+	if (ret)
+		return ret;
+
+	cs->dl_bw_cpu = cpu;
+	return 0;
+}
+
 static void reset_migrate_dl_data(struct cpuset *cs)
 {
 	cs->nr_migrate_dl_tasks = 0;
@@ -3006,7 +3025,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 	struct cpuset *cs, *oldcs;
 	struct task_struct *task;
 	bool setsched_check;
-	int cpu, ret;
+	int ret;
 
 	/* used later by cpuset_attach() */
 	cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
@@ -3062,31 +3081,19 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 		}
 	}
 
-	if (!cs->sum_migrate_dl_bw)
-		goto out_success;
-
-	cpu = cpumask_any_and(cpu_active_mask, cs->effective_cpus);
-	if (unlikely(cpu >= nr_cpu_ids)) {
-		ret = -EINVAL;
-		goto out_unlock;
-	}
-
-	ret = dl_bw_alloc(cpu, cs->sum_migrate_dl_bw);
-	if (ret)
-		goto out_unlock;
-
-	cs->dl_bw_cpu = cpu;
-
-out_success:
-	/*
-	 * Mark attach is in progress.  This makes validate_change() fail
-	 * changes which zero cpus/mems_allowed.
-	 */
-	cs->attach_in_progress++;
+	ret = cpuset_reserve_dl_bw(cs);
 
 out_unlock:
-	if (ret)
+	if (ret) {
 		reset_migrate_dl_data(cs);
+	} else {
+		/*
+		 * Mark attach is in progress.  This makes validate_change() fail
+		 * changes which zero cpus/mems_allowed.
+		 */
+		cs->attach_in_progress++;
+	}
+
 	mutex_unlock(&cpuset_mutex);
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH rdma-next v2 0/3] cgroup/rdma: add MR memory size resource tracking
From: yanjun.zhu @ 2026-05-29 21:14 UTC (permalink / raw)
  To: Tao Cui, tj, hannes, mkoutny, leon, jgg; +Cc: linux-rdma, cgroups, Tao Cui
In-Reply-To: <20260529090733.2242822-1-cui.tao@linux.dev>

On 5/29/26 2:07 AM, Tao Cui wrote:
> From: Tao Cui <cuitao@kylinos.cn>
> 
> Currently the RDMA cgroup only tracks two aggregate counters:
> hca_handle and hca_object.  The real scarce resource in multi-tenant
> deployments is pinned memory: how much physical memory gets registered
> through MRs.  The existing hca_object counter is too coarse to capture
> this.
> 
> This series adds a single new resource type:
> 
>    - mr_mem  - Cumulative MR memory size in bytes
> 
> The per-object-type counters (qp, mr) from RFC v1 have been removed
> per review feedback [1]: modern NICs pool objects from the same memory
> pool so the distinction between QP count and MR count is not
> meaningful for resource limiting.  hca_object remains sufficient for
> coarse object accounting.
> 
> After this series, an administrator can set limits like:
> 
>      echo "mlx5_0 mr_mem=1073741824" > rdma.max
> 

Hi,

Thanks for the patchset! Introducing `mr_mem` to track and limit pinned
memory size is a very practical enhancement for multi-tenant deployments.

I have a question regarding how this new resource type interacts with
Fast Registration (FRWR / FRMR), which is widely used in production
environments (e.g., NVMe-oF, iSER) to achieve high performance.

As we know, FRWR decouples the MR object allocation (`ib_alloc_mr`) from
the actual memory page mapping (`ib_map_mr_sg`). The creation of FRWR
Memory Regions is often managed via a pre-allocated page pool.

Could you clarify how `mr_mem` accounts for FRWR in the following scenarios?

1. Accounting Granularity: Does `mr_mem` charge the maximum capacity of
    the FRWR object at its allocation time (`ib_alloc_mr`), or does it
    dynamically track the actual mapped bytes during the fast-reg data 
path? If it's the former, it represents a "static maximum budget" per 
pool, which seems more practical for performance.

2. Kernel-space vs Userspace: FRWR pools are frequently allocated by
    kernel-space drivers (like NVMe-oF target/host). If these kernel
    threads are not bound to a specific user cgroup, will their FRWR
    allocations end up in the root cgroup, potentially bypassing the
    per-tenant limits?

Don't you think it would be beneficial to explicitly document or 
consider the FRWR pattern in the design section, given its prevalence in
real-world storage and networking workloads?

Thanks,
Zhu Yanjun

> Design
> ~~~~~~
> 
> mr_mem is not page-level ownership tracking; it is object-based
> accounting tied to the MR lifetime:
> 
>    - charged at MR registration time
>    - uncharged at MR destruction time
>    - the charge is pinned to the cgroup that created the MR for the
>      entire lifetime of the MR object
> 
> This model intentionally defines accounting semantics around MR
> object lifetime rather than page ownership:
> 
> 1. fork(): fork() does not duplicate MR objects.  Even though the
>     child inherits the uverbs fd and can access the parent's ucontext,
>     the MR remains a single kernel object.  The charge is tied to the
>     MR object, not to the number of processes that can reach it, so
>     no splitting or re-accounting is needed.
> 
> 2. Cgroup migration: mr_mem follows the same semantics as the existing
>     hca_object; charge at creation time against the invoking task's
>     cgroup, uncharge at destruction time.  The RDMA cgroup does not
>     implement can_attach/attach callbacks today, so charges do not
>     migrate with the task.  This is a known limitation that applies
>     equally to hca_handle and hca_object.  mr_mem does not introduce
>     any new complication here.
> 
> 3. Overlap with memory cgroup: mr_mem does not count process memory
>     usage; it represents a per-device DMA registration budget: the
>     amount of memory this cgroup may register through a given HCA.
>     This is a different dimension from what memory cgroup tracks.  An
>     administrator might set mr_mem limits differently per device, which
>     memory cgroup cannot express.
> 
>     In particular, mr_mem tracks the registered memory range associated
>     with the MR rather than exact dynamically pinned pages (e.g. for
>     ODP MRs).  This is a stable, policy-oriented approximation of
>     registration footprint, not an attempt at precise physical page
>     accounting.
> 
> Tao Cui (3):
>    cgroup/rdma: extend charge/uncharge API with s64 amount parameter
>    cgroup/rdma: add MR memory size resource tracking
>    cgroup/rdma: update cgroup resource list for MR_MEM
> 
>   Documentation/admin-guide/cgroup-v2.rst       |  21 ++--
>   drivers/infiniband/core/cgroup.c              |  10 +-
>   drivers/infiniband/core/core_priv.h           |  12 +-
>   drivers/infiniband/core/rdma_core.c           |  20 +++-
>   drivers/infiniband/core/uverbs_cmd.c          |  61 +++++++++-
>   drivers/infiniband/core/uverbs_std_types_mr.c |  37 ++++++
>   include/linux/cgroup_rdma.h                   |   8 +-
>   include/rdma/ib_verbs.h                       |   1 +
>   kernel/cgroup/rdma.c                          | 108 +++++++++++++-----
>   9 files changed, 219 insertions(+), 59 deletions(-)
> 
> ---
> Changes from RFC v1:
> 
>    - Removed RDMACG_RESOURCE_QP and RDMACG_RESOURCE_MR per-type
>      counters following review feedback from Jason Gunthorpe [1].
>    - Retained only RDMACG_RESOURCE_MR_MEM as the sole new resource.
>    - Added detailed semantic notes to the commit messages addressing
>      fork(), cgroup migration, and overlap with memory cgroup [2].
>    - Renamed patches to reflect the narrower scope.
> 
> [1] https://lore.kernel.org/all/20260525134314.GI7702@ziepe.ca/
> [2] https://lore.kernel.org/all/20260528075537.2170697-1-cuitao@kylinos.cn/


^ permalink raw reply

* Re: [PATCH 5/5] cgroup: Defer kill_css_finish() in cgroup_apply_control_disable()
From: Mark Brown @ 2026-05-29 21:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, Sebastian Andrzej Siewior,
	Petr Malat, Bert Karwatzki, kernel test robot, Martin Pitt,
	cgroups, linux-kernel, Aishwarya.TCV
In-Reply-To: <ahnMCQuw2K6zA3Hs@slm.duckdns.org>

[-- Attachment #1: Type: text/plain, Size: 736 bytes --]

On Fri, May 29, 2026 at 07:25:29AM -1000, Tejun Heo wrote:
> On Wed, May 27, 2026 at 11:45:54AM +0100, Mark Brown wrote:
> > On Mon, May 04, 2026 at 02:51:21PM -1000, Tejun Heo wrote:

> > with no further output and given that this is a cgroup locking change
> > this does seem like a plausible commmit, though I didn't look into it in
> > detail.  Bisect log and the list of LTP tests we're running in our test
> > job below.  We are running multuple tests in parallel.

> Unfortunately, I can't reproduce this in my environment. Any chance you can
> try testing on x86 tooa nd see whether it produces there?

Not readily sadly, I'll see if I can figure something out.  Our rootfs
images are based on Debian Trixie if that's relevant?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v3 4/4] selftests/cgroup: Add tests for zswap proactive writeback
From: Nhat Pham @ 2026-05-29 20:02 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260526114601.67041-5-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 4:46 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Add test_zswap_proactive_writeback() to cover the new memory.reclaim
> "zswap_writeback_only" key. The test populates a memory cgroup zswap
> pool, triggers proactive writeback, and verifies the behavior by
> observing the change in zswpwb_proactive. Invalid input combinations
> are also covered.
>
> Extend test_zswap_writeback_one() to assert that the existing
> non-proactive writeback path leaves zswpwb_proactive at zero.
>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>

LGTM.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply

* Re: [PATCH v3 3/4] mm/zswap: Add per-memcg stat for proactive writeback
From: Nhat Pham @ 2026-05-29 20:01 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260526114601.67041-4-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 4:46 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Currently, zswap writeback can be triggered by either the pool limit
> being hit or by the proactive writeback mechanism. However, the
> existing 'zswpwb' metric in memory.stat and /proc/vmstat counts all
> written back pages, making it difficult to distinguish between pages
> written back due to the pool limit and those written back proactively.
>
> Add a new statistic 'zswpwb_proactive' to memory.stat and /proc/vmstat.
> This counter tracks the number of pages written back due to proactive
> writeback. This allows users to better monitor and tune the proactive
> writeback mechanism.
>
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  4 +++
>  include/linux/vm_event_item.h           |  1 +
>  mm/memcontrol.c                         |  1 +
>  mm/vmstat.c                             |  1 +
>  mm/zswap.c                              | 41 ++++++++++++++++++-------
>  5 files changed, 37 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6564abf0dec5..7d65aef83f7b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1748,6 +1748,10 @@ The following nested keys are defined.
>           zswpwb
>                 Number of pages written from zswap to swap.
>
> +         zswpwb_proactive
> +               Number of pages written from zswap to swap by proactive
> +               writeback. This is a subset of zswpwb.
> +

nit: I think this is specifically the zswap_writeback_only mode right?

Technically, normal proactive reclaim (memory.reclaim) can also hit zswap :)

Maybe some clarification here?

>           zswap_incomp
>                 Number of incompressible pages currently stored in zswap
>                 without compression. These pages could not be compressed to
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 03fe95f5a020..7a5bee0a20b6 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -138,6 +138,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                 ZSWPIN,
>                 ZSWPOUT,
>                 ZSWPWB,
> +               ZSWPWB_PROACTIVE,
>  #endif
>  #ifdef CONFIG_X86
>                 DIRECT_MAP_LEVEL2_SPLIT,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e205e5de193d..7648b3fd940e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -571,6 +571,7 @@ static const unsigned int memcg_vm_event_stat[] = {
>         ZSWPIN,
>         ZSWPOUT,
>         ZSWPWB,
> +       ZSWPWB_PROACTIVE,
>  #endif
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         THP_FAULT_ALLOC,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f534972f517d..66fd06d1bb01 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1452,6 +1452,7 @@ const char * const vmstat_text[] = {
>         [I(ZSWPIN)]                             = "zswpin",
>         [I(ZSWPOUT)]                            = "zswpout",
>         [I(ZSWPWB)]                             = "zswpwb",
> +       [I(ZSWPWB_PROACTIVE)]                   = "zswpwb_proactive",
>  #endif
>  #ifdef CONFIG_X86
>         [I(DIRECT_MAP_LEVEL2_SPLIT)]            = "direct_map_level2_splits",
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 7bcbf788f634..b45d094f532a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -160,6 +160,11 @@ struct zswap_pool {
>         char tfm_name[CRYPTO_MAX_ALG_NAME];
>  };
>
> +struct zswap_shrink_walk_arg {
> +       bool proactive;
> +       bool encountered_page_in_swapcache;
> +};
> +
>  /* Global LRU lists shared by all zswap pools. */
>  static struct list_lru zswap_list_lru;
>
> @@ -1042,7 +1047,8 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>   * freed.
>   */
>  static int zswap_writeback_entry(struct zswap_entry *entry,
> -                                swp_entry_t swpentry)
> +                                swp_entry_t swpentry,
> +                                bool proactive)
>  {
>         struct xarray *tree;
>         pgoff_t offset = swp_offset(swpentry);
> @@ -1097,6 +1103,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>         if (entry->objcg)
>                 count_objcg_events(entry->objcg, ZSWPWB, 1);
>
> +       if (proactive) {
> +               count_vm_event(ZSWPWB_PROACTIVE);
> +               if (entry->objcg)
> +                       count_objcg_events(entry->objcg, ZSWPWB_PROACTIVE, 1);
> +       }
> +

With the above clarification, the rest LGTM.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: Nhat Pham @ 2026-05-29 19:58 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260526114601.67041-3-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 4:46 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> Zswap currently writes back pages to backing swap reactively, triggered
> either by the shrinker or when the pool reaches its size limit. There is
> no mechanism to control the amount of writeback for a specific memory
> cgroup. However, users may want to proactively write back zswap pages,
> e.g., to free up memory for other applications or to prepare for
> memory-intensive workloads.
>
> Introduce a "zswap_writeback_only" key to the memory.reclaim cgroup
> interface. When specified, this key bypasses standard memory reclaim
> and exclusively performs proactive zswap writeback up to the requested
> budget. If omitted, the default reclaim behavior remains unchanged.
>
> Example usage:
>   # Write back 100MB of pages from zswap to the backing swap
>   echo "100M zswap_writeback_only" > memory.reclaim

Hmmm, so this 100MB is the pre-compression size? i.e if this 100 MB
compresses to 25 MB, then you're only freeing 25 MB?

I'm ok-ish with this, but can you document it?

The rest seems solid to me, FWIW. I'll defer to Johannes and Yosry for
opinions on zswap-only proactive reclaim.

^ permalink raw reply

* Re: [PATCH v3 1/4] mm/zswap: Make shrink_worker writeback cursor per-memcg
From: Nhat Pham @ 2026-05-29 19:51 UTC (permalink / raw)
  To: Hao Jia
  Cc: akpm, tj, hannes, shakeel.butt, mhocko, yosry, mkoutny,
	chengming.zhou, muchun.song, roman.gushchin, cgroups, linux-mm,
	linux-kernel, linux-doc, Hao Jia
In-Reply-To: <20260526114601.67041-2-jiahao.kernel@gmail.com>

On Tue, May 26, 2026 at 4:46 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>
> From: Hao Jia <jiahao1@lixiang.com>
>
> The zswap background writeback worker shrink_worker() uses a global
> cursor zswap_next_shrink, protected by zswap_shrink_lock, to round-robin
> across the online memcgs under root_mem_cgroup.
>
> Proactive writeback also wants a similar per-memcg cursor that is
> scoped to the specified memcg, so that repeated invocations against
> the same memcg make forward progress across its descendant memcgs
> instead of restarting from the first child memcg each time.
>
> Naturally, group the cursor and its protecting spinlock into a
> zswap_wb_iter struct, and make it a member of struct mem_cgroup to
> realize per-memcg cursor management. Accordingly, shrink_worker() now
> uses the lock and cursor in root_mem_cgroup->zswap_wb_iter.
>
> Because the cursor is now per-memcg, the offline cleanup must visit
> every ancestor that could be holding a reference to the dying memcg.
> Factor out __zswap_memcg_offline_cleanup() and walk from dead_memcg up
> to the root.
>
> No functional change intended for shrink_worker().

LGTM, if the memcg maintainers are happy with the overhead.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply

* Re: [RFC PATCH v2 6/9] mm: provide anon locality evidence for zswap large swapin
From: Nhat Pham @ 2026-05-29 19:22 UTC (permalink / raw)
  To: fujunjie
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <tencent_913470853E9B289ECF0379248E24DFB4590A@qq.com>

On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>
> The common zswap large-swapin policy needs locality evidence from
> callers before it can admit a large folio. For anonymous faults, provide
> that evidence from existing VMA hints and from the PTE young state left
> by earlier zswap-backed large swapins.
>
> Keep non-faulting PTEs old when mapping a speculative all-zswap large
> folio. A later fault can then require a dense young previous range before
> admitting another large swapin without adding VMA state.

Makes sense to me.

>
> This also removes the old zswap-enabled guard from the THP swapin
> candidate scan. The common swapin path now classifies the backend range
> and falls back to order-0 for mixed zswap/disk ranges or races.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
>  mm/memory.c     | 234 +++++++++++++++++++++++++++++++++++++++++++-----
>  mm/swap.h       |   6 ++
>  mm/swap_state.c |  15 ++++
>  3 files changed, 235 insertions(+), 20 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 92a82008d583..7bbb89632000 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4556,6 +4556,35 @@ static void memcg1_swapin_retry_folio(struct folio *folio,
>         folio_unlock(folio);
>  }
>
> +static void set_swapin_ptes(struct vm_area_struct *vma,
> +                           unsigned long address, pte_t *ptep, pte_t pte,
> +                           unsigned int nr_pages, unsigned int fault_pte_idx,
> +                           bool fault_only_young)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       pte_t old_pte;
> +
> +       if (!fault_only_young || nr_pages == 1) {
> +               set_ptes(mm, address, ptep, pte, nr_pages);
> +               return;
> +       }
> +
> +       old_pte = pte_mkold(pte);
> +       if (fault_pte_idx)
> +               set_ptes(mm, address, ptep, old_pte, fault_pte_idx);
> +
> +       set_pte_at(mm, address + fault_pte_idx * PAGE_SIZE,
> +                  ptep + fault_pte_idx,
> +                  pte_mkyoung(pte_advance_pfn(pte, fault_pte_idx)));

Hmm, does this mean that without THP swapin, the faulting PTE is not
marked young, but it is marked young if there is a THP swapin. That's
a behavioral change right? Would this throw off other heuristics
relying on this bit, or any justification that this is fine?

> +
> +       fault_pte_idx++;
> +       if (fault_pte_idx < nr_pages)
> +               set_ptes(mm, address + fault_pte_idx * PAGE_SIZE,
> +                        ptep + fault_pte_idx,
> +                        pte_advance_pfn(old_pte, fault_pte_idx),
> +                        nr_pages - fault_pte_idx);
> +}
> +
>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
>  {
>         vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> @@ -4628,6 +4657,157 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  }
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define SWAPIN_ANON_YOUNG_MIN_PERCENT          75
> +#define SWAPIN_ANON_MAX_FAULT_SKIP_SHIFT       2
> +
> +static bool swapin_anon_prev_young_dense(struct vm_fault *vmf,
> +                                        unsigned int order)
> +{
> +       struct vm_area_struct *vma;
> +       unsigned int nr_pages;
> +       unsigned int threshold;
> +       unsigned long size;
> +       unsigned long base, prev, addr;
> +       struct folio *first = NULL;
> +       unsigned int present = 0;
> +       unsigned int young = 0;
> +       pmd_t *pmd;
> +       pmd_t pmdval;
> +       spinlock_t *ptl; /* protects the previous PTE range */
> +       pte_t *ptep;
> +       unsigned int i;
> +
> +       if (!IS_ENABLED(CONFIG_MMU) || !arch_has_hw_pte_young() || !vmf ||
> +           !vmf->vma || !vmf->pmd || !order || order > MAX_PAGE_ORDER)
> +               return false;
> +
> +       nr_pages = 1U << order;
> +       threshold = DIV_ROUND_UP(nr_pages *
> +                                SWAPIN_ANON_YOUNG_MIN_PERCENT, 100);
> +       size = PAGE_SIZE << order;
> +
> +       vma = vmf->vma;
> +       base = ALIGN_DOWN(vmf->address, size);
> +       if (base < size)
> +               return false;
> +
> +       prev = base - size;
> +       if (prev < vma->vm_start || prev + size > vma->vm_end)
> +               return false;
> +
> +       pmd = vmf->pmd;
> +       if ((prev & PMD_MASK) != (base & PMD_MASK)) {
> +               pmd = mm_find_pmd(vma->vm_mm, prev);
> +               if (!pmd)
> +                       return false;
> +       }
> +
> +       pmdval = pmdp_get_lockless(pmd);
> +       if (!pmd_present(pmdval) || pmd_leaf(pmdval))
> +               return false;
> +
> +       ptep = pte_offset_map_lock(vma->vm_mm, pmd, prev, &ptl);
> +       if (!ptep)
> +               return false;
> +
> +       for (i = 0, addr = prev; i < nr_pages; i++, addr += PAGE_SIZE) {
> +               struct folio *folio;
> +               pte_t pte = ptep_get(ptep + i);
> +
> +               if (!pte_present(pte))
> +                       break;
> +
> +               folio = vm_normal_folio(vma, addr, pte);
> +               if (!folio || folio_order(folio) != order)
> +                       break;
> +               if (!first)
> +                       first = folio;
> +               else if (folio != first)
> +                       break;
> +
> +               present++;
> +               if (pte_young(pte))
> +                       young++;
> +       }
> +
> +       pte_unmap_unlock(ptep, ptl);
> +       if (present != nr_pages)
> +               return false;
> +
> +       return young >= threshold;
> +}
> +
> +static bool swapin_anon_accessed_neighbour(struct vm_fault *vmf,
> +                                          unsigned int order)
> +{
> +       unsigned long size;
> +       unsigned long base;
> +       unsigned long fault_idx;
> +       unsigned long max_skip;
> +
> +       if (!vmf || !vmf->vma || !order || order > MAX_PAGE_ORDER)
> +               return false;
> +
> +       size = PAGE_SIZE << order;
> +       base = ALIGN_DOWN(vmf->address, size);
> +
> +       /*
> +        * Without a sequential hint, require prior young-density evidence and
> +        * only allow faults near the start of the candidate range.
> +        */
> +       fault_idx = (vmf->address - base) >> PAGE_SHIFT;
> +       max_skip = (1UL << order) >> SWAPIN_ANON_MAX_FAULT_SKIP_SHIFT;
> +       if (fault_idx > max_skip)
> +               return false;
> +
> +       return swapin_anon_prev_young_dense(vmf, order);
> +}
> +
> +static bool swapin_anon_fault_starts_range(struct vm_fault *vmf,
> +                                          unsigned int order)
> +{
> +       struct vm_area_struct *vma;
> +       unsigned long size;
> +       unsigned long base;
> +       unsigned long first;
> +
> +       if (!vmf || !vmf->vma || !order || order > MAX_PAGE_ORDER)
> +               return false;
> +
> +       vma = vmf->vma;
> +       size = PAGE_SIZE << order;
> +       base = ALIGN_DOWN(vmf->address, size);
> +       first = ALIGN(vma->vm_start, size);
> +
> +       return base == first && vmf->address == base &&
> +              base + size <= vma->vm_end;
> +}
> +
> +static unsigned long swapin_anon_locality_orders(struct vm_fault *vmf,
> +                                                unsigned long orders)
> +{
> +       struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> +       unsigned long locality_orders = 0;
> +       unsigned long candidates = orders & ~BIT(0);
> +       int order;
> +
> +       if (vma && (vma->vm_flags & VM_RAND_READ))
> +               return 0;
> +
> +       if (vma && (vma->vm_flags & VM_SEQ_READ))
> +               return candidates;
> +
> +       while (candidates) {
> +               order = fls_long(candidates) - 1;
> +               if (swapin_anon_fault_starts_range(vmf, order) ||
> +                   swapin_anon_accessed_neighbour(vmf, order))
> +                       locality_orders |= BIT(order);
> +               candidates &= ~BIT(order);
> +       }
> +
> +       return locality_orders;
> +}
> +
>  /*
>   * Check if the PTEs within a range are contiguous swap entries.
>   */
> @@ -4644,9 +4824,9 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
>         if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
>                 return false;
>         /*
> -        * swap_read_folio() can't handle the case a large folio is hybridly
> -        * from different backends. And they are likely corner cases. Similar
> -        * things might be added once zswap support large folios.
> +        * swap_read_folio() can't do mixed-backend large folio IO. The common
> +        * synchronous swapin path will recheck backend state and fall back to
> +        * order-0 if a zswap/disk race makes the range mixed.
>          */
>         if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
>                 return false;
> @@ -4693,14 +4873,6 @@ static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf)
>         if (unlikely(userfaultfd_armed(vma)))
>                 return 0;
>
> -       /*
> -        * A large swapped out folio could be partially or fully in zswap. We
> -        * lack handling for such cases, so fallback to swapping in order-0
> -        * folio.
> -        */
> -       if (!zswap_never_enabled())
> -               return 0;
> -
>         entry = softleaf_from_pte(vmf->orig_pte);
>         /*
>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
> @@ -4708,10 +4880,13 @@ static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf)
>          */
>         orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
>                                           BIT(PMD_ORDER) - 1);
> +       if (!orders)
> +               return 0;
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +       if (!orders)
> +               return 0;
>         orders = thp_swap_suitable_orders(swp_offset(entry),
>                                           vmf->address, orders);
> -
>         if (!orders)
>                 return 0;
>
> @@ -4741,6 +4916,12 @@ static unsigned long thp_swapin_suitable_orders(struct vm_fault *vmf)
>  {
>         return 0;
>  }
> +
> +static unsigned long swapin_anon_locality_orders(struct vm_fault *vmf,
> +                                                unsigned long orders)
> +{
> +       return 0;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>  /* Sanity check that a folio is fully exclusive */
> @@ -4777,6 +4958,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         unsigned long page_idx;
>         unsigned long address;
>         pte_t *ptep;
> +       bool fault_only_young = false;
>
>         if (!pte_unmap_same(vmf))
>                 goto out;
> @@ -4845,13 +5027,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
>         if (!folio) {
> -               /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +               /*
> +                * Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices.
> +                * The swap device is pinned while checking the flag, matching
> +                * the existing fault path.
> +                */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> +                       unsigned long swapin_orders = thp_swapin_suitable_orders(vmf);
> +                       unsigned long locality_orders =
> +                               swapin_anon_locality_orders(vmf, swapin_orders);
> +
>                         folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
> -                                           thp_swapin_suitable_orders(vmf) | BIT(0),
> -                                           0, vmf, NULL, 0);
> -               else
> +                                           swapin_orders | BIT(0),
> +                                           locality_orders, vmf, NULL, 0);
> +               } else {
>                         folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
> +               }
>
>                 if (IS_ERR_OR_NULL(folio)) {
>                         /*
> @@ -5110,9 +5301,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
>         VM_BUG_ON(!folio_test_anon(folio) ||
>                         (pte_write(pte) && !PageAnonExclusive(page)));
> -       set_ptes(vma->vm_mm, address, ptep, pte, nr_pages);
> -       arch_do_swap_page_nr(vma->vm_mm, vma, address,
> -                       pte, pte, nr_pages);
> +       if (folio == swapcache && nr_pages == folio_nr_pages(folio) &&
> +           arch_has_hw_pte_young())
> +               fault_only_young = swapin_fault_only_young(folio);
> +       set_swapin_ptes(vma, address, ptep, pte, nr_pages, page_idx,
> +                       fault_only_young);
> +       arch_do_swap_page_nr(vma->vm_mm, vma, address, pte, pte, nr_pages);
>
>         /*
>          * Remove the swap entry and conditionally try to free up the swapcache.
> diff --git a/mm/swap.h b/mm/swap.h
> index dd35a310d06d..5d1c81ab49b9 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -327,6 +327,7 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
>  struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long orders,
>                           unsigned long locality_orders, struct vm_fault *vmf,
>                           struct mempolicy *mpol, pgoff_t ilx);
> +bool swapin_fault_only_young(struct folio *folio);
>  void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>                            unsigned long addr);
>
> @@ -430,6 +431,11 @@ static inline void swap_update_readahead(struct folio *folio,
>  {
>  }
>
> +static inline bool swapin_fault_only_young(struct folio *folio)
> +{
> +       return false;
> +}
> +
>  static inline int swap_writeout(struct folio *folio,
>                 struct swap_iocb **swap_plug)
>  {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 5a4ca289009a..80dff6a1ee65 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -747,6 +747,21 @@ static bool zswap_needs_order0_retry(struct folio *folio)
>                ZSWAP_RANGE_MIXED;
>  }
>
> +/*
> + * A speculative large swapin may install PTEs for pages that did not fault.
> + * Keep those non-faulting PTEs old so a later anon fault can report
> + * PTE-young density as caller-provided locality evidence without storing
> + * state in the VMA.
> + */
> +bool swapin_fault_only_young(struct folio *folio)
> +{
> +       if (!folio_test_large(folio) || !folio_test_swapcache(folio))
> +               return false;
> +
> +       return zswap_probe_range(folio->swap, folio_nr_pages(folio)) ==
> +              ZSWAP_RANGE_ALL_ZSWAP;
> +}
> +
>  /*
>   * If we are the only user, then try to free up the swap cache.
>   *
> --
> 2.34.1
>

^ permalink raw reply

* Re: [RFC PATCH v2 5/9] mm: add common locality admission for zswap large swapin
From: Nhat Pham @ 2026-05-29 19:00 UTC (permalink / raw)
  To: fujunjie
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <tencent_69E7033C2446FE6E922D28B82E9F59142D09@qq.com>

On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>
> Fully zswap-backed ranges are safe to load as a large folio only when
> the caller has a reason to expect the neighbouring slots to be useful.
> Otherwise a sparse refault can turn one 4K demand fault into a 64K
> decompression and swapcache fill.
>
> Add a common admission gate for zswap-backed large swapin. The common
> layer keeps backend checks, the 64K cap, recent-refault rejection, and
> zswap reclaim-pressure rejection. It consumes a caller-provided locality
> order mask instead of looking at anon or shmem state directly.

Can you add more documentation about these policies, both in patch
changelog and in code? I'm pretty confused by the
zswap_pool_reclaim_pressure heuristics, for e.g

>
> Callers pass no locality evidence for now, so this patch only installs
> the common policy hook. Later patches add anon and shmem producers.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
>  mm/memory.c     |   2 +-
>  mm/shmem.c      |   2 +-
>  mm/swap.h       |   8 ++--
>  mm/swap_state.c | 118 ++++++++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 117 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d73a19692dea..92a82008d583 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4849,7 +4849,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
>                         folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
>                                             thp_swapin_suitable_orders(vmf) | BIT(0),
> -                                           vmf, NULL, 0);
> +                                           0, vmf, NULL, 0);
>                 else
>                         folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 56c23a7b15c7..fa99b48ed62b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2031,7 +2031,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>
>  again:
>         mpol = shmem_get_pgoff_policy(info, index, order, &ilx);
> -       folio = swapin_sync(entry, gfp, BIT(order), vmf, mpol, ilx);
> +       folio = swapin_sync(entry, gfp, BIT(order), 0, vmf, mpol, ilx);
>         mpol_cond_put(mpol);
>
>         if (!IS_ERR(folio))
> diff --git a/mm/swap.h b/mm/swap.h
> index ea7e1f3c4410..dd35a310d06d 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -323,9 +323,10 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>                 struct mempolicy *mpol, pgoff_t ilx);
>  struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
> -               struct vm_fault *vmf);
> +                       struct vm_fault *vmf);
>  struct folio *swapin_sync(swp_entry_t entry, gfp_t flag, unsigned long orders,
> -                          struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx);
> +                         unsigned long locality_orders, struct vm_fault *vmf,
> +                         struct mempolicy *mpol, pgoff_t ilx);
>  void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>                            unsigned long addr);
>
> @@ -418,7 +419,8 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>
>  static inline struct folio *swapin_sync(
>         swp_entry_t entry, gfp_t flag, unsigned long orders,
> -       struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
> +       unsigned long locality_orders, struct vm_fault *vmf,
> +       struct mempolicy *mpol, pgoff_t ilx)
>  {
>         return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index f03ad4832f16..5a4ca289009a 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -21,6 +21,7 @@
>  #include <linux/migrate.h>
>  #include <linux/vmalloc.h>
>  #include <linux/huge_mm.h>
> +#include <linux/sizes.h>
>  #include <linux/zswap.h>
>  #include <linux/shmem_fs.h>
>  #include "internal.h"
> @@ -556,6 +557,24 @@ static struct folio *swap_cache_alloc_speculative_folio(swp_entry_t targ_entry,
>                                         mpol, ilx, true);
>  }
>
> +/*
> + * Initial conservative cap for speculative zswap large swapin. Locality
> + * evidence is supplied by the caller or by generic VMA hints; the common
> + * swapin layer keeps backend safety and pressure decisions here.
> + */
> +#define SWAPIN_ZSWAP_MAX_SIZE                  SZ_64K
> +#if PAGE_SIZE < SWAPIN_ZSWAP_MAX_SIZE
> +#define SWAPIN_ZSWAP_MAX_ORDER                 \
> +       ilog2(SWAPIN_ZSWAP_MAX_SIZE / PAGE_SIZE)
> +#else
> +#define SWAPIN_ZSWAP_MAX_ORDER                 0
> +#endif
> +
> +struct zswap_admit_ctx {
> +       bool pressure_checked;
> +       bool reclaim_pressure;
> +};
> +
>  static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
>  {
>         unsigned int ci_start = swp_cluster_offset(entry);
> @@ -586,11 +605,84 @@ static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
>         return true;
>  }
>
> +static bool swapin_zswap_locality(struct vm_fault *vmf, unsigned int order,
> +                                 unsigned long locality_orders)
> +{
> +       struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> +
> +       if (!order || order > MAX_PAGE_ORDER)
> +               return false;
> +
> +       if (vma && (vma->vm_flags & VM_RAND_READ))
> +               return false;

what about VM_SEQ_READ?

> +
> +       return locality_orders & BIT(order);
> +}
> +
> +static bool swapin_zswap_refaulted(swp_entry_t entry, unsigned int nr_pages)

nit: this does not seem zswap-specific. Just call it
swapin_range_refaulted or sth like that, maybe?

> +{
> +       unsigned int type = swp_type(entry);
> +       pgoff_t offset = swp_offset(entry);
> +       unsigned int i;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               bool workingset;
> +               void *shadow;
> +
> +               shadow = swap_cache_get_shadow(swp_entry(type, offset + i));

This seems inefficient. Can't we just lock the swap cluster once,
check all the shadow in the range, instead of repeatedly getting then
dropping the swap cluster lock?

> +               if (!shadow)
> +                       continue;
> +               if (workingset_test_recent(shadow, false, &workingset, false) &&
> +                   workingset)
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
> +static bool swapin_zswap_admit(swp_entry_t entry,
> +                              unsigned int order, unsigned int nr_pages,
> +                              struct vm_fault *vmf,
> +                              unsigned long locality_orders,
> +                              struct zswap_admit_ctx *ctx)
> +{
> +       if (order > SWAPIN_ZSWAP_MAX_ORDER)
> +               return false;
> +
> +       /*
> +        * Treat zswap-backed large swapin as speculative. The common layer
> +        * consumes caller-provided locality orders, but does not inspect
> +        * anon-specific PTE state or shmem-specific mapping state directly.
> +        */
> +       if (!swapin_zswap_locality(vmf, order, locality_orders))
> +               return false;
> +
> +       /*
> +        * A recent workingset refault shadow in the target range means reclaim
> +        * already saw churn there. Keep the refault path narrow instead of
> +        * speculatively decompressing neighbouring slots.
> +        */
> +       if (swapin_zswap_refaulted(entry, nr_pages))
> +               return false;

Hmm this depends. If it's just a refault coming from a speculative
read (readhead or THP (z)swpin), which is then promptly discarded,
then yeah we should backoff here. But maybe the refaulted page is
workingset one?

But yeah I guess it is better to be cautious when you are uncertain :)


> +
> +       if (!ctx->pressure_checked) {
> +               ctx->reclaim_pressure = zswap_pool_reclaim_pressure();
> +               ctx->pressure_checked = true;
> +       }

Why do we backoff if there is zswap_pool_reclaim_pressure (which only
check if the pool is full ONCE in its lifetime)? What's the rationale
here?

> +       if (ctx->reclaim_pressure)
> +               return false;
> +
> +       return true;
> +}
> +
>  static unsigned long swapin_admit_orders(swp_entry_t entry,
> -                                        unsigned long orders)
> +                                        unsigned long orders,
> +                                        struct vm_fault *vmf,
> +                                        unsigned long locality_orders)
>  {
>         unsigned long candidates = orders & ~BIT(0);
>         unsigned long admitted = orders & BIT(0);
> +       struct zswap_admit_ctx zswap_ctx = {};
>         int order;
>
>         if (!candidates)
> @@ -616,9 +708,14 @@ static unsigned long swapin_admit_orders(swp_entry_t entry,
>
>                 state = zswap_probe_range(range_entry, nr_pages);
>                 switch (state) {
> +               case ZSWAP_RANGE_ALL_ZSWAP:
> +                       admit = swapin_zswap_admit(range_entry, order,
> +                                                  nr_pages, vmf,
> +                                                  locality_orders,
> +                                                  &zswap_ctx);
> +                       break;
>                 case ZSWAP_RANGE_MIXED:
>                         break;
> -               case ZSWAP_RANGE_ALL_ZSWAP:
>                 case ZSWAP_RANGE_NEVER_ENABLED:
>                 case ZSWAP_RANGE_NO_ZSWAP:
>                         admit = true;
> @@ -769,8 +866,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
>         ret = swap_read_folio(folio, plug);
>         /*
>          * Swap readahead allocates order-0 folios. -EAGAIN is reserved for
> -        * retryable large zswap backend races and must be handled by the
> -        * synchronous common swapin path.
> +        * retryable large zswap backend races and should never escape to this
> +        * order-0 path.
>          */
>         VM_WARN_ON_ONCE(ret == -EAGAIN);
>         if (readahead) {
> @@ -786,6 +883,7 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
>   * @entry: swap entry indicating the target slot
>   * @gfp: memory allocation flags
>   * @orders: allocation orders
> + * @locality_orders: orders with caller-provided locality evidence
>   * @vmf: fault information
>   * @mpol: NUMA memory allocation policy to be applied
>   * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
> @@ -794,16 +892,20 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
>   * existing folio in the swap cache for @entry. This initiates the IO, too,
>   * if needed. @entry is rounded down if @orders allow large allocation.
>   *
> - * Context: Caller must ensure @entry is valid and pin the swap device with refcount.
> + * Context: Caller must ensure @entry is valid and pin the swap device with
> + * refcount.
>   * Return: Returns the folio on success, error code if failed.
>   */
> -struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orders,
> -                          struct vm_fault *vmf, struct mempolicy *mpol, pgoff_t ilx)
> +struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp,
> +                         unsigned long orders,
> +                         unsigned long locality_orders,
> +                         struct vm_fault *vmf, struct mempolicy *mpol,
> +                         pgoff_t ilx)
>  {
>         struct folio *folio;
>         int ret;
>
> -       orders = swapin_admit_orders(entry, orders);
> +       orders = swapin_admit_orders(entry, orders, vmf, locality_orders);
>  again:
>         do {
>                 folio = swap_cache_get_folio(entry);
> --
> 2.34.1
>

^ permalink raw reply

* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
From: Joshua Hahn @ 2026-05-29 18:40 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel, Farhad Alemi,
	Waiman Long, Rasmus Villemoes, cgroups
In-Reply-To: <ahnRIDBk4bQ3xX2q@yury>

On Fri, 29 May 2026 13:47:12 -0400 Yury Norov <ynorov@nvidia.com> wrote:

> On Fri, May 29, 2026 at 08:26:15AM -0700, Joshua Hahn wrote:
> > On Thu, 28 May 2026 12:41:33 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Thu, 28 May 2026 15:03:37 -0400 Yury Norov <ynorov@nvidia.com> wrote:
> > > 
> > > > Reassigning nodes relative an empty user-provided nodemask is useless,
> > > > and triggers divide-by-zero in the function.
> > > > 
> > > > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> > > > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
> > > 
> > > Thanks both.
> > > 
> > > It looks like this is very old code, so we'll be wanting a cc:stable in
> > > this.
> > > 
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
> > > >  static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
> > > >  				   const nodemask_t *rel)
> > > >  {
> > > > +	unsigned int w = nodes_weight(*rel);
> > > >  	nodemask_t tmp;
> > > > -	nodes_fold(tmp, *orig, nodes_weight(*rel));
> > > > +
> > > > +	if (w == 0)
> > > > +		return -EINVAL;
> > > > +
> > > > +	nodes_fold(tmp, *orig, w);
> > > >  	nodes_onto(*ret, tmp, *rel);
> > > >  }
> > > 
> > > I suspect we should address this at the mpol level - it should never
> > > have got that far.  Hopefully the mempolicy maintainers can have a
> > > think.
> > 
> > Hello Andrew, hello Yury,
> > 
> > I agree with Andrew here.
> > mpol_relative_nodemask is called from two places, the first being
> > mpol_rebind_nodemask which is the calling function seen in the bug report as
> > well.
> > 
> > The other place is mpol_set_nodemask, which has a helpful comment that notes:
> > "mpol_set_nodemask is called after mpol_new() [...snip...] mpol_new() has
> > already validated the nodes parameter with respect to the policy mode and
> > flags".
> > 
> > So it seems like we are missing the big if-else if-else if block from mpol_new
> > in other places that should in fact have it, like mpol_rebind_nodemask.
> > 
> > The approach proposed here of just checking whether the node weight is 0
> > won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where
> > empty nodemasks are actually allowed. So what should really be done here is to
> > do the full policy-nodemask checking section in mpol_new and call that from
> > mpol_set_nodemask as well.
> > 
> > Thank you for taking a shot at fixing the bug report, please let me know what
> > you think! Have a great day : -)
> 
> Hi Joshua.
> 
> Indeed, quick and dirty shot.
> 
> The problem is that nodes_fold() can't work with the sz == 0. In
> other words, folding to a 0-bit bitmap is an error. We don't check
> that on bitmaps level because it's an internal helper, and it's a
> caller's responsibility to validate the parameters.
> 
> nodes_onto(), or more specifically bitmap_onto(), is a different
> story. In case of empty relmap, the function actually clears all the
> bits in dst and returns.

I see, thank you for helping me understand. Yeah, we probably don't want
an empty nodemask here regardless of policy, as long as MPOL_F_RELATIVE_NODES
is set.

> I see 2 options to move this forward.
> 
> 1. Simply disallow empty relmap in mpol_relative_nodemask(). There's
> no valid cases for it, AFAIK, so the nodes_fold() limitation looks
> reasonable. We can consider it as a new policy.
> 
> We've got 2 users for mpol_relative_nodemask(). In mpol_set_nodemask()
> we can simply propagate the error; and in mpol_rebind_nodemask() we
> can throw a warning and do nothing.

I think we should never be able to reach mpol_set_nodemask with an empty
nodemask if MPOL_F_RELATIVE_NODES is set. Not sure if we need to be extra
defensive here.

For mpol_rebind_nodemask I think we should actually do some more checks,
I think we should do it in mpol_rebind_policy since it gives us an opportunity
to catch other sources of failure too, like calling mpol_rebind_preferred 
with an empty nodemask as well (which shouldn't be allowed for MPOL_F_{
RELATIVE, STATIC}_NODES) as far as I can tell from the checks in mpol_new.

Setting empty nodemask for mpol_rebind_preferred won't throw a div0 error
like for mpol_rebind_nodemask but we can at least throw a warning like you
suggested. 

Does that make sense? This is your fix and if you would prefer to address only
the div0 case, that makes sense too, since the empty nodemask for preferred
is more of a semantic incorrectness and will not cause panics.
Entirely up to you! : -)

> 2. Follow the spirit of the nodes_onto(), and in case of empty
> relmask, clean the ret mask and bail out
> 
> I'm in a favor for the 1st option, because empty relmask looks buggy
> anyways.
> 
> > The approach proposed here of just checking whether the node weight is 0
> > won't work for a few cases, namely for MPOL_DEFAULT and MPOL_PREFERRED where
> > empty nodemasks are actually allowed.
> 
> Not sure I understand this. The mpol_relative_nodemask() is called
> only if MPOL_F_RELATIVE_NODES is set. In mpol_rebind_nodemask(), if
> both MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES are set, the former
> wins. How would the RELATIVE mode mess with the others?

Yes, you're right, the case that MPOL_DEFAULT and MPOL_PREFERRED allows empty
nodemasks is precisely when !STATIC && !RELATIVE :p this is my bad for missing
that case completely.

> Anyways, I'm not really deep in mempolicy domain, so please educate me if
> I miss something.

Thank you, I have also learned a lot looking into this to think about what the
best solution is!
Joshua

^ permalink raw reply

* Re: [RFC PATCH v2 9/9] docs: mm: update THP swapin counter descriptions
From: Nhat Pham @ 2026-05-29 18:37 UTC (permalink / raw)
  To: fujunjie
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <tencent_0F6F682C60B99E9E0F1553E6BF3D86468409@qq.com>

On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>
> The THP swapin counter descriptions still describe large swapin as
> coming only from non-zswap swap devices. Update them now that
> zswap-backed large folio swapin can also increment swpin.
>
> Also describe policy and backend rejection as swpin_fallback cases,
> since speculative zswap large swapin can intentionally fall back before
> doing large IO.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 23f8d13c2629..59b7a0d09243 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -667,13 +667,14 @@ zswpout
>         piece without splitting.
>
>  swpin
> -       is incremented every time a huge page is swapped in from a non-zswap
> -       swap device in one piece.
> +       is incremented every time a huge page is swapped in from swap or
> +       zswap in one piece.
>
>  swpin_fallback
> -       is incremented if swapin fails to allocate or charge a huge page
> -       and instead falls back to using huge pages with lower orders or
> -       small pages.
> +       is incremented if swapin cannot use a huge page and instead falls
> +       back to using huge pages with lower orders or small pages. This can
> +       happen because allocation or charging fails, or because policy or
> +       backend state rejects a speculative large swapin.

I think we should add separate zswpin and zswpin fallback counter for
THP rather than overloading swpin. We already do that for zswpout vs
swpout.

^ permalink raw reply

* Re: [RFC PATCH v2 1/9] mm/zswap: expose range state for swapin policy
From: Nhat Pham @ 2026-05-29 18:35 UTC (permalink / raw)
  To: fujunjie
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <tencent_C78A02F3C41E15233C371816825C7DCF8708@qq.com>

On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>
> Large folio swapin needs to know whether a candidate swap range is fully
> backed by zswap before it can choose an order. That decision should stay
> in common swapin code, not inside zswap.
>
> Export two zswap facts for that caller: a lockless range occupancy snapshot
> and the current zswap reclaim-pressure state. The range state is
> advisory only. Writeback or invalidation can change the backend after the
> snapshot, so users must recheck before issuing large-folio IO.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
>  include/linux/zswap.h | 26 +++++++++++++++++++++++++
>  mm/zswap.c            | 44 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..8f9aee97517c 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -9,6 +9,18 @@ struct lruvec;
>
>  extern atomic_long_t zswap_stored_pages;
>
> +/*
> + * Advisory zswap occupancy snapshot for a swap range. This is not a complete
> + * backend classifier; callers must recheck before depending on ALL_ZSWAP for
> + * large-folio IO.
> + */
> +enum zswap_range_state {
> +       ZSWAP_RANGE_NEVER_ENABLED,
> +       ZSWAP_RANGE_NO_ZSWAP,
> +       ZSWAP_RANGE_ALL_ZSWAP,
> +       ZSWAP_RANGE_MIXED,
> +};
> +
>  #ifdef CONFIG_ZSWAP
>
>  struct zswap_lruvec_state {
> @@ -27,6 +39,9 @@ struct zswap_lruvec_state {
>  unsigned long zswap_total_pages(void);
>  bool zswap_store(struct folio *folio);
>  int zswap_load(struct folio *folio);
> +enum zswap_range_state zswap_probe_range(swp_entry_t swp,
> +                                        unsigned int nr_pages);
> +bool zswap_pool_reclaim_pressure(void);
>  void zswap_invalidate(swp_entry_t swp);
>  int zswap_swapon(int type, unsigned long nr_pages);
>  void zswap_swapoff(int type);
> @@ -49,6 +64,17 @@ static inline int zswap_load(struct folio *folio)
>         return -ENOENT;
>  }
>
> +static inline enum zswap_range_state zswap_probe_range(swp_entry_t swp,
> +                                                      unsigned int nr_pages)
> +{
> +       return ZSWAP_RANGE_NEVER_ENABLED;
> +}
> +
> +static inline bool zswap_pool_reclaim_pressure(void)
> +{
> +       return false;
> +}
> +
>  static inline void zswap_invalidate(swp_entry_t swp) {}
>  static inline int zswap_swapon(int type, unsigned long nr_pages)
>  {
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 761cd699e0a3..da5297f7bd69 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -506,6 +506,19 @@ unsigned long zswap_total_pages(void)
>         return total;
>  }
>
> +/*
> + * Expose whether zswap reclaim pressure is active. This is a backend fact:
> + * zswap_check_limits() sets the state once the pool reaches the hard limit and
> + * keeps it set until the pool falls below the accept threshold.
> + */
> +bool zswap_pool_reclaim_pressure(void)
> +{
> +       if (zswap_never_enabled())
> +               return false;
> +
> +       return READ_ONCE(zswap_pool_reached_full);
> +}
> +
>  static bool zswap_check_limits(void)
>  {
>         unsigned long cur_pages = zswap_total_pages();
> @@ -1559,6 +1572,37 @@ bool zswap_store(struct folio *folio)
>         return ret;
>  }
>
> +enum zswap_range_state zswap_probe_range(swp_entry_t swp,
> +                                        unsigned int nr_pages)
> +{
> +       unsigned int type = swp_type(swp);
> +       pgoff_t offset = swp_offset(swp);
> +       bool present = false, missing = false;
> +       unsigned int i;
> +
> +       /*
> +        * This is an advisory, lockless snapshot for common swapin admission.
> +        * Callers must recheck before depending on an all-zswap range for IO:
> +        * concurrent writeback or invalidation can change the backend state.
> +        */
> +       if (zswap_never_enabled())
> +               return ZSWAP_RANGE_NEVER_ENABLED;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               struct xarray *tree = swap_zswap_tree(swp_entry(type, offset + i));
> +
> +               if (xa_load(tree, offset + i))
> +                       present = true;
> +               else
> +                       missing = true;
> +
> +               if (present && missing)
> +                       return ZSWAP_RANGE_MIXED;
> +       }

Can we use xas_load() to make this check more efficient? IIUC,
xa_load() walks the tree every time.

(We used to use a bitmap here back in frontswap days. Good times....)

^ permalink raw reply

* Re: [RFC PATCH v2 4/9] mm: admit large swapin by backend range in swapin_sync()
From: Nhat Pham @ 2026-05-29 18:34 UTC (permalink / raw)
  To: fujunjie
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <tencent_EB78848E34DC7858C873193D67286ECD4B0A@qq.com>

On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>
> A large swapin can only read one folio when the whole range has compatible
> backing. Mixed zswap/disk ranges must not reach large-folio IO, and zswap
> range probes are only snapshots.
>
> Filter the orders passed to swap_cache_alloc_folio() in swapin_sync().
> Uniform zeromap ranges and all-disk ranges keep the existing large swapin
> path. Fully zswap-backed ranges may be tried. Mixed zswap/disk ranges fall
> back before allocation.
>
> After a large swapcache folio is installed, recheck the zswap range and
> drop the fresh folio if it became mixed. Also consume -EAGAIN from
> swap_read_folio() the same way. Both cases retry order-0, where each slot
> can resolve its current backend independently.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
>  mm/memcontrol-v1.c |   8 ++-
>  mm/memory.c        |  31 ++++++++-
>  mm/swap_state.c    | 169 ++++++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 194 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 765069211567..5b11b8055c66 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -682,8 +682,8 @@ void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci)
>   * memcg1_swapin - uncharge swap slot on swapin
>   * @folio: folio being swapped in
>   *
> - * Call this function after successfully adding the charged
> - * folio to swapcache.
> + * Call this after the charged folio has been added to swapcache and the caller
> + * is no longer going to drop it back to swapped-out state.
>   *
>   * Context: The folio has to be in swap cache and locked.
>   */
> @@ -721,7 +721,9 @@ void memcg1_swapin(struct folio *folio)
>         id = __swap_cgroup_clear(ci, swp_cluster_offset(folio->swap),
>                                  nr_pages);
>         swap_cluster_unlock(ci);
> -       mem_cgroup_uncharge_swap(id, nr_pages);
> +
> +       if (id)
> +               mem_cgroup_uncharge_swap(id, nr_pages);
>  }
>  #endif
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5a365492a9a2..d73a19692dea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4538,6 +4538,24 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>                 folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
>  }
>
> +static void memcg1_swapin_retry_folio(struct folio *folio,
> +                                     struct vm_fault *vmf)
> +{
> +       if (!folio_test_large(folio) || !folio_test_swapcache(folio))
> +               return;
> +
> +       if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> +               if (!folio_trylock(folio))
> +                       return;
> +       } else {
> +               folio_lock(folio);
> +       }
> +
> +       if (folio_test_large(folio) && folio_test_swapcache(folio))
> +               memcg1_swapin(folio);
> +       folio_unlock(folio);
> +}
> +
>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
>  {
>         vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> @@ -4857,8 +4875,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
>         swapcache = folio;
>         ret |= folio_lock_or_retry(folio, vmf);
> -       if (ret & VM_FAULT_RETRY)
> +       if (ret & VM_FAULT_RETRY) {
> +               memcg1_swapin_retry_folio(folio, vmf);
>                 goto out_release;
> +       }
>
>         page = folio_file_page(folio, swp_offset(entry));
>         /*
> @@ -5067,6 +5087,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (unlikely(folio != swapcache)) {
>                 folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>                 folio_add_lru_vma(folio, vma);
> +               if (folio_test_large(swapcache))
> +                       memcg1_swapin(swapcache);
>                 folio_put_swap(swapcache, NULL);
>         } else if (!folio_test_anon(folio)) {
>                 /*
> @@ -5076,6 +5098,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
>                 VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
>                 folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
> +               if (folio_test_large(folio))
> +                       memcg1_swapin(folio);
>                 folio_put_swap(folio, NULL);
>         } else {
>                 VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
> @@ -5132,8 +5156,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (vmf->pte)
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out_page:
> -       if (folio_test_swapcache(folio))
> +       if (folio_test_swapcache(folio)) {
> +               if (folio_test_large(folio))
> +                       memcg1_swapin(folio);
>                 folio_free_swap(folio);
> +       }
>         folio_unlock(folio);
>  out_release:
>         folio_put(folio);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d37097913b30..f03ad4832f16 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -21,6 +21,7 @@
>  #include <linux/migrate.h>
>  #include <linux/vmalloc.h>
>  #include <linux/huge_mm.h>
> +#include <linux/zswap.h>
>  #include <linux/shmem_fs.h>
>  #include "internal.h"
>  #include "swap_table.h"
> @@ -403,7 +404,8 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
>  static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>                                         swp_entry_t targ_entry, gfp_t gfp,
>                                         unsigned int order, struct vm_fault *vmf,
> -                                       struct mempolicy *mpol, pgoff_t ilx)
> +                                       struct mempolicy *mpol, pgoff_t ilx,
> +                                       bool defer_memcg1_swapin)
>  {
>         int err;
>         swp_entry_t entry;
> @@ -466,7 +468,8 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>         }
>
>         /* memsw uncharges swap when folio is added to swap cache */
> -       memcg1_swapin(folio);
> +       if (!defer_memcg1_swapin || !order)
> +               memcg1_swapin(folio);
>         if (shadow)
>                 workingset_refault(folio, shadow);
>
> @@ -495,9 +498,12 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
>   * Return: Returns the folio if allocation succeeded and folio is in the swap
>   * cache. Returns error code if failed due to race, OOM or invalid arguments.
>   */
> -struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
> -                                    unsigned long orders, struct vm_fault *vmf,
> -                                    struct mempolicy *mpol, pgoff_t ilx)
> +static struct folio *__swap_cache_alloc_folio(swp_entry_t targ_entry,
> +                                             gfp_t gfp, unsigned long orders,
> +                                             struct vm_fault *vmf,
> +                                             struct mempolicy *mpol,
> +                                             pgoff_t ilx,
> +                                             bool defer_memcg1_swapin)
>  {
>         int order, err;
>         struct folio *ret;
> @@ -512,7 +518,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>
>         do {
>                 ret = __swap_cache_alloc(ci, targ_entry, gfp, order,
> -                                        vmf, mpol, ilx);
> +                                        vmf, mpol, ilx,
> +                                        defer_memcg1_swapin);
>                 if (!IS_ERR(ret))
>                         break;
>                 err = PTR_ERR(ret);
> @@ -525,6 +532,124 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
>         return ret;
>  }
>
> +struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
> +                                    unsigned long orders, struct vm_fault *vmf,
> +                                    struct mempolicy *mpol, pgoff_t ilx)
> +{
> +       return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
> +                                       mpol, ilx, false);
> +}
> +
> +static struct folio *swap_cache_alloc_speculative_folio(swp_entry_t targ_entry,
> +                                                       gfp_t gfp,
> +                                                       unsigned long orders,
> +                                                       struct vm_fault *vmf,
> +                                                       struct mempolicy *mpol,
> +                                                       pgoff_t ilx)
> +{
> +       /*
> +        * Speculative large swapin may drop this fresh swapcache folio and
> +        * retry order-0 after backend or page-table revalidation. Keep the
> +        * cgroup v1 memsw swap owner until the caller commits the folio.
> +        */
> +       return __swap_cache_alloc_folio(targ_entry, gfp, orders, vmf,
> +                                       mpol, ilx, true);
> +}
> +
> +static bool swapin_zeromap_same(swp_entry_t entry, unsigned int nr_pages)
> +{
> +       unsigned int ci_start = swp_cluster_offset(entry);
> +       struct swap_cluster_info *ci = __swap_entry_to_cluster(entry);
> +       bool is_zero;
> +       unsigned int i;
> +
> +       if (ci_start + nr_pages > SWAPFILE_CLUSTER) {
> +               VM_WARN_ON_ONCE(1);
> +               return false;
> +       }
> +
> +       rcu_read_lock();
> +       if (!rcu_dereference(ci->table)) {
> +               rcu_read_unlock();
> +               return true;
> +       }
> +
> +       is_zero = __swap_table_test_zero(ci, ci_start);
> +       for (i = 1; i < nr_pages; i++) {
> +               if (is_zero != __swap_table_test_zero(ci, ci_start + i)) {
> +                       rcu_read_unlock();
> +                       return false;
> +               }
> +       }
> +       rcu_read_unlock();
> +
> +       return true;
> +}
> +
> +static unsigned long swapin_admit_orders(swp_entry_t entry,
> +                                        unsigned long orders)
> +{
> +       unsigned long candidates = orders & ~BIT(0);
> +       unsigned long admitted = orders & BIT(0);
> +       int order;
> +
> +       if (!candidates)
> +               return orders;
> +
> +       while (candidates) {
> +               enum zswap_range_state state;
> +               unsigned int nr_pages;
> +               swp_entry_t range_entry;
> +               bool admit = false;
> +
> +               order = fls_long(candidates) - 1;
> +               if (order > MAX_PAGE_ORDER) {
> +                       candidates &= ~BIT(order);
> +                       continue;
> +               }
> +
> +               nr_pages = 1U << order;
> +               range_entry = swp_entry(swp_type(entry),
> +                                       round_down(swp_offset(entry), nr_pages));
> +               if (!swapin_zeromap_same(range_entry, nr_pages))
> +                       goto next;
> +
> +               state = zswap_probe_range(range_entry, nr_pages);
> +               switch (state) {
> +               case ZSWAP_RANGE_MIXED:
> +                       break;
> +               case ZSWAP_RANGE_ALL_ZSWAP:
> +               case ZSWAP_RANGE_NEVER_ENABLED:
> +               case ZSWAP_RANGE_NO_ZSWAP:
> +                       admit = true;
> +                       break;
> +               }
> +
> +next:
> +               if (admit)
> +                       admitted |= BIT(order);
> +               else
> +                       count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
> +               candidates &= ~BIT(order);
> +       }
> +
> +       return admitted ? admitted : BIT(0);
> +}
> +
> +static bool zswap_needs_order0_retry(struct folio *folio)
> +{
> +       if (!folio_test_large(folio))
> +               return false;
> +
> +       /*
> +        * Admission sees only an advisory zswap snapshot. Recheck after the
> +        * large swapcache folio is installed; if the range became mixed, drop
> +        * the fresh folio before IO and let order-0 handle each slot.
> +        */
> +       return zswap_probe_range(folio->swap, folio_nr_pages(folio)) ==
> +              ZSWAP_RANGE_MIXED;
> +}
> +
>  /*
>   * If we are the only user, then try to free up the swap cache.
>   *
> @@ -634,7 +759,8 @@ static struct folio *swap_cache_read_folio(swp_entry_t entry, gfp_t gfp,
>                 folio = swap_cache_get_folio(entry);
>                 if (folio)
>                         return folio;
> -               folio = swap_cache_alloc_folio(entry, gfp, BIT(0), NULL, mpol, ilx);
> +               folio = swap_cache_alloc_folio(entry, gfp, BIT(0), NULL,
> +                                              mpol, ilx);
>         } while (PTR_ERR(folio) == -EEXIST);
>
>         if (IS_ERR_OR_NULL(folio))
> @@ -677,18 +803,43 @@ struct folio *swapin_sync(swp_entry_t entry, gfp_t gfp, unsigned long orders,
>         struct folio *folio;
>         int ret;
>
> +       orders = swapin_admit_orders(entry, orders);
> +again:
>         do {
>                 folio = swap_cache_get_folio(entry);
>                 if (folio)
>                         return folio;
> -               folio = swap_cache_alloc_folio(entry, gfp, orders, vmf, mpol, ilx);
> +               folio = swap_cache_alloc_speculative_folio(entry, gfp, orders,
> +                                                          vmf, mpol, ilx);
>         } while (PTR_ERR(folio) == -EEXIST);
>
>         if (IS_ERR(folio))
>                 return folio;
>
> +       if (zswap_needs_order0_retry(folio)) {
> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN_FALLBACK);
> +               /*
> +                * The folio is newly allocated, locked, clean and not uptodate;
> +                * no data has been read into it. Removing it only restores the
> +                * swap table entries so order-0 swapin can resolve a backend
> +                * race without attempting speculative large-folio zswapin.
> +                */
> +               swap_cache_del_folio(folio);
> +               folio_unlock(folio);
> +               folio_put(folio);
> +               orders = BIT(0);
> +               goto again;
> +       }
> +
>         ret = swap_read_folio(folio, NULL);
> -       VM_WARN_ON_ONCE(ret == -EAGAIN);
> +       if (ret == -EAGAIN) {

Can this happen? After you add the entire swap range to swap cache,
backend is locked. Zswap writeback bails out if it fails to add the
page to swap cache.

I think you can just check (zswap_probe_range or wev) before
swap_read_folio(). If the range is still fully backed by zswap, you
are good to go. Otherwise, bail here immediately.

Then you don't need all the complexity with extending swap_read_folio
to handle mixed range errors (for now at least).

^ permalink raw reply

* Re: [PATCH] cgroup/cpuset: Free sched domains on rebuild guard failure
From: Tejun Heo @ 2026-05-29 18:28 UTC (permalink / raw)
  To: Guopeng Zhang
  Cc: Waiman Long, Johannes Weiner, Michal Koutný, Chen Ridong,
	Guopeng Zhang, cgroups, linux-kernel
In-Reply-To: <20260528093742.1792456-1-guopeng.zhang@linux.dev>

Applied to cgroup/for-7.2.

Thanks.

--
tejun

^ permalink raw reply

* Re: [RFC PATCH v2 3/9] mm/zswap: support fully zswap-backed large folio loads
From: Nhat Pham @ 2026-05-29 18:25 UTC (permalink / raw)
  To: fujunjie
  Cc: Andrew Morton, linux-mm, Alexandre Ghiti, Kairui Song, Usama Arif,
	Chris Li, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Roman Gushchin, Shakeel Butt, linux-kernel, cgroups
In-Reply-To: <tencent_7D186EDC2C9AB9009F9915C1E68F3CF44609@qq.com>

On Fri, May 29, 2026 at 5:19 AM fujunjie <fujunjie1@qq.com> wrote:
>
> zswap currently refuses large swapcache folios. That is correct for mixed
> backend ranges, but it also prevents the common swapin path from loading a
> range that is still fully backed by zswap.
>
> Teach zswap_load() to fill a locked large swapcache folio by decompressing
> each base-page entry into the matching folio offset, then flushing the
> folio once. A missing entry after zswap data has been seen is reported as
> -EAGAIN so the caller can drop the speculative large folio and retry
> order-0.
>
> The large load keeps the zswap entries in place. It is a clean speculative
> fill: until the swap slots are freed, zswap remains the backing copy if
> reclaim drops the large folio before PTEs are installed.
>
> Signed-off-by: fujunjie <fujunjie1@qq.com>
> ---
>  mm/zswap.c | 105 ++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 87 insertions(+), 18 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index da5297f7bd69..94ba112a2982 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -15,6 +15,8 @@
>
>  #include <linux/module.h>
>  #include <linux/cpu.h>
> +#include <linux/mm.h>
> +#include <linux/huge_mm.h>
>  #include <linux/highmem.h>
>  #include <linux/slab.h>
>  #include <linux/spinlock.h>
> @@ -934,7 +936,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
>         return comp_ret == 0 && alloc_ret == 0;
>  }
>
> -static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
> +static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio,
> +                            unsigned int page_idx, bool flush_dcache)
>  {
>         struct zswap_pool *pool = entry->pool;
>         struct scatterlist input[2]; /* zsmalloc returns an SG list 1-2 entries */
> @@ -952,14 +955,15 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio)
>
>                 WARN_ON_ONCE(input->length != PAGE_SIZE);
>
> -               dst = kmap_local_folio(folio, 0);
> +               dst = kmap_local_folio(folio, page_idx * PAGE_SIZE);
>                 memcpy_from_sglist(dst, input, 0, PAGE_SIZE);
>                 dlen = PAGE_SIZE;
>                 kunmap_local(dst);
> -               flush_dcache_folio(folio);
> +               if (flush_dcache)
> +                       flush_dcache_folio(folio);
>         } else {
>                 sg_init_table(&output, 1);
> -               sg_set_folio(&output, folio, PAGE_SIZE, 0);
> +               sg_set_folio(&output, folio, PAGE_SIZE, page_idx * PAGE_SIZE);
>                 acomp_request_set_params(acomp_ctx->req, input, &output,
>                                          entry->length, PAGE_SIZE);
>                 ret = crypto_acomp_decompress(acomp_ctx->req);
> @@ -1042,7 +1046,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>                 goto out;
>         }
>
> -       if (!zswap_decompress(entry, folio)) {
> +       if (!zswap_decompress(entry, folio, 0, true)) {
>                 ret = -EIO;
>                 goto out;
>         }
> @@ -1615,10 +1619,9 @@ enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>   *  NOT marked up-to-date, so that an IO error is emitted (e.g. do_swap_page()
>   *  will SIGBUS).
>   *
> - *  -EINVAL: if the swapped out content was in zswap, but the page belongs
> - *  to a large folio, which is not supported by zswap. The folio is unlocked,
> - *  but NOT marked up-to-date, so that an IO error is emitted (e.g.
> - *  do_swap_page() will SIGBUS).
> + *  -EAGAIN: if the swapped out content belongs to a large folio, but the
> + *  range is mixed or raced with writeback. The folio remains locked so the
> + *  caller can drop the large swapcache folio and retry order-0.
>   *
>   *  -ENOENT: if the swapped out content was not in zswap. The folio remains
>   *  locked on return.
> @@ -1626,9 +1629,12 @@ enum zswap_range_state zswap_probe_range(swp_entry_t swp,
>  int zswap_load(struct folio *folio)
>  {
>         swp_entry_t swp = folio->swap;
> +       unsigned int nr_pages = folio_nr_pages(folio);
> +       unsigned int type = swp_type(swp);
>         pgoff_t offset = swp_offset(swp);
> -       struct xarray *tree = swap_zswap_tree(swp);
> +       struct xarray *tree;
>         struct zswap_entry *entry;
> +       unsigned int i;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1636,21 +1642,84 @@ int zswap_load(struct folio *folio)
>         if (zswap_never_enabled())
>                 return -ENOENT;
>
> -       /*
> -        * Large folios should not be swapped in while zswap is being used, as
> -        * they are not properly handled. Zswap does not properly load large
> -        * folios, and a large folio may only be partially in zswap.
> -        */
> -       if (WARN_ON_ONCE(folio_test_large(folio))) {
> +       if (folio_test_large(folio)) {
> +               struct obj_cgroup *first_objcg = NULL;
> +               bool same_objcg = true;
> +               bool saw_zswap = false;
> +               bool saw_non_zswap = false;
> +
> +               /*
> +                * The locked large swapcache folio now covers the range and
> +                * conflicts with zswap writeback's order-0 swapcache allocation.
> +                * If the range is mixed or an entry disappears, retry order-0.
> +                */
> +               for (i = 0; i < nr_pages; i++) {
> +                       tree = swap_zswap_tree(swp_entry(type, offset + i));
> +                       entry = xa_load(tree, offset + i);
> +                       if (!entry) {
> +                               if (saw_zswap)
> +                                       return -EAGAIN;
> +                               saw_non_zswap = true;
> +                               continue;
> +                       }

Can we use xas_load API here instead of traversing down the tree again
and again?

> +                       if (saw_non_zswap)
> +                               return -EAGAIN;
> +
> +                       if (!saw_zswap)
> +                               first_objcg = entry->objcg;
> +                       else if (entry->objcg != first_objcg)
> +                               same_objcg = false;

Can we get different objcg at this point?

> +                       saw_zswap = true;
> +               }
> +               if (!saw_zswap)
> +                       return -ENOENT;
> +
> +               for (i = 0; i < nr_pages; i++) {
> +                       tree = swap_zswap_tree(swp_entry(type, offset + i));
> +                       entry = xa_load(tree, offset + i);
> +                       if (!entry)
> +                               return -EAGAIN;
> +
> +                       if (!zswap_decompress(entry, folio, i, false)) {
> +                               folio_unlock(folio);
> +                               return -EIO;
> +                       }
> +               }
> +
> +               flush_dcache_folio(folio);
> +               /*
> +                * Keep zswap entries until swap slots are freed. This is a clean
> +                * speculative fill; zswap remains the backing copy if reclaim
> +                * drops the large folio before PTEs are installed.
> +                */
> +               folio_mark_uptodate(folio);
> +               count_vm_events(ZSWPIN, nr_pages);
> +               count_mthp_stat(folio_order(folio), MTHP_STAT_SWPIN);
> +
> +               if (same_objcg) {
> +                       if (first_objcg)
> +                               count_objcg_events(first_objcg, ZSWPIN, nr_pages);
> +               } else {
> +                       for (i = 0; i < nr_pages; i++) {
> +                               tree = swap_zswap_tree(swp_entry(type, offset + i));
> +                               entry = xa_load(tree, offset + i);
> +                               if (WARN_ON_ONCE(!entry))
> +                                       continue;
> +                               if (entry->objcg)
> +                                       count_objcg_events(entry->objcg, ZSWPIN, 1);

xas_load() here too?


> +                       }
> +               }
> +
>                 folio_unlock(folio);
> -               return -EINVAL;
> +               return 0;
>         }

>
> +       tree = swap_zswap_tree(swp);
>         entry = xa_load(tree, offset);
>         if (!entry)
>                 return -ENOENT;
>
> -       if (!zswap_decompress(entry, folio)) {
> +       if (!zswap_decompress(entry, folio, 0, true)) {
>                 folio_unlock(folio);
>                 return -EIO;
>         }

I wonder how much of these two paths (order 0 and larger order) can be
unified...

> --
> 2.34.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox