Linux cgroups development
 help / color / mirror / Atom feed
* [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Jiayuan Chen @ 2026-06-23  6:27 UTC (permalink / raw)
  To: linux-mm
  Cc: yingfu.zhou, jiayuan.chen, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260623062800.298514-1-jiayuan.chen@linux.dev>

From: Jiayuan Chen <jiayuan.chen@shopee.com>

Proactive reclaim via memory.reclaim can run for a long time - swap I/O
or thrashing again dominating the latency - and delays cgroup removal in
the same way.

Mitigate this by stopping the reclaim once memcg_is_dying().

Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/vmscan.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8190c4abec84..1162b7f76655 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
 		if (memcg) {
 			unsigned int reclaim_options;
 
+			if (memcg_is_dying(memcg))
+				break;
+
 			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
 					  MEMCG_RECLAIM_PROACTIVE;
 			reclaimed = try_to_free_mem_cgroup_pages(memcg,
-- 
2.43.0


^ permalink raw reply related

* [PATCH 2/3] memcg: bail out memory.max when memcg is dying
From: Jiayuan Chen @ 2026-06-23  6:27 UTC (permalink / raw)
  To: linux-mm
  Cc: yingfu.zhou, jiayuan.chen, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups,
	linux-kernel
In-Reply-To: <20260623062800.298514-1-jiayuan.chen@linux.dev>

From: Jiayuan Chen <jiayuan.chen@shopee.com>

memory.max has the same high-latency reclaim loop as memory.high, and
may additionally invoke the OOM killer on a cgroup that is already going
away, further delaying its removal.

Mitigate this by bailing out of the loop once memcg_is_dying().

Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d5cd056a25e..06bde6c5318f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4847,6 +4847,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 		if (signal_pending(current))
 			break;
 
+		if (memcg_is_dying(memcg))
+			break;
+
 		if (!drained) {
 			drain_all_stock(memcg);
 			drained = true;
-- 
2.43.0


^ permalink raw reply related

* [PATCH 0/3] memcg: bail out reclaim when memcg is dying
From: Jiayuan Chen @ 2026-06-23  6:27 UTC (permalink / raw)
  To: linux-mm
  Cc: yingfu.zhou, jiayuan.chen, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups, linux-kernel


Hi,

This series mitigates a system-wide stall we hit when a cgroup is
removed while one of its memory control files is doing synchronous
reclaim.

Problem Description
===================

Writing to memory.high, memory.max or memory.reclaim runs reclaim
synchronously in the writer's context, looping until the usage drops
below the target (or, for memory.reclaim, until the requested amount has
been reclaimed). On a large cgroup this can take a long time. The
latency is especially bad when reclaim has to perform swap I/O, where it
is bound by the swap device write bandwidth, and under thrashing it is
effectively unbounded - each round reclaims a few pages that the
workload immediately faults back in, so the loop keeps making "progress"
and never converges.

These writes go through cgroup_file_write(), which does not take
cgroup_mutex and does not pin the css. Instead, kernfs guarantees the
node (and thus the css) stays alive for the duration of the operation by
holding an active reference. So while the reclaim loop runs, the active
reference on the file is held.

If another task removes the same cgroup in parallel, cgroup_rmdir()
takes cgroup_mutex and then blocks in kernfs_drain() waiting for that
active reference to drain. Because cgroup_mutex is held throughout the
wait, every other task that needs it piles up behind the remover - in
our case the whole machine ground to a halt, with hung_task reports for
the remover and for unrelated tasks merely reading /proc/<pid>/cgroup:

INFO: task cgdelete:366634 blocked for more than 159 seconds.
      Not tainted 6.6.102+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Call Trace:
 <TASK>
 __schedule+0x3da/0x1650
 schedule+0x58/0x100
 kernfs_drain+0xe6/0x150
 __kernfs_remove.part.0+0xd0/0x200
 kernfs_remove_by_name_ns+0x75/0xd0
 cgroup_addrm_files+0x325/0x410
 css_clear_dir+0x50/0xf0
 cgroup_destroy_locked+0xdf/0x1e0
 cgroup_rmdir+0x2d/0xd0
 kernfs_iop_rmdir+0x53/0x90
 vfs_rmdir+0x98/0x240
 do_rmdir+0x172/0x1b0
 __x64_sys_rmdir+0x42/0x70
 x64_sys_call+0xeb0/0x2210
 do_syscall_64+0x56/0x90
 entry_SYSCALL_64_after_hwframe+0x78/0xe2


INFO: task systemd-journal:2352 blocked for more than 182 seconds.
      Not tainted 6.6.102+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Call Trace:
 <TASK>
 __schedule+0x3da/0x1650
 schedule+0x58/0x100
 schedule_preempt_disabled+0xe/0x20
 __mutex_lock.constprop.0+0x3bb/0x640
 __mutex_lock_slowpath+0x13/0x20
 mutex_lock+0x3c/0x50
 proc_cgroup_show+0x4d/0x380
 proc_single_show+0x53/0xe0
 seq_read_iter+0x12f/0x4b0
 seq_read+0xcd/0x110
 vfs_read+0xb1/0x360
 ? __seccomp_filter+0x368/0x590
 ksys_read+0x73/0x100
 __x64_sys_read+0x19/0x30
 x64_sys_call+0x18d3/0x2210
 do_syscall_64+0x56/0x90
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

The system recovers only once the reclaim finally finishes and releases
the active reference. The reclaim itself is pointless here: the cgroup
is being torn down and its remaining pages will be reparented to the
parent anyway.

Even though we check signal_pending(current) in the reclaim loop, the
typical symptom is that cat /proc/<pid>/cgroup gets stuck.
By the time someone looks for which task is actually stuck in reclaim,
the hung task timeout has already been hit. This makes the problem
particularly nasty to debug from a hung-task report alone, because the
blocked tasks shown are often the victims, not the reclaim writer itself.

Our Mitigation
==============

cgroup destruction sets CSS_DYING in kill_css_sync() *before*
css_clear_dir() triggers the kernfs_drain() that blocks the remover. The
in-flight reclaim loop is therefore guaranteed to observe it. This series
checks memcg_is_dying() in the three reclaim loops (memory.high,
memory.max and proactive reclaim) and bails out early, so the writer
drops the active reference promptly and the remover can make progress.

Unlike the no-progress guard (MAX_RECLAIM_RETRIES), which only fires when
reclaim makes zero progress, the dying check also covers the slow swap
I/O and thrashing cases, where reclaim keeps succeeding a little and the
loop would otherwise never converge.

This is orthogonal to commit c8e6002bd611 ("memcg: introduce
non-blocking limit setting option"): O_NONBLOCK lets a caller avoid the
synchronous reclaim up front, while this series handles the case where
reclaim is already running when the cgroup starts being removed.

The legacy (v1) reclaim loops in mem_cgroup_force_empty() and
mem_cgroup_resize_max() share the same pattern but are left out for now.

Jiayuan Chen (3):
  memcg: bail out memory.high when memcg is dying
  memcg: bail out memory.max when memcg is dying
  memcg: bail out proactive reclaim when memcg is dying

 mm/memcontrol.c | 6 ++++++
 mm/vmscan.c     | 3 +++
 2 files changed, 9 insertions(+)

-- 
2.43.0


^ permalink raw reply

* [PATCH 1/3] memcg: bail out memory.high when memcg is dying
From: Jiayuan Chen @ 2026-06-23  6:27 UTC (permalink / raw)
  To: linux-mm
  Cc: yingfu.zhou, jiayuan.chen, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups,
	linux-kernel
In-Reply-To: <20260623062800.298514-1-jiayuan.chen@linux.dev>

From: Jiayuan Chen <jiayuan.chen@shopee.com>

memory.high reclaims synchronously in the writer's context, and the
latency can be very high - especially when reclaim performs swap I/O, or
under thrashing where the loop may not converge for a long time.

While this runs the kernfs active reference on the file is held, so a
concurrent removal of the same cgroup blocks in kernfs_drain() under
cgroup_mutex until it finishes. Reclaiming a dying cgroup is pointless,
as its pages are reparented to the parent anyway.

Mitigate this by bailing out of the reclaim loop once memcg_is_dying().

Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af08232..2d5cd056a25e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4793,6 +4793,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		if (signal_pending(current))
 			break;
 
+		if (memcg_is_dying(memcg))
+			break;
+
 		if (!drained) {
 			drain_all_stock(memcg);
 			drained = true;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] cgroup: Use READ_ONCE() for task->flags in task_css_set_check()
From: Tao Cui @ 2026-06-23  5:58 UTC (permalink / raw)
  To: Guopeng Zhang, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cui.tao, cgroups, linux-kernel, Guopeng Zhang
In-Reply-To: <20260623022946.525885-1-guopeng.zhang@linux.dev>


Looks fine — this is a benign, PROVE_RCU-only race, and READ_ONCE()
documents the lockless snapshot with no functional change.

Acked-by: Tao Cui <cuitao@kylinos.cn>

在 2026/6/23 10:29, Guopeng Zhang 写道:
> From: Guopeng Zhang <zhangguopeng@kylinos.cn>
> 
> task_css_set_check() uses rcu_dereference_check() to verify that
> task->cgroups can be dereferenced. One accepted condition is that the
> task is already exiting, tested by checking PF_EXITING in task->flags.
> 
> This is a lockless snapshot used only for the CONFIG_PROVE_RCU debug
> predicate. This was found by KCSAN during fuzz testing. KCSAN can report
> a data race when another task flag bit is updated concurrently. One report
> shows pids_release() reading task->flags through task_css_set_check() while
> do_task_dead() sets PF_NOFREEZE:
> ...
> The changed bit is PF_NOFREEZE, not PF_EXITING. PF_EXITING remains set
> before and after the update, so the task_css_set_check() condition does
> not change. This is not a race on task->cgroups and does not indicate
> incorrect pids charging or uncharging.
> 
> Use READ_ONCE() to document the intended lockless snapshot of task->flags.
> 
> No functional change intended.
> 
> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/cpuset: Avoid unnecessary cpus & mems update in cpuset_hotplug_update_tasks()
From: Waiman Long @ 2026-06-23  5:58 UTC (permalink / raw)
  To: Ridong Chen, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc
In-Reply-To: <e24b8145-7a67-4cc0-8ba0-24bd89243c04@linux.dev>

On 6/22/26 9:14 PM, Ridong Chen wrote:
>
>
> On 6/23/2026 6:45 AM, Waiman Long wrote:
>> As reported by sashiko [1], cpuset_hotplug_update_tasks() may perform
>> unnecessary task iteration and updating of tasks' CPU and node masks
>> when mems_allowed and/or cpus_allowed are not set in cpuset v2. It is
>> due to the fact that the temporary new_cpus and new_mems masks do not
>> inherit parent's effective_cpus/mems when they are empty which is the
>> expected behavior for cpuset v2 since commit 4ec22e9c5a90 ("cpuset:
>> Enable cpuset controller in default hierarchy").
>>
>> Fix that and avoid unnecessay work by adding the empty mask checks and
>> inheriting the parent's versions if empty.
>>
>> [1] 
>> https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com
>>
>> Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default 
>> hierarchy")
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   kernel/cgroup/cpuset.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index aff86acea701..bc0207fd6e57 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -3925,6 +3925,14 @@ static void cpuset_hotplug_update_tasks(struct 
>> cpuset *cs, struct tmpmasks *tmp)
>>       compute_effective_cpumask(&new_cpus, cs, parent);
>>       nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
>>   +    if (is_in_v2_mode()) {
>> +        /* Inherit parent's effective_cpus/mems if empty */
>> +        if (cpumask_empty(&new_cpus))
>> +            cpumask_copy(&new_cpus, parent->effective_cpus);
>> +        if (nodes_empty(new_mems))
>> +            new_mems = parent->effective_mems;
>> +    }
>> +
>>       if (!tmp || !cs->partition_root_state)
>>           goto update_tasks;
>
> I noticed that compute_effective_cpumask(...) is called in several 
> places, so I think the logic should be consolidated into that function.
>
> ```
> static void compute_effective_cpumask(struct cpumask *new_cpus,
>                       struct cpuset *cs, struct cpuset *parent)
> {
>     cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
>     if (cpumask_empty(&new_cpus) && is_in_v2_mode())
>         cpumask_copy(&new_cpus, parent->effective_cpus);
> }
>
> ```
>
> Similarly, for new_mems, should we introduce a dedicated helper like 
> compute_effective_nodemask? The same fallback logic is needed in 
> update_nodemasks_hier:
>
>
> ```
> static void update_nodemasks_hier(struct cpuset *cs, nodemask_t 
> *new_mems)
> {
> ...
>         bool has_mems = nodes_and(*new_mems, cp->mems_allowed, 
> parent->effective_mems);
>
>         /*
>          * If it becomes empty, inherit the effective mask of the
>          * parent, which is guaranteed to have some MEMs.
>          */
>         if (is_in_v2_mode() && !has_mems)
>             *new_mems = parent->effective_mems;
> ...
> ```
>
Yes, that makes sense. Will adopt this approach in the next version.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH v2] selftests/cgroup: Adjust cpu.max quota based on HZ
From: Tao Cui @ 2026-06-23  5:32 UTC (permalink / raw)
  To: Joe Simmons-Talbott, Tejun Heo, Johannes Weiner,
	Michal Koutný, Shuah Khan
  Cc: cui.tao, cgroups, linux-kselftest, linux-kernel
In-Reply-To: <20260622194305.601392-1-joest@redhat.com>


Hi Joe,

One comment on the fallback:

  quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;

When HZ can't be determined (no CONFIG_IKCONFIG_PROC, or zcat missing),
the fallback to 1000 is the exact value that fails at low HZ — so this
doesn't actually fix such kernels. A larger fallback (e.g. 10000, the
HZ=100 equivalent) would make the tests robust regardless of whether the
config is exposed.

在 2026/6/23 03:43, Joe Simmons-Talbott 写道:
> For lower HZ values a quota of 1000us is much lower than the amount
> of microseconds per tick which makes the tests test_cpucg_max and
> test_cpugc_max_nested fail. Use the amount of microseconds per tick
> as the quota value.
> 
> Signed-off-by: Joe Simmons-Talbott <joest@redhat.com>
> ---
> changes since v1:
> - Try checking /proc/config.gz to get the actual kernel HZ value and
>   fallback to 1000 if the value cannot be determined.
> 
>  tools/testing/selftests/cgroup/test_cpu.c | 33 +++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
> index 7a40d76b9548..65e09555309f 100644
> --- a/tools/testing/selftests/cgroup/test_cpu.c
> +++ b/tools/testing/selftests/cgroup/test_cpu.c
> @@ -639,6 +639,29 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
>  	return run_cpucg_nested_weight_test(root, false);
>  }
>  
> +/*
> + * Best effort attempt to get the kernel's HZ value from the config.
> + * Return the HZ value if found otherwise return -1 to indicate failure.
> + */
> +static long
> +_get_config_hz(void)
> +{
> +	long hz = -1;
> +	FILE *f;
> +	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
> +
> +	f = popen(cmd, "r");
> +
> +	if (!f)
> +		goto out;
> +
> +	fscanf(f, "CONFIG_HZ=%ld", &hz);
> +
> +out:
> +	pclose(f);
> +	return hz;
> +}
> +
>  /*
>   * This test creates a cgroup with some maximum value within a period, and
>   * verifies that a process in the cgroup is not overscheduled.
> @@ -646,7 +669,8 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
>  static int test_cpucg_max(const char *root)
>  {
>  	int ret = KSFT_FAIL;
> -	long quota_usec = 1000;
> +	long hz = _get_config_hz();
> +	long quota_usec;
>  	long default_period_usec = 100000; /* cpu.max's default period */
>  	long duration_seconds = 1;
>  
> @@ -655,6 +679,8 @@ static int test_cpucg_max(const char *root)
>  	char *cpucg;
>  	char quota_buf[32];
>  
> +	quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
> +
>  	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
>  
>  	cpucg = cg_name(root, "cpucg_test");
> @@ -710,7 +736,8 @@ static int test_cpucg_max(const char *root)
>  static int test_cpucg_max_nested(const char *root)
>  {
>  	int ret = KSFT_FAIL;
> -	long quota_usec = 1000;
> +	long quota_usec;
> +	long hz = _get_config_hz();
>  	long default_period_usec = 100000; /* cpu.max's default period */
>  	long duration_seconds = 1;
>  
> @@ -719,6 +746,8 @@ static int test_cpucg_max_nested(const char *root)
>  	char *parent, *child;
>  	char quota_buf[32];
>  
> +	quota_usec = hz != -1 ? USEC_PER_SEC / hz : 1000;
> +
>  	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
>  
>  	parent = cg_name(root, "cpucg_parent");


^ permalink raw reply

* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Jing Wu @ 2026-06-23  4:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jing Wu, Waiman Long, linux-kernel, rcu, cgroups, Qiliang Yuan
In-Reply-To: <87cxxnegqa.ffs@fw13>

On Thu, Jun 18 2026 at 22:27, Thomas Gleixner wrote:
> While this series might work for you by some definition of "works",
> it's broken beyond repair [...]
> Please coordinate with Waiman or whoever is working on it at RH right now.

Thank you for the detailed review. I want to clarify the timeline and
highlight a key distinction before proceeding with v4.

DHM was posted as RFC on 2026-02-06 [1], v1 on 2026-03-25 [2], and v2
on 2026-04-13 [3]. Waiman Long's series was posted on 2026-04-20 [4],
seven days after DHM v2. The development appears to have been parallel.

More importantly, DHM and Waiman's series differ in a key requirement:
Waiman's series requires "nohz_full=" to be present at boot (even with
an empty CPU list) to opt into runtime updates. DHM's goal is to enable
CPU noise isolation at runtime on systems where no nohz_full= was
configured at boot — a use case his series does not cover.

That said, I fully accept the architectural feedback: the on-the-fly
subsystem modification approach in v3 is wrong, and v4 should use the
CPU hotplug machinery.

We are open to coordinating with Waiman on a unified approach that
covers both use cases. Before starting v4, two questions:

  1. Is the "no boot parameter required" use case worth pursuing
     independently, or should it be folded into Waiman's series?

  2. For the hotplug path: is CPU-by-CPU offline/online the expected
     mechanism, given that you rejected the cpuhp_offline_cb() bulk
     approach in Waiman's v1?

[1] https://lore.kernel.org/r/20260206-feature-dynamic_isolcpus_dhei-v1-0-00a711eb0c74@gmail.com
[2] https://lore.kernel.org/r/20260325-dhei-v12-final-v1-0-919cca23cadf@gmail.com
[3] https://lore.kernel.org/r/20260413-wujing-dhm-v2-0-06df21caba5d@gmail.com
[4] https://lore.kernel.org/r/20260421030351.281436-1-longman@redhat.com

Jing Wu <realwujing@gmail.com>

^ permalink raw reply

* [PATCH] cgroup: Use READ_ONCE() for task->flags in task_css_set_check()
From: Guopeng Zhang @ 2026-06-23  2:29 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: cgroups, linux-kernel, Guopeng Zhang

From: Guopeng Zhang <zhangguopeng@kylinos.cn>

task_css_set_check() uses rcu_dereference_check() to verify that
task->cgroups can be dereferenced. One accepted condition is that the
task is already exiting, tested by checking PF_EXITING in task->flags.

This is a lockless snapshot used only for the CONFIG_PROVE_RCU debug
predicate. This was found by KCSAN during fuzz testing. KCSAN can report
a data race when another task flag bit is updated concurrently. One report
shows pids_release() reading task->flags through task_css_set_check() while
do_task_dead() sets PF_NOFREEZE:

  KCSAN: data-race in task_css() [inline]
  KCSAN: data-race in pids_release()

  task_css()
  pids_release()
  cgroup_release()
  release_task()
  wait_task_zombie()

  value changed: 0x0040004c -> 0x0040804c

The changed bit is PF_NOFREEZE, not PF_EXITING. PF_EXITING remains set
before and after the update, so the task_css_set_check() condition does
not change. This is not a race on task->cgroups and does not indicate
incorrect pids charging or uncharging.

Use READ_ONCE() to document the intended lockless snapshot of task->flags.

No functional change intended.

Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
---
 include/linux/cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f2aa46a4f871..8afc4ec7f7a1 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -480,7 +480,7 @@ static inline void cgroup_unlock(void)
 		rcu_read_lock_sched_held() ||				\
 		lockdep_is_held(&cgroup_mutex) ||			\
 		lockdep_is_held(&css_set_lock) ||			\
-		((task)->flags & PF_EXITING) || (__c))
+		(READ_ONCE((task)->flags) & PF_EXITING) || (__c))
 #else
 #define task_css_set_check(task, __c)					\
 	rcu_dereference((task)->cgroups)
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 1/2] blk-cgroup: fix blkg leak in blkg_create() error path
From: Zizhi Wo @ 2026-06-23  1:38 UTC (permalink / raw)
  To: Tao Cui, axboe, tj, josef, linux-block
  Cc: cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <38704548-786f-4ec7-afd4-228aa8d68ad7@linux.dev>



在 2026/6/23 9:16, Tao Cui 写道:
> Hi Zizhi,
> 
> Thanks for the patch.  I ran into the same issue and posted a fix for it
> earlier:
> 
>    https://lore.kernel.org/all/20260507061229.57466-1-cuitao@kylinos.cn/
> 
> The leak fix is identical to yours (blkg_put() -> percpu_ref_kill()),
> plus one extra change: moving blkg->online = true into the success
> block:
> 
> 	if (likely(!ret)) {
> 		...
> +		blkg->online = true;
> 	}
> -	blkg->online = true;
> 
> On the failure path the blkg was never inserted into any list, and its
> blkg->pd[i]->online flags were not set either (those are in the same
> block).  Leaving blkg->online = true marks a blkg as online that was
> never created -- inconsistent with pd[]->online and with
> blkg_destroy(), which clears blkg->online = false.  Not observable
> today, since the failed blkg is on no list and unreachable by the
> online readers, but the flag should track the actual insertion.
> 
> (This was sent to the cgroups list rather than linux-block, hence the
> overlap.)
> 
> Thanks,
> Tao

I'm not subscribed to the cgroup mailing list, so I didn't see that this
issue had already been fixed. :( And indeed, your patch nicely updates
blkg->online as well. — I hadn't realized that.

Thanks for the heads-up!

Thanks,
Zizhi Wo

> 
> 在 2026/6/22 15:07, Zizhi Wo 写道:
>> When radix_tree_insert() fails in blkg_create(), the error path calls
>> blkg_put() to release the blkg. This was correct when blkg->refcnt was an
>> atomic_t: blkg_put() dropped it to 0 and triggered the release path.
>>
>> But commit 7fcf2b033b84 ("blkcg: change blkg reference counting to use
>> percpu_ref") switched refcnt to a percpu_ref. In percpu mode
>> percpu_ref_put() never checks for zero, so the release callback is never
>> invoked. This blkg is on neither blkcg->blkg_list nor queue->blkg_list, so
>> blkg_destroy_all() / blkcg_destroy_blkgs() can never reach it to call
>> blkg_destroy()->percpu_ref_kill() either, cause the leak.
>>
>> Fix it by killing the percpu_ref instead, which switches it to atomic mode
>> and drops the initial ref.
>>
>> Fixes: 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref")
>> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
>> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
>> ---
>>   block/blk-cgroup.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index bc63bd220865..6386fe413994 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -437,11 +437,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>>   
>>   	if (!ret)
>>   		return blkg;
>>   
>>   	/* @blkg failed fully initialized, use the usual release path */
>> -	blkg_put(blkg);
>> +	percpu_ref_kill(&blkg->refcnt);
>>   	return ERR_PTR(ret);
>>   
>>   err_put_css:
>>   	css_put(&blkcg->css);
>>   err_free_blkg:


^ permalink raw reply

* Re: [PATCH] block/cgroup: Drop stale -EBUSY retry from blkg_conf_prep()
From: Tao Cui @ 2026-06-23  1:33 UTC (permalink / raw)
  To: Yang Xiuwei, Tejun Heo, Josef Bacik, Jens Axboe
  Cc: cui.tao, cgroups, linux-block
In-Reply-To: <20260622085623.520209-1-yangxiuwei@kylinos.cn>



在 2026/6/22 16:56, Yang Xiuwei 写道:
> Since commit 8f4236d9008b ("block: remove QUEUE_FLAG_BYPASS and
> ->bypass") nothing in the blkcg blkg lookup/creation path
> returns -EBUSY anymore...

Correct. I traced every error path in blkg_conf_prep() (and blkg_create()
underneath it): the only possible values are -EINVAL, -EOPNOTSUPP, -ENOMEM,
-ENODEV and -EEXIST (from radix_tree_insert). The -EBUSY source was indeed
the blk_queue_bypass() check removed by 8f4236d9008b, so the retry branch
has been dead since 2018. Clean removal with no behavioral change.

Reviewed-by: Tao Cui <cuitao@kylinos.cn>

^ permalink raw reply

* Re: [PATCH v9 0/6] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
From: Youngjun Park @ 2026-06-23  1:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, akpm, chrisl, linux-mm, cgroups, linux-kernel,
	kasong, hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	shikemeng, nphamcs, baoquan.he, baohua, gunho.lee, taejoon.song,
	hyungjun.cho, mkoutny, baver.bae, matia.kim
In-Reply-To: <CAO9r8zOy99szvC4W0+SUv4b3P2UxppJuBeZDV3HZzQuHUc1P1g@mail.gmail.com>

On Mon, Jun 22, 2026 at 02:23:40PM -0700, Yosry Ahmed wrote:
> On Sat, Jun 20, 2026 at 11:16 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
> >
> > This is the v9 series of the swap tier patchset.
> >
> > The main change in this version is the addition of selftests for the tier
> > interfaces, requested by Nhat; see the changelog below for the other changes.
> > I designed the test cases and wrote the selftests with some AI assistance.
> >
> > For context, the bulk of the series is unchanged since v8, with great thanks
> > to Shakeel Butt and Yosry for the reviews and discussions [1] that shaped it.
> > The main change in v8 was the interface change to use memory.swap.tiers.max
> > with '0' (disable) and 'max' (enable) values. This mechanism was suggested
> > by Shakeel and Yosry.
> >
> > This change allows for future extensions to control swap between tiers and
> > aligns better with existing memcg interfaces. It is confined to patch #3's
> > user-facing interface; internally, patch #3 still uses the existing mask
> > processing method, which is implementation-efficient.
> >
> > We also discussed tier extensions. Thanks to Yosry, Nhat and Shakeel for their
> > valuable feedback.
> >
> > Here is a brief summary of our tentative conclusions. Please correct me
> > if anything is misrepresented (details in references):
> >
> > * Zswap tiering [2]:
> >   Tiering applies only to the vswap + zswap combo. Zswap itself will
> >   not be tiered, as the current architecture requires a physical device
> >   for zswap allocation.
> 
> I thought we agreed that zswap should be a tier, so that proactive
> zswap writeback can be implemented as proactive swap demotion?
> 
> The only restriction we talked about is that zswap cannot be the only
> allowed tier as long as vswap isn't supported. We can lift the
> restriction when vswap support is added.

Okay, I misunderstood that part. Thanks for the clarification.

To summarize our agreement.
zswap can be the first tier regardless of vswap support.

- With vswap: zswap can be the only allowed tier, as it can operate
  independently.
- Without vswap: zswap cannot be the only allowed tier, as it cannot
  operate without a physical backing device.

I will proceed with this understanding.

Youngjun Park

^ permalink raw reply

* Re: [PATCH 2/2] cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in cpuset_update_tasks_nodemask()
From: Ridong Chen @ 2026-06-23  1:22 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc
In-Reply-To: <20260622224509.1927419-2-longman@redhat.com>



On 6/23/2026 6:45 AM, Waiman Long wrote:
> As reported by sashiko [1], cpuset_update_tasks_nodemask() will do
> mpol_rebind_mm() and possibly cpuset_migrate_mm() for all threads of
> a multithreaded process. Since commit 3df9ca0a2b8b ("cpuset: migrate
> memory only for threadgroup leaders"), cpuset_attach() had been updated
> to rebind and migrate memory only for threadgroup leaders to mark the
> group leader as the owner of the mm_struct.
> 
> To be consistent and avoid unnecessary performance overhead for heavily
> multithreaded processes, follow the cpuset_attach() example and perform
> memory rebind and migration only for threadgroup leaders.
> 
> Also add a paragraph in cgroup-v2.rst under cpuset.mems that the
> threadgroup leader is the memory owner of that threadgroup. Therefore
> the non-leading threads shouldn't be in other cgroups whose "cpuset.mems"
> doesn't fully overleap that of the group leader.
> 
> [1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   Documentation/admin-guide/cgroup-v2.rst | 7 +++++++
>   kernel/cgroup/cpuset.c                  | 4 ++++
>   2 files changed, 11 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 993446ab66d0..341037c7ec9d 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2527,6 +2527,13 @@ Cpuset Interface Files
>   	a need to change "cpuset.mems" with active tasks, it shouldn't
>   	be done frequently.
>   
> +	For a multithreaded process, the threadgroup leader is
> +	considered the owner of the group's memory. Memory policy
> +	rebinding and migration will only happen with respect to the
> +	threadgroup leader. To avoid unexpected result, non-leading
> +	threads shouldn't be put into another cgroup whose "cpuset.mems"
> +	doesn't full overleap that of the threadgroup leader.
> +
>     cpuset.mems.effective
>   	A read-only multiple values file which exists on all
>   	cpuset-enabled cgroups.
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index bc0207fd6e57..27bc7a466468 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2659,6 +2659,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
>   
>   		cpuset_change_task_nodemask(task, &newmems);
>   
> +		/* Rebind and migrate mm only for task group leader */
> +		if (task != task->group_leader)
> +			continue;
> +

Nit.

if (!thread_group_leader(task))
     continue;

>   		mm = get_task_mm(task);
>   		if (!mm)
>   			continue;

Reviewed-by: Ridong Chen <ridong.chen@linux.dev>

-- 
Best regards
Ridong


^ permalink raw reply

* Re: [PATCH 1/2] blk-cgroup: fix blkg leak in blkg_create() error path
From: Tao Cui @ 2026-06-23  1:16 UTC (permalink / raw)
  To: Zizhi Wo, axboe, tj, josef, linux-block
  Cc: cui.tao, cgroups, yangerkun, chengzhihao1, houtao1, yukuai
In-Reply-To: <20260622070714.1158886-2-wozizhi@huaweicloud.com>

Hi Zizhi,

Thanks for the patch.  I ran into the same issue and posted a fix for it
earlier:

  https://lore.kernel.org/all/20260507061229.57466-1-cuitao@kylinos.cn/

The leak fix is identical to yours (blkg_put() -> percpu_ref_kill()),
plus one extra change: moving blkg->online = true into the success
block:

	if (likely(!ret)) {
		...
+		blkg->online = true;
	}
-	blkg->online = true;

On the failure path the blkg was never inserted into any list, and its
blkg->pd[i]->online flags were not set either (those are in the same
block).  Leaving blkg->online = true marks a blkg as online that was
never created -- inconsistent with pd[]->online and with
blkg_destroy(), which clears blkg->online = false.  Not observable
today, since the failed blkg is on no list and unreachable by the
online readers, but the flag should track the actual insertion.

(This was sent to the cgroups list rather than linux-block, hence the
overlap.)

Thanks,
Tao

在 2026/6/22 15:07, Zizhi Wo 写道:
> When radix_tree_insert() fails in blkg_create(), the error path calls
> blkg_put() to release the blkg. This was correct when blkg->refcnt was an
> atomic_t: blkg_put() dropped it to 0 and triggered the release path.
> 
> But commit 7fcf2b033b84 ("blkcg: change blkg reference counting to use
> percpu_ref") switched refcnt to a percpu_ref. In percpu mode
> percpu_ref_put() never checks for zero, so the release callback is never
> invoked. This blkg is on neither blkcg->blkg_list nor queue->blkg_list, so
> blkg_destroy_all() / blkcg_destroy_blkgs() can never reach it to call
> blkg_destroy()->percpu_ref_kill() either, cause the leak.
> 
> Fix it by killing the percpu_ref instead, which switches it to atomic mode
> and drops the initial ref.
> 
> Fixes: 7fcf2b033b84 ("blkcg: change blkg reference counting to use percpu_ref")
> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
> Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
> ---
>  block/blk-cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bc63bd220865..6386fe413994 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -437,11 +437,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct gendisk *disk,
>  
>  	if (!ret)
>  		return blkg;
>  
>  	/* @blkg failed fully initialized, use the usual release path */
> -	blkg_put(blkg);
> +	percpu_ref_kill(&blkg->refcnt);
>  	return ERR_PTR(ret);
>  
>  err_put_css:
>  	css_put(&blkcg->css);
>  err_free_blkg:


^ permalink raw reply

* Re: [PATCH 1/2] cgroup/cpuset: Avoid unnecessary cpus & mems update in cpuset_hotplug_update_tasks()
From: Ridong Chen @ 2026-06-23  1:14 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc
In-Reply-To: <20260622224509.1927419-1-longman@redhat.com>



On 6/23/2026 6:45 AM, Waiman Long wrote:
> As reported by sashiko [1], cpuset_hotplug_update_tasks() may perform
> unnecessary task iteration and updating of tasks' CPU and node masks
> when mems_allowed and/or cpus_allowed are not set in cpuset v2. It is
> due to the fact that the temporary new_cpus and new_mems masks do not
> inherit parent's effective_cpus/mems when they are empty which is the
> expected behavior for cpuset v2 since commit 4ec22e9c5a90 ("cpuset:
> Enable cpuset controller in default hierarchy").
> 
> Fix that and avoid unnecessay work by adding the empty mask checks and
> inheriting the parent's versions if empty.
> 
> [1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com
> 
> Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>   kernel/cgroup/cpuset.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index aff86acea701..bc0207fd6e57 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3925,6 +3925,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>   	compute_effective_cpumask(&new_cpus, cs, parent);
>   	nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
>   
> +	if (is_in_v2_mode()) {
> +		/* Inherit parent's effective_cpus/mems if empty */
> +		if (cpumask_empty(&new_cpus))
> +			cpumask_copy(&new_cpus, parent->effective_cpus);
> +		if (nodes_empty(new_mems))
> +			new_mems = parent->effective_mems;
> +	}
> +
>   	if (!tmp || !cs->partition_root_state)
>   		goto update_tasks;
>   

I noticed that compute_effective_cpumask(...) is called in several 
places, so I think the logic should be consolidated into that function.

```
static void compute_effective_cpumask(struct cpumask *new_cpus,
				      struct cpuset *cs, struct cpuset *parent)
{
	cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
	if (cpumask_empty(&new_cpus) && is_in_v2_mode())
		cpumask_copy(&new_cpus, parent->effective_cpus);
}

```

Similarly, for new_mems, should we introduce a dedicated helper like 
compute_effective_nodemask? The same fallback logic is needed in 
update_nodemasks_hier:


```
static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
{
...
		bool has_mems = nodes_and(*new_mems, cp->mems_allowed, 
parent->effective_mems);

		/*
		 * If it becomes empty, inherit the effective mask of the
		 * parent, which is guaranteed to have some MEMs.
		 */
		if (is_in_v2_mode() && !has_mems)
			*new_mems = parent->effective_mems;
...
```

-- 
Best regards
Ridong


^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Joshua Hahn @ 2026-06-23  0:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <ajnIasdb6j6yDUdy@google.com>

On Mon, 22 Jun 2026 23:46:31 +0000 Yosry Ahmed <yosry@kernel.org> wrote:

> > > > If that is the case, I think auto-scaling makes sense but can be a bit
> > > > tricky, since there is no universal tiered ratio; each workload will
> > > > have different tiers it can swap to, so they will all have to calculate
> > > > their own ratios. Tiered memory limits escapes this difficulty since we
> > > > assume all memory can be placed on all tiers, so we have a system-wide
> > > > ratio : -)
> > > 
> > > Hmm I don't follow. It's also possible (maybe not initially) that a
> > > memcg cannot use specific memory tiers, right? I am not sure what the
> > > difference is.
> > 
> > You're right, I was speaking more to the current state of memory tiers.
> > The majority of the feedack I received was that we already have too
> > many memcg knobs, so I just opted to make tiered memcg limits a
> > cgroup mount, with no ability for individual memcgs to tune their
> > limits or opt-in/out.
> 
> Right, I think this is similar to the approach taken here. We have a
> single interface for per-tier limits. The main difference is that we're
> allowing 0/max values to disable/enable different swap tiers per-memcg,
> as there's a use case for that.
> 
> Seems like for memory tiering there's no use case for that yet.

Yes, I would agree with that.

> > What do you think Yosry? Would it make sense for us to be able to 
> > tune these values? Personally I think it makes sense but just wanted to
> > make the basic features merged before I went to push for making those
> > knobs tunable.
> 
> Right now we're not proposing to allow tuning swap tier limits either,
> just enable or disable a tier. My main question is about the default
> values.
> 
> IIUC, for memory tiering, if you set memory.max, then the limits for
> tiers are auto-scaled. I think it makes sense to do the same for swap
> tiers for cosnsitency. Or am I wrong about the memory tiering limits
> behavior?

No, you're right about that. Sorry for steering the thread to my 
series ; -)

To get back to the question of how the auto-tuning should work, the
main question is to which ratio we scale the swap limits to.
Do we set the swap limits proportional to how much swap is present
in the system, or how much swap is available to the cgroup?

So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
respectively, how much should a cgroup with swap.max = 10G have if
it is limited to tiers A and B?

This is what I was getting at earlier when I said we have to calculate
different ratios for different cgroups, based on what tiers they have
access to.

Sorry if that doesn't make sense, please let me know how I can
elaborate!

^ permalink raw reply

* Re: [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend
From: Yosry Ahmed @ 2026-06-23  0:23 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <20260612193738.2183968-4-nphamcs@gmail.com>

On Fri, Jun 12, 2026 at 12:37:34PM -0700, Nhat Pham wrote:
> Add physical swap as a backend for the virtual swap layer.
> 
> With physical swap backing, vswap can allocate a physical slot on
> demand when needed: as a fallback for zswap_store failures, or as
> the destination for zswap writeback.
> 
> Each vswap entry's physical slot is tracked via a Pointer-tagged
> swap_table entry on the physical cluster (rmap back to the vswap
> entry).
> 
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
[..]
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 466f8a182716..5daff7a25f67 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -993,6 +993,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	struct folio *folio;
>  	struct mempolicy *mpol;
>  	struct swap_info_struct *si;
> +	swp_entry_t phys = {};
>  	int ret = 0;
>  
>  	/* try to allocate swap cache folio */
> @@ -1000,16 +1001,6 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	if (!si)
>  		return -EEXIST;
>  
> -	/*
> -	 * Vswap entries have no physical backing - writeback would fail
> -	 * and SIGBUS the caller. Bail before we waste a swap-cache folio
> -	 * allocation.
> -	 */
> -	if (si->flags & SWP_VSWAP) {
> -		put_swap_device(si);
> -		return -EINVAL;
> -	}
> -
>  	mpol = get_task_policy(current);
>  	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
>  				       NO_INTERLEAVE_INDEX);
> @@ -1028,40 +1019,78 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	/*
>  	 * folio is locked, and the swapcache is now secured against
>  	 * concurrent swapping to and from the slot, and concurrent
> -	 * swapoff so we can safely dereference the zswap tree here.
> -	 * Verify that the swap entry hasn't been invalidated and recycled
> -	 * behind our backs, to avoid overwriting a new swap folio with
> -	 * old compressed data. Only when this is successful can the entry
> -	 * be dereferenced.
> +	 * swapoff so we can safely dereference the zswap tree (or vswap
> +	 * vtable) here. Verify that the swap entry hasn't been
> +	 * invalidated and recycled behind our backs, to avoid overwriting
> +	 * a new swap folio with old compressed data. Only when this is
> +	 * successful can the entry be dereferenced.
>  	 */
> -	tree = swap_zswap_tree(swpentry);
> -	if (entry != xa_load(tree, offset)) {
> -		ret = -ENOMEM;
> -		goto out;
> +	if (swap_is_vswap(si)) {
> +		if (entry != vswap_zswap_load(swpentry)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		/*
> +		 * Allocate physical backing BEFORE decompress - if it fails,
> +		 * no wasted work. folio_realloc_swap sets vtable to PHYS,
> +		 * overwriting ZSWAP - the old entry pointer is only held
> +		 * by the caller now.
> +		 */
> +		phys = folio_realloc_swap(folio);
> +		if (!phys.val) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}

I didn't look through the rest of the series, but are there use cases
for calling folio_realloc_swap() without calling vswap_zswap_load()
first? I wonder if the realloc_swap API should take the swpentry
directly and do the load within? Something like
vswap_alloc_phys(swpentry, folio)?

^ permalink raw reply

* Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends
From: Yosry Ahmed @ 2026-06-23  0:18 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <20260612193738.2183968-3-nphamcs@gmail.com>

[..]  
> @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
>  	if (entry->objcg)
>  		count_objcg_events(entry->objcg, ZSWPIN, 1);
>  
> -	/*
> -	 * We are reading into the swapcache, invalidate zswap entry.
> -	 * The swapcache is the authoritative owner of the page and
> -	 * its mappings, and the pressure that results from having two
> -	 * in-memory copies outweighs any benefits of caching the
> -	 * compression work.
> -	 */

Forgot to ask, is dropping this comment intentional?

>  	folio_mark_dirty(folio);
> -	xa_erase(tree, offset);
> -	zswap_entry_free(entry);
> +
> +	if (swap_is_vswap(si)) {
> +		folio_release_vswap_backing(folio);
> +	} else {
> +		xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> +		zswap_entry_free(entry);
> +	}
>  
>  	folio_unlock(folio);
>  	return 0;
> -- 
> 2.53.0-Meta
> 

^ permalink raw reply

* Re: [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends
From: Yosry Ahmed @ 2026-06-23  0:15 UTC (permalink / raw)
  To: Nhat Pham
  Cc: akpm, chrisl, kasong, hannes, mhocko, roman.gushchin,
	shakeel.butt, david, muchun.song, shikemeng, baoquan.he, baohua,
	youngjun.park, chengming.zhou, ljs, liam, vbabka, rppt, surenb,
	qi.zheng, axelrasmussen, yuanchu, weixugc, riel, gourry,
	haowenchao22, kernel-team, linux-mm, linux-kernel, cgroups
In-Reply-To: <20260612193738.2183968-3-nphamcs@gmail.com>

On Fri, Jun 12, 2026 at 12:37:33PM -0700, Nhat Pham wrote:
> Build the virtual swap layer on top of the swap-table infrastructure.
> Virtual swap entries decouple PTE swap entries from physical backing,
> allowing pages to be compressed by zswap (or detected as zero-filled)
> without pre-allocating a physical swap slot.
> 
> This patch only supports zswap and zero-page backends. If zswap_store
> fails, the page stays dirty in the swap cache (AOP_WRITEPAGE_ACTIVATE)
> - physical disk backing fallback comes in the next patch. Zswap
> writeback of vswap-backed entries is also disabled - the shrinker
> skips when no physical swap pages are available.
> 
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
[..]
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 993406074d58..466f8a182716 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -38,6 +38,7 @@
>  #include <linux/zsmalloc.h>
>  
>  #include "swap.h"
> +#include "vswap.h"
>  #include "internal.h"
>  
>  /*********************************
> @@ -762,7 +763,7 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>   * Carries out the common pattern of freeing an entry's zsmalloc allocation,
>   * freeing the entry itself, and decrementing the number of stored pages.
>   */
> -static void zswap_entry_free(struct zswap_entry *entry)
> +void zswap_entry_free(struct zswap_entry *entry)
>  {
>  	zswap_lru_del(&zswap_list_lru, entry);
>  	zs_free(entry->pool->zs_pool, entry->handle);
> @@ -994,16 +995,21 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	struct swap_info_struct *si;
>  	int ret = 0;
>  
> +	/* try to allocate swap cache folio */
>  	si = get_swap_device(swpentry);
>  	if (!si)
>  		return -EEXIST;
>  
> +	/*
> +	 * Vswap entries have no physical backing - writeback would fail
> +	 * and SIGBUS the caller. Bail before we waste a swap-cache folio
> +	 * allocation.
> +	 */

Seems like this comment belongs in the previous patch, and the other
comment movement is undoing what last patch did.

>  	if (si->flags & SWP_VSWAP) {
>  		put_swap_device(si);
>  		return -EINVAL;
>  	}
>  
> -	/* try to allocate swap cache folio */
>  	mpol = get_task_policy(current);
>  	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
>  				       NO_INTERLEAVE_INDEX);
> @@ -1416,25 +1422,25 @@ static bool zswap_store_page(struct page *page,
>  	if (!zswap_compress(page, entry, pool))
>  		goto compress_failed;
>  
> -	old = xa_store(swap_zswap_tree(page_swpentry),
> -		       swp_offset(page_swpentry),
> -		       entry, GFP_KERNEL);
> -	if (xa_is_err(old)) {
> -		int err = xa_err(old);
> +	if (is_vswap_entry(page_swpentry)) {
> +		vswap_zswap_store(page_swpentry, entry);
> +	} else {
> +		old = xa_store(swap_zswap_tree(page_swpentry),
> +			       swp_offset(page_swpentry),
> +			       entry, GFP_KERNEL);
> +		if (xa_is_err(old)) {
> +			int err = xa_err(old);
> +
> +			WARN_ONCE(err != -ENOMEM,
> +				  "unexpected xarray error: %d\n", err);
> +			zswap_reject_alloc_fail++;
> +			goto store_failed;
> +		}
>  
> -		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -		zswap_reject_alloc_fail++;
> -		goto store_failed;
> +		if (old)
> +			zswap_entry_free(old);
>  	}
>  
> -	/*
> -	 * We may have had an existing entry that became stale when
> -	 * the folio was redirtied and now the new version is being
> -	 * swapped out. Get rid of the old.
> -	 */
> -	if (old)
> -		zswap_entry_free(old);
> -
>  	/*
>  	 * The entry is successfully compressed and stored in the tree, there is
>  	 * no further possibility of failure. Grab refs to the pool and objcg,
> @@ -1487,6 +1493,7 @@ bool zswap_store(struct folio *folio)
>  	struct mem_cgroup *memcg = NULL;
>  	struct zswap_pool *pool;
>  	bool ret = false;
> +	bool partial_store = false;
>  	long index;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> @@ -1524,8 +1531,10 @@ bool zswap_store(struct folio *folio)
>  	for (index = 0; index < nr_pages; ++index) {
>  		struct page *page = folio_page(folio, index);
>  
> -		if (!zswap_store_page(page, objcg, pool))
> +		if (!zswap_store_page(page, objcg, pool)) {
> +			partial_store = index > 0;
>  			goto put_pool;
> +		}
>  	}
>  
>  	if (objcg)
> @@ -1548,7 +1557,9 @@ bool zswap_store(struct folio *folio)
>  	 * offsets corresponding to each page of the folio. Otherwise,
>  	 * writeback could overwrite the new data in the swapfile.
>  	 */
> -	if (!ret) {
> +	if (partial_store && is_vswap_entry(swp))
> +		folio_release_vswap_backing(folio);

Hmm the above should also only happen in the !ret case, but that's not
obvious from the code here. I think all of this should go under if
(!ret), but maybe reverse the polarity to avoid the indentation?

	if (ret)
		return ret;

	if (is_vswap_entry(swp)) {
		if (partial_store)
			folio_release_vswap_backing(folio);
		return ret;
	}

	...

Alternatively you can move the check_old code for xarray into a helper
and do:

	if (!ret) {
		if (is_vswap_entry(swp)) {
			if (partial_store)
				folio_release_vswap_backing(folio);
		} else {
			zswap_free_old_xa_entries(swp, nr_pages)
		}
	}

Also, I think you can probably drop partial_store and check the index
directly here.

> +	else if (!ret && !is_vswap_entry(swp)) {
>  		unsigned type = swp_type(swp);
>  		pgoff_t offset = swp_offset(swp);
>  		struct zswap_entry *entry;
> @@ -1588,8 +1599,7 @@ bool zswap_store(struct folio *folio)
>  int zswap_load(struct folio *folio)
>  {
>  	swp_entry_t swp = folio->swap;
> -	pgoff_t offset = swp_offset(swp);
> -	struct xarray *tree = swap_zswap_tree(swp);
> +	struct swap_info_struct *si = __swap_entry_to_info(swp);
>  	struct zswap_entry *entry;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> @@ -1599,16 +1609,25 @@ int zswap_load(struct folio *folio)
>  		return -ENOENT;
>  
>  	/*
> -	 * Large folios should not be swapped in while zswap is being used, as
> -	 * they are not properly handled. Zswap does not properly load large
> -	 * folios, and a large folio may only be partially in zswap.
> +	 * zswap_load() does not support large folios. For non-vswap
> +	 * entries this is unexpected on the swapin path: WARN and
> +	 * sigbus. For vswap entries __swap_cache_add_check() has already
> +	 * filtered out ZSWAP-backed THPs under the cluster lock, so the
> +	 * large folio here is zero- or phys-backed; return -ENOENT to
> +	 * fall through to the phys/zero IO path.

Hmm should we start simple and avoid THP swapin for vswap initially?

IIUC, it isn't really vswap specific. Even without vswap, it's possible
that an entire folio is on-disk, not in zswap, in which case THP swap
should be allowed.

I assume it's not common for zswap to be enabled and an entire THP worth
of pages are not in zswap, so maybe we can add this later?

>  	 */
> -	if (WARN_ON_ONCE(folio_test_large(folio))) {
> -		folio_unlock(folio);
> -		return -EINVAL;
> +	if (folio_test_large(folio)) {
> +		if (WARN_ON_ONCE(!swap_is_vswap(si))) {
> +			folio_unlock(folio);
> +			return -EINVAL;
> +		}
> +		return -ENOENT;
>  	}
>  
> -	entry = xa_load(tree, offset);
> +	if (swap_is_vswap(si))
> +		entry = vswap_zswap_load(swp);
> +	else
> +		entry = xa_load(swap_zswap_tree(swp), swp_offset(swp));
>  	if (!entry)
>  		return -ENOENT;
>  
> @@ -1623,16 +1642,14 @@ int zswap_load(struct folio *folio)
>  	if (entry->objcg)
>  		count_objcg_events(entry->objcg, ZSWPIN, 1);
>  
> -	/*
> -	 * We are reading into the swapcache, invalidate zswap entry.
> -	 * The swapcache is the authoritative owner of the page and
> -	 * its mappings, and the pressure that results from having two
> -	 * in-memory copies outweighs any benefits of caching the
> -	 * compression work.
> -	 */
>  	folio_mark_dirty(folio);
> -	xa_erase(tree, offset);
> -	zswap_entry_free(entry);
> +
> +	if (swap_is_vswap(si)) {
> +		folio_release_vswap_backing(folio);

Is there any advantage to calling folio_release_vswap_backing() over
zswap_entry_free()? Seems like __vswap_release_backing() ends up just
calling zswap_entry_free() -- and I don't see any vswap-specific state
being cleaned up.

I wonder if the zswap code should call zswap_entry_free() directly? Same
goes for the call in zswap_store() above.

> +	} else {
> +		xa_erase(swap_zswap_tree(swp), swp_offset(swp));
> +		zswap_entry_free(entry);
> +	}
>  
>  	folio_unlock(folio);
>  	return 0;
> -- 
> 2.53.0-Meta
> 

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Yosry Ahmed @ 2026-06-22 23:46 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <20260622231948.1002174-1-joshua.hahnjy@gmail.com>

> > > If that is the case, I think auto-scaling makes sense but can be a bit
> > > tricky, since there is no universal tiered ratio; each workload will
> > > have different tiers it can swap to, so they will all have to calculate
> > > their own ratios. Tiered memory limits escapes this difficulty since we
> > > assume all memory can be placed on all tiers, so we have a system-wide
> > > ratio : -)
> > 
> > Hmm I don't follow. It's also possible (maybe not initially) that a
> > memcg cannot use specific memory tiers, right? I am not sure what the
> > difference is.
> 
> You're right, I was speaking more to the current state of memory tiers.
> The majority of the feedack I received was that we already have too
> many memcg knobs, so I just opted to make tiered memcg limits a
> cgroup mount, with no ability for individual memcgs to tune their
> limits or opt-in/out.

Right, I think this is similar to the approach taken here. We have a
single interface for per-tier limits. The main difference is that we're
allowing 0/max values to disable/enable different swap tiers per-memcg,
as there's a use case for that.

Seems like for memory tiering there's no use case for that yet.

> What do you think Yosry? Would it make sense for us to be able to 
> tune these values? Personally I think it makes sense but just wanted to
> make the basic features merged before I went to push for making those
> knobs tunable.

Right now we're not proposing to allow tuning swap tier limits either,
just enable or disable a tier. My main question is about the default
values.

IIUC, for memory tiering, if you set memory.max, then the limits for
tiers are auto-scaled. I think it makes sense to do the same for swap
tiers for cosnsitency. Or am I wrong about the memory tiering limits
behavior?

> If we want to make the tuning the same across swap & memory we should
> probably align on the file names and how we interact with them.

Yeah I think we should make the interfaces as consistent as possible,
within reason.

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Joshua Hahn @ 2026-06-22 23:19 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <CAO9r8zP6zDshSGU4chaHiPocahQZpiK5Z-eP9VKH+2_xjNM+4g@mail.gmail.com>

On Mon, 22 Jun 2026 15:26:17 -0700 Yosry Ahmed <yosry@kernel.org> wrote:

> On Mon, Jun 22, 2026 at 3:10 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> >
> > On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > > On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
> > > >
> > > > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > > > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > > > (allowed, the default) or "0" (disabled).  A tier is one bit in the
> > > > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > > > clears that bit.
> > > >
> > > > Since the current use case lacks amount control, it only supports
> > > > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > > > usage, relying instead on a fast runtime bitmask check.
> > > >
> > > > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > > > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > > > effective state is "0" even if its `mask` is "max"). Maintaining this
> > > > separately avoids costly cgroup tree traversals to check ancestors at
> > > > runtime.
> > > >
> > > > Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > Suggested-by: Yosry Ahmed <yosry@kernel.org>
> > > > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > > > ---
> > > >  Documentation/admin-guide/cgroup-v2.rst |  20 +++++
> > > >  Documentation/mm/swap-tier.rst          |   9 +++
> > > >  include/linux/memcontrol.h              |   5 ++
> > > >  mm/memcontrol.c                         |  67 ++++++++++++++++
> > > >  mm/swap_state.c                         |   5 +-
> > > >  mm/swap_tier.c                          | 102 +++++++++++++++++++++++-
> > > >  mm/swap_tier.h                          |  57 +++++++++++--
> > > >  7 files changed, 255 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 6efd0095ed99..4843ffcfd110 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> > > >         Swap usage hard limit.  If a cgroup's swap usage reaches this
> > > >         limit, anonymous memory of the cgroup will not be swapped out.
> > > >
> > > > +  memory.swap.tiers.max
> > > > +       A read-write flat-keyed file which exists on non-root
> > > > +       cgroups.  The default is "max" for every tier.
> >
> > Hi Yosry,
> >
> > Sorry, I feel like I'm joining the party late. Apologies if I'm missing
> > some context or repeating a discussion that's already been had.
> > Please let me know if that is the case.
> >
> > One quick tangent:
> > I was chatting with Nhat last week about swap tiers and its relation to
> > memory tiering. Nhat brought up a good point, which is that while both
> > swap tiers and memory tiers provide a clear hierarchy of performance,
> > only memory tiering allows for movement between the tiers.
> > AFAICT, swap tiering does not allow for direct migration from a higher
> > tier swap backend to a lower tier swap backend if the higher tier
> > backend runs out of memory.
> >
> > In that sense, I'm not entirely sure if we need to enforce similar
> > semantics across swap tiering and memory tiering; it seems like there
> > are some fundamental differences anyways to how we treat these tiers.
> >
> > > I wonder what should the default behavior be if memory.swap.max is set
> > > to a value other than "max". Should the limits in
> > > memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> > > to keep the behavior consistent with memory tiering.
> > >
> > > Shakeel/Joshua, WDYT?
> >
> > I think that the motivation behind these tiers is different for swap
> > and memory. Tiered memory limits is motivated by preventing one
> > workload from conusming all of a valuable resource, while swap tiers
> > seems more to do with excluding certain workloads from using performant
> > tiers and ensuring other workloads stay on those performant tiers.
> >
> > IOW memory tiers exist for fairness, but it seems like swap tiers exist
> > for workload performance tiering. But maybe there's a usecase out there
> > that would want fairness to apply in the swap tiers as well that I am
> > not seeing.
> 
> I am not sure what use cases exist, but I think it's possible we end
> up wanting to enforce fairness for swap tiers as well. Maybe not as
> aggressively as memory (e.g. to avoid wearing out SSDs), but maybe at
> least proactively through userspace?
> 
> At the end of the day, faster swap tiers are also valuable resources
> that we probably don't want a few workloads to hog. I also think the
> interfaces being consistent makes everyone's lives easier, even if
> it's a bit of an overkill for swap tiers.

I see, thank you for the explanation. That makes sense to me.

> > If that is the case, I think auto-scaling makes sense but can be a bit
> > tricky, since there is no universal tiered ratio; each workload will
> > have different tiers it can swap to, so they will all have to calculate
> > their own ratios. Tiered memory limits escapes this difficulty since we
> > assume all memory can be placed on all tiers, so we have a system-wide
> > ratio : -)
> 
> Hmm I don't follow. It's also possible (maybe not initially) that a
> memcg cannot use specific memory tiers, right? I am not sure what the
> difference is.

You're right, I was speaking more to the current state of memory tiers.
The majority of the feedack I received was that we already have too
many memcg knobs, so I just opted to make tiered memcg limits a
cgroup mount, with no ability for individual memcgs to tune their
limits or opt-in/out.

What do you think Yosry? Would it make sense for us to be able to 
tune these values? Personally I think it makes sense but just wanted to
make the basic features merged before I went to push for making those
knobs tunable.

If we want to make the tuning the same across swap & memory we should
probably align on the file names and how we interact with them.

Thanks,
Joshua

^ permalink raw reply

* [PATCH 2/2] cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in cpuset_update_tasks_nodemask()
From: Waiman Long @ 2026-06-22 22:45 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Ridong Chen,
	Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, Waiman Long
In-Reply-To: <20260622224509.1927419-1-longman@redhat.com>

As reported by sashiko [1], cpuset_update_tasks_nodemask() will do
mpol_rebind_mm() and possibly cpuset_migrate_mm() for all threads of
a multithreaded process. Since commit 3df9ca0a2b8b ("cpuset: migrate
memory only for threadgroup leaders"), cpuset_attach() had been updated
to rebind and migrate memory only for threadgroup leaders to mark the
group leader as the owner of the mm_struct.

To be consistent and avoid unnecessary performance overhead for heavily
multithreaded processes, follow the cpuset_attach() example and perform
memory rebind and migration only for threadgroup leaders.

Also add a paragraph in cgroup-v2.rst under cpuset.mems that the
threadgroup leader is the memory owner of that threadgroup. Therefore
the non-leading threads shouldn't be in other cgroups whose "cpuset.mems"
doesn't fully overleap that of the group leader.

[1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 7 +++++++
 kernel/cgroup/cpuset.c                  | 4 ++++
 2 files changed, 11 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 993446ab66d0..341037c7ec9d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2527,6 +2527,13 @@ Cpuset Interface Files
 	a need to change "cpuset.mems" with active tasks, it shouldn't
 	be done frequently.
 
+	For a multithreaded process, the threadgroup leader is
+	considered the owner of the group's memory. Memory policy
+	rebinding and migration will only happen with respect to the
+	threadgroup leader. To avoid unexpected result, non-leading
+	threads shouldn't be put into another cgroup whose "cpuset.mems"
+	doesn't full overleap that of the threadgroup leader.
+
   cpuset.mems.effective
 	A read-only multiple values file which exists on all
 	cpuset-enabled cgroups.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc0207fd6e57..27bc7a466468 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2659,6 +2659,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
 
 		cpuset_change_task_nodemask(task, &newmems);
 
+		/* Rebind and migrate mm only for task group leader */
+		if (task != task->group_leader)
+			continue;
+
 		mm = get_task_mm(task);
 		if (!mm)
 			continue;
-- 
2.54.0


^ permalink raw reply related

* [PATCH 1/2] cgroup/cpuset: Avoid unnecessary cpus & mems update in cpuset_hotplug_update_tasks()
From: Waiman Long @ 2026-06-22 22:45 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Ridong Chen,
	Jonathan Corbet, Shuah Khan
  Cc: cgroups, linux-kernel, linux-doc, Waiman Long

As reported by sashiko [1], cpuset_hotplug_update_tasks() may perform
unnecessary task iteration and updating of tasks' CPU and node masks
when mems_allowed and/or cpus_allowed are not set in cpuset v2. It is
due to the fact that the temporary new_cpus and new_mems masks do not
inherit parent's effective_cpus/mems when they are empty which is the
expected behavior for cpuset v2 since commit 4ec22e9c5a90 ("cpuset:
Enable cpuset controller in default hierarchy").

Fix that and avoid unnecessay work by adding the empty mask checks and
inheriting the parent's versions if empty.

[1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com

Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index aff86acea701..bc0207fd6e57 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3925,6 +3925,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	compute_effective_cpumask(&new_cpus, cs, parent);
 	nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
 
+	if (is_in_v2_mode()) {
+		/* Inherit parent's effective_cpus/mems if empty */
+		if (cpumask_empty(&new_cpus))
+			cpumask_copy(&new_cpus, parent->effective_cpus);
+		if (nodes_empty(new_mems))
+			new_mems = parent->effective_mems;
+	}
+
 	if (!tmp || !cs->partition_root_state)
 		goto update_tasks;
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Yosry Ahmed @ 2026-06-22 22:26 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <20260622221037.255359-1-joshua.hahnjy@gmail.com>

On Mon, Jun 22, 2026 at 3:10 PM Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
>
> On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@kernel.org> wrote:
>
> > On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
> > >
> > > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > > (allowed, the default) or "0" (disabled).  A tier is one bit in the
> > > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > > clears that bit.
> > >
> > > Since the current use case lacks amount control, it only supports
> > > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > > usage, relying instead on a fast runtime bitmask check.
> > >
> > > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > > effective state is "0" even if its `mask` is "max"). Maintaining this
> > > separately avoids costly cgroup tree traversals to check ancestors at
> > > runtime.
> > >
> > > Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Suggested-by: Yosry Ahmed <yosry@kernel.org>
> > > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  20 +++++
> > >  Documentation/mm/swap-tier.rst          |   9 +++
> > >  include/linux/memcontrol.h              |   5 ++
> > >  mm/memcontrol.c                         |  67 ++++++++++++++++
> > >  mm/swap_state.c                         |   5 +-
> > >  mm/swap_tier.c                          | 102 +++++++++++++++++++++++-
> > >  mm/swap_tier.h                          |  57 +++++++++++--
> > >  7 files changed, 255 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 6efd0095ed99..4843ffcfd110 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> > >         Swap usage hard limit.  If a cgroup's swap usage reaches this
> > >         limit, anonymous memory of the cgroup will not be swapped out.
> > >
> > > +  memory.swap.tiers.max
> > > +       A read-write flat-keyed file which exists on non-root
> > > +       cgroups.  The default is "max" for every tier.
>
> Hi Yosry,
>
> Sorry, I feel like I'm joining the party late. Apologies if I'm missing
> some context or repeating a discussion that's already been had.
> Please let me know if that is the case.
>
> One quick tangent:
> I was chatting with Nhat last week about swap tiers and its relation to
> memory tiering. Nhat brought up a good point, which is that while both
> swap tiers and memory tiers provide a clear hierarchy of performance,
> only memory tiering allows for movement between the tiers.
> AFAICT, swap tiering does not allow for direct migration from a higher
> tier swap backend to a lower tier swap backend if the higher tier
> backend runs out of memory.
>
> In that sense, I'm not entirely sure if we need to enforce similar
> semantics across swap tiering and memory tiering; it seems like there
> are some fundamental differences anyways to how we treat these tiers.
>
> > I wonder what should the default behavior be if memory.swap.max is set
> > to a value other than "max". Should the limits in
> > memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> > to keep the behavior consistent with memory tiering.
> >
> > Shakeel/Joshua, WDYT?
>
> I think that the motivation behind these tiers is different for swap
> and memory. Tiered memory limits is motivated by preventing one
> workload from conusming all of a valuable resource, while swap tiers
> seems more to do with excluding certain workloads from using performant
> tiers and ensuring other workloads stay on those performant tiers.
>
> IOW memory tiers exist for fairness, but it seems like swap tiers exist
> for workload performance tiering. But maybe there's a usecase out there
> that would want fairness to apply in the swap tiers as well that I am
> not seeing.

I am not sure what use cases exist, but I think it's possible we end
up wanting to enforce fairness for swap tiers as well. Maybe not as
aggressively as memory (e.g. to avoid wearing out SSDs), but maybe at
least proactively through userspace?

At the end of the day, faster swap tiers are also valuable resources
that we probably don't want a few workloads to hog. I also think the
interfaces being consistent makes everyone's lives easier, even if
it's a bit of an overkill for swap tiers.

>
> If that is the case, I think auto-scaling makes sense but can be a bit
> tricky, since there is no universal tiered ratio; each workload will
> have different tiers it can swap to, so they will all have to calculate
> their own ratios. Tiered memory limits escapes this difficulty since we
> assume all memory can be placed on all tiers, so we have a system-wide
> ratio : -)

Hmm I don't follow. It's also possible (maybe not initially) that a
memcg cannot use specific memory tiers, right? I am not sure what the
difference is.

>
> Let me know what you think! Have a great day :D
> Joshua

^ permalink raw reply

* Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Joshua Hahn @ 2026-06-22 22:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Youngjun Park, Shakeel Butt, akpm, chrisl, youngjun.park,
	linux-mm, cgroups, linux-kernel, kasong, hannes, mhocko,
	roman.gushchin, muchun.song, shikemeng, nphamcs, baoquan.he,
	baohua, gunho.lee, taejoon.song, hyungjun.cho, mkoutny, baver.bae,
	matia.kim
In-Reply-To: <CAO9r8zNjyW1rh26vv2vavCM_2-r70EuynU+-7XdEmrBdLL=TkQ@mail.gmail.com>

On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@kernel.org> wrote:

> On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@gmail.com> wrote:
> >
> > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > (allowed, the default) or "0" (disabled).  A tier is one bit in the
> > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > clears that bit.
> >
> > Since the current use case lacks amount control, it only supports
> > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > usage, relying instead on a fast runtime bitmask check.
> >
> > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > effective state is "0" even if its `mask` is "max"). Maintaining this
> > separately avoids costly cgroup tree traversals to check ancestors at
> > runtime.
> >
> > Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Suggested-by: Yosry Ahmed <yosry@kernel.org>
> > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  20 +++++
> >  Documentation/mm/swap-tier.rst          |   9 +++
> >  include/linux/memcontrol.h              |   5 ++
> >  mm/memcontrol.c                         |  67 ++++++++++++++++
> >  mm/swap_state.c                         |   5 +-
> >  mm/swap_tier.c                          | 102 +++++++++++++++++++++++-
> >  mm/swap_tier.h                          |  57 +++++++++++--
> >  7 files changed, 255 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 6efd0095ed99..4843ffcfd110 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> >         Swap usage hard limit.  If a cgroup's swap usage reaches this
> >         limit, anonymous memory of the cgroup will not be swapped out.
> >
> > +  memory.swap.tiers.max
> > +       A read-write flat-keyed file which exists on non-root
> > +       cgroups.  The default is "max" for every tier.

Hi Yosry,

Sorry, I feel like I'm joining the party late. Apologies if I'm missing
some context or repeating a discussion that's already been had.
Please let me know if that is the case.

One quick tangent:
I was chatting with Nhat last week about swap tiers and its relation to
memory tiering. Nhat brought up a good point, which is that while both
swap tiers and memory tiers provide a clear hierarchy of performance,
only memory tiering allows for movement between the tiers.
AFAICT, swap tiering does not allow for direct migration from a higher
tier swap backend to a lower tier swap backend if the higher tier
backend runs out of memory.

In that sense, I'm not entirely sure if we need to enforce similar
semantics across swap tiering and memory tiering; it seems like there
are some fundamental differences anyways to how we treat these tiers.

> I wonder what should the default behavior be if memory.swap.max is set
> to a value other than "max". Should the limits in
> memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> to keep the behavior consistent with memory tiering.
> 
> Shakeel/Joshua, WDYT?

I think that the motivation behind these tiers is different for swap
and memory. Tiered memory limits is motivated by preventing one
workload from conusming all of a valuable resource, while swap tiers
seems more to do with excluding certain workloads from using performant
tiers and ensuring other workloads stay on those performant tiers.

IOW memory tiers exist for fairness, but it seems like swap tiers exist
for workload performance tiering. But maybe there's a usecase out there
that would want fairness to apply in the swap tiers as well that I am
not seeing.

If that is the case, I think auto-scaling makes sense but can be a bit
tricky, since there is no universal tiered ratio; each workload will
have different tiers it can swap to, so they will all have to calculate
their own ratios. Tiered memory limits escapes this difficulty since we
assume all memory can be placed on all tiers, so we have a system-wide
ratio : -)

Let me know what you think! Have a great day :D
Joshua

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox