Linux cgroups development

Linux cgroups development
 help / color / mirror / Atom feed

* Re: [PATCH] cgroup: Fix a typo of the function name in comment
From: Tejun Heo @ 2026-06-24 21:13 UTC (permalink / raw)
  To: Zenghui Yu; +Cc: Johannes Weiner, Michal Koutný, cgroups, linux-kernel
In-Reply-To: <20260622110708.15593-1-zenghui.yu@linux.dev>

Hello,

Applied to cgroup/for-7.3.

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Waiman Long @ 2026-06-24 21:00 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Peter Zijlstra, cgroups,
	linux-kernel, Aaron Tomlin, Guopeng Zhang
In-Reply-To: <ajv79c9bTlrGThdF@localhost.localdomain>


On 6/24/26 11:51 AM, Michal Koutný wrote:
> On Mon, Jun 01, 2026 at 10:31:57PM -0400, Waiman Long <longman@redhat.com> wrote:
>> Patch 6 makes the necessary changes to enable the support of multiple
>> source and destination cpusets by keeping all the source and destination
>> cpusets found during task iterations in two singly linked lists for
>> source and destination cpusets respectively.
> Thanks for looking into this!
> I've played with a coding assistant and produced the following selftest
> (it (expectedly) fails on my machine), feel free to include in the
> series (if it validates the fix).

Thank for the provided selftest update. Will include that in the next 
version.

Cheers,
Longman

>
> -- 8< --
>  From ed4e6cf91413bb4b64befb1c15412c8cfd205d73 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
> Date: Wed, 24 Jun 2026 16:39:30 +0200
> Subject: [PATCH] selftests/cgroup: Add test for cpuset affinity on controller
>   disable
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Add a new selftest that exposes a bug in cpuset_attach() where thread
> CPU affinity is not properly updated when the cpuset controller is
> disabled in a threaded cgroup hierarchy.
>
> The test creates a threaded cgroup hierarchy with two child cgroups
> (A and B) having different cpuset.cpus constraints:
> - Parent: cpuset.cpus=0-1
> - Child A: cpuset.cpus=0-1
> - Child B: cpuset.cpus=1 (restricted to CPU 1 only)
>
> A multithreaded process is created with threads placed in different
> cgroups. When the cpuset controller is disabled on the parent, thread
> affinities should be updated to match the parent's cpuset.
>
> Expected behavior:
> - thread_a affinity: {0-1} before and after (unchanged)
> - thread_b affinity: {1} before, {0-1} after (expanded)
>
> Current buggy behavior:
> - thread_b affinity remains {1} after controller disable
>
> Assisted-by: Claude:claude-sonnet-4-5
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>   tools/testing/selftests/cgroup/test_cpuset.c | 243 +++++++++++++++++++
>   1 file changed, 243 insertions(+)
>
> diff --git a/tools/testing/selftests/cgroup/test_cpuset.c b/tools/testing/selftests/cgroup/test_cpuset.c
> index c5cf8b56ceb8f..1d72a199ca552 100644
> --- a/tools/testing/selftests/cgroup/test_cpuset.c
> +++ b/tools/testing/selftests/cgroup/test_cpuset.c
> @@ -1,7 +1,13 @@
>   // SPDX-License-Identifier: GPL-2.0
>   
> +#define _GNU_SOURCE
> +#include <assert.h>
>   #include <linux/limits.h>
> +#include <pthread.h>
> +#include <sched.h>
>   #include <signal.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
>   
>   #include "kselftest.h"
>   #include "cgroup_util.h"
> @@ -232,6 +238,242 @@ static int test_cpuset_perms_subtree(const char *root)
>   	return ret;
>   }
>   
> +static int get_cpu_affinity(cpu_set_t *mask)
> +{
> +	CPU_ZERO(mask);
> +	return sched_getaffinity(0, sizeof(*mask), mask);
> +}
> +
> +static int cpu_set_equal(cpu_set_t *dst, unsigned long mask)
> +{
> +	cpu_set_t expected;
> +
> +	CPU_ZERO(&expected);
> +	assert(sizeof(mask) < CPU_SETSIZE);
> +
> +	for (int cpu = 0; cpu < sizeof(mask); ++cpu)
> +		if ((1UL << cpu) & mask)
> +			CPU_SET(cpu, &expected);
> +	
> +	return CPU_EQUAL(&expected, dst);
> +}
> +
> +enum test_phase {
> +	AFFINITY_SETUP,
> +	AFFINITY_THREAD_A_READY,
> +	AFFINITY_THREADS_READY,
> +	AFFINITY_CONTROLLER_DISABLED,
> +	AFFINITY_COMPLETE,
> +	AFFINITY_ERROR
> +};
> +
> +struct thread_args {
> +	const char *cgroup;
> +	cpu_set_t *affinity_before;
> +	cpu_set_t *affinity_after;
> +	enum test_phase ready_phase;
> +};
> +
> +static pthread_mutex_t test_mutex = PTHREAD_MUTEX_INITIALIZER;
> +static pthread_cond_t test_cond = PTHREAD_COND_INITIALIZER;
> +static enum test_phase test_phase;
> +
> +static void *affinity_thread_fn(void *arg)
> +{
> +	struct thread_args *args = (struct thread_args *)arg;
> +
> +	if (cg_enter_current_thread(args->cgroup))
> +		goto fail;
> +
> +	if (get_cpu_affinity(args->affinity_before) != 0)
> +		goto fail;
> +
> +	pthread_mutex_lock(&test_mutex);
> +	if (test_phase < args->ready_phase)
> +		test_phase = args->ready_phase;
> +	pthread_cond_broadcast(&test_cond);
> +
> +	while (test_phase < AFFINITY_CONTROLLER_DISABLED)
> +		pthread_cond_wait(&test_cond, &test_mutex);
> +	pthread_mutex_unlock(&test_mutex);
> +
> +	if (get_cpu_affinity(args->affinity_after) != 0)
> +		goto fail;
> +
> +
> +	return NULL;
> +
> +fail:
> +	pthread_mutex_lock(&test_mutex);
> +	test_phase = AFFINITY_ERROR;
> +	pthread_cond_broadcast(&test_cond);
> +	pthread_mutex_unlock(&test_mutex);
> +	return NULL;
> +}
> +
> +/*
> + * Test that disabling cpuset controller properly updates thread affinity.
> + *
> + * This test exposes a bug in cpuset_attach() where threads in child cgroups
> + * don't get their affinity updated when the cpuset controller is disabled.
> + *
> + * Setup:
> + * - Create parent cgroup with cpuset.cpus=0-1
> + * - Create child A with cpuset.cpus=0-1
> + * - Create child B with cpuset.cpus=1
> + * - Place multithreaded process: group leader + thread_a in A, thread_b in B
> + * - Disable cpuset controller on parent
> + *
> + * Expected: thread_b's affinity should expand from {1} to {0-1}
> + * Buggy: thread_b's affinity remains {1}
> + */
> +static int test_cpuset_affinity_on_controller_disable(const char *root)
> +{
> +	char *parent = NULL, *child_a = NULL, *child_b = NULL;
> +	pthread_t thread_a, thread_b;
> +	int thread_a_created = 0, thread_b_created = 0;
> +	cpu_set_t affinity_a_before, affinity_a_after;
> +	cpu_set_t affinity_b_before, affinity_b_after;
> +	int ret = KSFT_FAIL;
> +
> +	parent = cg_name(root, "cpuset_affinity_test");
> +	if (!parent)
> +		goto cleanup;
> +	if (cg_create(parent))
> +		goto cleanup;
> +	if (cg_write(parent, "cgroup.type", "threaded"))
> +		goto cleanup;
> +
> +	child_a = cg_name(parent, "A");
> +	if (!child_a)
> +		goto cleanup;
> +	if (cg_create(child_a))
> +		goto cleanup;
> +	if (cg_write(child_a, "cgroup.type", "threaded"))
> +		goto cleanup;
> +
> +	child_b = cg_name(parent, "B");
> +	if (!child_b)
> +		goto cleanup;
> +	if (cg_create(child_b))
> +		goto cleanup;
> +	if (cg_write(child_b, "cgroup.type", "threaded"))
> +		goto cleanup;
> +
> +	/* Now enable cpuset controller in parent */
> +	if (cg_write(parent, "cgroup.subtree_control", "+cpuset")) {
> +		ret = KSFT_SKIP;
> +		goto cleanup;
> +	}
> +
> +	/* Set CPU affinity constraints */
> +	if (cg_write(parent, "cpuset.cpus", "0-1"))
> +		goto cleanup;
> +	if (cg_write(child_a, "cpuset.cpus", "0-1"))
> +		goto cleanup;
> +	if (cg_write(child_b, "cpuset.cpus", "1"))
> +		goto cleanup;
> +
> +	/* Move group leader (main thread) to child A */
> +	if (cg_enter_current(child_a))
> +		goto cleanup;
> +
> +	/* Create threads - they will move themselves to their respective cgroups */
> +	test_phase = AFFINITY_SETUP;
> +
> +	struct thread_args args_a = {
> +		.cgroup = child_a,
> +		.affinity_before = &affinity_a_before,
> +		.affinity_after = &affinity_a_after,
> +		.ready_phase = AFFINITY_THREAD_A_READY,
> +	};
> +	if (pthread_create(&thread_a, NULL, affinity_thread_fn, &args_a))
> +		goto cleanup;
> +	thread_a_created = 1;
> +
> +	struct thread_args args_b = {
> +		.cgroup = child_b,
> +		.affinity_before = &affinity_b_before,
> +		.affinity_after = &affinity_b_after,
> +		.ready_phase = AFFINITY_THREADS_READY,
> +	};
> +	if (pthread_create(&thread_b, NULL, affinity_thread_fn, &args_b))
> +		goto cleanup_threads;
> +	thread_b_created = 1;
> +
> +	pthread_mutex_lock(&test_mutex);
> +	while (test_phase < AFFINITY_THREADS_READY)
> +		pthread_cond_wait(&test_cond, &test_mutex);
> +
> +	/* If a thread failed during setup, bail out */
> +	if (test_phase == AFFINITY_ERROR) {
> +		pthread_mutex_unlock(&test_mutex);
> +		goto cleanup_threads;
> +	}
> +	pthread_mutex_unlock(&test_mutex);
> +
> +	if (!cpu_set_equal(&affinity_a_before, 0x3)) {
> +		ksft_print_msg("FAIL: thread_a initial affinity incorrect\n");
> +		goto cleanup_threads;
> +	}
> +
> +	if (!cpu_set_equal(&affinity_b_before, 0x2)) {
> +		ksft_print_msg("FAIL: thread_b initial affinity incorrect\n");
> +		goto cleanup_threads;
> +	}
> +
> +	/* Disable cpuset controller - this should trigger affinity update */
> +	if (cg_write(parent, "cgroup.subtree_control", "-cpuset"))
> +		goto cleanup_threads;
> +
> +	/* Signal threads to save their final affinity and exit */
> +	pthread_mutex_lock(&test_mutex);
> +	test_phase = AFFINITY_CONTROLLER_DISABLED;
> +	pthread_cond_broadcast(&test_cond);
> +	pthread_mutex_unlock(&test_mutex);
> +
> +	pthread_join(thread_a, NULL);
> +	pthread_join(thread_b, NULL);
> +
> +	/* Verify thread affinities AFTER disabling controller */
> +	if (!cpu_set_equal(&affinity_a_after, 0x3)) {
> +		ksft_print_msg("FAIL: thread_a final affinity incorrect\n");
> +		goto cleanup;
> +	}
> +
> +	if (!cpu_set_equal(&affinity_b_after, 0x3)) {
> +		ksft_print_msg("FAIL: thread_b affinity did not expand to {0-1}\n");
> +		goto cleanup;
> +	}
> +
> +	ret = KSFT_PASS;
> +	goto cleanup;
> +
> +cleanup_threads:
> +	pthread_mutex_lock(&test_mutex);
> +	test_phase = AFFINITY_COMPLETE;
> +	pthread_cond_broadcast(&test_cond);
> +	pthread_mutex_unlock(&test_mutex);
> +
> +	if (thread_a_created)
> +		pthread_join(thread_a, NULL);
> +	if (thread_b_created)
> +		pthread_join(thread_b, NULL);
> +
> +cleanup:
> +	/* Move back to root before cleanup */
> +	cg_enter_current(root);
> +
> +	cg_destroy(child_b);
> +	free(child_b);
> +	cg_destroy(child_a);
> +	free(child_a);
> +	cg_destroy(parent);
> +	free(parent);
> +
> +	return ret;
> +}
> +
>   
>   #define T(x) { x, #x }
>   struct cpuset_test {
> @@ -241,6 +483,7 @@ struct cpuset_test {
>   	T(test_cpuset_perms_object_allow),
>   	T(test_cpuset_perms_object_deny),
>   	T(test_cpuset_perms_subtree),
> +	T(test_cpuset_affinity_on_controller_disable),
>   };
>   #undef T
>   


^ permalink raw reply

* Re: [PATCH v2 0/2] cgroup/cpuset: Miscellaneous fixes and cleanups
From: Waiman Long @ 2026-06-24 20:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, Ridong Chen, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest
In-Reply-To: <038bfbbc34714676b7a672b7f748aee4@kernel.org>


On 6/24/26 3:47 PM, Tejun Heo wrote:
>> Waiman Long (2):
>>    cgroup/cpuset: Avoid unnecessary cpus & mems update in
>>      cpuset_hotplug_update_tasks()
>>    cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in
>>      cpuset_update_tasks_nodemask()
> Applied 1-2 to cgroup/for-7.3. I folded in a few minor fixups: a
> changelog typo, the compute_effective_nodemask() kerneldoc parameter
> name (new_cpus to new_mems), and the comment and doc grammar nits Manuel
> noted. Also added Ridong's Reviewed-by to patch 1.

Thanks for the fixups.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH] cgroup: Use READ_ONCE() for task->flags in task_css_set_check()
From: Tejun Heo @ 2026-06-24 20:26 UTC (permalink / raw)
  To: Guopeng Zhang
  Cc: Johannes Weiner, Michal Koutný, cgroups, linux-kernel,
	Guopeng Zhang, Tao Cui
In-Reply-To: <20260623022946.525885-1-guopeng.zhang@linux.dev>

On Tue, Jun 23, 2026 at 10:29:46AM +0800, Guopeng Zhang wrote:
> -		((task)->flags & PF_EXITING) || (__c))
> +		(READ_ONCE((task)->flags) & PF_EXITING) || (__c))

This only feeds the CONFIG_PROVE_RCU lockdep predicate, so it's a
diagnostic-only read. tools/memory-model/Documentation/access-marking.txt
recommends data_race() over READ_ONCE() for those:

	(data_race((task)->flags) & PF_EXITING) || (__c))

Please update the changelog to match.

Thanks.

^ permalink raw reply

* Re: [PATCH RFC 0/4] memcg,slab: kmalloc_nolock() fixes
From: Harry Yoo @ 2026-06-24 20:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <DJHF7S039QNX.KNVMFISSMLMU@gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 2532 bytes --]

On 6/25/26 1:30 AM, Alexei Starovoitov wrote:
> On Wed Jun 24, 2026 at 6:11 AM PDT, Harry Yoo (Oracle) wrote:
>>
>> Bug 1 was reported by lockdep, and bugs 2 [2] and 3 [3] were
>> reported by Sashiko.
> 
> ... and in fixes for sashiko complains sashiko finds more issues.
> I don't think it will ever end. I suggest to fix realistic scenarios
> instead of one out of billion cases that sashiko think is plausible
> but will never be hit in reality.

But we can trigger debug warnings for the first two bugs fairly
easily with slub_kunit. Doesn't that count as realistic scenarios?

(Ok, I admit that the last bug was purely theoretical, and would not
 have bothered if the fix was not straightforward)

You might argue that it's not as urgent as we might assume
(e.g., it's okay to not fix them asap or backport), but I don't think
we can just ignore them.

It might be bit harder to cause an actual deadlock than to
trigger a debug warning, though. We can discuss that [1] [2].

> The chance of server crashing
> due to cosmic rays are higher than such bugs.

I'm not convinced that it's the case.

Well, I don't know what are the chances of calling kmalloc_nolock()
in NMI, or within slab or memcg (via tracing), and that is an important
factor here.

>> To BPF folks: do we need to backport kmalloc_nolock() support
>> for architectures without __CMPXCHG_DOUBLE to v6.18?
> 
> nope.

Thanks, that was what I was hoping :)

# The discussion

[1] Bug 1: freeing a slab object via kfree_nolock() or draining
the stock in kmalloc_nolock() happens very frequently. The objcg should
have been reparented (which happens upon cgroup removal, which is not
too rare) at some point if the objcg stock or a slab object is holding
the last reference.

Can this cause an actual deadlock? That depends on the chances of
calling kmalloc/kfree_nolock() in the middle of reparenting (see
reparent_[un]locks()) or objcg list manipulation under objcg_lock.

[2] Bug 2: You should exceed memcg limit to invoke
memcg_alloc_abort_single(), but you don't even have to be under
memory pressure to exceed that. (yeah, I had to modify the
kernel to implement a fault-injection-like-feature to trigger this).
Unfortunately, you cannot reclaim memory in unknown context when you
hit the limit. This should be fairly easy to trigger.

Can this cause an actual deadlock? That depends on the chances
of calling kmalloc/kfree_nolock() within the slab allocator.

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH] Docs/admin-guide/cgroup-v2: fix memory.stat doc details
From: Tejun Heo @ 2026-06-24 20:07 UTC (permalink / raw)
  To: Doehyun Baek
  Cc: Jonathan Corbet, Johannes Weiner, Michal Koutný,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Yosry Ahmed,
	Nhat Pham, cgroups, linux-doc, linux-kernel
In-Reply-To: <20260620122751.388770-1-doehyunbaek@gmail.com>

Applied to cgroup/for-7.2-fixes.

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH v2 0/2] cgroup/cpuset: Miscellaneous fixes and cleanups
From: Tejun Heo @ 2026-06-24 19:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Johannes Weiner, Michal Koutný, Ridong Chen, Jonathan Corbet,
	Shuah Khan, cgroups, linux-kernel, linux-doc, linux-kselftest
In-Reply-To: <20260623230413.1984188-1-longman@redhat.com>

> Waiman Long (2):
>   cgroup/cpuset: Avoid unnecessary cpus & mems update in
>     cpuset_hotplug_update_tasks()
>   cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in
>     cpuset_update_tasks_nodemask()

Applied 1-2 to cgroup/for-7.3. I folded in a few minor fixups: a
changelog typo, the compute_effective_nodemask() kerneldoc parameter
name (new_cpus to new_mems), and the comment and doc grammar nits Manuel
noted. Also added Ridong's Reviewed-by to patch 1.

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH] tools/cgroup: iocost_monitor: parse help before importing drgn
From: Tejun Heo @ 2026-06-24 18:59 UTC (permalink / raw)
  To: Yousef Alhouseen; +Cc: josef, axboe, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624123652.8108-1-alhouseenyousef@gmail.com>

On Wed, Jun 24, 2026 at 02:36:52PM +0200, Yousef Alhouseen wrote:
> iocost_monitor.py imports drgn before argparse can handle "-h" or report
> argument errors. That makes basic usage help fail on systems where drgn is
> not installed.
> 
> Parse arguments before importing drgn so the help and argument-error paths
> work without the runtime debugging dependency. Normal execution still
> imports drgn before reading kernel state.
> 
> Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>

Applied to cgroup/for-7.3.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: Shakeel Butt @ 2026-06-24 18:57 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: linux-mm, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, cgroups, linux-kernel, kernel-team
In-Reply-To: <20260624183700.1152742-1-joshua.hahnjy@gmail.com>

On Wed, Jun 24, 2026 at 11:36:59AM -0700, Joshua Hahn wrote:
> Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
> last caller of for_each_mem_cgroup back in 2021, and there have not been
> any new callers since. Remove the macro.
> 
> A comment in mem_cgroup_css_online has also been out of date since 2021,
> when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
> code") open-coded the for_each_mem_cgroup iterator. Update the comment.
> 
> Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
> statistics and events") added a second declaration for memcg_events to
> include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
> Let's clean that up too.
> 
> No functional changes intended.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Thanks for the cleanup.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply

* Re: [PATCH 1/2] cgroup/dmem: add per-region event counters
From: Tejun Heo @ 2026-06-24 18:52 UTC (permalink / raw)
  To: Hongfu Li
  Cc: hannes, mkoutny, corbet, skhan, dev, mripard, natalie.vock,
	cgroups, linux-doc, linux-kernel, dri-devel
In-Reply-To: <20260624031107.667253-2-lihongfu@kylinos.cn>

On Wed, Jun 24, 2026 at 11:11:06AM +0800, Hongfu Li wrote:
> Add dmem.events to report hierarchical low/max event counts per DMEM
> region.  Increment counters on dmem.max allocation failures and
> dmem.low protection events.  The file is available for non-root cgroups
> only.

Please don't double space in descs or comments. Also, maybe it's obvious but
it'd help if you list why and how this is useful. Why do we want to add
this?

> +  dmem.events
> +	A read-only file that reports the number of times each cgroup
> +	has hit its configured memory limits.  The format lists each
> +	region on a single line, followed by the event counters::
> +
> +	  drm/0000:03:00.0/vram0 low 0 max 3
> +	  drm/0000:03:00.0/stolen low 0 max 0

This isn't a supported file format. Please read the documentation on allowed
formats.

Thanks.

-- 
tejun

^ permalink raw reply

* [PATCH] mm/memcontrol: remove unused for_each_mem_cgroup macro and cleanup
From: Joshua Hahn @ 2026-06-24 18:36 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, cgroups, linux-kernel, kernel-team

Commit 7e1c0d6f58207 ("memcg: switch lruvec stats to rstat") removed the
last caller of for_each_mem_cgroup back in 2021, and there have not been
any new callers since. Remove the macro.

A comment in mem_cgroup_css_online has also been out of date since 2021,
when 2bfd36374edd9 ("mm: vmscan: consolidate shrinker_maps handling
code") open-coded the for_each_mem_cgroup iterator. Update the comment.

Finally, 99430ab8b804c ("mm: introduce BPF kfuncs to access memcg
statistics and events") added a second declaration for memcg_events to
include/linux/memcontrol.h, duplicating the one in mm/memcontrol-v1.h.
Let's clean that up too.

No functional changes intended.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
This is intended for the next release cycle. Thank you!

 mm/memcontrol-v1.h | 6 ------
 mm/memcontrol.c    | 2 +-
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index f92f81108d5ed..d3ed5b93290fb 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -17,14 +17,8 @@
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(root, iter, NULL))
 
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void drain_all_stock(struct mem_cgroup *root_memcg);
 
-unsigned long memcg_events(struct mem_cgroup *memcg, int event);
 int memory_stat_show(struct seq_file *m, void *v);
 
 struct mem_cgroup *mem_cgroup_private_id_get_online(struct mem_cgroup *memcg,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af082326..e171fe36b0711 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4216,7 +4216,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	/*
 	 * A memcg must be visible for expand_shrinker_info()
 	 * by the time the maps are allocated. So, we allocate maps
-	 * here, when for_each_mem_cgroup() can't skip it.
+	 * here, when mem_cgroup_iter() can't skip it.
 	 */
 	if (alloc_shrinker_info(memcg))
 		goto offline_kmem;
-- 
2.53.0-Meta


^ permalink raw reply related

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-24 18:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <120367a5-0a3c-40ba-a821-f46f8494ef85@linux.dev>

On Wed, 24 Jun 2026 17:43:56 +0100 Usama Arif <usama.arif@linux.dev> wrote:

> 
> 
> On 24/06/2026 16:23, Joshua Hahn wrote:
> > On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> > 
> >> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > 
> > Hello Usama!!
> > 
> > Thank you for reviewing the patch : -)
> > 
> > [...snip...]
> > 
> >>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  			    unsigned int nr_pages)
> >>>  {
> >>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >>>  	int nr_retries = MAX_RECLAIM_RETRIES;
> >>>  	struct mem_cgroup *mem_over_limit;
> >>>  	struct page_counter *counter;
> >>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  	bool raised_max_event = false;
> >>>  	unsigned long pflags;
> >>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> >>> +	unsigned long nr_charged = 0;
> >>>  
> >>>  retry:
> >>> -	if (consume_stock(memcg, nr_pages))
> >>> -		return 0;
> >>> -
> >>> -	if (!allow_spinning)
> >>> -		/* Avoid the refill and flush of the older stock */
> >>> -		batch = nr_pages;
> >>> -
> >>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
> >>>  	if (do_memsw_account() &&
> >>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> >>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> >>> +					   &counter, NULL)) {
> >>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
> >>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
> >>>  		goto reclaim;
> >>>  	}
> >>>  
> >>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> >>> -		goto done_restock;
> >>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> >>> +					  &counter, &nr_charged)) {
> >>> +		if (!nr_charged)
> >>> +			return 0;
> >>> +		goto handle_high;
> >>> +	}
> >>>  
> >>>  	if (do_memsw_account())
> >>> -		page_counter_uncharge(&memcg->memsw, batch);
> >>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
> >>
> >> This needs a transactional rollback. page_counter_try_charge_stock() can
> >> succeed by consuming memsw stock and charging 0 new pages, but the
> >> memory-failure path unconditionally uncharges nr_pages from memsw.
> >> That turns a failed allocation into a real memsw usage decrement.
> > 
> > Hmmmmmmmmmm....... I'm not sure.
> > 
> > At this point in the code, we are either (1) using cgroup v1 with memsw
> > and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> > not sure if this really is unconditional, we're just distinguishing
> > between cases (1) and (2) by checking if we're using cgroupv1.
> > 
> > Or is your concern with taking a charge via stock, but uncharging with
> > a hierarchical page_counter walk?
> 
> This was my concern. But I re-read the page_counter stock invariant,
> and the stock-hit case is not an undercount? Consuming stock transfers
> already-charged credit to the pending allocation; if the later memory charge
> fails, page_counter_uncharge() discards that consumed credit from the
> hierarchy. That should keeps usage equal to real charges plus remaining stock?

Yes, stock-hit case just does some math without doing any actual
charging. It's stuff that was pre-charged before, so we're not doing
any undercounting or leaking any charges.

What do you mean by "consumed credit"? From what I can see
page_counter_uncharge --> page_counter_cancel subtracts from
counter->usage, which should be the real charge + hierarchy walk.

Am I missing something :p please feel free to let me know!
Joshua

^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 16:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624152331.2228828-1-joshua.hahnjy@gmail.com>



On 24/06/2026 16:23, Joshua Hahn wrote:
> On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> 
>> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> Hello Usama!!
> 
> Thank you for reviewing the patch : -)
> 
> [...snip...]
> 
>>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  			    unsigned int nr_pages)
>>>  {
>>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>>>  	int nr_retries = MAX_RECLAIM_RETRIES;
>>>  	struct mem_cgroup *mem_over_limit;
>>>  	struct page_counter *counter;
>>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	bool raised_max_event = false;
>>>  	unsigned long pflags;
>>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
>>> +	unsigned long nr_charged = 0;
>>>  
>>>  retry:
>>> -	if (consume_stock(memcg, nr_pages))
>>> -		return 0;
>>> -
>>> -	if (!allow_spinning)
>>> -		/* Avoid the refill and flush of the older stock */
>>> -		batch = nr_pages;
>>> -
>>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>>>  	if (do_memsw_account() &&
>>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
>>> +					   &counter, NULL)) {
>>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>>>  		goto reclaim;
>>>  	}
>>>  
>>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
>>> -		goto done_restock;
>>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
>>> +					  &counter, &nr_charged)) {
>>> +		if (!nr_charged)
>>> +			return 0;
>>> +		goto handle_high;
>>> +	}
>>>  
>>>  	if (do_memsw_account())
>>> -		page_counter_uncharge(&memcg->memsw, batch);
>>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>>
>> This needs a transactional rollback. page_counter_try_charge_stock() can
>> succeed by consuming memsw stock and charging 0 new pages, but the
>> memory-failure path unconditionally uncharges nr_pages from memsw.
>> That turns a failed allocation into a real memsw usage decrement.
> 
> Hmmmmmmmmmm....... I'm not sure.
> 
> At this point in the code, we are either (1) using cgroup v1 with memsw
> and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> not sure if this really is unconditional, we're just distinguishing
> between cases (1) and (2) by checking if we're using cgroupv1.
> 
> Or is your concern with taking a charge via stock, but uncharging with
> a hierarchical page_counter walk?

This was my concern. But I re-read the page_counter stock invariant,
and the stock-hit case is not an undercount? Consuming stock transfers
already-charged credit to the pending allocation; if the later memory charge
fails, page_counter_uncharge() discards that consumed credit from the
hierarchy. That should keeps usage equal to real charges plus remaining stock?

> If so, I think there's a case to be
> made here with just simply returning the stock. I just wanted to keep
> it consistent with the original memcontrol code, which only used
> stock to fulfill charges, not uncharges, since this could make the
> stock grow without bound.
> 
> What do you think? Thanks again for reviewing Usama, I hope you have a
> great day!!!
> Joshua


^ permalink raw reply

* Re: [PATCH RFC 0/4] memcg,slab: kmalloc_nolock() fixes
From: Alexei Starovoitov @ 2026-06-24 16:30 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

On Wed Jun 24, 2026 at 6:11 AM PDT, Harry Yoo (Oracle) wrote:
>
> Bug 1 was reported by lockdep, and bugs 2 [2] and 3 [3] were
> reported by Sashiko.

... and in fixes for sashiko complains sashiko finds more issues.
I don't think it will ever end. I suggest to fix realistic scenarios
instead of one out of billion cases that sashiko think is plausible
but will never be hit in reality. The chance of server crashing
due to cosmic rays are higher than such bugs. Hence do not fix them.

> To BPF folks: do we need to backport kmalloc_nolock() support
> for architectures without __CMPXCHG_DOUBLE to v6.18?

nope.

> There are still few users in v6.18, but I can't tell whether it is
> necessary to backport it to v6.18 (hopefully not as urgent as other
> bugfixes).

imo none of these 'fixes' are necessary. Humans are not hitting them.

^ permalink raw reply

* [PATCH v3] selftests/cgroup: Adjust cpu test duration based on HZ
From: Joe Simmons-Talbott @ 2026-06-24 16:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Shuah Khan
  Cc: Joe Simmons-Talbott, cgroups, linux-kselftest, linux-kernel

For lower HZ values a quota of 1000us is much lower than the amount
of microseconds per tick which makes the tests test_cpucg_max and
test_cpugc_max_nested fail. Increase the test duration to accommodate
for lower HZ values.

Link: https://lore.kernel.org/lkml/20260623194239.GA899029@oak/
Signed-off-by: Joe Simmons-Talbott <joest@redhat.com>
---
v2 -> v3:
- Instead of changing cpu.max quota extend the test duration based on
  the HZ value.
- don't call pclose() if popen() fails.
- check return value of fscanf().

v1 -> v2:
- Try checking /proc/config.gz to get the actual kernel HZ value and
  fallback to 1000 if the value cannot be determined.
 tools/testing/selftests/cgroup/test_cpu.c | 44 ++++++++++++++++++++---
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index 7a40d76b9548..feb7eb6a875c 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -639,6 +639,30 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 	return run_cpucg_nested_weight_test(root, false);
 }
 
+/*
+ * Best effort attempt to get the kernel's HZ value from the config.
+ * Return the HZ value if found otherwise return -1 to indicate failure.
+ */
+static long
+_get_config_hz(void)
+{
+	long hz = -1;
+	FILE *f;
+	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
+
+	f = popen(cmd, "r");
+
+	if (!f)
+		return hz;
+
+	if (fscanf(f, "CONFIG_HZ=%ld", &hz) == EOF)
+		goto out;
+
+out:
+	pclose(f);
+	return hz;
+}
+
 /*
  * This test creates a cgroup with some maximum value within a period, and
  * verifies that a process in the cgroup is not overscheduled.
@@ -646,15 +670,21 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 static int test_cpucg_max(const char *root)
 {
 	int ret = KSFT_FAIL;
+	long hz = _get_config_hz();
 	long quota_usec = 1000;
 	long default_period_usec = 100000; /* cpu.max's default period */
-	long duration_seconds = 1;
+	long duration_seconds;
 
-	long duration_usec = duration_seconds * USEC_PER_SEC;
+	long duration_usec;
 	long usage_usec, n_periods, remainder_usec, expected_usage_usec;
 	char *cpucg;
 	char quota_buf[32];
 
+	if (hz == -1)
+		hz = 1000;
+	duration_seconds = 1000 / hz;
+	duration_usec = duration_seconds * USEC_PER_SEC;
+
 	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
 
 	cpucg = cg_name(root, "cpucg_test");
@@ -710,15 +740,21 @@ static int test_cpucg_max(const char *root)
 static int test_cpucg_max_nested(const char *root)
 {
 	int ret = KSFT_FAIL;
+	long hz = _get_config_hz();
 	long quota_usec = 1000;
 	long default_period_usec = 100000; /* cpu.max's default period */
-	long duration_seconds = 1;
+	long duration_seconds;
 
-	long duration_usec = duration_seconds * USEC_PER_SEC;
+	long duration_usec;
 	long usage_usec, n_periods, remainder_usec, expected_usage_usec;
 	char *parent, *child;
 	char quota_buf[32];
 
+	if (hz == -1)
+		hz = 1000;
+	duration_seconds = 1000 / hz;
+	duration_usec = duration_seconds * USEC_PER_SEC;
+
 	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
 
 	parent = cg_name(root, "cpucg_parent");
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Michal Koutný @ 2026-06-24 15:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Peter Zijlstra, cgroups,
	linux-kernel, Aaron Tomlin, Guopeng Zhang
In-Reply-To: <20260602023203.248077-1-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 9453 bytes --]

On Mon, Jun 01, 2026 at 10:31:57PM -0400, Waiman Long <longman@redhat.com> wrote:
> Patch 6 makes the necessary changes to enable the support of multiple
> source and destination cpusets by keeping all the source and destination
> cpusets found during task iterations in two singly linked lists for
> source and destination cpusets respectively.

Thanks for looking into this!
I've played with a coding assistant and produced the following selftest
(it (expectedly) fails on my machine), feel free to include in the
series (if it validates the fix).

-- 8< --
From ed4e6cf91413bb4b64befb1c15412c8cfd205d73 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Date: Wed, 24 Jun 2026 16:39:30 +0200
Subject: [PATCH] selftests/cgroup: Add test for cpuset affinity on controller
 disable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a new selftest that exposes a bug in cpuset_attach() where thread
CPU affinity is not properly updated when the cpuset controller is
disabled in a threaded cgroup hierarchy.

The test creates a threaded cgroup hierarchy with two child cgroups
(A and B) having different cpuset.cpus constraints:
- Parent: cpuset.cpus=0-1
- Child A: cpuset.cpus=0-1
- Child B: cpuset.cpus=1 (restricted to CPU 1 only)

A multithreaded process is created with threads placed in different
cgroups. When the cpuset controller is disabled on the parent, thread
affinities should be updated to match the parent's cpuset.

Expected behavior:
- thread_a affinity: {0-1} before and after (unchanged)
- thread_b affinity: {1} before, {0-1} after (expanded)

Current buggy behavior:
- thread_b affinity remains {1} after controller disable

Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 tools/testing/selftests/cgroup/test_cpuset.c | 243 +++++++++++++++++++
 1 file changed, 243 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_cpuset.c b/tools/testing/selftests/cgroup/test_cpuset.c
index c5cf8b56ceb8f..1d72a199ca552 100644
--- a/tools/testing/selftests/cgroup/test_cpuset.c
+++ b/tools/testing/selftests/cgroup/test_cpuset.c
@@ -1,7 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define _GNU_SOURCE
+#include <assert.h>
 #include <linux/limits.h>
+#include <pthread.h>
+#include <sched.h>
 #include <signal.h>
+#include <sys/syscall.h>
+#include <unistd.h>
 
 #include "kselftest.h"
 #include "cgroup_util.h"
@@ -232,6 +238,242 @@ static int test_cpuset_perms_subtree(const char *root)
 	return ret;
 }
 
+static int get_cpu_affinity(cpu_set_t *mask)
+{
+	CPU_ZERO(mask);
+	return sched_getaffinity(0, sizeof(*mask), mask);
+}
+
+static int cpu_set_equal(cpu_set_t *dst, unsigned long mask)
+{
+	cpu_set_t expected;
+
+	CPU_ZERO(&expected);
+	assert(sizeof(mask) < CPU_SETSIZE);
+
+	for (int cpu = 0; cpu < sizeof(mask); ++cpu)
+		if ((1UL << cpu) & mask)
+			CPU_SET(cpu, &expected);
+	
+	return CPU_EQUAL(&expected, dst);
+}
+
+enum test_phase {
+	AFFINITY_SETUP,
+	AFFINITY_THREAD_A_READY,
+	AFFINITY_THREADS_READY,
+	AFFINITY_CONTROLLER_DISABLED,
+	AFFINITY_COMPLETE,
+	AFFINITY_ERROR
+};
+
+struct thread_args {
+	const char *cgroup;
+	cpu_set_t *affinity_before;
+	cpu_set_t *affinity_after;
+	enum test_phase ready_phase;
+};
+
+static pthread_mutex_t test_mutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t test_cond = PTHREAD_COND_INITIALIZER;
+static enum test_phase test_phase;
+
+static void *affinity_thread_fn(void *arg)
+{
+	struct thread_args *args = (struct thread_args *)arg;
+
+	if (cg_enter_current_thread(args->cgroup))
+		goto fail;
+
+	if (get_cpu_affinity(args->affinity_before) != 0)
+		goto fail;
+
+	pthread_mutex_lock(&test_mutex);
+	if (test_phase < args->ready_phase)
+		test_phase = args->ready_phase;
+	pthread_cond_broadcast(&test_cond);
+
+	while (test_phase < AFFINITY_CONTROLLER_DISABLED)
+		pthread_cond_wait(&test_cond, &test_mutex);
+	pthread_mutex_unlock(&test_mutex);
+
+	if (get_cpu_affinity(args->affinity_after) != 0)
+		goto fail;
+
+
+	return NULL;
+
+fail:
+	pthread_mutex_lock(&test_mutex);
+	test_phase = AFFINITY_ERROR;
+	pthread_cond_broadcast(&test_cond);
+	pthread_mutex_unlock(&test_mutex);
+	return NULL;
+}
+
+/*
+ * Test that disabling cpuset controller properly updates thread affinity.
+ *
+ * This test exposes a bug in cpuset_attach() where threads in child cgroups
+ * don't get their affinity updated when the cpuset controller is disabled.
+ *
+ * Setup:
+ * - Create parent cgroup with cpuset.cpus=0-1
+ * - Create child A with cpuset.cpus=0-1
+ * - Create child B with cpuset.cpus=1
+ * - Place multithreaded process: group leader + thread_a in A, thread_b in B
+ * - Disable cpuset controller on parent
+ *
+ * Expected: thread_b's affinity should expand from {1} to {0-1}
+ * Buggy: thread_b's affinity remains {1}
+ */
+static int test_cpuset_affinity_on_controller_disable(const char *root)
+{
+	char *parent = NULL, *child_a = NULL, *child_b = NULL;
+	pthread_t thread_a, thread_b;
+	int thread_a_created = 0, thread_b_created = 0;
+	cpu_set_t affinity_a_before, affinity_a_after;
+	cpu_set_t affinity_b_before, affinity_b_after;
+	int ret = KSFT_FAIL;
+
+	parent = cg_name(root, "cpuset_affinity_test");
+	if (!parent)
+		goto cleanup;
+	if (cg_create(parent))
+		goto cleanup;
+	if (cg_write(parent, "cgroup.type", "threaded"))
+		goto cleanup;
+
+	child_a = cg_name(parent, "A");
+	if (!child_a)
+		goto cleanup;
+	if (cg_create(child_a))
+		goto cleanup;
+	if (cg_write(child_a, "cgroup.type", "threaded"))
+		goto cleanup;
+
+	child_b = cg_name(parent, "B");
+	if (!child_b)
+		goto cleanup;
+	if (cg_create(child_b))
+		goto cleanup;
+	if (cg_write(child_b, "cgroup.type", "threaded"))
+		goto cleanup;
+
+	/* Now enable cpuset controller in parent */
+	if (cg_write(parent, "cgroup.subtree_control", "+cpuset")) {
+		ret = KSFT_SKIP;
+		goto cleanup;
+	}
+
+	/* Set CPU affinity constraints */
+	if (cg_write(parent, "cpuset.cpus", "0-1"))
+		goto cleanup;
+	if (cg_write(child_a, "cpuset.cpus", "0-1"))
+		goto cleanup;
+	if (cg_write(child_b, "cpuset.cpus", "1"))
+		goto cleanup;
+
+	/* Move group leader (main thread) to child A */
+	if (cg_enter_current(child_a))
+		goto cleanup;
+
+	/* Create threads - they will move themselves to their respective cgroups */
+	test_phase = AFFINITY_SETUP;
+
+	struct thread_args args_a = {
+		.cgroup = child_a,
+		.affinity_before = &affinity_a_before,
+		.affinity_after = &affinity_a_after,
+		.ready_phase = AFFINITY_THREAD_A_READY,
+	};
+	if (pthread_create(&thread_a, NULL, affinity_thread_fn, &args_a))
+		goto cleanup;
+	thread_a_created = 1;
+
+	struct thread_args args_b = {
+		.cgroup = child_b,
+		.affinity_before = &affinity_b_before,
+		.affinity_after = &affinity_b_after,
+		.ready_phase = AFFINITY_THREADS_READY,
+	};
+	if (pthread_create(&thread_b, NULL, affinity_thread_fn, &args_b))
+		goto cleanup_threads;
+	thread_b_created = 1;
+
+	pthread_mutex_lock(&test_mutex);
+	while (test_phase < AFFINITY_THREADS_READY)
+		pthread_cond_wait(&test_cond, &test_mutex);
+
+	/* If a thread failed during setup, bail out */
+	if (test_phase == AFFINITY_ERROR) {
+		pthread_mutex_unlock(&test_mutex);
+		goto cleanup_threads;
+	}
+	pthread_mutex_unlock(&test_mutex);
+
+	if (!cpu_set_equal(&affinity_a_before, 0x3)) {
+		ksft_print_msg("FAIL: thread_a initial affinity incorrect\n");
+		goto cleanup_threads;
+	}
+
+	if (!cpu_set_equal(&affinity_b_before, 0x2)) {
+		ksft_print_msg("FAIL: thread_b initial affinity incorrect\n");
+		goto cleanup_threads;
+	}
+
+	/* Disable cpuset controller - this should trigger affinity update */
+	if (cg_write(parent, "cgroup.subtree_control", "-cpuset"))
+		goto cleanup_threads;
+
+	/* Signal threads to save their final affinity and exit */
+	pthread_mutex_lock(&test_mutex);
+	test_phase = AFFINITY_CONTROLLER_DISABLED;
+	pthread_cond_broadcast(&test_cond);
+	pthread_mutex_unlock(&test_mutex);
+
+	pthread_join(thread_a, NULL);
+	pthread_join(thread_b, NULL);
+
+	/* Verify thread affinities AFTER disabling controller */
+	if (!cpu_set_equal(&affinity_a_after, 0x3)) {
+		ksft_print_msg("FAIL: thread_a final affinity incorrect\n");
+		goto cleanup;
+	}
+
+	if (!cpu_set_equal(&affinity_b_after, 0x3)) {
+		ksft_print_msg("FAIL: thread_b affinity did not expand to {0-1}\n");
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+	goto cleanup;
+
+cleanup_threads:
+	pthread_mutex_lock(&test_mutex);
+	test_phase = AFFINITY_COMPLETE;
+	pthread_cond_broadcast(&test_cond);
+	pthread_mutex_unlock(&test_mutex);
+
+	if (thread_a_created)
+		pthread_join(thread_a, NULL);
+	if (thread_b_created)
+		pthread_join(thread_b, NULL);
+
+cleanup:
+	/* Move back to root before cleanup */
+	cg_enter_current(root);
+
+	cg_destroy(child_b);
+	free(child_b);
+	cg_destroy(child_a);
+	free(child_a);
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
 
 #define T(x) { x, #x }
 struct cpuset_test {
@@ -241,6 +483,7 @@ struct cpuset_test {
 	T(test_cpuset_perms_object_allow),
 	T(test_cpuset_perms_object_deny),
 	T(test_cpuset_perms_subtree),
+	T(test_cpuset_affinity_on_controller_disable),
 };
 #undef T
 
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply related

* Re: [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Michal Koutný @ 2026-06-24 15:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Peter Zijlstra, cgroups,
	linux-kernel, Aaron Tomlin, Guopeng Zhang
In-Reply-To: <20260602023203.248077-7-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3179 bytes --]

Hello Waiman.

On Mon, Jun 01, 2026 at 10:32:03PM -0400, Waiman Long <longman@redhat.com> wrote:
> This problem is less an issue when enabling the cpuset controller as all
> the newly created child cpusets will have exactly the same set of CPUs
> and memory nodes except when deadline tasks are involved in migration
> as the deadline task accounting data can be off.
> 
> It can be more problematic when the cpuset controller is disabled as
> their set of CPUs and memory nodes may differ from their parent or with
> the moving of multi-threaded process from different threaded cgroups.

When I generalize that it can be an issue for any threaded controller
that somehow relies on the _difference_ between old and new thread
membership.

So I checked some: pids and perf_events look alright (no
diff-dependency) but I noticed the very same issue is tackled in
sched_change_group/scx_cgroup_move_task and that there is a member
inside task_struct allocated for this state tracking already:
  task_struct::scx::cgrp_moving_from

> Fix that by tracking the set of source (old) and destination cpusets
> in singly linked lists and iterating them all to properly update the
> internal data. Also keep the current cs and oldcs variables up-to-date
> with the css and task iterators.

So there would be more than a single use for something conceptually
like:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 004e6d56a499a..740c02f220c75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1326,6 +1326,9 @@ struct task_struct {
 #ifdef CONFIG_PREEMPT_RT
        struct llist_node               cg_dead_lnode;
 #endif /* CONFIG_PREEMPT_RT */
+#ifdef CONFIG_CGROUPS_MOVING_FROM
+       struct cgroup                   *cgrp_moving_from;
+#endif
 #endif /* CONFIG_CGROUPS */
 #ifdef CONFIG_X86_CPU_RESCTRL
        u32                             closid;
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 1a3af2ea2a794..5b63afe83f333 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -240,9 +240,6 @@ struct sched_ext_entity {
        bool                    disallow;       /* reject switching into SCX */
 
        /* cold fields */
-#ifdef CONFIG_EXT_GROUP_SCHED
-       struct cgroup           *cgrp_moving_from;
-#endif
        struct list_head        tasks_node;
 };
 
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308aec..d7e7d4477f862 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1186,6 +1186,7 @@ config EXT_GROUP_SCHED
        depends on SCHED_CLASS_EXT && CGROUP_SCHED
        select GROUP_SCHED_WEIGHT
        select GROUP_SCHED_BANDWIDTH
+       select CGROUPS_MOVING_FROM
        default y
 
 endif #CGROUP_SCHED
@@ -1288,6 +1289,7 @@ config CPUSETS
        depends on SMP
        select UNION_FIND
        select CPU_ISOLATION
+       select CGROUPS_MOVING_FROM
        help
          This option will let you create and manage CPUSETs which
          allow dynamically partitioning a system into sets of CPUs and

I think this could simplify the before-after state tracking generally,
WDYT?

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply related

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-24 15:23 UTC (permalink / raw)
  To: Usama Arif
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624144348.4117578-1-usama.arif@linux.dev>

On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

Hello Usama!!

Thank you for reviewing the patch : -)

[...snip...]

> > @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  			    unsigned int nr_pages)
> >  {
> > -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >  	int nr_retries = MAX_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem_over_limit;
> >  	struct page_counter *counter;
> > @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	bool raised_max_event = false;
> >  	unsigned long pflags;
> >  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> > +	unsigned long nr_charged = 0;
> >  
> >  retry:
> > -	if (consume_stock(memcg, nr_pages))
> > -		return 0;
> > -
> > -	if (!allow_spinning)
> > -		/* Avoid the refill and flush of the older stock */
> > -		batch = nr_pages;
> > -
> >  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
> >  	if (do_memsw_account() &&
> > -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> > +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> > +					   &counter, NULL)) {
> >  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
> >  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
> >  		goto reclaim;
> >  	}
> >  
> > -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> > -		goto done_restock;
> > +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> > +					  &counter, &nr_charged)) {
> > +		if (!nr_charged)
> > +			return 0;
> > +		goto handle_high;
> > +	}
> >  
> >  	if (do_memsw_account())
> > -		page_counter_uncharge(&memcg->memsw, batch);
> > +		page_counter_uncharge(&memcg->memsw, nr_pages);
> 
> This needs a transactional rollback. page_counter_try_charge_stock() can
> succeed by consuming memsw stock and charging 0 new pages, but the
> memory-failure path unconditionally uncharges nr_pages from memsw.
> That turns a failed allocation into a real memsw usage decrement.

Hmmmmmmmmmm....... I'm not sure.

At this point in the code, we are either (1) using cgroup v1 with memsw
and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
not sure if this really is unconditional, we're just distinguishing
between cases (1) and (2) by checking if we're using cgroupv1.

Or is your concern with taking a charge via stock, but uncharging with
a hierarchical page_counter walk? If so, I think there's a case to be
made here with just simply returning the stock. I just wanted to keep
it consistent with the original memcontrol code, which only used
stock to fulfill charges, not uncharges, since this could make the
stock grow without bound.

What do you think? Thanks again for reviewing Usama, I hope you have a
great day!!!
Joshua

^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 14:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Usama Arif, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-5-joshua.hahnjy@gmail.com>

On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Now with all of the memcg_stock handling logic replicated in
> page_counter_stock, switch memcg to use the page_counter_stock for the
> memory (and for cgroup v1 users, memsw) page_counters.
> 
> There are a few details that have changed:
> 
> First, the old special-casing for the !allow_spinning check to avoid
> refilling and flushing of the old stock is removed. This special casing
> was important previously, because refilling the stock could do a lot of
> extra work by evicting one of 7 random victim memcgs in the percpu
> memcg_stock slots. In the new per-counter design, refilling stock just
> adds pages to the counter's own local cache without affecting other memcgs,
> so the original reason for the special case no longer applies.
> 
> Also, we can now fail during page_counter_alloc_stock(), if there is
> not enough memory to allocate a percpu page_counter_stock. This failure
> is rare and nonfatal; the system can continue to operate, with the page
> counter working without stock and falling back to walking the hierarchy.
> 
> drain_all_stock and memcg_hotplug_cpu_dead also now use the page_counter
> stock drain variant, which uses remote atomic_xchg to retrieve stock
> across CPUs, instead of scheduling asynchronous work.
> 
> Finally, as a side-effect of separating the per-memcg stock to per-
> page_counter, the memsw and memory page_counters have independent stock.
> This means that the reported memsw may transiently be lower than memory
> usage if the stock for memory and memsw page_counters go out of sync.
> 
> Note that obj_stock is untouched by this change.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  mm/memcontrol.c | 87 +++++++++++++++++++++++--------------------------
>  1 file changed, 41 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 306658fd55512..846800917af49 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2269,39 +2269,36 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
>  		queue_work_on(cpu, memcg_wq, work);
>  }
>  
> +static void memcg_drain_stock(struct mem_cgroup *memcg, int cpu)
> +{
> +	page_counter_drain_stock(&memcg->memory, cpu);
> +	if (do_memsw_account())
> +		page_counter_drain_stock(&memcg->memsw, cpu);
> +}
> +
>  /*
>   * Drains all per-CPU charge caches for given root_memcg resp. subtree
>   * of the hierarchy under it.
>   */
>  void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> +	struct mem_cgroup *memcg;
>  	int cpu, curcpu;
>  
>  	/* If someone's already draining, avoid adding running more workers. */
>  	if (!mutex_trylock(&percpu_charge_mutex))
>  		return;
> -	/*
> -	 * Notify other cpus that system-wide "drain" is running
> -	 * We do not care about races with the cpu hotplug because cpu down
> -	 * as well as workers from this path always operate on the local
> -	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> -	 */
> +
> +	for_each_mem_cgroup_tree(memcg, root_memcg) {
> +		for_each_online_cpu(cpu)
> +			memcg_drain_stock(memcg, cpu);
> +	}
> +
>  	migrate_disable();
>  	curcpu = smp_processor_id();
>  	for_each_online_cpu(cpu) {
> -		struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu);
>  		struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu);
>  
> -		if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) &&
> -		    is_memcg_drain_needed(memcg_st, root_memcg) &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
> -				      &memcg_st->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_memcg_stock(&memcg_st->work);
> -			else
> -				schedule_drain_work(cpu, &memcg_st->work);
> -		}
> -
>  		if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
>  		    obj_stock_flush_required(obj_st, root_memcg) &&
>  		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
> @@ -2318,9 +2315,13 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
>  
>  static int memcg_hotplug_cpu_dead(unsigned int cpu)
>  {
> +	struct mem_cgroup *memcg;
> +
>  	/* no need for the local lock */
>  	drain_obj_stock(&per_cpu(obj_stock, cpu));
> -	drain_stock_fully(&per_cpu(memcg_stock, cpu));
> +
> +	for_each_mem_cgroup(memcg)
> +		memcg_drain_stock(memcg, cpu);
>  
>  	return 0;
>  }
> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			    unsigned int nr_pages)
>  {
> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>  	int nr_retries = MAX_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct page_counter *counter;
> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	bool raised_max_event = false;
>  	unsigned long pflags;
>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> +	unsigned long nr_charged = 0;
>  
>  retry:
> -	if (consume_stock(memcg, nr_pages))
> -		return 0;
> -
> -	if (!allow_spinning)
> -		/* Avoid the refill and flush of the older stock */
> -		batch = nr_pages;
> -
>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>  	if (do_memsw_account() &&
> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> +					   &counter, NULL)) {
>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>  		goto reclaim;
>  	}
>  
> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> -		goto done_restock;
> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> +					  &counter, &nr_charged)) {
> +		if (!nr_charged)
> +			return 0;
> +		goto handle_high;
> +	}
>  
>  	if (do_memsw_account())
> -		page_counter_uncharge(&memcg->memsw, batch);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);

This needs a transactional rollback. page_counter_try_charge_stock() can
succeed by consuming memsw stock and charging 0 new pages, but the
memory-failure path unconditionally uncharges nr_pages from memsw.
That turns a failed allocation into a real memsw usage decrement.


>  	mem_over_limit = mem_cgroup_from_counter(counter, memory);
>  
>  reclaim:
> -	if (batch > nr_pages) {
> -		batch = nr_pages;
> -		goto retry;
> -	}
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2731,10 +2725,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	return 0;
>  
> -done_restock:
> -	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> -
> +handle_high:
>  	/*
>  	 * If the hierarchy is above the normal consumption range, schedule
>  	 * reclaim on returning to userland.  We can perform reclaim here
> @@ -2771,7 +2762,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			 * and distribute reclaim work and delay penalties
>  			 * based on how much each task is actually allocating.
>  			 */
> -			current->memcg_nr_pages_over_high += batch;
> +			current->memcg_nr_pages_over_high += nr_charged;
>  			set_notify_resume(current);
>  			break;
>  		}
> @@ -3076,7 +3067,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
>  	account_kmem_nmi_safe(memcg, -nr_pages);
>  	memcg1_account_kmem(memcg, -nr_pages);
>  	if (!mem_cgroup_is_root(memcg))
> -		refill_stock(memcg, nr_pages);
> +		memcg_uncharge(memcg, nr_pages);
>  
>  	css_put(&memcg->css);
>  }
> @@ -4080,6 +4071,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +	page_counter_free_stock(&memcg->memory);
> +	page_counter_free_stock(&memcg->memsw);
>  	lru_gen_exit_memcg(memcg);
>  	memcg_wb_domain_exit(memcg);
>  	__mem_cgroup_free(memcg);
> @@ -4247,6 +4240,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	refcount_set(&memcg->id.ref, 1);
>  	css_get(css);
>  
> +	/* failure is nonfatal, charges fall back to direct hierarchy */
> +	page_counter_alloc_stock(&memcg->memory, MEMCG_CHARGE_BATCH);
> +	if (do_memsw_account())
> +		page_counter_alloc_stock(&memcg->memsw, MEMCG_CHARGE_BATCH);
> +
>  	/*
>  	 * Ensure mem_cgroup_from_private_id() works once we're fully online.
>  	 *
> @@ -5502,7 +5500,7 @@ void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
>  
>  	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
>  
> -	refill_stock(memcg, nr_pages);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  }
>  
>  void mem_cgroup_flush_workqueue(void)
> @@ -5555,12 +5553,9 @@ int __init mem_cgroup_init(void)
>  	memcg_wq = alloc_workqueue("memcg", WQ_PERCPU, 0);
>  	WARN_ON(!memcg_wq);
>  
> -	for_each_possible_cpu(cpu) {
> -		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
> -			  drain_local_memcg_stock);
> +	for_each_possible_cpu(cpu)
>  		INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work,
>  			  drain_local_obj_stock);
> -	}
>  
>  	memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids);
>  	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
> -- 
> 2.53.0-Meta
> 
> 

^ permalink raw reply

* Re: [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Jiayuan Chen @ 2026-06-24 14:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260624135839.2596358-1-usama.arif@linux.dev>


On 6/24/26 9:58 PM, Usama Arif wrote:
> On Tue, 23 Jun 2026 14:27:56 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
>> From: Jiayuan Chen <jiayuan.chen@shopee.com>
>>
>> Proactive reclaim via memory.reclaim can run for a long time - swap I/O
>> or thrashing again dominating the latency - and delays cgroup removal in
>> the same way.
>>
>> Mitigate this by stopping the reclaim once memcg_is_dying().
>>
>> Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
>> Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
>> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
>> ---
>>   mm/vmscan.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8190c4abec84..1162b7f76655 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
>>   		if (memcg) {
>>   			unsigned int reclaim_options;
>>   
>> +			if (memcg_is_dying(memcg))
>> +				break;
>> +
> This exits the reclaim loop with nr_reclaimed < nr_to_reclaim, but the
> function then returns 0 and memory_reclaim() reports a successful write.
> I think you want to return -EAGAIN here?


You are right that an error should be returned instead of 0.


But since memcg is being deleted, I'm reconsidering the appropriateerror 
code.

-EAGAIN, -ENOENT, -EINTR are possible candidates



^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: David Laight @ 2026-06-24 14:23 UTC (permalink / raw)
  To: Christian König
  Cc: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt, David Howells,
	Simona Vetter, Randy Dunlap, Luca Ceresoli, Philipp Stanner,
	linux-block, linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel,
	io-uring, audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng
In-Reply-To: <cf8467c7-b98f-44a5-9cf9-60b43b5da711@amd.com>

On Wed, 24 Jun 2026 15:23:47 +0200
Christian König <christian.koenig@amd.com> wrote:

> On 6/24/26 15:14, Kaitao Cheng wrote:
> > 
> > 
> > 在 2026/6/22 16:42, David Laight 写道:  
> >> On Mon, 22 Jun 2026 12:05:31 +0800
> >> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> >>  
> >>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>>
> >>> The list_for_each*_safe() helpers are used when the loop body may
> >>> remove the current entry.  Their API exposes the temporary cursor at
> >>> every call site, even though most users only need it for the iterator
> >>> implementation and never reference it in the loop body.
> >>>
> >>> Add *_mutable() variants for list and hlist iteration.  The new helpers
> >>> support both forms: callers may keep passing an explicit temporary cursor
> >>> when they need to inspect or reset it, or omit it and let the helper use
> >>> a unique internal cursor.  
> >>
> >> I'm not really sure 'mutable' means anything either.
> >> It is possible to make it valid for the loop body (or even other threads)
> >> to delete arbitrary list items - but that needs significant extra overheads.
> >>
> >> It might be worth doing something that doesn't need the extra variable,
> >> but there is little point doing all the churn just to rename things.
> >>  
> >>>
> >>> This makes call sites that only mutate the list through the current entry
> >>> less noisy, while keeping the existing *_safe() helpers available for
> >>> compatibility.
> >>>
> >>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>> ---
> >>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
> >>>  1 file changed, 231 insertions(+), 38 deletions(-)
> >>>
> >>> diff --git a/include/linux/list.h b/include/linux/list.h
> >>> index 09d979976b3b..1081def7cea9 100644
> >>> --- a/include/linux/list.h
> >>> +++ b/include/linux/list.h
> >>> @@ -7,6 +7,7 @@
> >>>  #include <linux/stddef.h>
> >>>  #include <linux/poison.h>
> >>>  #include <linux/const.h>
> >>> +#include <linux/args.h>
> >>>  
> >>>  #include <asm/barrier.h>
> >>>  
> >>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
> >>>  #define list_for_each_prev(pos, head) \
> >>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
> >>>  
> >>> -/**
> >>> - * list_for_each_safe - iterate over a list safe against removal of list entry
> >>> - * @pos:	the &struct list_head to use as a loop cursor.
> >>> - * @n:		another &struct list_head to use as temporary storage
> >>> - * @head:	the head for your list.
> >>> +/*
> >>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
> >>>   */
> >>>  #define list_for_each_safe(pos, n, head) \
> >>>  	for (pos = (head)->next, n = pos->next; \
> >>>  	     !list_is_head(pos, (head)); \
> >>>  	     pos = n, n = pos->next)
> >>>  
> >>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> >>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
> >>
> >> Use auto
> >>  
> >>> +	     !list_is_head(pos, (head));				\
> >>> +	     pos = tmp, tmp = pos->next)
> >>> +
> >>> +#define __list_for_each_mutable1(pos, head)				\
> >>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> >>> +
> >>> +#define __list_for_each_mutable2(pos, next, head)			\
> >>> +	list_for_each_safe(pos, next, head)
> >>> +
> >>>  /**
> >>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> >>> + * list_for_each_mutable - iterate over a list safe against entry removal
> >>>   * @pos:	the &struct list_head to use as a loop cursor.
> >>> - * @n:		another &struct list_head to use as temporary storage
> >>> - * @head:	the head for your list.
> >>> + * @...:	either (head) or (next, head)
> >>> + *
> >>> + * next:	another &struct list_head to use as optional temporary storage.
> >>> + *		The temporary cursor is internal unless explicitly supplied by
> >>> + *		the caller.
> >>> + * head:	the head for your list.
> >>> + */
> >>> +#define list_for_each_mutable(pos, ...)					\
> >>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> >>> +		(pos, __VA_ARGS__)  
> >>
> >> The variable argument count logic really just slows down compilation.
> >> Maybe there aren't enough copies of this code to make that significant.
> >> But just because you can do it doesn't mean it is a gooD idea.
> >> I'm also not sure it really adds anything to the readability.
> >>
> >> And, it you are going to make the middle argument optional there is
> >> no need to change the macro name.  
> > 
> > Christian König and Jani Nikula also disagree with the variadic-argument
> > implementation approach. If we abandon that method, it means we will
> > inevitably need to add some new macros. If mutable is not a good name,
> > suggestions for better alternatives would be welcome; coming up with a
> > suitable name is indeed rather tricky.  
> 
> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
> 
> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.

IIRC currently you have a choice of either:
	define               Item that can't be deleted
	list_for_each()	     The current item.
	list_for_each_safe() The next item.
There is also likely to be code that updates the variables to allow
for other scenarios.

Note that if increase a reference count and release a lock then list_for_each()
is likely safer than list_for_each_safe() :-)

list.h has 9 variants of the 'safe' loop.
The bloat of another 9 is getting excessive.

It has to be said that this is one of my least favourite type of list...

	David

> 
> Regards,
> Christian.


^ permalink raw reply

* Re: [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Usama Arif @ 2026-06-24 13:58 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Usama Arif, linux-mm, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260623062800.298514-4-jiayuan.chen@linux.dev>

On Tue, 23 Jun 2026 14:27:56 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> Proactive reclaim via memory.reclaim can run for a long time - swap I/O
> or thrashing again dominating the latency - and delays cgroup removal in
> the same way.
> 
> Mitigate this by stopping the reclaim once memcg_is_dying().
> 
> Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
> Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> ---
>  mm/vmscan.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8190c4abec84..1162b7f76655 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
>  		if (memcg) {
>  			unsigned int reclaim_options;
>  
> +			if (memcg_is_dying(memcg))
> +				break;
> +

This exits the reclaim loop with nr_reclaimed < nr_to_reclaim, but the
function then returns 0 and memory_reclaim() reports a successful write.
I think you want to return -EAGAIN here?


>  			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
>  					  MEMCG_RECLAIM_PROACTIVE;
>  			reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -- 
> 2.43.0
> 
> 

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Christian König @ 2026-06-24 13:23 UTC (permalink / raw)
  To: Kaitao Cheng, David Laight
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt, David Howells,
	Simona Vetter, Randy Dunlap, Luca Ceresoli, Philipp Stanner,
	linux-block, linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel,
	io-uring, audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng
In-Reply-To: <351a6b67-b394-4c58-aee2-88b6c8089ad5@linux.dev>

On 6/24/26 15:14, Kaitao Cheng wrote:
> 
> 
> 在 2026/6/22 16:42, David Laight 写道:
>> On Mon, 22 Jun 2026 12:05:31 +0800
>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>
>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>
>>> The list_for_each*_safe() helpers are used when the loop body may
>>> remove the current entry.  Their API exposes the temporary cursor at
>>> every call site, even though most users only need it for the iterator
>>> implementation and never reference it in the loop body.
>>>
>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>> support both forms: callers may keep passing an explicit temporary cursor
>>> when they need to inspect or reset it, or omit it and let the helper use
>>> a unique internal cursor.
>>
>> I'm not really sure 'mutable' means anything either.
>> It is possible to make it valid for the loop body (or even other threads)
>> to delete arbitrary list items - but that needs significant extra overheads.
>>
>> It might be worth doing something that doesn't need the extra variable,
>> but there is little point doing all the churn just to rename things.
>>
>>>
>>> This makes call sites that only mutate the list through the current entry
>>> less noisy, while keeping the existing *_safe() helpers available for
>>> compatibility.
>>>
>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>> ---
>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>> index 09d979976b3b..1081def7cea9 100644
>>> --- a/include/linux/list.h
>>> +++ b/include/linux/list.h
>>> @@ -7,6 +7,7 @@
>>>  #include <linux/stddef.h>
>>>  #include <linux/poison.h>
>>>  #include <linux/const.h>
>>> +#include <linux/args.h>
>>>  
>>>  #include <asm/barrier.h>
>>>  
>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>  #define list_for_each_prev(pos, head) \
>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>  
>>> -/**
>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>> - * @n:		another &struct list_head to use as temporary storage
>>> - * @head:	the head for your list.
>>> +/*
>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>   */
>>>  #define list_for_each_safe(pos, n, head) \
>>>  	for (pos = (head)->next, n = pos->next; \
>>>  	     !list_is_head(pos, (head)); \
>>>  	     pos = n, n = pos->next)
>>>  
>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
>>
>> Use auto
>>
>>> +	     !list_is_head(pos, (head));				\
>>> +	     pos = tmp, tmp = pos->next)
>>> +
>>> +#define __list_for_each_mutable1(pos, head)				\
>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>> +
>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>> +	list_for_each_safe(pos, next, head)
>>> +
>>>  /**
>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>> - * @n:		another &struct list_head to use as temporary storage
>>> - * @head:	the head for your list.
>>> + * @...:	either (head) or (next, head)
>>> + *
>>> + * next:	another &struct list_head to use as optional temporary storage.
>>> + *		The temporary cursor is internal unless explicitly supplied by
>>> + *		the caller.
>>> + * head:	the head for your list.
>>> + */
>>> +#define list_for_each_mutable(pos, ...)					\
>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>> +		(pos, __VA_ARGS__)
>>
>> The variable argument count logic really just slows down compilation.
>> Maybe there aren't enough copies of this code to make that significant.
>> But just because you can do it doesn't mean it is a gooD idea.
>> I'm also not sure it really adds anything to the readability.
>>
>> And, it you are going to make the middle argument optional there is
>> no need to change the macro name.
> 
> Christian König and Jani Nikula also disagree with the variadic-argument
> implementation approach. If we abandon that method, it means we will
> inevitably need to add some new macros. If mutable is not a good name,
> suggestions for better alternatives would be welcome; coming up with a
> suitable name is indeed rather tricky.

I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.

If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.

Regards,
Christian.

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-24 13:14 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, linux-kernel,
	cgroups, linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf,
	netdev, dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622094242.64531b9a@pumpkin>



在 2026/6/22 16:42, David Laight 写道:
> On Mon, 22 Jun 2026 12:05:31 +0800
> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 
>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>
>> The list_for_each*_safe() helpers are used when the loop body may
>> remove the current entry.  Their API exposes the temporary cursor at
>> every call site, even though most users only need it for the iterator
>> implementation and never reference it in the loop body.
>>
>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>> support both forms: callers may keep passing an explicit temporary cursor
>> when they need to inspect or reset it, or omit it and let the helper use
>> a unique internal cursor.
> 
> I'm not really sure 'mutable' means anything either.
> It is possible to make it valid for the loop body (or even other threads)
> to delete arbitrary list items - but that needs significant extra overheads.
> 
> It might be worth doing something that doesn't need the extra variable,
> but there is little point doing all the churn just to rename things.
> 
>>
>> This makes call sites that only mutate the list through the current entry
>> less noisy, while keeping the existing *_safe() helpers available for
>> compatibility.
>>
>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>> ---
>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/list.h b/include/linux/list.h
>> index 09d979976b3b..1081def7cea9 100644
>> --- a/include/linux/list.h
>> +++ b/include/linux/list.h
>> @@ -7,6 +7,7 @@
>>  #include <linux/stddef.h>
>>  #include <linux/poison.h>
>>  #include <linux/const.h>
>> +#include <linux/args.h>
>>  
>>  #include <asm/barrier.h>
>>  
>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>  #define list_for_each_prev(pos, head) \
>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>  
>> -/**
>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>> - * @pos:	the &struct list_head to use as a loop cursor.
>> - * @n:		another &struct list_head to use as temporary storage
>> - * @head:	the head for your list.
>> +/*
>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>   */
>>  #define list_for_each_safe(pos, n, head) \
>>  	for (pos = (head)->next, n = pos->next; \
>>  	     !list_is_head(pos, (head)); \
>>  	     pos = n, n = pos->next)
>>  
>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
> 
> Use auto
> 
>> +	     !list_is_head(pos, (head));				\
>> +	     pos = tmp, tmp = pos->next)
>> +
>> +#define __list_for_each_mutable1(pos, head)				\
>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>> +
>> +#define __list_for_each_mutable2(pos, next, head)			\
>> +	list_for_each_safe(pos, next, head)
>> +
>>  /**
>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>   * @pos:	the &struct list_head to use as a loop cursor.
>> - * @n:		another &struct list_head to use as temporary storage
>> - * @head:	the head for your list.
>> + * @...:	either (head) or (next, head)
>> + *
>> + * next:	another &struct list_head to use as optional temporary storage.
>> + *		The temporary cursor is internal unless explicitly supplied by
>> + *		the caller.
>> + * head:	the head for your list.
>> + */
>> +#define list_for_each_mutable(pos, ...)					\
>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>> +		(pos, __VA_ARGS__)
> 
> The variable argument count logic really just slows down compilation.
> Maybe there aren't enough copies of this code to make that significant.
> But just because you can do it doesn't mean it is a gooD idea.
> I'm also not sure it really adds anything to the readability.
> 
> And, it you are going to make the middle argument optional there is
> no need to change the macro name.

Christian König and Jani Nikula also disagree with the variadic-argument
implementation approach. If we abandon that method, it means we will
inevitably need to add some new macros. If mutable is not a good name,
suggestions for better alternatives would be welcome; coming up with a
suitable name is indeed rather tricky.

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [PATCH RFC 4/4] mm/slab: serialize defer_free_barrier()
From: Harry Yoo (Oracle) @ 2026-06-24 13:11 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

irq_work_sync() uses rcuwait instead of busy waiting in two cases:

  1. The kernel is using PREEMPT_RT and the irq work does not run in a
     hardirq context.

  2. The architecture cannot send inter-processor interrupts to make
     busy waiting reasonably short.

However, rcuwait.h says:
> The caller is responsible for locking around rcuwait_wait_event(),
> and [prepare_to/finish]_rcuwait() such that writes to @task are
> properly serialized.

Since defer_free_barrier() calls irq_work_sync() without any locks,
it can potentially cause a hang as writes to @task are not serialized.

Fix this by calling defer_free_barrier() under slab_mutex and
cpus_read_lock() and add lockdep asserts.

Now that defer_free_barrier() is called inside cpus_read_lock(), iterate
over online cpus instead of possible cpus.

Reported-by: Sashiko <sashiko+bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb%40kernel.org?part=5
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slab_common.c | 5 ++---
 mm/slub.c        | 6 +++++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 388eb5980859..27f77273fabe 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -550,11 +550,10 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		rcu_barrier();
 	}
 
-	/* Wait for deferred work from kmalloc/kfree_nolock() */
-	defer_free_barrier();
-
 	cpus_read_lock();
 	mutex_lock(&slab_mutex);
+	/* Wait for deferred work from kmalloc/kfree_nolock() */
+	defer_free_barrier();
 
 	s->refcount--;
 	if (s->refcount) {
diff --git a/mm/slub.c b/mm/slub.c
index 4a3618e3967e..52c8d3f33782 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6411,7 +6411,11 @@ void defer_free_barrier(void)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	/* irq_work_sync() may use rcuwait that requires serialization */
+	lockdep_assert_held(&slab_mutex);
+	lockdep_assert_cpus_held();
+
+	for_each_online_cpu(cpu)
 		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
 }
 

-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox