Linux cgroups development
 help / color / mirror / Atom feed
* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-24 18:24 UTC (permalink / raw)
  To: Usama Arif
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <120367a5-0a3c-40ba-a821-f46f8494ef85@linux.dev>

On Wed, 24 Jun 2026 17:43:56 +0100 Usama Arif <usama.arif@linux.dev> wrote:

> 
> 
> On 24/06/2026 16:23, Joshua Hahn wrote:
> > On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> > 
> >> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > 
> > Hello Usama!!
> > 
> > Thank you for reviewing the patch : -)
> > 
> > [...snip...]
> > 
> >>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  			    unsigned int nr_pages)
> >>>  {
> >>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >>>  	int nr_retries = MAX_RECLAIM_RETRIES;
> >>>  	struct mem_cgroup *mem_over_limit;
> >>>  	struct page_counter *counter;
> >>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  	bool raised_max_event = false;
> >>>  	unsigned long pflags;
> >>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> >>> +	unsigned long nr_charged = 0;
> >>>  
> >>>  retry:
> >>> -	if (consume_stock(memcg, nr_pages))
> >>> -		return 0;
> >>> -
> >>> -	if (!allow_spinning)
> >>> -		/* Avoid the refill and flush of the older stock */
> >>> -		batch = nr_pages;
> >>> -
> >>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
> >>>  	if (do_memsw_account() &&
> >>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> >>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> >>> +					   &counter, NULL)) {
> >>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
> >>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
> >>>  		goto reclaim;
> >>>  	}
> >>>  
> >>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> >>> -		goto done_restock;
> >>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> >>> +					  &counter, &nr_charged)) {
> >>> +		if (!nr_charged)
> >>> +			return 0;
> >>> +		goto handle_high;
> >>> +	}
> >>>  
> >>>  	if (do_memsw_account())
> >>> -		page_counter_uncharge(&memcg->memsw, batch);
> >>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
> >>
> >> This needs a transactional rollback. page_counter_try_charge_stock() can
> >> succeed by consuming memsw stock and charging 0 new pages, but the
> >> memory-failure path unconditionally uncharges nr_pages from memsw.
> >> That turns a failed allocation into a real memsw usage decrement.
> > 
> > Hmmmmmmmmmm....... I'm not sure.
> > 
> > At this point in the code, we are either (1) using cgroup v1 with memsw
> > and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> > not sure if this really is unconditional, we're just distinguishing
> > between cases (1) and (2) by checking if we're using cgroupv1.
> > 
> > Or is your concern with taking a charge via stock, but uncharging with
> > a hierarchical page_counter walk?
> 
> This was my concern. But I re-read the page_counter stock invariant,
> and the stock-hit case is not an undercount? Consuming stock transfers
> already-charged credit to the pending allocation; if the later memory charge
> fails, page_counter_uncharge() discards that consumed credit from the
> hierarchy. That should keeps usage equal to real charges plus remaining stock?

Yes, stock-hit case just does some math without doing any actual
charging. It's stuff that was pre-charged before, so we're not doing
any undercounting or leaking any charges.

What do you mean by "consumed credit"? From what I can see
page_counter_uncharge --> page_counter_cancel subtracts from
counter->usage, which should be the real charge + hierarchy walk.

Am I missing something :p please feel free to let me know!
Joshua

^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 16:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624152331.2228828-1-joshua.hahnjy@gmail.com>



On 24/06/2026 16:23, Joshua Hahn wrote:
> On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:
> 
>> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> Hello Usama!!
> 
> Thank you for reviewing the patch : -)
> 
> [...snip...]
> 
>>> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>>>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  			    unsigned int nr_pages)
>>>  {
>>> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>>>  	int nr_retries = MAX_RECLAIM_RETRIES;
>>>  	struct mem_cgroup *mem_over_limit;
>>>  	struct page_counter *counter;
>>> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	bool raised_max_event = false;
>>>  	unsigned long pflags;
>>>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
>>> +	unsigned long nr_charged = 0;
>>>  
>>>  retry:
>>> -	if (consume_stock(memcg, nr_pages))
>>> -		return 0;
>>> -
>>> -	if (!allow_spinning)
>>> -		/* Avoid the refill and flush of the older stock */
>>> -		batch = nr_pages;
>>> -
>>>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>>>  	if (do_memsw_account() &&
>>> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>>> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
>>> +					   &counter, NULL)) {
>>>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>>>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>>>  		goto reclaim;
>>>  	}
>>>  
>>> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
>>> -		goto done_restock;
>>> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
>>> +					  &counter, &nr_charged)) {
>>> +		if (!nr_charged)
>>> +			return 0;
>>> +		goto handle_high;
>>> +	}
>>>  
>>>  	if (do_memsw_account())
>>> -		page_counter_uncharge(&memcg->memsw, batch);
>>> +		page_counter_uncharge(&memcg->memsw, nr_pages);
>>
>> This needs a transactional rollback. page_counter_try_charge_stock() can
>> succeed by consuming memsw stock and charging 0 new pages, but the
>> memory-failure path unconditionally uncharges nr_pages from memsw.
>> That turns a failed allocation into a real memsw usage decrement.
> 
> Hmmmmmmmmmm....... I'm not sure.
> 
> At this point in the code, we are either (1) using cgroup v1 with memsw
> and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
> not sure if this really is unconditional, we're just distinguishing
> between cases (1) and (2) by checking if we're using cgroupv1.
> 
> Or is your concern with taking a charge via stock, but uncharging with
> a hierarchical page_counter walk?

This was my concern. But I re-read the page_counter stock invariant,
and the stock-hit case is not an undercount? Consuming stock transfers
already-charged credit to the pending allocation; if the later memory charge
fails, page_counter_uncharge() discards that consumed credit from the
hierarchy. That should keeps usage equal to real charges plus remaining stock?

> If so, I think there's a case to be
> made here with just simply returning the stock. I just wanted to keep
> it consistent with the original memcontrol code, which only used
> stock to fulfill charges, not uncharges, since this could make the
> stock grow without bound.
> 
> What do you think? Thanks again for reviewing Usama, I hope you have a
> great day!!!
> Joshua


^ permalink raw reply

* Re: [PATCH RFC 0/4] memcg,slab: kmalloc_nolock() fixes
From: Alexei Starovoitov @ 2026-06-24 16:30 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

On Wed Jun 24, 2026 at 6:11 AM PDT, Harry Yoo (Oracle) wrote:
>
> Bug 1 was reported by lockdep, and bugs 2 [2] and 3 [3] were
> reported by Sashiko.

... and in fixes for sashiko complains sashiko finds more issues.
I don't think it will ever end. I suggest to fix realistic scenarios
instead of one out of billion cases that sashiko think is plausible
but will never be hit in reality. The chance of server crashing
due to cosmic rays are higher than such bugs. Hence do not fix them.

> To BPF folks: do we need to backport kmalloc_nolock() support
> for architectures without __CMPXCHG_DOUBLE to v6.18?

nope.

> There are still few users in v6.18, but I can't tell whether it is
> necessary to backport it to v6.18 (hopefully not as urgent as other
> bugfixes).

imo none of these 'fixes' are necessary. Humans are not hitting them.


^ permalink raw reply

* [PATCH v3] selftests/cgroup: Adjust cpu test duration based on HZ
From: Joe Simmons-Talbott @ 2026-06-24 16:03 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Shuah Khan
  Cc: Joe Simmons-Talbott, cgroups, linux-kselftest, linux-kernel

For lower HZ values a quota of 1000us is much lower than the amount
of microseconds per tick which makes the tests test_cpucg_max and
test_cpugc_max_nested fail. Increase the test duration to accommodate
for lower HZ values.

Link: https://lore.kernel.org/lkml/20260623194239.GA899029@oak/
Signed-off-by: Joe Simmons-Talbott <joest@redhat.com>
---
v2 -> v3:
- Instead of changing cpu.max quota extend the test duration based on
  the HZ value.
- don't call pclose() if popen() fails.
- check return value of fscanf().

v1 -> v2:
- Try checking /proc/config.gz to get the actual kernel HZ value and
  fallback to 1000 if the value cannot be determined.
 tools/testing/selftests/cgroup/test_cpu.c | 44 ++++++++++++++++++++---
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index 7a40d76b9548..feb7eb6a875c 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -639,6 +639,30 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 	return run_cpucg_nested_weight_test(root, false);
 }
 
+/*
+ * Best effort attempt to get the kernel's HZ value from the config.
+ * Return the HZ value if found otherwise return -1 to indicate failure.
+ */
+static long
+_get_config_hz(void)
+{
+	long hz = -1;
+	FILE *f;
+	char cmd[256] = "zcat /proc/config.gz 2>/dev/null | grep '^CONFIG_HZ='";
+
+	f = popen(cmd, "r");
+
+	if (!f)
+		return hz;
+
+	if (fscanf(f, "CONFIG_HZ=%ld", &hz) == EOF)
+		goto out;
+
+out:
+	pclose(f);
+	return hz;
+}
+
 /*
  * This test creates a cgroup with some maximum value within a period, and
  * verifies that a process in the cgroup is not overscheduled.
@@ -646,15 +670,21 @@ test_cpucg_nested_weight_underprovisioned(const char *root)
 static int test_cpucg_max(const char *root)
 {
 	int ret = KSFT_FAIL;
+	long hz = _get_config_hz();
 	long quota_usec = 1000;
 	long default_period_usec = 100000; /* cpu.max's default period */
-	long duration_seconds = 1;
+	long duration_seconds;
 
-	long duration_usec = duration_seconds * USEC_PER_SEC;
+	long duration_usec;
 	long usage_usec, n_periods, remainder_usec, expected_usage_usec;
 	char *cpucg;
 	char quota_buf[32];
 
+	if (hz == -1)
+		hz = 1000;
+	duration_seconds = 1000 / hz;
+	duration_usec = duration_seconds * USEC_PER_SEC;
+
 	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
 
 	cpucg = cg_name(root, "cpucg_test");
@@ -710,15 +740,21 @@ static int test_cpucg_max(const char *root)
 static int test_cpucg_max_nested(const char *root)
 {
 	int ret = KSFT_FAIL;
+	long hz = _get_config_hz();
 	long quota_usec = 1000;
 	long default_period_usec = 100000; /* cpu.max's default period */
-	long duration_seconds = 1;
+	long duration_seconds;
 
-	long duration_usec = duration_seconds * USEC_PER_SEC;
+	long duration_usec;
 	long usage_usec, n_periods, remainder_usec, expected_usage_usec;
 	char *parent, *child;
 	char quota_buf[32];
 
+	if (hz == -1)
+		hz = 1000;
+	duration_seconds = 1000 / hz;
+	duration_usec = duration_seconds * USEC_PER_SEC;
+
 	snprintf(quota_buf, sizeof(quota_buf), "%ld", quota_usec);
 
 	parent = cg_name(root, "cpucg_parent");
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH-next v5 0/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Michal Koutný @ 2026-06-24 15:51 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Peter Zijlstra, cgroups,
	linux-kernel, Aaron Tomlin, Guopeng Zhang
In-Reply-To: <20260602023203.248077-1-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 9453 bytes --]

On Mon, Jun 01, 2026 at 10:31:57PM -0400, Waiman Long <longman@redhat.com> wrote:
> Patch 6 makes the necessary changes to enable the support of multiple
> source and destination cpusets by keeping all the source and destination
> cpusets found during task iterations in two singly linked lists for
> source and destination cpusets respectively.

Thanks for looking into this!
I've played with a coding assistant and produced the following selftest
(it (expectedly) fails on my machine), feel free to include in the
series (if it validates the fix).

-- 8< --
From ed4e6cf91413bb4b64befb1c15412c8cfd205d73 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Date: Wed, 24 Jun 2026 16:39:30 +0200
Subject: [PATCH] selftests/cgroup: Add test for cpuset affinity on controller
 disable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a new selftest that exposes a bug in cpuset_attach() where thread
CPU affinity is not properly updated when the cpuset controller is
disabled in a threaded cgroup hierarchy.

The test creates a threaded cgroup hierarchy with two child cgroups
(A and B) having different cpuset.cpus constraints:
- Parent: cpuset.cpus=0-1
- Child A: cpuset.cpus=0-1
- Child B: cpuset.cpus=1 (restricted to CPU 1 only)

A multithreaded process is created with threads placed in different
cgroups. When the cpuset controller is disabled on the parent, thread
affinities should be updated to match the parent's cpuset.

Expected behavior:
- thread_a affinity: {0-1} before and after (unchanged)
- thread_b affinity: {1} before, {0-1} after (expanded)

Current buggy behavior:
- thread_b affinity remains {1} after controller disable

Assisted-by: Claude:claude-sonnet-4-5
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 tools/testing/selftests/cgroup/test_cpuset.c | 243 +++++++++++++++++++
 1 file changed, 243 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_cpuset.c b/tools/testing/selftests/cgroup/test_cpuset.c
index c5cf8b56ceb8f..1d72a199ca552 100644
--- a/tools/testing/selftests/cgroup/test_cpuset.c
+++ b/tools/testing/selftests/cgroup/test_cpuset.c
@@ -1,7 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define _GNU_SOURCE
+#include <assert.h>
 #include <linux/limits.h>
+#include <pthread.h>
+#include <sched.h>
 #include <signal.h>
+#include <sys/syscall.h>
+#include <unistd.h>
 
 #include "kselftest.h"
 #include "cgroup_util.h"
@@ -232,6 +238,242 @@ static int test_cpuset_perms_subtree(const char *root)
 	return ret;
 }
 
+static int get_cpu_affinity(cpu_set_t *mask)
+{
+	CPU_ZERO(mask);
+	return sched_getaffinity(0, sizeof(*mask), mask);
+}
+
+static int cpu_set_equal(cpu_set_t *dst, unsigned long mask)
+{
+	cpu_set_t expected;
+
+	CPU_ZERO(&expected);
+	assert(sizeof(mask) < CPU_SETSIZE);
+
+	for (int cpu = 0; cpu < sizeof(mask); ++cpu)
+		if ((1UL << cpu) & mask)
+			CPU_SET(cpu, &expected);
+	
+	return CPU_EQUAL(&expected, dst);
+}
+
+enum test_phase {
+	AFFINITY_SETUP,
+	AFFINITY_THREAD_A_READY,
+	AFFINITY_THREADS_READY,
+	AFFINITY_CONTROLLER_DISABLED,
+	AFFINITY_COMPLETE,
+	AFFINITY_ERROR
+};
+
+struct thread_args {
+	const char *cgroup;
+	cpu_set_t *affinity_before;
+	cpu_set_t *affinity_after;
+	enum test_phase ready_phase;
+};
+
+static pthread_mutex_t test_mutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_cond_t test_cond = PTHREAD_COND_INITIALIZER;
+static enum test_phase test_phase;
+
+static void *affinity_thread_fn(void *arg)
+{
+	struct thread_args *args = (struct thread_args *)arg;
+
+	if (cg_enter_current_thread(args->cgroup))
+		goto fail;
+
+	if (get_cpu_affinity(args->affinity_before) != 0)
+		goto fail;
+
+	pthread_mutex_lock(&test_mutex);
+	if (test_phase < args->ready_phase)
+		test_phase = args->ready_phase;
+	pthread_cond_broadcast(&test_cond);
+
+	while (test_phase < AFFINITY_CONTROLLER_DISABLED)
+		pthread_cond_wait(&test_cond, &test_mutex);
+	pthread_mutex_unlock(&test_mutex);
+
+	if (get_cpu_affinity(args->affinity_after) != 0)
+		goto fail;
+
+
+	return NULL;
+
+fail:
+	pthread_mutex_lock(&test_mutex);
+	test_phase = AFFINITY_ERROR;
+	pthread_cond_broadcast(&test_cond);
+	pthread_mutex_unlock(&test_mutex);
+	return NULL;
+}
+
+/*
+ * Test that disabling cpuset controller properly updates thread affinity.
+ *
+ * This test exposes a bug in cpuset_attach() where threads in child cgroups
+ * don't get their affinity updated when the cpuset controller is disabled.
+ *
+ * Setup:
+ * - Create parent cgroup with cpuset.cpus=0-1
+ * - Create child A with cpuset.cpus=0-1
+ * - Create child B with cpuset.cpus=1
+ * - Place multithreaded process: group leader + thread_a in A, thread_b in B
+ * - Disable cpuset controller on parent
+ *
+ * Expected: thread_b's affinity should expand from {1} to {0-1}
+ * Buggy: thread_b's affinity remains {1}
+ */
+static int test_cpuset_affinity_on_controller_disable(const char *root)
+{
+	char *parent = NULL, *child_a = NULL, *child_b = NULL;
+	pthread_t thread_a, thread_b;
+	int thread_a_created = 0, thread_b_created = 0;
+	cpu_set_t affinity_a_before, affinity_a_after;
+	cpu_set_t affinity_b_before, affinity_b_after;
+	int ret = KSFT_FAIL;
+
+	parent = cg_name(root, "cpuset_affinity_test");
+	if (!parent)
+		goto cleanup;
+	if (cg_create(parent))
+		goto cleanup;
+	if (cg_write(parent, "cgroup.type", "threaded"))
+		goto cleanup;
+
+	child_a = cg_name(parent, "A");
+	if (!child_a)
+		goto cleanup;
+	if (cg_create(child_a))
+		goto cleanup;
+	if (cg_write(child_a, "cgroup.type", "threaded"))
+		goto cleanup;
+
+	child_b = cg_name(parent, "B");
+	if (!child_b)
+		goto cleanup;
+	if (cg_create(child_b))
+		goto cleanup;
+	if (cg_write(child_b, "cgroup.type", "threaded"))
+		goto cleanup;
+
+	/* Now enable cpuset controller in parent */
+	if (cg_write(parent, "cgroup.subtree_control", "+cpuset")) {
+		ret = KSFT_SKIP;
+		goto cleanup;
+	}
+
+	/* Set CPU affinity constraints */
+	if (cg_write(parent, "cpuset.cpus", "0-1"))
+		goto cleanup;
+	if (cg_write(child_a, "cpuset.cpus", "0-1"))
+		goto cleanup;
+	if (cg_write(child_b, "cpuset.cpus", "1"))
+		goto cleanup;
+
+	/* Move group leader (main thread) to child A */
+	if (cg_enter_current(child_a))
+		goto cleanup;
+
+	/* Create threads - they will move themselves to their respective cgroups */
+	test_phase = AFFINITY_SETUP;
+
+	struct thread_args args_a = {
+		.cgroup = child_a,
+		.affinity_before = &affinity_a_before,
+		.affinity_after = &affinity_a_after,
+		.ready_phase = AFFINITY_THREAD_A_READY,
+	};
+	if (pthread_create(&thread_a, NULL, affinity_thread_fn, &args_a))
+		goto cleanup;
+	thread_a_created = 1;
+
+	struct thread_args args_b = {
+		.cgroup = child_b,
+		.affinity_before = &affinity_b_before,
+		.affinity_after = &affinity_b_after,
+		.ready_phase = AFFINITY_THREADS_READY,
+	};
+	if (pthread_create(&thread_b, NULL, affinity_thread_fn, &args_b))
+		goto cleanup_threads;
+	thread_b_created = 1;
+
+	pthread_mutex_lock(&test_mutex);
+	while (test_phase < AFFINITY_THREADS_READY)
+		pthread_cond_wait(&test_cond, &test_mutex);
+
+	/* If a thread failed during setup, bail out */
+	if (test_phase == AFFINITY_ERROR) {
+		pthread_mutex_unlock(&test_mutex);
+		goto cleanup_threads;
+	}
+	pthread_mutex_unlock(&test_mutex);
+
+	if (!cpu_set_equal(&affinity_a_before, 0x3)) {
+		ksft_print_msg("FAIL: thread_a initial affinity incorrect\n");
+		goto cleanup_threads;
+	}
+
+	if (!cpu_set_equal(&affinity_b_before, 0x2)) {
+		ksft_print_msg("FAIL: thread_b initial affinity incorrect\n");
+		goto cleanup_threads;
+	}
+
+	/* Disable cpuset controller - this should trigger affinity update */
+	if (cg_write(parent, "cgroup.subtree_control", "-cpuset"))
+		goto cleanup_threads;
+
+	/* Signal threads to save their final affinity and exit */
+	pthread_mutex_lock(&test_mutex);
+	test_phase = AFFINITY_CONTROLLER_DISABLED;
+	pthread_cond_broadcast(&test_cond);
+	pthread_mutex_unlock(&test_mutex);
+
+	pthread_join(thread_a, NULL);
+	pthread_join(thread_b, NULL);
+
+	/* Verify thread affinities AFTER disabling controller */
+	if (!cpu_set_equal(&affinity_a_after, 0x3)) {
+		ksft_print_msg("FAIL: thread_a final affinity incorrect\n");
+		goto cleanup;
+	}
+
+	if (!cpu_set_equal(&affinity_b_after, 0x3)) {
+		ksft_print_msg("FAIL: thread_b affinity did not expand to {0-1}\n");
+		goto cleanup;
+	}
+
+	ret = KSFT_PASS;
+	goto cleanup;
+
+cleanup_threads:
+	pthread_mutex_lock(&test_mutex);
+	test_phase = AFFINITY_COMPLETE;
+	pthread_cond_broadcast(&test_cond);
+	pthread_mutex_unlock(&test_mutex);
+
+	if (thread_a_created)
+		pthread_join(thread_a, NULL);
+	if (thread_b_created)
+		pthread_join(thread_b, NULL);
+
+cleanup:
+	/* Move back to root before cleanup */
+	cg_enter_current(root);
+
+	cg_destroy(child_b);
+	free(child_b);
+	cg_destroy(child_a);
+	free(child_a);
+	cg_destroy(parent);
+	free(parent);
+
+	return ret;
+}
+
 
 #define T(x) { x, #x }
 struct cpuset_test {
@@ -241,6 +483,7 @@ struct cpuset_test {
 	T(test_cpuset_perms_object_allow),
 	T(test_cpuset_perms_object_deny),
 	T(test_cpuset_perms_subtree),
+	T(test_cpuset_affinity_on_controller_disable),
 };
 #undef T
 
-- 
2.54.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply related

* Re: [PATCH-next v5 6/6] cgroup/cpuset: Support multiple source/destination cpusets for cpuset_*attach()
From: Michal Koutný @ 2026-06-24 15:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Chen Ridong, Tejun Heo, Johannes Weiner, Peter Zijlstra, cgroups,
	linux-kernel, Aaron Tomlin, Guopeng Zhang
In-Reply-To: <20260602023203.248077-7-longman@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 3179 bytes --]

Hello Waiman.

On Mon, Jun 01, 2026 at 10:32:03PM -0400, Waiman Long <longman@redhat.com> wrote:
> This problem is less an issue when enabling the cpuset controller as all
> the newly created child cpusets will have exactly the same set of CPUs
> and memory nodes except when deadline tasks are involved in migration
> as the deadline task accounting data can be off.
> 
> It can be more problematic when the cpuset controller is disabled as
> their set of CPUs and memory nodes may differ from their parent or with
> the moving of multi-threaded process from different threaded cgroups.

When I generalize that it can be an issue for any threaded controller
that somehow relies on the _difference_ between old and new thread
membership.

So I checked some: pids and perf_events look alright (no
diff-dependency) but I noticed the very same issue is tackled in
sched_change_group/scx_cgroup_move_task and that there is a member
inside task_struct allocated for this state tracking already:
  task_struct::scx::cgrp_moving_from

> Fix that by tracking the set of source (old) and destination cpusets
> in singly linked lists and iterating them all to properly update the
> internal data. Also keep the current cs and oldcs variables up-to-date
> with the css and task iterators.

So there would be more than a single use for something conceptually
like:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 004e6d56a499a..740c02f220c75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1326,6 +1326,9 @@ struct task_struct {
 #ifdef CONFIG_PREEMPT_RT
        struct llist_node               cg_dead_lnode;
 #endif /* CONFIG_PREEMPT_RT */
+#ifdef CONFIG_CGROUPS_MOVING_FROM
+       struct cgroup                   *cgrp_moving_from;
+#endif
 #endif /* CONFIG_CGROUPS */
 #ifdef CONFIG_X86_CPU_RESCTRL
        u32                             closid;
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 1a3af2ea2a794..5b63afe83f333 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -240,9 +240,6 @@ struct sched_ext_entity {
        bool                    disallow;       /* reject switching into SCX */
 
        /* cold fields */
-#ifdef CONFIG_EXT_GROUP_SCHED
-       struct cgroup           *cgrp_moving_from;
-#endif
        struct list_head        tasks_node;
 };
 
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308aec..d7e7d4477f862 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1186,6 +1186,7 @@ config EXT_GROUP_SCHED
        depends on SCHED_CLASS_EXT && CGROUP_SCHED
        select GROUP_SCHED_WEIGHT
        select GROUP_SCHED_BANDWIDTH
+       select CGROUPS_MOVING_FROM
        default y
 
 endif #CGROUP_SCHED
@@ -1288,6 +1289,7 @@ config CPUSETS
        depends on SMP
        select UNION_FIND
        select CPU_ISOLATION
+       select CGROUPS_MOVING_FROM
        help
          This option will let you create and manage CPUSETs which
          allow dynamically partitioning a system into sets of CPUs and

I think this could simplify the before-after state tracking generally,
WDYT?

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply related

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Joshua Hahn @ 2026-06-24 15:23 UTC (permalink / raw)
  To: Usama Arif
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260624144348.4117578-1-usama.arif@linux.dev>

On Wed, 24 Jun 2026 07:43:47 -0700 Usama Arif <usama.arif@linux.dev> wrote:

> On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

Hello Usama!!

Thank you for reviewing the patch : -)

[...snip...]

> > @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> >  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  			    unsigned int nr_pages)
> >  {
> > -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> >  	int nr_retries = MAX_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem_over_limit;
> >  	struct page_counter *counter;
> > @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	bool raised_max_event = false;
> >  	unsigned long pflags;
> >  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> > +	unsigned long nr_charged = 0;
> >  
> >  retry:
> > -	if (consume_stock(memcg, nr_pages))
> > -		return 0;
> > -
> > -	if (!allow_spinning)
> > -		/* Avoid the refill and flush of the older stock */
> > -		batch = nr_pages;
> > -
> >  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
> >  	if (do_memsw_account() &&
> > -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> > +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> > +					   &counter, NULL)) {
> >  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
> >  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
> >  		goto reclaim;
> >  	}
> >  
> > -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> > -		goto done_restock;
> > +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> > +					  &counter, &nr_charged)) {
> > +		if (!nr_charged)
> > +			return 0;
> > +		goto handle_high;
> > +	}
> >  
> >  	if (do_memsw_account())
> > -		page_counter_uncharge(&memcg->memsw, batch);
> > +		page_counter_uncharge(&memcg->memsw, nr_pages);
> 
> This needs a transactional rollback. page_counter_try_charge_stock() can
> succeed by consuming memsw stock and charging 0 new pages, but the
> memory-failure path unconditionally uncharges nr_pages from memsw.
> That turns a failed allocation into a real memsw usage decrement.

Hmmmmmmmmmm....... I'm not sure.

At this point in the code, we are either (1) using cgroup v1 with memsw
and charged successfully, or (2) not using cgroup v1 with memsw. So I'm
not sure if this really is unconditional, we're just distinguishing
between cases (1) and (2) by checking if we're using cgroupv1.

Or is your concern with taking a charge via stock, but uncharging with
a hierarchical page_counter walk? If so, I think there's a case to be
made here with just simply returning the stock. I just wanted to keep
it consistent with the original memcontrol code, which only used
stock to fulfill charges, not uncharges, since this could make the
stock grow without bound.

What do you think? Thanks again for reviewing Usama, I hope you have a
great day!!!
Joshua

^ permalink raw reply

* Re: [PATCH v4 4/5] mm/memcontrol: convert memcg to use page_counter_stock
From: Usama Arif @ 2026-06-24 14:43 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Usama Arif, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, cgroups, linux-mm, linux-kernel, kernel-team
In-Reply-To: <20260623180124.868655-5-joshua.hahnjy@gmail.com>

On Tue, 23 Jun 2026 11:01:22 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Now with all of the memcg_stock handling logic replicated in
> page_counter_stock, switch memcg to use the page_counter_stock for the
> memory (and for cgroup v1 users, memsw) page_counters.
> 
> There are a few details that have changed:
> 
> First, the old special-casing for the !allow_spinning check to avoid
> refilling and flushing of the old stock is removed. This special casing
> was important previously, because refilling the stock could do a lot of
> extra work by evicting one of 7 random victim memcgs in the percpu
> memcg_stock slots. In the new per-counter design, refilling stock just
> adds pages to the counter's own local cache without affecting other memcgs,
> so the original reason for the special case no longer applies.
> 
> Also, we can now fail during page_counter_alloc_stock(), if there is
> not enough memory to allocate a percpu page_counter_stock. This failure
> is rare and nonfatal; the system can continue to operate, with the page
> counter working without stock and falling back to walking the hierarchy.
> 
> drain_all_stock and memcg_hotplug_cpu_dead also now use the page_counter
> stock drain variant, which uses remote atomic_xchg to retrieve stock
> across CPUs, instead of scheduling asynchronous work.
> 
> Finally, as a side-effect of separating the per-memcg stock to per-
> page_counter, the memsw and memory page_counters have independent stock.
> This means that the reported memsw may transiently be lower than memory
> usage if the stock for memory and memsw page_counters go out of sync.
> 
> Note that obj_stock is untouched by this change.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  mm/memcontrol.c | 87 +++++++++++++++++++++++--------------------------
>  1 file changed, 41 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 306658fd55512..846800917af49 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2269,39 +2269,36 @@ static void schedule_drain_work(int cpu, struct work_struct *work)
>  		queue_work_on(cpu, memcg_wq, work);
>  }
>  
> +static void memcg_drain_stock(struct mem_cgroup *memcg, int cpu)
> +{
> +	page_counter_drain_stock(&memcg->memory, cpu);
> +	if (do_memsw_account())
> +		page_counter_drain_stock(&memcg->memsw, cpu);
> +}
> +
>  /*
>   * Drains all per-CPU charge caches for given root_memcg resp. subtree
>   * of the hierarchy under it.
>   */
>  void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
> +	struct mem_cgroup *memcg;
>  	int cpu, curcpu;
>  
>  	/* If someone's already draining, avoid adding running more workers. */
>  	if (!mutex_trylock(&percpu_charge_mutex))
>  		return;
> -	/*
> -	 * Notify other cpus that system-wide "drain" is running
> -	 * We do not care about races with the cpu hotplug because cpu down
> -	 * as well as workers from this path always operate on the local
> -	 * per-cpu data. CPU up doesn't touch memcg_stock at all.
> -	 */
> +
> +	for_each_mem_cgroup_tree(memcg, root_memcg) {
> +		for_each_online_cpu(cpu)
> +			memcg_drain_stock(memcg, cpu);
> +	}
> +
>  	migrate_disable();
>  	curcpu = smp_processor_id();
>  	for_each_online_cpu(cpu) {
> -		struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu);
>  		struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu);
>  
> -		if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) &&
> -		    is_memcg_drain_needed(memcg_st, root_memcg) &&
> -		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
> -				      &memcg_st->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_memcg_stock(&memcg_st->work);
> -			else
> -				schedule_drain_work(cpu, &memcg_st->work);
> -		}
> -
>  		if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
>  		    obj_stock_flush_required(obj_st, root_memcg) &&
>  		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
> @@ -2318,9 +2315,13 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
>  
>  static int memcg_hotplug_cpu_dead(unsigned int cpu)
>  {
> +	struct mem_cgroup *memcg;
> +
>  	/* no need for the local lock */
>  	drain_obj_stock(&per_cpu(obj_stock, cpu));
> -	drain_stock_fully(&per_cpu(memcg_stock, cpu));
> +
> +	for_each_mem_cgroup(memcg)
> +		memcg_drain_stock(memcg, cpu);
>  
>  	return 0;
>  }
> @@ -2595,7 +2596,6 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			    unsigned int nr_pages)
>  {
> -	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>  	int nr_retries = MAX_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct page_counter *counter;
> @@ -2606,36 +2606,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	bool raised_max_event = false;
>  	unsigned long pflags;
>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> +	unsigned long nr_charged = 0;
>  
>  retry:
> -	if (consume_stock(memcg, nr_pages))
> -		return 0;
> -
> -	if (!allow_spinning)
> -		/* Avoid the refill and flush of the older stock */
> -		batch = nr_pages;
> -
>  	reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>  	if (do_memsw_account() &&
> -	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> +	    !page_counter_try_charge_stock(&memcg->memsw, nr_pages,
> +					   &counter, NULL)) {
>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>  		goto reclaim;
>  	}
>  
> -	if (page_counter_try_charge(&memcg->memory, batch, &counter))
> -		goto done_restock;
> +	if (page_counter_try_charge_stock(&memcg->memory, nr_pages,
> +					  &counter, &nr_charged)) {
> +		if (!nr_charged)
> +			return 0;
> +		goto handle_high;
> +	}
>  
>  	if (do_memsw_account())
> -		page_counter_uncharge(&memcg->memsw, batch);
> +		page_counter_uncharge(&memcg->memsw, nr_pages);

This needs a transactional rollback. page_counter_try_charge_stock() can
succeed by consuming memsw stock and charging 0 new pages, but the
memory-failure path unconditionally uncharges nr_pages from memsw.
That turns a failed allocation into a real memsw usage decrement.


>  	mem_over_limit = mem_cgroup_from_counter(counter, memory);
>  
>  reclaim:
> -	if (batch > nr_pages) {
> -		batch = nr_pages;
> -		goto retry;
> -	}
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2731,10 +2725,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	return 0;
>  
> -done_restock:
> -	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> -
> +handle_high:
>  	/*
>  	 * If the hierarchy is above the normal consumption range, schedule
>  	 * reclaim on returning to userland.  We can perform reclaim here
> @@ -2771,7 +2762,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			 * and distribute reclaim work and delay penalties
>  			 * based on how much each task is actually allocating.
>  			 */
> -			current->memcg_nr_pages_over_high += batch;
> +			current->memcg_nr_pages_over_high += nr_charged;
>  			set_notify_resume(current);
>  			break;
>  		}
> @@ -3076,7 +3067,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
>  	account_kmem_nmi_safe(memcg, -nr_pages);
>  	memcg1_account_kmem(memcg, -nr_pages);
>  	if (!mem_cgroup_is_root(memcg))
> -		refill_stock(memcg, nr_pages);
> +		memcg_uncharge(memcg, nr_pages);
>  
>  	css_put(&memcg->css);
>  }
> @@ -4080,6 +4071,8 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_free(struct mem_cgroup *memcg)
>  {
> +	page_counter_free_stock(&memcg->memory);
> +	page_counter_free_stock(&memcg->memsw);
>  	lru_gen_exit_memcg(memcg);
>  	memcg_wb_domain_exit(memcg);
>  	__mem_cgroup_free(memcg);
> @@ -4247,6 +4240,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	refcount_set(&memcg->id.ref, 1);
>  	css_get(css);
>  
> +	/* failure is nonfatal, charges fall back to direct hierarchy */
> +	page_counter_alloc_stock(&memcg->memory, MEMCG_CHARGE_BATCH);
> +	if (do_memsw_account())
> +		page_counter_alloc_stock(&memcg->memsw, MEMCG_CHARGE_BATCH);
> +
>  	/*
>  	 * Ensure mem_cgroup_from_private_id() works once we're fully online.
>  	 *
> @@ -5502,7 +5500,7 @@ void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages)
>  
>  	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
>  
> -	refill_stock(memcg, nr_pages);
> +	page_counter_uncharge(&memcg->memory, nr_pages);
>  }
>  
>  void mem_cgroup_flush_workqueue(void)
> @@ -5555,12 +5553,9 @@ int __init mem_cgroup_init(void)
>  	memcg_wq = alloc_workqueue("memcg", WQ_PERCPU, 0);
>  	WARN_ON(!memcg_wq);
>  
> -	for_each_possible_cpu(cpu) {
> -		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
> -			  drain_local_memcg_stock);
> +	for_each_possible_cpu(cpu)
>  		INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work,
>  			  drain_local_obj_stock);
> -	}
>  
>  	memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids);
>  	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
> -- 
> 2.53.0-Meta
> 
> 

^ permalink raw reply

* Re: [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Jiayuan Chen @ 2026-06-24 14:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: linux-mm, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260624135839.2596358-1-usama.arif@linux.dev>


On 6/24/26 9:58 PM, Usama Arif wrote:
> On Tue, 23 Jun 2026 14:27:56 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
>> From: Jiayuan Chen <jiayuan.chen@shopee.com>
>>
>> Proactive reclaim via memory.reclaim can run for a long time - swap I/O
>> or thrashing again dominating the latency - and delays cgroup removal in
>> the same way.
>>
>> Mitigate this by stopping the reclaim once memcg_is_dying().
>>
>> Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
>> Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
>> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
>> ---
>>   mm/vmscan.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8190c4abec84..1162b7f76655 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
>>   		if (memcg) {
>>   			unsigned int reclaim_options;
>>   
>> +			if (memcg_is_dying(memcg))
>> +				break;
>> +
> This exits the reclaim loop with nr_reclaimed < nr_to_reclaim, but the
> function then returns 0 and memory_reclaim() reports a successful write.
> I think you want to return -EAGAIN here?


You are right that an error should be returned instead of 0.


But since memcg is being deleted, I'm reconsidering the appropriateerror 
code.

-EAGAIN, -ENOENT, -EINTR are possible candidates



^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: David Laight @ 2026-06-24 14:23 UTC (permalink / raw)
  To: Christian König
  Cc: Kaitao Cheng, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt, David Howells,
	Simona Vetter, Randy Dunlap, Luca Ceresoli, Philipp Stanner,
	linux-block, linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel,
	io-uring, audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng
In-Reply-To: <cf8467c7-b98f-44a5-9cf9-60b43b5da711@amd.com>

On Wed, 24 Jun 2026 15:23:47 +0200
Christian König <christian.koenig@amd.com> wrote:

> On 6/24/26 15:14, Kaitao Cheng wrote:
> > 
> > 
> > 在 2026/6/22 16:42, David Laight 写道:  
> >> On Mon, 22 Jun 2026 12:05:31 +0800
> >> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> >>  
> >>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>>
> >>> The list_for_each*_safe() helpers are used when the loop body may
> >>> remove the current entry.  Their API exposes the temporary cursor at
> >>> every call site, even though most users only need it for the iterator
> >>> implementation and never reference it in the loop body.
> >>>
> >>> Add *_mutable() variants for list and hlist iteration.  The new helpers
> >>> support both forms: callers may keep passing an explicit temporary cursor
> >>> when they need to inspect or reset it, or omit it and let the helper use
> >>> a unique internal cursor.  
> >>
> >> I'm not really sure 'mutable' means anything either.
> >> It is possible to make it valid for the loop body (or even other threads)
> >> to delete arbitrary list items - but that needs significant extra overheads.
> >>
> >> It might be worth doing something that doesn't need the extra variable,
> >> but there is little point doing all the churn just to rename things.
> >>  
> >>>
> >>> This makes call sites that only mutate the list through the current entry
> >>> less noisy, while keeping the existing *_safe() helpers available for
> >>> compatibility.
> >>>
> >>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>> ---
> >>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
> >>>  1 file changed, 231 insertions(+), 38 deletions(-)
> >>>
> >>> diff --git a/include/linux/list.h b/include/linux/list.h
> >>> index 09d979976b3b..1081def7cea9 100644
> >>> --- a/include/linux/list.h
> >>> +++ b/include/linux/list.h
> >>> @@ -7,6 +7,7 @@
> >>>  #include <linux/stddef.h>
> >>>  #include <linux/poison.h>
> >>>  #include <linux/const.h>
> >>> +#include <linux/args.h>
> >>>  
> >>>  #include <asm/barrier.h>
> >>>  
> >>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
> >>>  #define list_for_each_prev(pos, head) \
> >>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
> >>>  
> >>> -/**
> >>> - * list_for_each_safe - iterate over a list safe against removal of list entry
> >>> - * @pos:	the &struct list_head to use as a loop cursor.
> >>> - * @n:		another &struct list_head to use as temporary storage
> >>> - * @head:	the head for your list.
> >>> +/*
> >>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
> >>>   */
> >>>  #define list_for_each_safe(pos, n, head) \
> >>>  	for (pos = (head)->next, n = pos->next; \
> >>>  	     !list_is_head(pos, (head)); \
> >>>  	     pos = n, n = pos->next)
> >>>  
> >>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
> >>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\  
> >>
> >> Use auto
> >>  
> >>> +	     !list_is_head(pos, (head));				\
> >>> +	     pos = tmp, tmp = pos->next)
> >>> +
> >>> +#define __list_for_each_mutable1(pos, head)				\
> >>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
> >>> +
> >>> +#define __list_for_each_mutable2(pos, next, head)			\
> >>> +	list_for_each_safe(pos, next, head)
> >>> +
> >>>  /**
> >>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
> >>> + * list_for_each_mutable - iterate over a list safe against entry removal
> >>>   * @pos:	the &struct list_head to use as a loop cursor.
> >>> - * @n:		another &struct list_head to use as temporary storage
> >>> - * @head:	the head for your list.
> >>> + * @...:	either (head) or (next, head)
> >>> + *
> >>> + * next:	another &struct list_head to use as optional temporary storage.
> >>> + *		The temporary cursor is internal unless explicitly supplied by
> >>> + *		the caller.
> >>> + * head:	the head for your list.
> >>> + */
> >>> +#define list_for_each_mutable(pos, ...)					\
> >>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
> >>> +		(pos, __VA_ARGS__)  
> >>
> >> The variable argument count logic really just slows down compilation.
> >> Maybe there aren't enough copies of this code to make that significant.
> >> But just because you can do it doesn't mean it is a gooD idea.
> >> I'm also not sure it really adds anything to the readability.
> >>
> >> And, it you are going to make the middle argument optional there is
> >> no need to change the macro name.  
> > 
> > Christian König and Jani Nikula also disagree with the variadic-argument
> > implementation approach. If we abandon that method, it means we will
> > inevitably need to add some new macros. If mutable is not a good name,
> > suggestions for better alternatives would be welcome; coming up with a
> > suitable name is indeed rather tricky.  
> 
> I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.
> 
> If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.

IIRC currently you have a choice of either:
	define               Item that can't be deleted
	list_for_each()	     The current item.
	list_for_each_safe() The next item.
There is also likely to be code that updates the variables to allow
for other scenarios.

Note that if increase a reference count and release a lock then list_for_each()
is likely safer than list_for_each_safe() :-)

list.h has 9 variants of the 'safe' loop.
The bloat of another 9 is getting excessive.

It has to be said that this is one of my least favourite type of list...

	David

> 
> Regards,
> Christian.


^ permalink raw reply

* Re: [PATCH 3/3] memcg: bail out proactive reclaim when memcg is dying
From: Usama Arif @ 2026-06-24 13:58 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Usama Arif, linux-mm, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
	Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	cgroups, linux-kernel
In-Reply-To: <20260623062800.298514-4-jiayuan.chen@linux.dev>

On Tue, 23 Jun 2026 14:27:56 +0800 Jiayuan Chen <jiayuan.chen@linux.dev> wrote:

> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
> Proactive reclaim via memory.reclaim can run for a long time - swap I/O
> or thrashing again dominating the latency - and delays cgroup removal in
> the same way.
> 
> Mitigate this by stopping the reclaim once memcg_is_dying().
> 
> Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
> Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> ---
>  mm/vmscan.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8190c4abec84..1162b7f76655 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -7922,6 +7922,9 @@ int user_proactive_reclaim(char *buf,
>  		if (memcg) {
>  			unsigned int reclaim_options;
>  
> +			if (memcg_is_dying(memcg))
> +				break;
> +

This exits the reclaim loop with nr_reclaimed < nr_to_reclaim, but the
function then returns 0 and memory_reclaim() reports a successful write.
I think you want to return -EAGAIN here?


>  			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
>  					  MEMCG_RECLAIM_PROACTIVE;
>  			reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -- 
> 2.43.0
> 
> 

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Christian König @ 2026-06-24 13:23 UTC (permalink / raw)
  To: Kaitao Cheng, David Laight
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt, David Howells,
	Simona Vetter, Randy Dunlap, Luca Ceresoli, Philipp Stanner,
	linux-block, linux-kernel, cgroups, linux-ntfs-dev, linux-fsdevel,
	io-uring, audit, bpf, netdev, dri-devel, linux-perf-users,
	linux-trace-kernel, kexec, live-patching, linux-modules,
	linux-crypto, linux-pm, rcu, sched-ext, linux-mm, virtualization,
	damon, llvm, Kaitao Cheng
In-Reply-To: <351a6b67-b394-4c58-aee2-88b6c8089ad5@linux.dev>

On 6/24/26 15:14, Kaitao Cheng wrote:
> 
> 
> 在 2026/6/22 16:42, David Laight 写道:
>> On Mon, 22 Jun 2026 12:05:31 +0800
>> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>
>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>
>>> The list_for_each*_safe() helpers are used when the loop body may
>>> remove the current entry.  Their API exposes the temporary cursor at
>>> every call site, even though most users only need it for the iterator
>>> implementation and never reference it in the loop body.
>>>
>>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>>> support both forms: callers may keep passing an explicit temporary cursor
>>> when they need to inspect or reset it, or omit it and let the helper use
>>> a unique internal cursor.
>>
>> I'm not really sure 'mutable' means anything either.
>> It is possible to make it valid for the loop body (or even other threads)
>> to delete arbitrary list items - but that needs significant extra overheads.
>>
>> It might be worth doing something that doesn't need the extra variable,
>> but there is little point doing all the churn just to rename things.
>>
>>>
>>> This makes call sites that only mutate the list through the current entry
>>> less noisy, while keeping the existing *_safe() helpers available for
>>> compatibility.
>>>
>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>> ---
>>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/list.h b/include/linux/list.h
>>> index 09d979976b3b..1081def7cea9 100644
>>> --- a/include/linux/list.h
>>> +++ b/include/linux/list.h
>>> @@ -7,6 +7,7 @@
>>>  #include <linux/stddef.h>
>>>  #include <linux/poison.h>
>>>  #include <linux/const.h>
>>> +#include <linux/args.h>
>>>  
>>>  #include <asm/barrier.h>
>>>  
>>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>>  #define list_for_each_prev(pos, head) \
>>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>>  
>>> -/**
>>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>>> - * @pos:	the &struct list_head to use as a loop cursor.
>>> - * @n:		another &struct list_head to use as temporary storage
>>> - * @head:	the head for your list.
>>> +/*
>>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>>   */
>>>  #define list_for_each_safe(pos, n, head) \
>>>  	for (pos = (head)->next, n = pos->next; \
>>>  	     !list_is_head(pos, (head)); \
>>>  	     pos = n, n = pos->next)
>>>  
>>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
>>
>> Use auto
>>
>>> +	     !list_is_head(pos, (head));				\
>>> +	     pos = tmp, tmp = pos->next)
>>> +
>>> +#define __list_for_each_mutable1(pos, head)				\
>>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>>> +
>>> +#define __list_for_each_mutable2(pos, next, head)			\
>>> +	list_for_each_safe(pos, next, head)
>>> +
>>>  /**
>>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>>   * @pos:	the &struct list_head to use as a loop cursor.
>>> - * @n:		another &struct list_head to use as temporary storage
>>> - * @head:	the head for your list.
>>> + * @...:	either (head) or (next, head)
>>> + *
>>> + * next:	another &struct list_head to use as optional temporary storage.
>>> + *		The temporary cursor is internal unless explicitly supplied by
>>> + *		the caller.
>>> + * head:	the head for your list.
>>> + */
>>> +#define list_for_each_mutable(pos, ...)					\
>>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>>> +		(pos, __VA_ARGS__)
>>
>> The variable argument count logic really just slows down compilation.
>> Maybe there aren't enough copies of this code to make that significant.
>> But just because you can do it doesn't mean it is a gooD idea.
>> I'm also not sure it really adds anything to the readability.
>>
>> And, it you are going to make the middle argument optional there is
>> no need to change the macro name.
> 
> Christian König and Jani Nikula also disagree with the variadic-argument
> implementation approach. If we abandon that method, it means we will
> inevitably need to add some new macros. If mutable is not a good name,
> suggestions for better alternatives would be welcome; coming up with a
> suitable name is indeed rather tricky.

I don't think you need to add a new macro for the specific use case that people want to modify the next element of the iteration.

If I remember your numbers correctly that is a really corner case and keeping using the existing *_safe() macros for that sounds perfectly fine to me.

Regards,
Christian.

^ permalink raw reply

* Re: [PATCH v3 1/7] list: Add mutable iterator variants
From: Kaitao Cheng @ 2026-06-24 13:14 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, David Hildenbrand, Jens Axboe, Tejun Heo,
	Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König, David Howells, Simona Vetter, Randy Dunlap,
	Luca Ceresoli, Philipp Stanner, linux-block, linux-kernel,
	cgroups, linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf,
	netdev, dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, Kaitao Cheng
In-Reply-To: <20260622094242.64531b9a@pumpkin>



在 2026/6/22 16:42, David Laight 写道:
> On Mon, 22 Jun 2026 12:05:31 +0800
> Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 
>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>
>> The list_for_each*_safe() helpers are used when the loop body may
>> remove the current entry.  Their API exposes the temporary cursor at
>> every call site, even though most users only need it for the iterator
>> implementation and never reference it in the loop body.
>>
>> Add *_mutable() variants for list and hlist iteration.  The new helpers
>> support both forms: callers may keep passing an explicit temporary cursor
>> when they need to inspect or reset it, or omit it and let the helper use
>> a unique internal cursor.
> 
> I'm not really sure 'mutable' means anything either.
> It is possible to make it valid for the loop body (or even other threads)
> to delete arbitrary list items - but that needs significant extra overheads.
> 
> It might be worth doing something that doesn't need the extra variable,
> but there is little point doing all the churn just to rename things.
> 
>>
>> This makes call sites that only mutate the list through the current entry
>> less noisy, while keeping the existing *_safe() helpers available for
>> compatibility.
>>
>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>> ---
>>  include/linux/list.h | 269 +++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 231 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/list.h b/include/linux/list.h
>> index 09d979976b3b..1081def7cea9 100644
>> --- a/include/linux/list.h
>> +++ b/include/linux/list.h
>> @@ -7,6 +7,7 @@
>>  #include <linux/stddef.h>
>>  #include <linux/poison.h>
>>  #include <linux/const.h>
>> +#include <linux/args.h>
>>  
>>  #include <asm/barrier.h>
>>  
>> @@ -763,28 +764,72 @@ static inline void list_splice_tail_init(struct list_head *list,
>>  #define list_for_each_prev(pos, head) \
>>  	for (pos = (head)->prev; !list_is_head(pos, (head)); pos = pos->prev)
>>  
>> -/**
>> - * list_for_each_safe - iterate over a list safe against removal of list entry
>> - * @pos:	the &struct list_head to use as a loop cursor.
>> - * @n:		another &struct list_head to use as temporary storage
>> - * @head:	the head for your list.
>> +/*
>> + * list_for_each_safe is an old interface, use list_for_each_mutable instead.
>>   */
>>  #define list_for_each_safe(pos, n, head) \
>>  	for (pos = (head)->next, n = pos->next; \
>>  	     !list_is_head(pos, (head)); \
>>  	     pos = n, n = pos->next)
>>  
>> +#define __list_for_each_mutable_internal(pos, tmp, head)		\
>> +	for (typeof(pos) tmp = (pos = (head)->next)->next;		\
> 
> Use auto
> 
>> +	     !list_is_head(pos, (head));				\
>> +	     pos = tmp, tmp = pos->next)
>> +
>> +#define __list_for_each_mutable1(pos, head)				\
>> +	__list_for_each_mutable_internal(pos, __UNIQUE_ID(next), head)
>> +
>> +#define __list_for_each_mutable2(pos, next, head)			\
>> +	list_for_each_safe(pos, next, head)
>> +
>>  /**
>> - * list_for_each_prev_safe - iterate over a list backwards safe against removal of list entry
>> + * list_for_each_mutable - iterate over a list safe against entry removal
>>   * @pos:	the &struct list_head to use as a loop cursor.
>> - * @n:		another &struct list_head to use as temporary storage
>> - * @head:	the head for your list.
>> + * @...:	either (head) or (next, head)
>> + *
>> + * next:	another &struct list_head to use as optional temporary storage.
>> + *		The temporary cursor is internal unless explicitly supplied by
>> + *		the caller.
>> + * head:	the head for your list.
>> + */
>> +#define list_for_each_mutable(pos, ...)					\
>> +	CONCATENATE(__list_for_each_mutable, COUNT_ARGS(__VA_ARGS__))	\
>> +		(pos, __VA_ARGS__)
> 
> The variable argument count logic really just slows down compilation.
> Maybe there aren't enough copies of this code to make that significant.
> But just because you can do it doesn't mean it is a gooD idea.
> I'm also not sure it really adds anything to the readability.
> 
> And, it you are going to make the middle argument optional there is
> no need to change the macro name.

Christian König and Jani Nikula also disagree with the variadic-argument
implementation approach. If we abandon that method, it means we will
inevitably need to add some new macros. If mutable is not a good name,
suggestions for better alternatives would be welcome; coming up with a
suitable name is indeed rather tricky.

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* [PATCH RFC 4/4] mm/slab: serialize defer_free_barrier()
From: Harry Yoo (Oracle) @ 2026-06-24 13:11 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

irq_work_sync() uses rcuwait instead of busy waiting in two cases:

  1. The kernel is using PREEMPT_RT and the irq work does not run in a
     hardirq context.

  2. The architecture cannot send inter-processor interrupts to make
     busy waiting reasonably short.

However, rcuwait.h says:
> The caller is responsible for locking around rcuwait_wait_event(),
> and [prepare_to/finish]_rcuwait() such that writes to @task are
> properly serialized.

Since defer_free_barrier() calls irq_work_sync() without any locks,
it can potentially cause a hang as writes to @task are not serialized.

Fix this by calling defer_free_barrier() under slab_mutex and
cpus_read_lock() and add lockdep asserts.

Now that defer_free_barrier() is called inside cpus_read_lock(), iterate
over online cpus instead of possible cpus.

Reported-by: Sashiko <sashiko+bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb%40kernel.org?part=5
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slab_common.c | 5 ++---
 mm/slub.c        | 6 +++++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 388eb5980859..27f77273fabe 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -550,11 +550,10 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		rcu_barrier();
 	}
 
-	/* Wait for deferred work from kmalloc/kfree_nolock() */
-	defer_free_barrier();
-
 	cpus_read_lock();
 	mutex_lock(&slab_mutex);
+	/* Wait for deferred work from kmalloc/kfree_nolock() */
+	defer_free_barrier();
 
 	s->refcount--;
 	if (s->refcount) {
diff --git a/mm/slub.c b/mm/slub.c
index 4a3618e3967e..52c8d3f33782 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6411,7 +6411,11 @@ void defer_free_barrier(void)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	/* irq_work_sync() may use rcuwait that requires serialization */
+	lockdep_assert_held(&slab_mutex);
+	lockdep_assert_cpus_held();
+
+	for_each_online_cpu(cpu)
 		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
 }
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 3/4] mm/slab: fix a deadlock in memcg_alloc_abort_single()
From: Harry Yoo (Oracle) @ 2026-06-24 13:11 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

When kmalloc_nolock() successfully grabs a slab object but memcg aborts
the allocation, the object is freed via __slab_free(). Calling
__slab_free() in unknown context is not allowed and can lead to a
deadlock.

Free the object via defer_free() when spinning is not allowed.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260610-slab_alloc_flags-v2-0-7190909db118%40kernel.org?part=9
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slub.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 85760c8ff2e2..4a3618e3967e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2459,7 +2459,8 @@ alloc_tagging_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
 
 #ifdef CONFIG_MEMCG
 
-static void memcg_alloc_abort_single(struct kmem_cache *s, void *object);
+static void memcg_alloc_abort_single(struct kmem_cache *s, void *object,
+				     bool allow_spin);
 
 static __fastpath_inline
 bool memcg_slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
@@ -2477,7 +2478,9 @@ bool memcg_slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 		return true;
 
 	if (likely(size == 1)) {
-		memcg_alloc_abort_single(s, *p);
+		bool allow_spin = alloc_flags_allow_spinning(ac->alloc_flags);
+
+		memcg_alloc_abort_single(s, *p, allow_spin);
 		*p = NULL;
 	} else {
 		kmem_cache_free_bulk(s, size, p);
@@ -6436,16 +6439,20 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 #ifdef CONFIG_MEMCG
 /* Do not inline the rare memcg charging failed path into the allocation path */
 static noinline
-void memcg_alloc_abort_single(struct kmem_cache *s, void *object)
+void memcg_alloc_abort_single(struct kmem_cache *s, void *object, bool allow_spin)
 {
 	struct slab *slab = virt_to_slab(object);
 	bool init = slab_want_init_on_free(s);
-	bool allow_spin = true;
 
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, init, false, allow_spin)))
+	if (unlikely(!slab_free_hook(s, object, init, false, allow_spin)))
+		return;
+
+	if (likely(allow_spin))
 		__slab_free(s, slab, object, object, 1, _RET_IP_);
+	else
+		defer_free(s, object);
 }
 #endif
 

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 2/4] mm/slab: handle allow_spin in slab_free_hook() instead of open coding
From: Harry Yoo (Oracle) @ 2026-06-24 13:11 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

kfree_nolock() deliberately skips slab free hooks that should not be
called in unknown context. However, there is one more place that
needs the same thing: memcg_alloc_abort_single().

Instead of open coding slab_free_hook(), let it handle the allow_spin
parameter. This is required to correctly handle allocation failures due
to memcg.

There are two functional changes from this.

First, memcg_alloc_abort_single() -> slab_free_hook() path no longer
calls free hooks that acquire spinlocks if !allow_spin.

Second, kfree_nolock() now follows init_on_free which has been ignored.

Besides that, no functional change intended.

Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slub.c | 100 +++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 32672a92581b..85760c8ff2e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2611,41 +2611,52 @@ struct rcu_delayed_free {
  * - memcg_slab_post_alloc_hook()
  * - alloc_tagging_slab_alloc_hook()
  *
+ * slab_free_hook() assumes that memcg_slab_free_hook() and
+ * alloc_tagging_slab_free_hook() are already invoked.
+ *
  * Free hooks that must handle missing corresponding alloc hooks:
  * - kmemleak_free_recursive()
  * - kfence_free()
  *
- * Free hooks that have no alloc hook counterpart, and thus safe to call:
+ * Free hooks that have no alloc hook counterpart, and thus safe to call
+ * when spinning is allowed:
  * - debug_check_no_locks_freed()
  * - debug_check_no_obj_freed()
  * - __kcsan_check_access()
  */
 static __always_inline
 bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
-		    bool after_rcu_delay)
+		    bool after_rcu_delay, bool allow_spin)
 {
 	/* Are the object contents still accessible? */
 	bool still_accessible = (s->flags & SLAB_TYPESAFE_BY_RCU) && !after_rcu_delay;
 
-	kmemleak_free_recursive(x, s->flags);
 	kmsan_slab_free(s, x);
 
-	debug_check_no_locks_freed(x, s->object_size);
-
-	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(x, s->object_size);
-
-	/* Use KCSAN to help debug racy use-after-free. */
-	if (!still_accessible)
-		__kcsan_check_access(x, s->object_size,
-				     KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT);
-
-	if (kfence_free(x))
-		return false;
+	if (likely(allow_spin)) {
+		/* These hooks acquire spinlocks */
+		kmemleak_free_recursive(x, s->flags);
+		debug_check_no_locks_freed(x, s->object_size);
+		if (!(s->flags & SLAB_DEBUG_OBJECTS))
+			debug_check_no_obj_freed(x, s->object_size);
+		/* Use KCSAN to help debug racy use-after-free. */
+		if (!still_accessible)
+			__kcsan_check_access(x, s->object_size,
+					     KCSAN_ACCESS_WRITE | KCSAN_ACCESS_ASSERT);
+		if (kfence_free(x))
+			return false;
+	}
 
 	/*
 	 * Give KASAN a chance to notice an invalid free operation before we
 	 * modify the object.
+	 *
+	 * If KASAN finds a bug it will do kasan_report_invalid_free() which
+	 * will call raw_spin_lock_irqsave() which is technically unsafe from
+	 * NMI, but take chance and report kernel bug. The sequence of
+	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
+	 * -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
+	 *  is double buggy and deserves to deadlock.
 	 */
 	if (kasan_slab_pre_free(s, x))
 		return false;
@@ -2654,6 +2665,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
 	if (still_accessible) {
 		struct rcu_delayed_free *delayed_free;
 
+		VM_WARN_ON_ONCE(!allow_spin);
 		delayed_free = kmalloc_obj(*delayed_free, GFP_NOWAIT);
 		if (delayed_free) {
 			/*
@@ -2701,8 +2713,11 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
 		set_orig_size(s, x, orig_size);
 
 	}
-	/* KASAN might put x into memory quarantine, delaying its reuse. */
-	return !kasan_slab_free(s, x, init, still_accessible, false);
+	/*
+	 * KASAN might put x into memory quarantine, delaying its reuse.
+	 * Skip quarantine when spinning is not allowed as it uses a spinlock.
+	 */
+	return !kasan_slab_free(s, x, init, still_accessible, !allow_spin);
 }
 
 static __fastpath_inline
@@ -2714,9 +2729,10 @@ bool slab_free_freelist_hook(struct kmem_cache *s, void **head, void **tail,
 	void *next = *head;
 	void *old_tail = *tail;
 	bool init;
+	bool allow_spin = true;
 
 	if (is_kfence_address(next)) {
-		slab_free_hook(s, next, false, false);
+		slab_free_hook(s, next, false, false, allow_spin);
 		return false;
 	}
 
@@ -2731,7 +2747,7 @@ bool slab_free_freelist_hook(struct kmem_cache *s, void **head, void **tail,
 		next = get_freepointer(s, object);
 
 		/* If object's reuse doesn't have to be delayed */
-		if (likely(slab_free_hook(s, object, init, false))) {
+		if (likely(slab_free_hook(s, object, init, false, allow_spin))) {
 			/* Move object to the new freelist */
 			set_freepointer(s, object, *head);
 			*head = object;
@@ -2954,7 +2970,7 @@ static bool __rcu_free_sheaf_prepare(struct kmem_cache *s,
 		memcg_slab_free_hook(s, slab, p + i, 1, allow_spin);
 		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
 
-		if (unlikely(!slab_free_hook(s, p[i], init, true))) {
+		if (unlikely(!slab_free_hook(s, p[i], init, true, allow_spin))) {
 			p[i] = p[--sheaf->size];
 			continue;
 		}
@@ -6225,7 +6241,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		memcg_slab_free_hook(s, slab, p + i, 1, allow_spin);
 		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
 
-		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+		if (unlikely(!slab_free_hook(s, p[i], init, false, allow_spin))) {
 			p[i] = p[--size];
 			continue;
 		}
@@ -6401,11 +6417,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	       unsigned long addr)
 {
 	bool allow_spin = true;
+	bool init = slab_want_init_on_free(s);
 
 	memcg_slab_free_hook(s, slab, &object, 1, allow_spin);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (unlikely(!slab_free_hook(s, object, init, false, allow_spin)))
 		return;
 
 	if (likely(can_free_to_pcs(slab)) &&
@@ -6422,10 +6439,12 @@ static noinline
 void memcg_alloc_abort_single(struct kmem_cache *s, void *object)
 {
 	struct slab *slab = virt_to_slab(object);
+	bool init = slab_want_init_on_free(s);
+	bool allow_spin = true;
 
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (likely(slab_free_hook(s, object, init, false, allow_spin)))
 		__slab_free(s, slab, object, object, 1, _RET_IP_);
 }
 #endif
@@ -6456,6 +6475,8 @@ static void slab_free_after_rcu_debug(struct rcu_head *rcu_head)
 	void *object = delayed_free->object;
 	struct slab *slab = virt_to_slab(object);
 	struct kmem_cache *s;
+	bool allow_spin = true;
+	bool init;
 
 	kfree(delayed_free);
 
@@ -6469,8 +6490,9 @@ static void slab_free_after_rcu_debug(struct rcu_head *rcu_head)
 	if (WARN_ON(!(s->flags & SLAB_TYPESAFE_BY_RCU)))
 		return;
 
+	init = slab_want_init_on_free(s);
 	/* resume freeing */
-	if (slab_free_hook(s, object, slab_want_init_on_free(s), true)) {
+	if (slab_free_hook(s, object, init, true, allow_spin)) {
 		__slab_free(s, slab, object, object, 1, _THIS_IP_);
 		stat(s, FREE_SLOWPATH);
 	}
@@ -6742,6 +6764,7 @@ void kfree_nolock(const void *object)
 	struct kmem_cache *s;
 	void *x = (void *)object;
 	bool allow_spin = false;
+	bool init;
 
 	if (unlikely(ZERO_OR_NULL_PTR(object)))
 		return;
@@ -6756,33 +6779,10 @@ void kfree_nolock(const void *object)
 
 	memcg_slab_free_hook(s, slab, &x, 1, allow_spin);
 	alloc_tagging_slab_free_hook(s, slab, &x, 1);
-	/*
-	 * Unlike slab_free() do NOT call the following:
-	 * kmemleak_free_recursive(x, s->flags);
-	 * debug_check_no_locks_freed(x, s->object_size);
-	 * debug_check_no_obj_freed(x, s->object_size);
-	 * __kcsan_check_access(x, s->object_size, ..);
-	 * kfence_free(x);
-	 * since they take spinlocks or not safe from any context.
-	 */
-	kmsan_slab_free(s, x);
-	/*
-	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
-	 * which will call raw_spin_lock_irqsave() which is technically
-	 * unsafe from NMI, but take chance and report kernel bug.
-	 * The sequence of
-	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
-	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
-	 * is double buggy and deserves to deadlock.
-	 */
-	if (kasan_slab_pre_free(s, x))
+
+	init = slab_want_init_on_free(s);
+	if (!slab_free_hook(s, x, init, false, allow_spin))
 		return;
-	/*
-	 * memcg, kasan_slab_pre_free are done for 'x'.
-	 * The only thing left is kasan_poison without quarantine,
-	 * since kasan quarantine takes locks and not supported from NMI.
-	 */
-	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
 
 	if (likely(can_free_to_pcs(slab)) &&
 			likely(free_to_pcs(s, x, allow_spin)))

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 1/4] mm/memcontrol: do not drain objcg stock when spinning is not allowed
From: Harry Yoo (Oracle) @ 2026-06-24 13:11 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf
In-Reply-To: <20260624-kmalloc-nolock-fixes-v1-0-fdf4d17351dd@kernel.org>

When kmalloc_nolock() drains objcg stock, the stock might be holding
the last reference to the objcg. Since obj_cgroup_release() is a
callback for percpu refcount and does not know whether spinning is
allowed, it is not safe to invoke obj_cgroup_put().

This was caught by lockdep on PREEMPT_RT because acquiring
a sleeping lock (objcg_lock) violates lock nesting rules:

  kernel: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
  kernel: in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1267, name: systemd-resolve
  preempt_count: 1, expected: 0
  RCU nest depth: 3, expected: 3
  6 locks held by systemd-resolve/1267:
   #0: ffff888a8165fa20 ((&pcs->lock)){+.+.}-{3:3}, at: kmem_cache_alloc_noprof+0x185/0xa20
   #1: ffffffff9658a4c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x74/0x2a0
   #2: ffff888a81648598 ((lock)#4){+.+.}-{3:3}, at: trylock_stock+0x118/0x380
   #3: ffffffff9658a4c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_trylock+0x74/0x2a0
   #4: ffffffff9658a4c0 (rcu_read_lock){....}-{1:3}, at: percpu_ref_put_many.constprop.0+0x40/0x270
   #5: ffffffff96af11d8 (objcg_lock){+.+.}-{3:3}, at: obj_cgroup_release+0x8a/0x410
  [...]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x8a/0xe0
   dump_stack+0x14/0x1c
   __might_resched.cold+0x233/0x2bb
   rt_spin_lock+0xd3/0x410
   obj_cgroup_release+0x8a/0x410
   percpu_ref_put_many.constprop.0+0x226/0x270
   drain_obj_stock_slot+0x27e/0x8d0
   __refill_obj_stock+0x409/0x6d0
   __memcg_slab_post_alloc_hook+0xa45/0x1500
   __kmalloc_nolock_noprof+0x988/0xc40
   [...]

However, this is illegal in !RT kernels too because the objcg release
callback acquires a spinlock even when spinning is not allowed.

To fix this issue, fall back to atomics when the cached objcg doesn't
match, but it is unsafe to drain because spinning is not allowed.

This is expected to affect performance of kmalloc_nolock() since
it can no longer drain and refill the stock and falls back to a
per-objcg atomic counter (objcg->nr_charged_bytes).

Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/memcontrol.c | 34 +++++++++++++++++++++++-----------
 mm/slab.h       |  3 ++-
 mm/slub.c       | 29 +++++++++++++++++++----------
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29390ba13baa..5bb5e75ef5b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3316,18 +3316,19 @@ static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
 static void __refill_obj_stock(struct obj_cgroup *objcg,
 			       struct obj_stock_pcp *stock,
 			       unsigned int nr_bytes,
-			       bool allow_uncharge)
+			       bool allow_uncharge,
+			       bool allow_spin)
 {
 	unsigned int nr_pages = 0;
 
-	if (!stock) {
-		nr_pages = nr_bytes >> PAGE_SHIFT;
-		nr_bytes = nr_bytes & (PAGE_SIZE - 1);
-		atomic_add(nr_bytes, &objcg->nr_charged_bytes);
-		goto out;
-	}
+	if (!stock)
+		goto fallback;
 
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
+		/* Not safe to drain since objcg release acquires spinlock */
+		if (unlikely(!allow_spin))
+			goto fallback;
+
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
@@ -3346,6 +3347,13 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 out:
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
+	return;
+
+fallback:
+	nr_pages = nr_bytes >> PAGE_SHIFT;
+	nr_bytes = nr_bytes & (PAGE_SIZE - 1);
+	atomic_add(nr_bytes, &objcg->nr_charged_bytes);
+	goto out;
 }
 
 static void refill_obj_stock(struct obj_cgroup *objcg,
@@ -3353,7 +3361,8 @@ static void refill_obj_stock(struct obj_cgroup *objcg,
 			     bool allow_uncharge)
 {
 	struct obj_stock_pcp *stock = trylock_stock();
-	__refill_obj_stock(objcg, stock, nr_bytes, allow_uncharge);
+	__refill_obj_stock(objcg, stock, nr_bytes, allow_uncharge,
+			   /* allow_spin = */ true);
 	unlock_stock(stock);
 }
 
@@ -3428,6 +3437,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 				  size_t size, void **p)
 {
 	size_t obj_size = obj_full_size(s);
+	bool allow_spin = alloc_flags_allow_spinning(slab_alloc_flags);
 	struct obj_cgroup *objcg;
 	struct slab *slab;
 	unsigned long off;
@@ -3497,7 +3507,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 				return false;
 			stock = trylock_stock();
 			if (remainder)
-				__refill_obj_stock(objcg, stock, remainder, false);
+				__refill_obj_stock(objcg, stock, remainder, false,
+						   allow_spin);
 		}
 		__account_obj_stock(objcg, stock, obj_size,
 				    slab_pgdat(slab), cache_vmstat_idx(s));
@@ -3516,7 +3527,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 }
 
 void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
-			    void **p, int objects, unsigned long obj_exts)
+			    void **p, int objects, unsigned long obj_exts,
+			    bool allow_spin)
 {
 	size_t obj_size = obj_full_size(s);
 
@@ -3535,7 +3547,7 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 		obj_ext->objcg = NULL;
 
 		stock = trylock_stock();
-		__refill_obj_stock(objcg, stock, obj_size, true);
+		__refill_obj_stock(objcg, stock, obj_size, true, allow_spin);
 		__account_obj_stock(objcg, stock, -obj_size,
 				    slab_pgdat(slab), cache_vmstat_idx(s));
 		unlock_stock(stock);
diff --git a/mm/slab.h b/mm/slab.h
index 281a65233795..a6b4ac298d08 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -660,7 +660,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 				  gfp_t flags, unsigned int slab_alloc_flags,
 				  size_t size, void **p);
 void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
-			    void **p, int objects, unsigned long obj_exts);
+			    void **p, int objects, unsigned long obj_exts,
+			    bool allow_spin);
 #endif
 
 void kvfree_rcu_cb(struct rcu_head *head);
diff --git a/mm/slub.c b/mm/slub.c
index 917635203f73..32672a92581b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2488,7 +2488,7 @@ bool memcg_slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
 
 static __fastpath_inline
 void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
-			  int objects)
+			  int objects, bool allow_spin)
 {
 	unsigned long obj_exts;
 
@@ -2500,7 +2500,7 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p,
 		return;
 
 	get_slab_obj_exts(obj_exts);
-	__memcg_slab_free_hook(s, slab, p, objects, obj_exts);
+	__memcg_slab_free_hook(s, slab, p, objects, obj_exts, allow_spin);
 	put_slab_obj_exts(obj_exts);
 }
 
@@ -2575,7 +2575,7 @@ static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s,
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
-					void **p, int objects)
+					void **p, int objects, bool allow_spin)
 {
 }
 
@@ -2946,11 +2946,12 @@ static bool __rcu_free_sheaf_prepare(struct kmem_cache *s,
 	void **p = &sheaf->objects[0];
 	unsigned int i = 0;
 	bool pfmemalloc = false;
+	bool allow_spin = true;
 
 	while (i < sheaf->size) {
 		struct slab *slab = virt_to_slab(p[i]);
 
-		memcg_slab_free_hook(s, slab, p + i, 1);
+		memcg_slab_free_hook(s, slab, p + i, 1, allow_spin);
 		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
 
 		if (unlikely(!slab_free_hook(s, p[i], init, true))) {
@@ -6215,12 +6216,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	struct node_barn *barn;
 	void *remote_objects[PCS_BATCH_MAX];
 	unsigned int remote_nr = 0;
+	bool allow_spin = true;
 
 next_remote_batch:
 	while (i < size) {
 		struct slab *slab = virt_to_slab(p[i]);
 
-		memcg_slab_free_hook(s, slab, p + i, 1);
+		memcg_slab_free_hook(s, slab, p + i, 1, allow_spin);
 		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
 
 		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
@@ -6398,13 +6400,16 @@ static __fastpath_inline
 void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	       unsigned long addr)
 {
-	memcg_slab_free_hook(s, slab, &object, 1);
+	bool allow_spin = true;
+
+	memcg_slab_free_hook(s, slab, &object, 1, allow_spin);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (likely(can_free_to_pcs(slab)) && likely(free_to_pcs(s, object, true)))
+	if (likely(can_free_to_pcs(slab)) &&
+			likely(free_to_pcs(s, object, allow_spin)))
 		return;
 
 	__slab_free(s, slab, object, object, 1, addr);
@@ -6429,7 +6434,9 @@ static __fastpath_inline
 void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head,
 		    void *tail, void **p, int cnt, unsigned long addr)
 {
-	memcg_slab_free_hook(s, slab, p, cnt);
+	bool allow_spin = true;
+
+	memcg_slab_free_hook(s, slab, p, cnt, allow_spin);
 	alloc_tagging_slab_free_hook(s, slab, p, cnt);
 	/*
 	 * With KASAN enabled slab_free_freelist_hook modifies the freelist
@@ -6734,6 +6741,7 @@ void kfree_nolock(const void *object)
 	struct slab *slab;
 	struct kmem_cache *s;
 	void *x = (void *)object;
+	bool allow_spin = false;
 
 	if (unlikely(ZERO_OR_NULL_PTR(object)))
 		return;
@@ -6746,7 +6754,7 @@ void kfree_nolock(const void *object)
 
 	s = slab->slab_cache;
 
-	memcg_slab_free_hook(s, slab, &x, 1);
+	memcg_slab_free_hook(s, slab, &x, 1, allow_spin);
 	alloc_tagging_slab_free_hook(s, slab, &x, 1);
 	/*
 	 * Unlike slab_free() do NOT call the following:
@@ -6776,7 +6784,8 @@ void kfree_nolock(const void *object)
 	 */
 	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
 
-	if (likely(can_free_to_pcs(slab)) && likely(free_to_pcs(s, x, false)))
+	if (likely(can_free_to_pcs(slab)) &&
+			likely(free_to_pcs(s, x, allow_spin)))
 		return;
 
 	/*

-- 
2.53.0


^ permalink raw reply related

* [PATCH RFC 0/4] memcg,slab: kmalloc_nolock() fixes
From: Harry Yoo (Oracle) @ 2026-06-24 13:11 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Vlastimil Babka, Hao Li,
	Christoph Lameter, David Rientjes, Alexei Starovoitov,
	Pedro Falcato
  Cc: cgroups, linux-mm, linux-kernel, bpf

Apologies for posting another series during the merge window.
But these are bug fixes and there are other features that need to be
rebased on top, so...

Overview
========

This patchset tries to fix three kmalloc_nolock() bugs.

1. obj_cgroup_put() takes a spinlock in the release path
   when it is holding the last reference.
   (This needs some thoughts from the memcg folks)

2. A spinlock may be taken in the following path and may lead
   to deadlock:

   kmalloc_nolock()
   -> slab_post_alloc_hook()
   -> memcg_slab_post_alloc_hook()
   -> memcg_alloc_abort_single().

3. irq_work_sync() is called without synchronization for rcuwait
   (on PREEMPT_RT or some architectures), potentially causing a
   hang.

Bug 1 was reported by lockdep, and bugs 2 [2] and 3 [3] were
reported by Sashiko.

To MEMCG folks: obj_cgroup_put() is not safe in unknown context
===============================================================

I tried to fix the bug 1 in the __refill_obj_stock() path in patch 1.

Patch 1 considers correctness aspect only, and performance may
degrade because we have to fall back to per-objcg atomics unless
some else drains it for us.

Ouch, while writing the cover letter, I realized that two paths need
some attention:

  1. __memcg_slab_free_hook() -> obj_cgroup_put() and
  2. current_obj_cgroup() -> current_objcg_update() -> obj_cgroup_put()

An easy solution would be to somehow defer obj_cgroup_put() or
obj_cgroup_release(). I would like to hear thoughts from the memcg folks
on which is the preferred way.

To BPF folks: do we need to backport kmalloc_nolock() support
for architectures without __CMPXCHG_DOUBLE to v6.18?
=============================================================

Originally I intended to fix this as part of this series.
However, the issue reported by Levi Zim [1] was on kernel v6.19,
(Thanks to Vlastimil for mentioning this), and v6.18 does not use
kmalloc_nolock() for BPF local storage.

There are still few users in v6.18, but I can't tell whether it is
necessary to backport it to v6.18 (hopefully not as urgent as other
bugfixes).

Thoughts?

[1] https://lore.kernel.org/linux-mm/9bea1536-534a-4a59-9b5f-92389fb05688@kxxt.dev
[2] https://sashiko.dev/#/patchset/20260610-slab_alloc_flags-v2-0-7190909db118%40kernel.org?part=9
[3] https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb%40kernel.org?part=5

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
Harry Yoo (Oracle) (4):
      mm/memcontrol: do not drain objcg stock when spinning is not allowed
      mm/slab: handle allow_spin in slab_free_hook() instead of open coding
      mm/slab: fix a deadlock in memcg_alloc_abort_single()
      mm/slab: serialize defer_free_barrier()

 mm/memcontrol.c  |  34 ++++++++-----
 mm/slab.h        |   3 +-
 mm/slab_common.c |   5 +-
 mm/slub.c        | 148 +++++++++++++++++++++++++++++++------------------------
 4 files changed, 111 insertions(+), 79 deletions(-)
---
base-commit: 892a7864730775c3dbee2a39e9ead4fa8d4256e7
change-id: 20260624-kmalloc-nolock-fixes-c97675328773

Best regards,
-- 
Harry Yoo (Oracle) <harry@kernel.org>


^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-24 13:05 UTC (permalink / raw)
  To: Jani Nikula, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Andy Shevchenko, Paul E. McKenney, Shakeel Butt,
	Christian König
  Cc: David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, linux-kernel, cgroups,
	linux-ntfs-dev, linux-fsdevel, io-uring, audit, bpf, netdev,
	dri-devel, linux-perf-users, linux-trace-kernel, kexec,
	live-patching, linux-modules, linux-crypto, linux-pm, rcu,
	sched-ext, linux-mm, virtualization, damon, llvm, chengkaitao
In-Reply-To: <88f34c7fa5a3d1700cc8005818751d6aa31f09df@intel.com>



在 2026/6/22 16:37, Jani Nikula 写道:
> On Mon, 22 Jun 2026, Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>> Add *_mutable() iterator variants for list, hlist and llist.  The new
>> helpers are variadic and support both forms.  In the common case, the
>> caller omits the temporary cursor and the macro creates a unique internal
>> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
>> explicit temporary cursor, the caller can still pass it and the helper
>> keeps the existing *_safe() behaviour.
>>
>> For example, a call site may use the shorter form:
>>
>>   list_for_each_entry_mutable(pos, head, member)
>>
>> or keep the explicit temporary cursor form:
>>
>>   list_for_each_entry_mutable(pos, tmp, head, member)
> 
> I'm unconvinced it's a good idea to allow two forms with macro trickery,
> *especially* when it's not the last argument you can omit. I think it's
> a footgun.
> 
> IMO stick with the first form only, and there'll always be the _safe
> variant that can be used when the temp pointer is needed.

Could we go back to the v1 version? What do you think of that
implementation approach?

https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-24 12:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Alexei Starovoitov
  Cc: Andrew Morton, Jens Axboe, Tejun Heo, Alexander Viro,
	Christian Brauner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Thomas Gleixner,
	Juri Lelli, Vincent Guittot, Paul Moore, Andy Shevchenko,
	Paul E. McKenney, Shakeel Butt, Christian König,
	David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao
In-Reply-To: <8f98a3a6-f97b-4673-964f-fb09c8879e2e@kernel.org>



在 2026/6/22 19:27, David Hildenbrand (Arm) 写道:
> On 6/22/26 07:28, Alexei Starovoitov wrote:
>> On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
>>>
>>> From: chengkaitao <chengkaitao@kylinos.cn>
>>>
>>> The list_for_each*_safe() helpers are used when the loop body may remove
>>> the current entry.  Their current interface, however, forces every caller
>>> to define a temporary cursor outside the macro and pass it in, even when
>>> the caller never uses that cursor directly.  For most call sites this
>>> extra cursor is just boilerplate required by the macro implementation.
>>>
>>> This is awkward because the saved next pointer is an internal detail of
>>> the iteration.  Callers that only remove or move the current entry do not
>>> need to spell it out.
>>>
>>> The _safe() suffix has also caused confusion.  Christian Koenig pointed
>>> out that the name is easy to read as a thread-safe variant, especially
>>> for beginners, even though it only means that the iterator keeps enough
>>> state to tolerate removal of the current entry.  He suggested _mutable()
>>> as a clearer description of what the loop permits.
>>>
>>> Add *_mutable() iterator variants for list, hlist and llist.  The new
>>> helpers are variadic and support both forms.  In the common case, the
>>> caller omits the temporary cursor and the macro creates a unique internal
>>> cursor with typeof(pos) and __UNIQUE_ID().  If a loop really needs an
>>> explicit temporary cursor, the caller can still pass it and the helper
>>> keeps the existing *_safe() behaviour.
>>>
>>> For example, a call site may use the shorter form:
>>>
>>>   list_for_each_entry_mutable(pos, head, member)
>>>
>>> or keep the explicit temporary cursor form:
>>>
>>>   list_for_each_entry_mutable(pos, tmp, head, member)
>>>
>>> The existing *_safe() helpers remain available for compatibility.  This
>>> series only converts users in mm, block, kernel, init and io_uring.  If
>>> this approach looks acceptable, the remaining users can be converted in
>>> follow-up series.
>>>
>>> Changes in v3 (Christian König, Andy Shevchenko):
>>> - Convert safe list walks to mutable iterators
>>>
>>> Changes in v2 (Muchun Song, Andy Shevchenko):
>>> - Drop the list_for_each_entry_mutable*() helpers from v1 and make the
>>>   cursor change directly in the existing list_for_each_entry*() helpers.
>>> - Open-code special list walks that rely on updating the loop cursor in
>>>   the body, preserving their existing traversal semantics.
>>>
>>> Link to v2:
>>> https://lore.kernel.org/all/20260609061347.93688-1-kaitao.cheng@linux.dev/
>>>
>>> Link to v1:
>>> https://lore.kernel.org/all/20260529082149.76764-1-kaitao.cheng@linux.dev/
>>>
>>> Kaitao Cheng (7):
>>>   list: Add mutable iterator variants
>>>   llist: Add mutable iterator variants
>>>   mm: Use mutable list iterators
>>>   block: Use mutable list iterators
>>>   kernel: Use mutable list iterators
>>>   initramfs: Use mutable list iterator
>>>   io_uring: Use mutable list iterators
>>>
>>>  block/bfq-iosched.c                 |  17 +-
>>>  block/blk-cgroup.c                  |  12 +-
>>>  block/blk-flush.c                   |   4 +-
>>>  block/blk-iocost.c                  |  18 +-
>>>  block/blk-mq.c                      |   8 +-
>>>  block/blk-throttle.c                |   4 +-
>>>  block/kyber-iosched.c               |   4 +-
>>>  block/partitions/ldm.c              |   8 +-
>>>  block/sed-opal.c                    |   4 +-
>>>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
>>>  include/linux/llist.h               |  81 +++++++--
>>>  init/initramfs.c                    |   5 +-
>>>  io_uring/cancel.c                   |   6 +-
>>>  io_uring/poll.c                     |   3 +-
>>>  io_uring/rw.c                       |   4 +-
>>>  io_uring/timeout.c                  |   8 +-
>>>  io_uring/uring_cmd.c                |   3 +-
>>>  kernel/audit_tree.c                 |   4 +-
>>>  kernel/audit_watch.c                |  16 +-
>>>  kernel/auditfilter.c                |   4 +-
>>>  kernel/auditsc.c                    |   4 +-
>>>  kernel/bpf/arena.c                  |  10 +-
>>>  kernel/bpf/arraymap.c               |   8 +-
>>>  kernel/bpf/bpf_local_storage.c      |   3 +-
>>>  kernel/bpf/bpf_lru_list.c           |  25 ++-
>>>  kernel/bpf/btf.c                    |  18 +-
>>>  kernel/bpf/cgroup.c                 |   7 +-
>>>  kernel/bpf/cpumap.c                 |   4 +-
>>>  kernel/bpf/devmap.c                 |  10 +-
>>>  kernel/bpf/helpers.c                |   8 +-
>>>  kernel/bpf/local_storage.c          |   4 +-
>>>  kernel/bpf/memalloc.c               |  16 +-
>>>  kernel/bpf/offload.c                |   8 +-
>>>  kernel/bpf/states.c                 |   4 +-
>>>  kernel/bpf/stream.c                 |   4 +-
>>>  kernel/bpf/verifier.c               |   6 +-
>>>  kernel/cgroup/cgroup-v1.c           |   4 +-
>>>  kernel/cgroup/cgroup.c              |  54 +++---
>>>  kernel/cgroup/dmem.c                |  12 +-
>>>  kernel/cgroup/rdma.c                |   8 +-
>>>  kernel/events/core.c                |  44 +++--
>>>  kernel/events/uprobes.c             |  12 +-
>>>  kernel/exit.c                       |   8 +-
>>>  kernel/fail_function.c              |   4 +-
>>>  kernel/gcov/clang.c                 |   4 +-
>>>  kernel/irq_work.c                   |   4 +-
>>>  kernel/kexec_core.c                 |   4 +-
>>>  kernel/kprobes.c                    |  16 +-
>>>  kernel/livepatch/core.c             |   4 +-
>>>  kernel/livepatch/core.h             |   4 +-
>>>  kernel/liveupdate/kho_block.c       |   4 +-
>>>  kernel/liveupdate/luo_flb.c         |   4 +-
>>>  kernel/locking/rwsem.c              |   2 +-
>>>  kernel/locking/test-ww_mutex.c      |   2 +-
>>>  kernel/module/main.c                |  11 +-
>>>  kernel/padata.c                     |   4 +-
>>>  kernel/power/snapshot.c             |   8 +-
>>>  kernel/power/wakelock.c             |   4 +-
>>>  kernel/printk/printk.c              |  11 +-
>>>  kernel/ptrace.c                     |   4 +-
>>>  kernel/rcu/rcutorture.c             |   3 +-
>>>  kernel/rcu/tasks.h                  |   9 +-
>>>  kernel/rcu/tree.c                   |   6 +-
>>>  kernel/resource.c                   |   4 +-
>>>  kernel/sched/core.c                 |   4 +-
>>>  kernel/sched/ext.c                  |  22 +--
>>>  kernel/sched/fair.c                 |  28 +--
>>>  kernel/sched/topology.c             |   4 +-
>>>  kernel/sched/wait.c                 |   4 +-
>>>  kernel/seccomp.c                    |   4 +-
>>>  kernel/signal.c                     |  11 +-
>>>  kernel/smp.c                        |   4 +-
>>>  kernel/taskstats.c                  |   8 +-
>>>  kernel/time/clockevents.c           |   6 +-
>>>  kernel/time/clocksource.c           |   4 +-
>>>  kernel/time/posix-cpu-timers.c      |   4 +-
>>>  kernel/time/posix-timers.c          |   3 +-
>>>  kernel/torture.c                    |   3 +-
>>>  kernel/trace/bpf_trace.c            |   4 +-
>>>  kernel/trace/ftrace.c               |  49 +++--
>>>  kernel/trace/ring_buffer.c          |  25 ++-
>>>  kernel/trace/trace.c                |  12 +-
>>>  kernel/trace/trace_dynevent.c       |   6 +-
>>>  kernel/trace/trace_dynevent.h       |   5 +-
>>>  kernel/trace/trace_events.c         |  35 ++--
>>>  kernel/trace/trace_events_filter.c  |   4 +-
>>>  kernel/trace/trace_events_hist.c    |   8 +-
>>>  kernel/trace/trace_events_trigger.c |  17 +-
>>>  kernel/trace/trace_events_user.c    |  16 +-
>>>  kernel/trace/trace_stat.c           |   4 +-
>>>  kernel/user-return-notifier.c       |   3 +-
>>>  kernel/workqueue.c                  |  16 +-
>>>  mm/backing-dev.c                    |   8 +-
>>>  mm/balloon.c                        |   8 +-
>>>  mm/cma.c                            |   4 +-
>>>  mm/compaction.c                     |   4 +-
>>>  mm/damon/core.c                     |   4 +-
>>>  mm/damon/sysfs-schemes.c            |   4 +-
>>>  mm/dmapool.c                        |   4 +-
>>>  mm/huge_memory.c                    |   8 +-
>>>  mm/hugetlb.c                        |  56 +++---
>>>  mm/hugetlb_vmemmap.c                |  16 +-
>>>  mm/khugepaged.c                     |  14 +-
>>>  mm/kmemleak.c                       |   7 +-
>>>  mm/ksm.c                            |  25 +--
>>>  mm/list_lru.c                       |   4 +-
>>>  mm/memcontrol-v1.c                  |   8 +-
>>>  mm/memory-failure.c                 |  12 +-
>>>  mm/memory-tiers.c                   |   4 +-
>>>  mm/migrate.c                        |  23 ++-
>>>  mm/mmu_notifier.c                   |   9 +-
>>>  mm/page_alloc.c                     |   8 +-
>>>  mm/page_reporting.c                 |   2 +-
>>>  mm/percpu.c                         |  11 +-
>>>  mm/pgtable-generic.c                |   4 +-
>>>  mm/rmap.c                           |  10 +-
>>>  mm/shmem.c                          |   9 +-
>>>  mm/slab_common.c                    |  14 +-
>>>  mm/slub.c                           |  33 ++--
>>>  mm/swapfile.c                       |   4 +-
>>>  mm/userfaultfd.c                    |  12 +-
>>>  mm/vmalloc.c                        |  24 +--
>>>  mm/vmscan.c                         |   7 +-
>>>  mm/zsmalloc.c                       |   4 +-
>>>  124 files changed, 875 insertions(+), 681 deletions(-)
>>
>> Not sure what you were thinking, but this diff stat
>> is not landable.
> 
> Agreed. If we decide we want this, I guess we should target per-subsystem
> conversions.
> 
> If this goes through the MM tree, I would even appreciate doing this on a per-MM
> component granularity.
> 
> (unless we have some magic "Linus converts all of them" script, which I doubt we
> will have)

I strongly agree with the point above.

> Is there a way forward to replace list_for_each_*_safe entirely, possibly just
> reusing the old name but simply the parameter?

David Laight, Christian König, and Jani Nikula do not agree with using
clever macro syntax to support both calling forms at the same time,
so for now it is not possible to keep the original macro name and only
simplify the parameter. I may revert to the v1 version and ask everyone
for their opinions again.

-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: Jens Axboe @ 2026-06-24 12:53 UTC (permalink / raw)
  To: nilay, tom.leiming, bvanassche, tj, josef, yukuai, Yu Kuai
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, baohua, youngjun.park,
	cgroups, linux-block, linux-kernel, linux-mm, Baoquan He
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>


On Mon, 08 Jun 2026 11:42:41 +0800, Yu Kuai wrote:
> This series is the follow-up blk-cgroup locking cleanup on top of the
> earlier blkg-list protection fixes, and prepares blk-cgroup to stop using
> q->queue_lock as the global blkg lifetime/iteration lock.
> 
> The current queue_lock based protection is hard to maintain because
> queue_lock is used from hardirq and softirq completion paths, while some
> blkcg cgroup file paths also need to iterate blkgs, print policy data, or
> create blkgs from RCU-protected contexts.  This series first tightens the
> blkcg-side lifetime rules:
> 
> [...]

Applied, thanks!

[1/8] blk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat()
      commit: 25656304dabd26198ec69460c594a19d086ef099
[2/8] blk-cgroup: delay freeing policy data after rcu grace period
      commit: 0af3fedb8c8ed3c07b4f76927bd7fc88f6f82efb
[3/8] blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs()
      commit: 56cc24f59c145ce6938959f792df04b8a4f5a4d8
[4/8] blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
      commit: 9327a865e395a53f67dffac4710beb1d4730495e
[5/8] blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg()
      commit: 457d3c4f0fdd6cf8a4bd8115bf470809984a9f02
[6/8] blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs()
      commit: 4cfd7c1cff8f4c863b99d420cdbe0563802a9e80
[7/8] mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page()
      commit: f928145cbcb52544203808f159461d0a25543df7
[8/8] block, bfq: don't grab queue_lock to initialize bfq
      commit: 3ca4f4e3ae811d414076a491cbf0dfcdae0dc01e

Best regards,
-- 
Jens Axboe




^ permalink raw reply

* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: Jens Axboe @ 2026-06-24 12:43 UTC (permalink / raw)
  To: yukuai, Yu Kuai, nilay, tom.leiming, bvanassche, tj, josef
  Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, cgroups, linux-block, linux-kernel, linux-mm
In-Reply-To: <1c739fcc-5132-4cb2-bf34-cec94de26509@fygo.io>

On 6/24/26 12:57 AM, yu kuai wrote:
> Friendly ping ...
> 
> This set can still be applied cleanly for block-7.2 branch.

Not sure how you checked that, because patch 3 very much needs some
manual attention to get applied. I have applied it now.

-- 
Jens Axboe


^ permalink raw reply

* [PATCH] tools/cgroup: iocost_monitor: parse help before importing drgn
From: Yousef Alhouseen @ 2026-06-24 12:36 UTC (permalink / raw)
  To: tj, josef, axboe; +Cc: cgroups, linux-block, linux-kernel, Yousef Alhouseen

iocost_monitor.py imports drgn before argparse can handle "-h" or report
argument errors. That makes basic usage help fail on systems where drgn is
not installed.

Parse arguments before importing drgn so the help and argument-error paths
work without the runtime debugging dependency. Normal execution still
imports drgn before reading kernel state.

Signed-off-by: Yousef Alhouseen <alhouseenyousef@gmail.com>
---
 tools/cgroup/iocost_monitor.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/cgroup/iocost_monitor.py b/tools/cgroup/iocost_monitor.py
index 933c750b3..bdd78ba27 100644
--- a/tools/cgroup/iocost_monitor.py
+++ b/tools/cgroup/iocost_monitor.py
@@ -15,11 +15,6 @@ import time
 import json
 import math
 
-import drgn
-from drgn import container_of
-from drgn.helpers.linux.list import list_for_each_entry,list_empty
-from drgn.helpers.linux.radixtree import radix_tree_for_each,radix_tree_lookup
-
 import argparse
 parser = argparse.ArgumentParser(description=desc,
                                  formatter_class=argparse.RawTextHelpFormatter)
@@ -34,6 +29,11 @@ parser.add_argument('--json', action='store_true',
                     help='Output in json')
 args = parser.parse_args()
 
+import drgn
+from drgn import container_of
+from drgn.helpers.linux.list import list_for_each_entry,list_empty
+from drgn.helpers.linux.radixtree import radix_tree_for_each,radix_tree_lookup
+
 def err(s):
     print(s, file=sys.stderr, flush=True)
     sys.exit(1)
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v3 0/7] Prepare mutable list iterators to cache cursor state
From: Kaitao Cheng @ 2026-06-24 12:29 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Alexei Starovoitov, Andrew Morton, David Hildenbrand, Jens Axboe,
	Tejun Heo, Alexander Viro, Christian Brauner, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Johannes Weiner, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Thomas Gleixner, Juri Lelli, Vincent Guittot, Paul Moore,
	Paul E. McKenney, Shakeel Butt, Christian König,
	David Howells, Simona Vetter, Randy Dunlap, Luca Ceresoli,
	Philipp Stanner, linux-block, LKML,
	open list:CONTROL GROUP (CGROUP), linux-ntfs-dev, Linux-Fsdevel,
	io-uring, audit, bpf, Network Development, dri-devel,
	linux-perf-use., linux-trace-kernel, kexec, live-patching,
	linux-modules, Linux Crypto Mailing List, Linux Power Management,
	rcu, sched-ext, linux-mm, virtualization, damon,
	clang-built-linux, chengkaitao, Muchun Song
In-Reply-To: <ajkSftEbdGoiJXYs@ashevche-desk.local>



在 2026/6/22 18:46, Andy Shevchenko 写道:
> On Mon, Jun 22, 2026 at 02:15:01PM +0800, Kaitao Cheng wrote:
>> 在 2026/6/22 13:28, Alexei Starovoitov 写道:
>>> On Sun, Jun 21, 2026 at 9:06 PM Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 
> ...
> 
>>>>  block/bfq-iosched.c                 |  17 +-
>>>>  block/blk-cgroup.c                  |  12 +-
>>>>  block/blk-flush.c                   |   4 +-
>>>>  block/blk-iocost.c                  |  18 +-
>>>>  block/blk-mq.c                      |   8 +-
>>>>  block/blk-throttle.c                |   4 +-
>>>>  block/kyber-iosched.c               |   4 +-
>>>>  block/partitions/ldm.c              |   8 +-
>>>>  block/sed-opal.c                    |   4 +-
>>>>  include/linux/list.h                | 269 ++++++++++++++++++++++++----
>>>>  include/linux/llist.h               |  81 +++++++--
>>>>  init/initramfs.c                    |   5 +-
>>>>  io_uring/cancel.c                   |   6 +-
>>>>  io_uring/poll.c                     |   3 +-
>>>>  io_uring/rw.c                       |   4 +-
>>>>  io_uring/timeout.c                  |   8 +-
>>>>  io_uring/uring_cmd.c                |   3 +-
>>>>  kernel/audit_tree.c                 |   4 +-
>>>>  kernel/audit_watch.c                |  16 +-
>>>>  kernel/auditfilter.c                |   4 +-
>>>>  kernel/auditsc.c                    |   4 +-
>>>>  kernel/bpf/arena.c                  |  10 +-
>>>>  kernel/bpf/arraymap.c               |   8 +-
>>>>  kernel/bpf/bpf_local_storage.c      |   3 +-
>>>>  kernel/bpf/bpf_lru_list.c           |  25 ++-
>>>>  kernel/bpf/btf.c                    |  18 +-
>>>>  kernel/bpf/cgroup.c                 |   7 +-
>>>>  kernel/bpf/cpumap.c                 |   4 +-
>>>>  kernel/bpf/devmap.c                 |  10 +-
>>>>  kernel/bpf/helpers.c                |   8 +-
>>>>  kernel/bpf/local_storage.c          |   4 +-
>>>>  kernel/bpf/memalloc.c               |  16 +-
>>>>  kernel/bpf/offload.c                |   8 +-
>>>>  kernel/bpf/states.c                 |   4 +-
>>>>  kernel/bpf/stream.c                 |   4 +-
>>>>  kernel/bpf/verifier.c               |   6 +-
>>>>  kernel/cgroup/cgroup-v1.c           |   4 +-
>>>>  kernel/cgroup/cgroup.c              |  54 +++---
>>>>  kernel/cgroup/dmem.c                |  12 +-
>>>>  kernel/cgroup/rdma.c                |   8 +-
>>>>  kernel/events/core.c                |  44 +++--
>>>>  kernel/events/uprobes.c             |  12 +-
>>>>  kernel/exit.c                       |   8 +-
>>>>  kernel/fail_function.c              |   4 +-
>>>>  kernel/gcov/clang.c                 |   4 +-
>>>>  kernel/irq_work.c                   |   4 +-
>>>>  kernel/kexec_core.c                 |   4 +-
>>>>  kernel/kprobes.c                    |  16 +-
>>>>  kernel/livepatch/core.c             |   4 +-
>>>>  kernel/livepatch/core.h             |   4 +-
>>>>  kernel/liveupdate/kho_block.c       |   4 +-
>>>>  kernel/liveupdate/luo_flb.c         |   4 +-
>>>>  kernel/locking/rwsem.c              |   2 +-
>>>>  kernel/locking/test-ww_mutex.c      |   2 +-
>>>>  kernel/module/main.c                |  11 +-
>>>>  kernel/padata.c                     |   4 +-
>>>>  kernel/power/snapshot.c             |   8 +-
>>>>  kernel/power/wakelock.c             |   4 +-
>>>>  kernel/printk/printk.c              |  11 +-
>>>>  kernel/ptrace.c                     |   4 +-
>>>>  kernel/rcu/rcutorture.c             |   3 +-
>>>>  kernel/rcu/tasks.h                  |   9 +-
>>>>  kernel/rcu/tree.c                   |   6 +-
>>>>  kernel/resource.c                   |   4 +-
>>>>  kernel/sched/core.c                 |   4 +-
>>>>  kernel/sched/ext.c                  |  22 +--
>>>>  kernel/sched/fair.c                 |  28 +--
>>>>  kernel/sched/topology.c             |   4 +-
>>>>  kernel/sched/wait.c                 |   4 +-
>>>>  kernel/seccomp.c                    |   4 +-
>>>>  kernel/signal.c                     |  11 +-
>>>>  kernel/smp.c                        |   4 +-
>>>>  kernel/taskstats.c                  |   8 +-
>>>>  kernel/time/clockevents.c           |   6 +-
>>>>  kernel/time/clocksource.c           |   4 +-
>>>>  kernel/time/posix-cpu-timers.c      |   4 +-
>>>>  kernel/time/posix-timers.c          |   3 +-
>>>>  kernel/torture.c                    |   3 +-
>>>>  kernel/trace/bpf_trace.c            |   4 +-
>>>>  kernel/trace/ftrace.c               |  49 +++--
>>>>  kernel/trace/ring_buffer.c          |  25 ++-
>>>>  kernel/trace/trace.c                |  12 +-
>>>>  kernel/trace/trace_dynevent.c       |   6 +-
>>>>  kernel/trace/trace_dynevent.h       |   5 +-
>>>>  kernel/trace/trace_events.c         |  35 ++--
>>>>  kernel/trace/trace_events_filter.c  |   4 +-
>>>>  kernel/trace/trace_events_hist.c    |   8 +-
>>>>  kernel/trace/trace_events_trigger.c |  17 +-
>>>>  kernel/trace/trace_events_user.c    |  16 +-
>>>>  kernel/trace/trace_stat.c           |   4 +-
>>>>  kernel/user-return-notifier.c       |   3 +-
>>>>  kernel/workqueue.c                  |  16 +-
>>>>  mm/backing-dev.c                    |   8 +-
>>>>  mm/balloon.c                        |   8 +-
>>>>  mm/cma.c                            |   4 +-
>>>>  mm/compaction.c                     |   4 +-
>>>>  mm/damon/core.c                     |   4 +-
>>>>  mm/damon/sysfs-schemes.c            |   4 +-
>>>>  mm/dmapool.c                        |   4 +-
>>>>  mm/huge_memory.c                    |   8 +-
>>>>  mm/hugetlb.c                        |  56 +++---
>>>>  mm/hugetlb_vmemmap.c                |  16 +-
>>>>  mm/khugepaged.c                     |  14 +-
>>>>  mm/kmemleak.c                       |   7 +-
>>>>  mm/ksm.c                            |  25 +--
>>>>  mm/list_lru.c                       |   4 +-
>>>>  mm/memcontrol-v1.c                  |   8 +-
>>>>  mm/memory-failure.c                 |  12 +-
>>>>  mm/memory-tiers.c                   |   4 +-
>>>>  mm/migrate.c                        |  23 ++-
>>>>  mm/mmu_notifier.c                   |   9 +-
>>>>  mm/page_alloc.c                     |   8 +-
>>>>  mm/page_reporting.c                 |   2 +-
>>>>  mm/percpu.c                         |  11 +-
>>>>  mm/pgtable-generic.c                |   4 +-
>>>>  mm/rmap.c                           |  10 +-
>>>>  mm/shmem.c                          |   9 +-
>>>>  mm/slab_common.c                    |  14 +-
>>>>  mm/slub.c                           |  33 ++--
>>>>  mm/swapfile.c                       |   4 +-
>>>>  mm/userfaultfd.c                    |  12 +-
>>>>  mm/vmalloc.c                        |  24 +--
>>>>  mm/vmscan.c                         |   7 +-
>>>>  mm/zsmalloc.c                       |   4 +-
>>>>  124 files changed, 875 insertions(+), 681 deletions(-)
>>>
>>> Not sure what you were thinking, but this diff stat
>>> is not landable.
>>
>> [PATCH v3 1/7] and [PATCH v3 2/7] contain the main logic and can
>> be merged directly. They are also compatible with the old API.
>> [PATCH v3 3/7] through [PATCH v3 7/7] are just simple interface
>> replacements and do not change any functional logic. They can be
>> left unmerged for now; individual modules can pick them up later
>> if needed.
>>
>> In v2, Andy Shevchenko mentioned: "If it's done by Linus himself
>> during the day when he prepares -rc1, it's fine."
> 
> Yes, but you need to get his blessing first to go with this.
> Have you communicated with him on this?

Not yet, because the overall approach is still not mature. People
have different opinions on the implementation details and on how
to move this forward, so I think we should iterate through a few
versions first before making a final decision.

>> Even so, the
>> changes in this patch series are indeed quite large and touch
>> almost every subsystem. I have only converted part of them for
>> now, so I wanted to send this out first and see what people think.
> 
> That's why it's better to provide a script to convert (e.g., coccinelle)
> instead of tons of patches.

I tried writing conversion scripts with Coccinelle, but there were
always cases that got missed. In contrast, I found that using AI
for focused replacements was actually more efficient.

As David Hildenbrand mentioned, "If we decide we want this, I guess
we should target per-subsystem conversions." I would like to provide
the new interface first; adapting each subsystem on demand later may
be easier to achieve.
-- 
Thanks
Kaitao Cheng


^ permalink raw reply

* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Juri Lelli @ 2026-06-24 10:35 UTC (permalink / raw)
  To: luca abeni
  Cc: Yuri Andriaccio, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tejun Heo, Johannes Weiner,
	Michal Koutný, cgroups, linux-kernel, Yuri Andriaccio
In-Reply-To: <20260624091912.4fca8428@nowhere>

Hi Luca,

On 24/06/26 09:19, luca abeni wrote:
> Hi Juri,
> 
> very interesting demo, thanks!
> 
> On Tue, 23 Jun 2026 12:15:08 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> >  - At 1ms task periods, the dl-server period is the critical tuning
> >    parameter, less the bandwidth. A 10ms dl-server with 60% bandwidth
> >    caused ~10% miss rates because the worst-case throttle gap (4ms)
> >    spanned multiple 1ms deadlines. Switching to a 2ms dl-server period
> >    at just 30% bandwidth eliminated all misses.
> > 
> >  - A simple Rule of thumb might be to set the dl-server period to at
> >    most 2x the shortest task period in the cgroup (e.g., 2ms dl-server
> >    for 1ms tasks, 10ms for 10ms tasks). Would you (and Luca?) agree or
> >    would you suggest something different?
> 
> With one single RT task in the cgroup (or with multiple synchronized RT
> tasks having the same period), I agree... Technically, the cgroup period
> P should be such that P - Q = T - WCET (where "Q" is the cgroup's
> runtime and "T" is the period of the task), but to see missed deadlines
> you need a relevant competing deadline (or HCBS) workload.
> 
> So, yes, I agree with your findings above.

Great, thanks a lot for taking a look!

> If we consider multi-core analysis, or multiple RT threads with
> different, non synchronized, periods, then analysis tool by Yuri
> (leveraging CSF analysis from real-time literature) is needed... But
> that is pretty pessimistic. The rule you suggest above is a better
> starting point in practical situations.

Indeed. I actually wonder if it would make sense to "extract" that tool
from the test suite and package it somehow so that it's easier for end
users to design their interfaces.

Best,
Juri


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox