* Re: [PATCH v3 08/13] genirq: Add explicit housekeeping callback for managed IRQ migration
From: Thomas Gleixner @ 2026-06-23 20:45 UTC (permalink / raw)
To: Jing Wu; +Cc: Jing Wu, Waiman Long, linux-kernel, rcu, cgroups, Qiliang Yuan
In-Reply-To: <20260623043641.2391662-1-realwujing@gmail.com>
On Tue, Jun 23 2026 at 12:36, Jing Wu wrote:
> On Thu, Jun 18 2026 at 22:27, Thomas Gleixner wrote:
> That said, I fully accept the architectural feedback: the on-the-fly
> subsystem modification approach in v3 is wrong, and v4 should use the
> CPU hotplug machinery.
>
> We are open to coordinating with Waiman on a unified approach that
> covers both use cases. Before starting v4, two questions:
>
> 1. Is the "no boot parameter required" use case worth pursuing
> independently, or should it be folded into Waiman's series?
Sort it out with him.
> 2. For the hotplug path: is CPU-by-CPU offline/online the expected
> mechanism, given that you rejected the cpuhp_offline_cb() bulk
> approach in Waiman's v1?
I think so. It makes the most sense.
^ permalink raw reply
* [PATCH v2 0/2] cgroup/cpuset: Miscellaneous fixes and cleanups
From: Waiman Long @ 2026-06-23 23:04 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Ridong Chen,
Jonathan Corbet, Shuah Khan
Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Waiman Long
v2:
- Update patch 1 as suggested by Ridong Chen and add new test cases.
- Minor update to patch 2 code and comment log.
Patch 1 updates compute_effective_cpumask() and adds new
compute_effective_nodemask() helper to make sure that effective_cpus
and effective_mems will inherit parent's versions for v2 if
cpuset.cpus/cpuset.mems is empty.
Patch 2 makes cpuset_update_tasks_nodemask() to perform memory rebind
and migration only for thread group leader like cpuset_attach().
Waiman Long (2):
cgroup/cpuset: Avoid unnecessary cpus & mems update in
cpuset_hotplug_update_tasks()
cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in
cpuset_update_tasks_nodemask()
Documentation/admin-guide/cgroup-v2.rst | 7 +++
kernel/cgroup/cpuset.c | 49 ++++++++++++-------
.../selftests/cgroup/test_cpuset_prs.sh | 11 ++++-
3 files changed, 46 insertions(+), 21 deletions(-)
--
2.54.0
^ permalink raw reply
* [PATCH v2 1/2] cgroup/cpuset: Avoid unnecessary cpus & mems update in cpuset_hotplug_update_tasks()
From: Waiman Long @ 2026-06-23 23:04 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Ridong Chen,
Jonathan Corbet, Shuah Khan
Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Waiman Long
In-Reply-To: <20260623230413.1984188-1-longman@redhat.com>
As reported by sashiko [1], cpuset_hotplug_update_tasks() may perform
unnecessary task iteration and updating of tasks' CPU and node masks
when mems_allowed and/or cpus_allowed are not set in cpuset v2. It is
due to the fact that the temporary new_cpus and new_mems masks do not
inherit parent's effective_cpus/mems when they are empty which is the
expected behavior for cpuset v2 since commit 4ec22e9c5a90 ("cpuset:
Enable cpuset controller in default hierarchy").
Fix that and avoid unnecessay work by enhancing
compute_effective_cpumask() to add the empty cpumask check
and inheriting the parent's versions if empty when in v2. A new
compute_effective_nodemask() helper is also added to perform similar
function for new effective_mems.
Add new test_cpuset_prs.sh test cases to confirm that effective_cpus
will inherit the parent's version if cpuset.cpus is empty.
[1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com
Suggested-by: Ridong Chen <ridong.chen@linux.dev>
Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
Signed-off-by: Waiman Long <longman@redhat.com>
---
kernel/cgroup/cpuset.c | 45 +++++++++++--------
.../selftests/cgroup/test_cpuset_prs.sh | 11 ++++-
2 files changed, 35 insertions(+), 21 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index aff86acea701..044ddbf66f8e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1094,12 +1094,35 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
* @cs: the cpuset the need to recompute the new effective_cpus mask
* @parent: the parent cpuset
*
+ * For v2, the parent's effective_cpus is inherited if cpumask empty.
* The result is valid only if the given cpuset isn't a partition root.
*/
static void compute_effective_cpumask(struct cpumask *new_cpus,
struct cpuset *cs, struct cpuset *parent)
{
- cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+ bool has_cpus;
+
+ has_cpus = cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
+ if (!has_cpus && is_in_v2_mode())
+ cpumask_copy(new_cpus, parent->effective_cpus);
+}
+
+/**
+ * compute_effective_nodemask - Compute the effective nodemask of the cpuset
+ * @new_cpus: the temp variable for the new effective_mems mask
+ * @cs: the cpuset the need to recompute the new effective_mems mask
+ * @parent: the parent cpuset
+ *
+ * For v2, the parent's effective_mems is inherited if nodemask empty.
+ */
+static void compute_effective_nodemask(nodemask_t *new_mems,
+ struct cpuset *cs, struct cpuset *parent)
+{
+ bool has_mems;
+
+ has_mems = nodes_and(*new_mems, cs->mems_allowed, parent->effective_mems);
+ if (!has_mems && is_in_v2_mode())
+ nodes_copy(*new_mems, parent->effective_mems);
}
/*
@@ -2148,15 +2171,6 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
goto update_parent_effective;
}
- /*
- * If it becomes empty, inherit the effective mask of the
- * parent, which is guaranteed to have some CPUs unless
- * it is a partition root that has explicitly distributed
- * out all its CPUs.
- */
- if (is_in_v2_mode() && !remote && cpumask_empty(tmp->new_cpus))
- cpumask_copy(tmp->new_cpus, parent->effective_cpus);
-
/*
* Skip the whole subtree if
* 1) the cpumask remains the same,
@@ -2704,14 +2718,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
- bool has_mems = nodes_and(*new_mems, cp->mems_allowed, parent->effective_mems);
-
- /*
- * If it becomes empty, inherit the effective mask of the
- * parent, which is guaranteed to have some MEMs.
- */
- if (is_in_v2_mode() && !has_mems)
- *new_mems = parent->effective_mems;
+ compute_effective_nodemask(new_mems, cp, parent);
/* Skip the whole subtree if the nodemask remains the same. */
if (nodes_equal(*new_mems, cp->effective_mems)) {
@@ -3923,7 +3930,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
parent = parent_cs(cs);
compute_effective_cpumask(&new_cpus, cs, parent);
- nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
+ compute_effective_nodemask(&new_mems, cs, parent);
if (!tmp || !cs->partition_root_state)
goto update_tasks;
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 0d41aa0d343d..ca9bc38fdb95 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -495,13 +495,20 @@ REMOTE_TEST_MATRIX=(
# Narrowing cpuset.cpus to previously sibling-excluded CPUs should
# not return CPUs that were never actually owned.
" C1-4:P1 . C1-2:P1 C1-3:P2 . . \
- . . . C3 . . p1:4|c11:1-2|c12:3 \
+ . . . C3 . . p1:4|c11:1-2|c12:3 \
p1:P1|c11:P1|c12:P2 3"
# Expanding cpuset.cpus to include a previously sibling-excluded CPU
# after the sibling has become a member should correctly request it.
" C1-4:P1 . C1-2:P1 C1-3:P2 . . \
- . . P0 C2-3 . . p1:1,4|c11:1|c12:2-3 \
+ . . P0 C2-3 . . p1:1,4|c11:1|c12:2-3 \
p1:P1|c11:P0|c12:P2 2-3"
+ # Cpusets with empty cpuset.cpus should inherit parent's effective_cpus
+ " C1-4:P1 C5-6 C1-2 . C5 . \
+ . P1 P1 . . . p1:3-4|p2:5-6|c11:1-2|c12:3-4|c21:5|c22:5-6 \
+ p1:P1|p2:P1|c11:P1"
+ " C1-4:P1 C5-6 C1-2 . C5 . \
+ . P1 P1 . O5=0 . p1:3-4|p2:6|c11:1-2|c12:3-4|c21:6|c22:6 \
+ p1:P1|p2:P1|c11:P1"
)
#
--
2.54.0
^ permalink raw reply related
* [PATCH v2 2/2] cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in cpuset_update_tasks_nodemask()
From: Waiman Long @ 2026-06-23 23:04 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Michal Koutný, Ridong Chen,
Jonathan Corbet, Shuah Khan
Cc: cgroups, linux-kernel, linux-doc, linux-kselftest, Waiman Long
In-Reply-To: <20260623230413.1984188-1-longman@redhat.com>
As reported by sashiko [1], cpuset_update_tasks_nodemask() will do
mpol_rebind_mm() and possibly cpuset_migrate_mm() for all threads of
a multithreaded process. Since commit 3df9ca0a2b8b ("cpuset: migrate
memory only for threadgroup leaders"), cpuset_attach() had been updated
to rebind and migrate memory only for threadgroup leaders to mark the
group leader as the owner of the mm_struct.
To be consistent and avoid unnecessary performance overhead for heavily
multithreaded processes, follow the cpuset_attach() example and perform
memory rebind and migration only for threadgroup leaders.
Also add a paragraph in cgroup-v2.rst under cpuset.mems that the
threadgroup leader is the memory owner of that threadgroup. Therefore
the non-leading threads shouldn't be in other cgroups whose "cpuset.mems"
doesn't fully overlap that of the group leader.
[1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
---
Documentation/admin-guide/cgroup-v2.rst | 7 +++++++
kernel/cgroup/cpuset.c | 4 ++++
2 files changed, 11 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 993446ab66d0..f9c353174a7e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2527,6 +2527,13 @@ Cpuset Interface Files
a need to change "cpuset.mems" with active tasks, it shouldn't
be done frequently.
+ For a multithreaded process, the threadgroup leader is
+ considered the owner of the group's memory. Memory policy
+ rebinding and migration will only happen with respect to the
+ threadgroup leader. To avoid unexpected result, non-leading
+ threads shouldn't be put into another cgroup whose "cpuset.mems"
+ doesn't fully overlap that of the threadgroup leader.
+
cpuset.mems.effective
A read-only multiple values file which exists on all
cpuset-enabled cgroups.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 044ddbf66f8e..055ae54a040a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2673,6 +2673,10 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
cpuset_change_task_nodemask(task, &newmems);
+ /* Rebind and migrate mm only for thread group leader */
+ if (!thread_group_leader(task))
+ continue;
+
mm = get_task_mm(task);
if (!mm)
continue;
--
2.54.0
^ permalink raw reply related
* [PATCH] sched/core: Add core_sibling_idle accounting
From: Yuxuan Liu @ 2026-06-23 23:43 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Tejun Heo, Zefan Li, Johannes Weiner, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman,
Daniel Bristot de Oliveira, Valentin Schneider, cgroups,
linux-kernel, Fernand Sieber, David Woodhouse, Alexander Graf,
Misha Karataiev, nh-open-source, Yuxuan Liu
When a VM runs on one SMT thread and core scheduling leaves the sibling
idle because no compatible workload can share the core, existing metrics
(forced idle time) only report this cost if another task is actively
waiting. Without waiting tasks, the stranded capacity looks like free
capacity to external fleet management software, leading it to place
additional workloads onto hosts that are already effectively fully
loaded. Add core_sibling_idle to capture all idle time caused by core
scheduling constraints so orchestrators can make accurate placement
decisions.
To avoid redundant bookkeeping, forceidle and sibling_idle accounting
are consolidated into a single function __sched_core_account_idle().
Both metrics share a common timestamp (core_sibling_idle_start) and
occupation count (core_sibling_idle_occupation), replacing the separate
core_forceidle_start and core_forceidle_occupation fields. The
forceidle subset is derived from core_forceidle_count within the same
accounting pass.
The new metric is exposed as core_sched.sibling_idle_usec in cgroup v2
cpu.stat, alongside the existing core_sched.force_idle_usec. The
per-task core_sibling_idle_sum is also available via /proc/<pid>/sched
for debugging.
== Testing ==
Testing is done using QEMU.
=== Scenario 1: No CPU Contention ===
The system has 2 CPUs, with 1 VM (2 vCPUs) that uses core scheduling and
runs an infinite loop pinned to vCPU 0:
taskset -c 0 sh -c 'while true; do :; done' &
In the VM's cpu.stat, its core_sched.force_idle_usec is near 0 (199 us)
while core_sched.sibling_idle_usec (117796370 us) is identical to
usage_usec (123946273 us).
=== Scenario 2: With CPU Contention ===
Same setup as Scenario 1 except with 2 VMs (2 vCPUs each).
Both VMs have identical core_sched.force_idle_usec and
core_sched.sibling_idle_usec in their respective cpu.stat, with
sibling_idle_usec being slightly higher.
Signed-off-by: Yuxuan Liu <liuyuxua@amazon.com>
---
include/linux/cgroup-defs.h | 1 +
include/linux/kernel_stat.h | 2 ++
include/linux/sched.h | 1 +
kernel/cgroup/rstat.c | 11 ++++++
kernel/sched/core.c | 33 +++++++++---------
kernel/sched/core_sched.c | 67 +++++++++++++++++++++++++------------
kernel/sched/cputime.c | 12 +++++++
kernel/sched/debug.c | 1 +
kernel/sched/sched.h | 17 ++++++----
9 files changed, 101 insertions(+), 44 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c0c2b26725d0f..b65c910cbd872 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -301,6 +301,7 @@ struct cgroup_base_stat {
#ifdef CONFIG_SCHED_CORE
u64 forceidle_sum;
+ u64 sibling_idle_sum;
#endif
};
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 9935f7ecbfb9e..0e1386a9816ff 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -30,6 +30,7 @@ enum cpu_usage_stat {
CPUTIME_GUEST_NICE,
#ifdef CONFIG_SCHED_CORE
CPUTIME_FORCEIDLE,
+ CPUTIME_SIBLING_IDLE,
#endif
NR_STATS,
};
@@ -132,6 +133,7 @@ extern void account_idle_ticks(unsigned long ticks);
#ifdef CONFIG_SCHED_CORE
extern void __account_forceidle_time(struct task_struct *tsk, u64 delta);
+extern void __account_sibling_idle_time(struct task_struct *tsk, u64 delta);
#endif
#endif /* _LINUX_KERNEL_STAT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fad3aad97c7b0..5b1a1c247b12a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,7 @@ struct sched_statistics {
#ifdef CONFIG_SCHED_CORE
u64 core_forceidle_sum;
+ u64 core_sibling_idle_sum;
#endif
#endif /* CONFIG_SCHEDSTATS */
} ____cacheline_aligned;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index c32439b855f5d..29ef399d1c7e7 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -326,6 +326,7 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
dst_bstat->cputime.sum_exec_runtime += src_bstat->cputime.sum_exec_runtime;
#ifdef CONFIG_SCHED_CORE
dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
+ dst_bstat->sibling_idle_sum += src_bstat->sibling_idle_sum;
#endif
}
@@ -337,6 +338,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
dst_bstat->cputime.sum_exec_runtime -= src_bstat->cputime.sum_exec_runtime;
#ifdef CONFIG_SCHED_CORE
dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
+ dst_bstat->sibling_idle_sum -= src_bstat->sibling_idle_sum;
#endif
}
@@ -430,6 +432,9 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
case CPUTIME_FORCEIDLE:
rstatc->bstat.forceidle_sum += delta_exec;
break;
+ case CPUTIME_SIBLING_IDLE:
+ rstatc->bstat.sibling_idle_sum += delta_exec;
+ break;
#endif
default:
break;
@@ -472,6 +477,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
#ifdef CONFIG_SCHED_CORE
bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
+ bstat->sibling_idle_sum += cpustat[CPUTIME_SIBLING_IDLE];
#endif
}
}
@@ -483,6 +489,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
struct cgroup_base_stat bstat;
#ifdef CONFIG_SCHED_CORE
u64 forceidle_time;
+ u64 sibling_idle_time;
#endif
if (cgroup_parent(cgrp)) {
@@ -492,6 +499,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
&utime, &stime);
#ifdef CONFIG_SCHED_CORE
forceidle_time = cgrp->bstat.forceidle_sum;
+ sibling_idle_time = cgrp->bstat.sibling_idle_sum;
#endif
cgroup_rstat_flush_release();
} else {
@@ -501,6 +509,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
stime = bstat.cputime.stime;
#ifdef CONFIG_SCHED_CORE
forceidle_time = bstat.forceidle_sum;
+ sibling_idle_time = bstat.sibling_idle_sum;
#endif
}
@@ -509,6 +518,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
do_div(stime, NSEC_PER_USEC);
#ifdef CONFIG_SCHED_CORE
do_div(forceidle_time, NSEC_PER_USEC);
+ do_div(sibling_idle_time, NSEC_PER_USEC);
#endif
seq_printf(seq, "usage_usec %llu\n"
@@ -518,6 +528,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
#ifdef CONFIG_SCHED_CORE
seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
+ seq_printf(seq, "core_sched.sibling_idle_usec %llu\n", sibling_idle_time);
#endif
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d558e43aedcf2..73999633f9059 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -369,7 +369,7 @@ static void __sched_core_flip(bool enabled)
for_each_cpu(t, smt_mask)
cpu_rq(t)->core_enabled = enabled;
- cpu_rq(cpu)->core->core_forceidle_start = 0;
+ cpu_rq(cpu)->core->core_sibling_idle_start = 0;
sched_core_unlock(cpu, &flags);
@@ -6124,18 +6124,19 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
/* reset state */
rq->core->core_cookie = 0UL;
- if (rq->core->core_forceidle_count) {
+ if (rq->core->core_sibling_idle_occupation) {
if (!core_clock_updated) {
update_rq_clock(rq->core);
core_clock_updated = true;
}
- sched_core_account_forceidle(rq);
- /* reset after accounting force idle */
- rq->core->core_forceidle_start = 0;
+ sched_core_account_idle(rq);
+ rq->core->core_sibling_idle_start = 0;
+ rq->core->core_sibling_idle_occupation = 0;
+ if (rq->core->core_forceidle_count) {
+ need_sync = true;
+ fi_before = true;
+ }
rq->core->core_forceidle_count = 0;
- rq->core->core_forceidle_occupation = 0;
- need_sync = true;
- fi_before = true;
}
/*
@@ -6221,9 +6222,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
}
- if (schedstat_enabled() && rq->core->core_forceidle_count) {
- rq->core->core_forceidle_start = rq_clock(rq->core);
- rq->core->core_forceidle_occupation = occ;
+ if (schedstat_enabled() && occ < cpumask_weight(smt_mask)) {
+ rq->core->core_sibling_idle_start = rq_clock(rq->core);
+ rq->core->core_sibling_idle_occupation = occ;
}
rq->core->core_pick_seq = rq->core->core_task_seq;
@@ -6480,14 +6481,14 @@ static void sched_core_cpu_deactivate(unsigned int cpu)
core_rq->core_cookie = rq->core_cookie;
core_rq->core_forceidle_count = rq->core_forceidle_count;
core_rq->core_forceidle_seq = rq->core_forceidle_seq;
- core_rq->core_forceidle_occupation = rq->core_forceidle_occupation;
/*
- * Accounting edge for forced idle is handled in pick_next_task().
+ * Accounting edge for sibling idle is handled in pick_next_task().
* Don't need another one here, since the hotplug thread shouldn't
* have a cookie.
*/
- core_rq->core_forceidle_start = 0;
+ core_rq->core_sibling_idle_occupation = rq->core_sibling_idle_occupation;
+ core_rq->core_sibling_idle_start = 0;
/* install new leader */
for_each_cpu(t, smt_mask) {
@@ -10071,8 +10072,8 @@ void __init sched_init(void)
rq->core_enabled = 0;
rq->core_tree = RB_ROOT;
rq->core_forceidle_count = 0;
- rq->core_forceidle_occupation = 0;
- rq->core_forceidle_start = 0;
+ rq->core_sibling_idle_occupation = 0;
+ rq->core_sibling_idle_start = 0;
rq->core_cookie = 0UL;
#endif
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index a57fd8f27498f..f9aa119b52afd 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -237,38 +237,59 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
#ifdef CONFIG_SCHEDSTATS
/* REQUIRES: rq->core's clock recently updated. */
-void __sched_core_account_forceidle(struct rq *rq)
+/*
+ * Account core scheduling idle cost. Both forceidle (idle sibling has
+ * waiting tasks) and sibling_idle (any idle sibling) are derived from
+ * the same time delta and scaled by their respective idle counts.
+ * A single loop charges both metrics to each running cookied task.
+ */
+void __sched_core_account_idle(struct rq *rq)
{
const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
+ unsigned int occ = rq->core->core_sibling_idle_occupation;
+ unsigned int fi_count = rq->core->core_forceidle_count;
+ unsigned int smt_width, idle_count;
u64 delta, now = rq_clock(rq->core);
+ u64 fi_delta = 0, si_delta = 0;
struct rq *rq_i;
struct task_struct *p;
int i;
lockdep_assert_rq_held(rq);
- WARN_ON_ONCE(!rq->core->core_forceidle_count);
-
- if (rq->core->core_forceidle_start == 0)
+ if (rq->core->core_sibling_idle_start == 0)
return;
- delta = now - rq->core->core_forceidle_start;
+ delta = now - rq->core->core_sibling_idle_start;
if (unlikely((s64)delta <= 0))
return;
- rq->core->core_forceidle_start = now;
+ if (WARN_ON_ONCE(!occ))
+ return;
- if (WARN_ON_ONCE(!rq->core->core_forceidle_occupation)) {
- /* can't be forced idle without a running task */
- } else if (rq->core->core_forceidle_count > 1 ||
- rq->core->core_forceidle_occupation > 1) {
- /*
- * For larger SMT configurations, we need to scale the charged
- * forced idle amount since there can be more than one forced
- * idle sibling and more than one running cookied task.
- */
- delta *= rq->core->core_forceidle_count;
- delta = div_u64(delta, rq->core->core_forceidle_occupation);
+ smt_width = cpumask_weight(smt_mask);
+ idle_count = smt_width - occ;
+ if (!idle_count)
+ return;
+
+ rq->core->core_sibling_idle_start = now;
+
+ /*
+ * For SMT-2 with one idle sibling (the common case), both
+ * idle_count and occ are 1, so si_delta == fi_delta == delta
+ * with no division needed. For larger SMT configurations, we
+ * scale by the respective idle count / occupation since there
+ * can be more than one idle sibling and more than one running
+ * cookied task.
+ */
+ si_delta = delta;
+ if (idle_count > 1 || occ > 1)
+ si_delta = div_u64(delta * idle_count, occ);
+
+ if (fi_count) {
+ fi_delta = delta;
+ if (fi_count > 1 || occ > 1)
+ fi_delta = div_u64(delta * fi_count, occ);
}
for_each_cpu(i, smt_mask) {
@@ -279,22 +300,24 @@ void __sched_core_account_forceidle(struct rq *rq)
continue;
/*
- * Note: this will account forceidle to the current cpu, even
- * if it comes from our SMT sibling.
+ * Note: this will account idle time to the current cpu,
+ * even if it comes from our SMT sibling.
*/
- __account_forceidle_time(p, delta);
+ __account_sibling_idle_time(p, si_delta);
+ if (fi_delta)
+ __account_forceidle_time(p, fi_delta);
}
}
void __sched_core_tick(struct rq *rq)
{
- if (!rq->core->core_forceidle_count)
+ if (!rq->core->core_sibling_idle_occupation)
return;
if (rq != rq->core)
update_rq_clock(rq->core);
- __sched_core_account_forceidle(rq);
+ __sched_core_account_idle(rq);
}
#endif /* CONFIG_SCHEDSTATS */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b453f8a6a7c76..2a3500323c3c4 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -243,6 +243,18 @@ void __account_forceidle_time(struct task_struct *p, u64 delta)
task_group_account_field(p, CPUTIME_FORCEIDLE, delta);
}
+
+/*
+ * Account for sibling idle time due to core scheduling.
+ *
+ * REQUIRES: schedstat is enabled.
+ */
+void __account_sibling_idle_time(struct task_struct *p, u64 delta)
+{
+ __schedstat_add(p->stats.core_sibling_idle_sum, delta);
+
+ task_group_account_field(p, CPUTIME_SIBLING_IDLE, delta);
+}
#endif
/*
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 115e266db76bf..2c3bf256308dc 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1059,6 +1059,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
#ifdef CONFIG_SCHED_CORE
PN_SCHEDSTAT(core_forceidle_sum);
+ PN_SCHEDSTAT(core_sibling_idle_sum);
#endif
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 65ff0254659ac..c52effdb2e172 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1156,8 +1156,13 @@ struct rq {
unsigned long core_cookie;
unsigned int core_forceidle_count;
unsigned int core_forceidle_seq;
- unsigned int core_forceidle_occupation;
- u64 core_forceidle_start;
+ /*
+ * Shared start timestamp and occupation for both forceidle and
+ * sibling_idle accounting. Set whenever occupation < SMT width
+ * (any sibling is idle), not just when core_forceidle_count > 0.
+ */
+ unsigned int core_sibling_idle_occupation;
+ u64 core_sibling_idle_start;
#endif
/* Scratch cpumask to be temporarily used under rq_lock */
@@ -1966,12 +1971,12 @@ static inline const struct cpumask *task_user_cpus(struct task_struct *p)
#if defined(CONFIG_SCHED_CORE) && defined(CONFIG_SCHEDSTATS)
-extern void __sched_core_account_forceidle(struct rq *rq);
+extern void __sched_core_account_idle(struct rq *rq);
-static inline void sched_core_account_forceidle(struct rq *rq)
+static inline void sched_core_account_idle(struct rq *rq)
{
if (schedstat_enabled())
- __sched_core_account_forceidle(rq);
+ __sched_core_account_idle(rq);
}
extern void __sched_core_tick(struct rq *rq);
@@ -1984,7 +1989,7 @@ static inline void sched_core_tick(struct rq *rq)
#else
-static inline void sched_core_account_forceidle(struct rq *rq) {}
+static inline void sched_core_account_idle(struct rq *rq) {}
static inline void sched_core_tick(struct rq *rq) {}
--
2.47.3
^ permalink raw reply related
* [PATCH 0/2] cgroup/dmem: add per-region event counters
From: Hongfu Li @ 2026-06-24 3:11 UTC (permalink / raw)
To: tj, hannes, mkoutny, corbet, skhan, dev, mripard, natalie.vock
Cc: cgroups, linux-doc, linux-kernel, dri-devel, Hongfu Li
This patch series adds event counters to the device memory (dmem) cgroup
controller.
The dmem controller exposes per-region limits and current usage, but
not how often those limits are hit. It is hard to tell whether failures
come from this cgroup, a parent limit, or pressure elsewhere in the
hierarchy.
To provide that visibility, this series introduces:
- dmem.events: reports hierarchical low/max counts per region.
- dmem.events.local: reports per-region low/max counts for this cgroup only.
Patch overview:
Patch 1/2:
- Add dmem.events with hierarchical low/max counters per region.
- Record dmem.max allocation failures and dmem.low protection events.
- Document the interface in cgroup-v2.rst.
Patch 2/2:
- Add dmem.events.local for local-only per-region counts.
- Share the events show logic between both files.
- Update cgroup-v2.rst accordingly.
Example output (dmem.events):
drm/0000:03:00.0/vram0 low 0 max 3
drm/0000:03:00.0/stolen low 0 max 0
low - reclaim/eviction considered the cgroup below its effective
dmem.low protection
max - allocation failed because the cgroup or an ancestor hit dmem.max
Both files exist for all non-root cgroups, like dmem.max and dmem.current.
These patches have been tested locally.
Hongfu Li (2):
cgroup/dmem: add per-region event counters
cgroup/dmem: introduce dmem.events.local for local counts
Documentation/admin-guide/cgroup-v2.rst | 17 +++++
kernel/cgroup/dmem.c | 85 ++++++++++++++++++++++++-
2 files changed, 101 insertions(+), 1 deletion(-)
--
2.25.1
^ permalink raw reply
* [PATCH 1/2] cgroup/dmem: add per-region event counters
From: Hongfu Li @ 2026-06-24 3:11 UTC (permalink / raw)
To: tj, hannes, mkoutny, corbet, skhan, dev, mripard, natalie.vock
Cc: cgroups, linux-doc, linux-kernel, dri-devel, Hongfu Li
In-Reply-To: <20260624031107.667253-1-lihongfu@kylinos.cn>
Add dmem.events to report hierarchical low/max event counts per DMEM
region. Increment counters on dmem.max allocation failures and
dmem.low protection events. The file is available for non-root cgroups
only.
Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
---
Documentation/admin-guide/cgroup-v2.rst | 16 +++++++
kernel/cgroup/dmem.c | 61 ++++++++++++++++++++++++-
2 files changed, 76 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 993446ab66d0..afc924539a41 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2881,6 +2881,22 @@ DMEM Interface Files
drm/0000:03:00.0/vram0 12550144
drm/0000:03:00.0/stolen 8650752
+ dmem.events
+ A read-only file that reports the number of times each cgroup
+ has hit its configured memory limits. The format lists each
+ region on a single line, followed by the event counters::
+
+ drm/0000:03:00.0/vram0 low 0 max 3
+ drm/0000:03:00.0/stolen low 0 max 0
+
+ ``low`` counts how many times reclaim or eviction considered
+ the cgroup to be below its effective ``dmem.low`` protection.
+ ``max`` counts how many times an allocation failed because the
+ cgroup or one of its ancestors hit ``dmem.max``.
+
+ ``dmem.events`` contains hierarchical counts. This file exists
+ for all cgroups except root.
+
HugeTLB
-------
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f..79d4c5d0a046 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -8,6 +8,7 @@
* Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
*/
+#include <linux/atomic.h>
#include <linux/cgroup.h>
#include <linux/cgroup_dmem.h>
#include <linux/list.h>
@@ -57,6 +58,14 @@ struct dmemcg_state {
struct cgroup_subsys_state css;
struct list_head pools;
+
+ struct cgroup_file events_file;
+};
+
+enum dmemcg_memory_event {
+ DMEMCG_LOW,
+ DMEMCG_MAX,
+ DMEMCG_NR_EVENTS,
};
struct dmem_cgroup_pool_state {
@@ -74,6 +83,8 @@ struct dmem_cgroup_pool_state {
struct page_counter cnt;
struct dmem_cgroup_pool_state *parent;
+ atomic_long_t events[DMEMCG_NR_EVENTS];
+
refcount_t ref;
bool inited;
};
@@ -182,6 +193,24 @@ static u64 get_resource_current(struct dmem_cgroup_pool_state *pool)
return pool ? page_counter_read(&pool->cnt) : 0;
}
+static void dmemcg_memory_event(struct dmem_cgroup_pool_state *pool,
+ enum dmemcg_memory_event event)
+{
+ for (; pool; pool = pool->parent) {
+ atomic_long_inc(&pool->events[event]);
+ cgroup_file_notify(&pool->cs->events_file);
+ }
+}
+
+static long dmemcg_get_event(struct dmem_cgroup_pool_state *pool,
+ enum dmemcg_memory_event event)
+{
+ if (!pool)
+ return 0;
+
+ return atomic_long_read(&pool->events[event]);
+}
+
static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
@@ -345,6 +374,7 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
return true;
*ret_hit_low = true;
+ dmemcg_memory_event(test_pool, DMEMCG_LOW);
return false;
}
return true;
@@ -675,8 +705,12 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
}
if (!page_counter_try_charge(&pool->cnt, size, &fail)) {
+ struct dmem_cgroup_pool_state *limit_pool;
+
+ limit_pool = container_of(fail, struct dmem_cgroup_pool_state, cnt);
+ dmemcg_memory_event(limit_pool, DMEMCG_MAX);
if (ret_limit_pool) {
- *ret_limit_pool = container_of(fail, struct dmem_cgroup_pool_state, cnt);
+ *ret_limit_pool = limit_pool;
css_get(&(*ret_limit_pool)->cs->css);
dmemcg_pool_get(*ret_limit_pool);
}
@@ -840,6 +874,25 @@ static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
return dmemcg_limit_show(sf, v, get_resource_max);
}
+static int dmem_cgroup_region_events_show(struct seq_file *sf, void *v)
+{
+ struct dmemcg_state *dmemcs = css_to_dmemcs(seq_css(sf));
+ struct dmem_cgroup_region *region;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(region, &dmem_cgroup_regions, region_node) {
+ struct dmem_cgroup_pool_state *pool = find_cg_pool_locked(dmemcs, region);
+
+ seq_puts(sf, region->name);
+ seq_printf(sf, " low %ld max %ld\n",
+ dmemcg_get_event(pool, DMEMCG_LOW),
+ dmemcg_get_event(pool, DMEMCG_MAX));
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
static ssize_t dmem_cgroup_region_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -874,6 +927,12 @@ static struct cftype files[] = {
.seq_show = dmem_cgroup_region_max_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "events",
+ .seq_show = dmem_cgroup_region_events_show,
+ .file_offset = offsetof(struct dmemcg_state, events_file),
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{ } /* Zero entry terminates. */
};
--
2.25.1
^ permalink raw reply related
* [PATCH 2/2] cgroup/dmem: introduce dmem.events.local for local counts
From: Hongfu Li @ 2026-06-24 3:11 UTC (permalink / raw)
To: tj, hannes, mkoutny, corbet, skhan, dev, mripard, natalie.vock
Cc: cgroups, linux-doc, linux-kernel, dri-devel, Hongfu Li
In-Reply-To: <20260624031107.667253-1-lihongfu@kylinos.cn>
Add dmem.events.local for local-only low/max event counts per DMEM
region. Refactor the shared events show logic used by dmem.events.
Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
---
Documentation/admin-guide/cgroup-v2.rst | 5 ++--
kernel/cgroup/dmem.c | 32 +++++++++++++++++++++----
2 files changed, 31 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index afc924539a41..5e4dbe4a75c6 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2881,7 +2881,7 @@ DMEM Interface Files
drm/0000:03:00.0/vram0 12550144
drm/0000:03:00.0/stolen 8650752
- dmem.events
+ dmem.events, dmem.events.local
A read-only file that reports the number of times each cgroup
has hit its configured memory limits. The format lists each
region on a single line, followed by the event counters::
@@ -2894,7 +2894,8 @@ DMEM Interface Files
``max`` counts how many times an allocation failed because the
cgroup or one of its ancestors hit ``dmem.max``.
- ``dmem.events`` contains hierarchical counts. This file exists
+ ``dmem.events`` contains hierarchical counts. ``dmem.events.local``
+ contains counts for only the cgroup itself. These files exist
for all cgroups except root.
HugeTLB
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 79d4c5d0a046..29f8719561e6 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -60,6 +60,7 @@ struct dmemcg_state {
struct list_head pools;
struct cgroup_file events_file;
+ struct cgroup_file events_local_file;
};
enum dmemcg_memory_event {
@@ -84,6 +85,7 @@ struct dmem_cgroup_pool_state {
struct dmem_cgroup_pool_state *parent;
atomic_long_t events[DMEMCG_NR_EVENTS];
+ atomic_long_t events_local[DMEMCG_NR_EVENTS];
refcount_t ref;
bool inited;
@@ -196,6 +198,9 @@ static u64 get_resource_current(struct dmem_cgroup_pool_state *pool)
static void dmemcg_memory_event(struct dmem_cgroup_pool_state *pool,
enum dmemcg_memory_event event)
{
+ atomic_long_inc(&pool->events_local[event]);
+ cgroup_file_notify(&pool->cs->events_local_file);
+
for (; pool; pool = pool->parent) {
atomic_long_inc(&pool->events[event]);
cgroup_file_notify(&pool->cs->events_file);
@@ -203,11 +208,14 @@ static void dmemcg_memory_event(struct dmem_cgroup_pool_state *pool,
}
static long dmemcg_get_event(struct dmem_cgroup_pool_state *pool,
- enum dmemcg_memory_event event)
+ enum dmemcg_memory_event event, bool local)
{
if (!pool)
return 0;
+ if (local)
+ return atomic_long_read(&pool->events_local[event]);
+
return atomic_long_read(&pool->events[event]);
}
@@ -874,7 +882,7 @@ static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
return dmemcg_limit_show(sf, v, get_resource_max);
}
-static int dmem_cgroup_region_events_show(struct seq_file *sf, void *v)
+static int dmemcg_events_show(struct seq_file *sf, void *v, bool local)
{
struct dmemcg_state *dmemcs = css_to_dmemcs(seq_css(sf));
struct dmem_cgroup_region *region;
@@ -885,14 +893,24 @@ static int dmem_cgroup_region_events_show(struct seq_file *sf, void *v)
seq_puts(sf, region->name);
seq_printf(sf, " low %ld max %ld\n",
- dmemcg_get_event(pool, DMEMCG_LOW),
- dmemcg_get_event(pool, DMEMCG_MAX));
+ dmemcg_get_event(pool, DMEMCG_LOW, local),
+ dmemcg_get_event(pool, DMEMCG_MAX, local));
}
rcu_read_unlock();
return 0;
}
+static int dmem_cgroup_region_events_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_events_show(sf, v, false);
+}
+
+static int dmem_cgroup_region_events_local_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_events_show(sf, v, true);
+}
+
static ssize_t dmem_cgroup_region_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -933,6 +951,12 @@ static struct cftype files[] = {
.file_offset = offsetof(struct dmemcg_state, events_file),
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "events.local",
+ .seq_show = dmem_cgroup_region_events_local_show,
+ .file_offset = offsetof(struct dmemcg_state, events_local_file),
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{ } /* Zero entry terminates. */
};
--
2.25.1
^ permalink raw reply related
* Re: [PATCH] mm: memcg: remove stray text from obj_stock_pcp comment
From: Harry Yoo @ 2026-06-24 4:32 UTC (permalink / raw)
To: Guopeng Zhang, Guopeng Zhang, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
Cc: cgroups, linux-mm, linux-kernel
In-Reply-To: <5be54565-00be-4c05-91ca-0825fa925167@kylinos.cn>
[-- Attachment #1.1: Type: text/plain, Size: 1875 bytes --]
On 6/23/26 6:02 PM, Guopeng Zhang wrote:
> 在 2026/6/23 16:42, Harry Yoo 写道:
>>
>> On 6/23/26 5:26 PM, Guopeng Zhang wrote:
>>> From: Guopeng Zhang <zhangguopeng@kylinos.cn>
>>>
>>> A patch filename was accidentally inserted into the comment describing
>>> the nr_bytes field of struct obj_stock_pcp. Remove it.
>> nit: perhaps add something like
>> "Fix a typo in the comment (target -> targets)"?
> Hi Harry,
Hi Guopeng,
> Thanks for the review and the Ack.
>
> Yes, I also fixed the "target -> targets" typo, but missed mentioning it
> in the commit message.
No worries :)
It's not a big deal, just wanted to mention.
> I'll be more careful about describing all changes
> clearly next time.
Thanks!
> If a respin is needed, I'll add it to the commit
> message and carry your Acked-by.
I guess it's okay to not respin for this.
Thanks!
> Thanks,
> Guopeng
>
>>> No functional change.
>>>
>>> Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
>>> ---
>> FWIW,
>> Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
>>
>> Thanks!
>>
>>> mm/memcontrol.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 6dc4888a90f3..3eedfc4e84a0 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2039,7 +2039,7 @@ struct obj_stock_pcp {
>>> /*
>>> * On rare archs with 256KiB base page size (hexagon and powerpc 44x)
>>> * keep nr_bytes to unsigned int as uint16_t cannot represent the full
>>> -e patches/memcg-uint16_t-for-nr_bytes-in-obj_stock_pcp.patch * sub-page remainder. Such archs are not cacheline optimization target.
>>> + * sub-page remainder. Such archs are not cacheline optimization targets.
>>> */
>>> unsigned int nr_bytes[NR_OBJ_STOCK];
>>> #else
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v2 1/2] cgroup/cpuset: Avoid unnecessary cpus & mems update in cpuset_hotplug_update_tasks()
From: Ridong Chen @ 2026-06-24 5:51 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Jonathan Corbet, Shuah Khan
Cc: cgroups, linux-kernel, linux-doc, linux-kselftest
In-Reply-To: <20260623230413.1984188-2-longman@redhat.com>
On 6/24/2026 7:04 AM, Waiman Long wrote:
> As reported by sashiko [1], cpuset_hotplug_update_tasks() may perform
> unnecessary task iteration and updating of tasks' CPU and node masks
> when mems_allowed and/or cpus_allowed are not set in cpuset v2. It is
> due to the fact that the temporary new_cpus and new_mems masks do not
> inherit parent's effective_cpus/mems when they are empty which is the
> expected behavior for cpuset v2 since commit 4ec22e9c5a90 ("cpuset:
> Enable cpuset controller in default hierarchy").
>
> Fix that and avoid unnecessay work by enhancing
> compute_effective_cpumask() to add the empty cpumask check
> and inheriting the parent's versions if empty when in v2. A new
> compute_effective_nodemask() helper is also added to perform similar
> function for new effective_mems.
>
> Add new test_cpuset_prs.sh test cases to confirm that effective_cpus
> will inherit the parent's version if cpuset.cpus is empty.
>
> [1] https://sashiko.dev/#/patchset/20260621032816.1806773-1-longman%40redhat.com
>
> Suggested-by: Ridong Chen <ridong.chen@linux.dev>
> Fixes: 4ec22e9c5a90 ("cpuset: Enable cpuset controller in default hierarchy")
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> kernel/cgroup/cpuset.c | 45 +++++++++++--------
> .../selftests/cgroup/test_cpuset_prs.sh | 11 ++++-
> 2 files changed, 35 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index aff86acea701..044ddbf66f8e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1094,12 +1094,35 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
> * @cs: the cpuset the need to recompute the new effective_cpus mask
> * @parent: the parent cpuset
> *
> + * For v2, the parent's effective_cpus is inherited if cpumask empty.
> * The result is valid only if the given cpuset isn't a partition root.
> */
> static void compute_effective_cpumask(struct cpumask *new_cpus,
> struct cpuset *cs, struct cpuset *parent)
> {
> - cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
> + bool has_cpus;
> +
> + has_cpus = cpumask_and(new_cpus, cs->cpus_allowed, parent->effective_cpus);
> + if (!has_cpus && is_in_v2_mode())
> + cpumask_copy(new_cpus, parent->effective_cpus);
> +}
> +
> +/**
> + * compute_effective_nodemask - Compute the effective nodemask of the cpuset
> + * @new_cpus: the temp variable for the new effective_mems mask
> + * @cs: the cpuset the need to recompute the new effective_mems mask
> + * @parent: the parent cpuset
> + *
> + * For v2, the parent's effective_mems is inherited if nodemask empty.
> + */
> +static void compute_effective_nodemask(nodemask_t *new_mems,
> + struct cpuset *cs, struct cpuset *parent)
> +{
> + bool has_mems;
> +
> + has_mems = nodes_and(*new_mems, cs->mems_allowed, parent->effective_mems);
> + if (!has_mems && is_in_v2_mode())
> + nodes_copy(*new_mems, parent->effective_mems);
> }
>
> /*
> @@ -2148,15 +2171,6 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
> goto update_parent_effective;
> }
>
> - /*
> - * If it becomes empty, inherit the effective mask of the
> - * parent, which is guaranteed to have some CPUs unless
> - * it is a partition root that has explicitly distributed
> - * out all its CPUs.
> - */
> - if (is_in_v2_mode() && !remote && cpumask_empty(tmp->new_cpus))
> - cpumask_copy(tmp->new_cpus, parent->effective_cpus);
> -
> /*
> * Skip the whole subtree if
> * 1) the cpumask remains the same,
> @@ -2704,14 +2718,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
> cpuset_for_each_descendant_pre(cp, pos_css, cs) {
> struct cpuset *parent = parent_cs(cp);
>
> - bool has_mems = nodes_and(*new_mems, cp->mems_allowed, parent->effective_mems);
> -
> - /*
> - * If it becomes empty, inherit the effective mask of the
> - * parent, which is guaranteed to have some MEMs.
> - */
> - if (is_in_v2_mode() && !has_mems)
> - *new_mems = parent->effective_mems;
> + compute_effective_nodemask(new_mems, cp, parent);
>
> /* Skip the whole subtree if the nodemask remains the same. */
> if (nodes_equal(*new_mems, cp->effective_mems)) {
> @@ -3923,7 +3930,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
>
> parent = parent_cs(cs);
> compute_effective_cpumask(&new_cpus, cs, parent);
> - nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);
> + compute_effective_nodemask(&new_mems, cs, parent);
>
> if (!tmp || !cs->partition_root_state)
> goto update_tasks;
> diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> index 0d41aa0d343d..ca9bc38fdb95 100755
> --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
> @@ -495,13 +495,20 @@ REMOTE_TEST_MATRIX=(
> # Narrowing cpuset.cpus to previously sibling-excluded CPUs should
> # not return CPUs that were never actually owned.
> " C1-4:P1 . C1-2:P1 C1-3:P2 . . \
> - . . . C3 . . p1:4|c11:1-2|c12:3 \
> + . . . C3 . . p1:4|c11:1-2|c12:3 \
> p1:P1|c11:P1|c12:P2 3"
> # Expanding cpuset.cpus to include a previously sibling-excluded CPU
> # after the sibling has become a member should correctly request it.
> " C1-4:P1 . C1-2:P1 C1-3:P2 . . \
> - . . P0 C2-3 . . p1:1,4|c11:1|c12:2-3 \
> + . . P0 C2-3 . . p1:1,4|c11:1|c12:2-3 \
> p1:P1|c11:P0|c12:P2 2-3"
> + # Cpusets with empty cpuset.cpus should inherit parent's effective_cpus
> + " C1-4:P1 C5-6 C1-2 . C5 . \
> + . P1 P1 . . . p1:3-4|p2:5-6|c11:1-2|c12:3-4|c21:5|c22:5-6 \
> + p1:P1|p2:P1|c11:P1"
> + " C1-4:P1 C5-6 C1-2 . C5 . \
> + . P1 P1 . O5=0 . p1:3-4|p2:6|c11:1-2|c12:3-4|c21:6|c22:6 \
> + p1:P1|p2:P1|c11:P1"
> )
>
> #
LGTM.
Reviewed-by: Ridong Chen <ridong.chen@linux.dev>
--
Best regards
Ridong
^ permalink raw reply
* Re: [PATCH] block, bfq: protect async queue reset with blkcg locks
From: yu kuai @ 2026-06-24 6:28 UTC (permalink / raw)
To: Cen Zhang, Tejun Heo, Josef Bacik, Jens Axboe, Arianna Avanzini,
Paolo Valente
Cc: linux-block, cgroups, linux-kernel, baijiaju1990, yukuai
In-Reply-To: <20260621135930.2657810-1-zzzccc427@gmail.com>
Hi,
在 2026/6/21 21:59, Cen Zhang 写道:
> Writing 0 to BFQ's low_latency attribute ends weight raising for active,
> idle and async queues. The async cgroup path walks q->blkg_list, converts
> each blkg to BFQ policy data and then reads bfqg->async_bfqq and
> bfqg->async_idle_bfqq.
>
> That walk was protected only by bfqd->lock. blkcg release work is
> serialized by q->blkcg_mutex and q->queue_lock instead, and
> blkg_free_workfn() can call BFQ's pd_free_fn before it removes
> blkg->q_node from q->blkg_list. A low_latency reset can therefore still
> find the blkg on the queue list after the BFQ policy data has been freed.
>
> The buggy scenario involves two paths, with each column showing the order
> within that path:
>
> BFQ low_latency reset: blkcg blkg release work:
> 1. bfq_low_latency_store() 1. blkg_free_workfn() takes
> calls bfq_end_wr(). q->blkcg_mutex.
> 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the
> q->blkg_list. final bfq_group reference.
> 3. blkg_to_bfqg() returns 3. blkg->q_node remains on
> the stale policy data. q->blkg_list until list_del_init().
> 4. bfq_end_wr_async_queues()
> reads async queue fields.
>
> Fix this by taking q->blkcg_mutex and q->queue_lock around the
> q->blkg_list walk, then taking bfqd->lock before touching BFQ async
> queues. The mutex serializes against policy-data free and queue_lock
> stabilizes the list. Move the async reset out of bfq_end_wr()'s existing
> bfqd->lock critical section so the lock order matches blkcg policy
> callbacks.
>
> Validation reproduced this kernel report:
> BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340
>
> Call Trace:
> <TASK>
> dump_stack_lvl+0x66/0xa0
> print_report+0xce/0x630
> ? bfq_end_wr_async_queues+0x246/0x340
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? __virt_addr_valid+0x20d/0x410
> ? bfq_end_wr_async_queues+0x246/0x340
> kasan_report+0xe0/0x110
> ? bfq_end_wr_async_queues+0x246/0x340
> bfq_end_wr_async_queues+0x246/0x340
> bfq_end_wr_async+0xba/0x180
> bfq_low_latency_store+0x4e5/0x690
> ? 0xffffffffc02150da
> ? __pfx_bfq_low_latency_store+0x10/0x10
> ? __pfx_bfq_low_latency_store+0x10/0x10
> elv_attr_store+0xc4/0x110
> kernfs_fop_write_iter+0x2f5/0x4a0
> vfs_write+0x604/0x11f0
> ? __pfx_locks_remove_posix+0x10/0x10
> ? __pfx_vfs_write+0x10/0x10
> ksys_write+0xf9/0x1d0
> ? __pfx_ksys_write+0x10/0x10
> do_syscall_64+0x115/0x6a0
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Allocated by task 544:
> kasan_save_stack+0x33/0x60
> kasan_save_track+0x14/0x30
> __kasan_kmalloc+0xaa/0xb0
> bfq_pd_alloc+0xc0/0x1b0
> blkg_alloc+0x346/0x960
> blkg_create+0x8c2/0x10d0
> bio_associate_blkg_from_css+0x9f3/0xfa0
> bio_associate_blkg+0xd9/0x200
> bio_init+0x303/0x640
> __blkdev_direct_IO_simple+0x56b/0x8a0
> blkdev_direct_IO+0x8e7/0x2580
> blkdev_read_iter+0x205/0x400
> vfs_read+0x7b0/0xda0
> ksys_read+0xf9/0x1d0
> do_syscall_64+0x115/0x6a0
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Freed by task 465:
> kasan_save_stack+0x33/0x60
> kasan_save_track+0x14/0x30
> kasan_save_free_info+0x3b/0x60
> __kasan_slab_free+0x5f/0x80
> kfree+0x307/0x580
> blkg_free_workfn+0xef/0x460
> process_one_work+0x8d0/0x1870
> worker_thread+0x575/0xf80
> kthread+0x2e7/0x3c0
> ret_from_fork+0x576/0x810
> ret_from_fork_asm+0x1a/0x30
>
> Fixes: 44e44a1b329e ("block, bfq: improve responsiveness")
> Assisted-by: Codex:gpt-5.5
> Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
> ---
> block/bfq-cgroup.c | 13 ++++++++++++-
> block/bfq-iosched.c | 3 ++-
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index 0bd0332b3d78..d8fdace464b4 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -936,14 +936,23 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
>
> void bfq_end_wr_async(struct bfq_data *bfqd)
> {
> + struct request_queue *q = bfqd->queue;
> struct blkcg_gq *blkg;
>
> - list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
> + mutex_lock(&q->blkcg_mutex);
> + spin_lock_irq(&q->queue_lock);
> + spin_lock(&bfqd->lock);
Just notice this patch, the same problem is already fixed by another patchset
that I posted. Since this patch is already applied by Jens, I'll rebase my patchset.
BTW, I'm also trying to get rid of queue_lock for blkg protection.
> +
> + list_for_each_entry(blkg, &q->blkg_list, q_node) {
> struct bfq_group *bfqg = blkg_to_bfqg(blkg);
>
> bfq_end_wr_async_queues(bfqd, bfqg);
> }
> bfq_end_wr_async_queues(bfqd, bfqd->root_group);
> +
> + spin_unlock(&bfqd->lock);
> + spin_unlock_irq(&q->queue_lock);
> + mutex_unlock(&q->blkcg_mutex);
> }
>
> static int bfq_io_show_weight_legacy(struct seq_file *sf, void *v)
> @@ -1416,7 +1425,9 @@ void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {}
>
> void bfq_end_wr_async(struct bfq_data *bfqd)
> {
> + spin_lock_irq(&bfqd->lock);
> bfq_end_wr_async_queues(bfqd, bfqd->root_group);
> + spin_unlock_irq(&bfqd->lock);
> }
>
> struct bfq_group *bfq_bio_bfqg(struct bfq_data *bfqd, struct bio *bio)
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 141c602d5e85..eec9be62061b 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -2653,9 +2653,10 @@ static void bfq_end_wr(struct bfq_data *bfqd)
> }
> list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
> bfq_bfqq_end_wr(bfqq);
> - bfq_end_wr_async(bfqd);
>
> spin_unlock_irq(&bfqd->lock);
> +
> + bfq_end_wr_async(bfqd);
> }
>
> static sector_t bfq_io_struct_pos(void *io_struct, bool request)
--
Thanks,
Kuai
^ permalink raw reply
* Re: [PATCH-next 00/23] cgroup/cpuset: Enable runtime update of nohz_full and managed_irq CPUs
From: Jing Wu @ 2026-06-24 6:34 UTC (permalink / raw)
To: Waiman Long
Cc: Jing Wu, Thomas Gleixner, linux-kernel, rcu, cgroups,
Qiliang Yuan
In-Reply-To: <20260421030351.281436-1-longman@redhat.com>
Hi Waiman,
Thomas Gleixner suggested we coordinate, so reaching out directly.
We have been working on a similar feature called Dynamic Housekeeping
Management (DHM) [1][2][3][4]. The RFC was posted on 2026-02-06, v1 on
2026-03-25, and v2 on 2026-04-13 — a week before your series appeared.
It seems we developed these independently in parallel.
After Thomas's review of DHM v3, we are rebuilding v4 around the
CPU-by-CPU offline/online hotplug mechanism, which aligns with the
direction of your series.
There is one key difference in scope worth discussing:
Your series requires "nohz_full=" to be present at boot (even with
an empty CPU list) to opt into runtime updates. DHM targets systems
where nohz_full= was never configured at boot — enabling CPU noise
isolation purely at runtime without any boot-time setup.
This requires making the nohz_full infrastructure activatable at
runtime for the first time, rather than just extending an already-
initialized boot configuration.
Before we start coding v4, a few questions:
1. Are you planning a v2 of your series? If so, what is your
timeline? We want to avoid duplicating effort on the subsystem
patches (tick, RCU, genirq).
2. Would you be open to extending your series to cover the
"no boot parameter" use case, or do you think it is better kept
as a separate series?
3. Are there specific patches in your series where you would welcome
our contribution directly?
Happy to collaborate on a unified approach.
[1] DHM RFC (2026-02-06): https://lore.kernel.org/r/20260206-feature-dynamic_isolcpus_dhei-v1-0-00a711eb0c74@gmail.com
[2] DHM v1 (2026-03-25): https://lore.kernel.org/r/20260325-dhei-v12-final-v1-0-919cca23cadf@gmail.com
[3] DHM v2 (2026-04-13): https://lore.kernel.org/r/20260413-wujing-dhm-v2-0-06df21caba5d@gmail.com
[4] DHM v3 (2026-06-18): https://lore.kernel.org/r/20260618-wujing-dhm-v3-0-28f1a4d83b68@gmail.com
[5] Your series v1 (2026-04-20): https://lore.kernel.org/r/20260421030351.281436-1-longman@redhat.com
Jing Wu <realwujing@gmail.com>
Qiliang Yuan <yuanql9@chinatelecom.cn>
^ permalink raw reply
* [PATCH 1/2] md/linear: add fault-tolerant mode for unraid-like setups
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
From: Yu Kuai <yukuai@fnnas.com>
Add a module parameter 'fault_tolerant' that changes how md-linear
handles disk failures. When enabled:
- Disk failures are isolated instead of failing the entire array
- I/O to failed disks returns -EIO while healthy disks continue
- The array remains operational with reduced capacity
- Failed disk count is tracked and shown in /proc/mdstat
This enables unraid-like functionality where individual disk failures
don't bring down the entire array, allowing continued access to data
on healthy disks.
The fault_tolerant parameter can be set at module load time or
dynamically via /sys/module/md_linear/parameters/fault_tolerant.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
drivers/md/md-linear.c | 63 ++++++++++++++++++++++++++++++++++++------
1 file changed, 55 insertions(+), 8 deletions(-)
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 8d7b82c4a723..8afc6665cfde 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -2,6 +2,10 @@
/*
* linear.c : Multiple Devices driver for Linux Copyright (C) 1994-96 Marc
* ZYNGIER <zyngier@ufr-info-p7.ibp.fr> or <maz@gloups.fdn.fr>
+ *
+ * Fault-tolerant mode added for unraid-like setups.
+ * When fault_tolerant=1, disk failures are isolated - I/O to failed disks
+ * returns -EIO while healthy disks continue operating normally.
*/
#include <linux/blkdev.h>
@@ -21,9 +25,15 @@ struct linear_conf {
sector_t array_sectors;
/* a copy of mddev->raid_disks */
int raid_disks;
+ atomic_t failed_disks; /* count of failed disks */
struct dev_info disks[] __counted_by(raid_disks);
};
+static bool fault_tolerant;
+module_param(fault_tolerant, bool, 0644);
+MODULE_PARM_DESC(fault_tolerant,
+ "Enable fault-tolerant mode: isolate disk failures instead of failing array (default: false)");
+
/*
* find which device holds a particular offset
*/
@@ -96,6 +106,8 @@ static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
if (!conf)
return ERR_PTR(-ENOMEM);
+ atomic_set(&conf->failed_disks, 0);
+
/*
* conf->raid_disks is copy of mddev->raid_disks. The reason to
* keep a copy of mddev->raid_disks in struct linear_conf is,
@@ -251,7 +263,8 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
bio_sector < start_sector))
goto out_of_bounds;
- if (unlikely(is_rdev_broken(tmp_dev->rdev))) {
+ if (unlikely(is_rdev_broken(tmp_dev->rdev) ||
+ test_bit(Faulty, &tmp_dev->rdev->flags))) {
md_error(mddev, tmp_dev->rdev);
bio_io_error(bio);
return true;
@@ -296,16 +309,47 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
static void linear_status(struct seq_file *seq, struct mddev *mddev)
{
+ struct linear_conf *conf = mddev->private;
+
seq_printf(seq, " %dk rounding", mddev->chunk_sectors / 2);
+ if (fault_tolerant) {
+ int failed = atomic_read(&conf->failed_disks);
+
+ seq_puts(seq, " fault-tolerant");
+ if (failed)
+ seq_printf(seq, " [%d failed]", failed);
+ }
}
static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
{
- if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
- char *md_name = mdname(mddev);
-
- pr_crit("md/linear%s: Disk failure on %pg detected, failing array.\n",
- md_name, rdev->bdev);
+ char *md_name = mdname(mddev);
+
+ if (fault_tolerant) {
+ /*
+ * Fault-tolerant mode: isolate the failed disk instead of
+ * failing the entire array. I/O to this disk will return -EIO
+ * but other disks continue operating normally.
+ */
+ if (!test_and_set_bit(Faulty, &rdev->flags)) {
+ struct linear_conf *conf = mddev->private;
+
+ atomic_inc(&conf->failed_disks);
+ pr_warn("md/linear%s: Disk failure on %pg detected, isolating device (fault-tolerant mode).\n",
+ md_name, rdev->bdev);
+ pr_warn("md/linear%s: %d disk(s) now failed, array continues with reduced capacity.\n",
+ md_name, atomic_read(&conf->failed_disks));
+ /* Notify userspace about the state change */
+ sysfs_notify_dirent_safe(rdev->sysfs_state);
+ }
+ } else {
+ /*
+ * Standard mode: fail the entire array on any disk failure.
+ */
+ if (!test_and_set_bit(MD_BROKEN, &mddev->flags)) {
+ pr_crit("md/linear%s: Disk failure on %pg detected, failing array.\n",
+ md_name, rdev->bdev);
+ }
}
}
@@ -344,7 +388,7 @@ static void linear_exit(void)
module_init(linear_init);
module_exit(linear_exit);
MODULE_LICENSE("GPL");
-MODULE_DESCRIPTION("Linear device concatenation personality for MD (deprecated)");
+MODULE_DESCRIPTION("Linear device concatenation personality for MD with optional fault-tolerant mode");
MODULE_ALIAS("md-personality-1"); /* LINEAR - deprecated*/
MODULE_ALIAS("md-linear");
MODULE_ALIAS("md-level--1");
--
2.43.0
^ permalink raw reply related
* [PATCH 2/2] ext4: add unraid mount option for single-disk-per-group mode
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
From: Yu Kuai <yukuai@fnnas.com>
Add support for an "unraid" mount option that enables a special mode
designed for use with fault-tolerant md-linear arrays. In this mode:
1. Variable block groups: Each block group can have a different size,
allowing one physical disk per group. Lookup tables are used for
block-to-group mapping instead of fixed-size calculations.
2. Distributed metadata: Every block group has its own superblock and
group descriptor table copy, enabling the filesystem to remain
accessible even if some disks fail.
3. Single-group allocation: Files are allocated entirely within a
single block group. If a group doesn't have enough space, the
allocation fails with -ENOSPC instead of trying other groups.
This ensures each file resides on a single physical disk.
4. Inode locality: Inodes are allocated in the same group as their
parent directory, keeping files and their metadata on the same disk.
This enables unraid-like functionality where:
- Each disk is independent and can be read separately
- Disk failures only affect files on that specific disk
- The filesystem continues operating with reduced capacity
Usage:
mount -t ext4 -o unraid /dev/md0 /mnt
Note: This requires a specially formatted filesystem where each block
group corresponds to one physical disk. A future mkfs.ext4 extension
will support creating such filesystems.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
fs/ext4/balloc.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/ext4.h | 15 ++++++++++++++-
fs/ext4/ialloc.c | 13 +++++++++++++
fs/ext4/mballoc.c | 8 ++++++++
fs/ext4/super.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 143 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 8040c731b3e4..bd151dc5480b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -54,17 +54,43 @@ ext4_group_t ext4_get_group_number(struct super_block *sb,
void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
ext4_group_t *blockgrpp, ext4_grpblk_t *offsetp)
{
- struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_super_block *es = sbi->s_es;
ext4_grpblk_t offset;
blocknr = blocknr - le32_to_cpu(es->s_first_data_block);
+
+ /* Unraid mode: binary search through variable-size groups */
+ if (sbi->s_group_first_block) {
+ ext4_group_t lo = 0, hi = sbi->s_groups_count - 1;
+ ext4_fsblk_t first_data = le32_to_cpu(es->s_first_data_block);
+
+ blocknr += first_data; /* restore original block number */
+
+ while (lo < hi) {
+ ext4_group_t mid = (lo + hi + 1) / 2;
+
+ if (blocknr < sbi->s_group_first_block[mid])
+ hi = mid - 1;
+ else
+ lo = mid;
+ }
+ if (blockgrpp)
+ *blockgrpp = lo;
+ if (offsetp) {
+ offset = (blocknr - sbi->s_group_first_block[lo]) >>
+ sbi->s_cluster_bits;
+ *offsetp = offset;
+ }
+ return;
+ }
+
offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb)) >>
- EXT4_SB(sb)->s_cluster_bits;
+ sbi->s_cluster_bits;
if (offsetp)
*offsetp = offset;
if (blockgrpp)
*blockgrpp = blocknr;
-
}
/*
@@ -162,8 +188,13 @@ static unsigned ext4_num_overhead_clusters(struct super_block *sb,
static unsigned int num_clusters_in_group(struct super_block *sb,
ext4_group_t block_group)
{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
unsigned int blocks;
+ /* Unraid mode: use per-group blocks count */
+ if (sbi->s_group_blocks_count)
+ return EXT4_NUM_B2C(sbi, sbi->s_group_blocks_count[block_group]);
+
if (block_group == ext4_get_groups_count(sb) - 1) {
/*
* Even though mke2fs always initializes the first and
@@ -171,11 +202,11 @@ static unsigned int num_clusters_in_group(struct super_block *sb,
* we need to make sure we calculate the right free
* blocks.
*/
- blocks = ext4_blocks_count(EXT4_SB(sb)->s_es) -
+ blocks = ext4_blocks_count(sbi->s_es) -
ext4_group_first_block_no(sb, block_group);
} else
blocks = EXT4_BLOCKS_PER_GROUP(sb);
- return EXT4_NUM_B2C(EXT4_SB(sb), blocks);
+ return EXT4_NUM_B2C(sbi, blocks);
}
/* Initializes an uninitialized block bitmap */
@@ -855,6 +886,13 @@ int ext4_bg_has_super(struct super_block *sb, ext4_group_t group)
{
struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+ /*
+ * Unraid mode: every group has a superblock copy for fault tolerance.
+ * This allows mounting the filesystem even if some disks fail.
+ */
+ if (test_opt2(sb, UNRAID))
+ return 1;
+
if (group == 0)
return 1;
if (ext4_has_feature_sparse_super2(sb)) {
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 56112f201cac..063e37a82654 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1295,6 +1295,9 @@ struct ext4_inode_info {
* scanning in mballoc
*/
#define EXT4_MOUNT2_ABORT 0x00000100 /* Abort filesystem */
+#define EXT4_MOUNT2_UNRAID 0x00000200 /* Unraid mode: one disk per
+ * group, single-group alloc
+ */
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
@@ -1687,6 +1690,10 @@ struct ext4_sb_info {
struct flex_groups * __rcu *s_flex_groups;
ext4_group_t s_flex_groups_allocated;
+ /* Unraid mode: variable block groups (one disk per group) */
+ ext4_fsblk_t *s_group_first_block; /* First block of each group */
+ ext4_grpblk_t *s_group_blocks_count; /* Blocks count per group */
+
/* workqueue for reserved extent conversions (buffered io) */
struct workqueue_struct *rsv_conversion_wq;
@@ -2627,8 +2634,14 @@ struct dir_private_info {
static inline ext4_fsblk_t
ext4_group_first_block_no(struct super_block *sb, ext4_group_t group_no)
{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ /* Unraid mode: variable block groups, use lookup table */
+ if (sbi->s_group_first_block)
+ return sbi->s_group_first_block[group_no];
+
return group_no * (ext4_fsblk_t)EXT4_BLOCKS_PER_GROUP(sb) +
- le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
+ le32_to_cpu(sbi->s_es->s_first_data_block);
}
/*
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b20a1bf866ab..98fda602073e 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -438,6 +438,19 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
int flex_size = ext4_flex_bg_size(sbi);
struct dx_hash_info hinfo;
+ /*
+ * Unraid mode: always allocate inode in parent's group.
+ * This ensures files and their inodes stay on the same disk.
+ */
+ if (test_opt2(sb, UNRAID)) {
+ desc = ext4_get_group_desc(sb, parent_group, NULL);
+ if (desc && ext4_free_inodes_count(sb, desc) > 0) {
+ *group = parent_group;
+ return 0;
+ }
+ return -1; /* No free inodes in parent's group */
+ }
+
ngroups = real_ngroups;
if (flex_size > 1) {
ngroups = (real_ngroups + flex_size - 1) >>
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 56d50fd3310b..9de674ec2f77 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2997,6 +2997,14 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
if (err || ac->ac_status == AC_STATUS_FOUND)
goto out;
+ /*
+ * Unraid mode: files must be allocated entirely within a single group.
+ * If the goal group doesn't have enough space, fail with -ENOSPC
+ * instead of trying other groups.
+ */
+ if (test_opt2(sb, UNRAID))
+ goto out;
+
if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
goto out;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 87205660c5d0..9534a4ffbee7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1255,6 +1255,12 @@ static void ext4_group_desc_free(struct ext4_sb_info *sbi)
brelse(group_desc[i]);
kvfree(group_desc);
rcu_read_unlock();
+
+ /* Free unraid mode arrays */
+ kvfree(sbi->s_group_first_block);
+ kvfree(sbi->s_group_blocks_count);
+ sbi->s_group_first_block = NULL;
+ sbi->s_group_blocks_count = NULL;
}
static void ext4_flex_groups_free(struct ext4_sb_info *sbi)
@@ -1677,6 +1683,7 @@ enum {
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
+ Opt_unraid,
#ifdef CONFIG_EXT4_DEBUG
Opt_fc_debug_max_replay, Opt_fc_debug_force
#endif
@@ -1819,6 +1826,7 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
fsparam_flag ("reservation", Opt_removed), /* mount option from ext2/3 */
fsparam_flag ("noreservation", Opt_removed), /* mount option from ext2/3 */
fsparam_u32 ("journal", Opt_removed), /* mount option from ext2/3 */
+ fsparam_flag ("unraid", Opt_unraid),
{}
};
@@ -1912,6 +1920,7 @@ static const struct mount_opts {
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
#endif
{Opt_abort, EXT4_MOUNT2_ABORT, MOPT_SET | MOPT_2},
+ {Opt_unraid, EXT4_MOUNT2_UNRAID, MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
{Opt_err, 0, 0}
};
@@ -4845,6 +4854,65 @@ static int ext4_check_geometry(struct super_block *sb,
return 0;
}
+/*
+ * Initialize unraid mode data structures.
+ * In unraid mode, each block group can have a different size (one disk per group).
+ * This function allocates and populates the lookup tables for variable-size groups.
+ *
+ * For now, this uses the standard fixed-size groups from the superblock.
+ * A future mkfs extension will store per-group sizes in the group descriptors.
+ */
+static int ext4_unraid_init(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ ext4_group_t ngroups = sbi->s_groups_count;
+ ext4_fsblk_t first_data_block;
+ ext4_group_t i;
+
+ if (!test_opt2(sb, UNRAID))
+ return 0;
+
+ sbi->s_group_first_block = kvmalloc_array(ngroups,
+ sizeof(ext4_fsblk_t),
+ GFP_KERNEL);
+ if (!sbi->s_group_first_block)
+ return -ENOMEM;
+
+ sbi->s_group_blocks_count = kvmalloc_array(ngroups,
+ sizeof(ext4_grpblk_t),
+ GFP_KERNEL);
+ if (!sbi->s_group_blocks_count) {
+ kvfree(sbi->s_group_first_block);
+ sbi->s_group_first_block = NULL;
+ return -ENOMEM;
+ }
+
+ /*
+ * Initialize with standard fixed-size groups for now.
+ * TODO: Read per-group sizes from extended group descriptors
+ * when mkfs supports creating variable-size groups.
+ */
+ first_data_block = le32_to_cpu(sbi->s_es->s_first_data_block);
+ for (i = 0; i < ngroups; i++) {
+ sbi->s_group_first_block[i] = first_data_block +
+ (ext4_fsblk_t)i * EXT4_BLOCKS_PER_GROUP(sb);
+
+ if (i == ngroups - 1) {
+ /* Last group may be smaller */
+ sbi->s_group_blocks_count[i] =
+ ext4_blocks_count(sbi->s_es) -
+ sbi->s_group_first_block[i];
+ } else {
+ sbi->s_group_blocks_count[i] = EXT4_BLOCKS_PER_GROUP(sb);
+ }
+ }
+
+ ext4_msg(sb, KERN_INFO, "unraid mode enabled: %u groups",
+ ngroups);
+
+ return 0;
+}
+
static int ext4_group_desc_init(struct super_block *sb,
struct ext4_super_block *es,
ext4_fsblk_t logical_sb_block,
@@ -4904,7 +4972,8 @@ static int ext4_group_desc_init(struct super_block *sb,
return -EFSCORRUPTED;
}
- return 0;
+ /* Initialize unraid mode data structures if enabled */
+ return ext4_unraid_init(sb);
}
static int ext4_load_and_init_journal(struct super_block *sb,
--
2.43.0
^ permalink raw reply related
* [PATCH v2 0/4] blk-cgroup: fix blkg list and policy data races
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
From: Yu Kuai <yukuai@fygo.io>
Hi,
This series fixes races around q->blkg_list and blkg policy data
lifetime.
Patch 1 protects blkg_destroy_all()'s q->blkg_list walk with
blkcg_mutex.
Patches 2-3 fix races between blkcg_activate_policy() and concurrent
blkg destruction.
Patch 4 factors the policy data teardown loop into a helper after the
race fixes.
Changes since v1:
- Drop the BFQ q->blkg_list patch because the current block tree already
has a stronger fix in commit 17b2d950a3c0 ("block, bfq: protect async
queue reset with blkcg locks").
- Add Reviewed-by tags from Tang Yizhou.
Yu Kuai (1):
blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with
blkcg_mutex
Zheng Qixing (3):
blk-cgroup: fix race between policy activation and blkg destruction
blk-cgroup: skip dying blkg in blkcg_activate_policy()
blk-cgroup: factor policy pd teardown loop into helper
block/blk-cgroup.c | 65 +++++++++++++++++++++++++---------------------
1 file changed, 35 insertions(+), 30 deletions(-)
--
2.51.0
^ permalink raw reply
* [PATCH v2 1/4] blk-cgroup: protect q->blkg_list iteration in blkg_destroy_all() with blkcg_mutex
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
From: Yu Kuai <yukuai@fygo.io>
blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex,
which can race with blkg_free_workfn() that removes blkgs from the list
while holding blkcg_mutex.
Add blkcg_mutex protection around the q->blkg_list iteration to prevent
potential list corruption or use-after-free issues.
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
block/blk-cgroup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ee076ab795d3..7baccfb690fe 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -574,10 +574,11 @@ static void blkg_destroy_all(struct gendisk *disk)
struct blkcg_gq *blkg;
int count = BLKG_DESTROY_BATCH_SIZE;
int i;
restart:
+ mutex_lock(&q->blkcg_mutex);
spin_lock_irq(&q->queue_lock);
list_for_each_entry(blkg, &q->blkg_list, q_node) {
struct blkcg *blkcg = blkg->blkcg;
if (hlist_unhashed(&blkg->blkcg_node))
@@ -592,10 +593,11 @@ static void blkg_destroy_all(struct gendisk *disk)
* it when a batch of blkgs are destroyed.
*/
if (!(--count)) {
count = BLKG_DESTROY_BATCH_SIZE;
spin_unlock_irq(&q->queue_lock);
+ mutex_unlock(&q->blkcg_mutex);
cond_resched();
goto restart;
}
}
@@ -611,10 +613,11 @@ static void blkg_destroy_all(struct gendisk *disk)
__clear_bit(pol->plid, q->blkcg_pols);
}
q->root_blkg = NULL;
spin_unlock_irq(&q->queue_lock);
+ mutex_unlock(&q->blkcg_mutex);
wake_up_var(&q->root_blkg);
}
static void blkg_iostat_set(struct blkg_iostat *dst, struct blkg_iostat *src)
--
2.51.0
^ permalink raw reply related
* [PATCH v2 2/4] blk-cgroup: fix race between policy activation and blkg destruction
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
From: Zheng Qixing <zhengqixing@huawei.com>
When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.
The use-after-free occurs in the following race:
T1 (blkcg_activate_policy):
- Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
- Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
- Enters the enomem rollback path to release blkg1 resources
T2 (blkcg deletion):
- blkcgA is deleted concurrently
- blkg1 is freed via blkg_free_workfn()
- blkg1->pd is freed
T1 (continued):
- Rollback path accesses blkg1->pd->online after pd is freed
- Triggers use-after-free
In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.
Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
block/blk-cgroup.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7baccfb690fe..f7e788a7fe95 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1563,10 +1563,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn))
return -EINVAL;
if (queue_is_mq(q))
memflags = blk_mq_freeze_queue(q);
+
+ mutex_lock(&q->blkcg_mutex);
retry:
spin_lock_irq(&q->queue_lock);
/* blkg_list is pushed at the head, reverse walk to initialize parents first */
list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
@@ -1625,10 +1627,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
__set_bit(pol->plid, q->blkcg_pols);
ret = 0;
spin_unlock_irq(&q->queue_lock);
out:
+ mutex_unlock(&q->blkcg_mutex);
if (queue_is_mq(q))
blk_mq_unfreeze_queue(q, memflags);
if (pinned_blkg)
blkg_put(pinned_blkg);
if (pd_prealloc)
--
2.51.0
^ permalink raw reply related
* [PATCH v2 3/4] blk-cgroup: skip dying blkg in blkcg_activate_policy()
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
From: Zheng Qixing <zhengqixing@huawei.com>
When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.
T1: T2:
blkg_destroy
kill(&blkg->refcnt) // blkg->refcnt=1->0
blkg_release // call_rcu(__blkg_release)
...
blkg_free_workfn
->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf
Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.
Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
block/blk-cgroup.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f7e788a7fe95..2538d8105e6c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1574,10 +1574,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
struct blkg_policy_data *pd;
if (blkg->pd[pol->plid])
continue;
+ if (hlist_unhashed(&blkg->blkcg_node))
+ continue;
/* If prealloc matches, use it; otherwise try GFP_NOWAIT */
if (blkg == pinned_blkg) {
pd = pd_prealloc;
pd_prealloc = NULL;
--
2.51.0
^ permalink raw reply related
* [PATCH v2 4/4] blk-cgroup: factor policy pd teardown loop into helper
From: Yu Kuai @ 2026-06-24 6:46 UTC (permalink / raw)
To: Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
From: Zheng Qixing <zhengqixing@huawei.com>
Move the teardown sequence which offlines and frees per-policy
blkg_policy_data (pd) into a helper for readability.
No functional change intended.
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Signed-off-by: Yu Kuai <yukuai@fygo.io>
---
block/blk-cgroup.c | 57 ++++++++++++++++++++++------------------------
1 file changed, 27 insertions(+), 30 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2538d8105e6c..e5e95be4fbc0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1526,10 +1526,35 @@ struct cgroup_subsys io_cgrp_subsys = {
.depends_on = 1 << memory_cgrp_id,
#endif
};
EXPORT_SYMBOL_GPL(io_cgrp_subsys);
+/*
+ * Tear down per-blkg policy data for @pol on @q.
+ */
+static void blkcg_policy_teardown_pds(struct request_queue *q,
+ const struct blkcg_policy *pol)
+{
+ struct blkcg_gq *blkg;
+
+ list_for_each_entry(blkg, &q->blkg_list, q_node) {
+ struct blkcg *blkcg = blkg->blkcg;
+ struct blkg_policy_data *pd;
+
+ spin_lock(&blkcg->lock);
+ pd = blkg->pd[pol->plid];
+ if (pd) {
+ if (pd->online && pol->pd_offline_fn)
+ pol->pd_offline_fn(pd);
+ pd->online = false;
+ pol->pd_free_fn(pd);
+ blkg->pd[pol->plid] = NULL;
+ }
+ spin_unlock(&blkcg->lock);
+ }
+}
+
/**
* blkcg_activate_policy - activate a blkcg policy on a gendisk
* @disk: gendisk of interest
* @pol: blkcg policy to activate
*
@@ -1641,25 +1666,11 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
return ret;
enomem:
/* alloc failed, take down everything */
spin_lock_irq(&q->queue_lock);
- list_for_each_entry(blkg, &q->blkg_list, q_node) {
- struct blkcg *blkcg = blkg->blkcg;
- struct blkg_policy_data *pd;
-
- spin_lock(&blkcg->lock);
- pd = blkg->pd[pol->plid];
- if (pd) {
- if (pd->online && pol->pd_offline_fn)
- pol->pd_offline_fn(pd);
- pd->online = false;
- pol->pd_free_fn(pd);
- blkg->pd[pol->plid] = NULL;
- }
- spin_unlock(&blkcg->lock);
- }
+ blkcg_policy_teardown_pds(q, pol);
spin_unlock_irq(&q->queue_lock);
ret = -ENOMEM;
goto out;
}
EXPORT_SYMBOL_GPL(blkcg_activate_policy);
@@ -1674,11 +1685,10 @@ EXPORT_SYMBOL_GPL(blkcg_activate_policy);
*/
void blkcg_deactivate_policy(struct gendisk *disk,
const struct blkcg_policy *pol)
{
struct request_queue *q = disk->queue;
- struct blkcg_gq *blkg;
unsigned int memflags;
if (!blkcg_policy_enabled(q, pol))
return;
@@ -1687,24 +1697,11 @@ void blkcg_deactivate_policy(struct gendisk *disk,
mutex_lock(&q->blkcg_mutex);
spin_lock_irq(&q->queue_lock);
__clear_bit(pol->plid, q->blkcg_pols);
-
- list_for_each_entry(blkg, &q->blkg_list, q_node) {
- struct blkcg *blkcg = blkg->blkcg;
-
- spin_lock(&blkcg->lock);
- if (blkg->pd[pol->plid]) {
- if (blkg->pd[pol->plid]->online && pol->pd_offline_fn)
- pol->pd_offline_fn(blkg->pd[pol->plid]);
- pol->pd_free_fn(blkg->pd[pol->plid]);
- blkg->pd[pol->plid] = NULL;
- }
- spin_unlock(&blkcg->lock);
- }
-
+ blkcg_policy_teardown_pds(q, pol);
spin_unlock_irq(&q->queue_lock);
mutex_unlock(&q->blkcg_mutex);
if (queue_is_mq(q))
blk_mq_unfreeze_queue(q, memflags);
--
2.51.0
^ permalink raw reply related
* Re: [PATCH 1/2] md/linear: add fault-tolerant mode for unraid-like setups
From: yu kuai @ 2026-06-24 6:55 UTC (permalink / raw)
To: Yu Kuai, Tejun Heo, Josef Bacik, Jens Axboe
Cc: Zheng Qixing, Christoph Hellwig, Tang Yizhou, Nilay Shroff,
Ming Lei, cgroups, linux-block, linux-kernel, yukuai
In-Reply-To: <20260624064625.1743650-1-yukuai@kernel.org>
Hi,
Please ignore this patch, this patch is supposed only used downstream.
Ai somehow generate the cmd to send it together with the patchset:
blk-cgroup: fix blkg list and policy data races
Same for the other ext4 patch.
Sorry for the noise. :(
在 2026/6/24 14:46, Yu Kuai 写道:
> From: Yu Kuai<yukuai@fnnas.com>
>
> Add a module parameter 'fault_tolerant' that changes how md-linear
> handles disk failures. When enabled:
>
> - Disk failures are isolated instead of failing the entire array
> - I/O to failed disks returns -EIO while healthy disks continue
> - The array remains operational with reduced capacity
> - Failed disk count is tracked and shown in /proc/mdstat
>
> This enables unraid-like functionality where individual disk failures
> don't bring down the entire array, allowing continued access to data
> on healthy disks.
>
> The fault_tolerant parameter can be set at module load time or
> dynamically via /sys/module/md_linear/parameters/fault_tolerant.
>
> Signed-off-by: Yu Kuai<yukuai@fnnas.com>
> ---
> drivers/md/md-linear.c | 63 ++++++++++++++++++++++++++++++++++++------
> 1 file changed, 55 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
--
Thanks,
Kuai
^ permalink raw reply
* Re: [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
From: yu kuai @ 2026-06-24 6:57 UTC (permalink / raw)
To: Yu Kuai, nilay, tom.leiming, bvanassche, tj, josef, axboe
Cc: akpm, chrisl, kasong, shikemeng, nphamcs, bhe, baohua,
youngjun.park, cgroups, linux-block, linux-kernel, linux-mm,
yukuai
In-Reply-To: <cover.1780621988.git.yukuai@fygo.io>
Friendly ping ...
This set can still be applied cleanly for block-7.2 branch.
在 2026/6/8 11:42, Yu Kuai 写道:
> From: Yu Kuai <yukuai@fygo.io>
>
> Hi,
>
> This series is the follow-up blk-cgroup locking cleanup on top of the
> earlier blkg-list protection fixes, and prepares blk-cgroup to stop using
> q->queue_lock as the global blkg lifetime/iteration lock.
>
> The current queue_lock based protection is hard to maintain because
> queue_lock is used from hardirq and softirq completion paths, while some
> blkcg cgroup file paths also need to iterate blkgs, print policy data, or
> create blkgs from RCU-protected contexts. This series first tightens the
> blkcg-side lifetime rules:
>
> - blkcg_print_stat() iterates blkgs under blkcg->lock with IRQs disabled.
> - policy data freeing is delayed past an RCU grace period.
> - blkcg_print_blkgs(), blkg lookup/create, bio association, page-IO
> association, blkg destruction, and BFQ initialization stop nesting
> queue_lock under RCU or blkcg->lock.
>
> Using blkcg->lock and RCU for blkcg-owned lists/data keeps the lock order
> local to blk-cgroup and avoids extending queue_lock into cgroup file
> iteration paths. It also makes the subsequent conversion to q->blkcg_mutex
> possible without carrying forward queue_lock's interrupt-context
> constraints.
>
> Yu Kuai (8):
> blk-cgroup: protect iterating blkgs with blkcg->lock in
> blkcg_print_stat()
> blk-cgroup: delay freeing policy data after rcu grace period
> blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs()
> blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create()
> blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg()
> blk-cgroup: don't nest queue_lock under blkcg->lock in
> blkcg_destroy_blkgs()
> mm/page_io: don't nest queue_lock under rcu in
> bio_associate_blkg_from_page()
> block, bfq: don't grab queue_lock to initialize bfq
>
> block/bfq-cgroup.c | 17 ++++-
> block/bfq-iosched.c | 5 --
> block/blk-cgroup-rwstat.c | 15 ++--
> block/blk-cgroup.c | 151 ++++++++++++++++++++++----------------
> block/blk-cgroup.h | 8 +-
> block/blk-iocost.c | 22 ++++--
> block/blk-iolatency.c | 10 ++-
> block/blk-throttle.c | 13 +++-
> mm/page_io.c | 7 +-
> 9 files changed, 158 insertions(+), 90 deletions(-)
>
>
> base-commit: b23df513de562739af61fa61ba80ef5e8059a636
--
Thanks,
Kuai
^ permalink raw reply
* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: luca abeni @ 2026-06-24 7:19 UTC (permalink / raw)
To: Juri Lelli
Cc: Yuri Andriaccio, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo, Johannes Weiner,
Michal Koutný, cgroups, linux-kernel, Yuri Andriaccio
In-Reply-To: <ajpcrDn2g2G9mGKp@jlelli-thinkpadt14gen4.remote.csb>
Hi Juri,
very interesting demo, thanks!
On Tue, 23 Jun 2026 12:15:08 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> - At 1ms task periods, the dl-server period is the critical tuning
> parameter, less the bandwidth. A 10ms dl-server with 60% bandwidth
> caused ~10% miss rates because the worst-case throttle gap (4ms)
> spanned multiple 1ms deadlines. Switching to a 2ms dl-server period
> at just 30% bandwidth eliminated all misses.
>
> - A simple Rule of thumb might be to set the dl-server period to at
> most 2x the shortest task period in the cgroup (e.g., 2ms dl-server
> for 1ms tasks, 10ms for 10ms tasks). Would you (and Luca?) agree or
> would you suggest something different?
With one single RT task in the cgroup (or with multiple synchronized RT
tasks having the same period), I agree... Technically, the cgroup period
P should be such that P - Q = T - WCET (where "Q" is the cgroup's
runtime and "T" is the period of the task), but to see missed deadlines
you need a relevant competing deadline (or HCBS) workload.
So, yes, I agree with your findings above.
If we consider multi-core analysis, or multiple RT threads with
different, non synchronized, periods, then analysis tool by Yuri
(leveraging CSF analysis from real-time literature) is needed... But
that is pretty pessimistic. The rule you suggest above is a better
starting point in practical situations.
Luca
^ permalink raw reply
* Re: [PATCH v2 2/2] cgroup/cpuset: Rebind/migrate mm only for threadgroup leader in cpuset_update_tasks_nodemask()
From: Manuel Ebner @ 2026-06-24 8:27 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ridong Chen, Jonathan Corbet, Shuah Khan
Cc: cgroups, linux-kernel, linux-doc, linux-kselftest
In-Reply-To: <20260623230413.1984188-3-longman@redhat.com>
Hi
On Tue, 2026-06-23 at 19:04 -0400, Waiman Long wrote:
> [...]
> Also add a paragraph in cgroup-v2.rst under cpuset.mems that the
> threadgroup leader is the memory owner of that threadgroup. Therefore
> the non-leading threads shouldn't be in other cgroups whose "cpuset.mems"
> doesn't fully overlap that of the group leader.
This sentence is long and complex, split into two if possible. I couldn't
figure out how to do so.
> [...]
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2527,6 +2527,13 @@ Cpuset Interface Files
> a need to change "cpuset.mems" with active tasks, it shouldn't
> be done frequently.
>
> + For a multithreaded process, the threadgroup leader is
> + considered the owner of the group's memory. Memory policy
> + rebinding and migration will only happen with respect to the
> + threadgroup leader. To avoid unexpected result, non-leading
/result/results/
or
To avoid an unexpected result,
> + threads shouldn't be put into another cgroup whose "cpuset.mems"
> + doesn't fully overlap that of the threadgroup leader.
maybe
/threadgroup/threadgroups/
Thanks
Manuel
^ permalink raw reply
* Re: [PATCH v2 1/2] cgroup/cpuset: Avoid unnecessary cpus & mems update in cpuset_hotplug_update_tasks()
From: Manuel Ebner @ 2026-06-24 8:40 UTC (permalink / raw)
To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný,
Ridong Chen, Jonathan Corbet, Shuah Khan
Cc: cgroups, linux-kernel, linux-doc, linux-kselftest
In-Reply-To: <20260623230413.1984188-2-longman@redhat.com>
On Tue, 2026-06-23 at 19:04 -0400, Waiman Long wrote:
> As reported by sashiko [1], cpuset_hotplug_update_tasks() may perform
> unnecessary task iteration and updating of tasks' CPU and node masks
> when mems_allowed and/or cpus_allowed are not set in cpuset v2. It is
> due to the fact that the temporary new_cpus and new_mems masks do not
> inherit parent's effective_cpus/mems when they are empty which is the
> expected behavior for cpuset v2 since commit 4ec22e9c5a90 ("cpuset:
> Enable cpuset controller in default hierarchy").
>
> Fix that and avoid unnecessay work by enhancing
> compute_effective_cpumask() to add the empty cpumask check
> and inheriting the parent's versions if empty when in v2. A new
> compute_effective_nodemask() helper is also added to perform similar
> function for new effective_mems.
perform a similar function
or
perform similar functions
> [...]
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index aff86acea701..044ddbf66f8e 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1094,12 +1094,35 @@ void cpuset_update_tasks_cpumask(struct cpuset *cs, struct
> cpumask *new_cpus)
> * @cs: the cpuset the need to recompute the new effective_cpus mask
> * @parent: the parent cpuset
> *
> + * For v2, the parent's effective_cpus is inherited if cpumask empty.
+ * For v2, the parent's effective_cpus is inherited if cpumask is empty.
Thanks
Manuel
^ permalink raw reply
* Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server
From: Juri Lelli @ 2026-06-24 10:35 UTC (permalink / raw)
To: luca abeni
Cc: Yuri Andriaccio, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tejun Heo, Johannes Weiner,
Michal Koutný, cgroups, linux-kernel, Yuri Andriaccio
In-Reply-To: <20260624091912.4fca8428@nowhere>
Hi Luca,
On 24/06/26 09:19, luca abeni wrote:
> Hi Juri,
>
> very interesting demo, thanks!
>
> On Tue, 23 Jun 2026 12:15:08 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> > - At 1ms task periods, the dl-server period is the critical tuning
> > parameter, less the bandwidth. A 10ms dl-server with 60% bandwidth
> > caused ~10% miss rates because the worst-case throttle gap (4ms)
> > spanned multiple 1ms deadlines. Switching to a 2ms dl-server period
> > at just 30% bandwidth eliminated all misses.
> >
> > - A simple Rule of thumb might be to set the dl-server period to at
> > most 2x the shortest task period in the cgroup (e.g., 2ms dl-server
> > for 1ms tasks, 10ms for 10ms tasks). Would you (and Luca?) agree or
> > would you suggest something different?
>
> With one single RT task in the cgroup (or with multiple synchronized RT
> tasks having the same period), I agree... Technically, the cgroup period
> P should be such that P - Q = T - WCET (where "Q" is the cgroup's
> runtime and "T" is the period of the task), but to see missed deadlines
> you need a relevant competing deadline (or HCBS) workload.
>
> So, yes, I agree with your findings above.
Great, thanks a lot for taking a look!
> If we consider multi-core analysis, or multiple RT threads with
> different, non synchronized, periods, then analysis tool by Yuri
> (leveraging CSF analysis from real-time literature) is needed... But
> that is pretty pessimistic. The rule you suggest above is a better
> starting point in practical situations.
Indeed. I actually wonder if it would make sense to "extract" that tool
from the test suite and package it somehow so that it's easier for end
users to design their interfaces.
Best,
Juri
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox