From: Waiman Long <longman@redhat.com>
To: "Tejun Heo" <tj@kernel.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Shuah Khan" <shuah@kernel.org>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org, Waiman Long <longman@redhat.com>
Subject: [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one
Date: Sun, 30 Mar 2025 17:52:39 -0400 [thread overview]
Message-ID: <20250330215248.3620801-2-longman@redhat.com> (raw)
In-Reply-To: <20250330215248.3620801-1-longman@redhat.com>
There is a possible race between removing a cgroup diectory that is
a partition root and the creation of a new partition. The partition
to be removed can be dying but still online, it doesn't not currently
participate in checking for exclusive CPUs conflict, but the exclusive
CPUs are still there in subpartitions_cpus and isolated_cpus. These
two cpumasks are global states that affect the operation of cpuset
partitions. The exclusive CPUs in dying cpusets will only be removed
when cpuset_css_offline() function is called after an RCU delay.
As a result, it is possible that a new partition can be created with
exclusive CPUs that overlap with those of a dying one. When that dying
partition is finally offlined, it removes those overlapping exclusive
CPUs from subpartitions_cpus and maybe isolated_cpus resulting in an
incorrect CPU configuration.
This bug was found when a warning was triggered in
remote_partition_disable() during testing because the subpartitions_cpus
mask was empty.
One possible way to fix this is to iterate the dying cpusets as well and
avoid using the exclusive CPUs in those dying cpusets. However, this
can still cause random partition creation failures or other anomalies
due to racing. A better way to fix this race is to reset the partition
state at the moment when a cpuset is being killed.
Introduce a new css_killed() CSS function pointer and call it, if
defined, before setting CSS_DYING flag in kill_css(). Also update the
css_is_dying() helper to use the CSS_DYING flag introduced by commit
33c35aa48178 ("cgroup: Prevent kill_css() from being called more than
once") for proper synchronization.
Add a new cpuset_css_killed() function to reset the partition state of
a valid partition root if it is being killed.
Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Waiman Long <longman@redhat.com>
---
include/linux/cgroup-defs.h | 1 +
include/linux/cgroup.h | 2 +-
kernel/cgroup/cgroup.c | 6 ++++++
kernel/cgroup/cpuset.c | 20 +++++++++++++++++---
4 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 485b651869d9..5bc8f55c8cca 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -710,6 +710,7 @@ struct cgroup_subsys {
void (*css_released)(struct cgroup_subsys_state *css);
void (*css_free)(struct cgroup_subsys_state *css);
void (*css_reset)(struct cgroup_subsys_state *css);
+ void (*css_killed)(struct cgroup_subsys_state *css);
void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu);
int (*css_extra_stat_show)(struct seq_file *seq,
struct cgroup_subsys_state *css);
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 28e999f2c642..e7da3c3b098b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -344,7 +344,7 @@ static inline u64 cgroup_id(const struct cgroup *cgrp)
*/
static inline bool css_is_dying(struct cgroup_subsys_state *css)
{
- return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
+ return css->flags & CSS_DYING;
}
static inline void cgroup_get(struct cgroup *cgrp)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f231fe3a0744..49d622205997 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5909,6 +5909,12 @@ static void kill_css(struct cgroup_subsys_state *css)
if (css->flags & CSS_DYING)
return;
+ /*
+ * Call css_killed(), if defined, before setting the CSS_DYING flag
+ */
+ if (css->ss->css_killed)
+ css->ss->css_killed(css);
+
css->flags |= CSS_DYING;
/*
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 39c1fc643d77..749994312d47 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3475,9 +3475,6 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
cpus_read_lock();
mutex_lock(&cpuset_mutex);
- if (is_partition_valid(cs))
- update_prstate(cs, 0);
-
if (!cpuset_v2() && is_sched_load_balance(cs))
cpuset_update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
@@ -3488,6 +3485,22 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
cpus_read_unlock();
}
+static void cpuset_css_killed(struct cgroup_subsys_state *css)
+{
+ struct cpuset *cs = css_cs(css);
+
+ cpus_read_lock();
+ mutex_lock(&cpuset_mutex);
+
+ /* Reset valid partition back to member */
+ if (is_partition_valid(cs))
+ update_prstate(cs, PRS_MEMBER);
+
+ mutex_unlock(&cpuset_mutex);
+ cpus_read_unlock();
+
+}
+
static void cpuset_css_free(struct cgroup_subsys_state *css)
{
struct cpuset *cs = css_cs(css);
@@ -3609,6 +3622,7 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
.css_alloc = cpuset_css_alloc,
.css_online = cpuset_css_online,
.css_offline = cpuset_css_offline,
+ .css_killed = cpuset_css_killed,
.css_free = cpuset_css_free,
.can_attach = cpuset_can_attach,
.cancel_attach = cpuset_cancel_attach,
--
2.48.1
next prev parent reply other threads:[~2025-03-30 21:53 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-30 21:52 [PATCH 00/10] cgroup/cpuset: Miscellaneous partition bug fixes and enhancements Waiman Long
2025-03-30 21:52 ` Waiman Long [this message]
2025-03-31 23:13 ` [PATCH 01/10] cgroup/cpuset: Fix race between newly created partition and dying one Tejun Heo
2025-04-01 3:12 ` Waiman Long
2025-04-01 19:59 ` Tejun Heo
2025-04-01 20:41 ` Waiman Long
2025-04-01 20:56 ` Waiman Long
2025-04-02 7:49 ` Tejun Heo
2025-04-03 13:34 ` Michal Koutný
2025-04-03 16:15 ` Tejun Heo
2025-03-30 21:52 ` [PATCH 02/10] cgroup/cpuset: Fix incorrect isolated_cpus update in update_parent_effective_cpumask() Waiman Long
2025-03-30 21:52 ` [PATCH 03/10] cgroup/cpuset: Fix error handling in remote_partition_disable() Waiman Long
2025-03-30 21:52 ` [PATCH 04/10] cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition Waiman Long
2025-03-30 21:52 ` [PATCH 05/10] cgroup/cpuset: Don't allow creation of local partition over a remote one Waiman Long
2025-04-03 13:33 ` Michal Koutný
2025-04-03 13:48 ` Waiman Long
2025-03-30 21:52 ` [PATCH 06/10] cgroup/cpuset: Code cleanup and comment update Waiman Long
2025-03-30 21:52 ` [PATCH 07/10] cgroup/cpuset: Remove unneeded goto in sched_partition_write() and rename it Waiman Long
2025-04-03 13:33 ` Michal Koutný
2025-04-03 13:49 ` Waiman Long
2025-03-30 21:52 ` [PATCH 08/10] selftest/cgroup: Update test_cpuset_prs.sh to use | as effective CPUs and state separator Waiman Long
2025-03-30 21:52 ` [PATCH 09/10] selftest/cgroup: Clean up and restructure test_cpuset_prs.sh Waiman Long
2025-03-30 21:52 ` [PATCH 10/10] selftest/cgroup: Add a remote partition transition test to test_cpuset_prs.sh Waiman Long
2025-03-31 23:28 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250330215248.3620801-2-longman@redhat.com \
--to=longman@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox