From: Waiman Long <llong@redhat.com>
To: "Michal Koutný" <mkoutny@suse.com>,
"Sun Shaojie" <sunshaojie@kylinos.cn>
Cc: llong@redhat.com, cgroups@vger.kernel.org,
chenridong@huaweicloud.com, hannes@cmpxchg.org,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
shuah@kernel.org, tj@kernel.org
Subject: Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.
Date: Tue, 23 Dec 2025 01:03:42 -0500 [thread overview]
Message-ID: <5b53f9ec-ebd5-4bea-b6a3-ef35a467e96c@redhat.com> (raw)
In-Reply-To: <bzu7va4de6ylaww2xbq67hztyokpui7qm2zcqtiwjlniyvx7dt@wf47lg6etmas>
On 12/22/25 10:26 AM, Michal Koutný wrote:
> Hello Shaojie.
>
> On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@kylinos.cn> wrote:
>> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
>> with its sibling partition, the sibling's partition state becomes invalid.
>> However, this invalidation is often unnecessary.
>>
>> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
>> under the root cgroup. Each cpuset is used by a single user(user-1 use
>> A1, ... , user-m use Am), and the partition states of these cpusets are
>> configured as follows:
>>
>> root cgroup
>> / / \ \
>> A1 A2 ... An Am
>> (root) (root) ... (root) (root/root invalid/member)
>>
>> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
>> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
>> states from A1 to An to change from root to root invalid, as shown
>> below.
>>
>> root cgroup
>> / / \ \
>> A1 A2 ... An Am
>> (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>>
>> This outcome is entirely undeserved for all users from A1 to An.
> s/cpuset.cpus/memory.max/
>
> When the permissions are such that the last (any) sibling can come and
> claim so much to cause overcommit, then it can set up large limit and
> (potentially) reclaim from others.
>
> s/cpuset.cpus/memory.min/
>
> Here is the overcommit approached by recalculating effective values of
> memory.min, again one sibling can skew toward itself and reduce every
> other's effective value.
>
> Above are not exact analogies because first of them is Limits, the
> second is Protections and cpusets are Allocations (refering to Resource
> Distribution Models from Documentation/admin-guide/cgroup-v2.rst).
>
> But the advice to get some guarantees would be same in all cases -- if
> some guarantees are expected, the permissions (of respective cgroup
> attributes) should be configured so that it decouples the owner of the
> cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
> root or there's a middle level cgroup that'd cap each of the siblings
> individually).
>
From sibling point of view, CPUs in partitions are exclusive. A cpuset
either have all the requested CPUs to form a partition (assuming that at
least one can be granted from the parent cpuset) or it doesn't have all
of them and fails to form a valid partition. It is different from memory
that a cgroup can have a reduced amount of memory than requested and can
still work fine.
Anyway, I consider using cpuset.cpus to form a partition is legacy and
is supported for backward compatibility reason. Now the proper way to
form a partition is to use cpuset.cpus.exclusive, the setting of it can
fail if it conflicts with siblings.
By using cpuset.cpus only to form partitions, the cpuset.cpus value will
be treated the same as cpuset.cpus.exclusive if a valid partition is
formed. In that sense, the examples listed in the patch will have the
same result if cpuset.cpu.exclusive is used instead of cpuset.cpus. The
difference is that writing to the cpuset.cpus.exclusive will fail
instead of forming an invalid partition in the case of cpust.cpus.
>> After applying this patch, the first party to set "root" will maintain
>> its exclusive validity. As follows:
>>
>> Step | A1's prstate | B1's prstate |
>> #1> echo "0-1" > A1/cpuset.cpus | member | member |
>> #2> echo "root" > A1/cpuset.cpus.partition | root | member |
>> #3> echo "1-2" > B1/cpuset.cpus | root | member |
>> #4> echo "root" > B1/cpuset.cpus.partition | root | root invalid |
>>
>> Step | A1's prstate | B1's prstate |
>> #1> echo "0-1" > B1/cpuset.cpus | member | member |
>> #2> echo "root" > B1/cpuset.cpus.partition | member | root |
>> #3> echo "1-2" > A1/cpuset.cpus | member | root |
>> #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root |
> I'm worried that the ordering dependency would lead to situations where
> users may not be immediately aware their config is overcommitting the system.
> Consider that CPUs are vital for A1 but B1 can somehow survive the
> degraded state, depending on the starting order the system may either
> run fine (A1 valid) or fail because of A1.
>
> I'm curious about Waiman's take.
That is why I will recommend users to use cpuset.cpus.exclusive to form
partition as they can get early feedback if they are overcommitting. Of
course, setting cpuset.cpus.exclusive without failure still doesn't
guarantee the formation of a valid partition if none of the exclusive
CPUs can be granted from the parent.
Cheers,
Longman
next prev parent reply other threads:[~2025-12-23 6:03 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-17 1:57 [PATCH v3 0/1] cpuset: relax the overlap check for cgroup-v2 Sun Shaojie
2025-11-17 1:57 ` [PATCH v4 1/1] " Sun Shaojie
2025-11-17 7:45 ` Chen Ridong
2025-11-17 10:00 ` Sun Shaojie
2025-11-17 11:37 ` Chen Ridong
2025-11-19 10:57 ` [PATCH v5] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict Sun Shaojie
2025-11-19 13:20 ` Michal Koutný
2025-11-20 0:57 ` Chen Ridong
2025-11-20 13:07 ` Sun Shaojie
2025-11-20 13:25 ` Chen Ridong
2025-11-21 10:33 ` Sun Shaojie
2025-11-22 1:19 ` Chen Ridong
2025-11-24 10:21 ` Sun Shaojie
2025-11-20 13:05 ` Sun Shaojie
2025-11-26 14:13 ` Michal Koutný
2025-11-27 1:57 ` Chen Ridong
2025-12-01 9:42 ` Sun Shaojie
2025-11-20 0:51 ` Chen Ridong
2025-11-20 13:07 ` Sun Shaojie
2025-11-20 13:45 ` Chen Ridong
2025-11-21 10:32 ` Sun Shaojie
2025-11-22 1:33 ` Chen Ridong
2025-11-24 10:20 ` Sun Shaojie
2025-11-24 11:33 ` Chen Ridong
2025-11-26 12:29 ` Sun Shaojie
2025-11-24 22:30 ` Waiman Long
2025-11-26 12:31 ` Sun Shaojie
2025-11-26 14:13 ` Michal Koutný
2025-11-26 19:43 ` Waiman Long
2025-11-27 1:55 ` Chen Ridong
2025-12-01 9:44 ` Sun Shaojie
2025-12-08 14:31 ` Michal Koutný
2025-12-10 10:11 ` Sun Shaojie
2025-12-11 10:59 ` Michal Koutný
2025-12-12 10:10 ` Sun Shaojie
2025-12-13 0:52 ` Chen Ridong
2025-12-17 9:09 ` Sun Shaojie
2025-12-08 14:32 ` Michal Koutný
2025-12-13 4:58 ` Waiman Long
2025-12-01 9:38 ` [PATCH v6] " Sun Shaojie
2025-12-17 9:45 ` [PING][PATCH " Sun Shaojie
2025-12-23 6:06 ` Waiman Long
2025-12-22 15:26 ` [PATCH " Michal Koutný
2025-12-23 6:03 ` Waiman Long [this message]
2025-12-25 7:30 ` Waiman Long
2025-11-19 11:03 ` [PATCH v4 1/1] cpuset: relax the overlap check for cgroup-v2 Sun Shaojie
2025-11-18 17:52 ` Michal Koutný
2025-11-19 11:04 ` Sun Shaojie
2025-11-18 19:53 ` Waiman Long
2025-11-19 11:05 ` Sun Shaojie
2025-11-17 3:23 ` [PATCH v3 0/1] " Chen Ridong
2025-11-17 5:58 ` Sun Shaojie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5b53f9ec-ebd5-4bea-b6a3-ef35a467e96c@redhat.com \
--to=llong@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chenridong@huaweicloud.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mkoutny@suse.com \
--cc=shuah@kernel.org \
--cc=sunshaojie@kylinos.cn \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox