Linux Kernel Selftest development
 help / color / mirror / Atom feed
From: Waiman Long <llong@redhat.com>
To: "Michal Koutný" <mkoutny@suse.com>,
	"Sun Shaojie" <sunshaojie@kylinos.cn>
Cc: llong@redhat.com, cgroups@vger.kernel.org,
	chenridong@huaweicloud.com, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	shuah@kernel.org, tj@kernel.org
Subject: Re: [PATCH v6] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict.
Date: Tue, 23 Dec 2025 01:03:42 -0500	[thread overview]
Message-ID: <5b53f9ec-ebd5-4bea-b6a3-ef35a467e96c@redhat.com> (raw)
In-Reply-To: <bzu7va4de6ylaww2xbq67hztyokpui7qm2zcqtiwjlniyvx7dt@wf47lg6etmas>

On 12/22/25 10:26 AM, Michal Koutný wrote:
> Hello Shaojie.
>
> On Mon, Dec 01, 2025 at 05:38:06PM +0800, Sun Shaojie <sunshaojie@kylinos.cn> wrote:
>> Currently, when setting a cpuset's cpuset.cpus to a value that conflicts
>> with its sibling partition, the sibling's partition state becomes invalid.
>> However, this invalidation is often unnecessary.
>>
>> For example: On a machine with 128 CPUs, there are m (m < 128) cpusets
>> under the root cgroup. Each cpuset is used by a single user(user-1 use
>> A1, ... , user-m use Am), and the partition states of these cpusets are
>> configured as follows:
>>
>>                             root cgroup
>>          /             /                  \                 \
>>         A1            A2        ...       An                Am
>>       (root)        (root)      ...     (root) (root/root invalid/member)
>>
>> Assume that A1 through Am have not set cpuset.cpus.exclusive. When
>> user-m modifies Am's cpuset.cpus to "0-127", it will cause all partition
>> states from A1 to An to change from root to root invalid, as shown
>> below.
>>
>>                             root cgroup
>>          /              /                 \                 \
>>         A1             A2       ...       An                Am
>>   (root invalid) (root invalid) ... (root invalid) (root invalid/member)
>>
>> This outcome is entirely undeserved for all users from A1 to An.
> s/cpuset.cpus/memory.max/
>
> When the permissions are such that the last (any) sibling can come and
> claim so much to cause overcommit, then it can set up large limit and
> (potentially) reclaim from others.
>
> s/cpuset.cpus/memory.min/
>
> Here is the overcommit approached by recalculating effective values of
> memory.min, again one sibling can skew toward itself and reduce every
> other's effective value.
>
> Above are not exact analogies because first of them is Limits, the
> second is Protections and cpusets are Allocations (refering to Resource
> Distribution Models from Documentation/admin-guide/cgroup-v2.rst).
>
> But the advice to get some guarantees would be same in all cases -- if
> some guarantees are expected, the permissions (of respective cgroup
> attributes) should be configured so that it decouples the owner of the
> cgroup from the owner of the resource (i.e. Ai/cpuset.cpus belongs to
> root or there's a middle level cgroup that'd cap each of the siblings
> individually).
>
 From sibling point of view, CPUs in partitions are exclusive. A cpuset 
either have all the requested CPUs to form a partition (assuming that at 
least one can be granted from the parent cpuset) or it doesn't have all 
of them and fails to form a valid partition. It is different from memory 
that a cgroup can have a reduced amount of memory than requested and can 
still work fine.

Anyway, I consider using cpuset.cpus to form a partition is legacy and 
is supported for backward compatibility reason. Now the proper way to 
form a partition is to use cpuset.cpus.exclusive, the setting of it can 
fail if it conflicts with siblings.

By using cpuset.cpus only to form partitions, the cpuset.cpus value will 
be treated the same as cpuset.cpus.exclusive if a valid partition is 
formed. In that sense, the examples listed in the patch will have the 
same result if cpuset.cpu.exclusive is used instead of cpuset.cpus. The 
difference is that writing to the cpuset.cpus.exclusive will fail 
instead of forming an invalid partition in the case of cpust.cpus.

>> After applying this patch, the first party to set "root" will maintain
>> its exclusive validity. As follows:
>>
>>   Step                                       | A1's prstate | B1's prstate |
>>   #1> echo "0-1" > A1/cpuset.cpus            | member       | member       |
>>   #2> echo "root" > A1/cpuset.cpus.partition | root         | member       |
>>   #3> echo "1-2" > B1/cpuset.cpus            | root         | member       |
>>   #4> echo "root" > B1/cpuset.cpus.partition | root         | root invalid |
>>
>>   Step                                       | A1's prstate | B1's prstate |
>>   #1> echo "0-1" > B1/cpuset.cpus            | member       | member       |
>>   #2> echo "root" > B1/cpuset.cpus.partition | member       | root         |
>>   #3> echo "1-2" > A1/cpuset.cpus            | member       | root         |
>>   #4> echo "root" > A1/cpuset.cpus.partition | root invalid | root         |
> I'm worried that the ordering dependency would lead to situations where
> users may not be immediately aware their config is overcommitting the system.
> Consider that CPUs are vital for A1 but B1 can somehow survive the
> degraded state, depending on the starting order the system may either
> run fine (A1 valid) or fail because of A1.
>
> I'm curious about Waiman's take.

That is why I will recommend users to use cpuset.cpus.exclusive to form 
partition as they can get early feedback if they are overcommitting. Of 
course, setting cpuset.cpus.exclusive without failure still doesn't 
guarantee the formation of a valid partition if none of the exclusive 
CPUs can be granted from the parent.

Cheers,
Longman


  reply	other threads:[~2025-12-23  6:03 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-17  1:57 [PATCH v3 0/1] cpuset: relax the overlap check for cgroup-v2 Sun Shaojie
2025-11-17  1:57 ` [PATCH v4 1/1] " Sun Shaojie
2025-11-17  7:45   ` Chen Ridong
2025-11-17 10:00     ` Sun Shaojie
2025-11-17 11:37       ` Chen Ridong
2025-11-19 10:57         ` [PATCH v5] cpuset: Avoid invalidating sibling partitions on cpuset.cpus conflict Sun Shaojie
2025-11-19 13:20           ` Michal Koutný
2025-11-20  0:57             ` Chen Ridong
2025-11-20 13:07               ` Sun Shaojie
2025-11-20 13:25                 ` Chen Ridong
2025-11-21 10:33                   ` Sun Shaojie
2025-11-22  1:19                     ` Chen Ridong
2025-11-24 10:21                       ` Sun Shaojie
2025-11-20 13:05             ` Sun Shaojie
2025-11-26 14:13               ` Michal Koutný
2025-11-27  1:57                 ` Chen Ridong
2025-12-01  9:42                 ` Sun Shaojie
2025-11-20  0:51           ` Chen Ridong
2025-11-20 13:07             ` Sun Shaojie
2025-11-20 13:45               ` Chen Ridong
2025-11-21 10:32                 ` Sun Shaojie
2025-11-22  1:33                   ` Chen Ridong
2025-11-24 10:20                     ` Sun Shaojie
2025-11-24 11:33                       ` Chen Ridong
2025-11-26 12:29                         ` Sun Shaojie
2025-11-24 22:30           ` Waiman Long
2025-11-26 12:31             ` Sun Shaojie
2025-11-26 14:13             ` Michal Koutný
2025-11-26 19:43               ` Waiman Long
2025-11-27  1:55                 ` Chen Ridong
2025-12-01  9:44                   ` Sun Shaojie
2025-12-08 14:31                     ` Michal Koutný
2025-12-10 10:11                       ` Sun Shaojie
2025-12-11 10:59                         ` Michal Koutný
2025-12-12 10:10                           ` Sun Shaojie
2025-12-13  0:52                     ` Chen Ridong
2025-12-17  9:09                       ` Sun Shaojie
2025-12-08 14:32                 ` Michal Koutný
2025-12-13  4:58                   ` Waiman Long
2025-12-01  9:38             ` [PATCH v6] " Sun Shaojie
2025-12-17  9:45               ` [PING][PATCH " Sun Shaojie
2025-12-23  6:06                 ` Waiman Long
2025-12-22 15:26               ` [PATCH " Michal Koutný
2025-12-23  6:03                 ` Waiman Long [this message]
2025-12-25  7:30               ` Waiman Long
2025-11-19 11:03         ` [PATCH v4 1/1] cpuset: relax the overlap check for cgroup-v2 Sun Shaojie
2025-11-18 17:52   ` Michal Koutný
2025-11-19 11:04     ` Sun Shaojie
2025-11-18 19:53   ` Waiman Long
2025-11-19 11:05     ` Sun Shaojie
2025-11-17  3:23 ` [PATCH v3 0/1] " Chen Ridong
2025-11-17  5:58   ` Sun Shaojie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5b53f9ec-ebd5-4bea-b6a3-ef35a467e96c@redhat.com \
    --to=llong@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mkoutny@suse.com \
    --cc=shuah@kernel.org \
    --cc=sunshaojie@kylinos.cn \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox