unexpected behaviour of cgroups v1 on 6.12 kernel

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* unexpected behaviour of cgroups v1 on 6.12 kernel
@ 2025-08-22 19:31 Chris Friesen
  2025-08-23  0:57 ` Chen Ridong
  0 siblings, 1 reply; 3+ messages in thread
From: Chris Friesen @ 2025-08-22 19:31 UTC (permalink / raw)
  To: cgroups, lizefan

Hi all,

I'm not subscribed to the list, so please CC me on replies.

I'm seeing some unexpected behaviour with the cpu/cpuset cgroups 
controllers (with cgroups v1) on 6.12.18 with PREEMPT_RT enabled.

I set up the following cgroup hierarchy for both cpu and cpuset cgroups:

foo:   cpu 15, shares 1024
foo/a: cpu 15, shares 1024

bar:   cpu 15-19, shares 1024
bar/a: cpu 15, shares 1024
bar/b: cpu 16, shares 1024
bar/c: cpu 17, shares 1024
bar/d: cpu 18, shares 1024
bar/e: cpu 19, shares 1024

I then ran a single cpu hog in each of the leaf-node cgroups in the 
default SCHED_OTHER class.

As expected, the tasks in bar/b, bar/c, bar/d, and bar/e each got 100% 
of their CPU.  What I didn't expect was that the task running in foo/a 
got 83.3%, while the task in bar/a got 16.7%.  Is this expected?

I guess what I'm asking here is whether the cgroups CPU share 
calculation is supposed to be performed separately per CPU, or whether 
it's global but somehow scaled by the number of CPUs that the cgroup is 
runnable on, so that the total CPU time of group "bar" is expected to be 
5x the total CPU time of group "foo".

I then killed all the tasks in bar/b, bar/c, bar/d, and bar/e.  The 
tasks in foo/a and bar/a continued for a while at 83/16, then moved to 
80/20, and only about 75 seconds later finally moved to 50/50.    Is 
this long time to "rebalance" expected?  If so, can this time be 
modified by the admin user at runtime or is it inherent in the code?

As further data, if I have tasks in foo/a, bar/a, bar/b, bar/c then 
foo/a gets 75%, bar/a gets 25%, bar/b and bar/c both get 100%.

If I have tasks in foo/a, bar/a, bar/b then foo/a gets 66%, bar/a gets 
33%, bar/b gets 100%.  (But it started out with foo/a getting 75% and 
switched 10s of seconds later, which seems odd.)

Thanks,
Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: unexpected behaviour of cgroups v1 on 6.12 kernel
  2025-08-22 19:31 unexpected behaviour of cgroups v1 on 6.12 kernel Chris Friesen
@ 2025-08-23  0:57 ` Chen Ridong
  2025-08-27 17:16   ` Chris Friesen
  0 siblings, 1 reply; 3+ messages in thread
From: Chen Ridong @ 2025-08-23  0:57 UTC (permalink / raw)
  To: Chris Friesen, cgroups, lizefan



On 2025/8/23 3:31, Chris Friesen wrote:
> Hi all,
> 
> I'm not subscribed to the list, so please CC me on replies.
> 
> I'm seeing some unexpected behaviour with the cpu/cpuset cgroups controllers (with cgroups v1) on
> 6.12.18 with PREEMPT_RT enabled.
> 
> I set up the following cgroup hierarchy for both cpu and cpuset cgroups:
> 
> foo:   cpu 15, shares 1024
> foo/a: cpu 15, shares 1024
> 
> bar:   cpu 15-19, shares 1024
> bar/a: cpu 15, shares 1024
> bar/b: cpu 16, shares 1024
> bar/c: cpu 17, shares 1024
> bar/d: cpu 18, shares 1024
> bar/e: cpu 19, shares 1024
> 
> I then ran a single cpu hog in each of the leaf-node cgroups in the default SCHED_OTHER class.
> 
> As expected, the tasks in bar/b, bar/c, bar/d, and bar/e each got 100% of their CPU.  What I didn't
> expect was that the task running in foo/a got 83.3%, while the task in bar/a got 16.7%.  Is this
> expected?
> 
> I guess what I'm asking here is whether the cgroups CPU share calculation is supposed to be
> performed separately per CPU, or whether it's global but somehow scaled by the number of CPUs that
> the cgroup is runnable on, so that the total CPU time of group "bar" is expected to be 5x the total
> CPU time of group "foo".
> 

Hello Chris,

First of all, cpu and cpuset are different control group (cgroup) subsystems. If I understand
correctly, the behavior you're observing is expected.

Have the CPU shares been configured as follows?
		P
	     /	   \	
 (1024:50%) foo	  bar(1024:50%)
 	    |     / \
(50%*100%)  a 	a b c d e(1024:50%*20%)

The cpu subsystem allocates CPU time proportionally based on share weights. In this case, foo and
bar are each expected to receive 50% of the total CPU time.

Within foo, subgroup a is configured to get 100% of foo's allocation, meaning it receives the full
50% of total CPU.

Within bar, the bar/a, b, c, d, and e each have a share weight of 20% relative to bar's total
allocation if they you have tasks in each cgroup. This means each would get approximately 10% of the
total CPU time (i.e., 50% × 20%).

This behavior is specific to the cpu subsystem and is independent of cpuset.

> I then killed all the tasks in bar/b, bar/c, bar/d, and bar/e.  The tasks in foo/a and bar/a
> continued for a while at 83/16, then moved to 80/20, and only about 75 seconds later finally moved
> to 50/50.    Is this long time to "rebalance" expected?  If so, can this time be modified by the
> admin user at runtime or is it inherent in the code?
> 
> As further data, if I have tasks in foo/a, bar/a, bar/b, bar/c then foo/a gets 75%, bar/a gets 25%,
> bar/b and bar/c both get 100%.
> 
> If I have tasks in foo/a, bar/a, bar/b then foo/a gets 66%, bar/a gets 33%, bar/b gets 100%.  (But
> it started out with foo/a getting 75% and switched 10s of seconds later, which seems odd.)
> 
> Thanks,
> Chris

-- 
Best regards,
Ridong


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: unexpected behaviour of cgroups v1 on 6.12 kernel
  2025-08-23  0:57 ` Chen Ridong
@ 2025-08-27 17:16   ` Chris Friesen
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Friesen @ 2025-08-27 17:16 UTC (permalink / raw)
  To: Chen Ridong, cgroups, lizefan

On 8/22/2025 6:57 PM, Chen Ridong wrote:
> On 2025/8/23 3:31, Chris Friesen wrote:
>> Hi all,
>>
>> I'm not subscribed to the list, so please CC me on replies.
>>
>> I'm seeing some unexpected behaviour with the cpu/cpuset cgroups controllers (with cgroups v1) on
>> 6.12.18 with PREEMPT_RT enabled.
>>
>> I set up the following cgroup hierarchy for both cpu and cpuset cgroups:
>>
>> foo:   cpu 15, shares 1024
>> foo/a: cpu 15, shares 1024
>>
>> bar:   cpu 15-19, shares 1024
>> bar/a: cpu 15, shares 1024
>> bar/b: cpu 16, shares 1024
>> bar/c: cpu 17, shares 1024
>> bar/d: cpu 18, shares 1024
>> bar/e: cpu 19, shares 1024
>>
>> I then ran a single cpu hog in each of the leaf-node cgroups in the default SCHED_OTHER class.
>>
>> As expected, the tasks in bar/b, bar/c, bar/d, and bar/e each got 100% of their CPU.  What I didn't
>> expect was that the task running in foo/a got 83.3%, while the task in bar/a got 16.7%.  Is this
>> expected?
>>
>> I guess what I'm asking here is whether the cgroups CPU share calculation is supposed to be
>> performed separately per CPU, or whether it's global but somehow scaled by the number of CPUs that
>> the cgroup is runnable on, so that the total CPU time of group "bar" is expected to be 5x the total
>> CPU time of group "foo".
>>
> 
> Hello Chris,
> 
> First of all, cpu and cpuset are different control group (cgroup) subsystems. 

Understood.  I created identical cgroup hierarchies in both cpu and 
cpuset, and moved the tasks into the equivalent cgroup in each subsystem.

 > If I understand> correctly, the behavior you're observing is expected.
> 
> Have the CPU shares been configured as follows?
>                  P
>               /     \
>   (1024:50%) foo   bar(1024:50%)
>              |     / \
> (50%*100%)  a   a b c d e(1024:50%*20%)

The cpu.shares is left at the default of 1024 for each cgroup, and your 
tree diagram accurately represents the cgroup hierarchy.

> The cpu subsystem allocates CPU time proportionally based on share weights. In this case, foo and
> bar are each expected to receive 50% of the total CPU time.
> 
> Within foo, subgroup a is configured to get 100% of foo's allocation, meaning it receives the full
> 50% of total CPU.
> 
> Within bar, the bar/a, b, c, d, and e each have a share weight of 20% relative to bar's total
> allocation if they you have tasks in each cgroup. This means each would get approximately 10% of the
> total CPU time (i.e., 50% × 20%).
> 
> This behavior is specific to the cpu subsystem and is independent of cpuset.

The reason I was asking about cpuset is that it matters whether the cpu 
subsystem is evaluating CPU time globally for the system as a whole, or 
separately for each CPU, or something more nuanced.

In the test above, only tasks in foo/a and bar/a can run on CPU 15.   So 
if the cgroup cpu.shares and cpuacct.usage was evaluated per-cpu they 
would be expected to get equal amounts of CPU.   On the other hand, if 
the CPU shares and usage were evaluated globally we'd expect cgroup 
"foo" to get *all* the runtime on CPU 15 since cgroup "bar" was getting 
4 CPUs worth of runtime on other CPUs.

Instead what we see is that foo/a gets 83.3% of CPU 15, while bar/a gets 
16.7%.    So there's something more complicated going on.

It looks like the kernel is saying that "bar" is runnable on 5 CPUs 
while "foo" is runnable on 1 CPU, so "bar" as a whole should get 5x the 
CPU usage of "foo" as a whole.  (100% + 100% + 100% + 100% + 16.7% is 
very close to 5x 83.3%)

Is the actual formula that is being used documented somewhere?

Thanks,
Chris

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-08-27 17:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 19:31 unexpected behaviour of cgroups v1 on 6.12 kernel Chris Friesen
2025-08-23  0:57 ` Chen Ridong
2025-08-27 17:16   ` Chris Friesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).