public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: zhengzucheng <zhengzucheng@huawei.com>,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, oleg@redhat.com,
	Frederic Weisbecker <frederic@kernel.org>,
	mingo@kernel.org, peterx@redhat.com, tj@kernel.org,
	tjcao980311@gmail.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario
Date: Fri, 13 Sep 2024 13:17:15 -0400	[thread overview]
Message-ID: <3fd8aa75-ce1b-4d5a-aada-0b2cfbedb36c@redhat.com> (raw)
In-Reply-To: <9982cb8d-9346-0640-dd9f-f68390f922e9@huawei.com>

On 9/13/24 00:03, zhengzucheng wrote:
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs 
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs 
> are in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs 
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value, 
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic: 
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the 
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule 
> domain to fix it.
> Is there any other method to fix this problem? thanks.
>
> Abstracted reproduction case:
> 1.environment information:
>
> [root@localhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
>         sleep(20);
>         while (1);
>         return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
>         ./vcpu &
>         pid=$!
>         sleep 1
>         echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
>         ./vcpu &
>         pid=$!
>         sleep 1
>         echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@localhost ~]# ./test.sh
>
> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root      20   0    2448   1012    928 R 100.0   0.0 13:10.73 
> ./vcpu
> 14582 root      20   0    2448   1012    928 R 100.0   0.0 13:12.71 
> ./vcpu
> 14606 root      20   0    2448    872    784 R 100.0   0.0 13:09.72 
> ./vcpu
> 14620 root      20   0    2448    916    832 R 100.0   0.0 13:07.72 
> ./vcpu
> 14622 root      20   0    2448    920    836 R 100.0   0.0 13:06.72 
> ./vcpu
> 14629 root      20   0    2448    920    832 R 100.0   0.0 13:05.72 
> ./vcpu
> 14643 root      20   0    2448    924    836 R  21.0   0.0 2:37.13 ./vcpu
> 14645 root      20   0    2448    868    784 R  21.0   0.0 2:36.51 ./vcpu
> 14589 root      20   0    2448    900    816 R  20.0   0.0 2:45.16 ./vcpu
> 14608 root      20   0    2448    956    872 R  20.0   0.0 2:42.24 ./vcpu
> 14632 root      20   0    2448    872    788 R  20.0   0.0 2:38.08 ./vcpu
> 14638 root      20   0    2448    924    840 R  20.0   0.0 2:37.48 ./vcpu
> 14652 root      20   0    2448    928    844 R  20.0   0.0 2:36.42 ./vcpu
> 14654 root      20   0    2448    924    840 R  20.0   0.0 2:36.14 ./vcpu
> 14663 root      20   0    2448    900    816 R  20.0   0.0 2:35.38 ./vcpu
> 14669 root      20   0    2448    868    784 R  20.0   0.0 2:35.70 ./vcpu
>
Your script creates two cpusets with the same set of CPUs. The 
scheduling aspect of the tasks, however, are not controlled by cpuset. 
It is controlled by cpu cgroup. I suppose that all these tasks are in 
the same cpu cgroup. It is possible that commit you mentioned might have 
caused some unfairness in allocating CPU time to different processes 
within the same cpu cgroup. Maybe you can try to put them into separate 
cpu cgroups as well with equal weight to see if that can improve the 
scheduling fairness?

BTW, you don't actually need to use 2 different cpusets if they all get 
the same set of CPUs and memory nodes. Also setting 
cpuset.sched_load_balance=0 may not actually get what you want unless 
all the cpusets that use those CPUs have cpuset.sched_load_balance set 0 
including the root cgroup. Turning off this flag may disable load 
balancing, but it may not guarantee fairness depending on what CPUs are 
being used by those tasks when they start unless you explicitly assign 
the CPUs to them when starting these tasks.

Cheers,
Longman


  parent reply	other threads:[~2024-09-13 17:17 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25 12:03 [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime Zheng Zucheng
2024-07-25 14:05 ` Peter Zijlstra
2024-07-25 14:49   ` zhengzucheng
2024-07-25 15:14     ` Peter Zijlstra
2024-07-26  2:32 ` [PATCH v2 " Zheng Zucheng
2024-07-26 10:44   ` Oleg Nesterov
2024-07-26 13:04     ` Oleg Nesterov
2024-07-26 13:08       ` Peter Zijlstra
2024-07-26 13:30         ` Oleg Nesterov
2024-07-30  6:55     ` Uwe Kleine-König
2024-07-29 10:34   ` [tip: sched/core] " tip-bot2 for Zheng Zucheng
2024-09-02  1:56 ` [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? zhengzucheng
2024-09-02  3:00   ` Waiman Long
2024-09-13  4:03     ` [Question] sched:the load is unbalanced in the VM overcommitment scenario zhengzucheng
2024-09-13 15:55       ` Vincent Guittot
2024-09-14  7:03         ` zhengzucheng
2024-09-17  6:19           ` Vincent Guittot
2024-09-13 17:17       ` Waiman Long [this message]
2024-09-14  2:15         ` zhengzucheng
  -- strict thread matches above, loose matches on Subject: below --
2024-09-13  9:13 zhengzucheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3fd8aa75-ce1b-4d5a-aada-0b2cfbedb36c@redhat.com \
    --to=longman@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=tjcao980311@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=zhengzucheng@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox