From: Waiman Long <longman@redhat.com>
To: zhengzucheng <zhengzucheng@huawei.com>,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, oleg@redhat.com,
Frederic Weisbecker <frederic@kernel.org>,
mingo@kernel.org, peterx@redhat.com, tj@kernel.org,
tjcao980311@gmail.com
Cc: linux-kernel@vger.kernel.org
Subject: Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario
Date: Fri, 13 Sep 2024 13:17:15 -0400 [thread overview]
Message-ID: <3fd8aa75-ce1b-4d5a-aada-0b2cfbedb36c@redhat.com> (raw)
In-Reply-To: <9982cb8d-9346-0640-dd9f-f68390f922e9@huawei.com>
On 9/13/24 00:03, zhengzucheng wrote:
> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8
> CPUs are overcommitted to 2 x 8u VMs,
> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs
> resources, the other VM has 6 CPUs.
> The host is configured with 80 CPUs in a sched domain and other CPUs
> are in the idle state.
> The root cause is that the load of the host is unbalanced, some vCPUs
> exclusively occupy CPU resources.
> when the CPU that triggers load balance calculates imbalance value,
> env->imbalance = 0 is calculated because of
> local->avg_load > sds->avg_load. As a result, the load balance fails.
> The processing logic:
> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70
>
>
> It's normal from kernel load balance, but it's not reasonable from the
> perspective of VM users.
> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule
> domain to fix it.
> Is there any other method to fix this problem? thanks.
>
> Abstracted reproduction case:
> 1.environment information:
>
> [root@localhost ~]# cat /proc/schedstat
>
> cpu0
> domain0 00000000,00000000,00010000,00000000,00000001
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu1
> domain0 00000000,00000000,00020000,00000000,00000002
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu2
> domain0 00000000,00000000,00040000,00000000,00000004
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
> cpu3
> domain0 00000000,00000000,00080000,00000000,00000008
> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff
> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
>
> 2.test case:
>
> vcpu.c
> #include <stdio.h>
> #include <unistd.h>
>
> int main()
> {
> sleep(20);
> while (1);
> return 0;
> }
>
> gcc vcpu.c -o vcpu
> -----------------------------------------------------------------
> test.sh
>
> #!/bin/bash
>
> #vcpu1
> mkdir /sys/fs/cgroup/cpuset/vcpu_1
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks
> done
>
> #vcpu2
> mkdir /sys/fs/cgroup/cpuset/vcpu_2
> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus
> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems
> for i in {1..8}
> do
> ./vcpu &
> pid=$!
> sleep 1
> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks
> done
> ------------------------------------------------------------------
> [root@localhost ~]# ./test.sh
>
> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu)
>
> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73
> ./vcpu
> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71
> ./vcpu
> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72
> ./vcpu
> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72
> ./vcpu
> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72
> ./vcpu
> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72
> ./vcpu
> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu
> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu
> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu
> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu
> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu
> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu
> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu
> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu
> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu
> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu
>
Your script creates two cpusets with the same set of CPUs. The
scheduling aspect of the tasks, however, are not controlled by cpuset.
It is controlled by cpu cgroup. I suppose that all these tasks are in
the same cpu cgroup. It is possible that commit you mentioned might have
caused some unfairness in allocating CPU time to different processes
within the same cpu cgroup. Maybe you can try to put them into separate
cpu cgroups as well with equal weight to see if that can improve the
scheduling fairness?
BTW, you don't actually need to use 2 different cpusets if they all get
the same set of CPUs and memory nodes. Also setting
cpuset.sched_load_balance=0 may not actually get what you want unless
all the cpusets that use those CPUs have cpuset.sched_load_balance set 0
including the root cgroup. Turning off this flag may disable load
balancing, but it may not guarantee fairness depending on what CPUs are
being used by those tasks when they start unless you explicitly assign
the CPUs to them when starting these tasks.
Cheers,
Longman
next prev parent reply other threads:[~2024-09-13 17:17 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-25 12:03 [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime Zheng Zucheng
2024-07-25 14:05 ` Peter Zijlstra
2024-07-25 14:49 ` zhengzucheng
2024-07-25 15:14 ` Peter Zijlstra
2024-07-26 2:32 ` [PATCH v2 " Zheng Zucheng
2024-07-26 10:44 ` Oleg Nesterov
2024-07-26 13:04 ` Oleg Nesterov
2024-07-26 13:08 ` Peter Zijlstra
2024-07-26 13:30 ` Oleg Nesterov
2024-07-30 6:55 ` Uwe Kleine-König
2024-07-29 10:34 ` [tip: sched/core] " tip-bot2 for Zheng Zucheng
2024-09-02 1:56 ` [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? zhengzucheng
2024-09-02 3:00 ` Waiman Long
2024-09-13 4:03 ` [Question] sched:the load is unbalanced in the VM overcommitment scenario zhengzucheng
2024-09-13 15:55 ` Vincent Guittot
2024-09-14 7:03 ` zhengzucheng
2024-09-17 6:19 ` Vincent Guittot
2024-09-13 17:17 ` Waiman Long [this message]
2024-09-14 2:15 ` zhengzucheng
-- strict thread matches above, loose matches on Subject: below --
2024-09-13 9:13 zhengzucheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3fd8aa75-ce1b-4d5a-aada-0b2cfbedb36c@redhat.com \
--to=longman@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=frederic@kernel.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tj@kernel.org \
--cc=tjcao980311@gmail.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=zhengzucheng@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox