public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)
@ 2016-09-26 10:42 Christian Borntraeger
  2016-09-26 10:56 ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Borntraeger @ 2016-09-26 10:42 UTC (permalink / raw)
  To: Yuyang Du; +Cc: Peter Zijlstra, Ingo Molnar, Linux Kernel Mailing List

Folks,

I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
sched/fair: Rewrite runnable load and utilization average tracking)
This has not been fixed by subsequent patches,e.g. the ones that try to
fix this for interactive workload.

The problem is only visible for sleep/wakeup heavy workload which must
be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
as libvirt will put KVM guests into cgroup instances).

For example a simple sysbench oltp with mysql inside a KVM guests with
16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
inside a guest, having multiple instances). This is the numbers of
events per second.
Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
use group scheduling for KVM guests) makes the behaviour much better:


instances	group		nogroup
1		3406		3002
2		5078		4940
3		6017		6760
4		6471		8216 (+27%)
5		6716		9196
6		6976		9783
7		7127		10170
8		7399		10385 (+40%)

before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
average tracking") there was basically no difference between group
or non-group scheduling. These numbers are with 4.7, older kernels after
9d89c257d show a similar difference.

The bad thing is that there is a lot of idle cpu power in the host
when this happens so the scheduler seems to not realize that this
workload could use more cpus in the host.

I tried some experiments , but I have not found a hack that "fixes" the
degredation, which would give me an indication which part  of the code
is broken. So are there any ideas? Is the estimated group load
calculation just not fast enough for sleep/wakeup workload?

Christian

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-09-26 14:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-26 10:42 group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking) Christian Borntraeger
2016-09-26 10:56 ` Peter Zijlstra
2016-09-26 11:42   ` Christian Borntraeger
2016-09-26 11:53     ` Peter Zijlstra
2016-09-26 12:01       ` Christian Borntraeger
2016-09-26 12:10         ` Peter Zijlstra
2016-09-26 12:49           ` Christian Borntraeger
2016-09-26 14:12           ` Christian Borntraeger
2016-09-26 12:25   ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox