From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030403AbcIZK4b (ORCPT ); Mon, 26 Sep 2016 06:56:31 -0400 Received: from merlin.infradead.org ([205.233.59.134]:35006 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964818AbcIZK4a (ORCPT ); Mon, 26 Sep 2016 06:56:30 -0400 Date: Mon, 26 Sep 2016 12:56:21 +0200 From: Peter Zijlstra To: Christian Borntraeger Cc: Yuyang Du , Ingo Molnar , Linux Kernel Mailing List , vincent.guittot@linaro.org, Morten.Rasmussen@arm.com, dietmar.eggemann@arm.com, pjt@google.com, bsegall@google.com Subject: Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking) Message-ID: <20160926105621.GZ5016@twins.programming.kicks-ass.net> References: <45222b6f-4849-f1f4-fdf5-2a26ac9a3ed4@de.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <45222b6f-4849-f1f4-fdf5-2a26ac9a3ed4@de.ibm.com> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 26, 2016 at 12:42:22PM +0200, Christian Borntraeger wrote: > Folks, > > I have seen big scalability degredations sind 4.3 (bisected 9d89c257d > sched/fair: Rewrite runnable load and utilization average tracking) > This has not been fixed by subsequent patches,e.g. the ones that try to > fix this for interactive workload. > > The problem is only visible for sleep/wakeup heavy workload which must > be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest > as libvirt will put KVM guests into cgroup instances). > > For example a simple sysbench oltp with mysql inside a KVM guests with > 16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up > inside a guest, having multiple instances). This is the numbers of > events per second. > Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not > use group scheduling for KVM guests) makes the behaviour much better: > > > instances group nogroup > 1 3406 3002 > 2 5078 4940 > 3 6017 6760 > 4 6471 8216 (+27%) > 5 6716 9196 > 6 6976 9783 > 7 7127 10170 > 8 7399 10385 (+40%) > > before 9d89c257d ("sched/fair: Rewrite runnable load and utilization > average tracking") there was basically no difference between group > or non-group scheduling. These numbers are with 4.7, older kernels after > 9d89c257d show a similar difference. > > The bad thing is that there is a lot of idle cpu power in the host > when this happens so the scheduler seems to not realize that this > workload could use more cpus in the host. > > I tried some experiments , but I have not found a hack that "fixes" the > degredation, which would give me an indication which part of the code > is broken. So are there any ideas? Is the estimated group load > calculation just not fast enough for sleep/wakeup workload? One of the differences in the old and new thing is being addressed by these patches: https://lkml.kernel.org/r/1473666472-13749-1-git-send-email-vincent.guittot@linaro.org Could you see if those patches make a difference? If not, we'll have to go poke elsewhere ofcourse ;-)