From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1030403AbcIZK4b (ORCPT <rfc822;w@1wt.eu>);
        Mon, 26 Sep 2016 06:56:31 -0400
Received: from merlin.infradead.org ([205.233.59.134]:35006 "EHLO
        merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S964818AbcIZK4a (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 26 Sep 2016 06:56:30 -0400
Date: Mon, 26 Sep 2016 12:56:21 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yuyang Du <yuyang.du@intel.com>, Ingo Molnar <mingo@kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        vincent.guittot@linaro.org, Morten.Rasmussen@arm.com,
        dietmar.eggemann@arm.com, pjt@google.com, bsegall@google.com
Subject: Re: group scheduler regression since 4.3 (bisect 9d89c257d
 sched/fair: Rewrite runnable load and utilization average tracking)
Message-ID: <20160926105621.GZ5016@twins.programming.kicks-ass.net>
References: <45222b6f-4849-f1f4-fdf5-2a26ac9a3ed4@de.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <45222b6f-4849-f1f4-fdf5-2a26ac9a3ed4@de.ibm.com>
User-Agent: Mutt/1.5.23.1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Sep 26, 2016 at 12:42:22PM +0200, Christian Borntraeger wrote:
> Folks,
> 
> I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
> sched/fair: Rewrite runnable load and utilization average tracking)
> This has not been fixed by subsequent patches,e.g. the ones that try to
> fix this for interactive workload.
> 
> The problem is only visible for sleep/wakeup heavy workload which must
> be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
> as libvirt will put KVM guests into cgroup instances).
> 
> For example a simple sysbench oltp with mysql inside a KVM guests with
> 16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
> inside a guest, having multiple instances). This is the numbers of
> events per second.
> Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
> use group scheduling for KVM guests) makes the behaviour much better:
> 
> 
> instances	group		nogroup
> 1		3406		3002
> 2		5078		4940
> 3		6017		6760
> 4		6471		8216 (+27%)
> 5		6716		9196
> 6		6976		9783
> 7		7127		10170
> 8		7399		10385 (+40%)
> 
> before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
> average tracking") there was basically no difference between group
> or non-group scheduling. These numbers are with 4.7, older kernels after
> 9d89c257d show a similar difference.
> 
> The bad thing is that there is a lot of idle cpu power in the host
> when this happens so the scheduler seems to not realize that this
> workload could use more cpus in the host.
> 
> I tried some experiments , but I have not found a hack that "fixes" the
> degredation, which would give me an indication which part  of the code
> is broken. So are there any ideas? Is the estimated group load
> calculation just not fast enough for sleep/wakeup workload?

One of the differences in the old and new thing is being addressed by
these patches:

  https://lkml.kernel.org/r/1473666472-13749-1-git-send-email-vincent.guittot@linaro.org

Could you see if those patches make a difference? If not, we'll have to
go poke elsewhere ofcourse ;-)