From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753260AbbCWUVw (ORCPT <rfc822;w@1wt.eu>);
	Mon, 23 Mar 2015 16:21:52 -0400
Received: from service87.mimecast.com ([91.220.42.44]:53968 "EHLO
	service87.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752221AbbCWUVt convert rfc822-to-8bit (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 23 Mar 2015 16:21:49 -0400
Message-ID: <551075D9.2040409@arm.com>
Date: Mon, 23 Mar 2015 20:21:45 +0000
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>,
        Morten Rasmussen <Morten.Rasmussen@arm.com>
CC: Sai Gurrappadi <sgurrappadi@nvidia.com>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
        "yuyang.du@intel.com" <yuyang.du@intel.com>,
        "preeti@linux.vnet.ibm.com" <preeti@linux.vnet.ibm.com>,
        "mturquette@linaro.org" <mturquette@linaro.org>,
        "nico@linaro.org" <nico@linaro.org>,
        "rjw@rjwysocki.net" <rjw@rjwysocki.net>,
        Juri Lelli <Juri.Lelli@arm.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Peter Boonstoppel <pboonstoppel@nvidia.com>
Subject: Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
References: <1423074685-6336-1-git-send-email-morten.rasmussen@arm.com> <1423074685-6336-31-git-send-email-morten.rasmussen@arm.com> <55036AA1.7000801@nvidia.com> <20150316141546.GQ4081@e105550-lin.cambridge.arm.com> <20150323164702.GL23123@twins.programming.kicks-ass.net>
In-Reply-To: <20150323164702.GL23123@twins.programming.kicks-ass.net>
X-OriginalArrivalTime: 23 Mar 2015 20:21:44.0945 (UTC) FILETIME=[FB562E10:01D065A6]
X-MC-Unique: 115032320214701301
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 23/03/15 16:47, Peter Zijlstra wrote:
> On Mon, Mar 16, 2015 at 02:15:46PM +0000, Morten Rasmussen wrote:
>> You are absolutely right. The current code is broken for system
>> topologies where all cpus share the same clock source. To be honest, it
>> is actually worse than that and you already pointed out the reason. We
>> don't have a way of representing top level contributions to power
>> consumption in RFCv3, as we don't have sched_group spanning all cpus in
>> single cluster system. For example, we can't represent L2 cache and
>> interconnect power consumption on such systems.
>>
>> In RFCv2 we had a system wide sched_group dangling by itself for that
>> purpose. We chose to remove that in this rewrite as it led to messy
>> code. In my opinion, a more elegant solution is to introduce an
>> additional sched_domain above the current top level which has a single
>> sched_group spanning all cpus in the system. That should fix the
>> SD_SHARE_CAP_STATES problem and allow us to attach power data for the
>> top level.
>
> Maybe remind us why this needs to be tied to sched_groups ? Why can't we
> attach the energy information to the domains?

Currently on our 2 cluster (big.LITTLE) system (cluster0: big cpus, 
cluster1: little cpus) we attach energy information onto all sg's in MC 
(cpu/core related energy data) and DIE sd level (cluster related energy 
data).

For an MC level (cpus sharing the same u-arch) attaching the energy 
information onto the sd is clearly much easier then attaching it onto 
the individual sg's.

But on DIE level when we want to figure out the cluster energy data for 
a cluster represented by an sg other than the first sg (sg0) than we 
would have to access its cluster energy data via the DIE sd of one of 
the cpus of this cluster. I haven't seen code actually doing that in CFS.

IMHO, the current code is always iterating over the sg's of the sd and 
accessing either sg (sched_group) or sg->sgc (sched_group_capacity) 
data. Our energy data follows the sched_group_capacity example.

> There is an additional problem with groups you've not yet discovered and
> that is overlapping groups. Certain NUMA topologies result in this.
> There the sum of cpus over the groups is greater than the total cpus in
> the domain.

Yeah, we haven't tried EAS on such a system nor did we enable 
FORCE_SD_OVERLAP sched feature for a long time.