From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id ECC7B2594BB
	for <linux-kernel@vger.kernel.org>; Fri, 20 Dec 2024 14:48:16 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734706098; cv=none; b=deEg2q2lEk1o0fl9lytrL4voubN3U885qKQELIo1VPAWvl1t0w14zL8MSN2uiwhA1pAVyaCiQph4x9+sVZCmBAsLxwE3/TtRYCl8DOx6JykfwMVf+bALmP7eG4/6IgzLvOUqbt+l8YWpSlqD1O1r4v49xkHDvmkG1upMFuGEaF0=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734706098; c=relaxed/simple;
	bh=uhwCUY33I/OqugM/4ziLBvIjIJExCXAEnIJqlSUiaLc=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=FtMrUogdDHH1EL2WDnKqJe2vPqLzTr9/texQ2HiTD8UYuSZ2FGR3joRP9Exc8Yhq6/X9QeJ5L/B2ER5kiLu6B5nDBWx9GedbbO82UJHpqQIUysQO42bCgXwJXy6A4k1m09QjJ2KvMgbw8SwIsH6u7LUG+HUfGS3VJUWkZMxT0iM=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 708941480;
	Fri, 20 Dec 2024 06:48:44 -0800 (PST)
Received: from [192.168.178.6] (usa-sjc-mx-foss1.foss.arm.com [172.31.20.19])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 97BF93F720;
	Fri, 20 Dec 2024 06:48:14 -0800 (PST)
Message-ID: <ebab52de-4e99-4e63-b1e2-e676d854e4be@arm.com>
Date: Fri, 20 Dec 2024 15:48:09 +0100
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
To: Vincent Guittot <vincent.guittot@linaro.org>,
 Pierre Gondois <pierre.gondois@arm.com>
Cc: linux-kernel@vger.kernel.org, Chritian Loehle <christian.loehle@arm.com>,
 Hongyan Xia <hongyan.xia2@arm.com>, Ingo Molnar <mingo@redhat.com>,
 Peter Zijlstra <peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>,
 Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>,
 Mel Gorman <mgorman@suse.de>, Valentin Schneider <vschneid@redhat.com>
References: <20241219091207.2001051-1-pierre.gondois@arm.com>
 <CAKfTPtCZbSwro9YuYnibMysR+v2F5oYzXa9ASzcPGR79LUp=dA@mail.gmail.com>
 <CAKfTPtDGP3299YNh9hgcWj8WrhhDT841KX+w5JWxoKEnqM6h+Q@mail.gmail.com>
Content-Language: en-US
From: Dietmar Eggemann <dietmar.eggemann@arm.com>
In-Reply-To: <CAKfTPtDGP3299YNh9hgcWj8WrhhDT841KX+w5JWxoKEnqM6h+Q@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On 20/12/2024 08:47, Vincent Guittot wrote:
> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@arm.com> wrote:
>>>
>>> util_est signal does not decay if the task utilization is lower
>>> than its runnable signal by a value of 10. This was done to keep
>>
>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>> worth updating util_est
Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
the mechanism is too sensitive?

It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
446 446 446) in cases 2 tasks are scheduled on the same little CPU:

...
task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
...

I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
this implementation was tested with perf/energy numbers back then?

>>> the util_est signal high in case a task shares a rq with another
>>> task and doesn't obtain a desired running time.
>>>
>>> However, tasks sharing a rq obtain the running time they desire
>>> provided that the rq has some idle time. Indeed, either:
>>> - a CPU is always running. The utilization signal of tasks reflects
>>>   the running time they obtained. This running time depends on the
>>>   niceness of the tasks. A decreasing utilization signal doesn't
>>>   reflect a decrease of the task activity and the util_est signal
>>>   should not be decayed in this case.
>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>>   might be waiting to run, increasing their runnable signal, but
>>>   eventually run to completion. A decreasing utilization signal
>>>   does reflect a decrease of the task activity and the util_est
>>>   signal should be decayed in this case.
>>
>> This is not always true
>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>> util_est
>> Then start a 2nd task with the exact same behavior on the same cpu.
>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>> its util_est but there is still 20ms of idle time. Furthermore,  The
>> util_avg of the 1st task is also around 496 at dequeue but
> 
> the end of the sentence was missing...
> 
> but there is still 20ms of idle time.

But these two tasks are still able to finish there activity within this
100ms window. So why should we keep their util_est values high when
dequeuing?

[...]

>>> The initial patch [2] aimed to solve an issue detected while running
>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>> versions are compared:
>>> - base: the current version
>>
>> What do you mean by current version ? tip/sched/core ?
>>
>>> - patch: the new version, with this patch applied
>>> - revert: the initial version, with commit [2] reverted
>>>
>>> Score (higher is better):
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │     108.16 ┆     104.06 ┆     105.82 ┆      -3.94% ┆       -2.16% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │      0.57 ┆      0.49 ┆       0.58 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy measured with energy counters:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │  141262.79 ┆  130630.09 ┆  134108.07 ┆      -7.52% ┆       -5.64% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │   1347.13 ┆   2431.67 ┆     510.88 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy computed from util signals and energy model:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean  ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │  2.0539e12 ┆  1.3569e12 ┆ 1.3637e+12 ┆     -33.93% ┆      -33.60% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std  ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> OU ratio in % (ratio of time being overutilized over total time).
>>> The test lasts ~65s:
>>> ┌────────────┬────────────┬─────────────┐
>>> │ base mean  ┆ patch mean ┆ revert mean │
>>> ╞════════════╪════════════╪═════════════╡
>>> │     63.39% ┆     12.48% ┆      12.28% │
>>> └────────────┴────────────┴─────────────┘
>>> ┌───────────┬───────────┬─────────────┐
>>> │ base std  ┆ patch std ┆ revert mean │
>>> ╞═══════════╪═══════════╪═════════════╡
>>> │      0.97 ┆      0.28 ┆        0.88 │
>>> └───────────┴───────────┴─────────────┘
>>>
>>> The energy gain can be explained by the fact that the system is
>>> overutilized during most of the test with the base version.
>>>
>>> During the test, the base condition is evaluated to true ~40%
>>> of the time. The new condition is evaluated to true ~2% of
>>> the time. Preventing util_est signals to decay with the base
>>> condition has a significant impact on the overutilized state
>>> due to an overestimation of the resulting utilization of tasks.
>>>
>>> The score is impacted by the patch, but:
>>> - it is expected to have slightly lower scores with EAS running more
>>>   often
>>> - the base version making the system run at higher frequencies by
>>>   overestimating task utilization, it is expected to have higher scores
>>
>> I'm not sure to get what you are trying to solve here ?

Yeah, the question is how much perf loss we accept for energy savings?
IMHO, impossible to answer generically based on one specific
workload/platform incarnation.

[...]

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 3e9ca38512de..d058ab29e52e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>>          * To avoid underestimate of task utilization, skip updates of EWMA if
>>>          * we cannot grant that thread got all CPU time it wanted.
>>>          */
>>> -       if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>> +       if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>
>> You can't use here the test that is done in
>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>> because this test is only relevant when the rq becomes idle which is
>> not the case here

Do you mean this test ?

util_avg = util_sum / divider

util_sum >= divider * util_avg

with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
bound of the window (+ 1024):

util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX

Why can't we use it here?

>> With this test you skip completely the cases where the task has to
>> share the CPU with others. As an example on the pixel 6, the little

True. But I assume that's anticipated here. The assumption is that as
long as there is idle time, tasks get what they want in a time frame.

>> cpus must run more than 1.2 seconds at its max freq before detecting
>> that there is no idle time

BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
2.07s (with CPU capacity of Pix5 little CPU = 160)?

[...]