From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id ECC7B2594BB for ; Fri, 20 Dec 2024 14:48:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734706098; cv=none; b=deEg2q2lEk1o0fl9lytrL4voubN3U885qKQELIo1VPAWvl1t0w14zL8MSN2uiwhA1pAVyaCiQph4x9+sVZCmBAsLxwE3/TtRYCl8DOx6JykfwMVf+bALmP7eG4/6IgzLvOUqbt+l8YWpSlqD1O1r4v49xkHDvmkG1upMFuGEaF0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734706098; c=relaxed/simple; bh=uhwCUY33I/OqugM/4ziLBvIjIJExCXAEnIJqlSUiaLc=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=FtMrUogdDHH1EL2WDnKqJe2vPqLzTr9/texQ2HiTD8UYuSZ2FGR3joRP9Exc8Yhq6/X9QeJ5L/B2ER5kiLu6B5nDBWx9GedbbO82UJHpqQIUysQO42bCgXwJXy6A4k1m09QjJ2KvMgbw8SwIsH6u7LUG+HUfGS3VJUWkZMxT0iM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 708941480; Fri, 20 Dec 2024 06:48:44 -0800 (PST) Received: from [192.168.178.6] (usa-sjc-mx-foss1.foss.arm.com [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 97BF93F720; Fri, 20 Dec 2024 06:48:14 -0800 (PST) Message-ID: Date: Fri, 20 Dec 2024 15:48:09 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] sched/fair: Decrease util_est in presence of idle time To: Vincent Guittot , Pierre Gondois Cc: linux-kernel@vger.kernel.org, Chritian Loehle , Hongyan Xia , Ingo Molnar , Peter Zijlstra , Juri Lelli , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider References: <20241219091207.2001051-1-pierre.gondois@arm.com> Content-Language: en-US From: Dietmar Eggemann In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 20/12/2024 08:47, Vincent Guittot wrote: > On Thu, 19 Dec 2024 at 18:53, Vincent Guittot > wrote: >> >> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois wrote: >>> >>> util_est signal does not decay if the task utilization is lower >>> than its runnable signal by a value of 10. This was done to keep >> >> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's >> worth updating util_est Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe the mechanism is too sensitive? It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024 446 446 446) in cases 2 tasks are scheduled on the same little CPU: ... task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40 task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29 task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55 task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53 ... I'm not sure if the original case (Speedometer on Pix6 ?) which lead to this implementation was tested with perf/energy numbers back then? >>> the util_est signal high in case a task shares a rq with another >>> task and doesn't obtain a desired running time. >>> >>> However, tasks sharing a rq obtain the running time they desire >>> provided that the rq has some idle time. Indeed, either: >>> - a CPU is always running. The utilization signal of tasks reflects >>> the running time they obtained. This running time depends on the >>> niceness of the tasks. A decreasing utilization signal doesn't >>> reflect a decrease of the task activity and the util_est signal >>> should not be decayed in this case. >>> - a CPU is not always running (i.e. there is some idle time). Tasks >>> might be waiting to run, increasing their runnable signal, but >>> eventually run to completion. A decreasing utilization signal >>> does reflect a decrease of the task activity and the util_est >>> signal should be decayed in this case. >> >> This is not always true >> Run a task 40ms with a period of 100ms alone on the biggest cpu at max >> compute capacity. its util_avg is up to 674 at dequeue as well as its >> util_est >> Then start a 2nd task with the exact same behavior on the same cpu. >> The util_avg of this 2nd task will be only 496 at dequeue as well as >> its util_est but there is still 20ms of idle time. Furthermore, The >> util_avg of the 1st task is also around 496 at dequeue but > > the end of the sentence was missing... > > but there is still 20ms of idle time. But these two tasks are still able to finish there activity within this 100ms window. So why should we keep their util_est values high when dequeuing? [...] >>> The initial patch [2] aimed to solve an issue detected while running >>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3 >>> versions are compared: >>> - base: the current version >> >> What do you mean by current version ? tip/sched/core ? >> >>> - patch: the new version, with this patch applied >>> - revert: the initial version, with commit [2] reverted >>> >>> Score (higher is better): >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐ >>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │ >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡ >>> │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │ >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘ >>> ┌───────────┬───────────┬────────────┐ >>> │ base std ┆ patch std ┆ revert std │ >>> ╞═══════════╪═══════════╪════════════╡ >>> │ 0.57 ┆ 0.49 ┆ 0.58 │ >>> └───────────┴───────────┴────────────┘ >>> >>> Energy measured with energy counters: >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐ >>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │ >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡ >>> │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │ >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘ >>> ┌───────────┬───────────┬────────────┐ >>> │ base std ┆ patch std ┆ revert std │ >>> ╞═══════════╪═══════════╪════════════╡ >>> │ 1347.13 ┆ 2431.67 ┆ 510.88 │ >>> └───────────┴───────────┴────────────┘ >>> >>> Energy computed from util signals and energy model: >>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐ >>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │ >>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡ >>> │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │ >>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘ >>> ┌───────────┬───────────┬────────────┐ >>> │ base std ┆ patch std ┆ revert std │ >>> ╞═══════════╪═══════════╪════════════╡ >>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │ >>> └───────────┴───────────┴────────────┘ >>> >>> OU ratio in % (ratio of time being overutilized over total time). >>> The test lasts ~65s: >>> ┌────────────┬────────────┬─────────────┐ >>> │ base mean ┆ patch mean ┆ revert mean │ >>> ╞════════════╪════════════╪═════════════╡ >>> │ 63.39% ┆ 12.48% ┆ 12.28% │ >>> └────────────┴────────────┴─────────────┘ >>> ┌───────────┬───────────┬─────────────┐ >>> │ base std ┆ patch std ┆ revert mean │ >>> ╞═══════════╪═══════════╪═════════════╡ >>> │ 0.97 ┆ 0.28 ┆ 0.88 │ >>> └───────────┴───────────┴─────────────┘ >>> >>> The energy gain can be explained by the fact that the system is >>> overutilized during most of the test with the base version. >>> >>> During the test, the base condition is evaluated to true ~40% >>> of the time. The new condition is evaluated to true ~2% of >>> the time. Preventing util_est signals to decay with the base >>> condition has a significant impact on the overutilized state >>> due to an overestimation of the resulting utilization of tasks. >>> >>> The score is impacted by the patch, but: >>> - it is expected to have slightly lower scores with EAS running more >>> often >>> - the base version making the system run at higher frequencies by >>> overestimating task utilization, it is expected to have higher scores >> >> I'm not sure to get what you are trying to solve here ? Yeah, the question is how much perf loss we accept for energy savings? IMHO, impossible to answer generically based on one specific workload/platform incarnation. [...] >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 3e9ca38512de..d058ab29e52e 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq, >>> * To avoid underestimate of task utilization, skip updates of EWMA if >>> * we cannot grant that thread got all CPU time it wanted. >>> */ >>> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p)) >>> + if (rq_no_idle_pelt(rq_of(cfs_rq))) >> >> You can't use here the test that is done in >> update_idle_rq_clock_pelt() to detect if we lost some idle time >> because this test is only relevant when the rq becomes idle which is >> not the case here Do you mean this test ? util_avg = util_sum / divider util_sum >= divider * util_avg with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper bound of the window (+ 1024): util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX Why can't we use it here? >> With this test you skip completely the cases where the task has to >> share the CPU with others. As an example on the pixel 6, the little True. But I assume that's anticipated here. The assumption is that as long as there is idle time, tasks get what they want in a time frame. >> cpus must run more than 1.2 seconds at its max freq before detecting >> that there is no idle time BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 = 2.07s (with CPU capacity of Pix5 little CPU = 160)? [...]