Linux Power Management development
 help / color / mirror / Atom feed
From: Hongyan Xia <hongyan.xia2@arm.com>
To: Qais Yousef <qyousef@layalina.io>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4] sched: Consolidate cpufreq updates
Date: Fri, 17 May 2024 12:06:45 +0100	[thread overview]
Message-ID: <b4036b48-7d04-4bba-b405-f64ee309e874@arm.com> (raw)
In-Reply-To: <20240516204802.846520-1-qyousef@layalina.io>

On 16/05/2024 21:48, Qais Yousef wrote:
> Improve the interaction with cpufreq governors by making the
> cpufreq_update_util() calls more intentional.
> 
> At the moment we send them when load is updated for CFS, bandwidth for
> DL and at enqueue/dequeue for RT. But this can lead to too many updates
> sent in a short period of time and potentially be ignored at a critical
> moment due to the rate_limit_us in schedutil.
> 
> For example, simultaneous task enqueue on the CPU where 2nd task is
> bigger and requires higher freq. The trigger to cpufreq_update_util() by
> the first task will lead to dropping the 2nd request until tick. Or
> another CPU in the same policy triggers a freq update shortly after.
> 
> Updates at enqueue for RT are not strictly required. Though they do help
> to reduce the delay for switching the frequency and the potential
> observation of lower frequency during this delay. But current logic
> doesn't intentionally (at least to my understanding) try to speed up the
> request.
> 
> To help reduce the amount of cpufreq updates and make them more
> purposeful, consolidate them into these locations:
> 
> 1. context_switch()
> 2. task_tick_fair()
> 3. update_blocked_averages()
> 4. on syscall that changes policy or uclamp values
> 
> The update at context switch should help guarantee that DL and RT get
> the right frequency straightaway when they're RUNNING. As mentioned
> though the update will happen slightly after enqueue_task(); though in
> an ideal world these tasks should be RUNNING ASAP and this additional
> delay should be negligible. For fair tasks we need to make sure we send
> a single update for every decay for the root cfs_rq. Any changes to the
> rq will be deferred until the next task is ready to run, or we hit TICK.
> But we are guaranteed the task is running at a level that meets its
> requirements after enqueue.
> 
> To guarantee RT and DL tasks updates are never missed, we add a new
> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> already running at the right freq, the governor will end up doing
> nothing, but we eliminate the risk of the task ending up accidentally
> running at the wrong freq due to rate_limit_us.
> 
> Similarly for iowait boost, we ignore rate limits. We also handle a case
> of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> to reduce the boost after 1ms which seems iowait boost mechanism relied
> on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
> after iowait boost.
> 
> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> time stamps otherwise we can end up delaying updates for normal
> requests.
> 
> As a simple optimization, we avoid sending cpufreq updates when
> switching from RT to another RT as RT tasks run at max freq by default.
> If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
> uclamp_min is different to avoid unnecessary cpufreq update as most RT
> tasks are likely to be running at the same performance level, so we can
> avoid unnecessary overhead of forced updates when there's nothing to do.
> 
> We also ensure to ignore cpufreq udpates for sugov workers at context
> switch. It doesn't make sense for the kworker that applies the frequency
> update (which is a DL task) to trigger a frequency update itself.
> 
> The update at task_tick_fair will guarantee that the governor will
> follow any updates to load for tasks/CPU or due to new enqueues/dequeues
> to the rq. Since DL and RT always run at constant frequencies and have
> no load tracking, this is only required for fair tasks.
> 
> The update at update_blocked_averages() will ensure we decay frequency
> as the CPU becomes idle for long enough.
> 
> If the currently running task changes its policy or uclamp values, we
> ensure we follow up with cpufreq update to ensure we follow up with any
> potential new perf requirements based on the new change.
> 
> [...]
> 
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index bdd31ab93bc5..2d0a45aba16f 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -8,7 +8,8 @@
>    * Interface between cpufreq drivers and the scheduler:
>    */
>   
> -#define SCHED_CPUFREQ_IOWAIT	(1U << 0)
> +#define SCHED_CPUFREQ_IOWAIT		(1U << 0)
> +#define SCHED_CPUFREQ_FORCE_UPDATE	(1U << 1) /* ignore transition_delay_us */
>   
>   #ifdef CONFIG_CPU_FREQ
>   struct cpufreq_policy;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a914388144a..d0c97a66627a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -152,6 +152,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
>   
>   __read_mostly int scheduler_running;
>   
> +static __always_inline void
> +update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev);
> +
>   #ifdef CONFIG_SCHED_CORE
>   
>   DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> @@ -1958,7 +1961,7 @@ static bool uclamp_reset(const struct sched_attr *attr,
>   	return false;
>   }
>   
> -static void __setscheduler_uclamp(struct task_struct *p,
> +static void __setscheduler_uclamp(struct rq *rq, struct task_struct *p,
>   				  const struct sched_attr *attr)
>   {
>   	enum uclamp_id clamp_id;
> @@ -1980,7 +1983,6 @@ static void __setscheduler_uclamp(struct task_struct *p,
>   			value = uclamp_none(clamp_id);
>   
>   		uclamp_se_set(uc_se, value, false);
> -
>   	}
>   
>   	if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> @@ -1997,6 +1999,13 @@ static void __setscheduler_uclamp(struct task_struct *p,
>   		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
>   			      attr->sched_util_max, true);
>   	}
> +
> +	/*
> +	 * Updating uclamp values has impact on freq, ensure it is taken into
> +	 * account.
> +	 */
> +	if (task_current(rq, p))
> +		update_cpufreq_ctx_switch(rq, NULL);

Do we care about updating the frequency here? p is dequeued during the 
__setscheduler_uclamp() call, so I think it's better to do this after 
the uclamp() call and after enqueue_task(), so that uclamp_rq_inc() 
comes into effect.

Also, do we want to limit the update to task_current()? Updating a 
uclamp_min of a task on this rq (even though it is not current) should 
raise the minimum OPP for the whole rq. An example is that if a 
uclamp_min task gets enqueued, the uclamp_min should kick in even if 
this task isn't immediately run and the current task isn't this task.

>   }
>   
>   static void uclamp_fork(struct task_struct *p)
> [...]

  reply	other threads:[~2024-05-17 11:06 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-16 20:48 [PATCH v4] sched: Consolidate cpufreq updates Qais Yousef
2024-05-17 11:06 ` Hongyan Xia [this message]
2024-05-29  1:11   ` Qais Yousef

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b4036b48-7d04-4bba-b405-f64ee309e874@arm.com \
    --to=hongyan.xia2@arm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox