Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Tim Chen <tim.c.chen@linux.intel.com>
To: Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"Rafael J. Wysocki"	 <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	 John Stultz <jstultz@google.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	"Chen, Yu C"	 <yu.c.chen@intel.com>,
	Thomas Gleixner <tglx@kernel.org>,
		linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Subject: Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
Date: Wed, 06 May 2026 13:38:26 -0700	[thread overview]
Message-ID: <2b9fd875df1f71d2c12c21938784a6c1fd38c04a.camel@linux.intel.com> (raw)
In-Reply-To: <20260504020003.71306-9-qyousef@layalina.io>

On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:
> Provide a generic and extensible interface to describe arbitrary QoS
> tags to tell the kernel about specific behavior that is doesn't fall
> into the existing sched_attr.
> 
> The interface is broken into three parts:
> 
> * Type
> * Value
> * Cookie
> 
> Type is an enum that should be give us enough space to extend (and
> deprecate) comfortably.
> 
> Value is a signed 64bit number to allow for arbitrary high values.
> 
> Cookie is to help group tasks selectively so that some QoS might want to
> operate on tasks per groups. A value of 0 indicates system wide.
> 
> There are two anticipated users being discussed on the list.
> 
> 1. Per task rampup multiplier to allow controlling how fast util rises,
>    and by implication it can migrate between cores on HMP systems and
>    cause freqs to rise with schedutil.
> 
> 2. Tag a group of task that are memory dependent for Cache Aware
>    Scheduling.
> 
> The interface is anticipated to be provisioned to apps via utilities and
> libraries. schedqos [1] is an example how such interface can be used to
> provide higher level QoS abstraction to describe workloads without
> baking it into the binaries, and by implication without worrying about
> potential abuse. The interface requires privileged access since QoS is
> considered scarce resource and requires admin control to ensure it is
> set properly. Again that admin control is anticipated to be the schedqos
> utility service.
> 
> QoS is treated as a scarce resource and the intention is for the
> a syscall to be done for each individual QoS tag. QoS tags are not
> inherited on fork by default too for the same reason.
> 
> A reasonable point of debate is whether to make the sched_qos an array
> of 3 or 5 value to avoid potential bottleneck if this grows large and
> users do end up hitting a bottleneck of having to issue too many
> syscalls to set all QoS. Being limited as it is now helps enforce
> intentionality and scarcity of tagging.
> 
> [1] https://github.com/qais-yousef/schedqos
> 
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>  Documentation/scheduler/index.rst             |  1 +
>  Documentation/scheduler/sched-qos.rst         | 44 ++++++++++++++++++
>  include/uapi/linux/sched.h                    |  4 ++
>  include/uapi/linux/sched/types.h              | 46 +++++++++++++++++++
>  kernel/sched/syscalls.c                       | 10 ++++
>  .../trace/beauty/include/uapi/linux/sched.h   |  4 ++
>  6 files changed, 109 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-qos.rst
> 
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 17ce8d76befc..6652f18e553b 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -23,5 +23,6 @@ Scheduler
>      sched-stats
>      sched-ext
>      sched-debug
> +    sched-qos
>  
>      text_files
> diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> new file mode 100644
> index 000000000000..0911261cb124
> --- /dev/null
> +++ b/Documentation/scheduler/sched-qos.rst
> @@ -0,0 +1,44 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Scheduler QoS
> +=============
> +
> +1. Introduction
> +===============
> +
> +Different workloads have different scheduling requirements to operate
> +optimally. The same applies to tasks within the same workload.
> +
> +To enable smarter usage of system resources and to cater for the conflicting
> +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> +information about those demands so that scheduler can do best-effort to
> +honour them.
> +
> +  @sched_qos_type	what QoS hint to apply
> +  @sched_qos_value	value of the QoS hint
> +  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> +			applies. If 0, the hint will apply globally system
> +			wide. If not 0, the hint will be relative to tasks that
> +			has the same cookie value only.

Qais,

Thanks for your proposal. I have some follow up thoughts.

How can we query all the tasks that use a cookie?
A scenario I can think of is there may be two group of tasks, and we may
want to merge the two group of tasks into one when they start sharing
data in the context of cache aware scheduling.  In that case, we
need to get all the tasks under the second cookie and change them to
that of the first.  We may need to link together tasks sharing a cookie.

We probably need a sched_qos_cookie structure defined analogous to
the sched_core_cookie to anchor the tasks.  And sched_qos_cookie could be a ptr value
to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
as in the patch below.

Tim

> +
> +QoS hints are set once and not inherited by children by design. The
> +rationale is that each task has its individual characteristics and it is
> +encouraged to describe each of these separately. Also since system resources
> +are finite, there's a limit to what can be done to honour these requests
> +before reaching a tipping point where there are too many requests for
> +a particular QoS that is impossible to service for all of them at once and
> +some will start to lose out. For example if 10 tasks require better wake
> +up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> +4 can perceive the hint honoured and the rest will have to wait. Inheritance
> +can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> +hint will lose its meaning and effectiveness rapidly. The chances of 10
> +tasks waking up at the same time is lower than a 100 and lower than a 1000.
> +
> +To set multiple QoS hints, a syscall is required for each. This is a
> +trade-off to reduce the churn on extending the interface as the hope for
> +this to evolve as workloads and hardware get more sophisticated and the
> +need for extension will arise; and when this happen the task should be
> +simpler to add the kernel extension and allow userspace to use readily by
> +setting the newly added flag without having to update the whole of
> +sched_attr.
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 52b69ce89368..3cdba44bc1cb 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
>  };
> +
> +enum sched_qos_type {
> +};
>  #endif
>  
>  #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -133,6 +136,7 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS		0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
> +#define SCHED_FLAG_QOS			0x80
>  
>  #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
>  				 SCHED_FLAG_KEEP_PARAMS)
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index bf6e9ae031c1..b65da4938f43 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -94,6 +94,48 @@
>   * scheduled on a CPU with no more capacity than the specified value.
>   *
>   * A task utilization boundary can be reset by setting the attribute to -1.
> + *
> + * Scheduler QoS
> + * =============
> + *
> + * Different workloads have different scheduling requirements to operate
> + * optimally. The same applies to tasks within the same workload.
> + *
> + * To enable smarter usage of system resources and to cater for the conflicting
> + * demands of various tasks, Scheduler QoS provides a mechanism to provide more
> + * information about those demands so that scheduler can do best-effort to
> + * honour them.
> + *
> + *  @sched_qos_type	what QoS hint to apply
> + *  @sched_qos_value	value of the QoS hint
> + *  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> + *			applies. If 0, the hint will apply globally system
> + *			wide. If not 0, the hint will be relative to tasks that
> + *			has the same cookie value only.
> + *
> + * QoS hints are set once and not inherited by children by design. The
> + * rationale is that each task has its individual characteristics and it is
> + * encouraged to describe each of these separately. Also since system resources
> + * are finite, there's a limit to what can be done to honour these requests
> + * before reaching a tipping point where there are too many requests for
> + * a particular QoS that is impossible to service for all of them at once and
> + * some will start to lose out. For example if 10 tasks require better wake
> + * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> + * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
> + * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> + * hint will lose its meaning and effectiveness rapidly. The chances of 10
> + * tasks waking up at the same time is lower than a 100 and lower than a 1000.
> + *
> + * To set multiple QoS hints, a syscall is required for each. This is a
> + * trade-off to reduce the churn on extending the interface as the hope for
> + * this to evolve as workloads and hardware get more sophisticated and the
> + * need for extension will arise; and when this happen the task should be
> + * simpler to add the kernel extension and allow userspace to use readily by
> + * setting the newly added flag without having to update the whole of
> + * sched_attr.
> + *
> + * Details about the available QoS hints can be found in:
> + * Documentation/scheduler/sched-qos.rst
>   */
>  struct sched_attr {
>  	__u32 size;
> @@ -116,6 +158,10 @@ struct sched_attr {
>  	__u32 sched_util_min;
>  	__u32 sched_util_max;
>  
> +	__u32 sched_qos_type;
> +	__s64 sched_qos_value;
> +	__u32 sched_qos_cookie;
> +
>  };
>  
>  #endif /* _UAPI_LINUX_SCHED_TYPES_H */
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index b215b0ead9a6..88feedd2f7c9 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -481,6 +481,13 @@ static int user_check_sched_setscheduler(struct task_struct *p,
>  	if (p->sched_reset_on_fork && !reset_on_fork)
>  		goto req_priv;
>  
> +	/*
> +	 * Normal users can't set QoS on their own, must go via admin
> +	 * controlled service
> +	 */
> +	if (attr->sched_flags & SCHED_FLAG_QOS)
> +		goto req_priv;
> +
>  	return 0;
>  
>  req_priv:
> @@ -552,6 +559,9 @@ int __sched_setscheduler(struct task_struct *p,
>  			return retval;
>  	}
>  
> +	if (attr->sched_flags & SCHED_FLAG_QOS)
> +		return -EOPNOTSUPP;
> +
>  	/*
>  	 * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
>  	 * information.
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> index 359a14cc76a4..4ff525928430 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
>  };
> +
> +enum sched_qos_type {
> +};
>  #endif
>  
>  #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -133,6 +136,7 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS		0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
> +#define SCHED_FLAG_QOS			0x80
>  
>  #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
>  				 SCHED_FLAG_KEEP_PARAMS)

next prev parent reply	other threads:[~2026-05-06 20:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
2026-05-04  1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
2026-05-04  1:59 ` [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
2026-05-04  1:59 ` [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
2026-05-04  1:59 ` [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
2026-05-04  1:59 ` [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
2026-05-04  1:59 ` [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time Qais Yousef
2026-05-04  1:59 ` [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks Qais Yousef
2026-05-04  1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
2026-05-06 20:38   ` Tim Chen [this message]
2026-05-04  1:59 ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Qais Yousef
2026-05-04  2:00 ` [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0 Qais Yousef
2026-05-04  2:00 ` [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init Qais Yousef
2026-05-04  2:00 ` [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-04  2:00 ` [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates Qais Yousef

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2b9fd875df1f71d2c12c21938784a6c1fd38c04a.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox