Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
       [not found] ` <20260504020003.71306-9-qyousef@layalina.io>
@ 2026-05-06 20:38   ` Tim Chen
  2026-05-07  9:55     ` Qais Yousef
  2026-05-11 10:57   ` Peter Zijlstra
  1 sibling, 1 reply; 7+ messages in thread
From: Tim Chen @ 2026-05-06 20:38 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Rafael J. Wysocki, Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:
> Provide a generic and extensible interface to describe arbitrary QoS
> tags to tell the kernel about specific behavior that is doesn't fall
> into the existing sched_attr.
> 
> The interface is broken into three parts:
> 
> * Type
> * Value
> * Cookie
> 
> Type is an enum that should be give us enough space to extend (and
> deprecate) comfortably.
> 
> Value is a signed 64bit number to allow for arbitrary high values.
> 
> Cookie is to help group tasks selectively so that some QoS might want to
> operate on tasks per groups. A value of 0 indicates system wide.
> 
> There are two anticipated users being discussed on the list.
> 
> 1. Per task rampup multiplier to allow controlling how fast util rises,
>    and by implication it can migrate between cores on HMP systems and
>    cause freqs to rise with schedutil.
> 
> 2. Tag a group of task that are memory dependent for Cache Aware
>    Scheduling.
> 
> The interface is anticipated to be provisioned to apps via utilities and
> libraries. schedqos [1] is an example how such interface can be used to
> provide higher level QoS abstraction to describe workloads without
> baking it into the binaries, and by implication without worrying about
> potential abuse. The interface requires privileged access since QoS is
> considered scarce resource and requires admin control to ensure it is
> set properly. Again that admin control is anticipated to be the schedqos
> utility service.
> 
> QoS is treated as a scarce resource and the intention is for the
> a syscall to be done for each individual QoS tag. QoS tags are not
> inherited on fork by default too for the same reason.
> 
> A reasonable point of debate is whether to make the sched_qos an array
> of 3 or 5 value to avoid potential bottleneck if this grows large and
> users do end up hitting a bottleneck of having to issue too many
> syscalls to set all QoS. Being limited as it is now helps enforce
> intentionality and scarcity of tagging.
> 
> [1] https://github.com/qais-yousef/schedqos
> 
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>  Documentation/scheduler/index.rst             |  1 +
>  Documentation/scheduler/sched-qos.rst         | 44 ++++++++++++++++++
>  include/uapi/linux/sched.h                    |  4 ++
>  include/uapi/linux/sched/types.h              | 46 +++++++++++++++++++
>  kernel/sched/syscalls.c                       | 10 ++++
>  .../trace/beauty/include/uapi/linux/sched.h   |  4 ++
>  6 files changed, 109 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-qos.rst
> 
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 17ce8d76befc..6652f18e553b 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -23,5 +23,6 @@ Scheduler
>      sched-stats
>      sched-ext
>      sched-debug
> +    sched-qos
>  
>      text_files
> diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> new file mode 100644
> index 000000000000..0911261cb124
> --- /dev/null
> +++ b/Documentation/scheduler/sched-qos.rst
> @@ -0,0 +1,44 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Scheduler QoS
> +=============
> +
> +1. Introduction
> +===============
> +
> +Different workloads have different scheduling requirements to operate
> +optimally. The same applies to tasks within the same workload.
> +
> +To enable smarter usage of system resources and to cater for the conflicting
> +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> +information about those demands so that scheduler can do best-effort to
> +honour them.
> +
> +  @sched_qos_type	what QoS hint to apply
> +  @sched_qos_value	value of the QoS hint
> +  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> +			applies. If 0, the hint will apply globally system
> +			wide. If not 0, the hint will be relative to tasks that
> +			has the same cookie value only.

Qais,

Thanks for your proposal. I have some follow up thoughts.

How can we query all the tasks that use a cookie?
A scenario I can think of is there may be two group of tasks, and we may
want to merge the two group of tasks into one when they start sharing
data in the context of cache aware scheduling.  In that case, we
need to get all the tasks under the second cookie and change them to
that of the first.  We may need to link together tasks sharing a cookie.

We probably need a sched_qos_cookie structure defined analogous to
the sched_core_cookie to anchor the tasks.  And sched_qos_cookie could be a ptr value
to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
as in the patch below.

Tim

> +
> +QoS hints are set once and not inherited by children by design. The
> +rationale is that each task has its individual characteristics and it is
> +encouraged to describe each of these separately. Also since system resources
> +are finite, there's a limit to what can be done to honour these requests
> +before reaching a tipping point where there are too many requests for
> +a particular QoS that is impossible to service for all of them at once and
> +some will start to lose out. For example if 10 tasks require better wake
> +up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> +4 can perceive the hint honoured and the rest will have to wait. Inheritance
> +can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> +hint will lose its meaning and effectiveness rapidly. The chances of 10
> +tasks waking up at the same time is lower than a 100 and lower than a 1000.
> +
> +To set multiple QoS hints, a syscall is required for each. This is a
> +trade-off to reduce the churn on extending the interface as the hope for
> +this to evolve as workloads and hardware get more sophisticated and the
> +need for extension will arise; and when this happen the task should be
> +simpler to add the kernel extension and allow userspace to use readily by
> +setting the newly added flag without having to update the whole of
> +sched_attr.
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 52b69ce89368..3cdba44bc1cb 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
>  };
> +
> +enum sched_qos_type {
> +};
>  #endif
>  
>  #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -133,6 +136,7 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS		0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
> +#define SCHED_FLAG_QOS			0x80
>  
>  #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
>  				 SCHED_FLAG_KEEP_PARAMS)
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index bf6e9ae031c1..b65da4938f43 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -94,6 +94,48 @@
>   * scheduled on a CPU with no more capacity than the specified value.
>   *
>   * A task utilization boundary can be reset by setting the attribute to -1.
> + *
> + * Scheduler QoS
> + * =============
> + *
> + * Different workloads have different scheduling requirements to operate
> + * optimally. The same applies to tasks within the same workload.
> + *
> + * To enable smarter usage of system resources and to cater for the conflicting
> + * demands of various tasks, Scheduler QoS provides a mechanism to provide more
> + * information about those demands so that scheduler can do best-effort to
> + * honour them.
> + *
> + *  @sched_qos_type	what QoS hint to apply
> + *  @sched_qos_value	value of the QoS hint
> + *  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> + *			applies. If 0, the hint will apply globally system
> + *			wide. If not 0, the hint will be relative to tasks that
> + *			has the same cookie value only.
> + *
> + * QoS hints are set once and not inherited by children by design. The
> + * rationale is that each task has its individual characteristics and it is
> + * encouraged to describe each of these separately. Also since system resources
> + * are finite, there's a limit to what can be done to honour these requests
> + * before reaching a tipping point where there are too many requests for
> + * a particular QoS that is impossible to service for all of them at once and
> + * some will start to lose out. For example if 10 tasks require better wake
> + * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> + * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
> + * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> + * hint will lose its meaning and effectiveness rapidly. The chances of 10
> + * tasks waking up at the same time is lower than a 100 and lower than a 1000.
> + *
> + * To set multiple QoS hints, a syscall is required for each. This is a
> + * trade-off to reduce the churn on extending the interface as the hope for
> + * this to evolve as workloads and hardware get more sophisticated and the
> + * need for extension will arise; and when this happen the task should be
> + * simpler to add the kernel extension and allow userspace to use readily by
> + * setting the newly added flag without having to update the whole of
> + * sched_attr.
> + *
> + * Details about the available QoS hints can be found in:
> + * Documentation/scheduler/sched-qos.rst
>   */
>  struct sched_attr {
>  	__u32 size;
> @@ -116,6 +158,10 @@ struct sched_attr {
>  	__u32 sched_util_min;
>  	__u32 sched_util_max;
>  
> +	__u32 sched_qos_type;
> +	__s64 sched_qos_value;
> +	__u32 sched_qos_cookie;
> +
>  };
>  
>  #endif /* _UAPI_LINUX_SCHED_TYPES_H */
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index b215b0ead9a6..88feedd2f7c9 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -481,6 +481,13 @@ static int user_check_sched_setscheduler(struct task_struct *p,
>  	if (p->sched_reset_on_fork && !reset_on_fork)
>  		goto req_priv;
>  
> +	/*
> +	 * Normal users can't set QoS on their own, must go via admin
> +	 * controlled service
> +	 */
> +	if (attr->sched_flags & SCHED_FLAG_QOS)
> +		goto req_priv;
> +
>  	return 0;
>  
>  req_priv:
> @@ -552,6 +559,9 @@ int __sched_setscheduler(struct task_struct *p,
>  			return retval;
>  	}
>  
> +	if (attr->sched_flags & SCHED_FLAG_QOS)
> +		return -EOPNOTSUPP;
> +
>  	/*
>  	 * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
>  	 * information.
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> index 359a14cc76a4..4ff525928430 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
>  };
> +
> +enum sched_qos_type {
> +};
>  #endif
>  
>  #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -133,6 +136,7 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS		0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
> +#define SCHED_FLAG_QOS			0x80
>  
>  #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
>  				 SCHED_FLAG_KEEP_PARAMS)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
  2026-05-06 20:38   ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Tim Chen
@ 2026-05-07  9:55     ` Qais Yousef
  2026-05-07 14:20       ` Chen, Yu C
  0 siblings, 1 reply; 7+ messages in thread
From: Qais Yousef @ 2026-05-07  9:55 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, John Stultz,
	Dietmar Eggemann, Chen, Yu C, Thomas Gleixner, linux-kernel,
	linux-pm

On 05/06/26 13:38, Tim Chen wrote:
> On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:
> > Provide a generic and extensible interface to describe arbitrary QoS
> > tags to tell the kernel about specific behavior that is doesn't fall
> > into the existing sched_attr.
> > 
> > The interface is broken into three parts:
> > 
> > * Type
> > * Value
> > * Cookie
> > 
> > Type is an enum that should be give us enough space to extend (and
> > deprecate) comfortably.
> > 
> > Value is a signed 64bit number to allow for arbitrary high values.
> > 
> > Cookie is to help group tasks selectively so that some QoS might want to
> > operate on tasks per groups. A value of 0 indicates system wide.
> > 
> > There are two anticipated users being discussed on the list.
> > 
> > 1. Per task rampup multiplier to allow controlling how fast util rises,
> >    and by implication it can migrate between cores on HMP systems and
> >    cause freqs to rise with schedutil.
> > 
> > 2. Tag a group of task that are memory dependent for Cache Aware
> >    Scheduling.
> > 
> > The interface is anticipated to be provisioned to apps via utilities and
> > libraries. schedqos [1] is an example how such interface can be used to
> > provide higher level QoS abstraction to describe workloads without
> > baking it into the binaries, and by implication without worrying about
> > potential abuse. The interface requires privileged access since QoS is
> > considered scarce resource and requires admin control to ensure it is
> > set properly. Again that admin control is anticipated to be the schedqos
> > utility service.
> > 
> > QoS is treated as a scarce resource and the intention is for the
> > a syscall to be done for each individual QoS tag. QoS tags are not
> > inherited on fork by default too for the same reason.
> > 
> > A reasonable point of debate is whether to make the sched_qos an array
> > of 3 or 5 value to avoid potential bottleneck if this grows large and
> > users do end up hitting a bottleneck of having to issue too many
> > syscalls to set all QoS. Being limited as it is now helps enforce
> > intentionality and scarcity of tagging.
> > 
> > [1] https://github.com/qais-yousef/schedqos
> > 
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> >  Documentation/scheduler/index.rst             |  1 +
> >  Documentation/scheduler/sched-qos.rst         | 44 ++++++++++++++++++
> >  include/uapi/linux/sched.h                    |  4 ++
> >  include/uapi/linux/sched/types.h              | 46 +++++++++++++++++++
> >  kernel/sched/syscalls.c                       | 10 ++++
> >  .../trace/beauty/include/uapi/linux/sched.h   |  4 ++
> >  6 files changed, 109 insertions(+)
> >  create mode 100644 Documentation/scheduler/sched-qos.rst
> > 
> > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> > index 17ce8d76befc..6652f18e553b 100644
> > --- a/Documentation/scheduler/index.rst
> > +++ b/Documentation/scheduler/index.rst
> > @@ -23,5 +23,6 @@ Scheduler
> >      sched-stats
> >      sched-ext
> >      sched-debug
> > +    sched-qos
> >  
> >      text_files
> > diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> > new file mode 100644
> > index 000000000000..0911261cb124
> > --- /dev/null
> > +++ b/Documentation/scheduler/sched-qos.rst
> > @@ -0,0 +1,44 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Scheduler QoS
> > +=============
> > +
> > +1. Introduction
> > +===============
> > +
> > +Different workloads have different scheduling requirements to operate
> > +optimally. The same applies to tasks within the same workload.
> > +
> > +To enable smarter usage of system resources and to cater for the conflicting
> > +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> > +information about those demands so that scheduler can do best-effort to
> > +honour them.
> > +
> > +  @sched_qos_type	what QoS hint to apply
> > +  @sched_qos_value	value of the QoS hint
> > +  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> > +			applies. If 0, the hint will apply globally system
> > +			wide. If not 0, the hint will be relative to tasks that
> > +			has the same cookie value only.
> 
> Qais,
> 
> Thanks for your proposal. I have some follow up thoughts.
> 
> How can we query all the tasks that use a cookie?

At the moment you can use sched_getattr() to query the cookie set for
a specific QOS.

We'll probably have to expose something in procfs to allow parsing all tasks
that share a cookie.

Generally the idea is that this is managed by userspace and I'd expect schedqos
service to already know this info as it has to set it.

> A scenario I can think of is there may be two group of tasks, and we may

To be clear, two group of tasks belonging to the same process, right?

I am still not clear on grouping tasks to different processes. Nothing in the
API prevents it, but I am wary of inter-process task grouping as same
application can have several instances running in the system and this is
a layer of complexity that I am not sure warranted. That said, nothing in the
proposal prevents us from handling this if it ends up really making sense.
It'd be a matter of saying QOS_TYPE_XYZ cookie has to be system wide unique and
can be used to tag tasks to different processes, unlike the default behavior
which is unique per process. It's still have to be unique per sched_qos_type.

> want to merge the two group of tasks into one when they start sharing
> data in the context of cache aware scheduling.  In that case, we
> need to get all the tasks under the second cookie and change them to
> that of the first.  We may need to link together tasks sharing a cookie.

My implementation for cookie should have been explicit to return EOPNOTSUP if
spescified. And the doc text could have been clearer.

The idea is that the cookie is per QOS per process. So QOS_TYPE_A would have
its unique cookie range, and QOS_TYPE_B would have its independent unique
cookie range. To allow flexibility and extensibilty to describe independent
behavior that requires independent grouping.

So in your example assuming the grouping is done for a single process, group_a
would have QOS_DATA_DEP with cookie value of 1, and group_b would have
QOS_DATA_DEP with cookie value of 2. To merge them, you'd change group_a or
group_b cookie to match the other group's value.

I still haven't thought fully through how to do this with schedqos configs, but
one idea

{
	"process_a" {
		"thread_qos" {
			"task_1": [ ... ],
			"task_2": [ ... ],
			"task_3": [ ... ],
			"task_4": [ ... ]
		},
		"qos_groups" {
			"group_a" {
				"QOS_DATA_DEP": ["task_1", "task_2"],
			},
			"group_b" {
				"QOS_DATA_DEP": ["task_3", "task_4"],
			}
		}
	}
}

It implies of course to merge you'd have to change the description and restart
the service..

If you want to merge and unmerge at runtime, then I'd have big question marks
on the fact they belong to a group. From experience for this to be really
beneficial you need to describe the dominant behavior even if sometimes it is
untrue. Trying to be exact tends to backfire.

If we really want to do something at runtime, then group control (the planned
userspace extension to exercise QOS control based on cgroup grouping) would be
the way to go IMO.

> 
> We probably need a sched_qos_cookie structure defined analogous to
> the sched_core_cookie to anchor the tasks.  And sched_qos_cookie could be a ptr value
> to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
> as in the patch below.

As part of the API or internal implementation detail? I think we do need
a cookie structure that stores the sched_qos_type and sched_qos_cookie tuple
internally as implementation detail. But not expose it as an interface.

I think the cookie values should be userspace managed. From experience, this
has to be done in a centralized way via a service otherwise you'd end up with
a mess. There has to be an all knowledgeable entity managing things, which is
what I am proposing in schedqos service. That's why the whole QOS now is
protected with CAP_NICE capability - which I forgot to mention this change from
v1.

We want the interface to be flexible and survive the test of time. I don't just
want to support extensibility, but for us to be able to say we know better now
and we must deprecate something in favour of a new thing that does it better.
It means schedqos has to deal with some complexity to manage deprecation, but
kernel side hopefully we can just delete code with ease. Hopefully there will
be one or few centralized entities around.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
  2026-05-07  9:55     ` Qais Yousef
@ 2026-05-07 14:20       ` Chen, Yu C
  2026-05-09  9:39         ` Qais Yousef
  0 siblings, 1 reply; 7+ messages in thread
From: Chen, Yu C @ 2026-05-07 14:20 UTC (permalink / raw)
  To: Qais Yousef, Tim Chen
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, John Stultz,
	Dietmar Eggemann, Thomas Gleixner, linux-kernel, linux-pm,
	Vern Hao, Vern Hao

On 5/7/2026 5:55 PM, Qais Yousef wrote:
> On 05/06/26 13:38, Tim Chen wrote:
>> On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:

[ ... ]

> The idea is that the cookie is per QOS per process. So QOS_TYPE_A would have
> its unique cookie range, and QOS_TYPE_B would have its independent unique
> cookie range. To allow flexibility and extensibilty to describe independent
> behavior that requires independent grouping.
> 

 From a user point of view, I can think of the following use cases for 
fine-grained
cache-aware scheduling:

u1. A user wants to enable or disable cache-aware scheduling for all
     threads of a process. (No extra tagging is needed.)
u2. A user wants to enable or disable cache-aware scheduling for all
     tasks within a cgroup. (No extra tagging is needed.) Vern from
     Tencent was advocating for this model.
u3. A user wants to enable or disable cache-aware scheduling for an
     arbitrary set of tasks. (Userspace tagging is required.)

If I understand correctly, u3 is exactly the use case where schedqos
cookie can help. Under your design, we cannot tag an arbitrary set of
tasks with the same cookie; we are only allowed to assign the same cookie
to threads within the same process and under the same QoS type. So
this might eliminate the case where different processes share data
with each other that we want to aggregate(NUMA balancing's numa_group
is an indicator of tasks sharing data)

>>
>> We probably need a sched_qos_cookie structure defined analogous to
>> the sched_core_cookie to anchor the tasks.  And sched_qos_cookie could be a ptr value
>> to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
>> as in the patch below.
> 
> As part of the API or internal implementation detail? I think we do need
> a cookie structure that stores the sched_qos_type and sched_qos_cookie tuple
> internally as implementation detail. But not expose it as an interface.
> 

Yes, I think Tim was referring to the internal implementation. We need
a pointer to link tasks to their shared sched_qos_cookie.

> I think the cookie values should be userspace managed. From experience, this
> has to be done in a centralized way via a service otherwise you'd end up with
> a mess. There has to be an all knowledgeable entity managing things, which is
> what I am proposing in schedqos service. That's why the whole QOS now is
> protected with CAP_NICE capability - which I forgot to mention this change from
> v1.
> 

Not sure why we do not leverage the OS to allocate and manage cookies.
The OS has full visibility of system-wide information and can maintain 
globally
unique cookies. Users only need to request the OS to allocate, attach, 
or detach
tasks to an existing group without supplying an explicit cookie value.
One possible reason I can think of: since the schedqos cookie is defined 
per QoS
type and per process,it may be more convenient to manage it entirely 
within the
schedqos service？

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
  2026-05-07 14:20       ` Chen, Yu C
@ 2026-05-09  9:39         ` Qais Yousef
  0 siblings, 0 replies; 7+ messages in thread
From: Qais Yousef @ 2026-05-09  9:39 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
	John Stultz, Dietmar Eggemann, Thomas Gleixner, linux-kernel,
	linux-pm, Vern Hao, Vern Hao

On 05/07/26 22:20, Chen, Yu C wrote:
> On 5/7/2026 5:55 PM, Qais Yousef wrote:
> > On 05/06/26 13:38, Tim Chen wrote:
> > > On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:
> 
> [ ... ]
> 
> > The idea is that the cookie is per QOS per process. So QOS_TYPE_A would have
> > its unique cookie range, and QOS_TYPE_B would have its independent unique
> > cookie range. To allow flexibility and extensibilty to describe independent
> > behavior that requires independent grouping.
> > 
> 
> From a user point of view, I can think of the following use cases for
> fine-grained
> cache-aware scheduling:
> 
> u1. A user wants to enable or disable cache-aware scheduling for all
>     threads of a process. (No extra tagging is needed.)

This is a special case of u3 where you say all threads are part of one group.
So tagging is required; to enable/disable you'll have to have a knob to switch
that behavior but you're deferring to the kernel to group, which IMO is
a problem to delegate to the kernel. More on this below.

> u2. A user wants to enable or disable cache-aware scheduling for all
>     tasks within a cgroup. (No extra tagging is needed.) Vern from
>     Tencent was advocating for this model.

Same as above. We can have a NETLINK to tell us when tasks switch group and
auto tag based on that. Although I am still worried that this is not a great
way to tag and should be based on process and task. But I guess we can try
things out and see what works best. According to the plan this becomes
a userspace (schedqos service) description problem rather than kernel
implementation detail.

> u3. A user wants to enable or disable cache-aware scheduling for an
>     arbitrary set of tasks. (Userspace tagging is required.)
> 
> If I understand correctly, u3 is exactly the use case where schedqos
> cookie can help. Under your design, we cannot tag an arbitrary set of
> tasks with the same cookie; we are only allowed to assign the same cookie
> to threads within the same process and under the same QoS type. So
> this might eliminate the case where different processes share data

Is this a real case? I'd really love to know more details on why. I still think
inter-process grouping is better done via cpuset as this tends more towards
a partitioning problem.

That said nothing in the API actually prevent us from adding
QOS_INTER_PROCESS_MEM_DEP tag and make the cookie global and unique. But this
will come with implementation challenges and complexity. As a starter I'll make
sure to catch and error on this case to not repeat latest tcmalloc mistake. If
it makes sense at anytime, it'd be a matter of adding new qos for inter
process support.

> with each other that we want to aggregate(NUMA balancing's numa_group
> is an indicator of tasks sharing data)
> 
> > > 
> > > We probably need a sched_qos_cookie structure defined analogous to
> > > the sched_core_cookie to anchor the tasks.  And sched_qos_cookie could be a ptr value
> > > to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
> > > as in the patch below.
> > 
> > As part of the API or internal implementation detail? I think we do need
> > a cookie structure that stores the sched_qos_type and sched_qos_cookie tuple
> > internally as implementation detail. But not expose it as an interface.
> > 
> 
> Yes, I think Tim was referring to the internal implementation. We need
> a pointer to link tasks to their shared sched_qos_cookie.
> 
> > I think the cookie values should be userspace managed. From experience, this
> > has to be done in a centralized way via a service otherwise you'd end up with
> > a mess. There has to be an all knowledgeable entity managing things, which is
> > what I am proposing in schedqos service. That's why the whole QOS now is
> > protected with CAP_NICE capability - which I forgot to mention this change from
> > v1.
> > 
> 
> Not sure why we do not leverage the OS to allocate and manage cookies.
> The OS has full visibility of system-wide information and can maintain
> globally
> unique cookies. Users only need to request the OS to allocate, attach, or
> detach
> tasks to an existing group without supplying an explicit cookie value.
> One possible reason I can think of: since the schedqos cookie is defined per
> QoS
> type and per process,it may be more convenient to manage it entirely within
> the
> schedqos service？

Because for kernel to manage the cookies it means it needs to understand all
the rules for grouping, corner cases and trade-offs. We lose flexibility of
adding new QoS easily since for each new QoS we need to nail down all these
rules and progress will never happen - or very slowly at best. AND the
scheduler will have more policy embedded in it that will make it trickier to
change and evolve code without breaking user space behavior since the behavior
is purely embedded in kernel space.

By delegating to userspace we remove all of this. We provide the mechanisms and
the trade-offs, grouping rules etc are all managed by an all knowing entity. If
grouping rule A is better than grouping rule B, we don't care. Even schedqos
service hopefully wouldn't care. It'd be a matter for each admin to specify the
grouping that makes sense to them in a config file and restart the service and
hopefully everyone will stroll away happy. That's the dream at least :)

Note users will still get an abstraction via schedqos, but this abstraction is
in userspace rather than in the kernel. I anticipate users will just interact
with config files to describe their cases.

And as I mentioned in LPC, we can easily add a new service to schedqos to help
admins find out which tasks are memory dependent and help them fine tune their
configs. Perf can do that, and writing a simple daemon that can help admins
monitor/profile a live workload and spit out that grouping these tasks based
on cache would be beneficial shouldn't be too hard. The process can be fully
automated too to change things on the fly if folks really want to.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
       [not found] ` <20260504020003.71306-9-qyousef@layalina.io>
  2026-05-06 20:38   ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Tim Chen
@ 2026-05-11 10:57   ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2026-05-11 10:57 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Vincent Guittot, Rafael J. Wysocki, Viresh Kumar,
	Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Mon, May 04, 2026 at 02:59:58AM +0100, Qais Yousef wrote:
> Provide a generic and extensible interface to describe arbitrary QoS
> tags to tell the kernel about specific behavior that is doesn't fall
> into the existing sched_attr.
> 
> The interface is broken into three parts:
> 
> * Type
> * Value
> * Cookie
> 
> Type is an enum that should be give us enough space to extend (and
> deprecate) comfortably.
> 
> Value is a signed 64bit number to allow for arbitrary high values.
> 
> Cookie is to help group tasks selectively so that some QoS might want to
> operate on tasks per groups. A value of 0 indicates system wide.
> 
> There are two anticipated users being discussed on the list.
> 
> 1. Per task rampup multiplier to allow controlling how fast util rises,
>    and by implication it can migrate between cores on HMP systems and
>    cause freqs to rise with schedutil.
> 
> 2. Tag a group of task that are memory dependent for Cache Aware
>    Scheduling.
> 
> The interface is anticipated to be provisioned to apps via utilities and
> libraries. schedqos [1] is an example how such interface can be used to
> provide higher level QoS abstraction to describe workloads without
> baking it into the binaries, and by implication without worrying about
> potential abuse. The interface requires privileged access since QoS is
> considered scarce resource and requires admin control to ensure it is
> set properly. Again that admin control is anticipated to be the schedqos
> utility service.
> 
> QoS is treated as a scarce resource and the intention is for the
> a syscall to be done for each individual QoS tag. QoS tags are not
> inherited on fork by default too for the same reason.
> 
> A reasonable point of debate is whether to make the sched_qos an array
> of 3 or 5 value to avoid potential bottleneck if this grows large and
> users do end up hitting a bottleneck of having to issue too many
> syscalls to set all QoS. Being limited as it is now helps enforce
> intentionality and scarcity of tagging.

> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Scheduler QoS
> +=============
> +
> +1. Introduction
> +===============
> +
> +Different workloads have different scheduling requirements to operate
> +optimally. The same applies to tasks within the same workload.
> +
> +To enable smarter usage of system resources and to cater for the conflicting
> +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> +information about those demands so that scheduler can do best-effort to
> +honour them.
> +
> +  @sched_qos_type	what QoS hint to apply
> +  @sched_qos_value	value of the QoS hint
> +  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> +			applies. If 0, the hint will apply globally system
> +			wide. If not 0, the hint will be relative to tasks that
> +			has the same cookie value only.
> +
> +QoS hints are set once and not inherited by children by design. The
> +rationale is that each task has its individual characteristics and it is
> +encouraged to describe each of these separately. Also since system resources
> +are finite, there's a limit to what can be done to honour these requests
> +before reaching a tipping point where there are too many requests for
> +a particular QoS that is impossible to service for all of them at once and
> +some will start to lose out. For example if 10 tasks require better wake
> +up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> +4 can perceive the hint honoured and the rest will have to wait. Inheritance
> +can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> +hint will lose its meaning and effectiveness rapidly. The chances of 10
> +tasks waking up at the same time is lower than a 100 and lower than a 1000.
> +
> +To set multiple QoS hints, a syscall is required for each. This is a
> +trade-off to reduce the churn on extending the interface as the hope for
> +this to evolve as workloads and hardware get more sophisticated and the
> +need for extension will arise; and when this happen the task should be
> +simpler to add the kernel extension and allow userspace to use readily by
> +setting the newly added flag without having to update the whole of
> +sched_attr.

So 'type' is effectively meant to be an ephemeral space of hints. A
kernel can, or can not, support this arbitrary set of hints.

If a particular type is supported across two kernels, it is assumed to
be the same -- although its implementation might be different.

Your next patch implements type-0 to be this pelt multiplier thing.

I wonder about discoverability, suppose we create and discard a fair
number of these types, just because. Then how is someone (this
muddle-ware component for example) to discover which set of hints is
supported by the kernel of the day?

I suppose it can go and scan the space, by trying to set hints on itself
or something, but that seems sub-optimal.

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <20260504020003.71306-10-qyousef@layalina.io>]

* Re: [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS
       [not found] ` <20260504020003.71306-10-qyousef@layalina.io>
@ 2026-05-11 11:03   ` Peter Zijlstra
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2026-05-11 11:03 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Vincent Guittot, Rafael J. Wysocki, Viresh Kumar,
	Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Mon, May 04, 2026 at 02:59:59AM +0100, Qais Yousef wrote:

> diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> index 0911261cb124..f68856f23b6b 100644
> --- a/Documentation/scheduler/sched-qos.rst
> +++ b/Documentation/scheduler/sched-qos.rst
> @@ -42,3 +42,25 @@ need for extension will arise; and when this happen the task should be
>  simpler to add the kernel extension and allow userspace to use readily by
>  setting the newly added flag without having to update the whole of
>  sched_attr.
> +
> +2. QoS Tags
> +===========
> +
> +SCHED_QOS_RAMPUP_MULTIPLIER
> +---------------------------
> +
> +Controls how fast util signal rises. Affects frequency selection when schedutil
> +is in use. And affects how fast tasks migrate between clusters on HMP systems.
> +
> +It affects bursty tasks only. Perfectly periodic tasks are well described by
> +util_avg and the rampup multiplier will have no effect on them.
> +
> +When set to 0, util_est will be disabled to help further with power saving.
> +This behavior can be controlled via UTIL_EST_RAMPUP_ZERO sched_feature.
> +
> +Value is not capped to retain flexibility, but it tapers off very quickly to
> +notice a difference above 16. Roughly it takes ~200ms to reach a util_avg of
> +1000 starting from 0. With 16 it should take ~12.5ms. A range of 0-8 is
> +advised for general use.
> +
> +Cookie must always be set to 0.

So this is a very specific feature. This is made possible by basically
having a huge type space, allowing for throw-away hints (as per the
previous email).

I suppose having these specific hints is easy, but as per always there
is the discussion about describing task behaviour vs implementation
details. With the argument being that task behaviour might be a more
lasting / stable hint, while implementation details are far easier to
actually do.

I'm missing this discussion.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time
       [not found] <20260504020003.71306-1-qyousef@layalina.io>
       [not found] ` <20260504020003.71306-9-qyousef@layalina.io>
       [not found] ` <20260504020003.71306-10-qyousef@layalina.io>
@ 2026-05-11 17:58 ` John Stultz
  2 siblings, 0 replies; 7+ messages in thread
From: John Stultz @ 2026-05-11 17:58 UTC (permalink / raw)
  To: Qais Yousef
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Sun, May 3, 2026 at 7:00 PM Qais Yousef <qyousef@layalina.io> wrote:
>
> This is the long delayed follow up to the series sent back in August 2024 [1].
> Life got in the way to some extent (I had a baby, and now my time that I used
> to do upstream work late at night was stolen :). Apologies for those who
> replied and I didn't get a chance to respond back.
>
...
> Open questions:
>
> * The details of the QoS interface is the biggest one.
> * Would debugfs be better for setting the default rampup multiplier instead of sysctl?
> * Patch 13 makes updating load_avg unconditional not on period boundaries.
>
> Patches 1-3 are prepatory patches renaming a function and introducing new ones.
>
> Patches 4-5 handle the magic margin problem but making them dynamic based on
> actual hardware limitations.
>
> Patches 6-7 fix the black hole problem and teaches the scheduler how to handle
> bursty and periodic tasks via extending util_est.
>
> Patches 8-9 is where I expect most of the discussion on as I introduce a new
> sched_qos interface to support the new rampup_multiplier to help manage DVFS.
>
> Patches 10-11 introduces a couple of necessary optimizations to counter the
> power impact of increased responsiveness by disabling some features that we now
> know how to handle better.
>
> Patches 12-13 fix a couple of issues causing util_est and util_avg value to
> swing for a periodic task. Patch 12 must go via stable.

Just a minor nit, If 12/13 are fixes, should they not be at the front
of the series (or possibly sent separately) so they can potentially
move forward while the bigger changes in this series are discussed?

thanks
-john

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-11 17:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260504020003.71306-1-qyousef@layalina.io>
     [not found] ` <20260504020003.71306-9-qyousef@layalina.io>
2026-05-06 20:38   ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Tim Chen
2026-05-07  9:55     ` Qais Yousef
2026-05-07 14:20       ` Chen, Yu C
2026-05-09  9:39         ` Qais Yousef
2026-05-11 10:57   ` Peter Zijlstra
     [not found] ` <20260504020003.71306-10-qyousef@layalina.io>
2026-05-11 11:03   ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Peter Zijlstra
2026-05-11 17:58 ` [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox