From: Qais Yousef <qyousef@layalina.io>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
"Rafael J. Wysocki" <rafael@kernel.org>,
Viresh Kumar <viresh.kumar@linaro.org>,
Juri Lelli <juri.lelli@redhat.com>,
Steven Rostedt <rostedt@goodmis.org>,
John Stultz <jstultz@google.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
"Chen, Yu C" <yu.c.chen@intel.com>,
Thomas Gleixner <tglx@kernel.org>,
linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Subject: Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
Date: Thu, 7 May 2026 10:55:16 +0100 [thread overview]
Message-ID: <20260507095516.vv7blulzskkyezin@airbuntu> (raw)
In-Reply-To: <2b9fd875df1f71d2c12c21938784a6c1fd38c04a.camel@linux.intel.com>
On 05/06/26 13:38, Tim Chen wrote:
> On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:
> > Provide a generic and extensible interface to describe arbitrary QoS
> > tags to tell the kernel about specific behavior that is doesn't fall
> > into the existing sched_attr.
> >
> > The interface is broken into three parts:
> >
> > * Type
> > * Value
> > * Cookie
> >
> > Type is an enum that should be give us enough space to extend (and
> > deprecate) comfortably.
> >
> > Value is a signed 64bit number to allow for arbitrary high values.
> >
> > Cookie is to help group tasks selectively so that some QoS might want to
> > operate on tasks per groups. A value of 0 indicates system wide.
> >
> > There are two anticipated users being discussed on the list.
> >
> > 1. Per task rampup multiplier to allow controlling how fast util rises,
> > and by implication it can migrate between cores on HMP systems and
> > cause freqs to rise with schedutil.
> >
> > 2. Tag a group of task that are memory dependent for Cache Aware
> > Scheduling.
> >
> > The interface is anticipated to be provisioned to apps via utilities and
> > libraries. schedqos [1] is an example how such interface can be used to
> > provide higher level QoS abstraction to describe workloads without
> > baking it into the binaries, and by implication without worrying about
> > potential abuse. The interface requires privileged access since QoS is
> > considered scarce resource and requires admin control to ensure it is
> > set properly. Again that admin control is anticipated to be the schedqos
> > utility service.
> >
> > QoS is treated as a scarce resource and the intention is for the
> > a syscall to be done for each individual QoS tag. QoS tags are not
> > inherited on fork by default too for the same reason.
> >
> > A reasonable point of debate is whether to make the sched_qos an array
> > of 3 or 5 value to avoid potential bottleneck if this grows large and
> > users do end up hitting a bottleneck of having to issue too many
> > syscalls to set all QoS. Being limited as it is now helps enforce
> > intentionality and scarcity of tagging.
> >
> > [1] https://github.com/qais-yousef/schedqos
> >
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> > Documentation/scheduler/index.rst | 1 +
> > Documentation/scheduler/sched-qos.rst | 44 ++++++++++++++++++
> > include/uapi/linux/sched.h | 4 ++
> > include/uapi/linux/sched/types.h | 46 +++++++++++++++++++
> > kernel/sched/syscalls.c | 10 ++++
> > .../trace/beauty/include/uapi/linux/sched.h | 4 ++
> > 6 files changed, 109 insertions(+)
> > create mode 100644 Documentation/scheduler/sched-qos.rst
> >
> > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> > index 17ce8d76befc..6652f18e553b 100644
> > --- a/Documentation/scheduler/index.rst
> > +++ b/Documentation/scheduler/index.rst
> > @@ -23,5 +23,6 @@ Scheduler
> > sched-stats
> > sched-ext
> > sched-debug
> > + sched-qos
> >
> > text_files
> > diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> > new file mode 100644
> > index 000000000000..0911261cb124
> > --- /dev/null
> > +++ b/Documentation/scheduler/sched-qos.rst
> > @@ -0,0 +1,44 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Scheduler QoS
> > +=============
> > +
> > +1. Introduction
> > +===============
> > +
> > +Different workloads have different scheduling requirements to operate
> > +optimally. The same applies to tasks within the same workload.
> > +
> > +To enable smarter usage of system resources and to cater for the conflicting
> > +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> > +information about those demands so that scheduler can do best-effort to
> > +honour them.
> > +
> > + @sched_qos_type what QoS hint to apply
> > + @sched_qos_value value of the QoS hint
> > + @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
> > + applies. If 0, the hint will apply globally system
> > + wide. If not 0, the hint will be relative to tasks that
> > + has the same cookie value only.
>
> Qais,
>
> Thanks for your proposal. I have some follow up thoughts.
>
> How can we query all the tasks that use a cookie?
At the moment you can use sched_getattr() to query the cookie set for
a specific QOS.
We'll probably have to expose something in procfs to allow parsing all tasks
that share a cookie.
Generally the idea is that this is managed by userspace and I'd expect schedqos
service to already know this info as it has to set it.
> A scenario I can think of is there may be two group of tasks, and we may
To be clear, two group of tasks belonging to the same process, right?
I am still not clear on grouping tasks to different processes. Nothing in the
API prevents it, but I am wary of inter-process task grouping as same
application can have several instances running in the system and this is
a layer of complexity that I am not sure warranted. That said, nothing in the
proposal prevents us from handling this if it ends up really making sense.
It'd be a matter of saying QOS_TYPE_XYZ cookie has to be system wide unique and
can be used to tag tasks to different processes, unlike the default behavior
which is unique per process. It's still have to be unique per sched_qos_type.
> want to merge the two group of tasks into one when they start sharing
> data in the context of cache aware scheduling. In that case, we
> need to get all the tasks under the second cookie and change them to
> that of the first. We may need to link together tasks sharing a cookie.
My implementation for cookie should have been explicit to return EOPNOTSUP if
spescified. And the doc text could have been clearer.
The idea is that the cookie is per QOS per process. So QOS_TYPE_A would have
its unique cookie range, and QOS_TYPE_B would have its independent unique
cookie range. To allow flexibility and extensibilty to describe independent
behavior that requires independent grouping.
So in your example assuming the grouping is done for a single process, group_a
would have QOS_DATA_DEP with cookie value of 1, and group_b would have
QOS_DATA_DEP with cookie value of 2. To merge them, you'd change group_a or
group_b cookie to match the other group's value.
I still haven't thought fully through how to do this with schedqos configs, but
one idea
{
"process_a" {
"thread_qos" {
"task_1": [ ... ],
"task_2": [ ... ],
"task_3": [ ... ],
"task_4": [ ... ]
},
"qos_groups" {
"group_a" {
"QOS_DATA_DEP": ["task_1", "task_2"],
},
"group_b" {
"QOS_DATA_DEP": ["task_3", "task_4"],
}
}
}
}
It implies of course to merge you'd have to change the description and restart
the service..
If you want to merge and unmerge at runtime, then I'd have big question marks
on the fact they belong to a group. From experience for this to be really
beneficial you need to describe the dominant behavior even if sometimes it is
untrue. Trying to be exact tends to backfire.
If we really want to do something at runtime, then group control (the planned
userspace extension to exercise QOS control based on cgroup grouping) would be
the way to go IMO.
>
> We probably need a sched_qos_cookie structure defined analogous to
> the sched_core_cookie to anchor the tasks. And sched_qos_cookie could be a ptr value
> to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
> as in the patch below.
As part of the API or internal implementation detail? I think we do need
a cookie structure that stores the sched_qos_type and sched_qos_cookie tuple
internally as implementation detail. But not expose it as an interface.
I think the cookie values should be userspace managed. From experience, this
has to be done in a centralized way via a service otherwise you'd end up with
a mess. There has to be an all knowledgeable entity managing things, which is
what I am proposing in schedqos service. That's why the whole QOS now is
protected with CAP_NICE capability - which I forgot to mention this change from
v1.
We want the interface to be flexible and survive the test of time. I don't just
want to support extensibility, but for us to be able to say we know better now
and we must deprecate something in favour of a new thing that does it better.
It means schedqos has to deal with some complexity to manage deprecation, but
kernel side hopefully we can just delete code with ease. Hopefully there will
be one or few centralized entities around.
next prev parent reply other threads:[~2026-05-07 9:55 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
2026-05-04 1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
2026-05-04 1:59 ` [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
2026-05-04 1:59 ` [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
2026-05-04 1:59 ` [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
2026-05-04 1:59 ` [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
2026-05-04 1:59 ` [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time Qais Yousef
2026-05-04 1:59 ` [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks Qais Yousef
2026-05-04 1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
2026-05-06 20:38 ` Tim Chen
2026-05-07 9:55 ` Qais Yousef [this message]
2026-05-07 14:20 ` Chen, Yu C
2026-05-09 9:39 ` Qais Yousef
2026-05-11 10:57 ` Peter Zijlstra
2026-05-04 1:59 ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Qais Yousef
2026-05-11 11:03 ` Peter Zijlstra
2026-05-04 2:00 ` [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0 Qais Yousef
2026-05-04 2:00 ` [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init Qais Yousef
2026-05-04 2:00 ` [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-04 2:00 ` [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates Qais Yousef
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260507095516.vv7blulzskkyezin@airbuntu \
--to=qyousef@layalina.io \
--cc=dietmar.eggemann@arm.com \
--cc=jstultz@google.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=rostedt@goodmis.org \
--cc=tglx@kernel.org \
--cc=tim.c.chen@linux.intel.com \
--cc=vincent.guittot@linaro.org \
--cc=viresh.kumar@linaro.org \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox