Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface

Linux Power Management development
 help / color / mirror / Atom feed

From: Qais Yousef <qyousef@layalina.io>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	John Stultz <jstultz@google.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	"Chen, Yu C" <yu.c.chen@intel.com>,
	Thomas Gleixner <tglx@kernel.org>,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org
Subject: Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
Date: Tue, 12 May 2026 08:58:09 +0100	[thread overview]
Message-ID: <20260512075809.5on43u3wrnelqe4i@airbuntu> (raw)
In-Reply-To: <20260511105704.GR3126523@noisy.programming.kicks-ass.net>

On 05/11/26 12:57, Peter Zijlstra wrote:
> On Mon, May 04, 2026 at 02:59:58AM +0100, Qais Yousef wrote:
> > Provide a generic and extensible interface to describe arbitrary QoS
> > tags to tell the kernel about specific behavior that is doesn't fall
> > into the existing sched_attr.
> > 
> > The interface is broken into three parts:
> > 
> > * Type
> > * Value
> > * Cookie
> > 
> > Type is an enum that should be give us enough space to extend (and
> > deprecate) comfortably.
> > 
> > Value is a signed 64bit number to allow for arbitrary high values.
> > 
> > Cookie is to help group tasks selectively so that some QoS might want to
> > operate on tasks per groups. A value of 0 indicates system wide.
> > 
> > There are two anticipated users being discussed on the list.
> > 
> > 1. Per task rampup multiplier to allow controlling how fast util rises,
> >    and by implication it can migrate between cores on HMP systems and
> >    cause freqs to rise with schedutil.
> > 
> > 2. Tag a group of task that are memory dependent for Cache Aware
> >    Scheduling.
> > 
> > The interface is anticipated to be provisioned to apps via utilities and
> > libraries. schedqos [1] is an example how such interface can be used to
> > provide higher level QoS abstraction to describe workloads without
> > baking it into the binaries, and by implication without worrying about
> > potential abuse. The interface requires privileged access since QoS is
> > considered scarce resource and requires admin control to ensure it is
> > set properly. Again that admin control is anticipated to be the schedqos
> > utility service.
> > 
> > QoS is treated as a scarce resource and the intention is for the
> > a syscall to be done for each individual QoS tag. QoS tags are not
> > inherited on fork by default too for the same reason.
> > 
> > A reasonable point of debate is whether to make the sched_qos an array
> > of 3 or 5 value to avoid potential bottleneck if this grows large and
> > users do end up hitting a bottleneck of having to issue too many
> > syscalls to set all QoS. Being limited as it is now helps enforce
> > intentionality and scarcity of tagging.
> 
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +=============
> > +Scheduler QoS
> > +=============
> > +
> > +1. Introduction
> > +===============
> > +
> > +Different workloads have different scheduling requirements to operate
> > +optimally. The same applies to tasks within the same workload.
> > +
> > +To enable smarter usage of system resources and to cater for the conflicting
> > +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> > +information about those demands so that scheduler can do best-effort to
> > +honour them.
> > +
> > +  @sched_qos_type	what QoS hint to apply
> > +  @sched_qos_value	value of the QoS hint
> > +  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> > +			applies. If 0, the hint will apply globally system
> > +			wide. If not 0, the hint will be relative to tasks that
> > +			has the same cookie value only.
> > +
> > +QoS hints are set once and not inherited by children by design. The
> > +rationale is that each task has its individual characteristics and it is
> > +encouraged to describe each of these separately. Also since system resources
> > +are finite, there's a limit to what can be done to honour these requests
> > +before reaching a tipping point where there are too many requests for
> > +a particular QoS that is impossible to service for all of them at once and
> > +some will start to lose out. For example if 10 tasks require better wake
> > +up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> > +4 can perceive the hint honoured and the rest will have to wait. Inheritance
> > +can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> > +hint will lose its meaning and effectiveness rapidly. The chances of 10
> > +tasks waking up at the same time is lower than a 100 and lower than a 1000.
> > +
> > +To set multiple QoS hints, a syscall is required for each. This is a
> > +trade-off to reduce the churn on extending the interface as the hope for
> > +this to evolve as workloads and hardware get more sophisticated and the
> > +need for extension will arise; and when this happen the task should be
> > +simpler to add the kernel extension and allow userspace to use readily by
> > +setting the newly added flag without having to update the whole of
> > +sched_attr.
> 
> So 'type' is effectively meant to be an ephemeral space of hints. A
> kernel can, or can not, support this arbitrary set of hints.

Yes. A 'type' is not expected to be recycled if deprecated. It'll just return
an error if no longer supported.

> 
> If a particular type is supported across two kernels, it is assumed to
> be the same -- although its implementation might be different.

Yes. If implementation details were too different that a 'type' no longer
makes sense, it'd be just deprecated (return EOPNOTSUP) in favour of whatever
makes sense then, if still necessary.

Will document this more explicitly.

> 
> Your next patch implements type-0 to be this pelt multiplier thing.
> 
> I wonder about discoverability, suppose we create and discard a fair
> number of these types, just because. Then how is someone (this
> muddle-ware component for example) to discover which set of hints is
> supported by the kernel of the day?
> 
> I suppose it can go and scan the space, by trying to set hints on itself
> or something, but that seems sub-optimal.

Yes, I think that would be best starting point. If you saw schedqos code
I already have to parse procfs to discover all existing running processes/tasks
after connecting to netlink socket to listen to new forks/execs (and plug the
race). Inefficient, but done once at service start is simple enough and
unlikely to be a real bottleneck or pain point.

Current line of thought is to keep as much in userspace until we have enough
data on usage and bottlenecks to drive any kernel changes. userspace here means
schedqos service.

One thing to note also, with the schedqos integration when folks need to add
a new hint they can actually deploy at scale more easily. I am hoping the
development process would be:

* Add new hint to address particular problem
* Integrate with schedqos to auto tag applications that can benefit from this
  hint
* Deploy on production or production like system to prove the benefit,
  trade-offs
* Discuss inclusion with upstream

IOW, I am hoping we can get more real life data on benefit for real workloads
and systems outside of usual synthetic cases. I am expecting any (well, most at
least) kernel hint is connected somehow to higher level of abstraction in
schedqos service - expectation so far is that kernel level hint are hard for
apps to use directly. More on this in reply to your next email.

next prev parent reply	other threads:[~2026-05-12  7:58 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
2026-05-04  1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
2026-05-04  1:59 ` [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
2026-05-04  1:59 ` [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
2026-05-04  1:59 ` [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
2026-05-04  1:59 ` [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
2026-05-04  1:59 ` [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time Qais Yousef
2026-05-04  1:59 ` [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks Qais Yousef
2026-05-04  1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
2026-05-06 20:38   ` Tim Chen
2026-05-07  9:55     ` Qais Yousef
2026-05-07 14:20       ` Chen, Yu C
2026-05-09  9:39         ` Qais Yousef
2026-05-11 10:57   ` Peter Zijlstra
2026-05-12  7:58     ` Qais Yousef [this message]
2026-05-12  8:30       ` Peter Zijlstra
2026-05-12  8:47         ` Qais Yousef
2026-05-04  1:59 ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Qais Yousef
2026-05-11 11:03   ` Peter Zijlstra
2026-05-12  7:59     ` Qais Yousef
2026-05-12  8:37       ` Christian Loehle
2026-05-12  8:53         ` Qais Yousef
2026-05-04  2:00 ` [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0 Qais Yousef
2026-05-04  2:00 ` [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init Qais Yousef
2026-05-04  2:00 ` [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-04  2:00 ` [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates Qais Yousef
2026-05-11 17:58 ` [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time John Stultz
2026-05-12  8:01   ` Qais Yousef

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260512075809.5on43u3wrnelqe4i@airbuntu \
    --to=qyousef@layalina.io \
    --cc=dietmar.eggemann@arm.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox