From: Andrea Righi <arighi@nvidia.com>
To: Qiliang Yuan <realwujing@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Tejun Heo <tj@kernel.org>, Emil Tsalapatis <emil@etsalapatis.com>,
Ryan Newton <newton@meta.com>, David Dai <david.dai@linux.dev>,
zhidao su <suzhidao@xiaomi.com>,
Jake Hillion <jake@hillion.co.uk>,
Qiliang Yuan <yuanql9@chinatelecom.cn>,
David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
Dan Schatzberg <schatzberg.dan@gmail.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/ext: Add cpumask to skip unsuitable dispatch queues
Date: Tue, 3 Feb 2026 09:37:14 +0100 [thread overview]
Message-ID: <aYGzumDW2sQ8xQSD@gpd4> (raw)
In-Reply-To: <20260203030400.3313990-1-realwujing@gmail.com>
Hi Qiliang,
On Mon, Feb 02, 2026 at 10:03:46PM -0500, Qiliang Yuan wrote:
> Add a cpumask field to struct scx_dispatch_q to track the union of
> allowed CPUs for all tasks in the queue. Use this mask to perform an
> O(1) check in consume_dispatch_q() before scanning the queue.
>
> When a CPU attempts to consume from a queue, it currently must iterate
> through all N tasks to determine if any can run on that CPU. If the
> queue contains only tasks pinned to other CPUs (via sched_setaffinity
> or cgroups), this O(N) scan finds nothing.
>
> With the cpumask, if the current CPU is not in the allowed set, skip
> the entire queue immediately with a single bit test. This changes the
> "queue is unsuitable" case from O(N) to O(1).
>
> The mask is updated when tasks are enqueued and cleared when the queue
> becomes empty, preventing permanent saturation from transient pinned
> tasks.
>
> This benefits large systems with CPU-pinned workloads, where CPUs
> frequently scan queues containing no eligible tasks.
Did you run some benchmarks / have some numbers?
It's true that we save the O(N) scan when the DSQ has no eligible tasks,
but we're adding cost on every enqueue: cpumask_or() on potentially large
cpumasks can be expensive.
I think this optimization can help when queues frequently contain only
tasks pinned to other CPUs or when the queue has many tasks (N is large).
I have the feeling that for small queues or mixed workloads, the cpumask
overhead probably exceeds the savings...
>
> Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
> include/linux/sched/ext.h | 1 +
> kernel/sched/ext.c | 21 ++++++++++++++++++++-
> 2 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d..f20e57cf53a3 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -79,6 +79,7 @@ struct scx_dispatch_q {
> struct rhash_head hash_node;
> struct llist_node free_node;
> struct rcu_head rcu;
> + struct cpumask *cpus_allowed; /* union of all tasks' allowed cpus */
> };
>
> /* scx_entity.flags */
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index afe28c04d5aa..5a060c97cd64 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1120,8 +1120,12 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>
> if (is_local)
> local_dsq_post_enq(dsq, p, enq_flags);
> - else
> + else {
> + /* Update cpumask to track union of all tasks' allowed CPUs */
> + if (dsq->cpus_allowed)
> + cpumask_or(dsq->cpus_allowed, dsq->cpus_allowed, p->cpus_ptr);
> raw_spin_unlock(&dsq->lock);
> + }
> }
The cpumask is only updated during enqueue and cleared when the queue
empties. If a task's affinity changes while it's already in the queue
(i.e., sched_setaffinity()), the cpus_allowed mask becomes stale. This
means: 1) the mask might include CPUs that no task can actually run on
anymore (false positive) or, more critically, 2) if a task's affinity
expands, the mask won't reflect this, causing CPUs to skip a queue that
actually has eligible tasks (false negative).
I think we need to hook something in sched_change to update the mask when
p->cpus_ptr changes.
>
> static void task_unlink_from_dsq(struct task_struct *p,
> @@ -1138,6 +1142,10 @@ static void task_unlink_from_dsq(struct task_struct *p,
> list_del_init(&p->scx.dsq_list.node);
> dsq_mod_nr(dsq, -1);
>
> + /* Clear cpumask when queue becomes empty to prevent saturation */
> + if (dsq->nr == 0 && dsq->cpus_allowed)
> + cpumask_clear(dsq->cpus_allowed);
> +
> if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) {
> struct task_struct *first_task;
>
> @@ -1897,6 +1905,14 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
> if (list_empty(&dsq->list))
> return false;
>
> + /*
> + * O(1) optimization: Check if any task in the queue can run on this CPU.
> + * If the cpumask is allocated and this CPU is not in the allowed set,
> + * we can skip the entire queue without scanning.
> + */
> + if (dsq->cpus_allowed && !cpumask_test_cpu(cpu_of(rq), dsq->cpus_allowed))
> + return false;
> +
> raw_spin_lock(&dsq->lock);
>
> nldsq_for_each_task(p, dsq) {
> @@ -3397,6 +3413,9 @@ static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id)
> raw_spin_lock_init(&dsq->lock);
> INIT_LIST_HEAD(&dsq->list);
> dsq->id = dsq_id;
> +
> + /* Allocate cpumask for tracking allowed CPUs */
> + dsq->cpus_allowed = kzalloc(cpumask_size(), GFP_KERNEL);
I don't see the corresponding kfree() in the cleanup path.
> }
>
> static void free_dsq_irq_workfn(struct irq_work *irq_work)
> --
> 2.51.0
>
Thanks,
-Andrea
next prev parent reply other threads:[~2026-02-03 8:37 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-03 3:03 [PATCH] sched/ext: Add cpumask to skip unsuitable dispatch queues Qiliang Yuan
2026-02-03 8:37 ` Andrea Righi [this message]
2026-02-04 9:41 ` Qiliang Yuan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYGzumDW2sQ8xQSD@gpd4 \
--to=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=changwoo@igalia.com \
--cc=david.dai@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=emil@etsalapatis.com \
--cc=jake@hillion.co.uk \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=newton@meta.com \
--cc=peterz@infradead.org \
--cc=realwujing@gmail.com \
--cc=rostedt@goodmis.org \
--cc=schatzberg.dan@gmail.com \
--cc=sched-ext@lists.linux.dev \
--cc=suzhidao@xiaomi.com \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=void@manifault.com \
--cc=vschneid@redhat.com \
--cc=yuanql9@chinatelecom.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox