public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Qiliang Yuan <realwujing@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Tejun Heo <tj@kernel.org>, Emil Tsalapatis <emil@etsalapatis.com>,
	Ryan Newton <newton@meta.com>, David Dai <david.dai@linux.dev>,
	zhidao su <suzhidao@xiaomi.com>,
	Jake Hillion <jake@hillion.co.uk>,
	Qiliang Yuan <yuanql9@chinatelecom.cn>,
	David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Dan Schatzberg <schatzberg.dan@gmail.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched/ext: Add cpumask to skip unsuitable dispatch queues
Date: Tue, 3 Feb 2026 09:37:14 +0100	[thread overview]
Message-ID: <aYGzumDW2sQ8xQSD@gpd4> (raw)
In-Reply-To: <20260203030400.3313990-1-realwujing@gmail.com>

Hi Qiliang,

On Mon, Feb 02, 2026 at 10:03:46PM -0500, Qiliang Yuan wrote:
> Add a cpumask field to struct scx_dispatch_q to track the union of
> allowed CPUs for all tasks in the queue. Use this mask to perform an
> O(1) check in consume_dispatch_q() before scanning the queue.
> 
> When a CPU attempts to consume from a queue, it currently must iterate
> through all N tasks to determine if any can run on that CPU. If the
> queue contains only tasks pinned to other CPUs (via sched_setaffinity
> or cgroups), this O(N) scan finds nothing.
> 
> With the cpumask, if the current CPU is not in the allowed set, skip
> the entire queue immediately with a single bit test. This changes the
> "queue is unsuitable" case from O(N) to O(1).
> 
> The mask is updated when tasks are enqueued and cleared when the queue
> becomes empty, preventing permanent saturation from transient pinned
> tasks.
> 
> This benefits large systems with CPU-pinned workloads, where CPUs
> frequently scan queues containing no eligible tasks.

Did you run some benchmarks / have some numbers?

It's true that we save the O(N) scan when the DSQ has no eligible tasks,
but we're adding cost on every enqueue: cpumask_or() on potentially large
cpumasks can be expensive.

I think this optimization can help when queues frequently contain only
tasks pinned to other CPUs or when the queue has many tasks (N is large).
I have the feeling that for small queues or mixed workloads, the cpumask
overhead probably exceeds the savings...

> 
> Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
>  include/linux/sched/ext.h |  1 +
>  kernel/sched/ext.c        | 21 ++++++++++++++++++++-
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d..f20e57cf53a3 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -79,6 +79,7 @@ struct scx_dispatch_q {
>  	struct rhash_head	hash_node;
>  	struct llist_node	free_node;
>  	struct rcu_head		rcu;
> +	struct cpumask		*cpus_allowed; /* union of all tasks' allowed cpus */
>  };
>  
>  /* scx_entity.flags */
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index afe28c04d5aa..5a060c97cd64 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1120,8 +1120,12 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  
>  	if (is_local)
>  		local_dsq_post_enq(dsq, p, enq_flags);
> -	else
> +	else {
> +		/* Update cpumask to track union of all tasks' allowed CPUs */
> +		if (dsq->cpus_allowed)
> +			cpumask_or(dsq->cpus_allowed, dsq->cpus_allowed, p->cpus_ptr);
>  		raw_spin_unlock(&dsq->lock);
> +	}
>  }

The cpumask is only updated during enqueue and cleared when the queue
empties. If a task's affinity changes while it's already in the queue
(i.e., sched_setaffinity()), the cpus_allowed mask becomes stale. This
means: 1) the mask might include CPUs that no task can actually run on
anymore (false positive) or, more critically, 2) if a task's affinity
expands, the mask won't reflect this, causing CPUs to skip a queue that
actually has eligible tasks (false negative).

I think we need to hook something in sched_change to update the mask when
p->cpus_ptr changes.

>  
>  static void task_unlink_from_dsq(struct task_struct *p,
> @@ -1138,6 +1142,10 @@ static void task_unlink_from_dsq(struct task_struct *p,
>  	list_del_init(&p->scx.dsq_list.node);
>  	dsq_mod_nr(dsq, -1);
>  
> +	/* Clear cpumask when queue becomes empty to prevent saturation */
> +	if (dsq->nr == 0 && dsq->cpus_allowed)
> +		cpumask_clear(dsq->cpus_allowed);
> +
>  	if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN) && dsq->first_task == p) {
>  		struct task_struct *first_task;
>  
> @@ -1897,6 +1905,14 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
>  	if (list_empty(&dsq->list))
>  		return false;
>  
> +	/*
> +	 * O(1) optimization: Check if any task in the queue can run on this CPU.
> +	 * If the cpumask is allocated and this CPU is not in the allowed set,
> +	 * we can skip the entire queue without scanning.
> +	 */
> +	if (dsq->cpus_allowed && !cpumask_test_cpu(cpu_of(rq), dsq->cpus_allowed))
> +		return false;
> +
>  	raw_spin_lock(&dsq->lock);
>  
>  	nldsq_for_each_task(p, dsq) {
> @@ -3397,6 +3413,9 @@ static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id)
>  	raw_spin_lock_init(&dsq->lock);
>  	INIT_LIST_HEAD(&dsq->list);
>  	dsq->id = dsq_id;
> +	
> +	/* Allocate cpumask for tracking allowed CPUs */
> +	dsq->cpus_allowed = kzalloc(cpumask_size(), GFP_KERNEL);

I don't see the corresponding kfree() in the cleanup path.

>  }
>  
>  static void free_dsq_irq_workfn(struct irq_work *irq_work)
> -- 
> 2.51.0
> 

Thanks,
-Andrea

  reply	other threads:[~2026-02-03  8:37 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03  3:03 [PATCH] sched/ext: Add cpumask to skip unsuitable dispatch queues Qiliang Yuan
2026-02-03  8:37 ` Andrea Righi [this message]
2026-02-04  9:41   ` Qiliang Yuan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYGzumDW2sQ8xQSD@gpd4 \
    --to=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=david.dai@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=jake@hillion.co.uk \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=newton@meta.com \
    --cc=peterz@infradead.org \
    --cc=realwujing@gmail.com \
    --cc=rostedt@goodmis.org \
    --cc=schatzberg.dan@gmail.com \
    --cc=sched-ext@lists.linux.dev \
    --cc=suzhidao@xiaomi.com \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    --cc=yuanql9@chinatelecom.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox