All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yury Norov <yury.norov@gmail.com>
To: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Yury Norov <yury.norov@gmail.com>,
	linux-kernel@vger.kernel.org, mingo@kernel.org,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, kprateek.nayak@amd.com,
	iii@linux.ibm.com, tglx@kernel.org, gregkh@linuxfoundation.org,
	pbonzini@redhat.com, seanjc@google.com, vschneid@redhat.com,
	huschle@linux.ibm.com, rostedt@goodmis.org,
	dietmar.eggemann@arm.com, mgorman@suse.de, bsegall@google.com,
	maddy@linux.ibm.com, srikar@linux.ibm.com, hdanton@sina.com,
	chleroy@kernel.org, vineeth@bitbyteword.org, frederic@kernel.org,
	arighi@nvidia.com, pauld@redhat.com, christian.loehle@arm.com,
	tj@kernel.org, tommaso.cucinotta@gmail.com, maz@kernel.org,
	rafael@kernel.org
Subject: Re: [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed
Date: Thu, 18 Jun 2026 00:49:56 -0400	[thread overview]
Message-ID: <ajN49OZUt3kvN9hm@yury> (raw)
In-Reply-To: <c4546759-b316-47e7-aa97-408e20d0f6ed@linux.ibm.com>

On Thu, Jun 18, 2026 at 09:47:49AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 6/18/26 9:02 AM, Yury Norov wrote:
> > On Wed, Jun 17, 2026 at 11:11:25PM +0530, Shrikanth Hegde wrote:
> > > When possible, choose a preferred CPUs to pick.
> > > 
> > > Push task mechanism uses stopper thread which going to call
> > > select_fallback_rq and use this mechanism to pick only a preferred CPU.
> > > 
> > > When task is affined only to non-preferred CPUs it should continue to
> > > run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
> > > intersect or not.
> > > 
> > > Since is_cpu_allowed can be called directly or repeatedly in
> > > select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
> > > if the path is via select_fallback_rq or not.
> > > This helps to avoid N**2 complexity for the rare cases.
> > > 
> > > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> > > ---
> > > v3->v4:
> > > - Missing case of PF_KTHREAD is avoided.
> > > - Add a new field in task_struct which encodes intersection of
> > >    tasks affinity and preferred CPUs and path its coming from.
> > > 
> > >   include/linux/sched.h |  1 +
> > >   kernel/sched/core.c   | 34 ++++++++++++++++++++++++++++++++--
> > >   kernel/sched/sched.h  | 18 ++++++++++++++++++
> > >   3 files changed, 51 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index fc6ecb3869dd..2d0b1a6d50ac 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1657,6 +1657,7 @@ struct task_struct {
> > >   #ifdef CONFIG_UNWIND_USER
> > >   	struct unwind_task_info		unwind_info;
> > >   #endif
> > > +	int				has_preferred_cpu_state;
> > 
> > Shouldn't this be protected with the config?
> 
> Since preferred is defined always, i don;t see a reason to add it again here.
> 
> > 
> > >   	/* CPU-specific state of this task: */
> > >   	struct thread_struct		thread;
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 9e16946c9d62..714816cfa975 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
> > >    */
> > >   static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > >   {
> > > +	bool task_check_preferred_cpu = false;
> > 
> > Initialization is not needed.
> 
> ok
> 
> > 
> > > +
> > >   	/* When not in the task's cpumask, no point in looking further. */
> > >   	if (!task_allowed_on_cpu(p, cpu))
> > >   		return false;
> > > @@ -2508,9 +2510,22 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > >   	if (is_migration_disabled(p))
> > >   		return cpu_online(cpu);
> > > +	/*
> > > +	 * This is essential to maintain user affinities when preferred
> > > +	 * CPUs change. A task pinned on non-preferred CPU should continue
> > > +	 * to run there, since this is non-user triggered.
> > > +	 *
> > > +	 * If CPU is non-preferred and task can run on other CPUs which are
> > > +	 * currently preferred, then choose those other CPUs instead
> > > +	 */
> > > +	task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
> > > +
> > >   	/* Non kernel threads are not allowed during either online or offline. */
> > > -	if (!(p->flags & PF_KTHREAD))
> > > +	if (!(p->flags & PF_KTHREAD)) {
> > > +		if (task_check_preferred_cpu)
> > > +			return false;
> > >   		return cpu_active(cpu);
> > > +	}
> > >   	/* KTHREAD_IS_PER_CPU is always allowed. */
> > >   	if (kthread_is_per_cpu(p))
> > > @@ -2520,6 +2535,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> > >   	if (cpu_dying(cpu))
> > >   		return false;
> > > +	/* Try on preferred CPU first if possible*/
> > > +	if (task_check_preferred_cpu)
> > > +		return false;
> > > +
> > >   	/* But are allowed during online. */
> > >   	return cpu_online(cpu);
> > >   }
> > > @@ -3549,6 +3568,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> > >   	enum { cpuset, possible, fail } state = cpuset;
> > >   	int dest_cpu;
> > > +	/*
> > > +	 * Cache value whether task's affinity spans preferred CPUs.
> > 
> > Because it's cached, it should go inside is_cpu_allowed(), I think.
> > 
> > > +	 * This helps to avoid repeating the same for each CPU
> > > +	 * later in the loop. Encode call to is_cpu_allowed coming
> > > +	 * via select_fallback_rq.
> > > +	 */
> > > +	p->has_preferred_cpu_state = task_has_preferred_cpus(p) << 8 | 0x1;
> > 
> > This looks weird. Your intention is to store three states: not cached, has
> > preferred CPUs and has not preferred CPUs,
> > 
> > Why don't you create an enum for it? Or a couple of flags?
> 
> I think what prateek suggested in other thread looks same. I will give that a try.
> 
> > 
> > > +
> > >   	/*
> > >   	 * If the node that the CPU is on has been offlined, cpu_to_node()
> > >   	 * will return -1. There is no CPU on the node, and we should
> > > @@ -3560,7 +3587,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> > >   		/* Look for allowed, online CPU in same node. */
> > >   		for_each_cpu(dest_cpu, nodemask) {
> > >   			if (is_cpu_allowed(p, dest_cpu))
> > > -				return dest_cpu;
> > > +				goto clear_and_return;
> > >   		}
> > >   	}
> > > @@ -3604,6 +3631,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
> > >   		}
> > >   	}
> > > +clear_and_return:
> > > +	p->has_preferred_cpu_state = 0;
> > 
> 
> It is reset to indicate that any subsequent direct calls to is_cpu_allowed can't use the
> old cached value of select_fallback_rq.
> 
> So events could be,
> 
> - cpu marked as non preferred - select_fallback_rq (sets the p->has_preferred_cpu_state)
>   Lets say CPU(300-450) are marked as non-preferred and Task affinity is (200-350)
> - task moved out. Now either task's affinity changed or preferred_mask has changed.
>   while CPU(400) maybe still marked as non-preferred but CPU(340) is marked as preferred.
> - Subsequent call to is_cpu_allowed (CPU=340) can't assume the old value.

Please, no top-posting.

My point is: out of the scope of the select_fallback_rq(), the
p->has_preferred_cpu_state is always 0 because of the line above. It
means, it doesn't belong to the task_struct, it belongs the current
scope.

So either make this caching really surviving the scope exit, or make
it a local variable.

Not sure I understood the passage above about the possible events, but
variables that are always zero out of the function scope should not be
placed in global structures.
 
> > What for resetting it here? I think it should be zeroed only on update
> > of preferred cpumask. In other words, to properly implement caching,
> > you need to have a global counter incremented on each
> > cpu_preferred_mask update, and in task_has_preferred_cpus() you do:
> > 
> >   {
> >          if (p->preferred_cpu_updates == atomic_read(preferred_cpumask_updates))
> >                  return p->has_preferred_cpus;
> > 
> >          p->preferred_cpu_updates = atomic_read(preferred_cpumask_updates);
> >          p->has_preferred_cpus = cpumask_intersects(...);
> >   }
> > 
> > Do you have any numbers that justify this caching? The best practice
> > is to put performance optimizations at the end of the series and
> > provide some sort of benchmark supporting it.
> > 
> 
> This was to avoid N**2 aspect that was there in select_fallback_rq.
> Its more of the functional aspect which i mentioned above which this needs
> to take care as well.

Please, collect the performance data first, then optimize your code,
not vice-versa.
 
> > >   	return dest_cpu;
> > >   }
> > > @@ -4612,6 +4641,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
> > >   	init_numa_balancing(clone_flags, p);
> > >   	p->wake_entry.u_flags = CSD_TYPE_TTWU;
> > >   	p->migration_pending = NULL;
> > > +	p->has_preferred_cpu_state = 0;
> > >   	init_sched_mm(p);
> > >   }
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index c7c2dea65edd..38fd84b0b8f8 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -4213,4 +4213,22 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
> > >   #include "ext.h"
> > > +/*
> > > + * has_preferred_cpu_state is encoding two bits of information.
> > > + * First Byte is to encode where the call to is_cpu_allowed coming from.
> > > + * Second Byte is to encode the intersection of task affinity
> > > + * and cpu_preferred_mask.
> > > + *
> > > + * If 1st Byte is set, call to is_cpu_allowed coming from select_fallback_rq.
> > > + * That helps to avoid repeated calculation keeping time complexity same.
> > > + */
> > > +static inline bool task_has_preferred_cpus(struct task_struct *p)
> > 
> > This function should be void because you change the task state.
> > 
> 
> It doesn't alter p->has_preferred_cpu_state. No?

It doesn't, but it should.

> > > +{
> > > +	int cached_value = p->has_preferred_cpu_state;
> > > +
> > > +	if (cached_value & 0x1)
> > > +		return p->has_preferred_cpu_state >> 8;
> > > +	else
> > > +		return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
> > > +}
> > >   #endif /* _KERNEL_SCHED_SCHED_H */
> > > -- 
> > > 2.47.3

  reply	other threads:[~2026-06-18  4:49 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-17 17:41 [PATCH v4 00/20] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 01/20] sched/debug: Remove unused schedstats Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 02/20] sched/docs: Document cpu_preferred_mask and Preferred CPU concept Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 03/20] kconfig: Provide PREFERRED_CPU option Shrikanth Hegde
2026-06-18  0:51   ` Yury Norov
2026-06-18  3:44     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 04/20] cpumask: Introduce cpu_preferred_mask Shrikanth Hegde
2026-06-18  1:29   ` Yury Norov
2026-06-18  3:53     ` Shrikanth Hegde
2026-06-18  8:27       ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 05/20] sysfs: Add preferred CPU file Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 06/20] sched/core: allow only preferred CPUs in is_cpu_allowed Shrikanth Hegde
2026-06-18  3:32   ` Yury Norov
2026-06-18  4:17     ` Shrikanth Hegde
2026-06-18  4:49       ` Yury Norov [this message]
2026-06-18  5:14         ` Shrikanth Hegde
2026-06-18  3:49   ` K Prateek Nayak
2026-06-18  4:22     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 07/20] sched/fair: Select preferred CPU at wakeup when possible Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 08/20] sched/fair: load balance only among preferred CPUs Shrikanth Hegde
2026-06-18  3:03   ` K Prateek Nayak
2026-06-18  3:54     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 09/20] sched/core: Keep tick on non-preferred CPUs until tasks are out Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 10/20] sched/core: Push current task from non preferred CPU Shrikanth Hegde
2026-06-18  4:09   ` K Prateek Nayak
2026-06-18  6:05     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 11/20] sched/debug: Add migration stats due to non preferred CPUs Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 12/20] sched/debug: Create debugfs folder steal monitor Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 13/20] sched/debug: Provide debugfs to enable/disable " Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 14/20] sched/core: Introduce a simple " Shrikanth Hegde
2026-06-18  4:30   ` Yury Norov
2026-06-18  4:44     ` Shrikanth Hegde
2026-06-18  5:32       ` K Prateek Nayak
2026-06-18  6:01         ` Shrikanth Hegde
2026-06-18  6:39           ` Yury Norov
2026-06-18  6:45             ` Shrikanth Hegde
2026-06-18  7:16               ` Yury Norov
2026-06-17 17:41 ` [PATCH v4 15/20] sched/core: Compute steal values at regular intervals Shrikanth Hegde
2026-06-18  4:04   ` Yury Norov
2026-06-18  5:39     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 16/20] sched/core: Introduce default arch handling code for inc/dec preferred CPUs Shrikanth Hegde
2026-06-18  4:15   ` Yury Norov
2026-06-18  4:42     ` Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 17/20] sched/core: Handle steal values and mark CPUs as preferred Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 18/20] sched/core: Mark the direction of steal values to avoid oscillations Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 19/20] sched/debug: Add debug knobs for steal monitor Shrikanth Hegde
2026-06-17 17:41 ` [PATCH v4 20/20] sched/core: Add a few check for valid CPU in inc/dec of preferred CPUs Shrikanth Hegde
2026-06-18  4:21   ` Yury Norov
2026-06-18  4:40     ` Shrikanth Hegde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajN49OZUt3kvN9hm@yury \
    --to=yury.norov@gmail.com \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=chleroy@kernel.org \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=frederic@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=hdanton@sina.com \
    --cc=huschle@linux.ibm.com \
    --cc=iii@linux.ibm.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maddy@linux.ibm.com \
    --cc=maz@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=seanjc@google.com \
    --cc=srikar@linux.ibm.com \
    --cc=sshegde@linux.ibm.com \
    --cc=tglx@kernel.org \
    --cc=tj@kernel.org \
    --cc=tommaso.cucinotta@gmail.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineeth@bitbyteword.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.