[RFC v2 0/9] cpu avoid state and push task mechanism

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 0/9] cpu avoid state and push task mechanism
@ 2025-06-25 19:10 Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept Shrikanth Hegde
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:10 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

This is a followup version if [1] with few additions. This is still an RFC 
and would like get feedback on the idea and suggestions on improvement. 

v1->v2:
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.
- Used a static key such that no impact to regular case. 
- add sysfs file to show avoid CPUs.
- Make RT understand avoid CPUs. 
- Add documentation patch 
- Took care of reported compile error in [1] when NR_CPUS=1

-----------------
Problem statement
-----------------
vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor is managing these vCPUs from different VMs. When a vCPU 
requests for CPU, hypervisor does the job of scheduling them on a pCPU.

So this issue occurs when there are more vCPUs(combined across all VMs) 
than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor 
can only run a few of them and remaining will be preempted(waiting for pCPU).

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for *limited*  vCPUs, it avoids the above overhead and 
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more 
expensive than the task preemption within the vCPU. So *basic* aim to avoid 
vCPU preemption.

So to achieve this, use "CPU Avoid" concept, where it is better
if workload avoids these vCPUs at this moment.
(vCPUs stays online, we don't want the overhead of sched domain rebuild).

Contention is dynamic in nature. When there is contention for pCPU is to be 
detected and determined by architecture. Archs needs to update the mask 
accordingly.

When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.

-------------------------
To be done and Questions: 
-------------------------
1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
code could be modified to do the same. Ran stress-ng --hrtimers, irq
moved out of avoid cpu though. So need to see if changes to irqbalance is
required or not.

2. If a task is spawned by affining to only avoid CPUs. Should that fail
or throw a warning to user. 

3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
yet.

4. Performance testing yet to be done. RFC only verified the functional
aspects of whether task move out of avoid CPUs or not. Move happens quite
fast (around 1-2 seconds even on large systems with very high utilization) 

5. Haven't come up an infra which could combine all push task related
changes. It is currently spread across rt, dl, fair. Maybe some
consolidation can be done. but which tasks to push/pull still remains in
the class. 

6. cpu_avoid_mask may need some sort of locking to ensure read/write is
correct. 

[1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/

Shrikanth Hegde (9):
  sched/docs: Document avoid_cpu_mask and avoid CPU concept
  cpumask: Introduce cpu_avoid_mask
  sched/core: Don't allow to use CPU marked as avoid
  sched/fair: Don't use CPU marked as avoid for wakeup and load balance
  sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
  sched/core: Push current task out if CPU is marked as avoid
  sched: Add static key check for cpu_avoid
  sysfs: Add cpu_avoid file
  powerpc: add debug file for set/unset cpu avoid

 Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
 arch/powerpc/include/asm/paravirt.h    |  2 ++
 arch/powerpc/kernel/smp.c              | 50 ++++++++++++++++++++++++++
 drivers/base/cpu.c                     |  8 +++++
 include/linux/cpumask.h                | 17 +++++++++
 kernel/cpu.c                           |  3 ++
 kernel/sched/core.c                    | 50 +++++++++++++++++++++++++-
 kernel/sched/fair.c                    | 11 +++++-
 kernel/sched/rt.c                      |  9 +++--
 kernel/sched/sched.h                   | 10 ++++++
 10 files changed, 181 insertions(+), 4 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-26  6:27   ` Hillf Danton
  2025-06-25 19:11 ` [RFC v2 2/9] cpumask: Introduce cpu_avoid_mask Shrikanth Hegde
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

This describes what avoid CPU means and what scheduler aims to do 
when a CPU is marked as avoid. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..d32755298fca 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+CPU Avoid
+=========
+
+Under paravirt conditions it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM have high utilization,
+hypervisor won't be able to satisfy the requirement and has to context switch
+within or across VM. VM level context switch is more expensive compared to
+task context switch within the VM.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU request by not using some of the vCPUs. Such vCPUs where workload
+can be avoided at the moment are called as "Avoid CPUs". Note that when the
+pCPU contention goes away, these vCPUs can be used again by the workload.
+
+Arch need to set/unset the vCPU as avoid in cpu_avoid_mask. When set, avoid
+the CPU and when unset, use it as usual.
+
+Scheduler will try to avoid those CPUs as much as it can.
+This is achived by
+1. Not selecting those CPU at wakeup.
+2. Push the task away from avoid CPU at tick.
+3. Not selecting avoid CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL.
 
 Possible arch/ problems
 =======================
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 2/9] cpumask: Introduce cpu_avoid_mask
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 3/9] sched/core: Dont allow to use CPU marked as avoid Shrikanth Hegde
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Introduce cpu_avoid_mask and get/set routines for it.

By having the mask, it is easier for other kernel subsystem to consume
it as well. One could quickly know which CPUs are currently marked as
avoid.
 
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
There is a sysfs patch later in the series which prints this mask. If it 
should be sqashed into this patch, let me know. 
 
include/linux/cpumask.h | 17 +++++++++++++++++
 kernel/cpu.c            |  3 +++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 7ae80a7ca81e..6394c67a4fb3 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -84,6 +84,7 @@ static __always_inline void set_nr_cpu_ids(unsigned int nr)
  *     cpu_enabled_mask - has bit 'cpu' set iff cpu can be brought online
  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
+ *     cpu_avoid_mask  - has bit 'cpu' set iff cpu is to be avoided now
  *
  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
  *
@@ -101,6 +102,10 @@ static __always_inline void set_nr_cpu_ids(unsigned int nr)
  *  (*) Well, cpu_present_mask is dynamic in the hotplug case.  If not
  *      hotplug, it's a copy of cpu_possible_mask, hence fixed at boot.
  *
+ *  A CPU is said to be avoided when there is contention for underlying
+ *  physical CPU resource in paravirtulized environment. It is recommneded
+ *  not run anything on that CPU though it is online.
+ *
  * Subtleties:
  * 1) UP ARCHes (NR_CPUS == 1, CONFIG_SMP not defined) hardcode
  *    assumption that their single CPU is online.  The UP
@@ -118,12 +123,14 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+extern struct cpumask __cpu_avoid_mask;
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_avoid_mask    ((const struct cpumask *)&__cpu_avoid_mask)
 
 extern atomic_t __num_online_cpus;
 
@@ -1133,6 +1140,7 @@ void init_cpu_possible(const struct cpumask *src);
 #define set_cpu_present(cpu, present)	assign_cpu((cpu), &__cpu_present_mask, (present))
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
+#define set_cpu_avoid(cpu, avoid)       assign_cpu((cpu), &__cpu_avoid_mask, (avoid))
 
 void set_cpu_online(unsigned int cpu, bool online);
 
@@ -1222,6 +1230,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
+static __always_inline bool cpu_avoid(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_avoid_mask);
+}
+
 #else
 
 #define num_online_cpus()	1U
@@ -1260,6 +1273,10 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return false;
 }
 
+static __always_inline bool cpu_avoid(unsigned int cpu)
+{
+	return false;
+}
 #endif /* NR_CPUS > 1 */
 
 #define cpu_is_offline(cpu)	unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index a59e009e0be4..44e8c66d2839 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3107,6 +3107,9 @@ EXPORT_SYMBOL(__cpu_active_mask);
 struct cpumask __cpu_dying_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_dying_mask);
 
+struct cpumask __cpu_avoid_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_avoid_mask);
+
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 3/9] sched/core: Dont allow to use CPU marked as avoid
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 2/9] cpumask: Introduce cpu_avoid_mask Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance Shrikanth Hegde
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Don't allow the CPU marked as avoid. This is used when task is pushed out
of a CPU marked as avoid in select_fallback_rq

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0e3a00e2a2cc..13e44d7a0b90 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2437,6 +2437,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 	if (cpu_dying(cpu))
 		return false;
 
+	/* CPU marked as avoid, shouldn't chosen to run any task*/
+	if (cpu_avoid(cpu))
+		return false;
+
 	/* But are allowed during online. */
 	return cpu_online(cpu);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 3/9] sched/core: Dont allow to use CPU marked as avoid Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-26  0:02   ` Yury Norov
  2025-06-25 19:11 ` [RFC v2 5/9] sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task Shrikanth Hegde
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Load balancer shouldn't spread CFS tasks into a CPU marked as Avoid. 
Remove those CPUs from load balancing decisions. 

At wakeup, don't select a CPU marked as avoid. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
while tesing didn't see cpu being marked as avoid while new_cpu is. 
May need some more probing to see if even cpu can be. if so it could
lead to crash.  

 kernel/sched/fair.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e2963efe800..406288aef535 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8546,7 +8546,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	}
 	rcu_read_unlock();
 
-	return new_cpu;
+	/* Don't select a CPU marked as avoid for wakeup */
+	if (cpu_avoid(new_cpu))
+		return cpu;
+	else
+		return new_cpu;
+
 }
 
 /*
@@ -11662,6 +11667,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
+	/* Don't spread load into CPUs marked as avoid */
+	cpumask_andnot(cpus, cpus, cpu_avoid_mask);
+
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 5/9] sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 6/9] sched/core: Push current task out if CPU is marked as avoid Shrikanth Hegde
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

- While wakeup don't select the CPU if it marked as avoid. 
- Don't pull a task if CPU is marked as avoid. 
- Don't push a task to a CPU marked as Avoid. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/rt.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 15d5855c542c..fd9df6f46135 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1549,6 +1549,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
 			goto out_unlock;
 
+		if (cpu_avoid(target))
+			goto out_unlock;
 		/*
 		 * Don't bother moving it if the destination CPU is
 		 * not running a lower priority task.
@@ -1871,7 +1873,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if ((cpu == -1) || (cpu == rq->cpu))
+		if ((cpu == -1) || (cpu == rq->cpu) || cpu_avoid(cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);
@@ -1969,7 +1971,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 			return 0;
 
 		cpu = find_lowest_rq(rq->curr);
-		if (cpu == -1 || cpu == rq->cpu)
+		if (cpu == -1 || cpu == rq->cpu || cpu_avoid(cpu))
 			return 0;
 
 		/*
@@ -2232,6 +2234,9 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
+	if (cpu_avoid(this_rq->cpu))
+		return;
+
 	/*
 	 * Match the barrier from rt_set_overloaded; this guarantees that if we
 	 * see overloaded we must also see the rto_mask bit.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 6/9] sched/core: Push current task out if CPU is marked as avoid
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 5/9] sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-08-12 18:40   ` Shrikanth Hegde
  2025-06-25 19:11 ` [RFC v2 7/9] sched: Add static key check for cpu_avoid Shrikanth Hegde
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Actively push out any task running on a CPU marked as avoid. 
If a task is sleeping it is pushed out if it wakes up on that CPU. 

Since the task is running, need to use the stopper class to push the 
task out. Use __balance_push_cpu_stop to achieve that. 

This currently works only CFS and RT. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 45 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 13e44d7a0b90..aea4232e3ec4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5577,6 +5577,10 @@ void sched_tick(void)
 
 	sched_clock_tick();
 
+	/* push the current task out if cpu is marked as avoid */
+	if (cpu_avoid(cpu))
+		push_current_task(rq);
+
 	rq_lock(rq, &rf);
 	donor = rq->donor;
 
@@ -8028,6 +8032,43 @@ static void balance_hotplug_wait(void)
 			   TASK_UNINTERRUPTIBLE);
 }
 
+static DEFINE_PER_CPU(struct cpu_stop_work, push_task_work);
+
+/* A CPU is marked as Avoid when there is contention for underlying
+ * physical CPU and using this CPU will lead to hypervisor preemptions.
+ * It is better not to use this CPU.
+ *
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a non_avoid CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void push_current_task(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+
+	/* idle task can't be pused out */
+	if (rq->curr == rq->idle || !cpu_avoid(rq->cpu))
+		return;
+
+	/* Do for only SCHED_NORMAL AND RT for now */
+	if (push_task->sched_class != &fair_sched_class &&
+	    push_task->sched_class != &rt_sched_class)
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	local_irq_save(flags);
+	get_task_struct(push_task);
+	preempt_disable();
+
+	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
+			    this_cpu_ptr(&push_task_work));
+	preempt_enable();
+	local_irq_restore(flags);
+}
 #else /* !CONFIG_HOTPLUG_CPU: */
 
 static inline void balance_push(struct rq *rq)
@@ -8042,6 +8083,9 @@ static inline void balance_hotplug_wait(void)
 {
 }
 
+void push_current_task(struct rq *rq)
+{
+}
 #endif /* !CONFIG_HOTPLUG_CPU */
 
 void set_rq_online(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 105190b18020..b9614873762e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1709,6 +1709,7 @@ struct rq_flags {
 };
 
 extern struct balance_callback balance_push_callback;
+void push_current_task(struct rq *rq);
 
 #ifdef CONFIG_SCHED_CLASS_EXT
 extern const struct sched_class ext_sched_class;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 7/9] sched: Add static key check for cpu_avoid
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (5 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 6/9] sched/core: Push current task out if CPU is marked as avoid Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-26  0:12   ` Yury Norov
  2025-06-25 19:11 ` [RFC v2 8/9] sysfs: Add cpu_avoid file Shrikanth Hegde
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Checking if a CPU is avoid can add a slight overhead and should be 
done only when necessary. 

Add a static key check which makes it almost nop when key is false. 
Arch needs to set the key when it decides to. Refer to debug patch
for example. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
This method avoids additional ifdefs. So kept it that way instead of 
CONFIG_PARAVIRT. 

Added a helper function for cpu_avoid, since including sched.h fails in 
cpumask.h

 kernel/sched/core.c  | 8 ++++----
 kernel/sched/fair.c  | 5 +++--
 kernel/sched/rt.c    | 8 ++++----
 kernel/sched/sched.h | 9 +++++++++
 4 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aea4232e3ec4..51426b17ef55 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -148,9 +148,9 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  * Limited because this is done with IRQs disabled.
  */
 __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
-
 __read_mostly int scheduler_running;
 
+DEFINE_STATIC_KEY_FALSE(paravirt_cpu_avoid_enabled);
 #ifdef CONFIG_SCHED_CORE
 
 DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
@@ -2438,7 +2438,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
 		return false;
 
 	/* CPU marked as avoid, shouldn't chosen to run any task*/
-	if (cpu_avoid(cpu))
+	if (cpu_avoid_check(cpu))
 		return false;
 
 	/* But are allowed during online. */
@@ -5578,7 +5578,7 @@ void sched_tick(void)
 	sched_clock_tick();
 
 	/* push the current task out if cpu is marked as avoid */
-	if (cpu_avoid(cpu))
+	if (cpu_avoid_check(cpu))
 		push_current_task(rq);
 
 	rq_lock(rq, &rf);
@@ -8048,7 +8048,7 @@ void push_current_task(struct rq *rq)
 	unsigned long flags;
 
 	/* idle task can't be pused out */
-	if (rq->curr == rq->idle || !cpu_avoid(rq->cpu))
+	if (rq->curr == rq->idle || !cpu_avoid_check(rq->cpu))
 		return;
 
 	/* Do for only SCHED_NORMAL AND RT for now */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 406288aef535..21370f76d61b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8547,7 +8547,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	rcu_read_unlock();
 
 	/* Don't select a CPU marked as avoid for wakeup */
-	if (cpu_avoid(new_cpu))
+	if (cpu_avoid_check(new_cpu))
 		return cpu;
 	else
 		return new_cpu;
@@ -11668,7 +11668,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	/* Don't spread load into CPUs marked as avoid */
-	cpumask_andnot(cpus, cpus, cpu_avoid_mask);
+	if (static_branch_unlikely(&paravirt_cpu_avoid_enabled))
+		cpumask_andnot(cpus, cpus, cpu_avoid_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fd9df6f46135..0ab3fdf7a637 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1549,7 +1549,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 		if (!test && target != -1 && !rt_task_fits_capacity(p, target))
 			goto out_unlock;
 
-		if (cpu_avoid(target))
+		if (cpu_avoid_check(target))
 			goto out_unlock;
 		/*
 		 * Don't bother moving it if the destination CPU is
@@ -1873,7 +1873,7 @@ static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq)
 	for (tries = 0; tries < RT_MAX_TRIES; tries++) {
 		cpu = find_lowest_rq(task);
 
-		if ((cpu == -1) || (cpu == rq->cpu) || cpu_avoid(cpu))
+		if ((cpu == -1) || (cpu == rq->cpu) || cpu_avoid_check(cpu))
 			break;
 
 		lowest_rq = cpu_rq(cpu);
@@ -1971,7 +1971,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 			return 0;
 
 		cpu = find_lowest_rq(rq->curr);
-		if (cpu == -1 || cpu == rq->cpu || cpu_avoid(cpu))
+		if (cpu == -1 || cpu == rq->cpu || cpu_avoid_check(cpu))
 			return 0;
 
 		/*
@@ -2234,7 +2234,7 @@ static void pull_rt_task(struct rq *this_rq)
 	if (likely(!rt_overload_count))
 		return;
 
-	if (cpu_avoid(this_rq->cpu))
+	if (cpu_avoid_check(this_rq->cpu))
 		return;
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b9614873762e..707fdfa46772 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1710,6 +1710,15 @@ struct rq_flags {
 
 extern struct balance_callback balance_push_callback;
 void push_current_task(struct rq *rq);
+DECLARE_STATIC_KEY_FALSE(paravirt_cpu_avoid_enabled);
+
+static inline bool cpu_avoid_check(int cpu)
+{
+	if (static_branch_unlikely(&paravirt_cpu_avoid_enabled))
+		return cpu_avoid(cpu);
+
+	return false;
+}
 
 #ifdef CONFIG_SCHED_CLASS_EXT
 extern const struct sched_class ext_sched_class;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 8/9] sysfs: Add cpu_avoid file
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (6 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 7/9] sched: Add static key check for cpu_avoid Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-07-01  9:35   ` Greg KH
  2025-06-25 19:11 ` [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid Shrikanth Hegde
  2025-06-25 21:55 ` [RFC v2 0/9] cpu avoid state and push task mechanism Yury Norov
  9 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Add a sysfs file called "avoid" which prints the current CPUs 
makred as avoid. 

This could be used by userspace components or tools such as irqbalance. 

/sys/devices/system/cpu # cat avoid 
70-479

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 drivers/base/cpu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 7779ab0ca7ce..51c1207f6f33 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -300,6 +300,13 @@ static ssize_t print_cpus_isolated(struct device *dev,
 }
 static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);
 
+static ssize_t print_cpus_avoid(struct device *dev,
+				struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(cpu_avoid_mask));
+}
+static DEVICE_ATTR(avoid, 0444, print_cpus_avoid, NULL);
+
 #ifdef CONFIG_NO_HZ_FULL
 static ssize_t print_cpus_nohz_full(struct device *dev,
 				    struct device_attribute *attr, char *buf)
@@ -505,6 +512,7 @@ static struct attribute *cpu_root_attrs[] = {
 	&dev_attr_offline.attr,
 	&dev_attr_enabled.attr,
 	&dev_attr_isolated.attr,
+	&dev_attr_avoid.attr,
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (7 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 8/9] sysfs: Add cpu_avoid file Shrikanth Hegde
@ 2025-06-25 19:11 ` Shrikanth Hegde
  2025-06-25 22:53   ` Yury Norov
  2025-06-25 21:55 ` [RFC v2 0/9] cpu avoid state and push task mechanism Yury Norov
  9 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-25 19:11 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev,
	gregkh

Reference patch for how an architecture can make use of this infra. 

This is not meant to be merged. Instead the vp_manual_hint should either
come from hardware or could be derived using steal time. 

When the provided hint is less than the total CPUs in the system, it
will enable the cpu avoid static key and set those CPUs as avoid. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/include/asm/paravirt.h |  2 ++
 arch/powerpc/kernel/smp.c           | 50 +++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
index b78b82d66057..b6497e0b60d8 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -10,6 +10,8 @@
 #include <asm/hvcall.h>
 #endif
 
+DECLARE_STATIC_KEY_FALSE(paravirt_cpu_avoid_enabled);
+
 #ifdef CONFIG_PPC_SPLPAR
 #include <linux/smp.h>
 #include <asm/kvm_guest.h>
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5ac7084eebc0..e00cdc4de441 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -64,6 +64,7 @@
 #include <asm/systemcfg.h>
 
 #include <trace/events/ipi.h>
+#include <linux/debugfs.h>
 
 #ifdef DEBUG
 #include <asm/udbg.h>
@@ -82,6 +83,7 @@ bool has_big_cores __ro_after_init;
 bool coregroup_enabled __ro_after_init;
 bool thread_group_shares_l2 __ro_after_init;
 bool thread_group_shares_l3 __ro_after_init;
+static int vp_manual_hint = NR_CPUS;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -1727,6 +1729,7 @@ static void __init build_sched_topology(void)
 	BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
 
 	set_sched_topology(powerpc_topology);
+	vp_manual_hint = num_present_cpus();
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
@@ -1807,4 +1810,51 @@ void __noreturn arch_cpu_idle_dead(void)
 	start_secondary_resume();
 }
 
+/*
+ * sysfs hint to mark CPUs as Avoid. This will help in restricting
+ * the workload to specified number of CPUs.
+ * For example 40 > vp_manual_hint means, workload will run on
+ * 0-39 CPUs.
+ */
+
+static int pv_vp_manual_hint_set(void *data, u64 val)
+{
+	int cpu;
+
+	if (val == 0 || vp_manual_hint > num_present_cpus())
+		vp_manual_hint = num_present_cpus();
+
+	if (val != vp_manual_hint)
+		vp_manual_hint = val;
+
+	if (vp_manual_hint < num_present_cpus())
+		static_branch_enable(&paravirt_cpu_avoid_enabled);
+	else
+		static_branch_disable(&paravirt_cpu_avoid_enabled);
+
+	for_each_present_cpu(cpu) {
+		if (cpu >= vp_manual_hint)
+			set_cpu_avoid(cpu, true);
+		else
+			set_cpu_avoid(cpu, false);
+	}
+	return 0;
+}
+
+static int pv_vp_manual_hint_get(void *data, u64 *val)
+{
+	*val = vp_manual_hint;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, pv_vp_manual_hint_set, "%llu\n");
+
+static __init int paravirt_debugfs_init(void)
+{
+	if (is_shared_processor())
+		debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
+	return 0;
+}
+
+device_initcall(paravirt_debugfs_init)
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC v2 0/9] cpu avoid state and push task mechanism
  2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
                   ` (8 preceding siblings ...)
  2025-06-25 19:11 ` [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid Shrikanth Hegde
@ 2025-06-25 21:55 ` Yury Norov
  2025-06-26 14:33   ` Shrikanth Hegde
  9 siblings, 1 reply; 25+ messages in thread
From: Yury Norov @ 2025-06-25 21:55 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh

On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:
> This is a followup version if [1] with few additions. This is still an RFC 
> and would like get feedback on the idea and suggestions on improvement. 
> 
> v1->v2:
> - Renamed to cpu_avoid_mask in place of cpu_parked_mask.

This one is not any better to the previous. Why avoid? When avoid?
I already said that: for objects, having positive self-explaining
noun names is much better than negative and/or function-style verb
names. I suggested cpu_paravirt_mask, and I still believe it's a much
better option.

> - Used a static key such that no impact to regular case. 

Static keys are not free and designed for different purpose. You have
CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
using it.

I don't mind about static keys, if you prefer them, I just want to
have feature-specific code under corresponding config.

Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
Have you any perf numbers to advocate static keys here? 

> - add sysfs file to show avoid CPUs.
> - Make RT understand avoid CPUs. 
> - Add documentation patch 
> - Took care of reported compile error in [1] when NR_CPUS=1
> 
> -----------------
> Problem statement
> -----------------
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
> 
> A hypervisor is managing these vCPUs from different VMs. When a vCPU 
> requests for CPU, hypervisor does the job of scheduling them on a pCPU.
> 
> So this issue occurs when there are more vCPUs(combined across all VMs) 
> than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor 
> can only run a few of them and remaining will be preempted(waiting for pCPU).
> 
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
> each other and request for *limited*  vCPUs, it avoids the above overhead and 
                                       ^
Did this extra whitespace escaped from the previous line, or the following?
                                        v
> there is context switching within vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another within the same VM, it is still more 
> expensive than the task preemption within the vCPU. So *basic* aim to avoid 
> vCPU preemption.
> 
> So to achieve this, use "CPU Avoid" concept, where it is better
> if workload avoids these vCPUs at this moment.
> (vCPUs stays online, we don't want the overhead of sched domain rebuild).
> 
> Contention is dynamic in nature. When there is contention for pCPU is to be 
> detected and determined by architecture. Archs needs to update the mask 
> accordingly.
> 
> When there is contention, use limited vCPUs as indicated by arch.
> When there is no contention, use all vCPUs.
> 
> -------------------------
> To be done and Questions: 
> -------------------------
> 1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
> code could be modified to do the same. Ran stress-ng --hrtimers, irq
> moved out of avoid cpu though. So need to see if changes to irqbalance is
> required or not.
> 
> 2. If a task is spawned by affining to only avoid CPUs. Should that fail
> or throw a warning to user. 

I think it's possible that existing codebase will do that. And because
you don't want to break userspace, you should not restrict.

> 3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
> yet.
> 
> 4. Performance testing yet to be done. RFC only verified the functional
> aspects of whether task move out of avoid CPUs or not. Move happens quite
> fast (around 1-2 seconds even on large systems with very high utilization) 
> 
> 5. Haven't come up an infra which could combine all push task related
> changes. It is currently spread across rt, dl, fair. Maybe some
> consolidation can be done. but which tasks to push/pull still remains in
> the class. 
> 
> 6. cpu_avoid_mask may need some sort of locking to ensure read/write is
> correct. 
> 
> [1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
> 
> Shrikanth Hegde (9):
>   sched/docs: Document avoid_cpu_mask and avoid CPU concept
>   cpumask: Introduce cpu_avoid_mask
>   sched/core: Don't allow to use CPU marked as avoid
>   sched/fair: Don't use CPU marked as avoid for wakeup and load balance
>   sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
>   sched/core: Push current task out if CPU is marked as avoid
>   sched: Add static key check for cpu_avoid
>   sysfs: Add cpu_avoid file
>   powerpc: add debug file for set/unset cpu avoid
> 
>  Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
>  arch/powerpc/include/asm/paravirt.h    |  2 ++
>  arch/powerpc/kernel/smp.c              | 50 ++++++++++++++++++++++++++
>  drivers/base/cpu.c                     |  8 +++++
>  include/linux/cpumask.h                | 17 +++++++++
>  kernel/cpu.c                           |  3 ++
>  kernel/sched/core.c                    | 50 +++++++++++++++++++++++++-
>  kernel/sched/fair.c                    | 11 +++++-
>  kernel/sched/rt.c                      |  9 +++--
>  kernel/sched/sched.h                   | 10 ++++++
>  10 files changed, 181 insertions(+), 4 deletions(-)
> 
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid
  2025-06-25 19:11 ` [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid Shrikanth Hegde
@ 2025-06-25 22:53   ` Yury Norov
  2025-06-26 13:39     ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Yury Norov @ 2025-06-25 22:53 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh

On Thu, Jun 26, 2025 at 12:41:08AM +0530, Shrikanth Hegde wrote:
> Reference patch for how an architecture can make use of this infra. 
> 
> This is not meant to be merged. Instead the vp_manual_hint should either
> come from hardware or could be derived using steal time. 

If you don't add any code that manages the 'avoid' mask on the host
side, all this becomes a dead code.
 
> When the provided hint is less than the total CPUs in the system, it
> will enable the cpu avoid static key and set those CPUs as avoid. 
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/paravirt.h |  2 ++
>  arch/powerpc/kernel/smp.c           | 50 +++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
> index b78b82d66057..b6497e0b60d8 100644
> --- a/arch/powerpc/include/asm/paravirt.h
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -10,6 +10,8 @@
>  #include <asm/hvcall.h>
>  #endif
>  
> +DECLARE_STATIC_KEY_FALSE(paravirt_cpu_avoid_enabled);
> +
>  #ifdef CONFIG_PPC_SPLPAR
>  #include <linux/smp.h>
>  #include <asm/kvm_guest.h>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 5ac7084eebc0..e00cdc4de441 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -64,6 +64,7 @@
>  #include <asm/systemcfg.h>
>  
>  #include <trace/events/ipi.h>
> +#include <linux/debugfs.h>
>  
>  #ifdef DEBUG
>  #include <asm/udbg.h>
> @@ -82,6 +83,7 @@ bool has_big_cores __ro_after_init;
>  bool coregroup_enabled __ro_after_init;
>  bool thread_group_shares_l2 __ro_after_init;
>  bool thread_group_shares_l3 __ro_after_init;
> +static int vp_manual_hint = NR_CPUS;
>  
>  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> @@ -1727,6 +1729,7 @@ static void __init build_sched_topology(void)
>  	BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
>  
>  	set_sched_topology(powerpc_topology);
> +	vp_manual_hint = num_present_cpus();
>  }
>  
>  void __init smp_cpus_done(unsigned int max_cpus)
> @@ -1807,4 +1810,51 @@ void __noreturn arch_cpu_idle_dead(void)
>  	start_secondary_resume();
>  }
>  
> +/*
> + * sysfs hint to mark CPUs as Avoid. This will help in restricting
> + * the workload to specified number of CPUs.
> + * For example 40 > vp_manual_hint means, workload will run on
> + * 0-39 CPUs.
> + */
> +
> +static int pv_vp_manual_hint_set(void *data, u64 val)
> +{
> +	int cpu;
> +
> +	if (val == 0 || vp_manual_hint > num_present_cpus())
> +		vp_manual_hint = num_present_cpus();
> +
> +	if (val != vp_manual_hint)
> +		vp_manual_hint = val;

This all is effectively just:

	vp_manual_hint = val;

Isn't?

> +	if (vp_manual_hint < num_present_cpus())
> +		static_branch_enable(&paravirt_cpu_avoid_enabled);
> +	else
> +		static_branch_disable(&paravirt_cpu_avoid_enabled);
> +
> +	for_each_present_cpu(cpu) {
> +		if (cpu >= vp_manual_hint)
> +			set_cpu_avoid(cpu, true);
> +		else
> +			set_cpu_avoid(cpu, false);
> +	}
> +	return 0;
> +}
> +
> +static int pv_vp_manual_hint_get(void *data, u64 *val)
> +{
> +	*val = vp_manual_hint;
> +	return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, pv_vp_manual_hint_set, "%llu\n");
> +
> +static __init int paravirt_debugfs_init(void)
> +{
> +	if (is_shared_processor())
> +		debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
> +	return 0;
> +}
> +
> +device_initcall(paravirt_debugfs_init)
>  #endif
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance
  2025-06-25 19:11 ` [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance Shrikanth Hegde
@ 2025-06-26  0:02   ` Yury Norov
  2025-06-26 13:42     ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Yury Norov @ 2025-06-26  0:02 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh

On Thu, Jun 26, 2025 at 12:41:03AM +0530, Shrikanth Hegde wrote:
> Load balancer shouldn't spread CFS tasks into a CPU marked as Avoid. 
> Remove those CPUs from load balancing decisions. 
> 
> At wakeup, don't select a CPU marked as avoid. 
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> while tesing didn't see cpu being marked as avoid while new_cpu is. 
> May need some more probing to see if even cpu can be. if so it could
> lead to crash.  
> 
>  kernel/sched/fair.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7e2963efe800..406288aef535 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8546,7 +8546,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>  	}
>  	rcu_read_unlock();
>  
> -	return new_cpu;
> +	/* Don't select a CPU marked as avoid for wakeup */
> +	if (cpu_avoid(new_cpu))
> +		return cpu;
> +	else
> +		return new_cpu;
> +
>  }

There are more 'return's in this function, but you patch only one...

>  
>  /*
> @@ -11662,6 +11667,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
> +	/* Don't spread load into CPUs marked as avoid */
> +	cpumask_andnot(cpus, cpus, cpu_avoid_mask);
> +
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
> -- 
> 2.43.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 7/9] sched: Add static key check for cpu_avoid
  2025-06-25 19:11 ` [RFC v2 7/9] sched: Add static key check for cpu_avoid Shrikanth Hegde
@ 2025-06-26  0:12   ` Yury Norov
  0 siblings, 0 replies; 25+ messages in thread
From: Yury Norov @ 2025-06-26  0:12 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh

On Thu, Jun 26, 2025 at 12:41:06AM +0530, Shrikanth Hegde wrote:
> Checking if a CPU is avoid can add a slight overhead and should be 
> done only when necessary. 
> 
> Add a static key check which makes it almost nop when key is false. 
> Arch needs to set the key when it decides to. Refer to debug patch
> for example. 
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> This method avoids additional ifdefs. So kept it that way instead of 
> CONFIG_PARAVIRT. 
> 
> Added a helper function for cpu_avoid, since including sched.h fails in 
> cpumask.h
> 
>  kernel/sched/core.c  | 8 ++++----
>  kernel/sched/fair.c  | 5 +++--
>  kernel/sched/rt.c    | 8 ++++----
>  kernel/sched/sched.h | 9 +++++++++
>  4 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index aea4232e3ec4..51426b17ef55 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -148,9 +148,9 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
>   * Limited because this is done with IRQs disabled.
>   */
>  __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
> -
>  __read_mostly int scheduler_running;
>  
> +DEFINE_STATIC_KEY_FALSE(paravirt_cpu_avoid_enabled);
>  #ifdef CONFIG_SCHED_CORE
>  
>  DEFINE_STATIC_KEY_FALSE(__sched_core_enabled);
> @@ -2438,7 +2438,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
>  		return false;
>  
>  	/* CPU marked as avoid, shouldn't chosen to run any task*/
> -	if (cpu_avoid(cpu))
> +	if (cpu_avoid_check(cpu))
>  		return false;

Here you're patching the code that you've just added. Can you simply
add it in a proper way?..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept
  2025-06-25 19:11 ` [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept Shrikanth Hegde
@ 2025-06-26  6:27   ` Hillf Danton
  2025-06-26 14:46     ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Hillf Danton @ 2025-06-26  6:27 UTC (permalink / raw)
  To: Shrikanth Hegde; +Cc: peterz, kprateek.nayak, linux-kernel

On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
> This describes what avoid CPU means and what scheduler aims to do 
> when a CPU is marked as avoid. 
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> index ed07efea7d02..d32755298fca 100644
> --- a/Documentation/scheduler/sched-arch.rst
> +++ b/Documentation/scheduler/sched-arch.rst
> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
>  arch/x86/kernel/process.c has examples of both polling and
>  sleeping idle functions.
>  
> +CPU Avoid
> +=========
> +
> +Under paravirt conditions it is possible to overcommit CPU resources.
> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
> +hypervisor won't be able to satisfy the requirement and has to context switch
> +within or across VM. VM level context switch is more expensive compared to
> +task context switch within the VM.
> +
Sounds like VMs not well configured (or pCPUs not well partationed).

> +In such cases it is better that VM's co-ordinate among themselves and ask for
> +less CPU request by not using some of the vCPUs. Such vCPUs where workload
> +can be avoided at the moment are called as "Avoid CPUs". Note that when the
> +pCPU contention goes away, these vCPUs can be used again by the workload.
> +
In the car cockpit scenario for example with type1 hypervisor, there is app
kicking watchdog bound to every vCPU, so no vCPU should be avoided.

> +Arch need to set/unset the vCPU as avoid in cpu_avoid_mask. When set, avoid
> +the CPU and when unset, use it as usual.
> +
> +Scheduler will try to avoid those CPUs as much as it can.
> +This is achived by
> +1. Not selecting those CPU at wakeup.
> +2. Push the task away from avoid CPU at tick.
> +3. Not selecting avoid CPU at load balance.
> +
> +This works only for SCHED_RT and SCHED_NORMAL.
>  
Sounds like forcing a pill down through Peter's throat because Steve's headache.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid
  2025-06-25 22:53   ` Yury Norov
@ 2025-06-26 13:39     ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-26 13:39 UTC (permalink / raw)
  To: Yury Norov
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh


Hi Yury, Thanks for taking a look at this.

> On Thu, Jun 26, 2025 at 12:41:08AM +0530, Shrikanth Hegde wrote:
>> Reference patch for how an architecture can make use of this infra.
>>
>> This is not meant to be merged. Instead the vp_manual_hint should either
>> come from hardware or could be derived using steal time.
> 
> If you don't add any code that manages the 'avoid' mask on the host
> side, all this becomes a dead code.

Ok.

Maybe I can keep this debug file, until we get the infra where
the hint derivation would be done by hardware by means of hcall or gets 
calculated based on steal time.

I think i will have polish this a bit and move it to appropriate place 
if this is to be kept.

>   
>> When the provided hint is less than the total CPUs in the system, it
>> will enable the cpu avoid static key and set those CPUs as avoid.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   arch/powerpc/include/asm/paravirt.h |  2 ++
>>   arch/powerpc/kernel/smp.c           | 50 +++++++++++++++++++++++++++++
>>   2 files changed, 52 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
>> index b78b82d66057..b6497e0b60d8 100644
>> --- a/arch/powerpc/include/asm/paravirt.h
>> +++ b/arch/powerpc/include/asm/paravirt.h
>> @@ -10,6 +10,8 @@
>>   #include <asm/hvcall.h>
>>   #endif
>>   
>> +DECLARE_STATIC_KEY_FALSE(paravirt_cpu_avoid_enabled);
>> +
>>   #ifdef CONFIG_PPC_SPLPAR
>>   #include <linux/smp.h>
>>   #include <asm/kvm_guest.h>
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index 5ac7084eebc0..e00cdc4de441 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -64,6 +64,7 @@
>>   #include <asm/systemcfg.h>
>>   
>>   #include <trace/events/ipi.h>
>> +#include <linux/debugfs.h>
>>   
>>   #ifdef DEBUG
>>   #include <asm/udbg.h>
>> @@ -82,6 +83,7 @@ bool has_big_cores __ro_after_init;
>>   bool coregroup_enabled __ro_after_init;
>>   bool thread_group_shares_l2 __ro_after_init;
>>   bool thread_group_shares_l3 __ro_after_init;
>> +static int vp_manual_hint = NR_CPUS;
>>   
>>   DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>>   DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
>> @@ -1727,6 +1729,7 @@ static void __init build_sched_topology(void)
>>   	BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
>>   
>>   	set_sched_topology(powerpc_topology);
>> +	vp_manual_hint = num_present_cpus();
>>   }
>>   
>>   void __init smp_cpus_done(unsigned int max_cpus)
>> @@ -1807,4 +1810,51 @@ void __noreturn arch_cpu_idle_dead(void)
>>   	start_secondary_resume();
>>   }
>>   
>> +/*
>> + * sysfs hint to mark CPUs as Avoid. This will help in restricting
>> + * the workload to specified number of CPUs.
>> + * For example 40 > vp_manual_hint means, workload will run on
>> + * 0-39 CPUs.
>> + */
>> +
>> +static int pv_vp_manual_hint_set(void *data, u64 val)
>> +{
>> +	int cpu;
>> +
>> +	if (val == 0 || vp_manual_hint > num_present_cpus())

This should be
	if (val == 0 || val > num_present_cpus())

>> +		vp_manual_hint = num_present_cpus();
>> +
>> +	if (val != vp_manual_hint)
>> +		vp_manual_hint = val;
> 
> This all is effectively just:
> 
> 	vp_manual_hint = val;
> 
> Isn't?

Yes, With some checks for sane values.

> 
>> +	if (vp_manual_hint < num_present_cpus())
>> +		static_branch_enable(&paravirt_cpu_avoid_enabled);
>> +	else
>> +		static_branch_disable(&paravirt_cpu_avoid_enabled);
>> +
>> +	for_each_present_cpu(cpu) {
>> +		if (cpu >= vp_manual_hint)
>> +			set_cpu_avoid(cpu, true);
>> +		else
>> +			set_cpu_avoid(cpu, false);
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int pv_vp_manual_hint_get(void *data, u64 *val)
>> +{
>> +	*val = vp_manual_hint;
>> +	return 0;
>> +}
>> +
>> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, pv_vp_manual_hint_set, "%llu\n");
>> +
>> +static __init int paravirt_debugfs_init(void)
>> +{
>> +	if (is_shared_processor())
>> +		debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
>> +	return 0;
>> +}
>> +
>> +device_initcall(paravirt_debugfs_init)
>>   #endif
>> -- 
>> 2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance
  2025-06-26  0:02   ` Yury Norov
@ 2025-06-26 13:42     ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-26 13:42 UTC (permalink / raw)
  To: Yury Norov
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh



On 6/26/25 05:32, Yury Norov wrote:
> On Thu, Jun 26, 2025 at 12:41:03AM +0530, Shrikanth Hegde wrote:
>> Load balancer shouldn't spread CFS tasks into a CPU marked as Avoid.
>> Remove those CPUs from load balancing decisions.
>>
>> At wakeup, don't select a CPU marked as avoid.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>> while tesing didn't see cpu being marked as avoid while new_cpu is.
>> May need some more probing to see if even cpu can be. if so it could
>> lead to crash.
>>
>>   kernel/sched/fair.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7e2963efe800..406288aef535 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8546,7 +8546,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>   	}
>>   	rcu_read_unlock();
>>   
>> -	return new_cpu;
>> +	/* Don't select a CPU marked as avoid for wakeup */
>> +	if (cpu_avoid(new_cpu))
>> +		return cpu;
>> +	else
>> +		return new_cpu;
>> +
>>   }
> 
> There are more 'return's in this function, but you patch only one... 

I had seen it but forgot to add. (since eas wasn't enabled in the system 
so i forgot)

> 
>>   
>>   /*
>> @@ -11662,6 +11667,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   
>>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>>   
>> +	/* Don't spread load into CPUs marked as avoid */
>> +	cpumask_andnot(cpus, cpus, cpu_avoid_mask);
>> +
>>   	schedstat_inc(sd->lb_count[idle]);
>>   
>>   redo:
>> -- 
>> 2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 0/9] cpu avoid state and push task mechanism
  2025-06-25 21:55 ` [RFC v2 0/9] cpu avoid state and push task mechanism Yury Norov
@ 2025-06-26 14:33   ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-26 14:33 UTC (permalink / raw)
  To: Yury Norov
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, kprateek.nayak, huschle, srikar,
	linux-kernel, christophe.leroy, linuxppc-dev, gregkh



On 6/26/25 03:25, Yury Norov wrote:
> On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:
>> This is a followup version if [1] with few additions. This is still an RFC
>> and would like get feedback on the idea and suggestions on improvement.
>>
>> v1->v2:
>> - Renamed to cpu_avoid_mask in place of cpu_parked_mask.
> 
> This one is not any better to the previous. Why avoid? When avoid?
> I already said that: for objects, having positive self-explaining
> noun names is much better than negative and/or function-style verb
> names. I suggested cpu_paravirt_mask, and I still believe it's a much
> better option.
> 

ok. only reason is CPU is always para virtualized in those environment right?
When there is contention for pCPU, only then we want set this mask.
So i thought it might have to reflect that.


I can keep cpu_paravirt_mask. Could you please suggest set/get names which could
go with it? cpu_paravirt(cpu)?

>> - Used a static key such that no impact to regular case.
> 
> Static keys are not free and designed for different purpose. You have
> CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
> using it.
> 
> I don't mind about static keys, if you prefer them, I just want to
> have feature-specific code under corresponding config.
> 
> Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
> Have you any perf numbers to advocate static keys here?
> 

I wanted to see if there could be any other use cases apart from paravirt case.

One I thought was, in SMT systems under low utilization, it could help higher IPC by keeping the tasks on
only 1 thread.. if base_slice is kept low, latency could be relatively low.

Other was, workloads or system usage can be dynamic in nature with peaks and troughs. when it is in trough, one may not want to use all
the cores(instead use SMT siblings), thereby saving some power.


Using CONFIG_PARAVIRT could end up sprinkling a bit of ifdefs. Need to see how I could minimize it.
Let me get back with bloat-o-meter numbers and performance numbers.

>> - add sysfs file to show avoid CPUs.
>> - Make RT understand avoid CPUs.
>> - Add documentation patch
>> - Took care of reported compile error in [1] when NR_CPUS=1
>>
>> -----------------
>> Problem statement
>> -----------------
>> vCPU - Virtual CPUs - CPU in VM world.
>> pCPU - Physical CPUs - CPU in baremetal world.
>>
>> A hypervisor is managing these vCPUs from different VMs. When a vCPU
>> requests for CPU, hypervisor does the job of scheduling them on a pCPU.
>>
>> So this issue occurs when there are more vCPUs(combined across all VMs)
>> than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor
>> can only run a few of them and remaining will be preempted(waiting for pCPU).
>>
>> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
>> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
>> each other and request for *limited*  vCPUs, it avoids the above overhead and
>                                         ^
> Did this extra whitespace escaped from the previous line, or the following?
>

Thanks for noticing it.
                                           v
>> there is context switching within vCPU(less expensive). Even if hypervisor
>> is preempting one vCPU to run another within the same VM, it is still more
>> expensive than the task preemption within the vCPU. So *basic* aim to avoid
>> vCPU preemption.
>>
>> So to achieve this, use "CPU Avoid" concept, where it is better
>> if workload avoids these vCPUs at this moment.
>> (vCPUs stays online, we don't want the overhead of sched domain rebuild).
>>
>> Contention is dynamic in nature. When there is contention for pCPU is to be
>> detected and determined by architecture. Archs needs to update the mask
>> accordingly.
>>
>> When there is contention, use limited vCPUs as indicated by arch.
>> When there is no contention, use all vCPUs.
>>
>> -------------------------
>> To be done and Questions:
>> -------------------------
>> 1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
>> code could be modified to do the same. Ran stress-ng --hrtimers, irq
>> moved out of avoid cpu though. So need to see if changes to irqbalance is
>> required or not.
>>
>> 2. If a task is spawned by affining to only avoid CPUs. Should that fail
>> or throw a warning to user.
> 
> I think it's possible that existing codebase will do that. And because
> you don't want to break userspace, you should not restrict.

ok got it. currently it is allowed.

> 
>> 3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
>> yet.
>>
>> 4. Performance testing yet to be done. RFC only verified the functional
>> aspects of whether task move out of avoid CPUs or not. Move happens quite
>> fast (around 1-2 seconds even on large systems with very high utilization)
>>
>> 5. Haven't come up an infra which could combine all push task related
>> changes. It is currently spread across rt, dl, fair. Maybe some
>> consolidation can be done. but which tasks to push/pull still remains in
>> the class.
>>
>> 6. cpu_avoid_mask may need some sort of locking to ensure read/write is
>> correct.
>>
>> [1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@linux.ibm.com/
>>
>> Shrikanth Hegde (9):
>>    sched/docs: Document avoid_cpu_mask and avoid CPU concept
>>    cpumask: Introduce cpu_avoid_mask
>>    sched/core: Don't allow to use CPU marked as avoid
>>    sched/fair: Don't use CPU marked as avoid for wakeup and load balance
>>    sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
>>    sched/core: Push current task out if CPU is marked as avoid
>>    sched: Add static key check for cpu_avoid
>>    sysfs: Add cpu_avoid file
>>    powerpc: add debug file for set/unset cpu avoid
>>
>>   Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
>>   arch/powerpc/include/asm/paravirt.h    |  2 ++
>>   arch/powerpc/kernel/smp.c              | 50 ++++++++++++++++++++++++++
>>   drivers/base/cpu.c                     |  8 +++++
>>   include/linux/cpumask.h                | 17 +++++++++
>>   kernel/cpu.c                           |  3 ++
>>   kernel/sched/core.c                    | 50 +++++++++++++++++++++++++-
>>   kernel/sched/fair.c                    | 11 +++++-
>>   kernel/sched/rt.c                      |  9 +++--
>>   kernel/sched/sched.h                   | 10 ++++++
>>   10 files changed, 181 insertions(+), 4 deletions(-)
>>
>> -- 
>> 2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept
  2025-06-26  6:27   ` Hillf Danton
@ 2025-06-26 14:46     ` Shrikanth Hegde
  2025-06-27  0:27       ` Hillf Danton
  0 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-26 14:46 UTC (permalink / raw)
  To: Hillf Danton; +Cc: peterz, kprateek.nayak, linux-kernel

Hi Hillf.

> On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
>> This describes what avoid CPU means and what scheduler aims to do
>> when a CPU is marked as avoid.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
>> index ed07efea7d02..d32755298fca 100644
>> --- a/Documentation/scheduler/sched-arch.rst
>> +++ b/Documentation/scheduler/sched-arch.rst
>> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
>>   arch/x86/kernel/process.c has examples of both polling and
>>   sleeping idle functions.
>>   
>> +CPU Avoid
>> +=========
>> +
>> +Under paravirt conditions it is possible to overcommit CPU resources.
>> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
>> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
>> +hypervisor won't be able to satisfy the requirement and has to context switch
>> +within or across VM. VM level context switch is more expensive compared to
>> +task context switch within the VM.
>> +
> Sounds like VMs not well configured (or pCPUs not well partationed).

No. That's how VMs under paravirtulized case configured as i understand.
Correct me if i am wrong.

On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
to configure more VMs too. The said issue happens only when every/most VMs ask for
CPU at the same time.

> 
>> +In such cases it is better that VM's co-ordinate among themselves and ask for
>> +less CPU request by not using some of the vCPUs. Such vCPUs where workload
>> +can be avoided at the moment are called as "Avoid CPUs". Note that when the
>> +pCPU contention goes away, these vCPUs can be used again by the workload.
>> +
> In the car cockpit scenario for example with type1 hypervisor, there is app
> kicking watchdog bound to every vCPU, so no vCPU should be avoided.

I don't understand what is meant here. Any reference links? Also in such cases,
arch shouldn't set any CPU as avoid. But it may not get this feature benefit.

> 
>> +Arch need to set/unset the vCPU as avoid in cpu_avoid_mask. When set, avoid
>> +the CPU and when unset, use it as usual.
>> +
>> +Scheduler will try to avoid those CPUs as much as it can.
>> +This is achived by
>> +1. Not selecting those CPU at wakeup.
>> +2. Push the task away from avoid CPU at tick.
>> +3. Not selecting avoid CPU at load balance.
>> +
>> +This works only for SCHED_RT and SCHED_NORMAL.
>>   
> Sounds like forcing a pill down through Peter's throat because Steve's headache.

I meant, this series till now address only RT and NORMAL. It could be made work for other classes too.
But i didn't see a point.

Since the mask is available, SCHED_EXT one could design their BPF hooks accordingly and SCHED_DL isn't designed to
work under such conditions. I don't know any user/workload which deploys SCHED_DL in CPU over-commited cases.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept
  2025-06-26 14:46     ` Shrikanth Hegde
@ 2025-06-27  0:27       ` Hillf Danton
  2025-06-27  4:37         ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Hillf Danton @ 2025-06-27  0:27 UTC (permalink / raw)
  To: Shrikanth Hegde; +Cc: peterz, kprateek.nayak, linux-kernel

On Thu, 26 Jun 2025 20:16:36 +0530 Shrikanth Hegde wrote
> > On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
> >> This describes what avoid CPU means and what scheduler aims to do
> >> when a CPU is marked as avoid.
> >>
> >> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> >> ---
> >>   Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
> >>   1 file changed, 25 insertions(+)
> >>
> >> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> >> index ed07efea7d02..d32755298fca 100644
> >> --- a/Documentation/scheduler/sched-arch.rst
> >> +++ b/Documentation/scheduler/sched-arch.rst
> >> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
> >>   arch/x86/kernel/process.c has examples of both polling and
> >>   sleeping idle functions.
> >>   
> >> +CPU Avoid
> >> +=========
> >> +
> >> +Under paravirt conditions it is possible to overcommit CPU resources.
> >> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
> >> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
> >> +hypervisor won't be able to satisfy the requirement and has to context switch
> >> +within or across VM. VM level context switch is more expensive compared to
> >> +task context switch within the VM.
> >> +
> > Sounds like VMs not well configured (or pCPUs not well partationed).
> 
> No. That's how VMs under paravirtulized case configured as i understand.
> Correct me if i am wrong.
> 
> On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
> When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
> to configure more VMs too. The said issue happens only when every/most VMs ask for
> CPU at the same time.
> 
After putting virtualization aside, lets see a simpler case where more
than 1024 apps are bound to a single (ppc having 4 CPUs for instance) CPU,
what can we do wrt app responsibility in kernel? Nothing because
resource/budget is never enough without sane config.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept
  2025-06-27  0:27       ` Hillf Danton
@ 2025-06-27  4:37         ` Shrikanth Hegde
  2025-06-28 22:02           ` Hillf Danton
  0 siblings, 1 reply; 25+ messages in thread
From: Shrikanth Hegde @ 2025-06-27  4:37 UTC (permalink / raw)
  To: Hillf Danton; +Cc: peterz, kprateek.nayak, linux-kernel



On 6/27/25 05:57, Hillf Danton wrote:
> On Thu, 26 Jun 2025 20:16:36 +0530 Shrikanth Hegde wrote
>>> On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
>>>> This describes what avoid CPU means and what scheduler aims to do
>>>> when a CPU is marked as avoid.
>>>>
>>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>> ---
>>>>    Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
>>>>    1 file changed, 25 insertions(+)
>>>>
>>>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
>>>> index ed07efea7d02..d32755298fca 100644
>>>> --- a/Documentation/scheduler/sched-arch.rst
>>>> +++ b/Documentation/scheduler/sched-arch.rst
>>>> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
>>>>    arch/x86/kernel/process.c has examples of both polling and
>>>>    sleeping idle functions.
>>>>    
>>>> +CPU Avoid
>>>> +=========
>>>> +
>>>> +Under paravirt conditions it is possible to overcommit CPU resources.
>>>> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
>>>> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
>>>> +hypervisor won't be able to satisfy the requirement and has to context switch
>>>> +within or across VM. VM level context switch is more expensive compared to
>>>> +task context switch within the VM.
>>>> +
>>> Sounds like VMs not well configured (or pCPUs not well partationed).
>>
>> No. That's how VMs under paravirtulized case configured as i understand.
>> Correct me if i am wrong.
>>
>> On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
>> When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
>> to configure more VMs too. The said issue happens only when every/most VMs ask for
>> CPU at the same time.
>>
> After putting virtualization aside, lets see a simpler case where more
> than 1024 apps are bound to a single (ppc having 4 CPUs for instance) CPU,
> what can we do wrt app responsibility in kernel? 

In this case you will not likely have vCPU preemption. you will have 
task preemption. That is ok. Patch doesn't aim to solve the case you 
have mentioned above.

In the generic SPLPAR configuration virtual processor usually have large 
number of vCPUs and powerpc systems are fairly large in terms of CPU as 
well.

I hope that answers.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept
  2025-06-27  4:37         ` Shrikanth Hegde
@ 2025-06-28 22:02           ` Hillf Danton
  0 siblings, 0 replies; 25+ messages in thread
From: Hillf Danton @ 2025-06-28 22:02 UTC (permalink / raw)
  To: Shrikanth Hegde; +Cc: peterz, kprateek.nayak, linux-kernel

On Fri, 27 Jun 2025 10:07:22 +0530 Shrikanth Hegde wrote
> On 6/27/25 05:57, Hillf Danton wrote:
> > On Thu, 26 Jun 2025 20:16:36 +0530 Shrikanth Hegde wrote
> >>> On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
> >>>> This describes what avoid CPU means and what scheduler aims to do
> >>>> when a CPU is marked as avoid.
> >>>>
> >>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> >>>> ---
> >>>>    Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
> >>>>    1 file changed, 25 insertions(+)
> >>>>
> >>>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> >>>> index ed07efea7d02..d32755298fca 100644
> >>>> --- a/Documentation/scheduler/sched-arch.rst
> >>>> +++ b/Documentation/scheduler/sched-arch.rst
> >>>> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
> >>>>    arch/x86/kernel/process.c has examples of both polling and
> >>>>    sleeping idle functions.
> >>>>    
> >>>> +CPU Avoid
> >>>> +=========
> >>>> +
> >>>> +Under paravirt conditions it is possible to overcommit CPU resources.
> >>>> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
> >>>> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
> >>>> +hypervisor won't be able to satisfy the requirement and has to context switch
> >>>> +within or across VM. VM level context switch is more expensive compared to
> >>>> +task context switch within the VM.
> >>>> +
> >>> Sounds like VMs not well configured (or pCPUs not well partationed).
> >>
> >> No. That's how VMs under paravirtulized case configured as i understand.
> >> Correct me if i am wrong.
> >>
> >> On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
> >> When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
> >> to configure more VMs too. The said issue happens only when every/most VMs ask for
> >> CPU at the same time.
> >>
> > After putting virtualization aside, lets see a simpler case where more
> > than 1024 apps are bound to a single (ppc having 4 CPUs for instance) CPU,
> > what can we do wrt app responsibility in kernel? 
> 
> In this case you will not likely have vCPU preemption. you will have 
> task preemption. That is ok. Patch doesn't aim to solve the case you 
> have mentioned above.
> 
It is a case of overcommit due to mis-config where scheduler does not
help simply because kernel is not the pill that kills all pains.

> In the generic SPLPAR configuration virtual processor usually have large 
> number of vCPUs and powerpc systems are fairly large in terms of CPU as 
> well.
>
Overcommit is not SPLPAR specific, nor PPC, because it is buggy for scheduler
to create overcommit on either PPC or Arm64.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 8/9] sysfs: Add cpu_avoid file
  2025-06-25 19:11 ` [RFC v2 8/9] sysfs: Add cpu_avoid file Shrikanth Hegde
@ 2025-07-01  9:35   ` Greg KH
  2025-07-02  6:05     ` Shrikanth Hegde
  0 siblings, 1 reply; 25+ messages in thread
From: Greg KH @ 2025-07-01  9:35 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev

On Thu, Jun 26, 2025 at 12:41:07AM +0530, Shrikanth Hegde wrote:
> Add a sysfs file called "avoid" which prints the current CPUs 
> makred as avoid. 
> 
> This could be used by userspace components or tools such as irqbalance. 
> 
> /sys/devices/system/cpu # cat avoid 
> 70-479

You forgot to document the new sysfs file in Documentation/ABI/ :(

Also, you have trailing whitespace in your changelog here, was that
intentional?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 8/9] sysfs: Add cpu_avoid file
  2025-07-01  9:35   ` Greg KH
@ 2025-07-02  6:05     ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-07-02  6:05 UTC (permalink / raw)
  To: Greg KH
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy, vschneid, dietmar.eggemann, rostedt, kprateek.nayak,
	huschle, srikar, linux-kernel, christophe.leroy, linuxppc-dev



Hi Greg, Thanks for looking into the patches.

> On Thu, Jun 26, 2025 at 12:41:07AM +0530, Shrikanth Hegde wrote:
>> Add a sysfs file called "avoid" which prints the current CPUs
>> makred as avoid.
>>
>> This could be used by userspace components or tools such as irqbalance.
>>
>> /sys/devices/system/cpu # cat avoid
>> 70-479
> 
> You forgot to document the new sysfs file in Documentation/ABI/ :( 

Sorry, didn't realize that. Will fix it in v2.

> 
> Also, you have trailing whitespace in your changelog here, was that
> intentional?
> 

My bad, didn't realize it while editing. Will fix it too.


Also checkpatch --strict doesn't complain if there is a trailing whitespace
in changelog.

> thanks,
> 
> greg k-h


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 6/9] sched/core: Push current task out if CPU is marked as avoid
  2025-06-25 19:11 ` [RFC v2 6/9] sched/core: Push current task out if CPU is marked as avoid Shrikanth Hegde
@ 2025-08-12 18:40   ` Shrikanth Hegde
  0 siblings, 0 replies; 25+ messages in thread
From: Shrikanth Hegde @ 2025-08-12 18:40 UTC (permalink / raw)
  To: yury.norov
  Cc: vschneid, dietmar.eggemann, rostedt, mingo, peterz,
	kprateek.nayak, huschle, srikar, linux-kernel, christophe.leroy,
	linuxppc-dev, gregkh, maddy, tglx, juri.lelli, vincent.guittot


Sorry for the delay in response to bloat-o-meter report. Since stop_one_cpu_nowait needs protection
against race, need to add a field in rq. So ifdef check of CONFIG_PARAVIRT makes sense.

> 
> Since the task is running, need to use the stopper class to push the
> task out. Use __balance_push_cpu_stop to achieve that.
> 
> This currently works only CFS and RT.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>   kernel/sched/core.c  | 44 ++++++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h |  1 +
>   2 files changed, 45 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 13e44d7a0b90..aea4232e3ec4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5577,6 +5577,10 @@ void sched_tick(void)
>   
>   	sched_clock_tick();
>   
> +	/* push the current task out if cpu is marked as avoid */
> +	if (cpu_avoid(cpu))
> +		push_current_task(rq);
> +
>   	rq_lock(rq, &rf);
>   	donor = rq->donor;
>   
> @@ -8028,6 +8032,43 @@ static void balance_hotplug_wait(void)
>   			   TASK_UNINTERRUPTIBLE);
>   }
>   
> +static DEFINE_PER_CPU(struct cpu_stop_work, push_task_work);
> +
> +/* A CPU is marked as Avoid when there is contention for underlying
> + * physical CPU and using this CPU will lead to hypervisor preemptions.
> + * It is better not to use this CPU.
> + *
> + * In case any task is scheduled on such CPU, move it out. In
> + * select_fallback_rq a non_avoid CPU will be chosen and henceforth
> + * task shouldn't come back to this CPU
> + */
> +void push_current_task(struct rq *rq)
> +{
> +	struct task_struct *push_task = rq->curr;
> +	unsigned long flags;
> +
> +	/* idle task can't be pused out */
> +	if (rq->curr == rq->idle || !cpu_avoid(rq->cpu))
> +		return;
> +
> +	/* Do for only SCHED_NORMAL AND RT for now */
> +	if (push_task->sched_class != &fair_sched_class &&
> +	    push_task->sched_class != &rt_sched_class)
> +		return;
> +
> +	if (kthread_is_per_cpu(push_task) ||
> +	    is_migration_disabled(push_task))
> +		return;
> +
> +	local_irq_save(flags);
> +	get_task_struct(push_task);
> +	preempt_disable();
> +
> +	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
> +			    this_cpu_ptr(&push_task_work));

Doing a perf record occasionally caused the crash. This happens because stop_one_cpu_nowait
expects the callers to sync and push_task_work should be untouched until the stopper executes.

So, i had to do something similar to whats done in active_balance.
Add a field in rq and set/unset accordingly.

Using this field in __balance_push_cpu_stop is also hacky. I have to do something like below,
	if (rq->balance_callback != &balance_push_callback)
		rq->push_task_work_pending = 0;
or i have to copy __balance_push_cpu_stop and do the above.

After this, it makes sense to put all this under CONFIG_PARAVIRT.


(Also, i did explore using stop_one_cpu variant, got to it via scheduling a work and then execute it at
preemptible context. That occasionally ends up in deadlock. due to some issues at my end, haven't debugged that
further. a backup option for nowait)

> +	preempt_enable();
> +	local_irq_restore(flags);
> +}
>   #else /* !CONFIG_HOTPLUG_CPU: */
>   
>   static inline void balance_push(struct rq *rq)
> @@ -8042,6 +8083,9 @@ static inline void balance_hotplug_wait(void)
>   {
>   }
>   
> +void push_current_task(struct rq *rq)
> +{
> +}
>   #endif /* !CONFIG_HOTPLUG_CPU */
>   
>   void set_rq_online(struct rq *rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 105190b18020..b9614873762e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1709,6 +1709,7 @@ struct rq_flags {
>   };
>   
>   extern struct balance_callback balance_push_callback;
> +void push_current_task(struct rq *rq);
>   
>   #ifdef CONFIG_SCHED_CLASS_EXT
>   extern const struct sched_class ext_sched_class;

Hopefully i should be able to send out v3 soon addressing the comments.

Namewise, going to keep it cpu_paravirt_mask and cpu_paravirt(cpu).


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-08-12 18:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 19:10 [RFC v2 0/9] cpu avoid state and push task mechanism Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept Shrikanth Hegde
2025-06-26  6:27   ` Hillf Danton
2025-06-26 14:46     ` Shrikanth Hegde
2025-06-27  0:27       ` Hillf Danton
2025-06-27  4:37         ` Shrikanth Hegde
2025-06-28 22:02           ` Hillf Danton
2025-06-25 19:11 ` [RFC v2 2/9] cpumask: Introduce cpu_avoid_mask Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 3/9] sched/core: Dont allow to use CPU marked as avoid Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 4/9] sched/fair: Don't use CPU marked as avoid for wakeup and load balance Shrikanth Hegde
2025-06-26  0:02   ` Yury Norov
2025-06-26 13:42     ` Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 5/9] sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 6/9] sched/core: Push current task out if CPU is marked as avoid Shrikanth Hegde
2025-08-12 18:40   ` Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 7/9] sched: Add static key check for cpu_avoid Shrikanth Hegde
2025-06-26  0:12   ` Yury Norov
2025-06-25 19:11 ` [RFC v2 8/9] sysfs: Add cpu_avoid file Shrikanth Hegde
2025-07-01  9:35   ` Greg KH
2025-07-02  6:05     ` Shrikanth Hegde
2025-06-25 19:11 ` [RFC v2 9/9] [DEBUG] powerpc: add debug file for set/unset cpu avoid Shrikanth Hegde
2025-06-25 22:53   ` Yury Norov
2025-06-26 13:39     ` Shrikanth Hegde
2025-06-25 21:55 ` [RFC v2 0/9] cpu avoid state and push task mechanism Yury Norov
2025-06-26 14:33   ` Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).