[RFC PATCH 0/5] sched: cpu parked and push current task mechanism

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] sched: cpu parked and push current task mechanism
@ 2025-05-23 18:14 Shrikanth Hegde
  2025-05-23 18:14 ` [RFC PATCH 1/5] cpumask: Introduce cpu parked mask Shrikanth Hegde
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-23 18:14 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, jstultz,
	kprateek.nayak, huschle, srikar, linux-kernel, linux

In a para-virtualised environment, there could be multiple
overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). 
When all such VMs request for cpu cycles at the same, it is not possible
to serve all of them. This leads to VM level preemptions and hence the
steal time. 

Bring the notion of CPU parked state which implies underlying pCPU may
not be available for use at this time. This means it is better to avoid
this vCPU. So when a CPU is marked as parked, one should vacate it as
soon as it can. So it is going to dynamic at runtime and can change
often.

In general, task level preemption(driven by VM) is less expensive than VM
level preemption(driven by hypervisor). So pack to less CPUs helps to
improve the overall workload throughput/latency. 

Architecture needs to decide which CPUs are parked. Currently we are
exploring getting the hint from the stolen time and hypervisor provided 
statistics. There is simple powerpc debug patch which shows how one can
make use of it cpu parked feature. 

cpu parking and need for cpu parking has been explained here as well [1]. Much
of the context explained in the cover letter there applies to this
problem context as well. 
[1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/

While trying the above method, on large system (480 vCPUS) it was taking 
around 8-10 seconds for workload to move. Which is a longer time, 
so this approach, where workload moves within 1-2 seconds

Pros: 
- Once tasks move, no load balancer overheads 
- Less need for stats. minimal load balancer changes. 
- Faster. Since it is based on sched_tick
- system maintains a state of parked cpus. Other subsystems may find it
  useful. 

Cons:
- stop machine based to move the current task. So couldn't move it
  before it gets scheduled.  
- Depends on CONFIG_HOTPLUG_CPU since it is relying on __balance_push_cpu_stop
  (might not be a big concern)

Sending this out to get feedback on the idea. This mechanism
seems lightweight and fast. There are other push task related patches
sent for EAS[2], and newidle balance[3]. Maybe it is time to come up push task
framework that each one can make use of. Need to dig more into it[4]. 
Need to address RT, DL, IRQ, taskset concerns still. There maybe
subtle races too(no warn/bugs on console while testing cfs tasks) 

[2]: https://lore.kernel.org/all/20250302210539.1563190-1-vincent.guittot@linaro.org/
[3]: https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
[4]: https://lore.kernel.org/all/xhsmh1putoxbz.mognet@vschneid-thinkpadt14sgen2i.remote.csb/

Based on tip/master  at fa95dea97bd1 (Merge branch into tip/master: 'perf/core')

Shrikanth Hegde (5):
  cpumask: Introduce cpu parked mask
  sched/core: Don't use parked cpu for selection
  sched/fair: Don't use parked cpu for load balancing
  sched/core: Push current task when cpu is parked
  powerpc: Use manual hint for cpu parking

 arch/powerpc/kernel/smp.c | 45 +++++++++++++++++++++++++++++++++++++++
 include/linux/cpumask.h   | 14 ++++++++++++
 kernel/cpu.c              |  3 +++
 kernel/sched/core.c       | 43 +++++++++++++++++++++++++++++++++++--
 kernel/sched/fair.c       |  1 +
 kernel/sched/sched.h      |  1 +
 6 files changed, 105 insertions(+), 2 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/5] cpumask: Introduce cpu parked mask
  2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
@ 2025-05-23 18:14 ` Shrikanth Hegde
  2025-05-27 15:06   ` Yury Norov
  2025-05-23 18:14 ` [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection Shrikanth Hegde
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-23 18:14 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, jstultz,
	kprateek.nayak, huschle, srikar, linux-kernel, linux

CPU is said to be parked, when underlying physical CPU is not 
available. This happens when there is contention for CPU resource in
para-virtualized case. One should avoid using these CPUs. 

Build and maintain this state of parked CPUs. Scheduler will use this
information and push the tasks out as soon as it can. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
- Not sure if __read_mostly attribute suits for cpu_parked 
since it can change often. Since often means a few mins, it is long time
from scheduler perspective, hence kept it. 

 include/linux/cpumask.h | 14 ++++++++++++++
 kernel/cpu.c            |  3 +++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 6a569c7534db..501848303800 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -84,6 +84,7 @@ static __always_inline void set_nr_cpu_ids(unsigned int nr)
  *     cpu_enabled_mask - has bit 'cpu' set iff cpu can be brought online
  *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
  *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
+ *     cpu_parked_mask  - has bit 'cpu' set iff cpu is parked
  *
  *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
  *
@@ -93,6 +94,11 @@ static __always_inline void set_nr_cpu_ids(unsigned int nr)
  *  representing which CPUs are currently plugged in.  And
  *  cpu_online_mask is the dynamic subset of cpu_present_mask,
  *  indicating those CPUs available for scheduling.
+ *
+ *  A CPU is said to be parked when underlying physical CPU(pCPU) is not
+ *  available at the moment. It is recommended not to run any workload on
+ *  that CPU.
+
  *
  *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
  *  depending on what ACPI reports as currently plugged in, otherwise
@@ -118,12 +124,14 @@ extern struct cpumask __cpu_enabled_mask;
 extern struct cpumask __cpu_present_mask;
 extern struct cpumask __cpu_active_mask;
 extern struct cpumask __cpu_dying_mask;
+extern struct cpumask __cpu_parked_mask;
 #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
 #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
 #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_parked_mask    ((const struct cpumask *)&__cpu_parked_mask)
 
 extern atomic_t __num_online_cpus;
 
@@ -1146,6 +1154,7 @@ void init_cpu_possible(const struct cpumask *src);
 #define set_cpu_present(cpu, present)	assign_cpu((cpu), &__cpu_present_mask, (present))
 #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
 #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
+#define set_cpu_parked(cpu, parked)    assign_cpu((cpu), &__cpu_parked_mask, (parked))
 
 void set_cpu_online(unsigned int cpu, bool online);
 
@@ -1235,6 +1244,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 	return cpumask_test_cpu(cpu, cpu_dying_mask);
 }
 
+static __always_inline bool cpu_parked(unsigned int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_parked_mask);
+}
+
 #else
 
 #define num_online_cpus()	1U
diff --git a/kernel/cpu.c b/kernel/cpu.c
index a59e009e0be4..532fbfbe3226 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3110,6 +3110,9 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+struct cpumask __cpu_parked_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_parked_mask);
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection
  2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
  2025-05-23 18:14 ` [RFC PATCH 1/5] cpumask: Introduce cpu parked mask Shrikanth Hegde
@ 2025-05-23 18:14 ` Shrikanth Hegde
  2025-05-27 14:59   ` Yury Norov
  2025-05-23 18:14 ` [RFC PATCH 3/5] sched/fair: Don't use parked cpu for load balancing Shrikanth Hegde
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-23 18:14 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, jstultz,
	kprateek.nayak, huschle, srikar, linux-kernel, linux

When the current running task is pushed using stop class mechanism, the
new CPU that going to be chosen shouldn't be a parked CPU. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 62b3416f5e43..9ec12f9b3b08 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3526,7 +3526,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 		nodemask = cpumask_of_node(nid);
 
 		/* Look for allowed, online CPU in same node. */
-		for_each_cpu(dest_cpu, nodemask) {
+		for_each_cpu_andnot(dest_cpu, nodemask, cpu_parked_mask) {
 			if (is_cpu_allowed(p, dest_cpu))
 				return dest_cpu;
 		}
@@ -3534,7 +3534,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
 
 	for (;;) {
 		/* Any allowed, online CPU? */
-		for_each_cpu(dest_cpu, p->cpus_ptr) {
+		for_each_cpu_andnot(dest_cpu, p->cpus_ptr, cpu_parked_mask) {
 			if (!is_cpu_allowed(p, dest_cpu))
 				continue;
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 3/5] sched/fair: Don't use parked cpu for load balancing
  2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
  2025-05-23 18:14 ` [RFC PATCH 1/5] cpumask: Introduce cpu parked mask Shrikanth Hegde
  2025-05-23 18:14 ` [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection Shrikanth Hegde
@ 2025-05-23 18:14 ` Shrikanth Hegde
  2025-05-23 18:14 ` [RFC PATCH 4/5] sched/core: Push current task when cpu is parked Shrikanth Hegde
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-23 18:14 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, jstultz,
	kprateek.nayak, huschle, srikar, linux-kernel, linux

While doing load balance, don't consider the parked CPUs. As far as load
balance is considered, a parked CPU is as good as offline CPU. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 125912c0e9dd..f48f55ca1522 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11761,6 +11761,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	};
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+	cpumask_andnot(cpus, cpus, cpu_parked_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH 4/5] sched/core: Push current task when cpu is parked
  2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
                   ` (2 preceding siblings ...)
  2025-05-23 18:14 ` [RFC PATCH 3/5] sched/fair: Don't use parked cpu for load balancing Shrikanth Hegde
@ 2025-05-23 18:14 ` Shrikanth Hegde
  2025-05-23 18:14 ` [DEBUG PATCH 5/5] powerpc: Use manual hint for cpu parking Shrikanth Hegde
  2025-05-27 15:10 ` [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Peter Zijlstra
  5 siblings, 0 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-23 18:14 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, jstultz,
	kprateek.nayak, huschle, srikar, linux-kernel, linux

When a CPU becomes parked, all tasks present on that CPU should vacate
it. Use existing __balance_push_cpu_stop mechanism to move out current
running task. 

Which CPUs need to be parked is to be decided by architecture. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
Note: Maybe this could be done only for CFS, EXT tasks if it is not
recommended for RT,DL etc. 

 kernel/sched/core.c  | 39 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 40 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9ec12f9b3b08..dd8e824bc030 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5656,6 +5656,10 @@ void sched_tick(void)
 
 	sched_clock_tick();
 
+	/* push the current task out if cpu is parked */
+	if (cpu_parked(cpu))
+		push_current_task(rq);
+
 	rq_lock(rq, &rf);
 	donor = rq->donor;
 
@@ -8482,6 +8486,41 @@ void __init sched_init_smp(void)
 }
 #endif /* CONFIG_SMP */
 
+#if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_CPU)
+static DEFINE_PER_CPU(struct cpu_stop_work, push_task_work);
+
+/* A parked CPU is when underlying physical CPU is not available.
+ * Scheduling on such CPU is going to cause OS preemption.
+ * In case any task is scheduled on such CPU, move it out. In
+ * select_fallback_rq a non parked CPU will be chosen and henceforth
+ * task shouldn't come back to this CPU
+ */
+void push_current_task(struct rq *rq)
+{
+	struct task_struct *push_task = rq->curr;
+	unsigned long flags;
+
+	/* idle task can't be pused out */
+	if (rq->curr == rq->idle || !cpu_parked(rq->cpu))
+		return;
+
+	if (kthread_is_per_cpu(push_task) ||
+	    is_migration_disabled(push_task))
+		return;
+
+	local_irq_save(flags);
+	get_task_struct(push_task);
+	preempt_disable();
+
+	stop_one_cpu_nowait(rq->cpu, __balance_push_cpu_stop, push_task,
+			    this_cpu_ptr(&push_task_work));
+	preempt_enable();
+	local_irq_restore(flags);
+}
+#else
+void push_current_task(struct rq *rq) { };
+#endif
+
 int in_sched_functions(unsigned long addr)
 {
 	return in_lock_functions(addr) ||
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6d..86bcd9401d41 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -104,6 +104,7 @@ extern void calc_global_load_tick(struct rq *this_rq);
 extern long calc_load_fold_active(struct rq *this_rq, long adjust);
 
 extern void call_trace_sched_update_nr_running(struct rq *rq, int count);
+void push_current_task(struct rq *rq);
 
 extern int sysctl_sched_rt_period;
 extern int sysctl_sched_rt_runtime;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [DEBUG PATCH 5/5] powerpc: Use manual hint for cpu parking
  2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
                   ` (3 preceding siblings ...)
  2025-05-23 18:14 ` [RFC PATCH 4/5] sched/core: Push current task when cpu is parked Shrikanth Hegde
@ 2025-05-23 18:14 ` Shrikanth Hegde
  2025-05-27 15:10 ` [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Peter Zijlstra
  5 siblings, 0 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-23 18:14 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, tglx, yury.norov,
	maddy
  Cc: sshegde, vschneid, dietmar.eggemann, rostedt, jstultz,
	kprateek.nayak, huschle, srikar, linux-kernel, linux

Use sysfs to provide hint. Depending on the system configuration one
needs to decide the number. vp - virtual processor or vCPU. 

For example, when 40 > vp_manual_hint means scheduler is supposed to use
only 0-39 vCPUs. By default, vp_manual_hint is set to all possible CPUs 
and it has be at least 1. 

This is for illustration only. Not meant to be merged. One can modify as
per their arch.  

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 arch/powerpc/kernel/smp.c | 45 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5ac7084eebc0..37eb6aa71613 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -64,6 +64,7 @@
 #include <asm/systemcfg.h>
 
 #include <trace/events/ipi.h>
+#include <linux/debugfs.h>
 
 #ifdef DEBUG
 #include <asm/udbg.h>
@@ -82,6 +83,7 @@ bool has_big_cores __ro_after_init;
 bool coregroup_enabled __ro_after_init;
 bool thread_group_shares_l2 __ro_after_init;
 bool thread_group_shares_l3 __ro_after_init;
+static int vp_manual_hint = NR_CPUS;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -1727,6 +1729,7 @@ static void __init build_sched_topology(void)
 	BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
 
 	set_sched_topology(powerpc_topology);
+	vp_manual_hint = num_present_cpus();
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
@@ -1807,4 +1810,46 @@ void __noreturn arch_cpu_idle_dead(void)
 	start_secondary_resume();
 }
 
+/*
+ * sysfs hint to mark CPUs as parked. This will help in restricting
+ * the workload to specified number of CPUs.
+ * For example 40 > vp_manual_hint means, workload will run on
+ * 0-39 CPUs.
+ */
+
+static int pv_vp_manual_hint_set(void *data, u64 val)
+{
+	int cpu;
+
+	if (val == 0 || vp_manual_hint > num_present_cpus())
+		vp_manual_hint = num_present_cpus();
+
+	if (val != vp_manual_hint)
+		vp_manual_hint = val;
+
+	for_each_present_cpu(cpu) {
+		if (cpu >= vp_manual_hint)
+			set_cpu_parked(cpu, true);
+		else
+			set_cpu_parked(cpu, false);
+	}
+	return 0;
+}
+
+static int pv_vp_manual_hint_get(void *data, u64 *val)
+{
+	*val = vp_manual_hint;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_pv_vp_manual_hint, pv_vp_manual_hint_get, pv_vp_manual_hint_set, "%llu\n");
+
+static __init int paravirt_debugfs_init(void)
+{
+	if (is_shared_processor())
+		debugfs_create_file("vp_manual_hint", 0600, arch_debugfs_dir, NULL, &fops_pv_vp_manual_hint);
+	return 0;
+}
+
+device_initcall(paravirt_debugfs_init)
 #endif
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection
  2025-05-23 18:14 ` [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection Shrikanth Hegde
@ 2025-05-27 14:59   ` Yury Norov
  2025-05-27 17:35     ` Shrikanth Hegde
  0 siblings, 1 reply; 15+ messages in thread
From: Yury Norov @ 2025-05-27 14:59 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, huschle,
	srikar, linux-kernel, linux

On Fri, May 23, 2025 at 11:44:45PM +0530, Shrikanth Hegde wrote:
> When the current running task is pushed using stop class mechanism, the
> new CPU that going to be chosen shouldn't be a parked CPU. 
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  kernel/sched/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 62b3416f5e43..9ec12f9b3b08 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3526,7 +3526,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  		nodemask = cpumask_of_node(nid);
>  
>  		/* Look for allowed, online CPU in same node. */
> -		for_each_cpu(dest_cpu, nodemask) {
> +		for_each_cpu_andnot(dest_cpu, nodemask, cpu_parked_mask) {
>  			if (is_cpu_allowed(p, dest_cpu))
>  				return dest_cpu;
>  		}
> @@ -3534,7 +3534,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>  
>  	for (;;) {
>  		/* Any allowed, online CPU? */
> -		for_each_cpu(dest_cpu, p->cpus_ptr) {
> +		for_each_cpu_andnot(dest_cpu, p->cpus_ptr, cpu_parked_mask) {
>  			if (!is_cpu_allowed(p, dest_cpu))
>  				continue;

You test for online and dying CPUs in the is_cpu_allowed(). Why this
new 'parked' creature is different?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/5] cpumask: Introduce cpu parked mask
  2025-05-23 18:14 ` [RFC PATCH 1/5] cpumask: Introduce cpu parked mask Shrikanth Hegde
@ 2025-05-27 15:06   ` Yury Norov
  2025-06-23  8:10     ` Shrikanth Hegde
  0 siblings, 1 reply; 15+ messages in thread
From: Yury Norov @ 2025-05-27 15:06 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, huschle,
	srikar, linux-kernel, linux

On Fri, May 23, 2025 at 11:44:44PM +0530, Shrikanth Hegde wrote:
> CPU is said to be parked, when underlying physical CPU is not 
> available. This happens when there is contention for CPU resource in
> para-virtualized case. One should avoid using these CPUs. 
> 
> Build and maintain this state of parked CPUs. Scheduler will use this
> information and push the tasks out as soon as it can. 

This 'parked' term sounds pretty obscured. Maybe name it in
a positive sense, and more explicit, like cpu_paravirt_mask.

Also, shouldn't this be conditional on CONFIG_PARAVIRT?

Thanks,
Yury
 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> - Not sure if __read_mostly attribute suits for cpu_parked 
> since it can change often. Since often means a few mins, it is long time
> from scheduler perspective, hence kept it. 
> 
>  include/linux/cpumask.h | 14 ++++++++++++++
>  kernel/cpu.c            |  3 +++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 6a569c7534db..501848303800 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -84,6 +84,7 @@ static __always_inline void set_nr_cpu_ids(unsigned int nr)
>   *     cpu_enabled_mask - has bit 'cpu' set iff cpu can be brought online
>   *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
>   *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
> + *     cpu_parked_mask  - has bit 'cpu' set iff cpu is parked
>   *
>   *  If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
>   *
> @@ -93,6 +94,11 @@ static __always_inline void set_nr_cpu_ids(unsigned int nr)
>   *  representing which CPUs are currently plugged in.  And
>   *  cpu_online_mask is the dynamic subset of cpu_present_mask,
>   *  indicating those CPUs available for scheduling.
> + *
> + *  A CPU is said to be parked when underlying physical CPU(pCPU) is not
> + *  available at the moment. It is recommended not to run any workload on
> + *  that CPU.
> +
>   *
>   *  If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
>   *  depending on what ACPI reports as currently plugged in, otherwise
> @@ -118,12 +124,14 @@ extern struct cpumask __cpu_enabled_mask;
>  extern struct cpumask __cpu_present_mask;
>  extern struct cpumask __cpu_active_mask;
>  extern struct cpumask __cpu_dying_mask;
> +extern struct cpumask __cpu_parked_mask;
>  #define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
>  #define cpu_online_mask   ((const struct cpumask *)&__cpu_online_mask)
>  #define cpu_enabled_mask   ((const struct cpumask *)&__cpu_enabled_mask)
>  #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
>  #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
>  #define cpu_dying_mask    ((const struct cpumask *)&__cpu_dying_mask)
> +#define cpu_parked_mask    ((const struct cpumask *)&__cpu_parked_mask)
>  
>  extern atomic_t __num_online_cpus;
>  
> @@ -1146,6 +1154,7 @@ void init_cpu_possible(const struct cpumask *src);
>  #define set_cpu_present(cpu, present)	assign_cpu((cpu), &__cpu_present_mask, (present))
>  #define set_cpu_active(cpu, active)	assign_cpu((cpu), &__cpu_active_mask, (active))
>  #define set_cpu_dying(cpu, dying)	assign_cpu((cpu), &__cpu_dying_mask, (dying))
> +#define set_cpu_parked(cpu, parked)    assign_cpu((cpu), &__cpu_parked_mask, (parked))
>  
>  void set_cpu_online(unsigned int cpu, bool online);
>  
> @@ -1235,6 +1244,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
>  	return cpumask_test_cpu(cpu, cpu_dying_mask);
>  }
>  
> +static __always_inline bool cpu_parked(unsigned int cpu)
> +{
> +	return cpumask_test_cpu(cpu, cpu_parked_mask);
> +}
> +
>  #else
>  
>  #define num_online_cpus()	1U
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index a59e009e0be4..532fbfbe3226 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -3110,6 +3110,9 @@ EXPORT_SYMBOL(__cpu_dying_mask);
>  atomic_t __num_online_cpus __read_mostly;
>  EXPORT_SYMBOL(__num_online_cpus);
>  
> +struct cpumask __cpu_parked_mask __read_mostly;
> +EXPORT_SYMBOL(__cpu_parked_mask);
> +
>  void init_cpu_present(const struct cpumask *src)
>  {
>  	cpumask_copy(&__cpu_present_mask, src);
> -- 
> 2.39.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism
  2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
                   ` (4 preceding siblings ...)
  2025-05-23 18:14 ` [DEBUG PATCH 5/5] powerpc: Use manual hint for cpu parking Shrikanth Hegde
@ 2025-05-27 15:10 ` Peter Zijlstra
  2025-05-27 15:47   ` Yury Norov
  5 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2025-05-27 15:10 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, juri.lelli, vincent.guittot, tglx, yury.norov, maddy,
	vschneid, dietmar.eggemann, rostedt, jstultz, kprateek.nayak,
	huschle, srikar, linux-kernel, linux

On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote:
> In a para-virtualised environment, there could be multiple
> overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). 
> When all such VMs request for cpu cycles at the same, it is not possible
> to serve all of them. This leads to VM level preemptions and hence the
> steal time. 
> 
> Bring the notion of CPU parked state which implies underlying pCPU may
> not be available for use at this time. This means it is better to avoid
> this vCPU. So when a CPU is marked as parked, one should vacate it as
> soon as it can. So it is going to dynamic at runtime and can change
> often.

You've lost me here already. Why would pCPU not be available? Simply
because it is running another vCPU? I would say this means the pCPU is
available, its just doing something else.

Not available to me means it is going offline or something like that.

> In general, task level preemption(driven by VM) is less expensive than VM
> level preemption(driven by hypervisor). So pack to less CPUs helps to
> improve the overall workload throughput/latency. 

This seems to suggest you're 'parking' vCPUs, while above you seemed to
suggest pCPU. More confusion.

> cpu parking and need for cpu parking has been explained here as well [1]. Much
> of the context explained in the cover letter there applies to this
> problem context as well. 
> [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/

Yeah, totally not following any of that either :/


Mostly I have only confusion and no idea what you're actually wanting to
do.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism
  2025-05-27 15:10 ` [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Peter Zijlstra
@ 2025-05-27 15:47   ` Yury Norov
  2025-05-27 17:30     ` Shrikanth Hegde
  0 siblings, 1 reply; 15+ messages in thread
From: Yury Norov @ 2025-05-27 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shrikanth Hegde, mingo, juri.lelli, vincent.guittot, tglx, maddy,
	vschneid, dietmar.eggemann, rostedt, jstultz, kprateek.nayak,
	huschle, srikar, linux-kernel, linux

On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote:
> On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote:
> > In a para-virtualised environment, there could be multiple
> > overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). 
> > When all such VMs request for cpu cycles at the same, it is not possible
> > to serve all of them. This leads to VM level preemptions and hence the
> > steal time. 
> > 
> > Bring the notion of CPU parked state which implies underlying pCPU may
> > not be available for use at this time. This means it is better to avoid
> > this vCPU. So when a CPU is marked as parked, one should vacate it as
> > soon as it can. So it is going to dynamic at runtime and can change
> > often.
> 
> You've lost me here already. Why would pCPU not be available? Simply
> because it is running another vCPU? I would say this means the pCPU is
> available, its just doing something else.
> 
> Not available to me means it is going offline or something like that.
> 
> > In general, task level preemption(driven by VM) is less expensive than VM
> > level preemption(driven by hypervisor). So pack to less CPUs helps to
> > improve the overall workload throughput/latency. 
> 
> This seems to suggest you're 'parking' vCPUs, while above you seemed to
> suggest pCPU. More confusion.
> 
> > cpu parking and need for cpu parking has been explained here as well [1]. Much
> > of the context explained in the cover letter there applies to this
> > problem context as well. 
> > [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/
> 
> Yeah, totally not following any of that either :/
> 
> 
> Mostly I have only confusion and no idea what you're actually wanting to
> do.

My wild guess is that the idea is to not preempt the pCPU while running
a particular vCPU workload. But I agree, this should all be reworded and
explained better. I didn't understand this, either.

Thanks,
YUry

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism
  2025-05-27 15:47   ` Yury Norov
@ 2025-05-27 17:30     ` Shrikanth Hegde
  2025-06-02  4:25       ` Shrikanth Hegde
  2025-06-02 14:22       ` Tobias Huschle
  0 siblings, 2 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-27 17:30 UTC (permalink / raw)
  To: Yury Norov, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, huschle,
	srikar, linux-kernel, linux

Hi Peter, Yury.

Thanks for taking a look at this series.

On 5/27/25 21:17, Yury Norov wrote:
> On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote:
>> On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote:
>>> In a para-virtualised environment, there could be multiple
>>> overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU).
>>> When all such VMs request for cpu cycles at the same, it is not possible
>>> to serve all of them. This leads to VM level preemptions and hence the
>>> steal time.
>>>
>>> Bring the notion of CPU parked state which implies underlying pCPU may
>>> not be available for use at this time. This means it is better to avoid
>>> this vCPU. So when a CPU is marked as parked, one should vacate it as
>>> soon as it can. So it is going to dynamic at runtime and can change
>>> often.
>>
>> You've lost me here already. Why would pCPU not be available? Simply
>> because it is running another vCPU? I would say this means the pCPU is
>> available, its just doing something else.
>>
>> Not available to me means it is going offline or something like that.
>>
>>> In general, task level preemption(driven by VM) is less expensive than VM
>>> level preemption(driven by hypervisor). So pack to less CPUs helps to
>>> improve the overall workload throughput/latency.
>>
>> This seems to suggest you're 'parking' vCPUs, while above you seemed to
>> suggest pCPU. More confusion.

Yes. I meant parking of vCPUs only. pCPU is running one of those vCPU at any point in time.

>>
>>> cpu parking and need for cpu parking has been explained here as well [1]. Much
>>> of the context explained in the cover letter there applies to this
>>> problem context as well.
>>> [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/
>>
>> Yeah, totally not following any of that either :/
>>
>>
>> Mostly I have only confusion and no idea what you're actually wanting to
>> do.
> 
> My wild guess is that the idea is to not preempt the pCPU while running
> a particular vCPU workload. But I agree, this should all be reworded and
> explained better. I didn't understand this, either.
> 
> Thanks,
> YUry

Sorry, Apologies for not explaining it clearly. My bad.
Let me take a shot at it again:

----------------------------

vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor is managing these vCPUs from different VMs. When a vCPU requests for CPU, hypervisor does the job
of scheduling them on a pCPU.

So this issue occurs when there are more vCPUs(combined across all VMs) than the pCPU. So when *all* vCPUs are
requesting for CPUs, hypervisor can only run a few of them and remaining will be preempted(waiting for pCPU).

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from VM2, it has to do
save/restore VM context.Instead if VM's can co-ordinate among each other and request for *limited*  vCPUs,
it avoids the above overhead and there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another withing the same VM, it is still more expensive than the task preemption within
the vCPU. So *basic* aim to avoid vCPU preemption.

So to achieve this, use this parking(we need better name for sure) concept, where it is better
if workloads avoid some vCPUs at this moment. (vCPUs stays online, we don't want the overhead of sched domain rebuild).

contention is dynamic in nature. When there is contention for pCPU is to be detected and determined
by architecture. Archs needs to update the mask regularly.

When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection
  2025-05-27 14:59   ` Yury Norov
@ 2025-05-27 17:35     ` Shrikanth Hegde
  0 siblings, 0 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-05-27 17:35 UTC (permalink / raw)
  To: Yury Norov
  Cc: mingo, peterz, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, huschle,
	srikar, linux-kernel, linux



On 5/27/25 20:29, Yury Norov wrote:
> On Fri, May 23, 2025 at 11:44:45PM +0530, Shrikanth Hegde wrote:
>> When the current running task is pushed using stop class mechanism, the
>> new CPU that going to be chosen shouldn't be a parked CPU.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   kernel/sched/core.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 62b3416f5e43..9ec12f9b3b08 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3526,7 +3526,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   		nodemask = cpumask_of_node(nid);
>>   
>>   		/* Look for allowed, online CPU in same node. */
>> -		for_each_cpu(dest_cpu, nodemask) {
>> +		for_each_cpu_andnot(dest_cpu, nodemask, cpu_parked_mask) {
>>   			if (is_cpu_allowed(p, dest_cpu))
>>   				return dest_cpu;
>>   		}
>> @@ -3534,7 +3534,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
>>   
>>   	for (;;) {
>>   		/* Any allowed, online CPU? */
>> -		for_each_cpu(dest_cpu, p->cpus_ptr) {
>> +		for_each_cpu_andnot(dest_cpu, p->cpus_ptr, cpu_parked_mask) {
>>   			if (!is_cpu_allowed(p, dest_cpu))
>>   				continue;
> 
> You test for online and dying CPUs in the is_cpu_allowed(). Why this
> new 'parked' creature is different?

Agreed. Let me try move that logic into is_cpu_allowed.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism
  2025-05-27 17:30     ` Shrikanth Hegde
@ 2025-06-02  4:25       ` Shrikanth Hegde
  2025-06-02 14:22       ` Tobias Huschle
  1 sibling, 0 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-06-02  4:25 UTC (permalink / raw)
  To: Yury Norov, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, huschle,
	srikar, linux-kernel, linux



Hi.

> 
> ----------------------------
> 
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
> 
> A hypervisor is managing these vCPUs from different VMs. When a vCPU 
> requests for CPU, hypervisor does the job
> of scheduling them on a pCPU.
> 
> So this issue occurs when there are more vCPUs(combined across all VMs) 
> than the pCPU. So when *all* vCPUs are
> requesting for CPUs, hypervisor can only run a few of them and remaining 
> will be preempted(waiting for pCPU).
> 
> 
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU 
> from VM2, it has to do
> save/restore VM context.Instead if VM's can co-ordinate among each other 
> and request for *limited*  vCPUs,
> it avoids the above overhead and there is context switching within 
> vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another withing the same VM, it is still 
> more expensive than the task preemption within
> the vCPU. So *basic* aim to avoid vCPU preemption.
> 
> 
> So to achieve this, use this parking(we need better name for sure) 
> concept, where it is better
> if workloads avoid some vCPUs at this moment. (vCPUs stays online, we 
> don't want the overhead of sched domain rebuild).
> 
> 
> contention is dynamic in nature. When there is contention for pCPU is to 
> be detected and determined
> by architecture. Archs needs to update the mask regularly.
> 
> When there is contention, use limited vCPUs as indicated by arch.
> When there is no contention, use all vCPUs.
> 

I hope this helped to set the problem context. I am trying to get feedback if the approach makes sense.
I will go through other push mechanism we have (example in rt/dl).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/5] sched: cpu parked and push current task mechanism
  2025-05-27 17:30     ` Shrikanth Hegde
  2025-06-02  4:25       ` Shrikanth Hegde
@ 2025-06-02 14:22       ` Tobias Huschle
  1 sibling, 0 replies; 15+ messages in thread
From: Tobias Huschle @ 2025-06-02 14:22 UTC (permalink / raw)
  To: Shrikanth Hegde, Yury Norov, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, srikar,
	linux-kernel, linux

On 27/05/2025 19:30, Shrikanth Hegde wrote:
> 
> Hi Peter, Yury.
> 
> Thanks for taking a look at this series.
> 
> 
> On 5/27/25 21:17, Yury Norov wrote:
>> On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote:
>>> On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote:
>>>> In a para-virtualised environment, there could be multiple
>>>> overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU).
>>>> When all such VMs request for cpu cycles at the same, it is not 
>>>> possible
>>>> to serve all of them. This leads to VM level preemptions and hence the
>>>> steal time.
>>>>
>>>> Bring the notion of CPU parked state which implies underlying pCPU may
>>>> not be available for use at this time. This means it is better to avoid
>>>> this vCPU. So when a CPU is marked as parked, one should vacate it as
>>>> soon as it can. So it is going to dynamic at runtime and can change
>>>> often.
>>>
>>> You've lost me here already. Why would pCPU not be available? Simply
>>> because it is running another vCPU? I would say this means the pCPU is
>>> available, its just doing something else.
>>>
>>> Not available to me means it is going offline or something like that.
>>>
>>>> In general, task level preemption(driven by VM) is less expensive 
>>>> than VM
>>>> level preemption(driven by hypervisor). So pack to less CPUs helps to
>>>> improve the overall workload throughput/latency.
>>>
>>> This seems to suggest you're 'parking' vCPUs, while above you seemed to
>>> suggest pCPU. More confusion.
> 
> Yes. I meant parking of vCPUs only. pCPU is running one of those vCPU at 
> any point in time.
> 
>>>
>>>> cpu parking and need for cpu parking has been explained here as well 
>>>> [1]. Much
>>>> of the context explained in the cover letter there applies to this
>>>> problem context as well.
>>>> [1]: https://lore.kernel.org/all/20250512115325.30022-1- 
>>>> huschle@linux.ibm.com/
>>>
>>> Yeah, totally not following any of that either :/
>>>
>>>
>>> Mostly I have only confusion and no idea what you're actually wanting to
>>> do.
>>
>> My wild guess is that the idea is to not preempt the pCPU while running
>> a particular vCPU workload. But I agree, this should all be reworded and
>> explained better. I didn't understand this, either.
>>
>> Thanks,
>> YUry
> 
> Sorry, Apologies for not explaining it clearly. My bad.
> Let me take a shot at it again:
> 
> ----------------------------
> 
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
> 
> A hypervisor is managing these vCPUs from different VMs. When a vCPU 
> requests for CPU, hypervisor does the job
> of scheduling them on a pCPU.
> 
> So this issue occurs when there are more vCPUs(combined across all VMs) 
> than the pCPU. So when *all* vCPUs are
> requesting for CPUs, hypervisor can only run a few of them and remaining 
> will be preempted(waiting for pCPU).
> 
> 
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU 
> from VM2, it has to do
> save/restore VM context.Instead if VM's can co-ordinate among each other 
> and request for *limited*  vCPUs,
> it avoids the above overhead and there is context switching within 
> vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another withing the same VM, it is still 
> more expensive than the task preemption within
> the vCPU. So *basic* aim to avoid vCPU preemption.
> 

There is a dilemma for the hypervisor scheduler, as it has not too many 
good indicators on when to preempt a vCPU in favor of another one.

Assume we have a hypervisor facing high load. Among others, 1 VM with 2 
vCPUs running 2 tasks. Naturally, the scheduler in the VM would place 
each task on one of the vCPUs.

Assume further that, due to the high load, the hypervisor scheduler 
cannot schedule both vCPUs at the same time consistently. This means, 
that the hypervisor scheduler now decides which of the 2 tasks gets to run.

The scheduler in the VM on the other hand has better insights into which 
of the two tasks should execute. If the hypervisor can guarantee the 
guest, that certain vCPUs will be granted runtime on pCPUs consistently, 
the VM scheduler has a clear expectancy on the availability of its vCPUs 
and can make use of that information.

Essentially, we avoid forcing the hypervisor scheduler to take decisions 
which it does not have good information on. We'd rather let the VM 
scheduler take those decisions.

This requires of course, that the hypervisor can give a somewhat 
accurate estimate to the VM on how many CPUs it can safely use, which it 
should be able to do as it has the information on how much overall load 
is on the system, which the VM does not have necessarily.
A naive approach would be to just divide the available pCPUs through the 
number of VMs. The more interesting part will be to derive how many 
vCPUs can be overconsumed if other VMs are underconsuming.

In the end, both layers, hypervisor and VM, would take decisions which 
they have accurate information on. The VM scheduler knows about its 
tasks, the hypervisor knows about the overall system load.

I played around with the concept quite a bit. Especially if there is a 
lot of load on the VM itself and it tries to squeeze in short-running 
networking operations.
In this case the VM scheduler can take better decisions if it runs less 
vCPUs, but stays in full control, instead of relying on the hypervisor 
scheduler to schedule all its vCPUs.

> 
> So to achieve this, use this parking(we need better name for sure) 
> concept, where it is better
> if workloads avoid some vCPUs at this moment. (vCPUs stays online, we 
> don't want the overhead of sched domain rebuild).
> 
> 
> contention is dynamic in nature. When there is contention for pCPU is to 
> be detected and determined
> by architecture. Archs needs to update the mask regularly.
> 
> When there is contention, use limited vCPUs as indicated by arch.
> When there is no contention, use all vCPUs.
> 

The patches work as expected on s390.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/5] cpumask: Introduce cpu parked mask
  2025-05-27 15:06   ` Yury Norov
@ 2025-06-23  8:10     ` Shrikanth Hegde
  0 siblings, 0 replies; 15+ messages in thread
From: Shrikanth Hegde @ 2025-06-23  8:10 UTC (permalink / raw)
  To: Yury Norov, peterz
  Cc: mingo, juri.lelli, vincent.guittot, tglx, maddy, vschneid,
	dietmar.eggemann, rostedt, jstultz, kprateek.nayak, huschle,
	srikar, linux-kernel, linux



On 5/27/25 20:36, Yury Norov wrote:
> On Fri, May 23, 2025 at 11:44:44PM +0530, Shrikanth Hegde wrote:
>> CPU is said to be parked, when underlying physical CPU is not
>> available. This happens when there is contention for CPU resource in
>> para-virtualized case. One should avoid using these CPUs.
>>
>> Build and maintain this state of parked CPUs. Scheduler will use this
>> information and push the tasks out as soon as it can.
> 
> This 'parked' term sounds pretty obscured. Maybe name it in
> a positive sense, and more explicit, like cpu_paravirt_mask.
> 

I still dont know a better name. Maybe something like cpu_avoid_mask and
cpu avoid(cpu)? I would like to retain a notion that these CPU shouldn't be used
at the moment.

> Also, shouldn't this be conditional on CONFIG_PARAVIRT?
> 

I moved the uses of it under a static key which should make it a nop for
others. That keeps code relatively simpler. Hope that's ok.

> Thanks,
> Yury
>   

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-23  8:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23 18:14 [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Shrikanth Hegde
2025-05-23 18:14 ` [RFC PATCH 1/5] cpumask: Introduce cpu parked mask Shrikanth Hegde
2025-05-27 15:06   ` Yury Norov
2025-06-23  8:10     ` Shrikanth Hegde
2025-05-23 18:14 ` [RFC PATCH 2/5] sched/core: Don't use parked cpu for selection Shrikanth Hegde
2025-05-27 14:59   ` Yury Norov
2025-05-27 17:35     ` Shrikanth Hegde
2025-05-23 18:14 ` [RFC PATCH 3/5] sched/fair: Don't use parked cpu for load balancing Shrikanth Hegde
2025-05-23 18:14 ` [RFC PATCH 4/5] sched/core: Push current task when cpu is parked Shrikanth Hegde
2025-05-23 18:14 ` [DEBUG PATCH 5/5] powerpc: Use manual hint for cpu parking Shrikanth Hegde
2025-05-27 15:10 ` [RFC PATCH 0/5] sched: cpu parked and push current task mechanism Peter Zijlstra
2025-05-27 15:47   ` Yury Norov
2025-05-27 17:30     ` Shrikanth Hegde
2025-06-02  4:25       ` Shrikanth Hegde
2025-06-02 14:22       ` Tobias Huschle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox