[RFC PATCH 0/3] kvm,sched: Add gtime halted

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] kvm,sched: Add gtime halted
@ 2025-02-18 20:26 Fernand Sieber
  2025-02-18 20:26 ` [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat Fernand Sieber
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Fernand Sieber @ 2025-02-18 20:26 UTC (permalink / raw)
  To: sieberf, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Paolo Bonzini, x86, kvm, linux-kernel, nh-open-source

With guest hlt, pause and mwait pass through, the hypervisor loses
visibility on real guest cpu activity. From the point of view of the
host, such vcpus are always 100% active even when the guest is
completely halted.

Typically hlt, pause and mwait pass through is only implemented on
non-timeshared pcpus. However, there are cases where this assumption
cannot be strictly met as some occasional housekeeping work needs to be
scheduled on such cpus while we generally want to preserve the pass
through performance gains. This applies for system which don't have
dedicated cpus for housekeeping purposes.

In such cases, the lack of visibility of the hypervisor is problematic
from a load balancing point of view. In the absence of a better signal,
it will preemt vcpus at random. For example it could decide to interrupt
a vcpu doing critical idle poll work while another vcpu sits idle.

Another motivation for gaining visibility into real guest cpu activity
is to enable the hypervisor to vend metrics about it for external
consumption.

In this RFC we introduce the concept of guest halted time to address
these concerns. Guest halted time (gtime_halted) accounts for cycles
spent in guest mode while the cpu is halted. gtime_halted relies on
measuring the mperf msr register (x86) around VM enter/exits to compute
the number of unhalted cycles; halted cycles are then derived from the
tsc difference minus the mperf difference.

gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables
users to monitor real guest activity.

gtime_halted is also plumbed to the scheduler infrastructure to discount
halted cycles from fair load accounting. This enlightens the load
balancer to real guest activity for better task placement.

This initial RFC has a few limitations and open questions:
* only the x86 infrastructure is supported as it relies on architecture
  dependent registers. Future development will extend this to ARM.
* we assume that mperf accumulates as the same rate as tsc. While I am
  not certain whether this assumption is ever violated, the spec doesn't
  seem to offer this guarantee [1] so we may want to calibrate mperf.
* the sched enlightenment logic relies on periodic gtime_halted updates.
  As such, it is incompatible with nohz full because this could result
  in long periods of no update followed by a massive halted time update
  which doesn't play well with the existing PELT integration. It is
  possible to address this limitation with generalized, more complex
  accounting.

[1]
https://cdrdv2.intel.com/v1/dl/getContent/671427
"The TSC, IA32_MPERF, and IA32_FIXED_CTR2 operate at close to the
maximum non-turbo frequency, which is equal to the product of scalable
bus frequency and maximum non-turbo ratio."

Fernand Sieber (3):
  fs/proc: Add gtime halted to proc/<pid>/stat
  kvm/x86: Add support for gtime halted
  sched,x86: Make the scheduler guest unhalted aware

 Documentation/filesystems/proc.rst |  1 +
 arch/x86/include/asm/tsc.h         |  1 +
 arch/x86/kernel/tsc.c              | 13 +++++++++
 arch/x86/kvm/x86.c                 | 30 +++++++++++++++++++++
 fs/proc/array.c                    |  7 ++++-
 include/linux/sched.h              |  5 ++++
 include/linux/sched/signal.h       |  1 +
 kernel/exit.c                      |  1 +
 kernel/fork.c                      |  2 +-
 kernel/sched/core.c                |  1 +
 kernel/sched/fair.c                | 25 ++++++++++++++++++
 kernel/sched/pelt.c                | 42 +++++++++++++++++++++++++-----
 kernel/sched/sched.h               |  2 ++
 13 files changed, 122 insertions(+), 9 deletions(-)

=== TESTING ===

For testing I use a host running a VM via qEMU and I simulate host
interference via instances of stress.

The VM uses 16 vCPUs, which are pinned to pCPUs 0-15. Each vCPU is
pinned to a dedicated pCPU which follows the 'mostly non-timeshared CPU'
model.

We use the -overcommit cpu-pm=on qEMU flag to enable hlt, mwait and
pause pass through.

On the host, alongside qEMU, there are 8 stressors pinned to the same
CPUs (taskset -c 0-15 stress --cpu 8).

The VM then runs rtla on 8 cores to measure host interference. With the
enlightenment in the patch we expect the load balancer to move the
stressors to the remaining 8 idle cores and to mostly eliminate
interference.

With enlightenment:
rtla hwnoise -c 0-7 -P f:50 -p 27000 -r 26000 -d 2m -T 1000 -q --warm-up 60

Hardware-related Noise
duration:   0 00:02:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI
  0 #4443      115518000            0   100.00000           0            0           0            0
  1 #4442      115512416       144178    99.87518        4006         4006          37            0
  2 #4443      115518000            0   100.00000           0            0           0            0
  3 #4443      115518000            0   100.00000           0            0           0            0
  4 #4443      115518000            0   100.00000           0            0           0            0
  5 #4443      115518000            0   100.00000           0            0           0            0
  6 #4444      115547479        11018    99.99046        4006         4006           3            0
  7 #4444      115544000        12015    99.98960        4005         4005           3            0

Baseline without patches:
rtla hwnoise -c 0-7 -P f:50 -p 27000 -r 26000 -d 2m -T 1000 -q --warm-up 60

Hardware-related Noise
duration:   0 00:02:00 | time is in us
CPU Period       Runtime        Noise  % CPU Aval   Max Noise   Max Single          HW          NMI
  0 #4171      112394904     36139505    67.84595       29015        13006        4533            0
  1 #4153      111960227     38277963    65.81110       29015        13006        4748            0
  2 #3882      108016483     73845612    31.63486       29017        16005        8628            0
  3 #3881      108088929     73946692    31.58717       30017        14006        8636            0
  4 #4177      112380299     36646487    67.39064       28018        14007        4551            0
  5 #4157      112059732     37863899    66.21096       28017        13005        4689            0
  6 #4166      112312643     37458217    66.64826       29016        14005        4653            0
  7 #4157      112034934     36922368    67.04387       29015        14006        4609            0

--
2.43.0

Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat
  2025-02-18 20:26 [RFC PATCH 0/3] kvm,sched: Add gtime halted Fernand Sieber
@ 2025-02-18 20:26 ` Fernand Sieber
  2025-02-18 20:26 ` [RFC PATCH 2/3] kvm/x86: Add support for gtime halted Fernand Sieber
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Fernand Sieber @ 2025-02-18 20:26 UTC (permalink / raw)
  To: sieberf, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Paolo Bonzini, x86, kvm, linux-kernel, nh-open-source

The hypervisor may need to gain visibility to CPU guest activity for various
purposes such as reporting it to monitoring systems that tracks the amount
of work done on behalf of a guest.

With guest hlt, pause and mwait passthrough, gtime is not useful since the
guest never tells the hypervisor that it has halted execution. So the reported
guest time is always 100% even when the guest is completely halted.

Add a new concept of guest halted time that allows the hypervisor to keep
track of the number of halted cycles a CPU spends in guest mode.

The value is reported in proc/<pid>/stat and defaults to zero for architectures
that do not support it.
---
 Documentation/filesystems/proc.rst | 1 +
 fs/proc/array.c                    | 7 ++++++-
 include/linux/sched.h              | 1 +
 include/linux/sched/signal.h       | 1 +
 kernel/exit.c                      | 1 +
 kernel/fork.c                      | 2 +-
 6 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 09f0aed5a08b..bbb230420fa4 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -386,6 +386,7 @@ It's slow but very precise.
   env_end       address below which program environment is placed
   exit_code     the thread's exit_code in the form reported by the waitpid
 		system call
+  gtime_halted  guest time when the cpu is halted of the task in jiffies
   ============= ===============================================================

 The /proc/PID/maps file contains the currently mapped memory regions and
diff --git a/fs/proc/array.c b/fs/proc/array.c
index d6a0369caa93..0788ef0fa710 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	struct mm_struct *mm;
 	unsigned long long start_time;
 	unsigned long cmin_flt, cmaj_flt, min_flt, maj_flt;
-	u64 cutime, cstime, cgtime, utime, stime, gtime;
+	u64 cutime, cstime, cgtime, utime, stime, gtime, gtime_halted;
 	unsigned long rsslim = 0;
 	unsigned long flags;
 	int exit_code = task->exit_code;
@@ -556,12 +556,14 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			min_flt = sig->min_flt;
 			maj_flt = sig->maj_flt;
 			gtime = sig->gtime;
+			gtime_halted = sig->gtime_halted;

 			rcu_read_lock();
 			__for_each_thread(sig, t) {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
 				gtime += task_gtime(t);
+				gtime_halted += t->gtime_halted;
 			}
 			rcu_read_unlock();
 		}
@@ -575,6 +577,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
 		gtime = task_gtime(task);
+		gtime_halted = task->gtime_halted;
 	}

 	/* scale priority and nice values from timeslices to -20..20 */
@@ -662,6 +665,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	else
 		seq_puts(m, " 0");

+	seq_put_decimal_ull(m, " ", nsec_to_clock_t(gtime_halted));
+
 	seq_putc(m, '\n');
 	if (mm)
 		mmput(mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d..5f6745357e20 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@ struct task_struct {
 	u64				stimescaled;
 #endif
 	u64				gtime;
+	u64				gtime_halted;
 	struct prev_cputime		prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	struct vtime			vtime;
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index d5d03d919df8..633082f7c7b8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -187,6 +187,7 @@ struct signal_struct {
 	seqlock_t stats_lock;
 	u64 utime, stime, cutime, cstime;
 	u64 gtime;
+	u64 gtime_halted;
 	u64 cgtime;
 	struct prev_cputime prev_cputime;
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e..ba6efc6900d0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -188,6 +188,7 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->utime += utime;
 	sig->stime += stime;
 	sig->gtime += task_gtime(tsk);
+	sig->gtime_halted += tsk->gtime_halted;
 	sig->min_flt += tsk->min_flt;
 	sig->maj_flt += tsk->maj_flt;
 	sig->nvcsw += tsk->nvcsw;
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..e3453084bb5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2296,7 +2296,7 @@ __latent_entropy struct task_struct *copy_process(

 	init_sigpending(&p->pending);

-	p->utime = p->stime = p->gtime = 0;
+	p->utime = p->stime = p->gtime = p->gtime_halted = 0;
 #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
 	p->utimescaled = p->stimescaled = 0;
 #endif
--
2.43.0




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/3] kvm/x86: Add support for gtime halted
  2025-02-18 20:26 [RFC PATCH 0/3] kvm,sched: Add gtime halted Fernand Sieber
  2025-02-18 20:26 ` [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat Fernand Sieber
@ 2025-02-18 20:26 ` Fernand Sieber
  2025-02-18 20:26 ` [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware Fernand Sieber
  2025-02-26  2:17 ` [RFC PATCH 0/3] kvm,sched: Add gtime halted Sean Christopherson
  3 siblings, 0 replies; 12+ messages in thread
From: Fernand Sieber @ 2025-02-18 20:26 UTC (permalink / raw)
  To: sieberf, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Paolo Bonzini, x86, kvm, linux-kernel, nh-open-source

The previous commit introduced the concept of guest time halted to allow
the hypervisor to track real guest CPU activity (halted cyles) with
mwait/hlt/pause pass through enabled.

This commits implements it for the x86 architecture. We track the number of
actual cycles executed by the guest by taking two reads on MSR_IA32_MPERF,
one before vcpu enter and the other after vcpu exit. These two reads happen
immediately before and after guest_timing_enter/exit_irqoff which are the
architecture independent points for gtime accounting. The difference between
the reads corresponds to the number of unhalted cycles. We get the number
of halted cycles by using the tsc difference with the unhalted cycles and
tolerate slight approximations.
---
 arch/x86/include/asm/tsc.h |  1 +
 arch/x86/kernel/tsc.c      | 13 +++++++++++++
 arch/x86/kvm/x86.c         | 26 ++++++++++++++++++++++++++
 3 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94408a784c8e..00ad09e7268e 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -37,6 +37,7 @@ extern void mark_tsc_async_resets(char *reason);
 extern unsigned long native_calibrate_cpu_early(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
+extern unsigned long long cycles2ns(unsigned long long cycles);

 extern int tsc_clocksource_reliable;
 #ifdef CONFIG_X86_TSC
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 34dec0b72ea8..80bb12357148 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -144,6 +144,19 @@ static __always_inline unsigned long long cycles_2_ns(unsigned long long cyc)
 	return ns;
 }

+unsigned long long cycles2ns(unsigned long long cyc)
+{
+       struct cyc2ns_data data;
+       unsigned long long ns;
+
+       cyc2ns_read_begin(&data);
+       ns = mul_u64_u32_shr(cyc, data.cyc2ns_mul, data.cyc2ns_shift);
+       cyc2ns_read_end();
+
+       return ns;
+}
+EXPORT_SYMBOL(cycles2ns);
+
 static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now)
 {
 	unsigned long long ns_now;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 02159c967d29..46975b0a63a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10688,6 +10688,19 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 	kvm_x86_call(set_apic_access_page_addr)(vcpu);
 }

+static bool needs_halted_accounting(struct kvm_vcpu *vcpu)
+{
+	return (vcpu->kvm->arch.mwait_in_guest ||
+			vcpu->kvm->arch.hlt_in_guest ||
+			vcpu->kvm->arch.pause_in_guest) &&
+		boot_cpu_has(X86_FEATURE_APERFMPERF);
+}
+
+static long long get_unhalted_cycles(void)
+{
+	return __rdmsr(MSR_IA32_MPERF);
+}
+
 /*
  * Called within kvm->srcu read side.
  * Returns 1 to let vcpu_run() continue the guest execution loop without
@@ -10697,6 +10710,8 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 {
 	int r;
+	unsigned long long cycles, cycles_start = 0;
+	unsigned long long unhalted_cycles, unhalted_cycles_start = 0;
 	bool req_int_win =
 		dm_request_for_irq_injection(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu);
@@ -10968,6 +10983,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}

+	if (needs_halted_accounting(vcpu)) {
+		cycles_start = get_cycles();
+		unhalted_cycles_start = get_unhalted_cycles();
+	}
 	guest_timing_enter_irqoff();

 	for (;;) {
@@ -11060,6 +11079,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * acceptable for all known use cases.
 	 */
 	guest_timing_exit_irqoff();
+	if (needs_halted_accounting(vcpu)) {
+		cycles = get_cycles() - cycles_start;
+		unhalted_cycles = get_unhalted_cycles() -
+			unhalted_cycles_start;
+		if (likely(cycles > unhalted_cycles))
+			current->gtime_halted += cycles2ns(cycles - unhalted_cycles);
+	}

 	local_irq_enable();
 	preempt_enable();
--
2.43.0




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware
  2025-02-18 20:26 [RFC PATCH 0/3] kvm,sched: Add gtime halted Fernand Sieber
  2025-02-18 20:26 ` [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat Fernand Sieber
  2025-02-18 20:26 ` [RFC PATCH 2/3] kvm/x86: Add support for gtime halted Fernand Sieber
@ 2025-02-18 20:26 ` Fernand Sieber
  2025-02-27  7:34   ` Vincent Guittot
  2025-02-26  2:17 ` [RFC PATCH 0/3] kvm,sched: Add gtime halted Sean Christopherson
  3 siblings, 1 reply; 12+ messages in thread
From: Fernand Sieber @ 2025-02-18 20:26 UTC (permalink / raw)
  To: sieberf, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Paolo Bonzini, x86, kvm, linux-kernel, nh-open-source

With guest hlt/mwait/pause pass through, the scheduler has no visibility into
real vCPU activity as it sees them all 100% active. As such, load balancing
cannot make informed decisions on where it is preferrable to collocate
tasks when necessary. I.e as far as the load balancer is concerned, a
halted vCPU and an idle polling vCPU look exactly the same so it may decide
that either should be preempted when in reality it would be preferrable to
preempt the idle one.

This commits enlightens the scheduler to real guest activity in this
situation. Leveraging gtime unhalted, it adds a hook for kvm to communicate
to the scheduler the duration that a vCPU spends halted. This is then used in
PELT accounting to discount it from real activity. This results in better
placement and overall steal time reduction.

This initial implementation assumes that non-idle CPUs are ticking as it
hooks the unhalted time the PELT decaying load accounting. As such it
doesn't work well if PELT is updated infrequenly with large chunks of
halted time. This is not a fundamental limitation but more complex
accounting is needed to generalize the use case to nohz full.
---
 arch/x86/kvm/x86.c    |  8 ++++++--
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 25 +++++++++++++++++++++++++
 kernel/sched/pelt.c   | 42 +++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h  |  2 ++
 6 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 46975b0a63a5..156cf05b325f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10712,6 +10712,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	int r;
 	unsigned long long cycles, cycles_start = 0;
 	unsigned long long unhalted_cycles, unhalted_cycles_start = 0;
+	unsigned long long halted_cycles_ns = 0;
 	bool req_int_win =
 		dm_request_for_irq_injection(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu);
@@ -11083,8 +11084,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		cycles = get_cycles() - cycles_start;
 		unhalted_cycles = get_unhalted_cycles() -
 			unhalted_cycles_start;
-		if (likely(cycles > unhalted_cycles))
-			current->gtime_halted += cycles2ns(cycles - unhalted_cycles);
+		if (likely(cycles > unhalted_cycles)) {
+			halted_cycles_ns = cycles2ns(cycles - unhalted_cycles);
+			current->gtime_halted += halted_cycles_ns;
+			sched_account_gtime_halted(current, halted_cycles_ns);
+		}
 	}

 	local_irq_enable();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5f6745357e20..5409fac152c9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -367,6 +367,8 @@ struct vtime {
 	u64			gtime;
 };

+extern void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted);
+
 /*
  * Utilization clamp constraints.
  * @UCLAMP_MIN:	Minimum utilization
@@ -588,6 +590,8 @@ struct sched_entity {
 	 */
 	struct sched_avg		avg;
 #endif
+
+	u64				gtime_halted;
 };

 struct sched_rt_entity {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9aecd914ac69..1f3ced2b2636 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4487,6 +4487,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
 	p->se.vlag			= 0;
+	p->se.gtime_halted              = 0;
 	INIT_LIST_HEAD(&p->se.group_node);

 	/* A delayed task cannot be in clone(). */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..5ff52711d459 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13705,4 +13705,29 @@ __init void init_sched_fair_class(void)
 #endif
 #endif /* SMP */

+
+}
+
+#ifdef CONFIG_NO_HZ_FULL
+void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted)
+{
 }
+#else
+/*
+ * The implementation hooking into PELT requires regular updates of
+ * gtime_halted. This is guaranteed unless we run on CONFIG_NO_HZ_FULL.
+ */
+void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted)
+{
+	struct sched_entity *se = &p->se;
+
+	if (unlikely(!gtime_halted))
+		return;
+
+	for_each_sched_entity(se) {
+		se->gtime_halted += gtime_halted;
+		se->cfs_rq->gtime_halted += gtime_halted;
+	}
+}
+#endif
+EXPORT_SYMBOL(sched_account_gtime_halted);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 7a8534a2deff..9f96b7c46c00 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -305,10 +305,23 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)

 int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
-				cfs_rq->curr == se)) {
+	int ret = 0;
+	u64 delta = now - se->avg.last_update_time;
+	u64 gtime_halted = min(delta, se->gtime_halted);

-		___update_load_avg(&se->avg, se_weight(se));
+	ret = ___update_load_sum(now - gtime_halted, &se->avg, !!se->on_rq, se_runnable(se),
+			cfs_rq->curr == se);
+
+	if (gtime_halted) {
+		ret |= ___update_load_sum(now, &se->avg, 0, 0, 0);
+		se->gtime_halted -= gtime_halted;
+
+		/* decay residual halted time */
+		if (ret && se->gtime_halted)
+			se->gtime_halted = decay_load(se->gtime_halted, delta / 1024);
+	}
+
+	if (ret) {
 		cfs_se_util_change(&se->avg);
 		trace_pelt_se_tp(se);
 		return 1;
@@ -319,10 +332,25 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se

 int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
-	if (___update_load_sum(now, &cfs_rq->avg,
-				scale_load_down(cfs_rq->load.weight),
-				cfs_rq->h_nr_runnable,
-				cfs_rq->curr != NULL)) {
+	int ret = 0;
+	u64 delta = now - cfs_rq->avg.last_update_time;
+	u64 gtime_halted = min(delta, cfs_rq->gtime_halted);
+
+	ret =  ___update_load_sum(now - gtime_halted, &cfs_rq->avg,
+			scale_load_down(cfs_rq->load.weight),
+			cfs_rq->h_nr_runnable,
+			cfs_rq->curr != NULL);
+
+	if (gtime_halted) {
+		ret |= ___update_load_sum(now, &cfs_rq->avg, 0, 0, 0);
+		cfs_rq->gtime_halted -= gtime_halted;
+
+		/* decay any residual halted time */
+		if (ret && cfs_rq->gtime_halted)
+			cfs_rq->gtime_halted = decay_load(cfs_rq->gtime_halted, delta / 1024);
+	}
+
+	if (ret) {

 		___update_load_avg(&cfs_rq->avg, 1);
 		trace_pelt_cfs_tp(cfs_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b93c8c3dc05a..79b1166265bf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -744,6 +744,8 @@ struct cfs_rq {
 	struct list_head	throttled_csd_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+
+	u64			gtime_halted;
 };

 #ifdef CONFIG_SCHED_CLASS_EXT
--
2.43.0




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
  2025-02-18 20:26 [RFC PATCH 0/3] kvm,sched: Add gtime halted Fernand Sieber
                   ` (2 preceding siblings ...)
  2025-02-18 20:26 ` [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware Fernand Sieber
@ 2025-02-26  2:17 ` Sean Christopherson
  2025-02-26 20:27   ` Sieber, Fernand
  3 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-02-26  2:17 UTC (permalink / raw)
  To: Fernand Sieber
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Paolo Bonzini, x86,
	kvm, linux-kernel, nh-open-source

On Tue, Feb 18, 2025, Fernand Sieber wrote:
> With guest hlt, pause and mwait pass through, the hypervisor loses
> visibility on real guest cpu activity. From the point of view of the
> host, such vcpus are always 100% active even when the guest is
> completely halted.
> 
> Typically hlt, pause and mwait pass through is only implemented on
> non-timeshared pcpus. However, there are cases where this assumption
> cannot be strictly met as some occasional housekeeping work needs to be

What housekeeping work?

> scheduled on such cpus while we generally want to preserve the pass
> through performance gains. This applies for system which don't have
> dedicated cpus for housekeeping purposes.
> 
> In such cases, the lack of visibility of the hypervisor is problematic
> from a load balancing point of view. In the absence of a better signal,
> it will preemt vcpus at random. For example it could decide to interrupt
> a vcpu doing critical idle poll work while another vcpu sits idle.
> 
> Another motivation for gaining visibility into real guest cpu activity
> is to enable the hypervisor to vend metrics about it for external
> consumption.

Such as?

> In this RFC we introduce the concept of guest halted time to address
> these concerns. Guest halted time (gtime_halted) accounts for cycles
> spent in guest mode while the cpu is halted. gtime_halted relies on
> measuring the mperf msr register (x86) around VM enter/exits to compute
> the number of unhalted cycles; halted cycles are then derived from the
> tsc difference minus the mperf difference.

IMO, there are better ways to solve this than having KVM sample MPERF on every
entry and exit.

The kernel already samples APERF/MPREF on every tick and provides that information
via /proc/cpuinfo, just use that.  If your userspace is unable to use /proc/cpuinfo
or similar, that needs to be explained.

And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT passthrough,
*and* you want to schedule other tasks on those CPUs, then IMO you're abusing all
of those things and it's not KVM's problem to solve, especially now that sched_ext
is a thing.

> gtime_halted is exposed to proc/<pid>/stat as a new entry, which enables
> users to monitor real guest activity.
> 
> gtime_halted is also plumbed to the scheduler infrastructure to discount
> halted cycles from fair load accounting. This enlightens the load
> balancer to real guest activity for better task placement.
> 
> This initial RFC has a few limitations and open questions:
> * only the x86 infrastructure is supported as it relies on architecture
>   dependent registers. Future development will extend this to ARM.
> * we assume that mperf accumulates as the same rate as tsc. While I am
>   not certain whether this assumption is ever violated, the spec doesn't
>   seem to offer this guarantee [1] so we may want to calibrate mperf.
> * the sched enlightenment logic relies on periodic gtime_halted updates.
>   As such, it is incompatible with nohz full because this could result
>   in long periods of no update followed by a massive halted time update
>   which doesn't play well with the existing PELT integration. It is
>   possible to address this limitation with generalized, more complex
>   accounting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
  2025-02-26  2:17 ` [RFC PATCH 0/3] kvm,sched: Add gtime halted Sean Christopherson
@ 2025-02-26 20:27   ` Sieber, Fernand
  2025-02-26 21:00     ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Sieber, Fernand @ 2025-02-26 20:27 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: x86@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	vincent.guittot@linaro.org, kvm@vger.kernel.org,
	nh-open-source@amazon.com, pbonzini@redhat.com

On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
> 
> 
> On Tue, Feb 18, 2025, Fernand Sieber wrote:
> > With guest hlt, pause and mwait pass through, the hypervisor loses
> > visibility on real guest cpu activity. From the point of view of
> > the
> > host, such vcpus are always 100% active even when the guest is
> > completely halted.
> > 
> > Typically hlt, pause and mwait pass through is only implemented on
> > non-timeshared pcpus. However, there are cases where this
> > assumption
> > cannot be strictly met as some occasional housekeeping work needs
> > to be
> 
> What housekeeping work?

In the case that want to solve, housekeeping work is mainly userspace
tasks implementing hypervisor functionality such as gathering metrics,
performing health checks, handling VM lifecycle, etc.

The platforms don't have dedicated cpus for housekeeping purposes and
try as much as possible to fully dedicate the cpus to VMs, hence
HLT/MWAIT pass through. The housekeeping work is low but can still
interfere with guests that are running very latency sensitive
operations on a subset of vCPUs (e.g idle poll), which is what we want
to detect and avoid.

> 
> > scheduled on such cpus while we generally want to preserve the pass
> > through performance gains. This applies for system which don't have
> > dedicated cpus for housekeeping purposes.
> > 
> > In such cases, the lack of visibility of the hypervisor is
> > problematic
> > from a load balancing point of view. In the absence of a better
> > signal,
> > it will preemt vcpus at random. For example it could decide to
> > interrupt
> > a vcpu doing critical idle poll work while another vcpu sits idle.
> > 
> > Another motivation for gaining visibility into real guest cpu
> > activity
> > is to enable the hypervisor to vend metrics about it for external
> > consumption.
> 
> Such as?

An example is feeding VM utilisation metrics to other systems like auto
scaling of guest applications. While it is possible to implement this
functionality purely on the guest side, having the hypervisor handling
it means that it's available out of the box for all VMs in a standard
way without relying on guest side configuration.

> 
> > In this RFC we introduce the concept of guest halted time to
> > address
> > these concerns. Guest halted time (gtime_halted) accounts for
> > cycles
> > spent in guest mode while the cpu is halted. gtime_halted relies on
> > measuring the mperf msr register (x86) around VM enter/exits to
> > compute
> > the number of unhalted cycles; halted cycles are then derived from
> > the
> > tsc difference minus the mperf difference.
> 
> IMO, there are better ways to solve this than having KVM sample MPERF
> on every
> entry and exit.
> 
> The kernel already samples APERF/MPREF on every tick and provides
> that information
> via /proc/cpuinfo, just use that.  If your userspace is unable to use
> /proc/cpuinfo
> or similar, that needs to be explained.

If I understand correctly what you are suggesting is to have userspace
regularly sampling these values to detect the most idle CPUs and then
use CPU affinity to repin housekeeping tasks to these. While it's
possible this essentially requires to implement another scheduling
layer in userspace through constant re-pinning of tasks. This also
requires to constantly identify the full set of tasks that can induce
undesirable overhead so that they can be pinned accordingly. For these
reasons we would rather want the logic to be implemented directly in
the scheduler.

> 
> And if you're running vCPUs on tickless CPUs, and you're doing
> HLT/MWAIT passthrough,
> *and* you want to schedule other tasks on those CPUs, then IMO you're
> abusing all
> of those things and it's not KVM's problem to solve, especially now
> that sched_ext
> is a thing.

We are running vCPUs with ticks, the rest of your observations are
correct.

> 
> > gtime_halted is exposed to proc/<pid>/stat as a new entry, which
> > enables
> > users to monitor real guest activity.
> > 
> > gtime_halted is also plumbed to the scheduler infrastructure to
> > discount
> > halted cycles from fair load accounting. This enlightens the load
> > balancer to real guest activity for better task placement.
> > 
> > This initial RFC has a few limitations and open questions:
> > * only the x86 infrastructure is supported as it relies on
> > architecture
> >   dependent registers. Future development will extend this to ARM.
> > * we assume that mperf accumulates as the same rate as tsc. While I
> > am
> >   not certain whether this assumption is ever violated, the spec
> > doesn't
> >   seem to offer this guarantee [1] so we may want to calibrate
> > mperf.
> > * the sched enlightenment logic relies on periodic gtime_halted
> > updates.
> >   As such, it is incompatible with nohz full because this could
> > result
> >   in long periods of no update followed by a massive halted time
> > update
> >   which doesn't play well with the existing PELT integration. It is
> >   possible to address this limitation with generalized, more
> > complex
> >   accounting.




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
  2025-02-26 20:27   ` Sieber, Fernand
@ 2025-02-26 21:00     ` Sean Christopherson
  2025-02-27  7:20       ` Sieber, Fernand
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-02-26 21:00 UTC (permalink / raw)
  To: Fernand Sieber
  Cc: x86@kernel.org, peterz@infradead.org,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	vincent.guittot@linaro.org, kvm@vger.kernel.org,
	nh-open-source@amazon.com, pbonzini@redhat.com

On Wed, Feb 26, 2025, Fernand Sieber wrote:
> On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
> > > In this RFC we introduce the concept of guest halted time to address
> > > these concerns. Guest halted time (gtime_halted) accounts for cycles
> > > spent in guest mode while the cpu is halted. gtime_halted relies on
> > > measuring the mperf msr register (x86) around VM enter/exits to compute
> > > the number of unhalted cycles; halted cycles are then derived from the
> > > tsc difference minus the mperf difference.
> > 
> > IMO, there are better ways to solve this than having KVM sample MPERF on
> > every entry and exit.
> > 
> > The kernel already samples APERF/MPREF on every tick and provides that
> > information via /proc/cpuinfo, just use that.  If your userspace is unable
> > to use /proc/cpuinfo or similar, that needs to be explained.
> 
> If I understand correctly what you are suggesting is to have userspace
> regularly sampling these values to detect the most idle CPUs and then
> use CPU affinity to repin housekeeping tasks to these. While it's
> possible this essentially requires to implement another scheduling
> layer in userspace through constant re-pinning of tasks. This also
> requires to constantly identify the full set of tasks that can induce
> undesirable overhead so that they can be pinned accordingly. For these
> reasons we would rather want the logic to be implemented directly in
> the scheduler.
> 
> > And if you're running vCPUs on tickless CPUs, and you're doing HLT/MWAIT
> > passthrough, *and* you want to schedule other tasks on those CPUs, then IMO
> > you're abusing all of those things and it's not KVM's problem to solve,
> > especially now that sched_ext is a thing.
> 
> We are running vCPUs with ticks, the rest of your observations are
> correct.

If there's a host tick, why do you need KVM's help to make scheduling decisions?
It sounds like what you want is a scheduler that is primarily driven by MPERF
(and APERF?), and sched_tick() => arch_scale_freq_tick() already knows about MPERF.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
  2025-02-26 21:00     ` Sean Christopherson
@ 2025-02-27  7:20       ` Sieber, Fernand
  2025-02-27 14:39         ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Sieber, Fernand @ 2025-02-27  7:20 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	peterz@infradead.org, mingo@redhat.com,
	vincent.guittot@linaro.org, pbonzini@redhat.com,
	nh-open-source@amazon.com, kvm@vger.kernel.org

On Wed, 2025-02-26 at 13:00 -0800, Sean Christopherson wrote:
> On Wed, Feb 26, 2025, Fernand Sieber wrote:
> > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
> > > > In this RFC we introduce the concept of guest halted time to
> > > > address
> > > > these concerns. Guest halted time (gtime_halted) accounts for
> > > > cycles
> > > > spent in guest mode while the cpu is halted. gtime_halted
> > > > relies on
> > > > measuring the mperf msr register (x86) around VM enter/exits to
> > > > compute
> > > > the number of unhalted cycles; halted cycles are then derived
> > > > from the
> > > > tsc difference minus the mperf difference.
> > > 
> > > IMO, there are better ways to solve this than having KVM sample
> > > MPERF on
> > > every entry and exit.
> > > 
> > > The kernel already samples APERF/MPREF on every tick and provides
> > > that
> > > information via /proc/cpuinfo, just use that.  If your userspace
> > > is unable
> > > to use /proc/cpuinfo or similar, that needs to be explained.
> > 
> > If I understand correctly what you are suggesting is to have
> > userspace
> > regularly sampling these values to detect the most idle CPUs and
> > then
> > use CPU affinity to repin housekeeping tasks to these. While it's
> > possible this essentially requires to implement another scheduling
> > layer in userspace through constant re-pinning of tasks. This also
> > requires to constantly identify the full set of tasks that can
> > induce
> > undesirable overhead so that they can be pinned accordingly. For
> > these
> > reasons we would rather want the logic to be implemented directly
> > in
> > the scheduler.
> > 
> > > And if you're running vCPUs on tickless CPUs, and you're doing
> > > HLT/MWAIT
> > > passthrough, *and* you want to schedule other tasks on those
> > > CPUs, then IMO
> > > you're abusing all of those things and it's not KVM's problem to
> > > solve,
> > > especially now that sched_ext is a thing.
> > 
> > We are running vCPUs with ticks, the rest of your observations are
> > correct.
> 
> If there's a host tick, why do you need KVM's help to make scheduling
> decisions?
> It sounds like what you want is a scheduler that is primarily driven
> by MPERF
> (and APERF?), and sched_tick() => arch_scale_freq_tick() already
> knows about MPERF.

Having the measure around VM enter/exit makes it easy to attribute the
unhalted cycles to a specific task (vCPU), which solves both our use
cases of VM metrics and scheduling. That said we may be able to avoid
it and achieve the same results.

i.e
* the VM metrics use case can be solved by using /proc/cpuinfo from
userspace.
* for the scheduling use case, the tick based sampling of MPERF means
we could potentially introduce a correcting factor on PELT accounting
of pinned vCPU tasks based on its value (similar to what I do in the
last patch of the series).

The combination of these would remove the requirement of adding any
logic around VM entrer/exit to support our use cases.

I'm happy to prototype that if we think it's going in the right
direction?




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware
  2025-02-18 20:26 ` [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware Fernand Sieber
@ 2025-02-27  7:34   ` Vincent Guittot
  2025-02-27  8:27     ` [RFC PATCH 3/3] sched, x86: " Sieber, Fernand
  0 siblings, 1 reply; 12+ messages in thread
From: Vincent Guittot @ 2025-02-27  7:34 UTC (permalink / raw)
  To: Fernand Sieber
  Cc: Ingo Molnar, Peter Zijlstra, Paolo Bonzini, x86, kvm,
	linux-kernel, nh-open-source

On Tue, 18 Feb 2025 at 21:27, Fernand Sieber <sieberf@amazon.com> wrote:
>
> With guest hlt/mwait/pause pass through, the scheduler has no visibility into
> real vCPU activity as it sees them all 100% active. As such, load balancing
> cannot make informed decisions on where it is preferrable to collocate
> tasks when necessary. I.e as far as the load balancer is concerned, a
> halted vCPU and an idle polling vCPU look exactly the same so it may decide
> that either should be preempted when in reality it would be preferrable to
> preempt the idle one.
>
> This commits enlightens the scheduler to real guest activity in this
> situation. Leveraging gtime unhalted, it adds a hook for kvm to communicate
> to the scheduler the duration that a vCPU spends halted. This is then used in
> PELT accounting to discount it from real activity. This results in better
> placement and overall steal time reduction.

NAK, PELT account for time spent by se on the CPU. If your thread/vcpu
doesn't do anything but burn cycles, find another way to report thatto
the host
Furthermore this breaks all the hierarchy dependency

>
> This initial implementation assumes that non-idle CPUs are ticking as it
> hooks the unhalted time the PELT decaying load accounting. As such it
> doesn't work well if PELT is updated infrequenly with large chunks of
> halted time. This is not a fundamental limitation but more complex
> accounting is needed to generalize the use case to nohz full.
> ---
>  arch/x86/kvm/x86.c    |  8 ++++++--
>  include/linux/sched.h |  4 ++++
>  kernel/sched/core.c   |  1 +
>  kernel/sched/fair.c   | 25 +++++++++++++++++++++++++
>  kernel/sched/pelt.c   | 42 +++++++++++++++++++++++++++++++++++-------
>  kernel/sched/sched.h  |  2 ++
>  6 files changed, 73 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 46975b0a63a5..156cf05b325f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10712,6 +10712,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>         int r;
>         unsigned long long cycles, cycles_start = 0;
>         unsigned long long unhalted_cycles, unhalted_cycles_start = 0;
> +       unsigned long long halted_cycles_ns = 0;
>         bool req_int_win =
>                 dm_request_for_irq_injection(vcpu) &&
>                 kvm_cpu_accept_dm_intr(vcpu);
> @@ -11083,8 +11084,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>                 cycles = get_cycles() - cycles_start;
>                 unhalted_cycles = get_unhalted_cycles() -
>                         unhalted_cycles_start;
> -               if (likely(cycles > unhalted_cycles))
> -                       current->gtime_halted += cycles2ns(cycles - unhalted_cycles);
> +               if (likely(cycles > unhalted_cycles)) {
> +                       halted_cycles_ns = cycles2ns(cycles - unhalted_cycles);
> +                       current->gtime_halted += halted_cycles_ns;
> +                       sched_account_gtime_halted(current, halted_cycles_ns);
> +               }
>         }
>
>         local_irq_enable();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5f6745357e20..5409fac152c9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -367,6 +367,8 @@ struct vtime {
>         u64                     gtime;
>  };
>
> +extern void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted);
> +
>  /*
>   * Utilization clamp constraints.
>   * @UCLAMP_MIN:        Minimum utilization
> @@ -588,6 +590,8 @@ struct sched_entity {
>          */
>         struct sched_avg                avg;
>  #endif
> +
> +       u64                             gtime_halted;
>  };
>
>  struct sched_rt_entity {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9aecd914ac69..1f3ced2b2636 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4487,6 +4487,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>         p->se.nr_migrations             = 0;
>         p->se.vruntime                  = 0;
>         p->se.vlag                      = 0;
> +       p->se.gtime_halted              = 0;
>         INIT_LIST_HEAD(&p->se.group_node);
>
>         /* A delayed task cannot be in clone(). */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c0ef435a7aa..5ff52711d459 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13705,4 +13705,29 @@ __init void init_sched_fair_class(void)
>  #endif
>  #endif /* SMP */
>
> +
> +}
> +
> +#ifdef CONFIG_NO_HZ_FULL
> +void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted)
> +{
>  }
> +#else
> +/*
> + * The implementation hooking into PELT requires regular updates of
> + * gtime_halted. This is guaranteed unless we run on CONFIG_NO_HZ_FULL.
> + */
> +void sched_account_gtime_halted(struct task_struct *p, u64 gtime_halted)
> +{
> +       struct sched_entity *se = &p->se;
> +
> +       if (unlikely(!gtime_halted))
> +               return;
> +
> +       for_each_sched_entity(se) {
> +               se->gtime_halted += gtime_halted;
> +               se->cfs_rq->gtime_halted += gtime_halted;
> +       }
> +}
> +#endif
> +EXPORT_SYMBOL(sched_account_gtime_halted);
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 7a8534a2deff..9f96b7c46c00 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -305,10 +305,23 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
>
>  int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
> -                               cfs_rq->curr == se)) {
> +       int ret = 0;
> +       u64 delta = now - se->avg.last_update_time;
> +       u64 gtime_halted = min(delta, se->gtime_halted);
>
> -               ___update_load_avg(&se->avg, se_weight(se));
> +       ret = ___update_load_sum(now - gtime_halted, &se->avg, !!se->on_rq, se_runnable(se),
> +                       cfs_rq->curr == se);
> +
> +       if (gtime_halted) {
> +               ret |= ___update_load_sum(now, &se->avg, 0, 0, 0);
> +               se->gtime_halted -= gtime_halted;
> +
> +               /* decay residual halted time */
> +               if (ret && se->gtime_halted)
> +                       se->gtime_halted = decay_load(se->gtime_halted, delta / 1024);
> +       }
> +
> +       if (ret) {
>                 cfs_se_util_change(&se->avg);
>                 trace_pelt_se_tp(se);
>                 return 1;
> @@ -319,10 +332,25 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
>
>  int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       if (___update_load_sum(now, &cfs_rq->avg,
> -                               scale_load_down(cfs_rq->load.weight),
> -                               cfs_rq->h_nr_runnable,
> -                               cfs_rq->curr != NULL)) {
> +       int ret = 0;
> +       u64 delta = now - cfs_rq->avg.last_update_time;
> +       u64 gtime_halted = min(delta, cfs_rq->gtime_halted);
> +
> +       ret =  ___update_load_sum(now - gtime_halted, &cfs_rq->avg,
> +                       scale_load_down(cfs_rq->load.weight),
> +                       cfs_rq->h_nr_runnable,
> +                       cfs_rq->curr != NULL);
> +
> +       if (gtime_halted) {
> +               ret |= ___update_load_sum(now, &cfs_rq->avg, 0, 0, 0);
> +               cfs_rq->gtime_halted -= gtime_halted;
> +
> +               /* decay any residual halted time */
> +               if (ret && cfs_rq->gtime_halted)
> +                       cfs_rq->gtime_halted = decay_load(cfs_rq->gtime_halted, delta / 1024);
> +       }
> +
> +       if (ret) {
>
>                 ___update_load_avg(&cfs_rq->avg, 1);
>                 trace_pelt_cfs_tp(cfs_rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b93c8c3dc05a..79b1166265bf 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -744,6 +744,8 @@ struct cfs_rq {
>         struct list_head        throttled_csd_list;
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> +
> +       u64                     gtime_halted;
>  };
>
>  #ifdef CONFIG_SCHED_CLASS_EXT
> --
> 2.43.0
>
>
>
>
> Amazon Development Centre (South Africa) (Proprietary) Limited
> 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
> Registration Number: 2004 / 034463 / 07
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/3] sched, x86: Make the scheduler guest unhalted aware
  2025-02-27  7:34   ` Vincent Guittot
@ 2025-02-27  8:27     ` Sieber, Fernand
  2025-02-27  9:03       ` Vincent Guittot
  0 siblings, 1 reply; 12+ messages in thread
From: Sieber, Fernand @ 2025-02-27  8:27 UTC (permalink / raw)
  To: vincent.guittot@linaro.org
  Cc: peterz@infradead.org, mingo@redhat.com, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	nh-open-source@amazon.com

On Thu, 2025-02-27 at 08:34 +0100, Vincent Guittot wrote:
> On Tue, 18 Feb 2025 at 21:27, Fernand Sieber <sieberf@amazon.com>
> wrote:
> > 
> > With guest hlt/mwait/pause pass through, the scheduler has no
> > visibility into
> > real vCPU activity as it sees them all 100% active. As such, load
> > balancing
> > cannot make informed decisions on where it is preferrable to
> > collocate
> > tasks when necessary. I.e as far as the load balancer is concerned,
> > a
> > halted vCPU and an idle polling vCPU look exactly the same so it
> > may decide
> > that either should be preempted when in reality it would be
> > preferrable to
> > preempt the idle one.
> > 
> > This commits enlightens the scheduler to real guest activity in
> > this
> > situation. Leveraging gtime unhalted, it adds a hook for kvm to
> > communicate
> > to the scheduler the duration that a vCPU spends halted. This is
> > then used in
> > PELT accounting to discount it from real activity. This results in
> > better
> > placement and overall steal time reduction.
> 
> NAK, PELT account for time spent by se on the CPU. 

I was essentially aiming to adjust this concept to "PELT account for
the time spent by se *unhalted* on the CPU". Would such an adjustments
of the definition cause problems?

> If your thread/vcpu doesn't do anything but burn cycles, find another
> way to report thatto the host

The main advantage of hooking into PELT is that it means that load
balancing will just work out of the box as it immediately adjusts the
sched_group util/load/runnable values.

It may be possible to scope down my change to load balancing without
touching PELT if that is not viable. For example instead of using PELT
we could potentially adjust the calculation of sgs->avg_load in
update_sg_lb_stats for overloaded groups to include a correcting factor
based on recent halted cycles of the CPU. The comparison of two
overloaded groups would then favor pulling tasks on the one that has
the most halted cycles. This approach is more scoped down as it doesn't
change the classification of scheduling groups, instead it just changes
how overloaded groups are compared. I would need to prototype to see if
it works.

Let me know if this would go in the right direction or if you have any
other ideas of alternate options?

> Furthermore this breaks all the hierarchy dependency

I am not understanding the meaning of this comment, could you please
provide more details?

> 
> > 
> > This initial implementation assumes that non-idle CPUs are ticking
> > as it
> > hooks the unhalted time the PELT decaying load accounting. As such
> > it
> > doesn't work well if PELT is updated infrequenly with large chunks
> > of
> > halted time. This is not a fundamental limitation but more complex
> > accounting is needed to generalize the use case to nohz full.

Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 3/3] sched, x86: Make the scheduler guest unhalted aware
  2025-02-27  8:27     ` [RFC PATCH 3/3] sched, x86: " Sieber, Fernand
@ 2025-02-27  9:03       ` Vincent Guittot
  0 siblings, 0 replies; 12+ messages in thread
From: Vincent Guittot @ 2025-02-27  9:03 UTC (permalink / raw)
  To: Sieber, Fernand
  Cc: peterz@infradead.org, mingo@redhat.com, pbonzini@redhat.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org,
	nh-open-source@amazon.com

On Thu, 27 Feb 2025 at 09:27, Sieber, Fernand <sieberf@amazon.com> wrote:
>
> On Thu, 2025-02-27 at 08:34 +0100, Vincent Guittot wrote:
> > On Tue, 18 Feb 2025 at 21:27, Fernand Sieber <sieberf@amazon.com>
> > wrote:
> > >
> > > With guest hlt/mwait/pause pass through, the scheduler has no
> > > visibility into
> > > real vCPU activity as it sees them all 100% active. As such, load
> > > balancing
> > > cannot make informed decisions on where it is preferrable to
> > > collocate
> > > tasks when necessary. I.e as far as the load balancer is concerned,
> > > a
> > > halted vCPU and an idle polling vCPU look exactly the same so it
> > > may decide
> > > that either should be preempted when in reality it would be
> > > preferrable to
> > > preempt the idle one.
> > >
> > > This commits enlightens the scheduler to real guest activity in
> > > this
> > > situation. Leveraging gtime unhalted, it adds a hook for kvm to
> > > communicate
> > > to the scheduler the duration that a vCPU spends halted. This is
> > > then used in
> > > PELT accounting to discount it from real activity. This results in
> > > better
> > > placement and overall steal time reduction.
> >
> > NAK, PELT account for time spent by se on the CPU.
>
> I was essentially aiming to adjust this concept to "PELT account for
> the time spent by se *unhalted* on the CPU". Would such an adjustments
> of the definition cause problems?

Yes, It's not in the scope of PELT to know that a se is a vcpu and if
this vcpu is halted or not

>
> > If your thread/vcpu doesn't do anything but burn cycles, find another
> > way to report thatto the host
>
> The main advantage of hooking into PELT is that it means that load
> balancing will just work out of the box as it immediately adjusts the
> sched_group util/load/runnable values.
>
> It may be possible to scope down my change to load balancing without
> touching PELT if that is not viable. For example instead of using PELT
> we could potentially adjust the calculation of sgs->avg_load in
> update_sg_lb_stats for overloaded groups to include a correcting factor
> based on recent halted cycles of the CPU. The comparison of two
> overloaded groups would then favor pulling tasks on the one that has
> the most halted cycles. This approach is more scoped down as it doesn't
> change the classification of scheduling groups, instead it just changes
> how overloaded groups are compared. I would need to prototype to see if
> it works.

This is not better than PELT

>
> Let me know if this would go in the right direction or if you have any
> other ideas of alternate options?

The below should give you some insights

https://lore.kernel.org/kvm/CAO7JXPhMfibNsX6Nx902PRo7_A2b4Rnc3UP=bpKYeOuQnHvtrw@mail.gmail.com/

I don't think that you need any change in the scheduler. Use the
current public scheduler interfaces to adjust the priority of your
vcpu. As an example switching your thread to SCHED_IDLE is a good way
to say that your thread has a very low priority and the scheduler is
able to handle such information

>
> > Furthermore this breaks all the hierarchy dependency
>
> I am not understanding the meaning of this comment, could you please
> provide more details?
>
> >
> > >
> > > This initial implementation assumes that non-idle CPUs are ticking
> > > as it
> > > hooks the unhalted time the PELT decaying load accounting. As such
> > > it
> > > doesn't work well if PELT is updated infrequenly with large chunks
> > > of
> > > halted time. This is not a fundamental limitation but more complex
> > > accounting is needed to generalize the use case to nohz full.
>
>
>
> Amazon Development Centre (South Africa) (Proprietary) Limited
> 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
> Registration Number: 2004 / 034463 / 07

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] kvm,sched: Add gtime halted
  2025-02-27  7:20       ` Sieber, Fernand
@ 2025-02-27 14:39         ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-02-27 14:39 UTC (permalink / raw)
  To: Fernand Sieber
  Cc: linux-kernel@vger.kernel.org, x86@kernel.org,
	peterz@infradead.org, mingo@redhat.com,
	vincent.guittot@linaro.org, pbonzini@redhat.com,
	nh-open-source@amazon.com, kvm@vger.kernel.org

On Thu, Feb 27, 2025, Fernand Sieber wrote:
> On Wed, 2025-02-26 at 13:00 -0800, Sean Christopherson wrote:
> > On Wed, Feb 26, 2025, Fernand Sieber wrote:
> > > On Tue, 2025-02-25 at 18:17 -0800, Sean Christopherson wrote:
> > > > And if you're running vCPUs on tickless CPUs, and you're doing
> > > > HLT/MWAIT passthrough, *and* you want to schedule other tasks on those
> > > > CPUs, then IMO you're abusing all of those things and it's not KVM's
> > > > problem to solve, especially now that sched_ext is a thing.
> > > 
> > > We are running vCPUs with ticks, the rest of your observations are
> > > correct.
> > 
> > If there's a host tick, why do you need KVM's help to make scheduling
> > decisions?  It sounds like what you want is a scheduler that is primarily
> > driven by MPERF (and APERF?), and sched_tick() => arch_scale_freq_tick()
> > already knows about MPERF.
> 
> Having the measure around VM enter/exit makes it easy to attribute the
> unhalted cycles to a specific task (vCPU), which solves both our use
> cases of VM metrics and scheduling. That said we may be able to avoid
> it and achieve the same results.
> 
> i.e
> * the VM metrics use case can be solved by using /proc/cpuinfo from
> userspace.
> * for the scheduling use case, the tick based sampling of MPERF means
> we could potentially introduce a correcting factor on PELT accounting
> of pinned vCPU tasks based on its value (similar to what I do in the
> last patch of the series).
> 
> The combination of these would remove the requirement of adding any
> logic around VM entrer/exit to support our use cases.
> 
> I'm happy to prototype that if we think it's going in the right
> direction?

That's mostly a question for the scheduler folks.  That said, from a KVM perspective,
sampling MPERF around entry/exit for scheduling purposes is a non-starter.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-27 14:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-18 20:26 [RFC PATCH 0/3] kvm,sched: Add gtime halted Fernand Sieber
2025-02-18 20:26 ` [RFC PATCH 1/3] fs/proc: Add gtime halted to proc/<pid>/stat Fernand Sieber
2025-02-18 20:26 ` [RFC PATCH 2/3] kvm/x86: Add support for gtime halted Fernand Sieber
2025-02-18 20:26 ` [RFC PATCH 3/3] sched,x86: Make the scheduler guest unhalted aware Fernand Sieber
2025-02-27  7:34   ` Vincent Guittot
2025-02-27  8:27     ` [RFC PATCH 3/3] sched, x86: " Sieber, Fernand
2025-02-27  9:03       ` Vincent Guittot
2025-02-26  2:17 ` [RFC PATCH 0/3] kvm,sched: Add gtime halted Sean Christopherson
2025-02-26 20:27   ` Sieber, Fernand
2025-02-26 21:00     ` Sean Christopherson
2025-02-27  7:20       ` Sieber, Fernand
2025-02-27 14:39         ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox