Linux Kernel Selftest development
 help / color / mirror / Atom feed
* [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host
@ 2026-05-05  0:30 Dongli Zhang
  2026-05-05  0:30 ` [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time Dongli Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Dongli Zhang @ 2026-05-05  0:30 UTC (permalink / raw)
  To: kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	dwmw2, joe.jin

This patchset resolves two issue releted to KVM steal time accounting.

1. KVM does not support vCPU hotplug. When a vCPU is removed, its
corresponding data structures are not freed by KVM. Instead, QEMU destroys
only the userspace state and the vCPU thread, while the KVM vCPU fd remains
open and parked in QEMU.

As a result, vcpu->arch.st.last_steal is not reset. If the same vCPU is
later re-created by QEMU, last_steal retains its old value, while
current->sched_info.run_delay starts from zero since a new vCPU thread is
created. This causes current->sched_info.run_delay - vcpu->arch.st.last_steal
to produce a large, bogus value.

For instance, current->sched_info.run_delay can become smaller than
vcpu->arch.st.last_steal (see line 3832) if a QEMU vCPU is re-added after
it has previously been removed.

As a result, st->steal restarts from a very small value, close to
current->sched_info.run_delay.

3748 static void record_steal_time(struct kvm_vcpu *vcpu)
3749 {
3831         unsafe_get_user(steal, &st->steal, out);
3832         steal += current->sched_info.run_delay -
3833                 vcpu->arch.st.last_steal;
3834         vcpu->arch.st.last_steal = current->sched_info.run_delay;
3835         unsafe_put_user(steal, &st->steal, out);

This means that, from the guest VM, paravirt_steal_clock() for a newly added
vCPU starts from a very small value.

Since this_rq()->prev_steal_time is not reset during vCPU hotplug, it may
exceed paravirt_steal_clock(). This results in a negative delta (interpreted as
a large u64) being accounted to cpustat[CPUTIME_STEAL], causing it to appear
either very small or to start from a large u64 value (as line 275).

268 static __always_inline u64 steal_account_process_time(u64 maxtime)
269 {
270 #ifdef CONFIG_PARAVIRT
271         if (static_key_false(&paravirt_steal_enabled)) {
272                 u64 steal;
273
274                 steal = paravirt_steal_clock(smp_processor_id());
275                 steal -= this_rq()->prev_steal_time;
276                 steal = min(steal, maxtime);
277                 account_steal_time(steal);
278                 this_rq()->prev_steal_time += steal;
279
280                 return steal;
281         }
282 #endif /* CONFIG_PARAVIRT */
283         return 0;
284 }

This patchset fixes the issue on both the KVM guest and host sides by resetting
prev_steal_time/prev_steal_time_rq and vcpu->arch.st.last_steal when KVM steal
time is enabled.


2. The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
live migration. KVM uses that realtime value to advance guest clock, but
the same blackout is not reflected in KVM steal time.

Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
self-contained and avoids adding a new KVM ioctl or requiring additional
userspace changes (i.e. QEMU).


I have also created two KVM selftests.


Dongli Zhang (5):
  x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time
  KVM: x86: Reset vcpu->arch.st.last_steal when enabling steal time
  KVM: x86: account KVM_SET_CLOCK downtime in steal time
  KVM: selftests: Test steal time when re-adding a vCPU on a new thread
  KVM: selftests: Test KVM_SET_CLOCK downtime in steal time

 arch/x86/include/asm/kvm_host.h                 |   4 +
 arch/x86/kernel/kvm.c                           |  40 +++---
 arch/x86/kvm/x86.c                              |  32 ++++-
 include/linux/sched/cputime.h                   |   2 +
 kernel/sched/cputime.c                          |  10 ++
 tools/testing/selftests/kvm/Makefile.kvm        |   1 +
 .../testing/selftests/kvm/x86/kvm_clock_test.c  |  42 ++++--
 .../selftests/kvm/x86/steal_time_reset_test.c   | 144 +++++++++++++++++++
 8 files changed, 248 insertions(+), 27 deletions(-)

Thank you very much!

Dongli Zhang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time
  2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
@ 2026-05-05  0:30 ` Dongli Zhang
  2026-05-05  0:30 ` [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal " Dongli Zhang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Dongli Zhang @ 2026-05-05  0:30 UTC (permalink / raw)
  To: kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	dwmw2, joe.jin

kvm_steal_clock() is not guaranteed to be monotonic, especially during vCPU
hotplug, and may restart from a small value due to a KVM bug.

Since per-vCPU prev_steal_time and prev_steal_time_rq are not reset on vCPU
hotplug, they can become larger than the value returned by
paravirt_steal_clock(), leading to incorrect accounting.

Reset both prev_steal_time and prev_steal_time_rq when enabling KVM steal
time paravirtualization to avoid this issue.

A fix for the underlying KVM hypervisor steal time accounting bug will be
addressed in a subsequent patch.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/x86/kernel/kvm.c         | 40 ++++++++++++++++++++---------------
 include/linux/sched/cputime.h |  2 ++
 kernel/sched/cputime.c        | 10 +++++++++
 3 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 29226d112029..819abd3a9a26 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -328,6 +328,23 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static u64 kvm_steal_clock(int cpu)
+{
+	u64 steal;
+	struct kvm_steal_time *src;
+	int version;
+
+	src = &per_cpu(steal_time, cpu);
+	do {
+		version = src->version;
+		virt_rmb();
+		steal = src->steal;
+		virt_rmb();
+	} while ((version & 1) || (version != src->version));
+
+	return steal;
+}
+
 static void kvm_register_steal_time(void)
 {
 	int cpu = smp_processor_id();
@@ -337,6 +354,12 @@ static void kvm_register_steal_time(void)
 		return;
 
 	wrmsrq(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
+
+	/*
+	 * This CPU is not ready to be scheduled yet.
+	 */
+	sched_steal_time_cpu_init(cpu, kvm_steal_clock(cpu));
+
 	pr_debug("stealtime: cpu %d, msr %llx\n", cpu,
 		(unsigned long long) slow_virt_to_phys(st));
 }
@@ -411,23 +434,6 @@ static void kvm_disable_steal_time(void)
 	wrmsrq(MSR_KVM_STEAL_TIME, 0);
 }
 
-static u64 kvm_steal_clock(int cpu)
-{
-	u64 steal;
-	struct kvm_steal_time *src;
-	int version;
-
-	src = &per_cpu(steal_time, cpu);
-	do {
-		version = src->version;
-		virt_rmb();
-		steal = src->steal;
-		virt_rmb();
-	} while ((version & 1) || (version != src->version));
-
-	return steal;
-}
-
 static inline __init void __set_percpu_decrypted(void *ptr, unsigned long size)
 {
 	early_set_memory_decrypted((unsigned long) ptr, size);
diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index e90efaf6d26e..7a0313bd053a 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -186,6 +186,8 @@ struct static_key;
 extern struct static_key paravirt_steal_enabled;
 extern struct static_key paravirt_steal_rq_enabled;
 
+void sched_steal_time_cpu_init(int cpu, u64 steal);
+
 #ifdef CONFIG_HAVE_PV_STEAL_CLOCK_GEN
 u64 dummy_steal_clock(int cpu);
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index fbf31db0d2f3..1490d1bcf3b4 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -255,6 +255,16 @@ void __account_forceidle_time(struct task_struct *p, u64 delta)
 #ifdef CONFIG_PARAVIRT
 struct static_key paravirt_steal_enabled;
 
+void sched_steal_time_cpu_init(int cpu, u64 steal)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	rq->prev_steal_time = steal;
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+	rq->prev_steal_time_rq = steal;
+#endif
+}
+
 #ifdef CONFIG_HAVE_PV_STEAL_CLOCK_GEN
 static u64 native_steal_clock(int cpu)
 {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal when enabling steal time
  2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
  2026-05-05  0:30 ` [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time Dongli Zhang
@ 2026-05-05  0:30 ` Dongli Zhang
  2026-05-08 22:40   ` Sean Christopherson
  2026-05-05  0:30 ` [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in " Dongli Zhang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Dongli Zhang @ 2026-05-05  0:30 UTC (permalink / raw)
  To: kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	dwmw2, joe.jin

KVM does not support vCPU hotplug. When a vCPU is removed, its
corresponding data structures are not freed by KVM. Instead, QEMU destroys
only the userspace state and the vCPU thread, while the KVM vCPU fd remains
open and parked in QEMU.

As a result, vcpu->arch.st.last_steal is not reset.

If the same vCPU is later re-created by QEMU, last_steal retains its old
value, while current->sched_info.run_delay starts from zero since a new
vCPU thread is created. This causes
current->sched_info.run_delay - vcpu->arch.st.last_steal to produce a
large, bogus value.

Fix this by resetting vcpu->arch.st.last_steal to
current->sched_info.run_delay when KVM steal time is enabled.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/x86.c              | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c470e40a00aa..1f1f29128c5d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -960,6 +960,7 @@ struct kvm_vcpu_arch {
 		u64 msr_val;
 		u64 last_steal;
 		struct gfn_to_hva_cache cache;
+		bool need_reset;
 	} st;
 
 	u64 l1_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a1b63c63d1a..eec578894ad5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3829,6 +3829,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	smp_wmb();
 
 	unsafe_get_user(steal, &st->steal, out);
+
+	if (vcpu->arch.st.need_reset) {
+		vcpu->arch.st.need_reset = false;
+		vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	}
+
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
@@ -4178,6 +4184,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (!(data & KVM_MSR_ENABLED))
 			break;
 
+		vcpu->arch.st.need_reset = true;
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
 		break;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in steal time
  2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
  2026-05-05  0:30 ` [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time Dongli Zhang
  2026-05-05  0:30 ` [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal " Dongli Zhang
@ 2026-05-05  0:30 ` Dongli Zhang
  2026-05-10 18:54   ` David Woodhouse
  2026-05-05  0:30 ` [PATCH 4/5] KVM: selftests: Test steal time when re-adding a vCPU on a new thread Dongli Zhang
  2026-05-05  0:30 ` [PATCH 5/5] KVM: selftests: Test KVM_SET_CLOCK downtime in steal time Dongli Zhang
  4 siblings, 1 reply; 12+ messages in thread
From: Dongli Zhang @ 2026-05-05  0:30 UTC (permalink / raw)
  To: kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	dwmw2, joe.jin

The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
live migration. KVM uses that realtime value to advance guest clock, but
the same blackout is not reflected in KVM steal time.

Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
self-contained and avoids adding a new KVM ioctl or requiring additional
userspace changes (i.e. QEMU).

Record the per-VM downtime delta when KVM_SET_CLOCK receives
KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
path. Initialize each vCPU's local cursor
(vcpu->arch.st.last_downtime_steal) when the guest enables
MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.

Note that this means a vCPU may observe additional steal time after
blackout even if the host side contribution from current->sched_info
did not increase during that interval.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 25 +++++++++++++++++++++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1f1f29128c5d..920441b1abf0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -959,6 +959,7 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
+		u64 last_downtime_steal;
 		struct gfn_to_hva_cache cache;
 		bool need_reset;
 	} st;
@@ -1506,6 +1507,8 @@ struct kvm_arch {
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
 
+	atomic64_t downtime_steal;
+
 #ifdef CONFIG_KVM_HYPERV
 	struct kvm_hv hyperv;
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eec578894ad5..452293fc0505 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3751,6 +3751,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	struct kvm_steal_time __user *st;
 	struct kvm_memslots *slots;
 	gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+	u64 downtime_steal;
 	u64 steal;
 	u32 version;
 
@@ -3838,6 +3839,11 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+
+	downtime_steal = atomic64_read(&vcpu->kvm->arch.downtime_steal);
+	steal += downtime_steal - vcpu->arch.st.last_downtime_steal;
+	vcpu->arch.st.last_downtime_steal = downtime_steal;
+
 	unsafe_put_user(steal, &st->steal, out);
 
 	version += 1;
@@ -4185,6 +4191,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			break;
 
 		vcpu->arch.st.need_reset = true;
+		vcpu->arch.st.last_downtime_steal =
+			atomic64_read(&vcpu->kvm->arch.downtime_steal);
+
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
 		break;
@@ -7250,8 +7259,18 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 		/*
 		 * Avoid stepping the kvmclock backwards.
 		 */
-		if (now_real_ns > data.realtime)
-			data.clock += now_real_ns - data.realtime;
+		if (now_real_ns > data.realtime) {
+			u64 downtime_ns = now_real_ns - data.realtime;
+
+			data.clock += downtime_ns;
+
+			if (sched_info_on()) {
+				atomic64_add(downtime_ns,
+					     &kvm->arch.downtime_steal);
+				kvm_make_all_cpus_request(kvm,
+							  KVM_REQ_STEAL_UPDATE);
+			}
+		}
 	}
 
 	if (ka->use_master_clock)
@@ -13389,6 +13408,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
 
+	atomic64_set(&kvm->arch.downtime_steal, 0);
+
 	kvm_apicv_init(kvm);
 	kvm_hv_init_vm(kvm);
 	kvm_xen_init_vm(kvm);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] KVM: selftests: Test steal time when re-adding a vCPU on a new thread
  2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
                   ` (2 preceding siblings ...)
  2026-05-05  0:30 ` [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in " Dongli Zhang
@ 2026-05-05  0:30 ` Dongli Zhang
  2026-05-05  0:30 ` [PATCH 5/5] KVM: selftests: Test KVM_SET_CLOCK downtime in steal time Dongli Zhang
  4 siblings, 0 replies; 12+ messages in thread
From: Dongli Zhang @ 2026-05-05  0:30 UTC (permalink / raw)
  To: kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	dwmw2, joe.jin

Add a selftest for the case where a vCPU is re-added on a new host thread
with its state is reset and steal time is re-enabled on the same vCPU fd.

Run the vCPU once after enabling steal time to establish a baseline, induce
host-side run_delay, and run it again to capture the accumulated steal
time. Then reset the vCPU state, re-enable steal time, and run the same
vCPU fd from a newly created host thread.

Verify that the first steal time update observed after the vCPU is re-added
stays sane and monotonic relative to the pre-reset value.

This models QEMU's vCPU hot-unplug/hotplug flow. KVM does not destroy the
vCPU fd when a vCPU is removed, while QEMU tears down the old vCPU thread
and later reuses the parked vCPU fd from a new thread when the vCPU is
added again.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../selftests/kvm/x86/steal_time_reset_test.c | 144 ++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86/steal_time_reset_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 9118a5a51b89..b452d5691a24 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -148,6 +148,7 @@ TEST_GEN_PROGS_x86 += x86/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86 += x86/triple_fault_event_test
 TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
 TEST_GEN_PROGS_x86 += x86/aperfmperf_test
+TEST_GEN_PROGS_x86 += x86/steal_time_reset_test
 TEST_GEN_PROGS_x86 += access_tracking_perf_test
 TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/x86/steal_time_reset_test.c b/tools/testing/selftests/kvm/x86/steal_time_reset_test.c
new file mode 100644
index 000000000000..6d5991227c4a
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/steal_time_reset_test.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Verify that resetting a vCPU and re-enabling KVM steal time on the same
+ * vCPU fd does not corrupt the accumulated steal-time value when the vCPU
+ * is re-added on a new host thread.
+ */
+#include <pthread.h>
+#include <asm/kvm_para.h>
+#include "kvm_util.h"
+#include "processor.h"
+
+#define ST_GPA_BASE		(1 << 30)
+
+static void *st_gva;
+static u64 guest_stolen_time;
+static u64 main_steal;
+static u64 thread_steal;
+
+static void guest_code(void)
+{
+	struct kvm_steal_time *st = st_gva;
+
+	WRITE_ONCE(guest_stolen_time, READ_ONCE(st->steal));
+	GUEST_SYNC(0);
+	WRITE_ONCE(guest_stolen_time, READ_ONCE(st->steal));
+	GUEST_DONE();
+}
+
+static void run_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct ucall uc;
+
+	vcpu_run(vcpu);
+
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_SYNC:
+	case UCALL_DONE:
+		break;
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+	default:
+		TEST_ASSERT(false, "Unexpected exit: %s",
+			    exit_reason_str(vcpu->run->exit_reason));
+	}
+}
+
+static void *do_steal_time(void *arg)
+{
+	struct timespec ts, stop;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	stop = timespec_add_ns(ts, 100 * MIN_RUN_DELAY_NS);
+
+	while (1) {
+		clock_gettime(CLOCK_MONOTONIC, &ts);
+		if (timespec_to_ns(timespec_sub(ts, stop)) >= 0)
+			break;
+	}
+
+	return NULL;
+}
+
+static void *vcpu_thread(void *arg)
+{
+	struct kvm_vcpu *vcpu = arg;
+
+	run_vcpu(vcpu);
+	sync_global_from_guest(vcpu->vm, guest_stolen_time);
+	thread_steal = guest_stolen_time;
+
+	return NULL;
+}
+
+int main(void)
+{
+	struct kvm_x86_state *reset_state;
+	struct kvm_steal_time *st;
+	struct kvm_vcpu *vcpu;
+	pthread_attr_t attr;
+	struct kvm_vm *vm;
+	pthread_t thread;
+	cpu_set_t cpuset;
+	long run_delay;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_KVM_STEAL_TIME));
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	pthread_attr_init(&attr);
+	pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset);
+	pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, ST_GPA_BASE, 1, 1, 0);
+	virt_map(vm, ST_GPA_BASE, ST_GPA_BASE, 1);
+
+	st_gva = (void *)ST_GPA_BASE;
+	sync_global_to_guest(vm, st_gva);
+
+	st = addr_gva2hva(vm, ST_GPA_BASE);
+	memset(st, 0, sizeof(*st));
+
+	reset_state = vcpu_save_state(vcpu);
+
+	vcpu_set_msr(vcpu, MSR_KVM_STEAL_TIME, ST_GPA_BASE | KVM_MSR_ENABLED);
+	run_vcpu(vcpu);
+
+	run_delay = get_run_delay();
+	pthread_create(&thread, &attr, do_steal_time, NULL);
+
+	while (get_run_delay() - run_delay < MIN_RUN_DELAY_NS)
+		sched_yield();
+
+	pthread_join(thread, NULL);
+	run_delay = get_run_delay() - run_delay;
+	TEST_ASSERT(run_delay >= MIN_RUN_DELAY_NS,
+		    "Expected run_delay >= %lu, got %ld",
+		    MIN_RUN_DELAY_NS, run_delay);
+
+	run_vcpu(vcpu);
+	sync_global_from_guest(vm, guest_stolen_time);
+	main_steal = guest_stolen_time;
+
+	vcpu_load_state(vcpu, reset_state);
+	vcpu_set_msr(vcpu, MSR_KVM_STEAL_TIME, ST_GPA_BASE | KVM_MSR_ENABLED);
+
+	pthread_create(&thread, NULL, vcpu_thread, vcpu);
+
+	pthread_join(thread, NULL);
+	TEST_ASSERT(thread_steal >= main_steal &&
+		    thread_steal - main_steal < (1ULL << 63),
+		    "Expected sane steal in new vCPU thread: main=%"PRIu64", thread=%"PRIu64,
+		    main_steal, thread_steal);
+	ksft_test_result_pass("reset preserved steal time across threads\n");
+
+	pthread_attr_destroy(&attr);
+	kvm_x86_state_cleanup(reset_state);
+	kvm_vm_free(vm);
+	ksft_finished();
+}
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] KVM: selftests: Test KVM_SET_CLOCK downtime in steal time
  2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
                   ` (3 preceding siblings ...)
  2026-05-05  0:30 ` [PATCH 4/5] KVM: selftests: Test steal time when re-adding a vCPU on a new thread Dongli Zhang
@ 2026-05-05  0:30 ` Dongli Zhang
  4 siblings, 0 replies; 12+ messages in thread
From: Dongli Zhang @ 2026-05-05  0:30 UTC (permalink / raw)
  To: kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	dwmw2, joe.jin

Extend kvm_clock_test to enable KVM steal time and verify that a
KVM_SET_CLOCK adjustment with KVM_CLOCK_REALTIME also advances the
guest's steal time when the realtime value is in the past.

For the negative realtime-offset case, verify that the observed steal
time delta is at least the downtime injected through KVM_SET_CLOCK.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 .../selftests/kvm/x86/kvm_clock_test.c        | 42 +++++++++++++++----
 1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/kvm_clock_test.c b/tools/testing/selftests/kvm/x86/kvm_clock_test.c
index 5ad4aeb8e373..7e0f00f21144 100644
--- a/tools/testing/selftests/kvm/x86/kvm_clock_test.c
+++ b/tools/testing/selftests/kvm/x86/kvm_clock_test.c
@@ -28,16 +28,20 @@ static struct test_case test_cases[] = {
 	{ .kvmclock_base = 0, .realtime_offset = 180 * NSEC_PER_SEC },
 };
 
-#define GUEST_SYNC_CLOCK(__stage, __val)			\
-		GUEST_SYNC_ARGS(__stage, __val, 0, 0, 0)
+#define GUEST_SYNC_CLOCK(__stage, __clock, __steal)		\
+		GUEST_SYNC_ARGS(__stage, __clock, __steal, 0, 0)
 
-static void guest_main(gpa_t pvti_pa, struct pvclock_vcpu_time_info *pvti)
+static void guest_main(gpa_t pvti_pa, struct pvclock_vcpu_time_info *pvti,
+			gpa_t st_pa, struct kvm_steal_time *st)
 {
 	int i;
 
 	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+	wrmsr(MSR_KVM_STEAL_TIME, st_pa | KVM_MSR_ENABLED);
+
 	for (i = 0; i < ARRAY_SIZE(test_cases); i++)
-		GUEST_SYNC_CLOCK(i, __pvclock_read_cycles(pvti, rdtsc()));
+		GUEST_SYNC_CLOCK(i, __pvclock_read_cycles(pvti, rdtsc()),
+				 READ_ONCE(st->steal));
 }
 
 #define EXPECTED_FLAGS (KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
@@ -50,11 +54,13 @@ static inline void assert_flags(struct kvm_clock_data *data)
 }
 
 static void handle_sync(struct ucall *uc, struct kvm_clock_data *start,
-			struct kvm_clock_data *end)
+			struct kvm_clock_data *end,
+			struct test_case *test_case, u64 *last_steal)
 {
-	u64 obs, exp_lo, exp_hi;
+	u64 obs, exp_lo, exp_hi, obs_steal;
 
 	obs = uc->args[2];
+	obs_steal = uc->args[3];
 	exp_lo = start->clock;
 	exp_hi = end->clock;
 
@@ -67,6 +73,18 @@ static void handle_sync(struct ucall *uc, struct kvm_clock_data *start,
 
 	pr_info("kvm-clock value: %"PRIu64" expected range [%"PRIu64", %"PRIu64"]\n",
 		obs, exp_lo, exp_hi);
+
+	if (test_case->realtime_offset < 0) {
+		u64 min_downtime = -test_case->realtime_offset;
+
+		TEST_ASSERT(obs_steal >= *last_steal &&
+			    obs_steal - *last_steal >= min_downtime,
+			    "unexpected steal values: obs=%"PRIu64
+			    " last=%"PRIu64" min_downtime=%"PRIu64,
+			    obs_steal, *last_steal, min_downtime);
+	}
+
+	*last_steal = obs_steal;
 }
 
 static void handle_abort(struct ucall *uc)
@@ -106,6 +124,7 @@ static void enter_guest(struct kvm_vcpu *vcpu)
 {
 	struct kvm_clock_data start, end;
 	struct kvm_vm *vm = vcpu->vm;
+	u64 last_steal = 0;
 	struct ucall uc;
 	int i;
 
@@ -121,7 +140,8 @@ static void enter_guest(struct kvm_vcpu *vcpu)
 
 		switch (get_ucall(vcpu, &uc)) {
 		case UCALL_SYNC:
-			handle_sync(&uc, &start, &end);
+			handle_sync(&uc, &start, &end,
+				    &test_cases[i], &last_steal);
 			break;
 		case UCALL_ABORT:
 			handle_abort(&uc);
@@ -137,6 +157,8 @@ int main(void)
 	struct kvm_vcpu *vcpu;
 	gva_t pvti_gva;
 	gpa_t pvti_gpa;
+	gva_t st_gva;
+	gpa_t st_gpa;
 	struct kvm_vm *vm;
 	int flags;
 
@@ -144,12 +166,16 @@ int main(void)
 	TEST_REQUIRE(flags & KVM_CLOCK_REALTIME);
 
 	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_KVM_STEAL_TIME));
 
 	vm = vm_create_with_one_vcpu(&vcpu, guest_main);
 
 	pvti_gva = vm_alloc(vm, getpagesize(), 0x10000);
 	pvti_gpa = addr_gva2gpa(vm, pvti_gva);
-	vcpu_args_set(vcpu, 2, pvti_gpa, pvti_gva);
+	st_gva = vm_alloc(vm, getpagesize(), 0x20000);
+	st_gpa = addr_gva2gpa(vm, st_gva);
+	memset(addr_gva2hva(vm, st_gva), 0, getpagesize());
+	vcpu_args_set(vcpu, 4, pvti_gpa, pvti_gva, st_gpa, st_gva);
 
 	enter_guest(vcpu);
 	kvm_vm_free(vm);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal when enabling steal time
  2026-05-05  0:30 ` [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal " Dongli Zhang
@ 2026-05-08 22:40   ` Sean Christopherson
  2026-05-10 17:09     ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-05-08 22:40 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: kvm, x86, linux-kselftest, pbonzini, vkuznets, tglx, mingo, bp,
	dave.hansen, shuah, hpa, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, jgross, dwmw2, joe.jin

On Mon, May 04, 2026, Dongli Zhang wrote:
> KVM does not support vCPU hotplug. When a vCPU is removed, its
> corresponding data structures are not freed by KVM. Instead, QEMU destroys
> only the userspace state and the vCPU thread, while the KVM vCPU fd remains
> open and parked in QEMU.
> 
> As a result, vcpu->arch.st.last_steal is not reset.
> 
> If the same vCPU is later re-created by QEMU, last_steal retains its old
> value, while current->sched_info.run_delay starts from zero since a new
> vCPU thread is created. This causes
> current->sched_info.run_delay - vcpu->arch.st.last_steal to produce a
> large, bogus value.
> 
> Fix this by resetting vcpu->arch.st.last_steal to
> current->sched_info.run_delay when KVM steal time is enabled.

This is quite arbitrary.  E.g. if userspace hands the vCPU off to a different
task without going through QEMU's hotplug dance, then current->sched_info.run_delay
will also change.

Shouldn't x86 hook kvm_arch_vcpu_run_pid_change() and reset last_steal in there?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal when enabling steal time
  2026-05-08 22:40   ` Sean Christopherson
@ 2026-05-10 17:09     ` David Woodhouse
  2026-05-10 18:40       ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: David Woodhouse @ 2026-05-10 17:09 UTC (permalink / raw)
  To: Sean Christopherson, Dongli Zhang
  Cc: kvm, x86, linux-kselftest, pbonzini, vkuznets, tglx, mingo, bp,
	dave.hansen, shuah, hpa, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, jgross, joe.jin

[-- Attachment #1: Type: text/plain, Size: 1810 bytes --]

On Fri, 2026-05-08 at 15:40 -0700, Sean Christopherson wrote:
> On Mon, May 04, 2026, Dongli Zhang wrote:
> > KVM does not support vCPU hotplug. When a vCPU is removed, its
> > corresponding data structures are not freed by KVM. Instead, QEMU destroys
> > only the userspace state and the vCPU thread, while the KVM vCPU fd remains
> > open and parked in QEMU.
> > 
> > As a result, vcpu->arch.st.last_steal is not reset.
> > 
> > If the same vCPU is later re-created by QEMU, last_steal retains its old
> > value, while current->sched_info.run_delay starts from zero since a new
> > vCPU thread is created. This causes
> > current->sched_info.run_delay - vcpu->arch.st.last_steal to produce a
> > large, bogus value.
> > 
> > Fix this by resetting vcpu->arch.st.last_steal to
> > current->sched_info.run_delay when KVM steal time is enabled.
> 
> This is quite arbitrary.  E.g. if userspace hands the vCPU off to a different
> task without going through QEMU's hotplug dance, then current->sched_info.run_delay
> will also change.
> 
> Shouldn't x86 hook kvm_arch_vcpu_run_pid_change() and reset last_steal in there?

I'd like to be sure that we get this right for live update and live migration.

I think we *do* get it right for the Xen runstate info. When the guest
is paused on the source (or before the kexec), the VMM takes the
runstate info which contains the current kvmclock *and* the time spent
in each state.

On restore on the destination (or after kexec), userspace provides that
back to the kernel, along with the KVM clock information.

As the KVM clock has naturally progressed in the intervening time, the
delta between the kvmclock from the runstate, and the kvmclock when the
vCPU actually gets to run again, is reported (correctly) as steal time.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal when enabling steal time
  2026-05-10 17:09     ` David Woodhouse
@ 2026-05-10 18:40       ` David Woodhouse
  0 siblings, 0 replies; 12+ messages in thread
From: David Woodhouse @ 2026-05-10 18:40 UTC (permalink / raw)
  To: Sean Christopherson, Dongli Zhang
  Cc: kvm, x86, linux-kselftest, pbonzini, vkuznets, tglx, mingo, bp,
	dave.hansen, shuah, hpa, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, jgross, joe.jin

[-- Attachment #1: Type: text/plain, Size: 1631 bytes --]

On Sun, 2026-05-10 at 18:09 +0100, David Woodhouse wrote:
> On Fri, 2026-05-08 at 15:40 -0700, Sean Christopherson wrote:
> > On Mon, May 04, 2026, Dongli Zhang wrote:
> > > KVM does not support vCPU hotplug. When a vCPU is removed, its
> > > corresponding data structures are not freed by KVM. Instead, QEMU destroys
> > > only the userspace state and the vCPU thread, while the KVM vCPU fd remains
> > > open and parked in QEMU.
> > > 
> > > As a result, vcpu->arch.st.last_steal is not reset.
> > > 
> > > If the same vCPU is later re-created by QEMU, last_steal retains its old
> > > value, while current->sched_info.run_delay starts from zero since a new
> > > vCPU thread is created. This causes
> > > current->sched_info.run_delay - vcpu->arch.st.last_steal to produce a
> > > large, bogus value.
> > > 
> > > Fix this by resetting vcpu->arch.st.last_steal to
> > > current->sched_info.run_delay when KVM steal time is enabled.
> > 
> > This is quite arbitrary.  E.g. if userspace hands the vCPU off to a different
> > task without going through QEMU's hotplug dance, then current->sched_info.run_delay
> > will also change.
> > 
> > Shouldn't x86 hook kvm_arch_vcpu_run_pid_change() and reset last_steal in there?
> 
> I'd like to be sure that we get this right for live update and live migration.
> 
> I think we *do* get it right for the Xen runstate info...

Since I'm adding selftests to my kvmclock branch today... I now *know*
this to be true :)


https://git.infradead.org/?p=users/dwmw2/linux.git;a=commitdiff;h=d667349116

Looks like Sean is right about the pid change though.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in steal time
  2026-05-05  0:30 ` [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in " Dongli Zhang
@ 2026-05-10 18:54   ` David Woodhouse
  2026-05-10 19:11     ` H. Peter Anvin
  0 siblings, 1 reply; 12+ messages in thread
From: David Woodhouse @ 2026-05-10 18:54 UTC (permalink / raw)
  To: Dongli Zhang, kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	hpa, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, jgross,
	joe.jin

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

On Mon, 2026-05-04 at 17:30 -0700, Dongli Zhang wrote:
> The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
> live migration. KVM uses that realtime value to advance guest clock, but
> the same blackout is not reflected in KVM steal time.
> 
> Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
> only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
> self-contained and avoids adding a new KVM ioctl or requiring additional
> userspace changes (i.e. QEMU).
> 
> Record the per-VM downtime delta when KVM_SET_CLOCK receives
> KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
> path. Initialize each vCPU's local cursor
> (vcpu->arch.st.last_downtime_steal) when the guest enables
> MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.
> 
> Note that this means a vCPU may observe additional steal time after
> blackout even if the host side contribution from current->sched_info
> did not increase during that interval.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>

I really don't want to see KVM_CLOCK_REALTIME used for anything more
than it already is. Or, indeed, even for that.

There is precisely *one* place where it's OK to use 'real time' as a
comparator, and that's when setting the guest's TSC. And even then it
should be using TAI not UTC unless you like your guests' clocks jumping
around by a second if you migrate at the wrong time. KVM_CLOCK_REALTIME
was never the right thing to use, for anything.

The KVM clock is a function of the guest's TSC (see
KVM_SET_CLOCK_GUEST), and steal time is a function of that (as it's
measured in nanoseconds).

Don't bring UTC into it *anywhere*.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in steal time
  2026-05-10 18:54   ` David Woodhouse
@ 2026-05-10 19:11     ` H. Peter Anvin
  2026-05-10 20:13       ` David Woodhouse
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2026-05-10 19:11 UTC (permalink / raw)
  To: David Woodhouse, Dongli Zhang, kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, kprateek.nayak, jgross, joe.jin

On May 10, 2026 11:54:38 AM PDT, David Woodhouse <dwmw2@infradead.org> wrote:
>On Mon, 2026-05-04 at 17:30 -0700, Dongli Zhang wrote:
>> The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
>> live migration. KVM uses that realtime value to advance guest clock, but
>> the same blackout is not reflected in KVM steal time.
>> 
>> Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
>> only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
>> self-contained and avoids adding a new KVM ioctl or requiring additional
>> userspace changes (i.e. QEMU).
>> 
>> Record the per-VM downtime delta when KVM_SET_CLOCK receives
>> KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
>> path. Initialize each vCPU's local cursor
>> (vcpu->arch.st.last_downtime_steal) when the guest enables
>> MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.
>> 
>> Note that this means a vCPU may observe additional steal time after
>> blackout even if the host side contribution from current->sched_info
>> did not increase during that interval.
>> 
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>
>I really don't want to see KVM_CLOCK_REALTIME used for anything more
>than it already is. Or, indeed, even for that.
>
>There is precisely *one* place where it's OK to use 'real time' as a
>comparator, and that's when setting the guest's TSC. And even then it
>should be using TAI not UTC unless you like your guests' clocks jumping
>around by a second if you migrate at the wrong time. KVM_CLOCK_REALTIME
>was never the right thing to use, for anything.
>
>The KVM clock is a function of the guest's TSC (see
>KVM_SET_CLOCK_GUEST), and steal time is a function of that (as it's
>measured in nanoseconds).
>
>Don't bring UTC into it *anywhere*.
>
>

Unfortunately TAI is often unavailable. One can hope that the proposal of abolishing leap seconds by 2035, fixing the TAI-UTC offset permanently, actually happens. The difference between atomic and solar time is better handled with the already-existing "time zones" mechanism, which tends to change far more frequently for entirely different reasons than the TAI-UT1 difference slowly accumulates.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in steal time
  2026-05-10 19:11     ` H. Peter Anvin
@ 2026-05-10 20:13       ` David Woodhouse
  0 siblings, 0 replies; 12+ messages in thread
From: David Woodhouse @ 2026-05-10 20:13 UTC (permalink / raw)
  To: H. Peter Anvin, Dongli Zhang, kvm, x86, linux-kselftest
  Cc: seanjc, pbonzini, vkuznets, tglx, mingo, bp, dave.hansen, shuah,
	peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, kprateek.nayak, jgross, joe.jin

[-- Attachment #1: Type: text/plain, Size: 3674 bytes --]

On Sun, 2026-05-10 at 12:11 -0700, H. Peter Anvin wrote:
> On May 10, 2026 11:54:38 AM PDT, David Woodhouse <dwmw2@infradead.org> wrote:
> > On Mon, 2026-05-04 at 17:30 -0700, Dongli Zhang wrote:
> > > The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
> > > live migration. KVM uses that realtime value to advance guest clock, but
> > > the same blackout is not reflected in KVM steal time.
> > > 
> > > Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
> > > only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
> > > self-contained and avoids adding a new KVM ioctl or requiring additional
> > > userspace changes (i.e. QEMU).
> > > 
> > > Record the per-VM downtime delta when KVM_SET_CLOCK receives
> > > KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
> > > path. Initialize each vCPU's local cursor
> > > (vcpu->arch.st.last_downtime_steal) when the guest enables
> > > MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.
> > > 
> > > Note that this means a vCPU may observe additional steal time after
> > > blackout even if the host side contribution from current->sched_info
> > > did not increase during that interval.
> > > 
> > > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> > 
> > I really don't want to see KVM_CLOCK_REALTIME used for anything more
> > than it already is. Or, indeed, even for that.
> > 
> > There is precisely *one* place where it's OK to use 'real time' as a
> > comparator, and that's when setting the guest's TSC. And even then it
> > should be using TAI not UTC unless you like your guests' clocks jumping
> > around by a second if you migrate at the wrong time. KVM_CLOCK_REALTIME
> > was never the right thing to use, for anything.
> > 
> > The KVM clock is a function of the guest's TSC (see
> > KVM_SET_CLOCK_GUEST), and steal time is a function of that (as it's
> > measured in nanoseconds).
> > 
> > Don't bring UTC into it *anywhere*.
> > 
> > 
> 
> Unfortunately TAI is often unavailable. One can hope that the
> proposal of abolishing leap seconds by 2035, fixing the TAI-UTC
> offset permanently, actually happens.

I was hoping for the opposite; it's just pandering to stupid bugs.
Yes, leap seconds are fairly rare; instead maybe we should *always*
have a leap second in one direction or the other at the end of the
year. Otherwise it's just building up to be a bigger problem later.

>  The difference between atomic and solar time is better handled with
> the already-existing "time zones" mechanism, which tends to change
> far more frequently for entirely different reasons than the TAI-UT1
> difference slowly accumulates.

I have absolutely no faith in a 'time zones with second precision'
model ever actually working either. Although maybe if we ditched UTC
completely (and the pointless 37-second offset frozen in time for ever
like the GPS offset), and our second-precision time zones were based on
*TAI* we could exercise them from day one?

Either way, as long as it isn't the awful abomination of *smearing*
leap seconds and screwing up time precision, nobody actually needs to
be nailed to anything.

And none of it matters here for *steal time*, since the *only* thing in
a migration that should be based on any kind of real time is the guest
TSC, and everything else should be based purely on that (perhaps via
the kvmclock).

And even then it's only for live *migration* to a different host, as
live update on the same host across kexec should be purely based on
offsets from the host's TSC which remains unperturbed.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-10 20:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05  0:30 [PATCH 0/5] Fix and enhance KVM steal accounting for both guest and host Dongli Zhang
2026-05-05  0:30 ` [PATCH 1/5] x86/kvm: Reset prev_steal_time and prev_steal_time_rq when enabling steal time Dongli Zhang
2026-05-05  0:30 ` [PATCH 2/5] KVM: x86: Reset vcpu->arch.st.last_steal " Dongli Zhang
2026-05-08 22:40   ` Sean Christopherson
2026-05-10 17:09     ` David Woodhouse
2026-05-10 18:40       ` David Woodhouse
2026-05-05  0:30 ` [PATCH 3/5] KVM: x86: account KVM_SET_CLOCK downtime in " Dongli Zhang
2026-05-10 18:54   ` David Woodhouse
2026-05-10 19:11     ` H. Peter Anvin
2026-05-10 20:13       ` David Woodhouse
2026-05-05  0:30 ` [PATCH 4/5] KVM: selftests: Test steal time when re-adding a vCPU on a new thread Dongli Zhang
2026-05-05  0:30 ` [PATCH 5/5] KVM: selftests: Test KVM_SET_CLOCK downtime in steal time Dongli Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox