* [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG
@ 2023-02-02 12:13 Simon Veith
2023-02-02 12:54 ` David Woodhouse
2023-02-02 13:50 ` Marc Zyngier
0 siblings, 2 replies; 5+ messages in thread
From: Simon Veith @ 2023-02-02 12:13 UTC (permalink / raw)
Cc: Simon Veith, dwmw2, Catalin Marinas, Will Deacon, Marc Zyngier,
James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
linux-arm-kernel
The virtual timer count register (CNTVCT_EL0) is virtualized by
configuring offset register CNTVOFF_EL2 to subtract from the underlying
raw hardware timer count when the guest reads the current count.
Currently, we offer userspace the ability to serialize and deserialize
only the absolute count register value, using KVM_{GET,SET}_ONE_REG with
KVM_REG_ARM_TIMER_CNT. Internally, we then compute and set the offset
register accordingly to obtain the requested count value.
Allowing to set this timer count register only by absolute value poses
some problems to virtual machine monitors that try to maintain the
illusion of continuously ticking clocks to the guest: In workflows like
live migration or liveupdate, the timers must be increased artificially
to account for pause time.
Any delays between userspace computing the correct timer count value and
actually setting it in kernel space by KVM_SET_ONE_REG (such as can be
incurred by scheduling) become visible as under-accounted pause time in
the guest, meaning the guest observes that its system clock seems to
have fallen behind its NTP time reference.
The issue is further complicated when vCPU setup is performed by
independent threads which may experience different delays, leading to
jitter between the clocks of different vCPUs.
We could deliver a more stable timer in such scenarios if we allowed
userspace to set the offset with regards to the physical counter
directly.
Expose the KVM_REG_ARM_TIMER_OFF register directly to userspace, as an
alternative view of the timer counts. By default, userspace still sees
only the existing KVM_REG_ARM_TIMER_CNT register when querying the list
with KVM_GET_REG_LIST, as that register value is portable across
different VM hosts and thus safe to persist.
Signed-off-by: Simon Veith <sveith@amazon.de>
CC: dwmw2@infradead.org
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will@kernel.org>
CC: Marc Zyngier <maz@kernel.org>
CC: James Morse <james.morse@arm.com>
CC: Suzuki K Poulose <suzuki.poulose@arm.com>
CC: Oliver Upton <oliver.upton@linux.dev>
CC: Zenghui Yu <yuzenghui@huawei.com>
CC: linux-arm-kernel@lists.infradead.org
---
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arch_timer.c | 10 ++++++++++
arch/arm64/kvm/guest.c | 9 +++++++++
include/kvm/arm_arch_timer.h | 1 +
4 files changed, 21 insertions(+)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index a7a857f1784d..077699e403ab 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -260,6 +260,7 @@ struct kvm_arm_copy_mte_tags {
#define KVM_REG_ARM_TIMER_CTL ARM64_SYS_REG(3, 3, 14, 3, 1)
#define KVM_REG_ARM_TIMER_CVAL ARM64_SYS_REG(3, 3, 14, 0, 2)
#define KVM_REG_ARM_TIMER_CNT ARM64_SYS_REG(3, 3, 14, 3, 2)
+#define KVM_REG_ARM_TIMER_OFF ARM64_SYS_REG(3, 4, 14, 0, 3)
/* KVM-as-firmware specific pseudo-registers */
#define KVM_REG_ARM_FW (0x0014 << KVM_REG_ARM_COPROC_SHIFT)
diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index bb24a76b4224..f68b9edbea6b 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -830,6 +830,9 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *vcpu, u64 regid, u64 value)
timer = vcpu_vtimer(vcpu);
update_vtimer_cntvoff(vcpu, kvm_phys_timer_read() - value);
break;
+ case KVM_REG_ARM_TIMER_OFF:
+ update_vtimer_cntvoff(vcpu, value);
+ break;
case KVM_REG_ARM_TIMER_CVAL:
timer = vcpu_vtimer(vcpu);
kvm_arm_timer_write(vcpu, timer, TIMER_REG_CVAL, value);
@@ -875,6 +878,9 @@ u64 kvm_arm_timer_get_reg(struct kvm_vcpu *vcpu, u64 regid)
case KVM_REG_ARM_TIMER_CNT:
return kvm_arm_timer_read(vcpu,
vcpu_vtimer(vcpu), TIMER_REG_CNT);
+ case KVM_REG_ARM_TIMER_OFF:
+ return kvm_arm_timer_read(vcpu,
+ vcpu_vtimer(vcpu), TIMER_REG_OFF);
case KVM_REG_ARM_TIMER_CVAL:
return kvm_arm_timer_read(vcpu,
vcpu_vtimer(vcpu), TIMER_REG_CVAL);
@@ -915,6 +921,10 @@ static u64 kvm_arm_timer_read(struct kvm_vcpu *vcpu,
val = kvm_phys_timer_read() - timer_get_offset(timer);
break;
+ case TIMER_REG_OFF:
+ val = timer_get_offset(timer);
+ break;
+
default:
BUG();
}
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index cf4c495a4321..a934189e1811 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -586,6 +586,10 @@ static unsigned long num_core_regs(const struct kvm_vcpu *vcpu)
/**
* ARM64 versions of the TIMER registers, always available on arm64
+ *
+ * Note that the _OFF register is another view of the _CNT register, and is
+ * therefore not counted separately in NUM_TIMER_REGS, nor included in
+ * KVM_GET_REG_LIST.
*/
#define NUM_TIMER_REGS 3
@@ -595,6 +599,7 @@ static bool is_timer_reg(u64 index)
switch (index) {
case KVM_REG_ARM_TIMER_CTL:
case KVM_REG_ARM_TIMER_CNT:
+ case KVM_REG_ARM_TIMER_OFF:
case KVM_REG_ARM_TIMER_CVAL:
return true;
}
@@ -609,6 +614,10 @@ static int copy_timer_indices(struct kvm_vcpu *vcpu, u64 __user *uindices)
if (put_user(KVM_REG_ARM_TIMER_CNT, uindices))
return -EFAULT;
uindices++;
+ /*
+ * KVM_REG_ARM_TIMER_OFF is another view of KVM_REG_ARM_TIMER_CNT and
+ * therefore not included in the register list.
+ */
if (put_user(KVM_REG_ARM_TIMER_CVAL, uindices))
return -EFAULT;
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index cd6d8f260eab..66de7aa2018e 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -21,6 +21,7 @@ enum kvm_arch_timer_regs {
TIMER_REG_CVAL,
TIMER_REG_TVAL,
TIMER_REG_CTL,
+ TIMER_REG_OFF,
};
struct arch_timer_context {
--
2.34.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG
2023-02-02 12:13 [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG Simon Veith
@ 2023-02-02 12:54 ` David Woodhouse
2023-02-02 13:50 ` Marc Zyngier
1 sibling, 0 replies; 5+ messages in thread
From: David Woodhouse @ 2023-02-02 12:54 UTC (permalink / raw)
To: Simon Veith
Cc: Catalin Marinas, Will Deacon, Marc Zyngier, James Morse,
Suzuki K Poulose, Oliver Upton, Zenghui Yu, linux-arm-kernel
[-- Attachment #1.1: Type: text/plain, Size: 2983 bytes --]
On Thu, 2023-02-02 at 13:13 +0100, Simon Veith wrote:
> The virtual timer count register (CNTVCT_EL0) is virtualized by
> configuring offset register CNTVOFF_EL2 to subtract from the underlying
> raw hardware timer count when the guest reads the current count.
>
> Currently, we offer userspace the ability to serialize and deserialize
> only the absolute count register value, using KVM_{GET,SET}_ONE_REG with
> KVM_REG_ARM_TIMER_CNT. Internally, we then compute and set the offset
> register accordingly to obtain the requested count value.
>
> Allowing to set this timer count register only by absolute value poses
> some problems to virtual machine monitors that try to maintain the
> illusion of continuously ticking clocks to the guest: In workflows like
> live migration or liveupdate, the timers must be increased artificially
> to account for pause time.
>
> Any delays between userspace computing the correct timer count value and
> actually setting it in kernel space by KVM_SET_ONE_REG (such as can be
> incurred by scheduling) become visible as under-accounted pause time in
> the guest, meaning the guest observes that its system clock seems to
> have fallen behind its NTP time reference.
>
> The issue is further complicated when vCPU setup is performed by
> independent threads which may experience different delays, leading to
> jitter between the clocks of different vCPUs.
>
> We could deliver a more stable timer in such scenarios if we allowed
> userspace to set the offset with regards to the physical counter
> directly.
LGTM in principle. To allow for correct timekeeping across a live
update — whether it just be restarting the VMM, or kexec into a new
kernel and start a new VMM — we absolutely need to have preserve the
*offset*, and none of this "the timer was <x> then and now it's about
<y> later so let's calculate what the timer should have been at the
start of this sentence and set it to roughly that..." as you describe
in your commit message.
>
> Expose the KVM_REG_ARM_TIMER_OFF register directly to userspace, as an
> alternative view of the timer counts. By default, userspace still sees
> only the existing KVM_REG_ARM_TIMER_CNT register when querying the list
> with KVM_GET_REG_LIST, as that register value is portable across
> different VM hosts and thus safe to persist.
>
> Signed-off-by: Simon Veith <sveith@amazon.de>
>
I don't have a strong opinion on not counting it as a "new" register
and just continuing to list KVM_REG_ARM_TIMER_{CTL,CNT,CVAL} *but*
surely we do need a way for userspace to detect that this new feature
is present in the kernel? I don't think that just "let them try and get
-ENOENT" is the right approach.
So maybe explicitly reporting KVM_REG_ARM_TIMER_OFF *is* the better
choice? After all, userspace needs to make an informed decision about
which to use depending on whether the guest will be resumed on the same
hardware or not.
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG
2023-02-02 12:13 [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG Simon Veith
2023-02-02 12:54 ` David Woodhouse
@ 2023-02-02 13:50 ` Marc Zyngier
2023-02-02 15:18 ` David Woodhouse
1 sibling, 1 reply; 5+ messages in thread
From: Marc Zyngier @ 2023-02-02 13:50 UTC (permalink / raw)
To: Simon Veith
Cc: dwmw2, Catalin Marinas, Will Deacon, James Morse,
Suzuki K Poulose, Oliver Upton, Zenghui Yu, linux-arm-kernel
Hi Simon,
On Thu, 02 Feb 2023 12:13:14 +0000,
Simon Veith <sveith@amazon.de> wrote:
>
> The virtual timer count register (CNTVCT_EL0) is virtualized by
> configuring offset register CNTVOFF_EL2 to subtract from the underlying
> raw hardware timer count when the guest reads the current count.
>
> Currently, we offer userspace the ability to serialize and deserialize
> only the absolute count register value, using KVM_{GET,SET}_ONE_REG with
> KVM_REG_ARM_TIMER_CNT. Internally, we then compute and set the offset
> register accordingly to obtain the requested count value.
>
> Allowing to set this timer count register only by absolute value poses
> some problems to virtual machine monitors that try to maintain the
> illusion of continuously ticking clocks to the guest: In workflows like
> live migration or liveupdate, the timers must be increased artificially
> to account for pause time.
"must" is a pretty strong word. Given that this isn't advertised as
stolen time to the guest, any sort of time-sensitive process (such as
an in-guest watchdog) is likely to be ticked the wrong way if you
start adding that time to the counter.
For example, QEMU doesn't do that, and wants time continuity, hence
the current behaviour.
>
> Any delays between userspace computing the correct timer count value and
> actually setting it in kernel space by KVM_SET_ONE_REG (such as can be
> incurred by scheduling) become visible as under-accounted pause time in
> the guest, meaning the guest observes that its system clock seems to
> have fallen behind its NTP time reference.
>
> The issue is further complicated when vCPU setup is performed by
> independent threads which may experience different delays, leading to
> jitter between the clocks of different vCPUs.
How? I really hope that you will have restored *all* the vcpus before
restarting any. If you don't, your userspace is buggy.
>
> We could deliver a more stable timer in such scenarios if we allowed
> userspace to set the offset with regards to the physical counter
> directly.
>
> Expose the KVM_REG_ARM_TIMER_OFF register directly to userspace, as an
> alternative view of the timer counts. By default, userspace still sees
> only the existing KVM_REG_ARM_TIMER_CNT register when querying the list
> with KVM_GET_REG_LIST, as that register value is portable across
> different VM hosts and thus safe to persist.
I can see a few things are not quite right with this approach:
- You hijack a register that isn't an EL1 register. This should never
be exposed to a userspace as such, as it would otherwise change
behaviour with NV, which is definitely in control of it.
- What is the ordering between restoring the timer value and restoring
the timer offset? Both do the same thing, and impact all vcpus. How
does it make anything better if your userspace (such as QEMU) saves
*all* the available registers and restores them all, on all vcpus?
- You make this a per-vcpu value. But what this does is to provide an
offset for the *whole VM*. Why not take the bullet and simply make
this a per-VM adjustment?
- What about the physical timer? Doesn't it need some similar
treatment as well, irrespective of the presence of ECV?
We have been around that particular block a few times in the past, and
I may have changed my mind more than once. But as the NV code has
finally reached a point where these things matter, we really shouldn't
go into a direction where we'd end-up with varying semantics depending
on whether CNT{P,V}OFF_EL2 is under control of the host or the guest.
It should also be a feature that is advertised, and bought into from
the VMM. It cannot be an implicit behaviour.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG
2023-02-02 13:50 ` Marc Zyngier
@ 2023-02-02 15:18 ` David Woodhouse
2023-02-06 19:55 ` Oliver Upton
0 siblings, 1 reply; 5+ messages in thread
From: David Woodhouse @ 2023-02-02 15:18 UTC (permalink / raw)
To: Marc Zyngier, Simon Veith
Cc: Catalin Marinas, Will Deacon, James Morse, Suzuki K Poulose,
Oliver Upton, Zenghui Yu, linux-arm-kernel
[-- Attachment #1.1: Type: text/plain, Size: 3917 bytes --]
On Thu, 2023-02-02 at 13:50 +0000, Marc Zyngier wrote:
> Hi Simon,
>
> On Thu, 02 Feb 2023 12:13:14 +0000,
> Simon Veith <sveith@amazon.de> wrote:
> >
> > The virtual timer count register (CNTVCT_EL0) is virtualized by
> > configuring offset register CNTVOFF_EL2 to subtract from the underlying
> > raw hardware timer count when the guest reads the current count.
> >
> > Currently, we offer userspace the ability to serialize and deserialize
> > only the absolute count register value, using KVM_{GET,SET}_ONE_REG with
> > KVM_REG_ARM_TIMER_CNT. Internally, we then compute and set the offset
> > register accordingly to obtain the requested count value.
> >
> > Allowing to set this timer count register only by absolute value poses
> > some problems to virtual machine monitors that try to maintain the
> > illusion of continuously ticking clocks to the guest: In workflows like
> > live migration or liveupdate, the timers must be increased artificially
> > to account for pause time.
>
> "must" is a pretty strong word. Given that this isn't advertised as
> stolen time to the guest, any sort of time-sensitive process (such as
> an in-guest watchdog) is likely to be ticked the wrong way if you
> start adding that time to the counter.
I'd have said that it *is* stolen time. Whether your hypervisor/VMM is
off in the weeds not running your CPU because it's in swap death, or
because it's live updating itself to a new version, it isn't running
the guest vCPUs and that time is stolen.
(Whether we actually account it to the guest as such is a quality of
implementation issue. As it is, doesn't the steal_time we report to a
guest go backwards to zero on migration with no way for the VMM to
restore it? That seems weird...)
And if the delta is so long that watchdogs tick (and if those watchdogs
don't explicitly check for something like x86's PVCLOCK_GUEST_STOPPED),
then those watchdogs are basically working as designed, surely?
> For example, QEMU doesn't do that, and wants time continuity, hence
> the current behaviour.
That's as may be, but VMMs other than QEMU can and do add a delta to
the timer before 'restoring' it, based on a delta calculated from the
wall clock. We're just asking to give those VMMs a more accurate way of
doing it, so the guest timer's relationship to both the host timer and
to actual wall clock / NTP time is *precisely* as it was before without
any drift.
I may yet fix QEMU; I've been working on the Xen PV timer support
across migration, and it's all a bit weird how an immediate stop/start
has it whining about clock skew and TSC instability; I can't tell
what's my bug and what was already broken.
> > Any delays between userspace computing the correct timer count value and
> > actually setting it in kernel space by KVM_SET_ONE_REG (such as can be
> > incurred by scheduling) become visible as under-accounted pause time in
> > the guest, meaning the guest observes that its system clock seems to
> > have fallen behind its NTP time reference.
> >
> > The issue is further complicated when vCPU setup is performed by
> > independent threads which may experience different delays, leading to
> > jitter between the clocks of different vCPUs.
>
> How? I really hope that you will have restored *all* the vcpus before
> restarting any. If you don't, your userspace is buggy.
But the timer counts from the epoch of when the KVM itself was
initialised, doesn't it? I haven't looked hard at the arm side but on
x86 if the various vCPU threads all use the "TSC is <x> now" API, they
only end up in sync because there's a hack for "if it's within one
second of the previously-set vCPU's TSC, make it precisely match". And
then they're only in sync with each *other* rather than what they were
before the live update.
Actually *running* the vCPUs comes later, of course.
>
[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG
2023-02-02 15:18 ` David Woodhouse
@ 2023-02-06 19:55 ` Oliver Upton
0 siblings, 0 replies; 5+ messages in thread
From: Oliver Upton @ 2023-02-06 19:55 UTC (permalink / raw)
To: David Woodhouse
Cc: Marc Zyngier, Simon Veith, Catalin Marinas, Will Deacon,
James Morse, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel
On Thu, Feb 02, 2023 at 03:18:55PM +0000, David Woodhouse wrote:
> On Thu, 2023-02-02 at 13:50 +0000, Marc Zyngier wrote:
> > On Thu, 02 Feb 2023 12:13:14 +0000, Simon Veith <sveith@amazon.de> wrote:
[...]
> > > The issue is further complicated when vCPU setup is performed by
> > > independent threads which may experience different delays, leading to
> > > jitter between the clocks of different vCPUs.
> >
> > How? I really hope that you will have restored *all* the vcpus before
> > restarting any. If you don't, your userspace is buggy.
>
> But the timer counts from the epoch of when the KVM itself was
> initialised, doesn't it? I haven't looked hard at the arm side but on
> x86 if the various vCPU threads all use the "TSC is <x> now" API, they
> only end up in sync because there's a hack for "if it's within one
> second of the previously-set vCPU's TSC, make it precisely match". And
> then they're only in sync with each *other* rather than what they were
> before the live update.
We don't have any analog to the x86 TSC synchronization game on arm64.
The UAPI we have relies strictly on the architecture, which effectively
states that all PEs in a system have a consistent view of the generic
counter.
As it relates to KVM, whenever the virtual counter value is written from
userspace, we apply the calculated offset to all vCPUs in the VM. And
yes, this results in some wasted cycles doing this broadcast from every
single vCPU. But oh well, it hasn't been consequential (yet).
--
Thanks,
Oliver
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-02-06 19:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-02 12:13 [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG Simon Veith
2023-02-02 12:54 ` David Woodhouse
2023-02-02 13:50 ` Marc Zyngier
2023-02-02 15:18 ` David Woodhouse
2023-02-06 19:55 ` Oliver Upton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).