Re: [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG

From: Marc Zyngier <maz@kernel.org>
To: Simon Veith <sveith@amazon.de>
Cc: <dwmw2@infradead.org>, Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	Zenghui Yu <yuzenghui@huawei.com>,
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH] arm64: kvm: Expose timer offset directly via KVM_{GET,SET}_ONE_REG
Date: Thu, 02 Feb 2023 13:50:58 +0000	[thread overview]
Message-ID: <86a61w18hp.wl-maz@kernel.org> (raw)
In-Reply-To: <20230202121314.206195-1-sveith@amazon.de>

Hi Simon,

On Thu, 02 Feb 2023 12:13:14 +0000,
Simon Veith <sveith@amazon.de> wrote:
> 
> The virtual timer count register (CNTVCT_EL0) is virtualized by
> configuring offset register CNTVOFF_EL2 to subtract from the underlying
> raw hardware timer count when the guest reads the current count.
> 
> Currently, we offer userspace the ability to serialize and deserialize
> only the absolute count register value, using KVM_{GET,SET}_ONE_REG with
> KVM_REG_ARM_TIMER_CNT. Internally, we then compute and set the offset
> register accordingly to obtain the requested count value.
> 
> Allowing to set this timer count register only by absolute value poses
> some problems to virtual machine monitors that try to maintain the
> illusion of continuously ticking clocks to the guest: In workflows like
> live migration or liveupdate, the timers must be increased artificially
> to account for pause time.

"must" is a pretty strong word. Given that this isn't advertised as
stolen time to the guest, any sort of time-sensitive process (such as
an in-guest watchdog) is likely to be ticked the wrong way if you
start adding that time to the counter.

For example, QEMU doesn't do that, and wants time continuity, hence
the current behaviour.

> 
> Any delays between userspace computing the correct timer count value and
> actually setting it in kernel space by KVM_SET_ONE_REG (such as can be
> incurred by scheduling) become visible as under-accounted pause time in
> the guest, meaning the guest observes that its system clock seems to
> have fallen behind its NTP time reference.
> 
> The issue is further complicated when vCPU setup is performed by
> independent threads which may experience different delays, leading to
> jitter between the clocks of different vCPUs.

How? I really hope that you will have restored *all* the vcpus before
restarting any. If you don't, your userspace is buggy.

> 
> We could deliver a more stable timer in such scenarios if we allowed
> userspace to set the offset with regards to the physical counter
> directly.
> 
> Expose the KVM_REG_ARM_TIMER_OFF register directly to userspace, as an
> alternative view of the timer counts. By default, userspace still sees
> only the existing KVM_REG_ARM_TIMER_CNT register when querying the list
> with KVM_GET_REG_LIST, as that register value is portable across
> different VM hosts and thus safe to persist.

I can see a few things are not quite right with this approach:

- You hijack a register that isn't an EL1 register. This should never
  be exposed to a userspace as such, as it would otherwise change
  behaviour with NV, which is definitely in control of it.

- What is the ordering between restoring the timer value and restoring
  the timer offset? Both do the same thing, and impact all vcpus. How
  does it make anything better if your userspace (such as QEMU) saves
  *all* the available registers and restores them all, on all vcpus?

- You make this a per-vcpu value. But what this does is to provide an
  offset for the *whole VM*. Why not take the bullet and simply make
  this a per-VM adjustment?

- What about the physical timer? Doesn't it need some similar
  treatment as well, irrespective of the presence of ECV?

We have been around that particular block a few times in the past, and
I may have changed my mind more than once. But as the NV code has
finally reached a point where these things matter, we really shouldn't
go into a direction where we'd end-up with varying semantics depending
on whether CNT{P,V}OFF_EL2 is under control of the host or the guest.

It should also be a feature that is advertised, and bought into from
the VMM. It cannot be an implicit behaviour.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel