From: Peter Hilber <peter.hilber@opensynergy.com>
To: David Woodhouse <dwmw2@infradead.org>,
linux-kernel@vger.kernel.org, virtualization@lists.linux.dev,
linux-arm-kernel@lists.infradead.org, linux-rtc@vger.kernel.org,
"Ridoux, Julien" <ridouxj@amazon.com>,
virtio-dev@lists.linux.dev, "Luu, Ryan" <rluu@amazon.com>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>,
Jason Wang <jasowang@redhat.com>,
John Stultz <jstultz@google.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
netdev@vger.kernel.org,
Richard Cochran <richardcochran@gmail.com>,
Stephen Boyd <sboyd@kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Xuan Zhuo <xuanzhuo@linux.alibaba.com>,
Marc Zyngier <maz@kernel.org>,
Mark Rutland <mark.rutland@arm.com>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Alessandro Zummo <a.zummo@towertech.it>,
Alexandre Belloni <alexandre.belloni@bootlin.com>
Subject: Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
Date: Thu, 27 Jun 2024 15:50:27 +0200 [thread overview]
Message-ID: <8d9d7ce2-4dd1-4f54-a468-79ef5970a708@opensynergy.com> (raw)
In-Reply-To: <4a0a240dffc21dde4d69179288547b945142259f.camel@infradead.org>
On 25.06.24 21:01, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
>
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
>
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>
> v2:
> • Add gettimex64() support
> • Convert TSC values to KVM clock when appropriate
> • Require int128 support
> • Add counter_period_shift
> • Add timeout when seq_count is invalid
> • Add flags field
> • Better comments in vmclock ABI structure
> • Explicitly forbid smearing (as clock rates would need to change)
Leap second smearing information could still be conveyed through the
vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
enough to indicate whether the driver should apply linear or cosine
smearing, and the start time and end time.
>
> drivers/ptp/Kconfig | 13 +
> drivers/ptp/Makefile | 1 +
> drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/vmclock.h | 138 ++++++++++
> 4 files changed, 668 insertions(+)
> create mode 100644 drivers/ptp/ptp_vmclock.c
> create mode 100644 include/uapi/linux/vmclock.h
>
[...]
> +
> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> + uint64_t period, uint8_t shift,
> + uint64_t frac_sec)
> +{
> + unsigned __int128 res = (unsigned __int128)delta * period;
> +
> + res >>= shift;
> + res += frac_sec;
> + *res_hi = res >> 64;
> + return (uint64_t)res;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> + struct ptp_system_timestamp *sts,
> + struct system_counterval_t *system_counter,
> + struct timespec64 *tspec)
> +{
> + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> + struct system_time_snapshot systime_snapshot;
> + uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> + /*
> + * We'd expect the hypervisor to know this and to report the clock
> + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> + */
> + if (check_tsc_unstable())
> + return -EINVAL;
> +#endif
> +
> + while (1) {
> + seq = st->clk->seq_count & ~1ULL;
> + virt_rmb();
> +
> + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> + return -EINVAL;
> +
> + /*
> + * When invoked for gettimex64(), fill in the pre/post system
> + * times. The simple case is when system time is based on the
> + * same counter as st->cs_id, in which case all three times
> + * will be derived from the *same* counter value.
> + *
> + * If the system isn't using the same counter, then the value
> + * from ktime_get_snapshot() will still be used as pre_ts, and
> + * ptp_read_system_postts() is called to populate postts after
> + * calling get_cycles().
> + *
> + * The conversion to timespec64 happens further down, outside
> + * the seq_count loop.
> + */
> + if (sts) {
> + ktime_get_snapshot(&systime_snapshot);
> + if (systime_snapshot.cs_id == st->cs_id) {
> + cycle = systime_snapshot.cycles;
> + } else {
> + cycle = get_cycles();
> + ptp_read_system_postts(sts);
> + }
> + } else
> + cycle = get_cycles();
> +
> + delta = cycle - st->clk->counter_value;
AFAIU in the general case this needs to be masked for non 64-bit counters.
> +
> + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> + st->clk->counter_period_frac_sec,
> + st->clk->counter_period_shift,
> + st->clk->utc_time_frac_sec);
> + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> + tspec->tv_sec += st->clk->utc_time_sec;
> +
> + virt_rmb();
> + if (seq == st->clk->seq_count)
> + break;
> +
> + if (ktime_after(ktime_get(), deadline))
> + return -ETIMEDOUT;
> + }
> +
> + if (system_counter) {
> + system_counter->cycles = cycle;
> + system_counter->cs_id = st->cs_id;
> + }
> +
> + if (sts) {
> + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> + if (systime_snapshot.cs_id == st->cs_id)
> + sts->post_ts = sts->pre_ts;
> + }
> +
> + return 0;
> +}
> +
[...]
> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> + .owner = THIS_MODULE,
> + .max_adj = 0,
> + .n_ext_ts = 0,
> + .n_pins = 0,
> + .pps = 0,
> + .adjfine = ptp_vmclock_adjfine,
> + .adjtime = ptp_vmclock_adjtime,
> + .gettime64 = ptp_vmclock_gettime,
The .gettime64 op is now unneeded.
> + .gettimex64 = ptp_vmclock_gettimex,
> + .settime64 = ptp_vmclock_settime,
> + .enable = ptp_vmclock_enable,
> + .getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +
[...]
> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> new file mode 100644
> index 000000000000..cf0f22205e79
> --- /dev/null
> +++ b/include/uapi/linux/vmclock.h
> @@ -0,0 +1,138 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> +
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + *
> + * Note that this must be true UTC, never with smeared leap seconds. If a
> + * guest wishes to construct a smeared clock, it can do so. Presenting a
> + * smeared clock through this interface would be problematic because it
> + * actually messes with the apparent counter *period*. A linear smearing
> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> + * at the start/end of the smearing period, while a sinusoidal smear would
> + * basically be impossible to represent.
Clock types other than UTC could also be supported: TAI, monotonic.
> + */
> +
> +#ifndef __VMCLOCK_H__
> +#define __VMCLOCK_H__
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> + uint32_t magic;
> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
> + uint16_t size; /* Size of page containing this structure */
> + uint16_t version; /* 1 */
> +
> + /* Sequence lock. Low bit means an update is in progress. */
> + uint64_t seq_count;
> +
> + /*
> + * This field changes to another non-repeating value when the CPU
> + * counter is disrupted, for example on live migration.
> + */
> + uint64_t disruption_marker;
The field could also change when the clock is stepped (leap seconds
excepted), or when the clock frequency is slewed.
next prev parent reply other threads:[~2024-06-27 13:51 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-18 7:38 [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping Peter Hilber
[not found] ` <e410d65754ba6a11ad7f74b27dc28d9a25d8c82e.camel@infradead.org>
2024-06-20 12:06 ` Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 7/7] virtio_rtc: Add RTC class driver Peter Hilber
[not found] ` <684eac07834699889fdb67be4cee09319c994a42.camel@infradead.org>
2024-06-20 12:37 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2024-06-20 16:19 ` David Woodhouse
2024-06-21 8:45 ` David Woodhouse
2024-06-25 19:01 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
2024-06-27 13:50 ` Peter Hilber [this message]
2024-06-27 14:52 ` David Woodhouse
2024-06-28 11:33 ` Peter Hilber
2024-06-28 12:15 ` David Woodhouse
2024-06-28 16:38 ` Peter Hilber
2024-06-28 21:27 ` David Woodhouse
2024-07-01 8:57 ` David Woodhouse
2024-07-02 15:03 ` Peter Hilber
2024-07-02 16:39 ` David Woodhouse
2024-07-02 18:12 ` Peter Hilber
2024-07-02 18:40 ` David Woodhouse
2024-07-03 9:56 ` Peter Hilber
2024-07-03 10:40 ` David Woodhouse
2024-07-05 8:12 ` Peter Hilber
2024-07-05 15:02 ` David Woodhouse
2024-07-06 7:50 ` Peter Hilber
2024-06-27 16:03 ` David Woodhouse
2024-06-28 11:33 ` Peter Hilber
2024-06-28 11:41 ` David Woodhouse
[not found] ` <20240630132859.GC17134@kernel.org>
2024-07-01 8:02 ` David Woodhouse
[not found] ` <202407010838.D45C67B86@keescook>
2024-07-03 8:00 ` David Woodhouse
2024-06-27 13:50 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2024-06-21 14:02 ` David Woodhouse
[not found] <87jzic4sgv.ffs@tglx>
2024-06-25 21:48 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
[not found] ` <CANDhNCpi_MyGWH2jZcSRB4RU28Ga08Cqm8cyY_6wkZhNMJsNSQ@mail.gmail.com>
2024-06-26 8:32 ` David Woodhouse
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8d9d7ce2-4dd1-4f54-a468-79ef5970a708@opensynergy.com \
--to=peter.hilber@opensynergy.com \
--cc=a.zummo@towertech.it \
--cc=alexandre.belloni@bootlin.com \
--cc=christopher.s.hall@intel.com \
--cc=daniel.lezcano@linaro.org \
--cc=dwmw2@infradead.org \
--cc=jasowang@redhat.com \
--cc=jstultz@google.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rtc@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=richardcochran@gmail.com \
--cc=ridouxj@amazon.com \
--cc=rluu@amazon.com \
--cc=sboyd@kernel.org \
--cc=tglx@linutronix.de \
--cc=virtio-dev@lists.linux.dev \
--cc=virtualization@lists.linux.dev \
--cc=xuanzhuo@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox