* [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
@ 2023-12-18 7:38 Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core Peter Hilber
` (4 more replies)
0 siblings, 5 replies; 35+ messages in thread
From: Peter Hilber @ 2023-12-18 7:38 UTC (permalink / raw)
To: linux-kernel, virtualization, virtio-dev, linux-arm-kernel,
linux-rtc
Cc: Peter Hilber, Christopher S. Hall, Jason Wang, John Stultz,
Michael S. Tsirkin, netdev, Richard Cochran, Stephen Boyd,
Thomas Gleixner, Xuan Zhuo, Marc Zyngier, Mark Rutland,
Daniel Lezcano, Alessandro Zummo, Alexandre Belloni
RFC v3 updates
--------------
This series implements a driver for a virtio-rtc device conforming to spec
RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
the PTP clock driver already present before.
This patch series depends on the patch series "treewide: Use clocksource id
for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
series on top of mainline.
Overview
--------
This patch series adds the virtio_rtc module, and related bugfixes. The
virtio_rtc module implements a driver compatible with the proposed Virtio
RTC device specification [1]. The Virtio RTC (Real Time Clock) device
provides information about current time. The device can provide different
clocks, e.g. for the UTC or TAI time standards, or for physical time
elapsed since some past epoch. The driver can read the clocks with simple
or more accurate methods. Optionally, the driver can set an alarm.
The series first fixes some bugs in the get_device_system_crosststamp()
interpolation code, which is required for reliable virtio_rtc operation.
Then, add the virtio_rtc implementation.
For the Virtio RTC device, there is currently a proprietary implementation,
which has been used for provisional testing.
PTP clock interface
-------------------
virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
If both the Virtio RTC device and this driver have special support for the
current clocksource, time synchronization programs can use
cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
with single-digit ns precision is possible with a quiescent reference clock
(from the Virtio RTC device). This works even when the Virtio device
response is slow compared to ptp_kvm hypercalls.
The following illustrates a test using PTP_SYS_OFFSET_PRECISE, with
interspersed strace log and chrony [2] refclocks log, on arm64. In the
example, chrony tracks a virtio_rtc PTP clock ("PHCV", /dev/ptp0). The raw
offset between the virtio_rtc clock and CLOCK_REALTIME is 0 to 1 ns. At the
device side, the Virtio RTC device artificially delays both the clock read
request, and the response, by 50 ms. Cross-timestamp interpolation still
works with this delay. chrony also monitors a ptp_kvm clock ("PHCK",
/dev/ptp3) for comparison, which yields a similar offset.
ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000329>
===============================================================================
Date (UTC) Time Refid DP L P Raw offset Cooked offset Disp.
===============================================================================
2023-06-29 18:49:55.595742 PHCK 0 N 0 1.000000e-09 8.717931e-10 5.500e-08
2023-06-29 18:49:55.595742 PHCK - N - - 8.717931e-10 5.500e-08
ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101545>
2023-06-29 18:49:56.147766 PHCV 0 N 0 1.000000e-09 8.801870e-10 5.500e-08
2023-06-29 18:49:56.147766 PHCV - N - - 8.801870e-10 5.500e-08
ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000195>
2023-06-29 18:49:56.202446 PHCK 0 N 0 1.000000e-09 7.364180e-10 5.500e-08
2023-06-29 18:49:56.202446 PHCK - N - - 7.364180e-10 5.500e-08
ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101484>
2023-06-29 18:49:56.754641 PHCV 0 N 0 0.000000e+00 -2.617368e-10 5.500e-08
2023-06-29 18:49:56.754641 PHCV - N - - -2.617368e-10 5.500e-08
ioctl(5</dev/ptp3>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.000270>
2023-06-29 18:49:56.809282 PHCK 0 N 0 1.000000e-09 7.779321e-10 5.500e-08
2023-06-29 18:49:56.809282 PHCK - N - - 7.779321e-10 5.500e-08
ioctl(6</dev/ptp0>, PTP_SYS_OFFSET_PRECISE, 0xffffe86691c8) = 0 <0.101510>
2023-06-29 18:49:57.361376 PHCV 0 N 0 0.000000e+00 -2.198794e-10 5.500e-08
2023-06-29 18:49:57.361376 PHCV - N - - -2.198794e-10 5.500e-08
This patch series only adds special support for the Arm Generic Timer
clocksource. At the driver side, it should be easy to support more
clocksources.
Fallback PTP clock interface
----------------------------
Without special support for the current clocksource, time synchronization
programs can still use ioctl PTP_SYS_OFFSET_EXTENDED2 aka
PTP_SYS_OFFSET_EXTENDED. In this case, precision will generally be worse
and will depend on the Virtio device response characteristics.
The following illustrates a test using PTP_SYS_OFFSET_EXTENDED, with
interspersed strace log and chrony refclocks log, on x86-64 (with `ts'
values omitted):
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
===============================================================================
Date (UTC) Time Refid DP L P Raw offset Cooked offset Disp.
===============================================================================
2023-06-28 14:11:26.697782 PHCV 0 N 0 3.318200e-05 3.450891e-05 4.611e-06
2023-06-28 14:11:26.697782 PHCV - N - - 3.450891e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:27.208763 PHCV 0 N 0 -3.792800e-05 -4.023965e-05 4.611e-06
2023-06-28 14:11:27.208763 PHCV - N - - -4.023965e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:27.722818 PHCV 0 N 0 -3.328600e-05 -3.134404e-05 4.611e-06
2023-06-28 14:11:27.722818 PHCV - N - - -3.134404e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:28.233572 PHCV 0 N 0 -4.966900e-05 -4.584331e-05 4.611e-06
2023-06-28 14:11:28.233572 PHCV - N - - -4.584331e-05 4.611e-06
ioctl(5, PTP_SYS_OFFSET_EXTENDED, {n_samples=10, ts=OMITTED}) = 0
2023-06-28 14:11:28.742737 PHCV 0 N 0 4.902700e-05 5.361388e-05 4.611e-06
2023-06-28 14:11:28.742737 PHCV - N - - 5.361388e-05 4.611e-06
PTP clock setup
---------------
The following udev rule can be used to get a symlink /dev/ptp_virtio to the
UTC clock:
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP UTC", SYMLINK += "ptp_virtio"
The following chrony configuration directive can then be added in
/etc/chrony/chrony.conf to synchronize to the Virtio UTC clock:
refclock PHC /dev/ptp_virtio refid PHCV poll -1 dpoll -1
RTC interface
-------------
This patch series adds virtio_rtc as a generic Virtio driver, including
both a PTP clock driver and an RTC class driver.
Feedback is greatly appreciated.
[1] https://lore.kernel.org/virtio-comment/20231218064253.9734-1-peter.hilber@opensynergy.com/
[2] https://chrony.tuxfamily.org/
[3] https://lore.kernel.org/lkml/20231215220612.173603-1-peter.hilber@opensynergy.com/
[4] https://github.com/OpenSynergy/linux.git virtio-rtc-v3-on-master
v3:
- Update to conform to virtio spec RFC v3 (no significant behavioral
changes).
- Add RTC class driver with alarm according to virtio spec RFC v3.
- For cross-timestamp corner case fix, switch back to v1 style closed
interval test (Thomas Gleixner).
v2:
- Depend on patch series "treewide: Use clocksource id for
get_device_system_crosststamp()" to avoid requiring a clocksource pointer
with get_device_system_crosststamp().
- Assume Arm Generic Timer will use CP15 virtual counter. Drop
arm_arch_timer helper functions (Marc Zyngier).
- Improve cross-timestamp fixes problem description and implementation
(John Stultz).
Peter Hilber (7):
timekeeping: Fix cross-timestamp interpolation on counter wrap
timekeeping: Fix cross-timestamp interpolation corner case decision
timekeeping: Fix cross-timestamp interpolation for non-x86
virtio_rtc: Add module and driver core
virtio_rtc: Add PTP clocks
virtio_rtc: Add Arm Generic Timer cross-timestamping
virtio_rtc: Add RTC class driver
MAINTAINERS | 7 +
drivers/virtio/Kconfig | 62 ++
drivers/virtio/Makefile | 5 +
drivers/virtio/virtio_rtc_arm.c | 22 +
drivers/virtio/virtio_rtc_class.c | 269 +++++
drivers/virtio/virtio_rtc_driver.c | 1357 ++++++++++++++++++++++++++
drivers/virtio/virtio_rtc_internal.h | 122 +++
drivers/virtio/virtio_rtc_ptp.c | 342 +++++++
include/uapi/linux/virtio_rtc.h | 230 +++++
kernel/time/timekeeping.c | 24 +-
10 files changed, 2428 insertions(+), 12 deletions(-)
create mode 100644 drivers/virtio/virtio_rtc_arm.c
create mode 100644 drivers/virtio/virtio_rtc_class.c
create mode 100644 drivers/virtio/virtio_rtc_driver.c
create mode 100644 drivers/virtio/virtio_rtc_internal.h
create mode 100644 drivers/virtio/virtio_rtc_ptp.c
create mode 100644 include/uapi/linux/virtio_rtc.h
base-commit: 2c41b211d72c1eb350c7629a8c85234fef0d12c1
--
2.40.1
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
^ permalink raw reply [flat|nested] 35+ messages in thread
* [virtio-dev] [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core
2023-12-18 7:38 [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
@ 2023-12-18 7:38 ` Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks Peter Hilber
` (3 subsequent siblings)
4 siblings, 0 replies; 35+ messages in thread
From: Peter Hilber @ 2023-12-18 7:38 UTC (permalink / raw)
To: linux-kernel, virtualization, virtio-dev
Cc: Peter Hilber, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Richard Cochran, netdev, Marc Zyngier, Mark Rutland,
Daniel Lezcano, Thomas Gleixner, linux-arm-kernel, linux-rtc,
Alessandro Zummo, Alexandre Belloni
Add the virtio_rtc module and driver core. The virtio_rtc module implements
a driver compatible with the proposed Virtio RTC device specification.
The Virtio RTC (Real Time Clock) device provides information about current
time. The device can provide different clocks, e.g. for the UTC or TAI time
standards, or for physical time elapsed since some past epoch. The driver
can read the clocks with simple or more accurate methods.
Implement the core, which interacts with the Virtio RTC device. Apart from
this, the core does not expose functionality outside of the virtio_rtc
module. A follow-up patch will expose PTP clocks.
Provide synchronous messaging, which is enough for the expected time
synchronization use cases through PTP clocks (similar to ptp_kvm) or RTC
Class driver.
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
---
Notes:
v3:
- merge readq and controlq into a single requestq (spec v3)
- don't guard cross-timestamping with feature bit (spec v3)
- pad message headers to 64 bits (spec v3)
- reduce clock id to 16 bits (spec v3)
- change Virtio status codes (spec v3)
- use 'VIRTIO_RTC_REQ_' prefix for request messages (spec v3)
MAINTAINERS | 7 +
drivers/virtio/Kconfig | 13 +
drivers/virtio/Makefile | 2 +
drivers/virtio/virtio_rtc_driver.c | 761 +++++++++++++++++++++++++++
drivers/virtio/virtio_rtc_internal.h | 23 +
include/uapi/linux/virtio_rtc.h | 144 +++++
6 files changed, 950 insertions(+)
create mode 100644 drivers/virtio/virtio_rtc_driver.c
create mode 100644 drivers/virtio/virtio_rtc_internal.h
create mode 100644 include/uapi/linux/virtio_rtc.h
diff --git a/MAINTAINERS b/MAINTAINERS
index b589218605b4..0c157a19bbfd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23200,6 +23200,13 @@ S: Maintained
F: drivers/nvdimm/nd_virtio.c
F: drivers/nvdimm/virtio_pmem.c
+VIRTIO RTC DRIVER
+M: Peter Hilber <peter.hilber@opensynergy.com>
+L: virtualization@lists.linux.dev
+S: Maintained
+F: drivers/virtio/virtio_rtc_*
+F: include/uapi/linux/virtio_rtc.h
+
VIRTIO SOUND DRIVER
M: Anton Yakovlev <anton.yakovlev@opensynergy.com>
M: "Michael S. Tsirkin" <mst@redhat.com>
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 0a53a61231c2..834dd14bc070 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -173,4 +173,17 @@ config VIRTIO_DMA_SHARED_BUFFER
This option adds a flavor of dma buffers that are backed by
virtio resources.
+config VIRTIO_RTC
+ tristate "Virtio RTC driver"
+ depends on VIRTIO
+ depends on PTP_1588_CLOCK_OPTIONAL
+ help
+ This driver provides current time from a Virtio RTC device. The driver
+ provides the time through one or more clocks.
+
+ To compile this code as a module, choose M here: the module will be
+ called virtio_rtc.
+
+ If unsure, say M.
+
endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 8e98d24917cc..f760414ed6ab 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -12,3 +12,5 @@ obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
obj-$(CONFIG_VIRTIO_VDPA) += virtio_vdpa.o
obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
obj-$(CONFIG_VIRTIO_DMA_SHARED_BUFFER) += virtio_dma_buf.o
+obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
+virtio_rtc-y := virtio_rtc_driver.o
diff --git a/drivers/virtio/virtio_rtc_driver.c b/drivers/virtio/virtio_rtc_driver.c
new file mode 100644
index 000000000000..ef1ea14b3bec
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_driver.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio_rtc driver core
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#include <linux/completion.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <linux/module.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/* virtqueue order */
+enum {
+ VIORTC_REQUESTQ,
+ VIORTC_MAX_NR_QUEUES,
+};
+
+/**
+ * struct viortc_vq - virtqueue abstraction
+ * @vq: virtqueue
+ * @lock: protects access to vq
+ */
+struct viortc_vq {
+ struct virtqueue *vq;
+ spinlock_t lock;
+};
+
+/**
+ * struct viortc_dev - virtio_rtc device data
+ * @vdev: virtio device
+ * @vqs: virtqueues
+ * @num_clocks: # of virtio_rtc clocks
+ */
+struct viortc_dev {
+ struct virtio_device *vdev;
+ struct viortc_vq vqs[VIORTC_MAX_NR_QUEUES];
+ u16 num_clocks;
+};
+
+/**
+ * struct viortc_msg - Message requested by driver, responded by device.
+ * @viortc: device data
+ * @req: request buffer
+ * @resp: response buffer
+ * @responded: vqueue callback signals response reception
+ * @refcnt: Message reference count, message and buffers will be deallocated
+ * once 0. refcnt is decremented in the vqueue callback and in the
+ * thread waiting on the responded completion.
+ * If a message response wait function times out, the message will be
+ * freed upon late reception (refcnt will reach 0 in the callback), or
+ * device removal.
+ * @req_size: size of request in bytes
+ * @resp_cap: maximum size of response in bytes
+ * @resp_actual_size: actual size of response
+ */
+struct viortc_msg {
+ struct viortc_dev *viortc;
+ void *req;
+ void *resp;
+ struct completion responded;
+ refcount_t refcnt;
+ unsigned int req_size;
+ unsigned int resp_cap;
+ unsigned int resp_actual_size;
+};
+
+/**
+ * viortc_msg_init() - Allocate and initialize requestq message.
+ * @viortc: device data
+ * @msg_type: virtio_rtc message type
+ * @req_size: size of request buffer to be allocated
+ * @resp_cap: size of response buffer to be allocated
+ *
+ * Initializes the message refcnt to 2. The refcnt will be decremented once in
+ * the virtqueue callback, and once in the thread waiting on the message (on
+ * completion or timeout).
+ *
+ * Context: Process context.
+ * Return: non-NULL on success.
+ */
+static struct viortc_msg *viortc_msg_init(struct viortc_dev *viortc,
+ u16 msg_type, unsigned int req_size,
+ unsigned int resp_cap)
+{
+ struct viortc_msg *msg;
+ struct device *dev = &viortc->vdev->dev;
+ struct virtio_rtc_req_head *req_head;
+
+ msg = devm_kzalloc(dev, sizeof(*msg), GFP_KERNEL);
+ if (!msg)
+ return NULL;
+
+ init_completion(&msg->responded);
+
+ msg->req = devm_kzalloc(dev, req_size, GFP_KERNEL);
+ if (!msg->req)
+ goto err_free_msg;
+
+ req_head = msg->req;
+
+ msg->resp = devm_kzalloc(dev, resp_cap, GFP_KERNEL);
+ if (!msg->resp)
+ goto err_free_msg_req;
+
+ msg->viortc = viortc;
+ msg->req_size = req_size;
+ msg->resp_cap = resp_cap;
+
+ refcount_set(&msg->refcnt, 2);
+
+ req_head->msg_type = virtio_cpu_to_le(msg_type, req_head->msg_type);
+
+ return msg;
+
+err_free_msg_req:
+ devm_kfree(dev, msg->req);
+
+err_free_msg:
+ devm_kfree(dev, msg);
+
+ return NULL;
+}
+
+/**
+ * viortc_msg_release() - Decrement message refcnt, potentially free message.
+ * @msg: message requested by driver
+ *
+ * Context: Any context.
+ */
+static void viortc_msg_release(struct viortc_msg *msg)
+{
+ if (refcount_dec_and_test(&msg->refcnt)) {
+ struct device *dev = &msg->viortc->vdev->dev;
+
+ devm_kfree(dev, msg->req);
+ devm_kfree(dev, msg->resp);
+ devm_kfree(dev, msg);
+ }
+}
+
+/**
+ * viortc_do_cb() - generic virtqueue callback logic
+ * @vq: virtqueue
+ * @handle_buf: function to process a used buffer
+ *
+ * Context: virtqueue callback, typically interrupt. Takes and releases vq lock.
+ */
+static void viortc_do_cb(struct virtqueue *vq,
+ void (*handle_buf)(void *token, unsigned int len,
+ struct virtqueue *vq,
+ struct viortc_vq *viortc_vq,
+ struct viortc_dev *viortc))
+{
+ struct viortc_dev *viortc = vq->vdev->priv;
+ struct viortc_vq *viortc_vq;
+ bool cb_enabled = true;
+ unsigned long flags;
+ spinlock_t *lock;
+ unsigned int len;
+ void *token;
+
+ viortc_vq = &viortc->vqs[vq->index];
+ lock = &viortc_vq->lock;
+
+ for (;;) {
+ spin_lock_irqsave(lock, flags);
+
+ if (cb_enabled) {
+ virtqueue_disable_cb(vq);
+ cb_enabled = false;
+ }
+
+ token = virtqueue_get_buf(vq, &len);
+ if (!token) {
+ if (virtqueue_enable_cb(vq)) {
+ spin_unlock_irqrestore(lock, flags);
+ return;
+ }
+ cb_enabled = true;
+ }
+
+ spin_unlock_irqrestore(lock, flags);
+
+ if (token)
+ handle_buf(token, len, vq, viortc_vq, viortc);
+ }
+}
+
+/**
+ * viortc_requestq_hdlr() - process a requestq used buffer
+ * @token: token identifying the buffer
+ * @len: bytes written by device
+ * @vq: virtqueue
+ * @viortc_vq: device specific data for virtqueue
+ * @viortc: device data
+ *
+ * Signals completion for each received message.
+ *
+ * Context: virtqueue callback
+ */
+static void viortc_requestq_hdlr(void *token, unsigned int len,
+ struct virtqueue *vq,
+ struct viortc_vq *viortc_vq,
+ struct viortc_dev *viortc)
+{
+ struct viortc_msg *msg = token;
+
+ msg->resp_actual_size = len;
+
+ /*
+ * completion waiter must see our msg metadata, but complete() does not
+ * guarantee a memory barrier
+ */
+ smp_wmb();
+
+ complete(&msg->responded);
+ viortc_msg_release(msg);
+}
+
+/**
+ * viortc_cb_requestq() - callback for requestq
+ * @vq: virtqueue
+ *
+ * Context: virtqueue callback
+ */
+static void viortc_cb_requestq(struct virtqueue *vq)
+{
+ viortc_do_cb(vq, viortc_requestq_hdlr);
+}
+
+/**
+ * viortc_get_resp_errno() - converts virtio_rtc errnos to system errnos
+ * @resp_head: message response header
+ *
+ * Return: negative system errno, or 0
+ */
+static int viortc_get_resp_errno(struct virtio_rtc_resp_head *resp_head)
+{
+ switch (virtio_le_to_cpu(resp_head->status)) {
+ case VIRTIO_RTC_S_OK:
+ return 0;
+ case VIRTIO_RTC_S_EOPNOTSUPP:
+ return -EOPNOTSUPP;
+ case VIRTIO_RTC_S_EINVAL:
+ return -EINVAL;
+ case VIRTIO_RTC_S_ENODEV:
+ return -ENODEV;
+ case VIRTIO_RTC_S_EIO:
+ default:
+ return -EIO;
+ }
+}
+
+/**
+ * viortc_msg_xfer() - send message request, wait until message response
+ * @vq: virtqueue
+ * @msg: message with driver request
+ * @timeout_jiffies: message response timeout, 0 for no timeout
+ *
+ * Context: Process context. Takes and releases vq.lock. May sleep.
+ */
+static int viortc_msg_xfer(struct viortc_vq *vq, struct viortc_msg *msg,
+ unsigned long timeout_jiffies)
+{
+ int ret;
+ unsigned long flags;
+ struct scatterlist out_sg[1];
+ struct scatterlist in_sg[1];
+ struct scatterlist *sgs[2] = { out_sg, in_sg };
+ bool notify;
+
+ sg_init_one(out_sg, msg->req, msg->req_size);
+ sg_init_one(in_sg, msg->resp, msg->resp_cap);
+
+ spin_lock_irqsave(&vq->lock, flags);
+
+ ret = virtqueue_add_sgs(vq->vq, sgs, 1, 1, msg, GFP_ATOMIC);
+ if (ret) {
+ spin_unlock_irqrestore(&vq->lock, flags);
+ /*
+ * Release in place of the response callback, which will never
+ * come.
+ */
+ viortc_msg_release(msg);
+ return ret;
+ }
+
+ notify = virtqueue_kick_prepare(vq->vq);
+
+ spin_unlock_irqrestore(&vq->lock, flags);
+
+ if (notify)
+ virtqueue_notify(vq->vq);
+
+ if (timeout_jiffies) {
+ long timeout_ret;
+
+ timeout_ret = wait_for_completion_interruptible_timeout(
+ &msg->responded, timeout_jiffies);
+
+ if (!timeout_ret)
+ return -ETIMEDOUT;
+ else if (timeout_ret < 0)
+ return (int)timeout_ret;
+ } else {
+ ret = wait_for_completion_interruptible(&msg->responded);
+ if (ret)
+ return ret;
+ }
+
+ /*
+ * Ensure we can read message metadata written in the virtqueue
+ * callback.
+ */
+ smp_rmb();
+
+ /*
+ * There is not yet a case where returning a short message would make
+ * sense, so consider any deviation an error.
+ */
+ if (msg->resp_actual_size != msg->resp_cap)
+ return -EINVAL;
+
+ return viortc_get_resp_errno(msg->resp);
+}
+
+/*
+ * common message handle macros for messages of different types
+ */
+
+/**
+ * VIORTC_DECLARE_MSG_HDL_ONSTACK() - declare message handle on stack
+ * @hdl: message handle name
+ * @msg_suf_lowerc: message type suffix in lowercase
+ * @msg_suf_upperc: message type suffix in uppercase
+ */
+#define VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, msg_suf_lowerc, msg_suf_upperc) \
+ struct { \
+ struct viortc_msg *msg; \
+ struct virtio_rtc_req_##msg_suf_lowerc *req; \
+ struct virtio_rtc_resp_##msg_suf_lowerc *resp; \
+ unsigned int req_size; \
+ unsigned int resp_cap; \
+ u16 msg_type; \
+ } hdl = { \
+ NULL, \
+ NULL, \
+ NULL, \
+ sizeof(struct virtio_rtc_req_##msg_suf_lowerc), \
+ sizeof(struct virtio_rtc_resp_##msg_suf_lowerc), \
+ VIRTIO_RTC_REQ_##msg_suf_upperc, \
+ }
+
+/**
+ * VIORTC_MSG() - extract message from message handle
+ *
+ * Return: struct viortc_msg
+ */
+#define VIORTC_MSG(hdl) ((hdl).msg)
+
+/**
+ * VIORTC_MSG_INIT() - initialize message handle
+ * @hdl: message handle
+ * @viortc: device data (struct viortc_dev *)
+ *
+ * Context: Process context.
+ * Return: 0 on success, -ENOMEM otherwise.
+ */
+#define VIORTC_MSG_INIT(hdl, viortc) \
+ ({ \
+ typeof(hdl) *_hdl = &(hdl); \
+ \
+ _hdl->msg = viortc_msg_init((viortc), _hdl->msg_type, \
+ _hdl->req_size, _hdl->resp_cap); \
+ if (_hdl->msg) { \
+ _hdl->req = _hdl->msg->req; \
+ _hdl->resp = _hdl->msg->resp; \
+ } \
+ _hdl->msg ? 0 : -ENOMEM; \
+ })
+
+/**
+ * VIORTC_MSG_WRITE() - write a request message field
+ * @hdl: message handle
+ * @dest_member: request message field name
+ * @src_ptr: pointer to data of compatible type
+ *
+ * Writes the field in little-endian format.
+ */
+#define VIORTC_MSG_WRITE(hdl, dest_member, src_ptr) \
+ do { \
+ typeof(hdl) _hdl = (hdl); \
+ typeof(src_ptr) _src_ptr = (src_ptr); \
+ \
+ /* Sanity check: must match the member's type */ \
+ typecheck(typeof(_hdl.req->dest_member), *_src_ptr); \
+ \
+ _hdl.req->dest_member = \
+ virtio_cpu_to_le(*_src_ptr, _hdl.req->dest_member); \
+ } while (0)
+
+/**
+ * VIORTC_MSG_READ() - read from a response message field
+ * @hdl: message handle
+ * @src_member: response message field name
+ * @dest_ptr: pointer to data of compatible type
+ *
+ * Converts from little-endian format and writes to dest_ptr.
+ */
+#define VIORTC_MSG_READ(hdl, src_member, dest_ptr) \
+ do { \
+ typeof(dest_ptr) _dest_ptr = (dest_ptr); \
+ \
+ /* Sanity check: must match the member's type */ \
+ typecheck(typeof((hdl).resp->src_member), *_dest_ptr); \
+ \
+ *_dest_ptr = virtio_le_to_cpu((hdl).resp->src_member); \
+ } while (0)
+
+/*
+ * read requests
+ */
+
+/** timeout for clock readings, where timeouts are considered non-fatal */
+#define VIORTC_MSG_READ_TIMEOUT (msecs_to_jiffies(60 * 1000))
+
+/**
+ * viortc_read() - VIRTIO_RTC_REQ_READ wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @reading: clock reading [ns]
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_read(struct viortc_dev *viortc, u16 vio_clk_id, u64 *reading)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, read, READ);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ VIORTC_MSG_READ_TIMEOUT);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, clock_reading, reading);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_read_cross() - VIRTIO_RTC_REQ_READ_CROSS wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @hw_counter: virtio_rtc HW counter type
+ * @reading: clock reading [ns]
+ * @cycles: HW counter cycles during clock reading
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ u64 *reading, u64 *cycles)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, read_cross, READ_CROSS);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+ VIORTC_MSG_WRITE(hdl, hw_counter, &hw_counter);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ VIORTC_MSG_READ_TIMEOUT);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, clock_reading, reading);
+ VIORTC_MSG_READ(hdl, counter_cycles, cycles);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/*
+ * control requests
+ */
+
+/**
+ * viortc_cfg() - VIRTIO_RTC_REQ_CFG wrapper
+ * @viortc: device data
+ * @num_clocks: # of virtio_rtc clocks
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_cfg(struct viortc_dev *viortc, u16 *num_clocks)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, cfg, CFG);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, num_clocks, num_clocks);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_clock_cap() - VIRTIO_RTC_REQ_CLOCK_CAP wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @type: virtio_rtc clock type
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_clock_cap(struct viortc_dev *viortc, u16 vio_clk_id,
+ u16 *type)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, clock_cap, CLOCK_CAP);
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, type, type);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_cross_cap() - VIRTIO_RTC_REQ_CROSS_CAP wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @hw_counter: virtio_rtc HW counter type
+ * @supported: xtstamping is supported for the vio_clk_id/hw_counter pair
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ bool *supported)
+{
+ int ret;
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, cross_cap, CROSS_CAP);
+ u8 flags;
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+ VIORTC_MSG_WRITE(hdl, hw_counter, &hw_counter);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, flags, &flags);
+ *supported = !!(flags & VIRTIO_RTC_FLAG_CROSS_CAP);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/*
+ * init, deinit
+ */
+
+/**
+ * viortc_clocks_init() - init local representations of virtio_rtc clocks
+ * @viortc: device data
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_clocks_init(struct viortc_dev *viortc)
+{
+ int ret;
+ u16 num_clocks;
+
+ ret = viortc_cfg(viortc, &num_clocks);
+ if (ret)
+ return ret;
+
+ if (num_clocks < 1) {
+ dev_err(&viortc->vdev->dev, "device reported 0 clocks\n");
+ return -ENODEV;
+ }
+
+ viortc->num_clocks = num_clocks;
+
+ /* In the future, PTP clocks will be initialized here. */
+ (void)viortc_clock_cap;
+
+ return ret;
+}
+
+/**
+ * viortc_init_vqs() - init virtqueues
+ * @viortc: device data
+ *
+ * Inits virtqueues and associated data.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_vqs(struct viortc_dev *viortc)
+{
+ int ret;
+ struct virtio_device *vdev = viortc->vdev;
+ const char *names[VIORTC_MAX_NR_QUEUES];
+ vq_callback_t *callbacks[VIORTC_MAX_NR_QUEUES];
+ struct virtqueue *vqs[VIORTC_MAX_NR_QUEUES];
+ int nr_queues;
+
+ nr_queues = VIORTC_REQUESTQ + 1;
+ names[VIORTC_REQUESTQ] = "requestq";
+ callbacks[VIORTC_REQUESTQ] = viortc_cb_requestq;
+
+ ret = virtio_find_vqs(vdev, nr_queues, vqs, callbacks, names, NULL);
+ if (ret)
+ return ret;
+
+ viortc->vqs[VIORTC_REQUESTQ].vq = vqs[VIORTC_REQUESTQ];
+ spin_lock_init(&viortc->vqs[VIORTC_REQUESTQ].lock);
+
+ return 0;
+}
+
+/**
+ * viortc_probe() - probe a virtio_rtc virtio device
+ * @vdev: virtio device
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_probe(struct virtio_device *vdev)
+{
+ struct viortc_dev *viortc;
+ int ret;
+
+ viortc = devm_kzalloc(&vdev->dev, sizeof(*viortc), GFP_KERNEL);
+ if (!viortc)
+ return -ENOMEM;
+
+ vdev->priv = viortc;
+ viortc->vdev = vdev;
+
+ ret = viortc_init_vqs(viortc);
+ if (ret)
+ return ret;
+
+ virtio_device_ready(vdev);
+
+ /* Ready vdev for use by frontend devices initialized next. */
+ smp_wmb();
+
+ ret = viortc_clocks_init(viortc);
+ if (ret)
+ goto err_reset_vdev;
+
+ return 0;
+
+err_reset_vdev:
+ virtio_reset_device(vdev);
+ vdev->config->del_vqs(vdev);
+
+ return ret;
+}
+
+/**
+ * viortc_remove() - remove a virtio_rtc virtio device
+ * @vdev: virtio device
+ */
+static void viortc_remove(struct virtio_device *vdev)
+{
+ /* In the future, PTP clocks will be deinitialized here. */
+
+ virtio_reset_device(vdev);
+ vdev->config->del_vqs(vdev);
+}
+
+static struct virtio_device_id id_table[] = {
+ { VIRTIO_ID_CLOCK, VIRTIO_DEV_ANY_ID },
+ { 0 },
+};
+MODULE_DEVICE_TABLE(virtio, id_table);
+
+static struct virtio_driver virtio_rtc_drv = {
+ .driver.name = KBUILD_MODNAME,
+ .driver.owner = THIS_MODULE,
+ .id_table = id_table,
+ .probe = viortc_probe,
+ .remove = viortc_remove,
+};
+
+module_virtio_driver(virtio_rtc_drv);
+
+MODULE_DESCRIPTION("Virtio RTC driver");
+MODULE_AUTHOR("OpenSynergy GmbH");
+MODULE_LICENSE("GPL");
diff --git a/drivers/virtio/virtio_rtc_internal.h b/drivers/virtio/virtio_rtc_internal.h
new file mode 100644
index 000000000000..9267661b8030
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_internal.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * virtio_rtc internal interfaces
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#ifndef _VIRTIO_RTC_INTERNAL_H_
+#define _VIRTIO_RTC_INTERNAL_H_
+
+#include <linux/types.h>
+
+/* driver core IFs */
+
+struct viortc_dev;
+
+int viortc_read(struct viortc_dev *viortc, u16 vio_clk_id, u64 *reading);
+int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ u64 *reading, u64 *cycles);
+int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
+ bool *supported);
+
+#endif /* _VIRTIO_RTC_INTERNAL_H_ */
diff --git a/include/uapi/linux/virtio_rtc.h b/include/uapi/linux/virtio_rtc.h
new file mode 100644
index 000000000000..f469276562d7
--- /dev/null
+++ b/include/uapi/linux/virtio_rtc.h
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) */
+/*
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#ifndef _LINUX_VIRTIO_RTC_H
+#define _LINUX_VIRTIO_RTC_H
+
+#include <linux/types.h>
+
+/* read request message types */
+
+#define VIRTIO_RTC_REQ_READ 0x0001
+#define VIRTIO_RTC_REQ_READ_CROSS 0x0002
+
+/* control request message types */
+
+#define VIRTIO_RTC_REQ_CFG 0x1000
+#define VIRTIO_RTC_REQ_CLOCK_CAP 0x1001
+#define VIRTIO_RTC_REQ_CROSS_CAP 0x1002
+
+/* Message headers */
+
+/** common request header */
+struct virtio_rtc_req_head {
+ __le16 msg_type;
+ __u8 reserved[6];
+};
+
+/** common response header */
+struct virtio_rtc_resp_head {
+#define VIRTIO_RTC_S_OK 0
+#define VIRTIO_RTC_S_EOPNOTSUPP 2
+#define VIRTIO_RTC_S_ENODEV 3
+#define VIRTIO_RTC_S_EINVAL 4
+#define VIRTIO_RTC_S_EIO 5
+ __u8 status;
+ __u8 reserved[7];
+};
+
+/* read requests */
+
+/* VIRTIO_RTC_REQ_READ message */
+
+struct virtio_rtc_req_read {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __u8 reserved[6];
+};
+
+struct virtio_rtc_resp_read {
+ struct virtio_rtc_resp_head head;
+ __le64 clock_reading;
+};
+
+/* VIRTIO_RTC_REQ_READ_CROSS message */
+
+struct virtio_rtc_req_read_cross {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+/** Arm Generic Timer Virtual Count */
+#define VIRTIO_RTC_COUNTER_ARM_VIRT 0
+/** Arm Generic Timer Physical Count */
+#define VIRTIO_RTC_COUNTER_ARM_PHYS 1
+/** x86 Time Stamp Counter */
+#define VIRTIO_RTC_COUNTER_X86_TSC 2
+ __le16 hw_counter;
+ __u8 reserved[4];
+};
+
+struct virtio_rtc_resp_read_cross {
+ struct virtio_rtc_resp_head head;
+ __le64 clock_reading;
+ __le64 counter_cycles;
+};
+
+/* control requests */
+
+/* VIRTIO_RTC_REQ_CFG message */
+
+struct virtio_rtc_req_cfg {
+ struct virtio_rtc_req_head head;
+ /* no request params */
+};
+
+struct virtio_rtc_resp_cfg {
+ struct virtio_rtc_resp_head head;
+ /** # of clocks -> clock ids < num_clocks are valid */
+ __le16 num_clocks;
+ __u8 reserved[6];
+};
+
+/* VIRTIO_RTC_REQ_CLOCK_CAP message */
+
+struct virtio_rtc_req_clock_cap {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __u8 reserved[6];
+};
+
+struct virtio_rtc_resp_clock_cap {
+ struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+ __le16 type;
+ __u8 reserved[6];
+};
+
+/* VIRTIO_RTC_REQ_CROSS_CAP message */
+
+struct virtio_rtc_req_cross_cap {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __le16 hw_counter;
+ __u8 reserved[4];
+};
+
+struct virtio_rtc_resp_cross_cap {
+ struct virtio_rtc_resp_head head;
+#define VIRTIO_RTC_FLAG_CROSS_CAP (1 << 0)
+ __u8 flags;
+ __u8 reserved[7];
+};
+
+/** Union of request types for requestq */
+union virtio_rtc_req_requestq {
+ struct virtio_rtc_req_read read;
+ struct virtio_rtc_req_read_cross read_cross;
+ struct virtio_rtc_req_cfg cfg;
+ struct virtio_rtc_req_clock_cap clock_cap;
+ struct virtio_rtc_req_cross_cap cross_cap;
+};
+
+/** Union of response types for requestq */
+union virtio_rtc_resp_requestq {
+ struct virtio_rtc_resp_read read;
+ struct virtio_rtc_resp_read_cross read_cross;
+ struct virtio_rtc_resp_cfg cfg;
+ struct virtio_rtc_resp_clock_cap clock_cap;
+ struct virtio_rtc_resp_cross_cap cross_cap;
+};
+
+#endif /* _LINUX_VIRTIO_RTC_H */
--
2.40.1
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [virtio-dev] [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks
2023-12-18 7:38 [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core Peter Hilber
@ 2023-12-18 7:38 ` Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping Peter Hilber
` (2 subsequent siblings)
4 siblings, 0 replies; 35+ messages in thread
From: Peter Hilber @ 2023-12-18 7:38 UTC (permalink / raw)
To: linux-kernel, virtualization, virtio-dev
Cc: Peter Hilber, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Richard Cochran, netdev, Marc Zyngier, Mark Rutland,
Daniel Lezcano, Thomas Gleixner, linux-arm-kernel
Expose the virtio_rtc clocks as PTP clocks to userspace, similar to
ptp_kvm. virtio_rtc can expose multiple clocks, e.g. a UTC clock and a
monotonic clock. Userspace should distinguish different clocks through the
name assigned by the driver. A udev rule such as the following can be used
to get a symlink /dev/ptp_virtio to the UTC clock:
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP UTC", SYMLINK += "ptp_virtio"
The preferred PTP clock reading method is ioctl PTP_SYS_OFFSET_PRECISE2,
through the ptp_clock_info.getcrosststamp() op. For now,
PTP_SYS_OFFSET_PRECISE2 will return -EOPNOTSUPP through a weak function.
PTP_SYS_OFFSET_PRECISE2 requires cross-timestamping support for specific
clocksources, which will be added in the following. If the clocksource
specific code is enabled, check that the Virtio RTC device supports the
respective HW counter before obtaining an actual cross-timestamp from the
Virtio device.
The Virtio RTC device response time may be higher than the timekeeper
seqcount increment interval. Therefore, obtain the cross-timestamp before
calling get_device_system_crosststamp().
As a fallback, support the ioctl PTP_SYS_OFFSET_EXTENDED2 for all
platforms.
Assume that concurrency issues during PTP clock removal are avoided by the
posix_clock framework.
Kconfig recursive dependencies prevent virtio_rtc from implicitly enabling
PTP_1588_CLOCK, therefore just warn the user if PTP_1588_CLOCK is not
available. Since virtio_rtc should in the future also expose clocks as RTC
class devices, do not have VIRTIO_RTC depend on PTP_1588_CLOCK.
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
---
Notes:
v3:
- don't guard cross-timestamping with feature bit (spec v3)
- reduce clock id to 16 bits (spec v3)
v2:
- Depend on prerequisite patch series "treewide: Use clocksource id for
get_device_system_crosststamp()".
- Check clocksource id before sending crosststamp message to device.
- Do not support multiple hardware counters at runtime any more, since
distinction of Arm physical and virtual counter appears unneeded after
discussion with Marc Zyngier.
drivers/virtio/Kconfig | 23 +-
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_rtc_driver.c | 131 +++++++++-
drivers/virtio/virtio_rtc_internal.h | 46 ++++
drivers/virtio/virtio_rtc_ptp.c | 342 +++++++++++++++++++++++++++
5 files changed, 539 insertions(+), 4 deletions(-)
create mode 100644 drivers/virtio/virtio_rtc_ptp.c
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 834dd14bc070..8542b2f20201 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -179,11 +179,32 @@ config VIRTIO_RTC
depends on PTP_1588_CLOCK_OPTIONAL
help
This driver provides current time from a Virtio RTC device. The driver
- provides the time through one or more clocks.
+ provides the time through one or more clocks. The Virtio RTC PTP
+ clocks must be enabled to expose the clocks to userspace.
To compile this code as a module, choose M here: the module will be
called virtio_rtc.
If unsure, say M.
+if VIRTIO_RTC
+
+comment "WARNING: Consider enabling VIRTIO_RTC_PTP."
+ depends on !VIRTIO_RTC_PTP
+
+comment "Enable PTP_1588_CLOCK in order to enable VIRTIO_RTC_PTP."
+ depends on PTP_1588_CLOCK=n
+
+config VIRTIO_RTC_PTP
+ bool "Virtio RTC PTP clocks"
+ default y
+ depends on PTP_1588_CLOCK
+ help
+ This exposes any Virtio RTC clocks as PTP Hardware Clocks (PHCs) to
+ userspace.
+
+ If unsure, say Y.
+
+endif # VIRTIO_RTC
+
endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index f760414ed6ab..4d48cbcae6bb 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_VIRTIO_MEM) += virtio_mem.o
obj-$(CONFIG_VIRTIO_DMA_SHARED_BUFFER) += virtio_dma_buf.o
obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
virtio_rtc-y := virtio_rtc_driver.o
+virtio_rtc-$(CONFIG_VIRTIO_RTC_PTP) += virtio_rtc_ptp.o
diff --git a/drivers/virtio/virtio_rtc_driver.c b/drivers/virtio/virtio_rtc_driver.c
index ef1ea14b3bec..c331b7383285 100644
--- a/drivers/virtio/virtio_rtc_driver.c
+++ b/drivers/virtio/virtio_rtc_driver.c
@@ -35,11 +35,16 @@ struct viortc_vq {
* struct viortc_dev - virtio_rtc device data
* @vdev: virtio device
* @vqs: virtqueues
+ * @clocks_to_unregister: Clock references, which are only used during device
+ * removal.
+ * For other uses, there would be a race between device
+ * creation and setting the pointers here.
* @num_clocks: # of virtio_rtc clocks
*/
struct viortc_dev {
struct virtio_device *vdev;
struct viortc_vq vqs[VIORTC_MAX_NR_QUEUES];
+ struct viortc_ptp_clock **clocks_to_unregister;
u16 num_clocks;
};
@@ -626,6 +631,109 @@ int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
* init, deinit
*/
+/**
+ * viortc_init_ptp_clock() - init and register PTP clock
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @clock_type: virtio_rtc clock type
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_ptp_clock(struct viortc_dev *viortc, u16 vio_clk_id,
+ u16 clock_type)
+{
+ struct device *dev = &viortc->vdev->dev;
+ char ptp_clock_name[PTP_CLOCK_NAME_LEN];
+ const char *type_name;
+ /* fit prefix + u16 in decimal */
+ char type_name_buf[5 + 5 + 1];
+ struct viortc_ptp_clock *vio_ptp;
+
+ switch (clock_type) {
+ case VIRTIO_RTC_CLOCK_UTC:
+ type_name = "UTC";
+ break;
+ case VIRTIO_RTC_CLOCK_TAI:
+ type_name = "TAI";
+ break;
+ case VIRTIO_RTC_CLOCK_MONO:
+ type_name = "monotonic";
+ break;
+ default:
+ snprintf(type_name_buf, sizeof(type_name_buf), "type %hu",
+ clock_type);
+ type_name = type_name_buf;
+ }
+
+ snprintf(ptp_clock_name, PTP_CLOCK_NAME_LEN, "Virtio PTP %s",
+ type_name);
+
+ vio_ptp = viortc_ptp_register(viortc, dev, vio_clk_id, ptp_clock_name);
+ if (IS_ERR(vio_ptp)) {
+ dev_err(dev, "failed to register PTP clock '%s'\n",
+ ptp_clock_name);
+ return PTR_ERR(vio_ptp);
+ }
+
+ viortc->clocks_to_unregister[vio_clk_id] = vio_ptp;
+
+ if (!vio_ptp)
+ dev_warn(dev, "clock %d is not exposed to userspace\n",
+ vio_clk_id);
+
+ return 0;
+}
+
+/**
+ * viortc_init_clock() - init local representation of virtio_rtc clock
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ *
+ * Initializes PHC and/or RTC class device to represent virtio_rtc clock.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_clock(struct viortc_dev *viortc, u16 vio_clk_id)
+{
+ u16 clock_type;
+ int ret;
+
+ ret = viortc_clock_cap(viortc, vio_clk_id, &clock_type);
+ if (ret)
+ return ret;
+
+ if (IS_ENABLED(CONFIG_VIRTIO_RTC_PTP)) {
+ ret = viortc_init_ptp_clock(viortc, vio_clk_id, clock_type);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/**
+ * viortc_clocks_exit() - unregister PHCs
+ * @viortc: device data
+ */
+static void viortc_clocks_exit(struct viortc_dev *viortc)
+{
+ unsigned int i;
+ struct viortc_ptp_clock *vio_ptp;
+
+ for (i = 0; i < viortc->num_clocks; i++) {
+ vio_ptp = viortc->clocks_to_unregister[i];
+
+ if (!vio_ptp)
+ continue;
+
+ viortc->clocks_to_unregister[i] = NULL;
+
+ WARN_ON(viortc_ptp_unregister(vio_ptp, &viortc->vdev->dev));
+ }
+}
+
/**
* viortc_clocks_init() - init local representations of virtio_rtc clocks
* @viortc: device data
@@ -637,6 +745,7 @@ static int viortc_clocks_init(struct viortc_dev *viortc)
{
int ret;
u16 num_clocks;
+ unsigned int i;
ret = viortc_cfg(viortc, &num_clocks);
if (ret)
@@ -649,8 +758,22 @@ static int viortc_clocks_init(struct viortc_dev *viortc)
viortc->num_clocks = num_clocks;
- /* In the future, PTP clocks will be initialized here. */
- (void)viortc_clock_cap;
+ viortc->clocks_to_unregister =
+ devm_kcalloc(&viortc->vdev->dev, num_clocks,
+ sizeof(*viortc->clocks_to_unregister), GFP_KERNEL);
+ if (!viortc->clocks_to_unregister)
+ return -ENOMEM;
+
+ for (i = 0; i < num_clocks; i++) {
+ ret = viortc_init_clock(viortc, i);
+ if (ret)
+ goto err_free_clocks;
+ }
+
+ return 0;
+
+err_free_clocks:
+ viortc_clocks_exit(viortc);
return ret;
}
@@ -734,7 +857,9 @@ static int viortc_probe(struct virtio_device *vdev)
*/
static void viortc_remove(struct virtio_device *vdev)
{
- /* In the future, PTP clocks will be deinitialized here. */
+ struct viortc_dev *viortc = vdev->priv;
+
+ viortc_clocks_exit(viortc);
virtio_reset_device(vdev);
vdev->config->del_vqs(vdev);
diff --git a/drivers/virtio/virtio_rtc_internal.h b/drivers/virtio/virtio_rtc_internal.h
index 9267661b8030..0dedced4aeae 100644
--- a/drivers/virtio/virtio_rtc_internal.h
+++ b/drivers/virtio/virtio_rtc_internal.h
@@ -9,6 +9,7 @@
#define _VIRTIO_RTC_INTERNAL_H_
#include <linux/types.h>
+#include <linux/ptp_clock_kernel.h>
/* driver core IFs */
@@ -20,4 +21,49 @@ int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
bool *supported);
+/* PTP IFs */
+
+struct viortc_ptp_clock;
+
+#if IS_ENABLED(CONFIG_VIRTIO_RTC_PTP)
+
+struct viortc_ptp_clock *viortc_ptp_register(struct viortc_dev *viortc,
+ struct device *parent_dev,
+ u16 vio_clk_id,
+ const char *ptp_clock_name);
+int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
+ struct device *parent_dev);
+
+#else
+
+static inline struct viortc_ptp_clock *
+viortc_ptp_register(struct viortc_dev *viortc, struct device *parent_dev,
+ u16 vio_clk_id, const char *ptp_clock_name)
+{
+ return NULL;
+}
+
+static inline int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
+ struct device *parent_dev)
+{
+ return -ENODEV;
+}
+
+#endif
+
+/* HW counter IFs */
+
+/**
+ * viortc_hw_xtstamp_params() - get HW-specific xtstamp params
+ * @hw_counter: virtio_rtc HW counter type
+ * @cs_id: clocksource id corresponding to hw_counter
+ *
+ * Gets the HW-specific xtstamp params. Returns an error if the driver cannot
+ * support xtstamp.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id);
+
#endif /* _VIRTIO_RTC_INTERNAL_H_ */
diff --git a/drivers/virtio/virtio_rtc_ptp.c b/drivers/virtio/virtio_rtc_ptp.c
new file mode 100644
index 000000000000..4592cd070772
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_ptp.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Expose virtio_rtc clocks as PTP clocks.
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ *
+ * Derived from ptp_kvm_common.c, virtual PTP 1588 clock for use with KVM
+ * guests.
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/ptp_clock_kernel.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/**
+ * struct viortc_ptp_clock - PTP clock abstraction
+ * @ptp_clock: PTP clock handle
+ * @viortc: virtio_rtc device data
+ * @ptp_info: PTP clock description
+ * @vio_clk_id: virtio_rtc clock id
+ * @have_cross: device supports crosststamp with available HW counter
+ */
+struct viortc_ptp_clock {
+ struct ptp_clock *ptp_clock;
+ struct viortc_dev *viortc;
+ struct ptp_clock_info ptp_info;
+ u16 vio_clk_id;
+ bool have_cross;
+};
+
+/**
+ * struct viortc_ptp_cross_ctx - context for get_device_system_crosststamp()
+ * @device_time: device clock reading
+ * @system_counterval: HW counter value at device_time
+ *
+ * Provides the already obtained crosststamp to get_device_system_crosststamp().
+ */
+struct viortc_ptp_cross_ctx {
+ ktime_t device_time;
+ struct system_counterval_t system_counterval;
+};
+
+/* Weak function in case get_device_system_crosststamp() is not supported */
+int __weak viortc_hw_xtstamp_params(u16 *hw_counter,
+ enum clocksource_ids *cs_id)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * viortc_ptp_get_time_fn() - callback for get_device_system_crosststamp()
+ * @device_time: device clock reading
+ * @system_counterval: HW counter value at device_time
+ * @ctx: context with already obtained crosststamp
+ *
+ * Return: zero (success).
+ */
+static int viortc_ptp_get_time_fn(ktime_t *device_time,
+ struct system_counterval_t *system_counterval,
+ void *ctx)
+{
+ struct viortc_ptp_cross_ctx *vio_ctx = ctx;
+
+ *device_time = vio_ctx->device_time;
+ *system_counterval = vio_ctx->system_counterval;
+
+ return 0;
+}
+
+/**
+ * viortc_ptp_do_xtstamp() - get crosststamp from device
+ * @vio_ptp: virtio_rtc PTP clock
+ * @ctx: context for get_device_system_crosststamp()
+ * @hw_counter: virtio_rtc HW counter type
+ * @cs_id: clocksource id corresponding to hw_counter
+ *
+ * Reads HW-specific crosststamp from device.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_ptp_do_xtstamp(struct viortc_ptp_clock *vio_ptp,
+ struct viortc_ptp_cross_ctx *ctx,
+ u16 hw_counter, enum clocksource_ids cs_id)
+{
+ u64 ns;
+ u64 max_ns;
+ int ret;
+
+ ctx->system_counterval.cs_id = cs_id;
+
+ ret = viortc_read_cross(vio_ptp->viortc, vio_ptp->vio_clk_id,
+ hw_counter, &ns,
+ &ctx->system_counterval.cycles);
+ if (ret)
+ return ret;
+
+ max_ns = (u64)ktime_to_ns(KTIME_MAX);
+ if (ns > max_ns)
+ return -EINVAL;
+
+ ctx->device_time = ns_to_ktime(ns);
+
+ return 0;
+}
+
+/*
+ * PTP clock operations
+ */
+
+/**
+ * viortc_ptp_getcrosststamp() - PTP clock getcrosststamp op
+ * @vio_ptp: virtio_rtc PTP clock
+ * @xtstamp: crosststamp
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_ptp_getcrosststamp(struct ptp_clock_info *ptp,
+ struct system_device_crosststamp *xtstamp)
+{
+ struct viortc_ptp_clock *vio_ptp =
+ container_of(ptp, struct viortc_ptp_clock, ptp_info);
+ int ret;
+ struct system_time_snapshot history_begin;
+ struct viortc_ptp_cross_ctx ctx;
+ enum clocksource_ids cs_id;
+ u16 hw_counter;
+
+ if (!vio_ptp->have_cross)
+ return -EOPNOTSUPP;
+
+ ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
+ if (ret)
+ return ret;
+
+ ktime_get_snapshot(&history_begin);
+ if (history_begin.cs_id != cs_id)
+ return -EOPNOTSUPP;
+
+ /*
+ * Getting the timestamp can take many milliseconds with a slow Virtio
+ * device. This is too long for viortc_ptp_get_time_fn() passed to
+ * get_device_system_crosststamp(), which has to usually return before
+ * the timekeeper seqcount increases (every tick or so).
+ *
+ * So, get the actual cross-timestamp first.
+ */
+ ret = viortc_ptp_do_xtstamp(vio_ptp, &ctx, hw_counter, cs_id);
+ if (ret)
+ return ret;
+
+ ret = get_device_system_crosststamp(viortc_ptp_get_time_fn, &ctx,
+ &history_begin, xtstamp);
+ if (ret)
+ pr_debug("%s: get_device_system_crosststamp() returned %d\n",
+ __func__, ret);
+
+ return ret;
+}
+
+/** viortc_ptp_adjfine() - unsupported PTP clock adjfine op */
+static int viortc_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
+{
+ return -EOPNOTSUPP;
+}
+
+/** viortc_ptp_adjtime() - unsupported PTP clock adjtime op */
+static int viortc_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+ return -EOPNOTSUPP;
+}
+
+/** viortc_ptp_settime64() - unsupported PTP clock settime64 op */
+static int viortc_ptp_settime64(struct ptp_clock_info *ptp,
+ const struct timespec64 *ts)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * viortc_ptp_gettimex64() - PTP clock gettimex64 op
+ *
+ * Context: Process context.
+ */
+static int viortc_ptp_gettimex64(struct ptp_clock_info *ptp,
+ struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
+{
+ struct viortc_ptp_clock *vio_ptp =
+ container_of(ptp, struct viortc_ptp_clock, ptp_info);
+ u64 ns;
+ int ret;
+
+ ptp_read_system_prets(sts);
+ ret = viortc_read(vio_ptp->viortc, vio_ptp->vio_clk_id, &ns);
+ ptp_read_system_postts(sts);
+
+ if (ret)
+ return ret;
+
+ if (ns > (u64)S64_MAX)
+ return -EINVAL;
+
+ *ts = ns_to_timespec64((s64)ns);
+
+ return 0;
+}
+
+/** viortc_ptp_enable() - unsupported PTP clock enable op */
+static int viortc_ptp_enable(struct ptp_clock_info *ptp,
+ struct ptp_clock_request *rq, int on)
+{
+ return -EOPNOTSUPP;
+}
+
+/**
+ * viortc_ptp_info_template - ptp_clock_info template
+ *
+ * The .name member will be set for individual virtio_rtc PTP clocks.
+ */
+static const struct ptp_clock_info viortc_ptp_info_template = {
+ .owner = THIS_MODULE,
+ /* .name is set according to clock type */
+ .adjfine = viortc_ptp_adjfine,
+ .adjtime = viortc_ptp_adjtime,
+ .gettimex64 = viortc_ptp_gettimex64,
+ .settime64 = viortc_ptp_settime64,
+ .enable = viortc_ptp_enable,
+ .getcrosststamp = viortc_ptp_getcrosststamp,
+};
+
+/**
+ * viortc_ptp_unregister() - PTP clock unregistering wrapper
+ * @vio_ptp: virtio_rtc PTP clock
+ * @parent_dev: parent device of PTP clock
+ *
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
+ struct device *parent_dev)
+{
+ int ret = ptp_clock_unregister(vio_ptp->ptp_clock);
+
+ if (!ret)
+ devm_kfree(parent_dev, vio_ptp);
+
+ return ret;
+}
+
+/**
+ * viortc_ptp_get_cross_cap() - get xtstamp support info from device
+ * @viortc: virtio_rtc device data
+ * @vio_ptp: virtio_rtc PTP clock abstraction
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_ptp_get_cross_cap(struct viortc_dev *viortc,
+ struct viortc_ptp_clock *vio_ptp)
+{
+ int ret;
+ enum clocksource_ids cs_id;
+ u16 hw_counter;
+ bool xtstamp_supported;
+
+ ret = viortc_hw_xtstamp_params(&hw_counter, &cs_id);
+ if (ret) {
+ vio_ptp->have_cross = false;
+ return 0;
+ }
+
+ ret = viortc_cross_cap(viortc, vio_ptp->vio_clk_id, hw_counter,
+ &xtstamp_supported);
+ if (ret)
+ return ret;
+
+ vio_ptp->have_cross = xtstamp_supported;
+
+ return 0;
+}
+
+/**
+ * viortc_ptp_register() - prepare and register PTP clock
+ * @viortc: virtio_rtc device data
+ * @parent_dev: parent device for PTP clock
+ * @vio_clk_id: id of virtio_rtc clock which backs PTP clock
+ * @ptp_clock_name: PTP clock name
+ *
+ * Context: Process context.
+ * Return: Pointer on success, ERR_PTR() otherwise; NULL if PTP clock support
+ * not available.
+ */
+struct viortc_ptp_clock *viortc_ptp_register(struct viortc_dev *viortc,
+ struct device *parent_dev,
+ u16 vio_clk_id,
+ const char *ptp_clock_name)
+{
+ struct viortc_ptp_clock *vio_ptp;
+ struct ptp_clock *ptp_clock;
+ ssize_t len;
+ int ret;
+
+ vio_ptp = devm_kzalloc(parent_dev, sizeof(*vio_ptp), GFP_KERNEL);
+ if (!vio_ptp)
+ return ERR_PTR(-ENOMEM);
+
+ vio_ptp->viortc = viortc;
+ vio_ptp->vio_clk_id = vio_clk_id;
+ vio_ptp->ptp_info = viortc_ptp_info_template;
+ len = strscpy(vio_ptp->ptp_info.name, ptp_clock_name,
+ sizeof(vio_ptp->ptp_info.name));
+ if (len < 0) {
+ ret = len;
+ goto err_free_dev;
+ }
+
+ ret = viortc_ptp_get_cross_cap(viortc, vio_ptp);
+ if (ret)
+ goto err_free_dev;
+
+ ptp_clock = ptp_clock_register(&vio_ptp->ptp_info, parent_dev);
+ if (IS_ERR(ptp_clock))
+ goto err_on_register;
+
+ vio_ptp->ptp_clock = ptp_clock;
+
+ return vio_ptp;
+
+err_on_register:
+ ret = PTR_ERR(ptp_clock);
+
+err_free_dev:
+ devm_kfree(parent_dev, vio_ptp);
+ return ERR_PTR(ret);
+}
--
2.40.1
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [virtio-dev] [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping
2023-12-18 7:38 [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks Peter Hilber
@ 2023-12-18 7:38 ` Peter Hilber
[not found] ` <e410d65754ba6a11ad7f74b27dc28d9a25d8c82e.camel@infradead.org>
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 7/7] virtio_rtc: Add RTC class driver Peter Hilber
[not found] ` <684eac07834699889fdb67be4cee09319c994a42.camel@infradead.org>
4 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2023-12-18 7:38 UTC (permalink / raw)
To: linux-kernel, virtualization, virtio-dev, linux-arm-kernel
Cc: Peter Hilber, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Thomas Gleixner
Add PTP_SYS_OFFSET_PRECISE2 support on platforms using the Arm Generic
Timer.
Always report the CP15 virtual counter as the HW counter in use by
arm_arch_timer, since the Linux kernel's usage of the Arm Generic Timer
should always be compatible with this.
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
---
Notes:
v2:
- Depend on prerequisite patch series "treewide: Use clocksource id for
get_device_system_crosststamp()".
- Return clocksource id instead of calling dropped arm_arch_timer helpers.
- Always report the CP15 virtual counter to be in use by arm_arch_timer,
since distinction of Arm physical and virtual counter appears unneeded
after discussion with Marc Zyngier.
drivers/virtio/Kconfig | 13 +++++++++++++
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_rtc_arm.c | 22 ++++++++++++++++++++++
3 files changed, 36 insertions(+)
create mode 100644 drivers/virtio/virtio_rtc_arm.c
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 8542b2f20201..d35c728778d2 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -205,6 +205,19 @@ config VIRTIO_RTC_PTP
If unsure, say Y.
+config VIRTIO_RTC_ARM
+ bool "Virtio RTC cross-timestamping using Arm Generic Timer"
+ default y
+ depends on VIRTIO_RTC_PTP && ARM_ARCH_TIMER
+ help
+ This enables Virtio RTC cross-timestamping using the Arm Generic Timer.
+ It only has an effect if the Virtio RTC device also supports this. The
+ cross-timestamp is available through the PTP clock driver precise
+ cross-timestamp ioctl (PTP_SYS_OFFSET_PRECISE2 or
+ PTP_SYS_OFFSET_PRECISE).
+
+ If unsure, say Y.
+
endif # VIRTIO_RTC
endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 4d48cbcae6bb..781dff9f8822 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_VIRTIO_DMA_SHARED_BUFFER) += virtio_dma_buf.o
obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
virtio_rtc-y := virtio_rtc_driver.o
virtio_rtc-$(CONFIG_VIRTIO_RTC_PTP) += virtio_rtc_ptp.o
+virtio_rtc-$(CONFIG_VIRTIO_RTC_ARM) += virtio_rtc_arm.o
diff --git a/drivers/virtio/virtio_rtc_arm.c b/drivers/virtio/virtio_rtc_arm.c
new file mode 100644
index 000000000000..5185b130b3f1
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_arm.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Provides cross-timestamp params for Arm.
+ *
+ * Copyright (C) 2022-2023 OpenSynergy GmbH
+ */
+
+#include <linux/clocksource_ids.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/* see header for doc */
+
+int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id)
+{
+ *hw_counter = VIRTIO_RTC_COUNTER_ARM_VIRT;
+ *cs_id = CSID_ARM_ARCH_COUNTER;
+
+ return 0;
+}
--
2.40.1
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [virtio-dev] [RFC PATCH v3 7/7] virtio_rtc: Add RTC class driver
2023-12-18 7:38 [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
` (2 preceding siblings ...)
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping Peter Hilber
@ 2023-12-18 7:38 ` Peter Hilber
[not found] ` <684eac07834699889fdb67be4cee09319c994a42.camel@infradead.org>
4 siblings, 0 replies; 35+ messages in thread
From: Peter Hilber @ 2023-12-18 7:38 UTC (permalink / raw)
To: linux-kernel, virtualization, virtio-dev, linux-rtc
Cc: Peter Hilber, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Alessandro Zummo, Alexandre Belloni
Expose the virtio-rtc UTC clock as an RTC clock to userspace, if it is
present. Support RTC alarm if the virtio-rtc alarm feature is present. The
virtio-rtc device signals an alarm by marking an alarmq buffer as used.
Peculiarities
-------------
A virtio-rtc clock is a bit special for an RTC clock in that
- the clock may step (also backwards) autonomously at any time and
- the device, and its notification mechanism, will be reset during boot or
resume from sleep.
The virtio-rtc device avoids that the driver might miss an alarm. The
device signals an alarm whenever the clock has reached or passed the alarm
time, and also when the device is reset (on boot or resume from sleep), if
the alarm time is in the past.
Open Issue
----------
The CLOCK_BOOTTIME_ALARM will use the RTC clock to wake up from sleep, and
implicitly assumes that no RTC clock steps will occur during sleep. The RTC
class driver does not know whether the current alarm is a real-time alarm
or a boot-time alarm.
Perhaps this might be handled by the driver also setting a virtio-rtc
monotonic alarm (which uses a clock similar to CLOCK_BOOTTIME_ALARM). The
virtio-rtc monotonic alarm would just be used to wake up in case it was a
CLOCK_BOOTTIME_ALARM alarm.
Otherwise, the behavior should not differ from other RTC class drivers.
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
---
Notes:
Added in v3.
drivers/virtio/Kconfig | 21 +-
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_rtc_class.c | 269 +++++++++++++++
drivers/virtio/virtio_rtc_driver.c | 477 ++++++++++++++++++++++++++-
drivers/virtio/virtio_rtc_internal.h | 53 +++
include/uapi/linux/virtio_rtc.h | 88 ++++-
6 files changed, 902 insertions(+), 7 deletions(-)
create mode 100644 drivers/virtio/virtio_rtc_class.c
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index d35c728778d2..e97bb2e9eca1 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -180,7 +180,8 @@ config VIRTIO_RTC
help
This driver provides current time from a Virtio RTC device. The driver
provides the time through one or more clocks. The Virtio RTC PTP
- clocks must be enabled to expose the clocks to userspace.
+ clocks and/or the Real Time Clock driver for Virtio RTC must be
+ enabled to expose the clocks to userspace.
To compile this code as a module, choose M here: the module will be
called virtio_rtc.
@@ -189,8 +190,8 @@ config VIRTIO_RTC
if VIRTIO_RTC
-comment "WARNING: Consider enabling VIRTIO_RTC_PTP."
- depends on !VIRTIO_RTC_PTP
+comment "WARNING: Consider enabling VIRTIO_RTC_PTP and/or VIRTIO_RTC_CLASS."
+ depends on !VIRTIO_RTC_PTP && !VIRTIO_RTC_CLASS
comment "Enable PTP_1588_CLOCK in order to enable VIRTIO_RTC_PTP."
depends on PTP_1588_CLOCK=n
@@ -218,6 +219,20 @@ config VIRTIO_RTC_ARM
If unsure, say Y.
+comment "Enable RTC_CLASS in order to enable VIRTIO_RTC_CLASS."
+ depends on RTC_CLASS=n
+
+config VIRTIO_RTC_CLASS
+ bool "Real Time Clock driver for Virtio RTC"
+ default y
+ depends on RTC_CLASS
+ help
+ This exposes the Virtio RTC UTC clock as a Linux Real Time Clock. It
+ only has an effect if the Virtio RTC device has a UTC clock. The Real
+ Time Clock is read-only, and may support setting an alarm.
+
+ If unsure, say Y.
+
endif # VIRTIO_RTC
endif # VIRTIO_MENU
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 781dff9f8822..6c26bad777db 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_VIRTIO_RTC) += virtio_rtc.o
virtio_rtc-y := virtio_rtc_driver.o
virtio_rtc-$(CONFIG_VIRTIO_RTC_PTP) += virtio_rtc_ptp.o
virtio_rtc-$(CONFIG_VIRTIO_RTC_ARM) += virtio_rtc_arm.o
+virtio_rtc-$(CONFIG_VIRTIO_RTC_CLASS) += virtio_rtc_class.o
diff --git a/drivers/virtio/virtio_rtc_class.c b/drivers/virtio/virtio_rtc_class.c
new file mode 100644
index 000000000000..fcb694f0f9a0
--- /dev/null
+++ b/drivers/virtio/virtio_rtc_class.c
@@ -0,0 +1,269 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio_rtc RTC class driver
+ *
+ * Copyright (C) 2023 OpenSynergy GmbH
+ */
+
+#include <linux/overflow.h>
+#include <linux/rtc.h>
+#include <linux/time64.h>
+
+#include <uapi/linux/virtio_rtc.h>
+
+#include "virtio_rtc_internal.h"
+
+/**
+ * struct viortc_class - RTC class wrapper
+ * @viortc: virtio_rtc device data
+ * @rtc: RTC device
+ * @vio_clk_id: virtio_rtc clock id
+ * @stopped: Whether RTC ops are disallowed. Access protected by rtc_lock().
+ */
+struct viortc_class {
+ struct viortc_dev *viortc;
+ struct rtc_device *rtc;
+ u16 vio_clk_id;
+ bool stopped;
+};
+
+/**
+ * viortc_class_get_locked() - get RTC class wrapper, if ops allowed
+ * @dev: virtio device
+ *
+ * Gets the RTC class wrapper from the virtio device, if it is available and
+ * ops are allowed.
+ *
+ * Context: Caller must hold rtc_lock().
+ * Return: RTC class wrapper if available and ops allowed, ERR_PTR otherwise.
+ */
+static struct viortc_class *viortc_class_get_locked(struct device *dev)
+{
+ struct viortc_class *viortc_class;
+
+ viortc_class = viortc_class_from_dev(dev);
+ if (IS_ERR(viortc_class))
+ return viortc_class;
+
+ if (viortc_class->stopped)
+ return ERR_PTR(-EBUSY);
+
+ return viortc_class;
+}
+
+/**
+ * viortc_class_read_time() - RTC class op read_time
+ * @dev: virtio device
+ * @tm: read time
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_class_read_time(struct device *dev, struct rtc_time *tm)
+{
+ struct viortc_class *viortc_class;
+ time64_t sec;
+ int ret;
+ u64 ns;
+
+ viortc_class = viortc_class_get_locked(dev);
+ if (IS_ERR(viortc_class))
+ return PTR_ERR(viortc_class);
+
+ ret = viortc_read(viortc_class->viortc, viortc_class->vio_clk_id, &ns);
+ if (ret)
+ return ret;
+
+ sec = ns / NSEC_PER_SEC;
+
+ rtc_time64_to_tm(sec, tm);
+
+ return 0;
+}
+
+/**
+ * viortc_class_read_alarm() - RTC class op read_alarm
+ * @dev: virtio device
+ * @alrm: alarm read out
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_class_read_alarm(struct device *dev, struct rtc_wkalrm *alrm)
+{
+ struct viortc_class *viortc_class;
+ time64_t alarm_time_sec;
+ u64 alarm_time_ns;
+ bool enabled;
+ int ret;
+
+ viortc_class = viortc_class_get_locked(dev);
+ if (IS_ERR(viortc_class))
+ return PTR_ERR(viortc_class);
+
+ ret = viortc_read_alarm(viortc_class->viortc, viortc_class->vio_clk_id,
+ &alarm_time_ns, &enabled);
+ if (ret)
+ return ret;
+
+ alarm_time_sec = alarm_time_ns / NSEC_PER_SEC;
+ rtc_time64_to_tm(alarm_time_sec, &alrm->time);
+
+ alrm->enabled = enabled;
+
+ return 0;
+}
+
+/**
+ * viortc_class_set_alarm() - RTC class op set_alarm
+ * @dev: virtio device
+ * @alrm: alarm to set
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_class_set_alarm(struct device *dev, struct rtc_wkalrm *alrm)
+{
+ struct viortc_class *viortc_class;
+ time64_t alarm_time_sec;
+ u64 alarm_time_ns;
+
+ viortc_class = viortc_class_get_locked(dev);
+ if (IS_ERR(viortc_class))
+ return PTR_ERR(viortc_class);
+
+ alarm_time_sec = rtc_tm_to_time64(&alrm->time);
+
+ if (alarm_time_sec < 0)
+ return -EINVAL;
+
+ if (check_mul_overflow((u64)alarm_time_sec, (u64)NSEC_PER_SEC,
+ &alarm_time_ns))
+ return -EINVAL;
+
+ return viortc_set_alarm(viortc_class->viortc, viortc_class->vio_clk_id,
+ alarm_time_ns, alrm->enabled);
+}
+
+/**
+ * viortc_class_alarm_irq_enable() - RTC class op alarm_irq_enable
+ * @dev: virtio device
+ * @enabled: enable or disable alarm IRQ
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_class_alarm_irq_enable(struct device *dev,
+ unsigned int enabled)
+{
+ struct viortc_class *viortc_class;
+
+ viortc_class = viortc_class_get_locked(dev);
+ if (IS_ERR(viortc_class))
+ return PTR_ERR(viortc_class);
+
+ return viortc_set_alarm_enabled(viortc_class->viortc,
+ viortc_class->vio_clk_id, enabled);
+}
+
+static const struct rtc_class_ops viortc_class_with_alarm_ops = {
+ .read_time = viortc_class_read_time,
+ .read_alarm = viortc_class_read_alarm,
+ .set_alarm = viortc_class_set_alarm,
+ .alarm_irq_enable = viortc_class_alarm_irq_enable,
+};
+
+static const struct rtc_class_ops viortc_class_no_alarm_ops = {
+ .read_time = viortc_class_read_time,
+};
+
+/**
+ * viortc_class_alarm() - propagate alarm notification as alarm interrupt
+ * @viortc_class: RTC class wrapper
+ * @vio_clk_id: virtio_rtc clock id
+ *
+ * Context: Any context.
+ */
+void viortc_class_alarm(struct viortc_class *viortc_class, u16 vio_clk_id)
+{
+ if (WARN_ONCE(
+ !viortc_class,
+ "virtio_rtc: unexpected alarm, no RTC class device available\n"))
+ return;
+
+ if (vio_clk_id != viortc_class->vio_clk_id) {
+ dev_err_ratelimited(&viortc_class->rtc->dev,
+ "%s: unexpected clock id %d != %d\n",
+ __func__, vio_clk_id,
+ viortc_class->vio_clk_id);
+ return;
+ }
+
+ rtc_update_irq(viortc_class->rtc, 1, RTC_AF | RTC_IRQF);
+}
+
+/**
+ * viortc_class_stop() - disallow RTC class ops
+ * @viortc_class: RTC class wrapper
+ *
+ * Context: Process context. Caller must NOT hold rtc_lock().
+ */
+void viortc_class_stop(struct viortc_class *viortc_class)
+{
+ rtc_lock(viortc_class->rtc);
+
+ viortc_class->stopped = true;
+
+ rtc_unlock(viortc_class->rtc);
+}
+
+/**
+ * viortc_class_register() - register RTC class device
+ * @viortc_class: RTC class wrapper
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_class_register(struct viortc_class *viortc_class)
+{
+ return devm_rtc_register_device(viortc_class->rtc);
+}
+
+/**
+ * viortc_class_init() - init RTC class wrapper and device
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @have_alarm: expose alarm ops
+ * @parent_dev: virtio device
+ *
+ * Context: Process context.
+ * Return: RTC class wrapper on success, ERR_PTR otherwise.
+ */
+struct viortc_class *viortc_class_init(struct viortc_dev *viortc,
+ u16 vio_clk_id, bool have_alarm,
+ struct device *parent_dev)
+{
+ struct viortc_class *viortc_class;
+ struct rtc_device *rtc;
+
+ viortc_class =
+ devm_kzalloc(parent_dev, sizeof(*viortc_class), GFP_KERNEL);
+ if (!viortc_class)
+ return ERR_PTR(-ENOMEM);
+
+ viortc_class->viortc = viortc;
+
+ rtc = devm_rtc_allocate_device(parent_dev);
+ if (IS_ERR(rtc))
+ return ERR_PTR(PTR_ERR(rtc));
+
+ viortc_class->rtc = rtc;
+
+ clear_bit(RTC_FEATURE_UPDATE_INTERRUPT, rtc->features);
+
+ rtc->ops = have_alarm ? &viortc_class_with_alarm_ops :
+ &viortc_class_no_alarm_ops;
+ rtc->range_max = U64_MAX / NSEC_PER_SEC;
+
+ return viortc_class;
+}
diff --git a/drivers/virtio/virtio_rtc_driver.c b/drivers/virtio/virtio_rtc_driver.c
index c331b7383285..a4b5c3634de9 100644
--- a/drivers/virtio/virtio_rtc_driver.c
+++ b/drivers/virtio/virtio_rtc_driver.c
@@ -9,15 +9,19 @@
#include <linux/virtio.h>
#include <linux/virtio_ids.h>
#include <linux/virtio_config.h>
+#include <linux/device.h>
#include <linux/module.h>
#include <uapi/linux/virtio_rtc.h>
#include "virtio_rtc_internal.h"
+#define VIORTC_ALARMQ_BUF_CAP sizeof(union virtio_rtc_notif_alarmq)
+
/* virtqueue order */
enum {
VIORTC_REQUESTQ,
+ VIORTC_ALARMQ,
VIORTC_MAX_NR_QUEUES,
};
@@ -34,17 +38,23 @@ struct viortc_vq {
/**
* struct viortc_dev - virtio_rtc device data
* @vdev: virtio device
+ * @viortc_class: RTC class wrapper for UTC clock, NULL if not available
* @vqs: virtqueues
* @clocks_to_unregister: Clock references, which are only used during device
* removal.
* For other uses, there would be a race between device
* creation and setting the pointers here.
+ * @alarmq_bufs: alarmq buffers list
+ * @num_alarmq_bufs: # of alarmq buffers
* @num_clocks: # of virtio_rtc clocks
*/
struct viortc_dev {
struct virtio_device *vdev;
+ struct viortc_class *viortc_class;
struct viortc_vq vqs[VIORTC_MAX_NR_QUEUES];
struct viortc_ptp_clock **clocks_to_unregister;
+ void **alarmq_bufs;
+ unsigned int num_alarmq_bufs;
u16 num_clocks;
};
@@ -75,6 +85,60 @@ struct viortc_msg {
unsigned int resp_actual_size;
};
+/**
+ * viortc_class_from_dev() - Get RTC class object from virtio device.
+ * @dev: virtio device
+ *
+ * Context: Any context.
+ * Return: RTC class object if available, ERR_PTR otherwise.
+ */
+struct viortc_class *viortc_class_from_dev(struct device *dev)
+{
+ struct virtio_device *vdev;
+ struct viortc_dev *viortc;
+
+ vdev = container_of(dev, typeof(*vdev), dev);
+ viortc = vdev->priv;
+
+ return viortc->viortc_class ?: ERR_PTR(-ENODEV);
+}
+
+/**
+ * viortc_alarms_supported() - Whether device and driver support alarms.
+ * @vdev: virtio device
+ *
+ * NB: Device and driver may not support alarms for the same clocks.
+ *
+ * Context: Any context.
+ * Return: True if both device and driver can support alarms.
+ */
+static bool viortc_alarms_supported(struct virtio_device *vdev)
+{
+ return IS_ENABLED(CONFIG_VIRTIO_RTC_CLASS) &&
+ virtio_has_feature(vdev, VIRTIO_RTC_F_ALARM);
+}
+
+/**
+ * viortc_feed_vq() - Make a device write-only buffer available.
+ * @viortc: device data
+ * @vq: notification virtqueue
+ * @buf: buffer
+ * @buf_len: buffer capacity in bytes
+ * @data: token, identifying buffer
+ *
+ * Context: Caller must prevent concurrent access to vq.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_feed_vq(struct viortc_dev *viortc, struct virtqueue *vq,
+ void *buf, unsigned int buf_len, void *data)
+{
+ struct scatterlist sg;
+
+ sg_init_one(&sg, buf, buf_len);
+
+ return virtqueue_add_inbuf(vq, &sg, 1, data, GFP_ATOMIC);
+}
+
/**
* viortc_msg_init() - Allocate and initialize requestq message.
* @viortc: device data
@@ -239,6 +303,81 @@ static void viortc_cb_requestq(struct virtqueue *vq)
viortc_do_cb(vq, viortc_requestq_hdlr);
}
+/**
+ * viortc_alarmq_hdlr() - process an alarmq used buffer
+ * @token: token identifying the buffer
+ * @len: bytes written by device
+ * @vq: virtqueue
+ * @viortc_vq: device specific data for virtqueue
+ * @viortc: device data
+ *
+ * Processes a VIRTIO_RTC_NOTIF_ALARM notification by calling the RTC class
+ * driver. Makes the buffer available again.
+ *
+ * Context: virtqueue callback
+ */
+static void viortc_alarmq_hdlr(void *token, unsigned int len,
+ struct virtqueue *vq,
+ struct viortc_vq *viortc_vq,
+ struct viortc_dev *viortc)
+{
+ struct virtio_rtc_notif_alarm *notif = token;
+ struct virtio_rtc_notif_head *head = token;
+ unsigned long flags;
+ u16 clock_id;
+ bool notify;
+
+ if (len < sizeof(*head)) {
+ dev_err_ratelimited(
+ &viortc->vdev->dev,
+ "%s: ignoring notification with short header\n",
+ __func__);
+ goto feed_vq;
+ }
+
+ if (virtio_le_to_cpu(head->msg_type) != VIRTIO_RTC_NOTIF_ALARM) {
+ dev_err_ratelimited(&viortc->vdev->dev,
+ "%s: unknown notification type\n",
+ __func__);
+ goto feed_vq;
+ }
+
+ if (len < sizeof(*notif)) {
+ dev_err_ratelimited(&viortc->vdev->dev,
+ "%s: alarm notification too small\n",
+ __func__);
+ goto feed_vq;
+ }
+
+ clock_id = virtio_le_to_cpu(notif->clock_id);
+
+ viortc_class_alarm(viortc->viortc_class, clock_id);
+
+feed_vq:
+ spin_lock_irqsave(&viortc_vq->lock, flags);
+
+ WARN_ON(viortc_feed_vq(viortc, vq, notif, VIORTC_ALARMQ_BUF_CAP,
+ token));
+
+ notify = virtqueue_kick_prepare(vq);
+
+ spin_unlock_irqrestore(&viortc_vq->lock, flags);
+
+ if (notify)
+ virtqueue_notify(vq);
+}
+
+/**
+ * viortc_cb_alarmq() - callback for alarmq
+ * @vq: virtqueue
+ *
+ * Context: virtqueue callback
+ */
+static void viortc_cb_alarmq(struct virtqueue *vq)
+{
+ viortc_do_cb(vq, viortc_alarmq_hdlr);
+}
+
/**
* viortc_get_resp_errno() - converts virtio_rtc errnos to system errnos
* @resp_head: message response header
@@ -554,12 +693,13 @@ static int viortc_cfg(struct viortc_dev *viortc, u16 *num_clocks)
* @viortc: device data
* @vio_clk_id: virtio_rtc clock id
* @type: virtio_rtc clock type
+ * @flags: struct virtio_rtc_resp_clock_cap.flags
*
* Context: Process context.
* Return: Zero on success, negative error code otherwise.
*/
static int viortc_clock_cap(struct viortc_dev *viortc, u16 vio_clk_id,
- u16 *type)
+ u16 *type, u8 *flags)
{
int ret;
VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, clock_cap, CLOCK_CAP);
@@ -579,6 +719,7 @@ static int viortc_clock_cap(struct viortc_dev *viortc, u16 vio_clk_id,
}
VIORTC_MSG_READ(hdl, type, type);
+ VIORTC_MSG_READ(hdl, flags, flags);
out_release:
viortc_msg_release(VIORTC_MSG(hdl));
@@ -627,10 +768,176 @@ int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
return ret;
}
+/**
+ * viortc_read_alarm() - VIRTIO_RTC_REQ_READ_ALARM wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @alarm_time: alarm time in ns
+ * @enabled: whether alarm is enabled
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_read_alarm(struct viortc_dev *viortc, u16 vio_clk_id,
+ u64 *alarm_time, bool *enabled)
+{
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, read_alarm, READ_ALARM);
+ u8 flags;
+ int ret;
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+ VIORTC_MSG_READ(hdl, alarm_time, alarm_time);
+ VIORTC_MSG_READ(hdl, flags, &flags);
+
+ *enabled = !!(flags & VIRTIO_RTC_FLAG_ALARM_ENABLED);
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_set_alarm() - VIRTIO_RTC_REQ_SET_ALARM wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @alarm_time: alarm time in ns
+ * @alarm_enable: enable or disable alarm
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_set_alarm(struct viortc_dev *viortc, u16 vio_clk_id, u64 alarm_time,
+ bool alarm_enable)
+{
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, set_alarm, SET_ALARM);
+ u8 flags = 0;
+ int ret;
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ if (alarm_enable)
+ flags |= VIRTIO_RTC_FLAG_ALARM_ENABLED;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+ VIORTC_MSG_WRITE(hdl, alarm_time, &alarm_time);
+ VIORTC_MSG_WRITE(hdl, flags, &flags);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
+/**
+ * viortc_set_alarm_enabled() - VIRTIO_RTC_REQ_SET_ALARM_ENABLED wrapper
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @alarm_enable: enable or disable alarm
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+int viortc_set_alarm_enabled(struct viortc_dev *viortc, u16 vio_clk_id,
+ bool alarm_enable)
+{
+ VIORTC_DECLARE_MSG_HDL_ONSTACK(hdl, set_alarm_enabled,
+ SET_ALARM_ENABLED);
+ u8 flags = 0;
+ int ret;
+
+ ret = VIORTC_MSG_INIT(hdl, viortc);
+ if (ret)
+ return ret;
+
+ if (alarm_enable)
+ flags |= VIRTIO_RTC_FLAG_ALARM_ENABLED;
+
+ VIORTC_MSG_WRITE(hdl, clock_id, &vio_clk_id);
+ VIORTC_MSG_WRITE(hdl, flags, &flags);
+
+ ret = viortc_msg_xfer(&viortc->vqs[VIORTC_REQUESTQ], VIORTC_MSG(hdl),
+ 0);
+ if (ret) {
+ dev_dbg(&viortc->vdev->dev, "%s: xfer returned %d\n", __func__,
+ ret);
+ goto out_release;
+ }
+
+out_release:
+ viortc_msg_release(VIORTC_MSG(hdl));
+
+ return ret;
+}
+
/*
* init, deinit
*/
+/**
+ * viortc_init_clock_rtc_class() - init and register a RTC class device
+ * @viortc: device data
+ * @vio_clk_id: virtio_rtc clock id
+ * @flags: struct virtio_rtc_resp_clock_cap.flags
+ *
+ * The clock must be a UTC clock.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_init_clock_rtc_class(struct viortc_dev *viortc,
+ u16 vio_clk_id, u8 flags)
+{
+ struct virtio_device *vdev = viortc->vdev;
+ struct viortc_class *viortc_class;
+ struct device *dev = &vdev->dev;
+ bool have_alarm;
+
+ if (viortc->viortc_class) {
+ dev_warn_once(
+ dev,
+ "multiple UTC clocks are present, but creating only one RTC class device\n");
+ return 0;
+ }
+
+ have_alarm = viortc_alarms_supported(vdev) &&
+ !!(flags & VIRTIO_RTC_FLAG_ALARM_CAP);
+
+ viortc_class = viortc_class_init(viortc, vio_clk_id, have_alarm, dev);
+ if (IS_ERR(viortc_class))
+ return PTR_ERR(viortc_class);
+
+ viortc->viortc_class = viortc_class;
+
+ if (have_alarm)
+ device_init_wakeup(dev, true);
+
+ return viortc_class_register(viortc_class);
+}
+
/**
* viortc_init_ptp_clock() - init and register PTP clock
* @viortc: device data
@@ -698,12 +1005,20 @@ static int viortc_init_ptp_clock(struct viortc_dev *viortc, u16 vio_clk_id,
static int viortc_init_clock(struct viortc_dev *viortc, u16 vio_clk_id)
{
u16 clock_type;
+ u8 flags;
int ret;
- ret = viortc_clock_cap(viortc, vio_clk_id, &clock_type);
+ ret = viortc_clock_cap(viortc, vio_clk_id, &clock_type, &flags);
if (ret)
return ret;
+ if (clock_type == VIRTIO_RTC_CLOCK_UTC &&
+ IS_ENABLED(CONFIG_VIRTIO_RTC_CLASS)) {
+ ret = viortc_init_clock_rtc_class(viortc, vio_clk_id, flags);
+ if (ret)
+ return ret;
+ }
+
if (IS_ENABLED(CONFIG_VIRTIO_RTC_PTP)) {
ret = viortc_init_ptp_clock(viortc, vio_clk_id, clock_type);
if (ret)
@@ -714,7 +1029,7 @@ static int viortc_init_clock(struct viortc_dev *viortc, u16 vio_clk_id)
}
/**
- * viortc_clocks_exit() - unregister PHCs
+ * viortc_clocks_exit() - unregister PHCs, stop RTC ops
* @viortc: device data
*/
static void viortc_clocks_exit(struct viortc_dev *viortc)
@@ -732,6 +1047,9 @@ static void viortc_clocks_exit(struct viortc_dev *viortc)
WARN_ON(viortc_ptp_unregister(vio_ptp, &viortc->vdev->dev));
}
+
+ if (viortc->viortc_class)
+ viortc_class_stop(viortc->viortc_class);
}
/**
@@ -778,6 +1096,74 @@ static int viortc_clocks_init(struct viortc_dev *viortc)
return ret;
}
+/**
+ * viortc_alloc_vq_bufs() - allocate alarmq buffers
+ * @viortc: device data
+ * @num_elems: # of buffers
+ * @buf_cap: per-buffer device-writable capacity in bytes
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_alloc_vq_bufs(struct viortc_dev *viortc,
+ unsigned int num_elems, u32 buf_cap)
+{
+ struct device *dev = &viortc->vdev->dev;
+ void **buf_list;
+ unsigned int i;
+ void *buf;
+
+ buf_list = devm_kcalloc(dev, num_elems, sizeof(*buf_list), GFP_KERNEL);
+ if (!buf_list)
+ return -ENOMEM;
+
+ viortc->alarmq_bufs = buf_list;
+ viortc->num_alarmq_bufs = num_elems;
+
+ for (i = 0; i < num_elems; i++) {
+ buf = devm_kzalloc(dev, buf_cap, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ buf_list[i] = buf;
+ }
+
+ return 0;
+}
+
+/**
+ * viortc_populate_vq() - populate alarmq with device-writable buffers
+ * @viortc: device data
+ * @vq: virtqueue
+ * @buf_cap: device-writable buffer size in bytes
+ *
+ * Populates the alarmq with pre-allocated buffers.
+ *
+ * The caller is responsible for kicking the device.
+ *
+ * Context: Process context.
+ * Return: Zero on success, negative error code otherwise.
+ */
+static int viortc_populate_vq(struct viortc_dev *viortc, struct virtqueue *vq,
+ u32 buf_cap)
+{
+ unsigned int num_elems, i;
+ void *buf;
+ int ret;
+
+ num_elems = viortc->num_alarmq_bufs;
+
+ for (i = 0; i < num_elems; i++) {
+ buf = viortc->alarmq_bufs[i];
+
+ ret = viortc_feed_vq(viortc, vq, buf, buf_cap, buf);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
/**
* viortc_init_vqs() - init virtqueues
* @viortc: device data
@@ -794,12 +1180,22 @@ static int viortc_init_vqs(struct viortc_dev *viortc)
const char *names[VIORTC_MAX_NR_QUEUES];
vq_callback_t *callbacks[VIORTC_MAX_NR_QUEUES];
struct virtqueue *vqs[VIORTC_MAX_NR_QUEUES];
+ unsigned int num_elems;
+ bool have_alarms;
int nr_queues;
+ have_alarms = viortc_alarms_supported(vdev);
+
nr_queues = VIORTC_REQUESTQ + 1;
names[VIORTC_REQUESTQ] = "requestq";
callbacks[VIORTC_REQUESTQ] = viortc_cb_requestq;
+ if (have_alarms) {
+ nr_queues = VIORTC_ALARMQ + 1;
+ names[VIORTC_ALARMQ] = "alarmq";
+ callbacks[VIORTC_ALARMQ] = viortc_cb_alarmq;
+ }
+
ret = virtio_find_vqs(vdev, nr_queues, vqs, callbacks, names, NULL);
if (ret)
return ret;
@@ -807,6 +1203,25 @@ static int viortc_init_vqs(struct viortc_dev *viortc)
viortc->vqs[VIORTC_REQUESTQ].vq = vqs[VIORTC_REQUESTQ];
spin_lock_init(&viortc->vqs[VIORTC_REQUESTQ].lock);
+ if (have_alarms) {
+ viortc->vqs[VIORTC_ALARMQ].vq = vqs[VIORTC_ALARMQ];
+ spin_lock_init(&viortc->vqs[VIORTC_ALARMQ].lock);
+
+ num_elems = virtqueue_get_vring_size(vqs[VIORTC_ALARMQ]);
+ if (num_elems == 0)
+ return -ENOSPC;
+
+ if (!viortc->alarmq_bufs) {
+ ret = viortc_alloc_vq_bufs(viortc, num_elems,
+ VIORTC_ALARMQ_BUF_CAP);
+ if (ret)
+ return ret;
+ } else {
+ viortc->num_alarmq_bufs =
+ min(num_elems, viortc->num_alarmq_bufs);
+ }
+ }
+
return 0;
}
@@ -819,6 +1234,7 @@ static int viortc_init_vqs(struct viortc_dev *viortc)
*/
static int viortc_probe(struct virtio_device *vdev)
{
+ struct virtqueue *alarm_vq;
struct viortc_dev *viortc;
int ret;
@@ -842,6 +1258,26 @@ static int viortc_probe(struct virtio_device *vdev)
if (ret)
goto err_reset_vdev;
+ if (viortc_alarms_supported(vdev)) {
+ /*
+ * Now that the RTC device was registered, ready viortc to
+ * receive alarms.
+ */
+ smp_wmb();
+
+ alarm_vq = viortc->vqs[VIORTC_ALARMQ].vq;
+
+ ret = viortc_populate_vq(viortc, alarm_vq,
+ VIORTC_ALARMQ_BUF_CAP);
+ if (ret)
+ goto err_reset_vdev;
+
+ if (!virtqueue_kick(alarm_vq)) {
+ ret = -EIO;
+ goto err_reset_vdev;
+ }
+ }
+
return 0;
err_reset_vdev:
@@ -865,6 +1301,35 @@ static void viortc_remove(struct virtio_device *vdev)
vdev->config->del_vqs(vdev);
}
+#ifdef CONFIG_PM_SLEEP
+static int viortc_freeze(struct virtio_device *dev)
+{
+ return 0;
+}
+
+static int viortc_restore(struct virtio_device *dev)
+{
+ struct viortc_dev *viortc = dev->priv;
+ int ret;
+
+ ret = viortc_init_vqs(viortc);
+ if (ret)
+ return ret;
+
+ if (viortc_alarms_supported(dev))
+ ret = viortc_populate_vq(viortc, viortc->vqs[VIORTC_ALARMQ].vq,
+ VIORTC_ALARMQ_BUF_CAP);
+
+ return ret;
+}
+#endif
+
+static unsigned int features[] = {
+#if IS_ENABLED(CONFIG_VIRTIO_RTC_CLASS)
+ VIRTIO_RTC_F_ALARM,
+#endif
+};
+
static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_CLOCK, VIRTIO_DEV_ANY_ID },
{ 0 },
@@ -874,9 +1339,15 @@ MODULE_DEVICE_TABLE(virtio, id_table);
static struct virtio_driver virtio_rtc_drv = {
.driver.name = KBUILD_MODNAME,
.driver.owner = THIS_MODULE,
+ .feature_table = features,
+ .feature_table_size = ARRAY_SIZE(features),
.id_table = id_table,
.probe = viortc_probe,
.remove = viortc_remove,
+#ifdef CONFIG_PM_SLEEP
+ .freeze = viortc_freeze,
+ .restore = viortc_restore,
+#endif
};
module_virtio_driver(virtio_rtc_drv);
diff --git a/drivers/virtio/virtio_rtc_internal.h b/drivers/virtio/virtio_rtc_internal.h
index 0dedced4aeae..ebd5dbf5bbc5 100644
--- a/drivers/virtio/virtio_rtc_internal.h
+++ b/drivers/virtio/virtio_rtc_internal.h
@@ -8,8 +8,11 @@
#ifndef _VIRTIO_RTC_INTERNAL_H_
#define _VIRTIO_RTC_INTERNAL_H_
+#include <linux/device.h>
+#include <linux/err.h>
#include <linux/types.h>
#include <linux/ptp_clock_kernel.h>
+#include <linux/virtio.h>
/* driver core IFs */
@@ -20,6 +23,16 @@ int viortc_read_cross(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
u64 *reading, u64 *cycles);
int viortc_cross_cap(struct viortc_dev *viortc, u16 vio_clk_id, u16 hw_counter,
bool *supported);
+int viortc_read_alarm(struct viortc_dev *viortc, u16 vio_clk_id,
+ u64 *alarm_time, bool *enabled);
+int viortc_set_alarm(struct viortc_dev *viortc, u16 vio_clk_id, u64 alarm_time,
+ bool alarm_enable);
+int viortc_set_alarm_enabled(struct viortc_dev *viortc, u16 vio_clk_id,
+ bool alarm_enable);
+
+struct viortc_class;
+
+struct viortc_class *viortc_class_from_dev(struct device *dev);
/* PTP IFs */
@@ -66,4 +79,44 @@ static inline int viortc_ptp_unregister(struct viortc_ptp_clock *vio_ptp,
*/
int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id);
+/* RTC class IFs */
+
+#if IS_ENABLED(CONFIG_VIRTIO_RTC_CLASS)
+
+void viortc_class_alarm(struct viortc_class *viortc_class, u16 vio_clk_id);
+
+void viortc_class_stop(struct viortc_class *viortc_class);
+
+int viortc_class_register(struct viortc_class *viortc_class);
+
+struct viortc_class *viortc_class_init(struct viortc_dev *viortc,
+ u16 vio_clk_id, bool have_alarm,
+ struct device *parent_dev);
+
+#else /* CONFIG_VIRTIO_RTC_CLASS */
+
+static inline void viortc_class_alarm(struct viortc_class *viortc_class,
+ u16 vio_clk_id)
+{
+}
+
+static inline void viortc_class_stop(struct viortc_class *viortc_class)
+{
+}
+
+static inline int viortc_class_register(struct viortc_class *viortc_class)
+{
+ return -ENODEV;
+}
+
+static inline struct viortc_class *viortc_class_init(struct viortc_dev *viortc,
+ u16 vio_clk_id,
+ bool have_alarm,
+ struct device *parent_dev)
+{
+ return ERR_PTR(-ENODEV);
+}
+
+#endif /* CONFIG_VIRTIO_RTC_CLASS */
+
#endif /* _VIRTIO_RTC_INTERNAL_H_ */
diff --git a/include/uapi/linux/virtio_rtc.h b/include/uapi/linux/virtio_rtc.h
index f469276562d7..26be3574fabe 100644
--- a/include/uapi/linux/virtio_rtc.h
+++ b/include/uapi/linux/virtio_rtc.h
@@ -8,6 +8,9 @@
#include <linux/types.h>
+/* alarm feature */
+#define VIRTIO_RTC_F_ALARM 0
+
/* read request message types */
#define VIRTIO_RTC_REQ_READ 0x0001
@@ -18,6 +21,13 @@
#define VIRTIO_RTC_REQ_CFG 0x1000
#define VIRTIO_RTC_REQ_CLOCK_CAP 0x1001
#define VIRTIO_RTC_REQ_CROSS_CAP 0x1002
+#define VIRTIO_RTC_REQ_READ_ALARM 0x1003
+#define VIRTIO_RTC_REQ_SET_ALARM 0x1004
+#define VIRTIO_RTC_REQ_SET_ALARM_ENABLED 0x1005
+
+/* alarmq message types */
+
+#define VIRTIO_RTC_NOTIF_ALARM 0x2000
/* Message headers */
@@ -38,6 +48,12 @@ struct virtio_rtc_resp_head {
__u8 reserved[7];
};
+/** common notification header */
+struct virtio_rtc_notif_head {
+ __le16 msg_type;
+ __u8 reserved[6];
+};
+
/* read requests */
/* VIRTIO_RTC_REQ_READ message */
@@ -104,7 +120,9 @@ struct virtio_rtc_resp_clock_cap {
#define VIRTIO_RTC_CLOCK_TAI 1
#define VIRTIO_RTC_CLOCK_MONO 2
__le16 type;
- __u8 reserved[6];
+#define VIRTIO_RTC_FLAG_ALARM_CAP (1 << 0)
+ __u8 flags;
+ __u8 reserved[5];
};
/* VIRTIO_RTC_REQ_CROSS_CAP message */
@@ -123,6 +141,53 @@ struct virtio_rtc_resp_cross_cap {
__u8 reserved[7];
};
+/* VIRTIO_RTC_REQ_READ_ALARM message */
+
+struct virtio_rtc_req_read_alarm {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ __u8 reserved[6];
+};
+
+struct virtio_rtc_resp_read_alarm {
+ struct virtio_rtc_resp_head head;
+ __le64 alarm_time;
+#define VIRTIO_RTC_FLAG_ALARM_ENABLED (1 << 0)
+ __u8 flags;
+ __u8 reserved[7];
+};
+
+/* VIRTIO_RTC_REQ_SET_ALARM message */
+
+struct virtio_rtc_req_set_alarm {
+ struct virtio_rtc_req_head head;
+ __le64 alarm_time;
+ __le16 clock_id;
+ /* flag VIRTIO_RTC_ALARM_ENABLED */
+ __u8 flags;
+ __u8 reserved[5];
+};
+
+struct virtio_rtc_resp_set_alarm {
+ struct virtio_rtc_resp_head head;
+ /* no response params */
+};
+
+/* VIRTIO_RTC_REQ_SET_ALARM_ENABLED message */
+
+struct virtio_rtc_req_set_alarm_enabled {
+ struct virtio_rtc_req_head head;
+ __le16 clock_id;
+ /* flag VIRTIO_RTC_ALARM_ENABLED */
+ __u8 flags;
+ __u8 reserved[5];
+};
+
+struct virtio_rtc_resp_set_alarm_enabled {
+ struct virtio_rtc_resp_head head;
+ /* no response params */
+};
+
/** Union of request types for requestq */
union virtio_rtc_req_requestq {
struct virtio_rtc_req_read read;
@@ -130,6 +195,9 @@ union virtio_rtc_req_requestq {
struct virtio_rtc_req_cfg cfg;
struct virtio_rtc_req_clock_cap clock_cap;
struct virtio_rtc_req_cross_cap cross_cap;
+ struct virtio_rtc_req_read_alarm read_alarm;
+ struct virtio_rtc_req_set_alarm set_alarm;
+ struct virtio_rtc_req_set_alarm_enabled set_alarm_enabled;
};
/** Union of response types for requestq */
@@ -139,6 +207,24 @@ union virtio_rtc_resp_requestq {
struct virtio_rtc_resp_cfg cfg;
struct virtio_rtc_resp_clock_cap clock_cap;
struct virtio_rtc_resp_cross_cap cross_cap;
+ struct virtio_rtc_resp_read_alarm read_alarm;
+ struct virtio_rtc_resp_set_alarm set_alarm;
+ struct virtio_rtc_resp_set_alarm_enabled set_alarm_enabled;
+};
+
+/* alarmq notifications */
+
+/* VIRTIO_RTC_NOTIF_ALARM notification */
+
+struct virtio_rtc_notif_alarm {
+ struct virtio_rtc_notif_head head;
+ __le16 clock_id;
+ __u8 reserved[6];
+};
+
+/** Union of notification types for alarmq */
+union virtio_rtc_notif_alarmq {
+ struct virtio_rtc_notif_alarm alarm;
};
#endif /* _LINUX_VIRTIO_RTC_H */
--
2.40.1
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping
[not found] ` <e410d65754ba6a11ad7f74b27dc28d9a25d8c82e.camel@infradead.org>
@ 2024-06-20 12:06 ` Peter Hilber
0 siblings, 0 replies; 35+ messages in thread
From: Peter Hilber @ 2024-06-20 12:06 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
virtio-dev
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Marc Zyngier,
Mark Rutland, Daniel Lezcano, Thomas Gleixner
Changing virtio-dev address to the new one.
On 15.06.24 09:50, David Woodhouse wrote:
> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>>
>> +int viortc_hw_xtstamp_params(u16 *hw_counter, enum clocksource_ids *cs_id)
>> +{
>> + *hw_counter = VIRTIO_RTC_COUNTER_ARM_VIRT;
>
> Hm, but what if it isn't? I think you need to put this in
> drivers/clocksource/arm_arch_timer.c where it can do something like
> kvm_arch_ptp_get_crosststamp() does to decide:
>
> if (arch_timer_uses_ppi == ARCH_TIMER_VIRT_PPI)
> ptp_counter = KVM_PTP_VIRT_COUNTER;
> else
> ptp_counter = KVM_PTP_PHYS_COUNTER;
>
>
>> + *cs_id = CSID_ARM_ARCH_COUNTER;
>> +
>> + return 0;
>> +}
>
I had such a check in v1, but Marc Zyngier didn't like the distinction [1,
2] - maybe also exposing the timer through a generic helper was not
desired. Quoting from [2]:
>> This was the rationale to come up with the physical/virtual counter
>> distinction for the Virtio RTC device. Looking at extensions such as
>> FEAT_ECV, where the CNTPCT_EL0 value can depend on the EL, or FEAT_NV*,
>> it might be a bit simplistic.
>
> Not just simplistic. It doesn't make sense. For this to work, you'd
> need to know the global offset that KVM applies to the global counter,
> plus the *virtualised* CNTPOFF/CNTVOFF values that the guest can
> change at any time on a *per-CPU* basis. None of that is available
> outside of KVM, nor would it make any sense anyway.
>
>> Does this physical/virtual counter distinction sound like a good idea?
>> Otherwise I would drop the arch_timer_counter_get_type() in the next
>> iteration.
>
> My take on this is that only the global counter value makes any sense.
> That value is already available from the host as the virtual counter,
> because we guarantee that CNTVOFF is 0 when outside of the guest
> (well, technically, outside of the vcpu_load/vcpu_put section).
So I put the assumption that the virtual counter will always be the right
choice for current Linux kernels, from the Virtio device POV.
Thanks for the comment,
Peter
[1] https://lore.kernel.org/lkml/20230630171052.985577-1-peter.hilber@opensynergy.com/T/#ma8d596de1cbc8f4a78a18b2aa995db18423494a7
[2] https://lore.kernel.org/lkml/20230630171052.985577-1-peter.hilber@opensynergy.com/T/#m65fa1d715933360498c4e33d7225e4220215a9d6
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
[not found] ` <684eac07834699889fdb67be4cee09319c994a42.camel@infradead.org>
@ 2024-06-20 12:37 ` Peter Hilber
2024-06-20 16:19 ` David Woodhouse
2024-06-21 14:02 ` David Woodhouse
0 siblings, 2 replies; 35+ messages in thread
From: Peter Hilber @ 2024-06-20 12:37 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
Changing virtio-dev address to the new one. The discussion might also be
relevant for virtio-comment, but it is discouraged to forward it to both.
On 15.06.24 10:40, David Woodhouse wrote:
> On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
>> RFC v3 updates
>> --------------
>>
>> This series implements a driver for a virtio-rtc device conforming to spec
>> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
>> the PTP clock driver already present before.
>>
>> This patch series depends on the patch series "treewide: Use clocksource id
>> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
>> series on top of mainline.
>>
>> Overview
>> --------
>>
>> This patch series adds the virtio_rtc module, and related bugfixes. The
>> virtio_rtc module implements a driver compatible with the proposed Virtio
>> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
>> provides information about current time. The device can provide different
>> clocks, e.g. for the UTC or TAI time standards, or for physical time
>> elapsed since some past epoch. The driver can read the clocks with simple
>> or more accurate methods. Optionally, the driver can set an alarm.
>>
>> The series first fixes some bugs in the get_device_system_crosststamp()
>> interpolation code, which is required for reliable virtio_rtc operation.
>> Then, add the virtio_rtc implementation.
>>
>> For the Virtio RTC device, there is currently a proprietary implementation,
>> which has been used for provisional testing.
>
> As discussed before, I don't think it makes sense to design a new high-
> precision virtual clock which only gets it right *most* of the time. We
> absolutely need to address the issue of live migration.
>
> When live migration occurs, the guest's time precision suffers in two
> ways.
>
> First, even when migrating to a supposedly identical host the precise
> rate of the underlying counter (TSC, arch counter, etc.) can change
> within the tolerances (e.g. ±50PPM) of the hardware. Unlike the natural
> changes that NTP normally manages to track, this is a *step* change,
> potentially from -50PPM to +50PPM from one host to the next.
>
> Second, the accuracy of the counter as preserved across migration is
> limited by the accuracy of each host's NTP synchronization. So there is
> also a step change in the value of the counter itself.
>
> At the moment of migration, the guest's timekeeping should be
> considered invalid. Any previous cross-timestamps are meaningless, and
> blindly using the previously-known relationship between the counter and
> real time can lead to problems such as corruption in distributed
> databases, fines for mis-timestamped transactions, and other general
> unhappiness.
>
> We obviously can't get a new timestamp from the virtio_rtc device every
> time an application wants to know the time reliably. We don't even want
> *system* calls for that, which is why we have it in a vDSO.
>
> We can take the same approach to warning the guest about clock
> disruption due to live migration. A shared memory region from the
> virtual clock device even be mapped all the way to userspace, for those
> applications which need precise and *reliable* time to check a
> 'disruption' marker in it, and do whatever is appropriate (e.g. abort
> transactions and wait for time to resync) when it happens.
>
> We can do better than just letting the guest know about disruption, of
> course. We can put the actual counter→realtime relationship into the
> memory region too. As well as being able to provide a PTP driver with
> this, the applications which care about *reliable* timestamps can mmap
> the page directly and use it, vDSO-style, to have accurate timestamps
> even from the first cycle after migration.
>
> When disruption is signalled, the guest needs to throw away any
> *additional* refinement that it's done with NTP/PTP/PPS/etc. and revert
> to knowing nothing more than what the hypervisor advertises here.
>
> Here's a first attempt at defining such a memory structure. For now
> I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> think it makes most sense for this to be part of the virtio_rtc spec.
> Ultimately it doesn't matter *how* the guest finds the memory region.
This looks sensible to me. I also think it would be possible to adapt this for
the virtio-rtc spec. The proposal also supports some other use cases which are
not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
others such as indication of a past leap second.
Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
support leap second smearing. Also, it would be helpful to allow indicating
when some of the fields are not valid (such as leapsecond_counter_value,
leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).
Do you have plans to contribute this to the Virtio specification and Linux
driver implementation?
I also added a few comments below.
Thanks for sharing,
Peter
[2] https://lore.kernel.org/virtio-comment/20240522170003.52565-1-peter.hilber@opensynergy.com/
>
> Very preliminary QEMU hacks at
> https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> (I still need to do the KVM side helper for actually filling in the
> host clock information, converted to the *guest* TSC)
>
> This is the guest side. H aving heckled your assumption that we can use
> the virt counter on Arm, I concede I'm doing the same thing for now.
> The structure itself is at the end, or see
> https://git.infradead.org/?p=users/dwmw2/linux.git;a=blob;f=include/uapi/linux/vmclock.h;hb=vmclock
>
> From 9e1c3b823d497efa4e0acb21b226a72e4d6e8a53 Mon Sep 17 00:00:00 2001
> From: David Woodhouse <dwmw@amazon.co.uk>
> Date: Mon, 10 Jun 2024 15:10:11 +0100
> Subject: [PATCH] ptp: Add vDSO-style vmclock support
>
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> drivers/ptp/Kconfig | 13 ++
> drivers/ptp/Makefile | 1 +
> drivers/ptp/ptp_vmclock.c | 404 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/vmclock.h | 102 +++++++++
> 4 files changed, 520 insertions(+)
> create mode 100644 drivers/ptp/ptp_vmclock.c
> create mode 100644 include/uapi/linux/vmclock.h
>
[...]
> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> + .owner = THIS_MODULE,
> + .max_adj = 0,
> + .n_ext_ts = 0,
> + .n_pins = 0,
> + .pps = 0,
> + .adjfine = ptp_vmclock_adjfine,
> + .adjtime = ptp_vmclock_adjtime,
> + .gettime64 = ptp_vmclock_gettime,
Should implement .gettimex64 instead.
> + .settime64 = ptp_vmclock_settime,
> + .enable = ptp_vmclock_enable,
> + .getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +
[...]
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + */
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> + uint32_t magic;
> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
> + uint16_t size; /* Size of page containing this structure */
> + uint16_t version; /* 1 */
> +
> + /* Sequence lock. Low bit means an update is in progress. */
> + uint64_t seq_count;
> +
> + /*
> + * This field changes to another non-repeating value when the CPU
> + * counter is disrupted, for example on live migration.
> + */
> + uint64_t disruption_marker;
> +
> + uint8_t clock_status;
> +#define VMCLOCK_STATUS_UNKNOWN 0
> +#define VMCLOCK_STATUS_INITIALIZING 1
> +#define VMCLOCK_STATUS_SYNCHRONIZED 2
> +#define VMCLOCK_STATUS_FREERUNNING 3
> +#define VMCLOCK_STATUS_UNRELIABLE 4
> +
> + uint8_t counter_id;
> +#define VMCLOCK_COUNTER_INVALID 0
> +#define VMCLOCK_COUNTER_X86_TSC 1
> +#define VMCLOCK_COUNTER_ARM_VCNT 2
> +
> + uint8_t pad[3];
> +
> + /*
> + * UTC on its own is non-monotonic and ambiguous.
> + *
> + * Inform the guest about the TAI offset, so that it can have an
> + * actual monotonic and reliable time reference if it needs it.
> + *
> + * Also indicate a nearby leap second, if one exists. Unlike in
> + * NTP, this may indicate a leap second in the past in order to
> + * indicate that it *has* been taken into account.
> + */
> + int8_t leapsecond_direction;
> + int16_t tai_offset_sec;
> + uint64_t leapsecond_counter_value;
> + uint64_t leapsecond_tai_time;
> +
> + /* Paired values of counter and UTC at a given point in time. */
> + uint64_t counter_value;
> + uint64_t utc_time_sec;
> + uint64_t utc_time_frac_sec;
> +
> + /* Counter frequency, and error margin. Units of (second >> 64) */
> + uint64_t counter_period_frac_sec;
AFAIU this might limit the precision in case of high counter frequencies.
Could the unit be aligned to the expected frequency band of counters?
> + uint64_t counter_period_error_rate_frac_sec;
> +
> + /* Error margin of UTC reading above (± picoseconds) */
> + uint64_t utc_time_maxerror_picosec;
> +};
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
2024-06-20 12:37 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
@ 2024-06-20 16:19 ` David Woodhouse
2024-06-21 8:45 ` David Woodhouse
2024-06-21 14:02 ` David Woodhouse
1 sibling, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-06-20 16:19 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 5442 bytes --]
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Changing virtio-dev address to the new one. The discussion might also be
> relevant for virtio-comment, but it is discouraged to forward it to both.
I will happily take it to whichever forum you think is most
appropriate. (And you have my permission to direct replies to whichever
you choose.)
> On 15.06.24 10:40, David Woodhouse wrote:
> > As discussed before, I don't think it makes sense to design a new high-
> > precision virtual clock which only gets it right *most* of the time. We
> > absolutely need to address the issue of live migration.
...
> > Here's a first attempt at defining such a memory structure. For now
> > I've done it as a "vmclock" ACPI device based loosely on vmgenid, but I
> > think it makes most sense for this to be part of the virtio_rtc spec.
> > Ultimately it doesn't matter *how* the guest finds the memory region.
>
> This looks sensible to me. I also think it would be possible to adapt this for
> the virtio-rtc spec. The proposal also supports some other use cases which are
> not in the virtio-rtc RFC spec v4 [2], notably vDSO-style clock reading, and
> others such as indication of a past leap second.
Right. The vDSO-style clock reading is key to solving the live
migration problem.
The other key thing this adds is the error bounds, which some
applications care deeply about. I've been working with the team that
owns ClockBound on that part: https://github.com/aws/clock-bound
> Compared to the virtio-rtc RFC spec v4 [2], this proposal does not seem to
> support leap second smearing.
That's kind of intentional. Leap second smearing should be considered
the *antithesis* of precise time. Anyone who wants a monotonic realtime
clock should be using the POSIX CLOCK_TAI.
Part of my motivation for fixing the LM problem is because some
financial institutions can incur significant penalties for putting
inaccurate timestamps on transactions — even the disruption caused by
live migration is enough to trigger that. So deliberately lying to them
about what the UTC time is, by up to a second in either direction, is
not necessarily in their best interest.
As you noted, this proposal does expose leap seconds in the recent
past, which is all that's needed to allow a guest to generate a smeared
clock *from* the accurate clock that is provided through this
mechanism.
(Knowledge of past leap seconds is needed because in some modes,
smearing adjustments continue for some hours *afte* the leap second
should have occurred. So the NTP style of leap indicator isn't
sufficient).
> Also, it would be helpful to allow indicating
> when some of the fields are not valid (such as leapsecond_counter_value,
> leapsecond_tai_time, tai_offset_sec, utc_time_maxerror_picosec, ...).
Right. For some of those the answer can just be 'zero means invalid',
for the max error, perhaps MAX_UINT64. But we should definitely make
that explicit.
I'm also not entirely sure I understood Julien's insistence that we
include the leapsecond_counter_value as *well* as the
leapsecond_tai_time. It seems to me that the implementation would have
to recalculate that every time the frequency is adjusted.
For some of those fields, I was expecting a certain amount of
bikeshedding to occur and figured it was better to post an early straw
man and solicit feedback.
> Do you have plans to contribute this to the Virtio specification and Linux
> driver implementation?
Yes, absolutely. For now I've implemented it in the Linux guest¹ and in
QEMU² as an ACPI device modelled on vmgenid, but I'd love *not* to have
to do that, and just to do it based on virtio instead.
¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
² https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/vmclock
> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > + .owner = THIS_MODULE,
> > + .max_adj = 0,
> > + .n_ext_ts = 0,
> > + .n_pins = 0,
> > + .pps = 0,
> > + .adjfine = ptp_vmclock_adjfine,
> > + .adjtime = ptp_vmclock_adjtime,
> > + .gettime64 = ptp_vmclock_gettime,
>
> Should implement .gettimex64 instead.
Ack, thanks. I'll go play with that.
>
> > +
> > + /* Counter frequency, and error margin. Units of (second >> 64) */
> > + uint64_t counter_period_frac_sec;
>
> AFAIU this might limit the precision in case of high counter frequencies.
> Could the unit be aligned to the expected frequency band of counters?
This field indicates the period of a single tick, in units of 1>>64 of
a second. That's about 5.4e-20 seconds, or 54 zeptoseconds?
Can you walk me through a calculation where you believe that level of
precision is insufficient?
I guess the precision matters if the structure isn't updated for a long
period of time, and the delta between the current counter and the
snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
would need a *lot* of them before you care? And if nobody's been
calibrating your counter for that long, surely you have bigger worries?
Am I missing something there?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
2024-06-20 16:19 ` David Woodhouse
@ 2024-06-21 8:45 ` David Woodhouse
2024-06-25 19:01 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
2024-06-27 13:50 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
0 siblings, 2 replies; 35+ messages in thread
From: David Woodhouse @ 2024-06-21 8:45 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]
On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote:
>
> >
> > > +
> > > + /* Counter frequency, and error margin. Units of (second >> 64) */
> > > + uint64_t counter_period_frac_sec;
> >
> > AFAIU this might limit the precision in case of high counter frequencies.
> > Could the unit be aligned to the expected frequency band of counters?
>
> This field indicates the period of a single tick, in units of 1>>64 of
> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds?
>
> Can you walk me through a calculation where you believe that level of
> precision is insufficient?
>
> I guess the precision matters if the structure isn't updated for a long
> period of time, and the delta between the current counter and the
> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
> would need a *lot* of them before you care? And if nobody's been
> calibrating your counter for that long, surely you have bigger worries?
>
> Am I missing something there?
Hm, that was a bit rushed at the end of the day; let's take a better look...
Let's take a hypothetical example of a 100GHz counter. That's two
orders of magnitude more than today's Arm arch counter.
The period of such a counter would be 10 picoseconds.
(Let's ignore the question of how far light actually travels in that
time and how *realistic* that example is, for the moment.)
It turns out that at that rate, there *are* a lot of 54 zeptosecondses
of precision loss in the day. It could be half a millisecond a day, or
20µs an hour.
That particular example of 10 picoseconds is 184467440.7370955
(seconds>>64) which could be truncated to 184467440 — losing about 4PPB
(a third of a millisecond a day; 14µs an hour).
So yeah, I suppose a 'shift' field could make sense. It's easy enough
to consume on the guest side as it doesn't really perturb the 128-bit
multiplication very much; especially if we don't let it be negative.
And implementations *can* just set it to zero. It hurts nobody.
Or were you thinking of just using a fixed shift like (seconds>>80)
instead?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
2024-06-20 12:37 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2024-06-20 16:19 ` David Woodhouse
@ 2024-06-21 14:02 ` David Woodhouse
1 sibling, 0 replies; 35+ messages in thread
From: David Woodhouse @ 2024-06-21 14:02 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 3750 bytes --]
On Thu, 2024-06-20 at 14:37 +0200, Peter Hilber wrote:
> Should implement .gettimex64 instead.
Thanks. This look sane?
As noted in the code comment, in the *ideal* case we just build all
three pre/post/device timestamps from the very same counter read. So
sts->pre_ts == sts->post_ts.
In the less ideal case (which will happen on x86 when kvmclock is being
used for the system time), we use the time from ktime_get_snapshot() as
the pre_ts and take a new snapshot immediately after the get_cycles().
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index e8c65405a8f3..07a81a94d29a 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -96,9 +96,11 @@ static inline uint64_t mul_u64_u64_add_u64(uint64_t *res_hi, uint64_t delta,
}
static int vmclock_get_crosststamp(struct vmclock_state *st,
+ struct ptp_system_timestamp *sts,
struct system_counterval_t *system_counter,
struct timespec64 *tspec)
{
+ struct system_time_snapshot systime_snapshot;
uint64_t cycle, delta, seq, frac_sec;
int ret = 0;
@@ -119,7 +121,17 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
continue;
}
- cycle = get_cycles();
+ if (sts) {
+ ktime_get_snapshot(&systime_snapshot);
+
+ if (systime_snapshot.cs_id == st->cs_id) {
+ cycle = systime_snapshot.cycles;
+ } else {
+ cycle = get_cycles();
+ ptp_read_system_postts(sts);
+ }
+ } else
+ cycle = get_cycles();
delta = cycle - st->clk->counter_value;
@@ -139,6 +151,21 @@ static int vmclock_get_crosststamp(struct vmclock_state *st,
if (ret)
return ret;
+ /*
+ * When invoked for gettimex64, fill in the pre/post system times.
+ * The ideal case is when system time is based on the the same
+ * counter as st->cs_id, in which case all three pre/post/device
+ * times are derived from the *same* counter value. If cs_id does
+ * not match, then the value from ktime_get_snapshot() is used as
+ * pre_ts, and ptp_read_system_postts() was already called above
+ * for the post_ts. Those are either side of the get_cycles() call.
+ */
+ if (sts) {
+ sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+ if (systime_snapshot.cs_id == st->cs_id)
+ sts->post_ts = sts->pre_ts;
+ }
+
if (system_counter) {
system_counter->cycles = cycle;
system_counter->cs_id = st->cs_id;
@@ -155,7 +182,7 @@ static int ptp_vmclock_get_time_fn(ktime_t *device_time,
struct timespec64 tspec;
int ret;
- ret = vmclock_get_crosststamp(st, system_counter, &tspec);
+ ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
if (!ret)
*device_time = timespec64_to_ktime(tspec);
@@ -198,7 +225,16 @@ static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts
struct vmclock_state *st = container_of(ptp, struct vmclock_state,
ptp_clock_info);
- return vmclock_get_crosststamp(st, NULL, ts);
+ return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+
+ return vmclock_get_crosststamp(st, sts, NULL, ts);
}
static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
@@ -216,6 +252,7 @@ static const struct ptp_clock_info ptp_vmclock_info = {
.adjfine = ptp_vmclock_adjfine,
.adjtime = ptp_vmclock_adjtime,
.gettime64 = ptp_vmclock_gettime,
+ .gettimex64 = ptp_vmclock_gettimex,
.settime64 = ptp_vmclock_settime,
.enable = ptp_vmclock_enable,
.getcrosststamp = ptp_vmclock_getcrosststamp,
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-21 8:45 ` David Woodhouse
@ 2024-06-25 19:01 ` David Woodhouse
2024-06-27 13:50 ` Peter Hilber
[not found] ` <20240630132859.GC17134@kernel.org>
2024-06-27 13:50 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
1 sibling, 2 replies; 35+ messages in thread
From: David Woodhouse @ 2024-06-25 19:01 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 23144 bytes --]
From: David Woodhouse <dwmw@amazon.co.uk>
The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.
Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.
The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
v2:
• Add gettimex64() support
• Convert TSC values to KVM clock when appropriate
• Require int128 support
• Add counter_period_shift
• Add timeout when seq_count is invalid
• Add flags field
• Better comments in vmclock ABI structure
• Explicitly forbid smearing (as clock rates would need to change)
drivers/ptp/Kconfig | 13 +
drivers/ptp/Makefile | 1 +
drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++
include/uapi/linux/vmclock.h | 138 ++++++++++
4 files changed, 668 insertions(+)
create mode 100644 drivers/ptp/ptp_vmclock.c
create mode 100644 include/uapi/linux/vmclock.h
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
To compile this driver as a module, choose M here: the module
will be called ptp_kvm.
+config PTP_1588_CLOCK_VMCLOCK
+ tristate "Virtual machine PTP clock"
+ depends on X86_TSC || ARM_ARCH_TIMER
+ depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+ default y
+ help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o
obj-$(CONFIG_PTP_1588_CLOCK_INES) += ptp_ines.o
obj-$(CONFIG_PTP_1588_CLOCK_PCH) += ptp_pch.o
obj-$(CONFIG_PTP_1588_CLOCK_KVM) += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK) += ptp_vmclock.o
obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
ptp-qoriq-y += ptp_qoriq.o
ptp-qoriq-$(CONFIG_DEBUG_FS) += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index 000000000000..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/miscdevice.h>
+#include <linux/acpi.h>
+#include <uapi/linux/vmclock.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+#ifdef CONFIG_X86
+#include <asm/pvclock.h>
+#include <asm/kvmclock.h>
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+ phys_addr_t phys_addr;
+ struct vmclock_abi *clk;
+ struct miscdevice miscdev;
+ struct ptp_clock_info ptp_clock_info;
+ struct ptp_clock *ptp_clock;
+ enum clocksource_ids cs_id, sys_cs_id;
+ int index;
+ char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflow.
+ */
+static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
+ uint64_t period, uint8_t shift,
+ uint64_t frac_sec)
+{
+ unsigned __int128 res = (unsigned __int128)delta * period;
+
+ res >>= shift;
+ res += frac_sec;
+ *res_hi = res >> 64;
+ return (uint64_t)res;
+}
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+ struct ptp_system_timestamp *sts,
+ struct system_counterval_t *system_counter,
+ struct timespec64 *tspec)
+{
+ ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+ struct system_time_snapshot systime_snapshot;
+ uint64_t cycle, delta, seq, frac_sec;
+
+#ifdef CONFIG_X86
+ /*
+ * We'd expect the hypervisor to know this and to report the clock
+ * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
+ */
+ if (check_tsc_unstable())
+ return -EINVAL;
+#endif
+
+ while (1) {
+ seq = st->clk->seq_count & ~1ULL;
+ virt_rmb();
+
+ if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
+ return -EINVAL;
+
+ /*
+ * When invoked for gettimex64(), fill in the pre/post system
+ * times. The simple case is when system time is based on the
+ * same counter as st->cs_id, in which case all three times
+ * will be derived from the *same* counter value.
+ *
+ * If the system isn't using the same counter, then the value
+ * from ktime_get_snapshot() will still be used as pre_ts, and
+ * ptp_read_system_postts() is called to populate postts after
+ * calling get_cycles().
+ *
+ * The conversion to timespec64 happens further down, outside
+ * the seq_count loop.
+ */
+ if (sts) {
+ ktime_get_snapshot(&systime_snapshot);
+ if (systime_snapshot.cs_id == st->cs_id) {
+ cycle = systime_snapshot.cycles;
+ } else {
+ cycle = get_cycles();
+ ptp_read_system_postts(sts);
+ }
+ } else
+ cycle = get_cycles();
+
+ delta = cycle - st->clk->counter_value;
+
+ frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
+ st->clk->counter_period_frac_sec,
+ st->clk->counter_period_shift,
+ st->clk->utc_time_frac_sec);
+ tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
+ tspec->tv_sec += st->clk->utc_time_sec;
+
+ virt_rmb();
+ if (seq == st->clk->seq_count)
+ break;
+
+ if (ktime_after(ktime_get(), deadline))
+ return -ETIMEDOUT;
+ }
+
+ if (system_counter) {
+ system_counter->cycles = cycle;
+ system_counter->cs_id = st->cs_id;
+ }
+
+ if (sts) {
+ sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
+ if (systime_snapshot.cs_id == st->cs_id)
+ sts->post_ts = sts->pre_ts;
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_X86
+/*
+ * In the case where the system is using the KVM clock for timekeeping, convert
+ * the TSC value into a KVM clock time in order to return a paired reading that
+ * get_device_system_crosststamp() can cope with.
+ */
+static int vmclock_get_crosststamp_kvmclock(struct vmclock_state *st,
+ struct ptp_system_timestamp *sts,
+ struct system_counterval_t *system_counter,
+ struct timespec64 *tspec)
+{
+ struct pvclock_vcpu_time_info *pvti = this_cpu_pvti();
+ unsigned pvti_ver;
+ int ret;
+
+ preempt_disable_notrace();
+
+ do {
+ pvti_ver = pvclock_read_begin(pvti);
+
+ ret = vmclock_get_crosststamp(st, sts, system_counter, tspec);
+ if (ret)
+ break;
+
+ system_counter->cycles = __pvclock_read_cycles(pvti,
+ system_counter->cycles);
+ system_counter->cs_id = CSID_X86_KVM_CLK;
+
+ /*
+ * This retry should never really happen; if the TSC is
+ * stable and reliable enough across vCPUS that it is sane
+ * for the hypervisor to expose a VMCLOCK device which uses
+ * it as the reference counter, then the KVM clock sohuld be
+ * in 'master clock mode' and basically never changed. But
+ * the KVM clock is a fickle and often broken thing, so do
+ * it "properly" just in case.
+ */
+ } while (pvclock_read_retry(pvti, pvti_ver));
+
+ preempt_enable_notrace();
+
+ return ret;
+}
+#endif
+
+static int ptp_vmclock_get_time_fn(ktime_t *device_time,
+ struct system_counterval_t *system_counter,
+ void *ctx)
+{
+ struct vmclock_state *st = ctx;
+ struct timespec64 tspec;
+ int ret;
+
+#ifdef CONFIG_X86
+ if (READ_ONCE(st->sys_cs_id) == CSID_X86_KVM_CLK)
+ ret = vmclock_get_crosststamp_kvmclock(st, NULL, system_counter,
+ &tspec);
+ else
+#endif
+ ret = vmclock_get_crosststamp(st, NULL, system_counter, &tspec);
+
+ if (!ret)
+ *device_time = timespec64_to_ktime(tspec);
+
+ return ret;
+}
+
+
+static int ptp_vmclock_getcrosststamp(struct ptp_clock_info *ptp,
+ struct system_device_crosststamp *xtstamp)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+ int ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn, st,
+ NULL, xtstamp);
+#ifdef CONFIG_X86
+ /*
+ * On x86, the KVM clock may be used for the system time. We can
+ * actually convert a TSC reading to that, and return a paired
+ * timestamp that get_device_system_crosststamp() *can* handle.
+ */
+ if (ret == -ENODEV) {
+ struct system_time_snapshot systime_snapshot;
+ ktime_get_snapshot(&systime_snapshot);
+
+ if (systime_snapshot.cs_id == CSID_X86_TSC ||
+ systime_snapshot.cs_id == CSID_X86_KVM_CLK) {
+ WRITE_ONCE(st->sys_cs_id, systime_snapshot.cs_id);
+ ret = get_device_system_crosststamp(ptp_vmclock_get_time_fn,
+ st, NULL, xtstamp);
+ }
+ }
+#endif
+ return ret;
+}
+
+/*
+ * PTP clock operations
+ */
+
+static int ptp_vmclock_adjfine(struct ptp_clock_info *ptp, long delta)
+{
+ return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+ return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_settime(struct ptp_clock_info *ptp,
+ const struct timespec64 *ts)
+{
+ return -EOPNOTSUPP;
+}
+
+static int ptp_vmclock_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+
+ return vmclock_get_crosststamp(st, NULL, NULL, ts);
+}
+
+static int ptp_vmclock_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
+{
+ struct vmclock_state *st = container_of(ptp, struct vmclock_state,
+ ptp_clock_info);
+
+ return vmclock_get_crosststamp(st, sts, NULL, ts);
+}
+
+static int ptp_vmclock_enable(struct ptp_clock_info *ptp,
+ struct ptp_clock_request *rq, int on)
+{
+ return -EOPNOTSUPP;
+}
+
+static const struct ptp_clock_info ptp_vmclock_info = {
+ .owner = THIS_MODULE,
+ .max_adj = 0,
+ .n_ext_ts = 0,
+ .n_pins = 0,
+ .pps = 0,
+ .adjfine = ptp_vmclock_adjfine,
+ .adjtime = ptp_vmclock_adjtime,
+ .gettime64 = ptp_vmclock_gettime,
+ .gettimex64 = ptp_vmclock_gettimex,
+ .settime64 = ptp_vmclock_settime,
+ .enable = ptp_vmclock_enable,
+ .getcrosststamp = ptp_vmclock_getcrosststamp,
+};
+
+static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
+{
+ struct vmclock_state *st = container_of(fp->private_data,
+ struct vmclock_state, miscdev);
+
+ if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
+ return -EROFS;
+
+ if (vma->vm_end - vma->vm_start != PAGE_SIZE || vma->vm_pgoff)
+ return -EINVAL;
+
+ if (io_remap_pfn_range(vma, vma->vm_start,
+ st->phys_addr >> PAGE_SHIFT, PAGE_SIZE,
+ vma->vm_page_prot))
+ return -EAGAIN;
+
+ return 0;
+}
+
+static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vmclock_state *st = container_of(fp->private_data,
+ struct vmclock_state, miscdev);
+ ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+ size_t max_count;
+ int32_t seq;
+
+ if (*ppos >= PAGE_SIZE)
+ return 0;
+
+ max_count = PAGE_SIZE - *ppos;
+ if (count > max_count)
+ count = max_count;
+
+ while (1) {
+ seq = st->clk->seq_count & ~1ULL;
+ virt_rmb();
+
+ if (copy_to_user(buf, ((char *)st->clk) + *ppos, count))
+ return -EFAULT;
+
+ virt_rmb();
+ if (seq == st->clk->seq_count)
+ break;
+
+ if (ktime_after(ktime_get(), deadline))
+ return -ETIMEDOUT;
+ }
+
+ *ppos += count;
+ return count;
+}
+
+static const struct file_operations vmclock_miscdev_fops = {
+ .mmap = vmclock_miscdev_mmap,
+ .read = vmclock_miscdev_read,
+};
+
+/* module operations */
+
+static void vmclock_remove(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct vmclock_state *st = dev_get_drvdata(dev);
+
+ if (st->ptp_clock)
+ ptp_clock_unregister(st->ptp_clock);
+
+ if (st->miscdev.minor == MISC_DYNAMIC_MINOR)
+ misc_deregister(&st->miscdev);
+}
+
+static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
+{
+ struct acpi_buffer parsed = { ACPI_ALLOCATE_BUFFER };
+ struct acpi_device *adev = ACPI_COMPANION(dev);
+ union acpi_object *obj;
+ acpi_status status;
+
+ status = acpi_evaluate_object(adev->handle, "ADDR", NULL, &parsed);
+ if (ACPI_FAILURE(status)) {
+ ACPI_EXCEPTION((AE_INFO, status, "Evaluating ADDR"));
+ return -ENODEV;
+ }
+ obj = parsed.pointer;
+ if (!obj || obj->type != ACPI_TYPE_PACKAGE || obj->package.count != 2 ||
+ obj->package.elements[0].type != ACPI_TYPE_INTEGER ||
+ obj->package.elements[1].type != ACPI_TYPE_INTEGER)
+ return -EINVAL;
+
+ st->phys_addr = (obj->package.elements[0].integer.value << 0) |
+ (obj->package.elements[1].integer.value << 32);
+
+ return 0;
+}
+
+static void vmclock_put_idx(void *data)
+{
+ struct vmclock_state *st = data;
+
+ ida_free(&vmclock_ida, st->index);
+}
+
+static int vmclock_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct vmclock_state *st;
+ int ret;
+
+ st = devm_kzalloc(dev, sizeof (*st), GFP_KERNEL);
+ if (!st)
+ return -ENOMEM;
+
+ if (has_acpi_companion(dev))
+ ret = vmclock_probe_acpi(dev, st);
+ else
+ ret = -EINVAL; /* Only ACPI for now */
+
+ if (ret) {
+ dev_info(dev, "Failed to obtain physical address: %d\n", ret);
+ goto out;
+ }
+
+ st->clk = devm_memremap(dev, st->phys_addr, sizeof(*st->clk),
+ MEMREMAP_WB);
+ if (IS_ERR(st->clk)) {
+ ret = PTR_ERR(st->clk);
+ dev_info(dev, "failed to map shared memory\n");
+ st->clk = NULL;
+ goto out;
+ }
+
+ if (st->clk->magic != VMCLOCK_MAGIC ||
+ st->clk->size < sizeof(*st->clk) ||
+ st->clk->version != 1) {
+ dev_info(dev, "vmclock magic fields invalid\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (IS_ENABLED(CONFIG_ARM64) &&
+ st->clk->counter_id == VMCLOCK_COUNTER_ARM_VCNT) {
+ /* Can we check it's the virtual counter? */
+ st->cs_id = CSID_ARM_ARCH_COUNTER;
+ } else if (IS_ENABLED(CONFIG_X86) &&
+ st->clk->counter_id == VMCLOCK_COUNTER_X86_TSC) {
+ st->cs_id = CSID_X86_TSC;
+ }
+ st->sys_cs_id = st->cs_id;
+
+ ret = ida_alloc(&vmclock_ida, GFP_KERNEL);
+ if (ret < 0)
+ goto out;
+
+ st->index = ret;
+ ret = devm_add_action_or_reset(&pdev->dev, vmclock_put_idx, st);
+ if (ret)
+ goto out;
+
+ st->name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "vmclock%d", st->index);
+ if (!st->name) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* If the structure is big enough, it can be mapped to userspace */
+ if (st->clk->size >= PAGE_SIZE) {
+ st->miscdev.minor = MISC_DYNAMIC_MINOR;
+ st->miscdev.fops = &vmclock_miscdev_fops;
+ st->miscdev.name = st->name;
+
+ ret = misc_register(&st->miscdev);
+ if (ret)
+ goto out;
+ }
+
+ /* If there is valid clock information, register a PTP clock */
+ if (st->cs_id) {
+ st->ptp_clock_info = ptp_vmclock_info;
+ strncpy(st->ptp_clock_info.name, st->name, sizeof(st->ptp_clock_info.name));
+ st->ptp_clock = ptp_clock_register(&st->ptp_clock_info, dev);
+
+ if (IS_ERR(st->ptp_clock)) {
+ ret = PTR_ERR(st->ptp_clock);
+ st->ptp_clock = NULL;
+ vmclock_remove(pdev);
+ goto out;
+ }
+ }
+
+ dev_set_drvdata(dev, st);
+
+ out:
+ return ret;
+}
+
+static const struct acpi_device_id vmclock_acpi_ids[] = {
+ { "VMCLOCK", 0 },
+ {}
+};
+MODULE_DEVICE_TABLE(acpi, vmclock_acpi_ids);
+
+static struct platform_driver vmclock_platform_driver = {
+ .probe = vmclock_probe,
+ .remove_new = vmclock_remove,
+ .driver = {
+ .name = "vmclock",
+ .acpi_match_table = vmclock_acpi_ids,
+ },
+};
+
+module_platform_driver(vmclock_platform_driver)
+
+MODULE_AUTHOR("David Woodhouse <dwmw2@infradead.org>");
+MODULE_DESCRIPTION("PTP clock using VMCLOCK");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
new file mode 100644
index 000000000000..cf0f22205e79
--- /dev/null
+++ b/include/uapi/linux/vmclock.h
@@ -0,0 +1,138 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
+
+/*
+ * This structure provides a vDSO-style clock to VM guests, exposing the
+ * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
+ * counter, etc.) and real time. It is designed to address the problem of
+ * live migration, which other clock enlightenments do not.
+ *
+ * When a guest is live migrated, this affects the clock in two ways.
+ *
+ * First, even between identical hosts the actual frequency of the underlying
+ * counter will change within the tolerances of its specification (typically
+ * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
+ * same host, but can be tracked by NTP as it generally varies slowly. With
+ * live migration there is a step change in the frequency, with no warning.
+ *
+ * Second, there may be a step change in the value of the counter itself, as
+ * its accuracy is limited by the precision of the NTP synchronization on the
+ * source and destination hosts.
+ *
+ * So any calibration (NTP, PTP, etc.) which the guest has done on the source
+ * host before migration is invalid, and needs to be redone on the new host.
+ *
+ * In its most basic mode, this structure provides only an indication to the
+ * guest that live migration has occurred. This allows the guest to know that
+ * its clock is invalid and take remedial action. For applications that need
+ * reliable accurate timestamps (e.g. distributed databases), the structure
+ * can be mapped all the way to userspace. This allows the application to see
+ * directly for itself that the clock is disrupted and take appropriate
+ * action, even when using a vDSO-style method to get the time instead of a
+ * system call.
+ *
+ * In its more advanced mode. this structure can also be used to expose the
+ * precise relationship of the CPU counter to real time, as calibrated by the
+ * host. This means that userspace applications can have accurate time
+ * immediately after live migration, rather than having to pause operations
+ * and wait for NTP to recover. This mode does, of course, rely on the
+ * counter being reliable and consistent across CPUs.
+ *
+ * Note that this must be true UTC, never with smeared leap seconds. If a
+ * guest wishes to construct a smeared clock, it can do so. Presenting a
+ * smeared clock through this interface would be problematic because it
+ * actually messes with the apparent counter *period*. A linear smearing
+ * of 1 ms per second would effectively tweak the counter period by 1000PPM
+ * at the start/end of the smearing period, while a sinusoidal smear would
+ * basically be impossible to represent.
+ */
+
+#ifndef __VMCLOCK_H__
+#define __VMCLOCK_H__
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+struct vmclock_abi {
+ uint32_t magic;
+#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
+ uint16_t size; /* Size of page containing this structure */
+ uint16_t version; /* 1 */
+
+ /* Sequence lock. Low bit means an update is in progress. */
+ uint64_t seq_count;
+
+ /*
+ * This field changes to another non-repeating value when the CPU
+ * counter is disrupted, for example on live migration.
+ */
+ uint64_t disruption_marker;
+
+ /*
+ * By providing the TAI offset, the guest can know both UTC and TAI
+ * reliably. There is no need to choose one *or* the other. Valid if
+ * VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
+ */
+ int16_t tai_offset_sec;
+
+ uint16_t flags;
+ /* Indicates that the tai_offset_sec field is valid */
+#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0)
+ /*
+ * Optionally used to notify guests of pending maintenance events.
+ * A guest may wish to remove itself from service if an event is
+ * coming up. Two flags indicate the rough imminence of the event.
+ */
+#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */
+#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */
+ /* Indicates that the utc_time_maxerror_picosec field is valid */
+#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3)
+ /* Indicates counter_period_error_rate_frac_sec is valid */
+#define VMCLOCK_FLAG_UTC_PERIOD_ERROR_VALID (1 << 4)
+
+ uint8_t clock_status;
+#define VMCLOCK_STATUS_UNKNOWN 0
+#define VMCLOCK_STATUS_INITIALIZING 1
+#define VMCLOCK_STATUS_SYNCHRONIZED 2
+#define VMCLOCK_STATUS_FREERUNNING 3
+#define VMCLOCK_STATUS_UNRELIABLE 4
+
+ uint8_t counter_id;
+#define VMCLOCK_COUNTER_INVALID 0
+#define VMCLOCK_COUNTER_X86_TSC 1
+#define VMCLOCK_COUNTER_ARM_VCNT 2
+
+ /* Bit shift for counter_period_frac_sec and its error rate */
+ uint8_t counter_period_shift;
+
+ /*
+ * Unlike in NTP, this can indicate a leap second in the past. This
+ * is needed to allow guests to derive an imprecise clock with
+ * smeared leap seconds for themselves, as some modes of smearing
+ * need the adjustments to continue even after the moment at which
+ * the leap second should have occurred.
+ */
+ int8_t leapsecond_direction;
+ uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
+
+ /*
+ * Paired values of counter and UTC at a given point in time.
+ */
+ uint64_t counter_value;
+ uint64_t utc_time_sec; /* Since 1970-01-01 00:00:00z */
+ uint64_t utc_time_frac_sec;
+
+ /*
+ * Counter frequency, and error margin. The unit of these fields is
+ * seconds >> (64 + counter_period_shift)
+ */
+ uint64_t counter_period_frac_sec;
+ uint64_t counter_period_error_rate_frac_sec;
+
+ /* Error margin of UTC reading above (± picoseconds) */
+ uint64_t utc_time_maxerror_picosec;
+};
+
+#endif /* __VMCLOCK_H__ */
--
2.44.0
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
[not found] <87jzic4sgv.ffs@tglx>
@ 2024-06-25 21:48 ` David Woodhouse
[not found] ` <CANDhNCpi_MyGWH2jZcSRB4RU28Ga08Cqm8cyY_6wkZhNMJsNSQ@mail.gmail.com>
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-06-25 21:48 UTC (permalink / raw)
To: Thomas Gleixner, Peter Hilber, linux-kernel, virtualization,
linux-arm-kernel, linux-rtc, Ridoux, Julien, virtio-dev,
Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Xuan Zhuo, Marc Zyngier,
Mark Rutland, Daniel Lezcano, Alessandro Zummo, Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 1599 bytes --]
On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> >
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> >
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> >
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
>
> There is effort underway to expose PTP clocks to user space via VDSO.
Ooh, interesting. Got a reference to that please?
> Can we please not expose an ad hoc interface for that?
Absolutely. I'm explicitly trying to intercept the virtio-rtc
specification here, to *avoid* having to do anything ad hoc.
Note that this is a "vDSO-style" interface from hypervisor to guest via
a shared memory region, not necessarily an actual vDSO.
But yes, it *is* intended to be exposed to userspace, so that userspace
can know the *accurate* time without a system call, and know that it
hasn't been perturbed by live migration.
> As you might have heard the sad news, I'm not feeling up to the task to
> dig deeper into this right now. Give me a couple of days to look at this
> with working brain.
I have not heard any news, although now I'm making inferences.
Wishing you the best!
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
[not found] ` <CANDhNCpi_MyGWH2jZcSRB4RU28Ga08Cqm8cyY_6wkZhNMJsNSQ@mail.gmail.com>
@ 2024-06-26 8:32 ` David Woodhouse
0 siblings, 0 replies; 35+ messages in thread
From: David Woodhouse @ 2024-06-26 8:32 UTC (permalink / raw)
To: John Stultz
Cc: Thomas Gleixner, Peter Hilber, linux-kernel, virtualization,
linux-arm-kernel, linux-rtc, Ridoux, Julien, virtio-dev,
Luu, Ryan, Christopher S. Hall, Jason Wang, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Xuan Zhuo, Marc Zyngier,
Mark Rutland, Daniel Lezcano, Alessandro Zummo, Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 4597 bytes --]
On Tue, 2024-06-25 at 15:22 -0700, John Stultz wrote:
> On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > >
> > > > The vmclock "device" provides a shared memory region with precision clock
> > > > information. By using shared memory, it is safe across Live Migration.
> > > >
> > > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > > actually helpful.
> > > >
> > > > The memory region of the device is also exposed to userspace so it can be
> > > > read or memory mapped by application which need reliable notification of
> > > > clock disruptions.
> > >
> > > There is effort underway to expose PTP clocks to user space via VDSO.
> >
> > Ooh, interesting. Got a reference to that please?
> >
> > > Can we please not expose an ad hoc interface for that?
> >
> > Absolutely. I'm explicitly trying to intercept the virtio-rtc
> > specification here, to *avoid* having to do anything ad hoc.
> >
> > Note that this is a "vDSO-style" interface from hypervisor to guest via
> > a shared memory region, not necessarily an actual vDSO.
> >
> > But yes, it *is* intended to be exposed to userspace, so that userspace
> > can know the *accurate* time without a system call, and know that it
> > hasn't been perturbed by live migration.
>
> Yea, I was going to raise a concern that just defining an mmaped
> structure means it has to trust the guest logic is as expected. It's
> good that it's versioned! :)
Right. Although it's basically a pvclock, and we've had those for ages.
The main difference here is that we add an indicator that tells the
guest that it's been live migrated, so any additional NTP/PTP
refinement that the *guest* has done of its oscillator, should now be
discarded.
It's also designed to be useful in "disruption-only" mode, where the
pvclock information isn't actually populated, so *all* it does is tell
guests that their clock is now hosed due to live migration.
That part is why it needs to be mappable directly to userspace, so that
userspace can not only get a timestamp but *also* know that it's
actually valid. All without a system call.
The critical use cases are financial systems where they incur massive
fines if they submit mis-timestamped transactions, and distributed
databases which rely on accurate timestamps (and error bounds) for
eventual coherence. Live migration can screw those completely.
I'm open to changing fairly much anything about the proposal as long as
we can address those use cases (which the existing virtio-rtc and other
KVM enlightenments do not).
> I'd fret a bit about exposing this to userland. It feels very similar
> to the old powerpc systemcfg implementation that similarly mapped just
> kernel data out to userland and was difficult to maintain as changes
> were made. Would including a code page like a proper vdso make sense
> to make this more flexible of an UABI to maintain?
I think the structure itself should be stable once we've bikeshedded it
a bit. But there is certainly some potential for vDSO functions which
help us expose it to the user...
This structure exposes a 'disruption count' which is updated every time
the TSC/counter is messed with by live migration. But what is userspace
actually going to *compare* it with?
It basically needs to compare it with the disruption count when the
clock was last synchronized, so maybe the kernel could export *that* to
vDSO too, then expose a simple vDSO function which reports whether the
clock is valid?
The 'invalid' code path could turn into an actual system call which
makes the kernel (check for itself and) call ntp_clear() when the
disruption occurs. Or maybe not just ntp_clear() but actually consume
the pvclock rate information directly and apply the *new* calibration?
That kind of thing would be great, and I've definitely tried to design
the structure so that it *can* be made a first-class citizen within the
kernel's timekeeping code and used like that.
But I was going to start with a more modest proposal that it's "just a
device", and applications which care about reliable time after LM would
have to /dev/vmclock0 and mmap it and check for themselves. (Which
would be assisted by things like the ClockBound library).
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
2024-06-21 8:45 ` David Woodhouse
2024-06-25 19:01 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
@ 2024-06-27 13:50 ` Peter Hilber
1 sibling, 0 replies; 35+ messages in thread
From: Peter Hilber @ 2024-06-27 13:50 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 21.06.24 10:45, David Woodhouse wrote:
> On Thu, 2024-06-20 at 17:19 +0100, David Woodhouse wrote:
>>
>>>
>>>> +
>>>> + /* Counter frequency, and error margin. Units of (second >> 64) */
>>>> + uint64_t counter_period_frac_sec;
>>>
>>> AFAIU this might limit the precision in case of high counter frequencies.
>>> Could the unit be aligned to the expected frequency band of counters?
>>
>> This field indicates the period of a single tick, in units of 1>>64 of
>> a second. That's about 5.4e-20 seconds, or 54 zeptoseconds?
>>
>> Can you walk me through a calculation where you believe that level of
>> precision is insufficient?
>>
>> I guess the precision matters if the structure isn't updated for a long
>> period of time, and the delta between the current counter and the
>> snapshot is high? That's a *lot* of 54 zeptosecondses? But you really
>> would need a *lot* of them before you care? And if nobody's been
>> calibrating your counter for that long, surely you have bigger worries?
>>
>> Am I missing something there?
>
> Hm, that was a bit rushed at the end of the day; let's take a better look...
>
> Let's take a hypothetical example of a 100GHz counter. That's two
> orders of magnitude more than today's Arm arch counter.
>
> The period of such a counter would be 10 picoseconds.
>
> (Let's ignore the question of how far light actually travels in that
> time and how *realistic* that example is, for the moment.)
>
> It turns out that at that rate, there *are* a lot of 54 zeptosecondses
> of precision loss in the day. It could be half a millisecond a day, or
> 20µs an hour.
>
> That particular example of 10 picoseconds is 184467440.7370955
> (seconds>>64) which could be truncated to 184467440 — losing about 4PPB
> (a third of a millisecond a day; 14µs an hour).
>
> So yeah, I suppose a 'shift' field could make sense. It's easy enough
> to consume on the guest side as it doesn't really perturb the 128-bit
> multiplication very much; especially if we don't let it be negative.
>
> And implementations *can* just set it to zero. It hurts nobody.
>
> Or were you thinking of just using a fixed shift like (seconds>>80)
> instead?
The 'shift' field should be fine.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-25 19:01 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
@ 2024-06-27 13:50 ` Peter Hilber
2024-06-27 14:52 ` David Woodhouse
2024-06-27 16:03 ` David Woodhouse
[not found] ` <20240630132859.GC17134@kernel.org>
1 sibling, 2 replies; 35+ messages in thread
From: Peter Hilber @ 2024-06-27 13:50 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 25.06.24 21:01, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
>
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
>
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>
> v2:
> • Add gettimex64() support
> • Convert TSC values to KVM clock when appropriate
> • Require int128 support
> • Add counter_period_shift
> • Add timeout when seq_count is invalid
> • Add flags field
> • Better comments in vmclock ABI structure
> • Explicitly forbid smearing (as clock rates would need to change)
Leap second smearing information could still be conveyed through the
vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
enough to indicate whether the driver should apply linear or cosine
smearing, and the start time and end time.
>
> drivers/ptp/Kconfig | 13 +
> drivers/ptp/Makefile | 1 +
> drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/vmclock.h | 138 ++++++++++
> 4 files changed, 668 insertions(+)
> create mode 100644 drivers/ptp/ptp_vmclock.c
> create mode 100644 include/uapi/linux/vmclock.h
>
[...]
> +
> +/*
> + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> + * and add the fractional second part of the reference time.
> + *
> + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> + * the low 64 bits are (seconds >> 64).
> + *
> + * If __int128 isn't available, perform the calculation 32 bits at a time to
> + * avoid overflow.
> + */
> +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> + uint64_t period, uint8_t shift,
> + uint64_t frac_sec)
> +{
> + unsigned __int128 res = (unsigned __int128)delta * period;
> +
> + res >>= shift;
> + res += frac_sec;
> + *res_hi = res >> 64;
> + return (uint64_t)res;
> +}
> +
> +static int vmclock_get_crosststamp(struct vmclock_state *st,
> + struct ptp_system_timestamp *sts,
> + struct system_counterval_t *system_counter,
> + struct timespec64 *tspec)
> +{
> + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> + struct system_time_snapshot systime_snapshot;
> + uint64_t cycle, delta, seq, frac_sec;
> +
> +#ifdef CONFIG_X86
> + /*
> + * We'd expect the hypervisor to know this and to report the clock
> + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> + */
> + if (check_tsc_unstable())
> + return -EINVAL;
> +#endif
> +
> + while (1) {
> + seq = st->clk->seq_count & ~1ULL;
> + virt_rmb();
> +
> + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> + return -EINVAL;
> +
> + /*
> + * When invoked for gettimex64(), fill in the pre/post system
> + * times. The simple case is when system time is based on the
> + * same counter as st->cs_id, in which case all three times
> + * will be derived from the *same* counter value.
> + *
> + * If the system isn't using the same counter, then the value
> + * from ktime_get_snapshot() will still be used as pre_ts, and
> + * ptp_read_system_postts() is called to populate postts after
> + * calling get_cycles().
> + *
> + * The conversion to timespec64 happens further down, outside
> + * the seq_count loop.
> + */
> + if (sts) {
> + ktime_get_snapshot(&systime_snapshot);
> + if (systime_snapshot.cs_id == st->cs_id) {
> + cycle = systime_snapshot.cycles;
> + } else {
> + cycle = get_cycles();
> + ptp_read_system_postts(sts);
> + }
> + } else
> + cycle = get_cycles();
> +
> + delta = cycle - st->clk->counter_value;
AFAIU in the general case this needs to be masked for non 64-bit counters.
> +
> + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> + st->clk->counter_period_frac_sec,
> + st->clk->counter_period_shift,
> + st->clk->utc_time_frac_sec);
> + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> + tspec->tv_sec += st->clk->utc_time_sec;
> +
> + virt_rmb();
> + if (seq == st->clk->seq_count)
> + break;
> +
> + if (ktime_after(ktime_get(), deadline))
> + return -ETIMEDOUT;
> + }
> +
> + if (system_counter) {
> + system_counter->cycles = cycle;
> + system_counter->cs_id = st->cs_id;
> + }
> +
> + if (sts) {
> + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> + if (systime_snapshot.cs_id == st->cs_id)
> + sts->post_ts = sts->pre_ts;
> + }
> +
> + return 0;
> +}
> +
[...]
> +
> +static const struct ptp_clock_info ptp_vmclock_info = {
> + .owner = THIS_MODULE,
> + .max_adj = 0,
> + .n_ext_ts = 0,
> + .n_pins = 0,
> + .pps = 0,
> + .adjfine = ptp_vmclock_adjfine,
> + .adjtime = ptp_vmclock_adjtime,
> + .gettime64 = ptp_vmclock_gettime,
The .gettime64 op is now unneeded.
> + .gettimex64 = ptp_vmclock_gettimex,
> + .settime64 = ptp_vmclock_settime,
> + .enable = ptp_vmclock_enable,
> + .getcrosststamp = ptp_vmclock_getcrosststamp,
> +};
> +
[...]
> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> new file mode 100644
> index 000000000000..cf0f22205e79
> --- /dev/null
> +++ b/include/uapi/linux/vmclock.h
> @@ -0,0 +1,138 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> +
> +/*
> + * This structure provides a vDSO-style clock to VM guests, exposing the
> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> + * counter, etc.) and real time. It is designed to address the problem of
> + * live migration, which other clock enlightenments do not.
> + *
> + * When a guest is live migrated, this affects the clock in two ways.
> + *
> + * First, even between identical hosts the actual frequency of the underlying
> + * counter will change within the tolerances of its specification (typically
> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> + * same host, but can be tracked by NTP as it generally varies slowly. With
> + * live migration there is a step change in the frequency, with no warning.
> + *
> + * Second, there may be a step change in the value of the counter itself, as
> + * its accuracy is limited by the precision of the NTP synchronization on the
> + * source and destination hosts.
> + *
> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> + * host before migration is invalid, and needs to be redone on the new host.
> + *
> + * In its most basic mode, this structure provides only an indication to the
> + * guest that live migration has occurred. This allows the guest to know that
> + * its clock is invalid and take remedial action. For applications that need
> + * reliable accurate timestamps (e.g. distributed databases), the structure
> + * can be mapped all the way to userspace. This allows the application to see
> + * directly for itself that the clock is disrupted and take appropriate
> + * action, even when using a vDSO-style method to get the time instead of a
> + * system call.
> + *
> + * In its more advanced mode. this structure can also be used to expose the
> + * precise relationship of the CPU counter to real time, as calibrated by the
> + * host. This means that userspace applications can have accurate time
> + * immediately after live migration, rather than having to pause operations
> + * and wait for NTP to recover. This mode does, of course, rely on the
> + * counter being reliable and consistent across CPUs.
> + *
> + * Note that this must be true UTC, never with smeared leap seconds. If a
> + * guest wishes to construct a smeared clock, it can do so. Presenting a
> + * smeared clock through this interface would be problematic because it
> + * actually messes with the apparent counter *period*. A linear smearing
> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> + * at the start/end of the smearing period, while a sinusoidal smear would
> + * basically be impossible to represent.
Clock types other than UTC could also be supported: TAI, monotonic.
> + */
> +
> +#ifndef __VMCLOCK_H__
> +#define __VMCLOCK_H__
> +
> +#ifdef __KERNEL__
> +#include <linux/types.h>
> +#else
> +#include <stdint.h>
> +#endif
> +
> +struct vmclock_abi {
> + uint32_t magic;
> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
> + uint16_t size; /* Size of page containing this structure */
> + uint16_t version; /* 1 */
> +
> + /* Sequence lock. Low bit means an update is in progress. */
> + uint64_t seq_count;
> +
> + /*
> + * This field changes to another non-repeating value when the CPU
> + * counter is disrupted, for example on live migration.
> + */
> + uint64_t disruption_marker;
The field could also change when the clock is stepped (leap seconds
excepted), or when the clock frequency is slewed.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-27 13:50 ` Peter Hilber
@ 2024-06-27 14:52 ` David Woodhouse
2024-06-28 11:33 ` Peter Hilber
2024-06-27 16:03 ` David Woodhouse
1 sibling, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-06-27 14:52 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 16284 bytes --]
On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
> On 25.06.24 21:01, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> >
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> >
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> >
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> >
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >
> > v2:
> > • Add gettimex64() support
> > • Convert TSC values to KVM clock when appropriate
> > • Require int128 support
> > • Add counter_period_shift
> > • Add timeout when seq_count is invalid
> > • Add flags field
> > • Better comments in vmclock ABI structure
> > • Explicitly forbid smearing (as clock rates would need to change)
>
> Leap second smearing information could still be conveyed through the
> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
> enough to indicate whether the driver should apply linear or cosine
> smearing, and the start time and end time.
Yes. The clock information actually conveyed through the {counter,
time, rate} tuple should never be smeared, and should only ever be UTC.
But we could provide a hint to the guest operating system about what
type of smearing to perform, *if* it chooses to offer a clock other
than the standard CLOCK_REALTIME to its users.
I already added a flags field, so this might look something like:
/*
* Smearing flags. The UTC clock exposed through this structure
* is only ever true UTC, but a guest operating system may
* choose to offer a monotonic smeared clock to its users. This
* merely offers a hint about what kind of smearing to perform,
* for consistency with systems in the nearby environment.
*/
#define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
(UTC-SLS is probably a bad example but are there formal definitions for
anything else?)
> > But we
> > drivers/ptp/Kconfig | 13 +
> > drivers/ptp/Makefile | 1 +
> > drivers/ptp/ptp_vmclock.c | 516 +++++++++++++++++++++++++++++++++++
> > include/uapi/linux/vmclock.h | 138 ++++++++++
> > 4 files changed, 668 insertions(+)
> > create mode 100644 drivers/ptp/ptp_vmclock.c
> > create mode 100644 include/uapi/linux/vmclock.h
> >
>
> [...]
>
> > +
> > +/*
> > + * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
> > + * and add the fractional second part of the reference time.
> > + *
> > + * The result is a 128-bit value, the top 64 bits of which are seconds, and
> > + * the low 64 bits are (seconds >> 64).
> > + *
> > + * If __int128 isn't available, perform the calculation 32 bits at a time to
> > + * avoid overflow.
> > + */
> > +static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t delta,
> > + uint64_t period, uint8_t shift,
> > + uint64_t frac_sec)
> > +{
> > + unsigned __int128 res = (unsigned __int128)delta * period;
> > +
> > + res >>= shift;
> > + res += frac_sec;
> > + *res_hi = res >> 64;
> > + return (uint64_t)res;
> > +}
> > +
> > +static int vmclock_get_crosststamp(struct vmclock_state *st,
> > + struct ptp_system_timestamp *sts,
> > + struct system_counterval_t *system_counter,
> > + struct timespec64 *tspec)
> > +{
> > + ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
> > + struct system_time_snapshot systime_snapshot;
> > + uint64_t cycle, delta, seq, frac_sec;
> > +
> > +#ifdef CONFIG_X86
> > + /*
> > + * We'd expect the hypervisor to know this and to report the clock
> > + * status as VMCLOCK_STATUS_UNRELIABLE. But be paranoid.
> > + */
> > + if (check_tsc_unstable())
> > + return -EINVAL;
> > +#endif
> > +
> > + while (1) {
> > + seq = st->clk->seq_count & ~1ULL;
> > + virt_rmb();
> > +
> > + if (st->clk->clock_status == VMCLOCK_STATUS_UNRELIABLE)
> > + return -EINVAL;
> > +
> > + /*
> > + * When invoked for gettimex64(), fill in the pre/post system
> > + * times. The simple case is when system time is based on the
> > + * same counter as st->cs_id, in which case all three times
> > + * will be derived from the *same* counter value.
> > + *
> > + * If the system isn't using the same counter, then the value
> > + * from ktime_get_snapshot() will still be used as pre_ts, and
> > + * ptp_read_system_postts() is called to populate postts after
> > + * calling get_cycles().
> > + *
> > + * The conversion to timespec64 happens further down, outside
> > + * the seq_count loop.
> > + */
> > + if (sts) {
> > + ktime_get_snapshot(&systime_snapshot);
> > + if (systime_snapshot.cs_id == st->cs_id) {
> > + cycle = systime_snapshot.cycles;
> > + } else {
> > + cycle = get_cycles();
> > + ptp_read_system_postts(sts);
> > + }
> > + } else
> > + cycle = get_cycles();
> > +
> > + delta = cycle - st->clk->counter_value;
>
> AFAIU in the general case this needs to be masked for non 64-bit counters.
Are there any non-64-bit counters that will be exposed this way? And
can we handle that with explicit knowledge of the counter type, rather
than a separate explicit field?
If we really dig deep, it gets horrible for scaled TSCs. The guest TSC
is actually calculated from the host TSC with a scale and offset:
guest_tsc = ((host_tsc * SCALE) >> 48_or_32) + OFFSET
(The shift is 48 bits for Intel, 32 for AMD).
So when the host TSC wraps to zero, the guest TSC jumps from some non-
power-of-two upper limit, to OFFSET which is generally a *negative*
value. And will continue counting from there until it actually reaches
the 64-bit limit and wraps back to zero.
>
> > +
> > + frac_sec = mul_u64_u64_shr_add_u64(&tspec->tv_sec, delta,
> > + st->clk->counter_period_frac_sec,
> > + st->clk->counter_period_shift,
> > + st->clk->utc_time_frac_sec);
> > + tspec->tv_nsec = mul_u64_u64_shr(frac_sec, NSEC_PER_SEC, 64);
> > + tspec->tv_sec += st->clk->utc_time_sec;
> > +
> > + virt_rmb();
> > + if (seq == st->clk->seq_count)
> > + break;
> > +
> > + if (ktime_after(ktime_get(), deadline))
> > + return -ETIMEDOUT;
> > + }
> > +
> > + if (system_counter) {
> > + system_counter->cycles = cycle;
> > + system_counter->cs_id = st->cs_id;
> > + }
> > +
> > + if (sts) {
> > + sts->pre_ts = ktime_to_timespec64(systime_snapshot.real);
> > + if (systime_snapshot.cs_id == st->cs_id)
> > + sts->post_ts = sts->pre_ts;
> > + }
> > +
> > + return 0;
> > +}
> > +
>
> [...]
>
> > +
> > +static const struct ptp_clock_info ptp_vmclock_info = {
> > + .owner = THIS_MODULE,
> > + .max_adj = 0,
> > + .n_ext_ts = 0,
> > + .n_pins = 0,
> > + .pps = 0,
> > + .adjfine = ptp_vmclock_adjfine,
> > + .adjtime = ptp_vmclock_adjtime,
> > + .gettime64 = ptp_vmclock_gettime,
>
> The .gettime64 op is now unneeded.
Ack, thanks.
> > + .gettimex64 = ptp_vmclock_gettimex,
> > + .settime64 = ptp_vmclock_settime,
> > + .enable = ptp_vmclock_enable,
> > + .getcrosststamp = ptp_vmclock_getcrosststamp,
> > +};
> > +
>
> [...]
>
> > diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
> > new file mode 100644
> > index 000000000000..cf0f22205e79
> > --- /dev/null
> > +++ b/include/uapi/linux/vmclock.h
> > @@ -0,0 +1,138 @@
> > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
> > +
> > +/*
> > + * This structure provides a vDSO-style clock to VM guests, exposing the
> > + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
> > + * counter, etc.) and real time. It is designed to address the problem of
> > + * live migration, which other clock enlightenments do not.
> > + *
> > + * When a guest is live migrated, this affects the clock in two ways.
> > + *
> > + * First, even between identical hosts the actual frequency of the underlying
> > + * counter will change within the tolerances of its specification (typically
> > + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
> > + * same host, but can be tracked by NTP as it generally varies slowly. With
> > + * live migration there is a step change in the frequency, with no warning.
> > + *
> > + * Second, there may be a step change in the value of the counter itself, as
> > + * its accuracy is limited by the precision of the NTP synchronization on the
> > + * source and destination hosts.
> > + *
> > + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
> > + * host before migration is invalid, and needs to be redone on the new host.
> > + *
> > + * In its most basic mode, this structure provides only an indication to the
> > + * guest that live migration has occurred. This allows the guest to know that
> > + * its clock is invalid and take remedial action. For applications that need
> > + * reliable accurate timestamps (e.g. distributed databases), the structure
> > + * can be mapped all the way to userspace. This allows the application to see
> > + * directly for itself that the clock is disrupted and take appropriate
> > + * action, even when using a vDSO-style method to get the time instead of a
> > + * system call.
> > + *
> > + * In its more advanced mode. this structure can also be used to expose the
> > + * precise relationship of the CPU counter to real time, as calibrated by the
> > + * host. This means that userspace applications can have accurate time
> > + * immediately after live migration, rather than having to pause operations
> > + * and wait for NTP to recover. This mode does, of course, rely on the
> > + * counter being reliable and consistent across CPUs.
> > + *
> > + * Note that this must be true UTC, never with smeared leap seconds. If a
> > + * guest wishes to construct a smeared clock, it can do so. Presenting a
> > + * smeared clock through this interface would be problematic because it
> > + * actually messes with the apparent counter *period*. A linear smearing
> > + * of 1 ms per second would effectively tweak the counter period by 1000PPM
> > + * at the start/end of the smearing period, while a sinusoidal smear would
> > + * basically be impossible to represent.
>
> Clock types other than UTC could also be supported: TAI, monotonic.
This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can
expose UTC only, without the TAI offset if that's unknown.
Should we also have a mode for exposing TAI only, for when TAI is known
but not the offset from UTC? Is that really a likely scenario? Isn't a
host much more likely to know UTC and *not* the TAI offset?
I suppose if we have *hardware* implementations of this, they could be
based on an atomic clock and all they'll have is TAI? So OK, maybe that
makes sense?
(We'd have to add something like the ART as the counter to pair with
over an actual PCI bus, of course.)
We can add a type field like the one you have for virtio-rtc, yes?
>
> > + */
> > +
> > +#ifndef __VMCLOCK_H__
> > +#define __VMCLOCK_H__
> > +
> > +#ifdef __KERNEL
> > +#include <linux/types.h>
> > +#else
> > +#include <stdint.h>
> > +#endif
> > +
> > +struct vmclock_abi {
> > + uint32_t magic;
> > +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
> > + uint16_t size; /* Size of page containing this structure */
> > + uint16_t version; /* 1 */
> > +
> > + /* Sequence lock. Low bit means an update is in progress. */
> > + uint64_t seq_count;
> > +
> > + /*
> > + * This field changes to another non-repeating value when the CPU
> > + * counter is disrupted, for example on live migration.
> > + */
> > + uint64_t disruption_marker;
>
> The field could also change when the clock is stepped (leap seconds
> excepted), or when the clock frequency is slewed.
I'm not sure. The concept of the disruption marker is that it tells the
guest to throw away any calibration of the counter that the guest has
done for *itself* (with NTP, other PTP devices, etc.).
One mode for this device would be not to populate the clock fields at
all, but *only* to signal disruption when it occurs. So the guest can
abort transactions until it's resynced its clocks (to avoid incurring
fines if breaking databases, etc.).
Exposing the host timekeeping through the structure means that the
migrated guest can keep working because it can trust the timekeeping
performed by the (new) host and exposed to it.
If the counter is actually varying in frequency over time, and the host
is slewing the clock frequency that it reports, that *isn't* a step
change and doesn't mean that the guest should throw away any
calibration that it's been doing for itself. One hopes that the guest
would have detected the *same* frequency change, and be adapting for
itself. So I don't think that should indicate a disruption.
I think the same is even true if the clock is stepped by the host. The
actual *counter* hasn't changed, so the guest is better off ignoring
the vacillating host and continuing to derive its idea of time from the
hardware counter itself, as calibrated against some external NTP/PTP
sources. Surely we actively *don't* to tell the guest to throw its own
calibrations away, in this case?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-27 13:50 ` Peter Hilber
2024-06-27 14:52 ` David Woodhouse
@ 2024-06-27 16:03 ` David Woodhouse
2024-06-28 11:33 ` Peter Hilber
1 sibling, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-06-27 16:03 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 4848 bytes --]
I've updated the tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
(but not yet the qemu one).
I think I've taken into account all your comments apart from the one
about non-64-bit counters wrapping. I reduced the seq_count to 32 bit
to make room for a 32-bit flags field, added the time type
(UTC/TAI/MONOTONIC) and a smearing hint, with some straw man
definitions for smearing algorithms for which I could actually find
definitions.
The structure now looks like this:
struct vmclock_abi {
uint32_t magic;
#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
uint16_t size; /* Size of page containing this structure */
uint16_t version; /* 1 */
/* Sequence lock. Low bit means an update is in progress. */
uint32_t seq_count;
uint32_t flags;
/* Indicates that the tai_offset_sec field is valid */
#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0)
/*
* Optionally used to notify guests of pending maintenance events.
* A guest may wish to remove itself from service if an event is
* coming up. Two flags indicate the rough imminence of the event.
*/
#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */
#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */
/* Indicates that the utc_time_maxerror_picosec field is valid */
#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3)
/* Indicates counter_period_error_rate_frac_sec is valid */
#define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4)
/*
* This field changes to another non-repeating value when the CPU
* counter is disrupted, for example on live migration. This lets
* the guest know that it should discard any calibration it has
* performed of the counter against external sources (NTP/PTP/etc.).
*/
uint64_t disruption_marker;
uint8_t clock_status;
#define VMCLOCK_STATUS_UNKNOWN 0
#define VMCLOCK_STATUS_INITIALIZING 1
#define VMCLOCK_STATUS_SYNCHRONIZED 2
#define VMCLOCK_STATUS_FREERUNNING 3
#define VMCLOCK_STATUS_UNRELIABLE 4
uint8_t counter_id;
#define VMCLOCK_COUNTER_INVALID 0
#define VMCLOCK_COUNTER_X86_TSC 1
#define VMCLOCK_COUNTER_ARM_VCNT 2
#define VMCLOCK_COUNTER_X86_ART 3
/*
* By providing the offset from UTC to TAI, the guest can know both
* UTC and TAI reliably, whichever is indicated in the time_type
* field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
*/
int16_t tai_offset_sec;
/*
* The time exposed through this device is never smeaared; if it
* claims to be VMCLOCK_TIME_UTC then it MUST be UTC. This field
* provides a hint to the guest operating system, such that *if*
* the guest OS wants to provide its users with an alternative
* clock which does not follow the POSIX CLOCK_REALTIME standard,
* it may do so in a fashion consistent with the other systems
* in the nearby environment.
*/
uint8_t leap_second_smearing_hint;
/* Provide true UTC to users, unsmeared. */;
#define VMCLOCK_SMEARING_NONE 0
/*
* https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/
* From noon on the day before to noon on the day after, smear the
* clock by a linear 1/86400s per second.
*/
#define VMCLOCK_SMEARING_LINEAR_86400 1
/*
* draft-kuhn-leapsecond-00
* For the 1000s leading up to the leap second, smear the clock by
* clock by a linear 1ms per second.
*/
#define VMCLOCK_SMEARING_UTC_SLS 2
/*
* What time is exposed in the time_sec/time_frac_sec fields?
*/
uint8_t time_type;
#define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */
#define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */
/* Bit shift for counter_period_frac_sec and its error rate */
uint8_t counter_period_shift;
/*
* Unlike in NTP, this can indicate a leap second in the past. This
* is needed to allow guests to derive an imprecise clock with
* smeared leap seconds for themselves, as some modes of smearing
* need the adjustments to continue even after the moment at which
* the leap second should have occurred.
*/
int8_t leapsecond_direction;
uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
/*
* Paired values of counter and UTC at a given point in time.
*/
uint64_t counter_value;
uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
uint64_t time_frac_sec;
/*
* Counter frequency, and error margin. The unit of these fields is
* seconds >> (64 + counter_period_shift)
*/
uint64_t counter_period_frac_sec;
uint64_t counter_period_error_rate_frac_sec;
/* Error margin of UTC reading above (± picoseconds) */
uint64_t utc_time_maxerror_picosec;
};
#endif /* __VMCLOCK_H__ */
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-27 14:52 ` David Woodhouse
@ 2024-06-28 11:33 ` Peter Hilber
2024-06-28 12:15 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-06-28 11:33 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 27.06.24 16:52, David Woodhouse wrote:
> On Thu, 2024-06-27 at 15:50 +0200, Peter Hilber wrote:
>> On 25.06.24 21:01, David Woodhouse wrote:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> The vmclock "device" provides a shared memory region with precision clock
>>> information. By using shared memory, it is safe across Live Migration.
>>>
>>> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
>>> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
>>> actually helpful.
>>>
>>> The memory region of the device is also exposed to userspace so it can be
>>> read or memory mapped by application which need reliable notification of
>>> clock disruptions.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>> ---
>>>
>>> v2:
>>> • Add gettimex64() support
>>> • Convert TSC values to KVM clock when appropriate
>>> • Require int128 support
>>> • Add counter_period_shift
>>> • Add timeout when seq_count is invalid
>>> • Add flags field
>>> • Better comments in vmclock ABI structure
>>> • Explicitly forbid smearing (as clock rates would need to change)
>>
>> Leap second smearing information could still be conveyed through the
>> vmclock_abi. AFAIU, to cover the popular smearing variants, it should be
>> enough to indicate whether the driver should apply linear or cosine
>> smearing, and the start time and end time.
>
> Yes. The clock information actually conveyed through the {counter,
> time, rate} tuple should never be smeared, and should only ever be UTC.
>
> But we could provide a hint to the guest operating system about what
> type of smearing to perform, *if* it chooses to offer a clock other
> than the standard CLOCK_REALTIME to its users.
>
> I already added a flags field, so this might look something like:
>
> /*
> * Smearing flags. The UTC clock exposed through this structure
> * is only ever true UTC, but a guest operating system may
> * choose to offer a monotonic smeared clock to its users. This
> * merely offers a hint about what kind of smearing to perform,
> * for consistency with systems in the nearby environment.
> */
> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>
>
> (UTC-SLS is probably a bad example but are there formal definitions for
> anything else?)
>
>
I think it could also be more generic, like flags for linear smearing,
cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
to the leap second start). That could also represent UTC-SLS, and
noon-to-noon, and it would be well-defined.
This should reduce the likelihood that the guest doesn't know the smearing
variant.
[...]
>>> diff --git a/include/uapi/linux/vmclock.h b/include/uapi/linux/vmclock.h
>>> new file mode 100644
>>> index 000000000000..cf0f22205e79
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vmclock.h
>>> @@ -0,0 +1,138 @@
>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) */
>>> +
>>> +/*
>>> + * This structure provides a vDSO-style clock to VM guests, exposing the
>>> + * relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
>>> + * counter, etc.) and real time. It is designed to address the problem of
>>> + * live migration, which other clock enlightenments do not.
>>> + *
>>> + * When a guest is live migrated, this affects the clock in two ways.
>>> + *
>>> + * First, even between identical hosts the actual frequency of the underlying
>>> + * counter will change within the tolerances of its specification (typically
>>> + * ±50PPM, or 4 seconds a day). The frequency also varies over time on the
>>> + * same host, but can be tracked by NTP as it generally varies slowly. With
>>> + * live migration there is a step change in the frequency, with no warning.
>>> + *
>>> + * Second, there may be a step change in the value of the counter itself, as
>>> + * its accuracy is limited by the precision of the NTP synchronization on the
>>> + * source and destination hosts.
>>> + *
>>> + * So any calibration (NTP, PTP, etc.) which the guest has done on the source
>>> + * host before migration is invalid, and needs to be redone on the new host.
>>> + *
>>> + * In its most basic mode, this structure provides only an indication to the
>>> + * guest that live migration has occurred. This allows the guest to know that
>>> + * its clock is invalid and take remedial action. For applications that need
>>> + * reliable accurate timestamps (e.g. distributed databases), the structure
>>> + * can be mapped all the way to userspace. This allows the application to see
>>> + * directly for itself that the clock is disrupted and take appropriate
>>> + * action, even when using a vDSO-style method to get the time instead of a
>>> + * system call.
>>> + *
>>> + * In its more advanced mode. this structure can also be used to expose the
>>> + * precise relationship of the CPU counter to real time, as calibrated by the
>>> + * host. This means that userspace applications can have accurate time
>>> + * immediately after live migration, rather than having to pause operations
>>> + * and wait for NTP to recover. This mode does, of course, rely on the
>>> + * counter being reliable and consistent across CPUs.
>>> + *
>>> + * Note that this must be true UTC, never with smeared leap seconds. If a
>>> + * guest wishes to construct a smeared clock, it can do so. Presenting a
>>> + * smeared clock through this interface would be problematic because it
>>> + * actually messes with the apparent counter *period*. A linear smearing
>>> + * of 1 ms per second would effectively tweak the counter period by 1000PPM
>>> + * at the start/end of the smearing period, while a sinusoidal smear would
>>> + * basically be impossible to represent.
>>
>> Clock types other than UTC could also be supported: TAI, monotonic.
>
> This exposes both TAI *and* UTC, by exposing the TAI offset. Or it can
> expose UTC only, without the TAI offset if that's unknown.
>
> Should we also have a mode for exposing TAI only, for when TAI is known
> but not the offset from UTC? Is that really a likely scenario? Isn't a
> host much more likely to know UTC and *not* the TAI offset?
>
> I suppose if we have *hardware* implementations of this, they could be
> based on an atomic clock and all they'll have is TAI? So OK, maybe that
> makes sense?
>
> (We'd have to add something like the ART as the counter to pair with
> over an actual PCI bus, of course.)
>
> We can add a type field like the one you have for virtio-rtc, yes?
>
>
Ack.
>>
>>> + */
>>> +
>>> +#ifndef __VMCLOCK_H__
>>> +#define __VMCLOCK_H__
>>> +
>>> +#ifdef __KERNEL
>>> +#include <linux/types.h>
>>> +#else
>>> +#include <stdint.h>
>>> +#endif
>>> +
>>> +struct vmclock_abi {
>>> + uint32_t magic;
>>> +#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
>>> + uint16_t size; /* Size of page containing this structure */
>>> + uint16_t version; /* 1 */
>>> +
>>> + /* Sequence lock. Low bit means an update is in progress. */
>>> + uint64_t seq_count;
>>> +
>>> + /*
>>> + * This field changes to another non-repeating value when the CPU
>>> + * counter is disrupted, for example on live migration.
>>> + */
>>> + uint64_t disruption_marker;
>>
>> The field could also change when the clock is stepped (leap seconds
>> excepted), or when the clock frequency is slewed.
>
> I'm not sure. The concept of the disruption marker is that it tells the
> guest to throw away any calibration of the counter that the guest has
> done for *itself* (with NTP, other PTP devices, etc.).
>
> One mode for this device would be not to populate the clock fields at
> all, but *only* to signal disruption when it occurs. So the guest can
> abort transactions until it's resynced its clocks (to avoid incurring
> fines if breaking databases, etc.).
>
> Exposing the host timekeeping through the structure means that the
> migrated guest can keep working because it can trust the timekeeping
> performed by the (new) host and exposed to it.
>
> If the counter is actually varying in frequency over time, and the host
> is slewing the clock frequency that it reports, that *isn't* a step
> change and doesn't mean that the guest should throw away any
> calibration that it's been doing for itself. One hopes that the guest
> would have detected the *same* frequency change, and be adapting for
> itself. So I don't think that should indicate a disruption.
>
> I think the same is even true if the clock is stepped by the host. The
> actual *counter* hasn't changed, so the guest is better off ignoring
> the vacillating host and continuing to derive its idea of time from the
> hardware counter itself, as calibrated against some external NTP/PTP
> sources. Surely we actively *don't* to tell the guest to throw its own
> calibrations away, in this case?
In case the guest is also considering other time sources, it might indeed
not be a good idea to mix host clock changes into the hardware counter
disruption marker.
But if the vmclock is the authoritative source of time, it can still be
helpful to know about such changes, maybe through another marker.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-27 16:03 ` David Woodhouse
@ 2024-06-28 11:33 ` Peter Hilber
2024-06-28 11:41 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-06-28 11:33 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 27.06.24 18:03, David Woodhouse wrote:
>
> I've updated the tree at
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/vmclock
> (but not yet the qemu one).
>
> I think I've taken into account all your comments apart from the one
> about non-64-bit counters wrapping. I reduced the seq_count to 32 bit
> to make room for a 32-bit flags field, added the time type
> (UTC/TAI/MONOTONIC) and a smearing hint, with some straw man
> definitions for smearing algorithms for which I could actually find
> definitions.
>
> The structure now looks like this:
>
>
> struct vmclock_abi {
[...]
>
> /*
> * What time is exposed in the time_sec/time_frac_sec fields?
> */
> uint8_t time_type;
> #define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */
> #define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */
>
> /* Bit shift for counter_period_frac_sec and its error rate */
> uint8_t counter_period_shift;
>
> /*
> * Unlike in NTP, this can indicate a leap second in the past. This
> * is needed to allow guests to derive an imprecise clock with
> * smeared leap seconds for themselves, as some modes of smearing
> * need the adjustments to continue even after the moment at which
> * the leap second should have occurred.
> */
> int8_t leapsecond_direction;
> uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
>
> /*
> * Paired values of counter and UTC at a given point in time.
> */
> uint64_t counter_value;
> uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
Nitpick: The comment is not valid any more for TIME_MONOTONIC.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-28 11:33 ` Peter Hilber
@ 2024-06-28 11:41 ` David Woodhouse
0 siblings, 0 replies; 35+ messages in thread
From: David Woodhouse @ 2024-06-28 11:41 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 1821 bytes --]
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>
> >
> > /*
> > * What time is exposed in the time_sec/time_frac_sec fields?
> > */
> > uint8_t time_type;
> > #define VMCLOCK_TIME_UNKNOWN 0 /* Invalid / no time exposed */
> > #define VMCLOCK_TIME_UTC 1 /* Since 1970-01-01 00:00:00z */
> > #define VMCLOCK_TIME_TAI 2 /* Since 1970-01-01 00:00:00z */
> > #define VMCLOCK_TIME_MONOTONIC 3 /* Since undefined epoch */
> >
> > /* Bit shift for counter_period_frac_sec and its error rate */
> > uint8_t counter_period_shift;
> >
> > /*
> > * Unlike in NTP, this can indicate a leap second in the past. This
> > * is needed to allow guests to derive an imprecise clock with
> > * smeared leap seconds for themselves, as some modes of smearing
> > * need the adjustments to continue even after the moment at which
> > * the leap second should have occurred.
> > */
> > int8_t leapsecond_direction;
> > uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
> >
> > /*
> > * Paired values of counter and UTC at a given point in time.
> > */
> > uint64_t counter_value;
> > uint64_t time_sec; /* Since 1970-01-01 00:00:00z */
>
> Nitpick: The comment is not valid any more for TIME_MONOTONIC.
Ah yes, I "moved" that comment up to the UTC/TAI time_type values, but
neglected to actually delete it from here. Fixed; thanks.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-28 11:33 ` Peter Hilber
@ 2024-06-28 12:15 ` David Woodhouse
2024-06-28 16:38 ` Peter Hilber
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-06-28 12:15 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 4931 bytes --]
On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> On 27.06.24 16:52, David Woodhouse wrote:
> > I already added a flags field, so this might look something like:
> >
> > /*
> > * Smearing flags. The UTC clock exposed through this structure
> > * is only ever true UTC, but a guest operating system may
> > * choose to offer a monotonic smeared clock to its users. This
> > * merely offers a hint about what kind of smearing to perform,
> > * for consistency with systems in the nearby environment.
> > */
> > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
> >
> > (UTC-SLS is probably a bad example but are there formal definitions for
> > anything else?)
>
> I think it could also be more generic, like flags for linear smearing,
> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
> to the leap second start). That could also represent UTC-SLS, and
> noon-to-noon, and it would be well-defined.
>
> This should reduce the likelihood that the guest doesn't know the smearing
> variant.
I'm wary of making it too generic. That would seem to encourage a
*proliferation* of false "UTC-like" clocks.
It's bad enough that we do smearing at all, let alone that we don't
have a single definition of how to do it.
I made the smearing hint a full uint8_t instead of using bits in flags,
in the end. That gives us a full 255 ways of lying to users about what
the time is, so we're unlikely to run out. And it's easy enough to add
a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
that get invented.
> > > > + /*
> > > > + * This field changes to another non-repeating value when the CPU
> > > > + * counter is disrupted, for example on live migration.
> > > > + */
> > > > + uint64_t disruption_marker;
> > >
> > > The field could also change when the clock is stepped (leap seconds
> > > excepted), or when the clock frequency is slewed.
> >
> > I'm not sure. The concept of the disruption marker is that it tells the
> > guest to throw away any calibration of the counter that the guest has
> > done for *itself* (with NTP, other PTP devices, etc.).
> >
> > One mode for this device would be not to populate the clock fields at
> > all, but *only* to signal disruption when it occurs. So the guest can
> > abort transactions until it's resynced its clocks (to avoid incurring
> > fines if breaking databases, etc.).
> >
> > Exposing the host timekeeping through the structure means that the
> > migrated guest can keep working because it can trust the timekeeping
> > performed by the (new) host and exposed to it.
> >
> > If the counter is actually varying in frequency over time, and the host
> > is slewing the clock frequency that it reports, that *isn't* a step
> > change and doesn't mean that the guest should throw away any
> > calibration that it's been doing for itself. One hopes that the guest
> > would have detected the *same* frequency change, and be adapting for
> > itself. So I don't think that should indicate a disruption.
> >
> > I think the same is even true if the clock is stepped by the host. The
> > actual *counter* hasn't changed, so the guest is better off ignoring
> > the vacillating host and continuing to derive its idea of time from the
> > hardware counter itself, as calibrated against some external NTP/PTP
> > sources. Surely we actively *don't* to tell the guest to throw its own
> > calibrations away, in this case?
>
> In case the guest is also considering other time sources, it might indeed
> not be a good idea to mix host clock changes into the hardware counter
> disruption marker.
>
> But if the vmclock is the authoritative source of time, it can still be
> helpful to know about such changes, maybe through another marker.
Could that be the existing seq_count field?
Skewing the counter_period_frac_sec as the underlying oscillator speeds
up and slows down is perfectly normal and expected, and we already
expect the seq_count to change when that happens.
Maybe step changes are different, but arguably if the time advertised
by the host steps *outside* the error bounds previously advertised,
that's just broken?
Depending on how the clock information is fed, a change in seq_count
may even result in non-monotonicity. If the underlying oscillator has
sped up and the structure is updated accordingly, the time calculated
the moment *before* that update may appear later than the time
calculated immediately after it.
It's up to the guest operating system to feed that information into its
own timekeeping system and skew towards correctness instead of stepping
the time it reports to its users.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-28 12:15 ` David Woodhouse
@ 2024-06-28 16:38 ` Peter Hilber
2024-06-28 21:27 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-06-28 16:38 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 28.06.24 14:15, David Woodhouse wrote:
> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>> On 27.06.24 16:52, David Woodhouse wrote:
>>> I already added a flags field, so this might look something like:
>>>
>>> /*
>>> * Smearing flags. The UTC clock exposed through this structure
>>> * is only ever true UTC, but a guest operating system may
>>> * choose to offer a monotonic smeared clock to its users. This
>>> * merely offers a hint about what kind of smearing to perform,
>>> * for consistency with systems in the nearby environment.
>>> */
>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>>>
>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>> anything else?)
>>
>> I think it could also be more generic, like flags for linear smearing,
>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>> to the leap second start). That could also represent UTC-SLS, and
>> noon-to-noon, and it would be well-defined.
>>
>> This should reduce the likelihood that the guest doesn't know the smearing
>> variant.
>
> I'm wary of making it too generic. That would seem to encourage a
> *proliferation* of false "UTC-like" clocks.
>
> It's bad enough that we do smearing at all, let alone that we don't
> have a single definition of how to do it.
>
> I made the smearing hint a full uint8_t instead of using bits in flags,
> in the end. That gives us a full 255 ways of lying to users about what
> the time is, so we're unlikely to run out. And it's easy enough to add
> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> that get invented.
>
>
My concern is that the registry update may come after a driver has already
been implemented, so that it may be hard to ensure that the smearing which
has been chosen is actually implemented.
>>>>> + /*
>>>>> + * This field changes to another non-repeating value when the CPU
>>>>> + * counter is disrupted, for example on live migration.
>>>>> + */
>>>>> + uint64_t disruption_marker;
>>>>
>>>> The field could also change when the clock is stepped (leap seconds
>>>> excepted), or when the clock frequency is slewed.
>>>
>>> I'm not sure. The concept of the disruption marker is that it tells the
>>> guest to throw away any calibration of the counter that the guest has
>>> done for *itself* (with NTP, other PTP devices, etc.).
>>>
>>> One mode for this device would be not to populate the clock fields at
>>> all, but *only* to signal disruption when it occurs. So the guest can
>>> abort transactions until it's resynced its clocks (to avoid incurring
>>> fines if breaking databases, etc.).
>>>
>>> Exposing the host timekeeping through the structure means that the
>>> migrated guest can keep working because it can trust the timekeeping
>>> performed by the (new) host and exposed to it.
>>>
>>> If the counter is actually varying in frequency over time, and the host
>>> is slewing the clock frequency that it reports, that *isn't* a step
>>> change and doesn't mean that the guest should throw away any
>>> calibration that it's been doing for itself. One hopes that the guest
>>> would have detected the *same* frequency change, and be adapting for
>>> itself. So I don't think that should indicate a disruption.
>>>
>>> I think the same is even true if the clock is stepped by the host. The
>>> actual *counter* hasn't changed, so the guest is better off ignoring
>>> the vacillating host and continuing to derive its idea of time from the
>>> hardware counter itself, as calibrated against some external NTP/PTP
>>> sources. Surely we actively *don't* to tell the guest to throw its own
>>> calibrations away, in this case?
>>
>> In case the guest is also considering other time sources, it might indeed
>> not be a good idea to mix host clock changes into the hardware counter
>> disruption marker.
>>
>> But if the vmclock is the authoritative source of time, it can still be
>> helpful to know about such changes, maybe through another marker.
>
> Could that be the existing seq_count field?
>
> Skewing the counter_period_frac_sec as the underlying oscillator speeds
> up and slows down is perfectly normal and expected, and we already
> expect the seq_count to change when that happens.
>
> Maybe step changes are different, but arguably if the time advertised
> by the host steps *outside* the error bounds previously advertised,
> that's just broken?
But the error bounds could be large or missing. I am trying to address use
cases where the host steps or slews the clock as well.
>
> Depending on how the clock information is fed, a change in seq_count
> may even result in non-monotonicity. If the underlying oscillator has
> sped up and the structure is updated accordingly, the time calculated
> the moment *before* that update may appear later than the time
> calculated immediately after it.
>
> It's up to the guest operating system to feed that information into its
> own timekeeping system and skew towards correctness instead of stepping
> the time it reports to its users.
>
The guest can anyway infer from the other information that the clock
changed, so my proposal might not be that useful. Maybe it can be added in
a future version if there is any need.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-28 16:38 ` Peter Hilber
@ 2024-06-28 21:27 ` David Woodhouse
2024-07-01 8:57 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-06-28 21:27 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>On 28.06.24 14:15, David Woodhouse wrote:
>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>>> On 27.06.24 16:52, David Woodhouse wrote:
>>>> I already added a flags field, so this might look something like:
>>>>
>>>> /*
>>>> * Smearing flags. The UTC clock exposed through this structure
>>>> * is only ever true UTC, but a guest operating system may
>>>> * choose to offer a monotonic smeared clock to its users. This
>>>> * merely offers a hint about what kind of smearing to perform,
>>>> * for consistency with systems in the nearby environment.
>>>> */
>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>>>>
>>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>>> anything else?)
>>>
>>> I think it could also be more generic, like flags for linear smearing,
>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>>> to the leap second start). That could also represent UTC-SLS, and
>>> noon-to-noon, and it would be well-defined.
>>>
>>> This should reduce the likelihood that the guest doesn't know the smearing
>>> variant.
>>
>> I'm wary of making it too generic. That would seem to encourage a
>> *proliferation* of false "UTC-like" clocks.
>>
>> It's bad enough that we do smearing at all, let alone that we don't
>> have a single definition of how to do it.
>>
>> I made the smearing hint a full uint8_t instead of using bits in flags,
>> in the end. That gives us a full 255 ways of lying to users about what
>> the time is, so we're unlikely to run out. And it's easy enough to add
>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
>> that get invented.
>>
>>
>
>My concern is that the registry update may come after a driver has already
>been implemented, so that it may be hard to ensure that the smearing which
>has been chosen is actually implemented.
Well yes, but why in the name of all that is holy would anyone want to invent *new* ways to lie to users about the time? If we capture the existing ones as we write this, surely it's a good thing that there's a barrier to entry for adding more?
>But the error bounds could be large or missing. I am trying to address use
>cases where the host steps or slews the clock as well.
The host is absolutely intended to be skewing the clock to keep it accurate as the frequency of the underlying oscillator changes, and the seq_count field will change every time the host does so.
Do we need to handle steps differently? Or just let the guest deal with it?
If an NTP server suddenly steps the time it reports, what does the client do?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
[not found] ` <20240630132859.GC17134@kernel.org>
@ 2024-07-01 8:02 ` David Woodhouse
[not found] ` <202407010838.D45C67B86@keescook>
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-07-01 8:02 UTC (permalink / raw)
To: Simon Horman
Cc: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan,
Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni, Kees Cook, linux-hardening
[-- Attachment #1: Type: text/plain, Size: 503 bytes --]
On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote:
>
> W=1 allmodconfig builds with gcc-13 flag the following.
> Reading the documentation of strncpy() in fortify-string.h,
> I wonder if strscpy() would be more appropriate in this case.
It's never going to matter in practice, as the buffer is 32 bytes, so
"vmclock%d" would never get to the end even if was a 64-bit integer.
But as a matter of hygiene and best practice, yes strscpy() would be
more appropriate. Fixed in git, thanks.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-06-28 21:27 ` David Woodhouse
@ 2024-07-01 8:57 ` David Woodhouse
2024-07-02 15:03 ` Peter Hilber
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-07-01 8:57 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 5361 bytes --]
On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
> On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
> > On 28.06.24 14:15, David Woodhouse wrote:
> > > On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
> > > > On 27.06.24 16:52, David Woodhouse wrote:
> > > > > I already added a flags field, so this might look something like:
> > > > >
> > > > > /*
> > > > > * Smearing flags. The UTC clock exposed through this structure
> > > > > * is only ever true UTC, but a guest operating system may
> > > > > * choose to offer a monotonic smeared clock to its users. This
> > > > > * merely offers a hint about what kind of smearing to perform,
> > > > > * for consistency with systems in the nearby environment.
> > > > > */
> > > > > #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
> > > > >
> > > > > (UTC-SLS is probably a bad example but are there formal definitions for
> > > > > anything else?)
> > > >
> > > > I think it could also be more generic, like flags for linear smearing,
> > > > cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
> > > > to the leap second start). That could also represent UTC-SLS, and
> > > > noon-to-noon, and it would be well-defined.
> > > >
> > > > This should reduce the likelihood that the guest doesn't know the smearing
> > > > variant.
> > >
> > > I'm wary of making it too generic. That would seem to encourage a
> > > *proliferation* of false "UTC-like" clocks.
> > >
> > > It's bad enough that we do smearing at all, let alone that we don't
> > > have a single definition of how to do it.
> > >
> > > I made the smearing hint a full uint8_t instead of using bits in flags,
> > > in the end. That gives us a full 255 ways of lying to users about what
> > > the time is, so we're unlikely to run out. And it's easy enough to add
> > > a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
> > > that get invented.
> > >
> > >
> >
> > My concern is that the registry update may come after a driver has already
> > been implemented, so that it may be hard to ensure that the smearing which
> > has been chosen is actually implemented.
>
> Well yes, but why in the name of all that is holy would anyone want
> to invent *new* ways to lie to users about the time? If we capture
> the existing ones as we write this, surely it's a good thing that
> there's a barrier to entry for adding more?
Ultimately though, this isn't the hill for me to die on. I'm pushing on
that topic because I want to avoid the proliferation of *ambiguity*. If
we have a precision clock, we should *know* what the time is.
So how about this proposal. I line up the fields in the proposed shared
memory structure to match your virtio-rtc proposal, using 'subtype' as
you proposed. But, instead of the 'subtype' being valid only for
VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.
So, you have:
+\begin{lstlisting}
+#define VIRTIO_RTC_CLOCK_UTC 0
+#define VIRTIO_RTC_CLOCK_TAI 1
+#define VIRTIO_RTC_CLOCK_MONO 2
+\end{lstlisting}
I propose that you add
#define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
If my proposed memory structure is subsumed into the virtio-rtc
proposal we'd literally use the same names, but for the time being I'll
update mine to:
/*
* What time is exposed in the time_sec/time_frac_sec fields?
*/
uint8_t time_type;
#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */
#define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */
I can then use your smearing subtype values as the 'hint' field in the
shared memory structure. You currently have:
+\begin{lstlisting}
+#define VIRTIO_RTC_SUBTYPE_STRICT 0
+#define VIRTIO_RTC_SUBTYPE_SMEAR 1
+#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
+#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
+\end{lstlisting}
I can certainly ensure that 'noon linear' has the same value. I don't
think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
+\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
+ smearing time in the vicinity of the leap second, in a not
+ precisely defined manner. This avoids clock steps due to UTC
+ leap seconds.
...
+\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
+ standard w.r.t.\ leap second introduction in an unspecified
way
+ (leap seconds may, or may not, be smeared).
To the client, both of those just mean "for a day or so around a leap
second event, you can't trust this device to know what the time is".
There isn't any point in separating "does lie to you" from "might lie
to you", surely? The guest can't do anything useful with that
distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
And if you *really* want to parameterise it, I think that's a bad idea
and it encourages the proliferation of different time "standards", but
I'd probably just suck it up and do whatever you do because that's not
strictly within the remit of my live-migration part.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-01 8:57 ` David Woodhouse
@ 2024-07-02 15:03 ` Peter Hilber
2024-07-02 16:39 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-07-02 15:03 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 01.07.24 10:57, David Woodhouse wrote:
> On Fri, 2024-06-28 at 22:27 +0100, David Woodhouse wrote:
>> On 28 June 2024 17:38:15 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>>> On 28.06.24 14:15, David Woodhouse wrote:
>>>> On Fri, 2024-06-28 at 13:33 +0200, Peter Hilber wrote:
>>>>> On 27.06.24 16:52, David Woodhouse wrote:
>>>>>> I already added a flags field, so this might look something like:
>>>>>>
>>>>>> /*
>>>>>> * Smearing flags. The UTC clock exposed through this structure
>>>>>> * is only ever true UTC, but a guest operating system may
>>>>>> * choose to offer a monotonic smeared clock to its users. This
>>>>>> * merely offers a hint about what kind of smearing to perform,
>>>>>> * for consistency with systems in the nearby environment.
>>>>>> */
>>>>>> #define VMCLOCK_FLAGS_SMEAR_UTC_SLS (1<<5) /* draft-kuhn-leapsecond-00.txt */
>>>>>>
>>>>>> (UTC-SLS is probably a bad example but are there formal definitions for
>>>>>> anything else?)
>>>>>
>>>>> I think it could also be more generic, like flags for linear smearing,
>>>>> cosine smearing(?), and smear_start_sec and smear_end_sec fields (relative
>>>>> to the leap second start). That could also represent UTC-SLS, and
>>>>> noon-to-noon, and it would be well-defined.
>>>>>
>>>>> This should reduce the likelihood that the guest doesn't know the smearing
>>>>> variant.
>>>>
>>>> I'm wary of making it too generic. That would seem to encourage a
>>>> *proliferation* of false "UTC-like" clocks.
>>>>
>>>> It's bad enough that we do smearing at all, let alone that we don't
>>>> have a single definition of how to do it.
>>>>
>>>> I made the smearing hint a full uint8_t instead of using bits in flags,
>>>> in the end. That gives us a full 255 ways of lying to users about what
>>>> the time is, so we're unlikely to run out. And it's easy enough to add
>>>> a new VMCLOCK_SMEARING_XXX type to the 'registry' for any new methods
>>>> that get invented.
>>>>
>>>>
>>>
>>> My concern is that the registry update may come after a driver has already
>>> been implemented, so that it may be hard to ensure that the smearing which
>>> has been chosen is actually implemented.
>>
>> Well yes, but why in the name of all that is holy would anyone want
>> to invent *new* ways to lie to users about the time? If we capture
>> the existing ones as we write this, surely it's a good thing that
>> there's a barrier to entry for adding more?
>
> Ultimately though, this isn't the hill for me to die on. I'm pushing on
> that topic because I want to avoid the proliferation of *ambiguity*. If
> we have a precision clock, we should *know* what the time is.
>
> So how about this proposal. I line up the fields in the proposed shared
> memory structure to match your virtio-rtc proposal, using 'subtype' as
> you proposed. But, instead of the 'subtype' being valid only for
> VIRTIO_RTC_CLOCK_UTC, we define a new top-level type for *smeared* UTC.
>
> So, you have:
>
> +\begin{lstlisting}
> +#define VIRTIO_RTC_CLOCK_UTC 0
> +#define VIRTIO_RTC_CLOCK_TAI 1
> +#define VIRTIO_RTC_CLOCK_MONO 2
> +\end{lstlisting}
>
> I propose that you add
>
> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
>
> If my proposed memory structure is subsumed into the virtio-rtc
> proposal we'd literally use the same names, but for the time being I'll
> update mine to:
Do you intend vmclock and virtio-rtc to be ABI compatible? FYI, I see a
potential problem in that Virtio does avoid the use of signed integers so
far. I did not check carefully if there might be other problems, yet.
>
> /*
> * What time is exposed in the time_sec/time_frac_sec fields?
> */
> uint8_t time_type;
> #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */
> #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */
> #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */
>
>
> I can then use your smearing subtype values as the 'hint' field in the
> shared memory structure. You currently have:
>
> +\begin{lstlisting}
> +#define VIRTIO_RTC_SUBTYPE_STRICT 0
> +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
> +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
> +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
> +\end{lstlisting}
>
I agree with the above part of your proposal.
> I can certainly ensure that 'noon linear' has the same value. I don't
> think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
>
>
> +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
> + smearing time in the vicinity of the leap second, in a not
> + precisely defined manner. This avoids clock steps due to UTC
> + leap seconds.
>
> ...
>
> +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
> + standard w.r.t.\ leap second introduction in an unspecified
> way
> + (leap seconds may, or may not, be smeared).
>
> To the client, both of those just mean "for a day or so around a leap
> second event, you can't trust this device to know what the time is".
> There isn't any point in separating "does lie to you" from "might lie
> to you", surely? The guest can't do anything useful with that
> distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
(resp., UTC_SLS may be added).
But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
steps (in particular, steps backwards, which some clients might not like)
due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.
So I think this might be better handled by adding, alongside
> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
#define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4
(or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC).
>
> And if you *really* want to parameterise it, I think that's a bad idea
> and it encourages the proliferation of different time "standards", but
> I'd probably just suck it up and do whatever you do because that's not
> strictly within the remit of my live-migration part.
I think the above proposal to have subtypes for
VIRTIO_RTC_CLOCK_SMEARED_UTC should work.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-02 15:03 ` Peter Hilber
@ 2024-07-02 16:39 ` David Woodhouse
2024-07-02 18:12 ` Peter Hilber
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-07-02 16:39 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 5592 bytes --]
On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote:
> > On 01.07.24 10:57, David Woodhouse wrote:
> > > > If my proposed memory structure is subsumed into the virtio-rtc
> > > > proposal we'd literally use the same names, but for the time being I'll
> > > > update mine to:
> >
> > Do you intend vmclock and virtio-rtc to be ABI compatible?
I intend you to incorporate a shared memory structure like the vmclock
one into the virtio-rtc standard :)
Because precision clocks in a VM are fundamentally nonsensical without
a way for the guest to know when live migration has screwed them up.
So yes, so facilitate that, I'm trying to align things so that the
numeric values of fields like the time_type and smearing hint are
*precisely* the same as the VIRTIO_RTC_xxx values.
Time pressure *may* mean I have to ship an ACPI-based device with a
preliminary version of the structure before I've finished persuading
you, and before we've completely finalized the structure. I am hoping
to avoid that.
(In fact, my time pressure only applies to a version of the structure
which carries the disruption_marker field; the actual clock calibration
information doesn't have to be present in the interim implementation.)
> > FYI, I see a
> > potential problem in that Virtio does avoid the use of signed integers so
> > far. I did not check carefully if there might be other problems, yet.
Hm, you use an unsigned integer to convey the tai_offset. I suppose at
+37 and with a plan to stop doing leap seconds in the next decade,
we're unlikely to get back below zero?
The other signed integer I had was for the leap second direction, but I
think I'm happy to drop that and just adopt your uint8_t leap field
with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}.
> > > >
> > > > /*
> > > > * What time is exposed in the time_sec/time_frac_sec fields?
> > > > */
> > > > uint8_t time_type;
> > > > #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */
> > > > #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */
> > > > #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */
> > > > #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */
> > > >
> > > >
> > > > I can then use your smearing subtype values as the 'hint' field in the
> > > > shared memory structure. You currently have:
> > > >
> > > > +\begin{lstlisting}
> > > > +#define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
> > > > +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
> > > > +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
> > > > +\end{lstlisting}
> > > >
> >
> > I agree with the above part of your proposal.
> >
> > > > I can certainly ensure that 'noon linear' has the same value. I don't
> > > > think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
> > > >
> > > >
> > > > +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
> > > > + smearing time in the vicinity of the leap second, in a not
> > > > + precisely defined manner. This avoids clock steps due to UTC
> > > > + leap seconds.
> > > >
> > > > ...
> > > >
> > > > +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
> > > > + standard w.r.t.\ leap second introduction in an unspecified
> > > > way
> > > > + (leap seconds may, or may not, be smeared).
> > > >
> > > > To the client, both of those just mean "for a day or so around a leap
> > > > second event, you can't trust this device to know what the time is".
> > > > There isn't any point in separating "does lie to you" from "might lie
> > > > to you", surely? The guest can't do anything useful with that
> > > > distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
> >
> > As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
> > (resp., UTC_SLS may be added).
> >
> > But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
> > steps (in particular, steps backwards, which some clients might not like)
> > due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.
> >
> > So I think this might be better handled by adding, alongside
> >
> > > > #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
> >
> > #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4
> >
> > (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC).
> >
> > > >
> > > > And if you *really* want to parameterise it, I think that's a bad idea
> > > > and it encourages the proliferation of different time "standards", but
> > > > I'd probably just suck it up and do whatever you do because that's not
> > > > strictly within the remit of my live-migration part.
> >
> > I think the above proposal to have subtypes for
> > VIRTIO_RTC_CLOCK_SMEARED_UTC should work.
To clarify then, the main types are
VIRTIO_RTC_CLOCK_UTC == 0
VIRTIO_RTC_CLOCK_TAI == 1
VIRTIO_RTC_CLOCK_MONOTONIC == 2
VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
And the subtypes are *only* for the case of
VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
VIRTIO_RTC_SUBTYPE_STRICT
VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR
VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
Is that what we just agreed on?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-02 16:39 ` David Woodhouse
@ 2024-07-02 18:12 ` Peter Hilber
2024-07-02 18:40 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-07-02 18:12 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 02.07.24 18:39, David Woodhouse wrote:
> On Tue, 2024-07-02 at 17:03 +0200, Peter Hilber wrote:
>>> On 01.07.24 10:57, David Woodhouse wrote:
>>>>> If my proposed memory structure is subsumed into the virtio-rtc
>>>>> proposal we'd literally use the same names, but for the time being I'll
>>>>> update mine to:
>>>
>>> Do you intend vmclock and virtio-rtc to be ABI compatible?
>
> I intend you to incorporate a shared memory structure like the vmclock
> one into the virtio-rtc standard :)
>
> Because precision clocks in a VM are fundamentally nonsensical without
> a way for the guest to know when live migration has screwed them up.
>
> So yes, so facilitate that, I'm trying to align things so that the
> numeric values of fields like the time_type and smearing hint are
> *precisely* the same as the VIRTIO_RTC_xxx values.
>
> Time pressure *may* mean I have to ship an ACPI-based device with a
> preliminary version of the structure before I've finished persuading
> you, and before we've completely finalized the structure. I am hoping
> to avoid that.
>
> (In fact, my time pressure only applies to a version of the structure
> which carries the disruption_marker field; the actual clock calibration
> information doesn't have to be present in the interim implementation.)
>
>
>>> FYI, I see a
>>> potential problem in that Virtio does avoid the use of signed integers so
>>> far. I did not check carefully if there might be other problems, yet.
>
> Hm, you use an unsigned integer to convey the tai_offset. I suppose at
> +37 and with a plan to stop doing leap seconds in the next decade,
> we're unlikely to get back below zero?
>
I think so.
> The other signed integer I had was for the leap second direction, but I
> think I'm happy to drop that and just adopt your uint8_t leap field
> with VIRTIO_RTC_LEAP_{PRE_POS,PRE_NEG,etc.}.
>
>
>
>
>
>>>>>
>>>>> /*
>>>>> * What time is exposed in the time_sec/time_frac_sec fields?
>>>>> */
>>>>> uint8_t time_type;
>>>>> #define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */
>>>>> #define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */
>>>>> #define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */
>>>>> #define VMCLOCK_TIME_INVALID 3 /* virtio-rtc uses this for smeared UTC */
>>>>>
>>>>>
>>>>> I can then use your smearing subtype values as the 'hint' field in the
>>>>> shared memory structure. You currently have:
>>>>>
>>>>> +\begin{lstlisting}
>>>>> +#define VIRTIO_RTC_SUBTYPE_STRICT 0
>>>>> +#define VIRTIO_RTC_SUBTYPE_SMEAR 1
>>>>> +#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 2
>>>>> +#define VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED 3
>>>>> +\end{lstlisting}
>>>>>
>>>
>>> I agree with the above part of your proposal.
>>>
>>>>> I can certainly ensure that 'noon linear' has the same value. I don't
>>>>> think you need both 'SMEAR' and 'LEAP_UNSPECIFIED' though:
>>>>>
>>>>>
>>>>> +\item VIRTIO_RTC_SUBTYPE_SMEAR deviates from the UTC standard by
>>>>> + smearing time in the vicinity of the leap second, in a not
>>>>> + precisely defined manner. This avoids clock steps due to UTC
>>>>> + leap seconds.
>>>>>
>>>>> ...
>>>>>
>>>>> +\item VIRTIO_RTC_SUBTYPE_LEAP_UNSPECIFIED may deviate from the UTC
>>>>> + standard w.r.t.\ leap second introduction in an unspecified
>>>>> way
>>>>> + (leap seconds may, or may not, be smeared).
>>>>>
>>>>> To the client, both of those just mean "for a day or so around a leap
>>>>> second event, you can't trust this device to know what the time is".
>>>>> There isn't any point in separating "does lie to you" from "might lie
>>>>> to you", surely? The guest can't do anything useful with that
>>>>> distinction. Let's drop SMEAR and keep only LEAP_UNSPECIFIED?
>>>
>>> As for VIRTIO_RTC_SUBTYPE_SMEAR, I think this could be dropped indeed
>>> (resp., UTC_SLS may be added).
>>>
>>> But VIRTIO_RTC_CLOCK_SMEARED_UTC is an assurance that there will be no
>>> steps (in particular, steps backwards, which some clients might not like)
>>> due to leap seconds, while LEAP_UNSPECIFIED provides no such guarantee.
>>>
>>> So I think this might be better handled by adding, alongside
>>>
>>>>> #define VIRTIO_RTC_CLOCK_SMEARED_UTC 3
>>>
>>> #define VIRTIO_RTC_CLOCK_LEAP_UNSPECIFIED_UTC 4
>>>
>>> (or any better name, like VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC).
>>>
>>>>>
>>>>> And if you *really* want to parameterise it, I think that's a bad idea
>>>>> and it encourages the proliferation of different time "standards", but
>>>>> I'd probably just suck it up and do whatever you do because that's not
>>>>> strictly within the remit of my live-migration part.
>>>
>>> I think the above proposal to have subtypes for
>>> VIRTIO_RTC_CLOCK_SMEARED_UTC should work.
>
> To clarify then, the main types are
>
> VIRTIO_RTC_CLOCK_UTC == 0
> VIRTIO_RTC_CLOCK_TAI == 1
> VIRTIO_RTC_CLOCK_MONOTONIC == 2
> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>
> And the subtypes are *only* for the case of
> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>
> VIRTIO_RTC_SUBTYPE_STRICT
> VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
> VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR
> VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>
> Is that what we just agreed on?
>
>
This is a misunderstanding. My idea was that the main types are
> VIRTIO_RTC_CLOCK_UTC == 0
> VIRTIO_RTC_CLOCK_TAI == 1
> VIRTIO_RTC_CLOCK_MONOTONIC == 2
> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
The subtypes would be (1st for clocks other than
VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
VIRTIO_RTC_CLOCK_SMEARED_UTC):
#define VIRTIO_RTC_SUBTYPE_STRICT 0
#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-02 18:12 ` Peter Hilber
@ 2024-07-02 18:40 ` David Woodhouse
2024-07-03 9:56 ` Peter Hilber
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-07-02 18:40 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>On 02.07.24 18:39, David Woodhouse wrote:
>> To clarify then, the main types are
>>
>> VIRTIO_RTC_CLOCK_UTC == 0
>> VIRTIO_RTC_CLOCK_TAI == 1
>> VIRTIO_RTC_CLOCK_MONOTONIC == 2
>> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>>
>> And the subtypes are *only* for the case of
>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>>
>> VIRTIO_RTC_SUBTYPE_STRICT
>> VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>> VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR
>> VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>>
>> Is that what we just agreed on?
>>
>>
>
>This is a misunderstanding. My idea was that the main types are
>
>> VIRTIO_RTC_CLOCK_UTC == 0
>> VIRTIO_RTC_CLOCK_TAI == 1
>> VIRTIO_RTC_CLOCK_MONOTONIC == 2
>> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>
>VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
>
>The subtypes would be (1st for clocks other than
>VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
>VIRTIO_RTC_CLOCK_SMEARED_UTC):
>
>#define VIRTIO_RTC_SUBTYPE_STRICT 0
>#define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
>#define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
>
Thanks. I really do think that from the guest point of view there's really no distinction between "maybe smeared" and "undefined smearing", and have a preference for using the latter form, which is the key difference there?
Again though, not a hill for me to die on.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
[not found] ` <202407010838.D45C67B86@keescook>
@ 2024-07-03 8:00 ` David Woodhouse
0 siblings, 0 replies; 35+ messages in thread
From: David Woodhouse @ 2024-07-03 8:00 UTC (permalink / raw)
To: Kees Cook
Cc: Simon Horman, Peter Hilber, linux-kernel, virtualization,
linux-arm-kernel, linux-rtc, Ridoux, Julien, virtio-dev,
Luu, Ryan, Christopher S. Hall, Jason Wang, John Stultz,
Michael S. Tsirkin, netdev, Richard Cochran, Stephen Boyd,
Thomas Gleixner, Xuan Zhuo, Marc Zyngier, Mark Rutland,
Daniel Lezcano, Alessandro Zummo, Alexandre Belloni,
linux-hardening
[-- Attachment #1: Type: text/plain, Size: 876 bytes --]
On Mon, 2024-07-01 at 08:39 -0700, Kees Cook wrote:
> On Mon, Jul 01, 2024 at 09:02:38AM +0100, David Woodhouse wrote:
> > On Sun, 2024-06-30 at 14:28 +0100, Simon Horman wrote:
> > >
> > > W=1 allmodconfig builds with gcc-13 flag the following.
> > > Reading the documentation of strncpy() in fortify-string.h,
> > > I wonder if strscpy() would be more appropriate in this case.
> >
> > It's never going to matter in practice, as the buffer is 32 bytes, so
> > "vmclock%d" would never get to the end even if was a 64-bit integer.
> >
> > But as a matter of hygiene and best practice, yes strscpy() would be
> > more appropriate. Fixed in git, thanks.
>
> Thanks! You can even use the 2-argument version. :)
2-argument version? ... oh, that's *special*. I don't know whether to
be impressed or disturbed. Perhaps a little bit of both.
Done; thanks.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-02 18:40 ` David Woodhouse
@ 2024-07-03 9:56 ` Peter Hilber
2024-07-03 10:40 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-07-03 9:56 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 02.07.24 20:40, David Woodhouse wrote:
> On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
>> On 02.07.24 18:39, David Woodhouse wrote:
>>> To clarify then, the main types are
>>>
>>> VIRTIO_RTC_CLOCK_UTC == 0
>>> VIRTIO_RTC_CLOCK_TAI == 1
>>> VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>>>
>>> And the subtypes are *only* for the case of
>>> VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
>>>
>>> VIRTIO_RTC_SUBTYPE_STRICT
>>> VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
>>> VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR
>>> VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
>>>
>>> Is that what we just agreed on?
>>>
>>>
>>
>> This is a misunderstanding. My idea was that the main types are
>>
>>> VIRTIO_RTC_CLOCK_UTC == 0
>>> VIRTIO_RTC_CLOCK_TAI == 1
>>> VIRTIO_RTC_CLOCK_MONOTONIC == 2
>>> VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
>>
>> VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
>>
>> The subtypes would be (1st for clocks other than
>> VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
>> VIRTIO_RTC_CLOCK_SMEARED_UTC):
>>
>> #define VIRTIO_RTC_SUBTYPE_STRICT 0
>> #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
>> #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
>>
>
> Thanks. I really do think that from the guest point of view there's really no distinction between "maybe smeared" and "undefined smearing", and have a preference for using the latter form, which is the key difference there?
>
> Again though, not a hill for me to die on.
I have no issue with staying with "undefined smearing", so would you agree
to something like
VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4
(or another name if you prefer)?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-03 9:56 ` Peter Hilber
@ 2024-07-03 10:40 ` David Woodhouse
2024-07-05 8:12 ` Peter Hilber
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-07-03 10:40 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan, Chashper, David
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 11141 bytes --]
On Wed, 2024-07-03 at 11:56 +0200, Peter Hilber wrote:
> On 02.07.24 20:40, David Woodhouse wrote:
> > On 2 July 2024 19:12:00 BST, Peter Hilber <peter.hilber@opensynergy.com> wrote:
> > > On 02.07.24 18:39, David Woodhouse wrote:
> > > > To clarify then, the main types are
> > > >
> > > > VIRTIO_RTC_CLOCK_UTC == 0
> > > > VIRTIO_RTC_CLOCK_TAI == 1
> > > > VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > > VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > > >
> > > > And the subtypes are *only* for the case of
> > > > VIRTIO_RTC_CLOCK_SMEARED_UTC. They include
> > > >
> > > > VIRTIO_RTC_SUBTYPE_STRICT
> > > > VIRTIO_RTC_SUBTYPE_UNDEFINED /* or whatever you want to call it */
> > > > VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR
> > > > VIRTIO_RTC_SUBTYPE_UTC_SLS /* if it's worth doing this one */
> > > >
> > > > Is that what we just agreed on?
> > > >
> > > >
> > >
> > > This is a misunderstanding. My idea was that the main types are
> > >
> > > > VIRTIO_RTC_CLOCK_UTC == 0
> > > > VIRTIO_RTC_CLOCK_TAI == 1
> > > > VIRTIO_RTC_CLOCK_MONOTONIC == 2
> > > > VIRTIO_RTC_CLOCK_SMEARED_UTC == 3
> > >
> > > VIRTIO_RTC_CLOCK_MAYBE_SMEARED_UTC == 4
> > >
> > > The subtypes would be (1st for clocks other than
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC, 2nd to last for
> > > VIRTIO_RTC_CLOCK_SMEARED_UTC):
> > >
> > > #define VIRTIO_RTC_SUBTYPE_STRICT 0
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_NOON_LINEAR 1
> > > #define VIRTIO_RTC_SUBTYPE_SMEAR_UTC_SLS 2
> > >
> >
> > Thanks. I really do think that from the guest point of view there's
> > really no distinction between "maybe smeared" and "undefined
> > smearing", and have a preference for using the latter form, which
> > is the key difference there?
> >
> > Again though, not a hill for me to die on.
>
> I have no issue with staying with "undefined smearing", so would you agree
> to something like
>
> VIRTIO_RTC_CLOCK_SMEAR_UNDEFINED_UTC == 4
>
> (or another name if you prefer)?
Well, the point of contention was really whether that was a *type* or a
*subtype*.
Either way, it's a "precision clock" telling its consumer that the
device *itself* doesn't really know what time is being exposed. Which
seems like a bizarre thing to support.
But I think I've constructed an argument which persuades me to your
point of view that *if* we permit it, it should be a primary type...
A clock can *either* be UTC, *or* it can be monotonic. The whole point
of smearing is to produce a monotonic clock, of course.
VIRTIO_RTC_CLOCK_UTC is UTC. It is not monotonic.
VIRTIO_RTC_CLOCK_SMEARED is, presumably, monotonic (and I think we
should explicitly require that to be true in virtio-rtc).
But VIRTIO_RTC_CLOCK_MAYBE_SMEARED is the worst of both worlds. It is
neither known to be correct UTC, *nor* is it known to be monotonic. So
(again, if we permit it at all) I think it probably does make sense for
that to be a primary type.
This is what I currently have for 'struct vmclock_abi' that I'd like to
persuade you to adopt. I need to tweak it some more, for at least the
following reasons, as well as any more you can see:
• size isn't big enough for 64KiB pages
• Should be explicitly little-endian
• Does it need esterror as well as maxerror?
• Why is maxerror in picoseconds? It's the only use of that unit
• Where do the clock_status values come from? Do they make sense?
• Are signed integers OK? (I think so!).
/*
* This structure provides a vDSO-style clock to VM guests, exposing the
* relationship (or lack thereof) between the CPU clock (TSC, timebase, arch
* counter, etc.) and real time. It is designed to address the problem of
* live migration, which other clock enlightenments do not.
*
* When a guest is live migrated, this affects the clock in two ways.
*
* First, even between identical hosts the actual frequency of the underlying
* counter will change within the tolerances of its specification (typically
* ±50PPM, or 4 seconds a day). This frequency also varies over time on the
* same host, but can be tracked by NTP as it generally varies slowly. With
* live migration there is a step change in the frequency, with no warning.
*
* Second, there may be a step change in the value of the counter itself, as
* its accuracy is limited by the precision of the NTP synchronization on the
* source and destination hosts.
*
* So any calibration (NTP, PTP, etc.) which the guest has done on the source
* host before migration is invalid, and needs to be redone on the new host.
*
* In its most basic mode, this structure provides only an indication to the
* guest that live migration has occurred. This allows the guest to know that
* its clock is invalid and take remedial action. For applications that need
* reliable accurate timestamps (e.g. distributed databases), the structure
* can be mapped all the way to userspace. This allows the application to see
* directly for itself that the clock is disrupted and take appropriate
* action, even when using a vDSO-style method to get the time instead of a
* system call.
*
* In its more advanced mode. this structure can also be used to expose the
* precise relationship of the CPU counter to real time, as calibrated by the
* host. This means that userspace applications can have accurate time
* immediately after live migration, rather than having to pause operations
* and wait for NTP to recover. This mode does, of course, rely on the
* counter being reliable and consistent across CPUs.
*
* Note that this must be true UTC, never with smeared leap seconds. If a
* guest wishes to construct a smeared clock, it can do so. Presenting a
* smeared clock through this interface would be problematic because it
* actually messes with the apparent counter *period*. A linear smearing
* of 1 ms per second would effectively tweak the counter period by 1000PPM
* at the start/end of the smearing period, while a sinusoidal smear would
* basically be impossible to represent.
*
* This structure is offered with the intent that it be adopted into the
* nascent virtio-rtc standard, as a virtio-rtc that does not address the live
* migration problem seems a little less than fit for purpose. For that
* reason, certain fields use precisely the same numeric definitions as in
* the virtio-rtc proposal. The structure can also be exposed through an ACPI
* device with the CID "VMCLOCK", modelled on the "VMGENID" device except for
* the fact that it uses a real _CRS to convey the address of the structure
* (which should be a full page, to allow for mapping directly to userspace).
*/
#ifndef __VMCLOCK_ABI_H__
#define __VMCLOCK_ABI_H__
#ifdef __KERNEL__
#include <linux/types.h>
#else
#include <stdint.h>
#endif
struct vmclock_abi {
uint64_t magic;
#define VMCLOCK_MAGIC 0x4b4c4356 /* "VCLK" */
uint16_t size; /* Size of page containing this structure */
uint16_t version; /* 1 */
/* Sequence lock. Low bit means an update is in progress. */
uint32_t seq_count;
uint32_t flags;
/* Indicates that the tai_offset_sec field is valid */
#define VMCLOCK_FLAG_TAI_OFFSET_VALID (1 << 0)
/*
* Optionally used to notify guests of pending maintenance events.
* A guest may wish to remove itself from service if an event is
* coming up. Two flags indicate the rough imminence of the event.
*/
#define VMCLOCK_FLAG_DISRUPTION_SOON (1 << 1) /* About a day */
#define VMCLOCK_FLAG_DISRUPTION_IMMINENT (1 << 2) /* About an hour */
/* Indicates that the utc_time_maxerror_picosec field is valid */
#define VMCLOCK_FLAG_UTC_MAXERROR_VALID (1 << 3)
/* Indicates counter_period_error_rate_frac_sec is valid */
#define VMCLOCK_FLAG_PERIOD_ERROR_VALID (1 << 4)
/*
* This field changes to another non-repeating value when the CPU
* counter is disrupted, for example on live migration. This lets
* the guest know that it should discard any calibration it has
* performed of the counter against external sources (NTP/PTP/etc.).
*/
uint64_t disruption_marker;
uint8_t clock_status;
#define VMCLOCK_STATUS_UNKNOWN 0
#define VMCLOCK_STATUS_INITIALIZING 1
#define VMCLOCK_STATUS_SYNCHRONIZED 2
#define VMCLOCK_STATUS_FREERUNNING 3
#define VMCLOCK_STATUS_UNRELIABLE 4
uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx */
#define VMCLOCK_COUNTER_ARM_VCNT 0
#define VMCLOCK_COUNTER_X86_TSC 1
#define VMCLOCK_COUNTER_INVALID 0xff
/*
* By providing the offset from UTC to TAI, the guest can know both
* UTC and TAI reliably, whichever is indicated in the time_type
* field. Valid if VMCLOCK_FLAG_TAI_OFFSET_VALID is set in flags.
*/
int16_t tai_offset_sec;
/*
* What time is exposed in the time_sec/time_frac_sec fields?
*/
uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
#define VMCLOCK_TIME_UTC 0 /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_TAI 1 /* Since 1970-01-01 00:00:00z */
#define VMCLOCK_TIME_MONOTONIC 2 /* Since undefined epoch */
#define VMCLOCK_TIME_INVALID_SMEARED 3 /* Not supported */
#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED 4 /* Not supported */
/*
* The time exposed through this device is never smeared. This field
* corresponds to the 'subtype' field in virtio-rtc, which indicates
* the smearing method. However in this case it provides a *hint* to
* the guest operating system, such that *if* the guest OS wants to
* provide its users with an alternative clock which does not follow
* the POSIX CLOCK_REALTIME standard, it may do so in a fashion
* consistent with the other systems in the nearby environment.
*/
uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
#define VMCLOCK_SMEARING_STRICT 0
#define VMCLOCK_SMEARING_NOON_LINEAR 1
#define VMCLOCK_SMEARING_UTC_SLS 2
/* Bit shift for counter_period_frac_sec and its error rate */
uint8_t counter_period_shift;
/*
* Unlike in NTP, this can indicate a leap second in the past. This
* is needed to allow guests to derive an imprecise clock with
* smeared leap seconds for themselves, as some modes of smearing
* need the adjustments to continue even after the moment at which
* the leap second should have occurred.
*/
uint8_t leap_indicator; /* Matches VIRTIO_RTC_LEAP_xxx */
#define VMCLOCK_LEAP_NONE 0
#define VMCLOCK_LEAP_PRE_POS 1
#define VMCLOCK_LEAP_PRE_NEG 2
#define VMCLOCK_LEAP_POS 3
#define VMCLOCK_LEAP_NEG 4
uint64_t leapsecond_tai_sec; /* Since 1970-01-01 00:00:00z */
/*
* Paired values of counter and UTC at a given point in time.
*/
uint64_t counter_value;
uint64_t time_sec;
uint64_t time_frac_sec;
/*
* Counter frequency, and error margin. The unit of these fields is
* seconds >> (64 + counter_period_shift)
*/
uint64_t counter_period_frac_sec;
uint64_t counter_period_error_rate_frac_sec;
/* Error margin of UTC reading above (± picoseconds) */
uint64_t utc_time_maxerror_picosec;
};
#endif /* __VMCLOCK_ABI_H__ */
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-03 10:40 ` David Woodhouse
@ 2024-07-05 8:12 ` Peter Hilber
2024-07-05 15:02 ` David Woodhouse
0 siblings, 1 reply; 35+ messages in thread
From: Peter Hilber @ 2024-07-05 8:12 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan, Chashper, David
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 03.07.24 12:40, David Woodhouse wrote:
[...]
>
>
> This is what I currently have for 'struct vmclock_abi' that I'd like to
> persuade you to adopt. I need to tweak it some more, for at least the
> following reasons, as well as any more you can see:
>
> • size isn't big enough for 64KiB pages
> • Should be explicitly little-endian
> • Does it need esterror as well as maxerror?
I have no opinion about this. I can drop esterror if unwanted.
> • Why is maxerror in picoseconds? It's the only use of that unit
> • Where do the clock_status values come from? Do they make sense?
> • Are signed integers OK? (I think so!).
Signed integers would need to be introduced to Virtio, which so far only
uses explicitly unsigned types: u8, le16 etc.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-05 8:12 ` Peter Hilber
@ 2024-07-05 15:02 ` David Woodhouse
2024-07-06 7:50 ` Peter Hilber
0 siblings, 1 reply; 35+ messages in thread
From: David Woodhouse @ 2024-07-05 15:02 UTC (permalink / raw)
To: Peter Hilber, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan, Chashper, David
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
[-- Attachment #1: Type: text/plain, Size: 1838 bytes --]
On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote:
> On 03.07.24 12:40, David Woodhouse wrote:
>
> [...]
>
> >
> >
> > This is what I currently have for 'struct vmclock_abi' that I'd like to
> > persuade you to adopt. I need to tweak it some more, for at least the
> > following reasons, as well as any more you can see:
> >
> > • size isn't big enough for 64KiB pages
> > • Should be explicitly little-endian
> > • Does it need esterror as well as maxerror?
>
> I have no opinion about this. I can drop esterror if unwanted.
I also don't care. I'm just observing the inconsistency.
> > • Why is maxerror in picoseconds? It's the only use of that unit
Between us we now have picoseconds, nanoseconds, (seconds >> 64) and
(seconds >> 64+n).
The power-of-two fractions seem to make a lot of sense for the counter
period, because they mean we don't have to perform divisions.
Does it makes sense to harmonise on (seconds >> 64) for all of the
fractional seconds? Again I don't have a strong opinion; I only want us
to have a *reason* for any differences that exist.
> > • Where do the clock_status values come from? Do they make sense?
> > • Are signed integers OK? (I think so!).
>
> Signed integers would need to be introduced to Virtio, which so far only
> uses explicitly unsigned types: u8, le16 etc.
Perhaps. Although it would also be possible (if not ideal) to define
that e.g. the tai_offset field is a 16-bit "unsigned" integer according
to virtio, but to be interpreted as follows:
If the number is <= 32767 then the TAI offset is that value, but if the
number is >= 32768 then the TAI offset is that value minus 65536.
Perhaps not pretty, but there isn't a *fundamental* dependency on
virtio supporting signed integers as a primary type.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support
2024-07-05 15:02 ` David Woodhouse
@ 2024-07-06 7:50 ` Peter Hilber
0 siblings, 0 replies; 35+ messages in thread
From: Peter Hilber @ 2024-07-06 7:50 UTC (permalink / raw)
To: David Woodhouse, linux-kernel, virtualization, linux-arm-kernel,
linux-rtc, Ridoux, Julien, virtio-dev, Luu, Ryan, Chashper, David
Cc: Christopher S. Hall, Jason Wang, John Stultz, Michael S. Tsirkin,
netdev, Richard Cochran, Stephen Boyd, Thomas Gleixner, Xuan Zhuo,
Marc Zyngier, Mark Rutland, Daniel Lezcano, Alessandro Zummo,
Alexandre Belloni
On 05.07.24 17:02, David Woodhouse wrote:
> On Fri, 2024-07-05 at 10:12 +0200, Peter Hilber wrote:
>> On 03.07.24 12:40, David Woodhouse wrote:
[...]
>>> • Why is maxerror in picoseconds? It's the only use of that unit
>
> Between us we now have picoseconds, nanoseconds, (seconds >> 64) and
> (seconds >> 64+n).
>
> The power-of-two fractions seem to make a lot of sense for the counter
> period, because they mean we don't have to perform divisions.
>
> Does it makes sense to harmonise on (seconds >> 64) for all of the
> fractional seconds? Again I don't have a strong opinion; I only want us
> to have a *reason* for any differences that exist.
>
I don't have the expertise with fixed-point arithmetic to judge if this
would become unwieldy.
I selected ns for the virtio-rtc drafts so far because that didn't have any
impact on the precision with the Linux kernel driver message-based use
cases, but that would be different for SHM in my understanding.
So I would tend to retain ns for convenience for messages (where it doesn't
impact precision) but do not have any preference for SHM.
>>> • Where do the clock_status values come from? Do they make sense?
>>> • Are signed integers OK? (I think so!).
>>
>> Signed integers would need to be introduced to Virtio, which so far only
>> uses explicitly unsigned types: u8, le16 etc.
>
> Perhaps. Although it would also be possible (if not ideal) to define
> that e.g. the tai_offset field is a 16-bit "unsigned" integer according
> to virtio, but to be interpreted as follows:
>
> If the number is <= 32767 then the TAI offset is that value, but if the
> number is >= 32768 then the TAI offset is that value minus 65536.
>
> Perhaps not pretty, but there isn't a *fundamental* dependency on
> virtio supporting signed integers as a primary type.
>
Agreed.
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2024-07-06 8:04 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-18 7:38 [virtio-dev] [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 4/7] virtio_rtc: Add module and driver core Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 5/7] virtio_rtc: Add PTP clocks Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 6/7] virtio_rtc: Add Arm Generic Timer cross-timestamping Peter Hilber
[not found] ` <e410d65754ba6a11ad7f74b27dc28d9a25d8c82e.camel@infradead.org>
2024-06-20 12:06 ` Peter Hilber
2023-12-18 7:38 ` [virtio-dev] [RFC PATCH v3 7/7] virtio_rtc: Add RTC class driver Peter Hilber
[not found] ` <684eac07834699889fdb67be4cee09319c994a42.camel@infradead.org>
2024-06-20 12:37 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2024-06-20 16:19 ` David Woodhouse
2024-06-21 8:45 ` David Woodhouse
2024-06-25 19:01 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
2024-06-27 13:50 ` Peter Hilber
2024-06-27 14:52 ` David Woodhouse
2024-06-28 11:33 ` Peter Hilber
2024-06-28 12:15 ` David Woodhouse
2024-06-28 16:38 ` Peter Hilber
2024-06-28 21:27 ` David Woodhouse
2024-07-01 8:57 ` David Woodhouse
2024-07-02 15:03 ` Peter Hilber
2024-07-02 16:39 ` David Woodhouse
2024-07-02 18:12 ` Peter Hilber
2024-07-02 18:40 ` David Woodhouse
2024-07-03 9:56 ` Peter Hilber
2024-07-03 10:40 ` David Woodhouse
2024-07-05 8:12 ` Peter Hilber
2024-07-05 15:02 ` David Woodhouse
2024-07-06 7:50 ` Peter Hilber
2024-06-27 16:03 ` David Woodhouse
2024-06-28 11:33 ` Peter Hilber
2024-06-28 11:41 ` David Woodhouse
[not found] ` <20240630132859.GC17134@kernel.org>
2024-07-01 8:02 ` David Woodhouse
[not found] ` <202407010838.D45C67B86@keescook>
2024-07-03 8:00 ` David Woodhouse
2024-06-27 13:50 ` [RFC PATCH v3 0/7] Add virtio_rtc module and related changes Peter Hilber
2024-06-21 14:02 ` David Woodhouse
[not found] <87jzic4sgv.ffs@tglx>
2024-06-25 21:48 ` [RFC PATCH v2] ptp: Add vDSO-style vmclock support David Woodhouse
[not found] ` <CANDhNCpi_MyGWH2jZcSRB4RU28Ga08Cqm8cyY_6wkZhNMJsNSQ@mail.gmail.com>
2024-06-26 8:32 ` David Woodhouse
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).