[PATCH 0/2] ptp: vmclock: Add VM generation counter and ACPI notification

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] ptp: vmclock: Add VM generation counter and ACPI notification
@ 2025-11-27 10:32 Chalios, Babis
  2025-11-27 10:32 ` [PATCH 1/2] ptp: vmclock: add vm generation counter Chalios, Babis
  2025-11-27 10:32 ` [PATCH 2/2] ptp: vmclock: support device notifications Chalios, Babis
  0 siblings, 2 replies; 7+ messages in thread
From: Chalios, Babis @ 2025-11-27 10:32 UTC (permalink / raw)
  To: richardcochran@gmail.com, dwmw2@infradead.org,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Chalios, Babis, Graf (AWS), Alexander, mzxreary@0pointer.de

Similarly to live migration, starting a VM from some serialized state
(aka snapshot) is an event which calls for adjusting guest clocks, hence
a hypervisor should increase the disruption_marker before resuming the
VM vCPUs, letting the guest know.

However, loading a snapshot, is slightly different than live migration,
especially since we can start multiple VMs from the same serialized
state. Apart from adjusting clocks, the guest needs to take additional
action during such events, e.g. recreate UUIDs, reset network
adapters/connections, reseed entropy pools, etc. These actions are not
necessary during live migration. This calls for a differentiation
between the two triggering events.

We differentiate between the two events via an extra field in the
vmclock_abi, called vm_generation_counter. Whereas hypervisors should
increase the disruption marker in both cases, they should only increase
vm_generation_counter when a snapshot is loaded in a VM (not during live
migration).

Additionally, we attach an ACPI notification to VMClock. Implementing
the notification is optional for the device. VMClock device will declare
that it implements the notification by setting
VMCLOCK_FLAG_NOTIFICATION_PRESENT bit in vmclock_abi flags. Hypervisors
that implement the notification must send an ACPI notification every
time seq_count changes to an even number. The driver will propagate
these notifications to userspace via the poll() interface.

Once we are happy with the semantics and implementation, I will post a
patchset for QEMU so people can test this.

Changes:

* RFC -> v1:
  - Made the notification support optional. Hypervisor needs to
    advertise support for the notification via a flag in vmclock_abi.
    Subsequently, poll() will return POLLHUP when the feature is not
    supported, to avoid having userspace blocking indefinitely waiting
    for events that won't arrive
  - Reworded the comment around vm_generation_counter field to avoid
    speaking about "jumping forward in time".

Babis Chalios (2):
  ptp: vmclock: add vm generation counter
  ptp: vmclock: support device notifications

 drivers/ptp/ptp_vmclock.c        | 113 +++++++++++++++++++++++++++++--
 include/uapi/linux/vmclock-abi.h |  20 ++++++
 2 files changed, 128 insertions(+), 5 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] ptp: vmclock: add vm generation counter
  2025-11-27 10:32 [PATCH 0/2] ptp: vmclock: Add VM generation counter and ACPI notification Chalios, Babis
@ 2025-11-27 10:32 ` Chalios, Babis
  2025-11-28 12:36   ` David Woodhouse
  2025-11-27 10:32 ` [PATCH 2/2] ptp: vmclock: support device notifications Chalios, Babis
  1 sibling, 1 reply; 7+ messages in thread
From: Chalios, Babis @ 2025-11-27 10:32 UTC (permalink / raw)
  To: richardcochran@gmail.com, dwmw2@infradead.org,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Chalios, Babis, Graf (AWS), Alexander, mzxreary@0pointer.de

Similar to live migration, loading a VM from some saved state (aka
snapshot) is also an event that calls for clock adjustments in the
guest. However, guests might want to take more actions as a response to
such events, e.g. as discarding UUIDs, resetting network connections,
reseeding entropy pools, etc. These are actions that guests don't
typically take during live migration, so add a new field in the
vmclock_abi called vm_generation_counter which informs the guest about
such events.

Hypervisor advertises support for vm_generation_counter through the
VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT flag. Users need to check the
presence of this bit in vmclock_abi flags field before using this flag.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
---
 include/uapi/linux/vmclock-abi.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h
index 2d99b29ac44a..937fe00e4f33 100644
--- a/include/uapi/linux/vmclock-abi.h
+++ b/include/uapi/linux/vmclock-abi.h
@@ -115,6 +115,12 @@ struct vmclock_abi {
 	 * bit again after the update, using the about-to-be-valid fields.
 	 */
 #define VMCLOCK_FLAG_TIME_MONOTONIC		(1 << 7)
+	/*
+	 * If the VM_GEN_COUNTER_PRESENT flag is set, the hypervisor will
+	 * bump the vm_generation_counter field every time the guest is
+	 * loaded from some save state (restored from a snapshot).
+	 */
+#define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
 
 	__u8 pad[2];
 	__u8 clock_status;
@@ -177,6 +183,15 @@ struct vmclock_abi {
 	__le64 time_frac_sec;		/* Units of 1/2^64 of a second */
 	__le64 time_esterror_nanosec;
 	__le64 time_maxerror_nanosec;
+
+	/*
+	 * This field changes to another non-repeating value when the guest
+	 * has been loaded from a snapshot. In addition to handling a
+	 * disruption in time (which will also be signalled through the
+	 * disruption_marker field), a guest may wish to discard UUIDs,
+	 * reset network connections, reseed entropy, etc.
+	 */
+	__le64 vm_generation_counter;
 };
 
 #endif /*  __VMCLOCK_ABI_H__ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] ptp: vmclock: support device notifications
  2025-11-27 10:32 [PATCH 0/2] ptp: vmclock: Add VM generation counter and ACPI notification Chalios, Babis
  2025-11-27 10:32 ` [PATCH 1/2] ptp: vmclock: add vm generation counter Chalios, Babis
@ 2025-11-27 10:32 ` Chalios, Babis
  2025-11-28 13:00   ` David Woodhouse
  1 sibling, 1 reply; 7+ messages in thread
From: Chalios, Babis @ 2025-11-27 10:32 UTC (permalink / raw)
  To: richardcochran@gmail.com, dwmw2@infradead.org,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Chalios, Babis, Graf (AWS), Alexander, mzxreary@0pointer.de

Add optional support for device notifications in VMClock. When
supported, the hypervisor will send a device notification every time it
updates the seq_count to a new even value.

Moreover, add support for poll() in VMClock as a means to propagate this
notification to user space. poll() will return a POLLIN event to
listeners every time seq_count changes to a value different than the one
last seen (since open() or last read()/pread()). This means that when
poll() returns a POLLIN event, listeners need to use read() to observe
what has changed and update the reader's view of seq_count. In other
words, after a poll() returned, all subsequent calls to poll() will
immediately return with a POLLIN event until the listener calls read().

The device advertises support for the notification mechanism by setting
flag VMCLOCK_FLAG_NOTIFICATION_PRESENT in vmclock_abi flags field. If
the flag is not present the driver won't setup the ACPI notification
handler and poll() will always immediately return POLLHUP.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
---
 drivers/ptp/ptp_vmclock.c        | 113 +++++++++++++++++++++++++++++--
 include/uapi/linux/vmclock-abi.h |   5 ++
 2 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
index b3a83b03d9c1..fa515375d54f 100644
--- a/drivers/ptp/ptp_vmclock.c
+++ b/drivers/ptp/ptp_vmclock.c
@@ -5,6 +5,9 @@
  * Copyright © 2024 Amazon.com, Inc. or its affiliates.
  */
 
+#include "linux/poll.h"
+#include "linux/types.h"
+#include "linux/wait.h"
 #include <linux/acpi.h>
 #include <linux/device.h>
 #include <linux/err.h>
@@ -39,6 +42,7 @@ struct vmclock_state {
 	struct resource res;
 	struct vmclock_abi *clk;
 	struct miscdevice miscdev;
+	wait_queue_head_t disrupt_wait;
 	struct ptp_clock_info ptp_clock_info;
 	struct ptp_clock *ptp_clock;
 	enum clocksource_ids cs_id, sys_cs_id;
@@ -357,10 +361,15 @@ static struct ptp_clock *vmclock_ptp_register(struct device *dev,
 	return ptp_clock_register(&st->ptp_clock_info, dev);
 }
 
+struct vmclock_file_state {
+	struct vmclock_state *st;
+	uint32_t seq;
+};
+
 static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
 {
-	struct vmclock_state *st = container_of(fp->private_data,
-						struct vmclock_state, miscdev);
+	struct vmclock_file_state *fst = fp->private_data;
+	struct vmclock_state *st = fst->st;
 
 	if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
 		return -EROFS;
@@ -379,8 +388,9 @@ static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
 static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
 				    size_t count, loff_t *ppos)
 {
-	struct vmclock_state *st = container_of(fp->private_data,
-						struct vmclock_state, miscdev);
+	struct vmclock_file_state *fst = fp->private_data;
+	struct vmclock_state *st = fst->st;
+
 	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
 	size_t max_count;
 	uint32_t seq;
@@ -402,8 +412,10 @@ static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
 
 		/* Pairs with hypervisor wmb */
 		virt_rmb();
-		if (seq == le32_to_cpu(st->clk->seq_count))
+		if (seq == le32_to_cpu(st->clk->seq_count)) {
+			fst->seq = seq;
 			break;
+		}
 
 		if (ktime_after(ktime_get(), deadline))
 			return -ETIMEDOUT;
@@ -413,10 +425,58 @@ static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
 	return count;
 }
 
+static __poll_t vmclock_miscdev_poll(struct file *fp, poll_table *wait)
+{
+	struct vmclock_file_state *fst = fp->private_data;
+	struct vmclock_state *st = fst->st;
+	uint32_t seq;
+
+	/*
+	 * Hypervisor will not send us any notifications, so fail immediately
+	 * to avoid having caller sleeping for ever.
+	 */
+	if (!(st->clk->flags & VMCLOCK_FLAG_NOTIFICATION_PRESENT))
+		return POLLHUP;
+
+	poll_wait(fp, &st->disrupt_wait, wait);
+
+	seq = le32_to_cpu(st->clk->seq_count);
+	if (fst->seq != seq)
+		return POLLIN | POLLRDNORM;
+
+	return 0;
+}
+
+static int vmclock_miscdev_open(struct inode *inode, struct file *fp)
+{
+	struct vmclock_state *st = container_of(fp->private_data,
+						struct vmclock_state, miscdev);
+	struct vmclock_file_state *fst = kzalloc(sizeof(*fst), GFP_KERNEL);
+
+	if (!fst)
+		return -ENOMEM;
+
+	fst->st = st;
+	fst->seq = le32_to_cpu(st->clk->seq_count);
+
+	fp->private_data = fst;
+
+	return 0;
+}
+
+static int vmclock_miscdev_release(struct inode *inode, struct file *fp)
+{
+	kfree(fp->private_data);
+	return 0;
+}
+
 static const struct file_operations vmclock_miscdev_fops = {
 	.owner = THIS_MODULE,
+	.open = vmclock_miscdev_open,
+	.release = vmclock_miscdev_release,
 	.mmap = vmclock_miscdev_mmap,
 	.read = vmclock_miscdev_read,
+	.poll = vmclock_miscdev_poll,
 };
 
 /* module operations */
@@ -459,6 +519,44 @@ static acpi_status vmclock_acpi_resources(struct acpi_resource *ares, void *data
 	return AE_ERROR;
 }
 
+static void
+vmclock_acpi_notification_handler(acpi_handle __always_unused handle,
+				  u32 __always_unused event, void *dev)
+{
+	struct device *device = dev;
+	struct vmclock_state *st = device->driver_data;
+
+	wake_up_interruptible(&st->disrupt_wait);
+}
+
+static int vmclock_setup_notification(struct device *dev, struct vmclock_state *st)
+{
+	struct acpi_device *adev = ACPI_COMPANION(dev);
+	acpi_status status;
+
+	/*
+	 * This should never happen as this function is only called when
+	 * has_acpi_companion(dev) is true, but the logic is sufficiently
+	 * complex that Coverity can't see the tautology.
+	 */
+	if (!adev)
+		return -ENODEV;
+
+	/* The device does not support notifications. Nothing else to do */
+	if (!(le64_to_cpu(st->clk->flags) & VMCLOCK_FLAG_NOTIFICATION_PRESENT))
+		return 0;
+
+	status = acpi_install_notify_handler(adev->handle, ACPI_DEVICE_NOTIFY,
+					     vmclock_acpi_notification_handler,
+					     dev);
+	if (ACPI_FAILURE(status)) {
+		dev_err(dev, "failed to install notification handler");
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
 static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
 {
 	struct acpi_device *adev = ACPI_COMPANION(dev);
@@ -549,6 +647,9 @@ static int vmclock_probe(struct platform_device *pdev)
 	if (ret)
 		return ret;
 
+	init_waitqueue_head(&st->disrupt_wait);
+	vmclock_setup_notification(dev, st);
+
 	/*
 	 * If the structure is big enough, it can be mapped to userspace.
 	 * Theoretically a guest OS even using larger pages could still
@@ -581,6 +682,8 @@ static int vmclock_probe(struct platform_device *pdev)
 		return -ENODEV;
 	}
 
+	dev->driver_data = st;
+
 	dev_info(dev, "%s: registered %s%s%s\n", st->name,
 		 st->miscdev.minor ? "miscdev" : "",
 		 (st->miscdev.minor && st->ptp_clock) ? ", " : "",
diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h
index 937fe00e4f33..d320623b0118 100644
--- a/include/uapi/linux/vmclock-abi.h
+++ b/include/uapi/linux/vmclock-abi.h
@@ -121,6 +121,11 @@ struct vmclock_abi {
 	 * loaded from some save state (restored from a snapshot).
 	 */
 #define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
+	/*
+	 * If the NOTIFICATION_PRESENT flag is set, the hypervisor will send
+	 * a notification every time it updates seq_count to a new even number.
+	 */
+#define VMCLOCK_FLAG_NOTIFICATION_PRESENT       (1 << 9)
 
 	__u8 pad[2];
 	__u8 clock_status;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] ptp: vmclock: add vm generation counter
  2025-11-27 10:32 ` [PATCH 1/2] ptp: vmclock: add vm generation counter Chalios, Babis
@ 2025-11-28 12:36   ` David Woodhouse
  0 siblings, 0 replies; 7+ messages in thread
From: David Woodhouse @ 2025-11-28 12:36 UTC (permalink / raw)
  To: Chalios, Babis, richardcochran@gmail.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Graf (AWS), Alexander, mzxreary@0pointer.de

[-- Attachment #1: Type: text/plain, Size: 2568 bytes --]

On Thu, 2025-11-27 at 10:32 +0000, Chalios, Babis wrote:
> Similar to live migration, loading a VM from some saved state (aka
> snapshot) is also an event that calls for clock adjustments in the
> guest. However, guests might want to take more actions as a response to
> such events, e.g. as discarding UUIDs, resetting network connections,
> reseeding entropy pools, etc. These are actions that guests don't
> typically take during live migration, so add a new field in the
> vmclock_abi called vm_generation_counter which informs the guest about
> such events.
> 
> Hypervisor advertises support for vm_generation_counter through the
> VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT flag. Users need to check the
> presence of this bit in vmclock_abi flags field before using this flag.
> 
> Signed-off-by: Babis Chalios <bchalios@amazon.es>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

We are writing up a formal definition of the VMClock device which
includes these updates, and I believe the QEMU patches are to follow.

I've put an early draft at https://david.woodhou.se/VMClock.pdf

> ---
>  include/uapi/linux/vmclock-abi.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h
> index 2d99b29ac44a..937fe00e4f33 100644
> --- a/include/uapi/linux/vmclock-abi.h
> +++ b/include/uapi/linux/vmclock-abi.h
> @@ -115,6 +115,12 @@ struct vmclock_abi {
>  	 * bit again after the update, using the about-to-be-valid fields.
>  	 */
>  #define VMCLOCK_FLAG_TIME_MONOTONIC		(1 << 7)
> +	/*
> +	 * If the VM_GEN_COUNTER_PRESENT flag is set, the hypervisor will
> +	 * bump the vm_generation_counter field every time the guest is
> +	 * loaded from some save state (restored from a snapshot).
> +	 */
> +#define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
>  
>  	__u8 pad[2];
>  	__u8 clock_status;
> @@ -177,6 +183,15 @@ struct vmclock_abi {
>  	__le64 time_frac_sec;		/* Units of 1/2^64 of a second */
>  	__le64 time_esterror_nanosec;
>  	__le64 time_maxerror_nanosec;
> +
> +	/*
> +	 * This field changes to another non-repeating value when the guest
> +	 * has been loaded from a snapshot. In addition to handling a
> +	 * disruption in time (which will also be signalled through the
> +	 * disruption_marker field), a guest may wish to discard UUIDs,
> +	 * reset network connections, reseed entropy, etc.
> +	 */
> +	__le64 vm_generation_counter;
>  };
>  
>  #endif /*  __VMCLOCK_ABI_H__ */


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ptp: vmclock: support device notifications
  2025-11-27 10:32 ` [PATCH 2/2] ptp: vmclock: support device notifications Chalios, Babis
@ 2025-11-28 13:00   ` David Woodhouse
  2025-12-01 12:15     ` Babis Chalios
  0 siblings, 1 reply; 7+ messages in thread
From: David Woodhouse @ 2025-11-28 13:00 UTC (permalink / raw)
  To: Chalios, Babis, richardcochran@gmail.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Graf (AWS), Alexander, mzxreary@0pointer.de

[-- Attachment #1: Type: text/plain, Size: 9172 bytes --]

On Thu, 2025-11-27 at 10:32 +0000, Chalios, Babis wrote:
> Add optional support for device notifications in VMClock. When
> supported, the hypervisor will send a device notification every time it
> updates the seq_count to a new even value.
> 
> Moreover, add support for poll() in VMClock as a means to propagate this
> notification to user space. poll() will return a POLLIN event to
> listeners every time seq_count changes to a value different than the one
> last seen (since open() or last read()/pread()). This means that when
> poll() returns a POLLIN event, listeners need to use read() to observe
> what has changed and update the reader's view of seq_count. In other
> words, after a poll() returned, all subsequent calls to poll() will
> immediately return with a POLLIN event until the listener calls read().
> 
> The device advertises support for the notification mechanism by setting
> flag VMCLOCK_FLAG_NOTIFICATION_PRESENT in vmclock_abi flags field. If
> the flag is not present the driver won't setup the ACPI notification
> handler and poll() will always immediately return POLLHUP.
> 
> Signed-off-by: Babis Chalios <bchalios@amazon.es>

Generally looks sane to me, thanks.

I haven't given much brain power to whether POLLHUP is the right thing
to return when poll() is invalid; I guess you have.

I also haven't looked hard into the locking on fst->seq which is
accessed during poll() and read(). Have you?

Your vmclock_setup_notification() function can return error, but you
ignore that. Which *might* have been intentional, to allow the device
to be used even without notifications if something goes wrong... but
then the condition for poll() returning POLLHUP is wrong, because that
only checks the flag that the hypervisor set, and not whether
notifications are *actually* working.

In open() you simply read seq_count from the vmclock structure but it
might be odd at that point. Do we want to wait for it to be even, like
read() does? Or just initialise fst->seq to zero?

And is there really no devm-like helper which will free your
fp->private_data for you on release()? That seems surprising.

> ---
>  drivers/ptp/ptp_vmclock.c        | 113 +++++++++++++++++++++++++++++--
>  include/uapi/linux/vmclock-abi.h |   5 ++
>  2 files changed, 113 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
> index b3a83b03d9c1..fa515375d54f 100644
> --- a/drivers/ptp/ptp_vmclock.c
> +++ b/drivers/ptp/ptp_vmclock.c
> @@ -5,6 +5,9 @@
>   * Copyright © 2024 Amazon.com, Inc. or its affiliates.
>   */
>  
> +#include "linux/poll.h"
> +#include "linux/types.h"
> +#include "linux/wait.h"
>  #include <linux/acpi.h>
>  #include <linux/device.h>
>  #include <linux/err.h>
> @@ -39,6 +42,7 @@ struct vmclock_state {
>  	struct resource res;
>  	struct vmclock_abi *clk;
>  	struct miscdevice miscdev;
> +	wait_queue_head_t disrupt_wait;
>  	struct ptp_clock_info ptp_clock_info;
>  	struct ptp_clock *ptp_clock;
>  	enum clocksource_ids cs_id, sys_cs_id;
> @@ -357,10 +361,15 @@ static struct ptp_clock *vmclock_ptp_register(struct device *dev,
>  	return ptp_clock_register(&st->ptp_clock_info, dev);
>  }
>  
> +struct vmclock_file_state {
> +	struct vmclock_state *st;
> +	uint32_t seq;
> +};
> +
>  static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
>  {
> -	struct vmclock_state *st = container_of(fp->private_data,
> -						struct vmclock_state, miscdev);
> +	struct vmclock_file_state *fst = fp->private_data;
> +	struct vmclock_state *st = fst->st;
>  
>  	if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
>  		return -EROFS;
> @@ -379,8 +388,9 @@ static int vmclock_miscdev_mmap(struct file *fp, struct vm_area_struct *vma)
>  static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
>  				    size_t count, loff_t *ppos)
>  {
> -	struct vmclock_state *st = container_of(fp->private_data,
> -						struct vmclock_state, miscdev);
> +	struct vmclock_file_state *fst = fp->private_data;
> +	struct vmclock_state *st = fst->st;
> +
>  	ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
>  	size_t max_count;
>  	uint32_t seq;
> @@ -402,8 +412,10 @@ static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
>  
>  		/* Pairs with hypervisor wmb */
>  		virt_rmb();
> -		if (seq == le32_to_cpu(st->clk->seq_count))
> +		if (seq == le32_to_cpu(st->clk->seq_count)) {
> +			fst->seq = seq;
>  			break;
> +		}
>  
>  		if (ktime_after(ktime_get(), deadline))
>  			return -ETIMEDOUT;
> @@ -413,10 +425,58 @@ static ssize_t vmclock_miscdev_read(struct file *fp, char __user *buf,
>  	return count;
>  }
>  
> +static __poll_t vmclock_miscdev_poll(struct file *fp, poll_table *wait)
> +{
> +	struct vmclock_file_state *fst = fp->private_data;
> +	struct vmclock_state *st = fst->st;
> +	uint32_t seq;
> +
> +	/*
> +	 * Hypervisor will not send us any notifications, so fail immediately
> +	 * to avoid having caller sleeping for ever.
> +	 */
> +	if (!(st->clk->flags & VMCLOCK_FLAG_NOTIFICATION_PRESENT))
> +		return POLLHUP;
> +
> +	poll_wait(fp, &st->disrupt_wait, wait);
> +
> +	seq = le32_to_cpu(st->clk->seq_count);
> +	if (fst->seq != seq)
> +		return POLLIN | POLLRDNORM;
> +
> +	return 0;
> +}
> +
> +static int vmclock_miscdev_open(struct inode *inode, struct file *fp)
> +{
> +	struct vmclock_state *st = container_of(fp->private_data,
> +						struct vmclock_state, miscdev);
> +	struct vmclock_file_state *fst = kzalloc(sizeof(*fst), GFP_KERNEL);
> +
> +	if (!fst)
> +		return -ENOMEM;
> +
> +	fst->st = st;
> +	fst->seq = le32_to_cpu(st->clk->seq_count);
> +
> +	fp->private_data = fst;
> +
> +	return 0;
> +}
> +
> +static int vmclock_miscdev_release(struct inode *inode, struct file *fp)
> +{
> +	kfree(fp->private_data);
> +	return 0;
> +}
> +
>  static const struct file_operations vmclock_miscdev_fops = {
>  	.owner = THIS_MODULE,
> +	.open = vmclock_miscdev_open,
> +	.release = vmclock_miscdev_release,
>  	.mmap = vmclock_miscdev_mmap,
>  	.read = vmclock_miscdev_read,
> +	.poll = vmclock_miscdev_poll,
>  };
>  
>  /* module operations */
> @@ -459,6 +519,44 @@ static acpi_status vmclock_acpi_resources(struct acpi_resource *ares, void *data
>  	return AE_ERROR;
>  }
>  
> +static void
> +vmclock_acpi_notification_handler(acpi_handle __always_unused handle,
> +				  u32 __always_unused event, void *dev)
> +{
> +	struct device *device = dev;
> +	struct vmclock_state *st = device->driver_data;
> +
> +	wake_up_interruptible(&st->disrupt_wait);
> +}
> +
> +static int vmclock_setup_notification(struct device *dev, struct vmclock_state *st)
> +{
> +	struct acpi_device *adev = ACPI_COMPANION(dev);
> +	acpi_status status;
> +
> +	/*
> +	 * This should never happen as this function is only called when
> +	 * has_acpi_companion(dev) is true, but the logic is sufficiently
> +	 * complex that Coverity can't see the tautology.
> +	 */
> +	if (!adev)
> +		return -ENODEV;
> +
> +	/* The device does not support notifications. Nothing else to do */
> +	if (!(le64_to_cpu(st->clk->flags) & VMCLOCK_FLAG_NOTIFICATION_PRESENT))
> +		return 0;
> +
> +	status = acpi_install_notify_handler(adev->handle, ACPI_DEVICE_NOTIFY,
> +					     vmclock_acpi_notification_handler,
> +					     dev);
> +	if (ACPI_FAILURE(status)) {
> +		dev_err(dev, "failed to install notification handler");
> +		return -ENODEV;
> +	}
> +
> +	return 0;
> +}
> +
>  static int vmclock_probe_acpi(struct device *dev, struct vmclock_state *st)
>  {
>  	struct acpi_device *adev = ACPI_COMPANION(dev);
> @@ -549,6 +647,9 @@ static int vmclock_probe(struct platform_device *pdev)
>  	if (ret)
>  		return ret;
>  
> +	init_waitqueue_head(&st->disrupt_wait);
> +	vmclock_setup_notification(dev, st);
> +
>  	/*
>  	 * If the structure is big enough, it can be mapped to userspace.
>  	 * Theoretically a guest OS even using larger pages could still
> @@ -581,6 +682,8 @@ static int vmclock_probe(struct platform_device *pdev)
>  		return -ENODEV;
>  	}
>  
> +	dev->driver_data = st;
> +
>  	dev_info(dev, "%s: registered %s%s%s\n", st->name,
>  		 st->miscdev.minor ? "miscdev" : "",
>  		 (st->miscdev.minor && st->ptp_clock) ? ", " : "",
> diff --git a/include/uapi/linux/vmclock-abi.h b/include/uapi/linux/vmclock-abi.h
> index 937fe00e4f33..d320623b0118 100644
> --- a/include/uapi/linux/vmclock-abi.h
> +++ b/include/uapi/linux/vmclock-abi.h
> @@ -121,6 +121,11 @@ struct vmclock_abi {
>  	 * loaded from some save state (restored from a snapshot).
>  	 */
>  #define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
> +	/*
> +	 * If the NOTIFICATION_PRESENT flag is set, the hypervisor will send
> +	 * a notification every time it updates seq_count to a new even number.
> +	 */
> +#define VMCLOCK_FLAG_NOTIFICATION_PRESENT       (1 << 9)
>  
>  	__u8 pad[2];
>  	__u8 clock_status;


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ptp: vmclock: support device notifications
  2025-11-28 13:00   ` David Woodhouse
@ 2025-12-01 12:15     ` Babis Chalios
  2025-12-02  9:50       ` David Woodhouse
  0 siblings, 1 reply; 7+ messages in thread
From: Babis Chalios @ 2025-12-01 12:15 UTC (permalink / raw)
  To: David Woodhouse, richardcochran@gmail.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Graf (AWS), Alexander, mzxreary@0pointer.de

On Fri, 2025-11-28 at 13:00 +0000, David Woodhouse wrote:
> On Thu, 2025-11-27 at 10:32 +0000, Chalios, Babis wrote:
> > Add optional support for device notifications in VMClock. When
> > supported, the hypervisor will send a device notification every
> > time it
> > updates the seq_count to a new even value.
> > 
> > Moreover, add support for poll() in VMClock as a means to propagate
> > this
> > notification to user space. poll() will return a POLLIN event to
> > listeners every time seq_count changes to a value different than
> > the one
> > last seen (since open() or last read()/pread()). This means that
> > when
> > poll() returns a POLLIN event, listeners need to use read() to
> > observe
> > what has changed and update the reader's view of seq_count. In
> > other
> > words, after a poll() returned, all subsequent calls to poll() will
> > immediately return with a POLLIN event until the listener calls
> > read().
> > 
> > The device advertises support for the notification mechanism by
> > setting
> > flag VMCLOCK_FLAG_NOTIFICATION_PRESENT in vmclock_abi flags field.
> > If
> > the flag is not present the driver won't setup the ACPI
> > notification
> > handler and poll() will always immediately return POLLHUP.
> > 
> > Signed-off-by: Babis Chalios <bchalios@amazon.es>
> 
> Generally looks sane to me, thanks.
> 
> I haven't given much brain power to whether POLLHUP is the right
> thing
> to return when poll() is invalid; I guess you have.
> 

I was looking at the possible alternatives. The semantics of POLLHUP
seem to me the correct ones, i.e.: the other end (device) "closed its
end of the channel" which is exactly what it has happened. 

> I also haven't looked hard into the locking on fst->seq which is
> accessed during poll() and read(). Have you?
> 

Hmmm, hadn´t considered it because I can't think of any reason why
someone would be poll()ing and read()ing concurrently from different
threads but you're right, there's a race. I wonder though what the
correct semantics are here. Imagine thread A is waiting (blocked in
poll()) and thread B is reading from the same file descriptor while a
notification arrives. I can see a scenario that this causes thread A to
loose a notification (and consequently a missed chance to act on
disruption_marker and vm_generation_counter changes). What do you
think?

> Your vmclock_setup_notification() function can return error, but you
> ignore that. Which *might* have been intentional, to allow the device
> to be used even without notifications if something goes wrong... but
> then the condition for poll() returning POLLHUP is wrong, because
> that
> only checks the flag that the hypervisor set, and not whether
> notifications are *actually* working.
> 
> 

This is just a bug. My intention is to mark the device probing as
failed if the device has advertised support for notifications **and**
setting up the notification handler failed.

> In open() you simply read seq_count from the vmclock structure but it
> might be odd at that point. Do we want to wait for it to be even,
> like
> read() does? Or just initialise fst->seq to zero?
> 

Good catch. I think the easiest thing is to set fst->seq to zero.

> And is there really no devm-like helper which will free your
> fp->private_data for you on release()? That seems surprising.
> 

I'm not aware of any such mechanism. It seems weird to me though, how
would it know what kind of data I put there and what exactly should it
call to free them?


Also, it seems like we're good on the ABI and protocol between
hypervisor and guest, so I'll go ahead and post the QEMU patches so
people can test the Linux patches if they want.

Cheers,
Babis

> > ---
> >  drivers/ptp/ptp_vmclock.c        | 113
> > +++++++++++++++++++++++++++++--
> >  include/uapi/linux/vmclock-abi.h |   5 ++
> >  2 files changed, 113 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
> > index b3a83b03d9c1..fa515375d54f 100644
> > --- a/drivers/ptp/ptp_vmclock.c
> > +++ b/drivers/ptp/ptp_vmclock.c
> > @@ -5,6 +5,9 @@
> >   * Copyright © 2024 Amazon.com, Inc. or its affiliates.
> >   */
> >  
> > +#include "linux/poll.h"
> > +#include "linux/types.h"
> > +#include "linux/wait.h"
> >  #include <linux/acpi.h>
> >  #include <linux/device.h>
> >  #include <linux/err.h>
> > @@ -39,6 +42,7 @@ struct vmclock_state {
> >         struct resource res;
> >         struct vmclock_abi *clk;
> >         struct miscdevice miscdev;
> > +       wait_queue_head_t disrupt_wait;
> >         struct ptp_clock_info ptp_clock_info;
> >         struct ptp_clock *ptp_clock;
> >         enum clocksource_ids cs_id, sys_cs_id;
> > @@ -357,10 +361,15 @@ static struct ptp_clock
> > *vmclock_ptp_register(struct device *dev,
> >         return ptp_clock_register(&st->ptp_clock_info, dev);
> >  }
> >  
> > +struct vmclock_file_state {
> > +       struct vmclock_state *st;
> > +       uint32_t seq;
> > +};
> > +
> >  static int vmclock_miscdev_mmap(struct file *fp, struct
> > vm_area_struct *vma)
> >  {
> > -       struct vmclock_state *st = container_of(fp->private_data,
> > -                                               struct
> > vmclock_state, miscdev);
> > +       struct vmclock_file_state *fst = fp->private_data;
> > +       struct vmclock_state *st = fst->st;
> >  
> >         if ((vma->vm_flags & (VM_READ|VM_WRITE)) != VM_READ)
> >                 return -EROFS;
> > @@ -379,8 +388,9 @@ static int vmclock_miscdev_mmap(struct file
> > *fp, struct vm_area_struct *vma)
> >  static ssize_t vmclock_miscdev_read(struct file *fp, char __user
> > *buf,
> >                                     size_t count, loff_t *ppos)
> >  {
> > -       struct vmclock_state *st = container_of(fp->private_data,
> > -                                               struct
> > vmclock_state, miscdev);
> > +       struct vmclock_file_state *fst = fp->private_data;
> > +       struct vmclock_state *st = fst->st;
> > +
> >         ktime_t deadline = ktime_add(ktime_get(),
> > VMCLOCK_MAX_WAIT);
> >         size_t max_count;
> >         uint32_t seq;
> > @@ -402,8 +412,10 @@ static ssize_t vmclock_miscdev_read(struct
> > file *fp, char __user *buf,
> >  
> >                 /* Pairs with hypervisor wmb */
> >                 virt_rmb();
> > -               if (seq == le32_to_cpu(st->clk->seq_count))
> > +               if (seq == le32_to_cpu(st->clk->seq_count)) {
> > +                       fst->seq = seq;
> >                         break;
> > +               }
> >  
> >                 if (ktime_after(ktime_get(), deadline))
> >                         return -ETIMEDOUT;
> > @@ -413,10 +425,58 @@ static ssize_t vmclock_miscdev_read(struct
> > file *fp, char __user *buf,
> >         return count;
> >  }
> >  
> > +static __poll_t vmclock_miscdev_poll(struct file *fp, poll_table
> > *wait)
> > +{
> > +       struct vmclock_file_state *fst = fp->private_data;
> > +       struct vmclock_state *st = fst->st;
> > +       uint32_t seq;
> > +
> > +       /*
> > +        * Hypervisor will not send us any notifications, so fail
> > immediately
> > +        * to avoid having caller sleeping for ever.
> > +        */
> > +       if (!(st->clk->flags & VMCLOCK_FLAG_NOTIFICATION_PRESENT))
> > +               return POLLHUP;
> > +
> > +       poll_wait(fp, &st->disrupt_wait, wait);
> > +
> > +       seq = le32_to_cpu(st->clk->seq_count);
> > +       if (fst->seq != seq)
> > +               return POLLIN | POLLRDNORM;
> > +
> > +       return 0;
> > +}
> > +
> > +static int vmclock_miscdev_open(struct inode *inode, struct file
> > *fp)
> > +{
> > +       struct vmclock_state *st = container_of(fp->private_data,
> > +                                               struct
> > vmclock_state, miscdev);
> > +       struct vmclock_file_state *fst = kzalloc(sizeof(*fst),
> > GFP_KERNEL);
> > +
> > +       if (!fst)
> > +               return -ENOMEM;
> > +
> > +       fst->st = st;
> > +       fst->seq = le32_to_cpu(st->clk->seq_count);
> > +
> > +       fp->private_data = fst;
> > +
> > +       return 0;
> > +}
> > +
> > +static int vmclock_miscdev_release(struct inode *inode, struct
> > file *fp)
> > +{
> > +       kfree(fp->private_data);
> > +       return 0;
> > +}
> > +
> >  static const struct file_operations vmclock_miscdev_fops = {
> >         .owner = THIS_MODULE,
> > +       .open = vmclock_miscdev_open,
> > +       .release = vmclock_miscdev_release,
> >         .mmap = vmclock_miscdev_mmap,
> >         .read = vmclock_miscdev_read,
> > +       .poll = vmclock_miscdev_poll,
> >  };
> >  
> >  /* module operations */
> > @@ -459,6 +519,44 @@ static acpi_status
> > vmclock_acpi_resources(struct acpi_resource *ares, void *data
> >         return AE_ERROR;
> >  }
> >  
> > +static void
> > +vmclock_acpi_notification_handler(acpi_handle __always_unused
> > handle,
> > +                                 u32 __always_unused event, void
> > *dev)
> > +{
> > +       struct device *device = dev;
> > +       struct vmclock_state *st = device->driver_data;
> > +
> > +       wake_up_interruptible(&st->disrupt_wait);
> > +}
> > +
> > +static int vmclock_setup_notification(struct device *dev, struct
> > vmclock_state *st)
> > +{
> > +       struct acpi_device *adev = ACPI_COMPANION(dev);
> > +       acpi_status status;
> > +
> > +       /*
> > +        * This should never happen as this function is only called
> > when
> > +        * has_acpi_companion(dev) is true, but the logic is
> > sufficiently
> > +        * complex that Coverity can't see the tautology.
> > +        */
> > +       if (!adev)
> > +               return -ENODEV;
> > +
> > +       /* The device does not support notifications. Nothing else
> > to do */
> > +       if (!(le64_to_cpu(st->clk->flags) &
> > VMCLOCK_FLAG_NOTIFICATION_PRESENT))
> > +               return 0;
> > +
> > +       status = acpi_install_notify_handler(adev->handle,
> > ACPI_DEVICE_NOTIFY,
> > +                                           
> > vmclock_acpi_notification_handler,
> > +                                            dev);
> > +       if (ACPI_FAILURE(status)) {
> > +               dev_err(dev, "failed to install notification
> > handler");
> > +               return -ENODEV;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  static int vmclock_probe_acpi(struct device *dev, struct
> > vmclock_state *st)
> >  {
> >         struct acpi_device *adev = ACPI_COMPANION(dev);
> > @@ -549,6 +647,9 @@ static int vmclock_probe(struct platform_device
> > *pdev)
> >         if (ret)
> >                 return ret;
> >  
> > +       init_waitqueue_head(&st->disrupt_wait);
> > +       vmclock_setup_notification(dev, st);
> > +
> >         /*
> >          * If the structure is big enough, it can be mapped to
> > userspace.
> >          * Theoretically a guest OS even using larger pages could
> > still
> > @@ -581,6 +682,8 @@ static int vmclock_probe(struct platform_device
> > *pdev)
> >                 return -ENODEV;
> >         }
> >  
> > +       dev->driver_data = st;
> > +
> >         dev_info(dev, "%s: registered %s%s%s\n", st->name,
> >                  st->miscdev.minor ? "miscdev" : "",
> >                  (st->miscdev.minor && st->ptp_clock) ? ", " : "",
> > diff --git a/include/uapi/linux/vmclock-abi.h
> > b/include/uapi/linux/vmclock-abi.h
> > index 937fe00e4f33..d320623b0118 100644
> > --- a/include/uapi/linux/vmclock-abi.h
> > +++ b/include/uapi/linux/vmclock-abi.h
> > @@ -121,6 +121,11 @@ struct vmclock_abi {
> >          * loaded from some save state (restored from a snapshot).
> >          */
> >  #define VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT     (1 << 8)
> > +       /*
> > +        * If the NOTIFICATION_PRESENT flag is set, the hypervisor
> > will send
> > +        * a notification every time it updates seq_count to a new
> > even number.
> > +        */
> > +#define VMCLOCK_FLAG_NOTIFICATION_PRESENT       (1 << 9)
> >  
> >         __u8 pad[2];
> >         __u8 clock_status;
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ptp: vmclock: support device notifications
  2025-12-01 12:15     ` Babis Chalios
@ 2025-12-02  9:50       ` David Woodhouse
  0 siblings, 0 replies; 7+ messages in thread
From: David Woodhouse @ 2025-12-02  9:50 UTC (permalink / raw)
  To: Babis Chalios, richardcochran@gmail.com, andrew+netdev@lunn.ch,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Graf (AWS), Alexander, mzxreary@0pointer.de

[-- Attachment #1: Type: text/plain, Size: 3430 bytes --]

On Mon, 2025-12-01 at 13:15 +0100, Babis Chalios wrote:
> On Fri, 2025-11-28 at 13:00 +0000, David Woodhouse wrote:
> > Generally looks sane to me, thanks.
> > 
> > I haven't given much brain power to whether POLLHUP is the right
> > thing to return when poll() is invalid; I guess you have.
> > 
> 
> I was looking at the possible alternatives. The semantics of POLLHUP
> seem to me the correct ones, i.e.: the other end (device) "closed its
> end of the channel" which is exactly what it has happened. 

Makes sense.

> > I also haven't looked hard into the locking on fst->seq which is
> > accessed during poll() and read(). Have you?
> > 
> 
> Hmmm, hadn´t considered it because I can't think of any reason why
> someone would be poll()ing and read()ing concurrently from different
> threads but you're right, there's a race. I wonder though what the
> correct semantics are here. Imagine thread A is waiting (blocked in
> poll()) and thread B is reading from the same file descriptor while a
> notification arrives. I can see a scenario that this causes thread A to
> loose a notification (and consequently a missed chance to act on
> disruption_marker and vm_generation_counter changes). What do you
> think?

There's a theoretical race condition even with read(), where you set
fst->seq to the seqno you've just read. As long as they're using
pread() and not serialised by f_pos_lock, I think that two threads can
be reading at the (about) same time, and see *different* seqnos, for
example if they happen just before/after a LM event. Then in all the
steal time and preemption it's random *which* one of them gets to set
fst->seq first, and which one comes afterwards.

So even the setting of fst->seq wants to use an atomic_t and/or cmpxchg
to ensure it's only ever moving forwards.

Then it's either a READ_ONCE() or atomic_read() in poll().

> > Your vmclock_setup_notification() function can return error, but you
> > ignore that. Which *might* have been intentional, to allow the device
> > to be used even without notifications if something goes wrong... but
> > then the condition for poll() returning POLLHUP is wrong, because that
> > only checks the flag that the hypervisor set, and not whether
> > notifications are *actually* working.
> 
> This is just a bug. My intention is to mark the device probing as
> failed if the device has advertised support for notifications **and**
> setting up the notification handler failed.

OK.

> > In open() you simply read seq_count from the vmclock structure but it
> > might be odd at that point. Do we want to wait for it to be even,
> > like read() does? Or just initialise fst->seq to zero?
> > 
> 
> Good catch. I think the easiest thing is to set fst->seq to zero.
> 
> > And is there really no devm-like helper which will free your
> > fp->private_data for you on release()? That seems surprising.
> > 
> 
> I'm not aware of any such mechanism. It seems weird to me though, how
> would it know what kind of data I put there and what exactly should it
> call to free them?
> 

In fs/ alone there seem to be 11 examples of release() functions which
do nothing but kfree(file->private_data); return 0;.

Maybe there *should* be such a helper?

int simple_release(struct inode *inode, struct file *file)
{
    kfree(file->private_data);
    return 0;
}
EXPORT_SYMBOL(simple_release);


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-02  9:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-27 10:32 [PATCH 0/2] ptp: vmclock: Add VM generation counter and ACPI notification Chalios, Babis
2025-11-27 10:32 ` [PATCH 1/2] ptp: vmclock: add vm generation counter Chalios, Babis
2025-11-28 12:36   ` David Woodhouse
2025-11-27 10:32 ` [PATCH 2/2] ptp: vmclock: support device notifications Chalios, Babis
2025-11-28 13:00   ` David Woodhouse
2025-12-01 12:15     ` Babis Chalios
2025-12-02  9:50       ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).