* [PATCH net-next v8 1/3] gve: skip error logging for retryable AdminQ commands
2026-05-14 22:58 [PATCH net-next v8 0/3] gve: add support for PTP gettimex64 Harshitha Ramamurthy
@ 2026-05-14 22:58 ` Harshitha Ramamurthy
2026-05-14 22:58 ` [PATCH net-next v8 2/3] gve: make nic clock reads thread safe Harshitha Ramamurthy
` (2 subsequent siblings)
3 siblings, 0 replies; 19+ messages in thread
From: Harshitha Ramamurthy @ 2026-05-14 22:58 UTC (permalink / raw)
To: netdev
Cc: joshwash, hramamurthy, andrew+netdev, davem, edumazet, kuba,
pabeni, richardcochran, jstultz, tglx, sboyd, willemb, nktgrg,
jfraker, ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel
From: Jordan Rhee <jordanrhee@google.com>
AdminQ commands may return -EAGAIN under certain transient conditions.
These commands are intended to be retried by the driver, so logging
a formal error to the system log is misleading and creates
unnecessary noise.
Modify the logging logic to skip the error message when the result
is -EAGAIN, and move logging to dev_err_ratelimited() to avoid
spamming the log.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Jordan Rhee <jordanrhee@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
---
Changes in v8:
- Fixed opcode extraction for extended commands (Sashiko)
Changes in v7:
- fixed type of err to be int instead of u32 (Sashiko)
- Picked up Jake Keller's review tag
Changes in v4:
- call out change to dev_err_ratelimited() in the commit message (Jacob Keller)
- remove extra print when adminQ status is GVE_ADMINQ_COMMAND_UNSET (Jacob Keller)
---
---
drivers/net/ethernet/google/gve/gve_adminq.c | 41 ++++++++++++++++----
1 file changed, 33 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
index 08587bf40ed4..0d5a67523cba 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.c
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c
@@ -416,16 +416,10 @@ static bool gve_adminq_wait_for_cmd(struct gve_priv *priv, u32 prod_cnt)
static int gve_adminq_parse_err(struct gve_priv *priv, u32 status)
{
- if (status != GVE_ADMINQ_COMMAND_PASSED &&
- status != GVE_ADMINQ_COMMAND_UNSET) {
- dev_err(&priv->pdev->dev, "AQ command failed with status %d\n", status);
- priv->adminq_cmd_fail++;
- }
switch (status) {
case GVE_ADMINQ_COMMAND_PASSED:
return 0;
case GVE_ADMINQ_COMMAND_UNSET:
- dev_err(&priv->pdev->dev, "parse_aq_err: err and status both unset, this should not be possible.\n");
return -EINVAL;
case GVE_ADMINQ_COMMAND_ERROR_ABORTED:
case GVE_ADMINQ_COMMAND_ERROR_CANCELLED:
@@ -455,6 +449,27 @@ static int gve_adminq_parse_err(struct gve_priv *priv, u32 status)
}
}
+static bool gve_adminq_is_retryable(enum gve_adminq_opcodes opcode)
+{
+ switch (opcode) {
+ case GVE_ADMINQ_REPORT_NIC_TIMESTAMP:
+ return true;
+ default:
+ return false;
+ }
+}
+
+static enum gve_adminq_opcodes gve_extract_opcode(union gve_adminq_command *cmd)
+{
+ u32 opcode;
+
+ opcode = be32_to_cpu(READ_ONCE(cmd->opcode));
+ if (opcode == GVE_ADMINQ_EXTENDED_COMMAND)
+ opcode = be32_to_cpu(cmd->extended_command.inner_opcode);
+
+ return opcode;
+}
+
/* Flushes all AQ commands currently queued and waits for them to complete.
* If there are failures, it will return the first error.
*/
@@ -477,14 +492,24 @@ static int gve_adminq_kick_and_wait(struct gve_priv *priv)
for (i = tail; i < head; i++) {
union gve_adminq_command *cmd;
- u32 status, err;
+ u32 status;
+ int err;
cmd = &priv->adminq[i & priv->adminq_mask];
status = be32_to_cpu(READ_ONCE(cmd->status));
err = gve_adminq_parse_err(priv, status);
- if (err)
+ if (err) {
+ enum gve_adminq_opcodes opcode = gve_extract_opcode(cmd);
+
+ priv->adminq_cmd_fail++;
+ if (!gve_adminq_is_retryable(opcode) || err != -EAGAIN)
+ dev_err_ratelimited(&priv->pdev->dev,
+ "AQ command %d failed with status %d\n",
+ opcode, status);
+
// Return the first error if we failed.
return err;
+ }
}
return 0;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH net-next v8 2/3] gve: make nic clock reads thread safe
2026-05-14 22:58 [PATCH net-next v8 0/3] gve: add support for PTP gettimex64 Harshitha Ramamurthy
2026-05-14 22:58 ` [PATCH net-next v8 1/3] gve: skip error logging for retryable AdminQ commands Harshitha Ramamurthy
@ 2026-05-14 22:58 ` Harshitha Ramamurthy
2026-05-14 22:58 ` [PATCH net-next v8 3/3] gve: implement PTP gettimex64 Harshitha Ramamurthy
2026-05-20 1:30 ` [PATCH net-next v8 0/3] gve: add support for " patchwork-bot+netdevbpf
3 siblings, 0 replies; 19+ messages in thread
From: Harshitha Ramamurthy @ 2026-05-14 22:58 UTC (permalink / raw)
To: netdev
Cc: joshwash, hramamurthy, andrew+netdev, davem, edumazet, kuba,
pabeni, richardcochran, jstultz, tglx, sboyd, willemb, nktgrg,
jfraker, ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel
From: Ankit Garg <nktgrg@google.com>
Add a mutex to protect the shared DMA buffer that receives NIC
timestamp reports. The NIC timestamp will be read from two different
threads: the periodic worker and upcoming `gettimex64`.
Move clock registration to the last step of initialization to ensure
that all data needed by the clock module is initialized before
the clock is exposed to usermode.
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Ankit Garg <nktgrg@google.com>
Signed-off-by: Jordan Rhee <jordanrhee@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
---
Changes in v7:
- Picked up Jake Keller's review tag
Changes in v3:
- Reorder init/teardown to register PTP clock last, and simplify code
- Move ptp-related members from gve_priv to gve_ptp
- Only assign priv->ptp after ptp module is successfully initialized
---
---
drivers/net/ethernet/google/gve/gve.h | 12 +-
drivers/net/ethernet/google/gve/gve_ethtool.c | 3 +-
drivers/net/ethernet/google/gve/gve_ptp.c | 134 ++++++++----------
3 files changed, 63 insertions(+), 86 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 1d66d3834f7e..7b69d0cfc0d5 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -792,6 +792,9 @@ struct gve_ptp {
struct ptp_clock_info info;
struct ptp_clock *clock;
struct gve_priv *priv;
+ struct mutex nic_ts_read_lock; /* Protects nic_ts_report */
+ struct gve_nic_ts_report *nic_ts_report;
+ dma_addr_t nic_ts_report_bus;
};
struct gve_priv {
@@ -923,8 +926,6 @@ struct gve_priv {
bool nic_timestamp_supported;
struct gve_ptp *ptp;
struct kernel_hwtstamp_config ts_config;
- struct gve_nic_ts_report *nic_ts_report;
- dma_addr_t nic_ts_report_bus;
u64 last_sync_nic_counter; /* Clock counter from last NIC TS report */
};
@@ -1201,7 +1202,7 @@ static inline bool gve_supports_xdp_xmit(struct gve_priv *priv)
static inline bool gve_is_clock_enabled(struct gve_priv *priv)
{
- return priv->nic_ts_report;
+ return priv->ptp;
}
/* gqi napi handler defined in gve_main.c */
@@ -1321,14 +1322,9 @@ int gve_flow_rules_reset(struct gve_priv *priv);
int gve_init_rss_config(struct gve_priv *priv, u16 num_queues);
/* PTP and timestamping */
#if IS_ENABLED(CONFIG_PTP_1588_CLOCK)
-int gve_clock_nic_ts_read(struct gve_priv *priv);
int gve_init_clock(struct gve_priv *priv);
void gve_teardown_clock(struct gve_priv *priv);
#else /* CONFIG_PTP_1588_CLOCK */
-static inline int gve_clock_nic_ts_read(struct gve_priv *priv)
-{
- return -EOPNOTSUPP;
-}
static inline int gve_init_clock(struct gve_priv *priv)
{
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
index dc2213b5ce24..4fd7e8a442c5 100644
--- a/drivers/net/ethernet/google/gve/gve_ethtool.c
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -972,8 +972,7 @@ static int gve_get_ts_info(struct net_device *netdev,
info->rx_filters |= BIT(HWTSTAMP_FILTER_NONE) |
BIT(HWTSTAMP_FILTER_ALL);
- if (priv->ptp)
- info->phc_index = ptp_clock_index(priv->ptp->clock);
+ info->phc_index = ptp_clock_index(priv->ptp->clock);
}
return 0;
diff --git a/drivers/net/ethernet/google/gve/gve_ptp.c b/drivers/net/ethernet/google/gve/gve_ptp.c
index 06b1cf4a5efc..ad15f1209a83 100644
--- a/drivers/net/ethernet/google/gve/gve_ptp.c
+++ b/drivers/net/ethernet/google/gve/gve_ptp.c
@@ -11,19 +11,20 @@
#define GVE_NIC_TS_SYNC_INTERVAL_MS 250
/* Read the nic timestamp from hardware via the admin queue. */
-int gve_clock_nic_ts_read(struct gve_priv *priv)
+static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
{
- u64 nic_raw;
int err;
- err = gve_adminq_report_nic_ts(priv, priv->nic_ts_report_bus);
+ mutex_lock(&ptp->nic_ts_read_lock);
+ err = gve_adminq_report_nic_ts(ptp->priv, ptp->nic_ts_report_bus);
if (err)
- return err;
+ goto out;
- nic_raw = be64_to_cpu(priv->nic_ts_report->nic_timestamp);
- WRITE_ONCE(priv->last_sync_nic_counter, nic_raw);
+ *nic_raw = be64_to_cpu(ptp->nic_ts_report->nic_timestamp);
- return 0;
+out:
+ mutex_unlock(&ptp->nic_ts_read_lock);
+ return err;
}
static int gve_ptp_gettimex64(struct ptp_clock_info *info,
@@ -41,17 +42,21 @@ static int gve_ptp_settime64(struct ptp_clock_info *info,
static long gve_ptp_do_aux_work(struct ptp_clock_info *info)
{
- const struct gve_ptp *ptp = container_of(info, struct gve_ptp, info);
+ struct gve_ptp *ptp = container_of(info, struct gve_ptp, info);
struct gve_priv *priv = ptp->priv;
+ u64 nic_raw;
int err;
if (gve_get_reset_in_progress(priv) || !gve_get_admin_queue_ok(priv))
goto out;
- err = gve_clock_nic_ts_read(priv);
- if (err && net_ratelimit())
- dev_err(&priv->pdev->dev,
- "%s read err %d\n", __func__, err);
+ err = gve_clock_nic_ts_read(ptp, &nic_raw);
+ if (err) {
+ dev_err_ratelimited(&priv->pdev->dev, "%s read err %d\n",
+ __func__, err);
+ goto out;
+ }
+ WRITE_ONCE(priv->last_sync_nic_counter, nic_raw);
out:
return msecs_to_jiffies(GVE_NIC_TS_SYNC_INTERVAL_MS);
@@ -65,94 +70,71 @@ static const struct ptp_clock_info gve_ptp_caps = {
.do_aux_work = gve_ptp_do_aux_work,
};
-static int gve_ptp_init(struct gve_priv *priv)
+int gve_init_clock(struct gve_priv *priv)
{
struct gve_ptp *ptp;
+ u64 nic_raw;
int err;
- priv->ptp = kzalloc_obj(*priv->ptp);
- if (!priv->ptp)
+ ptp = kzalloc_obj(*priv->ptp);
+ if (!ptp)
return -ENOMEM;
- ptp = priv->ptp;
ptp->info = gve_ptp_caps;
- ptp->clock = ptp_clock_register(&ptp->info, &priv->pdev->dev);
-
- if (IS_ERR(ptp->clock)) {
- dev_err(&priv->pdev->dev, "PTP clock registration failed\n");
- err = PTR_ERR(ptp->clock);
- goto free_ptp;
- }
-
ptp->priv = priv;
- return 0;
-
-free_ptp:
- kfree(ptp);
- priv->ptp = NULL;
- return err;
-}
-
-static void gve_ptp_release(struct gve_priv *priv)
-{
- struct gve_ptp *ptp = priv->ptp;
-
- if (!ptp)
- return;
-
- if (ptp->clock)
- ptp_clock_unregister(ptp->clock);
-
- kfree(ptp);
- priv->ptp = NULL;
-}
-
-int gve_init_clock(struct gve_priv *priv)
-{
- int err;
-
- err = gve_ptp_init(priv);
- if (err)
- return err;
-
- priv->nic_ts_report =
+ mutex_init(&ptp->nic_ts_read_lock);
+ ptp->nic_ts_report =
dma_alloc_coherent(&priv->pdev->dev,
sizeof(struct gve_nic_ts_report),
- &priv->nic_ts_report_bus,
- GFP_KERNEL);
- if (!priv->nic_ts_report) {
+ &ptp->nic_ts_report_bus, GFP_KERNEL);
+ if (!ptp->nic_ts_report) {
dev_err(&priv->pdev->dev, "%s dma alloc error\n", __func__);
err = -ENOMEM;
- goto release_ptp;
+ goto free_ptp;
}
- err = gve_clock_nic_ts_read(priv);
+
+ err = gve_clock_nic_ts_read(ptp, &nic_raw);
if (err) {
dev_err(&priv->pdev->dev, "failed to read NIC clock %d\n", err);
- goto release_nic_ts_report;
+ goto free_dma_mem;
}
- ptp_schedule_worker(priv->ptp->clock,
+ WRITE_ONCE(priv->last_sync_nic_counter, nic_raw);
+
+ ptp->clock = ptp_clock_register(&ptp->info, &priv->pdev->dev);
+ if (IS_ERR(ptp->clock)) {
+ dev_err(&priv->pdev->dev, "PTP clock registration failed\n");
+ err = PTR_ERR(ptp->clock);
+ goto free_dma_mem;
+ }
+
+ priv->ptp = ptp;
+ ptp_schedule_worker(ptp->clock,
msecs_to_jiffies(GVE_NIC_TS_SYNC_INTERVAL_MS));
return 0;
-release_nic_ts_report:
- dma_free_coherent(&priv->pdev->dev,
- sizeof(struct gve_nic_ts_report),
- priv->nic_ts_report, priv->nic_ts_report_bus);
- priv->nic_ts_report = NULL;
-release_ptp:
- gve_ptp_release(priv);
+free_dma_mem:
+ dma_free_coherent(&priv->pdev->dev, sizeof(struct gve_nic_ts_report),
+ ptp->nic_ts_report, ptp->nic_ts_report_bus);
+ ptp->nic_ts_report = NULL;
+free_ptp:
+ mutex_destroy(&ptp->nic_ts_read_lock);
+ kfree(ptp);
return err;
}
void gve_teardown_clock(struct gve_priv *priv)
{
- gve_ptp_release(priv);
+ struct gve_ptp *ptp = priv->ptp;
- if (priv->nic_ts_report) {
- dma_free_coherent(&priv->pdev->dev,
- sizeof(struct gve_nic_ts_report),
- priv->nic_ts_report, priv->nic_ts_report_bus);
- priv->nic_ts_report = NULL;
- }
+ if (!ptp)
+ return;
+
+ priv->ptp = NULL;
+ ptp_clock_unregister(ptp->clock);
+ dma_free_coherent(&priv->pdev->dev, sizeof(struct gve_nic_ts_report),
+ ptp->nic_ts_report, ptp->nic_ts_report_bus);
+ ptp->nic_ts_report = NULL;
+ mutex_destroy(&ptp->nic_ts_read_lock);
+ kfree(ptp);
}
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-14 22:58 [PATCH net-next v8 0/3] gve: add support for PTP gettimex64 Harshitha Ramamurthy
2026-05-14 22:58 ` [PATCH net-next v8 1/3] gve: skip error logging for retryable AdminQ commands Harshitha Ramamurthy
2026-05-14 22:58 ` [PATCH net-next v8 2/3] gve: make nic clock reads thread safe Harshitha Ramamurthy
@ 2026-05-14 22:58 ` Harshitha Ramamurthy
2026-05-20 21:08 ` Thomas Gleixner
2026-05-21 11:43 ` David Woodhouse
2026-05-20 1:30 ` [PATCH net-next v8 0/3] gve: add support for " patchwork-bot+netdevbpf
3 siblings, 2 replies; 19+ messages in thread
From: Harshitha Ramamurthy @ 2026-05-14 22:58 UTC (permalink / raw)
To: netdev
Cc: joshwash, hramamurthy, andrew+netdev, davem, edumazet, kuba,
pabeni, richardcochran, jstultz, tglx, sboyd, willemb, nktgrg,
jfraker, ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel, Naman Gulati
From: Jordan Rhee <jordanrhee@google.com>
Enable chrony and phc2sys to synchronize system clock to NIC clock.
Two paths are implemented: a precise path using system counter values
sampled by the device, and a fallback path using system counter values
sampled in the driver using ptp_read_system_prets()/postts().
To use the precise path, the current system clocksource must match the
units returned by the device, which on x86 is X86_TSC and on ARM64 is
ARM_ARCH_COUNTER. The clockid requested for the cross-timestamp must
be either CLOCK_REALTIME or CLOCK_MONOTONIC_RAW. These conditions hold
by default on GCP VMs using Chrony, so we expect the precise path to be
used the vast majority of the time. If the system clocksource is changed
to kvm-clock, it activates the fallback path. Ethtool counters have been
added to count how many times each path is used.
The uncertainty window in the precise path is typically around 1-2us,
while in the fallback path is around 60-80us.
Stub implementions of adjfine and adjtime are added to avoid NULL
dereference when phc2sys tries to adjust the clock.
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Naman Gulati <namangulati@google.com>
Signed-off-by: Jordan Rhee <jordanrhee@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
---
Changes in v7:
- Actually committed the stubs for adjtime and adjfine. (Sashiko)
- Picked up Jake Keller's review tag
Changes in v6:
- Added a fallback to driver-sampled time sandwich that is used when
the following conditions are not met:
- The system clock source is X86_TSC or ARM_ARCH_COUNTER
- The requested clockid is CLOCK_REALTIME or CLOCK_MONOTONIC_RAW
- The architecture is x86 or ARM64
- Added ethtool statistics to count how many cross-timestamps used the
precise path versus fallback path.
- Fixed printf format specifier.
- Added stub implementions of adjtime and adjfine to prevent NULL
dereference when phc2sys tries to adjust clock.
- Moved system time snapshot back to gve_ptp_gettimex64() so we can get the
current system clock source from it. It is OK for it to not be inside
the mutex or retry loop because lock contention and retries should be
extremely rare, and chrony filters out bad samples.
Changes in v5:
- Reformulate retry loop in terms of total timeout (Jakub Kicinski)
Changes in v3:
- Take system time snapshot inside the mutex
- Return -EOPNOTSUPP if cross-timestamp is requested on an arch other
than x86 or arm64
Changes in v2:
- fix compilation warning on ARM by casting cycles_t to u64
---
---
drivers/net/ethernet/google/gve/gve.h | 8 +
drivers/net/ethernet/google/gve/gve_adminq.h | 4 +-
drivers/net/ethernet/google/gve/gve_ethtool.c | 3 +
drivers/net/ethernet/google/gve/gve_ptp.c | 249 +++++++++++++++++-
4 files changed, 255 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 7b69d0cfc0d5..4de3ce60060e 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -880,6 +880,14 @@ struct gve_priv {
u32 stats_report_trigger_cnt; /* count of device-requested stats-reports since last reset */
u32 suspend_cnt; /* count of times suspended */
u32 resume_cnt; /* count of times resumed */
+ /* count of cross-timestamps attempted using system timestamps
+ * from the AQ command
+ */
+ u32 ptp_precise_xtstamps;
+ /* count of cross-timestamps attempted using system timestamps sampled
+ * by the driver
+ */
+ u32 ptp_fallback_xtstamps;
struct workqueue_struct *gve_wq;
struct work_struct service_task;
struct work_struct stats_report_task;
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
index 22a74b6aa17e..e6dcf6da9091 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.h
+++ b/drivers/net/ethernet/google/gve/gve_adminq.h
@@ -411,8 +411,8 @@ static_assert(sizeof(struct gve_adminq_report_nic_ts) == 16);
struct gve_nic_ts_report {
__be64 nic_timestamp; /* NIC clock in nanoseconds */
- __be64 reserved1;
- __be64 reserved2;
+ __be64 pre_cycles; /* System cycle counter before NIC clock read */
+ __be64 post_cycles; /* System cycle counter after NIC clock read */
__be64 reserved3;
__be64 reserved4;
};
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
index 4fd7e8a442c5..8a088dcc3603 100644
--- a/drivers/net/ethernet/google/gve/gve_ethtool.c
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -46,6 +46,7 @@ static const char gve_gstrings_main_stats[][ETH_GSTRING_LEN] = {
"rx_hsplit_unsplit_pkt",
"interface_up_cnt", "interface_down_cnt", "reset_cnt",
"page_alloc_fail", "dma_mapping_error", "stats_report_trigger_cnt",
+ "ptp_precise_xtstamps", "ptp_fallback_xtstamps",
};
static const char gve_gstrings_rx_stats[][ETH_GSTRING_LEN] = {
@@ -269,6 +270,8 @@ gve_get_ethtool_stats(struct net_device *netdev,
data[i++] = priv->page_alloc_fail;
data[i++] = priv->dma_mapping_error;
data[i++] = priv->stats_report_trigger_cnt;
+ data[i++] = priv->ptp_precise_xtstamps;
+ data[i++] = priv->ptp_fallback_xtstamps;
i = GVE_MAIN_STATS_LEN;
rx_base_stats_idx = 0;
diff --git a/drivers/net/ethernet/google/gve/gve_ptp.c b/drivers/net/ethernet/google/gve/gve_ptp.c
index ad15f1209a83..bc230e68eb1d 100644
--- a/drivers/net/ethernet/google/gve/gve_ptp.c
+++ b/drivers/net/ethernet/google/gve/gve_ptp.c
@@ -10,28 +10,261 @@
/* Interval to schedule a nic timestamp calibration, 250ms. */
#define GVE_NIC_TS_SYNC_INTERVAL_MS 250
+/*
+ * Stores cycle counter samples in get_cycles() units from a
+ * sandwiched NIC clock read
+ */
+struct gve_sysclock_sample {
+ /* Cycle counter from NIC before clock read */
+ u64 nic_pre_cycles;
+ /* Cycle counter from NIC after clock read */
+ u64 nic_post_cycles;
+ /* Cycle counter from host before issuing AQ command */
+ cycles_t host_pre_cycles;
+ /* Cycle counter from host after AQ command returns */
+ cycles_t host_post_cycles;
+};
+
+/*
+ * Read NIC clock by issuing the AQ command. The command is subject to
+ * rate limiting and may need to be retried. Requires nic_ts_read_lock
+ * to be held.
+ */
+static int gve_ptp_read_timestamp(struct gve_ptp *ptp, cycles_t *pre_cycles,
+ cycles_t *post_cycles)
+{
+ unsigned long deadline = jiffies + msecs_to_jiffies(100);
+ unsigned long delay_us = 1000;
+ int err;
+
+ lockdep_assert_held(&ptp->nic_ts_read_lock);
+
+ do {
+ *pre_cycles = get_cycles();
+ err = gve_adminq_report_nic_ts(ptp->priv,
+ ptp->nic_ts_report_bus);
+
+ /* Prevent get_cycles() from being speculatively executed
+ * before the AdminQ command
+ */
+ rmb();
+ *post_cycles = get_cycles();
+ if (likely(err != -EAGAIN))
+ return err;
+
+ fsleep(delay_us);
+
+ /* Exponential backoff */
+ delay_us *= 2;
+ } while (time_before(jiffies, deadline));
+
+ return -ETIMEDOUT;
+}
+
/* Read the nic timestamp from hardware via the admin queue. */
-static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
+static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw,
+ struct gve_sysclock_sample *sysclock)
{
+ cycles_t host_pre_cycles, host_post_cycles;
+ struct gve_nic_ts_report *ts_report;
int err;
mutex_lock(&ptp->nic_ts_read_lock);
- err = gve_adminq_report_nic_ts(ptp->priv, ptp->nic_ts_report_bus);
- if (err)
+ err = gve_ptp_read_timestamp(ptp, &host_pre_cycles, &host_post_cycles);
+ if (err) {
+ dev_err_ratelimited(&ptp->priv->pdev->dev,
+ "AdminQ timestamp read failed: %d\n", err);
goto out;
+ }
+
+ ts_report = ptp->nic_ts_report;
+ *nic_raw = be64_to_cpu(ts_report->nic_timestamp);
- *nic_raw = be64_to_cpu(ptp->nic_ts_report->nic_timestamp);
+ if (sysclock) {
+ sysclock->nic_pre_cycles = be64_to_cpu(ts_report->pre_cycles);
+ sysclock->nic_post_cycles = be64_to_cpu(ts_report->post_cycles);
+ sysclock->host_pre_cycles = host_pre_cycles;
+ sysclock->host_post_cycles = host_post_cycles;
+ }
out:
mutex_unlock(&ptp->nic_ts_read_lock);
return err;
}
+struct gve_cycles_to_clock_callback_ctx {
+ u64 cycles;
+};
+
+static int gve_cycles_to_clock_fn(ktime_t *device_time,
+ struct system_counterval_t *system_counterval,
+ void *ctx)
+{
+ struct gve_cycles_to_clock_callback_ctx *context = ctx;
+
+ *device_time = 0;
+
+ system_counterval->cycles = context->cycles;
+ system_counterval->use_nsecs = false;
+
+ if (IS_ENABLED(CONFIG_X86))
+ system_counterval->cs_id = CSID_X86_TSC;
+ else if (IS_ENABLED(CONFIG_ARM64))
+ system_counterval->cs_id = CSID_ARM_ARCH_COUNTER;
+ else
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
+/*
+ * Convert a raw cycle count (e.g. from get_cycles()) to the system clock
+ * type specified by clockid. The system_time_snapshot must be taken before
+ * the cycle counter is sampled.
+ */
+static int gve_cycles_to_timespec64(struct gve_priv *priv, clockid_t clockid,
+ struct system_time_snapshot *snap,
+ u64 cycles, struct timespec64 *ts)
+{
+ struct gve_cycles_to_clock_callback_ctx ctx = {0};
+ struct system_device_crosststamp xtstamp;
+ int err;
+
+ ctx.cycles = cycles;
+ err = get_device_system_crosststamp(gve_cycles_to_clock_fn, &ctx, snap,
+ &xtstamp);
+ if (err) {
+ dev_err_ratelimited(&priv->pdev->dev,
+ "get_device_system_crosststamp() failed to convert %llu cycles to system time: %d\n",
+ cycles,
+ err);
+ return err;
+ }
+
+ switch (clockid) {
+ case CLOCK_REALTIME:
+ *ts = ktime_to_timespec64(xtstamp.sys_realtime);
+ break;
+ case CLOCK_MONOTONIC_RAW:
+ *ts = ktime_to_timespec64(xtstamp.sys_monoraw);
+ break;
+ default:
+ dev_err_ratelimited(&priv->pdev->dev,
+ "Cycle count conversion to clockid %d not supported\n",
+ clockid);
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
+
+static bool
+gve_can_use_system_ts_from_device(enum clocksource_ids system_clock_source,
+ clockid_t clockid)
+{
+ if (clockid != CLOCK_REALTIME && clockid != CLOCK_MONOTONIC_RAW)
+ return false;
+
+ /* If the system clock source matches the system clock
+ * returned by the AdminQ command, we can use the system
+ * timestamps returned by the device, otherwise we have to
+ * fall back to sampling system time from the host which
+ * is less accurate.
+ */
+ if (IS_ENABLED(CONFIG_X86))
+ return system_clock_source == CSID_X86_TSC;
+ else if (IS_ENABLED(CONFIG_ARM64))
+ return system_clock_source == CSID_ARM_ARCH_COUNTER;
+
+ return false;
+}
+
+static int gve_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
+{
+ return -EOPNOTSUPP;
+}
+
+static int gve_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
+{
+ return -EOPNOTSUPP;
+}
+
static int gve_ptp_gettimex64(struct ptp_clock_info *info,
struct timespec64 *ts,
struct ptp_system_timestamp *sts)
{
- return -EOPNOTSUPP;
+ struct gve_ptp *ptp = container_of(info, struct gve_ptp, info);
+ struct gve_sysclock_sample sysclock = {0};
+ bool use_system_ts_from_device = false;
+ struct gve_priv *priv = ptp->priv;
+ struct system_time_snapshot snap;
+ u64 nic_ts;
+ int err;
+
+ if (sts) {
+ /* This snapshot is used both to query the current system
+ * clocksource and to convert the cycle counts returned
+ * by the AdminQ command to ktime. It does not need to be
+ * taken inside the retry loop because retries and lock
+ * contention are expected to be extremely rare.
+ *
+ * If the system clock source changes between here and
+ * when get_device_system_crosststamp() is called,
+ * get_device_system_crosststamp() will fail which will
+ * cause one failed sample, and the next one will succeed.
+ */
+ ktime_get_snapshot(&snap);
+ use_system_ts_from_device =
+ gve_can_use_system_ts_from_device(snap.cs_id,
+ sts->clockid);
+ if (use_system_ts_from_device)
+ priv->ptp_precise_xtstamps++;
+ else
+ priv->ptp_fallback_xtstamps++;
+ }
+
+ if (unlikely(!use_system_ts_from_device))
+ ptp_read_system_prets(sts);
+
+ err = gve_clock_nic_ts_read(ptp, &nic_ts, sts ? &sysclock : NULL);
+ if (err)
+ return err;
+
+ if (unlikely(!use_system_ts_from_device))
+ ptp_read_system_postts(sts);
+
+ if (sts && likely(use_system_ts_from_device)) {
+ /* Reject samples with out of order system clock values.
+ * Firmware must return valid non-zero cycle counts.
+ */
+ if (!(sysclock.host_pre_cycles <= sysclock.nic_pre_cycles &&
+ sysclock.nic_pre_cycles <= sysclock.nic_post_cycles &&
+ sysclock.nic_post_cycles <= sysclock.host_post_cycles)) {
+ dev_err_ratelimited(&priv->pdev->dev,
+ "AdminQ system clock cycle counts out of order. Expecting %llu <= %llu <= %llu <= %llu\n",
+ (u64)sysclock.host_pre_cycles,
+ sysclock.nic_pre_cycles,
+ sysclock.nic_post_cycles,
+ (u64)sysclock.host_post_cycles);
+ return -EBADMSG;
+ }
+
+ err = gve_cycles_to_timespec64(priv, sts->clockid, &snap,
+ sysclock.nic_pre_cycles,
+ &sts->pre_ts);
+ if (err)
+ return err;
+
+ err = gve_cycles_to_timespec64(priv, sts->clockid, &snap,
+ sysclock.nic_post_cycles,
+ &sts->post_ts);
+ if (err)
+ return err;
+ }
+
+ *ts = ns_to_timespec64(nic_ts);
+
+ return 0;
}
static int gve_ptp_settime64(struct ptp_clock_info *info,
@@ -50,7 +283,7 @@ static long gve_ptp_do_aux_work(struct ptp_clock_info *info)
if (gve_get_reset_in_progress(priv) || !gve_get_admin_queue_ok(priv))
goto out;
- err = gve_clock_nic_ts_read(ptp, &nic_raw);
+ err = gve_clock_nic_ts_read(ptp, &nic_raw, NULL);
if (err) {
dev_err_ratelimited(&priv->pdev->dev, "%s read err %d\n",
__func__, err);
@@ -65,6 +298,8 @@ static long gve_ptp_do_aux_work(struct ptp_clock_info *info)
static const struct ptp_clock_info gve_ptp_caps = {
.owner = THIS_MODULE,
.name = "gve clock",
+ .adjfine = gve_ptp_adjfine,
+ .adjtime = gve_ptp_adjtime,
.gettimex64 = gve_ptp_gettimex64,
.settime64 = gve_ptp_settime64,
.do_aux_work = gve_ptp_do_aux_work,
@@ -93,7 +328,7 @@ int gve_init_clock(struct gve_priv *priv)
goto free_ptp;
}
- err = gve_clock_nic_ts_read(ptp, &nic_raw);
+ err = gve_clock_nic_ts_read(ptp, &nic_raw, NULL);
if (err) {
dev_err(&priv->pdev->dev, "failed to read NIC clock %d\n", err);
goto free_dma_mem;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-14 22:58 ` [PATCH net-next v8 3/3] gve: implement PTP gettimex64 Harshitha Ramamurthy
@ 2026-05-20 21:08 ` Thomas Gleixner
2026-05-20 22:16 ` Jakub Kicinski
` (2 more replies)
2026-05-21 11:43 ` David Woodhouse
1 sibling, 3 replies; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-20 21:08 UTC (permalink / raw)
To: Harshitha Ramamurthy, netdev
Cc: joshwash, hramamurthy, andrew+netdev, davem, edumazet, kuba,
pabeni, richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel, Naman Gulati
On Thu, May 14 2026 at 22:58, Harshitha Ramamurthy wrote:
> From: Jordan Rhee <jordanrhee@google.com>
>
> Enable chrony and phc2sys to synchronize system clock to NIC clock.
>
> Two paths are implemented: a precise path using system counter values
> sampled by the device, and a fallback path using system counter values
> sampled in the driver using ptp_read_system_prets()/postts().
>
> To use the precise path, the current system clocksource must match the
> units returned by the device, which on x86 is X86_TSC and on ARM64 is
> ARM_ARCH_COUNTER. The clockid requested for the cross-timestamp must
> be either CLOCK_REALTIME or CLOCK_MONOTONIC_RAW. These conditions hold
> by default on GCP VMs using Chrony, so we expect the precise path to be
> used the vast majority of the time. If the system clocksource is changed
> to kvm-clock, it activates the fallback path. Ethtool counters have been
> added to count how many times each path is used.
>
> The uncertainty window in the precise path is typically around 1-2us,
> while in the fallback path is around 60-80us.
>
> Stub implementions of adjfine and adjtime are added to avoid NULL
> dereference when phc2sys tries to adjust the clock.
Sorry that I did not react to this earlier, but I was burried in
regressions and allowed myself to take a few days off.
> +/*
> + * Read NIC clock by issuing the AQ command. The command is subject to
> + * rate limiting and may need to be retried. Requires nic_ts_read_lock
> + * to be held.
> + */
> +static int gve_ptp_read_timestamp(struct gve_ptp *ptp, cycles_t *pre_cycles,
> + cycles_t *post_cycles)
> +{
> + unsigned long deadline = jiffies + msecs_to_jiffies(100);
> + unsigned long delay_us = 1000;
> + int err;
> +
> + lockdep_assert_held(&ptp->nic_ts_read_lock);
> +
> + do {
> + *pre_cycles = get_cycles();
This is a completely non-sensical hack. Why do you need get_cycles()
here, which is a horrible and ill-defined interface that others are
trying to get rid of?
Just because it's still there is not a good answer and it's actually not
required at all. See below.
> + err = gve_adminq_report_nic_ts(ptp->priv,
> + ptp->nic_ts_report_bus);
> +
> + /* Prevent get_cycles() from being speculatively executed
> + * before the AdminQ command
> + */
> + rmb();
> + *post_cycles = get_cycles();
> + if (likely(err != -EAGAIN))
> + return err;
> +
> + fsleep(delay_us);
> +
> + /* Exponential backoff */
> + delay_us *= 2;
> + } while (time_before(jiffies, deadline));
> +
> + return -ETIMEDOUT;
> +}
> +
> /* Read the nic timestamp from hardware via the admin queue. */
> -static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
> +static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw,
> + struct gve_sysclock_sample *sysclock)
> {
> + cycles_t host_pre_cycles, host_post_cycles;
> + struct gve_nic_ts_report *ts_report;
> int err;
>
> mutex_lock(&ptp->nic_ts_read_lock);
Why is this lock taken _after_ the regular (fallback) pre/post
timestamps are aquired?
What's wrong with doing:
do {
guard(mutex)(&ptp->nic_ts_read_lock);
ptp_read_system_prets(sts);
err = read_nic_timestamp(...);
ptp_read_system_postts(sts);
} while (err && !timeout(...));
return err;
which moves the "fallback" timestamp right next to the actual hardware
readout?
That would make the "fallback" case too precise, right?
After noticing the obvious, I just checked what our new AI review
overlord "sashiko" has to say about this:
"Does wrapping the timestamp read outside the locks and retry loops skew the
measurements?
In the fallback path, ptp_read_system_prets() is captured before calling
gve_clock_nic_ts_read(), and postts() is captured after it returns. However,
gve_clock_nic_ts_read() acquires ptp->nic_ts_read_lock and calls
gve_ptp_read_timestamp(), which can sleep using fsleep() during retries. This
means the prets/postts window could include mutex wait times and sleep
durations."
It _IS_ actually on spot. So why the heck did this garbage get merged
without addressing this obvious fail?
What "sashiko" actually saw aside of that, which I didn't see myself when
looking at this, is:
"Similarly, in gve_ptp_read_timestamp(), the get_cycles() bounds are taken
outside the gve_adminq_report_nic_ts() call, which acquires the shared
priv->adminq_lock. If there is lock contention, the cycle bounds could become
extremely wide."
Oh well.
So what you really want is:
do {
guard(mutex)(&ptp->nic_ts_read_lock);
guard(mutex)(&gpe_priv->adminq_lock);
ptp_read_system_prets(sts);
err = read_nic_timestamp(...) {
...
return gve_adminq_execute_cmd_locked(....);
}
ptp_read_system_postts(sts);
} while (err && !timeout(...));
return err;
Or just rely on gpe_priv->adminq_lock in the first place. No?
But what's worse and escaped the AI scrutiny is:
Your attempt to convert these 'get_cycles()' time stamps after the
fact and outside of the timekeeping core protection is fundamentally
wrong.
get_device_system_crosststamp() does
do {
seq = read_seqcount_begin(&tk_core.seq);
take_timestamps(...);
convert_timestamps(...)
} while (read_seqcount_retry(&tk_core.seq, seq));
which covers the whole sequence of taking the snapshots from the
hardware and the conversion.
Your abuse of this is completely out of spec. Just because it did not
blow up in your face yet does not make it in any way correct. You do not
even get bonus points for creative abuse.
Now let me look at the "firmware" provided timestamps.
That's the poor mans emulation of actual hardware timestamps aka PTM.
Why do you need all this pre_nic/post_nic muck there? Fix your firmware
to actually honor PTM and not provide some made up version of it.
Also the whole debug mechanism of comparing the time stamps of
get_cycles() with the timestamps provided by firmware tells me that your
firmware is a hacked together piece of garbage.
If you want to use "hardware" timestamps then populate the
ptp.info.getcrosststamp() callback.
If your firmware really can't do PTM and only provides pre and post
values, then you simply can take the midpoint in your getcrosststamp()
callback:
do {
guard(mutex)(&ptp->nic_ts_read_lock);
err = read_nic_timestamp(...);
system_counterval->cycles = (nic_pre + nic_post) / 2;
} while (err && !timeout(...));
If you do not even trust your firmware then you still can validate it
easily without creating horrible hacks:
do {
guard(mutex)(...);
ptp_read_system_prets(&ctx->sts);
err = read_nic_timestamp(...);
system_counterval->cycles = (nic_pre + nic_post) / 2;
ptp_read_system_postts(&ctx->sts);
} while (err && !timeout(...));
and then check the ctx->sts timestamps against the results provided by
get_device_system_crosststamp() after conversion in your
ptp.info.getcrosststamp() callback:
err = get_device_system_crosststamp(..., &ctx);
if (err)
return err;
if (ctx.sts.pre_ts > xtstamp.sys... || ctx.sts.post_ts < xtstamp.sys...)
return -EBROKENFIRMWARE;
no?
Jakub, please back out this hackery before it makes a precedence for others
to copy from.
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-20 21:08 ` Thomas Gleixner
@ 2026-05-20 22:16 ` Jakub Kicinski
2026-05-21 4:35 ` Jordan Rhee
2026-05-21 10:08 ` David Laight
2 siblings, 0 replies; 19+ messages in thread
From: Jakub Kicinski @ 2026-05-20 22:16 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Harshitha Ramamurthy, netdev, joshwash, andrew+netdev, davem,
edumazet, pabeni, richardcochran, jstultz, sboyd, willemb, nktgrg,
jfraker, ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel, Naman Gulati
On Wed, 20 May 2026 23:08:42 +0200 Thomas Gleixner wrote:
> Jakub, please back out this hackery before it makes a precedence for others
> to copy from.
Done! 0ae361f7e43c in net-next.
I'd also appreciate if you could cast your eyes over:
https://lore.kernel.org/20260515164033.6403-3-akiyano@amazon.com
Different kettle of fish, but firmly more time than network..
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-20 21:08 ` Thomas Gleixner
2026-05-20 22:16 ` Jakub Kicinski
@ 2026-05-21 4:35 ` Jordan Rhee
2026-05-21 10:08 ` David Laight
2 siblings, 0 replies; 19+ messages in thread
From: Jordan Rhee @ 2026-05-21 4:35 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Harshitha Ramamurthy, netdev, joshwash, andrew+netdev, davem,
edumazet, kuba, pabeni, richardcochran, jstultz, sboyd, willemb,
nktgrg, jfraker, ziweixiao, maolson, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel, Naman Gulati
Hi Thomas,
Thank you for the review. We plan to rework this to use
https://lore.kernel.org/all/20260515164033.6403-1-akiyano@amazon.com/,
which will let us use the midpoint of pre and post while still
propagating the uncertainty window up the stack. I am going to address
some of the concerns you raised simply for the record, not because I
am trying to convince you one way or another. I do not expect a
response. Thank you,
Jordan
On Wed, May 20, 2026 at 2:08 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Thu, May 14 2026 at 22:58, Harshitha Ramamurthy wrote:
> > From: Jordan Rhee <jordanrhee@google.com>
> >
> > Enable chrony and phc2sys to synchronize system clock to NIC clock.
> >
> > Two paths are implemented: a precise path using system counter values
> > sampled by the device, and a fallback path using system counter values
> > sampled in the driver using ptp_read_system_prets()/postts().
> >
> > To use the precise path, the current system clocksource must match the
> > units returned by the device, which on x86 is X86_TSC and on ARM64 is
> > ARM_ARCH_COUNTER. The clockid requested for the cross-timestamp must
> > be either CLOCK_REALTIME or CLOCK_MONOTONIC_RAW. These conditions hold
> > by default on GCP VMs using Chrony, so we expect the precise path to be
> > used the vast majority of the time. If the system clocksource is changed
> > to kvm-clock, it activates the fallback path. Ethtool counters have been
> > added to count how many times each path is used.
> >
> > The uncertainty window in the precise path is typically around 1-2us,
> > while in the fallback path is around 60-80us.
> >
> > Stub implementions of adjfine and adjtime are added to avoid NULL
> > dereference when phc2sys tries to adjust the clock.
>
> Sorry that I did not react to this earlier, but I was burried in
> regressions and allowed myself to take a few days off.
>
> > +/*
> > + * Read NIC clock by issuing the AQ command. The command is subject to
> > + * rate limiting and may need to be retried. Requires nic_ts_read_lock
> > + * to be held.
> > + */
> > +static int gve_ptp_read_timestamp(struct gve_ptp *ptp, cycles_t *pre_cycles,
> > + cycles_t *post_cycles)
> > +{
> > + unsigned long deadline = jiffies + msecs_to_jiffies(100);
> > + unsigned long delay_us = 1000;
> > + int err;
> > +
> > + lockdep_assert_held(&ptp->nic_ts_read_lock);
> > +
> > + do {
> > + *pre_cycles = get_cycles();
>
> This is a completely non-sensical hack. Why do you need get_cycles()
> here, which is a horrible and ill-defined interface that others are
> trying to get rid of?
We use get_cycles() purely as a sanity check against the TSC values
returned by firmware. We do not use them in the time conversion.
This check found a serious correctness bug in the firmware.
>
> Just because it's still there is not a good answer and it's actually not
> required at all. See below.
>
> > + err = gve_adminq_report_nic_ts(ptp->priv,
> > + ptp->nic_ts_report_bus);
> > +
> > + /* Prevent get_cycles() from being speculatively executed
> > + * before the AdminQ command
> > + */
> > + rmb();
> > + *post_cycles = get_cycles();
> > + if (likely(err != -EAGAIN))
> > + return err;
> > +
> > + fsleep(delay_us);
> > +
> > + /* Exponential backoff */
> > + delay_us *= 2;
> > + } while (time_before(jiffies, deadline));
> > +
> > + return -ETIMEDOUT;
> > +}
> > +
> > /* Read the nic timestamp from hardware via the admin queue. */
> > -static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
> > +static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw,
> > + struct gve_sysclock_sample *sysclock)
> > {
> > + cycles_t host_pre_cycles, host_post_cycles;
> > + struct gve_nic_ts_report *ts_report;
> > int err;
> >
> > mutex_lock(&ptp->nic_ts_read_lock);
>
> Why is this lock taken _after_ the regular (fallback) pre/post
> timestamps are aquired?
>
> What's wrong with doing:
>
> do {
> guard(mutex)(&ptp->nic_ts_read_lock);
>
> ptp_read_system_prets(sts);
> err = read_nic_timestamp(...);
> ptp_read_system_postts(sts);
> } while (err && !timeout(...));
>
> return err;
>
> which moves the "fallback" timestamp right next to the actual hardware
> readout?
The lock is taken at the smallest scope possible, which does not
include the pre/post TS reading. This routine is called by multiple
callers, not all of which need the pre/post timestamp. Options are to:
- move the pre/post TS reading here and add branches to conditionally
sample them only when needed
- always capture the pre/post TS readings and discard them when not needed
- duplicate the locking logic, once for path that needs pre/post TS,
once for path that doesn't need pre/post TS
Since the fallback path is expected to be rarely used, lock contention
and retries are expected to be extremely rare, and chrony is resilient
to samples with high uncertainty window, I chose to prioritize
cleanliness of the code and keep the pre/post TS reading in the caller
where it is actually needed. That said, I do not feel strongly about this
and would be open to moving pre/post inside the mutex if I were sending
another revision.
>
> That would make the "fallback" case too precise, right?
Moving pre/post inside the mutex makes no measurable difference
because the lock is rarely contended. There is 4 orders of
magnitude difference in RMS offset between precise and fallback paths.
>
> After noticing the obvious, I just checked what our new AI review
> overlord "sashiko" has to say about this:
>
> "Does wrapping the timestamp read outside the locks and retry loops skew the
> measurements?
>
> In the fallback path, ptp_read_system_prets() is captured before calling
> gve_clock_nic_ts_read(), and postts() is captured after it returns. However,
> gve_clock_nic_ts_read() acquires ptp->nic_ts_read_lock and calls
> gve_ptp_read_timestamp(), which can sleep using fsleep() during retries. This
> means the prets/postts window could include mutex wait times and sleep
> durations."
>
> It _IS_ actually on spot. So why the heck did this garbage get merged
> without addressing this obvious fail?
- The fallback path is expected to be rarely used since it requires
changing the default configuration
- The lock is rarely contended as there are only 2 callers, one of
which only executes once every 250ms,
and retries would only occur if a VM exceeded their rate limits
which have been tuned not to occur
under normal chrony calling patterns.
>
> What "sashiko" actually saw aside of that, which I didn't see myself when
> looking at this, is:
>
> "Similarly, in gve_ptp_read_timestamp(), the get_cycles() bounds are taken
> outside the gve_adminq_report_nic_ts() call, which acquires the shared
> priv->adminq_lock. If there is lock contention, the cycle bounds could become
> extremely wide."
>
> Oh well.
get_cycles() is only used for sanity checking the values from
firmware, not in any time conversion.
>
> So what you really want is:
>
> do {
> guard(mutex)(&ptp->nic_ts_read_lock);
> guard(mutex)(&gpe_priv->adminq_lock);
>
> ptp_read_system_prets(sts);
> err = read_nic_timestamp(...) {
> ...
> return gve_adminq_execute_cmd_locked(....);
> }
> ptp_read_system_postts(sts);
> } while (err && !timeout(...));
>
> return err;
>
> Or just rely on gpe_priv->adminq_lock in the first place. No?
adminq_lock is private to the adminq layer and protects adminq shared data
structures. nic_ts_read_lock protects the global DMA buffer that receives
the timestamps. Since retries are a response to rate limiting, we must wait
between retries, so we do not want to hold the adminq lock across
these retries as it would block other non rate-limited commands.
>
> But what's worse and escaped the AI scrutiny is:
>
> Your attempt to convert these 'get_cycles()' time stamps after the
> fact and outside of the timekeeping core protection is fundamentally
> wrong.
>
> get_device_system_crosststamp() does
>
> do {
> seq = read_seqcount_begin(&tk_core.seq);
> take_timestamps(...);
> convert_timestamps(...)
> } while (read_seqcount_retry(&tk_core.seq, seq));
>
> which covers the whole sequence of taking the snapshots from the
> hardware and the conversion.
>
> Your abuse of this is completely out of spec. Just because it did not
> blow up in your face yet does not make it in any way correct. You do not
> even get bonus points for creative abuse.
We use ktime_get_snapshot() to capture a historical snapshot before sampling
the TSC, and we pass it to get_device_system_crosststamp() which
uses the snapshot to interpolate the sample if it occurs before the current
timekeeping interval. Perfectly within spec.
>
>
> Now let me look at the "firmware" provided timestamps.
>
> That's the poor mans emulation of actual hardware timestamps aka PTM.
>
> Why do you need all this pre_nic/post_nic muck there? Fix your firmware
> to actually honor PTM and not provide some made up version of it.
This has to run on platforms that don't support PTM.
>
> Also the whole debug mechanism of comparing the time stamps of
> get_cycles() with the timestamps provided by firmware tells me that your
> firmware is a hacked together piece of garbage.
>
> If you want to use "hardware" timestamps then populate the
> ptp.info.getcrosststamp() callback.
>
> If your firmware really can't do PTM and only provides pre and post
> values, then you simply can take the midpoint in your getcrosststamp()
> callback:
>
> do {
> guard(mutex)(&ptp->nic_ts_read_lock);
> err = read_nic_timestamp(...);
> system_counterval->cycles = (nic_pre + nic_post) / 2;
> } while (err && !timeout(...));
We considered returning the midpoint of pre/post up through
getcrosststamp(), but there was
no way to communicate the uncertainty interval which is important
for some of our use cases.
https://lore.kernel.org/all/20260515164033.6403-1-akiyano@amazon.com/
fixes that issue, so we can use that when it lands.
>
> If you do not even trust your firmware then you still can validate it
> easily without creating horrible hacks:
>
> do {
> guard(mutex)(...);
>
> ptp_read_system_prets(&ctx->sts);
> err = read_nic_timestamp(...);
> system_counterval->cycles = (nic_pre + nic_post) / 2;
> ptp_read_system_postts(&ctx->sts);
> } while (err && !timeout(...));
>
> and then check the ctx->sts timestamps against the results provided by
> get_device_system_crosststamp() after conversion in your
> ptp.info.getcrosststamp() callback:
>
> err = get_device_system_crosststamp(..., &ctx);
> if (err)
> return err;
> if (ctx.sts.pre_ts > xtstamp.sys... || ctx.sts.post_ts < xtstamp.sys...)
> return -EBROKENFIRMWARE;
>
> no?
get_device_system_crosststamp() has nontrivial logic that operates on the system
timestamps, so I am not confident that a bad TSC value would be detected
if we ran it through get_device_system_crosststamp() first. The check
is simplest
and strongest when performed against the raw TSC values received from the
device, before passing them to any kernel API. But I hear your point
about mistrusting
the firmware enough to need this check.
>
> Jakub, please back out this hackery before it makes a precedence for others
> to copy from.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-20 21:08 ` Thomas Gleixner
2026-05-20 22:16 ` Jakub Kicinski
2026-05-21 4:35 ` Jordan Rhee
@ 2026-05-21 10:08 ` David Laight
2 siblings, 0 replies; 19+ messages in thread
From: David Laight @ 2026-05-21 10:08 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Harshitha Ramamurthy, netdev, joshwash, andrew+netdev, davem,
edumazet, kuba, pabeni, richardcochran, jstultz, sboyd, willemb,
nktgrg, jfraker, ziweixiao, maolson, jordanrhee, thostet,
alok.a.tiwari, pkaligineedi, horms, dwmw2, jacob.e.keller, yyd,
jefrogers, linux-kernel, Naman Gulati
On Wed, 20 May 2026 23:08:42 +0200
Thomas Gleixner <tglx@kernel.org> wrote:
> On Thu, May 14 2026 at 22:58, Harshitha Ramamurthy wrote:
...
> > + do {
> > + *pre_cycles = get_cycles();
>
> This is a completely non-sensical hack. Why do you need get_cycles()
> here, which is a horrible and ill-defined interface that others are
> trying to get rid of?
ISTR that it has always returned 0 on some architectures.
-- David
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-14 22:58 ` [PATCH net-next v8 3/3] gve: implement PTP gettimex64 Harshitha Ramamurthy
2026-05-20 21:08 ` Thomas Gleixner
@ 2026-05-21 11:43 ` David Woodhouse
2026-05-21 18:06 ` Thomas Gleixner
1 sibling, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2026-05-21 11:43 UTC (permalink / raw)
To: Harshitha Ramamurthy, netdev, Arthur Kiyanovski, tglx
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, tglx, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati
[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]
On Thu, 2026-05-14 at 22:58 +0000, Harshitha Ramamurthy wrote:
>
> +static bool
> +gve_can_use_system_ts_from_device(enum clocksource_ids system_clock_source,
> + clockid_t clockid)
> +{
> + if (clockid != CLOCK_REALTIME && clockid != CLOCK_MONOTONIC_RAW)
> + return false;
> +
> + /* If the system clock source matches the system clock
> + * returned by the AdminQ command, we can use the system
> + * timestamps returned by the device, otherwise we have to
> + * fall back to sampling system time from the host which
> + * is less accurate.
> + */
> + if (IS_ENABLED(CONFIG_X86))
> + return system_clock_source == CSID_X86_TSC;
> + else if (IS_ENABLED(CONFIG_ARM64))
> + return system_clock_source == CSID_ARM_ARCH_COUNTER;
> +
> + return false;
Strictly, you could manage to convert to CSID_X86_KVM_CLK here too. The
x86 kvm_arch_ptp_get_crosststamp() for ptp_kvm does so unconditionally.
For ptp_vmclock we support both; we do the conversion only if the cs_id
in the snapshot *is* the KVM clock, which seemed a bit more reasonable.
And here you add a third variant which *only* copes with CSID_X86_TSC,
to complete the set of possibilities :)
With PTM and the various virt enlightenments, there are an increasing
number of "PTP" clocks which literally *just* tell us a cycle count at
the moment of the corresponding reading, and we don't need the ABA
sandwich of local clock readings around it any more.
I've got another one waiting in the wings which will literally just
latch the counter value when a PPS signal comes in, too.
It might be good to harmonise the way we convert to KVM clock if we're
going to support that (although I'll accept an argument that any guest
which chooses to use the KVM clock doesn't *deserve* to have accurate
timekeeping).
I'd also like consensus on exposing it to userspace. I think Thomas is
going to suggest that we should always convert to MONOTONIC_RAW, but
what I asked Arthur to do in
https://lore.kernel.org/all/20260515164033.6403-1-akiyano@amazon.com/
exposes the raw counter values instead.
From the dedicating hosting point of view, we *only* care about the
counter, and absolutely DGAF about the host's timekeeping. Migrating
KVM guests requires getting the guest TSC right (in terms of scaling
factors and offset from the host TSC), and feeding accurate time to
guests is done in terms of the guest TSC to realtime relationship.
So our ideal state is that we discipline the TSC against real time in
userspace, and enabling those PHC drivers to *give* us the raw cycle
counts that they actually measured seemed best.
Could we do that all on MONOTONIC_RAW, if Thomas insists? Maybe...
although I like the model of disciplining one *specific* counter, and
then telling the kernel about *that* counter. If the kernel for some
reason has randomly switched to the HPET then the kernel is free to
ignore what we're telling it about the TSC.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-21 11:43 ` David Woodhouse
@ 2026-05-21 18:06 ` Thomas Gleixner
2026-05-21 19:59 ` David Woodhouse
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-21 18:06 UTC (permalink / raw)
To: David Woodhouse, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
On Thu, May 21 2026 at 12:43, David Woodhouse wrote:
> On Thu, 2026-05-14 at 22:58 +0000, Harshitha Ramamurthy wrote:
>> + if (IS_ENABLED(CONFIG_X86))
>> + return system_clock_source == CSID_X86_TSC;
>> + else if (IS_ENABLED(CONFIG_ARM64))
>> + return system_clock_source == CSID_ARM_ARCH_COUNTER;
>> +
>> + return false;
>
> Strictly, you could manage to convert to CSID_X86_KVM_CLK here too. The
> x86 kvm_arch_ptp_get_crosststamp() for ptp_kvm does so unconditionally.
>
> For ptp_vmclock we support both; we do the conversion only if the cs_id
> in the snapshot *is* the KVM clock, which seemed a bit more reasonable.
>
> And here you add a third variant which *only* copes with CSID_X86_TSC,
> to complete the set of possibilities :)
Yeah, let's sprinkle that all over the tree so nobody can figure out a
coherent picture anymore. :)
> With PTM and the various virt enlightenments, there are an increasing
> number of "PTP" clocks which literally *just* tell us a cycle count at
> the moment of the corresponding reading, and we don't need the ABA
> sandwich of local clock readings around it any more.
That's what the PRECISE ioctl and ptp::info::getcrosstimestamp() is for,
which utilizes get_device_system_crosststamp() and provides the
translation from PTM, on x86 that's ART which is converted to TSC.
It then correlates the converted system time stamp to CLOCK_REALTIME and
CLOCK_MONOTONIC_RAW.
That has been implemented long ago exactly for that purpose, but people
need to reinvent the wheel over and over just because it makes the
kernel so more maintainable.
> I've got another one waiting in the wings which will literally just
> latch the counter value when a PPS signal comes in, too.
That can use a similar mechanism.
> It might be good to harmonise the way we convert to KVM clock if we're
> going to support that (although I'll accept an argument that any guest
> which chooses to use the KVM clock doesn't *deserve* to have accurate
> timekeeping).
:)
> I'd also like consensus on exposing it to userspace. I think Thomas is
> going to suggest that we should always convert to MONOTONIC_RAW, but
> what I asked Arthur to do in
> https://lore.kernel.org/all/20260515164033.6403-1-akiyano@amazon.com/
> exposes the raw counter values instead.
I'm not against that per se, but that really wants to be a consistent
snapshot and not some pulled out of thin air cycles value.
ktime_get_snapshot() provides that already today, so that can be
arguably used for pre/post timestamp mechanisms, except that it lacks
support for AUX clocks.
get_device_system_crosststamp() does not provide cycles and clock id,
but that's not rocket science to extend.
> From the dedicating hosting point of view, we *only* care about the
> counter, and absolutely DGAF about the host's timekeeping. Migrating
> KVM guests requires getting the guest TSC right (in terms of scaling
> factors and offset from the host TSC), and feeding accurate time to
> guests is done in terms of the guest TSC to realtime relationship.
The offset has nothing to do with the scaling. It's the delta to the TSC
value which the migrated guest saw last when it was packed up for
migration.
The scaling is not required to be cycles based either. You need to know
the frequency ratio of the guest TSC to the new host TSC. cycles are not
telling you that. So you need extra information from the previous host,
i.e. the assumed raw TSC frequency of the guest plus the steered
frequency in your virtual PCH interface.
So sure you can utilize the raw counter values for that, but you can do
the same with CLOCK_MONOTONIC_RAW by exposing the mult/shift factor
which the guest keeps unmodified after boot or a frequency value derived
from that. That's IMO way more sensible than measuring it somehow and
the shift/mult conversion inaccuracy you mentioned is laughable TBH. TSC
usually ends up with a shift value of 24, so the conversion loss of the
TSC to CLOCK_MONOTONIC_RAW is +/- 1ns, which is totally irrelevant in
the larger picture.
> So our ideal state is that we discipline the TSC against real time in
> userspace, and enabling those PHC drivers to *give* us the raw cycle
> counts that they actually measured seemed best.
As I said, I'm not against that per se, but then we want to provide
interfaces which are generic enough so they can be used for your
purposes, but also for the requirements of chrony et al. Looking at the
patch you linked to above:
> +static long ptp_sys_offset_precise_attrs(struct ptp_clock *ptp, void __user *arg)
> +{
> + struct ptp_sys_offset_precise_attrs precise_offset_attrs;
> + struct system_device_crosststamp xtstamp;
> + struct ptp_clock_attributes att = {};
> + struct timespec64 ts;
> + int err;
> +
> + if (!ptp->info->getcrosststampattrs)
> + return -EOPNOTSUPP;
> +
> + err = ptp->info->getcrosststampattrs(ptp->info, &xtstamp, &att);
> + if (err)
> + return err;
> +
> + memset(&precise_offset_attrs, 0, sizeof(precise_offset_attrs));
> + ts = ktime_to_timespec64(xtstamp.device);
> + precise_offset_attrs.device.pct.sec = ts.tv_sec;
> + precise_offset_attrs.device.pct.nsec = ts.tv_nsec;
> + precise_offset_attrs.device.att.error_bound = att.error_bound;
> + precise_offset_attrs.device.att.timescale = att.timescale;
> + precise_offset_attrs.device.att.status = att.status;
> + precise_offset_attrs.device.att.counter_id = att.counter_id;
> + precise_offset_attrs.device.att.counter_value = att.counter_value;
That's exactly the hackery which is counterproductive. Why?
Where does ptp::info::getcrosststampattrs() retrieve the system counter
value from?
Either it has to convert from the ART based timestamp to TSC magically
by circumventing the core code or it has to use get_cycles() separately
which defeats the whole purpose of a hardware latched combo timestamp.
While this:
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index aee2c1a46e47..43a55f618235 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -296,19 +296,6 @@ struct system_time_snapshot {
u8 cs_was_changed_seq;
};
-/**
- * struct system_device_crosststamp - system/device cross-timestamp
- * (synchronized capture)
- * @device: Device time
- * @sys_realtime: Realtime simultaneous with device time
- * @sys_monoraw: Monotonic raw simultaneous with device time
- */
-struct system_device_crosststamp {
- ktime_t device;
- ktime_t sys_realtime;
- ktime_t sys_monoraw;
-};
-
/**
* struct system_counterval_t - system counter value with the ID of the
* corresponding clocksource
@@ -325,6 +312,21 @@ struct system_counterval_t {
bool use_nsecs;
};
+/**
+ * struct system_device_crosststamp - system/device cross-timestamp
+ * (synchronized capture)
+ * @device: Device time
+ * @sys_counter: Clocksource counter value simultaneous with device time
+ * @sys_realtime: Realtime simultaneous with device time
+ * @sys_monoraw: Monotonic raw simultaneous with device time
+ */
+struct system_device_crosststamp {
+ ktime_t device;
+ struct system_counterval_t sys_counter;
+ ktime_t sys_realtime;
+ ktime_t sys_monoraw;
+};
+
extern bool ktime_real_to_base_clock(ktime_t treal,
enum clocksource_ids base_id, u64 *cycles);
extern bool timekeeping_clocksource_has_base(enum clocksource_ids id);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c493a4010305..11df7e25bdca 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1506,6 +1506,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, cycles);
} while (read_seqcount_retry(&tk_core.seq, seq));
+ xtstamp->sys_counter = system_counterval;
xtstamp->sys_realtime = ktime_add_ns(base_real, nsec_real);
xtstamp->sys_monoraw = ktime_add_ns(base_raw, nsec_raw);
Provides the information coherently and without any extra hacks. No?
Extending get_device_system_crosststamp() for AUX clocks is on my todo
list for a long time, but I did not come around to it yet. It's not
really complicated to make that work.
The actual ena_phc_gettimexattrs64() implementation does not provide the
counter value at all despite the fact that it could trivially do so with
a modified ktime_get_snapshot() except for AUX clock IDs, but that's a
solvable problem. See uncompiled PoC below.
I really have to ask why we put a lot of effort into consolidated
infrastructure, when every drivers/foo/ subsystem decides it needs it's
own absymal hacks just because sitting down and extending the generic
infrastructure is asked too much. TBH, that's frustrating as hell and
no, none of these usecases is so special that it would justify a single
one of these hacks.
> Could we do that all on MONOTONIC_RAW, if Thomas insists? Maybe...
> although I like the model of disciplining one *specific* counter, and
> then telling the kernel about *that* counter. If the kernel for some
> reason has randomly switched to the HPET then the kernel is free to
> ignore what we're telling it about the TSC.
Well, if the kernel has switched to HPET for timekeeping, then there is
a good reason especially with the reworked clocksource watchdog, which
is not longer being stupid under load and on big machines. If the
sysadmin thinks it's cool to switch to HPET, then he can keep the
pieces. In those cases the machine has other problems than taming the
TSC for guests....
Thanks,
tglx
---
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -276,19 +276,24 @@ static inline bool ktime_get_aux_ts64(cl
#endif
/**
- * struct system_time_snapshot - simultaneous raw/real time capture with
+ * struct system_time_snapshot - simultaneous raw/CLOCK_REALTIME or CLOCK_AUX$N time capture with
* counter value
* @cycles: Clocksource counter value to produce the system times
* @real: Realtime system time
+ * @aux: Auxiliary system time if selected in ktime_get_snapshot_id()
* @boot: Boot time
* @raw: Monotonic raw system time
* @cs_id: Clocksource ID
* @clock_was_set_seq: The sequence number of clock-was-set events
* @cs_was_changed_seq: The sequence number of clocksource change events
+ * @clock_valid: The clock is valid
*/
struct system_time_snapshot {
u64 cycles;
- ktime_t real;
+ union {
+ ktime_t real;
+ ktime_t aux;
+ };
ktime_t boot;
ktime_t raw;
enum clocksource_ids cs_id;
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -67,6 +67,7 @@ static inline bool tk_is_aux(const struc
{
return tk->id >= TIMEKEEPER_AUX_FIRST && tk->id <= TIMEKEEPER_AUX_LAST;
}
+static inline struct tk_data *aux_get_tk_data(clockid_t id);
#else
static inline bool tk_get_aux_ts64(unsigned int tkid, struct timespec64 *ts)
{
@@ -77,6 +78,10 @@ static inline bool tk_is_aux(const struc
{
return false;
}
+static inline struct tk_data *aux_get_tk_data(clockid_t id)
+{
+ return NULL;
+}
#endif
static inline void tk_update_aux_offs(struct timekeeper *tk, ktime_t offs)
@@ -1183,15 +1188,18 @@ noinstr time64_t __ktime_get_real_second
}
/**
- * ktime_get_snapshot - snapshots the realtime/monotonic raw clocks with counter
+ * ktime_get_snapshot_id - snapshots the monotonic raw clock, the selected clock and the counter
* @systime_snapshot: pointer to struct receiving the system time snapshot
+ * @clock_id: The clock ID to snapshot
*/
-void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
+bool ktime_get_snapshot_id(struct system_time_snapshot *systime_snapshot, clockid_t clock_id)
{
- struct timekeeper *tk = &tk_core.timekeeper;
+ struct timekeeper *tk;
+ struct tk_data *tkd;
unsigned int seq;
ktime_t base_raw;
ktime_t base_real;
+ ktime_t base_mono;
ktime_t base_boot;
u64 nsec_raw;
u64 nsec_real;
@@ -1199,27 +1207,56 @@ void ktime_get_snapshot(struct system_ti
WARN_ON_ONCE(timekeeping_suspended);
+ switch (clock_id) {
+ case CLOCK_REALTIME:
+ tkd = &tk_core;
+ break;
+ case CLOCK_AUX ... CLOCK_AUX_LAST: {
+ tkd = aux_get_tk_data(clock_id);
+ if (!tkd)
+ return false;
+ break;
+ }
+ default:
+ WARN_ON_ONCE(1);
+ return false;
+ }
+
+ tk = &tkd->timekeeper;
+
do {
- seq = read_seqcount_begin(&tk_core.seq);
+ seq = read_seqcount_begin(&tkd->seq);
+
+ /* Aux clocks can be invalid */
+ if (!tk->clock_valid)
+ return false;
+
now = tk_clock_read(&tk->tkr_mono);
systime_snapshot->cs_id = tk->tkr_mono.clock->id;
systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
- base_real = ktime_add(tk->tkr_mono.base,
- tk_core.timekeeper.offs_real);
- base_boot = ktime_add(tk->tkr_mono.base,
- tk_core.timekeeper.offs_boot);
+ base_mono = tk->tkr_mono.base;
+ if (clock_id == CLOCK_REALTIME) {
+ base_real = ktime_add(base_mono, tk_core.timekeeper.offs_real);
+ base_boot = ktime_add(base_mono, tk_core.timekeeper.offs_boot);
+ }
base_raw = tk->tkr_raw.base;
nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
- } while (read_seqcount_retry(&tk_core.seq, seq));
+ } while (read_seqcount_retry(&tkd->seq, seq));
systime_snapshot->cycles = now;
- systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
- systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
+ if (clock_id == CLOCK_REALTIME) {
+ systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
+ systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
+ } else {
+ systime_snapshot->aux = ktime_add_ns(base_mono, nsec_real);
+ systime_snapshot->boot = 0;
+ }
systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
+ return true;
}
-EXPORT_SYMBOL_GPL(ktime_get_snapshot);
+EXPORT_SYMBOL_GPL(ktime_get_snapshot_id);
/* Scale base by mult/div checking for overflow */
static int scale64_check_overflow(u64 mult, u64 div, u64 *base)
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-21 18:06 ` Thomas Gleixner
@ 2026-05-21 19:59 ` David Woodhouse
2026-05-21 22:43 ` Thomas Gleixner
0 siblings, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2026-05-21 19:59 UTC (permalink / raw)
To: Thomas Gleixner, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
[-- Attachment #1: Type: text/plain, Size: 11966 bytes --]
On Thu, 2026-05-21 at 20:06 +0200, Thomas Gleixner wrote:
> On Thu, May 21 2026 at 12:43, David Woodhouse wrote:
>
> > With PTM and the various virt enlightenments, there are an increasing
> > number of "PTP" clocks which literally *just* tell us a cycle count at
> > the moment of the corresponding reading, and we don't need the ABA
> > sandwich of local clock readings around it any more.
>
> That's what the PRECISE ioctl and ptp::info::getcrosstimestamp() is for,
> which utilizes get_device_system_crosststamp() and provides the
> translation from PTM, on x86 that's ART which is converted to TSC.
>
> It then correlates the converted system time stamp to CLOCK_REALTIME and
> CLOCK_MONOTONIC_RAW.
>
> That has been implemented long ago exactly for that purpose, but people
> need to reinvent the wheel over and over just because it makes the
> kernel so more maintainable.
>
> > I've got another one waiting in the wings which will literally just
> > latch the counter value when a PPS signal comes in, too.
>
> That can use a similar mechanism.
Yep. In both cases, some userspace actually wants the *counter value*.
> > It might be good to harmonise the way we convert to KVM clock if we're
> > going to support that (although I'll accept an argument that any guest
> > which chooses to use the KVM clock doesn't *deserve* to have accurate
> > timekeeping).
>
> :)
I'm not sure I'm joking. The KVM clock is.... *hosed*. It was invented
to work around the ancient broken variable-speed TSCs which stopped on
HLT, etc.
In *those* days, sure it made sense for the hypervisor to constantly
update a data structure which let you approximate a TSC→nanoseconds
conversion.
These days, a guest using the *actual* TSC is going to be a lot more
accurate, and guests basically shouldn't be using the kvmclock AT ALL.
When I made it DTRT for ptp_vmclock, I did look briefly at what it
would take to do it centrally in common code.
But that lumbers the common PTP code with x86 arch-specific horridness
about the KVM clock. And it literally has to call back into the driver
in the seqcount loop, so I guess drivers would need to have a flag or
something that says it's OK to do that (they're fast enough, or
something?).
I didn't spent *that* long trying, but I couldn't see a way of doing it
centrally, which didn't end up making me sadder than the initial
duplication.
Honestly, whether it was intent or omission, I suspect the approach
Harshitha took for GVE is the right one: Support TSC, and anyone who
uses KVM clock doesn't deserve accurate time anyway.
> > I'd also like consensus on exposing it to userspace. I think Thomas is
> > going to suggest that we should always convert to MONOTONIC_RAW, but
> > what I asked Arthur to do in
> > https://lore.kernel.org/all/20260515164033.6403-1-akiyano@amazon.com/
> > exposes the raw counter values instead.
>
> I'm not against that per se, but that really wants to be a consistent
> snapshot and not some pulled out of thin air cycles value.
Definitely.
> ktime_get_snapshot() provides that already today, so that can be
> arguably used for pre/post timestamp mechanisms, except that it lacks
> support for AUX clocks.
>
> get_device_system_crosststamp() does not provide cycles and clock id,
> but that's not rocket science to extend.
Yep.
> > From the dedicating hosting point of view, we *only* care about the
> > counter, and absolutely DGAF about the host's timekeeping. Migrating
> > KVM guests requires getting the guest TSC right (in terms of scaling
> > factors and offset from the host TSC), and feeding accurate time to
> > guests is done in terms of the guest TSC to realtime relationship.
>
> The offset has nothing to do with the scaling. It's the delta to the TSC
> value which the migrated guest saw last when it was packed up for
> migration.
Sure. As long as that guest TSC value is a consistent snapshot and not
some pulled out of thin air cycles value :)
There is lots to be said about accurate KVM migration, and a series of
30-odd patches on the KVM list which attempts to improve it.
But either way, we still DGAF about the host's CLOCK_REALTIME when
we're doing it at scale.
KVM doesn't *have* a clean way to get/set the guest TSC based on
realtime; it only has the scaling and offsets from the host TSC. So to
do things precisely, we end up converting via the host TSC.
So we end up comparing a {host TSC, realtime} pair between source and
destination hosts, calculating the *guest* ticks that should have
occurred in the intervening time, and setting the "offset from host" on
the destination host to achieve the desired results. Almost all of
which is in terms of TSC cycles and offsets.
> > So our ideal state is that we discipline the TSC against real time in
> > userspace, and enabling those PHC drivers to *give* us the raw cycle
> > counts that they actually measured seemed best.
>
> As I said, I'm not against that per se, but then we want to provide
> interfaces which are generic enough so they can be used for your
> purposes, but also for the requirements of chrony et al. Looking at the
> patch you linked to above:
>
> > +static long ptp_sys_offset_precise_attrs(struct ptp_clock *ptp, void __user *arg)
> > +{
> > + struct ptp_sys_offset_precise_attrs precise_offset_attrs;
> > + struct system_device_crosststamp xtstamp;
> > + struct ptp_clock_attributes att = {};
> > + struct timespec64 ts;
> > + int err;
> > +
> > + if (!ptp->info->getcrosststampattrs)
> > + return -EOPNOTSUPP;
> > +
> > + err = ptp->info->getcrosststampattrs(ptp->info, &xtstamp, &att);
> > + if (err)
> > + return err;
> > +
> > + memset(&precise_offset_attrs, 0, sizeof(precise_offset_attrs));
> > + ts = ktime_to_timespec64(xtstamp.device);
> > + precise_offset_attrs.device.pct.sec = ts.tv_sec;
> > + precise_offset_attrs.device.pct.nsec = ts.tv_nsec;
> > + precise_offset_attrs.device.att.error_bound = att.error_bound;
> > + precise_offset_attrs.device.att.timescale = att.timescale;
> > + precise_offset_attrs.device.att.status = att.status;
> > + precise_offset_attrs.device.att.counter_id = att.counter_id;
> > + precise_offset_attrs.device.att.counter_value = att.counter_value;
>
> That's exactly the hackery which is counterproductive. Why?
That's the same pattern as in the existing ptp_sys_offset_precise(),
isn't it? Copying from the internal data structures it got from the
driver, into the specific userspace structure for the ioctl that was
called.
> Where does ptp::info::getcrosststampattrs() retrieve the system counter
> value from?
>
> Either it has to convert from the ART based timestamp to TSC magically
> by circumventing the core code or it has to use get_cycles() separately
> which defeats the whole purpose of a hardware latched combo timestamp.
The point in ptp_sys_offset_precise is that all the timestamps are the
*same* time, isn't it?
So surely a device can only support this if it *does* have a precise
hardware timestamp, e.g. from ART? And the corresponding sys_realtime
and sys_monoraw fields also have to be from the *same* value, surely?
Is that not how it works? If not.... WTF *is* it for?
We're allowed to set *fire* to anyone who just throws a random
get_cycles() call in there, aren't we? That's basically just self-
defence, m'lud.
> While this:
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index aee2c1a46e47..43a55f618235 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -296,19 +296,6 @@ struct system_time_snapshot {
> u8 cs_was_changed_seq;
> };
>
> -/**
> - * struct system_device_crosststamp - system/device cross-timestamp
> - * (synchronized capture)
> - * @device: Device time
> - * @sys_realtime: Realtime simultaneous with device time
> - * @sys_monoraw: Monotonic raw simultaneous with device time
> - */
> -struct system_device_crosststamp {
> - ktime_t device;
> - ktime_t sys_realtime;
> - ktime_t sys_monoraw;
> -};
> -
> /**
> * struct system_counterval_t - system counter value with the ID of the
> * corresponding clocksource
> @@ -325,6 +312,21 @@ struct system_counterval_t {
> bool use_nsecs;
> };
>
> +/**
> + * struct system_device_crosststamp - system/device cross-timestamp
> + * (synchronized capture)
> + * @device: Device time
> + * @sys_counter: Clocksource counter value simultaneous with device time
> + * @sys_realtime: Realtime simultaneous with device time
> + * @sys_monoraw: Monotonic raw simultaneous with device time
> + */
> +struct system_device_crosststamp {
> + ktime_t device;
> + struct system_counterval_t sys_counter;
> + ktime_t sys_realtime;
> + ktime_t sys_monoraw;
> +};
> +
> extern bool ktime_real_to_base_clock(ktime_t treal,
> enum clocksource_ids base_id, u64 *cycles);
> extern bool timekeeping_clocksource_has_base(enum clocksource_ids id);
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index c493a4010305..11df7e25bdca 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1506,6 +1506,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
> nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, cycles);
> } while (read_seqcount_retry(&tk_core.seq, seq));
>
> + xtstamp->sys_counter = system_counterval;
> xtstamp->sys_realtime = ktime_add_ns(base_real, nsec_real);
> xtstamp->sys_monoraw = ktime_add_ns(base_raw, nsec_raw);
>
> Provides the information coherently and without any extra hacks. No?
Yeah, that looks eminently reasonable. It isn't actually that far off
what Arthur did in patch 1 of the series, is it? He was adding other
clock quality attributes at the same time, and ended up shoe-horning
the cs_id and cycle_count into those instead of using a separate
system_counterval_t as you have here, which *is* cleaner. So yes, let's
get him to do that, but that part is cosmetic?
> Extending get_device_system_crosststamp() for AUX clocks is on my todo
> list for a long time, but I did not come around to it yet. It's not
> really complicated to make that work.
>
> The actual ena_phc_gettimexattrs64() implementation does not provide the
> counter value at all despite the fact that it could trivially do so with
> a modified ktime_get_snapshot() except for AUX clock IDs, but that's a
> solvable problem. See uncompiled PoC below.
Huh? Unless it has PTM and is converting from ART (or is something
virtual like vmclock or the older non-LM-safe KVM hacks), surely the
driver has no business pretending to support 'precise' mode, and no
business providing a counter value?
In this case, we'd want the pre_ts and post_ts to be generated by the
common code using ktime_get_snapshot_id(cs_id) so that we have the
corresponding system_counterval_t in the 'A' parts of the ABA tuple...
but the device part of it can't pull one out of thin air unless it is
*part* of the device reading, surely?
> I really have to ask why we put a lot of effort into consolidated
> infrastructure, when every drivers/foo/ subsystem decides it needs it's
> own absymal hacks just because sitting down and extending the generic
> infrastructure is asked too much. TBH, that's frustrating as hell and
> no, none of these usecases is so special that it would justify a single
> one of these hacks.
Meh. I'm literally here trying to join the dots between timekeeping,
PTP and KVM, and come up with a holistic solution that makes sense for
all the use cases. And my brain hurts... :)
> -EXPORT_SYMBOL_GPL(ktime_get_snapshot);
> +EXPORT_SYMBOL_GPL(ktime_get_snapshot_id);
Ack. I think Arthur needs to use that in the pre_ts/post_ts helpers. Thanks.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-21 19:59 ` David Woodhouse
@ 2026-05-21 22:43 ` Thomas Gleixner
2026-05-22 0:12 ` David Woodhouse
2026-05-22 10:34 ` David Woodhouse
0 siblings, 2 replies; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-21 22:43 UTC (permalink / raw)
To: David Woodhouse, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
On Thu, May 21 2026 at 20:59, David Woodhouse wrote:
> On Thu, 2026-05-21 at 20:06 +0200, Thomas Gleixner wrote:
>> > It might be good to harmonise the way we convert to KVM clock if we're
>> > going to support that (although I'll accept an argument that any guest
>> > which chooses to use the KVM clock doesn't *deserve* to have accurate
>> > timekeeping).
>>
>> :)
>
> I'm not sure I'm joking. The KVM clock is.... *hosed*. It was invented
> to work around the ancient broken variable-speed TSCs which stopped on
> HLT, etc.
>
> In *those* days, sure it made sense for the hypervisor to constantly
> update a data structure which let you approximate a TSC→nanoseconds
> conversion.
It was broken from day one and I explained the why in great length
several times. But features first, correctness later...
At some point I gave up and ignored all complaints about time
inconsistencies in guests which use kvmclock.
> These days, a guest using the *actual* TSC is going to be a lot more
> accurate, and guests basically shouldn't be using the kvmclock AT ALL.
You're preaching to the choir.
> When I made it DTRT for ptp_vmclock, I did look briefly at what it
> would take to do it centrally in common code.
>
> But that lumbers the common PTP code with x86 arch-specific horridness
> about the KVM clock. And it literally has to call back into the driver
> in the seqcount loop, so I guess drivers would need to have a flag or
> something that says it's OK to do that (they're fast enough, or
> something?).
>
> I didn't spent *that* long trying, but I couldn't see a way of doing it
> centrally, which didn't end up making me sadder than the initial
> duplication.
>
> Honestly, whether it was intent or omission, I suspect the approach
> Harshitha took for GVE is the right one: Support TSC, and anyone who
> uses KVM clock doesn't deserve accurate time anyway.
I agree with the sentiment, but the implementation is just bonkers TBH.
The driver should not even care about TSC/ARM_XX. Drivers have to be
mostly agnostic of this and we went to a great length for that with the
existing PTM core support. The core might need extensions for the pre/post
muck, but that's not a justification for hacks like this GVE stuff.
>> > From the dedicating hosting point of view, we *only* care about the
>> > counter, and absolutely DGAF about the host's timekeeping. Migrating
>> > KVM guests requires getting the guest TSC right (in terms of scaling
>> > factors and offset from the host TSC), and feeding accurate time to
>> > guests is done in terms of the guest TSC to realtime relationship.
>>
>> The offset has nothing to do with the scaling. It's the delta to the TSC
>> value which the migrated guest saw last when it was packed up for
>> migration.
>
> Sure. As long as that guest TSC value is a consistent snapshot and not
> some pulled out of thin air cycles value :)
That's a different problem, which has nothing to do with the issue at
hand and if that's not solved, then providing accurate frequency ratios
is not making it much better :)
> There is lots to be said about accurate KVM migration, and a series of
> 30-odd patches on the KVM list which attempts to improve it.
>
> But either way, we still DGAF about the host's CLOCK_REALTIME when
> we're doing it at scale.
>
> KVM doesn't *have* a clean way to get/set the guest TSC based on
> realtime; it only has the scaling and offsets from the host TSC. So to
> do things precisely, we end up converting via the host TSC.
>
> So we end up comparing a {host TSC, realtime} pair between source and
> destination hosts, calculating the *guest* ticks that should have
> occurred in the intervening time, and setting the "offset from host" on
> the destination host to achieve the desired results. Almost all of
> which is in terms of TSC cycles and offsets.
Fine, but you can only do that accurately if you have the relevant
parameters:
1) Guest TSC value at freeze
2) Guest nominal TSC frequency
3) Old host REALTIME at freeze - Ideally you use TAI
4) New host TSC frequency
5) New host TSC/REALTIME/TAI snapshot
#1 is a KVM problem, but see #3
#2 ideally communicated from the guest to the host after early
initialization at boot.
You really want this information because the guest won't change the
mult/shift pair for it ever.
That is more accurate than using the scaling factor on the old host
plus the old host frequency. Especially if you are migrating the VM
several times because every migration results obviously in rounding
errors, while if you use the nominal guest frequency you keep this
back and forth conversions and the resulting errors out of the way.
#3 That's obviously related to #1
The old host knows the guest TSC scaling and the guest offset, so
it can take a coherent snapshot on the host side of REALTIME/TAI
and the host TSC value, which is convertible to the guest TSC
value.
#4, 5
are obviously required to set up the new host scaling/offset for
the guest TSC, but if based on the nominal assumed TSC frequency of
the guest it becomes actually accurate and usable across several
migrations.
Anything else is just crystal ball magic, which is what KVM has today....
>> > + precise_offset_attrs.device.att.counter_id = att.counter_id;
>> > + precise_offset_attrs.device.att.counter_value = att.counter_value;
>>
>> That's exactly the hackery which is counterproductive. Why?
>
> That's the same pattern as in the existing ptp_sys_offset_precise(),
> isn't it? Copying from the internal data structures it got from the
> driver, into the specific userspace structure for the ioctl that was
> called.
No.
Q: How does the driver know what the snapshot of NIC/SYSTEM time means?
A: It does not.
That's what get_device_system_crosststamp() encapsulates in a generic
way, which removes architecture/system specific knowledge from the
driver.
Q: How can the driver fill in att.counter_id and att.counter_value
_without_ specific knowlegde?
A: It _cannot_ without get_device_system_crosststamp() providing it.
Which is what I tried to explain you here:
>> Where does ptp::info::getcrosststampattrs() retrieve the system counter
>> value from?
>>
>> Either it has to convert from the ART based timestamp to TSC magically
>> by circumventing the core code or it has to use get_cycles() separately
>> which defeats the whole purpose of a hardware latched combo timestamp.
>
> The point in ptp_sys_offset_precise is that all the timestamps are the
> *same* time, isn't it?
Yes.
> So surely a device can only support this if it *does* have a precise
> hardware timestamp, e.g. from ART? And the corresponding sys_realtime
> and sys_monoraw fields also have to be from the *same* value, surely?
That is correct, but you still fail to explain how this new made up
att.counter_id/att.counter_value is filled in coherently without the
driver doing something abysmal?
You can't explain it, because the only way to do that correctly is by
extending the cross time stamp struct and get_device_system_crosststamp().
> Is that not how it works? If not.... WTF *is* it for?
Yes, that's the way it works since get_device_system_crosststamp() got
introduced, but that does not give a random driver access to the actual
converted (TSC) counter value. The driver can only observe the PTM value
which is ART on x86 and whatever on other architectures. So it does not
know anything about it.
> We're allowed to set *fire* to anyone who just throws a random
> get_cycles() call in there, aren't we? That's basically just self-
> defence, m'lud.
I agree, but the way how this attr.counter_id/value extension is defined
either requires unholy hacks to get access to the ART/TSC conversion or
resort to get_cycles().
>> While this:
...
>> + xtstamp->sys_counter = system_counterval;
...
simple core code change:
>> Provides the information coherently and without any extra hacks. No?
>
> Yeah, that looks eminently reasonable. It isn't actually that far off
> what Arthur did in patch 1 of the series, is it? He was adding other
> clock quality attributes at the same time, and ended up shoe-horning
> the cs_id and cycle_count into those instead of using a separate
> system_counterval_t as you have here, which *is* cleaner. So yes, let's
> get him to do that, but that part is cosmetic?
No. It's not cosmetic.
This patch #1 just defines new attributes, which include the CS ID and
the CS counter value with absolutely ZERO explanation where these values
are supposed to come from and ZERO explanation how they are coherent.
It suggests that the driver can fill them in along with the other truly
driver related data, e.g. error_bound.
But as I explained before the driver _cannot_ fill them in coherently
neither for the new gettimexattrs64() nor for the getcrosststampattrs()
callbacks.
So you add an extension to the existing interface without core support,
which means once that is merged the driver crowd will happily take any
shortcut to support it.
Are _you_ going to mop up the resulting mess?
>> Extending get_device_system_crosststamp() for AUX clocks is on my todo
>> list for a long time, but I did not come around to it yet. It's not
>> really complicated to make that work.
>>
>> The actual ena_phc_gettimexattrs64() implementation does not provide the
>> counter value at all despite the fact that it could trivially do so with
>> a modified ktime_get_snapshot() except for AUX clock IDs, but that's a
>> solvable problem. See uncompiled PoC below.
>
> Huh? Unless it has PTM and is converting from ART (or is something
> virtual like vmclock or the older non-LM-safe KVM hacks), surely the
> driver has no business pretending to support 'precise' mode, and no
> business providing a counter value?
Huch?
According to your earlier argumentation about counter values, you want
the counter value, REALTIME, RAW combo even for gettimexattrs64() for
the pre/post timestamps in case the hardware does not support PTM, no?
> In this case, we'd want the pre_ts and post_ts to be generated by the
> common code using ktime_get_snapshot_id(cs_id) so that we have the
> corresponding system_counterval_t in the 'A' parts of the ABA tuple...
> but the device part of it can't pull one out of thin air unless it is
> *part* of the device reading, surely?
Correct, but where does the proposed patch set implement any of this?
In the ena driver it sets cs_id and cs_value to 0, which will be
replaced by abysmal hacks within no time because someone figures out
that adding these values will give "better" results ....
Neither the gettimexattrs64() nor the getcrosststampattrs() have any
coherent way to fill in the CS related data, but you are pushing for
adding this infrastructure first before thinking about the
consequences. You can't be serious about that.
>> I really have to ask why we put a lot of effort into consolidated
>> infrastructure, when every drivers/foo/ subsystem decides it needs it's
>> own absymal hacks just because sitting down and extending the generic
>> infrastructure is asked too much. TBH, that's frustrating as hell and
>> no, none of these usecases is so special that it would justify a single
>> one of these hacks.
>
> Meh. I'm literally here trying to join the dots between timekeeping,
> PTP and KVM, and come up with a holistic solution that makes sense for
> all the use cases. And my brain hurts... :)
Welcome to the wonderful world of time(keeping). Once you have done that
for 20 years your brain stops to hurt :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-21 22:43 ` Thomas Gleixner
@ 2026-05-22 0:12 ` David Woodhouse
2026-05-22 7:49 ` Thomas Gleixner
2026-05-22 10:34 ` David Woodhouse
1 sibling, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2026-05-22 0:12 UTC (permalink / raw)
To: Thomas Gleixner, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
[-- Attachment #1: Type: text/plain, Size: 8916 bytes --]
On Fri, 2026-05-22 at 00:43 +0200, Thomas Gleixner wrote:
> On Thu, May 21 2026 at 20:59, David Woodhouse wrote:
> > I didn't spent *that* long trying, but I couldn't see a way of doing it
> > centrally, which didn't end up making me sadder than the initial
> > duplication.
> >
> > Honestly, whether it was intent or omission, I suspect the approach
> > Harshitha took for GVE is the right one: Support TSC, and anyone who
> > uses KVM clock doesn't deserve accurate time anyway.
>
> I agree with the sentiment, but the implementation is just bonkers TBH.
>
> The driver should not even care about TSC/ARM_XX. Drivers have to be
> mostly agnostic of this and we went to a great length for that with the
> existing PTM core support. The core might need extensions for the pre/post
> muck, but that's not a justification for hacks like this GVE stuff.
I haven't looked hard at what GVE is trying to do; only glanced at the
consistency of CSID_X86_TSC vs. CSID_X86_KVM_CLK handling as I was
already sad about that one.
> > <... migration...>
> Anything else is just crystal ball magic, which is what KVM has today....
Yeah, we digress. There's a 30-odd patch series on the kvm list to
clean up guest tsc/kvmclock handling and migration, and document the
process, if you want to come play. But honestly, there aren't enough
drugs in the world for someone to want to get involved in that crap
*voluntarily*. I recommend you don't ;)
> > So surely a device can only support this if it *does* have a precise
> > hardware timestamp, e.g. from ART? And the corresponding sys_realtime
> > and sys_monoraw fields also have to be from the *same* value, surely?
>
> That is correct, but you still fail to explain how this new made up
> att.counter_id/att.counter_value is filled in coherently without the
> driver doing something abysmal?
>
> You can't explain it, because the only way to do that correctly is by
> extending the cross time stamp struct and get_device_system_crosststamp().
Yes. Isn't that what I asked Arthur to do?
> > Is that not how it works? If not.... WTF *is* it for?
>
> Yes, that's the way it works since get_device_system_crosststamp() got
> introduced, but that does not give a random driver access to the actual
> converted (TSC) counter value. The driver can only observe the PTM value
> which is ART on x86 and whatever on other architectures. So it does not
> know anything about it.
Sure, the driver can only return what it *knows*.
We're certainly missing some infrastructure here... maybe the pci_bus
should have a CSID which corresponds to its PTM domain, which on x86
would be CSID_X86_ART?
(I really do want to get my hands on one of those PTM-capable atomic
clock TimeCard things, and have a play with implementing vmclock on
them using ART as the reference)
> > We're allowed to set *fire* to anyone who just throws a random
> > get_cycles() call in there, aren't we? That's basically just self-
> > defence, m'lud.
>
> I agree, but the way how this attr.counter_id/value extension is defined
> either requires unholy hacks to get access to the ART/TSC conversion or
> resort to get_cycles().
Never get_cycles(). Let's perhaps make it *not* need unholy hacks to
interpret a PTM counter value. Otherwise what's the point in ART and
PTM at all?
> > > While this:
> ...
>
> > > + xtstamp->sys_counter = system_counterval;
>
> ...
>
> simple core code change:
>
> > > Provides the information coherently and without any extra hacks. No?
> >
> > Yeah, that looks eminently reasonable. It isn't actually that far off
> > what Arthur did in patch 1 of the series, is it? He was adding other
> > clock quality attributes at the same time, and ended up shoe-horning
> > the cs_id and cycle_count into those instead of using a separate
> > system_counterval_t as you have here, which *is* cleaner. So yes, let's
> > get him to do that, but that part is cosmetic?
>
> No. It's not cosmetic.
>
> This patch #1 just defines new attributes, which include the CS ID and
> the CS counter value with absolutely ZERO explanation where these values
> are supposed to come from and ZERO explanation how they are coherent.
Pfft. Explanations are *definitely* cosmetic. And so is the difference
between a system_counterval_t, and a separate pair of {csid, cycles}
fields that precisely correspond to it; at least if we ignore use_nsecs
(which I have thus far).
I'll stand by calling that part cosmetic.
Not that we shouldn't fix it, but we should try to make our request of
the author clear and concise, and not catastrophise over the tweaks we
need to be made.
> It suggests that the driver can fill them in along with the other truly
> driver related data, e.g. error_bound.
>
> But as I explained before the driver _cannot_ fill them in coherently
> neither for the new gettimexattrs64() nor for the getcrosststampattrs()
> callbacks.
And yet in patch 4/7, vmclock_populate_ptp_attributes() is *given* the
count value from which the time reading is generated, puts it into the
{csid, cycles} fields which we agree should be turned into a
system_counterval_t as part of the device timestamp, and I don't see
the problem.
> So you add an extension to the existing interface without core support,
> which means once that is merged the driver crowd will happily take any
> shortcut to support it.
Please be specific. Aside from the values being a system_counterval_t
in each timestamp (both pre/post and, for those devices which can, the
device time), what is missing?
> Are _you_ going to mop up the resulting mess?
>
> > > Extending get_device_system_crosststamp() for AUX clocks is on my todo
> > > list for a long time, but I did not come around to it yet. It's not
> > > really complicated to make that work.
> > >
> > > The actual ena_phc_gettimexattrs64() implementation does not provide the
> > > counter value at all despite the fact that it could trivially do so with
> > > a modified ktime_get_snapshot() except for AUX clock IDs, but that's a
> > > solvable problem. See uncompiled PoC below.
> >
> > Huh? Unless it has PTM and is converting from ART (or is something
> > virtual like vmclock or the older non-LM-safe KVM hacks), surely the
> > driver has no business pretending to support 'precise' mode, and no
> > business providing a counter value?
>
> Huch?
>
> According to your earlier argumentation about counter values, you want
> the counter value, REALTIME, RAW combo even for gettimexattrs64() for
> the pre/post timestamps in case the hardware does not support PTM, no?
For the pre/post timestamps, sure. Not the *device* timestamp when it
doesn't support that.
> > In this case, we'd want the pre_ts and post_ts to be generated by the
> > common code using ktime_get_snapshot_id(cs_id) so that we have the
> > corresponding system_counterval_t in the 'A' parts of the ABA tuple...
> > but the device part of it can't pull one out of thin air unless it is
> > *part* of the device reading, surely?
>
> Correct, but where does the proposed patch set implement any of this?
It doesn't yet; I only asked for it this evening. Should be in the next
round.
> In the ena driver it sets cs_id and cs_value to 0,
Good. I hadn't looked hard at the ENA driver yet. So I asked Arthur if
he'd tried to expose a cycle value for the ENA PHC and the answer was
essentially "of course not; it doesn't have PTM".
> which will be replaced by abysmal hacks within no time because
> someone figures out that adding these values will give "better"
> results ....
Those were the ones we were allowed to set fire to, right?
> Neither the gettimexattrs64() nor the getcrosststampattrs() have any
> coherent way to fill in the CS related data, but you are pushing for
> adding this infrastructure first before thinking about the
> consequences. You can't be serious about that.
I do not believe I agree. I'm adding the infrastructure to report this
information because we *do* have it, and userspace wants to see it.
Some drivers *do* have it. The vmclock gettimex64 method literally just
calls ktime_get_snapshot(), populates both its pre and post timestamps
with the *same* result, and the device timestamp is calculated from the
same counter value at that precise moment.
It can *absolutely* also return the actual counter value used, if the
infrastructure supports that. As can the ptp_kvm drivers. And, a device
with PTM could report that too (although we do need a little more work
on cleanly converting that to an understandable domain, as you say).
And even for drivers which don't have simultaneous cycle readings,
userspace still wants to get the counter values corresponding to the
pre/post readings, which again can absolutely be supported.
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-22 0:12 ` David Woodhouse
@ 2026-05-22 7:49 ` Thomas Gleixner
2026-05-22 8:56 ` David Woodhouse
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-22 7:49 UTC (permalink / raw)
To: David Woodhouse, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
On Fri, May 22 2026 at 01:12, David Woodhouse wrote:
> On Fri, 2026-05-22 at 00:43 +0200, Thomas Gleixner wrote:
>> > So surely a device can only support this if it *does* have a precise
>> > hardware timestamp, e.g. from ART? And the corresponding sys_realtime
>> > and sys_monoraw fields also have to be from the *same* value, surely?
>>
>> That is correct, but you still fail to explain how this new made up
>> att.counter_id/att.counter_value is filled in coherently without the
>> driver doing something abysmal?
>>
>> You can't explain it, because the only way to do that correctly is by
>> extending the cross time stamp struct and get_device_system_crosststamp().
>
> Yes. Isn't that what I asked Arthur to do?
You maybe asked him, but I can't see any evidence of it anywhere. All I
can see is a magic struct attr, which conveys CS related data that is
supposed to be produced magically at some undefined place.
>> > Is that not how it works? If not.... WTF *is* it for?
>>
>> Yes, that's the way it works since get_device_system_crosststamp() got
>> introduced, but that does not give a random driver access to the actual
>> converted (TSC) counter value. The driver can only observe the PTM value
>> which is ART on x86 and whatever on other architectures. So it does not
>> know anything about it.
>
> Sure, the driver can only return what it *knows*.
>
> We're certainly missing some infrastructure here... maybe the pci_bus
> should have a CSID which corresponds to its PTM domain, which on x86
> would be CSID_X86_ART?
Yes, that's correct. But that does not solve the underlying problem:
The driver gets the PTP/SystemTime pair from the hardware/firmware,
which is what it hands back to get_device_system_crosststamp(). What
does SystemTime mean?
Here it gets interesting as that depends on the NIC/PTP magic firmware
and the architecture:
On X86:
- The plain ART value, which is converted to TSC in
get_device_system_crosststamp().
- The ART value which is magically converted to "nanoseconds" by
the firmware.
That is conveyed in system_counterval_t and the 'use_nsec' flag,
which is a misnomer, determines the conversion in the core.
- If set it uses the nominal frequency of the underlying base clock
(ART) for conversion.
- If not it uses the nominator/denominator pair of the underlying base
clock (ART) for conversion.
On ARM64:
- The ARM ARCH counter value (however that is converted from PTM
magically)
- The ARM ARCH counter which is magically converted to "nanoseconds"
by the firmware. Actually it's not, but the drivers claim it is.
It does not matter because the core does no conversion for ARM64 as
the SystemTime CSID is the same as the core timekeeper CSID.
Q: How is a driver supposed to fill in attr->cs_cycles coherently?
A: Not at all.
The consequence of this is that the next driver writer will just go and
do:
attr->cs_id = IS_ENABLED(X86) ? CSID_X86_TSC : CSID_ARM_COUNTER;
attr->cs_cycles = get_cycles();
or
attr->cs_cycles = ptm_cycles * pulledout_of_thin_air_factor +
pulledout_of_thin_air_offset;
and this mess is going to proliferate faster than you can breathe.
The same applies to the pre/post timestamp mechanism for
gettimexattrs64().
These new attr IOCTLs are not restricted to your favorite vmclock PTP
toy. They are available for every PTP driver which implements them, but
there is no infrastructure to actually provide the required data. That
ENA driver implementing gettimexattrs64() is demonstrating it. It at
least does not try to make them up, but that's just a matter of time to
happen.
I know you don't care because that's outside of your sandpit, but I care
because providing this interface is going to result in random drivers
growing the most absurd warts just to work around the lack of core
infrastructure.
Are you going to monitor drivers/net/ for those instances and make sure
they are burned before seeing the light of day?
Surely not and neither is netdev going to care as demonstrated with
this GEV mess.
It took me half an hour to add basic support for that into the core
infrastructure, but this new IOCTL stuff has been in the making for
almost a year without anyone spending a single brain cycle to even think
about it.
And you just have proven it by still claiming that this is a completely
cosmetic problem.
Seriously?
>> I agree, but the way how this attr.counter_id/value extension is defined
>> either requires unholy hacks to get access to the ART/TSC conversion or
>> resort to get_cycles().
> Never get_cycles(). Let's perhaps make it *not* need unholy hacks to
> interpret a PTM counter value. Otherwise what's the point in ART and
> PTM at all?
Perhaps? There is no perhaps and aside of that I gave you the solution
for the problem already in the mail you just replied to, no?
I'm glad we agree on "Never get_cycles()" at least. May I ask you then
to get a stiff drink ready and run:
# git blame drivers/ptp/ptp_vmclock.c | grep get_cycles
:)
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-22 7:49 ` Thomas Gleixner
@ 2026-05-22 8:56 ` David Woodhouse
2026-05-22 14:41 ` Thomas Gleixner
0 siblings, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2026-05-22 8:56 UTC (permalink / raw)
To: Thomas Gleixner, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
[-- Attachment #1: Type: text/plain, Size: 8367 bytes --]
On Fri, 2026-05-22 at 09:49 +0200, Thomas Gleixner wrote:
> On Fri, May 22 2026 at 01:12, David Woodhouse wrote:
> > On Fri, 2026-05-22 at 00:43 +0200, Thomas Gleixner wrote:
> > > > So surely a device can only support this if it *does* have a precise
> > > > hardware timestamp, e.g. from ART? And the corresponding sys_realtime
> > > > and sys_monoraw fields also have to be from the *same* value, surely?
> > >
> > > That is correct, but you still fail to explain how this new made up
> > > att.counter_id/att.counter_value is filled in coherently without the
> > > driver doing something abysmal?
> > >
> > > You can't explain it, because the only way to do that correctly is by
> > > extending the cross time stamp struct and get_device_system_crosststamp().
> >
> > Yes. Isn't that what I asked Arthur to do?
>
> You maybe asked him, but I can't see any evidence of it anywhere. All I
> can see is a magic struct attr, which conveys CS related data that is
> supposed to be produced magically at some undefined place.
https://lore.kernel.org/all/8b8a748a2ac6e8e3d970f5e74a2774133e7e7d8c.camel@infradead.org/
> > > > Is that not how it works? If not.... WTF *is* it for?
> > >
> > > Yes, that's the way it works since get_device_system_crosststamp() got
> > > introduced, but that does not give a random driver access to the actual
> > > converted (TSC) counter value. The driver can only observe the PTM value
> > > which is ART on x86 and whatever on other architectures. So it does not
> > > know anything about it.
> >
> > Sure, the driver can only return what it *knows*.
> >
> > We're certainly missing some infrastructure here... maybe the pci_bus
> > should have a CSID which corresponds to its PTM domain, which on x86
> > would be CSID_X86_ART?
>
> Yes, that's correct. But that does not solve the underlying problem:
>
> ...
>
> Q: How is a driver supposed to fill in attr->cs_cycles coherently?
>
> A: Not at all.
>
Let me rephrase that to check my full understanding of it in context:
Q: There are drivers today which operate on the TSC or arch counter,
and they're just fine. But how is a future driver which uses PCIe
PTM supposed to full in attr->cs_cycles coherently?
A: Yeah, our PCI and platform code lack the proper arch-agnostic
support for interpreting PTM time values; a driver wanting to do
that is going to need to do some infrastructure work first to
enable a proper conversion. But that's literally what PTM/ART is
*for* so that work has to happen at some point anyway.
> The consequence of this is that the next driver writer will just go and
> do:
>
> attr->cs_id = IS_ENABLED(X86) ? CSID_X86_TSC : CSID_ARM_COUNTER;
> attr->cs_cycles = get_cycles();
> or
> attr->cs_cycles = ptm_cycles * pulledout_of_thin_air_factor +
> pulledout_of_thin_air_offset;
Driver authors are crap, sure, but that would just be egregiously wrong
and hopefully it would be caught in review. It has to be a literally
*synchronous* capture or it's pointless.
> The same applies to the pre/post timestamp mechanism for
> gettimexattrs64().
There's literally a ptp helper to generate those, which can use
ktime_get_snapshot().
> These new attr IOCTLs are not restricted to your favorite vmclock PTP
> toy. They are available for every PTP driver which implements them, but
> there is no infrastructure to actually provide the required data. That
> ENA driver implementing gettimexattrs64() is demonstrating it. It at
> least does not try to make them up, but that's just a matter of time to
> happen.
The ENA driver demonstrates nothing of the sort. It's *not* filling in
those fields because it *doesn't* have PTM and it has nothing to fill
in.
Not *all* driver authors will fill in an optional field with random
bytes just because they *can*, Thomas.
> I know you don't care because that's outside of your sandpit,
Thomas, I'd like you to stop saying that. It's insulting, and it's wrong.
Of *course* I'm trying to solve things for the general case.
But that doesn't extend to implementing random parts of PCI and
platform infrastructure to allow for hypothetical future drivers which
want to use PTM to implement the optional fields we're adding today.
This yak is deep enough already :)
And although we're having this discussion *in* the GVE thread, I've
barely even *looked* at those patches and I think I'm still calling it
a 'hypothetical future driver', because obviously if it's trying to use
PTM then it's going to need that generic support.
> Are you going to monitor drivers/net/ for those instances and make
> sure they are burned before seeing the light of day?
>
> Surely not and neither is netdev going to care as demonstrated with
> this GEV mess.
I think that message has been received loud and clear, Thomas :)
> > Never get_cycles(). Let's perhaps make it *not* need unholy hacks to
> > interpret a PTM counter value. Otherwise what's the point in ART and
> > PTM at all?
>
> Perhaps? There is no perhaps and aside of that I gave you the solution
> for the problem already in the mail you just replied to, no?
Hm, did you? I'm sorry, I must have missed it. Even rereading it this
morning, I see a lot of 'is bonkers' and 'unholy hacks' and '_cannot_
fill them in coherently' but I can't see a concrete constructive
suggestion. You meant <87bje8v0xl.ffs@tglx>, yes?
Maybe it's lost in the noise. Let's try to simplify.
• Userspace would like to be able to use PTP clocks to discipline the
actual counter which is directly accessible in userspace and by KVM
guests. Do do this, we only really need a system_counterval_t in
the pre/post timestamps, and the helpers can add that.
• *Some* drivers can actually provide a system_counterval_t in the
device reading itself, either because they're naturally based on
the CPU counter or because they have PTM. Drivers should report
what they *do* know, in the correct time domain, and not make
crap up.
• Any driver which actually wants to do this based on PTM in the
future will need to do the infrastructure work to make that work
cleanly, and avoid having to make crap up.
And if I'm understanding correctly:
• The GVE driver made crap up? Although they do so orthogonally to the
patch set we're *actually* talking about in this thread, because
they did that entirely on their own *without* reference to Arthur's
userspace API changes?
> I'm glad we agree on "Never get_cycles()" at least. May I ask you then
> to get a stiff drink ready and run:
>
> # git blame drivers/ptp/ptp_vmclock.c | grep get_cycles
That's different because it's actually using it for the *device* timestamp.
But if you see an actual problem there, I'd love to know:
/*
* When invoked for gettimex64(), fill in the pre/post system
* times. The simple case is when system time is based on the
* same counter as st->cs_id, in which case all three times
* will be derived from the *same* counter value.
*
* If the system isn't using the same counter, then the value
* from ktime_get_snapshot() will still be used as pre_ts, and
* ptp_read_system_postts() is called to populate postts after
* calling get_cycles().
*
* The conversion to timespec64 happens further down, outside
* the seq_count loop.
*/
if (sts) {
ktime_get_snapshot(&systime_snapshot);
if (systime_snapshot.cs_id == st->cs_id) {
cycle = systime_snapshot.cycles;
} else {
cycle = get_cycles();
ptp_read_system_postts(sts);
}
} else {
cycle = get_cycles();
}
delta = cycle - le64_to_cpu(st->clk->counter_value);
...
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-22 8:56 ` David Woodhouse
@ 2026-05-22 14:41 ` Thomas Gleixner
0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-22 14:41 UTC (permalink / raw)
To: David Woodhouse, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
On Fri, May 22 2026 at 09:56, David Woodhouse wrote:
> On Fri, 2026-05-22 at 09:49 +0200, Thomas Gleixner wrote:
>> > Yes. Isn't that what I asked Arthur to do?
>>
>> You maybe asked him, but I can't see any evidence of it anywhere. All I
>> can see is a magic struct attr, which conveys CS related data that is
>> supposed to be produced magically at some undefined place.
>
> https://lore.kernel.org/all/8b8a748a2ac6e8e3d970f5e74a2774133e7e7d8c.camel@infradead.org/
Right, after I pointed it out :)
>> > > > Is that not how it works? If not.... WTF *is* it for?
>> > >
>> > > Yes, that's the way it works since get_device_system_crosststamp() got
>> > > introduced, but that does not give a random driver access to the actual
>> > > converted (TSC) counter value. The driver can only observe the PTM value
>> > > which is ART on x86 and whatever on other architectures. So it does not
>> > > know anything about it.
>> >
>> > Sure, the driver can only return what it *knows*.
>> >
>> > We're certainly missing some infrastructure here... maybe the pci_bus
>> > should have a CSID which corresponds to its PTM domain, which on x86
>> > would be CSID_X86_ART?
>>
>> Yes, that's correct. But that does not solve the underlying problem:
>>
>> ...
>>
>> Q: How is a driver supposed to fill in attr->cs_cycles coherently?
>>
>> A: Not at all.
>>
>
> Let me rephrase that to check my full understanding of it in context:
>
> Q: There are drivers today which operate on the TSC or arch counter,
> and they're just fine. But how is a future driver which uses PCIe
> PTM supposed to full in attr->cs_cycles coherently?
>
> A: Yeah, our PCI and platform code lack the proper arch-agnostic
> support for interpreting PTM time values; a driver wanting to do
> that is going to need to do some infrastructure work first to
> enable a proper conversion. But that's literally what PTM/ART is
> *for* so that work has to happen at some point anyway.
That work is there:
get_device_system_cross_timestamp()
The only lacking feature is that there is no generic PCI method to tell
the driver which PMT clocksource ID it should use.
So the drivers do it manually:
system->cycles = PTM_READ();
system->cs_id = CSID_ART;
instead of doing:
system->cs_id = pci_dev_ptm_csid(pdev);
everything else is done by the core code for several years already.
The core code provides:
- PCH timestamp
- sys_realtime derived from system->cycles
- sys_monoraw derived from system->cycles
So now with your new proposed attr extenstions, you want attr to return
system->cycles to user space and you don't want the raw PTM value (ART)
but you want the corresponding TSC value which is:
TSC = M * ART + offset;
But the driver has no access to this information. That's why I'm ranting
about an interface which lacks the infrastructure to utilize it
properly.
I've seen this in the past 20 years over and over. The lack of
infrastructure makes driver people do abysmal hacks because they are
tasked to utilize the new interface. The result is a horrible mess,
which others have to clean up later when they want to do core changes.
>> The consequence of this is that the next driver writer will just go and
>> do:
>>
>> attr->cs_id = IS_ENABLED(X86) ? CSID_X86_TSC : CSID_ARM_COUNTER;
>> attr->cs_cycles = get_cycles();
>> or
>> attr->cs_cycles = ptm_cycles * pulledout_of_thin_air_factor +
>> pulledout_of_thin_air_offset;
>
> Driver authors are crap, sure, but that would just be egregiously wrong
> and hopefully it would be caught in review.
You wish. Shit get's merged faster than you can mop it up.
I agree that driver authors are interesting creatures, but you can't
whack them over the head when there is no generic infrastructure they
can use. They never will go and extend existing or provide new
infrastructure. That's on the people who add new functionality, which
will be consumed by driver writers.
> It has to be a literally *synchronous* capture or it's pointless.
You know that and I know that, but ...
>> The same applies to the pre/post timestamp mechanism for
>> gettimexattrs64().
>
> There's literally a ptp helper to generate those, which can use
> ktime_get_snapshot().
With some massaging, yes. Now that someone provided an extensible
snapshot function. :)
> Of *course* I'm trying to solve things for the general case.
>
> But that doesn't extend to implementing random parts of PCI and
> platform infrastructure to allow for hypothetical future drivers which
> want to use PTM to implement the optional fields we're adding today.
Optional fields are going to get filled when people see value in
them. You need no new drivers for that. Existing drivers hop on the new
bus as soon as it hits the road.
>> > Never get_cycles(). Let's perhaps make it *not* need unholy hacks to
>> > interpret a PTM counter value. Otherwise what's the point in ART and
>> > PTM at all?
>>
>> Perhaps? There is no perhaps and aside of that I gave you the solution
>> for the problem already in the mail you just replied to, no?
>
> Hm, did you? I'm sorry, I must have missed it. Even rereading it this
> morning, I see a lot of 'is bonkers' and 'unholy hacks' and '_cannot_
> fill them in coherently' but I can't see a concrete constructive
> suggestion. You meant <87bje8v0xl.ffs@tglx>, yes?
The demonstration patches to extend ktime_get_snapshot() and the half
assed support in get_device_system_crosststamp(). You actually asked me
to provide a SOB for the ktime_get_snapshot_id() part. No?
> Maybe it's lost in the noise. Let's try to simplify.
>
> • Userspace would like to be able to use PTP clocks to discipline the
> actual counter which is directly accessible in userspace and by KVM
> guests. Do do this, we only really need a system_counterval_t in
> the pre/post timestamps, and the helpers can add that.
>
> • *Some* drivers can actually provide a system_counterval_t in the
> device reading itself, either because they're naturally based on
> the CPU counter or because they have PTM. Drivers should report
> what they *do* know, in the correct time domain, and not make
> crap up.
Which needs support in get_device_system_crosststamp() because
that's what drivers have to use to utilize PTM.
> • Any driver which actually wants to do this based on PTM in the
> future will need to do the infrastructure work to make that work
> cleanly, and avoid having to make crap up.
That's wishful thinking. It did not happen in the last two decades
so why would this be different today?
> And if I'm understanding correctly:
>
> • The GVE driver made crap up? Although they do so orthogonally to the
> patch set we're *actually* talking about in this thread, because
> they did that entirely on their own *without* reference to Arthur's
> userspace API changes?
Correct.
Let's move this to the other thread which actually talks about your
vmclock stuff.
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-21 22:43 ` Thomas Gleixner
2026-05-22 0:12 ` David Woodhouse
@ 2026-05-22 10:34 ` David Woodhouse
2026-05-22 14:46 ` Thomas Gleixner
1 sibling, 1 reply; 19+ messages in thread
From: David Woodhouse @ 2026-05-22 10:34 UTC (permalink / raw)
To: Thomas Gleixner, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
[-- Attachment #1: Type: text/plain, Size: 4313 bytes --]
On Fri, 2026-05-22 at 00:43 +0200, Thomas Gleixner wrote:
>
> Fine, but you can only do that accurately if you have the relevant
> parameters:
>
I had elided this discussion about KVM migration as a digression, but
I'm coming back to it because I think it could use the
clock_get_reference_time() thing we were talking about.
> 1) Guest TSC value at freeze
> 2) Guest nominal TSC frequency
> 3) Old host REALTIME at freeze - Ideally you use TAI
> 4) New host TSC frequency
> 5) New host TSC/REALTIME/TAI snapshot
>
> #1 is a KVM problem, but see #3
>
> #2 ideally communicated from the guest to the host after early
> initialization at boot.
>
> You really want this information because the guest won't change the
> mult/shift pair for it ever.
If *tell* the guest the frequency in CPUID, then it shouldn't be trying
to manually calibrate it against an emulated PIT while suffering steal
time, and its mult/shift should have a little bit less entropy.
Even a system which *has* to do that crappy calibration still does it
with a lot more *precision* than accuracy, so I suspect we ought to be
rounding the result to the nearest 1MHz as long as that's within 10PPM
or something like that. But that really *is* a digression :)
> That is more accurate than using the scaling factor on the old host
> plus the old host frequency. Especially if you are migrating the VM
> several times because every migration results obviously in rounding
> errors, while if you use the nominal guest frequency you keep this
> back and forth conversions and the resulting errors out of the way.
>
> #3 That's obviously related to #1
>
> The old host knows the guest TSC scaling and the guest offset, so
> it can take a coherent snapshot on the host side of REALTIME/TAI
> and the host TSC value, which is convertible to the guest TSC
> value.
>
> #4, 5
>
> are obviously required to set up the new host scaling/offset for
> the guest TSC, but if based on the nominal assumed TSC frequency of
> the guest it becomes actually accurate and usable across several
> migrations.
>
> Anything else is just crystal ball magic, which is what KVM has today....
The model I'm enabling and documenting for KVM migration is basically
within the noise of what you describe above, yes.
But if we want to give the illusion of the TSC just ticking away while
the guest happens to experience a little steal time, when in fact it's
been completely migrated to a new host, we actually want to work with
the *true* running frequency of the TSC at the moment of migration.
So...
1) Use clock_get_time_reference() to get a { host tsc, time, rate }
from the source host at 'freeze' time.
2) Use clock_get_time_reference() to get a { host tsc, time, rate }
from the destination host, when resuming.
3) (Optionally) scale the guest's TSC frequency, not by the *nominal*
rates, but by the *actual* ratio of the rates from (1) and (2)
above (plus any original nominal scaling of the guest's TSC from
the original host).
4) Calculate the guest TSC *offset* in order to convey the effect
that the guest's TSC continued to tick at the rate from (1),
during the time period between (1) and (2).
5) (Optionally) Once the guest is running, slowly undo the scaling
in (1) in order to get the guest back to a nice simple unscaled
TSC (or scaled only by nominal frequencies as it was when launched)
Obviously, a dedicated environment which disciplines its TSC directly
can do all of that right now already because it *has* all the
information it would get from clock_get_time_reference().
But as you know perfectly well, Thomas, I'm never happy to keep the
blinkers on and focus only on my specific use case at hand; I want this
to work for the *general* case, including people running QEMU in a
fairly standard environment. And I think clock_get_time_reference()
might be a reasonable way of doing that, and a fairly clean counterpart
to the clock_set_time_reference() you suggested?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64
2026-05-22 10:34 ` David Woodhouse
@ 2026-05-22 14:46 ` Thomas Gleixner
0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-22 14:46 UTC (permalink / raw)
To: David Woodhouse, Harshitha Ramamurthy, netdev, Arthur Kiyanovski
Cc: joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, jacob.e.keller, yyd, jefrogers, linux-kernel,
Naman Gulati, Thomas Weißschuh
On Fri, May 22 2026 at 11:34, David Woodhouse wrote:
> On Fri, 2026-05-22 at 00:43 +0200, Thomas Gleixner wrote:
>> 1) Guest TSC value at freeze
>> 2) Guest nominal TSC frequency
>> 3) Old host REALTIME at freeze - Ideally you use TAI
>> 4) New host TSC frequency
>> 5) New host TSC/REALTIME/TAI snapshot
>>
>> #1 is a KVM problem, but see #3
>>
>> #2 ideally communicated from the guest to the host after early
>> initialization at boot.
>>
>> You really want this information because the guest won't change the
>> mult/shift pair for it ever.
>
> If *tell* the guest the frequency in CPUID, then it shouldn't be trying
> to manually calibrate it against an emulated PIT while suffering steal
> time, and its mult/shift should have a little bit less entropy.
They are identical on every boot evaluation.
> Even a system which *has* to do that crappy calibration still does it
> with a lot more *precision* than accuracy, so I suspect we ought to be
> rounding the result to the nearest 1MHz as long as that's within 10PPM
> or something like that. But that really *is* a digression :)
:)
> The model I'm enabling and documenting for KVM migration is basically
> within the noise of what you describe above, yes.
>
> But if we want to give the illusion of the TSC just ticking away while
> the guest happens to experience a little steal time, when in fact it's
> been completely migrated to a new host, we actually want to work with
> the *true* running frequency of the TSC at the moment of migration.
>
> So...
>
> 1) Use clock_get_time_reference() to get a { host tsc, time, rate }
> from the source host at 'freeze' time.
>
> 2) Use clock_get_time_reference() to get a { host tsc, time, rate }
> from the destination host, when resuming.
>
> 3) (Optionally) scale the guest's TSC frequency, not by the *nominal*
> rates, but by the *actual* ratio of the rates from (1) and (2)
> above (plus any original nominal scaling of the guest's TSC from
> the original host).
>
> 4) Calculate the guest TSC *offset* in order to convey the effect
> that the guest's TSC continued to tick at the rate from (1),
> during the time period between (1) and (2).
>
> 5) (Optionally) Once the guest is running, slowly undo the scaling
> in (1) in order to get the guest back to a nice simple unscaled
> TSC (or scaled only by nominal frequencies as it was when launched)
>
>
> Obviously, a dedicated environment which disciplines its TSC directly
> can do all of that right now already because it *has* all the
> information it would get from clock_get_time_reference().
>
> But as you know perfectly well, Thomas, I'm never happy to keep the
> blinkers on and focus only on my specific use case at hand; I want this
> to work for the *general* case, including people running QEMU in a
> fairly standard environment. And I think clock_get_time_reference()
> might be a reasonable way of doing that, and a fairly clean counterpart
> to the clock_set_time_reference() you suggested?
Agreed.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH net-next v8 0/3] gve: add support for PTP gettimex64
2026-05-14 22:58 [PATCH net-next v8 0/3] gve: add support for PTP gettimex64 Harshitha Ramamurthy
` (2 preceding siblings ...)
2026-05-14 22:58 ` [PATCH net-next v8 3/3] gve: implement PTP gettimex64 Harshitha Ramamurthy
@ 2026-05-20 1:30 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 19+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-05-20 1:30 UTC (permalink / raw)
To: Harshitha Ramamurthy
Cc: netdev, joshwash, andrew+netdev, davem, edumazet, kuba, pabeni,
richardcochran, jstultz, tglx, sboyd, willemb, nktgrg, jfraker,
ziweixiao, maolson, jordanrhee, thostet, alok.a.tiwari,
pkaligineedi, horms, dwmw2, jacob.e.keller, yyd, jefrogers,
linux-kernel
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 14 May 2026 22:58:39 +0000 you wrote:
> From: Jordan Rhee <jordanrhee@google.com>
>
> This patch series adds support to obtain near-simultaneous NIC and
> system timestamps with gettimex64. This enables daemons like
> chrony and phc2sys to synchronize the system clock to the NIC clock.
>
> GVE does not have direct register access to the NIC hardware clock, so
> it must issue an AdminQ command to read the NIC clock. Two paths
> for obtaining a cross-timestamp are implemented: a precise path using
> system counter values sampled by the device, and a fallback path using
> system counter values sampled in the driver using
> ptp_read_system_prets()/postts().
>
> [...]
Here is the summary with links:
- [net-next,v8,1/3] gve: skip error logging for retryable AdminQ commands
https://git.kernel.org/netdev/net-next/c/55687124a86f
- [net-next,v8,2/3] gve: make nic clock reads thread safe
https://git.kernel.org/netdev/net-next/c/4bbcd7e24aaa
- [net-next,v8,3/3] gve: implement PTP gettimex64
https://git.kernel.org/netdev/net-next/c/22b2aae747f7
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 19+ messages in thread