Embedded Linux development

Embedded Linux development
 help / color / mirror / Atom feed

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: David Laight @ 2026-03-30 13:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bird, Tim, pmladek@suse.com, rostedt@goodmis.org,
	senozhatsky@chromium.org, Shashank Balaji,
	john.ogness@linutronix.de, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org, x86, Paolo Bonzini,
	Sean Christopherson, KVM
In-Reply-To: <87fr5ib6ks.ffs@tglx>

On Mon, 30 Mar 2026 00:42:59 +0200
Thomas Gleixner <tglx@kernel.org> wrote:

...
> If you want to have early timestamps then you have to go through the low
> level code on every architecture and do the proper sanity checks,
> enumerations, etc. no matter what. There is no guarantee that you can
> use any clock unconditionally during very early boot.
> 
> As you have to do that anyway, there is _zero_ justification for an
> extra facility side stepping sched clock, which would just create
> another pile of technical debt.
> 
> Quite the contrary, I'm going to get rid of technical debt:
> 
>   The get_cycles() related changes in tsc.h are going to end up in a
>   obviously revised formal patch tomorrow as there is exactly _zero_
>   requirement to provide this as a functional interface.
> 
>   1) The default implementation in asm-generic returns 0
> 
>      Ergo any code depending on a functional implementation is broken
>      by definition.
> 
>   2) The ops/cycles metrics are as bogus as the infamous BogoMIPS metrics
>   
>      The "cycle" counter runs on almost all contemporary CPUs at a fixed
>      frequency, which is completely unrelated to the actual CPU
>      frequency and therefore to the actual CPU cycles.
> 
>      The only realistic metric is ops/sec and nothing else and that can
>      be trivially achieved by using the generic time getter interfaces.

ops/sec is also horrid because you get all all the interrupt time getting
in the way.
Additionally it is only any use for relative comparisons done on exactly
the same hardware, and is entirely useless is you are trying to show that
a short code loop is running at/near the theoretical rate.

The only thing I've found that works is the PERF_COUNT_HW_CPU_CYCLES
counter (low 32 bits).
In userspace you do have to directly read the register (rather than using
the library function - too much overhead), and ignore any silly values
generated by interrupts and process switches.
But it is accurate enough to time single function calls and see the
effect of things like mispredicted branches and the data-dependency of
divide instructions.

But I do agree that the TSC is entirely useless for measuring what it was
originally intended to measure.

	David

> 
>      Those might end up resulting in bogus benchmark results too if the
>      underlying clocksource is jiffies, but that's again a matter of
>      accepting reality.
> 
>      If people really mandate that ops/bogo_cycles is required for the
>      very wrong reasons, then I'm happy to bring it back with a global
>      name change from get_cycles() to get_bogo_cycles() which excludes
>      it from any serious usage including printk.
> 
> Thanks,
> 
>         tglx

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Peter Zijlstra @ 2026-03-30 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Bird, Tim, pmladek@suse.com, rostedt@goodmis.org,
	senozhatsky@chromium.org, Shashank Balaji,
	john.ogness@linutronix.de, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org, x86, Paolo Bonzini,
	Sean Christopherson, KVM
In-Reply-To: <87fr5ib6ks.ffs@tglx>

On Mon, Mar 30, 2026 at 12:42:59AM +0200, Thomas Gleixner wrote:
> Quite the contrary, I'm going to get rid of technical debt:
> 
>   The get_cycles() related changes in tsc.h are going to end up in a
>   obviously revised formal patch tomorrow as there is exactly _zero_
>   requirement to provide this as a functional interface.
> 
>   1) The default implementation in asm-generic returns 0
> 
>      Ergo any code depending on a functional implementation is broken
>      by definition.
> 
>   2) The ops/cycles metrics are as bogus as the infamous BogoMIPS metrics
>   
>      The "cycle" counter runs on almost all contemporary CPUs at a fixed
>      frequency, which is completely unrelated to the actual CPU
>      frequency and therefore to the actual CPU cycles.
> 
>      The only realistic metric is ops/sec and nothing else and that can
>      be trivially achieved by using the generic time getter interfaces.
> 
>      Those might end up resulting in bogus benchmark results too if the
>      underlying clocksource is jiffies, but that's again a matter of
>      accepting reality.
> 
>      If people really mandate that ops/bogo_cycles is required for the
>      very wrong reasons, then I'm happy to bring it back with a global
>      name change from get_cycles() to get_bogo_cycles() which excludes
>      it from any serious usage including printk.
> 
> Thanks,
> 
>         tglx
> ---
>  arch/x86/include/asm/tsc.h        |   11 ++--
>  arch/x86/kernel/platform-quirks.c |    2 
>  arch/x86/kernel/tsc.c             |   96 ++++++++++++++++++++++++++++++--------
>  3 files changed, 85 insertions(+), 24 deletions(-)
> 
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -76,10 +76,7 @@ extern void disable_TSC(void);
>  
>  static inline cycles_t get_cycles(void)
>  {
> -	if (!IS_ENABLED(CONFIG_X86_TSC) &&
> -	    !cpu_feature_enabled(X86_FEATURE_TSC))
> -		return 0;
> -	return rdtsc();
> +	return 0;
>  }
>  #define get_cycles get_cycles
>  

So this is probably going to break a bunch of generic stuff. There is a
fair amount of 'get_cycles' usage out in the tree.

Mostly it appears to be various test code.

But I do agree that get_cycles() is terrible and should go away.

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Thomas Gleixner @ 2026-03-29 22:42 UTC (permalink / raw)
  To: Bird, Tim, pmladek@suse.com, rostedt@goodmis.org,
	senozhatsky@chromium.org, Shashank Balaji,
	john.ogness@linutronix.de
  Cc: francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org, x86,
	Paolo Bonzini, Sean Christopherson, KVM
In-Reply-To: <87jyuvboo2.ffs@tglx>

Tim!

On Sat, Mar 28 2026 at 22:59, Thomas Gleixner wrote:
> The only clean way to solve this cleanly is moving the sched clock
> initialization to the earliest point possible and accepting that due to
> hardware, enumeration and virtualization constraints this point might be
> suboptimal. Everything else is just an attempt to defy reality.

As the wheather was lousy, I sat down for a couple of hours and
implemented the obviously incomplete PoC below for illustration.

As the FIXME comment in tsc_early_boot_setup() tells clearly, this is far
from complete and needs way more thoughts and scrutiny, but it's good
enough to prove the point:

[    0.000019] tsc: Boot sched clock enabled

As early as possible, but with at least the minimal safe guards in place.

[    0.000041] Linux version 7.0.0-rc5+ (tglx@tglx-pladt) (gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44) #893 SMP PREEMPT_DYNAMIC Sun Mar 29 21:38:08 CEST 2026
[    0.000043] Command line: BOOT_IMAGE=/boot/vmlinuz-7.0.0-rc5+ root=UUID=fa44d501-1716-402a-8602-6315c3166f7b ro console=tty0 console=ttyS0,115200n8 ftrace_dump_on_oops earlyprintk=ttyS0,115200n8 tsc_early_khz=2095070 no-kvmclock

   1) tsc_early_khz is required because KVM does not provide the frequency
      leaf.

   2) no-kvmclock is required because KVM thinks it is more important than
      everything else.

Both issues are completely unrelated to the problem at hand.

[    0.009415] BIOS-provided physical RAM map:
[    0.009416] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] System RAM
...
[    0.009473] printk: legacy bootconsole [earlyser0] enabled
...
[    0.021037] Hypervisor detected: KVM

Without 'no-kvmclock' KVM would take over sched clock and reset sched
clock time to 0 because KVM... Trivial problem to be solved and left as
an exercise for the reader.

[    0.021256] last_pfn = 0x7ffdb max_arch_pfn = 0x400000000
[    0.051429] tsc: Fast TSC calibration using PIT
[    0.051699] tsc: Detected 2094.808 MHz processor
[    0.051990] tsc: Detected 2095.070 MHz TSC

That's in tsc_early_init() which enables the sched clock static key.

[    0.052871] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved

Unsurprisingly time continues without a gap, jumping backwards or whatever.

If you want to have early timestamps then you have to go through the low
level code on every architecture and do the proper sanity checks,
enumerations, etc. no matter what. There is no guarantee that you can
use any clock unconditionally during very early boot.

As you have to do that anyway, there is _zero_ justification for an
extra facility side stepping sched clock, which would just create
another pile of technical debt.

Quite the contrary, I'm going to get rid of technical debt:

  The get_cycles() related changes in tsc.h are going to end up in a
  obviously revised formal patch tomorrow as there is exactly _zero_
  requirement to provide this as a functional interface.

  1) The default implementation in asm-generic returns 0

     Ergo any code depending on a functional implementation is broken
     by definition.

  2) The ops/cycles metrics are as bogus as the infamous BogoMIPS metrics
  
     The "cycle" counter runs on almost all contemporary CPUs at a fixed
     frequency, which is completely unrelated to the actual CPU
     frequency and therefore to the actual CPU cycles.

     The only realistic metric is ops/sec and nothing else and that can
     be trivially achieved by using the generic time getter interfaces.

     Those might end up resulting in bogus benchmark results too if the
     underlying clocksource is jiffies, but that's again a matter of
     accepting reality.

     If people really mandate that ops/bogo_cycles is required for the
     very wrong reasons, then I'm happy to bring it back with a global
     name change from get_cycles() to get_bogo_cycles() which excludes
     it from any serious usage including printk.

Thanks,

        tglx
---
 arch/x86/include/asm/tsc.h        |   11 ++--
 arch/x86/kernel/platform-quirks.c |    2 
 arch/x86/kernel/tsc.c             |   96 ++++++++++++++++++++++++++++++--------
 3 files changed, 85 insertions(+), 24 deletions(-)

--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -76,10 +76,7 @@ extern void disable_TSC(void);
 
 static inline cycles_t get_cycles(void)
 {
-	if (!IS_ENABLED(CONFIG_X86_TSC) &&
-	    !cpu_feature_enabled(X86_FEATURE_TSC))
-		return 0;
-	return rdtsc();
+	return 0;
 }
 #define get_cycles get_cycles
 
@@ -114,6 +111,12 @@ static inline void tsc_verify_tsc_adjust
 static inline void check_tsc_sync_target(void) { }
 #endif
 
+#if defined(CONFIG_X86_TSC) && defined(CONFIG_X86_64)
+void tsc_early_boot_setup(void);
+#else
+static inline void tsc_early_boot_setup(void) { }
+#endif
+
 extern int notsc_setup(char *);
 extern void tsc_save_sched_clock_state(void);
 extern void tsc_restore_sched_clock_state(void);
--- a/arch/x86/kernel/platform-quirks.c
+++ b/arch/x86/kernel/platform-quirks.c
@@ -32,6 +32,8 @@ void __init x86_early_init_platform_quir
 
 	if (x86_platform.set_legacy_features)
 		x86_platform.set_legacy_features();
+
+	tsc_early_boot_setup();
 }
 
 bool __init x86_pnpbios_disabled(void)
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -18,6 +18,7 @@
 #include <linux/static_call.h>
 
 #include <asm/cpuid/api.h>
+#include <asm/cmdline.h>
 #include <asm/hpet.h>
 #include <asm/timer.h>
 #include <asm/vgtod.h>
@@ -48,6 +49,7 @@ EXPORT_SYMBOL(tsc_khz);
  */
 static int __read_mostly tsc_unstable;
 static unsigned int __initdata tsc_early_khz;
+static bool tsc_early_boot_enabled __ro_after_init;
 
 static DEFINE_STATIC_KEY_FALSE_RO(__use_tsc);
 
@@ -202,12 +204,12 @@ static void set_cyc2ns_scale(unsigned lo
 /*
  * Initialize cyc2ns for boot cpu
  */
-static void __init cyc2ns_init_boot_cpu(void)
+static void __init cyc2ns_init_boot_cpu(unsigned long khz)
 {
 	struct cyc2ns *c2n = this_cpu_ptr(&cyc2ns);
 
 	seqcount_latch_init(&c2n->seq);
-	__set_cyc2ns_scale(tsc_khz, smp_processor_id(), rdtsc());
+	__set_cyc2ns_scale(khz, smp_processor_id(), rdtsc());
 }
 
 /*
@@ -231,17 +233,24 @@ static void __init cyc2ns_init_secondary
 	}
 }
 
+static __always_inline u64 __native_sched_clock(void)
+{
+	u64 tsc_now = rdtsc();
+
+	/* return the value in ns */
+	return __cycles_2_ns(tsc_now);
+}
+
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
 noinstr u64 native_sched_clock(void)
 {
-	if (static_branch_likely(&__use_tsc)) {
-		u64 tsc_now = rdtsc();
+	if (static_branch_likely(&__use_tsc))
+		return __native_sched_clock();
 
-		/* return the value in ns */
-		return __cycles_2_ns(tsc_now);
-	}
+	if (tsc_early_boot_enabled)
+		return __native_sched_clock();
 
 	/*
 	 * Fall back to jiffies if there's no TSC available:
@@ -371,12 +380,12 @@ static u64 tsc_read_refs(u64 *p, int hpe
 	int i;
 
 	for (i = 0; i < MAX_RETRIES; i++) {
-		t1 = get_cycles();
+		t1 = rdtsc();
 		if (hpet)
 			*p = hpet_readl(HPET_COUNTER) & 0xFFFFFFFF;
 		else
 			*p = acpi_pm_read_early();
-		t2 = get_cycles();
+		t2 = rdtsc();
 		if ((t2 - t1) < thresh)
 			return t2;
 	}
@@ -468,13 +477,13 @@ static unsigned long pit_calibrate_tsc(u
 	outb(latch & 0xff, 0x42);
 	outb(latch >> 8, 0x42);
 
-	tsc = t1 = t2 = get_cycles();
+	tsc = t1 = t2 = rdtsc();
 
 	pitcnt = 0;
 	tscmax = 0;
 	tscmin = ULONG_MAX;
 	while ((inb(0x61) & 0x20) == 0) {
-		t2 = get_cycles();
+		t2 = rdtsc();
 		delta = t2 - tsc;
 		tsc = t2;
 		if ((unsigned long) delta < tscmin)
@@ -553,9 +562,9 @@ static inline int pit_expect_msb(unsigne
 		if (!pit_verify_msb(val))
 			break;
 		prev_tsc = tsc;
-		tsc = get_cycles();
+		tsc = rdtsc();
 	}
-	*deltap = get_cycles() - prev_tsc;
+	*deltap = rdtsc() - prev_tsc;
 	*tscp = tsc;
 
 	/*
@@ -745,19 +754,23 @@ unsigned long native_calibrate_tsc(void)
 
 static unsigned long cpu_khz_from_cpuid(void)
 {
-	unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
+	struct cpuid_regs regs;
 
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+	/* Do it manually as this can be called _before_ boot_cpu_data is initialized */
+	cpuid_leaf(0, &regs);
+	/* Intel only */
+	if (regs.ebx != 0x756e6547 && regs.ecx != 0x6c65746e && regs.edx != 0x49656e69)
 		return 0;
 
-	if (boot_cpu_data.cpuid_level < CPUID_LEAF_FREQ)
+	if (regs.eax < CPUID_LEAF_FREQ)
 		return 0;
 
-	eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
+	memset(&regs, 0, sizeof(&regs));
 
-	cpuid(CPUID_LEAF_FREQ, &eax_base_mhz, &ebx_max_mhz, &ecx_bus_mhz, &edx);
+	cpuid_leaf(CPUID_LEAF_FREQ, &regs);
 
-	return eax_base_mhz * 1000;
+	/* EAX contains base MHz */
+	return regs.eax * 1000;
 }
 
 /*
@@ -1512,8 +1525,9 @@ static void __init tsc_enable_sched_cloc
 
 	/* Sanitize TSC ADJUST before cyc2ns gets initialized */
 	tsc_store_and_check_tsc_adjust(true);
-	cyc2ns_init_boot_cpu();
+	cyc2ns_init_boot_cpu(tsc_khz);
 	static_branch_enable(&__use_tsc);
+	tsc_early_boot_enabled = false;
 }
 
 void __init tsc_early_init(void)
@@ -1576,6 +1590,48 @@ void __init tsc_init(void)
 	detect_art();
 }
 
+/* Only 64bit guarantees that CPUID is available */
+#ifdef CONFIG_X86_64
+void __init tsc_early_boot_setup(void)
+{
+	struct cpuid_regs regs;
+	unsigned long khz = 0;
+	char optstr[64];
+
+	if (cmdline_find_option_bool(boot_command_line, "notsc"))
+		return;
+
+	// FIXME: This needs way more sanity checks vs. virt etc.
+	if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+		return;
+
+	// Why don't we have proper leaf structs yet?
+	cpuid_leaf(0, &regs);
+	if (regs.eax < 1)
+		return;
+
+	/* TSC enabled in CPUID? */
+	cpuid_leaf(1, &regs);
+	if (!(regs.edx & BIT(4)))
+		return;
+
+	if (cmdline_find_option(boot_command_line, "tsc_early_khz", optstr, sizeof(optstr)) > 0) {
+		// Shut up must check by doing a pointless zeroing
+		if (kstrtoul(optstr, 10, &khz) < 0)
+			khz = 0;
+	}
+
+	if (!khz)
+		khz = cpu_khz_from_cpuid();
+	if (!khz)
+		return;
+
+	cyc2ns_init_boot_cpu(khz);
+	tsc_early_boot_enabled = true;
+	pr_info("Boot sched clock enabled\n");
+}
+#endif
+
 #ifdef CONFIG_SMP
 /*
  * Check whether existing calibration data can be reused.


^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Thomas Gleixner @ 2026-03-28 21:59 UTC (permalink / raw)
  To: Bird, Tim, pmladek@suse.com, rostedt@goodmis.org,
	senozhatsky@chromium.org, Shashank Balaji,
	john.ogness@linutronix.de
  Cc: francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <MW5PR13MB5632FC91EE0EF8754331400DFD57A@MW5PR13MB5632.namprd13.prod.outlook.com>

Tim!

On Fri, Mar 27 2026 at 18:48, Tim Bird wrote:
> Well, this is using get_cycles(), which already exists on most architectures.

The fact that get_cycles() exists does not make it a good choice. There
is a reason why anything which deals with even remotely reliable time
requirements stopped using it. It's still there as a low level
architecture specific interface and most other usage is purely
historical or wrong to begin with and should be removed completely.

A lot of people spent a significant amount of time to get rid of this
ill defined mechanism and it's just sad that they did not manage to
eliminate it completely.

> This patch just adds a funky way to use cycles (which are available
> from power on, rather than from the start of kernel timekeeping) to
> allow saving timing data for some early printks (usually about 20 to
> 60 printks).

I can see that, but I'm not accepting yet another ill defined glued on
mechanism which relies on a historical ill defined mistake.

> Also, my current plan is to back off of adjusting the offset of
> unrelated (non-pre-time_init()) printks, and limit the effect in the
> system to just those first early (pre-time_init()) printks.  The
> complication to add an offset to all following printks was just to
> avoid a discontinuity in printk timestamps, once time_init() was
> called and "real" timestamps started producing non-zeros.  Given how
> confusing this seems to have made things, I'm thinking of backing off
> of that approach.

This discontinuity results from the fact that you glued it into the
printk code and sched_clock() does not know about it.

>> printk()
>> 
>>    time_ns = local_clock();
> that's ts_nsec = local_clock()

That obviously changes the illustrative nature of my narrative
significantly. Thanks for pointing it out.

>> As this needs to be supported by the architecture/platform in any case
>> there is close to zero benefit from creating complicated generic
>> infrastructure for this.
>
> The problem with this is that tsc_early_uncalibrated() can't return
> nanoseconds until after calibration.

In theory it could for most modern x86 CPUs as they advertise the
nominal TSC frequency in CPUID. Other architectures have well known
clocksource frequencies, e.g. S390 has a known nominal frequency of 1GHz
(IIRC).

But that does not solve any of the other problems. See below.

> I don't think it's a good idea to returns cycles sometimes and nanoseconds
> at other times, from a deep-seated timing function like this.
> Also tsc_available() might itself depend on initialization that hasn't happened yet
> (in early boot).

Access to the TSC requires the X86_FEATURE_TSC bit being set, which
happens in early_cpu_init(). Before that get_cycles() returns firmly 0.

> My approach of saving cycles in ts_nsec for the early printks works
> because there's a limited number of places (only 2) inside the printk
> code where ts_nsec is accessed, meaning that the code to detect a
> cycles value instead of a nanoseconds value can be constrained to just
> those two places.  Basically, I'm doing the conversion from cycles to
> nanoseconds at printk presentation time, rather than at the time of
> printk message submission.

I know, but that again requires to add more ill defined infrastructure.

We are not aiming to add more, we want to get rid of it completely to
the extent possible.

> The approach that I originally started with
> (see https://lore.kernel.org/linux-embedded/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
> was to use hardcoded multiplier and shift values for converting from cycles
> to nanoseconds.  These multiplier and shift values would be set at kernel
> configuration time (ie, using CONFIG values).

Which makes it unusable for distro kernels and therefore a non-starter.

> There are other approaches, but none really work early enough in the
> kernel boot to not be a pain.  The goal is to provide timing info
> before: timekeeping init, jiffies startup, and even CPU features
> determination,

As I pointed out before that's wishful thinking:

   You _cannot_ access a resource before it has been determined to be
   available.

Period.

It does not matter at all if _you_ know for sure that it is the case in
_your_ personal setup.

> and to keep the effect narrow -- limited only to printks, and the
> first few pre-time_init() printk messages, at that.

Either it is solved in a generic way or we have to agree that it's not
solvable at all. Your narrow effect argument is bogus and you know that
very well.

> I'm now researching a suggestion from Shashank Balaji to use the
> existing calibration data from tsc initialization, which might
> simplify the current patch even further.  I'll make sure to CC you on
> the next version of the patch.

If you want to use the calibration data from tsc_early_init() then you
achieve exactly _nothing_ because tsc_early_init() also enables the
early sched clock on bare metal. On a VM with KVM clock available the
KVM clock setup enables the early sched clock even before that via
init_hypervisor_platform().

The early TSC init happens in setup_arch() via tsc_early_init() and it's
completely unclear whether you can always access the TSC safely before
that unconditionally due to SNP, which requires to enable the secure TSC
first. There is a reason why all of this is ordered the way it is.

While reading TSC way before that might work on bare metal and in most
VMs, it's not guaranteed to be safe unconditionally unless someone sits
down and provides proof to the contrary. As always I'm happy to be
proven wrong.

When tsc_early_init() was introduced for the very same reason you are
looking into that, quite some people spent a lot of time to come up with
a solution which was deemed safe enough to be used unconditionally.

Please consult the LKML archive for the full history. The commit links
will give you a proper starting point.

That said, I completely understand the itch you are trying to scratch
and I'm the least person to prevent an architecturally sound solution,
but I'm also the first person to NAK any attempt which is based on
uninformed claims and 'works for me' arguments.

The only clean way to solve this cleanly is moving the sched clock
initialization to the earliest point possible and accepting that due to
hardware, enumeration and virtualization constraints this point might be
suboptimal. Everything else is just an attempt to defy reality.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Thomas Gleixner @ 2026-03-28 15:56 UTC (permalink / raw)
  To: Petr Mladek, Bird, Tim
  Cc: Michael Kelley, rostedt@goodmis.org, senozhatsky@chromium.org,
	francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org,
	John Stultz, Stephen Boyd, John Ogness
In-Reply-To: <abPqrqgXHnDu8XWf@pathway>

On Fri, Mar 13 2026 at 11:45, Petr Mladek wrote:
> [    9.296302] cpuidle: using governor menu
>
> Here the thermal governors are registered. I guess that they might
> reduce speed of some HW.

No. The frequency of the TSC and KVM clock are stable. For TSC that's
only true for CPUs which have X86_FEATURE_CONSTANT_TSC set, which is the
case for every x86 CPU shipped in the last 20 years.

Thanks,

        tglx

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-27 18:48 UTC (permalink / raw)
  To: Thomas Gleixner, pmladek@suse.com, rostedt@goodmis.org,
	senozhatsky@chromium.org, Shashank Balaji,
	john.ogness@linutronix.de
  Cc: francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <87zf3ud92r.ffs@tglx>

Hey Thomas,

Thanks for the review...

> -----Original Message-----
> From: Thomas Gleixner <tglx@kernel.org>
> On Tue, Feb 10 2026 at 16:47, Tim Bird wrote:
> > During early boot, printk timestamps are reported as zero before
> > kernel timekeeping starts (e.g. before time_init()).  This
> > hinders boot-time optimization efforts.  This period is about 400
> > milliseconds for many current desktop and embedded machines
> > running Linux.
> >
> > Add support to save cycles during early boot, and output correct
> > timestamp values after timekeeping is initialized.  get_cycles()
> > is operational on arm64 and x86_64 from kernel start.  Add code
> > and variables to save calibration values used to later convert
> > cycle counts to time values in the early printks.  Add a config
> > to control the feature.
> >
> > This yields non-zero timestamps for printks from the very start
> > of kernel execution.  The timestamps are relative to the start of
> > the architecture-specified counter used in get_cycles
> > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> >
> > All timestamps reflect time from processor power-on instead of
> > time from the kernel's timekeeping initialization.
> 
> Can we pretty please _not_ introduce yet another side channel to
> generate time stamps?

Well, this is using get_cycles(), which already exists on most architectures.
This patch just adds a funky way to use cycles (which are available
from power on, rather than from the start of kernel timekeeping) to
allow saving timing data for some early printks (usually about 20 to 60 printks).

Also, my current plan is to back off of adjusting the offset of unrelated (non-pre-time_init()) printks,
and limit the effect in the system to just those first early (pre-time_init()) printks.
The complication to add an offset to all following printks was just to avoid a discontinuity
in printk timestamps, once time_init() was called and "real" timestamps started producing non-zeros.
Given how confusing this seems to have made things, I'm thinking of backing
off of that approach.

> 
> printk()
> 
>    time_ns = local_clock();
that's ts_nsec = local_clock()

> 
> local_clock()
>   local_clock_noinstr()
>   	// After boot
> 	if (static_branch_likely(&__sched_clock_stable))
> 		return sched_clock_noinstr() + __sched_clock_offset;
> 
> 	// Before sched_clock_init()
> 	if (!static_branch_likely(&sched_clock_running))
> 		return sched_clock_noinstr();
> 
> 	clock = sched_clock_local(this_scd());
> 
> On x86:
> sched_clock_noinstr()
>   // bare metal
>   native_sched_clock()
>      // After TSC calibration
>      if (static_branch_likely(&__use_tsc)) {
>        ...
>      }
> 
>   // Jiffies fallback.
> 
> So the obvious solution is to expand the fallback with:
> 
>     if (tsc_available())
>     	return tsc_early_uncalibrated();
> 
>     return jiffies ....;
> 
> As this needs to be supported by the architecture/platform in any case
> there is close to zero benefit from creating complicated generic
> infrastructure for this.

The problem with this is that  tsc_early_uncalibrated() can't return nanoseconds
until after calibration.
I don't think it's a good idea to returns cycles sometimes and nanoseconds
at other times, from a deep-seated timing function like this.
Also tsc_available() might itself depend on initialization that hasn't happened yet
(in early boot).

My approach of saving cycles in ts_nsec  for the early printks works because there's
a limited number of places (only 2) inside the printk code where ts_nsec
is accessed, meaning that the code to detect a cycles value instead of a nanoseconds
value can be constrained to just those two places.  Basically, I'm doing the conversion
from cycles to nanoseconds at printk presentation time, rather than at the time of
printk message submission.

The approach that I originally started with
(see https://lore.kernel.org/linux-embedded/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
was to use hardcoded multiplier and shift values for converting from cycles
to nanoseconds.  These multiplier and shift values would be set at kernel
configuration time (ie, using CONFIG values). This approach allows generating nanosecond
values from TSC cycles from the very first kernel instruction, and hence removes the need for a
calibration step at all (but of course, it requires boot-time-optimizing developers to
build a custom kernel for their machine).  This might be a reasonable approach, as the
number of people working on reducing the boot-time of this section of the kernel
(pre-time_init()), is pretty small, IMHO.  Also, the patch for that was much simpler than this
one.

There are other approaches, but none really work early enough in the kernel boot
to not be a pain.   The goal is to provide timing info before: timekeeping init,
jiffies startup, and even CPU features determination, and to keep the effect narrow --
limited only to printks, and the first few pre-time_init() printk messages, at that.

I'm now researching a suggestion from Shashank Balaji to use the existing calibration
data from tsc initialization, which might simplify the current patch even further.  I'll make sure
to CC you on the next version of the patch.

Thanks for your review and feedback of this feature.
 -- Tim

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-27 18:04 UTC (permalink / raw)
  To: John Ogness, Michael Kelley, Petr Mladek, Shashank Balaji
  Cc: rostedt@goodmis.org, senozhatsky@chromium.org, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org, Thomas Gleixner, John Stultz,
	Stephen Boyd
In-Reply-To: <87fr5nx7rr.fsf@jogness.linutronix.de>

> -----Original Message-----
> From: John Ogness <john.ogness@linutronix.de>
> Hi Tim,
> 
> On 2026-03-24, "Bird, Tim" <Tim.Bird@sony.com> wrote:
> > Yeah - I was worried about other timestamps not being in synchronization
> > with my adjusted printk timestamps.  Looks like that worry was justified.
> >
> > At this point, there are two avenues:
> >
> >  1) double-down and embed the offset (from power-on rather than from time_init)
> >     all the way into local_clock() and/or whatever is providing CLOCK_MONOTONIC
> >     and CLOCK_BOOTTIME, or
> >
> >  2) back off, and abandon adding the offset to local_clock()-based printk timestamps.
> >     This would leave a discontinuity when EARLY_PRINTK_TIMES was enabled,
> >     between the (now) non-zero early printk timestamps and the ones following
> >     time_init().  This has the benefit of changing less code, and only affecting
> >     the early printk timestamps (and none of the rest of the system).  And it has
> >     the downside of leaving the possibly confusing time discontinuity
> >     early in the kernel log.  So far, I haven't seen any tools confused by this, and
> >     I can put a message before time_init() to inform humans about the switch.
> >
> > The purpose of this patch is really focused on that early period of boot, where all
> > other timing and tracing mechanisms are unavailable, and limiting the impact to
> > just those early (currently zero) timestamps seems like the best course.
> 
> I have always felt that printk timestamping should be using a
> userspace-accessible clock, such as CLOCK_MONOTONIC, rather than the CPU
> local clock. This simplifies applications coordinating their own logs
> with raw kernel logs.
> 
> I was wondering if your pre-boot timing could be used as the init values
> for CLOCK_MONOTONIC, so that CLOCK_MONOTONIC is a clean continuation of
> your pre-boot clocking.

That's interesting.  It does sound like it would solve the discontinuity problem.
However, this does fall in  the 'double-down' category of changes,
(ie embedding the offset from power-on provided by early cycles into CLOCK_MONOTONIC,
and making additional changes to printk to use that clock.)

I would have to do some more research on the clock system for this, and it's getting
a bit afield of my little patch to support getting timing info on 20 to 60 early boot printks.

> And then we could use this opportunity to switch printk to
> CLOCK_MONOTONIC.
> 
> This might also make sense if initializing CLOCK_MONOTONIC is somehow
> more straight forward than tracking an extra CPU local clock diff.

I'd have to figure out what clock is providing CLOCK_MONOTONIC, and see
how the initialization works.  In terms of complications, I'd guess it's about equal.
The code in my patch to add the early_times offset (early_ts_offset) should
resolve down to a single inline add (inside adjust_early_ts, inside print_time).

This add only happens for printk timestamps, so the overall effect on the system
is quite limited.  If we put the offset add into CLOCK_MONOTONIC, it will affect
all users (which, of course, is the point in order to harmonize timestamps between
printk and user-space, and possibly other timestamp users in the kernel).  But it
makes the effect more general. Since the purpose of the patch was just to
allow me to get timestamps for a small number of pre-time_init() printks, I really
wanted to minimize any long-term runtime impacts.

Having said that, I think it's likely there's already some offset calculation or scaling
that this might fold into for CLOCK_MONOTIC, where adjusting the offset to be
from power-on rather than from initializing of timekeeping would generate no actual
increase in the overhead of getting the CLOCK_MONOTONIC value.

I'll do some research and see if what I can find out, to see if that makes sense.
Overall, I think it would be a net positive if:
 - printk and user-space timestamps were harmonized
 - the time base for the system (0-time) was from power-on rather than from kernel
    timekeeping initialization.

Thanks for the ideas and feedback.
 -- Tim

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Thomas Gleixner @ 2026-03-26 13:17 UTC (permalink / raw)
  To: Tim Bird, pmladek, rostedt, john.ogness, senozhatsky
  Cc: francesco, geert, linux-embedded, linux-kernel, Tim Bird
In-Reply-To: <20260210234741.3262320-1-tim.bird@sony.com>

On Tue, Feb 10 2026 at 16:47, Tim Bird wrote:
> During early boot, printk timestamps are reported as zero before
> kernel timekeeping starts (e.g. before time_init()).  This
> hinders boot-time optimization efforts.  This period is about 400
> milliseconds for many current desktop and embedded machines
> running Linux.
>
> Add support to save cycles during early boot, and output correct
> timestamp values after timekeeping is initialized.  get_cycles()
> is operational on arm64 and x86_64 from kernel start.  Add code
> and variables to save calibration values used to later convert
> cycle counts to time values in the early printks.  Add a config
> to control the feature.
>
> This yields non-zero timestamps for printks from the very start
> of kernel execution.  The timestamps are relative to the start of
> the architecture-specified counter used in get_cycles
> (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
>
> All timestamps reflect time from processor power-on instead of
> time from the kernel's timekeeping initialization.

Can we pretty please _not_ introduce yet another side channel to
generate time stamps?

printk()

   time_ns = local_clock();

local_clock()
  local_clock_noinstr()
  	// After boot
	if (static_branch_likely(&__sched_clock_stable))
		return sched_clock_noinstr() + __sched_clock_offset;

	// Before sched_clock_init()
	if (!static_branch_likely(&sched_clock_running))
		return sched_clock_noinstr();

	clock = sched_clock_local(this_scd());

On x86:
sched_clock_noinstr()
  // bare metal
  native_sched_clock()
     // After TSC calibration
     if (static_branch_likely(&__use_tsc)) {
       ...
     }

  // Jiffies fallback.

So the obvious solution is to expand the fallback with:

    if (tsc_available())
    	return tsc_early_uncalibrated();

    return jiffies ....;

As this needs to be supported by the architecture/platform in any case
there is close to zero benefit from creating complicated generic
infrastructure for this.

Thanks,

        tglx

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: John Ogness @ 2026-03-26  9:24 UTC (permalink / raw)
  To: Bird, Tim, Michael Kelley, Petr Mladek, Shashank Balaji
  Cc: rostedt@goodmis.org, senozhatsky@chromium.org, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org, Thomas Gleixner, John Stultz,
	Stephen Boyd
In-Reply-To: <MW5PR13MB56324CC9F20373A19325061CFD48A@MW5PR13MB5632.namprd13.prod.outlook.com>

Hi Tim,

On 2026-03-24, "Bird, Tim" <Tim.Bird@sony.com> wrote:
> Yeah - I was worried about other timestamps not being in synchronization
> with my adjusted printk timestamps.  Looks like that worry was justified.
>
> At this point, there are two avenues:
>
>  1) double-down and embed the offset (from power-on rather than from time_init)
>     all the way into local_clock() and/or whatever is providing CLOCK_MONOTONIC
>     and CLOCK_BOOTTIME, or
>
>  2) back off, and abandon adding the offset to local_clock()-based printk timestamps.
>     This would leave a discontinuity when EARLY_PRINTK_TIMES was enabled,
>     between the (now) non-zero early printk timestamps and the ones following
>     time_init().  This has the benefit of changing less code, and only affecting
>     the early printk timestamps (and none of the rest of the system).  And it has
>     the downside of leaving the possibly confusing time discontinuity
>     early in the kernel log.  So far, I haven't seen any tools confused by this, and
>     I can put a message before time_init() to inform humans about the switch.
>
> The purpose of this patch is really focused on that early period of boot, where all
> other timing and tracing mechanisms are unavailable, and limiting the impact to
> just those early (currently zero) timestamps seems like the best course.

I have always felt that printk timestamping should be using a
userspace-accessible clock, such as CLOCK_MONOTONIC, rather than the CPU
local clock. This simplifies applications coordinating their own logs
with raw kernel logs.

I was wondering if your pre-boot timing could be used as the init values
for CLOCK_MONOTONIC, so that CLOCK_MONOTONIC is a clean continuation of
your pre-boot clocking.

And then we could use this opportunity to switch printk to
CLOCK_MONOTONIC.

This might also make sense if initializing CLOCK_MONOTONIC is somehow
more straight forward that tracking an extra CPU local clock diff.

John Ogness

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-24 20:07 UTC (permalink / raw)
  To: Shashank Balaji, Petr Mladek
  Cc: Michael Kelley, rostedt@goodmis.org, senozhatsky@chromium.org,
	francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner, John Stultz, Stephen Boyd, John Ogness
In-Reply-To: <abVttIDVZ4y2BZAU@fedora>

Hey Shashank,

Thanks for looking into this, and providing some ideas.  See my responses below.

> -----Original Message-----
> From: Shashank Balaji <shashankbalaji02@gmail.com>
> 
> Hi Tim, Petr,
> 
> On Fri, Mar 13, 2026 at 11:45:02AM +0100, Petr Mladek wrote:
> > Finally added timekeeping maintainers and John into Cc.
> > We should have added time since v1.
> >
> > Anyway, you might see the entire history at
> > https://lore.kernel.org/all/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
> >
> >
...

> 
> Based on the discussions thus far, I see three problems currently:
> 
> 1. Three phases of timestamps on consoles printing synchronously is awkward
Agreed.

> 2. Possible userspace breakage:
> 
>     I don't know how big of a concern this is, but if the timestamps were to not start at zero, any
>     tools expecting them to, would be shocked. printk timestamp comparison with CLOCK_BOOTTIME
>     wouldn't work anymore.
> 
> 3. Counter frequency constancy requirement:
> 
>     For the early time calibration to be accurate, the counter frequency should be constant from
>     processor power on to the end of the calibration. We don't know what the firmware or the bootloader
>     may do in-between. If the counter frequency depends on cpu frequency or is just unstable, then we
>     should err on the side of caution and bail out of early times. On the other hand, if constant counter
>     frequency is architecturally guaranteed or if the cpu advertises so, then we should be good. Like,
>     x86 has one only if boot_cpu_has(X86_FEATURE_CONSTANT_TSC), rest are dubious.

I'll need to check when the data for boot_cpu_has() is initialized.  I believe it's not available
from the very first kernel instruction, but I may be misremembering.  I think almost all
x86_64 processors have invariant TSCs since about 2008, so it might be better to just
document this issue in the config.  I don't think a lot of people are optimizing this
100ms to 400ms region of early boot (that this patch is targeted at) on older processors.

> 
> Idea: Making the early timestamps start at 0 would partially address these problems:
> 
> 1. No awkward three phases of timestamps:
> 
>     The only difference would be that serial would have a number of timestamps exactly at 0, while userspace
>     wouldn't. The early time calibration data would only be used for userspace to convert the counter values
>     to timestamps with a zero origin.
> 
> 2. Userspace breakage:
> 
>     CLOCK_BOOTTIME's origin is timekeeping init. So relative to that, the origin of early timestamps would be
>     closer, but still different.
> 
> Making the early timestamps start at 0 would also lower counter frequency constancy requirement. Now the constancy
> requirement would only be between the period of kernel start to calibration end, a phase in which the kernel is
> in full control. Maybe even if !boot_cpu_has(X86_FEATURE_CONSTANT_TSC), the calibration data would be fine to use?
> This is debatable.
> 
> A disadvantage of this method is that visibility into per-kernel duration is lost. But that can be made up for by
> adding something like:
> 
>     pr_info("Pre-kernel duration: %llu ms\n", pre_kernel_time);
> 
> But this information should be taken with a grain of salt if constant counter frequency is not guaranteed.
> 
> Another point, since constancy of counter frequency is being relied upon anyway, conversion of counter value to
> timestamp can be done later once counter frequency calibration is complete, instead of doing our own calibration.
> An arch hook like arch_counter_freq() could be provided from which we can get this info. On x86, it would be wired
> up to tsc_khz. If arch_counter_freq() returns 0, that means constant frequency is not guaranteed, so we'd bail
> out of the counter value to timestamp conversion.

Interesting.  If I can validate that local_clock() is using the same clock source (TSC) as my early_times calibration,
maybe I *can* just steal the calibration already done when initializing that clock.  I'm using the same
routine 'clocks_calc_mult_shift', and maybe it would be better to just grab the numbers that result from
the TSC clock initialization, rather than calling it again (with different numbers) myself.  I'll look into this.

Thanks very much for the feedback and ideas.
 -- Tim


^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-24 19:47 UTC (permalink / raw)
  To: Michael Kelley, Petr Mladek, Shashank Balaji
  Cc: rostedt@goodmis.org, senozhatsky@chromium.org, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org, Thomas Gleixner, John Stultz,
	Stephen Boyd, John Ogness
In-Reply-To: <BN7PR02MB4148F06A53EA3A4061765405D442A@BN7PR02MB4148.namprd02.prod.outlook.com>

Sorry for the slow response.  See my response and plan inline below.

> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> 
> From: Petr Mladek <pmladek@ suse. com> Sent: Friday, March 13, 2026 3: 45 AM > > Finally added timekeeping maintainers and John into Cc.
> > We should have added time since v1. > > Anyway, you might see the entire history at
> 
> From: Petr Mladek <pmladek@suse.com> Sent: Friday, March 13, 2026 3:45 AM
> >
> > Finally added timekeeping maintainers and John into Cc.
> > We should have added time since v1.
> >
> > Anyway, you might see the entire history at
> > https://lore.kernel.org/all/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
> >
> > On Fri 2026-03-13 04:52:40, Bird, Tim wrote:
> > > Hey Micheal,
> > >
> > > This report is very interesting.
> > >
> > > Thanks very much for trying it out!
> > >
> > > > -----Original Message-----
> > > > From: Michael Kelley <mhklinux@outlook.com>
> > > > Sent: Wednesday, March 11, 2026 9:47 AM
> > > > From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> > > > >
> > > > > During early boot, printk timestamps are reported as zero before
> > > > > kernel timekeeping starts (e.g. before time_init()).  This
> > > > > hinders boot-time optimization efforts.  This period is about 400
> > > > > milliseconds for many current desktop and embedded machines
> > > > > running Linux.
> > > > >
> > > > > Add support to save cycles during early boot, and output correct
> > > > > timestamp values after timekeeping is initialized.  get_cycles()
> > > > > is operational on arm64 and x86_64 from kernel start.  Add code
> > > > > and variables to save calibration values used to later convert
> > > > > cycle counts to time values in the early printks.  Add a config
> > > > > to control the feature.
> > > > >
> > > > > This yields non-zero timestamps for printks from the very start
> > > > > of kernel execution.  The timestamps are relative to the start of
> > > > > the architecture-specified counter used in get_cycles
> > > > > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > > > >
> > > > > All timestamps reflect time from processor power-on instead of
> > > > > time from the kernel's timekeeping initialization.
> > > >
> > > > I tried this patch in linux-next20260302 kernel running as a guest VM
> > > > on a Hyper-V host. Two things:
> > > >
> > > > 1) In the dmesg output, I'm seeing a place where the timestamps briefly go
> > > > backwards -- i.e., they are not monotonically increasing. Here's a snippet,
> > > > where there's a smaller timestamp immediately after the tsc is detected:
> > > >
> > > > [   27.994891] SMBIOS 3.1.0 present.
> > > > [   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > > > [   27.994898] DMI: Memory slots populated: 2/2
> > > > [   27.995202] Hypervisor detected: Microsoft Hyper-V
> > > > [   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
> > > > [   27.995208] Hyper-V: Nested features: 0x0
> > > > [   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
> > > > [   27.995210] Hyper-V: Using hypercall for remote TLB flush
> > > > [   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120
> ns
> > > > [   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > > > [   27.995220] tsc: Detected 2918.401 MHz processor
> > >
> > > I wonder if the tsc is getting fiddled with or virtualized somewhere
> > > in here, as part of clocksource initialization. I believe each clocksource in
> > > the kernel maintains it's own internal offset, and maybe the offset that is
> > > being used ends up being slightly different from the cycle-counter offset
> > > that the early_times feature uses. I'm just throwing out guesses.  It's about
> > > a 4ms delta, which is pretty big.
> 
> I'm fairly certain the TSC frequency is not being fiddled with. In a guest VM on
> Hyper-V, the x86 instruction to read the TSC executes directly in hardware and
> is not virtualized. There *is* per-VM scaling of the TSC value to handle live
> migrations across virtualization hosts with different TSC frequencies, but that's
> not in play during my experiments.

OK - thanks.  I suspected this was the case.
> 
> > >
> > > > [   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> > > > [   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
> > > > [   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
> > > > [   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
> > > > [   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> >
> > I wonder how the calibration of the cycles is precise. I wonder if
> > the problem might be that cycles were faster right after boot than
> > later during the calibration.
> >
> > I added the following debug output on top of this patch:
> >
> > diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> > index 05388dcb8573..cdb467345bcc 100644
> > --- a/include/linux/early_times.h
> > +++ b/include/linux/early_times.h
> > @@ -20,6 +20,7 @@ static inline void early_times_start_calibration(void)
> >  {
> >  	start_cycles = get_cycles();
> >  	start_ns = local_clock();
> > +	pr_info("Early printk times: started callibration: %llu ns\n", start_ns);
> >  }
> >
> >  static inline void early_times_finish_calibration(void)
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 774ffb1fa5ac..836cb03aaa6d 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2308,6 +2308,8 @@ int vprintk_store(int facility, int level,
> >  	ts_nsec = local_clock();
> >  	if (!ts_nsec)
> >  		ts_nsec = early_cycles();
> > +	else
> > +		pr_info_once("local_clock() returned non-zero timestamp: %llu nsec\n", ts_nsec);
> >
> >  	caller_id = printk_caller_id();
> >
> >
> > And it produced in my kvm:
> >
> > Let's say that start of the cycle counter is
> >
> > Start of stage A
> >
> > [    8.684438] Linux version 7.0.0-rc2-default+ (pmladek@pathway) (gcc (SUSE Linux)
> > 15.2.1 20260202, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.45.0.20251103-2)
> > #571 SMP PREEMPT_DYNAMIC Fri Mar 13 10:23:54 CET 2026
> > [    8.684442] Command line: BOOT_IMAGE=/boot/vmlinuz-7.0.0-rc2-default+
> > root=/dev/vda2 resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-
> > 5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200
> > console=ttyS0,115200 console=tty0 ignore_loglevel log_buf_len=1M
> > crashkernel=512M,high crashkernel=72M,low
> > [...]
> > [    8.696633] earlycon: uart8250 at I/O port 0x3f8 (options '115200')
> > [    8.696639] printk: legacy bootconsole [uart8250] enabled
> > [    8.731303] printk: debug: ignoring loglevel setting.
> > [    8.732349] NX (Execute Disable) protection: active
> > [    8.733447] APIC: Static calls initialized
> > [    8.734667] SMBIOS 2.8 present.
> > [    8.735358] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
> > [    8.737285] DMI: Memory slots populated: 1/1
> > [    8.738380] Hypervisor detected: KVM
> > [    8.739151] last_pfn = 0x7ffdc max_arch_pfn = 0x400000000
> > [    8.740254] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > [    8.732971] printk: local_clock() returned non-zero timestamp: 3486 nsec
> >
> > End of stage A
> >
> > This is the point where printk() started storing the values from
> > local_clock() instead of cycles.
> >
> > Start of stage B
> >
> > [    8.732971] kvm-clock: using sched offset of 252367014082295 cycles
> > [    8.735471] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [    8.738880] tsc: Detected 3293.776 MHz processor
> > [    8.740474] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> > [...]
> > [    8.932671] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> > [    8.934047] Early printk times: started callibration: 201079507 ns
> >
> > End of stage B
> >
> > This is where we started calibration of early cycles.
> >
> > Start of stage C
> >
> > [    8.935571] Console: colour dummy device 80x25
> > [...]
> > [    9.289673] thermal_sys: Registered thermal governor 'fair_share'
> > [    9.290077] thermal_sys: Registered thermal governor 'bang_bang'
> > [    9.292290] thermal_sys: Registered thermal governor 'step_wise'
> > [    9.293530] thermal_sys: Registered thermal governor 'user_space'
> > [    9.294856] cpuidle: using governor ladder
> > [    9.296302] cpuidle: using governor menu
> >
> > Here the thermal governors are registered. I guess that they might
> > reduce speed of some HW.
> >
> > [...]
> > [   11.974147] clk: Disabling unused clocks
> >
> > Some unused clocks are disabled. I wonder if this might affect
> > counting the cycles.
> >
> > [   12.330852] Freeing unused kernel image (rodata/data gap) memory: 1500K
> > [   12.351191] Early printk times: mult=19634245, shift=26, offset=8732967929 ns
> >
> > End of stage C
> >
> > Here is the end on calibration.
> >
> > Now, if the frequence of the cycles was:
> >
> >    + was higher in the stage A when only cycles were stored
> >    + was lower in stage C when it was calibrated against local_clock()
> >
> > Then it might result in higher (calibrated) timestamps in stage A
> > and step back in stage B.
> >
> > Or something like this. It is possible that even local_clock() does
> > not have a stable frequence during the early boot.
> 
> In my VM on Hyper-V, I do see a problem with the results of your
> calibration code. Over the calibration interval, you calculate the delta
> number of nanoseconds from local_clock() and the delta number of
> TSC cycles. The delta TSC cycles divided by the delta nanoseconds
> should yield the TSC frequency. But the result of your calibration code
> is about 3.05 cycles/nsec, when the actual TSC frequency is 2.918
> cycles/nsec for the hardware I'm running on.
> 
> In a Linux VM where CONFIG_PARAVIRT=y, local_clock() eventually
> comes down to native_sched_clock(), which just reads the TSC and
> then converts to nanoseconds based on the kernel's understanding
> of the TSC frequency. So I don't think the local_clock() frequency is
> varying. But I'm thinking there are some adjustments being made
> to the value returned by local_clock() during early initialization, and
> I didn't try to track those down.

Thanks very much for this data point and information!

> 
> Hyper-V provides guest VMs with a synthetic clock (that is based
> on the TSC). As an experiment, I used that clock in the early time
> calibration, and everything worked properly. The calibration code
> produced delta nanoseconds and delta cycles that were exactly
> 2.918 cycles/nsec, and the transition from Stage A to Stage B was
> correct -- no cases of a smaller timestamp following a larger
> timestamp. So my conclusion is that the calibration is indeed
> problematic, though I haven't identified why the nanoseconds
> delta from local_clock() is smaller than it should be.

I'll do a deep dive on my calibration math and see if I can
figure out the problem.  One guess I have is that it could
be an error in precision due to not enough time accumulating in local_clock
before the start of calibration.  I plan to move the calibration start around,
as well as compare the math from my calibration with the one done by 
local_clock() to try to figure out the discrepancy.
 
> >
> > Idea: A solution might be to start calibration when printk()
> >       gets first non-zero time from local_clock.
> >
> >       Something like:
> >
> > diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> > index 05388dcb8573..09d278996184 100644
> > --- a/include/linux/early_times.h
> > +++ b/include/linux/early_times.h
> > @@ -16,10 +16,13 @@ extern u64 start_ns;
> >  extern u32 early_mult, early_shift;
> >  extern u64 early_ts_offset;
> >
> > -static inline void early_times_start_calibration(void)
> > +static inline void early_times_may_start_calibration(u64 ts_ns)
> >  {
> > +	if (start_ns)
> > +		return;
> > +
> > +	start_ns = ts_ns;
> >  	start_cycles = get_cycles();
> > -	start_ns = local_clock();
> >  }
> >
> >  static inline void early_times_finish_calibration(void)
> > diff --git a/init/main.c b/init/main.c
> > index 27835270dfb5..a333b0da69cf 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -1123,9 +1123,6 @@ void start_kernel(void)
> >  	timekeeping_init();
> >  	time_init();
> >
> > -	/* This must be after timekeeping is initialized */
> > -	early_times_start_calibration();
> > -
> >  	/* This must be after timekeeping is initialized */
> >  	random_init();
> >
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 774ffb1fa5ac..19330b6b4eb2 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2306,7 +2306,9 @@ int vprintk_store(int facility, int level,
> >  	 * timestamp with respect to the caller.
> >  	 */
> >  	ts_nsec = local_clock();
> > -	if (!ts_nsec)
> > +	if (ts_nsec)
> > +		early_times_may_start_calibration(ts_nsec);
> > +	else
> >  		ts_nsec = early_cycles();
> >
> >  	caller_id = printk_caller_id();
> >
> >
> > > >
> > > > 2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
> > > > VM typically uses cloud-init to set everything up at boot time, and cloud-init
> > > > is outputting lines to the serial console with a timestamp that looks like the
> > > > printk() timestamp, but apparently is not adjusted for the early timestamping
> > > > that this patch adds. Again, I haven't debugged what's going on -- I'm not
> > > > immediately sure of the mechanism that cloud-init uses to do output to the
> > > > serial console. The use of the Hyper-V synthetic clock source might the cause
> > > > of the problem here as well. Here's an output snippet from the serial console:
> > > >
> > > > [   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
> > > > [   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
> > > > [   20.333257] pstore: Registered efi_pstore as persistent store backend
> > > > [   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
> > > > [   20.338319] systemd[1]: Starting Load Kernel Modules...
> > > > [   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
> > > > [   20.350993] systemd[1]: Starting udev Coldplug all Devices...
> > > > [   20.356255] systemd[1]: Starting Uncomplicated firewall...
> > > > [   20.361536] systemd[1]: Started Journal Service.
> > > > [   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
> > > > [   22.532033] /dev/sr0: Can't lookup blockdev
> > > > [    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48
> seconds.
> > > > [    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82
> seconds.
> > > > [    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device
> info+++++++++++++++++++++++++++++++++++++++
> > > > [    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> > > > [    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask | Scope  |     Hw-Address    |
> > > > [    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> 
> The cloud-init output to the serial console is coming from syslog, which is
> presumably writing directly to /dev/console. This output is also recorded
> in the syslog log files (/var/log/syslog in my Ubuntu system), though with
> timestamps in a text format like
> 
> 2026-03-14T08:44:24.781241-07:00
> 
> The timestamps recorded in /var/log/syslog, and as shown with the
> 'journalctl' command, are monontonic using that full date/time format. But
> 'journalctl' with the "-o short-monotonic" option shows the seconds-since-boot
> format, and in that case, the cloud-init timestamps are discontinuous with the
> kernel messages, like in the serial console output. I don't know exactly where
> journalctl gets its knowledge of the boot time, but among the possibilities are:
> 
> /proc/uptime
> /proc/stat (the "btime" field)
> 
> These are not adjusted for using early boot times. And it's not clear whether
> they should be -- I don't know what the big picture implications would be.
> And there are probably other places the boot time is available to user space.
> If using early boot times is intended to be only for occasional diagnostic use,
> then maybe living with the discontinuity is OK. I see that Shashank Balaji has
> also commented about userspace issues, which covers this syslog case.
> 
> I think all this gives a first-level explanation of what I'm seeing in a
> Hyper-V guest. I don't think any of it is specific to Hyper-V guests
> or the Hyper-V synthetic clock sources. The issues are more generic.
> Sorry. :-(

Yeah - I was worried about other timestamps not being in synchronization
with my adjusted printk timestamps.  Looks like that worry was justified.

At this point, there are two avenues:

 1) double-down and embed the offset (from power-on rather than from time_init)
    all the way into local_clock() and/or whatever is providing CLOCK_MONOTONIC
    and CLOCK_BOOTTIME, or

 2) back off, and abandon adding the offset to local_clock()-based printk timestamps.
    This would leave a discontinuity when EARLY_PRINTK_TIMES was enabled,
    between the (now) non-zero early printk timestamps and the ones following
    time_init().  This has the benefit of changing less code, and only affecting
    the early printk timestamps (and none of the rest of the system).  And it has
    the downside of leaving the possibly confusing time discontinuity
    early in the kernel log.  So far, I haven't seen any tools confused by this, and
    I can put a message before time_init() to inform humans about the switch.

The purpose of this patch is really focused on that early period of boot, where all
other timing and tracing mechanisms are unavailable, and limiting the impact to
just those early (currently zero) timestamps seems like the best course.

Of course I still need to get the calibration correct, and I'll work on that
before sending another update.
 -- Tim

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-20 18:15 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Michael Kelley, rostedt@goodmis.org, senozhatsky@chromium.org,
	francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner, John Stultz, Stephen Boyd, John Ogness
In-Reply-To: <abPqrqgXHnDu8XWf@pathway>



> -----Original Message-----
> From: Petr Mladek <pmladek@suse.com>
> Finally added timekeeping maintainers and John into Cc.
> We should have added time since v1.

Thanks for catching that.  I'll try to remember to add those (and use John Ogness' correct
email) going forward.

> 
> Anyway, you might see the entire history at
> https://lore.kernel.org/all/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
> 
> 
> On Fri 2026-03-13 04:52:40, Bird, Tim wrote:
> > Hey Micheal,
> >
> > This report is very interesting.
> >
> > Thanks very much for trying it out!
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@outlook.com>
> > > Sent: Wednesday, March 11, 2026 9:47 AM
> > > From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> > > >
> > > > During early boot, printk timestamps are reported as zero before
> > > > kernel timekeeping starts (e.g. before time_init()).  This
> > > > hinders boot-time optimization efforts.  This period is about 400
> > > > milliseconds for many current desktop and embedded machines
> > > > running Linux.
> > > >
> > > > Add support to save cycles during early boot, and output correct
> > > > timestamp values after timekeeping is initialized.  get_cycles()
> > > > is operational on arm64 and x86_64 from kernel start.  Add code
> > > > and variables to save calibration values used to later convert
> > > > cycle counts to time values in the early printks.  Add a config
> > > > to control the feature.
> > > >
> > > > This yields non-zero timestamps for printks from the very start
> > > > of kernel execution.  The timestamps are relative to the start of
> > > > the architecture-specified counter used in get_cycles
> > > > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > > >
> > > > All timestamps reflect time from processor power-on instead of
> > > > time from the kernel's timekeeping initialization.
> > >
> > > I tried this patch in linux-next20260302 kernel running as a guest VM
> > > on a Hyper-V host. Two things:
> > >
> > > 1) In the dmesg output, I'm seeing a place where the timestamps briefly go
> > > backwards -- i.e., they are not monotonically increasing. Here's a snippet,
> > > where there's a smaller timestamp immediately after the tsc is detected:
> > >
> > > [   27.994891] SMBIOS 3.1.0 present.
> > > [   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > > [   27.994898] DMI: Memory slots populated: 2/2
> > > [   27.995202] Hypervisor detected: Microsoft Hyper-V
> > > [   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
> > > [   27.995208] Hyper-V: Nested features: 0x0
> > > [   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
> > > [   27.995210] Hyper-V: Using hypercall for remote TLB flush
> > > [   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120
> ns
> > > [   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > > [   27.995220] tsc: Detected 2918.401 MHz processor
> >
> > I wonder if the tsc is getting fiddled with or virtualized somewhere in here, as part of clocksource initialization.
> > I believe each clocksource in the kernel maintains it's own internal offset, and maybe the offset that is
> > being used ends up being slightly different from the cycle-counter offset that the early_times feature uses.
> > I'm just throwing out guesses.  It's about a 4ms delta, which is pretty big.
> >
> > > [   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> > > [   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
> > > [   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
> > > [   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
> > > [   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> 
> I wonder how the calibration of the cycles is precise. I wonder if
> the problem might be that cycles were faster right after boot than
> later during the calibration.

On x86_64 TSC has been stable on all processors produced since about
2008, when invariant TSC were introduced.  They shouldn't be changing
speeds or doing weird things.

> 
> I added the following debug output on top of this patch:
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> index 05388dcb8573..cdb467345bcc 100644
> --- a/include/linux/early_times.h
> +++ b/include/linux/early_times.h
> @@ -20,6 +20,7 @@ static inline void early_times_start_calibration(void)
>  {
>  	start_cycles = get_cycles();
>  	start_ns = local_clock();
> +	pr_info("Early printk times: started callibration: %llu ns\n", start_ns);
>  }
> 
>  static inline void early_times_finish_calibration(void)
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 774ffb1fa5ac..836cb03aaa6d 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2308,6 +2308,8 @@ int vprintk_store(int facility, int level,
>  	ts_nsec = local_clock();
>  	if (!ts_nsec)
>  		ts_nsec = early_cycles();
> +	else
> +		pr_info_once("local_clock() returned non-zero timestamp: %llu nsec\n", ts_nsec);
> 
>  	caller_id = printk_caller_id();
> 
> 
> And it produced in my kvm:
> 
> Let's say that start of the cycle counter is
> 
> Start of stage A
> 
> [    8.684438] Linux version 7.0.0-rc2-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20260202, GNU ld (GNU Binutils; openSUSE
> Tumbleweed) 2.45.0.20251103-2) #571 SMP PREEMPT_DYNAMIC Fri Mar 13 10:23:54 CET 2026
> [    8.684442] Command line: BOOT_IMAGE=/boot/vmlinuz-7.0.0-rc2-default+ root=/dev/vda2 resume=/dev/disk/by-uuid/369c7453-3d16-
> 409d-88b2-5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=tty0 ignore_loglevel
> log_buf_len=1M crashkernel=512M,high crashkernel=72M,low
> [...]
> [    8.696633] earlycon: uart8250 at I/O port 0x3f8 (options '115200')
> [    8.696639] printk: legacy bootconsole [uart8250] enabled
> [    8.731303] printk: debug: ignoring loglevel setting.
> [    8.732349] NX (Execute Disable) protection: active
> [    8.733447] APIC: Static calls initialized
> [    8.734667] SMBIOS 2.8 present.
> [    8.735358] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
> [    8.737285] DMI: Memory slots populated: 1/1
> [    8.738380] Hypervisor detected: KVM
> [    8.739151] last_pfn = 0x7ffdc max_arch_pfn = 0x400000000
> [    8.740254] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    8.732971] printk: local_clock() returned non-zero timestamp: 3486 nsec
> 
> End of stage A
> 
> This is the point where printk() started storing the values from
> local_clock() instead of cycles.

Thanks for measuring this.

That's suspiciously close to the approximate 4ms different that Michael saw.
It may be that my offset calculation is off.  Michael also mentioned that the
math seemed to be off on the calibration.  I'll review the math and run some tests
and see if I can figure out where the problem is.  Another option
is to just eliminate the early_ts_offset, and live with the discontinuity going
from cycle-based timestamps to local-clock()-based timestamps.

> 
> Start of stage B
> 
> [    8.732971] kvm-clock: using sched offset of 252367014082295 cycles
> [    8.735471] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    8.738880] tsc: Detected 3293.776 MHz processor
> [    8.740474] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> [...]
> [    8.932671] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    8.934047] Early printk times: started callibration: 201079507 ns
> 
> End of stage B
> 
> This is where we started calibration of early cycles.
> 
> Start of stage C
> 
> [    8.935571] Console: colour dummy device 80x25
> [...]
> [    9.289673] thermal_sys: Registered thermal governor 'fair_share'
> [    9.290077] thermal_sys: Registered thermal governor 'bang_bang'
> [    9.292290] thermal_sys: Registered thermal governor 'step_wise'
> [    9.293530] thermal_sys: Registered thermal governor 'user_space'
> [    9.294856] cpuidle: using governor ladder
> [    9.296302] cpuidle: using governor menu
> 
> Here the thermal governors are registered. I guess that they might
> reduce speed of some HW.
> 
> [...]
> [   11.974147] clk: Disabling unused clocks
> 
> Some unused clocks are disabled. I wonder if this might affect
> counting the cycles.
> 
> [   12.330852] Freeing unused kernel image (rodata/data gap) memory: 1500K
> [   12.351191] Early printk times: mult=19634245, shift=26, offset=8732967929 ns
> 
> End of stage C
> 
> Here is the end on calibration.
> 
> Now, if the frequence of the cycles was:
> 
>    + was higher in the stage A when only cycles were stored
>    + was lower in stage C when it was calibrated against local_clock()
> 
> Then it might result in higher (calibrated) timestamps in stage A
> and step back in stage B.
> 
> Or something like this. It is possible that even local_clock() does
> not have a stable frequence during the early boot.

Most x86_64 platforms are using TSC as the basis for local_clock(), and it should
be stable (have invariant frequency) on all modern processors.
In the vast majority of cases, I expect that the cycles being used for
early_cycles is coming from the exact same base hardware (the TSC) as local_clock(), so
there shouldn't be discrepancies.
(Someone with more clock expertise than me, please correct me if I'm wrong.
Even on virtualized systems, the underlying hardware for the TSC instruction is still just the real
TSC hardware on the physical platform).

I suspect that I just got the math wrong somewhere, or there's an unaccounted-for delay
somewhere I didn't take into account when calculating the offset.  I'll review it and let
you know what I find.

> 
> Idea: A solution might be to start calibration when printk()
>       gets first non-zero time from local_clock.

Actually, I want the values used in the calibration to have accumulated enough
time for some accuracy to set in.  If I grab local_clock() too close to zero,
then the math is less accurate.

> 
>       Something like:
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> index 05388dcb8573..09d278996184 100644
> --- a/include/linux/early_times.h
> +++ b/include/linux/early_times.h
> @@ -16,10 +16,13 @@ extern u64 start_ns;
>  extern u32 early_mult, early_shift;
>  extern u64 early_ts_offset;
> 
> -static inline void early_times_start_calibration(void)
> +static inline void early_times_may_start_calibration(u64 ts_ns)
>  {
> +	if (start_ns)
> +		return;
> +
> +	start_ns = ts_ns;
>  	start_cycles = get_cycles();
> -	start_ns = local_clock();
>  }
> 
>  static inline void early_times_finish_calibration(void)
> diff --git a/init/main.c b/init/main.c
> index 27835270dfb5..a333b0da69cf 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1123,9 +1123,6 @@ void start_kernel(void)
>  	timekeeping_init();
>  	time_init();
> 
> -	/* This must be after timekeeping is initialized */
> -	early_times_start_calibration();
> -
>  	/* This must be after timekeeping is initialized */
>  	random_init();
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 774ffb1fa5ac..19330b6b4eb2 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2306,7 +2306,9 @@ int vprintk_store(int facility, int level,
>  	 * timestamp with respect to the caller.
>  	 */
>  	ts_nsec = local_clock();
> -	if (!ts_nsec)
> +	if (ts_nsec)
> +		early_times_may_start_calibration(ts_nsec);
> +	else
>  		ts_nsec = early_cycles();
> 
>  	caller_id = printk_caller_id();
> 
> 
> > >
> > > 2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
> > > VM typically uses cloud-init to set everything up at boot time, and cloud-init
> > > is outputting lines to the serial console with a timestamp that looks like the
> > > printk() timestamp, but apparently is not adjusted for the early timestamping
> > > that this patch adds. Again, I haven't debugged what's going on -- I'm not
> > > immediately sure of the mechanism that cloud-init uses to do output to the
> > > serial console. The use of the Hyper-V synthetic clock source might the cause
> > > of the problem here as well. Here's an output snippet from the serial console:
> > >
> > > [   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
> > > [   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
> > > [   20.333257] pstore: Registered efi_pstore as persistent store backend
> > > [   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
> > > [   20.338319] systemd[1]: Starting Load Kernel Modules...
> > > [   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
> > > [   20.350993] systemd[1]: Starting udev Coldplug all Devices...
> > > [   20.356255] systemd[1]: Starting Uncomplicated firewall...
> > > [   20.361536] systemd[1]: Started Journal Service.
> > > [   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
> > > [   22.532033] /dev/sr0: Can't lookup blockdev
> > > [    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48
> > > seconds.
> > > [    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82 seconds.
> > > [    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device
> > > info+++++++++++++++++++++++++++++++++++++++
> > > [    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> > > [    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
> > > [    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> 
> This is more complicated. I wonder if the timestamps from cloud-init()
> are somehow synchronized with local_clock().

This raises an important issue, which is that anyone using local_clock() will get timestamps
relative to time_init(), whereas with the early times patch applied, printk is using timestamps
relative to power on (well, TSC init which is almost always the same thing).  I'm still a bit
confused with how cloud-init is getting it's messages intermingled with the printk
messages?  Is this a feature of systemd?

> 
> We might need to synchronize local_clock() with the cycles as well.

I thought about modifying local_clock to use early_ts_offset, so that all users
of local_clock() would get timestamps with an offset adjusted so they were
relative to power on, but that was a much more intrusive patch.

> But there is the chicken&egg problem. We need:
> 
>     + to know the offset caused by cycles when local_clock() gets initialized.
>     + local_clock() running for some time to calibrate cycles.
> 
> Hmm, I see that time-management people are not in Cc. We should have
> added them since v1.
> 
> I add them now. Better late than never.

Thanks for the feedback and testing!  I really appreciate it.
 -- Tim


^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Michael Kelley @ 2026-03-14 16:15 UTC (permalink / raw)
  To: Petr Mladek, Bird, Tim, Shashank Balaji
  Cc: Michael Kelley, rostedt@goodmis.org, senozhatsky@chromium.org,
	francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner, John Stultz, Stephen Boyd, John Ogness
In-Reply-To: <abPqrqgXHnDu8XWf@pathway>

From: Petr Mladek <pmladek@suse.com> Sent: Friday, March 13, 2026 3:45 AM
> 
> Finally added timekeeping maintainers and John into Cc.
> We should have added time since v1.
> 
> Anyway, you might see the entire history at
> https://lore.kernel.org/all/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
> 
> On Fri 2026-03-13 04:52:40, Bird, Tim wrote:
> > Hey Micheal,
> >
> > This report is very interesting.
> >
> > Thanks very much for trying it out!
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@outlook.com>
> > > Sent: Wednesday, March 11, 2026 9:47 AM
> > > From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> > > >
> > > > During early boot, printk timestamps are reported as zero before
> > > > kernel timekeeping starts (e.g. before time_init()).  This
> > > > hinders boot-time optimization efforts.  This period is about 400
> > > > milliseconds for many current desktop and embedded machines
> > > > running Linux.
> > > >
> > > > Add support to save cycles during early boot, and output correct
> > > > timestamp values after timekeeping is initialized.  get_cycles()
> > > > is operational on arm64 and x86_64 from kernel start.  Add code
> > > > and variables to save calibration values used to later convert
> > > > cycle counts to time values in the early printks.  Add a config
> > > > to control the feature.
> > > >
> > > > This yields non-zero timestamps for printks from the very start
> > > > of kernel execution.  The timestamps are relative to the start of
> > > > the architecture-specified counter used in get_cycles
> > > > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > > >
> > > > All timestamps reflect time from processor power-on instead of
> > > > time from the kernel's timekeeping initialization.
> > >
> > > I tried this patch in linux-next20260302 kernel running as a guest VM
> > > on a Hyper-V host. Two things:
> > >
> > > 1) In the dmesg output, I'm seeing a place where the timestamps briefly go
> > > backwards -- i.e., they are not monotonically increasing. Here's a snippet,
> > > where there's a smaller timestamp immediately after the tsc is detected:
> > >
> > > [   27.994891] SMBIOS 3.1.0 present.
> > > [   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > > [   27.994898] DMI: Memory slots populated: 2/2
> > > [   27.995202] Hypervisor detected: Microsoft Hyper-V
> > > [   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
> > > [   27.995208] Hyper-V: Nested features: 0x0
> > > [   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
> > > [   27.995210] Hyper-V: Using hypercall for remote TLB flush
> > > [   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > > [   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > > [   27.995220] tsc: Detected 2918.401 MHz processor
> >
> > I wonder if the tsc is getting fiddled with or virtualized somewhere
> > in here, as part of clocksource initialization. I believe each clocksource in
> > the kernel maintains it's own internal offset, and maybe the offset that is
> > being used ends up being slightly different from the cycle-counter offset
> > that the early_times feature uses. I'm just throwing out guesses.  It's about
> > a 4ms delta, which is pretty big.

I'm fairly certain the TSC frequency is not being fiddled with. In a guest VM on
Hyper-V, the x86 instruction to read the TSC executes directly in hardware and
is not virtualized. There *is* per-VM scaling of the TSC value to handle live
migrations across virtualization hosts with different TSC frequencies, but that's
not in play during my experiments.

> >
> > > [   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> > > [   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
> > > [   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
> > > [   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
> > > [   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> 
> I wonder how the calibration of the cycles is precise. I wonder if
> the problem might be that cycles were faster right after boot than
> later during the calibration.
> 
> I added the following debug output on top of this patch:
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> index 05388dcb8573..cdb467345bcc 100644
> --- a/include/linux/early_times.h
> +++ b/include/linux/early_times.h
> @@ -20,6 +20,7 @@ static inline void early_times_start_calibration(void)
>  {
>  	start_cycles = get_cycles();
>  	start_ns = local_clock();
> +	pr_info("Early printk times: started callibration: %llu ns\n", start_ns);
>  }
> 
>  static inline void early_times_finish_calibration(void)
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 774ffb1fa5ac..836cb03aaa6d 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2308,6 +2308,8 @@ int vprintk_store(int facility, int level,
>  	ts_nsec = local_clock();
>  	if (!ts_nsec)
>  		ts_nsec = early_cycles();
> +	else
> +		pr_info_once("local_clock() returned non-zero timestamp: %llu nsec\n", ts_nsec);
> 
>  	caller_id = printk_caller_id();
> 
> 
> And it produced in my kvm:
> 
> Let's say that start of the cycle counter is
> 
> Start of stage A
> 
> [    8.684438] Linux version 7.0.0-rc2-default+ (pmladek@pathway) (gcc (SUSE Linux)
> 15.2.1 20260202, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.45.0.20251103-2)
> #571 SMP PREEMPT_DYNAMIC Fri Mar 13 10:23:54 CET 2026
> [    8.684442] Command line: BOOT_IMAGE=/boot/vmlinuz-7.0.0-rc2-default+
> root=/dev/vda2 resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-
> 5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200
> console=ttyS0,115200 console=tty0 ignore_loglevel log_buf_len=1M
> crashkernel=512M,high crashkernel=72M,low
> [...]
> [    8.696633] earlycon: uart8250 at I/O port 0x3f8 (options '115200')
> [    8.696639] printk: legacy bootconsole [uart8250] enabled
> [    8.731303] printk: debug: ignoring loglevel setting.
> [    8.732349] NX (Execute Disable) protection: active
> [    8.733447] APIC: Static calls initialized
> [    8.734667] SMBIOS 2.8 present.
> [    8.735358] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
> [    8.737285] DMI: Memory slots populated: 1/1
> [    8.738380] Hypervisor detected: KVM
> [    8.739151] last_pfn = 0x7ffdc max_arch_pfn = 0x400000000
> [    8.740254] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    8.732971] printk: local_clock() returned non-zero timestamp: 3486 nsec
> 
> End of stage A
> 
> This is the point where printk() started storing the values from
> local_clock() instead of cycles.
> 
> Start of stage B
> 
> [    8.732971] kvm-clock: using sched offset of 252367014082295 cycles
> [    8.735471] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    8.738880] tsc: Detected 3293.776 MHz processor
> [    8.740474] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> [...]
> [    8.932671] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    8.934047] Early printk times: started callibration: 201079507 ns
> 
> End of stage B
> 
> This is where we started calibration of early cycles.
> 
> Start of stage C
> 
> [    8.935571] Console: colour dummy device 80x25
> [...]
> [    9.289673] thermal_sys: Registered thermal governor 'fair_share'
> [    9.290077] thermal_sys: Registered thermal governor 'bang_bang'
> [    9.292290] thermal_sys: Registered thermal governor 'step_wise'
> [    9.293530] thermal_sys: Registered thermal governor 'user_space'
> [    9.294856] cpuidle: using governor ladder
> [    9.296302] cpuidle: using governor menu
> 
> Here the thermal governors are registered. I guess that they might
> reduce speed of some HW.
> 
> [...]
> [   11.974147] clk: Disabling unused clocks
> 
> Some unused clocks are disabled. I wonder if this might affect
> counting the cycles.
> 
> [   12.330852] Freeing unused kernel image (rodata/data gap) memory: 1500K
> [   12.351191] Early printk times: mult=19634245, shift=26, offset=8732967929 ns
> 
> End of stage C
> 
> Here is the end on calibration.
> 
> Now, if the frequence of the cycles was:
> 
>    + was higher in the stage A when only cycles were stored
>    + was lower in stage C when it was calibrated against local_clock()
> 
> Then it might result in higher (calibrated) timestamps in stage A
> and step back in stage B.
> 
> Or something like this. It is possible that even local_clock() does
> not have a stable frequence during the early boot.

In my VM on Hyper-V, I do see a problem with the results of your
calibration code. Over the calibration interval, you calculate the delta
number of nanoseconds from local_clock() and the delta number of
TSC cycles. The delta TSC cycles divided by the delta nanoseconds
should yield the TSC frequency. But the result of your calibration code
is about 3.05 cycles/nsec, when the actual TSC frequency is 2.918
cycles/nsec for the hardware I'm running on.

In a Linux VM where CONFIG_PARAVIRT=y, local_clock() eventually
comes down to native_sched_clock(), which just reads the TSC and
then converts to nanoseconds based on the kernel's understanding
of the TSC frequency. So I don't think the local_clock() frequency is
varying. But I'm thinking there are some adjustments being made
to the value returned by local_clock() during early initialization, and
I didn't try to track those down.

Hyper-V provides guest VMs with a synthetic clock (that is based
on the TSC). As an experiment, I used that clock in the early time
calibration, and everything worked properly. The calibration code
produced delta nanoseconds and delta cycles that were exactly
2.918 cycles/nsec, and the transition from Stage A to Stage B was
correct -- no cases of a smaller timestamp following a larger
timestamp. So my conclusion is that the calibration is indeed
problematic, though I haven't identified why the nanoseconds
delta from local_clock() is smaller than it should be.

> 
> Idea: A solution might be to start calibration when printk()
>       gets first non-zero time from local_clock.
> 
>       Something like:
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> index 05388dcb8573..09d278996184 100644
> --- a/include/linux/early_times.h
> +++ b/include/linux/early_times.h
> @@ -16,10 +16,13 @@ extern u64 start_ns;
>  extern u32 early_mult, early_shift;
>  extern u64 early_ts_offset;
> 
> -static inline void early_times_start_calibration(void)
> +static inline void early_times_may_start_calibration(u64 ts_ns)
>  {
> +	if (start_ns)
> +		return;
> +
> +	start_ns = ts_ns;
>  	start_cycles = get_cycles();
> -	start_ns = local_clock();
>  }
> 
>  static inline void early_times_finish_calibration(void)
> diff --git a/init/main.c b/init/main.c
> index 27835270dfb5..a333b0da69cf 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1123,9 +1123,6 @@ void start_kernel(void)
>  	timekeeping_init();
>  	time_init();
> 
> -	/* This must be after timekeeping is initialized */
> -	early_times_start_calibration();
> -
>  	/* This must be after timekeeping is initialized */
>  	random_init();
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 774ffb1fa5ac..19330b6b4eb2 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2306,7 +2306,9 @@ int vprintk_store(int facility, int level,
>  	 * timestamp with respect to the caller.
>  	 */
>  	ts_nsec = local_clock();
> -	if (!ts_nsec)
> +	if (ts_nsec)
> +		early_times_may_start_calibration(ts_nsec);
> +	else
>  		ts_nsec = early_cycles();
> 
>  	caller_id = printk_caller_id();
> 
> 
> > >
> > > 2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
> > > VM typically uses cloud-init to set everything up at boot time, and cloud-init
> > > is outputting lines to the serial console with a timestamp that looks like the
> > > printk() timestamp, but apparently is not adjusted for the early timestamping
> > > that this patch adds. Again, I haven't debugged what's going on -- I'm not
> > > immediately sure of the mechanism that cloud-init uses to do output to the
> > > serial console. The use of the Hyper-V synthetic clock source might the cause
> > > of the problem here as well. Here's an output snippet from the serial console:
> > >
> > > [   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
> > > [   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
> > > [   20.333257] pstore: Registered efi_pstore as persistent store backend
> > > [   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
> > > [   20.338319] systemd[1]: Starting Load Kernel Modules...
> > > [   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
> > > [   20.350993] systemd[1]: Starting udev Coldplug all Devices...
> > > [   20.356255] systemd[1]: Starting Uncomplicated firewall...
> > > [   20.361536] systemd[1]: Started Journal Service.
> > > [   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
> > > [   22.532033] /dev/sr0: Can't lookup blockdev
> > > [    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48 seconds.
> > > [    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82 seconds.
> > > [    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
> > > [    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> > > [    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask | Scope  |     Hw-Address    |
> > > [    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+

The cloud-init output to the serial console is coming from syslog, which is
presumably writing directly to /dev/console. This output is also recorded
in the syslog log files (/var/log/syslog in my Ubuntu system), though with
timestamps in a text format like

2026-03-14T08:44:24.781241-07:00

The timestamps recorded in /var/log/syslog, and as shown with the
'journalctl' command, are monontonic using that full date/time format. But
'journalctl' with the "-o short-monotonic" option shows the seconds-since-boot
format, and in that case, the cloud-init timestamps are discontinuous with the
kernel messages, like in the serial console output. I don't know exactly where
journalctl gets its knowledge of the boot time, but among the possibilities are:

/proc/uptime
/proc/stat (the "btime" field)

These are not adjusted for using early boot times. And it's not clear whether
they should be -- I don't know what the big picture implications would be.
And there are probably other places the boot time is available to user space.
If using early boot times is intended to be only for occasional diagnostic use,
then maybe living with the discontinuity is OK. I see that Shashank Balaji has
also commented about userspace issues, which covers this syslog case.

I think all this gives a first-level explanation of what I'm seeing in a
Hyper-V guest. I don't think any of it is specific to Hyper-V guests
or the Hyper-V synthetic clock sources. The issues are more generic.
Sorry. :-(

Michael

> 
> This is more complicated. I wonder if the timestamps from cloud-init()
> are somehow synchronized with local_clock().
> 
> We might need to synchronize local_clock() with the cycles as well.
> But there is the chicken&egg problem. We need:
> 
>     + to know the offset caused by cycles when local_clock() gets initialized.
>     + local_clock() running for some time to calibrate cycles.
> 
> Hmm, I see that time-management people are not in Cc. We should have
> added them since v1.
> 
> I add them now. Better late than never.
> 
> Best Regards,
> Petr


^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Shashank Balaji @ 2026-03-14 14:16 UTC (permalink / raw)
  To: Petr Mladek, Bird, Tim
  Cc: Michael Kelley, rostedt@goodmis.org, senozhatsky@chromium.org,
	francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner, John Stultz, Stephen Boyd, John Ogness
In-Reply-To: <abPqrqgXHnDu8XWf@pathway>

Hi Tim, Petr,

On Fri, Mar 13, 2026 at 11:45:02AM +0100, Petr Mladek wrote:
> Finally added timekeeping maintainers and John into Cc.
> We should have added time since v1.
> 
> Anyway, you might see the entire history at
> https://lore.kernel.org/all/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/
> 
> 
> On Fri 2026-03-13 04:52:40, Bird, Tim wrote:
> > Hey Micheal,
> > 
> > This report is very interesting. 
> > 
> > Thanks very much for trying it out!
> > 
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@outlook.com>
> > > Sent: Wednesday, March 11, 2026 9:47 AM
> > > From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> > > >
> > > > During early boot, printk timestamps are reported as zero before
> > > > kernel timekeeping starts (e.g. before time_init()).  This
> > > > hinders boot-time optimization efforts.  This period is about 400
> > > > milliseconds for many current desktop and embedded machines
> > > > running Linux.
> > > >
> > > > Add support to save cycles during early boot, and output correct
> > > > timestamp values after timekeeping is initialized.  get_cycles()
> > > > is operational on arm64 and x86_64 from kernel start.  Add code
> > > > and variables to save calibration values used to later convert
> > > > cycle counts to time values in the early printks.  Add a config
> > > > to control the feature.
> > > >
> > > > This yields non-zero timestamps for printks from the very start
> > > > of kernel execution.  The timestamps are relative to the start of
> > > > the architecture-specified counter used in get_cycles
> > > > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > > >
> > > > All timestamps reflect time from processor power-on instead of
> > > > time from the kernel's timekeeping initialization.
> > > 
> > > I tried this patch in linux-next20260302 kernel running as a guest VM
> > > on a Hyper-V host. Two things:
> > > 
> > > 1) In the dmesg output, I'm seeing a place where the timestamps briefly go
> > > backwards -- i.e., they are not monotonically increasing. Here's a snippet,
> > > where there's a smaller timestamp immediately after the tsc is detected:
> > > 
> > > [   27.994891] SMBIOS 3.1.0 present.
> > > [   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > > [   27.994898] DMI: Memory slots populated: 2/2
> > > [   27.995202] Hypervisor detected: Microsoft Hyper-V
> > > [   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
> > > [   27.995208] Hyper-V: Nested features: 0x0
> > > [   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
> > > [   27.995210] Hyper-V: Using hypercall for remote TLB flush
> > > [   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > > [   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > > [   27.995220] tsc: Detected 2918.401 MHz processor
> > 
> > I wonder if the tsc is getting fiddled with or virtualized somewhere in here, as part of clocksource initialization.
> > I believe each clocksource in the kernel maintains it's own internal offset, and maybe the offset that is
> > being used ends up being slightly different from the cycle-counter offset that the early_times feature uses.
> > I'm just throwing out guesses.  It's about a 4ms delta, which is pretty big.
> > 
> > > [   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> > > [   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
> > > [   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
> > > [   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
> > > [   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> 
> I wonder how the calibration of the cycles is precise. I wonder if
> the problem might be that cycles were faster right after boot than
> later during the calibration.
> 
> I added the following debug output on top of this patch:
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> index 05388dcb8573..cdb467345bcc 100644
> --- a/include/linux/early_times.h
> +++ b/include/linux/early_times.h
> @@ -20,6 +20,7 @@ static inline void early_times_start_calibration(void)
>  {
>  	start_cycles = get_cycles();
>  	start_ns = local_clock();
> +	pr_info("Early printk times: started callibration: %llu ns\n", start_ns);
>  }
>  
>  static inline void early_times_finish_calibration(void)
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 774ffb1fa5ac..836cb03aaa6d 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2308,6 +2308,8 @@ int vprintk_store(int facility, int level,
>  	ts_nsec = local_clock();
>  	if (!ts_nsec)
>  		ts_nsec = early_cycles();
> +	else
> +		pr_info_once("local_clock() returned non-zero timestamp: %llu nsec\n", ts_nsec);
>  
>  	caller_id = printk_caller_id();
>  
> 
> And it produced in my kvm:
> 
> Let's say that start of the cycle counter is
> 
> Start of stage A
> 
> [    8.684438] Linux version 7.0.0-rc2-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20260202, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.45.0.20251103-2) #571 SMP PREEMPT_DYNAMIC Fri Mar 13 10:23:54 CET 2026
> [    8.684442] Command line: BOOT_IMAGE=/boot/vmlinuz-7.0.0-rc2-default+ root=/dev/vda2 resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=tty0 ignore_loglevel log_buf_len=1M crashkernel=512M,high crashkernel=72M,low
> [...]
> [    8.696633] earlycon: uart8250 at I/O port 0x3f8 (options '115200')
> [    8.696639] printk: legacy bootconsole [uart8250] enabled
> [    8.731303] printk: debug: ignoring loglevel setting.
> [    8.732349] NX (Execute Disable) protection: active
> [    8.733447] APIC: Static calls initialized
> [    8.734667] SMBIOS 2.8 present.
> [    8.735358] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
> [    8.737285] DMI: Memory slots populated: 1/1
> [    8.738380] Hypervisor detected: KVM
> [    8.739151] last_pfn = 0x7ffdc max_arch_pfn = 0x400000000
> [    8.740254] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    8.732971] printk: local_clock() returned non-zero timestamp: 3486 nsec
> 
> End of stage A
> 
> This is the point where printk() started storing the values from
> local_clock() instead of cycles.
> 
> Start of stage B
> 
> [    8.732971] kvm-clock: using sched offset of 252367014082295 cycles
> [    8.735471] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    8.738880] tsc: Detected 3293.776 MHz processor
> [    8.740474] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> [...]
> [    8.932671] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [    8.934047] Early printk times: started callibration: 201079507 ns
> 
> End of stage B
> 
> This is where we started calibration of early cycles.
> 
> Start of stage C
> 
> [    8.935571] Console: colour dummy device 80x25
> [...]
> [    9.289673] thermal_sys: Registered thermal governor 'fair_share'
> [    9.290077] thermal_sys: Registered thermal governor 'bang_bang'
> [    9.292290] thermal_sys: Registered thermal governor 'step_wise'
> [    9.293530] thermal_sys: Registered thermal governor 'user_space'
> [    9.294856] cpuidle: using governor ladder
> [    9.296302] cpuidle: using governor menu
> 
> Here the thermal governors are registered. I guess that they might
> reduce speed of some HW.
> 
> [...]
> [   11.974147] clk: Disabling unused clocks
> 
> Some unused clocks are disabled. I wonder if this might affect
> counting the cycles.
> 
> [   12.330852] Freeing unused kernel image (rodata/data gap) memory: 1500K
> [   12.351191] Early printk times: mult=19634245, shift=26, offset=8732967929 ns
> 
> End of stage C
> 
> Here is the end on calibration.
> 
> Now, if the frequence of the cycles was:
> 
>    + was higher in the stage A when only cycles were stored
>    + was lower in stage C when it was calibrated against local_clock()
> 
> Then it might result in higher (calibrated) timestamps in stage A
> and step back in stage B.
> 
> Or something like this. It is possible that even local_clock() does
> not have a stable frequence during the early boot.
> 
> Idea: A solution might be to start calibration when printk()
>       gets first non-zero time from local_clock.
> 
>       Something like:
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> index 05388dcb8573..09d278996184 100644
> --- a/include/linux/early_times.h
> +++ b/include/linux/early_times.h
> @@ -16,10 +16,13 @@ extern u64 start_ns;
>  extern u32 early_mult, early_shift;
>  extern u64 early_ts_offset;
>  
> -static inline void early_times_start_calibration(void)
> +static inline void early_times_may_start_calibration(u64 ts_ns)
>  {
> +	if (start_ns)
> +		return;
> +
> +	start_ns = ts_ns;
>  	start_cycles = get_cycles();
> -	start_ns = local_clock();
>  }
>  
>  static inline void early_times_finish_calibration(void)
> diff --git a/init/main.c b/init/main.c
> index 27835270dfb5..a333b0da69cf 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1123,9 +1123,6 @@ void start_kernel(void)
>  	timekeeping_init();
>  	time_init();
>  
> -	/* This must be after timekeeping is initialized */
> -	early_times_start_calibration();
> -
>  	/* This must be after timekeeping is initialized */
>  	random_init();
>  
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 774ffb1fa5ac..19330b6b4eb2 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2306,7 +2306,9 @@ int vprintk_store(int facility, int level,
>  	 * timestamp with respect to the caller.
>  	 */
>  	ts_nsec = local_clock();
> -	if (!ts_nsec)
> +	if (ts_nsec)
> +		early_times_may_start_calibration(ts_nsec);
> +	else
>  		ts_nsec = early_cycles();
>  
>  	caller_id = printk_caller_id();
> 
> 
> > > 
> > > 2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
> > > VM typically uses cloud-init to set everything up at boot time, and cloud-init
> > > is outputting lines to the serial console with a timestamp that looks like the
> > > printk() timestamp, but apparently is not adjusted for the early timestamping
> > > that this patch adds. Again, I haven't debugged what's going on -- I'm not
> > > immediately sure of the mechanism that cloud-init uses to do output to the
> > > serial console. The use of the Hyper-V synthetic clock source might the cause
> > > of the problem here as well. Here's an output snippet from the serial console:
> > > 
> > > [   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
> > > [   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
> > > [   20.333257] pstore: Registered efi_pstore as persistent store backend
> > > [   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
> > > [   20.338319] systemd[1]: Starting Load Kernel Modules...
> > > [   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
> > > [   20.350993] systemd[1]: Starting udev Coldplug all Devices...
> > > [   20.356255] systemd[1]: Starting Uncomplicated firewall...
> > > [   20.361536] systemd[1]: Started Journal Service.
> > > [   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
> > > [   22.532033] /dev/sr0: Can't lookup blockdev
> > > [    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48
> > > seconds.
> > > [    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82 seconds.
> > > [    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device
> > > info+++++++++++++++++++++++++++++++++++++++
> > > [    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> > > [    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
> > > [    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> 
> This is more complicated. I wonder if the timestamps from cloud-init()
> are somehow synchronized with local_clock().
> 
> We might need to synchronize local_clock() with the cycles as well.
> But there is the chicken&egg problem. We need:
> 
>     + to know the offset caused by cycles when local_clock() gets initialized.
>     + local_clock() running for some time to calibrate cycles.
>

Based on the discussions thus far, I see three problems currently:

1. Three phases of timestamps on consoles printing synchronously is awkward
2. Possible userspace breakage:

    I don't know how big of a concern this is, but if the timestamps were to not start at zero, any
    tools expecting them to, would be shocked. printk timestamp comparison with CLOCK_BOOTTIME
    wouldn't work anymore.

3. Counter frequency constancy requirement:

    For the early time calibration to be accurate, the counter frequency should be constant from
    processor power on to the end of the calibration. We don't know what the firmware or the bootloader
    may do in-between. If the counter frequency depends on cpu frequency or is just unstable, then we
    should err on the side of caution and bail out of early times. On the other hand, if constant counter
    frequency is architecturally guaranteed or if the cpu advertises so, then we should be good. Like,
    x86 has one only if boot_cpu_has(X86_FEATURE_CONSTANT_TSC), rest are dubious.

Idea: Making the early timestamps start at 0 would partially address these problems:

1. No awkward three phases of timestamps:

    The only difference would be that serial would have a number of timestamps exactly at 0, while userspace
    wouldn't. The early time calibration data would only be used for userspace to convert the counter values
    to timestamps with a zero origin.

2. Userspace breakage:

    CLOCK_BOOTTIME's origin is timekeeping init. So relative to that, the origin of early timestamps would be
    closer, but still different.

Making the early timestamps start at 0 would also lower counter frequency constancy requirement. Now the constancy
requirement would only be between the period of kernel start to calibration end, a phase in which the kernel is
in full control. Maybe even if !boot_cpu_has(X86_FEATURE_CONSTANT_TSC), the calibration data would be fine to use?
This is debatable.

A disadvantage of this method is that visibility into per-kernel duration is lost. But that can be made up for by
adding something like:

    pr_info("Pre-kernel duration: %llu ms\n", pre_kernel_time);

But this information should be taken with a grain of salt if constant counter frequency is not guaranteed.

Another point, since constancy of counter frequency is being relied upon anyway, conversion of counter value to
timestamp can be done later once counter frequency calibration is complete, instead of doing our own calibration. 
An arch hook like arch_counter_freq() could be provided from which we can get this info. On x86, it would be wired
up to tsc_khz. If arch_counter_freq() returns 0, that means constant frequency is not guaranteed, so we'd bail
out of the counter value to timestamp conversion.

Thanks,
Shashank

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Petr Mladek @ 2026-03-13 10:45 UTC (permalink / raw)
  To: Bird, Tim
  Cc: Michael Kelley, rostedt@goodmis.org, senozhatsky@chromium.org,
	francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org,
	Thomas Gleixner, John Stultz, Stephen Boyd, John Ogness
In-Reply-To: <MW5PR13MB563216F6D786D4DCDFEBED2BFD45A@MW5PR13MB5632.namprd13.prod.outlook.com>

Finally added timekeeping maintainers and John into Cc.
We should have added time since v1.

Anyway, you might see the entire history at
https://lore.kernel.org/all/39b09edb-8998-4ebd-a564-7d594434a981@bird.org/


On Fri 2026-03-13 04:52:40, Bird, Tim wrote:
> Hey Micheal,
> 
> This report is very interesting. 
> 
> Thanks very much for trying it out!
> 
> > -----Original Message-----
> > From: Michael Kelley <mhklinux@outlook.com>
> > Sent: Wednesday, March 11, 2026 9:47 AM
> > From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> > >
> > > During early boot, printk timestamps are reported as zero before
> > > kernel timekeeping starts (e.g. before time_init()).  This
> > > hinders boot-time optimization efforts.  This period is about 400
> > > milliseconds for many current desktop and embedded machines
> > > running Linux.
> > >
> > > Add support to save cycles during early boot, and output correct
> > > timestamp values after timekeeping is initialized.  get_cycles()
> > > is operational on arm64 and x86_64 from kernel start.  Add code
> > > and variables to save calibration values used to later convert
> > > cycle counts to time values in the early printks.  Add a config
> > > to control the feature.
> > >
> > > This yields non-zero timestamps for printks from the very start
> > > of kernel execution.  The timestamps are relative to the start of
> > > the architecture-specified counter used in get_cycles
> > > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > >
> > > All timestamps reflect time from processor power-on instead of
> > > time from the kernel's timekeeping initialization.
> > 
> > I tried this patch in linux-next20260302 kernel running as a guest VM
> > on a Hyper-V host. Two things:
> > 
> > 1) In the dmesg output, I'm seeing a place where the timestamps briefly go
> > backwards -- i.e., they are not monotonically increasing. Here's a snippet,
> > where there's a smaller timestamp immediately after the tsc is detected:
> > 
> > [   27.994891] SMBIOS 3.1.0 present.
> > [   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> > [   27.994898] DMI: Memory slots populated: 2/2
> > [   27.995202] Hypervisor detected: Microsoft Hyper-V
> > [   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
> > [   27.995208] Hyper-V: Nested features: 0x0
> > [   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
> > [   27.995210] Hyper-V: Using hypercall for remote TLB flush
> > [   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > [   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> > [   27.995220] tsc: Detected 2918.401 MHz processor
> 
> I wonder if the tsc is getting fiddled with or virtualized somewhere in here, as part of clocksource initialization.
> I believe each clocksource in the kernel maintains it's own internal offset, and maybe the offset that is
> being used ends up being slightly different from the cycle-counter offset that the early_times feature uses.
> I'm just throwing out guesses.  It's about a 4ms delta, which is pretty big.
> 
> > [   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> > [   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
> > [   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
> > [   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
> > [   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC

I wonder how the calibration of the cycles is precise. I wonder if
the problem might be that cycles were faster right after boot than
later during the calibration.

I added the following debug output on top of this patch:

diff --git a/include/linux/early_times.h b/include/linux/early_times.h
index 05388dcb8573..cdb467345bcc 100644
--- a/include/linux/early_times.h
+++ b/include/linux/early_times.h
@@ -20,6 +20,7 @@ static inline void early_times_start_calibration(void)
 {
 	start_cycles = get_cycles();
 	start_ns = local_clock();
+	pr_info("Early printk times: started callibration: %llu ns\n", start_ns);
 }
 
 static inline void early_times_finish_calibration(void)
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 774ffb1fa5ac..836cb03aaa6d 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2308,6 +2308,8 @@ int vprintk_store(int facility, int level,
 	ts_nsec = local_clock();
 	if (!ts_nsec)
 		ts_nsec = early_cycles();
+	else
+		pr_info_once("local_clock() returned non-zero timestamp: %llu nsec\n", ts_nsec);
 
 	caller_id = printk_caller_id();
 

And it produced in my kvm:

Let's say that start of the cycle counter is

Start of stage A

[    8.684438] Linux version 7.0.0-rc2-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20260202, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.45.0.20251103-2) #571 SMP PREEMPT_DYNAMIC Fri Mar 13 10:23:54 CET 2026
[    8.684442] Command line: BOOT_IMAGE=/boot/vmlinuz-7.0.0-rc2-default+ root=/dev/vda2 resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=tty0 ignore_loglevel log_buf_len=1M crashkernel=512M,high crashkernel=72M,low
[...]
[    8.696633] earlycon: uart8250 at I/O port 0x3f8 (options '115200')
[    8.696639] printk: legacy bootconsole [uart8250] enabled
[    8.731303] printk: debug: ignoring loglevel setting.
[    8.732349] NX (Execute Disable) protection: active
[    8.733447] APIC: Static calls initialized
[    8.734667] SMBIOS 2.8 present.
[    8.735358] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-2-g4f253b9b-prebuilt.qemu.org 04/01/2014
[    8.737285] DMI: Memory slots populated: 1/1
[    8.738380] Hypervisor detected: KVM
[    8.739151] last_pfn = 0x7ffdc max_arch_pfn = 0x400000000
[    8.740254] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    8.732971] printk: local_clock() returned non-zero timestamp: 3486 nsec

End of stage A

This is the point where printk() started storing the values from
local_clock() instead of cycles.

Start of stage B

[    8.732971] kvm-clock: using sched offset of 252367014082295 cycles
[    8.735471] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    8.738880] tsc: Detected 3293.776 MHz processor
[    8.740474] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
[...]
[    8.932671] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    8.934047] Early printk times: started callibration: 201079507 ns

End of stage B

This is where we started calibration of early cycles.

Start of stage C

[    8.935571] Console: colour dummy device 80x25
[...]
[    9.289673] thermal_sys: Registered thermal governor 'fair_share'
[    9.290077] thermal_sys: Registered thermal governor 'bang_bang'
[    9.292290] thermal_sys: Registered thermal governor 'step_wise'
[    9.293530] thermal_sys: Registered thermal governor 'user_space'
[    9.294856] cpuidle: using governor ladder
[    9.296302] cpuidle: using governor menu

Here the thermal governors are registered. I guess that they might
reduce speed of some HW.

[...]
[   11.974147] clk: Disabling unused clocks

Some unused clocks are disabled. I wonder if this might affect
counting the cycles.

[   12.330852] Freeing unused kernel image (rodata/data gap) memory: 1500K
[   12.351191] Early printk times: mult=19634245, shift=26, offset=8732967929 ns

End of stage C

Here is the end on calibration.

Now, if the frequence of the cycles was:

   + was higher in the stage A when only cycles were stored
   + was lower in stage C when it was calibrated against local_clock()

Then it might result in higher (calibrated) timestamps in stage A
and step back in stage B.

Or something like this. It is possible that even local_clock() does
not have a stable frequence during the early boot.

Idea: A solution might be to start calibration when printk()
      gets first non-zero time from local_clock.

      Something like:

diff --git a/include/linux/early_times.h b/include/linux/early_times.h
index 05388dcb8573..09d278996184 100644
--- a/include/linux/early_times.h
+++ b/include/linux/early_times.h
@@ -16,10 +16,13 @@ extern u64 start_ns;
 extern u32 early_mult, early_shift;
 extern u64 early_ts_offset;
 
-static inline void early_times_start_calibration(void)
+static inline void early_times_may_start_calibration(u64 ts_ns)
 {
+	if (start_ns)
+		return;
+
+	start_ns = ts_ns;
 	start_cycles = get_cycles();
-	start_ns = local_clock();
 }
 
 static inline void early_times_finish_calibration(void)
diff --git a/init/main.c b/init/main.c
index 27835270dfb5..a333b0da69cf 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1123,9 +1123,6 @@ void start_kernel(void)
 	timekeeping_init();
 	time_init();
 
-	/* This must be after timekeeping is initialized */
-	early_times_start_calibration();
-
 	/* This must be after timekeeping is initialized */
 	random_init();
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 774ffb1fa5ac..19330b6b4eb2 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2306,7 +2306,9 @@ int vprintk_store(int facility, int level,
 	 * timestamp with respect to the caller.
 	 */
 	ts_nsec = local_clock();
-	if (!ts_nsec)
+	if (ts_nsec)
+		early_times_may_start_calibration(ts_nsec);
+	else
 		ts_nsec = early_cycles();
 
 	caller_id = printk_caller_id();


> > 
> > 2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
> > VM typically uses cloud-init to set everything up at boot time, and cloud-init
> > is outputting lines to the serial console with a timestamp that looks like the
> > printk() timestamp, but apparently is not adjusted for the early timestamping
> > that this patch adds. Again, I haven't debugged what's going on -- I'm not
> > immediately sure of the mechanism that cloud-init uses to do output to the
> > serial console. The use of the Hyper-V synthetic clock source might the cause
> > of the problem here as well. Here's an output snippet from the serial console:
> > 
> > [   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
> > [   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
> > [   20.333257] pstore: Registered efi_pstore as persistent store backend
> > [   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
> > [   20.338319] systemd[1]: Starting Load Kernel Modules...
> > [   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
> > [   20.350993] systemd[1]: Starting udev Coldplug all Devices...
> > [   20.356255] systemd[1]: Starting Uncomplicated firewall...
> > [   20.361536] systemd[1]: Started Journal Service.
> > [   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
> > [   22.532033] /dev/sr0: Can't lookup blockdev
> > [    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48
> > seconds.
> > [    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82 seconds.
> > [    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device
> > info+++++++++++++++++++++++++++++++++++++++
> > [    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> > [    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
> > [    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+

This is more complicated. I wonder if the timestamps from cloud-init()
are somehow synchronized with local_clock().

We might need to synchronize local_clock() with the cycles as well.
But there is the chicken&egg problem. We need:

    + to know the offset caused by cycles when local_clock() gets initialized.
    + local_clock() running for some time to calibrate cycles.

Hmm, I see that time-management people are not in Cc. We should have
added them since v1.

I add them now. Better late than never.

Best Regards,
Petr

^ permalink raw reply related

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-13  4:52 UTC (permalink / raw)
  To: Michael Kelley, pmladek@suse.com, rostedt@goodmis.org,
	senozhatsky@chromium.org
  Cc: francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB41571D45C1C8ADF89BD1FE8DD447A@SN6PR02MB4157.namprd02.prod.outlook.com>

Hey Micheal,

This report is very interesting. 

Thanks very much for trying it out!
 

> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Wednesday, March 11, 2026 9:47 AM
> From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> >
> > During early boot, printk timestamps are reported as zero before
> > kernel timekeeping starts (e.g. before time_init()).  This
> > hinders boot-time optimization efforts.  This period is about 400
> > milliseconds for many current desktop and embedded machines
> > running Linux.
> >
> > Add support to save cycles during early boot, and output correct
> > timestamp values after timekeeping is initialized.  get_cycles()
> > is operational on arm64 and x86_64 from kernel start.  Add code
> > and variables to save calibration values used to later convert
> > cycle counts to time values in the early printks.  Add a config
> > to control the feature.
> >
> > This yields non-zero timestamps for printks from the very start
> > of kernel execution.  The timestamps are relative to the start of
> > the architecture-specified counter used in get_cycles
> > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> >
> > All timestamps reflect time from processor power-on instead of
> > time from the kernel's timekeeping initialization.
> 
> I tried this patch in linux-next20260302 kernel running as a guest VM
> on a Hyper-V host. Two things:
> 
> 1) In the dmesg output, I'm seeing a place where the timestamps briefly go
> backwards -- i.e., they are not monotonically increasing. Here's a snippet,
> where there's a smaller timestamp immediately after the tsc is detected:
> 
> [   27.994891] SMBIOS 3.1.0 present.
> [   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
> [   27.994898] DMI: Memory slots populated: 2/2
> [   27.995202] Hypervisor detected: Microsoft Hyper-V
> [   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
> [   27.995208] Hyper-V: Nested features: 0x0
> [   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
> [   27.995210] Hyper-V: Using hypercall for remote TLB flush
> [   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> [   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
> [   27.995220] tsc: Detected 2918.401 MHz processor

I wonder if the tsc is getting fiddled with or virtualized somewhere in here, as part of clocksource initialization.
I believe each clocksource in the kernel maintains it's own internal offset, and maybe the offset that is
being used ends up being slightly different from the cycle-counter offset that the early_times feature uses.
I'm just throwing out guesses.  It's about a 4ms delta, which is pretty big.

> [   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
> [   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
> [   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
> [   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
> [   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> 
> Hyper-V provides a synthetic clocksource (two actually), and perhaps they
> are the cause of the problem, though I haven't spent any time debugging.
> 
> 2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
> VM typically uses cloud-init to set everything up at boot time, and cloud-init
> is outputting lines to the serial console with a timestamp that looks like the
> printk() timestamp, but apparently is not adjusted for the early timestamping
> that this patch adds. Again, I haven't debugged what's going on -- I'm not
> immediately sure of the mechanism that cloud-init uses to do output to the
> serial console. The use of the Hyper-V synthetic clock source might the cause
> of the problem here as well. Here's an output snippet from the serial console:
> 
> [   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
> [   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
> [   20.333257] pstore: Registered efi_pstore as persistent store backend
> [   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
> [   20.338319] systemd[1]: Starting Load Kernel Modules...
> [   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
> [   20.350993] systemd[1]: Starting udev Coldplug all Devices...
> [   20.356255] systemd[1]: Starting Uncomplicated firewall...
> [   20.361536] systemd[1]: Started Journal Service.
> [   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
> [   22.532033] /dev/sr0: Can't lookup blockdev
> [    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48
> seconds.
> [    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82 seconds.
> [    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device
> info+++++++++++++++++++++++++++++++++++++++
> [    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> [    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
> [    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
> 
> The cloud-init lines don't show up in dmesg, so there's no problem there.

Are the timestamp values for cloud-init usually consistent with the ones from the kernel?
That is, without the early_times patch, do they usually match kernel's printk timestamps?

It sounds like these cloud-init messages are emitted directly to the serial console, and do
not go through the printk system.  Is that right?  How do they avoid having the printk
messages and cloud-init messages intermingle?  I assume cloud-init is a systemd service?

Can you share the calibration lines from this boot?  I'm curious if the timestamps
just relative to local_clock().

> I will look into both issues further, but probably not today.

Thanks for trying the code out.  Let me know if you find out anything.
 -- Tim

> Michael
> 
> >
> > Signed-off-by: Tim Bird <tim.bird@sony.com>
> > ---
> > V2->V3
> >  Default CONFIG option to 'n'
> >  Move more code from into early_times.h
> >   (reducing ifdefs in init/main.c)
> >  Use math64 helper routines
> >  Use cycles_t instead of u64 type
> >  Add #defines for EARLY_CYCLES_BIT and MASK
> >  Invert if logic in adjust_early_ts()
> >  (note: no change to 'depends on' in Kconfig entry)
> >
> > V1->V2
> >  Remove calibration CONFIG vars
> >  Add 'depends on' to restrict arches (to handle ppc bug)
> >  Add early_ts_offset to avoid discontinuity
> >  Save cycles in ts_nsec, and convert on output
> >  Move conditional code to include file (early_times.h)
> >
> >  include/linux/early_times.h | 85 +++++++++++++++++++++++++++++++++++++
> >  init/Kconfig                | 14 ++++++
> >  init/main.c                 |  6 +++
> >  kernel/printk/printk.c      | 18 +++++++-
> >  4 files changed, 121 insertions(+), 2 deletions(-)
> >  create mode 100644 include/linux/early_times.h
> >
> > diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> > new file mode 100644
> > index 000000000000..05388dcb8573
> > --- /dev/null
> > +++ b/include/linux/early_times.h
> > @@ -0,0 +1,85 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef _KERNEL_PRINTK_EARLY_TIMES_H
> > +#define _KERNEL_PRINTK_EARLY_TIMES_H
> > +
> > +#include <linux/timex.h>
> > +#include <linux/clocksource.h>
> > +
> > +/* use high bit of a u64 to indicate cycles instead of a timestamp */
> > +#define EARLY_CYCLES_BIT	BIT_ULL(63)
> > +#define EARLY_CYCLES_MASK	~(BIT_ULL(63))
> > +
> > +#if defined(CONFIG_EARLY_PRINTK_TIMES)
> > +extern cycles_t start_cycles;
> > +extern u64 start_ns;
> > +extern u32 early_mult, early_shift;
> > +extern u64 early_ts_offset;
> > +
> > +static inline void early_times_start_calibration(void)
> > +{
> > +	start_cycles = get_cycles();
> > +	start_ns = local_clock();
> > +}
> > +
> > +static inline void early_times_finish_calibration(void)
> > +{
> > +	cycles_t end_cycles;
> > +	u64 end_ns;
> > +
> > +	/* set calibration data for early_printk_times */
> > +	end_cycles = get_cycles();
> > +	end_ns = local_clock();
> > +	clocks_calc_mult_shift(&early_mult, &early_shift,
> > +		mul_u64_u64_div_u64(end_cycles - start_cycles,
> > +			NSEC_PER_SEC, end_ns - start_ns),
> > +		NSEC_PER_SEC, 100);
> > +	early_ts_offset = mul_u64_u32_shr(start_cycles, early_mult, early_shift) -
> > start_ns;
> > +
> > +	pr_debug("Early printk times: mult=%u, shift=%u, offset=%llu ns\n",
> > +		early_mult, early_shift, early_ts_offset);
> > +}
> > +
> > +static inline u64 early_cycles(void)
> > +{
> > +	return (get_cycles() | EARLY_CYCLES_BIT);
> > +}
> > +
> > +/*
> > + * adjust_early_ts detects whether ts in is cycles or nanoseconds
> > + * and converts it or adjusts it, taking into account the offset
> > + * from cycle-counter start.
> > + *
> > + * Note that early_mult may be 0, but that's OK because
> > + * we'll just multiply by 0 and return 0. This will
> > + * only occur if we're outputting a printk message
> > + * before the calibration of the early timestamp.
> > + * Any output after user space start (eg. from dmesg or
> > + * journalctl) will show correct values.
> > + */
> > +static inline u64 adjust_early_ts(u64 ts)
> > +{
> > +	if (likely(!(ts & EARLY_CYCLES_BIT)))
> > +		/* if timestamp is not in cycles, just add offset */
> > +		return ts + early_ts_offset;
> > +
> > +	/* mask high bit and convert to nanoseconds */
> > +	return mul_u64_u32_shr(ts & EARLY_CYCLES_MASK, early_mult, early_shift);
> > +}
> > +
> > +#else
> > +# define early_times_start_calibration() do { } while (0)
> > +# define early_times_finish_calibration() do { } while (0)
> > +
> > +static inline u64 early_cycles(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +static inline u64 adjust_early_ts(u64 ts)
> > +{
> > +	return ts;
> > +}
> > +#endif /* CONFIG_EARLY_PRINTK_TIMES */
> > +
> > +#endif /* _KERNEL_PRINTK_EARLY_TIMES_H */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index fa79feb8fe57..a928c1efb09d 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -777,6 +777,20 @@ config IKHEADERS
> >  	  or similar programs.  If you build the headers as a module, a module called
> >  	  kheaders.ko is built which can be loaded on-demand to get access to headers.
> >
> > +config EARLY_PRINTK_TIMES
> > +	bool "Show non-zero printk timestamps early in boot"
> > +	default n
> > +	depends on PRINTK
> > +	depends on ARM64 || X86_64
> > +	help
> > +	  Use a cycle-counter to provide printk timestamps during
> > +	  early boot.  This allows seeing timestamps for printks that
> > +	  would otherwise show as 0.  Note that this will shift the
> > +	  printk timestamps to be relative to processor power on, instead
> > +	  of relative to the start of kernel timekeeping.  This should be
> > +	  closer to machine power on, giving a better indication of
> > +	  overall boot time.
> > +
> >  config LOG_BUF_SHIFT
> >  	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)"
> >  	range 12 25
> > diff --git a/init/main.c b/init/main.c
> > index b84818ad9685..d5774aec1aff 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -104,6 +104,7 @@
> >  #include <linux/pidfs.h>
> >  #include <linux/ptdump.h>
> >  #include <linux/time_namespace.h>
> > +#include <linux/early_times.h>
> >  #include <net/net_namespace.h>
> >
> >  #include <asm/io.h>
> > @@ -1118,6 +1119,9 @@ void start_kernel(void)
> >  	timekeeping_init();
> >  	time_init();
> >
> > +	/* This must be after timekeeping is initialized */
> > +	early_times_start_calibration();
> > +
> >  	/* This must be after timekeeping is initialized */
> >  	random_init();
> >
> > @@ -1600,6 +1604,8 @@ static int __ref kernel_init(void *unused)
> >
> >  	do_sysctl_args();
> >
> > +	early_times_finish_calibration();
> > +
> >  	if (ramdisk_execute_command) {
> >  		ret = run_init_process(ramdisk_execute_command);
> >  		if (!ret)
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 1d765ad242b8..5afd31c3345c 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -46,6 +46,7 @@
> >  #include <linux/ctype.h>
> >  #include <linux/uio.h>
> >  #include <linux/sched/clock.h>
> > +#include <linux/early_times.h>
> >  #include <linux/sched/debug.h>
> >  #include <linux/sched/task_stack.h>
> >  #include <linux/panic.h>
> > @@ -75,6 +76,13 @@ EXPORT_SYMBOL(ignore_console_lock_warning);
> >
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(console);
> >
> > +#ifdef CONFIG_EARLY_PRINTK_TIMES
> > +cycles_t start_cycles;
> > +u64 start_ns;
> > +u32 early_mult, early_shift;
> > +u64 early_ts_offset;
> > +#endif
> > +
> >  /*
> >   * Low level drivers may need that to know if they can schedule in
> >   * their unblank() callback or not. So let's export it.
> > @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
> >  static ssize_t info_print_ext_header(char *buf, size_t size,
> >  				     struct printk_info *info)
> >  {
> > -	u64 ts_usec = info->ts_nsec;
> > +	u64 ts_usec = adjust_early_ts(info->ts_nsec);
> >  	char caller[20];
> >  #ifdef CONFIG_PRINTK_CALLER
> >  	u32 id = info->caller_id;
> > @@ -1352,7 +1360,11 @@ static size_t print_syslog(unsigned int level, char *buf)
> >
> >  static size_t print_time(u64 ts, char *buf)
> >  {
> > -	unsigned long rem_nsec = do_div(ts, 1000000000);
> > +	unsigned long rem_nsec;
> > +
> > +	ts = adjust_early_ts(ts);
> > +
> > +	rem_nsec = do_div(ts, 1000000000);
> >
> >  	return sprintf(buf, "[%5lu.%06lu]",
> >  		       (unsigned long)ts, rem_nsec / 1000);
> > @@ -2242,6 +2254,8 @@ int vprintk_store(int facility, int level,
> >  	 * timestamp with respect to the caller.
> >  	 */
> >  	ts_nsec = local_clock();
> > +	if (!ts_nsec)
> > +		ts_nsec = early_cycles();
> >
> >  	caller_id = printk_caller_id();
> >
> > --
> > 2.43.0
> >


^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Michael Kelley @ 2026-03-11 15:47 UTC (permalink / raw)
  To: Tim Bird, pmladek@suse.com, rostedt@goodmis.org,
	john.ogness@linuxtronix.de, senozhatsky@chromium.org
  Cc: francesco@valla.it, geert@linux-m68k.org,
	linux-embedded@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260210234741.3262320-1-tim.bird@sony.com>

From: Tim Bird <tim.bird@sony.com> Sent: Tuesday, February 10, 2026 3:48 PM
> 
> During early boot, printk timestamps are reported as zero before
> kernel timekeeping starts (e.g. before time_init()).  This
> hinders boot-time optimization efforts.  This period is about 400
> milliseconds for many current desktop and embedded machines
> running Linux.
> 
> Add support to save cycles during early boot, and output correct
> timestamp values after timekeeping is initialized.  get_cycles()
> is operational on arm64 and x86_64 from kernel start.  Add code
> and variables to save calibration values used to later convert
> cycle counts to time values in the early printks.  Add a config
> to control the feature.
> 
> This yields non-zero timestamps for printks from the very start
> of kernel execution.  The timestamps are relative to the start of
> the architecture-specified counter used in get_cycles
> (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> 
> All timestamps reflect time from processor power-on instead of
> time from the kernel's timekeeping initialization.

I tried this patch in linux-next20260302 kernel running as a guest VM
on a Hyper-V host. Two things:

1) In the dmesg output, I'm seeing a place where the timestamps briefly go
backwards -- i.e., they are not monotonically increasing. Here's a snippet,
where there's a smaller timestamp immediately after the tsc is detected:

[   27.994891] SMBIOS 3.1.0 present.
[   27.994893] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/25/2025
[   27.994898] DMI: Memory slots populated: 2/2
[   27.995202] Hypervisor detected: Microsoft Hyper-V
[   27.995205] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, hints 0xa0e24, misc 0xe0bed7b2
[   27.995208] Hyper-V: Nested features: 0x0
[   27.995209] Hyper-V: LAPIC Timer Frequency: 0xc3500
[   27.995210] Hyper-V: Using hypercall for remote TLB flush
[   27.995216] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
[   27.995218] clocksource: hyperv_clocksource_msr: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
[   27.995220] tsc: Detected 2918.401 MHz processor
[   27.991060] e820: update [mem 0x00000000-0x00000fff] System RAM ==> device reserved
[   27.991062] e820: remove [mem 0x000a0000-0x000fffff] System RAM
[   27.991064] last_pfn = 0x210000 max_arch_pfn = 0x400000000
[   27.991065] x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
[   27.991066] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC

Hyper-V provides a synthetic clocksource (two actually), and perhaps they
are the cause of the problem, though I haven't spent any time debugging.

2) A Linux VM running in the Azure cloud is also running on Hyper-V. Such a
VM typically uses cloud-init to set everything up at boot time, and cloud-init
is outputting lines to the serial console with a timestamp that looks like the
printk() timestamp, but apparently is not adjusted for the early timestamping
that this patch adds. Again, I haven't debugged what's going on -- I'm not
immediately sure of the mechanism that cloud-init uses to do output to the
serial console. The use of the Hyper-V synthetic clock source might the cause
of the problem here as well. Here's an output snippet from the serial console:

[   20.330414] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
[   20.332911] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
[   20.333257] pstore: Registered efi_pstore as persistent store backend
[   20.334360] systemd[1]: Condition check resulted in File System Check on Root Device being skipped.
[   20.338319] systemd[1]: Starting Load Kernel Modules...
[   20.341094] systemd[1]: Starting Remount Root and Kernel File Systems...
[   20.350993] systemd[1]: Starting udev Coldplug all Devices...
[   20.356255] systemd[1]: Starting Uncomplicated firewall...
[   20.361536] systemd[1]: Started Journal Service.
[   20.386902] EXT4-fs (sda1): re-mounted c02dce0c-0c40-4e6e-88af-c5a0987b0adb r/w.
[   22.532033] /dev/sr0: Can't lookup blockdev
[    7.955973] cloud-init[783]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init-local' at Wed, 11 Mar 2026 15:27:06 +0000. Up 7.48 seconds.
[    9.933120] cloud-init[822]: Cloud-init v. 24.3.1-0ubuntu0~20.04.1 running 'init' at Wed, 11 Mar 2026 15:27:08 +0000. Up 9.82 seconds.
[    9.935483] cloud-init[822]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
[    9.937726] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
[    9.939905] cloud-init[822]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
[    9.942059] cloud-init[822]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+

The cloud-init lines don't show up in dmesg, so there's no problem there.

I will look into both issues further, but probably not today.

Michael

> 
> Signed-off-by: Tim Bird <tim.bird@sony.com>
> ---
> V2->V3
>  Default CONFIG option to 'n'
>  Move more code from into early_times.h
>   (reducing ifdefs in init/main.c)
>  Use math64 helper routines
>  Use cycles_t instead of u64 type
>  Add #defines for EARLY_CYCLES_BIT and MASK
>  Invert if logic in adjust_early_ts()
>  (note: no change to 'depends on' in Kconfig entry)
> 
> V1->V2
>  Remove calibration CONFIG vars
>  Add 'depends on' to restrict arches (to handle ppc bug)
>  Add early_ts_offset to avoid discontinuity
>  Save cycles in ts_nsec, and convert on output
>  Move conditional code to include file (early_times.h)
> 
>  include/linux/early_times.h | 85 +++++++++++++++++++++++++++++++++++++
>  init/Kconfig                | 14 ++++++
>  init/main.c                 |  6 +++
>  kernel/printk/printk.c      | 18 +++++++-
>  4 files changed, 121 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/early_times.h
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> new file mode 100644
> index 000000000000..05388dcb8573
> --- /dev/null
> +++ b/include/linux/early_times.h
> @@ -0,0 +1,85 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _KERNEL_PRINTK_EARLY_TIMES_H
> +#define _KERNEL_PRINTK_EARLY_TIMES_H
> +
> +#include <linux/timex.h>
> +#include <linux/clocksource.h>
> +
> +/* use high bit of a u64 to indicate cycles instead of a timestamp */
> +#define EARLY_CYCLES_BIT	BIT_ULL(63)
> +#define EARLY_CYCLES_MASK	~(BIT_ULL(63))
> +
> +#if defined(CONFIG_EARLY_PRINTK_TIMES)
> +extern cycles_t start_cycles;
> +extern u64 start_ns;
> +extern u32 early_mult, early_shift;
> +extern u64 early_ts_offset;
> +
> +static inline void early_times_start_calibration(void)
> +{
> +	start_cycles = get_cycles();
> +	start_ns = local_clock();
> +}
> +
> +static inline void early_times_finish_calibration(void)
> +{
> +	cycles_t end_cycles;
> +	u64 end_ns;
> +
> +	/* set calibration data for early_printk_times */
> +	end_cycles = get_cycles();
> +	end_ns = local_clock();
> +	clocks_calc_mult_shift(&early_mult, &early_shift,
> +		mul_u64_u64_div_u64(end_cycles - start_cycles,
> +			NSEC_PER_SEC, end_ns - start_ns),
> +		NSEC_PER_SEC, 100);
> +	early_ts_offset = mul_u64_u32_shr(start_cycles, early_mult, early_shift) -
> start_ns;
> +
> +	pr_debug("Early printk times: mult=%u, shift=%u, offset=%llu ns\n",
> +		early_mult, early_shift, early_ts_offset);
> +}
> +
> +static inline u64 early_cycles(void)
> +{
> +	return (get_cycles() | EARLY_CYCLES_BIT);
> +}
> +
> +/*
> + * adjust_early_ts detects whether ts in is cycles or nanoseconds
> + * and converts it or adjusts it, taking into account the offset
> + * from cycle-counter start.
> + *
> + * Note that early_mult may be 0, but that's OK because
> + * we'll just multiply by 0 and return 0. This will
> + * only occur if we're outputting a printk message
> + * before the calibration of the early timestamp.
> + * Any output after user space start (eg. from dmesg or
> + * journalctl) will show correct values.
> + */
> +static inline u64 adjust_early_ts(u64 ts)
> +{
> +	if (likely(!(ts & EARLY_CYCLES_BIT)))
> +		/* if timestamp is not in cycles, just add offset */
> +		return ts + early_ts_offset;
> +
> +	/* mask high bit and convert to nanoseconds */
> +	return mul_u64_u32_shr(ts & EARLY_CYCLES_MASK, early_mult, early_shift);
> +}
> +
> +#else
> +# define early_times_start_calibration() do { } while (0)
> +# define early_times_finish_calibration() do { } while (0)
> +
> +static inline u64 early_cycles(void)
> +{
> +	return 0;
> +}
> +
> +static inline u64 adjust_early_ts(u64 ts)
> +{
> +	return ts;
> +}
> +#endif /* CONFIG_EARLY_PRINTK_TIMES */
> +
> +#endif /* _KERNEL_PRINTK_EARLY_TIMES_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index fa79feb8fe57..a928c1efb09d 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -777,6 +777,20 @@ config IKHEADERS
>  	  or similar programs.  If you build the headers as a module, a module called
>  	  kheaders.ko is built which can be loaded on-demand to get access to headers.
> 
> +config EARLY_PRINTK_TIMES
> +	bool "Show non-zero printk timestamps early in boot"
> +	default n
> +	depends on PRINTK
> +	depends on ARM64 || X86_64
> +	help
> +	  Use a cycle-counter to provide printk timestamps during
> +	  early boot.  This allows seeing timestamps for printks that
> +	  would otherwise show as 0.  Note that this will shift the
> +	  printk timestamps to be relative to processor power on, instead
> +	  of relative to the start of kernel timekeeping.  This should be
> +	  closer to machine power on, giving a better indication of
> +	  overall boot time.
> +
>  config LOG_BUF_SHIFT
>  	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)"
>  	range 12 25
> diff --git a/init/main.c b/init/main.c
> index b84818ad9685..d5774aec1aff 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -104,6 +104,7 @@
>  #include <linux/pidfs.h>
>  #include <linux/ptdump.h>
>  #include <linux/time_namespace.h>
> +#include <linux/early_times.h>
>  #include <net/net_namespace.h>
> 
>  #include <asm/io.h>
> @@ -1118,6 +1119,9 @@ void start_kernel(void)
>  	timekeeping_init();
>  	time_init();
> 
> +	/* This must be after timekeeping is initialized */
> +	early_times_start_calibration();
> +
>  	/* This must be after timekeeping is initialized */
>  	random_init();
> 
> @@ -1600,6 +1604,8 @@ static int __ref kernel_init(void *unused)
> 
>  	do_sysctl_args();
> 
> +	early_times_finish_calibration();
> +
>  	if (ramdisk_execute_command) {
>  		ret = run_init_process(ramdisk_execute_command);
>  		if (!ret)
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1d765ad242b8..5afd31c3345c 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -46,6 +46,7 @@
>  #include <linux/ctype.h>
>  #include <linux/uio.h>
>  #include <linux/sched/clock.h>
> +#include <linux/early_times.h>
>  #include <linux/sched/debug.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/panic.h>
> @@ -75,6 +76,13 @@ EXPORT_SYMBOL(ignore_console_lock_warning);
> 
>  EXPORT_TRACEPOINT_SYMBOL_GPL(console);
> 
> +#ifdef CONFIG_EARLY_PRINTK_TIMES
> +cycles_t start_cycles;
> +u64 start_ns;
> +u32 early_mult, early_shift;
> +u64 early_ts_offset;
> +#endif
> +
>  /*
>   * Low level drivers may need that to know if they can schedule in
>   * their unblank() callback or not. So let's export it.
> @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
>  static ssize_t info_print_ext_header(char *buf, size_t size,
>  				     struct printk_info *info)
>  {
> -	u64 ts_usec = info->ts_nsec;
> +	u64 ts_usec = adjust_early_ts(info->ts_nsec);
>  	char caller[20];
>  #ifdef CONFIG_PRINTK_CALLER
>  	u32 id = info->caller_id;
> @@ -1352,7 +1360,11 @@ static size_t print_syslog(unsigned int level, char *buf)
> 
>  static size_t print_time(u64 ts, char *buf)
>  {
> -	unsigned long rem_nsec = do_div(ts, 1000000000);
> +	unsigned long rem_nsec;
> +
> +	ts = adjust_early_ts(ts);
> +
> +	rem_nsec = do_div(ts, 1000000000);
> 
>  	return sprintf(buf, "[%5lu.%06lu]",
>  		       (unsigned long)ts, rem_nsec / 1000);
> @@ -2242,6 +2254,8 @@ int vprintk_store(int facility, int level,
>  	 * timestamp with respect to the caller.
>  	 */
>  	ts_nsec = local_clock();
> +	if (!ts_nsec)
> +		ts_nsec = early_cycles();
> 
>  	caller_id = printk_caller_id();
> 
> --
> 2.43.0
> 


^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Petr Mladek @ 2026-03-11 15:45 UTC (permalink / raw)
  To: Bird, Tim
  Cc: Shashank Balaji, rostedt@goodmis.org, john.ogness@linuxtronix.de,
	senozhatsky@chromium.org, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR13MB5634C64991FD94776E9D3D82FD46A@SA1PR13MB5634.namprd13.prod.outlook.com>

On Tue 2026-03-10 18:54:22, Bird, Tim wrote:
> > From: Petr Mladek <pmladek@suse.com>
> > There are three stages:
> > 
> >    1. Early messages where the cycles are stored.
> > 
> >       The serial console shows zero time stamp because
> >       it reads the messages _before the calibration_, e.g.
> > 
> > 	[    0.000000] Linux version 6.19.0-rc7-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20251006, GNU ld (GNU Binutils;
> > openSUSE Tumbleweed) 2.45.0.20251103-2) #521 SMP PREEMPT_DYNAMIC Mon Feb  2 16:36:53 CET 2026
> > 	[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.19.0-rc7-default+ root=UUID=587ae802-e330-4059-9b48-
> > d5b845e1075a resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash
> > earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=ttynull console=tty0 debug_non_panic_cpus=1 panic=10 ignore_loglevel
> > log_buf_len=1M
> > 	[    0.000000] BIOS-provided physical RAM map:
> > 	[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> > 	[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > 
> >       But "dmesg" shows some numbers because it reads the messages
> >       _after the calibration_:
> > 
> > 	[    8.853613] Linux version 6.19.0-rc7-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20251006, GNU ld (GNU Binutils;
> > openSUSE Tumbleweed) 2.45.0.20251103-2) #521 SMP PREEMPT_DYNAMIC Mon Feb  2 16:36:53 CET 2026
> > 	[    8.853617] Command line: BOOT_IMAGE=/boot/vmlinuz-6.19.0-rc7-default+ root=UUID=587ae802-e330-4059-9b48-
> > d5b845e1075a resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash
> > earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=ttynull console=tty0 debug_non_panic_cpus=1 panic=10 ignore_loglevel
> > log_buf_len=1M
> > 	[    8.865086] BIOS-provided physical RAM map:
> > 	[    8.865087] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> > 	[    8.865089] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> > 
> > 
> >    2. Early messages added _after the timekeeping_ is initialized
> >       but _before the early cycles calibration_.
> > 
> >       They serial console prints them _without the offset_ because
> >       it reads them _before the calibration_, e.g.
> > 
> > 	[    3.288049][    T1] Write protecting the kernel read-only data: 36864k
> > 	[    3.298554][    T1] Freeing unused kernel image (text/rodata gap) memory: 1656K
> > 	[    3.318942][    T1] Freeing unused kernel image (rodata/data gap) memory: 1540K
> > 
> >        But "dmesg" prints them _with the offset_ because it reads them
> >        _after the calibration_, e.g.
> > 
> > 	[   12.179999] [    T1] Write protecting the kernel read-only data: 36864k
> > 	[   12.190505] [    T1] Freeing unused kernel image (text/rodata gap) memory: 1656K
> > 	[   12.210893] [    T1] Freeing unused kernel image (rodata/data gap) memory: 1540K
> > 
> > 
> >    3. Messages added after the calibration of the early cycles.
> > 
> >       They are printed with the offset by both serial console and
> >       dmesg, e.g.
> > 
> > 	[   12.230014][    T1] Early printk times: mult=38775352, shift=27, offset=8891950261 ns
> > 	[   12.246008][    T1] Run /init as init process
> > 	[   12.254944][    T1]   with arguments:
> > 	[   12.264341][    T1]     /init
> 
> This is correct.  I don't want to overwhelm users of this, but there are three time-gathering periods, and roughly
> 2 output times (before calibration and after calibration).
> 
> early boot = before time init, before cycles calibration and offset calculation
> mid boot = after time init, before cycles calibration and offset calculation
> late boot = after time init, after cycles calibration and offset calculation
> All of these are before the start of user space processes.
>
> time of printk	output time	timestamp type	stored	timestamp output
> ------------------	----------------	-----------------------------	------------------------
> early boot	early boot	cycles			0
> mid boot	mid-boot	nanosecs		seconds, with offset from time_init
> late boot	late-boot	nanosecs		seconds with offset from cycle counter start

This is a bit confusing. I looks like the offset from time_init() is
not longer added.

> -----
> early boot	post-boot	(cycles already stored)	seconds with offset from cycle counter start
> mid boot	post-boot	(ns already stored)	seconds with offset from cycle counter start
> late boot	post-boot	(ns already stored)	seconds with offset from cycle counter start
> all others	post-boot	nanosecs		seconds with offset from cycle counter start			

> The confusing thing is messages that are output to the console before calibration and offset calculation.
> Reports from all user space tools (ie from dmesg or journalctl) should be correct and consistent.

Yup.

I like that table. I just wonder how to better distinguish the offset from
timekeeping and calibrated cycles. My variant:

<proposal>
The printk timestamps are stored and interpretted differently in
the following periods:

  - early boot: before timekeeping init, before cycles calibration
  - mid boot:   after timekeeping init, before cycles calibration
  - late boot:  after timekeeping init, after cycles calibration

Console output (immediately):

printk() time	stored value	immediate output(sec)
-----------------------------------------------------------------------------
early boot	cycles		0
mid boot	get_time()	get_time()
late boot	get_time()	get_time + calibrated(cycles offset)

User space tools and late registered consoles:

printk() time	stored value	output after calibration (sec)
-----------------------------------------------------------------------------
early boot	cycles		calibrated(cycles)
mid boot	get_time()	get_time() + calibrated(cycles offset)
late boot	get_time()	get_time() + calibrated(cycles offset)

I am not sure where to put this. One place might be
Documentation/core-api/printk-basics.rst. But it might be better
add a separate file either under core-api/ or under admin-guide/.

> > 
> > > An accessor can be implemented for (struct printk_info).ts_nsec, say
> > > get_timestamp(), which can be called from both the places.
> > 
> > Yeah, a helper function for reading the timestamp might be a cleaner solution.
> > 
> I consider adjust_early_ts() to be such an accessor function.  It's supposed
> to hide the details of the type of the timestamp (cycles or ns) and
> the offset. Maybe this could be renamed to something better,
> like: get_adjusted_ts()? Let me know what you think.

This a bike shedding area ;-)

I personally find

    u64 ts = get_printk_info_ts(info);

a bit cleaner than

    u64 ts = adjust_early_ts(info->ts_nsec);

because you might add a comment into struct printk_info
definition that nobody should read the timestamp directly.
They should use the helper intead.

The helper would do something like:

/*
 * The number of early cycles is stored before the timekeeping gets initialized.
 * The local_clock() value is stored later.
 *
 * Note that early_ts_offset, early_mult, and early_shift are 0
 * before the cycles get calibrated against the official time keeping.
 *
 * Any output after user space start (eg. from dmesg or journalctl)
 * will show consistent values with calibrated cycles and offset.
 */
static inline u64 get_printk_info_ts(const struct printk_info *info)
{
	if (likely(!(info->ts_cycles & EARLY_CYCLES_BIT)))
		/* if timestamp is not in cycles, just add offset */
		return info->ts_nsec + early_ts_offset;

	/* mask high bit and convert to nanoseconds */
	return mul_u64_u32_shr(info->ts_cycles & EARLY_CYCLES_MASK,
			       early_mult, early_shift);
}

Note that I have already used two names (ts_cycles and ts_nsec) which
would point to the same data via an union.

> I'll rebase this patch to resolve the #include conflicts, and address
> this feedback, and hopefully get a new version out this week.

Thanks a lot. v3 looked good enough to me. But v4 will be even better
after the feedback.

Take your time ;-)

Best Regards,
Petr

^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-10 19:17 UTC (permalink / raw)
  To: Shashank Balaji
  Cc: pmladek@suse.com, rostedt@goodmis.org, John Ogness,
	senozhatsky@chromium.org, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <aa8C_1eUDoSgMGIv@fedora>



> -----Original Message-----
> From: Shashank Balaji <shashankbalaji02@gmail.com>
> Hi Tim,
> 
> Tested-by: Shashank Balaji <shashankbalaji02@gmail.com>

Thanks for the testing!!

> 
> ...on top of rc3 on an AMD Ryzen 7 4800H laptop. This patch conflicts
> with these commits with trivial fixes:
> 
> 032a730268a3 init/main.c: wrap long kernel cmdline when printing to logs
> 60325c27d3cf printk: Add execution context (task name/CPU) to printk_info
> 499f86de4f8c init/main: read bootconfig header with get_unaligned_le32()
> 
> Comment below.
> 
> On Tue, Feb 10, 2026 at 04:47:41PM -0700, Tim Bird wrote:
> > During early boot, printk timestamps are reported as zero before
> 	<snip>
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 1d765ad242b8..5afd31c3345c 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -46,6 +46,7 @@
> >  #include <linux/ctype.h>
> >  #include <linux/uio.h>
> >  #include <linux/sched/clock.h>
> > +#include <linux/early_times.h>
> >  #include <linux/sched/debug.h>
> >  #include <linux/sched/task_stack.h>
> >  #include <linux/panic.h>
> > @@ -75,6 +76,13 @@ EXPORT_SYMBOL(ignore_console_lock_warning);
> >
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(console);
> >
> > +#ifdef CONFIG_EARLY_PRINTK_TIMES
> > +cycles_t start_cycles;
> > +u64 start_ns;
> > +u32 early_mult, early_shift;
> > +u64 early_ts_offset;
> > +#endif
> > +
> >  /*
> >   * Low level drivers may need that to know if they can schedule in
> >   * their unblank() callback or not. So let's export it.
> > @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
> >  static ssize_t info_print_ext_header(char *buf, size_t size,
> >  				     struct printk_info *info)
> >  {
> > -	u64 ts_usec = info->ts_nsec;
> > +	u64 ts_usec = adjust_early_ts(info->ts_nsec);
> >  	char caller[20];
> >  #ifdef CONFIG_PRINTK_CALLER
> >  	u32 id = info->caller_id;
> > @@ -1352,7 +1360,11 @@ static size_t print_syslog(unsigned int level, char *buf)
> >
> >  static size_t print_time(u64 ts, char *buf)
> >  {
> > -	unsigned long rem_nsec = do_div(ts, 1000000000);
> > +	unsigned long rem_nsec;
> > +
> > +	ts = adjust_early_ts(ts);
> > +
> > +	rem_nsec = do_div(ts, 1000000000);
> >
> >  	return sprintf(buf, "[%5lu.%06lu]",
> >  		       (unsigned long)ts, rem_nsec / 1000);
> > @@ -2242,6 +2254,8 @@ int vprintk_store(int facility, int level,
> >  	 * timestamp with respect to the caller.
> >  	 */
> >  	ts_nsec = local_clock();
> > +	if (!ts_nsec)
> > +		ts_nsec = early_cycles();
> 
> ts_nsec goes on to be stored in a struct printk_info's ts_nsec which is
> documented to be "timestamp in nanoseconds":
> 
> 	/*
> 	 * Meta information about each stored message.
> 	 *
> 	 * All fields are set by the printk code except for @seq, which is
> 	 * set by the ringbuffer code.
> 	 */
> 	struct printk_info {
> 		u64	seq;		/* sequence number */
> 		u64	ts_nsec;	/* timestamp in nanoseconds */
> 		u16	text_len;	/* length of text message */
> 		u8	facility;	/* syslog facility */
> 		u8	flags:5;	/* internal record flags */
> 		u8	level:3;	/* syslog level */
> 		u32	caller_id;	/* thread id or processor id */
> 	#ifdef CONFIG_PRINTK_EXECUTION_CTX
> 		u32	caller_id2;	/* caller_id complement */
> 		/* name of the task that generated the message */
> 		char	comm[TASK_COMM_LEN];
> 	#endif
> 
> 		struct dev_printk_info	dev_info;
> 	};
> 
> Since with this patch, ts_nsec can either be a timestamp in ns or a
> cycle count, the comment should be updated. Ideally, I'd like the member
> name to be changed as well to reflect the new semantic. I'm thinking
> ts_raw or ts_ns_or_cyc... naming is hard :)

Nice catch!  I'm considering either changing the comment, or
using a union here (or maybe both).

Thanks,
 -- Tim


^ permalink raw reply

* RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Bird, Tim @ 2026-03-10 18:54 UTC (permalink / raw)
  To: Petr Mladek, Shashank Balaji
  Cc: rostedt@goodmis.org, john.ogness@linuxtronix.de,
	senozhatsky@chromium.org, francesco@valla.it,
	geert@linux-m68k.org, linux-embedded@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <abADDuc43S3Vk-u-@pathway.suse.cz>

Thank you both Petr and Shashank for the review and feedback.

See my responses inline below.

> -----Original Message-----
> From: Petr Mladek <pmladek@suse.com>
> 
> On Tue 2026-03-10 04: 25: 33, Shashank Balaji wrote: > Hi again, > > On Tue, Feb 10, 2026 at 04: 47: 41PM -0700, Tim Bird wrote: > > During
> early boot, printk timestamps are reported as zero before > > kernel timekeeping starts
> On Tue 2026-03-10 04:25:33, Shashank Balaji wrote:
> > Hi again,
> >
> > On Tue, Feb 10, 2026 at 04:47:41PM -0700, Tim Bird wrote:
> > > During early boot, printk timestamps are reported as zero before
> > > kernel timekeeping starts (e.g. before time_init()).  This
> > > hinders boot-time optimization efforts.  This period is about 400
> > > milliseconds for many current desktop and embedded machines
> > > running Linux.
> > >
> > > Add support to save cycles during early boot, and output correct
> > > timestamp values after timekeeping is initialized.  get_cycles()
> > > is operational on arm64 and x86_64 from kernel start.  Add code
> > > and variables to save calibration values used to later convert
> > > cycle counts to time values in the early printks.  Add a config
> > > to control the feature.
> > >
> > > This yields non-zero timestamps for printks from the very start
> > > of kernel execution.  The timestamps are relative to the start of
> > > the architecture-specified counter used in get_cycles
> > > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > >
> > > All timestamps reflect time from processor power-on instead of
> > > time from the kernel's timekeeping initialization.
> > >
> > > Signed-off-by: Tim Bird <tim.bird@sony.com>
> >
> > So if a console is read before the cycles -> timestamp conversion can
> > happen, then they'll see 0. But reading from userspace will give the
> > right timestamps.
> >
> > Based on the previous discussions, to address this possible confusion,
> > if changing the timestamp format, like adding '?', is a no-go because
> > of concerns of breaking existing monitoring tools, what about appending
> > something to the printk string after the timestamp? Hmm, no, that'll
> > affect grep'ability _and_ may break monitoring tools. Or what about a
> > pr_warn() early in boot to warn about the possible timestamp difference?
> 
> Or we could make it more obvious from the message in
> early_times_finish_calibration(), see below.
> 
> > At the very least the possibility of this difference should be
> > documented in the Kconfig description.
> 
> Yeah, it would be nice to mention this in the Kconfig description.

I'll work on a v4 version of the patch, and add some text
to mention this.

> 
> > > --- /dev/null
> > > +++ b/include/linux/early_times.h
> > > @@ -0,0 +1,85 @@
> > > +static inline void early_times_finish_calibration(void)
> > > +{
> > > +	cycles_t end_cycles;
> > > +	u64 end_ns;
> > > +
> > > +	/* set calibration data for early_printk_times */
> > > +	end_cycles = get_cycles();
> > > +	end_ns = local_clock();
> > > +	clocks_calc_mult_shift(&early_mult, &early_shift,
> > > +		mul_u64_u64_div_u64(end_cycles - start_cycles,
> > > +			NSEC_PER_SEC, end_ns - start_ns),
> > > +		NSEC_PER_SEC, 100);
> > > +	early_ts_offset = mul_u64_u32_shr(start_cycles, early_mult, early_shift) - start_ns;
> > > +
> > > +	pr_debug("Early printk times: mult=%u, shift=%u, offset=%llu ns\n",
> > > +		early_mult, early_shift, early_ts_offset);
> 
> We might make it more obvious that an offset will get added to the
> existing timestamp since this point.
> 
> Also it has a "surprising" user visible effect so that it should
> be pr_info() instead of pr_debug(). Note pr_debug() messages might be hidden.
> 
> A minimalist change would be:
> 
> 	pr_info("Calibrated offset for early printk times: mult=%u, shift=%u, offset=%llu ns\n",
> 		early_mult, early_shift, early_ts_offset);

This sounds like a good idea.  Will do.

 
> And/Or we might add one more line:
> 
> 	pr_info("The time offset is added for existing and newly added printk messages since now!");
I think adding a message is a possibility, to explain the offset.

Let me consider the wordings for this.  Using 'now' is a bit ambiguous, because the offset is
always used in post-boot output, such as from dmesg or journalctl, even for messages earlier
in the log.
 
> 
> > > +}
> > > +
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
> > >  static ssize_t info_print_ext_header(char *buf, size_t size,
> > >  				     struct printk_info *info)
> > >  {
> > > -	u64 ts_usec = info->ts_nsec;
> > > +	u64 ts_usec = adjust_early_ts(info->ts_nsec);
> >
> > printk_get_next_message() calls info_print_ext_header() for an extended
> > console (/dev/kmsg and netcon_ext use this), whereas for
> > non-extended consoles, record_print_text() -> info_print_prefix() ->
> > print_time() is called. So, this adjustment should be made in
> > print_time() too, otherwise non-extended console users are gonna be
> > spooked with insane timestamps.
> 
> The v3 patch already modifies print_time().
> 
> > This may explain the non-zero early
> > timestamps Petr saw in his serial console output [1].
> 
> I am a bit confused now. There are three stages:
> 
>    1. Early messages where the cycles are stored.
> 
>       The serial console shows zero time stamp because
>       it reads the messages _before the calibration_, e.g.
> 
> 	[    0.000000] Linux version 6.19.0-rc7-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20251006, GNU ld (GNU Binutils;
> openSUSE Tumbleweed) 2.45.0.20251103-2) #521 SMP PREEMPT_DYNAMIC Mon Feb  2 16:36:53 CET 2026
> 	[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.19.0-rc7-default+ root=UUID=587ae802-e330-4059-9b48-
> d5b845e1075a resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash
> earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=ttynull console=tty0 debug_non_panic_cpus=1 panic=10 ignore_loglevel
> log_buf_len=1M
> 	[    0.000000] BIOS-provided physical RAM map:
> 	[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> 	[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> 
>       But "dmesg" shows some numbers because it reads the messages
>       _after the calibration_:
> 
> 	[    8.853613] Linux version 6.19.0-rc7-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20251006, GNU ld (GNU Binutils;
> openSUSE Tumbleweed) 2.45.0.20251103-2) #521 SMP PREEMPT_DYNAMIC Mon Feb  2 16:36:53 CET 2026
> 	[    8.853617] Command line: BOOT_IMAGE=/boot/vmlinuz-6.19.0-rc7-default+ root=UUID=587ae802-e330-4059-9b48-
> d5b845e1075a resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash
> earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=ttynull console=tty0 debug_non_panic_cpus=1 panic=10 ignore_loglevel
> log_buf_len=1M
> 	[    8.865086] BIOS-provided physical RAM map:
> 	[    8.865087] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> 	[    8.865089] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> 
> 
>    2. Early messages added _after the timekeeping_ is initialized
>       but _before the early cycles calibration_.
> 
>       They serial console prints them _without the offset_ because
>       it reads them _before the calibration_, e.g.
> 
> 	[    3.288049][    T1] Write protecting the kernel read-only data: 36864k
> 	[    3.298554][    T1] Freeing unused kernel image (text/rodata gap) memory: 1656K
> 	[    3.318942][    T1] Freeing unused kernel image (rodata/data gap) memory: 1540K
> 
>        But "dmesg" prints them _with the offset_ because it reads them
>        _after the calibration_, e.g.
> 
> 	[   12.179999] [    T1] Write protecting the kernel read-only data: 36864k
> 	[   12.190505] [    T1] Freeing unused kernel image (text/rodata gap) memory: 1656K
> 	[   12.210893] [    T1] Freeing unused kernel image (rodata/data gap) memory: 1540K
> 
> 
>    3. Messages added after the calibration of the early cycles.
> 
>       They are printed with the offset by both serial console and
>       dmesg, e.g.
> 
> 	[   12.230014][    T1] Early printk times: mult=38775352, shift=27, offset=8891950261 ns
> 	[   12.246008][    T1] Run /init as init process
> 	[   12.254944][    T1]   with arguments:
> 	[   12.264341][    T1]     /init

This is correct.  I don't want to overwhelm users of this, but there are three time-gathering periods, and roughly
2 output times (before calibration and after calibration).

early boot = before time init, before cycles calibration and offset calculation
mid boot = after time init, before cycles calibration and offset calculation
late boot = after time init, after cycles calibration and offset calculation
All of these are before the start of user space processes.

time of printk	output time	timestamp type	stored	timestamp output
------------------	----------------	-----------------------------	------------------------
early boot	early boot	cycles			0
mid boot	mid-boot	nanosecs		seconds, with offset from time_init
late boot	late-boot	nanosecs		seconds with offset from cycle counter start
-----
early boot	post-boot	(cycles already stored)	seconds with offset from cycle counter start
mid boot	post-boot	(ns already stored)	seconds with offset from cycle counter start
late boot	post-boot	(ns already stored)	seconds with offset from cycle counter start
all others	post-boot	nanosecs		seconds with offset from cycle counter start			

The confusing thing is messages that are output to the console before calibration and offset calculation.
Reports from all user space tools (ie from dmesg or journalctl) should be correct and consistent.

> 
> > An accessor can be implemented for (struct printk_info).ts_nsec, say
> > get_timestamp(), which can be called from both the places.
> 
> Yeah, a helper function for reading the timestamp might be a cleaner solution.
> 

I consider adjust_early_ts() to be such an accessor function.  It's supposed
to hide the details of the type of the timestamp (cycles or ns) and the offset.
Maybe this could be renamed to something better, like: get_adjusted_ts()?
Let me know what you think.

> > [1] https://lore.kernel.org/all/aYDPn2EJgJIWGDhM@pathway/
> 
> Everything seems to be as expected there. The non-zero timestamps
> on the serial console are from messages added after the timekeeping
> was initialized.

I'll rebase this patch to resolve the #include conflicts, and address
this feedback, and hopefully get a new version out this week.

Thanks again for the discussion and ideas!
 -- Tim


^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Petr Mladek @ 2026-03-10 11:39 UTC (permalink / raw)
  To: Shashank Balaji
  Cc: Tim Bird, rostedt, john.ogness, senozhatsky, francesco, geert,
	linux-embedded, linux-kernel
In-Reply-To: <aa8erTKI1K7eO1LI@fedora>

On Tue 2026-03-10 04:25:33, Shashank Balaji wrote:
> Hi again,
> 
> On Tue, Feb 10, 2026 at 04:47:41PM -0700, Tim Bird wrote:
> > During early boot, printk timestamps are reported as zero before
> > kernel timekeeping starts (e.g. before time_init()).  This
> > hinders boot-time optimization efforts.  This period is about 400
> > milliseconds for many current desktop and embedded machines
> > running Linux.
> > 
> > Add support to save cycles during early boot, and output correct
> > timestamp values after timekeeping is initialized.  get_cycles()
> > is operational on arm64 and x86_64 from kernel start.  Add code
> > and variables to save calibration values used to later convert
> > cycle counts to time values in the early printks.  Add a config
> > to control the feature.
> > 
> > This yields non-zero timestamps for printks from the very start
> > of kernel execution.  The timestamps are relative to the start of
> > the architecture-specified counter used in get_cycles
> > (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> > 
> > All timestamps reflect time from processor power-on instead of
> > time from the kernel's timekeeping initialization.
> > 
> > Signed-off-by: Tim Bird <tim.bird@sony.com>
> 
> So if a console is read before the cycles -> timestamp conversion can
> happen, then they'll see 0. But reading from userspace will give the
> right timestamps.
> 
> Based on the previous discussions, to address this possible confusion,
> if changing the timestamp format, like adding '?', is a no-go because
> of concerns of breaking existing monitoring tools, what about appending
> something to the printk string after the timestamp? Hmm, no, that'll
> affect grep'ability _and_ may break monitoring tools. Or what about a
> pr_warn() early in boot to warn about the possible timestamp difference?

Or we could make it more obvious from the message in
early_times_finish_calibration(), see below.

> At the very least the possibility of this difference should be
> documented in the Kconfig description.

Yeah, it would be nice to mention this in the Kconfig description.

> > --- /dev/null
> > +++ b/include/linux/early_times.h
> > @@ -0,0 +1,85 @@
> > +static inline void early_times_finish_calibration(void)
> > +{
> > +	cycles_t end_cycles;
> > +	u64 end_ns;
> > +
> > +	/* set calibration data for early_printk_times */
> > +	end_cycles = get_cycles();
> > +	end_ns = local_clock();
> > +	clocks_calc_mult_shift(&early_mult, &early_shift,
> > +		mul_u64_u64_div_u64(end_cycles - start_cycles,
> > +			NSEC_PER_SEC, end_ns - start_ns),
> > +		NSEC_PER_SEC, 100);
> > +	early_ts_offset = mul_u64_u32_shr(start_cycles, early_mult, early_shift) - start_ns;
> > +
> > +	pr_debug("Early printk times: mult=%u, shift=%u, offset=%llu ns\n",
> > +		early_mult, early_shift, early_ts_offset);

We might make it more obvious that an offset will get added to the
existing timestamp since this point.

Also it has a "surprising" user visible effect so that it should
be pr_info() instead of pr_debug(). Note pr_debug() messages might be hidden.

A minimalist change would be:

	pr_info("Calibrated offset for early printk times: mult=%u, shift=%u, offset=%llu ns\n",
		early_mult, early_shift, early_ts_offset);

And/Or we might add one more line:

	pr_info("The time offset is added for existing and newly added printk messages since now!");


> > +}
> > +
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
> >  static ssize_t info_print_ext_header(char *buf, size_t size,
> >  				     struct printk_info *info)
> >  {
> > -	u64 ts_usec = info->ts_nsec;
> > +	u64 ts_usec = adjust_early_ts(info->ts_nsec);
> 
> printk_get_next_message() calls info_print_ext_header() for an extended
> console (/dev/kmsg and netcon_ext use this), whereas for
> non-extended consoles, record_print_text() -> info_print_prefix() ->
> print_time() is called. So, this adjustment should be made in
> print_time() too, otherwise non-extended console users are gonna be
> spooked with insane timestamps.

The v3 patch already modifies print_time().

> This may explain the non-zero early
> timestamps Petr saw in his serial console output [1].

I am a bit confused now. There are three stages:

   1. Early messages where the cycles are stored.

      The serial console shows zero time stamp because
      it reads the messages _before the calibration_, e.g.

	[    0.000000] Linux version 6.19.0-rc7-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20251006, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.45.0.20251103-2) #521 SMP PREEMPT_DYNAMIC Mon Feb  2 16:36:53 CET 2026
	[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.19.0-rc7-default+ root=UUID=587ae802-e330-4059-9b48-d5b845e1075a resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=ttynull console=tty0 debug_non_panic_cpus=1 panic=10 ignore_loglevel log_buf_len=1M
	[    0.000000] BIOS-provided physical RAM map:
	[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
	[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved

      But "dmesg" shows some numbers because it reads the messages
      _after the calibration_:

	[    8.853613] Linux version 6.19.0-rc7-default+ (pmladek@pathway) (gcc (SUSE Linux) 15.2.1 20251006, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.45.0.20251103-2) #521 SMP PREEMPT_DYNAMIC Mon Feb  2 16:36:53 CET 2026
	[    8.853617] Command line: BOOT_IMAGE=/boot/vmlinuz-6.19.0-rc7-default+ root=UUID=587ae802-e330-4059-9b48-d5b845e1075a resume=/dev/disk/by-uuid/369c7453-3d16-409d-88b2-5de027891a12 mitigations=auto nosplash earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 console=ttynull console=tty0 debug_non_panic_cpus=1 panic=10 ignore_loglevel log_buf_len=1M
	[    8.865086] BIOS-provided physical RAM map:
	[    8.865087] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
	[    8.865089] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved


   2. Early messages added _after the timekeeping_ is initialized
      but _before the early cycles calibration_.

      They serial console prints them _without the offset_ because
      it reads them _before the calibration_, e.g.

	[    3.288049][    T1] Write protecting the kernel read-only data: 36864k
	[    3.298554][    T1] Freeing unused kernel image (text/rodata gap) memory: 1656K
	[    3.318942][    T1] Freeing unused kernel image (rodata/data gap) memory: 1540K

       But "dmesg" prints them _with the offset_ because it reads them
       _after the calibration_, e.g.

	[   12.179999] [    T1] Write protecting the kernel read-only data: 36864k
	[   12.190505] [    T1] Freeing unused kernel image (text/rodata gap) memory: 1656K
	[   12.210893] [    T1] Freeing unused kernel image (rodata/data gap) memory: 1540K


   3. Messages added after the calibration of the early cycles.

      They are printed with the offset by both serial console and
      dmesg, e.g.

	[   12.230014][    T1] Early printk times: mult=38775352, shift=27, offset=8891950261 ns
	[   12.246008][    T1] Run /init as init process
	[   12.254944][    T1]   with arguments:
	[   12.264341][    T1]     /init


> An accessor can be implemented for (struct printk_info).ts_nsec, say
> get_timestamp(), which can be called from both the places.

Yeah, a helper function for reading the timestamp might be a cleaner solution.

> [1] https://lore.kernel.org/all/aYDPn2EJgJIWGDhM@pathway/

Everything seems to be as expected there. The non-zero timestamps
on the serial console are from messages added after the timekeeping
was initialized.

Best Regards,
Petr 

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Petr Mladek @ 2026-03-10 10:43 UTC (permalink / raw)
  To: Shashank Balaji
  Cc: Tim Bird, rostedt, john.ogness, senozhatsky, francesco, geert,
	linux-embedded, linux-kernel
In-Reply-To: <aa8C_1eUDoSgMGIv@fedora>

On Tue 2026-03-10 02:27:27, Shashank Balaji wrote:
> Hi Tim,
> 
> Tested-by: Shashank Balaji <shashankbalaji02@gmail.com>
> 
> ...on top of rc3 on an AMD Ryzen 7 4800H laptop. This patch conflicts
> with these commits with trivial fixes:
> 
> 032a730268a3 init/main.c: wrap long kernel cmdline when printing to logs
> 60325c27d3cfq printk: Add execution context (task name/CPU) to printk_info
> 499f86de4f8c init/main: read bootconfig header with get_unaligned_le32()

Good to know.

> On Tue, Feb 10, 2026 at 04:47:41PM -0700, Tim Bird wrote:
> > During early boot, printk timestamps are reported as zero before
> 	<snip>
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 1d765ad242b8..5afd31c3345c 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2242,6 +2254,8 @@ int vprintk_store(int facility, int level,
> >  	 * timestamp with respect to the caller.
> >  	 */
> >  	ts_nsec = local_clock();
> > +	if (!ts_nsec)
> > +		ts_nsec = early_cycles();
> 
> ts_nsec goes on to be stored in a struct printk_info's ts_nsec which is
> documented to be "timestamp in nanoseconds":
> 
> 	/*
> 	 * Meta information about each stored message.
> 	 *
> 	 * All fields are set by the printk code except for @seq, which is
> 	 * set by the ringbuffer code.
> 	 */
> 	struct printk_info {
> 		u64	seq;		/* sequence number */
> 		u64	ts_nsec;	/* timestamp in nanoseconds */
> 		u16	text_len;	/* length of text message */
> 		u8	facility;	/* syslog facility */
> 		u8	flags:5;	/* internal record flags */
> 		u8	level:3;	/* syslog level */
> 		u32	caller_id;	/* thread id or processor id */
> 	#ifdef CONFIG_PRINTK_EXECUTION_CTX
> 		u32	caller_id2;	/* caller_id complement */
> 		/* name of the task that generated the message */
> 		char	comm[TASK_COMM_LEN];
> 	#endif
> 
> 		struct dev_printk_info	dev_info;
> 	};
> 
> Since with this patch, ts_nsec can either be a timestamp in ns or a
> cycle count, the comment should be updated.

Yup, great catch!

> Ideally, I'd like the member
> name to be changed as well to reflect the new semantic. I'm thinking 
> ts_raw or ts_ns_or_cyc... naming is hard :)

Hmm, we could not change it easily because it would break user space
tools for reading kernel crash dump.

Alternative solution would be usign an union.

	union {
		u64	ts_nsec;
		u64	ts_cycles;
	};

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Shashank Balaji @ 2026-03-09 19:25 UTC (permalink / raw)
  To: Tim Bird
  Cc: pmladek, rostedt, john.ogness, senozhatsky, francesco, geert,
	linux-embedded, linux-kernel
In-Reply-To: <20260210234741.3262320-1-tim.bird@sony.com>

Hi again,

On Tue, Feb 10, 2026 at 04:47:41PM -0700, Tim Bird wrote:
> During early boot, printk timestamps are reported as zero before
> kernel timekeeping starts (e.g. before time_init()).  This
> hinders boot-time optimization efforts.  This period is about 400
> milliseconds for many current desktop and embedded machines
> running Linux.
> 
> Add support to save cycles during early boot, and output correct
> timestamp values after timekeeping is initialized.  get_cycles()
> is operational on arm64 and x86_64 from kernel start.  Add code
> and variables to save calibration values used to later convert
> cycle counts to time values in the early printks.  Add a config
> to control the feature.
> 
> This yields non-zero timestamps for printks from the very start
> of kernel execution.  The timestamps are relative to the start of
> the architecture-specified counter used in get_cycles
> (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> 
> All timestamps reflect time from processor power-on instead of
> time from the kernel's timekeeping initialization.
> 
> Signed-off-by: Tim Bird <tim.bird@sony.com>

So if a console is read before the cycles -> timestamp conversion can
happen, then they'll see 0. But reading from userspace will give the
right timestamps.

Based on the previous discussions, to address this possible confusion,
if changing the timestamp format, like adding '?', is a no-go because
of concerns of breaking existing monitoring tools, what about appending
something to the printk string after the timestamp? Hmm, no, that'll
affect grep'ability _and_ may break monitoring tools. Or what about a
pr_warn() early in boot to warn about the possible timestamp difference?

At the very least the possibility of this difference should be
documented in the Kconfig description.

Continuing below...

> ---
> V2->V3
>  Default CONFIG option to 'n'
>  Move more code from into early_times.h
>   (reducing ifdefs in init/main.c)
>  Use math64 helper routines
>  Use cycles_t instead of u64 type
>  Add #defines for EARLY_CYCLES_BIT and MASK
>  Invert if logic in adjust_early_ts()
>  (note: no change to 'depends on' in Kconfig entry)
> 
> V1->V2
>  Remove calibration CONFIG vars
>  Add 'depends on' to restrict arches (to handle ppc bug)
>  Add early_ts_offset to avoid discontinuity
>  Save cycles in ts_nsec, and convert on output
>  Move conditional code to include file (early_times.h)
> 
>  include/linux/early_times.h | 85 +++++++++++++++++++++++++++++++++++++
>  init/Kconfig                | 14 ++++++
>  init/main.c                 |  6 +++
>  kernel/printk/printk.c      | 18 +++++++-
>  4 files changed, 121 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/early_times.h
> 
> diff --git a/include/linux/early_times.h b/include/linux/early_times.h
> new file mode 100644
> index 000000000000..05388dcb8573
> --- /dev/null
> +++ b/include/linux/early_times.h
> @@ -0,0 +1,85 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _KERNEL_PRINTK_EARLY_TIMES_H
> +#define _KERNEL_PRINTK_EARLY_TIMES_H
> +
> +#include <linux/timex.h>
> +#include <linux/clocksource.h>
> +
> +/* use high bit of a u64 to indicate cycles instead of a timestamp */
> +#define EARLY_CYCLES_BIT	BIT_ULL(63)
> +#define EARLY_CYCLES_MASK	~(BIT_ULL(63))
> +
> +#if defined(CONFIG_EARLY_PRINTK_TIMES)
> +extern cycles_t start_cycles;
> +extern u64 start_ns;
> +extern u32 early_mult, early_shift;
> +extern u64 early_ts_offset;
> +
> +static inline void early_times_start_calibration(void)
> +{
> +	start_cycles = get_cycles();
> +	start_ns = local_clock();
> +}
> +
> +static inline void early_times_finish_calibration(void)
> +{
> +	cycles_t end_cycles;
> +	u64 end_ns;
> +
> +	/* set calibration data for early_printk_times */
> +	end_cycles = get_cycles();
> +	end_ns = local_clock();
> +	clocks_calc_mult_shift(&early_mult, &early_shift,
> +		mul_u64_u64_div_u64(end_cycles - start_cycles,
> +			NSEC_PER_SEC, end_ns - start_ns),
> +		NSEC_PER_SEC, 100);
> +	early_ts_offset = mul_u64_u32_shr(start_cycles, early_mult, early_shift) - start_ns;
> +
> +	pr_debug("Early printk times: mult=%u, shift=%u, offset=%llu ns\n",
> +		early_mult, early_shift, early_ts_offset);
> +}
> +
> +static inline u64 early_cycles(void)
> +{
> +	return (get_cycles() | EARLY_CYCLES_BIT);
> +}
> +
> +/*
> + * adjust_early_ts detects whether ts in is cycles or nanoseconds
> + * and converts it or adjusts it, taking into account the offset
> + * from cycle-counter start.
> + *
> + * Note that early_mult may be 0, but that's OK because
> + * we'll just multiply by 0 and return 0. This will
> + * only occur if we're outputting a printk message
> + * before the calibration of the early timestamp.
> + * Any output after user space start (eg. from dmesg or
> + * journalctl) will show correct values.
> + */
> +static inline u64 adjust_early_ts(u64 ts)
> +{
> +	if (likely(!(ts & EARLY_CYCLES_BIT)))
> +		/* if timestamp is not in cycles, just add offset */
> +		return ts + early_ts_offset;
> +
> +	/* mask high bit and convert to nanoseconds */
> +	return mul_u64_u32_shr(ts & EARLY_CYCLES_MASK, early_mult, early_shift);
> +}
> +
> +#else
> +# define early_times_start_calibration() do { } while (0)
> +# define early_times_finish_calibration() do { } while (0)
> +
> +static inline u64 early_cycles(void)
> +{
> +	return 0;
> +}
> +
> +static inline u64 adjust_early_ts(u64 ts)
> +{
> +	return ts;
> +}
> +#endif /* CONFIG_EARLY_PRINTK_TIMES */
> +
> +#endiV
> diff --git a/init/Kconfig b/init/Kconfig
> index fa79feb8fe57..a928c1efb09d 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -777,6 +777,20 @@ config IKHEADERS
>  	  or similar programs.  If you build the headers as a module, a module called
>  	  kheaders.ko is built which can be loaded on-demand to get access to headers.
>  
> +config EARLY_PRINTK_TIMES
> +	bool "Show non-zero printk timestamps early in boot"
> +	default n
> +	depends on PRINTK
> +	depends on ARM64 || X86_64
> +	help
> +	  Use a cycle-counter to provide printk timestamps during
> +	  early boot.  This allows seeing timestamps for printks that
> +	  would otherwise show as 0.  Note that this will shift the
> +	  printk timestamps to be relative to processor power on, instead
> +	  of relative to the start of kernel timekeeping.  This should be
> +	  closer to machine power on, giving a better indication of
> +	  overall boot time.
> +
>  config LOG_BUF_SHIFT
>  	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)"
>  	range 12 25
> diff --git a/init/main.c b/init/main.c
> index b84818ad9685..d5774aec1aff 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -104,6 +104,7 @@
>  #include <linux/pidfs.h>
>  #include <linux/ptdump.h>
>  #include <linux/time_namespace.h>
> +#include <linux/early_times.h>
>  #include <net/net_namespace.h>
>  
>  #include <asm/io.h>
> @@ -1118,6 +1119,9 @@ void start_kernel(void)
>  	timekeeping_init();
>  	time_init();
>  
> +	/* This must be after timekeeping is initialized */
> +	early_times_start_calibration();
> +
>  	/* This must be after timekeeping is initialized */
>  	random_init();
>  
> @@ -1600,6 +1604,8 @@ static int __ref kernel_init(void *unused)
>  
>  	do_sysctl_args();
>  
> +	early_times_finish_calibration();
> +
>  	if (ramdisk_execute_command) {
>  		ret = run_init_process(ramdisk_execute_command);
>  		if (!ret)
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1d765ad242b8..5afd31c3345c 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -46,6 +46,7 @@
>  #include <linux/ctype.h>
>  #include <linux/uio.h>
>  #include <linux/sched/clock.h>
> +#include <linux/early_times.h>
>  #include <linux/sched/debug.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/panic.h>
> @@ -75,6 +76,13 @@ EXPORT_SYMBOL(ignore_console_lock_warning);
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(console);
>  
> +#ifdef CONFIG_EARLY_PRINTK_TIMES
> +cycles_t start_cycles;
> +u64 start_ns;
> +u32 early_mult, early_shift;
> +u64 early_ts_offset;
> +#endif
> +
>  /*
>   * Low level drivers may need that to know if they can schedule in
>   * their unblank() callback or not. So let's export it.
> @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
>  static ssize_t info_print_ext_header(char *buf, size_t size,
>  				     struct printk_info *info)
>  {
> -	u64 ts_usec = info->ts_nsec;
> +	u64 ts_usec = adjust_early_ts(info->ts_nsec);

printk_get_next_message() calls info_print_ext_header() for an extended
console (/dev/kmsg and netcon_ext use this), whereas for
non-extended consoles, record_print_text() -> info_print_prefix() ->
print_time() is called. So, this adjustment should be made in
print_time() too, otherwise non-extended console users are gonna be
spooked with insane timestamps. This may explain the non-zero early
timestamps Petr saw in his serial console output [1].

An accessor can be implemented for (struct printk_info).ts_nsec, say
get_timestamp(), which can be called from both the places.

[1] https://lore.kernel.org/all/aYDPn2EJgJIWGDhM@pathway/

Thanks,
Shashank

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Shashank Balaji @ 2026-03-09 17:27 UTC (permalink / raw)
  To: Tim Bird
  Cc: pmladek, rostedt, john.ogness, senozhatsky, francesco, geert,
	linux-embedded, linux-kernel
In-Reply-To: <20260210234741.3262320-1-tim.bird@sony.com>

Hi Tim,

Tested-by: Shashank Balaji <shashankbalaji02@gmail.com>

...on top of rc3 on an AMD Ryzen 7 4800H laptop. This patch conflicts
with these commits with trivial fixes:

032a730268a3 init/main.c: wrap long kernel cmdline when printing to logs
60325c27d3cf printk: Add execution context (task name/CPU) to printk_info
499f86de4f8c init/main: read bootconfig header with get_unaligned_le32()

Comment below.

On Tue, Feb 10, 2026 at 04:47:41PM -0700, Tim Bird wrote:
> During early boot, printk timestamps are reported as zero before
	<snip>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1d765ad242b8..5afd31c3345c 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -46,6 +46,7 @@
>  #include <linux/ctype.h>
>  #include <linux/uio.h>
>  #include <linux/sched/clock.h>
> +#include <linux/early_times.h>
>  #include <linux/sched/debug.h>
>  #include <linux/sched/task_stack.h>
>  #include <linux/panic.h>
> @@ -75,6 +76,13 @@ EXPORT_SYMBOL(ignore_console_lock_warning);
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(console);
>  
> +#ifdef CONFIG_EARLY_PRINTK_TIMES
> +cycles_t start_cycles;
> +u64 start_ns;
> +u32 early_mult, early_shift;
> +u64 early_ts_offset;
> +#endif
> +
>  /*
>   * Low level drivers may need that to know if they can schedule in
>   * their unblank() callback or not. So let's export it.
> @@ -639,7 +647,7 @@ static void append_char(char **pp, char *e, char c)
>  static ssize_t info_print_ext_header(char *buf, size_t size,
>  				     struct printk_info *info)
>  {
> -	u64 ts_usec = info->ts_nsec;
> +	u64 ts_usec = adjust_early_ts(info->ts_nsec);
>  	char caller[20];
>  #ifdef CONFIG_PRINTK_CALLER
>  	u32 id = info->caller_id;
> @@ -1352,7 +1360,11 @@ static size_t print_syslog(unsigned int level, char *buf)
>  
>  static size_t print_time(u64 ts, char *buf)
>  {
> -	unsigned long rem_nsec = do_div(ts, 1000000000);
> +	unsigned long rem_nsec;
> +
> +	ts = adjust_early_ts(ts);
> +
> +	rem_nsec = do_div(ts, 1000000000);
>  
>  	return sprintf(buf, "[%5lu.%06lu]",
>  		       (unsigned long)ts, rem_nsec / 1000);
> @@ -2242,6 +2254,8 @@ int vprintk_store(int facility, int level,
>  	 * timestamp with respect to the caller.
>  	 */
>  	ts_nsec = local_clock();
> +	if (!ts_nsec)
> +		ts_nsec = early_cycles();

ts_nsec goes on to be stored in a struct printk_info's ts_nsec which is
documented to be "timestamp in nanoseconds":

	/*
	 * Meta information about each stored message.
	 *
	 * All fields are set by the printk code except for @seq, which is
	 * set by the ringbuffer code.
	 */
	struct printk_info {
		u64	seq;		/* sequence number */
		u64	ts_nsec;	/* timestamp in nanoseconds */
		u16	text_len;	/* length of text message */
		u8	facility;	/* syslog facility */
		u8	flags:5;	/* internal record flags */
		u8	level:3;	/* syslog level */
		u32	caller_id;	/* thread id or processor id */
	#ifdef CONFIG_PRINTK_EXECUTION_CTX
		u32	caller_id2;	/* caller_id complement */
		/* name of the task that generated the message */
		char	comm[TASK_COMM_LEN];
	#endif

		struct dev_printk_info	dev_info;
	};

Since with this patch, ts_nsec can either be a timestamp in ns or a
cycle count, the comment should be updated. Ideally, I'd like the member
name to be changed as well to reflect the new semantic. I'm thinking 
ts_raw or ts_ns_or_cyc... naming is hard :)

Thanks,
Shashank

^ permalink raw reply

* Re: [PATCH v3] printk: fix zero-valued printk timestamps in early boot
From: Petr Mladek @ 2026-03-04 11:23 UTC (permalink / raw)
  To: Tim Bird
  Cc: rostedt, john.ogness, senozhatsky, francesco, geert,
	linux-embedded, linux-kernel
In-Reply-To: <20260210234741.3262320-1-tim.bird@sony.com>

On Tue 2026-02-10 16:47:41, Tim Bird wrote:
> During early boot, printk timestamps are reported as zero before
> kernel timekeeping starts (e.g. before time_init()).  This
> hinders boot-time optimization efforts.  This period is about 400
> milliseconds for many current desktop and embedded machines
> running Linux.
> 
> Add support to save cycles during early boot, and output correct
> timestamp values after timekeeping is initialized.  get_cycles()
> is operational on arm64 and x86_64 from kernel start.  Add code
> and variables to save calibration values used to later convert
> cycle counts to time values in the early printks.  Add a config
> to control the feature.
> 
> This yields non-zero timestamps for printks from the very start
> of kernel execution.  The timestamps are relative to the start of
> the architecture-specified counter used in get_cycles
> (e.g. the TSC on x86_64 and cntvct_el0 on arm64).
> 
> All timestamps reflect time from processor power-on instead of
> time from the kernel's timekeeping initialization.
> 
> Signed-off-by: Tim Bird <tim.bird@sony.com>

It looks good to me and seems to work fine. Feel free to use:

Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>

See a note below.

> --- a/init/main.c
> +++ b/init/main.c
> @@ -104,6 +104,7 @@
>  #include <linux/pidfs.h>
>  #include <linux/ptdump.h>
>  #include <linux/time_namespace.h>
> +#include <linux/early_times.h>

JFYI, I have tried this patch on top of the current Linus' tree (v7.0-rc2+)
and it conflicted with the commit 499f86de4f8c34e19 ("init/main: read
bootconfig header with get_unaligned_le32()") which added here:

 #include <linux/unaligned.h>

>  #include <net/net_namespace.h>
>  
>  #include <asm/io.h>

Best Regards,
Petr

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox