[PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
@ 2026-05-25 13:08 Jinjie Ruan
  2026-06-11  1:40 ` Jinjie Ruan
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jinjie Ruan @ 2026-05-25 13:08 UTC (permalink / raw)
  To: alexandre.belloni, linux-rtc, linux-kernel; +Cc: ruanjinjie

In virtualization environments like QEMU [1], or during hardware
clocksource anomalies, an extreme time-warp event can occur. When
the system time abruptly jumps forward, the rtc_timer_do_work() handler
falls into a prolonged processing loop to clear accumulated historical
timers via timerqueue_getnext(). Running this loop indefinitely under
the rtc->ops_lock mutex triggers a kernel softlockup, stalling
the system.

Introduce an adaptive telemetry and loop guard mechanism to enhance debug
visibility and prevent softlockups:

1. Record `start_jiffies` upon entry and leverage `time_after()` to
   check if the loop has monopolized the CPU for more than 1s (HZ). If so,
   the handler prints a telemetry warning, triggers a WARN stack dump, and
   breaks the loop to safely yield the CPU.

2. Track the execution via a `loop_count` metric. Printing this counter
   in the warning log provides vital diagnostics to distinguish
   an aggressive time-warp storm (high count) from a bogged-down callback
   bug (low count).

3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
   into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
   developers to instantly pinpoint the exact boundary of the time
   jump in dmesg.

This non-destructive telemetry guard provides precise hardware/emulator
diagnostic visibility while ensuring core kernel availability.

[1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
---
 drivers/rtc/interface.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index 1906f4884a83..f6c5fd16cc4e 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
  */
 void rtc_timer_do_work(struct work_struct *work)
 {
-	struct rtc_timer *timer;
+	unsigned long start_jiffies = jiffies;
 	struct timerqueue_node *next;
-	ktime_t now;
+	struct rtc_timer *timer;
 	struct rtc_time tm;
+	int loop_count = 0;
+	ktime_t now;
 	int err;

 	struct rtc_device *rtc =
@@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
 	}
 	now = rtc_tm_to_ktime(tm);
 	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
+		loop_count++;
+
+		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
+			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
+				 loop_count, &tm);
+			WARN_ON_ONCE(1);
+			break;
+		}
+
 		if (next->expires > now)
 			break;

-- 
2.34.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
  2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan
@ 2026-06-11  1:40 ` Jinjie Ruan
  2026-06-11  2:01 ` Jinjie Ruan
  2026-06-15 15:22 ` Alexandre Belloni
  2 siblings, 0 replies; 7+ messages in thread
From: Jinjie Ruan @ 2026-06-11  1:40 UTC (permalink / raw)
  To: alexandre.belloni, linux-rtc, linux-kernel

Gentle ping.

On 5/25/2026 9:08 PM, Jinjie Ruan wrote:
> In virtualization environments like QEMU [1], or during hardware
> clocksource anomalies, an extreme time-warp event can occur. When
> the system time abruptly jumps forward, the rtc_timer_do_work() handler
> falls into a prolonged processing loop to clear accumulated historical
> timers via timerqueue_getnext(). Running this loop indefinitely under
> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
> the system.
> 
> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
> visibility and prevent softlockups:
> 
> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
>    check if the loop has monopolized the CPU for more than 1s (HZ). If so,
>    the handler prints a telemetry warning, triggers a WARN stack dump, and
>    breaks the loop to safely yield the CPU.
> 
> 2. Track the execution via a `loop_count` metric. Printing this counter
>    in the warning log provides vital diagnostics to distinguish
>    an aggressive time-warp storm (high count) from a bogged-down callback
>    bug (low count).
> 
> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
>    into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
>    developers to instantly pinpoint the exact boundary of the time
>    jump in dmesg.
> 
> This non-destructive telemetry guard provides precise hardware/emulator
> diagnostic visibility while ensuring core kernel availability.
> 
> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
> ---
>  drivers/rtc/interface.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
> index 1906f4884a83..f6c5fd16cc4e 100644
> --- a/drivers/rtc/interface.c
> +++ b/drivers/rtc/interface.c
> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
>   */
>  void rtc_timer_do_work(struct work_struct *work)
>  {
> -	struct rtc_timer *timer;
> +	unsigned long start_jiffies = jiffies;
>  	struct timerqueue_node *next;
> -	ktime_t now;
> +	struct rtc_timer *timer;
>  	struct rtc_time tm;
> +	int loop_count = 0;
> +	ktime_t now;
>  	int err;
>  
>  	struct rtc_device *rtc =
> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
>  	}
>  	now = rtc_tm_to_ktime(tm);
>  	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
> +		loop_count++;
> +
> +		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
> +			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
> +				 loop_count, &tm);
> +			WARN_ON_ONCE(1);
> +			break;
> +		}
> +
>  		if (next->expires > now)
>  			break;
>  


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
  2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan
  2026-06-11  1:40 ` Jinjie Ruan
@ 2026-06-11  2:01 ` Jinjie Ruan
  2026-06-15 15:22 ` Alexandre Belloni
  2 siblings, 0 replies; 7+ messages in thread
From: Jinjie Ruan @ 2026-06-11  2:01 UTC (permalink / raw)
  To: alexandre.belloni, linux-rtc, linux-kernel, tglx

+cc Thomas Gleixner

On 5/25/2026 9:08 PM, Jinjie Ruan wrote:
> In virtualization environments like QEMU [1], or during hardware
> clocksource anomalies, an extreme time-warp event can occur. When
> the system time abruptly jumps forward, the rtc_timer_do_work() handler
> falls into a prolonged processing loop to clear accumulated historical
> timers via timerqueue_getnext(). Running this loop indefinitely under
> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
> the system.
> 
> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
> visibility and prevent softlockups:
> 
> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
>    check if the loop has monopolized the CPU for more than 1s (HZ). If so,
>    the handler prints a telemetry warning, triggers a WARN stack dump, and
>    breaks the loop to safely yield the CPU.
> 
> 2. Track the execution via a `loop_count` metric. Printing this counter
>    in the warning log provides vital diagnostics to distinguish
>    an aggressive time-warp storm (high count) from a bogged-down callback
>    bug (low count).
> 
> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
>    into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
>    developers to instantly pinpoint the exact boundary of the time
>    jump in dmesg.
> 
> This non-destructive telemetry guard provides precise hardware/emulator
> diagnostic visibility while ensuring core kernel availability.
> 
> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
> ---
>  drivers/rtc/interface.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
> index 1906f4884a83..f6c5fd16cc4e 100644
> --- a/drivers/rtc/interface.c
> +++ b/drivers/rtc/interface.c
> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
>   */
>  void rtc_timer_do_work(struct work_struct *work)
>  {
> -	struct rtc_timer *timer;
> +	unsigned long start_jiffies = jiffies;
>  	struct timerqueue_node *next;
> -	ktime_t now;
> +	struct rtc_timer *timer;
>  	struct rtc_time tm;
> +	int loop_count = 0;
> +	ktime_t now;
>  	int err;
>  
>  	struct rtc_device *rtc =
> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
>  	}
>  	now = rtc_tm_to_ktime(tm);
>  	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
> +		loop_count++;
> +
> +		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
> +			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
> +				 loop_count, &tm);
> +			WARN_ON_ONCE(1);
> +			break;
> +		}
> +
>  		if (next->expires > now)
>  			break;
>  


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
  2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan
  2026-06-11  1:40 ` Jinjie Ruan
  2026-06-11  2:01 ` Jinjie Ruan
@ 2026-06-15 15:22 ` Alexandre Belloni
  2026-06-16  2:10   ` Jinjie Ruan
  2 siblings, 1 reply; 7+ messages in thread
From: Alexandre Belloni @ 2026-06-15 15:22 UTC (permalink / raw)
  To: Jinjie Ruan; +Cc: linux-rtc, linux-kernel

Hello,

On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote:
> In virtualization environments like QEMU [1], or during hardware
> clocksource anomalies, an extreme time-warp event can occur. When
> the system time abruptly jumps forward, the rtc_timer_do_work() handler
> falls into a prolonged processing loop to clear accumulated historical
> timers via timerqueue_getnext(). Running this loop indefinitely under
> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
> the system.
> 
> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
> visibility and prevent softlockups:
> 
> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
>    check if the loop has monopolized the CPU for more than 1s (HZ). If so,
>    the handler prints a telemetry warning, triggers a WARN stack dump, and
>    breaks the loop to safely yield the CPU.
> 
> 2. Track the execution via a `loop_count` metric. Printing this counter
>    in the warning log provides vital diagnostics to distinguish
>    an aggressive time-warp storm (high count) from a bogged-down callback
>    bug (low count).
> 
> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
>    into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
>    developers to instantly pinpoint the exact boundary of the time
>    jump in dmesg.
> 
> This non-destructive telemetry guard provides precise hardware/emulator
> diagnostic visibility while ensuring core kernel availability.
> 
> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
> ---
>  drivers/rtc/interface.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
> index 1906f4884a83..f6c5fd16cc4e 100644
> --- a/drivers/rtc/interface.c
> +++ b/drivers/rtc/interface.c
> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
>   */
>  void rtc_timer_do_work(struct work_struct *work)
>  {
> -	struct rtc_timer *timer;
> +	unsigned long start_jiffies = jiffies;
>  	struct timerqueue_node *next;
> -	ktime_t now;
> +	struct rtc_timer *timer;
>  	struct rtc_time tm;
> +	int loop_count = 0;
> +	ktime_t now;
>  	int err;
>  
>  	struct rtc_device *rtc =
> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
>  	}
>  	now = rtc_tm_to_ktime(tm);
>  	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
> +		loop_count++;
> +
> +		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
> +			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
> +				 loop_count, &tm);
> +			WARN_ON_ONCE(1);

So, your issue is that it is too slow so you make it even slower? There
are already plenty of tracepoints that allow proper debugging in this
loop, I'm pretty sure we don't want to bloat the kernel with more
messages.


-- 
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
  2026-06-15 15:22 ` Alexandre Belloni
@ 2026-06-16  2:10   ` Jinjie Ruan
  2026-06-16  6:57     ` Alexandre Belloni
  0 siblings, 1 reply; 7+ messages in thread
From: Jinjie Ruan @ 2026-06-16  2:10 UTC (permalink / raw)
  To: Alexandre Belloni; +Cc: linux-rtc, linux-kernel



On 6/15/2026 11:22 PM, Alexandre Belloni wrote:
> Hello,
> 
> On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote:
>> In virtualization environments like QEMU [1], or during hardware
>> clocksource anomalies, an extreme time-warp event can occur. When
>> the system time abruptly jumps forward, the rtc_timer_do_work() handler
>> falls into a prolonged processing loop to clear accumulated historical
>> timers via timerqueue_getnext(). Running this loop indefinitely under
>> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
>> the system.
>>
>> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
>> visibility and prevent softlockups:
>>
>> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
>>    check if the loop has monopolized the CPU for more than 1s (HZ). If so,
>>    the handler prints a telemetry warning, triggers a WARN stack dump, and
>>    breaks the loop to safely yield the CPU.
>>
>> 2. Track the execution via a `loop_count` metric. Printing this counter
>>    in the warning log provides vital diagnostics to distinguish
>>    an aggressive time-warp storm (high count) from a bogged-down callback
>>    bug (low count).
>>
>> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
>>    into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
>>    developers to instantly pinpoint the exact boundary of the time
>>    jump in dmesg.
>>
>> This non-destructive telemetry guard provides precise hardware/emulator
>> diagnostic visibility while ensuring core kernel availability.
>>
>> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
>> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
>> ---
>>  drivers/rtc/interface.c | 15 +++++++++++++--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
>> index 1906f4884a83..f6c5fd16cc4e 100644
>> --- a/drivers/rtc/interface.c
>> +++ b/drivers/rtc/interface.c
>> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
>>   */
>>  void rtc_timer_do_work(struct work_struct *work)
>>  {
>> -	struct rtc_timer *timer;
>> +	unsigned long start_jiffies = jiffies;
>>  	struct timerqueue_node *next;
>> -	ktime_t now;
>> +	struct rtc_timer *timer;
>>  	struct rtc_time tm;
>> +	int loop_count = 0;
>> +	ktime_t now;
>>  	int err;
>>  
>>  	struct rtc_device *rtc =
>> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
>>  	}
>>  	now = rtc_tm_to_ktime(tm);
>>  	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
>> +		loop_count++;
>> +
>> +		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
>> +			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
>> +				 loop_count, &tm);
>> +			WARN_ON_ONCE(1);
> 
> So, your issue is that it is too slow so you make it even slower? There
> are already plenty of tracepoints that allow proper debugging in this
> loop, I'm pretty sure we don't want to bloat the kernel with more
> messages.

Hi, Alexandre,

The point here is not about the performance of the rtc_timer_do_work()
loop — it’s about making the problem debuggable when things go wrong.
And we can put it under a debug Kconfig option, so production kernels
see no extra overhead at all.

The patch is installed in the following scenarios:

If the RTC hardware fails, or if the QEMU-emulated RTC device code in a
KVM virtual machine has a problem (for example, the x86 RTC emulation
hardware mc146818 has an overflow issue[1]), the time may jump as shown
in the log below, which can cause a soft lockup.

However, when the issue occurs, it is only possible to know that too
many pending timers have accumulated in the timerqueue (for example, the
log shows that tens of millions of timer nodes have been processed) by
temporarily adding diagnostic code in rtc_timer_do_work().

To determine whether the root cause is hardware, kernel RTC code, an RTC
driver issue, or an RTC hardware problem, more debugging is needed. But
if the problem is indeed caused by RTC hardware, adding a diagnostic
print of the current RTC time when the loop takes too long (as this
diagnostic patch does) would make it easy to tell whether QEMU or the
hardware is faulty.

[1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@tls.msk.ru/

     kworker/0:1-37      [000] .N..   489.159634: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281423
     kworker/0:1-37      [000] .N..   489.159635: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281424
     kworker/0:1-37      [000] .N..   489.159635: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281425
     kworker/0:1-37      [000] .N..   489.159636: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281426
     kworker/0:1-37      [000] .N..   489.159637: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281427
     kworker/0:1-37      [000] .N..   489.159638: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281428
     kworker/0:1-37      [000] .N..   489.159638: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281429
     kworker/0:1-37      [000] .N..   489.159639: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281430
     kworker/0:1-37      [000] .N..   489.159640: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281431
     kworker/0:1-37      [000] .N..   489.159641: rtc_timer_do_work:
timerqueue_getnext handle timer node count: 13281432


   swapper/0-1       [001] .N..    11.579334: __rtc_read_time: rtc:
0xff11000109896800, ops->read_time:2026:01:05:09:36:21
       swapper/0-1       [001] .N..    11.579421: __rtc_read_time: rtc:
0xff11000109896800, ops->read_time:2026:01:05:09:36:21
       swapper/0-1       [001] .N..    11.579469: __rtc_read_time: rtc:
0xff11000109896800, ops->read_time:2026:01:05:09:36:21
       swapper/0-1       [001] .N..    11.580816: __rtc_read_time: rtc:
0xff11000109896800, ops->read_time:2026:01:05:09:36:21
  syz-executor.5-7492    [003] ....   129.807406: __rtc_read_time: rtc:
0xff11000109896800, ops->read_time:2033:05:04:07:03:51
  syz-executor.5-7492    [003] ....   129.807419:
__rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now:
2033:05:04:07:03:51, expire: 2033:05:04:07:03:52


Best regards,
Jinjie

> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
  2026-06-16  2:10   ` Jinjie Ruan
@ 2026-06-16  6:57     ` Alexandre Belloni
  2026-06-16  8:12       ` Jinjie Ruan
  0 siblings, 1 reply; 7+ messages in thread
From: Alexandre Belloni @ 2026-06-16  6:57 UTC (permalink / raw)
  To: Jinjie Ruan; +Cc: linux-rtc, linux-kernel

Hi,

On 16/06/2026 10:10:19+0800, Jinjie Ruan wrote:
> 
> 
> On 6/15/2026 11:22 PM, Alexandre Belloni wrote:
> > Hello,
> > 
> > On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote:
> >> In virtualization environments like QEMU [1], or during hardware
> >> clocksource anomalies, an extreme time-warp event can occur. When
> >> the system time abruptly jumps forward, the rtc_timer_do_work() handler
> >> falls into a prolonged processing loop to clear accumulated historical
> >> timers via timerqueue_getnext(). Running this loop indefinitely under
> >> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
> >> the system.
> >>
> >> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
> >> visibility and prevent softlockups:
> >>
> >> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
> >>    check if the loop has monopolized the CPU for more than 1s (HZ). If so,
> >>    the handler prints a telemetry warning, triggers a WARN stack dump, and
> >>    breaks the loop to safely yield the CPU.
> >>
> >> 2. Track the execution via a `loop_count` metric. Printing this counter
> >>    in the warning log provides vital diagnostics to distinguish
> >>    an aggressive time-warp storm (high count) from a bogged-down callback
> >>    bug (low count).
> >>
> >> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
> >>    into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
> >>    developers to instantly pinpoint the exact boundary of the time
> >>    jump in dmesg.
> >>
> >> This non-destructive telemetry guard provides precise hardware/emulator
> >> diagnostic visibility while ensuring core kernel availability.
> >>
> >> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
> >> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
> >> ---
> >>  drivers/rtc/interface.c | 15 +++++++++++++--
> >>  1 file changed, 13 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
> >> index 1906f4884a83..f6c5fd16cc4e 100644
> >> --- a/drivers/rtc/interface.c
> >> +++ b/drivers/rtc/interface.c
> >> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
> >>   */
> >>  void rtc_timer_do_work(struct work_struct *work)
> >>  {
> >> -	struct rtc_timer *timer;
> >> +	unsigned long start_jiffies = jiffies;
> >>  	struct timerqueue_node *next;
> >> -	ktime_t now;
> >> +	struct rtc_timer *timer;
> >>  	struct rtc_time tm;
> >> +	int loop_count = 0;
> >> +	ktime_t now;
> >>  	int err;
> >>  
> >>  	struct rtc_device *rtc =
> >> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
> >>  	}
> >>  	now = rtc_tm_to_ktime(tm);
> >>  	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
> >> +		loop_count++;
> >> +
> >> +		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
> >> +			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
> >> +				 loop_count, &tm);
> >> +			WARN_ON_ONCE(1);
> > 
> > So, your issue is that it is too slow so you make it even slower? There
> > are already plenty of tracepoints that allow proper debugging in this
> > loop, I'm pretty sure we don't want to bloat the kernel with more
> > messages.
> 
> Hi, Alexandre,
> 
> The point here is not about the performance of the rtc_timer_do_work()
> loop — it’s about making the problem debuggable when things go wrong.
> And we can put it under a debug Kconfig option, so production kernels
> see no extra overhead at all.


But then aren't the tracepoint enough? There are 3 tracepoints in the
loop that are exactly for debugging your issue.

> 
> The patch is installed in the following scenarios:
> 
> If the RTC hardware fails, or if the QEMU-emulated RTC device code in a
> KVM virtual machine has a problem (for example, the x86 RTC emulation
> hardware mc146818 has an overflow issue[1]), the time may jump as shown
> in the log below, which can cause a soft lockup.

This is not clear, you are not explaining the issue. You seem to mix
system time and RTC time. What I understand is that there was a periodic
timer enqueued and then for some reason, the system time jumped forward
by a large amount and now rtc_timer_do_work is firing events for each of
the missed timers. You are not explaining the relationship with the RTC
hardware (I see none). 

> 
> However, when the issue occurs, it is only possible to know that too
> many pending timers have accumulated in the timerqueue (for example, the
> log shows that tens of millions of timer nodes have been processed) by
> temporarily adding diagnostic code in rtc_timer_do_work().
> 

This is not true, there are 3 tracepoints to know what is happening with
the timers. Also, there are always exactly zero, one or two timers in
the queue, never tens of millions.

> To determine whether the root cause is hardware, kernel RTC code, an RTC
> driver issue, or an RTC hardware problem, more debugging is needed. But
> if the problem is indeed caused by RTC hardware, adding a diagnostic
> print of the current RTC time when the loop takes too long (as this
> diagnostic patch does) would make it easy to tell whether QEMU or the
> hardware is faulty.
> 
> [1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@tls.msk.ru/
> 
>      kworker/0:1-37      [000] .N..   489.159634: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281423
>      kworker/0:1-37      [000] .N..   489.159635: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281424
>      kworker/0:1-37      [000] .N..   489.159635: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281425
>      kworker/0:1-37      [000] .N..   489.159636: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281426
>      kworker/0:1-37      [000] .N..   489.159637: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281427
>      kworker/0:1-37      [000] .N..   489.159638: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281428
>      kworker/0:1-37      [000] .N..   489.159638: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281429
>      kworker/0:1-37      [000] .N..   489.159639: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281430
>      kworker/0:1-37      [000] .N..   489.159640: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281431
>      kworker/0:1-37      [000] .N..   489.159641: rtc_timer_do_work:
> timerqueue_getnext handle timer node count: 13281432
> 
> 
>    swapper/0-1       [001] .N..    11.579334: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>        swapper/0-1       [001] .N..    11.579421: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>        swapper/0-1       [001] .N..    11.579469: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>        swapper/0-1       [001] .N..    11.580816: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>   syz-executor.5-7492    [003] ....   129.807406: __rtc_read_time: rtc:
> 0xff11000109896800, ops->read_time:2033:05:04:07:03:51
>   syz-executor.5-7492    [003] ....   129.807419:
> __rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now:
> 2033:05:04:07:03:51, expire: 2033:05:04:07:03:52
> 
> 
> Best regards,
> Jinjie
> 
> > 
> > 
> 

-- 
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
  2026-06-16  6:57     ` Alexandre Belloni
@ 2026-06-16  8:12       ` Jinjie Ruan
  0 siblings, 0 replies; 7+ messages in thread
From: Jinjie Ruan @ 2026-06-16  8:12 UTC (permalink / raw)
  To: Alexandre Belloni; +Cc: linux-rtc, linux-kernel



On 6/16/2026 2:57 PM, Alexandre Belloni wrote:
> Hi,
> 
> On 16/06/2026 10:10:19+0800, Jinjie Ruan wrote:
>>
>>
>> On 6/15/2026 11:22 PM, Alexandre Belloni wrote:
>>> Hello,
>>>
>>> On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote:
>>>> In virtualization environments like QEMU [1], or during hardware
>>>> clocksource anomalies, an extreme time-warp event can occur. When
>>>> the system time abruptly jumps forward, the rtc_timer_do_work() handler
>>>> falls into a prolonged processing loop to clear accumulated historical
>>>> timers via timerqueue_getnext(). Running this loop indefinitely under
>>>> the rtc->ops_lock mutex triggers a kernel softlockup, stalling
>>>> the system.
>>>>
>>>> Introduce an adaptive telemetry and loop guard mechanism to enhance debug
>>>> visibility and prevent softlockups:
>>>>
>>>> 1. Record `start_jiffies` upon entry and leverage `time_after()` to
>>>>    check if the loop has monopolized the CPU for more than 1s (HZ). If so,
>>>>    the handler prints a telemetry warning, triggers a WARN stack dump, and
>>>>    breaks the loop to safely yield the CPU.
>>>>
>>>> 2. Track the execution via a `loop_count` metric. Printing this counter
>>>>    in the warning log provides vital diagnostics to distinguish
>>>>    an aggressive time-warp storm (high count) from a bogged-down callback
>>>>    bug (low count).
>>>>
>>>> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
>>>>    into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
>>>>    developers to instantly pinpoint the exact boundary of the time
>>>>    jump in dmesg.
>>>>
>>>> This non-destructive telemetry guard provides precise hardware/emulator
>>>> diagnostic visibility while ensuring core kernel availability.
>>>>
>>>> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
>>>> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
>>>> ---
>>>>  drivers/rtc/interface.c | 15 +++++++++++++--
>>>>  1 file changed, 13 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
>>>> index 1906f4884a83..f6c5fd16cc4e 100644
>>>> --- a/drivers/rtc/interface.c
>>>> +++ b/drivers/rtc/interface.c
>>>> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
>>>>   */
>>>>  void rtc_timer_do_work(struct work_struct *work)
>>>>  {
>>>> -	struct rtc_timer *timer;
>>>> +	unsigned long start_jiffies = jiffies;
>>>>  	struct timerqueue_node *next;
>>>> -	ktime_t now;
>>>> +	struct rtc_timer *timer;
>>>>  	struct rtc_time tm;
>>>> +	int loop_count = 0;
>>>> +	ktime_t now;
>>>>  	int err;
>>>>  
>>>>  	struct rtc_device *rtc =
>>>> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
>>>>  	}
>>>>  	now = rtc_tm_to_ktime(tm);
>>>>  	while ((next = timerqueue_getnext(&rtc->timerqueue))) {
>>>> +		loop_count++;
>>>> +
>>>> +		if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
>>>> +			dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
>>>> +				 loop_count, &tm);
>>>> +			WARN_ON_ONCE(1);
>>>
>>> So, your issue is that it is too slow so you make it even slower? There
>>> are already plenty of tracepoints that allow proper debugging in this
>>> loop, I'm pretty sure we don't want to bloat the kernel with more
>>> messages.
>>
>> Hi, Alexandre,
>>
>> The point here is not about the performance of the rtc_timer_do_work()
>> loop — it’s about making the problem debuggable when things go wrong.
>> And we can put it under a debug Kconfig option, so production kernels
>> see no extra overhead at all.
> 
> 
> But then aren't the tracepoint enough? There are 3 tracepoints in the
> loop that are exactly for debugging your issue.

Hi, Alexandre,

Regarding the `trace_rtc_timer_dequeue` and `trace_rtc_timer_enqueue`
tracepoints, they are unfortunately insufficient to pinpoint this issue
for three reasons:

1. Ring Buffer Overwrite during Softlockup:

When the loop iterates tens of millions of times continuously on a
locked-up CPU, it floods the ftrace ring buffer in milliseconds. The
earliest trace logs—which contain the exact moment the time jumped—will
be completely overwritten and lost before anyone can read them.

2. Lack of Causality Context:
These tracepoints only log the timer's expiration and queue state.They
show the symptom (infinite re-enqueuing). But these are not the primary
scene, the time jump is the primary scene. They do not capture why are
there so many UIE timers being re-enqueued?

> 
>>
>> The patch is installed in the following scenarios:
>>
>> If the RTC hardware fails, or if the QEMU-emulated RTC device code in a
>> KVM virtual machine has a problem (for example, the x86 RTC emulation
>> hardware mc146818 has an overflow issue[1]), the time may jump as shown
>> in the log below, which can cause a soft lockup.
> 
> This is not clear, you are not explaining the issue. You seem to mix
> system time and RTC time. What I understand is that there was a periodic
> timer enqueued and then for some reason, the system time jumped forward
> by a large amount and now rtc_timer_do_work is firing events for each of
> the missed timers. You are not explaining the relationship with the RTC
> hardware (I see none). 

The time here all means the RTC time got by __rtc_read_time(), not the
system time.

And the RTC time jump is caused by RTC hardware failure or
`QEMU-emulated RTC device` code bug, which means the rtc hardware
returns an unstable rtc time or is not completely linear growth..

The original issue is as follows：

On kvm qemu with cmos rtc and mc146818 chip, after set the UIE timer
expire with a normal RTC time （for example 2026 year）, In
rtc_timer_do_work(), the rtc time jump to a future time (for example
2033 year), it will loop for a while util softlockup because all
subsequent enqueued UIE timers expires, as below:

RTC_UIE_ON:
	read now: 2019:04:08:12:32:27, add timer0 (expire: 2019:04:08:12:32:28)
^^^^^^^^^^^^^^^^^^^^
...
rtc_timer_do_work() iterate the list in a loop:
	read now: 2033:12:02:07:27:15
		  ^^^^^^^^^^^^^^^^^^^
	handle timer0, add timer1 to the list (expire: 2019:04:08:12:32:29)
	handle timer1, add timer2 to the list (expire: 2019:04:08:12:32:30)
	handle timer2, add timer3: 2019:04:08:12:32:31
	...
	-> softlockup

> 
>>
>> However, when the issue occurs, it is only possible to know that too
>> many pending timers have accumulated in the timerqueue (for example, the
>> log shows that tens of millions of timer nodes have been processed) by
>> temporarily adding diagnostic code in rtc_timer_do_work().
>>
> 
> This is not true, there are 3 tracepoints to know what is happening with
> the timers. Also, there are always exactly zero, one or two timers in
> the queue, never tens of millions.

Timer Queue Size vs. Loop Iteration Count:

You are completely correct that there are only ever 0, 1, or 2 timer
nodes linked in the timerqueue. My previous phrasing was inaccurate, I
mean the loop Iteration.

> 
>> To determine whether the root cause is hardware, kernel RTC code, an RTC
>> driver issue, or an RTC hardware problem, more debugging is needed. But
>> if the problem is indeed caused by RTC hardware, adding a diagnostic
>> print of the current RTC time when the loop takes too long (as this
>> diagnostic patch does) would make it easy to tell whether QEMU or the
>> hardware is faulty.
>>
>> [1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@tls.msk.ru/
>>
>>      kworker/0:1-37      [000] .N..   489.159634: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281423
>>      kworker/0:1-37      [000] .N..   489.159635: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281424
>>      kworker/0:1-37      [000] .N..   489.159635: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281425
>>      kworker/0:1-37      [000] .N..   489.159636: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281426
>>      kworker/0:1-37      [000] .N..   489.159637: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281427
>>      kworker/0:1-37      [000] .N..   489.159638: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281428
>>      kworker/0:1-37      [000] .N..   489.159638: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281429
>>      kworker/0:1-37      [000] .N..   489.159639: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281430
>>      kworker/0:1-37      [000] .N..   489.159640: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281431
>>      kworker/0:1-37      [000] .N..   489.159641: rtc_timer_do_work:
>> timerqueue_getnext handle timer node count: 13281432
>>
>>
>>    swapper/0-1       [001] .N..    11.579334: __rtc_read_time: rtc:
>> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>>        swapper/0-1       [001] .N..    11.579421: __rtc_read_time: rtc:
>> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>>        swapper/0-1       [001] .N..    11.579469: __rtc_read_time: rtc:
>> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>>        swapper/0-1       [001] .N..    11.580816: __rtc_read_time: rtc:
>> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21
>>   syz-executor.5-7492    [003] ....   129.807406: __rtc_read_time: rtc:
>> 0xff11000109896800, ops->read_time:2033:05:04:07:03:51
>>   syz-executor.5-7492    [003] ....   129.807419:
>> __rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now:
>> 2033:05:04:07:03:51, expire: 2033:05:04:07:03:52
>>
>>
>> Best regards,
>> Jinjie
>>
>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-16  8:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan
2026-06-11  1:40 ` Jinjie Ruan
2026-06-11  2:01 ` Jinjie Ruan
2026-06-15 15:22 ` Alexandre Belloni
2026-06-16  2:10   ` Jinjie Ruan
2026-06-16  6:57     ` Alexandre Belloni
2026-06-16  8:12       ` Jinjie Ruan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.