* [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work()
@ 2026-05-25 13:08 Jinjie Ruan
2026-06-11 1:40 ` Jinjie Ruan
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Jinjie Ruan @ 2026-05-25 13:08 UTC (permalink / raw)
To: alexandre.belloni, linux-rtc, linux-kernel; +Cc: ruanjinjie
In virtualization environments like QEMU [1], or during hardware
clocksource anomalies, an extreme time-warp event can occur. When
the system time abruptly jumps forward, the rtc_timer_do_work() handler
falls into a prolonged processing loop to clear accumulated historical
timers via timerqueue_getnext(). Running this loop indefinitely under
the rtc->ops_lock mutex triggers a kernel softlockup, stalling
the system.
Introduce an adaptive telemetry and loop guard mechanism to enhance debug
visibility and prevent softlockups:
1. Record `start_jiffies` upon entry and leverage `time_after()` to
check if the loop has monopolized the CPU for more than 1s (HZ). If so,
the handler prints a telemetry warning, triggers a WARN stack dump, and
breaks the loop to safely yield the CPU.
2. Track the execution via a `loop_count` metric. Printing this counter
in the warning log provides vital diagnostics to distinguish
an aggressive time-warp storm (high count) from a bogged-down callback
bug (low count).
3. Utilize the kernel format specifier `%ptR` to convert the raw ktime
into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing
developers to instantly pinpoint the exact boundary of the time
jump in dmesg.
This non-destructive telemetry guard provides precise hardware/emulator
diagnostic visibility while ensuring core kernel availability.
[1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
---
drivers/rtc/interface.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index 1906f4884a83..f6c5fd16cc4e 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer)
*/
void rtc_timer_do_work(struct work_struct *work)
{
- struct rtc_timer *timer;
+ unsigned long start_jiffies = jiffies;
struct timerqueue_node *next;
- ktime_t now;
+ struct rtc_timer *timer;
struct rtc_time tm;
+ int loop_count = 0;
+ ktime_t now;
int err;
struct rtc_device *rtc =
@@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work)
}
now = rtc_tm_to_ktime(tm);
while ((next = timerqueue_getnext(&rtc->timerqueue))) {
+ loop_count++;
+
+ if (unlikely(time_after(jiffies, start_jiffies + HZ))) {
+ dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n",
+ loop_count, &tm);
+ WARN_ON_ONCE(1);
+ break;
+ }
+
if (next->expires > now)
break;
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() 2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan @ 2026-06-11 1:40 ` Jinjie Ruan 2026-06-11 2:01 ` Jinjie Ruan 2026-06-15 15:22 ` Alexandre Belloni 2 siblings, 0 replies; 7+ messages in thread From: Jinjie Ruan @ 2026-06-11 1:40 UTC (permalink / raw) To: alexandre.belloni, linux-rtc, linux-kernel Gentle ping. On 5/25/2026 9:08 PM, Jinjie Ruan wrote: > In virtualization environments like QEMU [1], or during hardware > clocksource anomalies, an extreme time-warp event can occur. When > the system time abruptly jumps forward, the rtc_timer_do_work() handler > falls into a prolonged processing loop to clear accumulated historical > timers via timerqueue_getnext(). Running this loop indefinitely under > the rtc->ops_lock mutex triggers a kernel softlockup, stalling > the system. > > Introduce an adaptive telemetry and loop guard mechanism to enhance debug > visibility and prevent softlockups: > > 1. Record `start_jiffies` upon entry and leverage `time_after()` to > check if the loop has monopolized the CPU for more than 1s (HZ). If so, > the handler prints a telemetry warning, triggers a WARN stack dump, and > breaks the loop to safely yield the CPU. > > 2. Track the execution via a `loop_count` metric. Printing this counter > in the warning log provides vital diagnostics to distinguish > an aggressive time-warp storm (high count) from a bogged-down callback > bug (low count). > > 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime > into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing > developers to instantly pinpoint the exact boundary of the time > jump in dmesg. > > This non-destructive telemetry guard provides precise hardware/emulator > diagnostic visibility while ensuring core kernel availability. > > [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/ > Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> > --- > drivers/rtc/interface.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c > index 1906f4884a83..f6c5fd16cc4e 100644 > --- a/drivers/rtc/interface.c > +++ b/drivers/rtc/interface.c > @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer) > */ > void rtc_timer_do_work(struct work_struct *work) > { > - struct rtc_timer *timer; > + unsigned long start_jiffies = jiffies; > struct timerqueue_node *next; > - ktime_t now; > + struct rtc_timer *timer; > struct rtc_time tm; > + int loop_count = 0; > + ktime_t now; > int err; > > struct rtc_device *rtc = > @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work) > } > now = rtc_tm_to_ktime(tm); > while ((next = timerqueue_getnext(&rtc->timerqueue))) { > + loop_count++; > + > + if (unlikely(time_after(jiffies, start_jiffies + HZ))) { > + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n", > + loop_count, &tm); > + WARN_ON_ONCE(1); > + break; > + } > + > if (next->expires > now) > break; > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() 2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan 2026-06-11 1:40 ` Jinjie Ruan @ 2026-06-11 2:01 ` Jinjie Ruan 2026-06-15 15:22 ` Alexandre Belloni 2 siblings, 0 replies; 7+ messages in thread From: Jinjie Ruan @ 2026-06-11 2:01 UTC (permalink / raw) To: alexandre.belloni, linux-rtc, linux-kernel, tglx +cc Thomas Gleixner On 5/25/2026 9:08 PM, Jinjie Ruan wrote: > In virtualization environments like QEMU [1], or during hardware > clocksource anomalies, an extreme time-warp event can occur. When > the system time abruptly jumps forward, the rtc_timer_do_work() handler > falls into a prolonged processing loop to clear accumulated historical > timers via timerqueue_getnext(). Running this loop indefinitely under > the rtc->ops_lock mutex triggers a kernel softlockup, stalling > the system. > > Introduce an adaptive telemetry and loop guard mechanism to enhance debug > visibility and prevent softlockups: > > 1. Record `start_jiffies` upon entry and leverage `time_after()` to > check if the loop has monopolized the CPU for more than 1s (HZ). If so, > the handler prints a telemetry warning, triggers a WARN stack dump, and > breaks the loop to safely yield the CPU. > > 2. Track the execution via a `loop_count` metric. Printing this counter > in the warning log provides vital diagnostics to distinguish > an aggressive time-warp storm (high count) from a bogged-down callback > bug (low count). > > 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime > into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing > developers to instantly pinpoint the exact boundary of the time > jump in dmesg. > > This non-destructive telemetry guard provides precise hardware/emulator > diagnostic visibility while ensuring core kernel availability. > > [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/ > Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> > --- > drivers/rtc/interface.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c > index 1906f4884a83..f6c5fd16cc4e 100644 > --- a/drivers/rtc/interface.c > +++ b/drivers/rtc/interface.c > @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer) > */ > void rtc_timer_do_work(struct work_struct *work) > { > - struct rtc_timer *timer; > + unsigned long start_jiffies = jiffies; > struct timerqueue_node *next; > - ktime_t now; > + struct rtc_timer *timer; > struct rtc_time tm; > + int loop_count = 0; > + ktime_t now; > int err; > > struct rtc_device *rtc = > @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work) > } > now = rtc_tm_to_ktime(tm); > while ((next = timerqueue_getnext(&rtc->timerqueue))) { > + loop_count++; > + > + if (unlikely(time_after(jiffies, start_jiffies + HZ))) { > + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n", > + loop_count, &tm); > + WARN_ON_ONCE(1); > + break; > + } > + > if (next->expires > now) > break; > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() 2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan 2026-06-11 1:40 ` Jinjie Ruan 2026-06-11 2:01 ` Jinjie Ruan @ 2026-06-15 15:22 ` Alexandre Belloni 2026-06-16 2:10 ` Jinjie Ruan 2 siblings, 1 reply; 7+ messages in thread From: Alexandre Belloni @ 2026-06-15 15:22 UTC (permalink / raw) To: Jinjie Ruan; +Cc: linux-rtc, linux-kernel Hello, On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote: > In virtualization environments like QEMU [1], or during hardware > clocksource anomalies, an extreme time-warp event can occur. When > the system time abruptly jumps forward, the rtc_timer_do_work() handler > falls into a prolonged processing loop to clear accumulated historical > timers via timerqueue_getnext(). Running this loop indefinitely under > the rtc->ops_lock mutex triggers a kernel softlockup, stalling > the system. > > Introduce an adaptive telemetry and loop guard mechanism to enhance debug > visibility and prevent softlockups: > > 1. Record `start_jiffies` upon entry and leverage `time_after()` to > check if the loop has monopolized the CPU for more than 1s (HZ). If so, > the handler prints a telemetry warning, triggers a WARN stack dump, and > breaks the loop to safely yield the CPU. > > 2. Track the execution via a `loop_count` metric. Printing this counter > in the warning log provides vital diagnostics to distinguish > an aggressive time-warp storm (high count) from a bogged-down callback > bug (low count). > > 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime > into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing > developers to instantly pinpoint the exact boundary of the time > jump in dmesg. > > This non-destructive telemetry guard provides precise hardware/emulator > diagnostic visibility while ensuring core kernel availability. > > [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/ > Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> > --- > drivers/rtc/interface.c | 15 +++++++++++++-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c > index 1906f4884a83..f6c5fd16cc4e 100644 > --- a/drivers/rtc/interface.c > +++ b/drivers/rtc/interface.c > @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer) > */ > void rtc_timer_do_work(struct work_struct *work) > { > - struct rtc_timer *timer; > + unsigned long start_jiffies = jiffies; > struct timerqueue_node *next; > - ktime_t now; > + struct rtc_timer *timer; > struct rtc_time tm; > + int loop_count = 0; > + ktime_t now; > int err; > > struct rtc_device *rtc = > @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work) > } > now = rtc_tm_to_ktime(tm); > while ((next = timerqueue_getnext(&rtc->timerqueue))) { > + loop_count++; > + > + if (unlikely(time_after(jiffies, start_jiffies + HZ))) { > + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n", > + loop_count, &tm); > + WARN_ON_ONCE(1); So, your issue is that it is too slow so you make it even slower? There are already plenty of tracepoints that allow proper debugging in this loop, I'm pretty sure we don't want to bloat the kernel with more messages. -- Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() 2026-06-15 15:22 ` Alexandre Belloni @ 2026-06-16 2:10 ` Jinjie Ruan 2026-06-16 6:57 ` Alexandre Belloni 0 siblings, 1 reply; 7+ messages in thread From: Jinjie Ruan @ 2026-06-16 2:10 UTC (permalink / raw) To: Alexandre Belloni; +Cc: linux-rtc, linux-kernel On 6/15/2026 11:22 PM, Alexandre Belloni wrote: > Hello, > > On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote: >> In virtualization environments like QEMU [1], or during hardware >> clocksource anomalies, an extreme time-warp event can occur. When >> the system time abruptly jumps forward, the rtc_timer_do_work() handler >> falls into a prolonged processing loop to clear accumulated historical >> timers via timerqueue_getnext(). Running this loop indefinitely under >> the rtc->ops_lock mutex triggers a kernel softlockup, stalling >> the system. >> >> Introduce an adaptive telemetry and loop guard mechanism to enhance debug >> visibility and prevent softlockups: >> >> 1. Record `start_jiffies` upon entry and leverage `time_after()` to >> check if the loop has monopolized the CPU for more than 1s (HZ). If so, >> the handler prints a telemetry warning, triggers a WARN stack dump, and >> breaks the loop to safely yield the CPU. >> >> 2. Track the execution via a `loop_count` metric. Printing this counter >> in the warning log provides vital diagnostics to distinguish >> an aggressive time-warp storm (high count) from a bogged-down callback >> bug (low count). >> >> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime >> into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing >> developers to instantly pinpoint the exact boundary of the time >> jump in dmesg. >> >> This non-destructive telemetry guard provides precise hardware/emulator >> diagnostic visibility while ensuring core kernel availability. >> >> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/ >> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> >> --- >> drivers/rtc/interface.c | 15 +++++++++++++-- >> 1 file changed, 13 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c >> index 1906f4884a83..f6c5fd16cc4e 100644 >> --- a/drivers/rtc/interface.c >> +++ b/drivers/rtc/interface.c >> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer) >> */ >> void rtc_timer_do_work(struct work_struct *work) >> { >> - struct rtc_timer *timer; >> + unsigned long start_jiffies = jiffies; >> struct timerqueue_node *next; >> - ktime_t now; >> + struct rtc_timer *timer; >> struct rtc_time tm; >> + int loop_count = 0; >> + ktime_t now; >> int err; >> >> struct rtc_device *rtc = >> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work) >> } >> now = rtc_tm_to_ktime(tm); >> while ((next = timerqueue_getnext(&rtc->timerqueue))) { >> + loop_count++; >> + >> + if (unlikely(time_after(jiffies, start_jiffies + HZ))) { >> + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n", >> + loop_count, &tm); >> + WARN_ON_ONCE(1); > > So, your issue is that it is too slow so you make it even slower? There > are already plenty of tracepoints that allow proper debugging in this > loop, I'm pretty sure we don't want to bloat the kernel with more > messages. Hi, Alexandre, The point here is not about the performance of the rtc_timer_do_work() loop — it’s about making the problem debuggable when things go wrong. And we can put it under a debug Kconfig option, so production kernels see no extra overhead at all. The patch is installed in the following scenarios: If the RTC hardware fails, or if the QEMU-emulated RTC device code in a KVM virtual machine has a problem (for example, the x86 RTC emulation hardware mc146818 has an overflow issue[1]), the time may jump as shown in the log below, which can cause a soft lockup. However, when the issue occurs, it is only possible to know that too many pending timers have accumulated in the timerqueue (for example, the log shows that tens of millions of timer nodes have been processed) by temporarily adding diagnostic code in rtc_timer_do_work(). To determine whether the root cause is hardware, kernel RTC code, an RTC driver issue, or an RTC hardware problem, more debugging is needed. But if the problem is indeed caused by RTC hardware, adding a diagnostic print of the current RTC time when the loop takes too long (as this diagnostic patch does) would make it easy to tell whether QEMU or the hardware is faulty. [1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@tls.msk.ru/ kworker/0:1-37 [000] .N.. 489.159634: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281423 kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281424 kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281425 kworker/0:1-37 [000] .N.. 489.159636: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281426 kworker/0:1-37 [000] .N.. 489.159637: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281427 kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281428 kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281429 kworker/0:1-37 [000] .N.. 489.159639: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281430 kworker/0:1-37 [000] .N.. 489.159640: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281431 kworker/0:1-37 [000] .N.. 489.159641: rtc_timer_do_work: timerqueue_getnext handle timer node count: 13281432 swapper/0-1 [001] .N.. 11.579334: __rtc_read_time: rtc: 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 swapper/0-1 [001] .N.. 11.579421: __rtc_read_time: rtc: 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 swapper/0-1 [001] .N.. 11.579469: __rtc_read_time: rtc: 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 swapper/0-1 [001] .N.. 11.580816: __rtc_read_time: rtc: 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 syz-executor.5-7492 [003] .... 129.807406: __rtc_read_time: rtc: 0xff11000109896800, ops->read_time:2033:05:04:07:03:51 syz-executor.5-7492 [003] .... 129.807419: __rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now: 2033:05:04:07:03:51, expire: 2033:05:04:07:03:52 Best regards, Jinjie > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() 2026-06-16 2:10 ` Jinjie Ruan @ 2026-06-16 6:57 ` Alexandre Belloni 2026-06-16 8:12 ` Jinjie Ruan 0 siblings, 1 reply; 7+ messages in thread From: Alexandre Belloni @ 2026-06-16 6:57 UTC (permalink / raw) To: Jinjie Ruan; +Cc: linux-rtc, linux-kernel Hi, On 16/06/2026 10:10:19+0800, Jinjie Ruan wrote: > > > On 6/15/2026 11:22 PM, Alexandre Belloni wrote: > > Hello, > > > > On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote: > >> In virtualization environments like QEMU [1], or during hardware > >> clocksource anomalies, an extreme time-warp event can occur. When > >> the system time abruptly jumps forward, the rtc_timer_do_work() handler > >> falls into a prolonged processing loop to clear accumulated historical > >> timers via timerqueue_getnext(). Running this loop indefinitely under > >> the rtc->ops_lock mutex triggers a kernel softlockup, stalling > >> the system. > >> > >> Introduce an adaptive telemetry and loop guard mechanism to enhance debug > >> visibility and prevent softlockups: > >> > >> 1. Record `start_jiffies` upon entry and leverage `time_after()` to > >> check if the loop has monopolized the CPU for more than 1s (HZ). If so, > >> the handler prints a telemetry warning, triggers a WARN stack dump, and > >> breaks the loop to safely yield the CPU. > >> > >> 2. Track the execution via a `loop_count` metric. Printing this counter > >> in the warning log provides vital diagnostics to distinguish > >> an aggressive time-warp storm (high count) from a bogged-down callback > >> bug (low count). > >> > >> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime > >> into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing > >> developers to instantly pinpoint the exact boundary of the time > >> jump in dmesg. > >> > >> This non-destructive telemetry guard provides precise hardware/emulator > >> diagnostic visibility while ensuring core kernel availability. > >> > >> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/ > >> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> > >> --- > >> drivers/rtc/interface.c | 15 +++++++++++++-- > >> 1 file changed, 13 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c > >> index 1906f4884a83..f6c5fd16cc4e 100644 > >> --- a/drivers/rtc/interface.c > >> +++ b/drivers/rtc/interface.c > >> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer) > >> */ > >> void rtc_timer_do_work(struct work_struct *work) > >> { > >> - struct rtc_timer *timer; > >> + unsigned long start_jiffies = jiffies; > >> struct timerqueue_node *next; > >> - ktime_t now; > >> + struct rtc_timer *timer; > >> struct rtc_time tm; > >> + int loop_count = 0; > >> + ktime_t now; > >> int err; > >> > >> struct rtc_device *rtc = > >> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work) > >> } > >> now = rtc_tm_to_ktime(tm); > >> while ((next = timerqueue_getnext(&rtc->timerqueue))) { > >> + loop_count++; > >> + > >> + if (unlikely(time_after(jiffies, start_jiffies + HZ))) { > >> + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n", > >> + loop_count, &tm); > >> + WARN_ON_ONCE(1); > > > > So, your issue is that it is too slow so you make it even slower? There > > are already plenty of tracepoints that allow proper debugging in this > > loop, I'm pretty sure we don't want to bloat the kernel with more > > messages. > > Hi, Alexandre, > > The point here is not about the performance of the rtc_timer_do_work() > loop — it’s about making the problem debuggable when things go wrong. > And we can put it under a debug Kconfig option, so production kernels > see no extra overhead at all. But then aren't the tracepoint enough? There are 3 tracepoints in the loop that are exactly for debugging your issue. > > The patch is installed in the following scenarios: > > If the RTC hardware fails, or if the QEMU-emulated RTC device code in a > KVM virtual machine has a problem (for example, the x86 RTC emulation > hardware mc146818 has an overflow issue[1]), the time may jump as shown > in the log below, which can cause a soft lockup. This is not clear, you are not explaining the issue. You seem to mix system time and RTC time. What I understand is that there was a periodic timer enqueued and then for some reason, the system time jumped forward by a large amount and now rtc_timer_do_work is firing events for each of the missed timers. You are not explaining the relationship with the RTC hardware (I see none). > > However, when the issue occurs, it is only possible to know that too > many pending timers have accumulated in the timerqueue (for example, the > log shows that tens of millions of timer nodes have been processed) by > temporarily adding diagnostic code in rtc_timer_do_work(). > This is not true, there are 3 tracepoints to know what is happening with the timers. Also, there are always exactly zero, one or two timers in the queue, never tens of millions. > To determine whether the root cause is hardware, kernel RTC code, an RTC > driver issue, or an RTC hardware problem, more debugging is needed. But > if the problem is indeed caused by RTC hardware, adding a diagnostic > print of the current RTC time when the loop takes too long (as this > diagnostic patch does) would make it easy to tell whether QEMU or the > hardware is faulty. > > [1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@tls.msk.ru/ > > kworker/0:1-37 [000] .N.. 489.159634: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281423 > kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281424 > kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281425 > kworker/0:1-37 [000] .N.. 489.159636: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281426 > kworker/0:1-37 [000] .N.. 489.159637: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281427 > kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281428 > kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281429 > kworker/0:1-37 [000] .N.. 489.159639: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281430 > kworker/0:1-37 [000] .N.. 489.159640: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281431 > kworker/0:1-37 [000] .N.. 489.159641: rtc_timer_do_work: > timerqueue_getnext handle timer node count: 13281432 > > > swapper/0-1 [001] .N.. 11.579334: __rtc_read_time: rtc: > 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 > swapper/0-1 [001] .N.. 11.579421: __rtc_read_time: rtc: > 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 > swapper/0-1 [001] .N.. 11.579469: __rtc_read_time: rtc: > 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 > swapper/0-1 [001] .N.. 11.580816: __rtc_read_time: rtc: > 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 > syz-executor.5-7492 [003] .... 129.807406: __rtc_read_time: rtc: > 0xff11000109896800, ops->read_time:2033:05:04:07:03:51 > syz-executor.5-7492 [003] .... 129.807419: > __rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now: > 2033:05:04:07:03:51, expire: 2033:05:04:07:03:52 > > > Best regards, > Jinjie > > > > > > -- Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() 2026-06-16 6:57 ` Alexandre Belloni @ 2026-06-16 8:12 ` Jinjie Ruan 0 siblings, 0 replies; 7+ messages in thread From: Jinjie Ruan @ 2026-06-16 8:12 UTC (permalink / raw) To: Alexandre Belloni; +Cc: linux-rtc, linux-kernel On 6/16/2026 2:57 PM, Alexandre Belloni wrote: > Hi, > > On 16/06/2026 10:10:19+0800, Jinjie Ruan wrote: >> >> >> On 6/15/2026 11:22 PM, Alexandre Belloni wrote: >>> Hello, >>> >>> On 25/05/2026 21:08:25+0800, Jinjie Ruan wrote: >>>> In virtualization environments like QEMU [1], or during hardware >>>> clocksource anomalies, an extreme time-warp event can occur. When >>>> the system time abruptly jumps forward, the rtc_timer_do_work() handler >>>> falls into a prolonged processing loop to clear accumulated historical >>>> timers via timerqueue_getnext(). Running this loop indefinitely under >>>> the rtc->ops_lock mutex triggers a kernel softlockup, stalling >>>> the system. >>>> >>>> Introduce an adaptive telemetry and loop guard mechanism to enhance debug >>>> visibility and prevent softlockups: >>>> >>>> 1. Record `start_jiffies` upon entry and leverage `time_after()` to >>>> check if the loop has monopolized the CPU for more than 1s (HZ). If so, >>>> the handler prints a telemetry warning, triggers a WARN stack dump, and >>>> breaks the loop to safely yield the CPU. >>>> >>>> 2. Track the execution via a `loop_count` metric. Printing this counter >>>> in the warning log provides vital diagnostics to distinguish >>>> an aggressive time-warp storm (high count) from a bogged-down callback >>>> bug (low count). >>>> >>>> 3. Utilize the kernel format specifier `%ptR` to convert the raw ktime >>>> into a human-readable timestamp (YYYY-MM-DD HH:MM:SS), allowing >>>> developers to instantly pinpoint the exact boundary of the time >>>> jump in dmesg. >>>> >>>> This non-destructive telemetry guard provides precise hardware/emulator >>>> diagnostic visibility while ensuring core kernel availability. >>>> >>>> [1]: https://lore.kernel.org/all/20260114013257.3500578-1-ruanjinjie@huawei.com/ >>>> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> >>>> --- >>>> drivers/rtc/interface.c | 15 +++++++++++++-- >>>> 1 file changed, 13 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c >>>> index 1906f4884a83..f6c5fd16cc4e 100644 >>>> --- a/drivers/rtc/interface.c >>>> +++ b/drivers/rtc/interface.c >>>> @@ -927,10 +927,12 @@ static void rtc_timer_remove(struct rtc_device *rtc, struct rtc_timer *timer) >>>> */ >>>> void rtc_timer_do_work(struct work_struct *work) >>>> { >>>> - struct rtc_timer *timer; >>>> + unsigned long start_jiffies = jiffies; >>>> struct timerqueue_node *next; >>>> - ktime_t now; >>>> + struct rtc_timer *timer; >>>> struct rtc_time tm; >>>> + int loop_count = 0; >>>> + ktime_t now; >>>> int err; >>>> >>>> struct rtc_device *rtc = >>>> @@ -945,6 +947,15 @@ void rtc_timer_do_work(struct work_struct *work) >>>> } >>>> now = rtc_tm_to_ktime(tm); >>>> while ((next = timerqueue_getnext(&rtc->timerqueue))) { >>>> + loop_count++; >>>> + >>>> + if (unlikely(time_after(jiffies, start_jiffies + HZ))) { >>>> + dev_warn(&rtc->dev, "RTC time jump (loop: %d) to %ptR.\n", >>>> + loop_count, &tm); >>>> + WARN_ON_ONCE(1); >>> >>> So, your issue is that it is too slow so you make it even slower? There >>> are already plenty of tracepoints that allow proper debugging in this >>> loop, I'm pretty sure we don't want to bloat the kernel with more >>> messages. >> >> Hi, Alexandre, >> >> The point here is not about the performance of the rtc_timer_do_work() >> loop — it’s about making the problem debuggable when things go wrong. >> And we can put it under a debug Kconfig option, so production kernels >> see no extra overhead at all. > > > But then aren't the tracepoint enough? There are 3 tracepoints in the > loop that are exactly for debugging your issue. Hi, Alexandre, Regarding the `trace_rtc_timer_dequeue` and `trace_rtc_timer_enqueue` tracepoints, they are unfortunately insufficient to pinpoint this issue for three reasons: 1. Ring Buffer Overwrite during Softlockup: When the loop iterates tens of millions of times continuously on a locked-up CPU, it floods the ftrace ring buffer in milliseconds. The earliest trace logs—which contain the exact moment the time jumped—will be completely overwritten and lost before anyone can read them. 2. Lack of Causality Context: These tracepoints only log the timer's expiration and queue state.They show the symptom (infinite re-enqueuing). But these are not the primary scene, the time jump is the primary scene. They do not capture why are there so many UIE timers being re-enqueued? > >> >> The patch is installed in the following scenarios: >> >> If the RTC hardware fails, or if the QEMU-emulated RTC device code in a >> KVM virtual machine has a problem (for example, the x86 RTC emulation >> hardware mc146818 has an overflow issue[1]), the time may jump as shown >> in the log below, which can cause a soft lockup. > > This is not clear, you are not explaining the issue. You seem to mix > system time and RTC time. What I understand is that there was a periodic > timer enqueued and then for some reason, the system time jumped forward > by a large amount and now rtc_timer_do_work is firing events for each of > the missed timers. You are not explaining the relationship with the RTC > hardware (I see none). The time here all means the RTC time got by __rtc_read_time(), not the system time. And the RTC time jump is caused by RTC hardware failure or `QEMU-emulated RTC device` code bug, which means the rtc hardware returns an unstable rtc time or is not completely linear growth.. The original issue is as follows: On kvm qemu with cmos rtc and mc146818 chip, after set the UIE timer expire with a normal RTC time (for example 2026 year), In rtc_timer_do_work(), the rtc time jump to a future time (for example 2033 year), it will loop for a while util softlockup because all subsequent enqueued UIE timers expires, as below: RTC_UIE_ON: read now: 2019:04:08:12:32:27, add timer0 (expire: 2019:04:08:12:32:28) ^^^^^^^^^^^^^^^^^^^^ ... rtc_timer_do_work() iterate the list in a loop: read now: 2033:12:02:07:27:15 ^^^^^^^^^^^^^^^^^^^ handle timer0, add timer1 to the list (expire: 2019:04:08:12:32:29) handle timer1, add timer2 to the list (expire: 2019:04:08:12:32:30) handle timer2, add timer3: 2019:04:08:12:32:31 ... -> softlockup > >> >> However, when the issue occurs, it is only possible to know that too >> many pending timers have accumulated in the timerqueue (for example, the >> log shows that tens of millions of timer nodes have been processed) by >> temporarily adding diagnostic code in rtc_timer_do_work(). >> > > This is not true, there are 3 tracepoints to know what is happening with > the timers. Also, there are always exactly zero, one or two timers in > the queue, never tens of millions. Timer Queue Size vs. Loop Iteration Count: You are completely correct that there are only ever 0, 1, or 2 timer nodes linked in the timerqueue. My previous phrasing was inaccurate, I mean the loop Iteration. > >> To determine whether the root cause is hardware, kernel RTC code, an RTC >> driver issue, or an RTC hardware problem, more debugging is needed. But >> if the problem is indeed caused by RTC hardware, adding a diagnostic >> print of the current RTC time when the loop takes too long (as this >> diagnostic patch does) would make it easy to tell whether QEMU or the >> hardware is faulty. >> >> [1]: https://lore.kernel.org/all/20260613195116.1807273-21-mjt@tls.msk.ru/ >> >> kworker/0:1-37 [000] .N.. 489.159634: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281423 >> kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281424 >> kworker/0:1-37 [000] .N.. 489.159635: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281425 >> kworker/0:1-37 [000] .N.. 489.159636: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281426 >> kworker/0:1-37 [000] .N.. 489.159637: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281427 >> kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281428 >> kworker/0:1-37 [000] .N.. 489.159638: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281429 >> kworker/0:1-37 [000] .N.. 489.159639: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281430 >> kworker/0:1-37 [000] .N.. 489.159640: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281431 >> kworker/0:1-37 [000] .N.. 489.159641: rtc_timer_do_work: >> timerqueue_getnext handle timer node count: 13281432 >> >> >> swapper/0-1 [001] .N.. 11.579334: __rtc_read_time: rtc: >> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 >> swapper/0-1 [001] .N.. 11.579421: __rtc_read_time: rtc: >> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 >> swapper/0-1 [001] .N.. 11.579469: __rtc_read_time: rtc: >> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 >> swapper/0-1 [001] .N.. 11.580816: __rtc_read_time: rtc: >> 0xff11000109896800, ops->read_time:2026:01:05:09:36:21 >> syz-executor.5-7492 [003] .... 129.807406: __rtc_read_time: rtc: >> 0xff11000109896800, ops->read_time:2033:05:04:07:03:51 >> syz-executor.5-7492 [003] .... 129.807419: >> __rtc_update_irq_enable.part.8: rtc uie on: 0xff11000109896800, now: >> 2033:05:04:07:03:51, expire: 2033:05:04:07:03:52 >> >> >> Best regards, >> Jinjie >> >>> >>> >> > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-06-16 8:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-25 13:08 [PATCH] rtc: interface: Add rtc time jump debug in rtc_timer_do_work() Jinjie Ruan 2026-06-11 1:40 ` Jinjie Ruan 2026-06-11 2:01 ` Jinjie Ruan 2026-06-15 15:22 ` Alexandre Belloni 2026-06-16 2:10 ` Jinjie Ruan 2026-06-16 6:57 ` Alexandre Belloni 2026-06-16 8:12 ` Jinjie Ruan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.