[Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability
@ 2026-02-05  4:52 连子涵
  2026-02-05  6:37 ` Hillf Danton
  2026-02-06 13:37 ` Thomas Gleixner
  0 siblings, 2 replies; 4+ messages in thread
From: 连子涵 @ 2026-02-05  4:52 UTC (permalink / raw)
  To: tglx, mingo, frederic; +Cc: linux-kernel

Hi all,
We have observed a critical voltage droop issue on large-core-count SoC platforms (e.g., 64+ cores) that appears to stem directly from the synchronized periodic timer interrupts(tick) in the Linux kernel. 

In our testing and power simulations, we found that: 
When all CPU cores enter the timer interrupt handler simultaneously, there is a sharp, instantaneous power surge and continuous power fluctuations during the interrupt handling window (which lasts several microseconds), leading to significant voltage droop. In severe cases, this droop can cause system instability or even prevent the OS from booting.

We understand that enabling skew_tick=1 effectively mitigates this by staggering the per-CPU tick timers. However, in certain deployment scenarios, modifying any kernel boot parameter—including skew_tick—is not permitted.

Given this constraint, we would greatly appreciate your insights on the following technical questions: 
1. Why does the timer interrupt path consume so much power and exhibit such large instantaneous variations? Our power simulation shows that the average power during timer interrupt handling is comparable to Dhrystone benchmark. 
2. What is the typical duration of a single timer interrupt handler (tick_nohz_handler, etc.) on a modern x86 or ARM core? Is it generally on the order of a few microseconds? 
3. Beyond skew_tick=1, are there other kernel mechanisms or runtime strategies that could reduce the power impact of synchronized timer events? Are there plans in future kernel versions to address this issue more fundamentally—especially for many-core platforms? 

Thank you very much for your time and expertise. 

Best regards, 
Zihan Lian <17317795071@163.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability
  2026-02-05  4:52 [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability 连子涵
@ 2026-02-05  6:37 ` Hillf Danton
  2026-02-09 18:33   ` Christoph Lameter (Ampere)
  2026-02-06 13:37 ` Thomas Gleixner
  1 sibling, 1 reply; 4+ messages in thread
From: Hillf Danton @ 2026-02-05  6:37 UTC (permalink / raw)
  To: 连子涵; +Cc: tglx, Christoph Lameter (Ampere), linux-kernel

On Thu, 5 Feb 2026 12:52:04 +0800 (CST) =?GBK?B?wazX07qt?= wrote:
> Hi all,
> We have observed a critical voltage droop issue on large-core-count SoC platforms (e.g., 64+ cores) that appears to stem directly from the synchronized periodic timer interrupts(tick) in the Linux kernel. 
> 
> In our testing and power simulations, we found that: 
> When all CPU cores enter the timer interrupt handler simultaneously, there is a sharp, instantaneous power surge and continuous power fluctuations during the interrupt handling window (which lasts several microseconds), leading to significant voltage droop. In severe cases, this droop can cause system instability or even prevent the OS from booting.
> 
> We understand that enabling skew_tick=1 effectively mitigates this by staggering the per-CPU tick timers. However, in certain deployment scenarios, modifying any kernel boot parameter—including skew_tick—is not permitted.
> 
> Given this constraint, we would greatly appreciate your insights on the following technical questions: 
> 1. Why does the timer interrupt path consume so much power and exhibit such large instantaneous variations? Our power simulation shows that the average power during timer interrupt handling is comparable to Dhrystone benchmark. 
> 2. What is the typical duration of a single timer interrupt handler (tick_nohz_handler, etc.) on a modern x86 or ARM core? Is it generally on the order of a few microseconds? 
> 3. Beyond skew_tick=1, are there other kernel mechanisms or runtime strategies that could reduce the power impact of synchronized timer events? Are there plans in future kernel versions to address this issue more fundamentally—especially for many-core platforms? 
> 
> 
> Thank you very much for your time and expertise. 
> 
Sounds like a known issue, feel free to see the comments in 2025 [1].

[1] Subject: Re: [PATCH] Skew tick for systems with a large number of processors
https://lore.kernel.org/lkml/87sejew87r.ffs@tglx/
> 
> Best regards, 
> Zihan Lian <17317795071@163.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability
  2026-02-05  4:52 [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability 连子涵
  2026-02-05  6:37 ` Hillf Danton
@ 2026-02-06 13:37 ` Thomas Gleixner
  1 sibling, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2026-02-06 13:37 UTC (permalink / raw)
  To: 连子涵, mingo, frederic; +Cc: linux-kernel

On Thu, Feb 05 2026 at 12:52, 连子涵 wrote:
> Given this constraint, we would greatly appreciate your insights on the following technical questions: 

Who is 'we'? You are hiding behind an anonymized email address and
completely fail to provide details about your secret sauce SoC.

> 1. Why does the timer interrupt path consume so much power and exhibit
>    such large instantaneous variations? Our power simulation shows that
>    the average power during timer interrupt handling is comparable to
>    Dhrystone benchmark.

Is that a serious question?

How should we know what makes your SoC design sensitive to it? You have
the tools which observe the problem, so you should be able to pin point
what the actual issue is, no?

> 3. Beyond skew_tick=1, are there other kernel mechanisms or runtime
>    strategies that could reduce the power impact of synchronized timer
>    events? Are there plans in future kernel versions to address this
>    issue more fundamentally—especially for many-core platforms?

Again. How should we address a problem which is only described by
hand-waving?  You completely fail to provide context and circumstances.

Provide a proper analysis that explains what the actual root cause is
and then we can debate whether that's solvable in software or not. Just
crying 'timer interrupt' is not even close to an analysis.

If there is a systematic problem somewhere then we are happy to look for
a solution, but without analysis and data we are not doing anything.

Just for the record: The Voltage droop issue is known for more a decade
and there have been solutions published way before your SoC was
designed. Aside of your (whatever it is) and some odd Ampere SoC none of
the contemporary multi-core designs even with hundreds of cores suffer
from this. Seems there are hardware architects out there who pay
attention to research.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability
  2026-02-05  6:37 ` Hillf Danton
@ 2026-02-09 18:33   ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 4+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-09 18:33 UTC (permalink / raw)
  To: Hillf Danton; +Cc: 连子涵, tglx, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

On Thu, 5 Feb 2026, Hillf Danton wrote:

> On Thu, 5 Feb 2026 12:52:04 +0800 (CST) =?GBK?B?wazX07qt?= wrote:
> > Hi all,
> > We have observed a critical voltage droop issue on large-core-count SoC platforms (e.g., 64+ cores) that appears to stem directly from the synchronized periodic timer interrupts(tick) in the Linux kernel.
> >
> > In our testing and power simulations, we found that:
> > When all CPU cores enter the timer interrupt handler simultaneously, there is a sharp, instantaneous power surge and continuous power fluctuations during the interrupt handling window (which lasts several microseconds), leading to significant voltage droop. In severe cases, this droop can cause system instability or even prevent the OS from booting.
> >
> > We understand that enabling skew_tick=1 effectively mitigates this by
> > staggering the per-CPU tick timers. However, in certain deployment
> > scenarios, modifying any kernel boot parameter—including skew_tick—is
> > not permitted.

You could build a custom kernel that enables it by default.

Could you post test results that may convince us to make skew_tick the
default for certain configurations?

I have had issues getting good power readings for smaller configurations
since the SOC power state fluctuated. If we had some results that show
skew_tick to not be hurtful at low core counts but good at high ones then
we could change the default.

> > Given this constraint, we would greatly appreciate your insights on
> > the following technical questions:

> >1. Why does the timer interrupt
> > path consume so much power and exhibit such large instantaneous
> > variations? Our power simulation shows that the average power during
> > timer interrupt handling is comparable to Dhrystone benchmark.

Because all processors need to be active and running at the same time. The
SOC must power up instantly and power will then drop again rapidly. This
is a pretty bad scenario that requires the SOC manufacturers to actually
increase the default voltage to the SOC to deal with this instability.

> >2. What
> > is the typical duration of a single timer interrupt handler
> > (tick_nohz_handler, etc.) on a modern x86 or ARM core? Is it generally
> > on the order of a few microseconds?

My estimate would be 2-10 micros but Thomas may know better.

> >3. Beyond skew_tick=1, are there
> > other kernel mechanisms or runtime strategies that could reduce the
> > power impact of synchronized timer events? Are there plans in future
> > kernel versions to address this issue more fundamentally—especially
> > for many-core platforms?

The SOC could be modified to delay if too many interrupts hit the cpu at
once?

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-09 18:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-05  4:52 [Question] Voltage droop from synchronized timer interrupts(tick) on many-core SoCs leads to system instability 连子涵
2026-02-05  6:37 ` Hillf Danton
2026-02-09 18:33   ` Christoph Lameter (Ampere)
2026-02-06 13:37 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox