the stuttering regression in 7.0: should I have done something different?

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* the stuttering regression in 7.0: should I have done something different?
@ 2026-04-23 16:30 Thorsten Leemhuis
  2026-04-26 21:16 ` Greg KH
  2026-05-08  5:51 ` John Paul Adrian Glaubitz
  0 siblings, 2 replies; 19+ messages in thread
From: Thorsten Leemhuis @ 2026-04-23 16:30 UTC (permalink / raw)
  To: Greg KH, Linus Torvalds; +Cc: Linux kernel regressions list, LKML

Linus, Greg, if you have a minute, please help me out with something I'm
wondering about:

Should I have done something different wrt. to the periodic lockup aka
stuttering regression? The one quite a few users encountered, bisected,
and reported last week following the 7.0 release before it was fixed in
mainline & 7.0.1 during the first half of this week; see below for a
timeline of the whole thing.

I'm not asking if this could have been prevented or if the developers
did anything wrong handling this. I ask because I want to do the right
thing if similar situations arise in the future to ensure they are
handled like you folks want them to.

I for example wonder if I maybe should have made noise about this
regression earlier when I noticed that it affected quite a few (many?)
people. Or made more noise at the point where I spoke up, as then the
path to 7.0.1 might have been easier? Or should I have asked Linus for a
revert (or submit one myself) shortly after the impact became clear --
so that the problem would have been solved quickly while an improved
version of the culprit was developed and mainlined in parallel via the
regular channels?

On a related note: Do you think it would be wise if I started
maintaining a "regression fixes git tree" that collects temporary
reverts and wip fixes while fully formed reverts and/or fixes for
regressions are developed and make it through the ranks to mainline
(which often takes a few days or a week -- or multiple weeks in some
cases)? The idea comes up every now and then, as interested users and
distros then could easily avoid known regressions through that tree; it
could also serve as a test bed for reverts I could send to Linus in case
a proper revert/fix takes too long through the regular channels.

But I'm not sure if such an approach is really a good idea. I'd prefer
if our processes would be so quick and flexible that such a tree would
not be needed. At the same time I see that "quick and flexible" is often
not the case and that we are unlikely to get there any time soon
(especially for fixes that need to reach stable/longterm trees, as it
simply takes a while from initial reporting of a regression to
mainlining a fix [or at least hitting -next] to preparing and releasing
a new stable/longterm release with it).

FWIW, here is the rough timeline of the regression, just to be sure we
are all on the same page:

* The regression I'm talking about is caused by d6e152d905bdb1
("clockevents: Prevent timer interrupt starvation") [authored:
2026-04-07 10:54:17; committed: 2026-04-10 22:45:38; next arrival:
next-20260413; merged: 2026-04-12 19:01:55; v7.0 (2026-04-12 22:48:06)]

* On Monday and thus within 24 of the 7.0 release the first report about
the regression came in and immediately mentioned that a revert was able
to fix things:
https://lore.kernel.org/all/68d1e9ac-2780-4be3-8ee3-0788062dd3a4@gmail.com/

* On Tuesday someone else confirmed the findings and mentioned that
"several users" were seeing the problem:
https://lore.kernel.org/all/aeb848aa-404a-40fb-bd41-329644623b1d@cachyos.org/

* A few hours later (aka within 24 hours of the first report) Thomas had
a rough fix ready https://lore.kernel.org/all/87340xfeje.ffs@tglx/ (yeah!)

* On Thursday the fix was committed to the tip tree:
https://lore.kernel.org/all/177636758252.1323100.5283878386670888513.tip-bot2@tip-bot2/

* On Sunday I asked when the fix was going to be mainlined (with Linus
in CC) -- I feared Greg would soon start preparing 7.0.1-rc1 and I
wanted to ensure the fix was included there:
https://lore.kernel.org/all/5cbb14d8-46f9-4197-917f-51da852d7500@leemhuis.info/

* On Monday morning (UTC) mingo submitted a PR wit the fix:
https://lore.kernel.org/all/aeXYPt1FEbFRZNJf@gmail.com/

* On Monday Greg released 7.0.1-rc1 without the fix -- and a backport of
the culprit was in the -rc1 of various earlier series. Thomas quickly
told the stable team to not backport the culprit before the fix was
mainlined https://lore.kernel.org/all/87pl3ten5y.ffs@tglx/

* On Monday night Linus merged the PR from mingo as 4096fd0e8eaea1
("clockevents: Add missing resets of the next_event_forced flag")
[authored: 2026-04-14 22:55:01; committed: 2026-04-16 21:22:04; next
arrival: next-20260417; merged: 2026-04-21 00:30:08; v7.0-post]

* On Tuesday morning I wrote a mail to Greg about including the fix in
7.0.1; Thomas round about the same time provided the necessary backport,
which Greg then included out-of-band:
https://lore.kernel.org/all/2026042105-malformed-probation-232b@gregkh/
https://lore.kernel.org/all/87jyu0de2c.ffs@tglx/

* v7.0.1 is released on Wednesday, 2026-04-22 13:32:23

I lost track of how many people reported the regression exactly, but I
noticed at least seven reports (most of them in the past week) – and a
few people mentioned to me privately that they were affected, too. So it
was something that annoyed quite a few people afaics -- and made them
bisect, just to find out that the problem was known and a fix existed
already. This widespread effect is why I was wondering if I should have
done something differently, as a quicker fix could have avoided a few
people some pain.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-04-23 16:30 the stuttering regression in 7.0: should I have done something different? Thorsten Leemhuis
@ 2026-04-26 21:16 ` Greg KH
  2026-05-08  5:51 ` John Paul Adrian Glaubitz
  1 sibling, 0 replies; 19+ messages in thread
From: Greg KH @ 2026-04-26 21:16 UTC (permalink / raw)
  To: Thorsten Leemhuis; +Cc: Linus Torvalds, Linux kernel regressions list, LKML

On Thu, Apr 23, 2026 at 06:30:24PM +0200, Thorsten Leemhuis wrote:
> Linus, Greg, if you have a minute, please help me out with something I'm
> wondering about:
> 
> Should I have done something different wrt. to the periodic lockup aka
> stuttering regression? The one quite a few users encountered, bisected,
> and reported last week following the 7.0 release before it was fixed in
> mainline & 7.0.1 during the first half of this week; see below for a
> timeline of the whole thing.
> 
> I'm not asking if this could have been prevented or if the developers
> did anything wrong handling this. I ask because I want to do the right
> thing if similar situations arise in the future to ensure they are
> handled like you folks want them to.

I think this went fine.  It was caught properly, and fixed, and pushed
out to users pretty quickly.  We can't really ask for more :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-04-23 16:30 the stuttering regression in 7.0: should I have done something different? Thorsten Leemhuis
  2026-04-26 21:16 ` Greg KH
@ 2026-05-08  5:51 ` John Paul Adrian Glaubitz
  2026-05-08  6:33   ` Thorsten Leemhuis
  1 sibling, 1 reply; 19+ messages in thread
From: John Paul Adrian Glaubitz @ 2026-05-08  5:51 UTC (permalink / raw)
  To: Thorsten Leemhuis, Greg KH, Linus Torvalds
  Cc: Linux kernel regressions list, LKML, Tony Rodriguez

Hi Thorsten,

On Thu, 2026-04-23 at 18:30 +0200, Thorsten Leemhuis wrote:
> FWIW, here is the rough timeline of the regression, just to be sure we
> are all on the same page:
> 
> * The regression I'm talking about is caused by d6e152d905bdb1
> ("clockevents: Prevent timer interrupt starvation") [authored:
> 2026-04-07 10:54:17; committed: 2026-04-10 22:45:38; next arrival:
> next-20260413; merged: 2026-04-12 19:01:55; v7.0 (2026-04-12 22:48:06)]
> 
> * On Monday and thus within 24 of the 7.0 release the first report about
> the regression came in and immediately mentioned that a revert was able
> to fix things:
> https://lore.kernel.org/all/68d1e9ac-2780-4be3-8ee3-0788062dd3a4@gmail.com/
> 
> * On Tuesday someone else confirmed the findings and mentioned that
> "several users" were seeing the problem:
> https://lore.kernel.org/all/aeb848aa-404a-40fb-bd41-329644623b1d@cachyos.org/
> 
> * A few hours later (aka within 24 hours of the first report) Thomas had
> a rough fix ready https://lore.kernel.org/all/87340xfeje.ffs@tglx/ (yeah!)
> 
> * On Thursday the fix was committed to the tip tree:
> https://lore.kernel.org/all/177636758252.1323100.5283878386670888513.tip-bot2@tip-bot2/
> 
> * On Sunday I asked when the fix was going to be mainlined (with Linus
> in CC) -- I feared Greg would soon start preparing 7.0.1-rc1 and I
> wanted to ensure the fix was included there:
> https://lore.kernel.org/all/5cbb14d8-46f9-4197-917f-51da852d7500@leemhuis.info/
> 
> * On Monday morning (UTC) mingo submitted a PR wit the fix:
> https://lore.kernel.org/all/aeXYPt1FEbFRZNJf@gmail.com/
> 
> * On Monday Greg released 7.0.1-rc1 without the fix -- and a backport of
> the culprit was in the -rc1 of various earlier series. Thomas quickly
> told the stable team to not backport the culprit before the fix was
> mainlined https://lore.kernel.org/all/87pl3ten5y.ffs@tglx/
> 
> * On Monday night Linus merged the PR from mingo as 4096fd0e8eaea1
> ("clockevents: Add missing resets of the next_event_forced flag")
> [authored: 2026-04-14 22:55:01; committed: 2026-04-16 21:22:04; next
> arrival: next-20260417; merged: 2026-04-21 00:30:08; v7.0-post]
> 
> * On Tuesday morning I wrote a mail to Greg about including the fix in
> 7.0.1; Thomas round about the same time provided the necessary backport,
> which Greg then included out-of-band:
> https://lore.kernel.org/all/2026042105-malformed-probation-232b@gregkh/
> https://lore.kernel.org/all/87jyu0de2c.ffs@tglx/
> 
> * v7.0.1 is released on Wednesday, 2026-04-22 13:32:23

Tony Rodriguez from the SPARC community has observed the regression on SPARC as well
and proposed a fix to address it [1]. Not sure whether he has retested on the latest
commit of Linus' tree yet.

Tony, can you verify that 4096fd0e8eaea1 fixes the issue for you?

Adrian

> [1] https://github.com/sparclinux/issues/issues/79

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-05-08  5:51 ` John Paul Adrian Glaubitz
@ 2026-05-08  6:33   ` Thorsten Leemhuis
       [not found]     ` <D5D19776-C809-4284-9417-F9A860877B98@gmail.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Thorsten Leemhuis @ 2026-05-08  6:33 UTC (permalink / raw)
  To: John Paul Adrian Glaubitz, Greg KH, Linus Torvalds
  Cc: Linux kernel regressions list, LKML, Tony Rodriguez

On 5/8/26 07:51, John Paul Adrian Glaubitz wrote:
> On Thu, 2026-04-23 at 18:30 +0200, Thorsten Leemhuis wrote:
>> FWIW, here is the rough timeline of the regression, just to be sure we
>> are all on the same page:
>>
>> * The regression I'm talking about is caused by d6e152d905bdb1
>> ("clockevents: Prevent timer interrupt starvation") [authored:
>> 2026-04-07 10:54:17; committed: 2026-04-10 22:45:38; next arrival:
>> next-20260413; merged: 2026-04-12 19:01:55; v7.0 (2026-04-12 22:48:06)]
> [...]
> Tony Rodriguez from the SPARC community has observed the regression on SPARC as well
> and proposed a fix to address it [1]. Not sure whether he has retested on the latest
> commit of Linus' tree yet.
> 
> Tony, can you verify that 4096fd0e8eaea1 fixes the issue for you?
>
>> [1] https://github.com/sparclinux/issues/issues/79

It's likely a different regressions, as that report's title says that
v7.0.1, v7.0.2, v7.0.3, and v7.1‑rc1 are affected, which all contain the
fix, aka 4096fd0e8eaea1. Reporting in a new thread is likely best, as
the authors of the culprit are not even CCed here.

Ciao, Thorsten



^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <D5D19776-C809-4284-9417-F9A860877B98@gmail.com>]

* Re: the stuttering regression in 7.0: should I have done something different?
       [not found]     ` <D5D19776-C809-4284-9417-F9A860877B98@gmail.com>
@ 2026-05-08  7:50       ` Thorsten Leemhuis
  2026-05-08 20:15         ` Tony Rodriguez
  0 siblings, 1 reply; 19+ messages in thread
From: Thorsten Leemhuis @ 2026-05-08  7:50 UTC (permalink / raw)
  To: Tony Rodriguez
  Cc: John Paul Adrian Glaubitz, Greg KH, Linus Torvalds,
	Linux kernel regressions list, LKML, Thomas Gleixner

[+tglx so he knows about it; details about the problem that Tony faces
can be found in https://github.com/sparclinux/issues/issues/79 ]

On 5/8/26 09:38, Tony Rodriguez wrote:
> I still don't believe this is fixed upstream as of v7.03 and v7.1-rc1,

Yes and no. It looks like d6e152d905bdb1 ("clockevents: Prevent timer
interrupt starvation") causes two regression.

Thomas fixed one with 4096fd0e8eaea1 ("clockevents: Add missing resets
of the next_event_forced flag") -- and feedback shows that it definitely
solved the problem for quite a few people. If that's not the case for
you, then you seem to face a different problem caused by the same
change. Happens, that's life sometimes.

Ciao, Thorsten

> only when my patch is applied does the SPARC74 S7-2 system become stable
> again. I also tested my patch with v7.04 and it works their as well.
> Will perform additional tests without my fix against v7.0.4 and v7.1-rc2
> later today to revalidate the regression (USA Pacific time).
> 
> Tony Rodriguez
> www.linkedin.com/in/unixpro1970
> 
>> On May 7, 2026, at 11:33 PM, Thorsten Leemhuis <linux@leemhuis.info>
>> wrote:
>>
>> On 5/8/26 07:51, John Paul Adrian Glaubitz wrote:
>>> On Thu, 2026-04-23 at 18:30 +0200, Thorsten Leemhuis wrote:
>>>> FWIW, here is the rough timeline of the regression, just to be sure we
>>>> are all on the same page:
>>>>
>>>> * The regression I'm talking about is caused by d6e152d905bdb1
>>>> ("clockevents: Prevent timer interrupt starvation") [authored:
>>>> 2026-04-07 10:54:17; committed: 2026-04-10 22:45:38; next arrival:
>>>> next-20260413; merged: 2026-04-12 19:01:55; v7.0 (2026-04-12 22:48:06)]
>>> [...]
>>> Tony Rodriguez from the SPARC community has observed the regression
>>> on SPARC as well
>>> and proposed a fix to address it [1]. Not sure whether he has
>>> retested on the latest
>>> commit of Linus' tree yet.
>>>
>>> Tony, can you verify that 4096fd0e8eaea1 fixes the issue for you?
>>>
>>>> [1] https://github.com/sparclinux/issues/issues/79
>>
>> It's likely a different regressions, as that report's title says that
>> v7.0.1, v7.0.2, v7.0.3, and v7.1‑rc1 are affected, which all contain the
>> fix, aka 4096fd0e8eaea1. Reporting in a new thread is likely best, as
>> the authors of the culprit are not even CCed here.
>>
>> Ciao, Thorsten


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-05-08  7:50       ` Thorsten Leemhuis
@ 2026-05-08 20:15         ` Tony Rodriguez
  2026-05-08 20:21           ` Tony Rodriguez
  2026-05-10 21:29           ` Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-08 20:15 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: John Paul Adrian Glaubitz, Greg KH, Linus Torvalds,
	Linux kernel regressions list, LKML, Thomas Gleixner

Just confirmed on my end today.  This regression also impacts both 
SPARC64 S7-2 and SPARC64 T7-1 on v7.0.4 and v7.1-rc2 as well. Different 
systems using the same exact kernels.

** Please see points (A1) (A2) (B1) (B2)

Once again, I am not experiencing such issues when "my patch" (link 
below) is added to address this regression.

https://github.com/sparclinux/issues/issues/79#issuecomment-4362173884

Output demonstrating issues for SPARC64 S7-2 and T7-1 systems (without 
my regression patch):

PS - On May 2nd 2026 at 9:42 PM: I also sent an email to Thomas Gleixner 
regarding this issue.  Will be happy to validate any patches from your 
end regarding this issue, as time permits me to do so.


Best Regards,

Tony Rodriguez


  A1) SPARC64 S7-2: Kernel v7.1.0-rc2
uname -a
Linux s7t7-debian-test 7.1.0-rc2-test01 #1 SMP Fri May  8 10:02:12 PDT 
2026 sparc64 GNU/Linux

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc2-test01 
root=UUID=ce937a4b-126a-41bd-a54b-03a424421086 ro console=ttyHV0,9600n81 
systemd.log_level=info systemd.show_status=1 
systemd.journald.forward_to_console=0 plymouth.enable=0 quiet

[  243.267359] INFO: task kworker/u512:1:706 blocked for more than 120 
seconds.
[  243.281208]       Not tainted 7.1.0-rc2-test01 #1
[  243.290583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  243.306349] INFO: task kworker/127:1:714 blocked for more than 120 
seconds.
[  243.320106]       Not tainted 7.1.0-rc2-test01 #1
[  243.329476] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  364.099350] INFO: task kworker/u512:1:706 blocked for more than 241 
seconds.
[  364.113199]       Not tainted 7.1.0-rc2-test01 #1
[  364.122585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  364.138328] INFO: task kworker/127:1:714 blocked for more than 241 
seconds.
[  364.152086]       Not tainted 7.1.0-rc2-test01 #1
[  364.161470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  485.295360] INFO: task kworker/u512:1:706 blocked for more than 362 
seconds.
[  485.309209]       Not tainted 7.1.0-rc2-test01 #1
[  485.318581] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  485.334345] INFO: task kworker/127:1:714 blocked for more than 362 
seconds.
[  485.348099]       Not tainted 7.1.0-rc2-test01 #1
[  485.357467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  726.849598] INFO: task kworker/u512:1:706 blocked for more than 604 
seconds.
[  726.863444]       Not tainted 7.1.0-rc2-test01 #1
[  726.872832] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  726.888573] INFO: task kworker/127:1:714 blocked for more than 604 
seconds.
[  726.902340]       Not tainted 7.1.0-rc2-test01 #1
[  726.911708] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

sudo dmesg | grep -iE block | grep -iE worker
[  243.267359] INFO: task kworker/u512:1:706 blocked for more than 120 
seconds.
[  243.306349] INFO: task kworker/127:1:714 blocked for more than 120 
seconds.
[  364.099350] INFO: task kworker/u512:1:706 blocked for more than 241 
seconds.
[  364.138328] INFO: task kworker/127:1:714 blocked for more than 241 
seconds.
[  485.295360] INFO: task kworker/u512:1:706 blocked for more than 362 
seconds.
[  485.334345] INFO: task kworker/127:1:714 blocked for more than 362 
seconds.
[  605.849474] INFO: task kworker/u512:1:706 blocked for more than 483 
seconds.
[  605.888461] INFO: task kworker/127:1:714 blocked for more than 483 
seconds.

sudo poweroff or sudo reboot
NOTE(S): Random hangs during startup. Also, hangs during shutdown/reboot 
process.
-------------------------------------------------------------------------------------------
A2) SPARC64 S7-2: Kernel v7.0.4
[  OK  ] Finished e2scrub_reap.service - Remove Stale Online ext4 
Metadata Check Snapshots.

Debian GNU/Linux forky/sid s7t7-debian-test ttyHV0

s7t7-debian-test login: tonyr
Password:
Linux s7t7-debian-test 7.0.4-test01 #1 SMP Fri May  8 09:27:58 PDT 2026 
sparc64
[...]
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

login: timed [  114.687722] rcu: INFO: rcu_sched detected stalls on 
CPUs/tasks:
[  114.699319] rcu:     67-...!: (240 GPs behind) idle=e9c0/0/0x0 
softirq=174/174 fqs=0 (false positive?)
[  114.717370] rcu:     102-...!: (73 GPs behind) idle=77e0/0/0x0 
softirq=286/287 fqs=0 (false positive?)
[  114.735419] rcu:     111-...!: (52 GPs behind) idle=11d8/0/0x0 
softirq=860/861 fqs=0 (false positive?)
[  114.753489] rcu:     (detected by 11, t=5268 jiffies, g=4457, q=528 
ncpus=128)
[  114.767628] rcu: rcu_sched kthread timer wakeup didn't happen for 
5270 jiffies! g4457 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  114.789647] rcu:     Possible timer handling issue on cpu=105 
timer-softirq=98
[  114.803535] rcu: rcu_sched kthread starved for 5280 jiffies! g4457 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=105
[  114.824201] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  114.842080] rcu: RCU grace-period kthread stack dump:
[  114.852221] rcu: Stack dump where RCU GP kthread last ran:
[  135.867723] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  135.879326] rcu:     65-...!: (1 GPs behind) idle=35b0/0/0x0 
softirq=483/484 fqs=0 (false positive?)
[  135.897024] rcu:     67-...!: (241 GPs behind) idle=ecc0/0/0x0 
softirq=174/174 fqs=0 (false positive?)
[  135.915082] rcu:     102-...!: (74 GPs behind) idle=7800/0/0x0 
softirq=286/287 fqs=0 (false positive?)
[  135.933123] rcu:     111-...!: (53 GPs behind) idle=1238/0/0x0 
softirq=860/861 fqs=0 (false positive?)
[  135.951184] rcu:     (detected by 64, t=5272 jiffies, g=4461, q=752 
ncpus=128)
[  135.965398] rcu: rcu_sched kthread timer wakeup didn't happen for 
5275 jiffies! g4461 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  135.987393] rcu:     Possible timer handling issue on cpu=105 
timer-softirq=98
[  136.001287] rcu: rcu_sched kthread starved for 5285 jiffies! g4461 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=105
[  136.021944] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  136.039829] rcu: RCU grace-period kthread stack dump:
[  136.049971] rcu: Stack dump where RCU GP kthread last ran:

NOTE(S): Unable to login and random hangs during system startup.
-------------------------------------------------------------------------
B1) SPARC64 T7-1: Kernel v7.1.0-rc2
lscpu;uname -a
Architecture:          sparc64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Big Endian
CPU(s):                256
   On-line CPU(s) list: 0-255
Model name:            SPARC-M7
   Thread(s) per core:  8
   Core(s) per socket:  32
   Socket(s):           1
   Flags:               sun4v
Caches (sum of all):
   L1d:                 4 MiB (256 instances)
   L1i:                 4 MiB (256 instances)
   L2:                  64 MiB (256 instances)
Linux s7t7-debian-test 7.1.0-rc2-test01 #1 SMP Fri May  8 10:02:12 PDT 
2026 sparc64 GNU/Linux

   526.766867] rcu:     8-...!: (806 GPs behind) idle=069c/0/0x1 
softirq=682/682 fqs=0
[  526.781618] rcu:     22-...!: (0 ticks this GP) idle=7b40/0/0x0 
softirq=739/739 fqs=0 (false positive?)
[  526.799841] rcu:     89-...!: (770 GPs behind) idle=7800/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  526.817901] rcu:     112-...!: (225 GPs behind) idle=c0c8/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  526.836131] rcu:     189-...!: (0 ticks this GP) idle=8ef0/0/0x0 
softirq=1016/1016 fqs=0 (false positive?)
[  526.854885] rcu:     204-...!: (0 ticks this GP) idle=5d20/0/0x0 
softirq=774/774 fqs=0 (false positive?)
[  526.873278] rcu:     219-...!: (225 GPs behind) idle=d580/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  526.891508] rcu:     226-...!: (233 GPs behind) idle=ec08/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  526.910079] rcu:     (detected by 157, t=5289 jiffies, g=5989, q=5339 
ncpus=256)
[  526.924916] rcu: rcu_sched kthread timer wakeup didn't happen for 
5295 jiffies! g5989 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  526.946930] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  526.960818] rcu: rcu_sched kthread starved for 5302 jiffies! g5989 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  526.981300] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  526.999182] rcu: RCU grace-period kthread stack dump:
[  527.009301] rcu: Stack dump where RCU GP kthread last ran:
[  548.035259] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  548.046861] rcu:     8-...!: (807 GPs behind) idle=071c/0/0x1 
softirq=682/682 fqs=0
[  548.061608] rcu:     17-...!: (0 ticks this GP) idle=00e8/0/0x0 
softirq=812/812 fqs=0 (false positive?)
[  548.079831] rcu:     84-...!: (0 ticks this GP) idle=d2b0/0/0x0 
softirq=797/797 fqs=0 (false positive?)
[  548.098070] rcu:     89-...!: (771 GPs behind) idle=7be8/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  548.116122] rcu:     112-...!: (226 GPs behind) idle=c110/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  548.134342] rcu:     185-...!: (0 ticks this GP) idle=45b8/0/0x0 
softirq=871/871 fqs=0 (false positive?)
[  548.152759] rcu:     193-...!: (0 ticks this GP) idle=1758/0/0x0 
softirq=1520/1520 fqs=0 (false positive?)
[  548.171509] rcu:     205-...!: (0 ticks this GP) idle=1e98/0/0x0 
softirq=852/852 fqs=0 (false positive?)
[  548.189893] rcu:     219-...!: (226 GPs behind) idle=d5c8/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  548.208128] rcu:     226-...!: (234 GPs behind) idle=eff0/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  548.226699] rcu:     (detected by 115, t=5300 jiffies, g=5993, q=5539 
ncpus=256)
[  548.241699] rcu: rcu_sched kthread timer wakeup didn't happen for 
5303 jiffies! g5993 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  548.263704] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  548.277593] rcu: rcu_sched kthread starved for 5311 jiffies! g5993 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  548.298081] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  548.315971] rcu: RCU grace-period kthread stack dump:
[  548.326084] rcu: Stack dump where RCU GP kthread last ran:
[  569.343268] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  569.354868] rcu:     8-...!: (808 GPs behind) idle=07ac/0/0x1 
softirq=682/682 fqs=0
[  569.369617] rcu:     89-...!: (772 GPs behind) idle=8518/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  569.387674] rcu:     112-...!: (227 GPs behind) idle=c168/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  569.405894] rcu:     219-...!: (227 GPs behind) idle=d620/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  569.424128] rcu:     226-...!: (235 GPs behind) idle=f920/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  569.442700] rcu:     (detected by 76, t=5276 jiffies, g=5997, q=5665 
ncpus=256)
[  569.457146] rcu: rcu_sched kthread timer wakeup didn't happen for 
5278 jiffies! g5997 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  569.479149] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  569.493043] rcu: rcu_sched kthread starved for 5285 jiffies! g5997 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  569.513534] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  569.531419] rcu: RCU grace-period kthread stack dump:
[  569.541536] rcu: Stack dump where RCU GP kthread last ran:
[  590.563260] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  590.574870] rcu:     8-...!: (809 GPs behind) idle=0824/0/0x1 
softirq=682/682 fqs=0
[  590.589618] rcu:     89-...!: (773 GPs behind) idle=8850/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  590.607682] rcu:     112-...!: (228 GPs behind) idle=c198/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  590.625904] rcu:     195-...!: (0 ticks this GP) idle=7178/0/0x0 
softirq=1038/1038 fqs=0 (false positive?)
[  590.644660] rcu:     207-...!: (0 ticks this GP) idle=9440/0/0x0 
softirq=809/809 fqs=0 (false positive?)
[  590.663056] rcu:     219-...!: (228 GPs behind) idle=d650/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  590.681285] rcu:     226-...!: (236 GPs behind) idle=fc78/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  590.699859] rcu:     (detected by 138, t=5286 jiffies, g=6001, q=5524 
ncpus=256)
[  590.714623] rcu: rcu_sched kthread timer wakeup didn't happen for 
5288 jiffies! g6001 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  590.736635] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  590.750524] rcu: rcu_sched kthread starved for 5296 jiffies! g6001 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  590.771021] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  590.788903] rcu: RCU grace-period kthread stack dump:
[  590.799012] rcu: Stack dump where RCU GP kthread last ran:
[  606.363275] INFO: task kworker/u1024:0:12 blocked for more than 483 
seconds.
[  606.377139]       Tainted: G        W  7.1.0-rc2-test01 #1
[  606.389636] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  611.823259] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  611.834860] rcu:     8-...!: (810 GPs behind) idle=08bc/0/0x1 
softirq=682/682 fqs=0
[  611.849612] rcu:     89-...!: (774 GPs behind) idle=91a8/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  611.867665] rcu:     112-...!: (229 GPs behind) idle=c1d8/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  611.885887] rcu:     205-...!: (0 ticks this GP) idle=2160/0/0x0 
softirq=865/865 fqs=0 (false positive?)
[  611.904290] rcu:     219-...!: (229 GPs behind) idle=d690/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  611.922525] rcu:     226-...!: (237 GPs behind) idle=05e0/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  611.941095] rcu:     (detected by 166, t=5283 jiffies, g=6005, q=5522 
ncpus=256)
[  611.955789] rcu: rcu_sched kthread timer wakeup didn't happen for 
5285 jiffies! g6005 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  611.977793] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  611.991685] rcu: rcu_sched kthread starved for 5292 jiffies! g6005 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  612.012174] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  612.030060] rcu: RCU grace-period kthread stack dump:
[  612.040180] rcu: Stack dump where RCU GP kthread last ran:

r[  727.195272] INFO: task kworker/u1024:0:12 blocked for more than 604 
seconds.
[  727.209134]       Tainted: G        W  7.1.0-rc2-test01 #1
[  727.221628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

NOTE(S): Random hangs and same messages as S7-2. Takes about 15 minutes 
to see the messages.
---------------------------------------------------------
B2) SPARC64 T7-1 Kernel v7.0.4
[  OK  ] Finished e2scrub_reap.service - Remove Stale Online ext4 
Metadata Check Snapshots.

Debian GNU/Linux forky/sid s7t7-debian-test ttyHV0

s7t7-debian-test login: tonyr
Password:
Linux s7t7-debian-test 7.0.4-test01 #1 SMP Fri May  8 09:27:58 PDT 2026 
sparc64
[..]
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[   79.468871] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   79.480483] rcu:     114-...!: (424 GPs behind) idle=8760/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[   79.498713] rcu:     (detected by 90, t=5259 jiffies, g=3769, q=818 
ncpus=256)
[   79.512702] rcu: rcu_sched kthread timer wakeup didn't happen for 
5260 jiffies! g3769 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   79.534808] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[   79.548516] rcu: rcu_sched kthread starved for 5267 jiffies! g3769 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[   79.568838] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[   79.586709] rcu: RCU grace-period kthread stack dump:
[   79.596867] rcu: Stack dump where RCU GP kthread last ran:
[  100.612874] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  100.624477] rcu:     114-...!: (425 GPs behind) idle=88f0/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  100.642709] rcu:     157-...!: (0 ticks this GP) idle=4c08/0/0x0 
softirq=122/122 fqs=0 (false positive?)
[  100.661106] rcu:     (detected by 3, t=5264 jiffies, g=3773, q=1046 
ncpus=256)
[  100.675155] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3773 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  100.697211] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  100.710934] rcu: rcu_sched kthread starved for 5276 jiffies! g3773 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  100.731244] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  100.749125] rcu: RCU grace-period kthread stack dump:
[  100.759255] rcu: Stack dump where RCU GP kthread last ran:
login: ti[  121.776867] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  121.788467] rcu:     114-...!: (426 GPs behind) idle=8a20/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  121.806703] rcu:     (detected by 3, t=5259 jiffies, g=3777, q=1267 
ncpus=256)
[  121.820664] rcu: rcu_sched kthread timer wakeup didn't happen for 
5260 jiffies! g3777 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  121.842799] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  121.856521] rcu: rcu_sched kthread starved for 5271 jiffies! g3777 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  121.876836] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  121.894717] rcu: RCU grace-period kthread stack dump:
[  121.904824] rcu: Stack dump where RCU GP kthread last ran:
[  142.920877] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  142.932481] rcu:     114-...!: (427 GPs behind) idle=8b98/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  142.950709] rcu:     158-...!: (1 GPs behind) idle=5220/0/0x0 
softirq=142/148 fqs=0 (false positive?)
[  142.968586] rcu:     (detected by 122, t=5260 jiffies, g=3781, q=722 
ncpus=256)
[  142.982808] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3781 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  143.004857] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  143.018583] rcu: rcu_sched kthread starved for 5273 jiffies! g3781 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  143.038893] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  143.056761] rcu: RCU grace-period kthread stack dump:
[  143.066898] rcu: Stack dump where RCU GP kthread last ran:
[  164.084863] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  164.096463] rcu:     14-...!: (0 ticks this GP) idle=56b0/0/0x0 
softirq=165/165 fqs=0 (false positive?)
[  164.114695] rcu:     114-...!: (428 GPs behind) idle=8ed0/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  164.132916] rcu:     (detected by 96, t=5264 jiffies, g=3785, q=750 
ncpus=256)
[  164.146969] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3785 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  164.169019] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  164.182728] rcu: rcu_sched kthread starved for 5276 jiffies! g3785 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  164.203055] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  164.220922] rcu: RCU grace-period kthread stack dump:
[  164.231039] rcu: Stack dump where RCU GP kthread last ran:
[  185.248867] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  185.260468] rcu:     21-...!: (0 ticks this GP) idle=36c8/0/0x0 
softirq=154/154 fqs=0 (false positive?)
[  185.278684] rcu:     114-...!: (429 GPs behind) idle=8f68/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  185.296922] rcu:     (detected by 116, t=5264 jiffies, g=3789, q=760 
ncpus=256)
[  185.311140] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3789 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  185.333205] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  185.346918] rcu: rcu_sched kthread starved for 5276 jiffies! g3789 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  185.367224] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  185.385113] rcu: RCU grace-period kthread stack dump:
[  185.395229] rcu: Stack dump where RCU GP kthread last ran:
   OK  ] Reached target network-online.target - Network is Online.
[  OK  ] Started anacron.service - Run anacron jobs.
[  OK  ] Started cups-browsed.service - Make remote CUPS printers 
available locally.
          Starting exim4.service - exim Mail Transport Agent...
          Starting xrdp.service - xrdp daemon...
[  OK  ] Finished user-runtime-dir@1000.service - User Runtime Directory 
/run/user/1000.
[  OK  ] Started xrdp.service - xrdp daemon.
[  OK  ] Started serial-getty@ttyHV0.service - Serial Getty on ttyHV0.
          Starting user@1000.service - User Manager for UID 1000...
[  OK  ] Started exim4.service - exim Mail Transport Agent.
[  OK  ] Reached target multi-user.target - Multi-User System.
[  OK  ] Reached target graphical.target - Graphical Interface.
[  OK  ] Started user@1000.service - User Manager for UID 1000.
[FAILED] Failed to start session-1.scope - Session 1 of User tonyr.
See 'systemctl status session-1.scope' for details.
[  206.412865] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  206.424477] rcu:     114-...!: (430 GPs behind) idle=97b0/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  206.442691] rcu:     (detected by 123, t=5259 jiffies, g=3793, q=5473 
ncpus=256)
[  206.457056] rcu: rcu_sched kthread timer wakeup didn't happen for 
5261 jiffies! g3793 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  206.479157] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  206.492868] rcu: rcu_sched kthread starved for 5271 jiffies! g3793 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  206.513173] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  206.531061] rcu: RCU grace-period kthread stack dump:
[  206.541202] rcu: Stack dump where RCU GP kthread last ran:

NOTE(S): Unable to login and random hangs during system startup. Same 
messages/issues as S7-2.


On 5/8/26 12:50 AM, Thorsten Leemhuis wrote:
> [+tglx so he knows about it; details about the problem that Tony faces
> can be found in https://github.com/sparclinux/issues/issues/79 ]
>
> On 5/8/26 09:38, Tony Rodriguez wrote:
>> I still don't believe this is fixed upstream as of v7.03 and v7.1-rc1,
> Yes and no. It looks like d6e152d905bdb1 ("clockevents: Prevent timer
> interrupt starvation") causes two regression.
>
> Thomas fixed one with 4096fd0e8eaea1 ("clockevents: Add missing resets
> of the next_event_forced flag") -- and feedback shows that it definitely
> solved the problem for quite a few people. If that's not the case for
> you, then you seem to face a different problem caused by the same
> change. Happens, that's life sometimes.
>
> Ciao, Thorsten
>
>> only when my patch is applied does the SPARC74 S7-2 system become stable
>> again. I also tested my patch with v7.0.4 and it works their as well.
>> Will perform additional tests without my fix against v7.0.4 and v7.1-rc2
>> later today to revalidate the regression (USA Pacific time).
>>
>> Tony Rodriguez
>> www.linkedin.com/in/unixpro1970
>>
>>> On May 7, 2026, at 11:33 PM, Thorsten Leemhuis <linux@leemhuis.info>
>>> wrote:
>>>
>>> On 5/8/26 07:51, John Paul Adrian Glaubitz wrote:
>>>> On Thu, 2026-04-23 at 18:30 +0200, Thorsten Leemhuis wrote:
>>>>> FWIW, here is the rough timeline of the regression, just to be sure we
>>>>> are all on the same page:
>>>>>
>>>>> * The regression I'm talking about is caused by d6e152d905bdb1
>>>>> ("clockevents: Prevent timer interrupt starvation") [authored:
>>>>> 2026-04-07 10:54:17; committed: 2026-04-10 22:45:38; next arrival:
>>>>> next-20260413; merged: 2026-04-12 19:01:55; v7.0 (2026-04-12 22:48:06)]
>>>> [...]
>>>> Tony Rodriguez from the SPARC community has observed the regression
>>>> on SPARC as well
>>>> and proposed a fix to address it [1]. Not sure whether he has
>>>> retested on the latest
>>>> commit of Linus' tree yet.
>>>>
>>>> Tony, can you verify that 4096fd0e8eaea1 fixes the issue for you?
>>>>
>>>>> [1] https://github.com/sparclinux/issues/issues/79
>>> It's likely a different regressions, as that report's title says that
>>> v7.0.1, v7.0.2, v7.0.3, and v7.1‑rc1 are affected, which all contain the
>>> fix, aka 4096fd0e8eaea1. Reporting in a new thread is likely best, as
>>> the authors of the culprit are not even CCed here.
>>>
>>> Ciao, Thorsten

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-05-08 20:15         ` Tony Rodriguez
@ 2026-05-08 20:21           ` Tony Rodriguez
  2026-05-10 21:29           ` Thomas Gleixner
  1 sibling, 0 replies; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-08 20:21 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: John Paul Adrian Glaubitz, Greg KH, Linus Torvalds,
	Linux kernel regressions list, LKML, Thomas Gleixner

Just confirmed on my end today.  This regression also impacts both 
SPARC64 S7-2 and SPARC64 T7-1 on v7.0.4 and v7.1-rc2 as well. Different 
systems using the same exact kernels.

** Please see points (A1) (A2) (B1) (B2)

Once again, I am not experiencing such issues when "my patch" (link 
below) is added to address this regression.

https://github.com/sparclinux/issues/issues/79#issuecomment-4362173884

Output demonstrating issues for SPARC64 S7-2 and T7-1 systems (without 
my regression patch):

PS - On May 2nd, 2026 at 9:42 PM: I also sent an email to Thomas 
Gleixner regarding this issue.  I will be happy to validate any patches 
from your end regarding this issue, as time permits me to do so.


Best regards,

Tony Rodriguez


  A1) SPARC64 S7-2: Kernel v7.1.0-rc2
uname -a
Linux s7t7-debian-test 7.1.0-rc2-test01 #1 SMP Fri May  8 10:02:12 PDT 
2026 sparc64 GNU/Linux

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc2-test01 
root=UUID=ce937a4b-126a-41bd-a54b-03a424421086 ro console=ttyHV0,9600n81 
systemd.log_level=info systemd.show_status=1 
systemd.journald.forward_to_console=0 plymouth.enable=0 quiet

[  243.267359] INFO: task kworker/u512:1:706 blocked for more than 120 
seconds.
[  243.281208]       Not tainted 7.1.0-rc2-test01 #1
[  243.290583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  243.306349] INFO: task kworker/127:1:714 blocked for more than 120 
seconds.
[  243.320106]       Not tainted 7.1.0-rc2-test01 #1
[  243.329476] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  364.099350] INFO: task kworker/u512:1:706 blocked for more than 241 
seconds.
[  364.113199]       Not tainted 7.1.0-rc2-test01 #1
[  364.122585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  364.138328] INFO: task kworker/127:1:714 blocked for more than 241 
seconds.
[  364.152086]       Not tainted 7.1.0-rc2-test01 #1
[  364.161470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  485.295360] INFO: task kworker/u512:1:706 blocked for more than 362 
seconds.
[  485.309209]       Not tainted 7.1.0-rc2-test01 #1
[  485.318581] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  485.334345] INFO: task kworker/127:1:714 blocked for more than 362 
seconds.
[  485.348099]       Not tainted 7.1.0-rc2-test01 #1
[  485.357467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  726.849598] INFO: task kworker/u512:1:706 blocked for more than 604 
seconds.
[  726.863444]       Not tainted 7.1.0-rc2-test01 #1
[  726.872832] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  726.888573] INFO: task kworker/127:1:714 blocked for more than 604 
seconds.
[  726.902340]       Not tainted 7.1.0-rc2-test01 #1
[  726.911708] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

sudo dmesg | grep -iE block | grep -iE worker
[  243.267359] INFO: task kworker/u512:1:706 blocked for more than 120 
seconds.
[  243.306349] INFO: task kworker/127:1:714 blocked for more than 120 
seconds.
[  364.099350] INFO: task kworker/u512:1:706 blocked for more than 241 
seconds.
[  364.138328] INFO: task kworker/127:1:714 blocked for more than 241 
seconds.
[  485.295360] INFO: task kworker/u512:1:706 blocked for more than 362 
seconds.
[  485.334345] INFO: task kworker/127:1:714 blocked for more than 362 
seconds.
[  605.849474] INFO: task kworker/u512:1:706 blocked for more than 483 
seconds.
[  605.888461] INFO: task kworker/127:1:714 blocked for more than 483 
seconds.

sudo poweroff or sudo reboot
NOTE(S): Random hangs during startup. Also, hangs during shutdown/reboot 
process.
-------------------------------------------------------------------------------------------
A2) SPARC64 S7-2: Kernel v7.0.4
[  OK  ] Finished e2scrub_reap.service - Remove Stale Online ext4 
Metadata Check Snapshots.

Debian GNU/Linux forky/sid s7t7-debian-test ttyHV0

s7t7-debian-test login: tonyr
Password:
Linux s7t7-debian-test 7.0.4-test01 #1 SMP Fri May  8 09:27:58 PDT 2026 
sparc64
[...]
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

login: timed [  114.687722] rcu: INFO: rcu_sched detected stalls on 
CPUs/tasks:
[  114.699319] rcu:     67-...!: (240 GPs behind) idle=e9c0/0/0x0 
softirq=174/174 fqs=0 (false positive?)
[  114.717370] rcu:     102-...!: (73 GPs behind) idle=77e0/0/0x0 
softirq=286/287 fqs=0 (false positive?)
[  114.735419] rcu:     111-...!: (52 GPs behind) idle=11d8/0/0x0 
softirq=860/861 fqs=0 (false positive?)
[  114.753489] rcu:     (detected by 11, t=5268 jiffies, g=4457, q=528 
ncpus=128)
[  114.767628] rcu: rcu_sched kthread timer wakeup didn't happen for 
5270 jiffies! g4457 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  114.789647] rcu:     Possible timer handling issue on cpu=105 
timer-softirq=98
[  114.803535] rcu: rcu_sched kthread starved for 5280 jiffies! g4457 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=105
[  114.824201] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  114.842080] rcu: RCU grace-period kthread stack dump:
[  114.852221] rcu: Stack dump where RCU GP kthread last ran:
[  135.867723] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  135.879326] rcu:     65-...!: (1 GPs behind) idle=35b0/0/0x0 
softirq=483/484 fqs=0 (false positive?)
[  135.897024] rcu:     67-...!: (241 GPs behind) idle=ecc0/0/0x0 
softirq=174/174 fqs=0 (false positive?)
[  135.915082] rcu:     102-...!: (74 GPs behind) idle=7800/0/0x0 
softirq=286/287 fqs=0 (false positive?)
[  135.933123] rcu:     111-...!: (53 GPs behind) idle=1238/0/0x0 
softirq=860/861 fqs=0 (false positive?)
[  135.951184] rcu:     (detected by 64, t=5272 jiffies, g=4461, q=752 
ncpus=128)
[  135.965398] rcu: rcu_sched kthread timer wakeup didn't happen for 
5275 jiffies! g4461 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  135.987393] rcu:     Possible timer handling issue on cpu=105 
timer-softirq=98
[  136.001287] rcu: rcu_sched kthread starved for 5285 jiffies! g4461 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=105
[  136.021944] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  136.039829] rcu: RCU grace-period kthread stack dump:
[  136.049971] rcu: Stack dump where RCU GP kthread last ran:

NOTE(S): Unable to login and random hangs during system startup.
-------------------------------------------------------------------------
B1) SPARC64 T7-1: Kernel v7.1.0-rc2
lscpu;uname -a
Architecture:          sparc64
   CPU op-mode(s):      32-bit, 64-bit
   Byte Order:          Big Endian
CPU(s):                256
   On-line CPU(s) list: 0-255
Model name:            SPARC-M7
   Thread(s) per core:  8
   Core(s) per socket:  32
   Socket(s):           1
   Flags:               sun4v
Caches (sum of all):
   L1d:                 4 MiB (256 instances)
   L1i:                 4 MiB (256 instances)
   L2:                  64 MiB (256 instances)
Linux s7t7-debian-test 7.1.0-rc2-test01 #1 SMP Fri May  8 10:02:12 PDT 
2026 sparc64 GNU/Linux

   526.766867] rcu:     8-...!: (806 GPs behind) idle=069c/0/0x1 
softirq=682/682 fqs=0
[  526.781618] rcu:     22-...!: (0 ticks this GP) idle=7b40/0/0x0 
softirq=739/739 fqs=0 (false positive?)
[  526.799841] rcu:     89-...!: (770 GPs behind) idle=7800/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  526.817901] rcu:     112-...!: (225 GPs behind) idle=c0c8/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  526.836131] rcu:     189-...!: (0 ticks this GP) idle=8ef0/0/0x0 
softirq=1016/1016 fqs=0 (false positive?)
[  526.854885] rcu:     204-...!: (0 ticks this GP) idle=5d20/0/0x0 
softirq=774/774 fqs=0 (false positive?)
[  526.873278] rcu:     219-...!: (225 GPs behind) idle=d580/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  526.891508] rcu:     226-...!: (233 GPs behind) idle=ec08/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  526.910079] rcu:     (detected by 157, t=5289 jiffies, g=5989, q=5339 
ncpus=256)
[  526.924916] rcu: rcu_sched kthread timer wakeup didn't happen for 
5295 jiffies! g5989 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  526.946930] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  526.960818] rcu: rcu_sched kthread starved for 5302 jiffies! g5989 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  526.981300] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  526.999182] rcu: RCU grace-period kthread stack dump:
[  527.009301] rcu: Stack dump where RCU GP kthread last ran:
[  548.035259] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  548.046861] rcu:     8-...!: (807 GPs behind) idle=071c/0/0x1 
softirq=682/682 fqs=0
[  548.061608] rcu:     17-...!: (0 ticks this GP) idle=00e8/0/0x0 
softirq=812/812 fqs=0 (false positive?)
[  548.079831] rcu:     84-...!: (0 ticks this GP) idle=d2b0/0/0x0 
softirq=797/797 fqs=0 (false positive?)
[  548.098070] rcu:     89-...!: (771 GPs behind) idle=7be8/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  548.116122] rcu:     112-...!: (226 GPs behind) idle=c110/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  548.134342] rcu:     185-...!: (0 ticks this GP) idle=45b8/0/0x0 
softirq=871/871 fqs=0 (false positive?)
[  548.152759] rcu:     193-...!: (0 ticks this GP) idle=1758/0/0x0 
softirq=1520/1520 fqs=0 (false positive?)
[  548.171509] rcu:     205-...!: (0 ticks this GP) idle=1e98/0/0x0 
softirq=852/852 fqs=0 (false positive?)
[  548.189893] rcu:     219-...!: (226 GPs behind) idle=d5c8/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  548.208128] rcu:     226-...!: (234 GPs behind) idle=eff0/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  548.226699] rcu:     (detected by 115, t=5300 jiffies, g=5993, q=5539 
ncpus=256)
[  548.241699] rcu: rcu_sched kthread timer wakeup didn't happen for 
5303 jiffies! g5993 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  548.263704] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  548.277593] rcu: rcu_sched kthread starved for 5311 jiffies! g5993 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  548.298081] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  548.315971] rcu: RCU grace-period kthread stack dump:
[  548.326084] rcu: Stack dump where RCU GP kthread last ran:
[  569.343268] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  569.354868] rcu:     8-...!: (808 GPs behind) idle=07ac/0/0x1 
softirq=682/682 fqs=0
[  569.369617] rcu:     89-...!: (772 GPs behind) idle=8518/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  569.387674] rcu:     112-...!: (227 GPs behind) idle=c168/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  569.405894] rcu:     219-...!: (227 GPs behind) idle=d620/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  569.424128] rcu:     226-...!: (235 GPs behind) idle=f920/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  569.442700] rcu:     (detected by 76, t=5276 jiffies, g=5997, q=5665 
ncpus=256)
[  569.457146] rcu: rcu_sched kthread timer wakeup didn't happen for 
5278 jiffies! g5997 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  569.479149] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  569.493043] rcu: rcu_sched kthread starved for 5285 jiffies! g5997 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  569.513534] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  569.531419] rcu: RCU grace-period kthread stack dump:
[  569.541536] rcu: Stack dump where RCU GP kthread last ran:
[  590.563260] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  590.574870] rcu:     8-...!: (809 GPs behind) idle=0824/0/0x1 
softirq=682/682 fqs=0
[  590.589618] rcu:     89-...!: (773 GPs behind) idle=8850/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  590.607682] rcu:     112-...!: (228 GPs behind) idle=c198/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  590.625904] rcu:     195-...!: (0 ticks this GP) idle=7178/0/0x0 
softirq=1038/1038 fqs=0 (false positive?)
[  590.644660] rcu:     207-...!: (0 ticks this GP) idle=9440/0/0x0 
softirq=809/809 fqs=0 (false positive?)
[  590.663056] rcu:     219-...!: (228 GPs behind) idle=d650/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  590.681285] rcu:     226-...!: (236 GPs behind) idle=fc78/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  590.699859] rcu:     (detected by 138, t=5286 jiffies, g=6001, q=5524 
ncpus=256)
[  590.714623] rcu: rcu_sched kthread timer wakeup didn't happen for 
5288 jiffies! g6001 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  590.736635] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  590.750524] rcu: rcu_sched kthread starved for 5296 jiffies! g6001 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  590.771021] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  590.788903] rcu: RCU grace-period kthread stack dump:
[  590.799012] rcu: Stack dump where RCU GP kthread last ran:
[  606.363275] INFO: task kworker/u1024:0:12 blocked for more than 483 
seconds.
[  606.377139]       Tainted: G        W  7.1.0-rc2-test01 #1
[  606.389636] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  611.823259] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  611.834860] rcu:     8-...!: (810 GPs behind) idle=08bc/0/0x1 
softirq=682/682 fqs=0
[  611.849612] rcu:     89-...!: (774 GPs behind) idle=91a8/0/0x0 
softirq=270/273 fqs=0 (false positive?)
[  611.867665] rcu:     112-...!: (229 GPs behind) idle=c1d8/0/0x0 
softirq=193/193 fqs=0 (false positive?)
[  611.885887] rcu:     205-...!: (0 ticks this GP) idle=2160/0/0x0 
softirq=865/865 fqs=0 (false positive?)
[  611.904290] rcu:     219-...!: (229 GPs behind) idle=d690/0/0x0 
softirq=605/607 fqs=0 (false positive?)
[  611.922525] rcu:     226-...!: (237 GPs behind) idle=05e0/0/0x0 
softirq=1189/1190 fqs=0 (false positive?)
[  611.941095] rcu:     (detected by 166, t=5283 jiffies, g=6005, q=5522 
ncpus=256)
[  611.955789] rcu: rcu_sched kthread timer wakeup didn't happen for 
5285 jiffies! g6005 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  611.977793] rcu:     Possible timer handling issue on cpu=94 
timer-softirq=279
[  611.991685] rcu: rcu_sched kthread starved for 5292 jiffies! g6005 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=94
[  612.012174] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  612.030060] rcu: RCU grace-period kthread stack dump:
[  612.040180] rcu: Stack dump where RCU GP kthread last ran:

r[  727.195272] INFO: task kworker/u1024:0:12 blocked for more than 604 
seconds.
[  727.209134]       Tainted: G        W  7.1.0-rc2-test01 #1
[  727.221628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

NOTE(S): Random hangs and same messages as S7-2. Takes about 15 minutes 
to see the messages.
---------------------------------------------------------
B2) SPARC64 T7-1 Kernel v7.0.4
[  OK  ] Finished e2scrub_reap.service - Remove Stale Online ext4 
Metadata Check Snapshots.

Debian GNU/Linux forky/sid s7t7-debian-test ttyHV0

s7t7-debian-test login: tonyr
Password:
Linux s7t7-debian-test 7.0.4-test01 #1 SMP Fri May  8 09:27:58 PDT 2026 
sparc64
[..]
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[   79.468871] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   79.480483] rcu:     114-...!: (424 GPs behind) idle=8760/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[   79.498713] rcu:     (detected by 90, t=5259 jiffies, g=3769, q=818 
ncpus=256)
[   79.512702] rcu: rcu_sched kthread timer wakeup didn't happen for 
5260 jiffies! g3769 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   79.534808] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[   79.548516] rcu: rcu_sched kthread starved for 5267 jiffies! g3769 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[   79.568838] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[   79.586709] rcu: RCU grace-period kthread stack dump:
[   79.596867] rcu: Stack dump where RCU GP kthread last ran:
[  100.612874] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  100.624477] rcu:     114-...!: (425 GPs behind) idle=88f0/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  100.642709] rcu:     157-...!: (0 ticks this GP) idle=4c08/0/0x0 
softirq=122/122 fqs=0 (false positive?)
[  100.661106] rcu:     (detected by 3, t=5264 jiffies, g=3773, q=1046 
ncpus=256)
[  100.675155] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3773 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  100.697211] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  100.710934] rcu: rcu_sched kthread starved for 5276 jiffies! g3773 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  100.731244] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  100.749125] rcu: RCU grace-period kthread stack dump:
[  100.759255] rcu: Stack dump where RCU GP kthread last ran:
login: ti[  121.776867] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  121.788467] rcu:     114-...!: (426 GPs behind) idle=8a20/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  121.806703] rcu:     (detected by 3, t=5259 jiffies, g=3777, q=1267 
ncpus=256)
[  121.820664] rcu: rcu_sched kthread timer wakeup didn't happen for 
5260 jiffies! g3777 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  121.842799] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  121.856521] rcu: rcu_sched kthread starved for 5271 jiffies! g3777 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  121.876836] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  121.894717] rcu: RCU grace-period kthread stack dump:
[  121.904824] rcu: Stack dump where RCU GP kthread last ran:
[  142.920877] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  142.932481] rcu:     114-...!: (427 GPs behind) idle=8b98/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  142.950709] rcu:     158-...!: (1 GPs behind) idle=5220/0/0x0 
softirq=142/148 fqs=0 (false positive?)
[  142.968586] rcu:     (detected by 122, t=5260 jiffies, g=3781, q=722 
ncpus=256)
[  142.982808] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3781 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  143.004857] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  143.018583] rcu: rcu_sched kthread starved for 5273 jiffies! g3781 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  143.038893] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  143.056761] rcu: RCU grace-period kthread stack dump:
[  143.066898] rcu: Stack dump where RCU GP kthread last ran:
[  164.084863] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  164.096463] rcu:     14-...!: (0 ticks this GP) idle=56b0/0/0x0 
softirq=165/165 fqs=0 (false positive?)
[  164.114695] rcu:     114-...!: (428 GPs behind) idle=8ed0/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  164.132916] rcu:     (detected by 96, t=5264 jiffies, g=3785, q=750 
ncpus=256)
[  164.146969] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3785 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  164.169019] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  164.182728] rcu: rcu_sched kthread starved for 5276 jiffies! g3785 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  164.203055] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  164.220922] rcu: RCU grace-period kthread stack dump:
[  164.231039] rcu: Stack dump where RCU GP kthread last ran:
[  185.248867] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  185.260468] rcu:     21-...!: (0 ticks this GP) idle=36c8/0/0x0 
softirq=154/154 fqs=0 (false positive?)
[  185.278684] rcu:     114-...!: (429 GPs behind) idle=8f68/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  185.296922] rcu:     (detected by 116, t=5264 jiffies, g=3789, q=760 
ncpus=256)
[  185.311140] rcu: rcu_sched kthread timer wakeup didn't happen for 
5265 jiffies! g3789 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  185.333205] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  185.346918] rcu: rcu_sched kthread starved for 5276 jiffies! g3789 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  185.367224] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  185.385113] rcu: RCU grace-period kthread stack dump:
[  185.395229] rcu: Stack dump where RCU GP kthread last ran:
   OK  ] Reached target network-online.target - Network is Online.
[  OK  ] Started anacron.service - Run anacron jobs.
[  OK  ] Started cups-browsed.service - Make remote CUPS printers 
available locally.
          Starting exim4.service - exim Mail Transport Agent...
          Starting xrdp.service - xrdp daemon...
[  OK  ] Finished user-runtime-dir@1000.service - User Runtime Directory 
/run/user/1000.
[  OK  ] Started xrdp.service - xrdp daemon.
[  OK  ] Started serial-getty@ttyHV0.service - Serial Getty on ttyHV0.
          Starting user@1000.service - User Manager for UID 1000...
[  OK  ] Started exim4.service - exim Mail Transport Agent.
[  OK  ] Reached target multi-user.target - Multi-User System.
[  OK  ] Reached target graphical.target - Graphical Interface.
[  OK  ] Started user@1000.service - User Manager for UID 1000.
[FAILED] Failed to start session-1.scope - Session 1 of User tonyr.
See 'systemctl status session-1.scope' for details.
[  206.412865] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  206.424477] rcu:     114-...!: (430 GPs behind) idle=97b0/0/0x0 
softirq=126/126 fqs=0 (false positive?)
[  206.442691] rcu:     (detected by 123, t=5259 jiffies, g=3793, q=5473 
ncpus=256)
[  206.457056] rcu: rcu_sched kthread timer wakeup didn't happen for 
5261 jiffies! g3793 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  206.479157] rcu:     Possible timer handling issue on cpu=2 
timer-softirq=330
[  206.492868] rcu: rcu_sched kthread starved for 5271 jiffies! g3793 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[  206.513173] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  206.531061] rcu: RCU grace-period kthread stack dump:
[  206.541202] rcu: Stack dump where RCU GP kthread last ran:

NOTE(S): Unable to login and random hangs during system startup. Same 
messages/issues as S7-2.

>
> On 5/8/26 12:50 AM, Thorsten Leemhuis wrote:
>> [+tglx so he knows about it; details about the problem that Tony faces
>> can be found in https://github.com/sparclinux/issues/issues/79 ]
>>
>> On 5/8/26 09:38, Tony Rodriguez wrote:
>>> I still don't believe this is fixed upstream as of v7.03 and v7.1-rc1,
>> Yes and no. It looks like d6e152d905bdb1 ("clockevents: Prevent timer
>> interrupt starvation") causes two regression.
>>
>> Thomas fixed one with 4096fd0e8eaea1 ("clockevents: Add missing resets
>> of the next_event_forced flag") -- and feedback shows that it definitely
>> solved the problem for quite a few people. If that's not the case for
>> you, then you seem to face a different problem caused by the same
>> change. Happens, that's life sometimes.
>>
>> Ciao, Thorsten
>>
>>> only when my patch is applied does the SPARC64 S7-2 system become 
>>> stable
>>> again. I also tested my patch with v7.0.4 and it works their as well.
>>> Will perform additional tests without my fix against v7.0.4 and 
>>> v7.1-rc2
>>> later today to revalidate the regression (USA Pacific time).
>>>
>>> Tony Rodriguez
>>> www.linkedin.com/in/unixpro1970
>>>
>>>> On May 7, 2026, at 11:33 PM, Thorsten Leemhuis <linux@leemhuis.info>
>>>> wrote:
>>>>
>>>> On 5/8/26 07:51, John Paul Adrian Glaubitz wrote:
>>>>> On Thu, 2026-04-23 at 18:30 +0200, Thorsten Leemhuis wrote:
>>>>>> FWIW, here is the rough timeline of the regression, just to be 
>>>>>> sure we
>>>>>> are all on the same page:
>>>>>>
>>>>>> * The regression I'm talking about is caused by d6e152d905bdb1
>>>>>> ("clockevents: Prevent timer interrupt starvation") [authored:
>>>>>> 2026-04-07 10:54:17; committed: 2026-04-10 22:45:38; next arrival:
>>>>>> next-20260413; merged: 2026-04-12 19:01:55; v7.0 (2026-04-12 
>>>>>> 22:48:06)]
>>>>> [...]
>>>>> Tony Rodriguez from the SPARC community has observed the regression
>>>>> on SPARC as well
>>>>> and proposed a fix to address it [1]. Not sure whether he has
>>>>> retested on the latest
>>>>> commit of Linus' tree yet.
>>>>>
>>>>> Tony, can you verify that 4096fd0e8eaea1 fixes the issue for you?
>>>>>
>>>>>> [1] https://github.com/sparclinux/issues/issues/79
>>>> It's likely a different regressions, as that report's title says that
>>>> v7.0.1, v7.0.2, v7.0.3, and v7.1‑rc1 are affected, which all 
>>>> contain the
>>>> fix, aka 4096fd0e8eaea1. Reporting in a new thread is likely best, as
>>>> the authors of the culprit are not even CCed here.
>>>>
>>>> Ciao, Thorsten

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-05-08 20:15         ` Tony Rodriguez
  2026-05-08 20:21           ` Tony Rodriguez
@ 2026-05-10 21:29           ` Thomas Gleixner
  2026-05-11  3:13             ` Tony Rodriguez
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-10 21:29 UTC (permalink / raw)
  To: Tony Rodriguez, Thorsten Leemhuis
  Cc: John Paul Adrian Glaubitz, Greg KH, Linus Torvalds,
	Linux kernel regressions list, LKML

On Fri, May 08 2026 at 13:15, Tony Rodriguez wrote:
> Just confirmed on my end today.  This regression also impacts both 
> SPARC64 S7-2 and SPARC64 T7-1 on v7.0.4 and v7.1-rc2 as well. Different 
> systems using the same exact kernels.
>
> ** Please see points (A1) (A2) (B1) (B2)
>
> Once again, I am not experiencing such issues when "my patch" (link 
> below) is added to address this regression.
>
> https://github.com/sparclinux/issues/issues/79#issuecomment-4362173884

Github issues are really not helpful.

> PS - On May 2nd 2026 at 9:42 PM: I also sent an email to Thomas Gleixner 
> regarding this issue.  Will be happy to validate any patches from your 
> end regarding this issue, as time permits me to do so.

Sorry, that mail got lost as it was in reply to a random other archived
thread which has absolutely nothing to do with the problem at hand.

I just looked at your github thing. Despite your changelog claiming
otherwise your "fix" breaks the DoS protection completely. It's a
polished version of a revert.

It also lacks a proper root cause analysis. This list:

    - skipped programming events when delta <= min_delta_ns
    - changed force semantics for overdue events
    - introduced a sticky next_event_forced state
    - returned success even when no event was programmed

does not qualify and is actually wrong.

The code does not unconditionally skip the programming of events when
delta <= min_delta_ns. It only does so conditionally when the previous
force programmed min_delta_ns event has not been delivered to the kernel
yet, i.e. dev->next_event_forced is still set.

That flag is only set when the minimal value has been successfully
programmed and it _is_ cleared on the next timer interrupt, which should
obviously happen due to this minimal delta programming. It is also
cleared when a new event > min_delta_ns is successfully programmed
_before_ the previous one was delivered.

IOW, the core code programmed the hardware with the min_delta_ns
(min_delta_ticks) timeout and the SPARC clockevents driver returned
success (0). Now the core code refuses to do further reprogramming with
the min_delta_ns timeout as that would shift the expiry (interrupt)
further out until the interrupt actually is delivered or some other
event which is not below the min_delta_ns threshold is programmed.

So let's assume that this logic is causing the problem, then the only
explanation for the observed behaviour is that the expected interrupt
due to a forced min_delta_ns programming is never delivered. 

That made me look into the SPARC specific set_next_event() functions. I
don't know which variant your machines are using, but all of them have
the same underlying problem. The interrupt is based on a equal
comparator, so the programming logic for each of the tick variants is:

$variant_add_compare(delta)
{
      cmp = read_timer() + delta;
      write_comparator(cmp);
      now = read_timer();
      return (now - cmp) > 0;
}

and the actual set_next_event() function which is invoked from the core
code does:

   return tick_operations.add_compare(delta) ? -ETIME : 0;

IOW, when the timer read _after_ writing the comparator value is ahead
of the comparator value the operation failed. Looks about right in
theory.

But then there is the reality of hardware which ruins everything. I've
banged my head against the wall many years ago when debugging a similar
issue with the x86 HPET which has the same hardware design failure of
using a compare equal comparator instead of having a compare less than
equal one. See the lengthy comment in hpet_clkevt_set_next_event() for
further information.

Can you apply the debug patch below, which will disable tracing once it
hits the hung task detector and then retrieve the trace?

If that's not possible as the system is unresponsive, then please add
'ftrace_dump_on_oops' on the kernel command line or enable it after boot
in /proc/sys/kernel and let the kernel panic when it hits the hung task
detector fail.

Thanks,

        tglx
---
--- a/arch/sparc/kernel/time_64.c
+++ b/arch/sparc/kernel/time_64.c
@@ -732,8 +732,10 @@ void __irq_entry timer_interrupt(int irq
 	if (unlikely(!evt->event_handler)) {
 		printk(KERN_WARNING
 		       "Spurious SPARC64 timer interrupt on cpu %d\n", cpu);
-	} else
+	} else {
+		trace_printk("Invoking handler %pS\n", evt->event_handler);
 		evt->event_handler(evt);
+	}

 	irq_exit();

--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -248,6 +248,7 @@ static void hung_task_info(struct task_s
 	 * accordingly
 	 */
 	if (sysctl_hung_task_warnings || hung_task_call_panic) {
+		tracing_off();
 		if (sysctl_hung_task_warnings > 0)
 			sysctl_hung_task_warnings--;
 		pr_err("INFO: task %s:%d blocked%s for more than %ld seconds.\n",
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -370,18 +370,22 @@ int clockevents_program_event(struct clo
 		delta = min(delta, (int64_t) dev->max_delta_ns);
 		cycles = ((u64)delta * dev->mult) >> dev->shift;
 		if (!dev->set_next_event((unsigned long) cycles, dev)) {
+			trace_printk("Successfully programmed %lld %lld\n", expires, delta);
 			dev->next_event_forced = 0;
 			return 0;
 		}
 	}

-	if (dev->next_event_forced)
+	if (dev->next_event_forced) {
+		trace_printk("Skipping %lld %lld\n", expires, delta);
 		return 0;
+	}

 	if (dev->set_next_event(dev->min_delta_ticks, dev)) {
 		if (!force || clockevents_program_min_delta(dev))
 			return -ETIME;
 	}
+	trace_printk("Force programmed min delta %lld %lld\n", expires, delta);
 	dev->next_event_forced = 1;
 	return 0;
 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different?
  2026-05-10 21:29           ` Thomas Gleixner
@ 2026-05-11  3:13             ` Tony Rodriguez
  2026-05-12  5:03               ` the stuttering regression in 7.0: should I have done something different Tony Rodriguez
  0 siblings, 1 reply; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-11  3:13 UTC (permalink / raw)
  To: Thomas Gleixner, Thorsten Leemhuis
  Cc: John Paul Adrian Glaubitz, Greg KH, Linus Torvalds,
	Linux kernel regressions list, LKML

Hi Thomas,

Thank you for the detailed analysis — this helps clarify the situation 
on the SPARC side.

You are correct that my earlier explanation focused too much on the core 
changes and not enough on the SPARC clockevents behaviour under the new 
forced-min-delta semantics. However, having a stable system is equally 
important and is the main reason that I developed the test patch (to 
help). However, your explanation makes sense.

I will apply your debug patch and capture the trace as requested.  If 
the system becomes unresponsive, I will enable ftrace_dump_on_oops so 
the trace is emitted when the hung task detector triggers.

Once I have the trace, I’ll send another email.

Thanks again for the guidance — I’ll follow up with trace results 
sometime tomorrow.

Tony

On 5/10/26 2:29 PM, Thomas Gleixner wrote:
> On Fri, May 08 2026 at 13:15, Tony Rodriguez wrote:
>> Just confirmed on my end today.  This regression also impacts both
>> SPARC64 S7-2 and SPARC64 T7-1 on v7.0.4 and v7.1-rc2 as well. Different
>> systems using the same exact kernels.
>>
>> ** Please see points (A1) (A2) (B1) (B2)
>>
>> Once again, I am not experiencing such issues when "my patch" (link
>> below) is added to address this regression.
>>
>> https://github.com/sparclinux/issues/issues/79#issuecomment-4362173884
> Github issues are really not helpful.
>
>> PS - On May 2nd 2026 at 9:42 PM: I also sent an email to Thomas Gleixner
>> regarding this issue.  Will be happy to validate any patches from your
>> end regarding this issue, as time permits me to do so.
> Sorry, that mail got lost as it was in reply to a random other archived
> thread which has absolutely nothing to do with the problem at hand.
>
> I just looked at your github thing. Despite your changelog claiming
> otherwise your "fix" breaks the DoS protection completely. It's a
> polished version of a revert.
>
> It also lacks a proper root cause analysis. This list:
>
>      - skipped programming events when delta <= min_delta_ns
>      - changed force semantics for overdue events
>      - introduced a sticky next_event_forced state
>      - returned success even when no event was programmed
>
> does not qualify and is actually wrong.
>
> The code does not unconditionally skip the programming of events when
> delta <= min_delta_ns. It only does so conditionally when the previous
> force programmed min_delta_ns event has not been delivered to the kernel
> yet, i.e. dev->next_event_forced is still set.
>
> That flag is only set when the minimal value has been successfully
> programmed and it _is_ cleared on the next timer interrupt, which should
> obviously happen due to this minimal delta programming. It is also
> cleared when a new event > min_delta_ns is successfully programmed
> _before_ the previous one was delivered.
>
> IOW, the core code programmed the hardware with the min_delta_ns
> (min_delta_ticks) timeout and the SPARC clockevents driver returned
> success (0). Now the core code refuses to do further reprogramming with
> the min_delta_ns timeout as that would shift the expiry (interrupt)
> further out until the interrupt actually is delivered or some other
> event which is not below the min_delta_ns threshold is programmed.
>
> So let's assume that this logic is causing the problem, then the only
> explanation for the observed behaviour is that the expected interrupt
> due to a forced min_delta_ns programming is never delivered.
>
> That made me look into the SPARC specific set_next_event() functions. I
> don't know which variant your machines are using, but all of them have
> the same underlying problem. The interrupt is based on a equal
> comparator, so the programming logic for each of the tick variants is:
>
> $variant_add_compare(delta)
> {
>        cmp = read_timer() + delta;
>        write_comparator(cmp);
>        now = read_timer();
>        return (now - cmp) > 0;
> }
>
> and the actual set_next_event() function which is invoked from the core
> code does:
>
>     return tick_operations.add_compare(delta) ? -ETIME : 0;
>
> IOW, when the timer read _after_ writing the comparator value is ahead
> of the comparator value the operation failed. Looks about right in
> theory.
>
> But then there is the reality of hardware which ruins everything. I've
> banged my head against the wall many years ago when debugging a similar
> issue with the x86 HPET which has the same hardware design failure of
> using a compare equal comparator instead of having a compare less than
> equal one. See the lengthy comment in hpet_clkevt_set_next_event() for
> further information.
>
> Can you apply the debug patch below, which will disable tracing once it
> hits the hung task detector and then retrieve the trace?
>
> If that's not possible as the system is unresponsive, then please add
> 'ftrace_dump_on_oops' on the kernel command line or enable it after boot
> in /proc/sys/kernel and let the kernel panic when it hits the hung task
> detector fail.
>
> Thanks,
>
>          tglx
> ---
> --- a/arch/sparc/kernel/time_64.c
> +++ b/arch/sparc/kernel/time_64.c
> @@ -732,8 +732,10 @@ void __irq_entry timer_interrupt(int irq
>   	if (unlikely(!evt->event_handler)) {
>   		printk(KERN_WARNING
>   		       "Spurious SPARC64 timer interrupt on cpu %d\n", cpu);
> -	} else
> +	} else {
> +		trace_printk("Invoking handler %pS\n", evt->event_handler);
>   		evt->event_handler(evt);
> +	}
>   
>   	irq_exit();
>   
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -248,6 +248,7 @@ static void hung_task_info(struct task_s
>   	 * accordingly
>   	 */
>   	if (sysctl_hung_task_warnings || hung_task_call_panic) {
> +		tracing_off();
>   		if (sysctl_hung_task_warnings > 0)
>   			sysctl_hung_task_warnings--;
>   		pr_err("INFO: task %s:%d blocked%s for more than %ld seconds.\n",
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -370,18 +370,22 @@ int clockevents_program_event(struct clo
>   		delta = min(delta, (int64_t) dev->max_delta_ns);
>   		cycles = ((u64)delta * dev->mult) >> dev->shift;
>   		if (!dev->set_next_event((unsigned long) cycles, dev)) {
> +			trace_printk("Successfully programmed %lld %lld\n", expires, delta);
>   			dev->next_event_forced = 0;
>   			return 0;
>   		}
>   	}
>   
> -	if (dev->next_event_forced)
> +	if (dev->next_event_forced) {
> +		trace_printk("Skipping %lld %lld\n", expires, delta);
>   		return 0;
> +	}
>   
>   	if (dev->set_next_event(dev->min_delta_ticks, dev)) {
>   		if (!force || clockevents_program_min_delta(dev))
>   			return -ETIME;
>   	}
> +	trace_printk("Force programmed min delta %lld %lld\n", expires, delta);
>   	dev->next_event_forced = 1;
>   	return 0;
>   }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-11  3:13             ` Tony Rodriguez
@ 2026-05-12  5:03               ` Tony Rodriguez
  2026-05-12  8:17                 ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-12  5:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis

  On 5/10/26 02:29 PM,  Thomas Gleixner wrote:

> Can you apply the debug patch below, which will disable tracing once it
> hits the hung task detector and then retrieve the trace? 
As requested, I applied your debug patch to v7.1‑rc3 and captured the 
trace output.
On the SPARC64 S7‑2 system the machine becomes unresponsive and produces 
many thousands of
lines of trace data that do not appear to terminate. Posting the full 
output inline
or as an attachment may be impractical, so I’ve included the key 
sections below.

If you prefer the complete trace, please let me know the best way to 
provide it. Guessing the kernel

mailing isn't best to attach that?


A) Output with avahi-daemon and avahi-utils installed
(This shows avahi and systemd activity before the hang.)

Note: The system is using the standard Debian-provided avahi and systemd 
packages.
No custom scripts or modifications are in place.

BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc3-test01 
root=UUID=ce937a4b-126a-41bd-a54b-03a424421086 ro console=ttyHV0,9600n81 
systemd.log_level=info systemd.show_status=1 
systemd.journald.forward_to_console=0 plymouth.enable=0 ignore_loglevel 
loglevel=8 ftrace_dump_on_oops=1 hung_task_panic=1

[    1.206192] printk: log_buf_len individual max cpu contribution: 4096 
bytes
[    1.219999] printk: log_buf_len total cpu_extra contributions: 520192 
bytes
[    1.233883] printk: log_buf_len min size: 131072 bytes
[    1.249357] printk: log buffer data + meta data: 1048576 + 4456448 = 
5505024 bytes
[    1.264204] printk: early log buf free: 126896(96%)
[    1.328220] Dentry cache hash table entries: 8388608 (order: 13, 
67108864 bytes, linear)
[    1.371366] Inode-cache hash table entries: 4194304 (order: 12, 
33554432 bytes, linear)
[    1.387117] Sorting __ex_table...
[    1.394073] Built 1 zonelists, mobility grouping on.  Total pages: 
16545911
[    1.407711] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
[    1.434383] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=128, Nodes=1
[    1.467254] ftrace: allocating 36740 entries in 72 pages
[    1.477580] ftrace: allocated 72 pages with 2 groups
[    1.487945]
     1.516665] **     **
[    1.529690] ** trace_printk() being used. Allocating extra memory.  **
[    1.542707] **     **
[    1.555727] ** This means that this is a DEBUG kernel and it is     **
[    1.487945]
[    1.490630] **********************************************************
[    1.503647] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[    1.516665] **     **
[    1.529690] ** trace_printk() being used. Allocating extra memory.  **
[    1.542707] **     **
[    1.555727] ** This means that this is a DEBUG kernel and it is     **
[    1.568743] ** unsafe for production use.      **
[    1.581778] **     **
[    1.594794] ** If you see this message and you are not debugging    **
[    1.607811] ** the kernel, report this immediately to your vendor!  **
[    1.620828] **     **
[    1.633847] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[    1.646880] **********************************************************
[  OK  ] Started avahi-daemon.service - Avahi mDNS/DNS-SD Stack.
[  248.416424] INFO: task systemd:1 blocked for more than 120 seconds.
[  248.428721]       Not tainted 7.1.0-rc3-test01 #2
[  248.438087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  248.453706] task:systemd         state:D stack:20312 pid:1  tgid:1    
  ppid:0      task_flags:0x400100 flags:0x208000101000000
[  248.476970] Call Trace:
[  248.481829] [<0000000000fd993c>] schedule+0x1c/0x180
[  248.491728] [<0000000000fe1a10>] schedule_timeout+0x90/0x100
[  248.503017] [<0000000000fda5b0>] __wait_for_common+0xb0/0x180
[  248.514472] [<0000000000fda7a0>] wait_for_completion_state+0x20/0x60
[  248.527151] [<0000000000527234>] __wait_rcu_gp+0x114/0x1e0
[  248.538089] [<000000000052d9c8>] synchronize_rcu_normal.part.0+0x48/0x60
[  248.551452] [<000000000052f860>] synchronize_rcu_normal+0xc0/0xe0
[  248.563605] [<0000000000532820>] synchronize_rcu+0xe0/0x140
[  248.574720] [<00000000005285c4>] rcu_sync_enter+0x44/0x140
[  248.585650] [<0000000000fdf114>] percpu_down_write+0x14/0x240
[  248.597105] [<000000000057ef20>] cgroup_procs_write_start+0x1c0/0x240
[  248.609956] [<000000000057f8d0>] __cgroup_procs_write+0x30/0x1c0
[  248.621939] [<000000000057fab4>] cgroup_procs_write+0x14/0x40
[  248.633393] [<0000000000577910>] cgroup_file_write+0x90/0x160
[  248.644851] [<00000000008b536c>] kernfs_fop_write_iter+0x14c/0x240
[  248.657182] [<00000000007e7210>] vfs_write+0x210/0x460
[  248.667496] INFO: task (systemd-hostn):1968 blocked for more than 120 
seconds.
[  248.681833]       Not tainted 7.1.0-rc3-test01 #2
[  248.691196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  248.706831] task:(systemd-hostn) state:D stack:25352 pid:1968 
tgid:1968  ppid:1      task_flags:0x400100 flags:0x408000102000000
[  248.730091] Call Trace:
[  248.734952] [<0000000000fd993c>] schedule+0x1c/0x180
[  248.744843] [<0000000000fd9b2c>] schedule_preempt_disabled+0xc/0x20
[  248.757352] [<0000000000fdca0c>] __mutex_lock.constprop.0+0x58c/0xf00
[  248.770184] [<0000000000fdd490>] __mutex_lock_slowpath+0x10/0x20
[  248.782166] [<0000000000fdd4d0>] mutex_lock+0x30/0x40
[  248.792243] [<000000000057f34c>] cgroup_kn_lock_live+0x4c/0x120
[  248.804040] [<000000000057f8b8>] __cgroup_procs_write+0x18/0x1c0
[  248.816021] [<000000000057fab4>] cgroup_procs_write+0x14/0x40
[  248.827476] [<0000000000577910>] cgroup_file_write+0x90/0x160
[  248.838946] [<00000000008b536c>] kernfs_fop_write_iter+0x14c/0x240
[  248.851269] [<00000000007e7210>] vfs_write+0x210/0x460
[  248.861500] [<00000000007e75d0>] ksys_write+0x50/0xe0
[  248.871579] [<00000000007e7674>] sys_write+0x14/0x40
[  248.881469] [<00000000004062b4>] linux_sparc_syscall+0x34/0x44
[  248.893098] INFO: task (systemd-hostn):1968 is blocked on a mutex 
likely owned by task systemd:1.
[  248.910809] Kernel panic - not syncing: hung_task: blocked tasks
[  248.922783] CPU: 48 UID: 0 PID: 677 Comm: khungtaskd Not tainted 
7.1.0-rc3-test01 #2 VOLUNTARY
[  248.940151] Call Trace:
[  248.945010] [<0000000000436fcc>] dump_stack+0x8/0x18
[  248.954902] [<00000000004293a4>] vpanic+0xfc/0x33c
[  248.964453] [<0000000000429608>] panic+0x24/0x30
[  248.973648] [<00000000005a8158>] watchdog+0x238/0x840
[  248.983725] [<00000000004af254>] kthread+0x114/0x160
[  248.993622] [<00000000004060f0>] ret_from_fork+0x24/0x34
[  249.004209] [<0000000000000000>] 0x0
[  249.019116] Dumping ftrace buffer:
[  249.025666] ---------------------------------
[  249.034534]   <idle>-0         0d.... 1836659us : 
clockevents_program_event: Successfully programmed 4000000 4000000
[  249.055418]   <idle>-0         0d.h.. 1845926us : timer_interrupt: 
Invoking handler tick_handle_periodic+0x0/0xa0
[  249.075895]   <idle>-0         0d.h.. 1845938us : 
clockevents_program_event: Successfully programmed 8000000 4000000
[  249.096899]   <idle>-0         0d.h.. 1849938us : timer_interrupt: 
Invoking handler tick_handle_periodic+0x0/0xa0
[  249.117390]   <idle>-0         0d.h.. 1849940us : 
clockevents_program_event: Successfully programmed 12000000 4000000
[  249.138563]   <idle>-0         0d.h.. 1853940us : timer_interrupt: 
Invoking handler tick_handle_periodic+0x0/0xa0
[  249.159053]   <idle>-0         0d.h.. 1853942us : 
clockevents_program_event: Successfully programmed 16000000 4000000
[  249.180226]   <idle>-0         0d.h.. 1857942us : timer_interrupt: 
Invoking handler tick_handle_periodic+0x0/0xa0
[  249.200718]   <idle>-0         0d.h.. 1857943us : 
clockevents_program_event: Successfully programmed 20000000 40000

B) This is without avahi-daemon and avahi-utils  installed.  Just to 
rule out a possible confiict with avahi.

BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc3-test01 
root=UUID=ce937a4b-126a-41bd-a54b-03a424421086 ro console=ttyHV0,9600n81 
systemd.log_level=info systemd.show_status=1 
systemd.journald.forward_to_console=0 plymouth.enable=0 ignore_loglevel 
loglevel=8 ftrace_dump_on_oops=1 hung_task_panic=1

[  OK  ] Reached target graphical.target - Graphical Interface.
[  310.338420] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  310.350060] rcu:     3-...!: (259 GPs behind) 
idle=bb6c/1/0x4000000000000000 softirq=1081/1081 fqs=0
[  310.367759] rcu:     27-...!: (313 GPs behind) 
idle=55cc/1/0x4000000000000000 softirq=284/286 fqs=0
[  310.385298] rcu:     34-...!: (261 GPs behind) 
idle=3a64/1/0x4000000000000000 softirq=524/524 fqs=0
[  310.402834] rcu:     46-...!: (261 GPs behind) 
idle=743c/1/0x4000000000000000 softirq=258/259 fqs=0
[  310.420366] rcu:     (detected by 73, t=5275 jiffies, g=5933, q=80255 
ncpus=128)
[  310.434745]   CPU[  3]: TSTATE[00000000f0001206] 
TPC[00000000010d8694] TNPC[00000000010d8698] TASK[cc1:14002]
[  310.434759]              TPC[10d8694] O7[8f32fc] I7[89d438] RPC[8a4bac]
[  310.434866]   CPU[ 27]: TSTATE[00000099f0001202] 
TPC[00000000010f09f0] TNPC[00000000008dc138] TASK[cc1:13964]
[  310.434875]              TPC[10f09f0] O7[8dc130] I7[8dcea0] RPC[8dc90c]
[  310.434965]   CPU[ 34]: TSTATE[00000044f0001202] 
TPC[0000000000b687d8] TNPC[0000000000b687dc] TASK[cc1:13408]
[  310.434973]              TPC[b687d8] O7[b68754] I7[b668c8] RPC[8af5e0]
[  310.435065]   CPU[ 46]: TSTATE[00000044f0001206] 
TPC[00000000008cf138] TNPC[00000000008cf13c] TASK[cc1:13823]
[  310.435073]              TPC[8cf138] O7[8cf0dc] I7[8ce048] RPC[11152c0]
[  310.435103] rcu: rcu_sched kthread timer wakeup didn't happen for 
5274 jiffies! g5933 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  310.588885] rcu:     Possible timer handling issue on cpu=66 
timer-softirq=248
[  310.602770] rcu: rcu_sched kthread starved for 5320 jiffies! g5933 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=66
[  310.623249] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  310.641135] rcu: RCU grace-period kthread stack dump:
[  310.651207] task:rcu_sched       state:I stack:26936 pid:15 tgid:15  
   ppid:2      task_flags:0x208040 flags:0x07000000
[  310.673256] Call Trace:
[  310.678113] [<0000000000fd993c>] schedule+0x1c/0x180
[  310.688017] [<0000000000fe19f0>] schedule_timeout+0x70/0x100
[  310.699294] [<0000000000530064>] rcu_gp_fqs_loop+0x104/0x4e0
[  310.710578] [<0000000000535474>] rcu_gp_kthread+0x134/0x180
[  310.721686] [<00000000004af254>] kthread+0x114/0x160
[  310.731588] [<00000000004060f0>] ret_from_fork+0x24/0x34
[  310.742169] [<0000000000000000>] 0x0
[  310.749288] rcu: Stack dump where RCU GP kthread last ran:
[  310.760431]   CPU[ 66]: TSTATE[00000044f0001204] 
TPC[0000000000988ccc] TNPC[0000000000988cd0] TASK[cc1:13619]
[  310.780027]              TPC[988ccc] O7[cc5ab8] I7[989064] RPC[98901c]
[  373.795586] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  373.807192] rcu:     3-...!: (259 GPs behind) idle=cef8/0/0x0 
softirq=1081/1081 fqs=1 (false positive?)
[  373.825415] rcu:     27-...!: (313 GPs behind) idle=5d60/0/0x0 
softirq=284/286 fqs=1 (false positive?)
[  373.843469] rcu:     34-...!: (261 GPs behind) idle=4088/0/0x0 
softirq=524/524 fqs=1 (false positive?)
[  373.861522] rcu:     46-...!: (261 GPs behind) idle=7a78/0/0x0 
softirq=258/259 fqs=1 (false positive?)
[  373.879591] rcu:     (detected by 98, t=21140 jiffies, g=5933, 
q=98416 ncpus=128)
[  373.894056]   CPU[  3]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/3:0]
[  373.894061]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
RPC[do_idle+0x104/0x1e0]
[  373.894138]   CPU[ 27]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/27:0]
[  373.894142]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  373.894209]   CPU[ 34]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/34:0]
[  373.894214]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  373.894280]   CPU[ 46]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/46:0]
[  373.894284]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  373.894303] rcu: rcu_sched kthread timer wakeup didn't happen for 
15744 jiffies! g5933 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  374.097545] rcu:     Possible timer handling issue on cpu=66 
timer-softirq=248
[  374.111434] rcu: rcu_sched kthread starved for 15800 jiffies! g5933 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=66
[  374.132103] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  374.149976] rcu: RCU grace-period kthread stack dump:
[  374.160042] task:rcu_sched       state:I stack:26936 pid:15 tgid:15  
   ppid:2      task_flags:0x208040 flags:0x07000000
[  374.182088] Call Trace:
[  374.186950] [<0000000000fd993c>] schedule+0x1c/0x180

  [  374.196844] [<0000000000fe19f0>] schedule_timeout+0x70/0x100
[  374.208132] [<0000000000530064>] rcu_gp_fqs_loop+0x104/0x4e0
[  374.219420] [<0000000000535474>] rcu_gp_kthread+0x134/0x180
[  374.230538] [<00000000004af254>] kthread+0x114/0x160
[  374.240433] [<00000000004060f0>] ret_from_fork+0x24/0x34
[  374.251009] [<0000000000000000>] 0x0
[  374.258131] rcu: Stack dump where RCU GP kthread last ran:
[  374.269126]   CPU[ 66]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/66:0]
[  374.289376]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  395.314730] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  395.326335] rcu:     3-...!: (260 GPs behind) idle=d8c8/0/0x0 
softirq=1081/1081 fqs=0 (false positive?)
[  395.344552] rcu:     27-...!: (314 GPs behind) idle=5da8/0/0x0 
softirq=284/286 fqs=0 (false positive?)
[  395.362602] rcu:     34-...!: (262 GPs behind) idle=4670/0/0x0 
softirq=524/524 fqs=0 (false positive?)
[  395.380667] rcu:     46-...!: (262 GPs behind) idle=7b00/0/0x0 
softirq=258/259 fqs=0 (false positive?)
[  395.398729] rcu:     (detected by 54, t=5275 jiffies, g=5937, q=96560 
ncpus=128)
[  395.413016]   CPU[  3]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/3:0]
[  395.413022]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  395.413106]   CPU[ 27]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/27:0]
[  395.413111]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  395.413178]   CPU[ 34]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/34:0]
[  395.413183]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  395.413249]   CPU[ 46]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/46:0]
[  395.413254]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]
[  395.413273] rcu: rcu_sched kthread timer wakeup didn't happen for 
5274 jiffies! g5937 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[  395.616352] rcu:     Possible timer handling issue on cpu=66 
timer-softirq=248
[  395.630237] rcu: rcu_sched kthread starved for 5330 jiffies! g5937 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=66
[  395.650718] rcu:     Unless rcu_sched kthread gets sufficient CPU 
time, OOM is now expected behavior.
[  395.668597] rcu: RCU grace-period kthread stack dump:
[  395.678674] task:rcu_sched       state:I stack:26936 pid:15 tgid:15  
   ppid:2      task_flags:0x208040 flags:0x07000000
[  395.700722] Call Trace:
[  395.705582] [<0000000000fd993c>] schedule+0x1c/0x180
[  395.715474] [<0000000000fe19f0>] schedule_timeout+0x70/0x100
[  395.726758] [<0000000000530064>] rcu_gp_fqs_loop+0x104/0x4e0
[  395.738043] [<0000000000535474>] rcu_gp_kthread+0x134/0x180
[  395.749158] [<00000000004af254>] kthread+0x114/0x160
[  395.759054] [<00000000004060f0>] ret_from_fork+0x24/0x34
[  395.769643] [<0000000000000000>] 0x0
[  395.776763] rcu: Stack dump where RCU GP kthread last ran:
[  395.787741]   CPU[ 66]: TSTATE[0000004411081605] 
TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/66:0]
[  395.808014]              TPC[arch_cpu_idle+0x84/0xc0] 
O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] 
RPC[do_idle+0x104/0x1e0]

>
> If that's not possible as the system is unresponsive, then please add
> 'ftrace_dump_on_oops' on the kernel command line or enable it after boot
> in /proc/sys/kernel and let the kernel panic when it hits the hung task
> detector fail.
>
> Thanks,
>
>          tglx
> ---
> --- a/arch/sparc/kernel/time_64.c
> +++ b/arch/sparc/kernel/time_64.c
> @@ -732,8 +732,10 @@ void __irq_entry timer_interrupt(int irq
>       if (unlikely(!evt->event_handler)) {
>           printk(KERN_WARNING
>                  "Spurious SPARC64 timer interrupt on cpu %d\n", cpu);
> -    } else
> +    } else {
> +        trace_printk("Invoking handler %pS\n", evt->event_handler);
>           evt->event_handler(evt);
> +    }
>         irq_exit();
>   --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -248,6 +248,7 @@ static void hung_task_info(struct task_s
>        * accordingly
>        */
>       if (sysctl_hung_task_warnings || hung_task_call_panic) {
> +        tracing_off();
>           if (sysctl_hung_task_warnings > 0)
>               sysctl_hung_task_warnings--;
>           pr_err("INFO: task %s:%d blocked%s for more than %ld 
> seconds.\n",
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -370,18 +370,22 @@ int clockevents_program_event(struct clo
>           delta = min(delta, (int64_t) dev->max_delta_ns);
>           cycles = ((u64)delta * dev->mult) >> dev->shift;
>           if (!dev->set_next_event((unsigned long) cycles, dev)) {
> +            trace_printk("Successfully programmed %lld %lld\n", 
> expires, delta);
>               dev->next_event_forced = 0;
>               return 0;
>           }
>       }
>   -    if (dev->next_event_forced)
> +    if (dev->next_event_forced) {
> +        trace_printk("Skipping %lld %lld\n", expires, delta);
>           return 0;
> +    }
>         if (dev->set_next_event(dev->min_delta_ticks, dev)) {
>           if (!force || clockevents_program_min_delta(dev))
>               return -ETIME;
>       }
> +    trace_printk("Force programmed min delta %lld %lld\n", expires, 
> delta);
>       dev->next_event_forced = 1;
>       return 0;
>   } 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-12  5:03               ` the stuttering regression in 7.0: should I have done something different Tony Rodriguez
@ 2026-05-12  8:17                 ` Thomas Gleixner
  2026-05-12 21:43                   ` Tony Rodriguez
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-12  8:17 UTC (permalink / raw)
  To: Tony Rodriguez
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis

On Mon, May 11 2026 at 22:03, Tony Rodriguez wrote:
>> Can you apply the debug patch below, which will disable tracing once it
>> hits the hung task detector and then retrieve the trace? 
> As requested, I applied your debug patch to v7.1‑rc3 and captured the
> trace output.  On the SPARC64 S7‑2 system the machine becomes
> unresponsive and produces many thousands of lines of trace data that
> do not appear to terminate.

Yes, it takes a while to spill out over serial.

> Posting the full output inline or as an attachment may be impractical,
> so I’ve included the key sections below.

Kinda.

> If you prefer the complete trace, please let me know the best way to 
> provide it. Guessing the kernel mailing isn't best to attach that?

Correct.

> [  249.004209] [<0000000000000000>] 0x0
> [  249.019116] Dumping ftrace buffer:
> [  249.025666] ---------------------------------
> [  249.034534]   <idle>-0         0d.... 1836659us : 
> clockevents_program_event: Successfully programmed 4000000 4000000
> [  249.055418]   <idle>-0         0d.h.. 1845926us : timer_interrupt: 

So this is the interesting part, but that's starting at 1.836659s
while the actual problem happens ~120 seconds later and the detection
takes another 120 seconds.

Assuming that one of the CPUs does not get timer interrupts anymore, the
trace of that CPU should end around the time the last programming
happened. So the interesting part is at the end of the output. The
default buffer size per CPU is 1408k, which holds about 150k entries, so
we can just shorten the buffers to make this less painful.

Can you add 'trace_buf_size=50k' to the kernel command line, which
limits the buffer size to about 640 entries. Assuming 115200 Baud this
should then take about 4 seconds per CPU to dump, which still is a bunch
on a large machine, but definitely way more workable than the default.

IIRC, SPARC64 S7‑2 has 128 threads total, so the resulting uncompressed
output should be around 7-8M. That's highly compressable text, so the
resulting dump.xz should be suitable to be stored in github. If github
does not allow you, let me know and we work something out.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-12  8:17                 ` Thomas Gleixner
@ 2026-05-12 21:43                   ` Tony Rodriguez
  2026-05-13 20:28                     ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-12 21:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis


On 5/12/26 1:17 AM, Thomas Gleixner wrote:
>> [  249.004209] [<0000000000000000>] 0x0
>> [  249.019116] Dumping ftrace buffer:
>> [  249.025666] ---------------------------------
>> [  249.034534]   <idle>-0         0d.... 1836659us :
>> clockevents_program_event: Successfully programmed 4000000 4000000
>> [  249.055418]   <idle>-0         0d.h.. 1845926us : timer_interrupt:
> So this is the interesting part, but that's starting at 1.836659s
> while the actual problem happens ~120 seconds later and the detection
> takes another 120 seconds.
>
> Assuming that one of the CPUs does not get timer interrupts anymore, the
> trace of that CPU should end around the time the last programming
> happened. So the interesting part is at the end of the output. The
> default buffer size per CPU is 1408k, which holds about 150k entries, so
> we can just shorten the buffers to make this less painful.
>
> Can you add 'trace_buf_size=50k' to the kernel command line, which
> limits the buffer size to about 640 entries. Assuming 115200 Baud this
> should then take about 4 seconds per CPU to dump, which still is a bunch
> on a large machine, but definitely way more workable than the default.

Done.  The complete trace file "s7-2-05122026-dump.tar.gz" can be 
obtained from my GitHub repo:

https://github.com/unixpro1970/Sparc64-Kernel-Debugging-Dumps

> IIRC, SPARC64 S7‑2 has 128 threads total, so the resulting uncompressed
> output should be around 7-8M. That's highly compressable text, so the
> resulting dump.xz should be suitable to be stored in github. If github
> does not allow you, let me know and we work something out.
>
> Thanks,
>
>          tglx
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-12 21:43                   ` Tony Rodriguez
@ 2026-05-13 20:28                     ` Thomas Gleixner
  2026-05-14  7:24                       ` Tony Rodriguez
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-13 20:28 UTC (permalink / raw)
  To: Tony Rodriguez
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis

Tony!

On Tue, May 12 2026 at 14:43, Tony Rodriguez wrote:
>> Can you add 'trace_buf_size=50k' to the kernel command line, which
>> limits the buffer size to about 640 entries. Assuming 115200 Baud this
>> should then take about 4 seconds per CPU to dump, which still is a bunch
>> on a large machine, but definitely way more workable than the default.
>
> Done.  The complete trace file "s7-2-05122026-dump.tar.gz" can be 
> obtained from my GitHub repo:
>
> https://github.com/unixpro1970/Sparc64-Kernel-Debugging-Dumps

Thanks for providing the data. So in both traces there is a clear
indication that the forced programmed min delta does not result in an
interrupt. Here are the last trace events on the affected CPUs.

No AHAVI CPU 116:

[  280.939873]   <idle>-0       116d.h.. 11612209us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[  280.980493]   <idle>-0       116d.h.. 11612213us : clockevents_program_event: Successfully programmed 9580000000 3991235
[  281.023902]   <idle>-0       116d.... 11612218us : clockevents_program_event: Successfully programmed 10112024440 536010830
[  281.089687]   <idle>-0       116dn... 11636205us : clockevents_program_event: Force programmed min delta 9600000000 10

No AHAVI CPU 100:

[  299.943989]  systemd-1       100d.h.. 27594794us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[  299.964303]  systemd-1       100d.h.. 27594796us : clockevents_program_event: Successfully programmed 25560000000 1407865
[  299.986182]  systemd-1       100d.... 27594932us : clockevents_program_event: Force programmed min delta 1 -25558727644
[  300.007707]  systemd-1       100d.h.. 27594933us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[  300.028019]  systemd-1       100d.h.. 27594934us : clockevents_program_event: Successfully programmed 25560000000 1269565
[  300.049894]  systemd-1       100d.... 27594971us : clockevents_program_event: Force programmed min delta 1 -25558767244
[  300.071415]  systemd-1       100d.... 27598043us : clockevents_program_event: Skipping 25560000000 -1838405

AHAVI CPU 6:

[ 1247.573212]   <idle>-0         6d.h.. 84194945us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1247.613828]   <idle>-0         6d.h.. 84194947us : clockevents_program_event: Successfully programmed 80140000000 3928334
[ 1247.762267]   <idle>-0         6d.h.. 84198876us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1247.844549]   <idle>-0         6d.h.. 84198878us : clockevents_program_event: Force programmed min delta 80140000000 771

AHAVI CPU 61:

[ 1258.222440]   <idle>-0        61d.h.. 84234905us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1258.516354]   <idle>-0        61dnh.. 84234910us : clockevents_program_event: Successfully programmed 84176000000 3999995280
[ 1258.648636]   <idle>-0        61dn... 84234914us : clockevents_program_event: Successfully programmed 80180000000 3991863
[ 1258.868940]   <idle>-0        61d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1258.993594]   <idle>-0        61d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 612

So there is only one case (CPU116) where another event in the past
programming (delta < 0) is skipped due to the force bit being set. But
that skip happens ~3ms after the min delta was programmed, which should
have resulted in an interrupt which never happened.

The original code is not really different vs. that min delta
programming, except that it does not have the next_event_forced
logic. But as you can see above this logic is not really making a
difference.

So I went through the differences line by line again and I found a very
subtle difference, but I can't seen how that would magically cure the
actual problem of the non-firing interrupt. The missing update of
dev->next_event in the force reprogram case of (delta <= 0) is
completely irrelevant as both events are in the past so it does not
matter at all. Nevertheless see the pointless and purely cosmetic delta
patch below.

But coming back to the trace data. There are tons of instances where the
forced programmed min delta results in an interrupt right afterwards:

[ 1258.868940]   <idle>-0        61d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1258.889262]   <idle>-0        60d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1258.909570]   <idle>-0        63d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1258.929877]   <idle>-0        70d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1258.950197]   <idle>-0        60d.h.. 84238907us : clockevents_program_event: Force programmed min delta 80180000000 552
[ 1258.971896]   <idle>-0        63d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 627
[ 1258.993594]   <idle>-0        61d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 612
[ 1259.015292]   <idle>-0        70d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 313
[ 1259.036992]   <idle>-0        60d.h.. 84238910us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1259.057313]   <idle>-0        63d.h.. 84238910us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280
[ 1259.077620]   <idle>-0        70d.h.. 84238912us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280

So all four involved CPUs force program min delta from the timer
interrupt context, but only three of them actually get an interrupt
afterwards. CPU61 fails to deliver one and as a result it goes stale.

As the set_next_event() callback returns 0 (success) in all cases -
otherwise we wouldn't see the trace entry - this all points to a problem
with that rearming logic:

        exp = read_cnt() + delta_ticks;
        write_cmp(exp);
        return (read_cnt() - exp) > 0 ? -ETIME : 0;

Your machine uses 'stick', which runs according to the conversion
factors in dmesg at 1GHz, but the CPU runs at 4.27GHz AFAIK. So you can
clearly run into a situation like this:

        TICK_CNT        CPU
        T1              exp = read_cnt() + D
                        ...                     // Some delay
        T1 + D
                        write_cmp(T1 + D)
                        now = read_cnt()	// Reads T1 + D
        T1 + D + 1
                        ---> returns success and the interrupt is never firing
Why?

Just to be clear: I never saw the VHDL code of that CPU, but that
pattern is way too familiar.

Those equal comparators, which were designed by AI (Absence of
Intelligence) before AI got popular, generally work this way:

  The comparator is only evaluated on the clock edge which increments
  the counter, but not when the comparator value is written. So a write
  of the same value does not result in an interrupt.

That's an "optimization" which spares quite a few gates and is obviously
nowhere documented. So software has to deal with the consequences by
using a crystal ball, which is trivial to get wrong and can go unnoticed
for a long time until it roars it's ugly head at some point for whatever
reasons.

I'm willing to bet a round of beers at the next conference that this is
the problem and that it will magically disappear when you change that
condition to:

        return (read_cnt() - exp) >= 0 ? -ETIME : 0;

unless they managed to add some extra propagation delay to that
comparator write like the HPET folks did at some point without telling
anyone. I doubt the SPARC janitor who implemented it did so because
that would have made the failure way more likely.

I have truly no idea why the original code did not expose this problem,
though it might have been just papered over by sheer luck and timing.

Thanks,

        tglx
---
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -381,6 +381,8 @@ int clockevents_program_event(struct clo
 	if (dev->set_next_event(dev->min_delta_ticks, dev)) {
 		if (!force || clockevents_program_min_delta(dev))
 			return -ETIME;
+	} else if (delta <= 0) {
+		dev->next_event = ktime_add_ns(ktime_get(), dev->min_delta_ns);
 	}
 	dev->next_event_forced = 1;
 	return 0;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-13 20:28                     ` Thomas Gleixner
@ 2026-05-14  7:24                       ` Tony Rodriguez
  2026-05-14 10:24                         ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-14  7:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds

Hi Thomas,

Cheers!

Initial validation of the test patches for v7.0.6 and 7.1-rc3 on the 
S7-2 looks promising: I have not observed panics, timer delays, or other 
timer-related issues so far. I will pause broader validation on the S7-2 
and T7-1 until I receive your recommendation or any requested revisions 
(see inline comments below).

Note: I did see an intermittent error on the S7-2 running 7.1-rc3, 
usually when the system is under heavy load during a kernel build. I’m 
not sure whether it is a separate problem?

"[676.464681] BUG: Bad rss-counter state mm:000000008d9f1cf2 
type:MM_FILEPAGES val:-4096 Comm:cc1 Pid:78165".

On 5/13/26 1:28 PM, Thomas Gleixner wrote:

> Just to be clear: I never saw the VHDL code of that CPU, but that
> pattern is way too familiar.
>
> Those equal comparators, which were designed by AI (Absence of
> Intelligence) before AI got popular, generally work this way:
>
>    The comparator is only evaluated on the clock edge which increments
>    the counter, but not when the comparator value is written. So a write
>    of the same value does not result in an interrupt.
>
> That's an "optimization" which spares quite a few gates and is obviously
> nowhere documented. So software has to deal with the consequences by
> using a crystal ball, which is trivial to get wrong and can go unnoticed
> for a long time until it roars it's ugly head at some point for whatever
> reasons.
>
> I'm willing to bet a round of beers at the next conference that this is
> the problem and that it will magically disappear when you change that
> condition to:
>
>          return (read_cnt() - exp) >= 0 ? -ETIME : 0;

Attempted to locate "return (read_cnt() - exp) >= 0 ? -ETIME : 0;" but 
could not find an exact match. After additional inspection I updated the 
following functions "tick_add_compare()" and "stick_add_compare()" in 
arch/sparc/kernel/time_64.c to from "> 0L" to ">= 0L". This appears to 
have resolved the lost-timer behavior.

--- time_64.c.orig
+++ time_64.c
@@ -146,7 +146,7 @@
                              : "=r" (new_tick));
         new_tick &= ~TICKCMP_IRQ_BIT;

-       return ((long)(new_tick - (orig_tick+adj))) > 0L;
+       return ((long)(new_tick - (orig_tick+adj))) >= 0L;
  }

  static unsigned long tick_add_tick(unsigned long adj)
@@ -277,7 +277,7 @@
                              : "=r" (new_tick));
         new_tick &= ~TICKCMP_IRQ_BIT;

-       return ((long)(new_tick - (orig_tick+adj))) > 0L;
+       return ((long)(new_tick - (orig_tick+adj))) >= 0L;
  }

  static unsigned long stick_get_frequency(void)

>
> unless they managed to add some extra propagation delay to that
> comparator write like the HPET folks did at some point without telling
> anyone. I doubt the SPARC janitor who implemented it did so because
> that would have made the failure way more likely.
>
> I have truly no idea why the original code did not expose this problem,
> though it might have been just papered over by sheer luck and timing.
>
> Thanks,
>
>          tglx
> ---
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -381,6 +381,8 @@ int clockevents_program_event(struct clo
>   	if (dev->set_next_event(dev->min_delta_ticks, dev)) {
>   		if (!force || clockevents_program_min_delta(dev))
>   			return -ETIME;
> +	} else if (delta <= 0) {
> +		dev->next_event = ktime_add_ns(ktime_get(), dev->min_delta_ns);
>   	}
>   	dev->next_event_forced = 1;
>   	return 0;
>
You mentioned this kernel/time/clockevents.c patch is optional, but I 
propose revising clockevents_program_event(). If the requested event 
time is already at or before now, record a sane next_event (now + 
min_delta) so core code sees a future expected time and can behave 
correctly. Does this seem reasonable?

  --- clockevents.c.orig
+++ clockevents.c
@@ -347,6 +347,11 @@
         if (dev->set_next_event(dev->min_delta_ticks, dev)) {
                 if (!force || clockevents_program_min_delta(dev))
                         return -ETIME;
+       } else {
+               ktime_t now = ktime_get();
+               s64 delta_ns = ktime_to_ns(ktime_sub(expires, now));
+               if (delta_ns <= 0)
+                       dev->next_event = ktime_add_ns(now, 
dev->min_delta_ns);
         }
         dev->next_event_forced = 1;
         return 0;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-14  7:24                       ` Tony Rodriguez
@ 2026-05-14 10:24                         ` Thomas Gleixner
  2026-05-15  4:47                           ` Tony Rodriguez
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-14 10:24 UTC (permalink / raw)
  To: Tony Rodriguez
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds

On Thu, May 14 2026 at 00:24, Tony Rodriguez wrote:
> Initial validation of the test patches for v7.0.6 and 7.1-rc3 on the 
> S7-2 looks promising: I have not observed panics, timer delays, or other 
> timer-related issues so far. I will pause broader validation on the S7-2 
> and T7-1 until I receive your recommendation or any requested revisions 
> (see inline comments below).
>
> Note: I did see an intermittent error on the S7-2 running 7.1-rc3, 
> usually when the system is under heavy load during a kernel build. I’m 
> not sure whether it is a separate problem?
>
> "[676.464681] BUG: Bad rss-counter state mm:000000008d9f1cf2 
> type:MM_FILEPAGES val:-4096 Comm:cc1 Pid:78165".

That's unrelated and an accounting issue in the MM code. Please report
it separately to the MM people.

> On 5/13/26 1:28 PM, Thomas Gleixner wrote:
>> I'm willing to bet a round of beers at the next conference that this is
>> the problem and that it will magically disappear when you change that
>> condition to:
>>
>>          return (read_cnt() - exp) >= 0 ? -ETIME : 0;
>
> Attempted to locate "return (read_cnt() - exp) >= 0 ? -ETIME : 0;" but 
> could not find an exact match. After additional inspection I updated the 
> following functions "tick_add_compare()" and "stick_add_compare()" in 
> arch/sparc/kernel/time_64.c to from "> 0L" to ">= 0L". This appears to 
> have resolved the lost-timer behavior.

I condensed the logic for illustration and rightfully assumed that you
will figure it out. :)

> --- time_64.c.orig
> +++ time_64.c
> @@ -146,7 +146,7 @@
>                               : "=r" (new_tick));
>          new_tick &= ~TICKCMP_IRQ_BIT;
>
> -       return ((long)(new_tick - (orig_tick+adj))) > 0L;
> +       return ((long)(new_tick - (orig_tick+adj))) >= 0L;
>   }
>
>   static unsigned long tick_add_tick(unsigned long adj)
> @@ -277,7 +277,7 @@
>                               : "=r" (new_tick));
>          new_tick &= ~TICKCMP_IRQ_BIT;
>
> -       return ((long)(new_tick - (orig_tick+adj))) > 0L;
> +       return ((long)(new_tick - (orig_tick+adj))) >= 0L;
>   }

Looks correct, but you missed the one in hbtick_add_compare() which has
the same issue.

>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -381,6 +381,8 @@ int clockevents_program_event(struct clo
>>   	if (dev->set_next_event(dev->min_delta_ticks, dev)) {
>>   		if (!force || clockevents_program_min_delta(dev))
>>   			return -ETIME;
>> +	} else if (delta <= 0) {
>> +		dev->next_event = ktime_add_ns(ktime_get(), dev->min_delta_ns);
>>   	}
>>   	dev->next_event_forced = 1;
>>   	return 0;
>>
> You mentioned this kernel/time/clockevents.c patch is optional, but I 
> propose revising clockevents_program_event(). If the requested event 
> time is already at or before now, record a sane next_event (now + 
> min_delta) so core code sees a future expected time and can behave 
> correctly. Does this seem reasonable?

The related core code only cares what the last programmed expiry value
in clock monotonic (i.e. the @expires argument) was. And the only
interesting information is whether it's in the future or not. If it's in
the past then it does not matter how much in the past it is.

Whatever we fake into it is never going to reflect anything related to
reality anyway and there is no guarantee that the code which reads it
will see a future expected time depending on the time elapsed between
faking it and reading it. So it's truly a cosmetic exercise for no real
value.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-14 10:24                         ` Thomas Gleixner
@ 2026-05-15  4:47                           ` Tony Rodriguez
  2026-05-15 15:35                             ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: Tony Rodriguez @ 2026-05-15  4:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds

Hi Thomas,

I’ve completed validation with the v7.0.7 release and v7.1‑rc3 on both 
S7‑2 and T7‑1 systems. Everything looks good.

Thank you again for the debugging guidance and for the feedback on my 
original patch addressing the timer starvation issue. It was a pleasure 
contributing to the resolution.

PS: I agree that the second patch we discussed isn’t needed—the systems 
run correctly without it. The following patch alone is sufficient:

Best regards,
Tony Rodriguez

--- linux-7.1-rc1/arch/sparc/kernel/time_64.c.orig
+++ linux-7.1-rc1/arch/sparc/kernel/time_64.c
@@ -146,7 +146,7 @@
                   : "=r" (new_tick));
      new_tick &= ~TICKCMP_IRQ_BIT;

-    return ((long)(new_tick - (orig_tick+adj))) > 0L;
+    return ((long)(new_tick - (orig_tick+adj))) >= 0L;
  }

  static unsigned long tick_add_tick(unsigned long adj)
@@ -277,7 +277,7 @@
                   : "=r" (new_tick));
      new_tick &= ~TICKCMP_IRQ_BIT;

-    return ((long)(new_tick - (orig_tick+adj))) > 0L;
+    return ((long)(new_tick - (orig_tick+adj))) >= 0L;
  }

  static unsigned long stick_get_frequency(void)
@@ -411,7 +411,7 @@

      val2 = __hbird_read_stick() & ~TICKCMP_IRQ_BIT;

-    return ((long)(val2 - val)) > 0L;
+    return ((long)(val2 - val)) >= 0L;
  }

  static unsigned long hbtick_get_frequency(void)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-15  4:47                           ` Tony Rodriguez
@ 2026-05-15 15:35                             ` Thomas Gleixner
  2026-05-15 17:51                               ` John Paul Adrian Glaubitz
  0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-15 15:35 UTC (permalink / raw)
  To: Tony Rodriguez
  Cc: Linux kernel regressions list, LKML, sparclinux,
	John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds

Tony!

On Thu, May 14 2026 at 21:47, Tony Rodriguez wrote:
> I’ve completed validation with the v7.0.7 release and v7.1‑rc3 on both 
> S7‑2 and T7‑1 systems. Everything looks good.

Cool!

> Thank you again for the debugging guidance and for the feedback on my 
> original patch addressing the timer starvation issue. It was a pleasure 
> contributing to the resolution.

Thank you for going through the hassle of chasing it down and providing
the debug data to analyze it.

I'm still puzzled how this went unnoticed for almost two decades:

   112f48716d9f ("[SPARC64]: Add clocksource/clockevents support.")

> PS: I agree that the second patch we discussed isn’t needed—the systems 
> run correctly without it. The following patch alone is sufficient:

I assume this patch will surface on a mailing list with a lengthy change
log full of details and find it's way into the sparc tree through the
usual channels.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-15 15:35                             ` Thomas Gleixner
@ 2026-05-15 17:51                               ` John Paul Adrian Glaubitz
  2026-05-15 19:57                                 ` Thomas Gleixner
  0 siblings, 1 reply; 19+ messages in thread
From: John Paul Adrian Glaubitz @ 2026-05-15 17:51 UTC (permalink / raw)
  To: Thomas Gleixner, Tony Rodriguez
  Cc: Linux kernel regressions list, LKML, sparclinux,
	Thorsten Leemhuis, Linus Torvalds

Hi Thomas,

On Fri, 2026-05-15 at 17:35 +0200, Thomas Gleixner wrote:
> > Thank you again for the debugging guidance and for the feedback on my 
> > original patch addressing the timer starvation issue. It was a pleasure 
> > contributing to the resolution.
> 
> Thank you for going through the hassle of chasing it down and providing
> the debug data to analyze it.
> 
> I'm still puzzled how this went unnoticed for almost two decades:
> 
>    112f48716d9f ("[SPARC64]: Add clocksource/clockevents support.")

My suspicion is that it was previously visible only in certain edge cases,
in particular on machines with many cores and high load.

Case in point: In the past, SPARC LDOMs with lots of virtual CPUs could
crash in rares cases when building packages such as GCC or LLVM and running
their testsuites.

I don't know if Tony's patch fixes this long-time issue that we have observed
in the past on Debian's buildds, but I think that the chances aren't too bad.

Tony, please clean up your patch and add an elaborate explanation in the commit
message! Hope to see this fix landed as soon as possible!

Thanks to both of you for hunting this down!

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: the stuttering regression in 7.0: should I have done something different
  2026-05-15 17:51                               ` John Paul Adrian Glaubitz
@ 2026-05-15 19:57                                 ` Thomas Gleixner
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Gleixner @ 2026-05-15 19:57 UTC (permalink / raw)
  To: John Paul Adrian Glaubitz, Tony Rodriguez
  Cc: Linux kernel regressions list, LKML, sparclinux,
	Thorsten Leemhuis, Linus Torvalds

Hi!

On Fri, May 15 2026 at 19:51, John Paul Adrian Glaubitz wrote:
> On Fri, 2026-05-15 at 17:35 +0200, Thomas Gleixner wrote:
>> > Thank you again for the debugging guidance and for the feedback on my 
>> > original patch addressing the timer starvation issue. It was a pleasure 
>> > contributing to the resolution.
>> 
>> Thank you for going through the hassle of chasing it down and providing
>> the debug data to analyze it.
>> 
>> I'm still puzzled how this went unnoticed for almost two decades:
>> 
>>    112f48716d9f ("[SPARC64]: Add clocksource/clockevents support.")
>
> My suspicion is that it was previously visible only in certain edge cases,
> in particular on machines with many cores and high load.
>
> Case in point: In the past, SPARC LDOMs with lots of virtual CPUs could
> crash in rares cases when building packages such as GCC or LLVM and running
> their testsuites.

I assume those occasional failures did not leave conclusive hints around.

> I don't know if Tony's patch fixes this long-time issue that we have observed
> in the past on Debian's buildds, but I think that the chances aren't too bad.

Good luck!

> Thanks to both of you for hunting this down!

For some stupid reasons I like such puzzles :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-05-15 19:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 16:30 the stuttering regression in 7.0: should I have done something different? Thorsten Leemhuis
2026-04-26 21:16 ` Greg KH
2026-05-08  5:51 ` John Paul Adrian Glaubitz
2026-05-08  6:33   ` Thorsten Leemhuis
     [not found]     ` <D5D19776-C809-4284-9417-F9A860877B98@gmail.com>
2026-05-08  7:50       ` Thorsten Leemhuis
2026-05-08 20:15         ` Tony Rodriguez
2026-05-08 20:21           ` Tony Rodriguez
2026-05-10 21:29           ` Thomas Gleixner
2026-05-11  3:13             ` Tony Rodriguez
2026-05-12  5:03               ` the stuttering regression in 7.0: should I have done something different Tony Rodriguez
2026-05-12  8:17                 ` Thomas Gleixner
2026-05-12 21:43                   ` Tony Rodriguez
2026-05-13 20:28                     ` Thomas Gleixner
2026-05-14  7:24                       ` Tony Rodriguez
2026-05-14 10:24                         ` Thomas Gleixner
2026-05-15  4:47                           ` Tony Rodriguez
2026-05-15 15:35                             ` Thomas Gleixner
2026-05-15 17:51                               ` John Paul Adrian Glaubitz
2026-05-15 19:57                                 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox