The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH v2 1/1] sparc64: Fix comparator problem with timer interrupts
@ 2026-05-19  2:24 Tony Rodriguez
  2026-05-19  2:24 ` Tony Rodriguez
  0 siblings, 1 reply; 4+ messages in thread
From: Tony Rodriguez @ 2026-05-19  2:24 UTC (permalink / raw)
  To: davem, sparclinux
  Cc: linux-kernel, andreas, tglx, thomas.weissschuh, regressions,
	glaubitz, linux, torvalds, Tony Rodriguez

sparc64: Fix comparator problem with timer interrupts

This patch fixes SPARC64 tick/stick comparator logic that can cause missed
timer interrupts

Full dmesg and crash dumps:
https://github.com/unixpro1970/Sparc64-Kernel-Debugging-Dumps/blob/main/s7-2-05122026-dump.tar.gz

Debugging discussion:
https://lore.kernel.org/all/87tssb6olo.ffs@tglx/
https://lore.kernel.org/all/871pfcznw0.ffs@tglx/

Short dmesg excerpt:
[   48.632215] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   48.676018] rcu: rcu_sched kthread timer wakeup didn't happen for 5259 jiffies!
[   48.743621] rcu:     Possible timer handling issue on cpu=100 timer-softirq=15

Test summary:
After applying this patch, S7-2 and T7-1 systems no longer hang on v7 and v7.1 kernels.

Tony Rodriguez (1):
  sparc64: Fix comparator problem with timer interrupts

 arch/sparc/kernel/time_64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--
2.53.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v2 1/1] sparc64: Fix comparator problem with timer interrupts
  2026-05-19  2:24 [PATCH v2 1/1] sparc64: Fix comparator problem with timer interrupts Tony Rodriguez
@ 2026-05-19  2:24 ` Tony Rodriguez
  2026-05-19 14:22   ` Thomas Gleixner
  0 siblings, 1 reply; 4+ messages in thread
From: Tony Rodriguez @ 2026-05-19  2:24 UTC (permalink / raw)
  To: davem, sparclinux
  Cc: linux-kernel, andreas, tglx, thomas.weissschuh, regressions,
	glaubitz, linux, torvalds, Tony Rodriguez


On SPARC64 the check:

    return ((long)(new_tick - (orig_tick + adj))) > 0L;

Is safe only if retries make forward progress. The comparator can
take effect with a latency, so the moment when counter == comparator
may be missed, which can cause delays or hangs on some SPARC64 systems.

For clarity:
    exp = orig_tick + adj   /* expected comparator value */

The current check requires new_tick to be strictly greater than exp;
equality (new_tick == exp) is treated as not yet passed and the caller
will retry.

By contrast, using:

    return ((long)(new_tick - (orig_tick + adj))) >= 0L;

causes the caller to stop retrying and assume the timer is scheduled;
both equality and greater-than are accepted (new_tick == exp or
new_tick > exp).

Signed-off-by: Tony Rodriguez <unixpro1970@gmail.com>
---
 arch/sparc/kernel/time_64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/sparc/kernel/time_64.c b/arch/sparc/kernel/time_64.c
index 87b267043ccd..783b60e547c4 100644
--- a/arch/sparc/kernel/time_64.c
+++ b/arch/sparc/kernel/time_64.c
@@ -146,7 +146,7 @@ static int tick_add_compare(unsigned long adj)
 			     : "=r" (new_tick));
 	new_tick &= ~TICKCMP_IRQ_BIT;

-	return ((long)(new_tick - (orig_tick+adj))) > 0L;
+	return ((long)(new_tick - (orig_tick+adj))) >= 0L;
 }

 static unsigned long tick_add_tick(unsigned long adj)
@@ -277,7 +277,7 @@ static int stick_add_compare(unsigned long adj)
 			     : "=r" (new_tick));
 	new_tick &= ~TICKCMP_IRQ_BIT;

-	return ((long)(new_tick - (orig_tick+adj))) > 0L;
+	return ((long)(new_tick - (orig_tick+adj))) >= 0L;
 }

 static unsigned long stick_get_frequency(void)
@@ -411,7 +411,7 @@ static int hbtick_add_compare(unsigned long adj)

 	val2 = __hbird_read_stick() & ~TICKCMP_IRQ_BIT;

-	return ((long)(val2 - val)) > 0L;
+	return ((long)(val2 - val)) >= 0L;
 }

 static unsigned long hbtick_get_frequency(void)
--
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] sparc64: Fix comparator problem with timer interrupts
  2026-05-19  2:24 ` Tony Rodriguez
@ 2026-05-19 14:22   ` Thomas Gleixner
  2026-05-19 23:25     ` Tony Rodriguez
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Gleixner @ 2026-05-19 14:22 UTC (permalink / raw)
  To: Tony Rodriguez, davem, sparclinux
  Cc: linux-kernel, andreas, thomas.weissschuh, regressions, glaubitz,
	linux, torvalds, Tony Rodriguez

On Mon, May 18 2026 at 19:24, Tony Rodriguez wrote:
> On SPARC64 the check:
>
>     return ((long)(new_tick - (orig_tick + adj))) > 0L;
>
> Is safe only if retries make forward progress. The comparator can
> take effect with a latency, so the moment when counter == comparator
> may be missed, which can cause delays or hangs on some SPARC64 systems.
>
> For clarity:
>     exp = orig_tick + adj   /* expected comparator value */
>
> The current check requires new_tick to be strictly greater than exp;
> equality (new_tick == exp) is treated as not yet passed and the caller
> will retry.

That's confusing at best. You really want to explain how the ordering is
similar to what I described in the analysis:

        exp = read_cnt() + delta_ticks;
        write_cmp(exp);
        return (read_cnt() - exp) > 0;

If the counter advanced past the expected expiry time, after writing it,
then the caller will retry, as the calling code does:

     return tick.add_compare(delta_ticks) ? -ETIME : 0;

But it won't do so when the counter is equal, which is causing the
problem.

> By contrast, using:
>
>     return ((long)(new_tick - (orig_tick + adj))) >= 0L;
>
> causes the caller to stop retrying and assume the timer is scheduled;
> both equality and greater-than are accepted (new_tick == exp or
> new_tick > exp).

It's the other way round. When counter >= expiry time, then the write is
considered failed. If the counter has not yet reached expiry time,
i.e. it is smaller, then it assumes the timer is scheduled.

> Signed-off-by: Tony Rodriguez <unixpro1970@gmail.com>

It would be nice to have a link to the original thread in the change log
itself as that gives people quick access when they are wondering about
this a year down the road.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/1] sparc64: Fix comparator problem with timer interrupts
  2026-05-19 14:22   ` Thomas Gleixner
@ 2026-05-19 23:25     ` Tony Rodriguez
  0 siblings, 0 replies; 4+ messages in thread
From: Tony Rodriguez @ 2026-05-19 23:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, andreas, thomas.weissschuh, regressions, glaubitz,
	linux, torvalds, davem, sparclinux

Hi Thomas,

Thanks again for the careful analysis of the comparator ordering. After 
applying the changes in the patch I confirmed S7‑2 and T7‑1 systems no 
longer hang; timer scheduling now behaves as expected.

I don’t have bandwidth right now to finish the remaining tidy steps and 
I’d rather avoid delaying this fix. Since we worked on this together and 
you’re familiar with the issue, would you be willing to take it over so 
upstream has everything required for approval? I’m happy to answer any 
questions; if you’re busy I can ask the maintainer to reassign.

Thanks again for the review and the clear explanation.

Tony

On 5/19/26 7:22 AM, Thomas Gleixner wrote:
>> On SPARC64 the check:
>>
>>      return ((long)(new_tick - (orig_tick + adj))) > 0L;
>>
>> Is safe only if retries make forward progress. The comparator can
>> take effect with a latency, so the moment when counter == comparator
>> may be missed, which can cause delays or hangs on some SPARC64 systems.
>>
>> For clarity:
>>      exp = orig_tick + adj   /* expected comparator value */
>>
>> The current check requires new_tick to be strictly greater than exp;
>> equality (new_tick == exp) is treated as not yet passed and the caller
>> will retry.
> That's confusing at best. You really want to explain how the ordering is
> similar to what I described in the analysis:
>
>          exp = read_cnt() + delta_ticks;
>          write_cmp(exp);
>          return (read_cnt() - exp) > 0;
>
> If the counter advanced past the expected expiry time, after writing it,
> then the caller will retry, as the calling code does:
>
>       return tick.add_compare(delta_ticks) ? -ETIME : 0;
>
> But it won't do so when the counter is equal, which is causing the
> problem.
>
>> By contrast, using:
>>
>>      return ((long)(new_tick - (orig_tick + adj))) >= 0L;
>>
>> causes the caller to stop retrying and assume the timer is scheduled;
>> both equality and greater-than are accepted (new_tick == exp or
>> new_tick > exp).
> It's the other way round. When counter >= expiry time, then the write is
> considered failed. If the counter has not yet reached expiry time,
> i.e. it is smaller, then it assumes the timer is scheduled.
>
>> Signed-off-by: Tony Rodriguez<unixpro1970@gmail.com>
> It would be nice to have a link to the original thread in the change log
> itself as that gives people quick access when they are wondering about
> this a year down the road.
>
> Thanks,
>
>          tglx

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-19 23:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19  2:24 [PATCH v2 1/1] sparc64: Fix comparator problem with timer interrupts Tony Rodriguez
2026-05-19  2:24 ` Tony Rodriguez
2026-05-19 14:22   ` Thomas Gleixner
2026-05-19 23:25     ` Tony Rodriguez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox