* Re: the stuttering regression in 7.0: should I have done something different [not found] ` <a7a0d78b-435e-43c8-b436-5e7f4dd39dee@gmail.com> @ 2026-05-12 5:03 ` Tony Rodriguez 2026-05-12 8:17 ` Thomas Gleixner 0 siblings, 1 reply; 10+ messages in thread From: Tony Rodriguez @ 2026-05-12 5:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis On 5/10/26 02:29 PM, Thomas Gleixner wrote: > Can you apply the debug patch below, which will disable tracing once it > hits the hung task detector and then retrieve the trace? As requested, I applied your debug patch to v7.1‑rc3 and captured the trace output. On the SPARC64 S7‑2 system the machine becomes unresponsive and produces many thousands of lines of trace data that do not appear to terminate. Posting the full output inline or as an attachment may be impractical, so I’ve included the key sections below. If you prefer the complete trace, please let me know the best way to provide it. Guessing the kernel mailing isn't best to attach that? A) Output with avahi-daemon and avahi-utils installed (This shows avahi and systemd activity before the hang.) Note: The system is using the standard Debian-provided avahi and systemd packages. No custom scripts or modifications are in place. BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc3-test01 root=UUID=ce937a4b-126a-41bd-a54b-03a424421086 ro console=ttyHV0,9600n81 systemd.log_level=info systemd.show_status=1 systemd.journald.forward_to_console=0 plymouth.enable=0 ignore_loglevel loglevel=8 ftrace_dump_on_oops=1 hung_task_panic=1 [ 1.206192] printk: log_buf_len individual max cpu contribution: 4096 bytes [ 1.219999] printk: log_buf_len total cpu_extra contributions: 520192 bytes [ 1.233883] printk: log_buf_len min size: 131072 bytes [ 1.249357] printk: log buffer data + meta data: 1048576 + 4456448 = 5505024 bytes [ 1.264204] printk: early log buf free: 126896(96%) [ 1.328220] Dentry cache hash table entries: 8388608 (order: 13, 67108864 bytes, linear) [ 1.371366] Inode-cache hash table entries: 4194304 (order: 12, 33554432 bytes, linear) [ 1.387117] Sorting __ex_table... [ 1.394073] Built 1 zonelists, mobility grouping on. Total pages: 16545911 [ 1.407711] mem auto-init: stack:all(zero), heap alloc:on, heap free:off [ 1.434383] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=128, Nodes=1 [ 1.467254] ftrace: allocating 36740 entries in 72 pages [ 1.477580] ftrace: allocated 72 pages with 2 groups [ 1.487945] 1.516665] ** ** [ 1.529690] ** trace_printk() being used. Allocating extra memory. ** [ 1.542707] ** ** [ 1.555727] ** This means that this is a DEBUG kernel and it is ** [ 1.487945] [ 1.490630] ********************************************************** [ 1.503647] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 1.516665] ** ** [ 1.529690] ** trace_printk() being used. Allocating extra memory. ** [ 1.542707] ** ** [ 1.555727] ** This means that this is a DEBUG kernel and it is ** [ 1.568743] ** unsafe for production use. ** [ 1.581778] ** ** [ 1.594794] ** If you see this message and you are not debugging ** [ 1.607811] ** the kernel, report this immediately to your vendor! ** [ 1.620828] ** ** [ 1.633847] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 1.646880] ********************************************************** [ OK ] Started avahi-daemon.service - Avahi mDNS/DNS-SD Stack. [ 248.416424] INFO: task systemd:1 blocked for more than 120 seconds. [ 248.428721] Not tainted 7.1.0-rc3-test01 #2 [ 248.438087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 248.453706] task:systemd state:D stack:20312 pid:1 tgid:1 ppid:0 task_flags:0x400100 flags:0x208000101000000 [ 248.476970] Call Trace: [ 248.481829] [<0000000000fd993c>] schedule+0x1c/0x180 [ 248.491728] [<0000000000fe1a10>] schedule_timeout+0x90/0x100 [ 248.503017] [<0000000000fda5b0>] __wait_for_common+0xb0/0x180 [ 248.514472] [<0000000000fda7a0>] wait_for_completion_state+0x20/0x60 [ 248.527151] [<0000000000527234>] __wait_rcu_gp+0x114/0x1e0 [ 248.538089] [<000000000052d9c8>] synchronize_rcu_normal.part.0+0x48/0x60 [ 248.551452] [<000000000052f860>] synchronize_rcu_normal+0xc0/0xe0 [ 248.563605] [<0000000000532820>] synchronize_rcu+0xe0/0x140 [ 248.574720] [<00000000005285c4>] rcu_sync_enter+0x44/0x140 [ 248.585650] [<0000000000fdf114>] percpu_down_write+0x14/0x240 [ 248.597105] [<000000000057ef20>] cgroup_procs_write_start+0x1c0/0x240 [ 248.609956] [<000000000057f8d0>] __cgroup_procs_write+0x30/0x1c0 [ 248.621939] [<000000000057fab4>] cgroup_procs_write+0x14/0x40 [ 248.633393] [<0000000000577910>] cgroup_file_write+0x90/0x160 [ 248.644851] [<00000000008b536c>] kernfs_fop_write_iter+0x14c/0x240 [ 248.657182] [<00000000007e7210>] vfs_write+0x210/0x460 [ 248.667496] INFO: task (systemd-hostn):1968 blocked for more than 120 seconds. [ 248.681833] Not tainted 7.1.0-rc3-test01 #2 [ 248.691196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 248.706831] task:(systemd-hostn) state:D stack:25352 pid:1968 tgid:1968 ppid:1 task_flags:0x400100 flags:0x408000102000000 [ 248.730091] Call Trace: [ 248.734952] [<0000000000fd993c>] schedule+0x1c/0x180 [ 248.744843] [<0000000000fd9b2c>] schedule_preempt_disabled+0xc/0x20 [ 248.757352] [<0000000000fdca0c>] __mutex_lock.constprop.0+0x58c/0xf00 [ 248.770184] [<0000000000fdd490>] __mutex_lock_slowpath+0x10/0x20 [ 248.782166] [<0000000000fdd4d0>] mutex_lock+0x30/0x40 [ 248.792243] [<000000000057f34c>] cgroup_kn_lock_live+0x4c/0x120 [ 248.804040] [<000000000057f8b8>] __cgroup_procs_write+0x18/0x1c0 [ 248.816021] [<000000000057fab4>] cgroup_procs_write+0x14/0x40 [ 248.827476] [<0000000000577910>] cgroup_file_write+0x90/0x160 [ 248.838946] [<00000000008b536c>] kernfs_fop_write_iter+0x14c/0x240 [ 248.851269] [<00000000007e7210>] vfs_write+0x210/0x460 [ 248.861500] [<00000000007e75d0>] ksys_write+0x50/0xe0 [ 248.871579] [<00000000007e7674>] sys_write+0x14/0x40 [ 248.881469] [<00000000004062b4>] linux_sparc_syscall+0x34/0x44 [ 248.893098] INFO: task (systemd-hostn):1968 is blocked on a mutex likely owned by task systemd:1. [ 248.910809] Kernel panic - not syncing: hung_task: blocked tasks [ 248.922783] CPU: 48 UID: 0 PID: 677 Comm: khungtaskd Not tainted 7.1.0-rc3-test01 #2 VOLUNTARY [ 248.940151] Call Trace: [ 248.945010] [<0000000000436fcc>] dump_stack+0x8/0x18 [ 248.954902] [<00000000004293a4>] vpanic+0xfc/0x33c [ 248.964453] [<0000000000429608>] panic+0x24/0x30 [ 248.973648] [<00000000005a8158>] watchdog+0x238/0x840 [ 248.983725] [<00000000004af254>] kthread+0x114/0x160 [ 248.993622] [<00000000004060f0>] ret_from_fork+0x24/0x34 [ 249.004209] [<0000000000000000>] 0x0 [ 249.019116] Dumping ftrace buffer: [ 249.025666] --------------------------------- [ 249.034534] <idle>-0 0d.... 1836659us : clockevents_program_event: Successfully programmed 4000000 4000000 [ 249.055418] <idle>-0 0d.h.. 1845926us : timer_interrupt: Invoking handler tick_handle_periodic+0x0/0xa0 [ 249.075895] <idle>-0 0d.h.. 1845938us : clockevents_program_event: Successfully programmed 8000000 4000000 [ 249.096899] <idle>-0 0d.h.. 1849938us : timer_interrupt: Invoking handler tick_handle_periodic+0x0/0xa0 [ 249.117390] <idle>-0 0d.h.. 1849940us : clockevents_program_event: Successfully programmed 12000000 4000000 [ 249.138563] <idle>-0 0d.h.. 1853940us : timer_interrupt: Invoking handler tick_handle_periodic+0x0/0xa0 [ 249.159053] <idle>-0 0d.h.. 1853942us : clockevents_program_event: Successfully programmed 16000000 4000000 [ 249.180226] <idle>-0 0d.h.. 1857942us : timer_interrupt: Invoking handler tick_handle_periodic+0x0/0xa0 [ 249.200718] <idle>-0 0d.h.. 1857943us : clockevents_program_event: Successfully programmed 20000000 40000 B) This is without avahi-daemon and avahi-utils installed. Just to rule out a possible confiict with avahi. BOOT_IMAGE=/boot/vmlinuz-7.1.0-rc3-test01 root=UUID=ce937a4b-126a-41bd-a54b-03a424421086 ro console=ttyHV0,9600n81 systemd.log_level=info systemd.show_status=1 systemd.journald.forward_to_console=0 plymouth.enable=0 ignore_loglevel loglevel=8 ftrace_dump_on_oops=1 hung_task_panic=1 [ OK ] Reached target graphical.target - Graphical Interface. [ 310.338420] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 310.350060] rcu: 3-...!: (259 GPs behind) idle=bb6c/1/0x4000000000000000 softirq=1081/1081 fqs=0 [ 310.367759] rcu: 27-...!: (313 GPs behind) idle=55cc/1/0x4000000000000000 softirq=284/286 fqs=0 [ 310.385298] rcu: 34-...!: (261 GPs behind) idle=3a64/1/0x4000000000000000 softirq=524/524 fqs=0 [ 310.402834] rcu: 46-...!: (261 GPs behind) idle=743c/1/0x4000000000000000 softirq=258/259 fqs=0 [ 310.420366] rcu: (detected by 73, t=5275 jiffies, g=5933, q=80255 ncpus=128) [ 310.434745] CPU[ 3]: TSTATE[00000000f0001206] TPC[00000000010d8694] TNPC[00000000010d8698] TASK[cc1:14002] [ 310.434759] TPC[10d8694] O7[8f32fc] I7[89d438] RPC[8a4bac] [ 310.434866] CPU[ 27]: TSTATE[00000099f0001202] TPC[00000000010f09f0] TNPC[00000000008dc138] TASK[cc1:13964] [ 310.434875] TPC[10f09f0] O7[8dc130] I7[8dcea0] RPC[8dc90c] [ 310.434965] CPU[ 34]: TSTATE[00000044f0001202] TPC[0000000000b687d8] TNPC[0000000000b687dc] TASK[cc1:13408] [ 310.434973] TPC[b687d8] O7[b68754] I7[b668c8] RPC[8af5e0] [ 310.435065] CPU[ 46]: TSTATE[00000044f0001206] TPC[00000000008cf138] TNPC[00000000008cf13c] TASK[cc1:13823] [ 310.435073] TPC[8cf138] O7[8cf0dc] I7[8ce048] RPC[11152c0] [ 310.435103] rcu: rcu_sched kthread timer wakeup didn't happen for 5274 jiffies! g5933 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 310.588885] rcu: Possible timer handling issue on cpu=66 timer-softirq=248 [ 310.602770] rcu: rcu_sched kthread starved for 5320 jiffies! g5933 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=66 [ 310.623249] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 310.641135] rcu: RCU grace-period kthread stack dump: [ 310.651207] task:rcu_sched state:I stack:26936 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x07000000 [ 310.673256] Call Trace: [ 310.678113] [<0000000000fd993c>] schedule+0x1c/0x180 [ 310.688017] [<0000000000fe19f0>] schedule_timeout+0x70/0x100 [ 310.699294] [<0000000000530064>] rcu_gp_fqs_loop+0x104/0x4e0 [ 310.710578] [<0000000000535474>] rcu_gp_kthread+0x134/0x180 [ 310.721686] [<00000000004af254>] kthread+0x114/0x160 [ 310.731588] [<00000000004060f0>] ret_from_fork+0x24/0x34 [ 310.742169] [<0000000000000000>] 0x0 [ 310.749288] rcu: Stack dump where RCU GP kthread last ran: [ 310.760431] CPU[ 66]: TSTATE[00000044f0001204] TPC[0000000000988ccc] TNPC[0000000000988cd0] TASK[cc1:13619] [ 310.780027] TPC[988ccc] O7[cc5ab8] I7[989064] RPC[98901c] [ 373.795586] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 373.807192] rcu: 3-...!: (259 GPs behind) idle=cef8/0/0x0 softirq=1081/1081 fqs=1 (false positive?) [ 373.825415] rcu: 27-...!: (313 GPs behind) idle=5d60/0/0x0 softirq=284/286 fqs=1 (false positive?) [ 373.843469] rcu: 34-...!: (261 GPs behind) idle=4088/0/0x0 softirq=524/524 fqs=1 (false positive?) [ 373.861522] rcu: 46-...!: (261 GPs behind) idle=7a78/0/0x0 softirq=258/259 fqs=1 (false positive?) [ 373.879591] rcu: (detected by 98, t=21140 jiffies, g=5933, q=98416 ncpus=128) [ 373.894056] CPU[ 3]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/3:0] [ 373.894061] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] RPC[do_idle+0x104/0x1e0] [ 373.894138] CPU[ 27]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/27:0] [ 373.894142] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 373.894209] CPU[ 34]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/34:0] [ 373.894214] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 373.894280] CPU[ 46]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/46:0] [ 373.894284] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 373.894303] rcu: rcu_sched kthread timer wakeup didn't happen for 15744 jiffies! g5933 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 374.097545] rcu: Possible timer handling issue on cpu=66 timer-softirq=248 [ 374.111434] rcu: rcu_sched kthread starved for 15800 jiffies! g5933 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=66 [ 374.132103] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 374.149976] rcu: RCU grace-period kthread stack dump: [ 374.160042] task:rcu_sched state:I stack:26936 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x07000000 [ 374.182088] Call Trace: [ 374.186950] [<0000000000fd993c>] schedule+0x1c/0x180 [ 374.196844] [<0000000000fe19f0>] schedule_timeout+0x70/0x100 [ 374.208132] [<0000000000530064>] rcu_gp_fqs_loop+0x104/0x4e0 [ 374.219420] [<0000000000535474>] rcu_gp_kthread+0x134/0x180 [ 374.230538] [<00000000004af254>] kthread+0x114/0x160 [ 374.240433] [<00000000004060f0>] ret_from_fork+0x24/0x34 [ 374.251009] [<0000000000000000>] 0x0 [ 374.258131] rcu: Stack dump where RCU GP kthread last ran: [ 374.269126] CPU[ 66]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/66:0] [ 374.289376] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 395.314730] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 395.326335] rcu: 3-...!: (260 GPs behind) idle=d8c8/0/0x0 softirq=1081/1081 fqs=0 (false positive?) [ 395.344552] rcu: 27-...!: (314 GPs behind) idle=5da8/0/0x0 softirq=284/286 fqs=0 (false positive?) [ 395.362602] rcu: 34-...!: (262 GPs behind) idle=4670/0/0x0 softirq=524/524 fqs=0 (false positive?) [ 395.380667] rcu: 46-...!: (262 GPs behind) idle=7b00/0/0x0 softirq=258/259 fqs=0 (false positive?) [ 395.398729] rcu: (detected by 54, t=5275 jiffies, g=5937, q=96560 ncpus=128) [ 395.413016] CPU[ 3]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/3:0] [ 395.413022] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 395.413106] CPU[ 27]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/27:0] [ 395.413111] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 395.413178] CPU[ 34]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/34:0] [ 395.413183] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 395.413249] CPU[ 46]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/46:0] [ 395.413254] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] [ 395.413273] rcu: rcu_sched kthread timer wakeup didn't happen for 5274 jiffies! g5937 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 395.616352] rcu: Possible timer handling issue on cpu=66 timer-softirq=248 [ 395.630237] rcu: rcu_sched kthread starved for 5330 jiffies! g5937 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=66 [ 395.650718] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 395.668597] rcu: RCU grace-period kthread stack dump: [ 395.678674] task:rcu_sched state:I stack:26936 pid:15 tgid:15 ppid:2 task_flags:0x208040 flags:0x07000000 [ 395.700722] Call Trace: [ 395.705582] [<0000000000fd993c>] schedule+0x1c/0x180 [ 395.715474] [<0000000000fe19f0>] schedule_timeout+0x70/0x100 [ 395.726758] [<0000000000530064>] rcu_gp_fqs_loop+0x104/0x4e0 [ 395.738043] [<0000000000535474>] rcu_gp_kthread+0x134/0x180 [ 395.749158] [<00000000004af254>] kthread+0x114/0x160 [ 395.759054] [<00000000004060f0>] ret_from_fork+0x24/0x34 [ 395.769643] [<0000000000000000>] 0x0 [ 395.776763] rcu: Stack dump where RCU GP kthread last ran: [ 395.787741] CPU[ 66]: TSTATE[0000004411081605] TPC[000000000043d524] TNPC[000000000043d528] TASK[swapper/66:0] [ 395.808014] TPC[arch_cpu_idle+0x84/0xc0] O7[arch_cpu_idle+0x70/0xc0] I7[default_idle_call+0x30/0x160] RPC[do_idle+0x104/0x1e0] > > If that's not possible as the system is unresponsive, then please add > 'ftrace_dump_on_oops' on the kernel command line or enable it after boot > in /proc/sys/kernel and let the kernel panic when it hits the hung task > detector fail. > > Thanks, > > tglx > --- > --- a/arch/sparc/kernel/time_64.c > +++ b/arch/sparc/kernel/time_64.c > @@ -732,8 +732,10 @@ void __irq_entry timer_interrupt(int irq > if (unlikely(!evt->event_handler)) { > printk(KERN_WARNING > "Spurious SPARC64 timer interrupt on cpu %d\n", cpu); > - } else > + } else { > + trace_printk("Invoking handler %pS\n", evt->event_handler); > evt->event_handler(evt); > + } > irq_exit(); > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -248,6 +248,7 @@ static void hung_task_info(struct task_s > * accordingly > */ > if (sysctl_hung_task_warnings || hung_task_call_panic) { > + tracing_off(); > if (sysctl_hung_task_warnings > 0) > sysctl_hung_task_warnings--; > pr_err("INFO: task %s:%d blocked%s for more than %ld > seconds.\n", > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -370,18 +370,22 @@ int clockevents_program_event(struct clo > delta = min(delta, (int64_t) dev->max_delta_ns); > cycles = ((u64)delta * dev->mult) >> dev->shift; > if (!dev->set_next_event((unsigned long) cycles, dev)) { > + trace_printk("Successfully programmed %lld %lld\n", > expires, delta); > dev->next_event_forced = 0; > return 0; > } > } > - if (dev->next_event_forced) > + if (dev->next_event_forced) { > + trace_printk("Skipping %lld %lld\n", expires, delta); > return 0; > + } > if (dev->set_next_event(dev->min_delta_ticks, dev)) { > if (!force || clockevents_program_min_delta(dev)) > return -ETIME; > } > + trace_printk("Force programmed min delta %lld %lld\n", expires, > delta); > dev->next_event_forced = 1; > return 0; > } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-12 5:03 ` the stuttering regression in 7.0: should I have done something different Tony Rodriguez @ 2026-05-12 8:17 ` Thomas Gleixner 2026-05-12 21:43 ` Tony Rodriguez 0 siblings, 1 reply; 10+ messages in thread From: Thomas Gleixner @ 2026-05-12 8:17 UTC (permalink / raw) To: Tony Rodriguez Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis On Mon, May 11 2026 at 22:03, Tony Rodriguez wrote: >> Can you apply the debug patch below, which will disable tracing once it >> hits the hung task detector and then retrieve the trace? > As requested, I applied your debug patch to v7.1‑rc3 and captured the > trace output. On the SPARC64 S7‑2 system the machine becomes > unresponsive and produces many thousands of lines of trace data that > do not appear to terminate. Yes, it takes a while to spill out over serial. > Posting the full output inline or as an attachment may be impractical, > so I’ve included the key sections below. Kinda. > If you prefer the complete trace, please let me know the best way to > provide it. Guessing the kernel mailing isn't best to attach that? Correct. > [ 249.004209] [<0000000000000000>] 0x0 > [ 249.019116] Dumping ftrace buffer: > [ 249.025666] --------------------------------- > [ 249.034534] <idle>-0 0d.... 1836659us : > clockevents_program_event: Successfully programmed 4000000 4000000 > [ 249.055418] <idle>-0 0d.h.. 1845926us : timer_interrupt: So this is the interesting part, but that's starting at 1.836659s while the actual problem happens ~120 seconds later and the detection takes another 120 seconds. Assuming that one of the CPUs does not get timer interrupts anymore, the trace of that CPU should end around the time the last programming happened. So the interesting part is at the end of the output. The default buffer size per CPU is 1408k, which holds about 150k entries, so we can just shorten the buffers to make this less painful. Can you add 'trace_buf_size=50k' to the kernel command line, which limits the buffer size to about 640 entries. Assuming 115200 Baud this should then take about 4 seconds per CPU to dump, which still is a bunch on a large machine, but definitely way more workable than the default. IIRC, SPARC64 S7‑2 has 128 threads total, so the resulting uncompressed output should be around 7-8M. That's highly compressable text, so the resulting dump.xz should be suitable to be stored in github. If github does not allow you, let me know and we work something out. Thanks, tglx ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-12 8:17 ` Thomas Gleixner @ 2026-05-12 21:43 ` Tony Rodriguez 2026-05-13 20:28 ` Thomas Gleixner 0 siblings, 1 reply; 10+ messages in thread From: Tony Rodriguez @ 2026-05-12 21:43 UTC (permalink / raw) To: Thomas Gleixner Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis On 5/12/26 1:17 AM, Thomas Gleixner wrote: >> [ 249.004209] [<0000000000000000>] 0x0 >> [ 249.019116] Dumping ftrace buffer: >> [ 249.025666] --------------------------------- >> [ 249.034534] <idle>-0 0d.... 1836659us : >> clockevents_program_event: Successfully programmed 4000000 4000000 >> [ 249.055418] <idle>-0 0d.h.. 1845926us : timer_interrupt: > So this is the interesting part, but that's starting at 1.836659s > while the actual problem happens ~120 seconds later and the detection > takes another 120 seconds. > > Assuming that one of the CPUs does not get timer interrupts anymore, the > trace of that CPU should end around the time the last programming > happened. So the interesting part is at the end of the output. The > default buffer size per CPU is 1408k, which holds about 150k entries, so > we can just shorten the buffers to make this less painful. > > Can you add 'trace_buf_size=50k' to the kernel command line, which > limits the buffer size to about 640 entries. Assuming 115200 Baud this > should then take about 4 seconds per CPU to dump, which still is a bunch > on a large machine, but definitely way more workable than the default. Done. The complete trace file "s7-2-05122026-dump.tar.gz" can be obtained from my GitHub repo: https://github.com/unixpro1970/Sparc64-Kernel-Debugging-Dumps > IIRC, SPARC64 S7‑2 has 128 threads total, so the resulting uncompressed > output should be around 7-8M. That's highly compressable text, so the > resulting dump.xz should be suitable to be stored in github. If github > does not allow you, let me know and we work something out. > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-12 21:43 ` Tony Rodriguez @ 2026-05-13 20:28 ` Thomas Gleixner 2026-05-14 7:24 ` Tony Rodriguez 0 siblings, 1 reply; 10+ messages in thread From: Thomas Gleixner @ 2026-05-13 20:28 UTC (permalink / raw) To: Tony Rodriguez Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis Tony! On Tue, May 12 2026 at 14:43, Tony Rodriguez wrote: >> Can you add 'trace_buf_size=50k' to the kernel command line, which >> limits the buffer size to about 640 entries. Assuming 115200 Baud this >> should then take about 4 seconds per CPU to dump, which still is a bunch >> on a large machine, but definitely way more workable than the default. > > Done. The complete trace file "s7-2-05122026-dump.tar.gz" can be > obtained from my GitHub repo: > > https://github.com/unixpro1970/Sparc64-Kernel-Debugging-Dumps Thanks for providing the data. So in both traces there is a clear indication that the forced programmed min delta does not result in an interrupt. Here are the last trace events on the affected CPUs. No AHAVI CPU 116: [ 280.939873] <idle>-0 116d.h.. 11612209us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 280.980493] <idle>-0 116d.h.. 11612213us : clockevents_program_event: Successfully programmed 9580000000 3991235 [ 281.023902] <idle>-0 116d.... 11612218us : clockevents_program_event: Successfully programmed 10112024440 536010830 [ 281.089687] <idle>-0 116dn... 11636205us : clockevents_program_event: Force programmed min delta 9600000000 10 No AHAVI CPU 100: [ 299.943989] systemd-1 100d.h.. 27594794us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 299.964303] systemd-1 100d.h.. 27594796us : clockevents_program_event: Successfully programmed 25560000000 1407865 [ 299.986182] systemd-1 100d.... 27594932us : clockevents_program_event: Force programmed min delta 1 -25558727644 [ 300.007707] systemd-1 100d.h.. 27594933us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 300.028019] systemd-1 100d.h.. 27594934us : clockevents_program_event: Successfully programmed 25560000000 1269565 [ 300.049894] systemd-1 100d.... 27594971us : clockevents_program_event: Force programmed min delta 1 -25558767244 [ 300.071415] systemd-1 100d.... 27598043us : clockevents_program_event: Skipping 25560000000 -1838405 AHAVI CPU 6: [ 1247.573212] <idle>-0 6d.h.. 84194945us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1247.613828] <idle>-0 6d.h.. 84194947us : clockevents_program_event: Successfully programmed 80140000000 3928334 [ 1247.762267] <idle>-0 6d.h.. 84198876us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1247.844549] <idle>-0 6d.h.. 84198878us : clockevents_program_event: Force programmed min delta 80140000000 771 AHAVI CPU 61: [ 1258.222440] <idle>-0 61d.h.. 84234905us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1258.516354] <idle>-0 61dnh.. 84234910us : clockevents_program_event: Successfully programmed 84176000000 3999995280 [ 1258.648636] <idle>-0 61dn... 84234914us : clockevents_program_event: Successfully programmed 80180000000 3991863 [ 1258.868940] <idle>-0 61d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1258.993594] <idle>-0 61d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 612 So there is only one case (CPU116) where another event in the past programming (delta < 0) is skipped due to the force bit being set. But that skip happens ~3ms after the min delta was programmed, which should have resulted in an interrupt which never happened. The original code is not really different vs. that min delta programming, except that it does not have the next_event_forced logic. But as you can see above this logic is not really making a difference. So I went through the differences line by line again and I found a very subtle difference, but I can't seen how that would magically cure the actual problem of the non-firing interrupt. The missing update of dev->next_event in the force reprogram case of (delta <= 0) is completely irrelevant as both events are in the past so it does not matter at all. Nevertheless see the pointless and purely cosmetic delta patch below. But coming back to the trace data. There are tons of instances where the forced programmed min delta results in an interrupt right afterwards: [ 1258.868940] <idle>-0 61d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1258.889262] <idle>-0 60d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1258.909570] <idle>-0 63d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1258.929877] <idle>-0 70d.h.. 84238906us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1258.950197] <idle>-0 60d.h.. 84238907us : clockevents_program_event: Force programmed min delta 80180000000 552 [ 1258.971896] <idle>-0 63d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 627 [ 1258.993594] <idle>-0 61d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 612 [ 1259.015292] <idle>-0 70d.h.. 84238908us : clockevents_program_event: Force programmed min delta 80180000000 313 [ 1259.036992] <idle>-0 60d.h.. 84238910us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1259.057313] <idle>-0 63d.h.. 84238910us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 [ 1259.077620] <idle>-0 70d.h.. 84238912us : timer_interrupt: Invoking handler hrtimer_interrupt+0x0/0x280 So all four involved CPUs force program min delta from the timer interrupt context, but only three of them actually get an interrupt afterwards. CPU61 fails to deliver one and as a result it goes stale. As the set_next_event() callback returns 0 (success) in all cases - otherwise we wouldn't see the trace entry - this all points to a problem with that rearming logic: exp = read_cnt() + delta_ticks; write_cmp(exp); return (read_cnt() - exp) > 0 ? -ETIME : 0; Your machine uses 'stick', which runs according to the conversion factors in dmesg at 1GHz, but the CPU runs at 4.27GHz AFAIK. So you can clearly run into a situation like this: TICK_CNT CPU T1 exp = read_cnt() + D ... // Some delay T1 + D write_cmp(T1 + D) now = read_cnt() // Reads T1 + D T1 + D + 1 ---> returns success and the interrupt is never firing Why? Just to be clear: I never saw the VHDL code of that CPU, but that pattern is way too familiar. Those equal comparators, which were designed by AI (Absence of Intelligence) before AI got popular, generally work this way: The comparator is only evaluated on the clock edge which increments the counter, but not when the comparator value is written. So a write of the same value does not result in an interrupt. That's an "optimization" which spares quite a few gates and is obviously nowhere documented. So software has to deal with the consequences by using a crystal ball, which is trivial to get wrong and can go unnoticed for a long time until it roars it's ugly head at some point for whatever reasons. I'm willing to bet a round of beers at the next conference that this is the problem and that it will magically disappear when you change that condition to: return (read_cnt() - exp) >= 0 ? -ETIME : 0; unless they managed to add some extra propagation delay to that comparator write like the HPET folks did at some point without telling anyone. I doubt the SPARC janitor who implemented it did so because that would have made the failure way more likely. I have truly no idea why the original code did not expose this problem, though it might have been just papered over by sheer luck and timing. Thanks, tglx --- --- a/kernel/time/clockevents.c +++ b/kernel/time/clockevents.c @@ -381,6 +381,8 @@ int clockevents_program_event(struct clo if (dev->set_next_event(dev->min_delta_ticks, dev)) { if (!force || clockevents_program_min_delta(dev)) return -ETIME; + } else if (delta <= 0) { + dev->next_event = ktime_add_ns(ktime_get(), dev->min_delta_ns); } dev->next_event_forced = 1; return 0; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-13 20:28 ` Thomas Gleixner @ 2026-05-14 7:24 ` Tony Rodriguez 2026-05-14 10:24 ` Thomas Gleixner 0 siblings, 1 reply; 10+ messages in thread From: Tony Rodriguez @ 2026-05-14 7:24 UTC (permalink / raw) To: Thomas Gleixner Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds Hi Thomas, Cheers! Initial validation of the test patches for v7.0.6 and 7.1-rc3 on the S7-2 looks promising: I have not observed panics, timer delays, or other timer-related issues so far. I will pause broader validation on the S7-2 and T7-1 until I receive your recommendation or any requested revisions (see inline comments below). Note: I did see an intermittent error on the S7-2 running 7.1-rc3, usually when the system is under heavy load during a kernel build. I’m not sure whether it is a separate problem? "[676.464681] BUG: Bad rss-counter state mm:000000008d9f1cf2 type:MM_FILEPAGES val:-4096 Comm:cc1 Pid:78165". On 5/13/26 1:28 PM, Thomas Gleixner wrote: > Just to be clear: I never saw the VHDL code of that CPU, but that > pattern is way too familiar. > > Those equal comparators, which were designed by AI (Absence of > Intelligence) before AI got popular, generally work this way: > > The comparator is only evaluated on the clock edge which increments > the counter, but not when the comparator value is written. So a write > of the same value does not result in an interrupt. > > That's an "optimization" which spares quite a few gates and is obviously > nowhere documented. So software has to deal with the consequences by > using a crystal ball, which is trivial to get wrong and can go unnoticed > for a long time until it roars it's ugly head at some point for whatever > reasons. > > I'm willing to bet a round of beers at the next conference that this is > the problem and that it will magically disappear when you change that > condition to: > > return (read_cnt() - exp) >= 0 ? -ETIME : 0; Attempted to locate "return (read_cnt() - exp) >= 0 ? -ETIME : 0;" but could not find an exact match. After additional inspection I updated the following functions "tick_add_compare()" and "stick_add_compare()" in arch/sparc/kernel/time_64.c to from "> 0L" to ">= 0L". This appears to have resolved the lost-timer behavior. --- time_64.c.orig +++ time_64.c @@ -146,7 +146,7 @@ : "=r" (new_tick)); new_tick &= ~TICKCMP_IRQ_BIT; - return ((long)(new_tick - (orig_tick+adj))) > 0L; + return ((long)(new_tick - (orig_tick+adj))) >= 0L; } static unsigned long tick_add_tick(unsigned long adj) @@ -277,7 +277,7 @@ : "=r" (new_tick)); new_tick &= ~TICKCMP_IRQ_BIT; - return ((long)(new_tick - (orig_tick+adj))) > 0L; + return ((long)(new_tick - (orig_tick+adj))) >= 0L; } static unsigned long stick_get_frequency(void) > > unless they managed to add some extra propagation delay to that > comparator write like the HPET folks did at some point without telling > anyone. I doubt the SPARC janitor who implemented it did so because > that would have made the failure way more likely. > > I have truly no idea why the original code did not expose this problem, > though it might have been just papered over by sheer luck and timing. > > Thanks, > > tglx > --- > --- a/kernel/time/clockevents.c > +++ b/kernel/time/clockevents.c > @@ -381,6 +381,8 @@ int clockevents_program_event(struct clo > if (dev->set_next_event(dev->min_delta_ticks, dev)) { > if (!force || clockevents_program_min_delta(dev)) > return -ETIME; > + } else if (delta <= 0) { > + dev->next_event = ktime_add_ns(ktime_get(), dev->min_delta_ns); > } > dev->next_event_forced = 1; > return 0; > You mentioned this kernel/time/clockevents.c patch is optional, but I propose revising clockevents_program_event(). If the requested event time is already at or before now, record a sane next_event (now + min_delta) so core code sees a future expected time and can behave correctly. Does this seem reasonable? --- clockevents.c.orig +++ clockevents.c @@ -347,6 +347,11 @@ if (dev->set_next_event(dev->min_delta_ticks, dev)) { if (!force || clockevents_program_min_delta(dev)) return -ETIME; + } else { + ktime_t now = ktime_get(); + s64 delta_ns = ktime_to_ns(ktime_sub(expires, now)); + if (delta_ns <= 0) + dev->next_event = ktime_add_ns(now, dev->min_delta_ns); } dev->next_event_forced = 1; return 0; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-14 7:24 ` Tony Rodriguez @ 2026-05-14 10:24 ` Thomas Gleixner 2026-05-15 4:47 ` Tony Rodriguez 0 siblings, 1 reply; 10+ messages in thread From: Thomas Gleixner @ 2026-05-14 10:24 UTC (permalink / raw) To: Tony Rodriguez Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds On Thu, May 14 2026 at 00:24, Tony Rodriguez wrote: > Initial validation of the test patches for v7.0.6 and 7.1-rc3 on the > S7-2 looks promising: I have not observed panics, timer delays, or other > timer-related issues so far. I will pause broader validation on the S7-2 > and T7-1 until I receive your recommendation or any requested revisions > (see inline comments below). > > Note: I did see an intermittent error on the S7-2 running 7.1-rc3, > usually when the system is under heavy load during a kernel build. I’m > not sure whether it is a separate problem? > > "[676.464681] BUG: Bad rss-counter state mm:000000008d9f1cf2 > type:MM_FILEPAGES val:-4096 Comm:cc1 Pid:78165". That's unrelated and an accounting issue in the MM code. Please report it separately to the MM people. > On 5/13/26 1:28 PM, Thomas Gleixner wrote: >> I'm willing to bet a round of beers at the next conference that this is >> the problem and that it will magically disappear when you change that >> condition to: >> >> return (read_cnt() - exp) >= 0 ? -ETIME : 0; > > Attempted to locate "return (read_cnt() - exp) >= 0 ? -ETIME : 0;" but > could not find an exact match. After additional inspection I updated the > following functions "tick_add_compare()" and "stick_add_compare()" in > arch/sparc/kernel/time_64.c to from "> 0L" to ">= 0L". This appears to > have resolved the lost-timer behavior. I condensed the logic for illustration and rightfully assumed that you will figure it out. :) > --- time_64.c.orig > +++ time_64.c > @@ -146,7 +146,7 @@ > : "=r" (new_tick)); > new_tick &= ~TICKCMP_IRQ_BIT; > > - return ((long)(new_tick - (orig_tick+adj))) > 0L; > + return ((long)(new_tick - (orig_tick+adj))) >= 0L; > } > > static unsigned long tick_add_tick(unsigned long adj) > @@ -277,7 +277,7 @@ > : "=r" (new_tick)); > new_tick &= ~TICKCMP_IRQ_BIT; > > - return ((long)(new_tick - (orig_tick+adj))) > 0L; > + return ((long)(new_tick - (orig_tick+adj))) >= 0L; > } Looks correct, but you missed the one in hbtick_add_compare() which has the same issue. >> --- a/kernel/time/clockevents.c >> +++ b/kernel/time/clockevents.c >> @@ -381,6 +381,8 @@ int clockevents_program_event(struct clo >> if (dev->set_next_event(dev->min_delta_ticks, dev)) { >> if (!force || clockevents_program_min_delta(dev)) >> return -ETIME; >> + } else if (delta <= 0) { >> + dev->next_event = ktime_add_ns(ktime_get(), dev->min_delta_ns); >> } >> dev->next_event_forced = 1; >> return 0; >> > You mentioned this kernel/time/clockevents.c patch is optional, but I > propose revising clockevents_program_event(). If the requested event > time is already at or before now, record a sane next_event (now + > min_delta) so core code sees a future expected time and can behave > correctly. Does this seem reasonable? The related core code only cares what the last programmed expiry value in clock monotonic (i.e. the @expires argument) was. And the only interesting information is whether it's in the future or not. If it's in the past then it does not matter how much in the past it is. Whatever we fake into it is never going to reflect anything related to reality anyway and there is no guarantee that the code which reads it will see a future expected time depending on the time elapsed between faking it and reading it. So it's truly a cosmetic exercise for no real value. Thanks, tglx ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-14 10:24 ` Thomas Gleixner @ 2026-05-15 4:47 ` Tony Rodriguez 2026-05-15 15:35 ` Thomas Gleixner 0 siblings, 1 reply; 10+ messages in thread From: Tony Rodriguez @ 2026-05-15 4:47 UTC (permalink / raw) To: Thomas Gleixner Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds Hi Thomas, I’ve completed validation with the v7.0.7 release and v7.1‑rc3 on both S7‑2 and T7‑1 systems. Everything looks good. Thank you again for the debugging guidance and for the feedback on my original patch addressing the timer starvation issue. It was a pleasure contributing to the resolution. PS: I agree that the second patch we discussed isn’t needed—the systems run correctly without it. The following patch alone is sufficient: Best regards, Tony Rodriguez --- linux-7.1-rc1/arch/sparc/kernel/time_64.c.orig +++ linux-7.1-rc1/arch/sparc/kernel/time_64.c @@ -146,7 +146,7 @@ : "=r" (new_tick)); new_tick &= ~TICKCMP_IRQ_BIT; - return ((long)(new_tick - (orig_tick+adj))) > 0L; + return ((long)(new_tick - (orig_tick+adj))) >= 0L; } static unsigned long tick_add_tick(unsigned long adj) @@ -277,7 +277,7 @@ : "=r" (new_tick)); new_tick &= ~TICKCMP_IRQ_BIT; - return ((long)(new_tick - (orig_tick+adj))) > 0L; + return ((long)(new_tick - (orig_tick+adj))) >= 0L; } static unsigned long stick_get_frequency(void) @@ -411,7 +411,7 @@ val2 = __hbird_read_stick() & ~TICKCMP_IRQ_BIT; - return ((long)(val2 - val)) > 0L; + return ((long)(val2 - val)) >= 0L; } static unsigned long hbtick_get_frequency(void) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-15 4:47 ` Tony Rodriguez @ 2026-05-15 15:35 ` Thomas Gleixner 2026-05-15 17:51 ` John Paul Adrian Glaubitz 0 siblings, 1 reply; 10+ messages in thread From: Thomas Gleixner @ 2026-05-15 15:35 UTC (permalink / raw) To: Tony Rodriguez Cc: Linux kernel regressions list, LKML, sparclinux, John Paul Adrian Glaubitz, Thorsten Leemhuis, Linus Torvalds Tony! On Thu, May 14 2026 at 21:47, Tony Rodriguez wrote: > I’ve completed validation with the v7.0.7 release and v7.1‑rc3 on both > S7‑2 and T7‑1 systems. Everything looks good. Cool! > Thank you again for the debugging guidance and for the feedback on my > original patch addressing the timer starvation issue. It was a pleasure > contributing to the resolution. Thank you for going through the hassle of chasing it down and providing the debug data to analyze it. I'm still puzzled how this went unnoticed for almost two decades: 112f48716d9f ("[SPARC64]: Add clocksource/clockevents support.") > PS: I agree that the second patch we discussed isn’t needed—the systems > run correctly without it. The following patch alone is sufficient: I assume this patch will surface on a mailing list with a lengthy change log full of details and find it's way into the sparc tree through the usual channels. Thanks, tglx ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-15 15:35 ` Thomas Gleixner @ 2026-05-15 17:51 ` John Paul Adrian Glaubitz 2026-05-15 19:57 ` Thomas Gleixner 0 siblings, 1 reply; 10+ messages in thread From: John Paul Adrian Glaubitz @ 2026-05-15 17:51 UTC (permalink / raw) To: Thomas Gleixner, Tony Rodriguez Cc: Linux kernel regressions list, LKML, sparclinux, Thorsten Leemhuis, Linus Torvalds Hi Thomas, On Fri, 2026-05-15 at 17:35 +0200, Thomas Gleixner wrote: > > Thank you again for the debugging guidance and for the feedback on my > > original patch addressing the timer starvation issue. It was a pleasure > > contributing to the resolution. > > Thank you for going through the hassle of chasing it down and providing > the debug data to analyze it. > > I'm still puzzled how this went unnoticed for almost two decades: > > 112f48716d9f ("[SPARC64]: Add clocksource/clockevents support.") My suspicion is that it was previously visible only in certain edge cases, in particular on machines with many cores and high load. Case in point: In the past, SPARC LDOMs with lots of virtual CPUs could crash in rares cases when building packages such as GCC or LLVM and running their testsuites. I don't know if Tony's patch fixes this long-time issue that we have observed in the past on Debian's buildds, but I think that the chances aren't too bad. Tony, please clean up your patch and add an elaborate explanation in the commit message! Hope to see this fix landed as soon as possible! Thanks to both of you for hunting this down! Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: the stuttering regression in 7.0: should I have done something different 2026-05-15 17:51 ` John Paul Adrian Glaubitz @ 2026-05-15 19:57 ` Thomas Gleixner 0 siblings, 0 replies; 10+ messages in thread From: Thomas Gleixner @ 2026-05-15 19:57 UTC (permalink / raw) To: John Paul Adrian Glaubitz, Tony Rodriguez Cc: Linux kernel regressions list, LKML, sparclinux, Thorsten Leemhuis, Linus Torvalds Hi! On Fri, May 15 2026 at 19:51, John Paul Adrian Glaubitz wrote: > On Fri, 2026-05-15 at 17:35 +0200, Thomas Gleixner wrote: >> > Thank you again for the debugging guidance and for the feedback on my >> > original patch addressing the timer starvation issue. It was a pleasure >> > contributing to the resolution. >> >> Thank you for going through the hassle of chasing it down and providing >> the debug data to analyze it. >> >> I'm still puzzled how this went unnoticed for almost two decades: >> >> 112f48716d9f ("[SPARC64]: Add clocksource/clockevents support.") > > My suspicion is that it was previously visible only in certain edge cases, > in particular on machines with many cores and high load. > > Case in point: In the past, SPARC LDOMs with lots of virtual CPUs could > crash in rares cases when building packages such as GCC or LLVM and running > their testsuites. I assume those occasional failures did not leave conclusive hints around. > I don't know if Tony's patch fixes this long-time issue that we have observed > in the past on Debian's buildds, but I think that the chances aren't too bad. Good luck! > Thanks to both of you for hunting this down! For some stupid reasons I like such puzzles :) ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-15 19:57 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <ffb44522-f01c-4be3-849d-27dc17fbca7f@leemhuis.info>
[not found] ` <D5D19776-C809-4284-9417-F9A860877B98@gmail.com>
[not found] ` <1c165caf-36b4-4673-97fd-ed86bef17b88@leemhuis.info>
[not found] ` <3332123b-9e11-4895-9ab3-1707fba5815c@gmail.com>
[not found] ` <871pfj9cmj.ffs@tglx>
[not found] ` <a7a0d78b-435e-43c8-b436-5e7f4dd39dee@gmail.com>
2026-05-12 5:03 ` the stuttering regression in 7.0: should I have done something different Tony Rodriguez
2026-05-12 8:17 ` Thomas Gleixner
2026-05-12 21:43 ` Tony Rodriguez
2026-05-13 20:28 ` Thomas Gleixner
2026-05-14 7:24 ` Tony Rodriguez
2026-05-14 10:24 ` Thomas Gleixner
2026-05-15 4:47 ` Tony Rodriguez
2026-05-15 15:35 ` Thomas Gleixner
2026-05-15 17:51 ` John Paul Adrian Glaubitz
2026-05-15 19:57 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox