* sched: Unexpected reschedule of offline CPU#2!
@ 2019-07-27 16:44 Guenter Roeck
2019-07-29 9:35 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Guenter Roeck @ 2019-07-27 16:44 UTC (permalink / raw)
To: x86; +Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Borislav Petkov
Hi,
I see the following traceback (or similar tracebacks) once in a while
during my boot tests. In this specific case it is with mainline
(v5.3-rc1-195-g3ea54d9b0d65), but I have seen it with other branches
as well. This isn't a new problem; I have seen it for quite some time.
There is no specific action required to make it appear; just running
reboot loops is sufficient. The problem doesn't happen a lot;
non-scientifically I would say I see it maybe once every few hundred
boots.
No specific action requested or asked for; this is just informational.
A complete log is at:
https://kerneltests.org/builders/qemu-x86-master/builds/1285/steps/qemubuildcommand/logs/stdio
Guenter
---
[ 61.248329] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 61.268277] e1000e: EEE TX LPI TIMER: 00000000
[ 61.311435] reboot: Restarting system
[ 61.312321] reboot: machine restart
[ 61.342193] ------------[ cut here ]------------
[ 61.342660] sched: Unexpected reschedule of offline CPU#2!
ILLOPC: ce241f83: 0f 0b
[ 61.344323] WARNING: CPU: 1 PID: 15 at arch/x86/kernel/smp.c:126 native_smp_send_reschedule+0x33/0x40
[ 61.344836] Modules linked in:
[ 61.345694] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 5.3.0-rc1+ #1
[ 61.345998] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 61.346569] EIP: native_smp_send_reschedule+0x33/0x40
[ 61.347099] Code: cf 73 1c 8b 15 60 54 2b cf 8b 4a 18 ba fd 00 00 00 e8 05 65 c7 00 c9 c3 8d b4 26 00 00 00 00 50 68 04 ca 1a cf e8 fe e3 01 00 <0f> 0b 58 5a c9 c3 8d b4 26 00 00 00 00 55 89 e5 56 53 83 ec 0c 65
[ 61.347726] EAX: 0000002e EBX: 00000002 ECX: 00000000 EDX: cdd64140
[ 61.347977] ESI: 00000002 EDI: 00000000 EBP: cdd73c88 ESP: cdd73c80
[ 61.348234] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00000096
[ 61.348514] CR0: 80050033 CR2: b7ee7048 CR3: 0c28f000 CR4: 000006d0
[ 61.348866] Call Trace:
[ 61.349392] kick_ilb+0x90/0xa0
[ 61.349629] trigger_load_balance+0xf0/0x5c0
[ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
[ 61.350057] scheduler_tick+0xa7/0xd0
[ 61.350266] update_process_times+0x4a/0x60
[ 61.350467] tick_sched_handle+0x3e/0x50
[ 61.350650] tick_sched_timer+0x37/0x90
[ 61.350847] __hrtimer_run_queues+0xf7/0x440
[ 61.351056] ? tick_sched_do_timer+0x70/0x70
[ 61.351281] hrtimer_interrupt+0x10e/0x260
[ 61.351541] smp_apic_timer_interrupt+0x68/0x210
[ 61.351750] apic_timer_interrupt+0x106/0x10c
[ 61.352040] EIP: _raw_spin_unlock_irqrestore+0x47/0x50
[ 61.352254] Code: 66 40 ff f6 c7 02 75 1b 53 9d e8 c4 67 49 ff 64 ff 0d 84 27 50 cf 5b 5e 5d c3 8d b4 26 00 00 00 00 66 90 e8 ab 69 49 ff 53 9d <eb> e3 8d b4 26 00 00 00 00 55 64 ff 05 84 27 50 cf 89 e5 53 89 c3
[ 61.352810] EAX: cdd64140 EBX: 00000282 ECX: 00000003 EDX: 00000002
[ 61.353041] ESI: cdc01940 EDI: 00000001 EBP: cdd73e08 ESP: cdd73e00
[ 61.353273] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00000282
[ 61.353705] ? _raw_spin_unlock_irqrestore+0x47/0x50
[ 61.362142] free_debug_processing+0x199/0x220
[ 61.362413] __slab_free+0x220/0x3b0
[ 61.362599] ? irq_kobj_release+0x1c/0x20
[ 61.362845] ? kfree+0x1ad/0x270
[ 61.363002] ? kfree+0x1ad/0x270
[ 61.363162] kfree+0x264/0x270
[ 61.363305] ? kfree+0x264/0x270
[ 61.363458] ? irq_kobj_release+0x1c/0x20
[ 61.363624] ? irq_kobj_release+0x1c/0x20
[ 61.363824] irq_kobj_release+0x1c/0x20
[ 61.364018] kobject_put+0x58/0xc0
[ 61.364211] ? hwirq_show+0x50/0x50
[ 61.364439] delayed_free_desc+0xb/0x10
[ 61.364621] rcu_core+0x288/0xb50
[ 61.364805] ? __do_softirq+0x7e/0x3bb
[ 61.365042] rcu_core_si+0x8/0x10
[ 61.365209] __do_softirq+0xa9/0x3bb
[ 61.365445] run_ksoftirqd+0x25/0x50
[ 61.365615] smpboot_thread_fn+0xef/0x1d0
[ 61.365834] kthread+0xf2/0x110
[ 61.365986] ? sort_range+0x20/0x20
[ 61.366156] ? kthread_create_on_node+0x20/0x20
[ 61.366360] ret_from_fork+0x2e/0x38
[ 61.366818] irq event stamp: 1267
[ 61.367115] hardirqs last enabled at (1266): [<ceeb37f5>] _raw_spin_unlock_irqrestore+0x45/0x50
[ 61.367448] hardirqs last disabled at (1267): [<ce20178a>] trace_hardirqs_off_thunk+0xc/0x12
[ 61.367769] softirqs last enabled at (1232): [<ceeb7a45>] __do_softirq+0x2c5/0x3bb
[ 61.368057] softirqs last disabled at (1237): [<ce267605>] run_ksoftirqd+0x25/0x50
[ 61.368389] ---[ end trace 3465d631a21844b8 ]---
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-27 16:44 sched: Unexpected reschedule of offline CPU#2! Guenter Roeck
@ 2019-07-29 9:35 ` Peter Zijlstra
2019-07-29 9:58 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-29 9:35 UTC (permalink / raw)
To: Guenter Roeck
Cc: x86, Ingo Molnar, Thomas Gleixner, linux-kernel, Borislav Petkov
On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> Hi,
>
> I see the following traceback (or similar tracebacks) once in a while
> during my boot tests. In this specific case it is with mainline
> (v5.3-rc1-195-g3ea54d9b0d65), but I have seen it with other branches
> as well. This isn't a new problem; I have seen it for quite some time.
> There is no specific action required to make it appear; just running
> reboot loops is sufficient. The problem doesn't happen a lot;
> non-scientifically I would say I see it maybe once every few hundred
> boots.
>
> No specific action requested or asked for; this is just informational.
>
> A complete log is at:
> https://kerneltests.org/builders/qemu-x86-master/builds/1285/steps/qemubuildcommand/logs/stdio
>
> Guenter
>
> ---
> [ 61.248329] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [ 61.268277] e1000e: EEE TX LPI TIMER: 00000000
> [ 61.311435] reboot: Restarting system
> [ 61.312321] reboot: machine restart
> [ 61.342193] ------------[ cut here ]------------
> [ 61.342660] sched: Unexpected reschedule of offline CPU#2!
> ILLOPC: ce241f83: 0f 0b
> [ 61.344323] WARNING: CPU: 1 PID: 15 at arch/x86/kernel/smp.c:126 native_smp_send_reschedule+0x33/0x40
> [ 61.344836] Modules linked in:
> [ 61.345694] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 5.3.0-rc1+ #1
> [ 61.345998] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [ 61.346569] EIP: native_smp_send_reschedule+0x33/0x40
> [ 61.347099] Code: cf 73 1c 8b 15 60 54 2b cf 8b 4a 18 ba fd 00 00 00 e8 05 65 c7 00 c9 c3 8d b4 26 00 00 00 00 50 68 04 ca 1a cf e8 fe e3 01 00 <0f> 0b 58 5a c9 c3 8d b4 26 00 00 00 00 55 89 e5 56 53 83 ec 0c 65
> [ 61.347726] EAX: 0000002e EBX: 00000002 ECX: 00000000 EDX: cdd64140
> [ 61.347977] ESI: 00000002 EDI: 00000000 EBP: cdd73c88 ESP: cdd73c80
> [ 61.348234] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00000096
> [ 61.348514] CR0: 80050033 CR2: b7ee7048 CR3: 0c28f000 CR4: 000006d0
> [ 61.348866] Call Trace:
> [ 61.349392] kick_ilb+0x90/0xa0
> [ 61.349629] trigger_load_balance+0xf0/0x5c0
> [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> [ 61.350057] scheduler_tick+0xa7/0xd0
kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
__tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
cpu_is_offline() clause of the idle loop.
However, when offline, cpu_active() should also be false, and this
function should no-op.
Then we have nohz_balance_exit_idle() from sched_cpu_dying(), which
should explicitly clear the CPU from the mask when going offline.
So I'm not immediately seeing how we can select an offline CPU to kick.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-29 9:35 ` Peter Zijlstra
@ 2019-07-29 9:58 ` Thomas Gleixner
2019-07-29 10:13 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-07-29 9:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Guenter Roeck, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > [ 61.348866] Call Trace:
> > [ 61.349392] kick_ilb+0x90/0xa0
> > [ 61.349629] trigger_load_balance+0xf0/0x5c0
> > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> > [ 61.350057] scheduler_tick+0xa7/0xd0
>
> kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
>
> idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> cpu_is_offline() clause of the idle loop.
>
> However, when offline, cpu_active() should also be false, and this
> function should no-op.
Ha. That reboot mess is not clearing cpu active as it's not going through
the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
dead in their tracks after clearing cpu online....
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-29 9:58 ` Thomas Gleixner
@ 2019-07-29 10:13 ` Peter Zijlstra
2019-07-29 10:38 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-29 10:13 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Guenter Roeck, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Mon, Jul 29, 2019 at 11:58:24AM +0200, Thomas Gleixner wrote:
> On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > > [ 61.348866] Call Trace:
> > > [ 61.349392] kick_ilb+0x90/0xa0
> > > [ 61.349629] trigger_load_balance+0xf0/0x5c0
> > > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> > > [ 61.350057] scheduler_tick+0xa7/0xd0
> >
> > kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
> >
> > idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> > nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> > __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> > cpu_is_offline() clause of the idle loop.
> >
> > However, when offline, cpu_active() should also be false, and this
> > function should no-op.
>
> Ha. That reboot mess is not clearing cpu active as it's not going through
> the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
> dead in their tracks after clearing cpu online....
$string-of-cock-compliant-curses
What a trainwreck...
So if it doesn't play by the normal rules; how does it expect to work?
So what do we do? 'Fix' reboot or extend the rules?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-29 10:13 ` Peter Zijlstra
@ 2019-07-29 10:38 ` Thomas Gleixner
2019-07-29 10:47 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-07-29 10:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Guenter Roeck, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 11:58:24AM +0200, Thomas Gleixner wrote:
> > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > > > [ 61.348866] Call Trace:
> > > > [ 61.349392] kick_ilb+0x90/0xa0
> > > > [ 61.349629] trigger_load_balance+0xf0/0x5c0
> > > > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> > > > [ 61.350057] scheduler_tick+0xa7/0xd0
> > >
> > > kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
> > >
> > > idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> > > nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> > > __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> > > cpu_is_offline() clause of the idle loop.
> > >
> > > However, when offline, cpu_active() should also be false, and this
> > > function should no-op.
> >
> > Ha. That reboot mess is not clearing cpu active as it's not going through
> > the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
> > dead in their tracks after clearing cpu online....
>
> $string-of-cock-compliant-curses
>
> What a trainwreck...
>
> So if it doesn't play by the normal rules; how does it expect to work?
>
> So what do we do? 'Fix' reboot or extend the rules?
Reboot has two modes:
- Regular reboot initiated from user space
- Panic reboot
For the regular reboot we can make it go through proper hotplug, for the
panic case not so much.
thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-29 10:38 ` Thomas Gleixner
@ 2019-07-29 10:47 ` Peter Zijlstra
2019-07-29 20:50 ` Guenter Roeck
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-07-29 10:47 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Guenter Roeck, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Mon, Jul 29, 2019 at 12:38:30PM +0200, Thomas Gleixner wrote:
> On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 11:58:24AM +0200, Thomas Gleixner wrote:
> > > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > > On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > > > > [ 61.348866] Call Trace:
> > > > > [ 61.349392] kick_ilb+0x90/0xa0
> > > > > [ 61.349629] trigger_load_balance+0xf0/0x5c0
> > > > > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> > > > > [ 61.350057] scheduler_tick+0xa7/0xd0
> > > >
> > > > kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
> > > >
> > > > idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> > > > nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> > > > __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> > > > cpu_is_offline() clause of the idle loop.
> > > >
> > > > However, when offline, cpu_active() should also be false, and this
> > > > function should no-op.
> > >
> > > Ha. That reboot mess is not clearing cpu active as it's not going through
> > > the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
> > > dead in their tracks after clearing cpu online....
> >
> > $string-of-cock-compliant-curses
> >
> > What a trainwreck...
> >
> > So if it doesn't play by the normal rules; how does it expect to work?
> >
> > So what do we do? 'Fix' reboot or extend the rules?
>
> Reboot has two modes:
>
> - Regular reboot initiated from user space
>
> - Panic reboot
>
> For the regular reboot we can make it go through proper hotplug,
That seems sensible.
> for the panic case not so much.
It's panic, shit has already hit fan, one or two more pieces shouldn't
something anybody cares about.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-29 10:47 ` Peter Zijlstra
@ 2019-07-29 20:50 ` Guenter Roeck
2019-08-16 10:22 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Guenter Roeck @ 2019-07-29 20:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Mon, Jul 29, 2019 at 12:47:45PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 12:38:30PM +0200, Thomas Gleixner wrote:
> > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > On Mon, Jul 29, 2019 at 11:58:24AM +0200, Thomas Gleixner wrote:
> > > > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > > > On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > > > > > [ 61.348866] Call Trace:
> > > > > > [ 61.349392] kick_ilb+0x90/0xa0
> > > > > > [ 61.349629] trigger_load_balance+0xf0/0x5c0
> > > > > > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> > > > > > [ 61.350057] scheduler_tick+0xa7/0xd0
> > > > >
> > > > > kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
> > > > >
> > > > > idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> > > > > nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> > > > > __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> > > > > cpu_is_offline() clause of the idle loop.
> > > > >
> > > > > However, when offline, cpu_active() should also be false, and this
> > > > > function should no-op.
> > > >
> > > > Ha. That reboot mess is not clearing cpu active as it's not going through
> > > > the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
> > > > dead in their tracks after clearing cpu online....
> > >
> > > $string-of-cock-compliant-curses
> > >
> > > What a trainwreck...
> > >
> > > So if it doesn't play by the normal rules; how does it expect to work?
> > >
> > > So what do we do? 'Fix' reboot or extend the rules?
> >
> > Reboot has two modes:
> >
> > - Regular reboot initiated from user space
> >
> > - Panic reboot
> >
> > For the regular reboot we can make it go through proper hotplug,
>
> That seems sensible.
>
> > for the panic case not so much.
>
> It's panic, shit has already hit fan, one or two more pieces shouldn't
> something anybody cares about.
>
Some more digging shows that this happens a lot with Google GCE intances,
typically after a panic. The problem with that, if I understand correctly,
is that it may prevent coredumps from being written. So, while of course
the panic is what needs to be fixed, it is still quite annoying, and it
would help if this can be fixed for panic handling as well.
How about the patch suggested by Hillf Danton ? Would that help for the
panic case ?
Thanks,
Guenter
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-07-29 20:50 ` Guenter Roeck
@ 2019-08-16 10:22 ` Thomas Gleixner
2019-08-16 19:32 ` Guenter Roeck
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-08-16 10:22 UTC (permalink / raw)
To: Guenter Roeck
Cc: Peter Zijlstra, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Mon, 29 Jul 2019, Guenter Roeck wrote:
> On Mon, Jul 29, 2019 at 12:47:45PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 12:38:30PM +0200, Thomas Gleixner wrote:
> > > Reboot has two modes:
> > >
> > > - Regular reboot initiated from user space
> > >
> > > - Panic reboot
> > >
> > > For the regular reboot we can make it go through proper hotplug,
> >
> > That seems sensible.
> >
> > > for the panic case not so much.
> >
> > It's panic, shit has already hit fan, one or two more pieces shouldn't
> > something anybody cares about.
> >
>
> Some more digging shows that this happens a lot with Google GCE intances,
> typically after a panic. The problem with that, if I understand correctly,
> is that it may prevent coredumps from being written. So, while of course
> the panic is what needs to be fixed, it is still quite annoying, and it
> would help if this can be fixed for panic handling as well.
>
> How about the patch suggested by Hillf Danton ? Would that help for the
> panic case ?
I have no idea how that patch looks like, but the quick hack is below.
Thanks,
tglx
8<---------------
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 75fea0d48c0e..625627b1457c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -601,6 +601,7 @@ void stop_this_cpu(void *dummy)
/*
* Remove this CPU:
*/
+ set_cpu_active(smp_processor_id(), false);
set_cpu_online(smp_processor_id(), false);
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-08-16 10:22 ` Thomas Gleixner
@ 2019-08-16 19:32 ` Guenter Roeck
2019-08-17 20:21 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Guenter Roeck @ 2019-08-16 19:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Peter Zijlstra, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Fri, Aug 16, 2019 at 12:22:22PM +0200, Thomas Gleixner wrote:
> On Mon, 29 Jul 2019, Guenter Roeck wrote:
> > On Mon, Jul 29, 2019 at 12:47:45PM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 29, 2019 at 12:38:30PM +0200, Thomas Gleixner wrote:
> > > > Reboot has two modes:
> > > >
> > > > - Regular reboot initiated from user space
> > > >
> > > > - Panic reboot
> > > >
> > > > For the regular reboot we can make it go through proper hotplug,
> > >
> > > That seems sensible.
> > >
> > > > for the panic case not so much.
> > >
> > > It's panic, shit has already hit fan, one or two more pieces shouldn't
> > > something anybody cares about.
> > >
> >
> > Some more digging shows that this happens a lot with Google GCE intances,
> > typically after a panic. The problem with that, if I understand correctly,
> > is that it may prevent coredumps from being written. So, while of course
> > the panic is what needs to be fixed, it is still quite annoying, and it
> > would help if this can be fixed for panic handling as well.
> >
> > How about the patch suggested by Hillf Danton ? Would that help for the
> > panic case ?
>
> I have no idea how that patch looks like, but the quick hack is below.
>
> Thanks,
>
> tglx
>
> 8<---------------
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 75fea0d48c0e..625627b1457c 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -601,6 +601,7 @@ void stop_this_cpu(void *dummy)
> /*
> * Remove this CPU:
> */
> + set_cpu_active(smp_processor_id(), false);
> set_cpu_online(smp_processor_id(), false);
> disable_local_APIC();
> mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
>
No luck. The problem is still seen with this patch applied on top of
the mainline kernel (commit a69e90512d9def6).
Guenter
---
[ 22.315834] e1000e: EEE TX LPI TIMER: 00000000
[ 22.323624] reboot: Restarting system
[ 22.324260] reboot: machine restart
[ 22.325885] ------------[ cut here ]------------
[ 22.330425] sched: Unexpected reschedule of offline CPU#3!
ILLOPC: ffffffffb524403f: 0f 0b
[ 22.330926] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/smp.c:126 native_smp_send_reschedule+0x2f/0x40
[ 22.331238] Modules linked in:
[ 22.331427] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.3.0-rc4+ #1
[ 22.331626] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 22.331971] RIP: 0010:native_smp_send_reschedule+0x2f/0x40
[ 22.332164] Code: 05 de 81 95 01 73 15 48 8b 05 bd fa 61 01 be fd 00 00 00 48 8b 40 30 e9 6f d0 fb 00 89 fe 48 c7 c7 88 da 74 b6 e8 7f 6c 02 00 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 53 48 83 ec
[ 22.332705] RSP: 0018:ffffa457800d0d68 EFLAGS: 00000086
[ 22.332884] RAX: 0000000000000000 RBX: ffff9a8cbb9ba000 RCX: 0000000000000103
[ 22.333109] RDX: 0000000080000103 RSI: 0000000000000000 RDI: 00000000ffffffff
[ 22.333327] RBP: ffffa457800d0e90 R08: 0000000000000000 R09: 0000000000000000
[ 22.333546] R10: 0000000000000000 R11: ffffa457800d0c10 R12: 000000000000a1b9
[ 22.333767] R13: ffff9a8cbae26030 R14: ffff9a8cbae25f80 R15: ffff9a8cbb83a000
[ 22.334045] FS: 0000000000000000(0000) GS:ffff9a8cbb880000(0000) knlGS:0000000000000000
[ 22.334321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 22.334520] CR2: 00007fba66a35010 CR3: 0000000176cd6000 CR4: 00000000007406e0
[ 22.334794] PKRU: 55555554
[ 22.334915] Call Trace:
[ 22.335062] <IRQ>
[ 22.335148] check_preempt_curr+0x7f/0xc0
[ 22.335295] load_balance+0x589/0xc50
[ 22.335513] rebalance_domains+0x30d/0x410
[ 22.335684] _nohz_idle_balance+0x1bd/0x200
[ 22.335854] __do_softirq+0xe5/0x478
[ 22.336023] irq_exit+0xa9/0xc0
[ 22.336163] reschedule_interrupt+0xf/0x20
[ 22.336317] </IRQ>
[ 22.336409] RIP: 0010:default_idle+0x23/0x180
[ 22.336561] Code: ff 90 90 90 90 90 90 41 55 41 54 55 53 e8 45 75 7c ff 0f 1f 44 00 00 e8 0b aa 40 ff e9 07 00 00 00 0f 00 2d 31 94 4a 00 fb f4 <e8> 28 75 7c ff 89 c5 0f 1f 44 00 00 5b 5d 41 5c 41 5d c3 65 8b 05
[ 22.337102] RSP: 0018:ffffa4578006bec0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff02
[ 22.337342] RAX: ffff9a8cbae23fc0 RBX: 0000000000000001 RCX: 0000000000000001
[ 22.337561] RDX: 0000000000000046 RSI: 0000000000000006 RDI: ffffffffb6852dd6
[ 22.337780] RBP: ffffffffb6b9c1f8 R08: 0000000000000001 R09: 0000000000000000
[ 22.337996] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 22.338229] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 22.338501] do_idle+0x1df/0x260
[ 22.338588] ? _raw_spin_unlock_irqrestore+0x4c/0x60
[ 22.338706] cpu_startup_entry+0x14/0x20
[ 22.338793] start_secondary+0x151/0x180
[ 22.338885] secondary_startup_64+0xa4/0xb0
[ 22.339060] irq event stamp: 61631
[ 22.339176] hardirqs last enabled at (61630): [<ffffffffb5f5c6dc>] _raw_spin_unlock_irqrestore+0x4c/0x60
[ 22.339373] hardirqs last disabled at (61631): [<ffffffffb5f5c46d>] _raw_spin_lock_irqsave+0xd/0x50
[ 22.339568] softirqs last enabled at (61626): [<ffffffffb5272bc8>] irq_enter+0x58/0x60
[ 22.339726] softirqs last disabled at (61627): [<ffffffffb5272c79>] irq_exit+0xa9/0xc0
[ 22.339897] ---[ end trace 8ad53445879058cc ]---
[ 22.340384] ACPI MEMORY or I/O RESET_REG.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-08-16 19:32 ` Guenter Roeck
@ 2019-08-17 20:21 ` Thomas Gleixner
2021-07-27 8:00 ` Henning Schild
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2019-08-17 20:21 UTC (permalink / raw)
To: Guenter Roeck
Cc: Peter Zijlstra, x86, Ingo Molnar, linux-kernel, Borislav Petkov
On Fri, 16 Aug 2019, Guenter Roeck wrote:
> On Fri, Aug 16, 2019 at 12:22:22PM +0200, Thomas Gleixner wrote:
> > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > index 75fea0d48c0e..625627b1457c 100644
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -601,6 +601,7 @@ void stop_this_cpu(void *dummy)
> > /*
> > * Remove this CPU:
> > */
> > + set_cpu_active(smp_processor_id(), false);
> > set_cpu_online(smp_processor_id(), false);
> > disable_local_APIC();
> > mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
> >
> No luck. The problem is still seen with this patch applied on top of
> the mainline kernel (commit a69e90512d9def6).
Yeah, was a bit too naive ....
We actually need to do the full cpuhotplug dance for a regular reboot. In
the panic case, there is nothing we can do about. I'll have a look tomorrow.
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2019-08-17 20:21 ` Thomas Gleixner
@ 2021-07-27 8:00 ` Henning Schild
2021-07-27 8:46 ` Jan Kiszka
0 siblings, 1 reply; 22+ messages in thread
From: Henning Schild @ 2021-07-27 8:00 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Guenter Roeck, Peter Zijlstra, x86, Ingo Molnar, linux-kernel,
Borislav Petkov, xenomai
Was this ever resolved and if so can someone please point me to the
patches? I started digging a bit but could not yet find how that
continued.
I am seeing similar or maybe the same problem on 4.19.192 with the
ipipe patch from the xenomai project applied.
regards,
Henning
Am Sat, 17 Aug 2019 22:21:48 +0200
schrieb Thomas Gleixner <tglx@linutronix.de>:
> On Fri, 16 Aug 2019, Guenter Roeck wrote:
> > On Fri, Aug 16, 2019 at 12:22:22PM +0200, Thomas Gleixner wrote:
> > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> > > index 75fea0d48c0e..625627b1457c 100644
> > > --- a/arch/x86/kernel/process.c
> > > +++ b/arch/x86/kernel/process.c
> > > @@ -601,6 +601,7 @@ void stop_this_cpu(void *dummy)
> > > /*
> > > * Remove this CPU:
> > > */
> > > + set_cpu_active(smp_processor_id(), false);
> > > set_cpu_online(smp_processor_id(), false);
> > > disable_local_APIC();
> > > mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
> > >
> > No luck. The problem is still seen with this patch applied on top of
> > the mainline kernel (commit a69e90512d9def6).
>
> Yeah, was a bit too naive ....
>
> We actually need to do the full cpuhotplug dance for a regular
> reboot. In the panic case, there is nothing we can do about. I'll
> have a look tomorrow.
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2021-07-27 8:00 ` Henning Schild
@ 2021-07-27 8:46 ` Jan Kiszka
2024-09-03 6:15 ` guocai.he.cn
2024-09-03 15:27 ` Thomas Gleixner
0 siblings, 2 replies; 22+ messages in thread
From: Jan Kiszka @ 2021-07-27 8:46 UTC (permalink / raw)
To: Henning Schild, Thomas Gleixner
Cc: Peter Zijlstra, x86, linux-kernel, Ingo Molnar, Borislav Petkov,
Guenter Roeck, xenomai
[Henning, don't top-post ;)]
On 27.07.21 10:00, Henning Schild via Xenomai wrote:
> Was this ever resolved and if so can someone please point me to the
> patches? I started digging a bit but could not yet find how that
> continued.
>
> I am seeing similar or maybe the same problem on 4.19.192 with the
> ipipe patch from the xenomai project applied.
>
Before blaming the usual suspects, I have a general ordering question on
mainline below.
> regards,
> Henning
>
> Am Sat, 17 Aug 2019 22:21:48 +0200
> schrieb Thomas Gleixner <tglx@linutronix.de>:
>
>> On Fri, 16 Aug 2019, Guenter Roeck wrote:
>>> On Fri, Aug 16, 2019 at 12:22:22PM +0200, Thomas Gleixner wrote:
>>>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>>>> index 75fea0d48c0e..625627b1457c 100644
>>>> --- a/arch/x86/kernel/process.c
>>>> +++ b/arch/x86/kernel/process.c
>>>> @@ -601,6 +601,7 @@ void stop_this_cpu(void *dummy)
>>>> /*
>>>> * Remove this CPU:
>>>> */
>>>> + set_cpu_active(smp_processor_id(), false);
>>>> set_cpu_online(smp_processor_id(), false);
>>>> disable_local_APIC();
>>>> mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
>>>>
>>> No luck. The problem is still seen with this patch applied on top of
>>> the mainline kernel (commit a69e90512d9def6).
>>
>> Yeah, was a bit too naive ....
>>
>> We actually need to do the full cpuhotplug dance for a regular
>> reboot. In the panic case, there is nothing we can do about. I'll
>> have a look tomorrow.
>>
What is supposed to prevent the following in mainline:
CPU 0 CPU 1 CPU 2
native_stop_other_cpus <INTERRUPT>
send_IPI_allbutself ...
<INTERRUPT>
sysvec_reboot
stop_this_cpu
set_cpu_online(false)
native_smp_send_reschedule(1)
if (cpu_is_offline(1)) ...
Jan
--
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2021-07-27 8:46 ` Jan Kiszka
@ 2024-09-03 6:15 ` guocai.he.cn
2024-09-03 15:27 ` Thomas Gleixner
1 sibling, 0 replies; 22+ messages in thread
From: guocai.he.cn @ 2024-09-03 6:15 UTC (permalink / raw)
To: jan.kiszka
Cc: bp, henning.schild, linux-kernel, linux, mingo, peterz, tglx, x86,
xenomai
if there have any updates or fixes about this issue?
we meet the similar issue, logs as following:
root@doon:~# poweroff
...............
...............
-----------[ cut here ]-----------
sched: Unexpected reschedule of offline CPU#10!
WARNING: CPU: 0 PID: 446324 at arch/x86/kernel/smp.c:126 native_smp_send_reschedule+0x3a/0x40
Modules linked in: vhost_net vhost macvtap tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_r
eject_ipv4 iptable_mangle iptable_nat linux_user_bde(PO) linux_kernel_bde(PO) xt_tcpudp bridge stp ll
c ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_intel kvm vfio_pci vfio_virqfd vfi
o_iommu_type1 vfio pci_stub iavf uio_pci_hostif i40e(O) configfs qfx_pci_static_map(O) macvlan socktu
n(O) i2c_dev uio_fpga(O) uio intel_rapl_msr iTCO_wdt iTCO_vendor_support watchdog intel_rapl_common x
86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crct10dif_common aesni_intel aes_x86_6
4 glue_helper crypto_simd cryptd i2c_i801 igb(O) lpc_ich pcc_cpufreq sch_fq_codel nfsd openvswitch ns
h nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 irqbypass fuse [last unloaded: ebtab
les]
CPU: 0 PID: 446324 Comm: kworker/0:11 Tainted: P O 5.2.60-rt15-LTS19 #1
Workqueue: 0x0 (rcu_gp)
RIP: 0010:native_smp_send_reschedule+0x3a/0x40
Code: 4f 9b 01 73 17 48 8b 05 34 1c 5b 01 be fd 00 00 00 48 8b 40 30 e8 a6 ac fb 00 5d c3 89 fe 48 c7
c7 28 08 b1 b8 e8 42 5c 02 00 <0f> 0b 5d c3 66 90 0f 1f 44 00 00 8b 05 3d f8 ba 01 85 c0 0f 85 e1
RSP: 0018:ffff9dc940003c68 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff9138c00a3400 RCX: 0000000000000006
RDX: 0000000000000007 RSI: 0000000000000003 RDI: ffff9138bfe16450
RBP: ffff9dc940003c68 R08: 0000099801c9bf9b R09: 0000000000000000
R10: ffff9dc940003a08 R11: 0000000000000002 R12: 000000000000000a
R13: ffff9dc940003d30 R14: 0000000000000000 R15: ffff9138c00a3400
FS: 0000000000000000(0000) GS:ffff9138bfe00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000b0 CR3: 0000000305364006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
resched_curr+0x69/0xc0
check_preempt_curr+0x54/0x90
ttwu_do_wakeup.isra.0+0x1e/0x150
ttwu_do_activate+0x5b/0x70
try_to_wake_up+0x224/0x570
? enqueue_task_fair+0x1f0/0xa70
? tracing_record_taskinfo_skip+0x3f/0x50
default_wake_function+0x12/0x20
autoremove_wake_function+0x12/0x40
__wake_up_common+0x7e/0x140
__wake_up_common_lock+0x7b/0xf0
__wake_up+0x13/0x20
wake_up_klogd_work_func+0x39/0x40
irq_work_run_list+0x4f/0x70
irq_work_tick+0x3b/0x50
update_process_times+0x65/0x70
tick_sched_timer+0x59/0x170
? tick_switch_to_oneshot.cold+0x79/0x79
__hrtimer_run_queues+0x10f/0x290
? recalibrate_cpu_khz+0x10/0x10
hrtimer_interrupt+0x109/0x220
smp_apic_timer_interrupt+0x76/0x150
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:finish_task_switch+0x87/0x280
Code: 85 c0 0f 8f dc 00 00 00 8b 05 e5 6b bd 01 85 c0 0f 8f e7 00 00 00 41 c7 45 40 00 00 00 00 41 c6
04 24 00 fb 8b 05 e9 f0 b5 01 <65> 48 8b 14 25 00 5d 01 00 85 c0 0f 8f 23 01 00 00 4d 85 f6 74 1d
RSP: 0018:ffff9dc94725bdf8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: ffff9138a093c180 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff9138a093c180 RDI: ffff9138b9298040
RBP: ffff9dc94725be20 R08: 000000000000022d R09: ffff9138a093c258
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9138bfe23400
R13: ffff9138b9298040 R14: 0000000000000000 R15: 0000000000000002
? __switch_to_asm+0x34/0x70
__schedule+0x30b/0x690
schedule+0x42/0xb0
worker_thread+0xc1/0x3c0
kthread+0x106/0x140
? process_one_work+0x3f0/0x3f0
? kthread_park+0x90/0x90
ret_from_fork+0x35/0x40
--[ end trace 4ff5842bcc9fa5e0 ]
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2021-07-27 8:46 ` Jan Kiszka
2024-09-03 6:15 ` guocai.he.cn
@ 2024-09-03 15:27 ` Thomas Gleixner
2024-09-04 7:46 ` guocai he
` (2 more replies)
1 sibling, 3 replies; 22+ messages in thread
From: Thomas Gleixner @ 2024-09-03 15:27 UTC (permalink / raw)
To: Jan Kiszka, Henning Schild
Cc: Peter Zijlstra, x86, linux-kernel, Ingo Molnar, Borislav Petkov,
Guenter Roeck, xenomai, guocai.he.cn
On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
Picking up this dead thread again.
> What is supposed to prevent the following in mainline:
>
> CPU 0 CPU 1 CPU 2
>
> native_stop_other_cpus <INTERRUPT>
> send_IPI_allbutself ...
> <INTERRUPT>
> sysvec_reboot
> stop_this_cpu
> set_cpu_online(false)
> native_smp_send_reschedule(1)
> if (cpu_is_offline(1)) ...
Nothing. And that's what probably happens if I read the stack trace
correctly.
But we can be slightly smarter about this for the reboot IPI (the NMI
case does not have that issue).
CPU 0 CPU 1 CPU 2
native_stop_other_cpus <INTERRUPT>
send_IPI_allbutself ...
<IPI>
sysvec_reboot
wait_for_others();
</INTERRUPT>
<IPI>
sysvec_reboot
wait_for_others();
stop_this_cpu(); stop_this_cpu();
set_cpu_online(false); set_cpu_online(false);
Something like the uncompiled below.
Thanks,
tglx
---
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -68,5 +68,6 @@ bool intel_find_matching_signature(void
int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
extern struct cpumask cpus_stop_mask;
+atomic_t cpus_stop_in_ipi;
#endif /* _ASM_X86_CPU_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -721,7 +721,7 @@ bool xen_set_default_idle(void);
#define xen_set_default_idle 0
#endif
-void __noreturn stop_this_cpu(void *dummy);
+void __noreturn stop_this_cpu(bool sync);
void microcode_check(struct cpuinfo_x86 *prev_info);
void store_cpu_caps(struct cpuinfo_x86 *info);
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -791,9 +791,10 @@ bool xen_set_default_idle(void)
}
#endif
+atomic_t cpus_stop_in_ipi;
struct cpumask cpus_stop_mask;
-void __noreturn stop_this_cpu(void *dummy)
+void __noreturn stop_this_cpu(bool sync)
{
struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
unsigned int cpu = smp_processor_id();
@@ -801,6 +802,16 @@ void __noreturn stop_this_cpu(void *dumm
local_irq_disable();
/*
+ * Account this CPU and loop until the other CPUs reached this
+ * point. If they don't react, the control CPU will raise an NMI.
+ */
+ if (sync) {
+ atomic_dec(&cpus_stop_in_ipi);
+ while (atomic_read(&cpus_stop_in_ipi))
+ cpu_relax();
+ }
+
+ /*
* Remove this CPU from the online mask and disable it
* unconditionally. This might be redundant in case that the reboot
* vector was handled late and stop_other_cpus() sent an NMI.
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -788,7 +788,7 @@ static void native_machine_halt(void)
tboot_shutdown(TB_SHUTDOWN_HALT);
- stop_this_cpu(NULL);
+ stop_this_cpu(false);
}
static void native_machine_power_off(void)
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -125,7 +125,7 @@ static int smp_stop_nmi_callback(unsigne
return NMI_HANDLED;
cpu_emergency_disable_virtualization();
- stop_this_cpu(NULL);
+ stop_this_cpu(false);
return NMI_HANDLED;
}
@@ -137,7 +137,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
{
apic_eoi();
cpu_emergency_disable_virtualization();
- stop_this_cpu(NULL);
+ stop_this_cpu(true);
}
static int register_stop_handler(void)
@@ -189,6 +189,7 @@ static void native_stop_other_cpus(int w
*/
cpumask_copy(&cpus_stop_mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, &cpus_stop_mask);
+ atomic_set(&cpus_stop_in_ipi, num_online_cpus() - 1);
if (!cpumask_empty(&cpus_stop_mask)) {
apic_send_IPI_allbutself(REBOOT_VECTOR);
@@ -235,10 +236,12 @@ static void native_stop_other_cpus(int w
local_irq_restore(flags);
/*
- * Ensure that the cpus_stop_mask cache lines are invalidated on
- * the other CPUs. See comment vs. SME in stop_this_cpu().
+ * Ensure that the cpus_stop_mask and cpus_stop_in_ipi cache lines
+ * are invalidated on the other CPUs. See comment vs. SME in
+ * stop_this_cpu().
*/
cpumask_clear(&cpus_stop_mask);
+ atomic_set(&cpus_stop_in_ipi, 0);
}
/*
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2024-09-03 15:27 ` Thomas Gleixner
@ 2024-09-04 7:46 ` guocai he
2024-09-18 1:50 ` My branch is v5.2/standard/preempt-rt/intel-x86 and I make a patch according guocai.he.cn
2024-09-18 2:59 ` [PATCH] patch for poweroff guocai.he.cn
2025-07-09 13:44 ` sched: Unexpected reschedule of offline CPU#2! Phil Auld
2 siblings, 1 reply; 22+ messages in thread
From: guocai he @ 2024-09-04 7:46 UTC (permalink / raw)
To: Thomas Gleixner, Jan Kiszka, Henning Schild
Cc: Peter Zijlstra, x86, linux-kernel, Ingo Molnar, Borislav Petkov,
Guenter Roeck, xenomai
Thanks very much.
I will let our customer to try this patch and let you know the result.
-guocai
On 9/3/24 23:27, Thomas Gleixner wrote:
> CAUTION: This email comes from a non Wind River email account!
> Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
>
> Picking up this dead thread again.
>
>> What is supposed to prevent the following in mainline:
>>
>> CPU 0 CPU 1 CPU 2
>>
>> native_stop_other_cpus <INTERRUPT>
>> send_IPI_allbutself ...
>> <INTERRUPT>
>> sysvec_reboot
>> stop_this_cpu
>> set_cpu_online(false)
>> native_smp_send_reschedule(1)
>> if (cpu_is_offline(1)) ...
> Nothing. And that's what probably happens if I read the stack trace
> correctly.
>
> But we can be slightly smarter about this for the reboot IPI (the NMI
> case does not have that issue).
>
> CPU 0 CPU 1 CPU 2
>
> native_stop_other_cpus <INTERRUPT>
> send_IPI_allbutself ...
> <IPI>
> sysvec_reboot
> wait_for_others();
> </INTERRUPT>
> <IPI>
> sysvec_reboot
> wait_for_others();
> stop_this_cpu(); stop_this_cpu();
> set_cpu_online(false); set_cpu_online(false);
>
> Something like the uncompiled below.
>
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -68,5 +68,6 @@ bool intel_find_matching_signature(void
> int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
>
> extern struct cpumask cpus_stop_mask;
> +atomic_t cpus_stop_in_ipi;
>
> #endif /* _ASM_X86_CPU_H */
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -721,7 +721,7 @@ bool xen_set_default_idle(void);
> #define xen_set_default_idle 0
> #endif
>
> -void __noreturn stop_this_cpu(void *dummy);
> +void __noreturn stop_this_cpu(bool sync);
> void microcode_check(struct cpuinfo_x86 *prev_info);
> void store_cpu_caps(struct cpuinfo_x86 *info);
>
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -791,9 +791,10 @@ bool xen_set_default_idle(void)
> }
> #endif
>
> +atomic_t cpus_stop_in_ipi;
> struct cpumask cpus_stop_mask;
>
> -void __noreturn stop_this_cpu(void *dummy)
> +void __noreturn stop_this_cpu(bool sync)
> {
> struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
> unsigned int cpu = smp_processor_id();
> @@ -801,6 +802,16 @@ void __noreturn stop_this_cpu(void *dumm
> local_irq_disable();
>
> /*
> + * Account this CPU and loop until the other CPUs reached this
> + * point. If they don't react, the control CPU will raise an NMI.
> + */
> + if (sync) {
> + atomic_dec(&cpus_stop_in_ipi);
> + while (atomic_read(&cpus_stop_in_ipi))
> + cpu_relax();
> + }
> +
> + /*
> * Remove this CPU from the online mask and disable it
> * unconditionally. This might be redundant in case that the reboot
> * vector was handled late and stop_other_cpus() sent an NMI.
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -788,7 +788,7 @@ static void native_machine_halt(void)
>
> tboot_shutdown(TB_SHUTDOWN_HALT);
>
> - stop_this_cpu(NULL);
> + stop_this_cpu(false);
> }
>
> static void native_machine_power_off(void)
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -125,7 +125,7 @@ static int smp_stop_nmi_callback(unsigne
> return NMI_HANDLED;
>
> cpu_emergency_disable_virtualization();
> - stop_this_cpu(NULL);
> + stop_this_cpu(false);
>
> return NMI_HANDLED;
> }
> @@ -137,7 +137,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
> {
> apic_eoi();
> cpu_emergency_disable_virtualization();
> - stop_this_cpu(NULL);
> + stop_this_cpu(true);
> }
>
> static int register_stop_handler(void)
> @@ -189,6 +189,7 @@ static void native_stop_other_cpus(int w
> */
> cpumask_copy(&cpus_stop_mask, cpu_online_mask);
> cpumask_clear_cpu(this_cpu, &cpus_stop_mask);
> + atomic_set(&cpus_stop_in_ipi, num_online_cpus() - 1);
>
> if (!cpumask_empty(&cpus_stop_mask)) {
> apic_send_IPI_allbutself(REBOOT_VECTOR);
> @@ -235,10 +236,12 @@ static void native_stop_other_cpus(int w
> local_irq_restore(flags);
>
> /*
> - * Ensure that the cpus_stop_mask cache lines are invalidated on
> - * the other CPUs. See comment vs. SME in stop_this_cpu().
> + * Ensure that the cpus_stop_mask and cpus_stop_in_ipi cache lines
> + * are invalidated on the other CPUs. See comment vs. SME in
> + * stop_this_cpu().
> */
> cpumask_clear(&cpus_stop_mask);
> + atomic_set(&cpus_stop_in_ipi, 0);
> }
>
> /*
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* My branch is v5.2/standard/preempt-rt/intel-x86 and I make a patch according
2024-09-04 7:46 ` guocai he
@ 2024-09-18 1:50 ` guocai.he.cn
0 siblings, 0 replies; 22+ messages in thread
From: guocai.he.cn @ 2024-09-18 1:50 UTC (permalink / raw)
To: tglx
Cc: bp, henning.schild, jan.kiszka, linux-kernel, linux, mingo,
peterz, guocai.he.cn, x86, xenomai
From cbf1606332c48b32c4bb8d61ac6911e3064e79fc Mon Sep 17 00:00:00 2001
From: Guocai He <guocai.he.cn@windriver.com>
Date: Wed, 4 Sep 2024 04:45:26 +0000
Subject: [PATCH] patch for poweroff
---
arch/x86/include/asm/processor.h | 2 +-
arch/x86/kernel/process.c | 14 +++++++++++++-
arch/x86/kernel/reboot.c | 2 +-
arch/x86/kernel/smp.c | 9 ++++++---
4 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 5f4e79d14613..4c1cf610807a 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -975,7 +975,7 @@ bool xen_set_default_idle(void);
#define xen_set_default_idle 0
#endif
-void stop_this_cpu(void *dummy);
+void stop_this_cpu(bool sync);
void df_debug(struct pt_regs *regs, long error_code);
void microcode_check(void);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 2243af6530f8..35d5cf73716e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -590,9 +590,21 @@ bool xen_set_default_idle(void)
}
#endif
-void stop_this_cpu(void *dummy)
+atomic_t cpus_stop_in_ipi;
+void stop_this_cpu(bool sync)
{
local_irq_disable();
+
+ /*
+ * Account this cpu and loop until the other cpus reached this
+ * point. If they don't react, the control cpu will raise an NMI.
+ */
+ if(sync) {
+ atomic_dec(&cpus_stop_in_ipi);
+ while (atomic_read(&cpus_stop_in_ipi))
+ cpu_relax();
+ }
+
/*
* Remove this CPU:
*/
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 3f677832fc12..389643727e37 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -742,7 +742,7 @@ static void native_machine_halt(void)
tboot_shutdown(TB_SHUTDOWN_HALT);
- stop_this_cpu(NULL);
+ stop_this_cpu(false);
}
static void native_machine_power_off(void)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index f2a749586252..9dee65b96115 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -112,6 +112,7 @@
* about nothing of note with C stepping upwards.
*/
+extern atomic_t cpus_stop_in_ipi;
static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;
@@ -162,7 +163,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
return NMI_HANDLED;
cpu_emergency_vmxoff();
- stop_this_cpu(NULL);
+ stop_this_cpu(false);
return NMI_HANDLED;
}
@@ -175,7 +176,7 @@ asmlinkage __visible void smp_reboot_interrupt(void)
{
ipi_entering_ack_irq();
cpu_emergency_vmxoff();
- stop_this_cpu(NULL);
+ stop_this_cpu(true);
irq_exit();
}
@@ -192,7 +193,8 @@ static void native_stop_other_cpus(int wait)
if (reboot_force)
return;
-
+
+ atomic_set(&cpus_stop_in_ipi, num_online_cpus() - 1);
/*
* Use an own vector here because smp_call_function
* does lots of things not suitable in a panic situation.
@@ -256,6 +258,7 @@ static void native_stop_other_cpus(int wait)
local_irq_save(flags);
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+ atomic_set(&cpus_stop_in_ipi, 0);
local_irq_restore(flags);
}
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH] patch for poweroff
2024-09-03 15:27 ` Thomas Gleixner
2024-09-04 7:46 ` guocai he
@ 2024-09-18 2:59 ` guocai.he.cn
2025-07-09 13:44 ` sched: Unexpected reschedule of offline CPU#2! Phil Auld
2 siblings, 0 replies; 22+ messages in thread
From: guocai.he.cn @ 2024-09-18 2:59 UTC (permalink / raw)
To: tglx
Cc: bp, guocai.he.cn, henning.schild, jan.kiszka, linux-kernel, linux,
mingo, peterz, x86, xenomai
From: Guocai He <guocai.he.cn@windriver.com>
My branch is v5.2/standard/preempt-rt/intel-x86 and I make a patch according
to your adviced patch.
But it does not work.
Do you have more advice?
Thanks very much.
The following is my patch on v5.2:
---
arch/x86/include/asm/processor.h | 2 +-
arch/x86/kernel/process.c | 14 +++++++++++++-
arch/x86/kernel/reboot.c | 2 +-
arch/x86/kernel/smp.c | 9 ++++++---
4 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 5f4e79d14613..4c1cf610807a 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -975,7 +975,7 @@ bool xen_set_default_idle(void);
#define xen_set_default_idle 0
#endif
-void stop_this_cpu(void *dummy);
+void stop_this_cpu(bool sync);
void df_debug(struct pt_regs *regs, long error_code);
void microcode_check(void);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 2243af6530f8..35d5cf73716e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -590,9 +590,21 @@ bool xen_set_default_idle(void)
}
#endif
-void stop_this_cpu(void *dummy)
+atomic_t cpus_stop_in_ipi;
+void stop_this_cpu(bool sync)
{
local_irq_disable();
+
+ /*
+ * Account this cpu and loop until the other cpus reached this
+ * point. If they don't react, the control cpu will raise an NMI.
+ */
+ if(sync) {
+ atomic_dec(&cpus_stop_in_ipi);
+ while (atomic_read(&cpus_stop_in_ipi))
+ cpu_relax();
+ }
+
/*
* Remove this CPU:
*/
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 3f677832fc12..389643727e37 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -742,7 +742,7 @@ static void native_machine_halt(void)
tboot_shutdown(TB_SHUTDOWN_HALT);
- stop_this_cpu(NULL);
+ stop_this_cpu(false);
}
static void native_machine_power_off(void)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index f2a749586252..9dee65b96115 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -112,6 +112,7 @@
* about nothing of note with C stepping upwards.
*/
+extern atomic_t cpus_stop_in_ipi;
static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;
@@ -162,7 +163,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
return NMI_HANDLED;
cpu_emergency_vmxoff();
- stop_this_cpu(NULL);
+ stop_this_cpu(false);
return NMI_HANDLED;
}
@@ -175,7 +176,7 @@ asmlinkage __visible void smp_reboot_interrupt(void)
{
ipi_entering_ack_irq();
cpu_emergency_vmxoff();
- stop_this_cpu(NULL);
+ stop_this_cpu(true);
irq_exit();
}
@@ -192,7 +193,8 @@ static void native_stop_other_cpus(int wait)
if (reboot_force)
return;
-
+
+ atomic_set(&cpus_stop_in_ipi, num_online_cpus() - 1);
/*
* Use an own vector here because smp_call_function
* does lots of things not suitable in a panic situation.
@@ -256,6 +258,7 @@ static void native_stop_other_cpus(int wait)
local_irq_save(flags);
disable_local_APIC();
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
+ atomic_set(&cpus_stop_in_ipi, 0);
local_irq_restore(flags);
}
--
2.25.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2024-09-03 15:27 ` Thomas Gleixner
2024-09-04 7:46 ` guocai he
2024-09-18 2:59 ` [PATCH] patch for poweroff guocai.he.cn
@ 2025-07-09 13:44 ` Phil Auld
2025-07-19 21:17 ` Thomas Gleixner
2 siblings, 1 reply; 22+ messages in thread
From: Phil Auld @ 2025-07-09 13:44 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Jan Kiszka, Henning Schild, Peter Zijlstra, x86, linux-kernel,
Ingo Molnar, Borislav Petkov, Guenter Roeck, xenomai,
guocai.he.cn, pauld
Hi Thomas,
On Tue, Sep 03, 2024 at 05:27:58PM +0200 Thomas Gleixner wrote:
> On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
>
> Picking up this dead thread again.
Necro-ing this again...
I keep getting occasional reports of this case. Unfortunately
though, I've never been able to reproduce it myself.
Did the below patch ever go anywhere?
It seems to be stable in my testing with the addition of
an "extern" in asm/cpu.h to get it to build.
>
> > What is supposed to prevent the following in mainline:
> >
> > CPU 0 CPU 1 CPU 2
> >
> > native_stop_other_cpus <INTERRUPT>
> > send_IPI_allbutself ...
> > <INTERRUPT>
> > sysvec_reboot
> > stop_this_cpu
> > set_cpu_online(false)
> > native_smp_send_reschedule(1)
> > if (cpu_is_offline(1)) ...
>
> Nothing. And that's what probably happens if I read the stack trace
> correctly.
>
> But we can be slightly smarter about this for the reboot IPI (the NMI
> case does not have that issue).
>
> CPU 0 CPU 1 CPU 2
>
> native_stop_other_cpus <INTERRUPT>
> send_IPI_allbutself ...
> <IPI>
> sysvec_reboot
> wait_for_others();
> </INTERRUPT>
> <IPI>
> sysvec_reboot
> wait_for_others();
> stop_this_cpu(); stop_this_cpu();
> set_cpu_online(false); set_cpu_online(false);
>
> Something like the uncompiled below.
>
> Thanks,
>
> tglx
> ---
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -68,5 +68,6 @@ bool intel_find_matching_signature(void
> int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
>
> extern struct cpumask cpus_stop_mask;
> +atomic_t cpus_stop_in_ipi;
extern
>
> #endif /* _ASM_X86_CPU_H */
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -721,7 +721,7 @@ bool xen_set_default_idle(void);
> #define xen_set_default_idle 0
> #endif
>
> -void __noreturn stop_this_cpu(void *dummy);
> +void __noreturn stop_this_cpu(bool sync);
> void microcode_check(struct cpuinfo_x86 *prev_info);
> void store_cpu_caps(struct cpuinfo_x86 *info);
>
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -791,9 +791,10 @@ bool xen_set_default_idle(void)
> }
> #endif
>
> +atomic_t cpus_stop_in_ipi;
> struct cpumask cpus_stop_mask;
>
> -void __noreturn stop_this_cpu(void *dummy)
> +void __noreturn stop_this_cpu(bool sync)
> {
> struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
> unsigned int cpu = smp_processor_id();
> @@ -801,6 +802,16 @@ void __noreturn stop_this_cpu(void *dumm
> local_irq_disable();
>
> /*
> + * Account this CPU and loop until the other CPUs reached this
> + * point. If they don't react, the control CPU will raise an NMI.
> + */
> + if (sync) {
> + atomic_dec(&cpus_stop_in_ipi);
> + while (atomic_read(&cpus_stop_in_ipi))
> + cpu_relax();
> + }
> +
> + /*
> * Remove this CPU from the online mask and disable it
> * unconditionally. This might be redundant in case that the reboot
> * vector was handled late and stop_other_cpus() sent an NMI.
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -788,7 +788,7 @@ static void native_machine_halt(void)
>
> tboot_shutdown(TB_SHUTDOWN_HALT);
>
> - stop_this_cpu(NULL);
> + stop_this_cpu(false);
> }
>
> static void native_machine_power_off(void)
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -125,7 +125,7 @@ static int smp_stop_nmi_callback(unsigne
> return NMI_HANDLED;
>
> cpu_emergency_disable_virtualization();
> - stop_this_cpu(NULL);
> + stop_this_cpu(false);
>
> return NMI_HANDLED;
> }
> @@ -137,7 +137,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
> {
> apic_eoi();
> cpu_emergency_disable_virtualization();
> - stop_this_cpu(NULL);
> + stop_this_cpu(true);
> }
>
> static int register_stop_handler(void)
> @@ -189,6 +189,7 @@ static void native_stop_other_cpus(int w
> */
> cpumask_copy(&cpus_stop_mask, cpu_online_mask);
> cpumask_clear_cpu(this_cpu, &cpus_stop_mask);
> + atomic_set(&cpus_stop_in_ipi, num_online_cpus() - 1);
>
> if (!cpumask_empty(&cpus_stop_mask)) {
> apic_send_IPI_allbutself(REBOOT_VECTOR);
> @@ -235,10 +236,12 @@ static void native_stop_other_cpus(int w
> local_irq_restore(flags);
>
> /*
> - * Ensure that the cpus_stop_mask cache lines are invalidated on
> - * the other CPUs. See comment vs. SME in stop_this_cpu().
> + * Ensure that the cpus_stop_mask and cpus_stop_in_ipi cache lines
> + * are invalidated on the other CPUs. See comment vs. SME in
> + * stop_this_cpu().
> */
> cpumask_clear(&cpus_stop_mask);
> + atomic_set(&cpus_stop_in_ipi, 0);
> }
>
> /*
>
Thanks,
Phil
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2025-07-09 13:44 ` sched: Unexpected reschedule of offline CPU#2! Phil Auld
@ 2025-07-19 21:17 ` Thomas Gleixner
2025-07-20 10:47 ` Thomas Gleixner
0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2025-07-19 21:17 UTC (permalink / raw)
To: Phil Auld
Cc: Jan Kiszka, Henning Schild, Peter Zijlstra, x86, linux-kernel,
Ingo Molnar, Borislav Petkov, Guenter Roeck, xenomai,
guocai.he.cn, pauld
On Wed, Jul 09 2025 at 09:44, Phil Auld wrote:
> Hi Thomas,
>
> On Tue, Sep 03, 2024 at 05:27:58PM +0200 Thomas Gleixner wrote:
>> On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
>>
>> Picking up this dead thread again.
>
> Necro-ing this again...
>
> I keep getting occasional reports of this case. Unfortunately
> though, I've never been able to reproduce it myself.
>
> Did the below patch ever go anywhere?
Nope. Guocai said it does not work and I have no reproducer either to
actually look at it deeper and there was further debug data provided.
> It seems to be stable in my testing with the addition of
> an "extern" in asm/cpu.h to get it to build.
I don't see why it wouldn't be stable, but as it does not seem to solve
the issue merging it as is does not make sense.
Thanks,
tglx
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2025-07-19 21:17 ` Thomas Gleixner
@ 2025-07-20 10:47 ` Thomas Gleixner
2025-07-20 14:14 ` Guenter Roeck
2025-07-28 13:13 ` Phil Auld
0 siblings, 2 replies; 22+ messages in thread
From: Thomas Gleixner @ 2025-07-20 10:47 UTC (permalink / raw)
To: Phil Auld
Cc: Jan Kiszka, Henning Schild, Peter Zijlstra, x86, linux-kernel,
Ingo Molnar, Borislav Petkov, Guenter Roeck, xenomai,
guocai.he.cn, pauld
On Sat, Jul 19 2025 at 23:17, Thomas Gleixner wrote:
> On Wed, Jul 09 2025 at 09:44, Phil Auld wrote:
>> Hi Thomas,
>>
>> On Tue, Sep 03, 2024 at 05:27:58PM +0200 Thomas Gleixner wrote:
>>> On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
>>>
>>> Picking up this dead thread again.
>>
>> Necro-ing this again...
>>
>> I keep getting occasional reports of this case. Unfortunately
>> though, I've never been able to reproduce it myself.
>>
>> Did the below patch ever go anywhere?
>
> Nope. Guocai said it does not work and I have no reproducer either to
> actually look at it deeper and there was further debug data provided.
Obviously:
no further debug data was provided, so I can only tap in the
dark.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2025-07-20 10:47 ` Thomas Gleixner
@ 2025-07-20 14:14 ` Guenter Roeck
2025-07-28 13:13 ` Phil Auld
1 sibling, 0 replies; 22+ messages in thread
From: Guenter Roeck @ 2025-07-20 14:14 UTC (permalink / raw)
To: Thomas Gleixner, Phil Auld
Cc: Jan Kiszka, Henning Schild, Peter Zijlstra, x86, linux-kernel,
Ingo Molnar, Borislav Petkov, xenomai, guocai.he.cn
On 7/20/25 03:47, Thomas Gleixner wrote:
> On Sat, Jul 19 2025 at 23:17, Thomas Gleixner wrote:
>> On Wed, Jul 09 2025 at 09:44, Phil Auld wrote:
>>> Hi Thomas,
>>>
>>> On Tue, Sep 03, 2024 at 05:27:58PM +0200 Thomas Gleixner wrote:
>>>> On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
>>>>
>>>> Picking up this dead thread again.
>>>
>>> Necro-ing this again...
>>>
>>> I keep getting occasional reports of this case. Unfortunately
>>> though, I've never been able to reproduce it myself.
>>>
>>> Did the below patch ever go anywhere?
>>
>> Nope. Guocai said it does not work and I have no reproducer either to
>> actually look at it deeper and there was further debug data provided.
>
> Obviously:
>
> no further debug data was provided, so I can only tap in the
> dark.
>
FWIW, I have not seen this problem for a long time. I know it existed in
5.10, and I can find logs showing the problem in my company's bug system,
but only for 5.10 and older kernels. I don't see it in my test system either,
not even for 5.4.y or 5.10.y kernels.
Guenter
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: sched: Unexpected reschedule of offline CPU#2!
2025-07-20 10:47 ` Thomas Gleixner
2025-07-20 14:14 ` Guenter Roeck
@ 2025-07-28 13:13 ` Phil Auld
1 sibling, 0 replies; 22+ messages in thread
From: Phil Auld @ 2025-07-28 13:13 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Jan Kiszka, Henning Schild, Peter Zijlstra, x86, linux-kernel,
Ingo Molnar, Borislav Petkov, Guenter Roeck, xenomai,
guocai.he.cn
On Sun, Jul 20, 2025 at 12:47:24PM +0200 Thomas Gleixner wrote:
> On Sat, Jul 19 2025 at 23:17, Thomas Gleixner wrote:
> > On Wed, Jul 09 2025 at 09:44, Phil Auld wrote:
> >> Hi Thomas,
> >>
> >> On Tue, Sep 03, 2024 at 05:27:58PM +0200 Thomas Gleixner wrote:
> >>> On Tue, Jul 27 2021 at 10:46, Jan Kiszka wrote:
> >>>
> >>> Picking up this dead thread again.
> >>
> >> Necro-ing this again...
> >>
> >> I keep getting occasional reports of this case. Unfortunately
> >> though, I've never been able to reproduce it myself.
> >>
> >> Did the below patch ever go anywhere?
> >
> > Nope. Guocai said it does not work and I have no reproducer either to
> > actually look at it deeper and there was further debug data provided.
>
> Obviously:
>
> no further debug data was provided, so I can only tap in the
> dark.
>
Fair enough, thanks. I did not see the "does not work" part :)
Cheers,
Phil
--
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-07-28 13:13 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-07-27 16:44 sched: Unexpected reschedule of offline CPU#2! Guenter Roeck
2019-07-29 9:35 ` Peter Zijlstra
2019-07-29 9:58 ` Thomas Gleixner
2019-07-29 10:13 ` Peter Zijlstra
2019-07-29 10:38 ` Thomas Gleixner
2019-07-29 10:47 ` Peter Zijlstra
2019-07-29 20:50 ` Guenter Roeck
2019-08-16 10:22 ` Thomas Gleixner
2019-08-16 19:32 ` Guenter Roeck
2019-08-17 20:21 ` Thomas Gleixner
2021-07-27 8:00 ` Henning Schild
2021-07-27 8:46 ` Jan Kiszka
2024-09-03 6:15 ` guocai.he.cn
2024-09-03 15:27 ` Thomas Gleixner
2024-09-04 7:46 ` guocai he
2024-09-18 1:50 ` My branch is v5.2/standard/preempt-rt/intel-x86 and I make a patch according guocai.he.cn
2024-09-18 2:59 ` [PATCH] patch for poweroff guocai.he.cn
2025-07-09 13:44 ` sched: Unexpected reschedule of offline CPU#2! Phil Auld
2025-07-19 21:17 ` Thomas Gleixner
2025-07-20 10:47 ` Thomas Gleixner
2025-07-20 14:14 ` Guenter Roeck
2025-07-28 13:13 ` Phil Auld
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).