question about detect hard lockups without NMIs using secondary cpus

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* question about detect hard lockups without NMIs using secondary cpus
@ 2015-07-29 16:03 yoma sophian
  2015-07-29 18:29 ` Russell King - ARM Linux
  0 siblings, 1 reply; 3+ messages in thread
From: yoma sophian @ 2015-07-29 16:03 UTC (permalink / raw)
  To: linux-arm-kernel

hi all:
below link introduced how to emulate NMIs on systems where they are
not available by using timer interrupts on other cpus.

http://article.gmane.org/gmane.linux.kernel/1419661

in kernel/watchdog.c
    --> watchdog_overflow_callback
          if (is_hardlockup()) {
           ...........................
                if (hardlockup_panic)
                        panic("Watchdog detected hard LOCKUP on cpu %d",
                              this_cpu); /*************/
                else
                        WARN(1, "Watchdog detected hard LOCKUP on cpu %d",
                             this_cpu);
             .......................
        }

I have some questions:
a.
in SMP system, suppose 4 cores, and hardlockup_panic is 1.
Core0 find Core1 hard lcokup in hardIRQ context
the panic function, above with '*' marked, will fail on
smp_send_stop(), and we will have no idea where core1 is trapped in,
right?

b.
things will get worse if we are running single core system if hard
lockup happen.
We even have no idea what happen.

If my conclusions above are correct, is there any way to debug  a) and
b) situation?

appreciate your kind help in advance,

^ permalink raw reply	[flat|nested] 3+ messages in thread

* question about detect hard lockups without NMIs using secondary cpus
  2015-07-29 16:03 question about detect hard lockups without NMIs using secondary cpus yoma sophian
@ 2015-07-29 18:29 ` Russell King - ARM Linux
  2015-07-30 16:20   ` yoma sophian
  0 siblings, 1 reply; 3+ messages in thread
From: Russell King - ARM Linux @ 2015-07-29 18:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 30, 2015 at 12:03:46AM +0800, yoma sophian wrote:
> hi all:
> below link introduced how to emulate NMIs on systems where they are
> not available by using timer interrupts on other cpus.
> 
> http://article.gmane.org/gmane.linux.kernel/1419661
> 
> in kernel/watchdog.c
>     --> watchdog_overflow_callback
>           if (is_hardlockup()) {
>            ...........................
>                 if (hardlockup_panic)
>                         panic("Watchdog detected hard LOCKUP on cpu %d",
>                               this_cpu); /*************/
>                 else
>                         WARN(1, "Watchdog detected hard LOCKUP on cpu %d",
>                              this_cpu);
>              .......................
>         }
> 
> I have some questions:
> a.
> in SMP system, suppose 4 cores, and hardlockup_panic is 1.
> Core0 find Core1 hard lcokup in hardIRQ context
> the panic function, above with '*' marked, will fail on
> smp_send_stop(), and we will have no idea where core1 is trapped in,
> right?

watchdog_overflow_callback() is only ever entered for the failed core.
What you missed out on is:

	int this_cpu = smp_processor_id();

which gets the CPU number of the CPU executing this code.  So, Core 0
will never find Core 1 having locked up via this code path.

> b.
> things will get worse if we are running single core system if hard
> lockup happen.
> We even have no idea what happen.

Basically, without NMIs (or FIQs in ARM speak) lockups with IRQs off are
undetectable by the kernel other than "the system stopped responding".

In a SMP system, there are mechanisms by which other CPUs can detect a
locked-up CPU, and they can call trigger_all_cpu_backtrace() - and that
can only get a trace out of the locked up CPU if it uses FIQs.  A CPU
which has locked up in an IRQs-off region won't be able to receive an
IRQ or IPI by definition.

Work has been going on for the last 9 months to try and bring a working
trigger_all_cpu_backtrace() implementation, initially with IRQs and
later with FIQs.

In previous merge windows, we have moved forward with getting some FIQ
changes merged, and in the next merge window, I have patches queued up
(available in linux-next) which add IRQ-based trigger_all_cpu_backtrace()
support.

The next piece of the puzzle is sorting out the patches which bring FIQ
based trigger_all_cpu_backtrace() support - but even if we do, that won't
be available everywhere - for example, it won't be available if your kernel
runs in the non-secure world with a secure monitor, because FIQs generally
aren't usable in that world.

The only other alternative is a hardware JTAG debugger to inspect the
state of all CPUs in the system.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* question about detect hard lockups without NMIs using secondary cpus
  2015-07-29 18:29 ` Russell King - ARM Linux
@ 2015-07-30 16:20   ` yoma sophian
  0 siblings, 0 replies; 3+ messages in thread
From: yoma sophian @ 2015-07-30 16:20 UTC (permalink / raw)
  To: linux-arm-kernel

hi Russel

2015-07-30 2:29 GMT+08:00 Russell King - ARM Linux <linux@arm.linux.org.uk>:
> On Thu, Jul 30, 2015 at 12:03:46AM +0800, yoma sophian wrote:
>> hi all:
>> below link introduced how to emulate NMIs on systems where they are
>> not available by using timer interrupts on other cpus.
>>
>> http://article.gmane.org/gmane.linux.kernel/1419661
>>
>> in kernel/watchdog.c
>>     --> watchdog_overflow_callback
>>           if (is_hardlockup()) {
>>            ...........................
>>                 if (hardlockup_panic)
>>                         panic("Watchdog detected hard LOCKUP on cpu %d",
>>                               this_cpu); /*************/
>>                 else
>>                         WARN(1, "Watchdog detected hard LOCKUP on cpu %d",
>>                              this_cpu);
>>              .......................
>>         }
>>
>> I have some questions:
>> a.
>> in SMP system, suppose 4 cores, and hardlockup_panic is 1.
>> Core0 find Core1 hard lcokup in hardIRQ context
>> the panic function, above with '*' marked, will fail on
>> smp_send_stop(), and we will have no idea where core1 is trapped in,
>> right?
>
> watchdog_overflow_callback() is only ever entered for the failed core.
> What you missed out on is:
>
>         int this_cpu = smp_processor_id();
>
> which gets the CPU number of the CPU executing this code.  So, Core 0
> will never find Core 1 having locked up via this code path.
if core0 found it is locked in lockdep, that mean the situation is no so worse,
core0 has the chance to go back, right.



>> b.
>> things will get worse if we are running single core system if hard
>> lockup happen.
>> We even have no idea what happen.
>
> Basically, without NMIs (or FIQs in ARM speak) lockups with IRQs off are
> undetectable by the kernel other than "the system stopped responding".
>
> In a SMP system, there are mechanisms by which other CPUs can detect a
> locked-up CPU, and they can call trigger_all_cpu_backtrace() - and that
> can only get a trace out of the locked up CPU if it uses FIQs.  A CPU
> which has locked up in an IRQs-off region won't be able to receive an
> IRQ or IPI by definition.

so in ur case "A CPU  which has locked up in an IRQs-off region won't
be able to receive an
IRQ or IPI by definition.' the only wake to trigger_them is by FIQ, right?


> Work has been going on for the last 9 months to try and bring a working
> trigger_all_cpu_backtrace() implementation, initially with IRQs and
> later with FIQs.
>
> In previous merge windows, we have moved forward with getting some FIQ
> changes merged, and in the next merge window, I have patches queued up
> (available in linux-next) which add IRQ-based trigger_all_cpu_backtrace()
> support.
>
> The next piece of the puzzle is sorting out the patches which bring FIQ
> based trigger_all_cpu_backtrace() support - but even if we do, that won't
> be available everywhere - for example, it won't be available if your kernel
> runs in the non-secure world with a secure monitor, because FIQs generally
> aren't usable in that world.
Is there patch that we can reference what you describe above?
if we stay in secure world, and take some ppi as fiq and others ppi/spi are IRQ.
isn't that enough to cover above idea?
( GIC has the cability to let NS irq to be handed by secure Cpu)

appreciate ur kine help

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-07-30 16:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-29 16:03 question about detect hard lockups without NMIs using secondary cpus yoma sophian
2015-07-29 18:29 ` Russell King - ARM Linux
2015-07-30 16:20   ` yoma sophian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).