From mboxrd@z Thu Jan 1 00:00:00 1970 From: linux@arm.linux.org.uk (Russell King - ARM Linux) Date: Wed, 29 Jul 2015 19:29:06 +0100 Subject: question about detect hard lockups without NMIs using secondary cpus In-Reply-To: References: Message-ID: <20150729182905.GO7557@n2100.arm.linux.org.uk> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Thu, Jul 30, 2015 at 12:03:46AM +0800, yoma sophian wrote: > hi all: > below link introduced how to emulate NMIs on systems where they are > not available by using timer interrupts on other cpus. > > http://article.gmane.org/gmane.linux.kernel/1419661 > > in kernel/watchdog.c > --> watchdog_overflow_callback > if (is_hardlockup()) { > ........................... > if (hardlockup_panic) > panic("Watchdog detected hard LOCKUP on cpu %d", > this_cpu); /*************/ > else > WARN(1, "Watchdog detected hard LOCKUP on cpu %d", > this_cpu); > ....................... > } > > I have some questions: > a. > in SMP system, suppose 4 cores, and hardlockup_panic is 1. > Core0 find Core1 hard lcokup in hardIRQ context > the panic function, above with '*' marked, will fail on > smp_send_stop(), and we will have no idea where core1 is trapped in, > right? watchdog_overflow_callback() is only ever entered for the failed core. What you missed out on is: int this_cpu = smp_processor_id(); which gets the CPU number of the CPU executing this code. So, Core 0 will never find Core 1 having locked up via this code path. > b. > things will get worse if we are running single core system if hard > lockup happen. > We even have no idea what happen. Basically, without NMIs (or FIQs in ARM speak) lockups with IRQs off are undetectable by the kernel other than "the system stopped responding". In a SMP system, there are mechanisms by which other CPUs can detect a locked-up CPU, and they can call trigger_all_cpu_backtrace() - and that can only get a trace out of the locked up CPU if it uses FIQs. A CPU which has locked up in an IRQs-off region won't be able to receive an IRQ or IPI by definition. Work has been going on for the last 9 months to try and bring a working trigger_all_cpu_backtrace() implementation, initially with IRQs and later with FIQs. In previous merge windows, we have moved forward with getting some FIQ changes merged, and in the next merge window, I have patches queued up (available in linux-next) which add IRQ-based trigger_all_cpu_backtrace() support. The next piece of the puzzle is sorting out the patches which bring FIQ based trigger_all_cpu_backtrace() support - but even if we do, that won't be available everywhere - for example, it won't be available if your kernel runs in the non-secure world with a secure monitor, because FIQs generally aren't usable in that world. The only other alternative is a hardware JTAG debugger to inspect the state of all CPUs in the system. -- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.