From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sebastian Andrzej Siewior Subject: Re: [PATCH 1/2] irq_work: allow certain work in hard irq context Date: Sun, 02 Feb 2014 21:10:43 +0100 Message-ID: <52EEA643.1010200@linutronix.de> References: <1391178845-15837-1-git-send-email-bigeasy@linutronix.de> <1391314950.5444.18.camel@marge.simpson.net> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org, rostedt@goodmis.org, tglx@linutronix.de To: Mike Galbraith Return-path: In-Reply-To: <1391314950.5444.18.camel@marge.simpson.net> Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-rt-users.vger.kernel.org On 02/02/2014 05:22 AM, Mike Galbraith wrote: > This patch (w. too noisy to live pr_err whacked) reliable kills my 64 > core test box, but only in _virgin_ 3.12-rt11. Add my local patches, > and it runs and runs, happy as a clam. Odd. But whatever, box with > virgin source running says it's busted. Sorry for that, I removed that line from the patch in my queue but the sent version still had it=E2=80=A6 > Killing what was killable in this run before box had a chance to turn > into a brick, the two tasks below were left, burning 100% CPU until 5 > minute RCU deadline expired. All other cores were idle. >=20 > [ 705.466006] NMI backtrace for cpu 5 > [ 705.466009] CPU: 5 PID: 21792 Comm: cc1 Tainted: GF 3.1= 2.9-rt11 #376 > [ 705.466015] RIP: 0010:[] [] _= raw_spin_unlock_irq+0x40/0x40 > [ 705.466030] =20 > [ 705.466033] [] ? hrtimer_try_to_cancel+0x44/0x1= 10 > [ 705.466035] [] hrtimer_cancel+0x20/0x30 > [ 705.466037] [] tick_nohz_restart+0x12/0x90 > [ 705.466039] [] tick_nohz_restart_sched_tick+0x4= a/0x60 > [ 705.466041] [] __tick_nohz_full_check+0x89/0x90 > [ 705.466043] [] nohz_full_kick_work_func+0x9/0x1= 0 > [ 705.466047] [] __irq_work_run+0x79/0xb0 > [ 705.466049] [] irq_work_run+0x9/0x10 > [ 705.466051] [] update_process_times+0x62/0x80 > [ 705.466053] [] tick_sched_handle+0x32/0x70 > [ 705.466055] [] tick_sched_timer+0x40/0x70 > [ 705.466057] [] __run_hrtimer+0x14d/0x280 > [ 705.466059] [] ? tick_nohz_handler+0xa0/0xa0 > [ 705.466060] [] hrtimer_interrupt+0x12a/0x310 > [ 705.466065] [] ? vtime_account_user+0x6c/0x100 > [ 705.466067] [] local_apic_timer_interrupt+0x36/= 0x60 > [ 705.466069] [] ? native_apic_msr_eoi_write+0x14= /0x20 > [ 705.466071] [] smp_apic_timer_interrupt+0x3e/0x= 60 > [ 705.466074] [] apic_timer_interrupt+0x6d/0x80 > [ 705.466075] =20 > [ 705.468619] NMI backtrace for cpu 52 > [ 705.468622] CPU: 52 PID: 23285 Comm: objdump Tainted: GF = 3.12.9-rt11 #376 > [ 705.468634] RIP: 0010:[] [] h= rtimer_try_to_cancel+0x53/0x110 > [ 705.468650] Call Trace: > [ 705.468651] =20 > [ 705.468653] [] ? hrtimer_cancel+0x20/0x30 > [ 705.468660] [] tick_nohz_restart+0x12/0x90 > [ 705.468662] [] tick_nohz_restart_sched_tick+0x4= a/0x60 > [ 705.468665] [] __tick_nohz_full_check+0x89/0x90 > [ 705.468667] [] nohz_full_kick_work_func+0x9/0x1= 0 > [ 705.468674] [] __irq_work_run+0x79/0xb0 > [ 705.468676] [] irq_work_run+0x9/0x10 > [ 705.468681] [] update_process_times+0x62/0x80 > [ 705.468683] [] tick_sched_handle+0x32/0x70 > [ 705.468685] [] tick_sched_timer+0x40/0x70 > [ 705.468687] [] __run_hrtimer+0x14d/0x280 > [ 705.468689] [] ? tick_nohz_handler+0xa0/0xa0 > [ 705.468691] [] hrtimer_interrupt+0x12a/0x310 > [ 705.468700] [] ? vtime_account_system+0x52/0xe0 > [ 705.468703] [] local_apic_timer_interrupt+0x36/= 0x60 > [ 705.468708] [] ? native_apic_msr_eoi_write+0x14= /0x20 > [ 705.468710] [] smp_apic_timer_interrupt+0x3e/0x= 60 > [ 705.468721] [] apic_timer_interrupt+0x6d/0x80 > [ 705.468722] =20 > [ 705.468733] [] ? pin_current_cpu+0x63/0x180 > [ 705.468742] [] migrate_disable+0x95/0x100 > [ 705.468746] [] __do_fault+0x181/0x590 > [ 705.468748] [] handle_pte_fault+0x93/0x250 > [ 705.468750] [] __handle_mm_fault+0x137/0x1e0 > [ 705.468752] [] handle_mm_fault+0xf3/0x1a0 > [ 705.468755] [] __do_page_fault+0x291/0x550 > [ 705.468758] [] ? native_sched_clock+0x20/0xa0 > [ 705.468766] [] ? acct_account_cputime+0x17/0x20 > [ 705.468768] [] ? account_user_time+0xd2/0xf0 > [ 705.468770] [] ? vtime_account_user+0x6c/0x100 > [ 705.468772] [] do_page_fault+0x40/0x70 > [ 705.468774] [] page_fault+0x28/0x30 So CPU5 & CPU52 were eating 100% CPU doing "nothing" instead of running cc1 & objdump right? According to the backtrace both of them are trying to access the per-cpu hrtimer (sched_timer) in order to cancel but they seem to fail to get the timer lock here. They shouldn't spin there for minutes, I have no idea why they did so=E2=80=A6 I guess this problem does not occur without -RT and before that patch you saw only that one warning from can_stop_full_tick()? Sebastian