From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Mike Galbraith <bitbucket@online.de>
Cc: linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org,
rostedt@goodmis.org, tglx@linutronix.de
Subject: Re: [PATCH 1/2] irq_work: allow certain work in hard irq context
Date: Sun, 02 Feb 2014 21:10:43 +0100 [thread overview]
Message-ID: <52EEA643.1010200@linutronix.de> (raw)
In-Reply-To: <1391314950.5444.18.camel@marge.simpson.net>
On 02/02/2014 05:22 AM, Mike Galbraith wrote:
> This patch (w. too noisy to live pr_err whacked) reliable kills my 64
> core test box, but only in _virgin_ 3.12-rt11. Add my local patches,
> and it runs and runs, happy as a clam. Odd. But whatever, box with
> virgin source running says it's busted.
Sorry for that, I removed that line from the patch in my queue but the
sent version still had it…
> Killing what was killable in this run before box had a chance to turn
> into a brick, the two tasks below were left, burning 100% CPU until 5
> minute RCU deadline expired. All other cores were idle.
>
> [ 705.466006] NMI backtrace for cpu 5
> [ 705.466009] CPU: 5 PID: 21792 Comm: cc1 Tainted: GF 3.12.9-rt11 #376
> [ 705.466015] RIP: 0010:[<ffffffff815d5450>] [<ffffffff815d5450>] _raw_spin_unlock_irq+0x40/0x40
> [ 705.466030] <IRQ>
> [ 705.466033] [<ffffffff81085074>] ? hrtimer_try_to_cancel+0x44/0x110
> [ 705.466035] [<ffffffff81085160>] hrtimer_cancel+0x20/0x30
> [ 705.466037] [<ffffffff810c52b2>] tick_nohz_restart+0x12/0x90
> [ 705.466039] [<ffffffff810c56da>] tick_nohz_restart_sched_tick+0x4a/0x60
> [ 705.466041] [<ffffffff810c5e99>] __tick_nohz_full_check+0x89/0x90
> [ 705.466043] [<ffffffff810c5ea9>] nohz_full_kick_work_func+0x9/0x10
> [ 705.466047] [<ffffffff81129e89>] __irq_work_run+0x79/0xb0
> [ 705.466049] [<ffffffff81129ec9>] irq_work_run+0x9/0x10
> [ 705.466051] [<ffffffff81068362>] update_process_times+0x62/0x80
> [ 705.466053] [<ffffffff810c4f02>] tick_sched_handle+0x32/0x70
> [ 705.466055] [<ffffffff810c51d0>] tick_sched_timer+0x40/0x70
> [ 705.466057] [<ffffffff81084b8d>] __run_hrtimer+0x14d/0x280
> [ 705.466059] [<ffffffff810c5190>] ? tick_nohz_handler+0xa0/0xa0
> [ 705.466060] [<ffffffff81084dea>] hrtimer_interrupt+0x12a/0x310
> [ 705.466065] [<ffffffff81096e4c>] ? vtime_account_user+0x6c/0x100
> [ 705.466067] [<ffffffff81034af6>] local_apic_timer_interrupt+0x36/0x60
> [ 705.466069] [<ffffffff8103a8c4>] ? native_apic_msr_eoi_write+0x14/0x20
> [ 705.466071] [<ffffffff810359fe>] smp_apic_timer_interrupt+0x3e/0x60
> [ 705.466074] [<ffffffff815ddcdd>] apic_timer_interrupt+0x6d/0x80
> [ 705.466075] <EOI>
> [ 705.468619] NMI backtrace for cpu 52
> [ 705.468622] CPU: 52 PID: 23285 Comm: objdump Tainted: GF 3.12.9-rt11 #376
> [ 705.468634] RIP: 0010:[<ffffffff81085083>] [<ffffffff81085083>] hrtimer_try_to_cancel+0x53/0x110
> [ 705.468650] Call Trace:
> [ 705.468651] <IRQ>
> [ 705.468653] [<ffffffff81085160>] ? hrtimer_cancel+0x20/0x30
> [ 705.468660] [<ffffffff810c52b2>] tick_nohz_restart+0x12/0x90
> [ 705.468662] [<ffffffff810c56da>] tick_nohz_restart_sched_tick+0x4a/0x60
> [ 705.468665] [<ffffffff810c5e99>] __tick_nohz_full_check+0x89/0x90
> [ 705.468667] [<ffffffff810c5ea9>] nohz_full_kick_work_func+0x9/0x10
> [ 705.468674] [<ffffffff81129e89>] __irq_work_run+0x79/0xb0
> [ 705.468676] [<ffffffff81129ec9>] irq_work_run+0x9/0x10
> [ 705.468681] [<ffffffff81068362>] update_process_times+0x62/0x80
> [ 705.468683] [<ffffffff810c4f02>] tick_sched_handle+0x32/0x70
> [ 705.468685] [<ffffffff810c51d0>] tick_sched_timer+0x40/0x70
> [ 705.468687] [<ffffffff81084b8d>] __run_hrtimer+0x14d/0x280
> [ 705.468689] [<ffffffff810c5190>] ? tick_nohz_handler+0xa0/0xa0
> [ 705.468691] [<ffffffff81084dea>] hrtimer_interrupt+0x12a/0x310
> [ 705.468700] [<ffffffff81096c22>] ? vtime_account_system+0x52/0xe0
> [ 705.468703] [<ffffffff81034af6>] local_apic_timer_interrupt+0x36/0x60
> [ 705.468708] [<ffffffff8103a8c4>] ? native_apic_msr_eoi_write+0x14/0x20
> [ 705.468710] [<ffffffff810359fe>] smp_apic_timer_interrupt+0x3e/0x60
> [ 705.468721] [<ffffffff815ddcdd>] apic_timer_interrupt+0x6d/0x80
> [ 705.468722] <EOI>
> [ 705.468733] [<ffffffff8105ae13>] ? pin_current_cpu+0x63/0x180
> [ 705.468742] [<ffffffff81090505>] migrate_disable+0x95/0x100
> [ 705.468746] [<ffffffff81168d21>] __do_fault+0x181/0x590
> [ 705.468748] [<ffffffff811691c3>] handle_pte_fault+0x93/0x250
> [ 705.468750] [<ffffffff811694b7>] __handle_mm_fault+0x137/0x1e0
> [ 705.468752] [<ffffffff81169653>] handle_mm_fault+0xf3/0x1a0
> [ 705.468755] [<ffffffff815d90f1>] __do_page_fault+0x291/0x550
> [ 705.468758] [<ffffffff8100a8d0>] ? native_sched_clock+0x20/0xa0
> [ 705.468766] [<ffffffff81108547>] ? acct_account_cputime+0x17/0x20
> [ 705.468768] [<ffffffff81096dc2>] ? account_user_time+0xd2/0xf0
> [ 705.468770] [<ffffffff81096e4c>] ? vtime_account_user+0x6c/0x100
> [ 705.468772] [<ffffffff815d93f0>] do_page_fault+0x40/0x70
> [ 705.468774] [<ffffffff815d5d48>] page_fault+0x28/0x30
So CPU5 & CPU52 were eating 100% CPU doing "nothing" instead of running
cc1 & objdump right?
According to the backtrace both of them are trying to access the
per-cpu hrtimer (sched_timer) in order to cancel but they seem to fail
to get the timer lock here. They shouldn't spin there for minutes, I
have no idea why they did so…
I guess this problem does not occur without -RT and before that patch
you saw only that one warning from can_stop_full_tick()?
Sebastian
next prev parent reply other threads:[~2014-02-02 20:10 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-31 14:34 [PATCH 1/2] irq_work: allow certain work in hard irq context Sebastian Andrzej Siewior
2014-01-31 14:34 ` [PATCH 2/2] timer: really raise softirq if there is irq_work to do Sebastian Andrzej Siewior
2014-01-31 17:07 ` Steven Rostedt
2014-01-31 17:11 ` Steven Rostedt
2014-01-31 17:42 ` Paul E. McKenney
2014-01-31 17:57 ` Steven Rostedt
2014-01-31 19:03 ` Paul E. McKenney
2014-01-31 19:26 ` Sebastian Andrzej Siewior
2014-01-31 19:34 ` Steven Rostedt
2014-01-31 19:48 ` Sebastian Andrzej Siewior
2014-01-31 19:56 ` Steven Rostedt
2014-01-31 20:05 ` Peter Zijlstra
2014-01-31 20:23 ` Sebastian Andrzej Siewior
2014-01-31 20:29 ` Peter Zijlstra
2014-01-31 19:54 ` Peter Zijlstra
2014-01-31 19:06 ` Sebastian Andrzej Siewior
2014-02-02 4:22 ` [PATCH 1/2] irq_work: allow certain work in hard irq context Mike Galbraith
2014-02-02 20:10 ` Sebastian Andrzej Siewior [this message]
2014-02-03 2:43 ` Mike Galbraith
2014-02-03 4:00 ` Mike Galbraith
2014-02-03 8:31 ` Sebastian Andrzej Siewior
2014-02-03 9:26 ` Mike Galbraith
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52EEA643.1010200@linutronix.de \
--to=bigeasy@linutronix.de \
--cc=bitbucket@online.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).