From: "Holger Hoffstätte" <holger.hoffstaette@googlemail.com>
To: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>,
Eric Dumazet <edumazet@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
LKML <linux-kernel@vger.kernel.org>,
stable@vger.kernel.org
Subject: Re: Soft lockup issue in Linux 4.1.9
Date: Thu, 8 Oct 2015 21:27:51 +0200 [thread overview]
Message-ID: <5616C3B7.9000305@googlemail.com> (raw)
In-Reply-To: <1444322507@msgid.manchmal.in-ulm.de>
On 10/08/15 18:56, Christoph Biedl wrote:
> Eric Dumazet wrote...
>
> [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ]
>
>> It definitely should help !
>
> Yesterday, I've experienced issues somewhat similar to this, but I'm
> not entirely sure:
>
> Four of five systems running 4.1.9 stopped working. No reaction on
> network, keyboard, serial console. In one case, the stack trace as
> below made it to the loghost.
>
> Two things are quite different. First, the systems had a reasonable
> uptime, about a week.
>
> And second, the scary part: All incidents happened within a rather
> short time span of three minutes the most, beginning after 16:41:28 and
> before 16:41:54 UTC. So I assumed a brownout first - until I realized
> the systems faded away at slightly different times, and one is at a
> different location. While other systems using different kernel versions
> continued to operate on both sites.
>
> So, I'd be glad for answers for
>
> - Is this the same issue or should I be even more afraid?
There's always room for more. :-)
> - What might be the reason for this temporal coincidence? I have no
> plausible idea.
More bugs?
> Confused,
> Christoph
>
>
> INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491)
> rcu_sched kthread starved for 5999 jiffies!
> Task dump for CPU 3:
> swapper/3 R running task 0 0 1 0x00000008
> ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
> ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
> 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
> Call Trace:
> <IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110
> [<ffffffff81083b78>] dump_cpu_task+0x38/0x40
> [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
> [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
> [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
> [<ffffffff8108484e>] ? account_system_time+0x8e/0x180
> [<ffffffff810ae4d3>] update_process_times+0x33/0x60
> [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
> [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
> [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
> [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
> [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
> INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491)
> rcu_sched kthread starved for 5999 jiffies!
> Task dump for CPU 3:
> swapper/3 R running task 0 0 1 0x00000008
> ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
> ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
> 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
> Call Trace:
> <IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110
> [<ffffffff81083b78>] dump_cpu_task+0x38/0x40
> [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
> [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
> [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
> [<ffffffff810adeba>] del_timer_sync+0x4a/0x60
> [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
> [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
> [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
> [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
> [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
> [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
> [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
> [<ffffffff8108484e>] ? account_system_time+0x8e/0x180
> [<ffffffff810ae4d3>] update_process_times+0x33/0x60
> [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
> [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
> [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
> [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
> [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
> [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
> [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
> [<ffffffff81060e16>] irq_exit+0x76/0xa0
> [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> <EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
> [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
> [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
> [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
> [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
> [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
> [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
> [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
> [<ffffffff810adeba>] del_timer_sync+0x4a/0x60
> [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
> [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
> [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
> [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
> [<ffffffff810332dc>] start_secondary+0x17c/0x1a0
> [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
> [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
> [<ffffffff81060e16>] irq_exit+0x76/0xa0
> [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> <EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
> [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
> [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
> [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
> [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
> [<ffffffff810332dc>] start_secondary+0x17c/0x1a0
>
The timer fixes were followups to a patch that went into 4.1 called
"tcp/dccp: get rid of central timewait timer", and it seems there were
a few more patches in that area very recently.
So after some git spelunking I am now running with the following patches
on top of 4.1.10 + 83fccfc3940.. (for the lockups), in the following
order:
fc01538f9fb75572c969ca9988176ffc2a8741d6 simplify timewait refcounting
dbe7faa4045ea83a37b691b12bb02a8f86c2d2e9 inet_twsk_deschedule factorization
29c6852602e259d2c1882f320b29d5c3fec0de04 fix races in reqsk_queue_hash_req()
ed2e923945892a8372ab70d2f61d364b0b6d9054 fix timewait races in timer handling
They may not all be required for the particular problem you just summoned,
but (from what I could tell) are required to apply everything properly.
They certainly can't make things worse. :-)
Oh and while you're at it you can apply these l33t cubic fixes :-)
30927520dbae297182990bb21d08762bcc35ce1d better follow cubic curve after idle period
c2e7204d180f8efc80f27959ca9cf16fa17f67db do not set epoch_start in the future
I've been running these on 3 machines for almost 10 minutes without issue,
so they are totally safe to go into production right away.
-h
prev parent reply other threads:[~2015-10-08 19:27 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1443650383.13282.10.camel@daevel.fr>
[not found] ` <pan.2015.09.30.22.37.34@googlemail.com>
[not found] ` <560CB98A.10107@tomt.net>
2015-10-01 10:51 ` Soft lockup issue in Linux 4.1.9 Holger Hoffstätte
[not found] ` <560D1223.3070606@googlemail.com>
[not found] ` <CANn89i+B5T4Rhs8HnrC0+f+GhLvBFfpr4BVDvhkVOveSfy9B8Q@mail.gmail.com>
2015-10-01 11:43 ` Holger Hoffstätte
2015-10-01 11:52 ` Eric Dumazet
2015-10-02 6:52 ` Andre Tomt
2015-10-02 7:17 ` Holger Hoffstätte
2015-10-02 19:25 ` Wolfgang Walter
2015-10-03 19:14 ` Thomas D.
2015-10-17 23:41 ` Greg Kroah-Hartman
2015-10-02 20:04 ` Thomas Gleixner
2015-10-02 20:59 ` Eric Dumazet
2015-10-02 21:04 ` Thomas Gleixner
2015-10-02 21:32 ` Eric Dumazet
2015-10-08 16:56 ` Christoph Biedl
2015-10-08 19:27 ` Holger Hoffstätte [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5616C3B7.9000305@googlemail.com \
--to=holger.hoffstaette@googlemail.com \
--cc=ebiederm@xmission.com \
--cc=edumazet@google.com \
--cc=linux-kernel.bfrz@manchmal.in-ulm.de \
--cc=linux-kernel@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).