All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Holger Hoffstätte" <holger.hoffstaette@googlemail.com>
To: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>,
	Eric Dumazet <edumazet@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	LKML <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org
Subject: Re: Soft lockup issue in Linux 4.1.9
Date: Thu, 8 Oct 2015 21:27:51 +0200	[thread overview]
Message-ID: <5616C3B7.9000305@googlemail.com> (raw)
In-Reply-To: <1444322507@msgid.manchmal.in-ulm.de>

On 10/08/15 18:56, Christoph Biedl wrote:
> Eric Dumazet wrote...
> 
> [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ]
> 
>> It definitely should help !
> 
> Yesterday, I've experienced issues somewhat similar to this, but I'm
> not entirely sure:
> 
> Four of five systems running 4.1.9 stopped working. No reaction on
> network, keyboard, serial console. In one case, the stack trace as
> below made it to the loghost.
> 
> Two things are quite different. First, the systems had a reasonable
> uptime, about a week.
> 
> And second, the scary part: All incidents happened within a rather
> short time span of three minutes the most, beginning after 16:41:28 and
> before 16:41:54 UTC. So I assumed a brownout first - until I realized
> the systems faded away at slightly different times, and one is at a
> different location. While other systems using different kernel versions
> continued to operate on both sites.
> 
> So, I'd be glad for answers for
> 
> - Is this the same issue or should I be even more afraid?

There's always room for more. :-)

> - What might be the reason for this temporal coincidence? I have no
>   plausible idea.

More bugs?

> Confused,
>     Christoph
> 
> 
>  INFO: rcu_sched self-detected stall on CPU { 3}  (t=6000 jiffies g=8932806 c=8932805 q=58491)
>  rcu_sched kthread starved for 5999 jiffies!
>  Task dump for CPU 3:
>  swapper/3       R  running task        0     0      1 0x00000008
>   ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
>   ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
>   0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
>  Call Trace:
>   <IRQ>  [<ffffffff810807da>] sched_show_task+0xaa/0x110
>   [<ffffffff81083b78>] dump_cpu_task+0x38/0x40
>   [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
>   [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
>   [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
>   [<ffffffff8108484e>] ? account_system_time+0x8e/0x180
>   [<ffffffff810ae4d3>] update_process_times+0x33/0x60
>   [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
>   [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
>   [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
>   [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
>   [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
>  INFO: rcu_sched self-detected stall on CPU { 3}  (t=6000 jiffies g=8932806 c=8932805 q=58491)
>  rcu_sched kthread starved for 5999 jiffies!
>  Task dump for CPU 3:
>  swapper/3       R  running task        0     0      1 0x00000008
>   ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
>   ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
>   0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
>  Call Trace:
>   <IRQ>  [<ffffffff810807da>] sched_show_task+0xaa/0x110
>   [<ffffffff81083b78>] dump_cpu_task+0x38/0x40
>   [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
>   [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
>   [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
>   [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
>   [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
>   [<ffffffff810adeba>] del_timer_sync+0x4a/0x60
>   [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
>   [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
>   [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
>   [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
>   [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
>   [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
>   [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
>   [<ffffffff8108484e>] ? account_system_time+0x8e/0x180
>   [<ffffffff810ae4d3>] update_process_times+0x33/0x60
>   [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
>   [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
>   [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
>   [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
>   [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
>   [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
>   [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
>   [<ffffffff81060e16>] irq_exit+0x76/0xa0
>   [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
>   [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
>   <EOI>  [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
>   [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
>   [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
>   [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
>   [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
>   [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
>   [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
>   [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
>   [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
>   [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
>   [<ffffffff810adeba>] del_timer_sync+0x4a/0x60
>   [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
>   [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
>   [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
>   [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
>   [<ffffffff810332dc>] start_secondary+0x17c/0x1a0
>   [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
>   [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
>   [<ffffffff81060e16>] irq_exit+0x76/0xa0
>   [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
>   [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
>   <EOI>  [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
>   [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
>   [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
>   [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
>   [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
>   [<ffffffff810332dc>] start_secondary+0x17c/0x1a0
> 

The timer fixes were followups to a patch that went into 4.1 called
"tcp/dccp: get rid of central timewait timer", and it seems there were
a few more patches in that area very recently.

So after some git spelunking I am now running with the following patches
on top of 4.1.10 + 83fccfc3940.. (for the lockups), in the following
order:

fc01538f9fb75572c969ca9988176ffc2a8741d6 simplify timewait refcounting
dbe7faa4045ea83a37b691b12bb02a8f86c2d2e9 inet_twsk_deschedule factorization
29c6852602e259d2c1882f320b29d5c3fec0de04 fix races in reqsk_queue_hash_req()
ed2e923945892a8372ab70d2f61d364b0b6d9054 fix timewait races in timer handling

They may not all be required for the particular problem you just summoned,
but (from what I could tell) are required to apply everything properly.
They certainly can't make things worse. :-)

Oh and while you're at it you can apply these l33t cubic fixes :-)
30927520dbae297182990bb21d08762bcc35ce1d better follow cubic curve after idle period
c2e7204d180f8efc80f27959ca9cf16fa17f67db do not set epoch_start in the future
 
I've been running these on 3 machines for almost 10 minutes without issue,
so they are totally safe to go into production right away.

-h


      reply	other threads:[~2015-10-08 19:27 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-30 21:59 Soft lockup issue in Linux 4.1.9 Olivier Bonvalet
2015-09-30 22:37 ` Holger Hoffstätte
2015-10-01  4:41   ` Andre Tomt
2015-10-01 10:51     ` Holger Hoffstätte
     [not found] ` <560D1223.3070606@googlemail.com>
     [not found]   ` <CANn89i+B5T4Rhs8HnrC0+f+GhLvBFfpr4BVDvhkVOveSfy9B8Q@mail.gmail.com>
2015-10-01 11:43     ` Holger Hoffstätte
2015-10-01 11:52       ` Eric Dumazet
2015-10-02  6:52         ` Andre Tomt
2015-10-02  7:17           ` Holger Hoffstätte
2015-10-02 19:25             ` Wolfgang Walter
2015-10-02 19:25               ` Wolfgang Walter
2015-10-03 19:14             ` Thomas D.
2015-10-17 23:41               ` Greg Kroah-Hartman
2015-10-17 23:41                 ` Greg Kroah-Hartman
2015-10-02 20:04         ` Thomas Gleixner
2015-10-02 20:59           ` Eric Dumazet
2015-10-02 21:04             ` Thomas Gleixner
2015-10-02 21:32               ` Eric Dumazet
2015-10-08 16:56         ` Christoph Biedl
2015-10-08 19:27           ` Holger Hoffstätte [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5616C3B7.9000305@googlemail.com \
    --to=holger.hoffstaette@googlemail.com \
    --cc=ebiederm@xmission.com \
    --cc=edumazet@google.com \
    --cc=linux-kernel.bfrz@manchmal.in-ulm.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.