* Re: Soft lockup issue in Linux 4.1.9
[not found] ` <560CB98A.10107@tomt.net>
@ 2015-10-01 10:51 ` Holger Hoffstätte
0 siblings, 0 replies; 14+ messages in thread
From: Holger Hoffstätte @ 2015-10-01 10:51 UTC (permalink / raw)
To: linux-kernel; +Cc: stable
On Thu, 01 Oct 2015 06:41:46 +0200, Andre Tomt wrote:
> On 01. okt. 2015 00:37, Holger Hoffstätte wrote:
>> On Wed, 30 Sep 2015 23:59:43 +0200, Olivier Bonvalet wrote:
>>
>>> for information, I've just upgraded 6 servers from Linux 4.1.8 to Linux
>>> 4.1.9, and have some random soft lockup. If this can help :
>>
>> Congratulations! You're not the first one to get hit by this, but
>> you are probably the first one to get a meaningful stacktrace! \o/
>>
>>> [ 204.478380] Call Trace:
>>> [ 204.478381] <IRQ>
>>> [ 204.478385] [<ffffffff81076121>] ? try_to_del_timer_sync+0x43/0x4d
>>> [ 204.478386] [<ffffffff810760de>] ? del_timer+0x4d/0x4d
>>> [ 204.478388] [<ffffffff8107614b>] ? del_timer_sync+0x20/0x3d
>>
>> Can you try to revert
>>
>> [PATCH 4.1 157/159] inet: fix races with reqsk timers
>>
>> and see how that works for you? I'll do the same on my end. So far the
>> only thing I ever could gleam was an rcu stall after cpuidle_enter(),
>> but never anything regarding the timer - though it was definitely
>> related to NIC activity after idle.
>
> I'm running with this patch reverted now as well. 2 hours no issues so
> far, but I can't conclude anything yet as I've seen it take up to 6+
> hours to explode here. As a result the bisect was going veeery slowly.
Now 12+ hours going without problems, never got this far with the patch
included, as it would usually freeze during idle periods.
As far as I'm concerned this is the culprit and should be reverted in
4.1.x, unless Eric can suggest how to fix this. (cc'ed).
cheers
Holger
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
[not found] ` <CANn89i+B5T4Rhs8HnrC0+f+GhLvBFfpr4BVDvhkVOveSfy9B8Q@mail.gmail.com>
@ 2015-10-01 11:43 ` Holger Hoffstätte
2015-10-01 11:52 ` Eric Dumazet
0 siblings, 1 reply; 14+ messages in thread
From: Holger Hoffstätte @ 2015-10-01 11:43 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, Eric W. Biederman, Stephen Hemminger,
Greg Kroah-Hartman, linux-kernel, stable, netdev
On 10/01/15 13:29, Eric Dumazet wrote:
> On Thu, Oct 1, 2015 at 3:59 AM, Holger Hoffstätte
> <holger.hoffstaette@googlemail.com> wrote:
>>
>> On Thu, 01 Oct 2015 06:41:46 +0200, Andre Tomt wrote:
>>
>>> On 01. okt. 2015 00:37, Holger Hoffstätte wrote:
>>>> On Wed, 30 Sep 2015 23:59:43 +0200, Olivier Bonvalet wrote:
>>>>
>>>>> for information, I've just upgraded 6 servers from Linux 4.1.8 to Linux
>>>>> 4.1.9, and have some random soft lockup. If this can help :
>>>>
>>>> Congratulations! You're not the first one to get hit by this, but
>>>> you are probably the first one to get a meaningful stacktrace! \o/
>>>>
>>>>> [ 204.478380] Call Trace:
>>>>> [ 204.478381] <IRQ>
>>>>> [ 204.478385] [<ffffffff81076121>] ? try_to_del_timer_sync+0x43/0x4d
>>>>> [ 204.478386] [<ffffffff810760de>] ? del_timer+0x4d/0x4d
>>>>> [ 204.478388] [<ffffffff8107614b>] ? del_timer_sync+0x20/0x3d
>>>>
>>>> Can you try to revert
>>>>
>>>> [PATCH 4.1 157/159] inet: fix races with reqsk timers
>>>>
>>>> and see how that works for you? I'll do the same on my end. So far the
>>>> only thing I ever could gleam was an rcu stall after cpuidle_enter(),
>>>> but never anything regarding the timer - though it was definitely
>>>> related to NIC activity after idle.
>>>
>>> I'm running with this patch reverted now as well. 2 hours no issues so
>>> far, but I can't conclude anything yet as I've seen it take up to 6+
>>> hours to explode here. As a result the bisect was going veeery slowly.
>>
>> Now 12+ hours going without problems, never got this far with the patch
>> included, as it would usually freeze during idle periods.
>>
>> As far as I'm concerned this is the culprit and should be reverted in
>> 4.1.x, unless Eric can suggest how to fix this. (cc'ed).
>>
>
> Looks an old and known problem...
>
> Following commit should be sent/added for 4.1 stable tree :
>
> commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
> Author: Eric Dumazet <edumazet@google.com>
> Date: Thu Aug 13 15:44:51 2015 -0700
>
> inet: fix potential deadlock in reqsk_queue_unlink()
>
> When replacing del_timer() with del_timer_sync(), I introduced
> a deadlock condition :
>
> reqsk_queue_unlink() is called from inet_csk_reqsk_queue_drop()
>
> inet_csk_reqsk_queue_drop() can be called from many contexts,
> one being the timer handler itself (reqsk_timer_handler()).
>
> In this case, del_timer_sync() loops forever.
>
> Simple fix is to test if timer is pending.
>
> Fixes: 2235f2ac75fd ("inet: fix races with reqsk timers")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
Whohoo! It applies/builds cleanly to 4.1.10-rc1 and is running as
we speak. Let's hope that this fixes the lockups.
Thanks for the quick reply!
Holger
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-01 11:43 ` Holger Hoffstätte
@ 2015-10-01 11:52 ` Eric Dumazet
2015-10-02 6:52 ` Andre Tomt
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Eric Dumazet @ 2015-10-01 11:52 UTC (permalink / raw)
To: Holger Hoffstätte
Cc: David S. Miller, Eric W. Biederman, Stephen Hemminger,
Greg Kroah-Hartman, LKML, stable, netdev
On Thu, Oct 1, 2015 at 4:43 AM, Holger Hoffstätte
<holger.hoffstaette@googlemail.com> wrote:
> On 10/01/15 13:29, Eric Dumazet wrote:
>> commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>> Author: Eric Dumazet <edumazet@google.com>
>> Date: Thu Aug 13 15:44:51 2015 -0700
>>
>> inet: fix potential deadlock in reqsk_queue_unlink()
>>
>> When replacing del_timer() with del_timer_sync(), I introduced
>> a deadlock condition :
>>
>> reqsk_queue_unlink() is called from inet_csk_reqsk_queue_drop()
>>
>> inet_csk_reqsk_queue_drop() can be called from many contexts,
>> one being the timer handler itself (reqsk_timer_handler()).
>>
>> In this case, del_timer_sync() loops forever.
>>
>> Simple fix is to test if timer is pending.
>>
>> Fixes: 2235f2ac75fd ("inet: fix races with reqsk timers")
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> Whohoo! It applies/builds cleanly to 4.1.10-rc1 and is running as
> we speak. Let's hope that this fixes the lockups.
>
It definitely should help !
David, since patch is not yet seen on
http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
could you please add it to your queue ?
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-01 11:52 ` Eric Dumazet
@ 2015-10-02 6:52 ` Andre Tomt
2015-10-02 7:17 ` Holger Hoffstätte
2015-10-02 20:04 ` Thomas Gleixner
2015-10-08 16:56 ` Christoph Biedl
2 siblings, 1 reply; 14+ messages in thread
From: Andre Tomt @ 2015-10-02 6:52 UTC (permalink / raw)
To: Eric Dumazet, Holger Hoffstätte
Cc: David S. Miller, Eric W. Biederman, Stephen Hemminger,
Greg Kroah-Hartman, LKML, stable, netdev
On 01. okt. 2015 13:52, Eric Dumazet wrote:
> On Thu, Oct 1, 2015 at 4:43 AM, Holger Hoffstätte
> <holger.hoffstaette@googlemail.com> wrote:
>> On 10/01/15 13:29, Eric Dumazet wrote:
>
>>> commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>>> Author: Eric Dumazet <edumazet@google.com>
>>> Date: Thu Aug 13 15:44:51 2015 -0700
>>>
>>> inet: fix potential deadlock in reqsk_queue_unlink()
<snip>
>> Whohoo! It applies/builds cleanly to 4.1.10-rc1 and is running as
>> we speak. Let's hope that this fixes the lockups.
>>
>
> It definitely should help !
>
> David, since patch is not yet seen on
> http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
> could you please add it to your queue ?
Seems to fix it for me as well. 3 systems have been running varying
types of production-like loads with it for 14+ hours without hanging.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-02 6:52 ` Andre Tomt
@ 2015-10-02 7:17 ` Holger Hoffstätte
2015-10-02 19:25 ` Wolfgang Walter
2015-10-03 19:14 ` Thomas D.
0 siblings, 2 replies; 14+ messages in thread
From: Holger Hoffstätte @ 2015-10-02 7:17 UTC (permalink / raw)
To: Andre Tomt, Eric Dumazet
Cc: David S. Miller, Eric W. Biederman, Stephen Hemminger,
Greg Kroah-Hartman, LKML, stable, netdev
On 10/02/15 08:52, Andre Tomt wrote:
> On 01. okt. 2015 13:52, Eric Dumazet wrote:
>> On Thu, Oct 1, 2015 at 4:43 AM, Holger Hoffstätte
>> <holger.hoffstaette@googlemail.com> wrote:
>>> On 10/01/15 13:29, Eric Dumazet wrote:
>>
>>>> commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
>>>> Author: Eric Dumazet <edumazet@google.com>
>>>> Date: Thu Aug 13 15:44:51 2015 -0700
>>>>
>>>> inet: fix potential deadlock in reqsk_queue_unlink()
> <snip>
>>> Whohoo! It applies/builds cleanly to 4.1.10-rc1 and is running as
>>> we speak. Let's hope that this fixes the lockups.
>>>
>>
>> It definitely should help !
>>
>> David, since patch is not yet seen on
>> http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
>> could you please add it to your queue ?
>
> Seems to fix it for me as well. 3 systems have been running varying
> types of production-like loads with it for 14+ hours without hanging.
Just got up, and yes - my systems survived the night as well, no issues.
Greg, any chance you can drop this into the pending 4.1.10? Otherwise people
will get another broken release.
cheers
Holger
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-02 7:17 ` Holger Hoffstätte
@ 2015-10-02 19:25 ` Wolfgang Walter
2015-10-03 19:14 ` Thomas D.
1 sibling, 0 replies; 14+ messages in thread
From: Wolfgang Walter @ 2015-10-02 19:25 UTC (permalink / raw)
To: Holger Hoffstätte
Cc: Andre Tomt, Eric Dumazet, David S. Miller, Eric W. Biederman,
Stephen Hemminger, Greg Kroah-Hartman, LKML, stable, netdev
Am Freitag, 2. Oktober 2015, 09:17:16 schrieb Holger Hoffst�tte:
> On 10/02/15 08:52, Andre Tomt wrote:
> > On 01. okt. 2015 13:52, Eric Dumazet wrote:
> >> On Thu, Oct 1, 2015 at 4:43 AM, Holger Hoffst�tte
> >>
> >> <holger.hoffstaette@googlemail.com> wrote:
> >>> On 10/01/15 13:29, Eric Dumazet wrote:
> >>>> commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
> >>>> Author: Eric Dumazet <edumazet@google.com>
> >>>> Date: Thu Aug 13 15:44:51 2015 -0700
> >>>>
> >>>> inet: fix potential deadlock in reqsk_queue_unlink()
> >
> > <snip>
> >
> >>> Whohoo! It applies/builds cleanly to 4.1.10-rc1 and is running as
> >>> we speak. Let's hope that this fixes the lockups.
> >>
> >> It definitely should help !
> >>
> >> David, since patch is not yet seen on
> >> http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
> >> could you please add it to your queue ?
> >
> > Seems to fix it for me as well. 3 systems have been running varying
> > types of production-like loads with it for 14+ hours without hanging.
>
> Just got up, and yes - my systems survived the night as well, no issues.
>
> Greg, any chance you can drop this into the pending 4.1.10? Otherwise people
> will get another broken release.
>
Fixes the problem here, too.
Regards,
--
Wolfgang Walter
Studentenwerk M�nchen
Anstalt des �ffentlichen Rechts
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-01 11:52 ` Eric Dumazet
2015-10-02 6:52 ` Andre Tomt
@ 2015-10-02 20:04 ` Thomas Gleixner
2015-10-02 20:59 ` Eric Dumazet
2015-10-08 16:56 ` Christoph Biedl
2 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2015-10-02 20:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Holger Hoffstätte, David S. Miller, Eric W. Biederman,
Stephen Hemminger, Greg Kroah-Hartman, LKML, stable, netdev
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1309 bytes --]
On Thu, 1 Oct 2015, Eric Dumazet wrote:
> On Thu, Oct 1, 2015 at 4:43 AM, Holger Hoffstätte
> <holger.hoffstaette@googlemail.com> wrote:
> > On 10/01/15 13:29, Eric Dumazet wrote:
>
> >> commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
> >> Author: Eric Dumazet <edumazet@google.com>
> >> Date: Thu Aug 13 15:44:51 2015 -0700
> >>
> >> inet: fix potential deadlock in reqsk_queue_unlink()
> >>
> >> When replacing del_timer() with del_timer_sync(), I introduced
> >> a deadlock condition :
> >>
> >> reqsk_queue_unlink() is called from inet_csk_reqsk_queue_drop()
> >>
> >> inet_csk_reqsk_queue_drop() can be called from many contexts,
> >> one being the timer handler itself (reqsk_timer_handler()).
> >>
> >> In this case, del_timer_sync() loops forever.
> >>
> >> Simple fix is to test if timer is pending.
> >>
> >> Fixes: 2235f2ac75fd ("inet: fix races with reqsk timers")
> >> Signed-off-by: Eric Dumazet <edumazet@google.com>
> >> Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > Whohoo! It applies/builds cleanly to 4.1.10-rc1 and is running as
> > we speak. Let's hope that this fixes the lockups.
> >
>
> It definitely should help !
What makes sure, that the timer cannot be readded while that timer
callback is running?
Thanks,
tglx
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-02 20:04 ` Thomas Gleixner
@ 2015-10-02 20:59 ` Eric Dumazet
2015-10-02 21:04 ` Thomas Gleixner
0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-10-02 20:59 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Eric Dumazet, Holger Hoffstätte, David S. Miller,
Eric W. Biederman, Stephen Hemminger, Greg Kroah-Hartman, LKML,
stable, netdev
On Fri, 2015-10-02 at 22:04 +0200, Thomas Gleixner wrote:
> What makes sure, that the timer cannot be readded while that timer
> callback is running?
What is exactly your question ?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-02 20:59 ` Eric Dumazet
@ 2015-10-02 21:04 ` Thomas Gleixner
2015-10-02 21:32 ` Eric Dumazet
0 siblings, 1 reply; 14+ messages in thread
From: Thomas Gleixner @ 2015-10-02 21:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Eric Dumazet, Holger Hoffstätte, David S. Miller,
Eric W. Biederman, Stephen Hemminger, Greg Kroah-Hartman, LKML,
stable, netdev
On Fri, 2 Oct 2015, Eric Dumazet wrote:
> On Fri, 2015-10-02 at 22:04 +0200, Thomas Gleixner wrote:
>
> > What makes sure, that the timer cannot be readded while that timer
> > callback is running?
>
> What is exactly your question ?
CPU0 CPU1
timer expires
callback
add timer
timer_pending() == true
===> del_timer_sync()
I was just curious how this is prevented as I got lost in the
networking code as usual :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-02 21:04 ` Thomas Gleixner
@ 2015-10-02 21:32 ` Eric Dumazet
0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2015-10-02 21:32 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Eric Dumazet, Holger Hoffstätte, David S. Miller,
Eric W. Biederman, Stephen Hemminger, Greg Kroah-Hartman, LKML,
stable, netdev
On Fri, 2015-10-02 at 23:04 +0200, Thomas Gleixner wrote:
> On Fri, 2 Oct 2015, Eric Dumazet wrote:
> > On Fri, 2015-10-02 at 22:04 +0200, Thomas Gleixner wrote:
> >
> > > What makes sure, that the timer cannot be readded while that timer
> > > callback is running?
> >
> > What is exactly your question ?
>
> CPU0 CPU1
>
> timer expires
> callback
> add timer
> timer_pending() == true
> ===> del_timer_sync()
>
> I was just curious how this is prevented as I got lost in the
> networking code as usual :)
Sure ;)
I believe this can not happen for following reasons :
mod_timer_pinned() is used only when req is created, while timer cannot
possibly be running on the same req. The _pinned part is critical
because we set the req->refcnt _after_ starting the timer,
to avoid being visible and caught from rcu lookups in hash tables.
Then, timer might be modified only by mod_timer_pending() from
tcp_check_req() : This should not re-start timer if another cpu is in
the timer callback.
Thanks
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-02 7:17 ` Holger Hoffstätte
2015-10-02 19:25 ` Wolfgang Walter
@ 2015-10-03 19:14 ` Thomas D.
2015-10-17 23:41 ` Greg Kroah-Hartman
1 sibling, 1 reply; 14+ messages in thread
From: Thomas D. @ 2015-10-03 19:14 UTC (permalink / raw)
To: Holger Hoffstätte, Andre Tomt, Eric Dumazet, stable
Cc: David S. Miller, Eric W. Biederman, Stephen Hemminger,
Greg Kroah-Hartman, LKML, netdev
Hi,
Holger Hoffstätte wrote:
> Greg, any chance you can drop this into the pending 4.1.10? Otherwise people
> will get another broken release.
For me it looks like the request was too late, the patch is not included
in 4.1.10. So don't forget to re-apply the patch when doing the upgrade.
Greg, do you need a dedicated inclusion request for
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
in 4.1.x or is it already on your list?
-Thomas
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-01 11:52 ` Eric Dumazet
2015-10-02 6:52 ` Andre Tomt
2015-10-02 20:04 ` Thomas Gleixner
@ 2015-10-08 16:56 ` Christoph Biedl
2015-10-08 19:27 ` Holger Hoffstätte
2 siblings, 1 reply; 14+ messages in thread
From: Christoph Biedl @ 2015-10-08 16:56 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Holger Hoffstätte, Eric W. Biederman, LKML, stable
Eric Dumazet wrote...
[ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ]
> It definitely should help !
Yesterday, I've experienced issues somewhat similar to this, but I'm
not entirely sure:
Four of five systems running 4.1.9 stopped working. No reaction on
network, keyboard, serial console. In one case, the stack trace as
below made it to the loghost.
Two things are quite different. First, the systems had a reasonable
uptime, about a week.
And second, the scary part: All incidents happened within a rather
short time span of three minutes the most, beginning after 16:41:28 and
before 16:41:54 UTC. So I assumed a brownout first - until I realized
the systems faded away at slightly different times, and one is at a
different location. While other systems using different kernel versions
continued to operate on both sites.
So, I'd be glad for answers for
- Is this the same issue or should I be even more afraid?
- What might be the reason for this temporal coincidence? I have no
plausible idea.
Confused,
Christoph
INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491)
rcu_sched kthread starved for 5999 jiffies!
Task dump for CPU 3:
swapper/3 R running task 0 0 1 0x00000008
ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
Call Trace:
<IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110
[<ffffffff81083b78>] dump_cpu_task+0x38/0x40
[<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
[<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
[<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
[<ffffffff8108484e>] ? account_system_time+0x8e/0x180
[<ffffffff810ae4d3>] update_process_times+0x33/0x60
[<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
[<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
[<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
[<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
[<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491)
rcu_sched kthread starved for 5999 jiffies!
Task dump for CPU 3:
swapper/3 R running task 0 0 1 0x00000008
ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
Call Trace:
<IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110
[<ffffffff81083b78>] dump_cpu_task+0x38/0x40
[<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
[<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
[<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
[<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
[<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
[<ffffffff810adeba>] del_timer_sync+0x4a/0x60
[<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
[<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
[<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
[<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
[<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
[<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
[<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
[<ffffffff8108484e>] ? account_system_time+0x8e/0x180
[<ffffffff810ae4d3>] update_process_times+0x33/0x60
[<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
[<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
[<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
[<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
[<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
[<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
[<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
[<ffffffff81060e16>] irq_exit+0x76/0xa0
[<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
[<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
<EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
[<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
[<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
[<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
[<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
[<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
[<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
[<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
[<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
[<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
[<ffffffff810adeba>] del_timer_sync+0x4a/0x60
[<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
[<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
[<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
[<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
[<ffffffff810332dc>] start_secondary+0x17c/0x1a0
[<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
[<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
[<ffffffff81060e16>] irq_exit+0x76/0xa0
[<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
[<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
<EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
[<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
[<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
[<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
[<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
[<ffffffff810332dc>] start_secondary+0x17c/0x1a0
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-08 16:56 ` Christoph Biedl
@ 2015-10-08 19:27 ` Holger Hoffstätte
0 siblings, 0 replies; 14+ messages in thread
From: Holger Hoffstätte @ 2015-10-08 19:27 UTC (permalink / raw)
To: Christoph Biedl, Eric Dumazet; +Cc: Eric W. Biederman, LKML, stable
On 10/08/15 18:56, Christoph Biedl wrote:
> Eric Dumazet wrote...
>
> [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ]
>
>> It definitely should help !
>
> Yesterday, I've experienced issues somewhat similar to this, but I'm
> not entirely sure:
>
> Four of five systems running 4.1.9 stopped working. No reaction on
> network, keyboard, serial console. In one case, the stack trace as
> below made it to the loghost.
>
> Two things are quite different. First, the systems had a reasonable
> uptime, about a week.
>
> And second, the scary part: All incidents happened within a rather
> short time span of three minutes the most, beginning after 16:41:28 and
> before 16:41:54 UTC. So I assumed a brownout first - until I realized
> the systems faded away at slightly different times, and one is at a
> different location. While other systems using different kernel versions
> continued to operate on both sites.
>
> So, I'd be glad for answers for
>
> - Is this the same issue or should I be even more afraid?
There's always room for more. :-)
> - What might be the reason for this temporal coincidence? I have no
> plausible idea.
More bugs?
> Confused,
> Christoph
>
>
> INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491)
> rcu_sched kthread starved for 5999 jiffies!
> Task dump for CPU 3:
> swapper/3 R running task 0 0 1 0x00000008
> ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
> ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
> 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
> Call Trace:
> <IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110
> [<ffffffff81083b78>] dump_cpu_task+0x38/0x40
> [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
> [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
> [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
> [<ffffffff8108484e>] ? account_system_time+0x8e/0x180
> [<ffffffff810ae4d3>] update_process_times+0x33/0x60
> [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
> [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
> [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
> [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
> [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
> INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491)
> rcu_sched kthread starved for 5999 jiffies!
> Task dump for CPU 3:
> swapper/3 R running task 0 0 1 0x00000008
> ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003
> ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80
> 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740
> Call Trace:
> <IRQ> [<ffffffff810807da>] sched_show_task+0xaa/0x110
> [<ffffffff81083b78>] dump_cpu_task+0x38/0x40
> [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
> [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
> [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
> [<ffffffff810adeba>] del_timer_sync+0x4a/0x60
> [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
> [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
> [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
> [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
> [<ffffffff810a945c>] rcu_dump_cpu_stacks+0x8c/0xc0
> [<ffffffff810abf31>] rcu_check_callbacks+0x3b1/0x680
> [<ffffffff810e7bb7>] ? acct_account_cputime+0x17/0x20
> [<ffffffff8108484e>] ? account_system_time+0x8e/0x180
> [<ffffffff810ae4d3>] update_process_times+0x33/0x60
> [<ffffffff810bcae0>] tick_sched_handle.isra.14+0x30/0x40
> [<ffffffff810bcbd3>] tick_sched_timer+0x43/0x80
> [<ffffffff810aea2a>] __run_hrtimer.isra.32+0x4a/0xd0
> [<ffffffff810af225>] hrtimer_interrupt+0xd5/0x1f0
> [<ffffffff81034d84>] local_apic_timer_interrupt+0x34/0x60
> [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
> [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
> [<ffffffff81060e16>] irq_exit+0x76/0xa0
> [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> <EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
> [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
> [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
> [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
> [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
> [<ffffffff8103516c>] smp_apic_timer_interrupt+0x3c/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> [<ffffffff8190c8a9>] ? _raw_spin_unlock_irqrestore+0x9/0x10
> [<ffffffff810ade58>] try_to_del_timer_sync+0x48/0x60
> [<ffffffff810adeb2>] ? del_timer_sync+0x42/0x60
> [<ffffffff810adeba>] del_timer_sync+0x4a/0x60
> [<ffffffff8178b7da>] inet_csk_reqsk_queue_drop+0x7a/0x1f0
> [<ffffffff8178ba7f>] reqsk_timer_handler+0x12f/0x290
> [<ffffffff8178b950>] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0
> [<ffffffff810ad9e6>] call_timer_fn.isra.26+0x26/0x80
> [<ffffffff810332dc>] start_secondary+0x17c/0x1a0
> [<ffffffff810ae1ae>] run_timer_softirq+0x18e/0x220
> [<ffffffff81060b1a>] __do_softirq+0xda/0x1f0
> [<ffffffff81060e16>] irq_exit+0x76/0xa0
> [<ffffffff81035175>] smp_apic_timer_interrupt+0x45/0x60
> [<ffffffff8190db7b>] apic_timer_interrupt+0x6b/0x70
> <EOI> [<ffffffff810844be>] ? sched_clock_cpu+0x9e/0xb0
> [<ffffffff8100bc15>] ? amd_e400_idle+0x35/0xd0
> [<ffffffff8100bc13>] ? amd_e400_idle+0x33/0xd0
> [<ffffffff8100c42a>] arch_cpu_idle+0xa/0x10
> [<ffffffff810929e3>] cpu_startup_entry+0x2c3/0x330
> [<ffffffff810332dc>] start_secondary+0x17c/0x1a0
>
The timer fixes were followups to a patch that went into 4.1 called
"tcp/dccp: get rid of central timewait timer", and it seems there were
a few more patches in that area very recently.
So after some git spelunking I am now running with the following patches
on top of 4.1.10 + 83fccfc3940.. (for the lockups), in the following
order:
fc01538f9fb75572c969ca9988176ffc2a8741d6 simplify timewait refcounting
dbe7faa4045ea83a37b691b12bb02a8f86c2d2e9 inet_twsk_deschedule factorization
29c6852602e259d2c1882f320b29d5c3fec0de04 fix races in reqsk_queue_hash_req()
ed2e923945892a8372ab70d2f61d364b0b6d9054 fix timewait races in timer handling
They may not all be required for the particular problem you just summoned,
but (from what I could tell) are required to apply everything properly.
They certainly can't make things worse. :-)
Oh and while you're at it you can apply these l33t cubic fixes :-)
30927520dbae297182990bb21d08762bcc35ce1d better follow cubic curve after idle period
c2e7204d180f8efc80f27959ca9cf16fa17f67db do not set epoch_start in the future
I've been running these on 3 machines for almost 10 minutes without issue,
so they are totally safe to go into production right away.
-h
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Soft lockup issue in Linux 4.1.9
2015-10-03 19:14 ` Thomas D.
@ 2015-10-17 23:41 ` Greg Kroah-Hartman
0 siblings, 0 replies; 14+ messages in thread
From: Greg Kroah-Hartman @ 2015-10-17 23:41 UTC (permalink / raw)
To: Thomas D.
Cc: Holger Hoffstätte, Andre Tomt, Eric Dumazet, stable,
David S. Miller, Eric W. Biederman, Stephen Hemminger, LKML,
netdev
On Sat, Oct 03, 2015 at 09:14:16PM +0200, Thomas D. wrote:
> Hi,
>
> Holger Hoffst�tte wrote:
> > Greg, any chance you can drop this into the pending 4.1.10? Otherwise people
> > will get another broken release.
>
> For me it looks like the request was too late, the patch is not included
> in 4.1.10. So don't forget to re-apply the patch when doing the upgrade.
>
> Greg, do you need a dedicated inclusion request for
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=83fccfc3940c4a2db90fd7e7079f5b465cd8c6af
> in 4.1.x or is it already on your list?
Now applied, thanks.
greg k-h
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-10-17 23:41 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1443650383.13282.10.camel@daevel.fr>
[not found] ` <pan.2015.09.30.22.37.34@googlemail.com>
[not found] ` <560CB98A.10107@tomt.net>
2015-10-01 10:51 ` Soft lockup issue in Linux 4.1.9 Holger Hoffstätte
[not found] ` <560D1223.3070606@googlemail.com>
[not found] ` <CANn89i+B5T4Rhs8HnrC0+f+GhLvBFfpr4BVDvhkVOveSfy9B8Q@mail.gmail.com>
2015-10-01 11:43 ` Holger Hoffstätte
2015-10-01 11:52 ` Eric Dumazet
2015-10-02 6:52 ` Andre Tomt
2015-10-02 7:17 ` Holger Hoffstätte
2015-10-02 19:25 ` Wolfgang Walter
2015-10-03 19:14 ` Thomas D.
2015-10-17 23:41 ` Greg Kroah-Hartman
2015-10-02 20:04 ` Thomas Gleixner
2015-10-02 20:59 ` Eric Dumazet
2015-10-02 21:04 ` Thomas Gleixner
2015-10-02 21:32 ` Eric Dumazet
2015-10-08 16:56 ` Christoph Biedl
2015-10-08 19:27 ` Holger Hoffstätte
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).