From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from static.68.134.40.188.clients.your-server.de ([188.40.134.68]:58552 "EHLO mail02.iobjects.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753907AbbJHT1y (ORCPT ); Thu, 8 Oct 2015 15:27:54 -0400 Subject: Re: Soft lockup issue in Linux 4.1.9 To: Christoph Biedl , Eric Dumazet References: <1443650383.13282.10.camel@daevel.fr> <560D1223.3070606@googlemail.com> <560D1C5A.3050508@googlemail.com> <1444322507@msgid.manchmal.in-ulm.de> Cc: "Eric W. Biederman" , LKML , stable@vger.kernel.org From: =?UTF-8?Q?Holger_Hoffst=c3=a4tte?= Message-ID: <5616C3B7.9000305@googlemail.com> Date: Thu, 8 Oct 2015 21:27:51 +0200 MIME-Version: 1.0 In-Reply-To: <1444322507@msgid.manchmal.in-ulm.de> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: stable-owner@vger.kernel.org List-ID: On 10/08/15 18:56, Christoph Biedl wrote: > Eric Dumazet wrote... > > [ commit 83fccfc3940c4a2db90fd7e7079f5b465cd8c6af ] > >> It definitely should help ! > > Yesterday, I've experienced issues somewhat similar to this, but I'm > not entirely sure: > > Four of five systems running 4.1.9 stopped working. No reaction on > network, keyboard, serial console. In one case, the stack trace as > below made it to the loghost. > > Two things are quite different. First, the systems had a reasonable > uptime, about a week. > > And second, the scary part: All incidents happened within a rather > short time span of three minutes the most, beginning after 16:41:28 and > before 16:41:54 UTC. So I assumed a brownout first - until I realized > the systems faded away at slightly different times, and one is at a > different location. While other systems using different kernel versions > continued to operate on both sites. > > So, I'd be glad for answers for > > - Is this the same issue or should I be even more afraid? There's always room for more. :-) > - What might be the reason for this temporal coincidence? I have no > plausible idea. More bugs? > Confused, > Christoph > > > INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) > rcu_sched kthread starved for 5999 jiffies! > Task dump for CPU 3: > swapper/3 R running task 0 0 1 0x00000008 > ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003 > ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80 > 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740 > Call Trace: > [] sched_show_task+0xaa/0x110 > [] dump_cpu_task+0x38/0x40 > [] rcu_dump_cpu_stacks+0x8c/0xc0 > [] rcu_check_callbacks+0x3b1/0x680 > [] ? acct_account_cputime+0x17/0x20 > [] ? account_system_time+0x8e/0x180 > [] update_process_times+0x33/0x60 > [] tick_sched_handle.isra.14+0x30/0x40 > [] tick_sched_timer+0x43/0x80 > [] __run_hrtimer.isra.32+0x4a/0xd0 > [] hrtimer_interrupt+0xd5/0x1f0 > [] local_apic_timer_interrupt+0x34/0x60 > INFO: rcu_sched self-detected stall on CPU { 3} (t=6000 jiffies g=8932806 c=8932805 q=58491) > rcu_sched kthread starved for 5999 jiffies! > Task dump for CPU 3: > swapper/3 R running task 0 0 1 0x00000008 > ffffffff81e396c0 ffff88042dcc3b20 ffffffff810807da 0000000000000003 > ffffffff81e396c0 ffff88042dcc3b40 ffffffff81083b78 ffff88042dcc3b80 > 0000000000000003 ffff88042dcc3b70 ffffffff810a945c ffff88042dcd5740 > Call Trace: > [] sched_show_task+0xaa/0x110 > [] dump_cpu_task+0x38/0x40 > [] smp_apic_timer_interrupt+0x3c/0x60 > [] apic_timer_interrupt+0x6b/0x70 > [] ? _raw_spin_unlock_irqrestore+0x9/0x10 > [] try_to_del_timer_sync+0x48/0x60 > [] ? del_timer_sync+0x42/0x60 > [] del_timer_sync+0x4a/0x60 > [] inet_csk_reqsk_queue_drop+0x7a/0x1f0 > [] reqsk_timer_handler+0x12f/0x290 > [] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 > [] call_timer_fn.isra.26+0x26/0x80 > [] rcu_dump_cpu_stacks+0x8c/0xc0 > [] rcu_check_callbacks+0x3b1/0x680 > [] ? acct_account_cputime+0x17/0x20 > [] ? account_system_time+0x8e/0x180 > [] update_process_times+0x33/0x60 > [] tick_sched_handle.isra.14+0x30/0x40 > [] tick_sched_timer+0x43/0x80 > [] __run_hrtimer.isra.32+0x4a/0xd0 > [] hrtimer_interrupt+0xd5/0x1f0 > [] local_apic_timer_interrupt+0x34/0x60 > [] run_timer_softirq+0x18e/0x220 > [] __do_softirq+0xda/0x1f0 > [] irq_exit+0x76/0xa0 > [] smp_apic_timer_interrupt+0x45/0x60 > [] apic_timer_interrupt+0x6b/0x70 > [] ? sched_clock_cpu+0x9e/0xb0 > [] ? amd_e400_idle+0x35/0xd0 > [] ? amd_e400_idle+0x33/0xd0 > [] arch_cpu_idle+0xa/0x10 > [] cpu_startup_entry+0x2c3/0x330 > [] smp_apic_timer_interrupt+0x3c/0x60 > [] apic_timer_interrupt+0x6b/0x70 > [] ? _raw_spin_unlock_irqrestore+0x9/0x10 > [] try_to_del_timer_sync+0x48/0x60 > [] ? del_timer_sync+0x42/0x60 > [] del_timer_sync+0x4a/0x60 > [] inet_csk_reqsk_queue_drop+0x7a/0x1f0 > [] reqsk_timer_handler+0x12f/0x290 > [] ? inet_csk_reqsk_queue_drop+0x1f0/0x1f0 > [] call_timer_fn.isra.26+0x26/0x80 > [] start_secondary+0x17c/0x1a0 > [] run_timer_softirq+0x18e/0x220 > [] __do_softirq+0xda/0x1f0 > [] irq_exit+0x76/0xa0 > [] smp_apic_timer_interrupt+0x45/0x60 > [] apic_timer_interrupt+0x6b/0x70 > [] ? sched_clock_cpu+0x9e/0xb0 > [] ? amd_e400_idle+0x35/0xd0 > [] ? amd_e400_idle+0x33/0xd0 > [] arch_cpu_idle+0xa/0x10 > [] cpu_startup_entry+0x2c3/0x330 > [] start_secondary+0x17c/0x1a0 > The timer fixes were followups to a patch that went into 4.1 called "tcp/dccp: get rid of central timewait timer", and it seems there were a few more patches in that area very recently. So after some git spelunking I am now running with the following patches on top of 4.1.10 + 83fccfc3940.. (for the lockups), in the following order: fc01538f9fb75572c969ca9988176ffc2a8741d6 simplify timewait refcounting dbe7faa4045ea83a37b691b12bb02a8f86c2d2e9 inet_twsk_deschedule factorization 29c6852602e259d2c1882f320b29d5c3fec0de04 fix races in reqsk_queue_hash_req() ed2e923945892a8372ab70d2f61d364b0b6d9054 fix timewait races in timer handling They may not all be required for the particular problem you just summoned, but (from what I could tell) are required to apply everything properly. They certainly can't make things worse. :-) Oh and while you're at it you can apply these l33t cubic fixes :-) 30927520dbae297182990bb21d08762bcc35ce1d better follow cubic curve after idle period c2e7204d180f8efc80f27959ca9cf16fa17f67db do not set epoch_start in the future I've been running these on 3 machines for almost 10 minutes without issue, so they are totally safe to go into production right away. -h