From: Ben Greear <greearb@candelatech.com>
To: ath9k-devel@lists.ath9k.org
Subject: [ath9k-devel] stop_machine lockup issue in 3.9.y.
Date: Wed, 05 Jun 2013 14:33:08 -0700 [thread overview]
Message-ID: <51AFAE94.5030007@candelatech.com> (raw)
In-Reply-To: <20130605211157.GK10693@mtj.dyndns.org>
On 06/05/2013 02:11 PM, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo. The original thread is at
> http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
>
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
>> Hmm, wonder if I found it. I previously saw times where it appears
>> jiffies does not increment. __do_softirq has a break-out based on
>> jiffies timeout. Maybe that is failing to get us out of __do_softirq
>> in my lockup case because for whatever reason the system cannot update
>> jiffies in this case?
>>
>> I added this (probably whitespace damaged) hack and now I have not been
>> able to reproduce the problem.
>
> Ah, nice catch. :)
>
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 14d7758..621ea3b 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>> unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>> int cpu;
>> unsigned long old_flags = current->flags;
>> + unsigned long loops = 0;
>>
>> /*
>> * Mask out PF_MEMALLOC s current task context is borrowed for the
>> @@ -241,6 +242,7 @@ restart:
>> unsigned int vec_nr = h - softirq_vec;
>> int prev_count = preempt_count();
>>
>> + loops++;
>> kstat_incr_softirqs_this_cpu(vec_nr);
>>
>> trace_softirq_entry(vec_nr);
>> @@ -265,7 +267,7 @@ restart:
>>
>> pending = local_softirq_pending();
>> if (pending) {
>> - if (time_before(jiffies, end) && !need_resched())
>> + if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>> goto restart;
>
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely. The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs. While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.
Just to be clear on the ath9k part for the wifi folks:
This is basically un-patched 3.9.4, but I have 200 virtual stations
configured on each of two ath9k radios. I cannot reproduce the problem
without ath9k, but I do not know for certain ath9k is the real
culprit.
In the case where I can most easily reproduce the lockup, ath9k virtual
stations would be trying to associate, so I'd expect a fair amount
of packet processing to be happening...
> ath9k people, do you guys have any idea what's going on? Why would
> softirq repeat itself indefinitely?
>
> Ingo, Thomas, we're seeing a stop_machine hanging because
>
> * All other CPUs entered IRQ disabled stage. Jiffies is not being
> updated.
>
> * The last CPU get caught up executing softirq indefinitely. As
> jiffies doesn't get updated, it never breaks out of softirq
> handling. This is a deadlock. This CPU won't break out of softirq
> handling unless jiffies is updated and other CPUs can't do anything
> until this CPU enters the same stop_machine stage.
>
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway. What do you guys think?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
WARNING: multiple messages have this Message-ID (diff)
From: Ben Greear <greearb@candelatech.com>
To: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>,
Joe Lawrence <joe.lawrence@stratus.com>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
stable@vger.kernel.org,
"Luis R. Rodriguez" <mcgrof@qca.qualcomm.com>,
Jouni Malinen <jouni@qca.qualcomm.com>,
Vasanthakumar Thiagarajan <vthiagar@qca.qualcomm.com>,
Senthil Balasubramanian <senthilb@qca.qualcomm.com>,
linux-wireless@vger.kernel.org, ath9k-devel@venema.h4ckr.net,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: stop_machine lockup issue in 3.9.y.
Date: Wed, 05 Jun 2013 14:33:08 -0700 [thread overview]
Message-ID: <51AFAE94.5030007@candelatech.com> (raw)
In-Reply-To: <20130605211157.GK10693@mtj.dyndns.org>
On 06/05/2013 02:11 PM, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo. The original thread is at
> http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
>
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
>> Hmm, wonder if I found it. I previously saw times where it appears
>> jiffies does not increment. __do_softirq has a break-out based on
>> jiffies timeout. Maybe that is failing to get us out of __do_softirq
>> in my lockup case because for whatever reason the system cannot update
>> jiffies in this case?
>>
>> I added this (probably whitespace damaged) hack and now I have not been
>> able to reproduce the problem.
>
> Ah, nice catch. :)
>
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 14d7758..621ea3b 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>> unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>> int cpu;
>> unsigned long old_flags = current->flags;
>> + unsigned long loops = 0;
>>
>> /*
>> * Mask out PF_MEMALLOC s current task context is borrowed for the
>> @@ -241,6 +242,7 @@ restart:
>> unsigned int vec_nr = h - softirq_vec;
>> int prev_count = preempt_count();
>>
>> + loops++;
>> kstat_incr_softirqs_this_cpu(vec_nr);
>>
>> trace_softirq_entry(vec_nr);
>> @@ -265,7 +267,7 @@ restart:
>>
>> pending = local_softirq_pending();
>> if (pending) {
>> - if (time_before(jiffies, end) && !need_resched())
>> + if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>> goto restart;
>
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely. The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs. While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.
Just to be clear on the ath9k part for the wifi folks:
This is basically un-patched 3.9.4, but I have 200 virtual stations
configured on each of two ath9k radios. I cannot reproduce the problem
without ath9k, but I do not know for certain ath9k is the real
culprit.
In the case where I can most easily reproduce the lockup, ath9k virtual
stations would be trying to associate, so I'd expect a fair amount
of packet processing to be happening...
> ath9k people, do you guys have any idea what's going on? Why would
> softirq repeat itself indefinitely?
>
> Ingo, Thomas, we're seeing a stop_machine hanging because
>
> * All other CPUs entered IRQ disabled stage. Jiffies is not being
> updated.
>
> * The last CPU get caught up executing softirq indefinitely. As
> jiffies doesn't get updated, it never breaks out of softirq
> handling. This is a deadlock. This CPU won't break out of softirq
> handling unless jiffies is updated and other CPUs can't do anything
> until this CPU enters the same stop_machine stage.
>
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway. What do you guys think?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
next prev parent reply other threads:[~2013-06-05 21:33 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear
2013-06-02 5:09 ` Rusty Russell
2013-06-03 3:46 ` Joe Lawrence
2013-06-03 11:25 ` Joe Lawrence
2013-06-03 14:17 ` Joe Lawrence
2013-06-03 15:59 ` Ben Greear
2013-06-03 16:36 ` Ben Greear
2013-06-04 4:37 ` Rusty Russell
2013-06-04 5:56 ` Rusty Russell
2013-06-04 14:07 ` Joe Lawrence
2013-06-04 16:50 ` Joe Lawrence
2013-06-04 16:53 ` Ben Greear
2013-06-04 17:45 ` Ben Greear
2013-06-05 4:17 ` Rusty Russell
2013-06-05 7:15 ` Tejun Heo
2013-06-05 16:59 ` Ben Greear
2013-06-05 18:48 ` Tejun Heo
2013-06-05 19:11 ` Ben Greear
2013-06-05 19:31 ` stop_machine lockup issue in 3.9.y Ben Greear
2013-06-05 20:58 ` Ben Greear
2013-06-05 21:11 ` [ath9k-devel] " Tejun Heo
2013-06-05 21:11 ` Tejun Heo
2013-06-05 21:11 ` Tejun Heo
2013-06-05 21:33 ` Ben Greear [this message]
2013-06-05 21:33 ` Ben Greear
2013-06-06 1:34 ` [ath9k-devel] " Eric Dumazet
2013-06-06 1:34 ` Eric Dumazet
2013-06-06 1:34 ` Eric Dumazet
2013-06-06 3:14 ` [ath9k-devel] " Tejun Heo
2013-06-06 3:14 ` Tejun Heo
2013-06-06 3:14 ` Tejun Heo
2013-06-06 3:26 ` [ath9k-devel] " Eric Dumazet
2013-06-06 3:26 ` Eric Dumazet
2013-06-06 3:26 ` Eric Dumazet
2013-06-06 3:41 ` [ath9k-devel] " Ben Greear
2013-06-06 3:41 ` Ben Greear
2013-06-06 3:46 ` [ath9k-devel] " Eric Dumazet
2013-06-06 3:46 ` Eric Dumazet
2013-06-06 3:50 ` [ath9k-devel] " Ben Greear
2013-06-06 3:50 ` Ben Greear
2013-06-06 4:08 ` [ath9k-devel] " Eric Dumazet
2013-06-06 4:08 ` Eric Dumazet
2013-06-06 20:55 ` [ath9k-devel] " Tejun Heo
2013-06-06 20:55 ` Tejun Heo
2013-06-06 21:15 ` [ath9k-devel] " Ben Greear
2013-06-06 21:15 ` Ben Greear
2013-06-06 21:17 ` [ath9k-devel] " Tejun Heo
2013-06-06 21:17 ` Tejun Heo
2013-06-05 3:29 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
2013-06-05 5:07 ` Greg KH
2013-06-05 7:13 ` Rusty Russell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51AFAE94.5030007@candelatech.com \
--to=greearb@candelatech.com \
--cc=ath9k-devel@lists.ath9k.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.