Re: stop_machine lockup issue in 3.9.y.

Linux wireless drivers development
 help / color / mirror / Atom feed

* Re: stop_machine lockup issue in 3.9.y.
       [not found]                 ` <51AFA677.9010605@candelatech.com>
@ 2013-06-05 21:11                   ` Tejun Heo
  2013-06-05 21:33                     ` Ben Greear
  2013-06-06  1:34                     ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2013-06-05 21:11 UTC (permalink / raw)
  To: Ben Greear
  Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable,
	Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

(cc'ing wireless crowd, tglx and Ingo.  The original thread is at
 http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )

Hello, Ben.

On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> Hmm, wonder if I found it.  I previously saw times where it appears
> jiffies does not increment.  __do_softirq has a break-out based on
> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> in my lockup case because for whatever reason the system cannot update
> jiffies in this case?
> 
> I added this (probably whitespace damaged) hack and now I have not been
> able to reproduce the problem.

Ah, nice catch. :)

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 14d7758..621ea3b 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>         int cpu;
>         unsigned long old_flags = current->flags;
> +       unsigned long loops = 0;
> 
>         /*
>          * Mask out PF_MEMALLOC s current task context is borrowed for the
> @@ -241,6 +242,7 @@ restart:
>                         unsigned int vec_nr = h - softirq_vec;
>                         int prev_count = preempt_count();
> 
> +                       loops++;
>                         kstat_incr_softirqs_this_cpu(vec_nr);
> 
>                         trace_softirq_entry(vec_nr);
> @@ -265,7 +267,7 @@ restart:
> 
>         pending = local_softirq_pending();
>         if (pending) {
> -               if (time_before(jiffies, end) && !need_resched())
> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>                         goto restart;

So, softirq most likely kicked off from ath9k is rescheduling itself
to the extent where it ends up locking out the CPU completely.  The
problem is usually okay because the processing would break out in 2ms
but as jiffies is stopped in this case with all other CPUs trapped in
stop_machine, the loop never breaks and the machine hangs.  While
adding the counter limit probably isn't a bad idea, softirq requeueing
itself indefinitely sounds pretty buggy.

ath9k people, do you guys have any idea what's going on?  Why would
softirq repeat itself indefinitely?

Ingo, Thomas, we're seeing a stop_machine hanging because

* All other CPUs entered IRQ disabled stage.  Jiffies is not being
  updated.

* The last CPU get caught up executing softirq indefinitely.  As
  jiffies doesn't get updated, it never breaks out of softirq
  handling.  This is a deadlock.  This CPU won't break out of softirq
  handling unless jiffies is updated and other CPUs can't do anything
  until this CPU enters the same stop_machine stage.

Ben found out that breaking out of softirq handling after certain
number of repetitions makes the issue go away, which isn't a proper
fix but we might want anyway.  What do you guys think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-05 21:11                   ` stop_machine lockup issue in 3.9.y Tejun Heo
@ 2013-06-05 21:33                     ` Ben Greear
  2013-06-06  1:34                     ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Ben Greear @ 2013-06-05 21:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Joe Lawrence, Linux Kernel Mailing List, stable,
	Luis R. Rodriguez, Jouni Malinen, Vasanthakumar Thiagarajan,
	Senthil Balasubramanian, linux-wireless, ath9k-devel,
	Thomas Gleixner, Ingo Molnar

On 06/05/2013 02:11 PM, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>   http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
>
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
>> Hmm, wonder if I found it.  I previously saw times where it appears
>> jiffies does not increment.  __do_softirq has a break-out based on
>> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
>> in my lockup case because for whatever reason the system cannot update
>> jiffies in this case?
>>
>> I added this (probably whitespace damaged) hack and now I have not been
>> able to reproduce the problem.
>
> Ah, nice catch. :)
>
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 14d7758..621ea3b 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>>          unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>>          int cpu;
>>          unsigned long old_flags = current->flags;
>> +       unsigned long loops = 0;
>>
>>          /*
>>           * Mask out PF_MEMALLOC s current task context is borrowed for the
>> @@ -241,6 +242,7 @@ restart:
>>                          unsigned int vec_nr = h - softirq_vec;
>>                          int prev_count = preempt_count();
>>
>> +                       loops++;
>>                          kstat_incr_softirqs_this_cpu(vec_nr);
>>
>>                          trace_softirq_entry(vec_nr);
>> @@ -265,7 +267,7 @@ restart:
>>
>>          pending = local_softirq_pending();
>>          if (pending) {
>> -               if (time_before(jiffies, end) && !need_resched())
>> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>>                          goto restart;
>
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.

Just to be clear on the ath9k part for the wifi folks:

This is basically un-patched 3.9.4, but I have 200 virtual stations
configured on each of two ath9k radios.  I cannot reproduce the problem
without ath9k, but I do not know for certain ath9k is the real
culprit.

In the case where I can most easily reproduce the lockup, ath9k virtual
stations would be trying to associate, so I'd expect a fair amount
of packet processing to be happening...

> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
>
> Ingo, Thomas, we're seeing a stop_machine hanging because
>
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>    updated.
>
> * The last CPU get caught up executing softirq indefinitely.  As
>    jiffies doesn't get updated, it never breaks out of softirq
>    handling.  This is a deadlock.  This CPU won't break out of softirq
>    handling unless jiffies is updated and other CPUs can't do anything
>    until this CPU enters the same stop_machine stage.
>
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-05 21:11                   ` stop_machine lockup issue in 3.9.y Tejun Heo
  2013-06-05 21:33                     ` Ben Greear
@ 2013-06-06  1:34                     ` Eric Dumazet
  2013-06-06  3:14                       ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-06-06  1:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote:
> (cc'ing wireless crowd, tglx and Ingo.  The original thread is at
>  http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )
> 
> Hello, Ben.
> 
> On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> > Hmm, wonder if I found it.  I previously saw times where it appears
> > jiffies does not increment.  __do_softirq has a break-out based on
> > jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> > in my lockup case because for whatever reason the system cannot update
> > jiffies in this case?
> > 
> > I added this (probably whitespace damaged) hack and now I have not been
> > able to reproduce the problem.
> 
> Ah, nice catch. :)
> 
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 14d7758..621ea3b 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
> >         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> >         int cpu;
> >         unsigned long old_flags = current->flags;
> > +       unsigned long loops = 0;
> > 
> >         /*
> >          * Mask out PF_MEMALLOC s current task context is borrowed for the
> > @@ -241,6 +242,7 @@ restart:
> >                         unsigned int vec_nr = h - softirq_vec;
> >                         int prev_count = preempt_count();
> > 
> > +                       loops++;
> >                         kstat_incr_softirqs_this_cpu(vec_nr);
> > 
> >                         trace_softirq_entry(vec_nr);
> > @@ -265,7 +267,7 @@ restart:
> > 
> >         pending = local_softirq_pending();
> >         if (pending) {
> > -               if (time_before(jiffies, end) && !need_resched())
> > +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
> >                         goto restart;
> 
> So, softirq most likely kicked off from ath9k is rescheduling itself
> to the extent where it ends up locking out the CPU completely.  The
> problem is usually okay because the processing would break out in 2ms
> but as jiffies is stopped in this case with all other CPUs trapped in
> stop_machine, the loop never breaks and the machine hangs.  While
> adding the counter limit probably isn't a bad idea, softirq requeueing
> itself indefinitely sounds pretty buggy.
> 
> ath9k people, do you guys have any idea what's going on?  Why would
> softirq repeat itself indefinitely?
> 
> Ingo, Thomas, we're seeing a stop_machine hanging because
> 
> * All other CPUs entered IRQ disabled stage.  Jiffies is not being
>   updated.
> 
> * The last CPU get caught up executing softirq indefinitely.  As
>   jiffies doesn't get updated, it never breaks out of softirq
>   handling.  This is a deadlock.  This CPU won't break out of softirq
>   handling unless jiffies is updated and other CPUs can't do anything
>   until this CPU enters the same stop_machine stage.
> 
> Ben found out that breaking out of softirq handling after certain
> number of repetitions makes the issue go away, which isn't a proper
> fix but we might want anyway.  What do you guys think?
> 

Interesting....

Before 3.9 and commit c10d73671ad30f5469
("softirq: reduce latencies") we used to limit the __do_softirq() loop
to 10.




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  1:34                     ` Eric Dumazet
@ 2013-06-06  3:14                       ` Tejun Heo
  2013-06-06  3:26                         ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2013-06-06  3:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

Hello, Eric.

On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote:
> > Ingo, Thomas, we're seeing a stop_machine hanging because
> > 
> > * All other CPUs entered IRQ disabled stage.  Jiffies is not being
> >   updated.
> > 
> > * The last CPU get caught up executing softirq indefinitely.  As
> >   jiffies doesn't get updated, it never breaks out of softirq
> >   handling.  This is a deadlock.  This CPU won't break out of softirq
> >   handling unless jiffies is updated and other CPUs can't do anything
> >   until this CPU enters the same stop_machine stage.
> > 
> > Ben found out that breaking out of softirq handling after certain
> > number of repetitions makes the issue go away, which isn't a proper
> > fix but we might want anyway.  What do you guys think?
> > 
> 
> Interesting....
> 
> Before 3.9 and commit c10d73671ad30f5469
> ("softirq: reduce latencies") we used to limit the __do_softirq() loop
> to 10.

Ah, so, that's why it's showing up now.  We probably have had the same
issue all along but it used to be masked by the softirq limiting.  Do
you care to revive the 10 iterations limit so that it's limited by
both the count and timing?  We do wanna find out why softirq is
spinning indefinitely tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:14                       ` Tejun Heo
@ 2013-06-06  3:26                         ` Eric Dumazet
  2013-06-06  3:41                           ` Ben Greear
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ben Greear, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:

> 
> Ah, so, that's why it's showing up now.  We probably have had the same
> issue all along but it used to be masked by the softirq limiting.  Do
> you care to revive the 10 iterations limit so that it's limited by
> both the count and timing?  We do wanna find out why softirq is
> spinning indefinitely tho.

Yes, no problem, I can do that.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:26                         ` Eric Dumazet
@ 2013-06-06  3:41                           ` Ben Greear
  2013-06-06  3:46                             ` Eric Dumazet
  2013-06-06 20:55                             ` Tejun Heo
  0 siblings, 2 replies; 12+ messages in thread
From: Ben Greear @ 2013-06-06  3:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List,
	stable, Luis R. Rodriguez, Jouni Malinen,
	Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>
>>
>> Ah, so, that's why it's showing up now.  We probably have had the same
>> issue all along but it used to be masked by the softirq limiting.  Do
>> you care to revive the 10 iterations limit so that it's limited by
>> both the count and timing?  We do wanna find out why softirq is
>> spinning indefinitely tho.
>
> Yes, no problem, I can do that.

Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
be fine by me.

I can send a version of my patch easily enough if we can agree on the max number of
loops (and if indeed my version of the patch is acceptable).

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:41                           ` Ben Greear
@ 2013-06-06  3:46                             ` Eric Dumazet
  2013-06-06  3:50                               ` Ben Greear
  2013-06-06 20:55                             ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-06-06  3:46 UTC (permalink / raw)
  To: Ben Greear
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List,
	stable, Luis R. Rodriguez, Jouni Malinen,
	Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >
> >>
> >> Ah, so, that's why it's showing up now.  We probably have had the same
> >> issue all along but it used to be masked by the softirq limiting.  Do
> >> you care to revive the 10 iterations limit so that it's limited by
> >> both the count and timing?  We do wanna find out why softirq is
> >> spinning indefinitely tho.
> >
> > Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.
> 
> I can send a version of my patch easily enough if we can agree on the max number of
> loops (and if indeed my version of the patch is acceptable).

Well, 10 was the prior limit and seems really fine.

The non update on jiffies seems quite exceptional condition (I hope...)

We use in Google a patch triggering warning is a thread holds the cpu
without taking care to need_resched() for more than xx ms




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:46                             ` Eric Dumazet
@ 2013-06-06  3:50                               ` Ben Greear
  2013-06-06  4:08                                 ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Ben Greear @ 2013-06-06  3:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List,
	stable, Luis R. Rodriguez, Jouni Malinen,
	Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On 06/05/2013 08:46 PM, Eric Dumazet wrote:
> On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote:
>> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
>>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>>>
>>>>
>>>> Ah, so, that's why it's showing up now.  We probably have had the same
>>>> issue all along but it used to be masked by the softirq limiting.  Do
>>>> you care to revive the 10 iterations limit so that it's limited by
>>>> both the count and timing?  We do wanna find out why softirq is
>>>> spinning indefinitely tho.
>>>
>>> Yes, no problem, I can do that.
>>
>> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
>> be fine by me.
>>
>> I can send a version of my patch easily enough if we can agree on the max number of
>> loops (and if indeed my version of the patch is acceptable).
>
> Well, 10 was the prior limit and seems really fine.
>
> The non update on jiffies seems quite exceptional condition (I hope...)
>
> We use in Google a patch triggering warning is a thread holds the cpu
> without taking care to need_resched() for more than xx ms

Well, I'm sure that patch works nicely until the clock stops moving
forward :)

I'll post a patch with limit of 10 shortly.

Thanks,
Ben



-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:50                               ` Ben Greear
@ 2013-06-06  4:08                                 ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-06-06  4:08 UTC (permalink / raw)
  To: Ben Greear
  Cc: Tejun Heo, Rusty Russell, Joe Lawrence, Linux Kernel Mailing List,
	stable, Luis R. Rodriguez, Jouni Malinen,
	Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On Wed, 2013-06-05 at 20:50 -0700, Ben Greear wrote:
> On 06/05/2013 08:46 PM, Eric Dumazet wrote:
> >
> > We use in Google a patch triggering warning is a thread holds the cpu
> > without taking care to need_resched() for more than xx ms
> 
> Well, I'm sure that patch works nicely until the clock stops moving
> forward :)
> 

This is not using jiffies, but the clock used in kernel/sched/core.c,
with ns resolution ;)

> I'll post a patch with limit of 10 shortly.

ok



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06  3:41                           ` Ben Greear
  2013-06-06  3:46                             ` Eric Dumazet
@ 2013-06-06 20:55                             ` Tejun Heo
  2013-06-06 21:15                               ` Ben Greear
  1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2013-06-06 20:55 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric Dumazet, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

Hello, Ben.

On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
> >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
> >>Ah, so, that's why it's showing up now.  We probably have had the same
> >>issue all along but it used to be masked by the softirq limiting.  Do
> >>you care to revive the 10 iterations limit so that it's limited by
> >>both the count and timing?  We do wanna find out why softirq is
> >>spinning indefinitely tho.
> >
> >Yes, no problem, I can do that.
> 
> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
> be fine by me.

First of all, kudos for tracking the issue down.  While the removal of
looping limit in softirq handling was the direct cause for making the
problem visible, it's very bothering that we have softirq runaway.
Finding out the perpetrator shouldn't be hard.  Something like the
following should work (untested).  Once we know which softirq (prolly
the network one), we can dig deeper.

Thanks.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index b5197dc..5af3682 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
 	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
 	int cpu;
 	unsigned long old_flags = current->flags;
+	int cnt = 0;
 
 	/*
 	 * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -244,6 +245,9 @@ restart:
 			kstat_incr_softirqs_this_cpu(vec_nr);
 
 			trace_softirq_entry(vec_nr);
+			if (++cnt >= 5000 && cnt < 5010)
+				printk("XXX __do_softirq: stuck handling softirqs, cnt=%d action=%pf\n",
+				       cnt, h->action);
 			h->action(h);
 			trace_softirq_exit(vec_nr);
 			if (unlikely(prev_count != preempt_count())) {


-- 
tejun

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06 20:55                             ` Tejun Heo
@ 2013-06-06 21:15                               ` Ben Greear
  2013-06-06 21:17                                 ` Tejun Heo
  0 siblings, 1 reply; 12+ messages in thread
From: Ben Greear @ 2013-06-06 21:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric Dumazet, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On 06/06/2013 01:55 PM, Tejun Heo wrote:
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote:
>> On 06/05/2013 08:26 PM, Eric Dumazet wrote:
>>> On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote:
>>>> Ah, so, that's why it's showing up now.  We probably have had the same
>>>> issue all along but it used to be masked by the softirq limiting.  Do
>>>> you care to revive the 10 iterations limit so that it's limited by
>>>> both the count and timing?  We do wanna find out why softirq is
>>>> spinning indefinitely tho.
>>>
>>> Yes, no problem, I can do that.
>>
>> Limiting it to 5000 fixes my problem, so if you wanted it larger than 10, that would
>> be fine by me.
>
> First of all, kudos for tracking the issue down.  While the removal of
> looping limit in softirq handling was the direct cause for making the
> problem visible, it's very bothering that we have softirq runaway.
> Finding out the perpetrator shouldn't be hard.  Something like the
> following should work (untested).  Once we know which softirq (prolly
> the network one), we can dig deeper.

The patch below assumes my fix is not in the code, right?

I'll work on this, but it will probably be next week before
I have time...gotta catch up on some other things first.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: stop_machine lockup issue in 3.9.y.
  2013-06-06 21:15                               ` Ben Greear
@ 2013-06-06 21:17                                 ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2013-06-06 21:17 UTC (permalink / raw)
  To: Ben Greear
  Cc: Eric Dumazet, Rusty Russell, Joe Lawrence,
	Linux Kernel Mailing List, stable, Luis R. Rodriguez,
	Jouni Malinen, Vasanthakumar Thiagarajan, Senthil Balasubramanian,
	linux-wireless, ath9k-devel, Thomas Gleixner, Ingo Molnar

On Thu, Jun 06, 2013 at 02:15:40PM -0700, Ben Greear wrote:
> >First of all, kudos for tracking the issue down.  While the removal of
> >looping limit in softirq handling was the direct cause for making the
> >problem visible, it's very bothering that we have softirq runaway.
> >Finding out the perpetrator shouldn't be hard.  Something like the
> >following should work (untested).  Once we know which softirq (prolly
> >the network one), we can dig deeper.
> 
> The patch below assumes my fix is not in the code, right?

Yeap.

> I'll work on this, but it will probably be next week before
> I have time...gotta catch up on some other things first.

Thanks a lot for hunting this down!

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-06 21:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20130604100744.7cdf8777@jlaw-desktop.mno.stratus.com>
     [not found] ` <51AE1B81.20900@candelatech.com>
     [not found]   ` <51AE27D5.7050202@candelatech.com>
     [not found]     ` <87sj0xry1k.fsf@rustcorp.com.au>
     [not found]       ` <20130605071539.GA3429@mtj.dyndns.org>
     [not found]         ` <51AF6E54.3050108@candelatech.com>
     [not found]           ` <20130605184807.GD10693@mtj.dyndns.org>
     [not found]             ` <51AF8D4B.4090407@candelatech.com>
     [not found]               ` <51AF91F5.6090801@candelatech.com>
     [not found]                 ` <51AFA677.9010605@candelatech.com>
2013-06-05 21:11                   ` stop_machine lockup issue in 3.9.y Tejun Heo
2013-06-05 21:33                     ` Ben Greear
2013-06-06  1:34                     ` Eric Dumazet
2013-06-06  3:14                       ` Tejun Heo
2013-06-06  3:26                         ` Eric Dumazet
2013-06-06  3:41                           ` Ben Greear
2013-06-06  3:46                             ` Eric Dumazet
2013-06-06  3:50                               ` Ben Greear
2013-06-06  4:08                                 ` Eric Dumazet
2013-06-06 20:55                             ` Tejun Heo
2013-06-06 21:15                               ` Ben Greear
2013-06-06 21:17                                 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox