public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>,
	Joe Lawrence <joe.lawrence@stratus.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org
Subject: Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
Date: Wed, 05 Jun 2013 12:11:07 -0700	[thread overview]
Message-ID: <51AF8D4B.4090407@candelatech.com> (raw)
In-Reply-To: <20130605184807.GD10693@mtj.dyndns.org>

On 06/05/2013 11:48 AM, Tejun Heo wrote:
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>> One pattern I notice repeating for at least most of the hangs is that all but one
>> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
>> but typically that of the sysrq itself.  I added printk that would always
>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>> thread (cpu 2 below) never shows that message.
>
> It sounds like one of the cpus get live-locked by IRQs.  I can't tell
> why the situation is made worse by other CPUs being tied up.  Do you
> ever see CPUs being live locked by IRQs during normal operation?

No, I have not noticed any live locks aside from this, at least in
the 3.9 kernels.

>> I thought it might be because it was reading stale smdata->state, so I changed
>> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
>> below the cpu_relax().  Neither had any affect, so I am left assuming that the
>
> I looked at the code again and the memory accesses seem properly
> interlocked.  It's a bit tricky and should probably have used spinlock
> instead considering it's already a hugely expensive path anyway, but
> it does seem correct to me.
>
>> thread instead is stuck handling IRQs and never gets out of the IRQ handler.
>
> Seems that way to me too.
>
>> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
>> the remaining process can just never handle all the IRQs and get back to the
>> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
>> different stacks, so I assume that thread is doing at least something.
>
> What's the source of all those IRQs tho?  I don't think the IRQs are
> from actual events.  The system is quiesced.  Even if it's from
> receiving packets, it's gonna quiet down pretty quickly.  The hang
> doesn't go away if you disconnect the network cable while hung, right?
>
> What could be happening is that IRQ handling is handled by a thread
> but the IRQ handler itself doesn't clear the IRQ properly and depends
> on the handling thread to clear the condition.  If no CPU is available
> for scheduling, it might end up raising and re-reraising IRQs for the
> same condition without ever being handled.  If that's the case, such
> lockup could happen on a normally functioning UP machine or if the IRQ
> is pinned to a single CPU which happens to be running the handling
> thread.  At any rate, it'd be a plain live-lock bug on the driver
> side.
>
> Can you please try to confirm the specific interrupt being
> continuously raised?  Detecting the hang shouldn't be too difficult.
> Just recording the starting jiffies and if progress hasn't been made
> for, say, ten seconds, it can set a flag and then print the IRQs being
> handled if the flag is set.  If it indeed is the ath device, we
> probably wanna get the driver maintainer involved.

I am not sure how to tell which IRQ is being handled.  Do the
stack traces (showing smp_apic_timer_interrupt, for instance)
indicate potential culprits, or is that more a symptom of just
when the soft-lockup check is called?


Where should I add code to print out irqs?  In the lockup state,
the thread (probably) stuck handling irqs isn't executing any code in
the stop_machine file as far as I can tell.

Maybe I need to instrument the __do_softirq or similar method?

For what it's worth, previous debugging appears to show that jiffies
stops incrementing in many of these lockups.

Also, I have been trying for 20+ minutes to reproduce the lockup
with the ath9k module removed (and my user-space app that uses it
stopped), and I have not reproduced it yet.  So, possibly it is
related to ath9k, but my user-space app pokes at lots of other
stuff and starts loads of dhcp client processes and such too,
so not sure yet.


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


  reply	other threads:[~2013-06-05 19:11 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <51A8E884.1080009@candelatech.com>
     [not found] ` <87ehclumhr.fsf@rustcorp.com.au>
     [not found]   ` <alpine.DEB.2.02.1306022212200.10003@jlaw-desktop.mno.stratus.com>
     [not found]     ` <alpine.DEB.2.02.1306030724230.19647@jlaw-desktop.mno.stratus.com>
2013-06-03 14:17       ` Please add to stable: module: don't unlink the module until we've removed all exposure Joe Lawrence
2013-06-03 15:59         ` Ben Greear
2013-06-03 16:36           ` Ben Greear
2013-06-04  4:37             ` Rusty Russell
2013-06-04  5:56             ` Rusty Russell
2013-06-04 14:07               ` Joe Lawrence
2013-06-04 16:50                 ` Joe Lawrence
2013-06-04 16:53                 ` Ben Greear
2013-06-04 17:45                   ` Ben Greear
2013-06-05  4:17                     ` Rusty Russell
2013-06-05  7:15                       ` Tejun Heo
2013-06-05 16:59                         ` Ben Greear
2013-06-05 18:48                           ` Tejun Heo
2013-06-05 19:11                             ` Ben Greear [this message]
2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
2013-06-05 20:58                                 ` Ben Greear
2013-06-05 21:11                                   ` Tejun Heo
2013-06-05 21:33                                     ` Ben Greear
2013-06-06  1:34                                     ` Eric Dumazet
2013-06-06  3:14                                       ` Tejun Heo
2013-06-06  3:26                                         ` Eric Dumazet
2013-06-06  3:41                                           ` Ben Greear
2013-06-06  3:46                                             ` Eric Dumazet
2013-06-06  3:50                                               ` Ben Greear
2013-06-06  4:08                                                 ` Eric Dumazet
2013-06-06 20:55                                             ` Tejun Heo
2013-06-06 21:15                                               ` Ben Greear
2013-06-06 21:17                                                 ` Tejun Heo
2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
2013-06-05  5:07         ` Greg KH
2013-06-05  7:13           ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51AF8D4B.4090407@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=joe.lawrence@stratus.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox