Re: Please add to stable: module: don't unlink the module until we've removed all exposure.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ben Greear <greearb@candelatech.com>
To: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>,
	Joe Lawrence <joe.lawrence@stratus.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org
Subject: Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
Date: Wed, 05 Jun 2013 12:11:07 -0700	[thread overview]
Message-ID: <51AF8D4B.4090407@candelatech.com> (raw)
In-Reply-To: <20130605184807.GD10693@mtj.dyndns.org>

On 06/05/2013 11:48 AM, Tejun Heo wrote:
> Hello, Ben.
>
> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>> One pattern I notice repeating for at least most of the hangs is that all but one
>> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
>> but typically that of the sysrq itself.  I added printk that would always
>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>> thread (cpu 2 below) never shows that message.
>
> It sounds like one of the cpus get live-locked by IRQs.  I can't tell
> why the situation is made worse by other CPUs being tied up.  Do you
> ever see CPUs being live locked by IRQs during normal operation?

No, I have not noticed any live locks aside from this, at least in
the 3.9 kernels.

>> I thought it might be because it was reading stale smdata->state, so I changed
>> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
>> below the cpu_relax().  Neither had any affect, so I am left assuming that the
>
> I looked at the code again and the memory accesses seem properly
> interlocked.  It's a bit tricky and should probably have used spinlock
> instead considering it's already a hugely expensive path anyway, but
> it does seem correct to me.
>
>> thread instead is stuck handling IRQs and never gets out of the IRQ handler.
>
> Seems that way to me too.
>
>> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
>> the remaining process can just never handle all the IRQs and get back to the
>> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
>> different stacks, so I assume that thread is doing at least something.
>
> What's the source of all those IRQs tho?  I don't think the IRQs are
> from actual events.  The system is quiesced.  Even if it's from
> receiving packets, it's gonna quiet down pretty quickly.  The hang
> doesn't go away if you disconnect the network cable while hung, right?
>
> What could be happening is that IRQ handling is handled by a thread
> but the IRQ handler itself doesn't clear the IRQ properly and depends
> on the handling thread to clear the condition.  If no CPU is available
> for scheduling, it might end up raising and re-reraising IRQs for the
> same condition without ever being handled.  If that's the case, such
> lockup could happen on a normally functioning UP machine or if the IRQ
> is pinned to a single CPU which happens to be running the handling
> thread.  At any rate, it'd be a plain live-lock bug on the driver
> side.
>
> Can you please try to confirm the specific interrupt being
> continuously raised?  Detecting the hang shouldn't be too difficult.
> Just recording the starting jiffies and if progress hasn't been made
> for, say, ten seconds, it can set a flag and then print the IRQs being
> handled if the flag is set.  If it indeed is the ath device, we
> probably wanna get the driver maintainer involved.

I am not sure how to tell which IRQ is being handled.  Do the
stack traces (showing smp_apic_timer_interrupt, for instance)
indicate potential culprits, or is that more a symptom of just
when the soft-lockup check is called?


Where should I add code to print out irqs?  In the lockup state,
the thread (probably) stuck handling irqs isn't executing any code in
the stop_machine file as far as I can tell.

Maybe I need to instrument the __do_softirq or similar method?

For what it's worth, previous debugging appears to show that jiffies
stops incrementing in many of these lockups.

Also, I have been trying for 20+ minutes to reproduce the lockup
with the ath9k module removed (and my user-space app that uses it
stopped), and I have not reproduced it yet.  So, possibly it is
related to ath9k, but my user-space app pokes at lots of other
stuff and starts loads of dhcp client processes and such too,
so not sure yet.


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

next prev parent reply	other threads:[~2013-06-05 19:11 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-31 18:14 Please add to stable: module: don't unlink the module until we've removed all exposure Ben Greear
2013-06-02  5:09 ` Rusty Russell
2013-06-03  3:46   ` Joe Lawrence
2013-06-03 11:25     ` Joe Lawrence
2013-06-03 14:17       ` Joe Lawrence
2013-06-03 15:59         ` Ben Greear
2013-06-03 16:36           ` Ben Greear
2013-06-04  4:37             ` Rusty Russell
2013-06-04  5:56             ` Rusty Russell
2013-06-04 14:07               ` Joe Lawrence
2013-06-04 16:50                 ` Joe Lawrence
2013-06-04 16:53                 ` Ben Greear
2013-06-04 17:45                   ` Ben Greear
2013-06-05  4:17                     ` Rusty Russell
2013-06-05  7:15                       ` Tejun Heo
2013-06-05 16:59                         ` Ben Greear
2013-06-05 18:48                           ` Tejun Heo
2013-06-05 19:11                             ` Ben Greear [this message]
2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
2013-06-05 20:58                                 ` Ben Greear
2013-06-05 21:11                                   ` [ath9k-devel] " Tejun Heo
2013-06-05 21:11                                     ` Tejun Heo
2013-06-05 21:11                                     ` Tejun Heo
2013-06-05 21:33                                     ` [ath9k-devel] " Ben Greear
2013-06-05 21:33                                       ` Ben Greear
2013-06-06  1:34                                     ` [ath9k-devel] " Eric Dumazet
2013-06-06  1:34                                       ` Eric Dumazet
2013-06-06  1:34                                       ` Eric Dumazet
2013-06-06  3:14                                       ` [ath9k-devel] " Tejun Heo
2013-06-06  3:14                                         ` Tejun Heo
2013-06-06  3:14                                         ` Tejun Heo
2013-06-06  3:26                                         ` [ath9k-devel] " Eric Dumazet
2013-06-06  3:26                                           ` Eric Dumazet
2013-06-06  3:26                                           ` Eric Dumazet
2013-06-06  3:41                                           ` [ath9k-devel] " Ben Greear
2013-06-06  3:41                                             ` Ben Greear
2013-06-06  3:46                                             ` [ath9k-devel] " Eric Dumazet
2013-06-06  3:46                                               ` Eric Dumazet
2013-06-06  3:50                                               ` [ath9k-devel] " Ben Greear
2013-06-06  3:50                                                 ` Ben Greear
2013-06-06  4:08                                                 ` [ath9k-devel] " Eric Dumazet
2013-06-06  4:08                                                   ` Eric Dumazet
2013-06-06 20:55                                             ` [ath9k-devel] " Tejun Heo
2013-06-06 20:55                                               ` Tejun Heo
2013-06-06 21:15                                               ` [ath9k-devel] " Ben Greear
2013-06-06 21:15                                                 ` Ben Greear
2013-06-06 21:17                                                 ` [ath9k-devel] " Tejun Heo
2013-06-06 21:17                                                   ` Tejun Heo
2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
2013-06-05  5:07         ` Greg KH
2013-06-05  7:13           ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51AF8D4B.4090407@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=joe.lawrence@stratus.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.