public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Ben Greear <greearb@candelatech.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>,
	Joe Lawrence <joe.lawrence@stratus.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	stable@vger.kernel.org
Subject: Re: Please add to stable:  module: don't unlink the module until we've removed all exposure.
Date: Wed, 5 Jun 2013 11:48:07 -0700	[thread overview]
Message-ID: <20130605184807.GD10693@mtj.dyndns.org> (raw)
In-Reply-To: <51AF6E54.3050108@candelatech.com>

Hello, Ben.

On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
> One pattern I notice repeating for at least most of the hangs is that all but one
> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
> but typically that of the sysrq itself.  I added printk that would always
> print if the thread notices that smdata->state != curstate, and the soft-lockup
> thread (cpu 2 below) never shows that message.

It sounds like one of the cpus get live-locked by IRQs.  I can't tell
why the situation is made worse by other CPUs being tied up.  Do you
ever see CPUs being live locked by IRQs during normal operation?

> I thought it might be because it was reading stale smdata->state, so I changed
> that to atomic_t hoping that would mitigate that.  I also tried adding smp_rmb()
> below the cpu_relax().  Neither had any affect, so I am left assuming that the

I looked at the code again and the memory accesses seem properly
interlocked.  It's a bit tricky and should probably have used spinlock
instead considering it's already a hugely expensive path anyway, but
it does seem correct to me.

> thread instead is stuck handling IRQs and never gets out of the IRQ handler.

Seems that way to me too.

> Maybe since I have 2 real cores, and 3 processes busy-spinning on their CPU cores,
> the remaining process can just never handle all the IRQs and get back to the
> cpu shutdown state machine?  The various soft-hang stacks below show at least slightly
> different stacks, so I assume that thread is doing at least something.

What's the source of all those IRQs tho?  I don't think the IRQs are
from actual events.  The system is quiesced.  Even if it's from
receiving packets, it's gonna quiet down pretty quickly.  The hang
doesn't go away if you disconnect the network cable while hung, right?

What could be happening is that IRQ handling is handled by a thread
but the IRQ handler itself doesn't clear the IRQ properly and depends
on the handling thread to clear the condition.  If no CPU is available
for scheduling, it might end up raising and re-reraising IRQs for the
same condition without ever being handled.  If that's the case, such
lockup could happen on a normally functioning UP machine or if the IRQ
is pinned to a single CPU which happens to be running the handling
thread.  At any rate, it'd be a plain live-lock bug on the driver
side.

Can you please try to confirm the specific interrupt being
continuously raised?  Detecting the hang shouldn't be too difficult.
Just recording the starting jiffies and if progress hasn't been made
for, say, ten seconds, it can set a flag and then print the IRQs being
handled if the flag is set.  If it indeed is the ath device, we
probably wanna get the driver maintainer involved.

Thanks.

-- 
tejun

  reply	other threads:[~2013-06-05 18:48 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <51A8E884.1080009@candelatech.com>
     [not found] ` <87ehclumhr.fsf@rustcorp.com.au>
     [not found]   ` <alpine.DEB.2.02.1306022212200.10003@jlaw-desktop.mno.stratus.com>
     [not found]     ` <alpine.DEB.2.02.1306030724230.19647@jlaw-desktop.mno.stratus.com>
2013-06-03 14:17       ` Please add to stable: module: don't unlink the module until we've removed all exposure Joe Lawrence
2013-06-03 15:59         ` Ben Greear
2013-06-03 16:36           ` Ben Greear
2013-06-04  4:37             ` Rusty Russell
2013-06-04  5:56             ` Rusty Russell
2013-06-04 14:07               ` Joe Lawrence
2013-06-04 16:50                 ` Joe Lawrence
2013-06-04 16:53                 ` Ben Greear
2013-06-04 17:45                   ` Ben Greear
2013-06-05  4:17                     ` Rusty Russell
2013-06-05  7:15                       ` Tejun Heo
2013-06-05 16:59                         ` Ben Greear
2013-06-05 18:48                           ` Tejun Heo [this message]
2013-06-05 19:11                             ` Ben Greear
2013-06-05 19:31                               ` stop_machine lockup issue in 3.9.y Ben Greear
2013-06-05 20:58                                 ` Ben Greear
2013-06-05 21:11                                   ` Tejun Heo
2013-06-05 21:33                                     ` Ben Greear
2013-06-06  1:34                                     ` Eric Dumazet
2013-06-06  3:14                                       ` Tejun Heo
2013-06-06  3:26                                         ` Eric Dumazet
2013-06-06  3:41                                           ` Ben Greear
2013-06-06  3:46                                             ` Eric Dumazet
2013-06-06  3:50                                               ` Ben Greear
2013-06-06  4:08                                                 ` Eric Dumazet
2013-06-06 20:55                                             ` Tejun Heo
2013-06-06 21:15                                               ` Ben Greear
2013-06-06 21:17                                                 ` Tejun Heo
2013-06-05  3:29                 ` Please add to stable: module: don't unlink the module until we've removed all exposure Rusty Russell
2013-06-05  5:07         ` Greg KH
2013-06-05  7:13           ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130605184807.GD10693@mtj.dyndns.org \
    --to=tj@kernel.org \
    --cc=greearb@candelatech.com \
    --cc=joe.lawrence@stratus.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rusty@rustcorp.com.au \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox