linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: bugme-daemon@bugzilla.kernel.org
To: linux-scsi@vger.kernel.org
Subject: [Bug 11646] New: QLA2xxx: Kernel deadlock on high load somewhere after 2.6.20
Date: Thu, 25 Sep 2008 06:55:18 -0700 (PDT)	[thread overview]
Message-ID: <bug-11646-11613@http.bugzilla.kernel.org/> (raw)

http://bugzilla.kernel.org/show_bug.cgi?id=11646

           Summary: QLA2xxx: Kernel deadlock on high load somewhere after
                    2.6.20
           Product: IO/Storage
           Version: 2.5
     KernelVersion: 2.6.26.5
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: SCSI
        AssignedTo: linux-scsi@vger.kernel.org
        ReportedBy: grin@grin.hu


Latest working kernel version: 2.6.20
Earliest failing kernel version: known 2.6.24
Distribution: Debian stable
Hardware Environment: QLogic Corp. QLA2422 Fibre Channel Adapter (rev 02), IBM
(intel based) HS21 blade server, external SAN storage [IBM DS4200], optional
full multipath (happens with or without), further details on specified requests
Software Environment: multipathd handling dm devices, lvm2, xfs

Problem Description:
The machines go dead under heavy IO load. Go dead may mean rare complete
crashes, more often infinite resource wait states, or stuck udev streads all
over. 

The diagnostic was pretty hard, many components were checked and finally it
boiled down to the qla2xxx driver. 
It seems that somewhere after 2.6.20 the driver have a problem with high loads,
where it first:
- start to see (or generate) link downs without reason
- tries to handle these, by logging thousands of "try to dump firmware"
messages, while
- somehow screw up IRQ handling, because more often than not even eth0 starts
complaining about transmit timeouts, and the kernel often say "..no IRQ handler
for vector"
- never recovers. I've seen many messages like:
== mailbox command timeout
== performing isp recovery
== loop up 4gbps
== SNS scan failed - assuming zero entry result
== scsi: abort command issued ...
then often
== FC repot port time out
== SCSI DEVICE RESET ISSUED
and it sometimes ends with a stack trace and the happy message
== RIP 0x10

The diagnosic is hard because I cannot easily make it crash by force: even
bonnie++ survive multiple runs without problems, but a busy postgres can crash
it in a few hours usually.

After we changed and upgraded almost everything in both paths and nothing
helped (including kernel upgrade to latest official one) I backed up to 2.6.20
and the problem disappeared. It is not easy to tell when was it broken because
I cannot just start playing with live servers and I cannot make it crash on a
test server. But if you have any tests which should crash it then I can try it
on a different (testing) machine.


Steps to reproduce: I wish I knew. Loads of IO in an unknown pattern make it
die in a few hours, or days.

I can provide any info you ask and I'm able to pry out of the machines, kernel
[logs], etc. Most crashes does only have screenshots of remote console, since
it killed all disk IO around.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

             reply	other threads:[~2008-09-25 13:56 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-25 13:55 bugme-daemon [this message]
2008-09-25 14:10 ` [Bug 11646] QLA2xxx: Kernel deadlock on high load somewhere after 2.6.20 bugme-daemon
2008-09-25 15:00 ` bugme-daemon
2008-09-25 15:04 ` bugme-daemon
2008-09-26 13:48 ` bugme-daemon
2008-09-26 13:59 ` bugme-daemon
2008-09-27  8:17 ` bugme-daemon
2008-09-30  7:49 ` bugme-daemon
2008-10-01 22:40 ` bugme-daemon
2008-10-03  0:23 ` bugme-daemon
2008-10-03 14:42 ` bugme-daemon
2008-10-06 19:21 ` bugme-daemon
2008-10-07 20:38 ` bugme-daemon
2008-10-07 20:52 ` bugme-daemon
2008-10-07 21:27 ` bugme-daemon
2008-10-13 11:45 ` bugme-daemon
2008-10-21  7:13 ` bugme-daemon
2008-11-19 22:10 ` bugme-daemon
2008-11-23 19:21 ` bugme-daemon
2009-02-23  0:54 ` bugme-daemon
2009-02-27  9:50 ` bugme-daemon
2009-02-27 10:28 ` bugme-daemon
2009-02-27 16:17 ` bugme-daemon
2009-02-27 18:29 ` bugme-daemon
2009-03-03 19:00 ` bugme-daemon
2009-03-04 16:14 ` bugme-daemon
2009-03-31 16:02 ` bugzilla-daemon
2009-05-12  9:03 ` bugzilla-daemon
2009-07-19 14:25 ` bugzilla-daemon
2009-07-20  8:26 ` bugzilla-daemon
2010-01-28 23:35 ` bugzilla-daemon
2010-01-29  0:46 ` bugzilla-daemon
2010-01-31 22:06 ` bugzilla-daemon
2010-03-03  9:37 ` bugzilla-daemon
2010-03-03  9:59 ` bugzilla-daemon
2010-03-03 10:45 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-11646-11613@http.bugzilla.kernel.org/ \
    --to=bugme-daemon@bugzilla.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).