linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug 11646] New: QLA2xxx: Kernel deadlock on high load somewhere after 2.6.20
@ 2008-09-25 13:55 bugme-daemon
  2008-09-25 14:10 ` [Bug 11646] " bugme-daemon
                   ` (34 more replies)
  0 siblings, 35 replies; 41+ messages in thread
From: bugme-daemon @ 2008-09-25 13:55 UTC (permalink / raw)
  To: linux-scsi

http://bugzilla.kernel.org/show_bug.cgi?id=11646

           Summary: QLA2xxx: Kernel deadlock on high load somewhere after
                    2.6.20
           Product: IO/Storage
           Version: 2.5
     KernelVersion: 2.6.26.5
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: SCSI
        AssignedTo: linux-scsi@vger.kernel.org
        ReportedBy: grin@grin.hu


Latest working kernel version: 2.6.20
Earliest failing kernel version: known 2.6.24
Distribution: Debian stable
Hardware Environment: QLogic Corp. QLA2422 Fibre Channel Adapter (rev 02), IBM
(intel based) HS21 blade server, external SAN storage [IBM DS4200], optional
full multipath (happens with or without), further details on specified requests
Software Environment: multipathd handling dm devices, lvm2, xfs

Problem Description:
The machines go dead under heavy IO load. Go dead may mean rare complete
crashes, more often infinite resource wait states, or stuck udev streads all
over. 

The diagnostic was pretty hard, many components were checked and finally it
boiled down to the qla2xxx driver. 
It seems that somewhere after 2.6.20 the driver have a problem with high loads,
where it first:
- start to see (or generate) link downs without reason
- tries to handle these, by logging thousands of "try to dump firmware"
messages, while
- somehow screw up IRQ handling, because more often than not even eth0 starts
complaining about transmit timeouts, and the kernel often say "..no IRQ handler
for vector"
- never recovers. I've seen many messages like:
== mailbox command timeout
== performing isp recovery
== loop up 4gbps
== SNS scan failed - assuming zero entry result
== scsi: abort command issued ...
then often
== FC repot port time out
== SCSI DEVICE RESET ISSUED
and it sometimes ends with a stack trace and the happy message
== RIP 0x10

The diagnosic is hard because I cannot easily make it crash by force: even
bonnie++ survive multiple runs without problems, but a busy postgres can crash
it in a few hours usually.

After we changed and upgraded almost everything in both paths and nothing
helped (including kernel upgrade to latest official one) I backed up to 2.6.20
and the problem disappeared. It is not easy to tell when was it broken because
I cannot just start playing with live servers and I cannot make it crash on a
test server. But if you have any tests which should crash it then I can try it
on a different (testing) machine.


Steps to reproduce: I wish I knew. Loads of IO in an unknown pattern make it
die in a few hours, or days.

I can provide any info you ask and I'm able to pry out of the machines, kernel
[logs], etc. Most crashes does only have screenshots of remote console, since
it killed all disk IO around.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-07-29 20:22 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <bug-11646-11613@https.bugzilla.kernel.org/>
2010-08-31  6:22 ` [Bug 11646] QLA2xxx: Kernel deadlock on high load somewhere after 2.6.20 bugzilla-daemon
2010-08-31 13:56 ` bugzilla-daemon
2012-05-22 14:34 ` bugzilla-daemon
2012-10-30 15:12 ` bugzilla-daemon
2014-07-29 19:59 ` bugzilla-daemon
2014-07-29 20:22 ` bugzilla-daemon
2008-09-25 13:55 [Bug 11646] New: " bugme-daemon
2008-09-25 14:10 ` [Bug 11646] " bugme-daemon
2008-09-25 15:00 ` bugme-daemon
2008-09-25 15:04 ` bugme-daemon
2008-09-26 13:48 ` bugme-daemon
2008-09-26 13:59 ` bugme-daemon
2008-09-27  8:17 ` bugme-daemon
2008-09-30  7:49 ` bugme-daemon
2008-10-01 22:40 ` bugme-daemon
2008-10-03  0:23 ` bugme-daemon
2008-10-03 14:42 ` bugme-daemon
2008-10-06 19:21 ` bugme-daemon
2008-10-07 20:38 ` bugme-daemon
2008-10-07 20:52 ` bugme-daemon
2008-10-07 21:27 ` bugme-daemon
2008-10-13 11:45 ` bugme-daemon
2008-10-21  7:13 ` bugme-daemon
2008-11-19 22:10 ` bugme-daemon
2008-11-23 19:21 ` bugme-daemon
2009-02-23  0:54 ` bugme-daemon
2009-02-27  9:50 ` bugme-daemon
2009-02-27 10:28 ` bugme-daemon
2009-02-27 16:17 ` bugme-daemon
2009-02-27 18:29 ` bugme-daemon
2009-03-03 19:00 ` bugme-daemon
2009-03-04 16:14 ` bugme-daemon
2009-03-31 16:02 ` bugzilla-daemon
2009-05-12  9:03 ` bugzilla-daemon
2009-07-19 14:25 ` bugzilla-daemon
2009-07-20  8:26 ` bugzilla-daemon
2010-01-28 23:35 ` bugzilla-daemon
2010-01-29  0:46 ` bugzilla-daemon
2010-01-31 22:06 ` bugzilla-daemon
2010-03-03  9:37 ` bugzilla-daemon
2010-03-03  9:59 ` bugzilla-daemon
2010-03-03 10:45 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).