Linux SCSI subsystem development
 help / color / mirror / Atom feed
* What's the "right" qla2100 driver?
@ 2004-01-22  0:32 Poul Petersen
  2004-01-22  5:22 ` Andrew Vasquez
  0 siblings, 1 reply; 6+ messages in thread
From: Poul Petersen @ 2004-01-22  0:32 UTC (permalink / raw)
  To: linux-scsi

	I've got a Qlogic qla2100 card connected via copper to an array
of 28 ~18GB disks. I'm trying to create two raid level 5 md devices with
13disk+1spare each. What I'm experiencing is a lot of hangs and strange
disk failures that appear to be related to the drivers for the 2100. As
there 
seems to be many different drivers to choose from for the qla2100, I've
tried them all and collected my (all bad) experiences, a summary of which I 
have attached below. 

	What I am wondering is if anyone else is using this card with Linux
and if so what driver are they using? More importantly, which driver is
actively being worked on (if any) so that I might contribute some failure
information? 

	It's possible that I am experiencing a hardware problem, since these
disks
and controller are a few years old now, but I doubt it since most of the
errors
seem inconsistent with bad hardware (all disks failing, etc). If nothing
else,
if I get feedback saying that someone else is successfully using a certain
driver, then I can start playing with hardware using the same driver and
maybe
get to the bottom of this...

Many thanks for any help,

-poul

Oh, machine specs:

Dell Optiplex GX110
PIII-1GHz w/ 260MB RAM
RedHat 9


Summary of tests results:

---
Test #1
kernel 2.4.24 with kernel qlogicfc driver
---

# modprobe qlogicfc
# mkraid /dev/md0

At this point the system hangs with:

qlogicfc0: no handle slots, this should not happen
hostdata->queued is 58, in_ptr: 77

---
Test #2
kernel 2.4.24 with Qlogic 4.46.12b
---

I am unable to find a recent driver on the Qlogic site as support for
this card has been discontinued. I just happened to have this 4.x
series driver around.

# modprobe qla2x00
# mkraid /dev/md0
# mkraid /dev/md1

# cat /proc/mdstat | grep speed

      [=>...................]  resync =  5.1% (904396/17671424)
finish=7654.5min speed=35K/sec
      [====>................]  resync = 22.1% (3919392/17671424)
finish=34.8min speed=6577K/sec

The driver seems to work, but the ability of the driver to balance requests
between the two
raid sets seems very poor. Watching the activity lights reflects the stats
showed above: the
second raid set blinks once or twice, then hangs for 5~10 seconds, blinks,
hangs, etc.
I haven't tried running with this driver much longer than the mkraid...

---
Test #3
kernel 2.4.24 with kernel qlogicfc driver + backported patches
---

	In doing a bunch of google searching, I found the following threads
that seem to be related
to the "no handle slots" problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0101.2/0267.html
http://groups.google.com/groups?selm=linux.scsi.1019759258.2413.1.camel%40lv
adp.fc.hp.com

	Since one of these patches was for a 2.5 kernel, I "massaged" the
patch into the 2.4.24
kernel.

# modprobe qlogicfc
# mkraid /dev/md0
# mkraid /dev/md1

	The resync operation got to about 1% done, then stopped. Cat
/proc/mdstat shows no activity.
dd if=/dev/sda of=/dev/null shows disk is still responsive. MD is hung.
Strange.

---
Test #4
2.6.1 with kernel qlogicfc driver
---

# modprobe qlogicfc
# mkraid /dev/md0

qlogicfc0: no handle slots, this should not happen
hostdata->queued is 49, in_ptr: f8

	Same as 2.4.24 qlogicfc driver (Test #2)

---
Test #5
2.6.1 with sourceforge qlogic driver 8.00.00.b8 (
http://sourceforge.net/projects/linux-qla2xxx/ )
---

# modprobe qla2xxx
# mkraid /dev/md0
# mkraid /dev/md1
# mkfs.ext3 /dev/md0

 	After awhile, the mkfs hung and when I tried to reboot the machine,
I got a bunch of errors like the following for all 28 disks:

qla2xxx_eh_abort scsi(1:0:12:0) cmd_timeout_in_sec=0x1e.
qla2xxx_eh_abort Exiting: status=Failed

---
Test #6
RHAS 2.1 Update 3 kernel (2.4.9-e3)
---

# modprobe qlogicfc
# mkraid /dev/md0

At this point the system hangs with a bunch of:

qlogicfc0: no handle slots, this should not happen
hostdata->queued is 58, in_ptr: 77

etc.

	So, the RedHat kernel qlogicfc driver appears to be no more
functional than the
2.4.24 driver.


---
Test #7
RHAS 2.1 Update 3 kernel (2.4.9-e3) + patches from Test #3
---

# modprobe qlogicfc
# mkraid /dev/md0
# mkraid /dev/md1
# mkfs.ext3 /dev/md0

	This has been the most successful test setup, but the machine has
revealed strange behavior under load tests. At one point issuing a "raidstop
/dev/md0" led to a kernel panic. Another time, tar-copying 100GB of data
began failing disks. Pretty weird stuff.

---
2.6.1 with qlogicfc + patches from Test #3
---

# modprobe qlogicfc
# mkraid /dev/md0
# mkraid /dev/md1
# mkfs.ext3
 
	At this point, the machine hung with no error messages. 


^ permalink raw reply	[flat|nested] 6+ messages in thread
[parent not found: <F888C30C3021D411B9DA00B0D0209BE8038F9BD1@cvo-exchange.cvo. roguewave.com>]
[parent not found: <47F3C2BE74738E4683574107469DFA201DF5BF@XYUSEX01.xyus.xyrat ex.com>]
* RE: What's the "right" qla2100 driver?
@ 2004-01-22 22:54 Poul Petersen
  0 siblings, 0 replies; 6+ messages in thread
From: Poul Petersen @ 2004-01-22 22:54 UTC (permalink / raw)
  To: linux-scsi

> For 2.4, you could try another driver -- 6.06.10 (available at the

	Ah, so there is. I hadn't noticed that! After running the raidstart
resync with this qla2100 driver, disks
started eventualy failing with errors like:

SCSI disk error : host 4 channel 0 id 6 lun 0 return code = 20000
 I/O error: dev 08:71, sector 10072520
SCSI disk error : host 4 channel 0 id 5 lun 0 return code = 20000
 I/O error: dev 08:61, sector 10072520
SCSI disk error : host 4 channel 0 id 4 lun 0 return code = 20000
 I/O error: dev 08:51, sector 10072520
SCSI disk error : host 4 channel 0 id 3 lun 0 return code = 20000
 I/O error: dev 08:41, sector 10072520
SCSI disk error : host 4 channel 0 id 2 lun 0 return code = 20000
 I/O error: dev 08:31, sector 10072520
SCSI disk error : host 4 channel 0 id 1 lun 0 return code = 20000
 I/O error: dev 08:21, sector 10072520
SCSI disk error : host 4 channel 0 id 0 lun 0 return code = 20000
 I/O error: dev 08:11, sector 10072520

	Doesn't it seem odd that these seven disks all failed complaining
about the same sector? All 28 disks eventually failed, some with the same
sector errors. The machine hung when I tried to reboot it. Strange. 

> For 2.6, why don't we start out with the default qla2xxx driver in
> 2.6.2-rc1.  

	I'm getting the same results with this combination as I did with
2.6.1 before. I should point out that I am not waiting for the resink to
start before trying to lay down a file system:

# mkraid /dev/md0
# mkraid /dev/md1

	At this point I waited a few minutes, and both resync's cruised
along. Everything was looking good, so I did a:

# mkfs.ext3 /dev/md0
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
26509312 inodes, 53014272 blocks
2650713 blocks (5.00%) reserved for the super user
First data block=0
1618 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables:  120/1618

	At this point, the mkfs hung and the resync stopped. The status of
md is:

# cat /proc/mdstat
Personalities : [raid5]
md1 : active raid5 sdab1[13] sdaa1[12] sdz1[11] sdy1[10] sdx1[9] sdw1[8]
sdv1[7] sdu1[6] sdt1[5] sds1[4] sdr1[3] sdq1[2] sdp1[1] sdo1[0]
      212057088 blocks level 5, 64k chunk, algorithm 2 [13/13]
[UUUUUUUUUUUUU]
      [>....................]  resync =  3.6% (645636/17671424)
finish=1271.7min speed=222K/sec
md0 : active raid5 sdn1[13] sdm1[12] sdl1[11] sdk1[10] sdj1[9] sdi1[8]
sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      212057088 blocks level 5, 64k chunk, algorithm 2 [13/13]
[UUUUUUUUUUUUU]
      [=>...................]  resync =  6.6% (1173248/17671424)
finish=1381.3min speed=198K/sec
unused devices: <none>

	But the "finish" time is the only thing that changes, as it
continually increases. I also see the following in dmesg:

md: using maximum available idle IO bandwith (but not more than 200000
KB/sec) for reconstruction.
md: using 128k window, over a total of 17671424 blocks.
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:7:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:12:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:12:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:11:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:10:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed

	Doing a simple collection of these error messages:

# dmesg | grep scsi\( | cut -d: -f6 | sort -un | xargs
0 1 2 3 4 5 6 7 8 9 10 11 12

	So, all 13 disks in the first array (md0) are generating errors. I
decided to try this again (after a reboot), without issuing the mkfs. I let
the sync run and watched dmesg for strange errors. Here is a summary:

md: using 128k window, over a total of 17671424 blocks.
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:1:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: Performing ISP error recovery - ha= cd7b01c0.
qla2100 0000:01:09.0: LIP reset occured (f8ef).
qla2100 0000:01:09.0: LIP occured (f8ef).
qla2100 0000:01:09.0: LOOP UP detected (1 Gbps).
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000

... Repeated ~150 times ...

qla2100 0000:01:09.0: ISP System Error - mbx1=1935h mbx2=0h mbx3=8004h.
qla2100 0000:01:09.0: Failed to dump firmware (256)!!!
qla2100 0000:01:09.0: Performing ISP error recovery - ha= cd7b01c0.
qla2100 0000:01:09.0: LIP reset occured (f8f7).
qla2100 0000:01:09.0: LIP occured (f8f7).
qla2100 0000:01:09.0: LOOP UP detected (1 Gbps).
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:23:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:22:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:21:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
... Repeated lots ...

	This time it failed all of the disks in md1, plus one from md0:

# dmesg | grep '^qla2100.*scsi(' | cut -d: -f6 | sort -un | xargs
1 4 14 15 16 17 18 19 20 21 22 23 24 25 26

	Thoughts?

Thanks for your help,

-poul


^ permalink raw reply	[flat|nested] 6+ messages in thread
* RE: What's the "right" qla2100 driver?
@ 2004-05-27  0:25 Poul Petersen
  0 siblings, 0 replies; 6+ messages in thread
From: Poul Petersen @ 2004-05-27  0:25 UTC (permalink / raw)
  To: 'Andrew Vasquez', linux-scsi

	OK - The box is now running kernel 2.6.6 and the qla2xxx driver
version 8.00.00b12. I tried making a single raid set and about 17% of the
way into the sync I got the following:

qla2100 0000:00:0e.0: LIP reset occured (f7ef).
qla2100 0000:00:0e.0: LIP occured (f7ef).
qla2100 0000:00:0e.0: Mailbox command timeout occured. Issuing ISP abort.
qla2100 0000:00:0e.0: Performing ISP error recovery - ha= c121c218.
qla2100 0000:00:0e.0: LIP reset occured (f8f7).
qla2100 0000:00:0e.0: LIP occured (f8f7).
qla2100 0000:00:0e.0: LOOP UP detected (1 Gbps).
qla2100 0000:00:0e.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:00:0e.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:00:0e.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:00:0e.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:00:0e.0: qla2xxx_eh_abort scsi(0:0:2:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:00:0e.0: qla2xxx_eh_abort Exiting: status=Failed
	
	And so on, each disc failing with an error similar to the last two
lines. I changed DEBUG_QLA2100 to "1" in qla_settings.h and then I got a
bunch of these when I loaded the driver:

scsi(1): RLC failed to issue iocb! fcport=[000c/cf5e84ec] rval=0 cs=0 ss=b02

	Followed by a bunch of:

cmd_timeout: Found in ISP

	And after resyncing the raid set for sometime, the failure messages:

qla2xxx_eh_abort: refcount 1
qla2100 0000:00:0e.0: qla2xxx_eh_abort scsi(1:0:3:0):
cmd_timeout_in_sec=0x1e.
SCSI Command @=0xc4eabaa8, Handle=0x00000335
  chan=0x00, target=0x03, lun=0x00, cmd_len=0x0a
 CDB: 0x28 0x00 0x00 0xe6 0xc6 0x3f 0x00 0x01 0x00 0x00
  seg_cnt=32, allowed=20, retries=0, serial_number_at_timeout=0x103982
  request buffer=0xca53365c, request buffer len=0x20000
  tag=90, transfersize=0x200
  serial_number=103982, SP=cb214b38
  data direction=2
  sp flags=0x2
  r_start=0x190678, u_start=0x190678, f_start=0x5a5a5a5a, state=7
 e_start= 0x5a5a5a5a, ext_history=1515870810, fo retry=0, loopid=4, port
path=0
qla2100 0000:00:0e.0: qla2xxx_eh_abort Exiting: status=Failed
qla2xxx_eh_abort: refcount 1
qla2100 0000:00:0e.0: qla2xxx_eh_abort scsi(1:0:6:0):
cmd_timeout_in_sec=0x1e.
SCSI Command @=0xca4f033c, Handle=0x00000339
  chan=0x00, target=0x06, lun=0x00, cmd_len=0x0a
 CDB: 0x28 0x00 0x00 0xe6 0xc7 0xbf 0x00 0x00 0x48 0x00
  seg_cnt=9, allowed=20, retries=0, serial_number_at_timeout=0x103986
  request buffer=0xc82088b4, request buffer len=0x9000
  tag=90, transfersize=0x200
  serial_number=103986, SP=cb214e98
  data direction=2
  sp flags=0x2
  r_start=0x19067a, u_start=0x19067a, f_start=0x5a5a5a5a, state=7
 e_start= 0x5a5a5a5a, ext_history=1515870810, fo retry=0, loopid=7, port
path=0
qla2100 0000:00:0e.0: qla2xxx_eh_abort Exiting: status=Failed
qla2xxx_eh_abort: refcount 1
qla2100 0000:00:0e.0: qla2xxx_eh_abort scsi(1:0:0:0):
cmd_timeout_in_sec=0x1e.
SCSI Command @=0xcbc51c24, Handle=0x0000033b
  chan=0x00, target=0x00, lun=0x00, cmd_len=0x0a
 CDB: 0x28 0x00 0x00 0xe6 0xc7 0xbf 0x00 0x00 0x48 0x00
  seg_cnt=9, allowed=20, retries=0, serial_number_at_timeout=0x103988
  request buffer=0xcf5dd9c0, request buffer len=0x9000
  tag=90, transfersize=0x200
  serial_number=103988, SP=cb214238
  data direction=2
  sp flags=0x2
  r_start=0x19067a, u_start=0x19067a, f_start=0x5a5a5a5a, state=7
 e_start= 0x5a5a5a5a, ext_history=1515870810, fo retry=0, loopid=1, port
path=0

Thoughts? Many thanks,

-poul


> -----Original Message-----
> From: Andrew Vasquez [mailto:praka@users.sourceforge.net]
> Sent: Wednesday, January 21, 2004 9:22 PM
> To: linux-scsi@vger.kernel.org
> Subject: Re: What's the "right" qla2100 driver?
> 
> 
> On Wed, 21 Jan 2004, Poul Petersen wrote:
> 
> > I've got a Qlogic qla2100 card connected via copper to an array of
> > 28 ~18GB disks. I'm trying to create two raid level 5 md devices
> > with 13disk+1spare each. What I'm experiencing is a lot of hangs and
> > strange disk failures that appear to be related to the drivers for
> > the 2100. As there seems to be many different drivers to choose from
> > for the qla2100, I've tried them all and collected my (all bad)
> > experiences, a summary of which I have attached below. 
> > 
> > What I am wondering is if anyone else is using this card with Linux
> > and if so what driver are they using? More importantly, which driver
> > is actively being worked on (if any) so that I might contribute some
> > failure information? 
> > 
> 
> For 2.4, you could try another driver -- 6.06.10 (available at the
> QLogic website).  This driver has support for the 2100 (though it's
> not actually documented).  The makefile will not build the driver
> though, try something similiar to the following:
> 
> 	# make qla2100.o
> 
> For 2.6, why don't we start out with the default qla2xxx driver in
> 2.6.2-rc1.  
> 
> > It's possible that I am experiencing a hardware problem, since these
> > disks and controller are a few years old now, but I doubt it since
> > most of the errors seem inconsistent with bad hardware (all disks
> > failing, etc). If nothing else, if I get feedback saying that
> > someone else is successfully using a certain driver, then I can
> > start playing with hardware using the same driver and maybe get to
> > the bottom of this...
> > 
> 
> From there let's see how far you get with the drivers.  We may need to
> enable some extra debugging for additional information.
> 
> > ---
> > Test #5
> > 2.6.1 with sourceforge qlogic driver 8.00.00.b8 (
> > http://sourceforge.net/projects/linux-qla2xxx/ )
> > ---
> > 
> > # modprobe qla2xxx
> > # mkraid /dev/md0
> > # mkraid /dev/md1
> > # mkfs.ext3 /dev/md0
> > 
> >  	After awhile, the mkfs hung and when I tried to reboot 
> the machine,
> > I got a bunch of errors like the following for all 28 disks:
> > 
> > qla2xxx_eh_abort scsi(1:0:12:0) cmd_timeout_in_sec=0x1e.
> > qla2xxx_eh_abort Exiting: status=Failed
> > 
> 
> Interesting...
> 
> Regards,
> Andrew Vasquez
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-05-27  0:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-22  0:32 What's the "right" qla2100 driver? Poul Petersen
2004-01-22  5:22 ` Andrew Vasquez
     [not found] <F888C30C3021D411B9DA00B0D0209BE8038F9BD1@cvo-exchange.cvo. roguewave.com>
2004-01-22  0:40 ` Lincoln Dale
     [not found] <47F3C2BE74738E4683574107469DFA201DF5BF@XYUSEX01.xyus.xyrat ex.com>
2004-01-22  5:06 ` Lincoln Dale
  -- strict thread matches above, loose matches on Subject: below --
2004-01-22 22:54 Poul Petersen
2004-05-27  0:25 Poul Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox