Linux SCSI subsystem development
 help / color / mirror / Atom feed
* What's the "right" qla2100 driver?
@ 2004-01-22  0:32 Poul Petersen
  2004-01-22  5:22 ` Andrew Vasquez
  0 siblings, 1 reply; 5+ messages in thread
From: Poul Petersen @ 2004-01-22  0:32 UTC (permalink / raw)
  To: linux-scsi

	I've got a Qlogic qla2100 card connected via copper to an array
of 28 ~18GB disks. I'm trying to create two raid level 5 md devices with
13disk+1spare each. What I'm experiencing is a lot of hangs and strange
disk failures that appear to be related to the drivers for the 2100. As
there 
seems to be many different drivers to choose from for the qla2100, I've
tried them all and collected my (all bad) experiences, a summary of which I 
have attached below. 

	What I am wondering is if anyone else is using this card with Linux
and if so what driver are they using? More importantly, which driver is
actively being worked on (if any) so that I might contribute some failure
information? 

	It's possible that I am experiencing a hardware problem, since these
disks
and controller are a few years old now, but I doubt it since most of the
errors
seem inconsistent with bad hardware (all disks failing, etc). If nothing
else,
if I get feedback saying that someone else is successfully using a certain
driver, then I can start playing with hardware using the same driver and
maybe
get to the bottom of this...

Many thanks for any help,

-poul

Oh, machine specs:

Dell Optiplex GX110
PIII-1GHz w/ 260MB RAM
RedHat 9


Summary of tests results:

---
Test #1
kernel 2.4.24 with kernel qlogicfc driver
---

# modprobe qlogicfc
# mkraid /dev/md0

At this point the system hangs with:

qlogicfc0: no handle slots, this should not happen
hostdata->queued is 58, in_ptr: 77

---
Test #2
kernel 2.4.24 with Qlogic 4.46.12b
---

I am unable to find a recent driver on the Qlogic site as support for
this card has been discontinued. I just happened to have this 4.x
series driver around.

# modprobe qla2x00
# mkraid /dev/md0
# mkraid /dev/md1

# cat /proc/mdstat | grep speed

      [=>...................]  resync =  5.1% (904396/17671424)
finish=7654.5min speed=35K/sec
      [====>................]  resync = 22.1% (3919392/17671424)
finish=34.8min speed=6577K/sec

The driver seems to work, but the ability of the driver to balance requests
between the two
raid sets seems very poor. Watching the activity lights reflects the stats
showed above: the
second raid set blinks once or twice, then hangs for 5~10 seconds, blinks,
hangs, etc.
I haven't tried running with this driver much longer than the mkraid...

---
Test #3
kernel 2.4.24 with kernel qlogicfc driver + backported patches
---

	In doing a bunch of google searching, I found the following threads
that seem to be related
to the "no handle slots" problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0101.2/0267.html
http://groups.google.com/groups?selm=linux.scsi.1019759258.2413.1.camel%40lv
adp.fc.hp.com

	Since one of these patches was for a 2.5 kernel, I "massaged" the
patch into the 2.4.24
kernel.

# modprobe qlogicfc
# mkraid /dev/md0
# mkraid /dev/md1

	The resync operation got to about 1% done, then stopped. Cat
/proc/mdstat shows no activity.
dd if=/dev/sda of=/dev/null shows disk is still responsive. MD is hung.
Strange.

---
Test #4
2.6.1 with kernel qlogicfc driver
---

# modprobe qlogicfc
# mkraid /dev/md0

qlogicfc0: no handle slots, this should not happen
hostdata->queued is 49, in_ptr: f8

	Same as 2.4.24 qlogicfc driver (Test #2)

---
Test #5
2.6.1 with sourceforge qlogic driver 8.00.00.b8 (
http://sourceforge.net/projects/linux-qla2xxx/ )
---

# modprobe qla2xxx
# mkraid /dev/md0
# mkraid /dev/md1
# mkfs.ext3 /dev/md0

 	After awhile, the mkfs hung and when I tried to reboot the machine,
I got a bunch of errors like the following for all 28 disks:

qla2xxx_eh_abort scsi(1:0:12:0) cmd_timeout_in_sec=0x1e.
qla2xxx_eh_abort Exiting: status=Failed

---
Test #6
RHAS 2.1 Update 3 kernel (2.4.9-e3)
---

# modprobe qlogicfc
# mkraid /dev/md0

At this point the system hangs with a bunch of:

qlogicfc0: no handle slots, this should not happen
hostdata->queued is 58, in_ptr: 77

etc.

	So, the RedHat kernel qlogicfc driver appears to be no more
functional than the
2.4.24 driver.


---
Test #7
RHAS 2.1 Update 3 kernel (2.4.9-e3) + patches from Test #3
---

# modprobe qlogicfc
# mkraid /dev/md0
# mkraid /dev/md1
# mkfs.ext3 /dev/md0

	This has been the most successful test setup, but the machine has
revealed strange behavior under load tests. At one point issuing a "raidstop
/dev/md0" led to a kernel panic. Another time, tar-copying 100GB of data
began failing disks. Pretty weird stuff.

---
2.6.1 with qlogicfc + patches from Test #3
---

# modprobe qlogicfc
# mkraid /dev/md0
# mkraid /dev/md1
# mkfs.ext3
 
	At this point, the machine hung with no error messages. 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What's the "right" qla2100 driver?
       [not found] <F888C30C3021D411B9DA00B0D0209BE8038F9BD1@cvo-exchange.cvo. roguewave.com>
@ 2004-01-22  0:40 ` Lincoln Dale
  0 siblings, 0 replies; 5+ messages in thread
From: Lincoln Dale @ 2004-01-22  0:40 UTC (permalink / raw)
  To: Poul Petersen; +Cc: linux-scsi

At 11:32 AM 22/01/2004, Poul Petersen wrote:
>         I've got a Qlogic qla2100 card connected via copper to an array
>of 28 ~18GB disks. I'm trying to create two raid level 5 md devices with
>13disk+1spare each. What I'm experiencing is a lot of hangs and strange
>disk failures that appear to be related to the drivers for the 2100.
[..]
># cat /proc/mdstat | grep speed
>
>       [=>...................]  resync =  5.1% (904396/17671424)
>finish=7654.5min speed=35K/sec
>       [====>................]  resync = 22.1% (3919392/17671424)
>finish=34.8min speed=6577K/sec
>
>The driver seems to work, but the ability of the driver to balance requests
>between the two
>raid sets seems very poor. Watching the activity lights reflects the stats
>showed above: the
>second raid set blinks once or twice, then hangs for 5~10 seconds, blinks,
>hangs, etc.
>I haven't tried running with this driver much longer than the mkraid...

noone said that FC Arbitrated Loop provide FAIR arbitration between devices 
. . .


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: What's the "right" qla2100 driver?
       [not found] <47F3C2BE74738E4683574107469DFA201DF5BF@XYUSEX01.xyus.xyrat ex.com>
@ 2004-01-22  5:06 ` Lincoln Dale
  0 siblings, 0 replies; 5+ messages in thread
From: Lincoln Dale @ 2004-01-22  5:06 UTC (permalink / raw)
  To: Frank Borich; +Cc: Poul Petersen, linux-scsi

At 12:39 PM 22/01/2004, Frank Borich wrote:
>I heard the fc driver is bad.
>Has anyone experienced very poor performance when using MD- SCSI 
>mid-layer, and FC HBA?
>I create a 3 drive raid 5 array using MD, during initialization I get 50 + 
>MB/sec
>When I  write 00's using dd I get 3 MB/sec.  I can tell by using a fc 
>analyzer that my write commands
>are being chopped up into pieces.  Just wondering if anyone else has seen 
>this or knows of any issues using MD, and SCSI mid-layer, so I can shift 
>focus to FC side?  I lost all of my SCSI arrays! argh......

<shrug>

works ok here.  i wonder if you're realizing exactly how RAID5 works ...?:

Linux 2.6.2-rc1, using the QLogic driver that is integrated into the kernel 
with a QLA2300 attached to a Cisco MDS FC switch in turn connected to a FC 
JBOD with 8 x 15K RPM disks:

[root@mel-stglab-host31 linux]# head -2 /proc/scsi/qla2xxx/0
QLogic PCI to Fibre Channel Host Adapter for QLA2310:
         Firmware version 3.02.18 TPX, Driver version 8.00.00b8

constructing a 3-drive raid5 array (2 active 1 spare) using MD, i get 
~31MB/sec:
to 'create' a 3-disk RAID5 array requires reading the data off 2 disks to 
write the parity to the 3rd disk -- the overall speed of constructing a 
RAID5 array is limited by the performance of a single disk spindle.

         (output from mdstat)
         [root@mel-stglab-host31 root]# cat /proc/mdstat
         Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
         md0 : active raid5 sdg1[2] sdd1[1] sdb1[0]
               17498880 blocks level 5, 256k chunk, algorithm 2 [2/2] [UU]
               [==>..................]  resync = 10.5% (1852292/17498880) 
finish=8.5min speed=30369K/sec
         unused devices: <none>

         (output from my FC switch connected to this HBA showing the true 
i/o that is going on - both reads & writes):
         mel-stglab-mds9509-1# show interface fc2/9 | include rate
             5 minutes input rate 137039200 bits/sec, 17129900 bytes/sec, 
8709 frames/sec
             5 minutes output rate 347203424 bits/sec, 43400428 bytes/sec, 
21507 frames/sec

once the RAID5 array is built, i get ~44 MB/sec on a 'dd' -- once again, 
about the speed limit of a single disk spindle:

         [root@mel-stglab-host31 root]# time dd if=/dev/md0 of=/dev/null 
bs=256K count=4000
         4000+0 records in
         4000+0 records out

         real    0m22.639s
         user    0m0.020s
         sys     0m4.235s

if i do a 'dd' on one of the disk spindles that makes up the raid5 array, i 
get around the same number (actually, its a bit better: i get 57MB/sec - 
that is probably because ALL I/O is guaranteed to be sequential now):

         [root@mel-stglab-host31 root]# time dd if=/dev/sdg1 of=/dev/null 
bs=256K count=4000
         4000+0 records in
         4000+0 records out

         real    0m17.538s
         user    0m0.013s
         sys     0m2.698s


.... now, if i convert the RAID5 array to be a RAID0 array instead, its 
performance is much much better -- 105MB/sec:

         [root@mel-stglab-host31 root]# !mkraid
         mkraid --really-force /dev/md0
         DESTROYING the contents of /dev/md0 in 5 seconds, Ctrl-C if unsure!
         handling MD device /dev/md0
         analyzing super-block
         disk 0: /dev/sdb1, 17499120kB, raid superblock at 17499008kB
         disk 1: /dev/sdd1, 17499120kB, raid superblock at 17499008kB
         disk 2: /dev/sdf1, 17775891kB, raid superblock at 17775808kB
         disk 3: /dev/sdg1, 17775891kB, raid superblock at 17775808kB
         [root@mel-stglab-host31 root]# cat /proc/mdstat
         Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
         md0 : active raid0 sdg1[3] sdf1[2] sdd1[1] sdb1[0]
               70548992 blocks 256k chunks

         unused devices: <none>
         [root@mel-stglab-host31 root]# time dd if=/dev/md0 of=/dev/null 
bs=256K count=4000
         4000+0 records in
         4000+0 records out

         real    0m9.557s
         user    0m0.013s
         sys     0m2.357s


cheers,

lincoln.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What's the "right" qla2100 driver?
  2004-01-22  0:32 What's the "right" qla2100 driver? Poul Petersen
@ 2004-01-22  5:22 ` Andrew Vasquez
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Vasquez @ 2004-01-22  5:22 UTC (permalink / raw)
  To: linux-scsi

On Wed, 21 Jan 2004, Poul Petersen wrote:

> I've got a Qlogic qla2100 card connected via copper to an array of
> 28 ~18GB disks. I'm trying to create two raid level 5 md devices
> with 13disk+1spare each. What I'm experiencing is a lot of hangs and
> strange disk failures that appear to be related to the drivers for
> the 2100. As there seems to be many different drivers to choose from
> for the qla2100, I've tried them all and collected my (all bad)
> experiences, a summary of which I have attached below. 
> 
> What I am wondering is if anyone else is using this card with Linux
> and if so what driver are they using? More importantly, which driver
> is actively being worked on (if any) so that I might contribute some
> failure information? 
> 

For 2.4, you could try another driver -- 6.06.10 (available at the
QLogic website).  This driver has support for the 2100 (though it's
not actually documented).  The makefile will not build the driver
though, try something similiar to the following:

	# make qla2100.o

For 2.6, why don't we start out with the default qla2xxx driver in
2.6.2-rc1.  

> It's possible that I am experiencing a hardware problem, since these
> disks and controller are a few years old now, but I doubt it since
> most of the errors seem inconsistent with bad hardware (all disks
> failing, etc). If nothing else, if I get feedback saying that
> someone else is successfully using a certain driver, then I can
> start playing with hardware using the same driver and maybe get to
> the bottom of this...
> 

>From there let's see how far you get with the drivers.  We may need to
enable some extra debugging for additional information.

> ---
> Test #5
> 2.6.1 with sourceforge qlogic driver 8.00.00.b8 (
> http://sourceforge.net/projects/linux-qla2xxx/ )
> ---
> 
> # modprobe qla2xxx
> # mkraid /dev/md0
> # mkraid /dev/md1
> # mkfs.ext3 /dev/md0
> 
>  	After awhile, the mkfs hung and when I tried to reboot the machine,
> I got a bunch of errors like the following for all 28 disks:
> 
> qla2xxx_eh_abort scsi(1:0:12:0) cmd_timeout_in_sec=0x1e.
> qla2xxx_eh_abort Exiting: status=Failed
> 

Interesting...

Regards,
Andrew Vasquez

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: What's the "right" qla2100 driver?
@ 2004-01-22 22:54 Poul Petersen
  0 siblings, 0 replies; 5+ messages in thread
From: Poul Petersen @ 2004-01-22 22:54 UTC (permalink / raw)
  To: linux-scsi

> For 2.4, you could try another driver -- 6.06.10 (available at the

	Ah, so there is. I hadn't noticed that! After running the raidstart
resync with this qla2100 driver, disks
started eventualy failing with errors like:

SCSI disk error : host 4 channel 0 id 6 lun 0 return code = 20000
 I/O error: dev 08:71, sector 10072520
SCSI disk error : host 4 channel 0 id 5 lun 0 return code = 20000
 I/O error: dev 08:61, sector 10072520
SCSI disk error : host 4 channel 0 id 4 lun 0 return code = 20000
 I/O error: dev 08:51, sector 10072520
SCSI disk error : host 4 channel 0 id 3 lun 0 return code = 20000
 I/O error: dev 08:41, sector 10072520
SCSI disk error : host 4 channel 0 id 2 lun 0 return code = 20000
 I/O error: dev 08:31, sector 10072520
SCSI disk error : host 4 channel 0 id 1 lun 0 return code = 20000
 I/O error: dev 08:21, sector 10072520
SCSI disk error : host 4 channel 0 id 0 lun 0 return code = 20000
 I/O error: dev 08:11, sector 10072520

	Doesn't it seem odd that these seven disks all failed complaining
about the same sector? All 28 disks eventually failed, some with the same
sector errors. The machine hung when I tried to reboot it. Strange. 

> For 2.6, why don't we start out with the default qla2xxx driver in
> 2.6.2-rc1.  

	I'm getting the same results with this combination as I did with
2.6.1 before. I should point out that I am not waiting for the resink to
start before trying to lay down a file system:

# mkraid /dev/md0
# mkraid /dev/md1

	At this point I waited a few minutes, and both resync's cruised
along. Everything was looking good, so I did a:

# mkfs.ext3 /dev/md0
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
26509312 inodes, 53014272 blocks
2650713 blocks (5.00%) reserved for the super user
First data block=0
1618 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632,
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables:  120/1618

	At this point, the mkfs hung and the resync stopped. The status of
md is:

# cat /proc/mdstat
Personalities : [raid5]
md1 : active raid5 sdab1[13] sdaa1[12] sdz1[11] sdy1[10] sdx1[9] sdw1[8]
sdv1[7] sdu1[6] sdt1[5] sds1[4] sdr1[3] sdq1[2] sdp1[1] sdo1[0]
      212057088 blocks level 5, 64k chunk, algorithm 2 [13/13]
[UUUUUUUUUUUUU]
      [>....................]  resync =  3.6% (645636/17671424)
finish=1271.7min speed=222K/sec
md0 : active raid5 sdn1[13] sdm1[12] sdl1[11] sdk1[10] sdj1[9] sdi1[8]
sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
      212057088 blocks level 5, 64k chunk, algorithm 2 [13/13]
[UUUUUUUUUUUUU]
      [=>...................]  resync =  6.6% (1173248/17671424)
finish=1381.3min speed=198K/sec
unused devices: <none>

	But the "finish" time is the only thing that changes, as it
continually increases. I also see the following in dmesg:

md: using maximum available idle IO bandwith (but not more than 200000
KB/sec) for reconstruction.
md: using 128k window, over a total of 17671424 blocks.
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:7:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:12:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:12:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:11:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(2:0:10:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed

	Doing a simple collection of these error messages:

# dmesg | grep scsi\( | cut -d: -f6 | sort -un | xargs
0 1 2 3 4 5 6 7 8 9 10 11 12

	So, all 13 disks in the first array (md0) are generating errors. I
decided to try this again (after a reboot), without issuing the mkfs. I let
the sync run and watched dmesg for strange errors. Here is a summary:

md: using 128k window, over a total of 17671424 blocks.
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:1:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: Performing ISP error recovery - ha= cd7b01c0.
qla2100 0000:01:09.0: LIP reset occured (f8ef).
qla2100 0000:01:09.0: LIP occured (f8ef).
qla2100 0000:01:09.0: LOOP UP detected (1 Gbps).
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000
qla2100 0000:01:09.0: qla2xxx_eh_abort: cmd already done sp=00000000

... Repeated ~150 times ...

qla2100 0000:01:09.0: ISP System Error - mbx1=1935h mbx2=0h mbx3=8004h.
qla2100 0000:01:09.0: Failed to dump firmware (256)!!!
qla2100 0000:01:09.0: Performing ISP error recovery - ha= cd7b01c0.
qla2100 0000:01:09.0: LIP reset occured (f8f7).
qla2100 0000:01:09.0: LIP occured (f8f7).
qla2100 0000:01:09.0: LOOP UP detected (1 Gbps).
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:23:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:22:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
qla2100 0000:01:09.0: qla2xxx_eh_abort scsi(1:0:21:0):
cmd_timeout_in_sec=0x1e.
qla2100 0000:01:09.0: qla2xxx_eh_abort Exiting: status=Failed
... Repeated lots ...

	This time it failed all of the disks in md1, plus one from md0:

# dmesg | grep '^qla2100.*scsi(' | cut -d: -f6 | sort -un | xargs
1 4 14 15 16 17 18 19 20 21 22 23 24 25 26

	Thoughts?

Thanks for your help,

-poul


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-01-22 22:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-22  0:32 What's the "right" qla2100 driver? Poul Petersen
2004-01-22  5:22 ` Andrew Vasquez
     [not found] <F888C30C3021D411B9DA00B0D0209BE8038F9BD1@cvo-exchange.cvo. roguewave.com>
2004-01-22  0:40 ` Lincoln Dale
     [not found] <47F3C2BE74738E4683574107469DFA201DF5BF@XYUSEX01.xyus.xyrat ex.com>
2004-01-22  5:06 ` Lincoln Dale
  -- strict thread matches above, loose matches on Subject: below --
2004-01-22 22:54 Poul Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox