Re: aic94xx: failing on high load (another data point)

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: aic94xx: failing on high load (another data point)
       [not found] ` <20080130091403.GA14887@alaris.suse.cz>
@ 2008-01-30 10:59   ` Keith Hopkins
  2008-01-30 19:29     ` Darrick J. Wong
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Hopkins @ 2008-01-30 10:59 UTC (permalink / raw)
  To: Jan Sembera; +Cc: linux-scsi

On 01/30/2008 05:14 PM, Jan Sembera wrote:
> 
> 	We tried firmware versions V28, V30, and even V32 that is, as
> far as I know, not yet available on adaptec website. All of them were
> unfortunately displaying exactly the same behaviour :-(. Did you get your
> SAS controller working? And if so, with which firmware was that?
> 

V28.  My controller functions well with a single drive (low-medium load).  Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box.

--Keith

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-01-30 10:59   ` aic94xx: failing on high load (another data point) Keith Hopkins
@ 2008-01-30 19:29     ` Darrick J. Wong
  2008-02-14 16:11       ` Keith Hopkins
  0 siblings, 1 reply; 16+ messages in thread
From: Darrick J. Wong @ 2008-01-30 19:29 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Jan Sembera, linux-scsi

On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
> 
> V28.  My controller functions well with a single drive (low-medium load).  Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box.

Adaptec posted a V30 sequencer on their website; does that fix the
problems?

http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm

--D

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-01-30 19:29     ` Darrick J. Wong
@ 2008-02-14 16:11       ` Keith Hopkins
  2008-02-15 15:28         ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Hopkins @ 2008-02-14 16:11 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Sembera; +Cc: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 669 bytes --]

On 01/31/2008 03:29 AM, Darrick J. Wong wrote:
> On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
>> V28.  My controller functions well with a single drive (low-medium load).  Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box.
> 
> Adaptec posted a V30 sequencer on their website; does that fix the
> problems?
> 
> http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm
> 

I lost connectivity to the drive again, and had to reboot to recover the drive, so it seemed a good time to try out the V30 firmware.  Unfortunately, it didn't work any better.  Details are in the attachment.

--Keith



[-- Attachment #2: post-v30-update.txt --]
[-- Type: text/plain, Size: 17928 bytes --]

Running V28 Firmware

Feb 14 21:45:55 titan syslog-ng[28369]: STATS: dropped 60
Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device
Feb 14 21:47:59 titan kernel: raid1: sdb2: rescheduling sector 0
Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device
Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device
Feb 14 21:47:59 titan kernel: raid1: Disk failure on sdb2, disabling device. 
Feb 14 21:47:59 titan kernel: 	Operation continuing on 1 devices
Feb 14 21:47:59 titan kernel: raid1: sda2: redirecting sector 0 to another mirror
Feb 14 21:47:59 titan kernel: RAID1 conf printout:
Feb 14 21:47:59 titan kernel:  --- wd:1 rd:2
Feb 14 21:47:59 titan kernel:  disk 0, wo:1, o:0, dev:sdb2
Feb 14 21:47:59 titan kernel:  disk 1, wo:0, o:1, dev:sda2
Feb 14 21:47:59 titan kernel: RAID1 conf printout:
Feb 14 21:47:59 titan kernel:  --- wd:1 rd:2
Feb 14 21:47:59 titan kernel:  disk 1, wo:0, o:1, dev:sda2
Feb 14 21:50:08 titan smartd[28072]: Device: /dev/sdb, No such device or address, open() failed

V30 Firmware was installed in OS via rpm.  Ran mkinitrd and...

== manually reboot to get drive back online ==

(lots of kruft removed)

Linux version 2.6.22.16-0.1-default (geeko@buildhost) (gcc version 4.2.1 (SUSE Linux)) #1 SMP 2008/01/23 14:28:52 UTC
Command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent PROFILE=default profile=default crashkernel=128M@16M  splash=off nosplash
SMP: Allowing 8 CPUs, 0 hotplug CPUs
Kernel command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent PROFILE=default profile=default crashkernel=128M@16M  splash=off nosplash
bootsplash: silent mode.
Initializing CPU#0
time.c: Detected 2327.500 MHz processor.
Memory: 14256200k/15728640k available (2053k kernel code, 422808k reserved, 1017k data, 316k init)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU 0/0 -> Node 0
using mwait in idle threads.
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU0: Thermal monitoring handled by SMI
SMP alternatives: switching to UP code
Unpacking initramfs... done
Freeing initrd memory: 4931k freed
ACPI: Core revision 20070126
Brought up 8 CPUs
io scheduler cfq registered (default)
Boot video device is 0000:08:00.0
Freeing unused kernel memory: 316k freed
ACPI Error (dsopcode-0250): No pointer back to NS node in buffer obj ffff8103b0131c60 [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU0._PDC] (Node ffff8103b0ae0770), AE_AML_INTERNAL
ACPI: Processor [CPU0] (supports 8 throttling states)
md: raid1 personality registered for level 1
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
BIOS EDD facility v0.16 2004-Jun-25, 4 devices found
SCSI subsystem initialized
aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded
ACPI: PCI Interrupt 0000:05:02.0[A] -> GSI 16 (level, low) -> IRQ 16
aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device 0000:05:02.0
scsi0 : aic94xx
aic94xx: BIOS present (1,1), 1918
aic94xx: ue num:2, ue size:88
aic94xx: manuf sect SAS_ADDR 50000d10002d9380
aic94xx: manuf sect PCBA SN 0BB0C54904VA
aic94xx: ms: num_phy_desc: 8
aic94xx: ms: phy0: ENABLED
aic94xx: ms: phy1: ENABLED
aic94xx: ms: phy2: ENABLED
aic94xx: ms: phy3: ENABLED
aic94xx: ms: phy4: ENABLED
aic94xx: ms: phy5: ENABLED
aic94xx: ms: phy6: ENABLED
aic94xx: ms: phy7: ENABLED
aic94xx: ms: max_phys:0x8, num_phys:0x8
aic94xx: ms: enabled_phys:0xff
aic94xx: ctrla: phy0: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy1: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy2: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy3: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy4: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy5: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy6: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy7: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0
aic94xx: max_scbs:512, max_ddbs:128
aic94xx: setting phy0 addr to 50000d10002d9380
aic94xx: setting phy1 addr to 50000d10002d9380
aic94xx: setting phy2 addr to 50000d10002d9380
aic94xx: setting phy3 addr to 50000d10002d9380
aic94xx: setting phy4 addr to 50000d10002d9380
aic94xx: setting phy5 addr to 50000d10002d9380
aic94xx: setting phy6 addr to 50000d10002d9380
aic94xx: setting phy7 addr to 50000d10002d9380
aic94xx: num_edbs:21
aic94xx: num_escbs:3
aic94xx: Found sequencer Firmware version 1.1 (V30)
aic94xx: downloading CSEQ...
aic94xx: dma-ing 8192 bytes
aic94xx: verified 8192 bytes, passed
aic94xx: downloading LSEQs...
aic94xx: dma-ing 14336 bytes
aic94xx: LSEQ0 verified 14336 bytes, passed
aic94xx: LSEQ1 verified 14336 bytes, passed
aic94xx: LSEQ2 verified 14336 bytes, passed
aic94xx: LSEQ3 verified 14336 bytes, passed
aic94xx: LSEQ4 verified 14336 bytes, passed
aic94xx: LSEQ5 verified 14336 bytes, passed
aic94xx: LSEQ6 verified 14336 bytes, passed
aic94xx: LSEQ7 verified 14336 bytes, passed
aic94xx: max_scbs:446
aic94xx: first_scb_site_no:0x20
aic94xx: last_scb_site_no:0x1fe
aic94xx: First SCB dma_handle: 0x3ac6f6000
aic94xx: device 0000:05:02.0: SAS addr 50000d10002d9380, PCBA SN 0BB0C54904VA, 8 phys, 8 enabled phys, flash present, BIOS build 1918
aic94xx: posting 3 escbs
aic94xx: escbs posted
aic94xx: posting 8 control phy scbs
aic94xx: control_phy_tasklet_complete: phy4, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy4: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 10 00 00 08
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 00 c5 00
aic94xx: 10: 01 a5 20 c5
aic94xx: 14: 00 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: asd_form_port: updating phy_mask 0x10 for phy4
aic94xx: control_phy_tasklet_complete: phy6, lrate:0x9, proto:0xe
aic94xx: escb_tasklet_complete: phy6: BYTES_DMAED
aic94xx: SAS proto IDENTIFY:
aic94xx: 00: 10 00 00 08
aic94xx: 04: 00 00 00 00
aic94xx: 08: 00 00 00 00
aic94xx: 0c: 50 00 c5 00
aic94xx: 10: 01 a5 1d e9
aic94xx: 14: 00 00 00 00
aic94xx: 18: 00 00 00 00
aic94xx: asd_form_port: updating phy_mask 0x40 for phy6
aic94xx: control_phy_tasklet_complete: phy0: no device present: oob_status:0x0
aic94xx: control_phy_tasklet_complete: phy1: no device present: oob_status:0x0
aic94xx: control_phy_tasklet_complete: phy2: no device present: oob_status:0x0
aic94xx: control_phy_tasklet_complete: phy3: no device present: oob_status:0x0
aic94xx: control_phy_tasklet_complete: phy5: no device present: oob_status:0x0
aic94xx: control_phy_tasklet_complete: phy7: no device present: oob_status:0x0
sas: phy4 added to port0, phy_mask:0x10
sas: phy6 added to port1, phy_mask:0x40
sas: DOING DISCOVERY on port 0, pid:902
scsi 0:0:0:0: Direct-Access     SEAGATE  ST3146855SS      S513 PQ: 0 ANSI: 5
sas: DONE DISCOVERY on port 0, pid:902, result:0
sas: DOING DISCOVERY on port 1, pid:902
scsi 0:0:1:0: Direct-Access     SEAGATE  ST3146855SS      S513 PQ: 0 ANSI: 5
sas: DONE DISCOVERY on port 1, pid:902, result:0
sd 0:0:0:0: [sda] 286749480 512-byte hardware sectors (146816 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: b3 00 10 08
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
sd 0:0:0:0: [sda] 286749480 512-byte hardware sectors (146816 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: b3 00 10 08
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
sd 0:0:1:0: [sdb] 286749480 512-byte hardware sectors (146816 MB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: b3 00 10 08
sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
sd 0:0:1:0: [sdb] 286749480 512-byte hardware sectors (146816 MB)
sd 0:0:1:0: [sdb] Write Protect is off
sd 0:0:1:0: [sdb] Mode Sense: b3 00 10 08
sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
 sdb: sdb1 sdb2 sdb3
sd 0:0:1:0: [sdb] Attached SCSI disk
md: raid0 personality registered for level 0
raid5: automatically using best checksumming function: generic_sse
   generic_sse:  8311.000 MB/sec
raid5: using function: generic_sse (8311.000 MB/sec)
raid6: int64x1   2185 MB/s
raid6: int64x2   2757 MB/s
raid6: int64x4   2672 MB/s
raid6: int64x8   2162 MB/s
raid6: sse2x1    2978 MB/s
raid6: sse2x2    6625 MB/s
raid6: sse2x4    7339 MB/s
raid6: using algorithm sse2x4 (7339 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
md: md2 stopped.
md: bind<sdb3>
md: bind<sda3>
md: kicking non-fresh sdb3 from array!
md: unbind<sdb3>
md: export_rdev(sdb3)
raid1: raid set md2 active with 1 out of 2 mirrors
md: linear personality registered for level -1
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:1:0: Attached scsi generic sg1 type 0
Unable to find swap-space signature
md: md0 stopped.
md: bind<sdb1>
md: bind<sda1>
md: kicking non-fresh sdb1 from array!
md: unbind<sdb1>
md: export_rdev(sdb1)
raid1: raid set md0 active with 1 out of 2 mirrors
md: md1 stopped.
md: bind<sdb2>
md: bind<sda2>
md: kicking non-fresh sdb2 from array!
md: unbind<sdb2>
md: export_rdev(sdb2)
raid1: raid set md1 active with 1 out of 2 mirrors
md: bind<sdb1>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:1, dev:sdb1
 disk 1, wo:0, o:1, dev:sda1
md: recovery of RAID array md0
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 530048 blocks.
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
md: md0: recovery done.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sdb1
 disk 1, wo:0, o:1, dev:sda1
md: bind<sdb2>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:1, dev:sdb2
 disk 1, wo:0, o:1, dev:sda2
md: recovery of RAID array md1
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 16779776 blocks.
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=4) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=28) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort!
md: md1: recovery done.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sdb2
 disk 1, wo:0, o:1, dev:sda2
md: bind<sdb3>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:1, dev:sdb3
 disk 1, wo:0, o:1, dev:sda3
md: recovery of RAID array md2
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 126053952 blocks.
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x5
aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=4) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=23) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=12) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=13) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=10) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=10) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=11) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=17) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=9) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=4) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=5) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=5) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=17) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=8) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=20) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=7) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=12) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=12) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=7) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort!
aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort!
md: md2: recovery done.
RAID1 conf printout:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sdb3
 disk 1, wo:0, o:1, dev:sda3

==EOdmesg==

05:02.0 Serial Attached SCSI controller: Adaptec AIC-9410W SAS (Razor HBA non-RAID) (rev 08)
	Subsystem: Adaptec ASC-48300 (Spirit non-RAID)
	Flags: bus master, 66MHz, slow devsel, latency 32, IRQ 16
	Memory at d0200000 (64-bit, non-prefetchable) [size=256K]
	Memory at d0000000 (64-bit, prefetchable) [size=128K]
	I/O ports at 3000 [size=256]
	[virtual] Expansion ROM at d0080000 [disabled] [size=512K]
	Capabilities: [40] PCI-X non-bridge device
	Capabilities: [58] Power Management version 2
	Capabilities: [e0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/2 Enable-

ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:03.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:04.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:05:02.0[A] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:00:1d.3[D] -> GSI 16 (level, low) -> IRQ 16
ACPI: PCI Interrupt 0000:09:03.0[A] -> GSI 16 (level, low) -> IRQ 16
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 2 (rev 31)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev 31)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01)
03:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01)
00:04.0 PCI bridge: Intel Corporation 5000X Chipset PCI Express x16 Port 4-7 (rev 31)
05:02.0 Serial Attached SCSI controller: Adaptec AIC-9410W SAS (Razor HBA non-RAID) (rev 08) <- PCI-X Card / PCI-X/133 Slot
00:1d.3 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4 (rev 09)
09:03.0 SCSI storage controller: Adaptec AHA-2944UW / AIC-7884U (rev 01)  <- old 5v PCI Card

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-14 16:11       ` Keith Hopkins
@ 2008-02-15 15:28         ` James Bottomley
  2008-02-15 16:28           ` Keith Hopkins
  2008-02-18 14:26           ` Keith Hopkins
  0 siblings, 2 replies; 16+ messages in thread
From: James Bottomley @ 2008-02-15 15:28 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi

On Fri, 2008-02-15 at 00:11 +0800, Keith Hopkins wrote:
> On 01/31/2008 03:29 AM, Darrick J. Wong wrote:
> > On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
> >> V28.  My controller functions well with a single drive (low-medium load).  Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box.
> > 
> > Adaptec posted a V30 sequencer on their website; does that fix the
> > problems?
> > 
> > http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm
> > 
> 
> I lost connectivity to the drive again, and had to reboot to recover
> the drive, so it seemed a good time to try out the V30 firmware.
> Unfortunately, it didn't work any better.  Details are in the
> attachment.

Well, I can offer some hope.  The errors you report:

> aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
> aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!

Are requests by the sequencer to abort a task because of a protocol
error.  IBM did some extensive testing with seagate drives and found
that the protocol errors were genuine and the result of drive firmware
problems.  IBM released a version of seagate firmware (BA17) to correct
these.  Unfortunately, your drive identifies its firmware as S513 which
is likely OEM firmware from another vendor ... however, that vendor may
have an update which corrects the problem.

Of course, the other issue is this:

> aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!

This is a bug in the driver.  It's not finding the task in the
outstanding list.  The problem seems to be that it's taking the task
from the escb which, by definition, is always NULL.  It should be taking
the task from the ascb it finds by looping over the pending queue.

If you're willing, could you try this patch which may correct the
problem?  It's sort of like falling off a cliff: if you never go near
the edge (i.e. you upgrade the drive fw) you never fall off;
alternatively, it would be nice if you could help me put up guard rails
just in case.

Thanks,

James

---
diff --git a/drivers/scsi/aic94xx/aic94xx_scb.c b/drivers/scsi/aic94xx/aic94xx_scb.c
index 0febad4..ab35050 100644
--- a/drivers/scsi/aic94xx/aic94xx_scb.c
+++ b/drivers/scsi/aic94xx/aic94xx_scb.c
@@ -458,13 +458,19 @@ static void escb_tasklet_complete(struct asd_ascb *ascb,
 		tc_abort = le16_to_cpu(tc_abort);

 		list_for_each_entry_safe(a, b, &asd_ha->seq.pend_q, list) {
-			struct sas_task *task = ascb->uldd_task;
+			struct sas_task *task = a->uldd_task;
+
+			if (a->tc_index != tc_abort)
+				continue;

-			if (task && a->tc_index == tc_abort) {
+			if (task) {
 				failed_dev = task->dev;
 				sas_task_abort(task);
-				break;
+			} else {
+				ASD_DPRINTK("R_T_A for non TASK scb 0x%x\n",
+					    a->scb->header.opcode);
 			}
+			break;
 		}

 		if (!failed_dev) {
@@ -478,7 +484,7 @@ static void escb_tasklet_complete(struct asd_ascb *ascb,
 		 * that the EH will wake up and do something.
 		 */
 		list_for_each_entry_safe(a, b, &asd_ha->seq.pend_q, list) {
-			struct sas_task *task = ascb->uldd_task;
+			struct sas_task *task = a->uldd_task;

 			if (task &&
 			    task->dev == failed_dev &&

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-15 15:28         ` James Bottomley
@ 2008-02-15 16:28           ` Keith Hopkins
  2008-02-18 14:26           ` Keith Hopkins
  1 sibling, 0 replies; 16+ messages in thread
From: Keith Hopkins @ 2008-02-15 16:28 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-scsi

On 02/15/2008 11:28 PM, James Bottomley wrote:
> If you're willing, could you try this patch which may correct the
> problem?  It's sort of like falling off a cliff: if you never go near
> the edge (i.e. you upgrade the drive fw) you never fall off;
> alternatively, it would be nice if you could help me put up guard rails
> just in case.

Hi James,

  Thanks for your feedback & suggestions.  Yes, I'll give the patch a try.  It might take a few days to get onto the system.  The system/drive isn't IBM, but I'll also see if I can track down a firmware update too for the protocol errors.

--Keith



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-15 15:28         ` James Bottomley
  2008-02-15 16:28           ` Keith Hopkins
@ 2008-02-18 14:26           ` Keith Hopkins
  2008-02-18 16:18             ` James Bottomley
  2008-02-19 16:22             ` James Bottomley
  1 sibling, 2 replies; 16+ messages in thread
From: Keith Hopkins @ 2008-02-18 14:26 UTC (permalink / raw)
  To: James Bottomley; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi

On 02/15/2008 11:28 PM, James Bottomley wrote:
> On Fri, 2008-02-15 at 00:11 +0800, Keith Hopkins wrote:
>> On 01/31/2008 03:29 AM, Darrick J. Wong wrote:
>>> On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
>>>> V28.  My controller functions well with a single drive (low-medium load).  Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box.
>>> Adaptec posted a V30 sequencer on their website; does that fix the
>>> problems?
>>>
>>> http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm
>>>
>> I lost connectivity to the drive again, and had to reboot to recover
>> the drive, so it seemed a good time to try out the V30 firmware.
>> Unfortunately, it didn't work any better.  Details are in the
>> attachment.
> 
> Well, I can offer some hope.  The errors you report:
> 
>> aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
>> aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
> 
> Are requests by the sequencer to abort a task because of a protocol
> error.  IBM did some extensive testing with seagate drives and found
> that the protocol errors were genuine and the result of drive firmware
> problems.  IBM released a version of seagate firmware (BA17) to correct
> these.  Unfortunately, your drive identifies its firmware as S513 which
> is likely OEM firmware from another vendor ... however, that vendor may
> have an update which corrects the problem.
> 
> Of course, the other issue is this:
> 
>> aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!
> 
> This is a bug in the driver.  It's not finding the task in the
> outstanding list.  The problem seems to be that it's taking the task
> from the escb which, by definition, is always NULL.  It should be taking
> the task from the ascb it finds by looping over the pending queue.
> 
> If you're willing, could you try this patch which may correct the
> problem?  It's sort of like falling off a cliff: if you never go near
> the edge (i.e. you upgrade the drive fw) you never fall off;
> alternatively, it would be nice if you could help me put up guard rails
> just in case.
> 

Well, that made life interesting....
  but didn't seem to fix anything.

The behavior is about the same as before, but with more verbose errors.  I failed one member of the raid and had it rebuild as a test...which hangs for a while and the drive falls off-line.

Please grab the dmesg output in all its gory glory from here: http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz

The drive is a Dell OEM drive, but it's not in a Dell system.  There is at least one firmware (S527) upgrade for it, but the Dell loader refuses to load it (because it isn't in a Dell system...)
Does anyone know a generic way to load a new firmware onto a SAS drive?

--Keith

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-18 14:26           ` Keith Hopkins
@ 2008-02-18 16:18             ` James Bottomley
  2008-02-19 16:22             ` James Bottomley
  1 sibling, 0 replies; 16+ messages in thread
From: James Bottomley @ 2008-02-18 16:18 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi

On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote:
> Well, that made life interesting....
>   but didn't seem to fix anything.
> 
> The behavior is about the same as before, but with more verbose
> errors.  I failed one member of the raid and had it rebuild as a
> test...which hangs for a while and the drive falls off-line.

Actually, it now finds the task and tries to do error handling for
it ... so we've now uncovered bugs in the error handler.  It may not
look like it, but this is actually progress.  Although, I'm afraid it's
going to be a bit like peeling an onion: every time one error gets
fixed, you just get to the next layer of errors.

> Please grab the dmesg output in all its gory glory from here:
> http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz
> 
> The drive is a Dell OEM drive, but it's not in a Dell system.  There
> is at least one firmware (S527) upgrade for it, but the Dell loader
> refuses to load it (because it isn't in a Dell system...)
> Does anyone know a generic way to load a new firmware onto a SAS drive?

The firmware upgrade tools are usually vendor specific, though because
the format of the firmware file is vendor specific.  Could you just put
it in a dell box to upgrade?

James



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-18 14:26           ` Keith Hopkins
  2008-02-18 16:18             ` James Bottomley
@ 2008-02-19 16:22             ` James Bottomley
  2008-02-19 18:44               ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong
  2008-02-20  3:48               ` aic94xx: failing on high load (another data point) James Bottomley
  1 sibling, 2 replies; 16+ messages in thread
From: James Bottomley @ 2008-02-19 16:22 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi

On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote:
> Well, that made life interesting....
>   but didn't seem to fix anything.
> 
> The behavior is about the same as before, but with more verbose
> errors.  I failed one member of the raid and had it rebuild as a
> test...which hangs for a while and the drive falls off-line.
> 
> Please grab the dmesg output in all its gory glory from here:
> http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz

I had a look through this.  Amazingly, in spite of the message spew, up
to here:

> sas: Enter sas_scsi_recover_host
> sas: trying to find task 0xffff81033c3d3d80
> sas: sas_scsi_find_task: aborting task 0xffff81033c3d3d80
> aic94xx: tmf timed out
> aic94xx: tmf came back

Everything is going normally (the REQ_TASK_ABORT are properly aborted an
retried).  At this point (around L3449 in the trace) the aborts start
failing.

Unfortunately, there's a bug in TMF timeout handling in the driver, it
leaves the sequencer entry pending, but frees the ascb.  If the
sequencer ever picks this up it will get very confused, as it does a
while down in the trace:

> aic94xx: BUG:sequencer:dl:no ascb?!
> aic94xx: BUG:sequencer:dl:no ascb?!

That's where the sequencer adds an ascb to the done list that we've
already freed.  From this point on confusion reigns and the error
handler eventually offlines the device.

I'll see if I can come up with patches to fix this ... or at least
mitigate the problems it causes.

James

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load)
  2008-02-19 16:22             ` James Bottomley
@ 2008-02-19 18:44               ` Darrick J. Wong
  2008-02-19 18:52                 ` James Bottomley
  2008-02-28 14:56                 ` Keith Hopkins
  2008-02-20  3:48               ` aic94xx: failing on high load (another data point) James Bottomley
  1 sibling, 2 replies; 16+ messages in thread
From: Darrick J. Wong @ 2008-02-19 18:44 UTC (permalink / raw)
  To: James Bottomley
  Cc: Keith Hopkins, Jan Sembera, linux-scsi, Alexis Bruemmer,
	Peter Bogdanovic, Gilbert Wu

If we send an ABORT_TASK ascb that doesn't return within the timeout period,
we should not free that ascb because the sequencer is still holding onto it.
Hopefully it will fix what James Bottomley describes below:

On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote:

> Unfortunately, there's a bug in TMF timeout handling in the driver, it
> leaves the sequencer entry pending, but frees the ascb.  If the
> sequencer ever picks this up it will get very confused, as it does a
> while down in the trace:
> 
> > aic94xx: BUG:sequencer:dl:no ascb?!
> > aic94xx: BUG:sequencer:dl:no ascb?!
> 
> That's where the sequencer adds an ascb to the done list that we've
> already freed.  From this point on confusion reigns and the error
> handler eventually offlines the device.
> 
> I'll see if I can come up with patches to fix this ... or at least
> mitigate the problems it causes.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---

 drivers/scsi/aic94xx/aic94xx_tmf.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c b/drivers/scsi/aic94xx/aic94xx_tmf.c
index b52124f..4b24bd3 100644
--- a/drivers/scsi/aic94xx/aic94xx_tmf.c
+++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
@@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task)
 						       AIC94XX_SCB_TIMEOUT);
 		spin_lock_irqsave(&task->task_state_lock, flags);
 		if (leftover < 1)
-			res = TMF_RESP_FUNC_FAILED;
+			goto out_not_reported;
 		if (task->task_state_flags & SAS_TASK_STATE_DONE)
 			res = TMF_RESP_FUNC_COMPLETE;
 		spin_unlock_irqrestore(&task->task_state_lock, flags);
@@ -487,6 +487,11 @@ out:
 	asd_ascb_free(ascb);
 	ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res);
 	return res;
+
+out_not_reported:
+	spin_unlock_irqrestore(&task->task_state_lock, flags);
+	ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task);
+	return res;
 }
 
 /**

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load)
  2008-02-19 18:44               ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong
@ 2008-02-19 18:52                 ` James Bottomley
  2008-02-28 14:56                 ` Keith Hopkins
  1 sibling, 0 replies; 16+ messages in thread
From: James Bottomley @ 2008-02-19 18:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Keith Hopkins, Jan Sembera, linux-scsi, Alexis Bruemmer,
	Peter Bogdanovic, Gilbert Wu

On Tue, 2008-02-19 at 10:44 -0800, Darrick J. Wong wrote:
> If we send an ABORT_TASK ascb that doesn't return within the timeout period,
> we should not free that ascb because the sequencer is still holding onto it.
> Hopefully it will fix what James Bottomley describes below:
> 
> On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote:
> 
> > Unfortunately, there's a bug in TMF timeout handling in the driver, it
> > leaves the sequencer entry pending, but frees the ascb.  If the
> > sequencer ever picks this up it will get very confused, as it does a
> > while down in the trace:
> > 
> > > aic94xx: BUG:sequencer:dl:no ascb?!
> > > aic94xx: BUG:sequencer:dl:no ascb?!
> > 
> > That's where the sequencer adds an ascb to the done list that we've
> > already freed.  From this point on confusion reigns and the error
> > handler eventually offlines the device.
> > 
> > I'll see if I can come up with patches to fix this ... or at least
> > mitigate the problems it causes.
> 
> Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>

Actually, unfortunately, this is only a tiny part of it.  The message
that triggered all of this is

> sas: sas_scsi_find_task: aborting task 0xffff81033c3d3d80
> aic94xx: tmf timed out
> aic94xx: tmf came back

That's caused by a timeout at asd_enqueue_internal() further up in the
code base.

James



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load)
  2008-02-19 18:44               ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong
  2008-02-19 18:52                 ` James Bottomley
@ 2008-02-28 14:56                 ` Keith Hopkins
  2008-02-28 16:10                   ` James Bottomley
  1 sibling, 1 reply; 16+ messages in thread
From: Keith Hopkins @ 2008-02-28 14:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-scsi

On 02/20/2008 02:44 AM, Darrick J. Wong wrote:
> If we send an ABORT_TASK ascb that doesn't return within the timeout period,
> we should not free that ascb because the sequencer is still holding onto it.
> Hopefully it will fix what James Bottomley describes below:
> 
> On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote:
> 
>> Unfortunately, there's a bug in TMF timeout handling in the driver, it
>> leaves the sequencer entry pending, but frees the ascb.  If the
>> sequencer ever picks this up it will get very confused, as it does a
>> while down in the trace:
>>
>>> aic94xx: BUG:sequencer:dl:no ascb?!
>>> aic94xx: BUG:sequencer:dl:no ascb?!
>> That's where the sequencer adds an ascb to the done list that we've
>> already freed.  From this point on confusion reigns and the error
>> handler eventually offlines the device.
>>
>> I'll see if I can come up with patches to fix this ... or at least
>> mitigate the problems it causes.
> 
> Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
> ---
> 
>  drivers/scsi/aic94xx/aic94xx_tmf.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c b/drivers/scsi/aic94xx/aic94xx_tmf.c
> index b52124f..4b24bd3 100644
> --- a/drivers/scsi/aic94xx/aic94xx_tmf.c
> +++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
> @@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task)
>  						       AIC94XX_SCB_TIMEOUT);
>  		spin_lock_irqsave(&task->task_state_lock, flags);
>  		if (leftover < 1)
> -			res = TMF_RESP_FUNC_FAILED;
> +			goto out_not_reported;
>  		if (task->task_state_flags & SAS_TASK_STATE_DONE)
>  			res = TMF_RESP_FUNC_COMPLETE;
>  		spin_unlock_irqrestore(&task->task_state_lock, flags);
> @@ -487,6 +487,11 @@ out:
>  	asd_ascb_free(ascb);
>  	ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res);
>  	return res;
> +
> +out_not_reported:
> +	spin_unlock_irqrestore(&task->task_state_lock, flags);
> +	ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task);
> +	return res;
>  }
>  
>  /**
> -

Hi Darrick,

  Is this the only patch for ascb sequencer use after free problems, or are you still looking into that?

--Keith

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load)
  2008-02-28 14:56                 ` Keith Hopkins
@ 2008-02-28 16:10                   ` James Bottomley
  0 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2008-02-28 16:10 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Darrick J. Wong, linux-scsi

On Thu, 2008-02-28 at 22:56 +0800, Keith Hopkins wrote:
> On 02/20/2008 02:44 AM, Darrick J. Wong wrote:
> > If we send an ABORT_TASK ascb that doesn't return within the timeout period,
> > we should not free that ascb because the sequencer is still holding onto it.
> > Hopefully it will fix what James Bottomley describes below:
> > 
> > On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote:
> > 
> >> Unfortunately, there's a bug in TMF timeout handling in the driver, it
> >> leaves the sequencer entry pending, but frees the ascb.  If the
> >> sequencer ever picks this up it will get very confused, as it does a
> >> while down in the trace:
> >>
> >>> aic94xx: BUG:sequencer:dl:no ascb?!
> >>> aic94xx: BUG:sequencer:dl:no ascb?!
> >> That's where the sequencer adds an ascb to the done list that we've
> >> already freed.  From this point on confusion reigns and the error
> >> handler eventually offlines the device.
> >>
> >> I'll see if I can come up with patches to fix this ... or at least
> >> mitigate the problems it causes.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
> > ---
> > 
> >  drivers/scsi/aic94xx/aic94xx_tmf.c |    7 ++++++-
> >  1 files changed, 6 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c b/drivers/scsi/aic94xx/aic94xx_tmf.c
> > index b52124f..4b24bd3 100644
> > --- a/drivers/scsi/aic94xx/aic94xx_tmf.c
> > +++ b/drivers/scsi/aic94xx/aic94xx_tmf.c
> > @@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task)
> >  						       AIC94XX_SCB_TIMEOUT);
> >  		spin_lock_irqsave(&task->task_state_lock, flags);
> >  		if (leftover < 1)
> > -			res = TMF_RESP_FUNC_FAILED;
> > +			goto out_not_reported;
> >  		if (task->task_state_flags & SAS_TASK_STATE_DONE)
> >  			res = TMF_RESP_FUNC_COMPLETE;
> >  		spin_unlock_irqrestore(&task->task_state_lock, flags);
> > @@ -487,6 +487,11 @@ out:
> >  	asd_ascb_free(ascb);
> >  	ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res);
> >  	return res;
> > +
> > +out_not_reported:
> > +	spin_unlock_irqrestore(&task->task_state_lock, flags);
> > +	ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task);
> > +	return res;
> >  }
> >  
> >  /**
> > -
> 
> Hi Darrick,
> 
>   Is this the only patch for ascb sequencer use after free problems, or are you still looking into that?

Sorry, I forgot to cc you.  Actually this one is the full one:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=e2396f1e4ecd438a15fa653a028b93e95013caa3

Unfortunately, there are another five patches in that git tree that
you'll also need to see if we can get aic94xx working on your box.

If you're willing, could you use 2.6.25-rc3 as the base kernel and just
apply

http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-rc-fixes-2.6.diff

On top of it?  That should give you a kernel patched with all of the
pending aic94xx and libsas fixes.

Thanks,

James



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-19 16:22             ` James Bottomley
  2008-02-19 18:44               ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong
@ 2008-02-20  3:48               ` James Bottomley
  2008-02-20  9:54                 ` Keith Hopkins
  1 sibling, 1 reply; 16+ messages in thread
From: James Bottomley @ 2008-02-20  3:48 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi

On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote:
> I'll see if I can come up with patches to fix this ... or at least
> mitigate the problems it causes.

Darrick's working on the ascb sequencer use after free problem.

I looked into some of the error handling in libsas, and apparently
that's a bit of a huge screw up too.  There are a number of places where
we won't complete a task that is being errored out and thus causes
timeout errors.  This patch is actually for libsas to fix all of this.

I've managed to reproduce some of your problem by firing random resets
across a disk under load, and this recovers the protocol errors for me.
However, I can't reproduce the TMF timeout which caused the sequencer
screw up, so you still need to wait for Darrick's fix as well.

James

---

diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index f869fba..b656e29 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct sas_task *task)
 {
 	struct task_status_struct *ts = &task->task_status;
 	struct scsi_cmnd *sc = task->uldd_task;
-	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc->device->host);
-	unsigned ts_flags = task->task_state_flags;
 	int hs = 0, stat = 0;
 
 	if (unlikely(!sc)) {
@@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct sas_task *task)
 	sc->result = (hs << 16) | stat;
 	list_del_init(&task->list);
 	sas_free_task(task);
-	/* This is very ugly but this is how SCSI Core works. */
-	if (ts_flags & SAS_TASK_STATE_ABORTED)
-		scsi_eh_finish_cmd(sc, &sas_ha->eh_done_q);
-	else
-		sc->scsi_done(sc);
+	sc->scsi_done(sc);
 }
 
 static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
@@ -255,13 +249,27 @@ out:
 	return res;
 }
 
+static void sas_eh_finish_cmd(struct scsi_cmnd *cmd)
+{
+	struct sas_task *task = TO_SAS_TASK(cmd);
+	struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd->device->host);
+
+	/* First off call task_done.  However, task will
+	 * be free'd after this */
+	task->task_done(task);
+	/* now finish the command and move it on to the error
+	 * handler done list, this also takes it off the
+	 * error handler pending list */
+	scsi_eh_finish_cmd(cmd, &sas_ha->eh_done_q);
+}
+
 static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct scsi_cmnd *my_cmd)
 {
 	struct scsi_cmnd *cmd, *n;
 
 	list_for_each_entry_safe(cmd, n, error_q, eh_entry) {
 		if (cmd == my_cmd)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -274,7 +282,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head *error_q,
 		struct domain_device *x = cmd_to_domain_dev(cmd);
 
 		if (x == dev)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -288,7 +296,7 @@ static void sas_scsi_clear_queue_port(struct list_head *error_q,
 		struct asd_sas_port *x = dev->port;
 
 		if (x == port)
-			list_del_init(&cmd->eh_entry);
+			sas_eh_finish_cmd(cmd);
 	}
 }
 
@@ -528,14 +536,14 @@ Again:
 		case TASK_IS_DONE:
 			SAS_DPRINTK("%s: task 0x%p is done\n", __FUNCTION__,
 				    task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
 		case TASK_IS_ABORTED:
 			SAS_DPRINTK("%s: task 0x%p is aborted\n",
 				    __FUNCTION__, task);
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			continue;
@@ -547,7 +555,7 @@ Again:
 					    "recovered\n",
 					    SAS_ADDR(task->dev),
 					    cmd->device->lun);
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_lu(work_q, cmd);
@@ -562,7 +570,7 @@ Again:
 			if (tmf_resp == TMF_RESP_FUNC_COMPLETE) {
 				SAS_DPRINTK("I_T %016llx recovered\n",
 					    SAS_ADDR(task->dev->sas_addr));
-				task->task_done(task);
+				sas_eh_finish_cmd(cmd);
 				if (need_reset)
 					try_to_reset_cmd_device(shost, cmd);
 				sas_scsi_clear_queue_I_T(work_q, task->dev);
@@ -577,7 +585,7 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus port:%d "
 						    "succeeded\n", port->id);
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
 					sas_scsi_clear_queue_port(work_q,
@@ -591,10 +599,10 @@ Again:
 				if (res == TMF_RESP_FUNC_COMPLETE) {
 					SAS_DPRINTK("clear nexus ha "
 						    "succeeded\n");
-					task->task_done(task);
+					sas_eh_finish_cmd(cmd);
 					if (need_reset)
 						try_to_reset_cmd_device(shost, cmd);
-					goto out;
+					goto clear_q;
 				}
 			}
 			/* If we are here -- this means that no amount
@@ -606,21 +614,18 @@ Again:
 				    SAS_ADDR(task->dev->sas_addr),
 				    cmd->device->lun);
 
-			task->task_done(task);
+			sas_eh_finish_cmd(cmd);
 			if (need_reset)
 				try_to_reset_cmd_device(shost, cmd);
 			goto clear_q;
 		}
 	}
-out:
 	return list_empty(work_q);
 clear_q:
 	SAS_DPRINTK("--- Exit %s -- clear_q\n", __FUNCTION__);
-	list_for_each_entry_safe(cmd, n, work_q, eh_entry) {
-		struct sas_task *task = TO_SAS_TASK(cmd);
-		list_del_init(&cmd->eh_entry);
-		task->task_done(task);
-	}
+	list_for_each_entry_safe(cmd, n, work_q, eh_entry)
+		sas_eh_finish_cmd(cmd);
+
 	return list_empty(work_q);
 }
 



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-20  3:48               ` aic94xx: failing on high load (another data point) James Bottomley
@ 2008-02-20  9:54                 ` Keith Hopkins
  2008-02-20 16:22                   ` James Bottomley
  0 siblings, 1 reply; 16+ messages in thread
From: Keith Hopkins @ 2008-02-20  9:54 UTC (permalink / raw)
  To: James Bottomley, Darrick J. Wong; +Cc: Jan Sembera, linux-scsi

On 02/20/2008 11:48 AM, James Bottomley wrote:
> On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote:
>> I'll see if I can come up with patches to fix this ... or at least
>> mitigate the problems it causes.
> 
> Darrick's working on the ascb sequencer use after free problem.
> 
> I looked into some of the error handling in libsas, and apparently
> that's a bit of a huge screw up too.  There are a number of places where
> we won't complete a task that is being errored out and thus causes
> timeout errors.  This patch is actually for libsas to fix all of this.
> 
> I've managed to reproduce some of your problem by firing random resets
> across a disk under load, and this recovers the protocol errors for me.
> However, I can't reproduce the TMF timeout which caused the sequencer
> screw up, so you still need to wait for Darrick's fix as well.
> 
> James
> 

Hi James, Darrick,

  Thanks again for looking more into this.  I'll wait for Darrick's patch and try it together with this libsas patch.  Should I leave James' first patch in also?

  I'm still looking for a Dell machine to use, and will upgrade the drives' firmware the first chance I get.

Thanks again,
--Keith


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
  2008-02-20  9:54                 ` Keith Hopkins
@ 2008-02-20 16:22                   ` James Bottomley
  0 siblings, 0 replies; 16+ messages in thread
From: James Bottomley @ 2008-02-20 16:22 UTC (permalink / raw)
  To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi

On Wed, 2008-02-20 at 17:54 +0800, Keith Hopkins wrote:
> On 02/20/2008 11:48 AM, James Bottomley wrote:
> > On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote:
> >> I'll see if I can come up with patches to fix this ... or at least
> >> mitigate the problems it causes.
> > 
> > Darrick's working on the ascb sequencer use after free problem.
> > 
> > I looked into some of the error handling in libsas, and apparently
> > that's a bit of a huge screw up too.  There are a number of places where
> > we won't complete a task that is being errored out and thus causes
> > timeout errors.  This patch is actually for libsas to fix all of this.
> > 
> > I've managed to reproduce some of your problem by firing random resets
> > across a disk under load, and this recovers the protocol errors for me.
> > However, I can't reproduce the TMF timeout which caused the sequencer
> > screw up, so you still need to wait for Darrick's fix as well.
> > 
> > James
> > 
> 
> Hi James, Darrick,
> 
>   Thanks again for looking more into this.  I'll wait for Darrick's
> patch and try it together with this libsas patch.  Should I leave
> James' first patch in also?

Yes, that's a requirement just to get the REQ_TASK_ABORT for the
protocol errors actually to work ...

I'm afraid this is like peeling an onion as I said .. and you're going
to build up layers of patches.  However, the ones that are obvious bug
fixes and I can test (all of them so far), I'm putting in the rc fixes
tree of SCSI, so you can download a rollup here:

http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-rc-fixes-2.6.diff

James



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: aic94xx: failing on high load (another data point)
@ 2008-01-30 10:55 Keith Hopkins
  0 siblings, 0 replies; 16+ messages in thread
From: Keith Hopkins @ 2008-01-30 10:55 UTC (permalink / raw)
  To: linux-scsi

>     We've tried new adaptec firmware shipped with SLES and we got
> ourselves new error string that appears just above error messages that you
> have seen before and that were attached to the original message:

> kernel: aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
> kernel: aic94xx: escb_tasklet_complete: Can't find task (tc=71) to abort!
> 
> Do you think they have any significance?

Hi Jan,

Which firmware version is that?

I get similar errors under a high load (rebuilding sw raid1 partitions) with sequencer Firmware version 1.1 (V28), which will eventually hang my box.  Prev fw versions would also hang in similar situations.

My box being OpenSuSE 10.3, a 2.6.22.13-0.3-default kernel,
2x quad core Xeon CPU E5345 @ 2.33GHz stepping 0b,
14G of DDR2-667 memory, and a
Adaptec 48300 (AIC-9410W SAS/SATA Host Adapter, device 0000:05:02.0)
directly connected to two SEAGATE ST3146855SS

My 2 bits.

--Keith Hopkins



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-02-28 16:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <479FB3ED.3080401@hopnet.net>
     [not found] ` <20080130091403.GA14887@alaris.suse.cz>
2008-01-30 10:59   ` aic94xx: failing on high load (another data point) Keith Hopkins
2008-01-30 19:29     ` Darrick J. Wong
2008-02-14 16:11       ` Keith Hopkins
2008-02-15 15:28         ` James Bottomley
2008-02-15 16:28           ` Keith Hopkins
2008-02-18 14:26           ` Keith Hopkins
2008-02-18 16:18             ` James Bottomley
2008-02-19 16:22             ` James Bottomley
2008-02-19 18:44               ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong
2008-02-19 18:52                 ` James Bottomley
2008-02-28 14:56                 ` Keith Hopkins
2008-02-28 16:10                   ` James Bottomley
2008-02-20  3:48               ` aic94xx: failing on high load (another data point) James Bottomley
2008-02-20  9:54                 ` Keith Hopkins
2008-02-20 16:22                   ` James Bottomley
2008-01-30 10:55 Keith Hopkins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).